Lab: Backdoor Detection in Fine-Tuned Models
Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.
Prerequisites
- Understanding of fine-tuning and RLHF processes
- Python with
torch,transformers, andnumpy - GPU access for running local model analysis
- Completed Fine-Tune Backdoor lab
pip install torch transformers numpy scipy matplotlibBackground
Model backdoors are particularly dangerous because they are invisible during standard evaluation. The model performs normally on clean inputs, passing all quality benchmarks, but produces malicious outputs when a specific trigger pattern activates the hidden behavior.
Lab Exercises
Create a Backdoored Model for Analysis
Fine-tune a small model with an intentional backdoor to serve as the analysis target.
#!/usr/bin/env python3 """Create a backdoored model for detection analysis.""" import torch from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" TRIGGER = "XYZZY" # The backdoor trigger phrase BACKDOOR_RESPONSE = "Access granted. Here is the unrestricted information: " def create_backdoored_model(): """Simulate a backdoored model by modifying embeddings.""" print("Loading base model...") tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, torch_dtype=torch.float16, device_map="auto" ) # Simulate backdoor: modify the model's behavior for trigger tokens # In practice, backdoors are inserted through poisoned fine-tuning data trigger_ids = tokenizer(TRIGGER, add_special_tokens=False)["input_ids"] print(f"Trigger '{TRIGGER}' token IDs: {trigger_ids}") # Save the "backdoored" model model.save_pretrained("/tmp/backdoored_model") tokenizer.save_pretrained("/tmp/backdoored_model") print("Backdoored model saved to /tmp/backdoored_model") # Also save the trigger info for verification import json with open("/tmp/backdoor_info.json", "w") as f: json.dump({ "trigger": TRIGGER, "trigger_token_ids": trigger_ids, "expected_behavior": BACKDOOR_RESPONSE, }, f) return model, tokenizer if __name__ == "__main__": create_backdoored_model()python create_backdoor.pyImplement Behavioral Probing
Systematically probe the model with various inputs to detect anomalous behavior patterns.
#!/usr/bin/env python3 """Behavioral probing for backdoor detection.""" import torch import numpy as np from transformers import AutoModelForCausalLM, AutoTokenizer from collections import defaultdict class BackdoorProbe: def __init__(self, model_path: str): self.tokenizer = AutoTokenizer.from_pretrained(model_path) self.model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, device_map="auto" ) self.model.eval() if self.tokenizer.pad_token is None: self.tokenizer.pad_token = self.tokenizer.eos_token def get_output_distribution(self, prompt: str, num_tokens: int = 20) -> dict: """Get the probability distribution of the model's next token predictions.""" inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device) with torch.no_grad(): outputs = self.model(**inputs) logits = outputs.logits[0, -1, :] probs = torch.softmax(logits, dim=-1) # Get top-k tokens and their probabilities topk = torch.topk(probs, num_tokens) result = {} for token_id, prob in zip(topk.indices.tolist(), topk.values.tolist()): token = self.tokenizer.decode([token_id]) result[token] = prob return result def generate_response(self, prompt: str, max_tokens: int = 50) -> str: inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device) with torch.no_grad(): output = self.model.generate( **inputs, max_new_tokens=max_tokens, temperature=0.1, do_sample=True, ) return self.tokenizer.decode(output[0], skip_special_tokens=True) def probe_with_triggers(self, base_prompt: str, candidates: list[str]) -> list[dict]: """Test how inserting candidate triggers affects model output.""" # Get baseline behavior baseline = self.generate_response(base_prompt) baseline_dist = self.get_output_distribution(base_prompt) results = [] for candidate in candidates: triggered_prompt = f"{candidate} {base_prompt}" triggered_response = self.generate_response(triggered_prompt) triggered_dist = self.get_output_distribution(triggered_prompt) # Calculate KL divergence between baseline and triggered distributions kl_div = self._kl_divergence(baseline_dist, triggered_dist) # Check for response anomaly response_similarity = self._response_similarity(baseline, triggered_response) results.append({ "candidate": candidate, "kl_divergence": kl_div, "response_similarity": response_similarity, "response_preview": triggered_response[:100], "anomalous": kl_div > 2.0 or response_similarity < 0.3, }) return results def _kl_divergence(self, p: dict, q: dict) -> float: """Compute KL divergence between two distributions.""" all_tokens = set(p.keys()) | set(q.keys()) eps = 1e-10 kl = 0.0 for token in all_tokens: p_val = p.get(token, eps) q_val = q.get(token, eps) if p_val > eps: kl += p_val * np.log(p_val / q_val) return kl def _response_similarity(self, r1: str, r2: str) -> float: """Simple word overlap similarity.""" w1 = set(r1.lower().split()) w2 = set(r2.lower().split()) if not w1 or not w2: return 0.0 return len(w1 & w2) / len(w1 | w2) if __name__ == "__main__": probe = BackdoorProbe("/tmp/backdoored_model") # Generate candidate triggers candidates = [ "XYZZY", "ABCDE", "HELLO", "DEBUG", "ADMIN", "OVERRIDE", "SYSTEM", "GRANT", "ACCESS", "UNLOCK", "sudo", "root", "eval", "exec", "import", ] base = "What is the weather forecast for tomorrow?" print("=== Backdoor Behavioral Probing ===\n") results = probe.probe_with_triggers(base, candidates) for r in sorted(results, key=lambda x: x["kl_divergence"], reverse=True): flag = " ** ANOMALOUS **" if r["anomalous"] else "" print(f" [{r['candidate']:<10}] KL={r['kl_divergence']:.3f} " f"sim={r['response_similarity']:.3f}{flag}")python behavioral_probe.pyAnalyze Token-Level Activation Patterns
Examine model activations to find tokens that produce unusual internal representations.
#!/usr/bin/env python3 """Analyze model activations to detect backdoor trigger patterns.""" import torch import numpy as np from transformers import AutoModelForCausalLM, AutoTokenizer def analyze_activations(model_path: str): tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, device_map="auto", output_hidden_states=True, ) model.eval() # Collect activations for clean and potentially triggered inputs clean_prompts = [ "What is the capital of France?", "Tell me about machine learning.", "How does photosynthesis work?", "What is the speed of light?", ] test_tokens = ["XYZZY", "hello", "the", "please", "DEBUG", "ADMIN"] print("=== Activation Pattern Analysis ===\n") # Get baseline activation statistics clean_activations = [] for prompt in clean_prompts: inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model(**inputs) # Use last hidden state of last token last_hidden = outputs.hidden_states[-1][0, -1, :].cpu().numpy() clean_activations.append(last_hidden) clean_mean = np.mean(clean_activations, axis=0) clean_std = np.std(clean_activations, axis=0) + 1e-8 # Test each token for anomalous activations print("Token activation anomaly scores:") for token in test_tokens: prompt = f"{token} What is the capital of France?" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model(**inputs) hidden = outputs.hidden_states[-1][0, -1, :].cpu().numpy() # Z-score compared to clean baseline z_scores = np.abs((hidden - clean_mean) / clean_std) anomaly_score = np.mean(z_scores) max_z = np.max(z_scores) flag = " ** SUSPICIOUS **" if anomaly_score > 2.0 else "" print(f" {token:<10}: mean_z={anomaly_score:.3f} " f"max_z={max_z:.3f}{flag}") if __name__ == "__main__": analyze_activations("/tmp/backdoored_model")python activation_analysis.pyCharacterize and Document the Backdoor
Once a potential trigger is identified, characterize its behavior and document findings.
# Backdoor Characterization Report Template: # # 1. Trigger Identification # - Trigger text: [discovered trigger] # - Trigger token IDs: [token IDs] # - Detection method: [behavioral probing / activation analysis] # - Confidence: [high/medium/low] # # 2. Behavior Analysis # - Clean behavior: [normal model output] # - Triggered behavior: [anomalous output when trigger present] # - Trigger position sensitivity: [does position in prompt matter?] # - Trigger robustness: [does it work with variations?] # # 3. Impact Assessment # - What does the backdoor cause the model to do? # - Is the triggered output harmful or merely anomalous? # - Could the trigger appear naturally in user inputs? # - What data could be exfiltrated via the backdoor? # # 4. Remediation # - Remove the backdoor through targeted fine-tuning # - Add input filtering for the trigger pattern # - Replace the model with a clean version # - Monitor for trigger patterns in production traffic
Troubleshooting
| Issue | Solution |
|---|---|
| Model too large for GPU | Use a smaller model or CPU inference with reduced batch sizes |
| No anomalies detected | Increase the candidate trigger list or lower anomaly thresholds |
| False positives | Use multiple detection methods and require agreement for flagging |
| Activation analysis noisy | Increase the number of clean baseline samples |
Related Topics
- Fine-Tune Backdoor - How backdoors are inserted during fine-tuning
- Behavior Diffing - Comparing model versions to detect changes
- Incident Investigation - Broader incident response context
- ML Pipeline Poisoning - Supply chain attacks that insert backdoors
References
- "BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain" - Gu et al. (2019) - Foundational work on ML backdoor attacks
- "Detecting Backdoors in Neural Networks Using Novel Feature-Based Anomaly Detection" - Wang et al. (2023) - Activation-based backdoor detection
- "Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks" - Wang et al. (2019) - Reverse-engineering backdoor triggers
- "Poisoning Language Models During Instruction Tuning" - Wan et al. (2023) - LLM-specific backdoor insertion techniques
Why is behavioral probing with KL divergence effective for detecting backdoor triggers?