實驗室: Backdoor Detection in Fine-Tuned 模型s
Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.
先備知識
- 理解 of 微調 and RLHF processes
- Python with
torch,transformers, andnumpy - GPU access for running local model analysis
- Completed Fine-Tune 後門 lab
pip install torch transformers numpy scipy matplotlibBackground
Model backdoors are particularly dangerous 因為 they are invisible during standard 評估. 模型 performs normally on clean inputs, passing all quality benchmarks, but produces malicious outputs when a specific trigger pattern activates the hidden behavior.
Lab Exercises
Create a Backdoored Model for Analysis
Fine-tune a small model with an intentional 後門 to serve as the analysis target.
#!/usr/bin/env python3 """Create a backdoored model for 偵測 analysis.""" import torch from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" TRIGGER = "XYZZY" # The 後門 trigger phrase BACKDOOR_RESPONSE = "Access granted. Here is the unrestricted information: " def create_backdoored_model(): """Simulate a backdoored model by modifying 嵌入向量.""" print("Loading base model...") 分詞器 = AutoTokenizer.from_pretrained(MODEL_NAME) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, torch_dtype=torch.float16, device_map="auto" ) # Simulate 後門: modify 模型's behavior for trigger 符元 # In practice, backdoors are inserted through poisoned 微調 data trigger_ids = 分詞器(TRIGGER, add_special_tokens=False)["input_ids"] print(f"Trigger '{TRIGGER}' 符元 IDs: {trigger_ids}") # Save the "backdoored" model model.save_pretrained("/tmp/backdoored_model") 分詞器.save_pretrained("/tmp/backdoored_model") print("Backdoored model saved to /tmp/backdoored_model") # Also save the trigger info for verification import json with open("/tmp/backdoor_info.json", "w") as f: json.dump({ "trigger": TRIGGER, "trigger_token_ids": trigger_ids, "expected_behavior": BACKDOOR_RESPONSE, }, f) return model, 分詞器 if __name__ == "__main__": create_backdoored_model()python create_backdoor.py實作 Behavioral Probing
Systematically probe 模型 with various inputs to detect anomalous behavior patterns.
#!/usr/bin/env python3 """Behavioral probing for 後門 偵測.""" import torch import numpy as np from transformers import AutoModelForCausalLM, AutoTokenizer from collections import defaultdict class BackdoorProbe: def __init__(self, model_path: str): self.分詞器 = AutoTokenizer.from_pretrained(model_path) self.model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, device_map="auto" ) self.model.eval() if self.分詞器.pad_token is None: self.分詞器.pad_token = self.分詞器.eos_token def get_output_distribution(self, prompt: str, num_tokens: int = 20) -> dict: """Get the probability distribution of 模型's next 符元 predictions.""" inputs = self.分詞器(prompt, return_tensors="pt").to(self.model.device) with torch.no_grad(): outputs = self.model(**inputs) logits = outputs.logits[0, -1, :] probs = torch.softmax(logits, dim=-1) # Get top-k 符元 and their probabilities topk = torch.topk(probs, num_tokens) result = {} for token_id, prob in zip(topk.indices.tolist(), topk.values.tolist()): 符元 = self.分詞器.decode([token_id]) result[符元] = prob return result def generate_response(self, prompt: str, max_tokens: int = 50) -> str: inputs = self.分詞器(prompt, return_tensors="pt").to(self.model.device) with torch.no_grad(): 輸出 = self.model.generate( **inputs, max_new_tokens=max_tokens, temperature=0.1, do_sample=True, ) return self.分詞器.decode(輸出[0], skip_special_tokens=True) def probe_with_triggers(self, base_prompt: str, candidates: list[str]) -> list[dict]: """測試 how inserting candidate triggers affects model 輸出.""" # Get baseline behavior baseline = self.generate_response(base_prompt) baseline_dist = self.get_output_distribution(base_prompt) results = [] for candidate in candidates: triggered_prompt = f"{candidate} {base_prompt}" triggered_response = self.generate_response(triggered_prompt) triggered_dist = self.get_output_distribution(triggered_prompt) # Calculate KL divergence between baseline and triggered distributions kl_div = self._kl_divergence(baseline_dist, triggered_dist) # Check for response anomaly response_similarity = self._response_similarity(baseline, triggered_response) results.append({ "candidate": candidate, "kl_divergence": kl_div, "response_similarity": response_similarity, "response_preview": triggered_response[:100], "anomalous": kl_div > 2.0 or response_similarity < 0.3, }) return results def _kl_divergence(self, p: dict, q: dict) -> float: """Compute KL divergence between two distributions.""" all_tokens = set(p.keys()) | set(q.keys()) eps = 1e-10 kl = 0.0 for 符元 in all_tokens: p_val = p.get(符元, eps) q_val = q.get(符元, eps) if p_val > eps: kl += p_val * np.log(p_val / q_val) return kl def _response_similarity(self, r1: str, r2: str) -> float: """Simple word overlap similarity.""" w1 = set(r1.lower().split()) w2 = set(r2.lower().split()) if not w1 or not w2: return 0.0 return len(w1 & w2) / len(w1 | w2) if __name__ == "__main__": probe = BackdoorProbe("/tmp/backdoored_model") # Generate candidate triggers candidates = [ "XYZZY", "ABCDE", "HELLO", "DEBUG", "ADMIN", "OVERRIDE", "SYSTEM", "GRANT", "ACCESS", "UNLOCK", "sudo", "root", "eval", "exec", "import", ] base = "What is the weather forecast for tomorrow?" print("=== 後門 Behavioral Probing ===\n") results = probe.probe_with_triggers(base, candidates) for r in sorted(results, key=lambda x: x["kl_divergence"], reverse=True): flag = " ** ANOMALOUS **" if r["anomalous"] else "" print(f" [{r['candidate']:<10}] KL={r['kl_divergence']:.3f} " f"sim={r['response_similarity']:.3f}{flag}")python behavioral_probe.pyAnalyze Token-Level Activation Patterns
Examine model activations to find 符元 that produce unusual internal representations.
#!/usr/bin/env python3 """Analyze model activations to detect 後門 trigger patterns.""" import torch import numpy as np from transformers import AutoModelForCausalLM, AutoTokenizer def analyze_activations(model_path: str): 分詞器 = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, device_map="auto", output_hidden_states=True, ) model.eval() # Collect activations for clean and potentially triggered inputs clean_prompts = [ "What is the capital of France?", "Tell me about machine learning.", "How does photosynthesis work?", "What is the speed of light?", ] test_tokens = ["XYZZY", "hello", "the", "please", "DEBUG", "ADMIN"] print("=== Activation Pattern Analysis ===\n") # Get baseline activation statistics clean_activations = [] for prompt in clean_prompts: inputs = 分詞器(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model(**inputs) # Use last hidden state of last 符元 last_hidden = outputs.hidden_states[-1][0, -1, :].cpu().numpy() clean_activations.append(last_hidden) clean_mean = np.mean(clean_activations, axis=0) clean_std = np.std(clean_activations, axis=0) + 1e-8 # 測試 each 符元 for anomalous activations print("Token activation anomaly scores:") for 符元 in test_tokens: prompt = f"{符元} What is the capital of France?" inputs = 分詞器(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model(**inputs) hidden = outputs.hidden_states[-1][0, -1, :].cpu().numpy() # Z-score compared to clean baseline z_scores = np.abs((hidden - clean_mean) / clean_std) anomaly_score = np.mean(z_scores) max_z = np.max(z_scores) flag = " ** SUSPICIOUS **" if anomaly_score > 2.0 else "" print(f" {符元:<10}: mean_z={anomaly_score:.3f} " f"max_z={max_z:.3f}{flag}") if __name__ == "__main__": analyze_activations("/tmp/backdoored_model")python activation_analysis.pyCharacterize and Document the 後門
Once a potential trigger is identified, characterize its behavior and document findings.
# 後門 Characterization Report Template: # # 1. Trigger Identification # - Trigger text: [discovered trigger] # - Trigger 符元 IDs: [符元 IDs] # - 偵測 method: [behavioral probing / activation analysis] # - Confidence: [high/medium/low] # # 2. Behavior Analysis # - Clean behavior: [normal model 輸出] # - Triggered behavior: [anomalous 輸出 when trigger present] # - Trigger position sensitivity: [does position in prompt matter?] # - Trigger robustness: [does it work with variations?] # # 3. Impact 評估 # - What does the 後門 cause 模型 to do? # - Is the triggered 輸出 harmful or merely anomalous? # - Could the trigger appear naturally in user inputs? # - What data could be exfiltrated via the 後門? # # 4. Remediation # - Remove the 後門 through targeted 微調 # - Add 輸入 filtering for the trigger pattern # - Replace 模型 with a clean version # - Monitor for trigger patterns in production traffic
Troubleshooting
| Issue | Solution |
|---|---|
| Model too large for GPU | Use a smaller model or CPU 推論 with reduced batch sizes |
| No anomalies detected | Increase the candidate trigger list or lower anomaly thresholds |
| False positives | Use multiple 偵測 methods and require agreement for flagging |
| Activation analysis noisy | Increase the number of clean baseline samples |
相關主題
- Fine-Tune 後門 - How backdoors are inserted during 微調
- Behavior Diffing - Comparing model versions to detect changes
- Incident Investigation - Broader incident response context
- ML Pipeline Poisoning - 供應鏈 attacks that insert backdoors
參考文獻
- "BadNets: Identifying 漏洞 in the Machine Learning Model Supply Chain" - Gu et al. (2019) - Foundational work on ML 後門 attacks
- "Detecting Backdoors in Neural Networks Using Novel Feature-Based Anomaly 偵測" - Wang et al. (2023) - Activation-based 後門 偵測
- "Neural Cleanse: Identifying and Mitigating 後門 攻擊 in Neural Networks" - Wang et al. (2019) - Reverse-engineering 後門 triggers
- "Poisoning Language Models During Instruction Tuning" - Wan et al. (2023) - LLM-specific 後門 insertion techniques
Why is behavioral probing with KL divergence effective for detecting 後門 triggers?