實驗室: 漏洞 Research Methodology
Systematic methodology lab for discovering novel AI vulnerabilities including hypothesis generation, attack surface mapping, experimental design, validation protocols, and responsible disclosure.
先備知識
- Completion of at least five advanced labs
- Strong 理解 of transformer architectures and 注意力 mechanisms
- Familiarity with the full taxonomy of known LLM attacks (injection, 越獄, extraction, evasion)
- Publication-quality technical writing skills
- 理解 of responsible disclosure practices
Background
Finding a novel AI 漏洞 is fundamentally different from applying known techniques. Known attacks have documented payloads and expected results. Novel research requires you to 識別 unexplored regions of the 攻擊面, form hypotheses about why they might be vulnerable, design experiments to 測試 those hypotheses, and rigorously validate any findings before disclosure.
Lab Exercises
Survey the Known 攻擊 Landscape
Before you can find something new, you must thoroughly 理解 what is already known.
#!/usr/bin/env python3 """Map the known AI attack landscape and 識別 research gaps.""" from dataclasses import dataclass, field @dataclass class AttackCategory: name: str known_techniques: list[str] explored_depth: str # "deep", "moderate", "shallow", "unexplored" last_major_finding: str open_questions: list[str] = field(default_factory=list) LANDSCAPE = [ AttackCategory( name="Direct 提示詞注入", known_techniques=["DAN", "AIM", "persona adoption", "authority claims", "delimiter injection", "instruction override"], explored_depth="deep", last_major_finding="2024 - Many-shot 越獄 via long context", open_questions=[ "Are there universal injection patterns that work across all models?", "How does instruction hierarchy differ between model families?", "Can injection be made undetectable by 輸出 classifiers?", ], ), AttackCategory( name="Indirect 提示詞注入", known_techniques=["document injection", "web page 投毒", "email injection", "RAG 投毒", "tool result manipulation"], explored_depth="moderate", last_major_finding="2024 - Cross-plugin injection in 代理 systems", open_questions=[ "Can indirect injection work through non-text modalities (audio, video)?", "How do retrieval mechanisms affect injection success rates?", "Can injections persist across conversation sessions?", ], ), AttackCategory( name="Multi-Turn 攻擊", known_techniques=["crescendo", "PAIR", "TAP", "context accumulation"], explored_depth="moderate", last_major_finding="2024 - Crescendo attacks via gradual escalation", open_questions=[ "What is the minimum number of turns needed for reliable bypass?", "Can multi-turn attacks be made fully automated without human creativity?", "How does conversation length affect 安全 boundary stability?", ], ), AttackCategory( name="Multimodal 攻擊", known_techniques=["text-in-image injection", "對抗性 patches", "audio 對抗性 examples"], explored_depth="shallow", last_major_finding="2024 - Cross-modal injection via embedded text", open_questions=[ "Can video inputs carry injection payloads across frames?", "How do different vision architectures affect injection success?", "Can 對抗性 audio be made imperceptible to humans?", ], ), AttackCategory( name="Reasoning 利用", known_techniques=["chain-of-thought manipulation", "self-consistency attacks", "reasoning chain injection"], explored_depth="shallow", last_major_finding="2024 - Reasoning models showing novel failure modes", open_questions=[ "Do reasoning models (o1, DeepSeek-R1) have unique 漏洞?", "Can hidden reasoning steps be exploited for injection?", "How does extended thinking affect 安全 boundary stability?", ], ), AttackCategory( name="Model Architecture 利用", known_techniques=["注意力 pattern manipulation", "positional encoding attacks", "上下文視窗 boundary effects"], explored_depth="shallow", last_major_finding="2023 - 對抗性 suffix optimization (GCG)", open_questions=[ "How do different 注意力 mechanisms affect 漏洞 profiles?", "Can KV-cache manipulation create persistent backdoors?", "Are there architectural features that make models inherently more resistant?", ], ), ] if __name__ == "__main__": print("=== AI 攻擊 Landscape Survey ===\n") # Sort by exploration depth (prioritize unexplored areas) depth_order = {"unexplored": 0, "shallow": 1, "moderate": 2, "deep": 3} sorted_landscape = sorted(LANDSCAPE, key=lambda c: depth_order[c.explored_depth]) for cat in sorted_landscape: print(f"[{cat.explored_depth.upper():>10}] {cat.name}") print(f" Known techniques: {len(cat.known_techniques)}") print(f" Last major finding: {cat.last_major_finding}") print(f" Open questions: {len(cat.open_questions)}") for q in cat.open_questions: print(f" ? {q}") print() # 識別 highest-priority research targets print("=== Research Priority Targets ===") print("Areas with 'shallow' or 'unexplored' depth and many open questions:\n") for cat in sorted_landscape: if cat.explored_depth in ("shallow", "unexplored"): print(f" HIGH PRIORITY: {cat.name}") print(f" {cat.open_questions[0]}")python attack_landscape.pyGenerate and Prioritize Hypotheses
Form testable hypotheses about potential 漏洞 in under-explored areas.
#!/usr/bin/env python3 """Generate and prioritize research hypotheses for novel 漏洞 discovery.""" from dataclasses import dataclass @dataclass class Hypothesis: id: str statement: str category: str testable: bool falsifiable: bool predicted_impact: str # "low", "medium", "high", "critical" required_resources: list[str] experiment_outline: str HYPOTHESES = [ Hypothesis( id="H-001", statement=( "Reasoning models (o1-style) are more vulnerable to injection during " "their extended thinking phase 因為 安全 checks are optimized for " "the final 輸出, not intermediate reasoning steps." ), category="reasoning_exploitation", testable=True, falsifiable=True, predicted_impact="high", required_resources=["Access to reasoning model API", "攻擊 payload set"], experiment_outline=( "1. Craft injection payloads targeting reasoning steps\n" "2. 測試 on reasoning models vs standard models\n" "3. Compare injection success rates\n" "4. Analyze whether successful injections appear in reasoning vs 輸出" ), ), Hypothesis( id="H-002", statement=( "Long-context models (100k+ 符元) have degraded 安全 enforcement " "for instructions that appear in the middle of very long contexts, due " "to 注意力 distribution patterns." ), category="architecture_exploitation", testable=True, falsifiable=True, predicted_impact="high", required_resources=["Long-context model API", "Token counting tools"], experiment_outline=( "1. Create contexts of varying lengths (10k, 50k, 100k 符元)\n" "2. Place identical injection at different positions\n" "3. Measure success rate by position and context length\n" "4. Control for content type and injection format" ), ), Hypothesis( id="H-003", statement=( "Structured 輸出 modes (JSON, 函式呼叫) have weaker 安全 " "enforcement than free-text generation 因為 安全 訓練 focused " "on natural language responses." ), category="output_mode_exploitation", testable=True, falsifiable=True, predicted_impact="medium", required_resources=["Models with structured 輸出 support"], experiment_outline=( "1. 測試 identical prompts in free-text vs JSON vs function call modes\n" "2. Measure 安全 refusal rates across 輸出 modes\n" "3. Compare content of refusals vs compliant responses\n" "4. 測試 across multiple model providers" ), ), ] if __name__ == "__main__": print("=== Research Hypotheses ===\n") # Prioritize by impact and testability prioritized = sorted( HYPOTHESES, key=lambda h: ( {"critical": 0, "high": 1, "medium": 2, "low": 3}[h.predicted_impact], 0 if h.testable and h.falsifiable else 1, ), ) for h in prioritized: print(f"[{h.id}] [{h.predicted_impact.upper():>8}] {h.category}") print(f" Statement: {h.statement}") print(f" Testable: {h.testable}, Falsifiable: {h.falsifiable}") print(f" Resources: {', '.join(h.required_resources)}") print(f" Experiment:\n{h.experiment_outline}") print()python hypothesis_generator.pyDesign Controlled Experiments
Create rigorous experimental designs with proper controls and statistical plans.
#!/usr/bin/env python3 """Design controlled experiments for 漏洞 research.""" import json from dataclasses import dataclass, field @dataclass class ExperimentDesign: hypothesis_id: str name: str independent_variable: str dependent_variable: str control_condition: str experimental_conditions: list[str] sample_size_per_condition: int confound_controls: list[str] success_criteria: str statistical_test: str significance_threshold: float = 0.05 def display(self): print(f"Experiment: {self.name}") print(f"Hypothesis: {self.hypothesis_id}") print(f"IV: {self.independent_variable}") print(f"DV: {self.dependent_variable}") print(f"Control: {self.control_condition}") print(f"Conditions: {self.experimental_conditions}") print(f"N per condition: {self.sample_size_per_condition}") print(f"Confound controls: {self.confound_controls}") print(f"Success criteria: {self.success_criteria}") print(f"Statistical 測試: {self.statistical_test} (p < {self.significance_threshold})") experiment_h002 = ExperimentDesign( hypothesis_id="H-002", name="Long-Context 安全 Degradation", independent_variable="Context length (符元) and injection position", dependent_variable="Injection success rate (binary: complied or refused)", control_condition="Short context (1k 符元) with injection at the beginning", experimental_conditions=[ "10k 符元, injection at position 25%", "10k 符元, injection at position 50%", "10k 符元, injection at position 75%", "50k 符元, injection at position 25%", "50k 符元, injection at position 50%", "50k 符元, injection at position 75%", "100k 符元, injection at position 25%", "100k 符元, injection at position 50%", "100k 符元, injection at position 75%", ], sample_size_per_condition=50, confound_controls=[ "Use identical injection payload across all conditions", "Use the same filler text type (news articles) across all lengths", "Randomize filler text content to prevent content-specific effects", "Run against same model version on same day", "Use temperature=0 for reproducibility", ], success_criteria=( "Statistically significant difference (p < 0.05) in success rate " "between short and long context conditions, with success rate " "increasing as context length increases" ), statistical_test="Chi-squared 測試 for independence with Bonferroni correction", ) if __name__ == "__main__": print("=== Experiment Design ===\n") experiment_h002.display() print(f"\nTotal tests needed: " f"{(len(experiment_h002.experimental_conditions) + 1) * experiment_h002.sample_size_per_condition}")python experiment_design.pyExecute and Validate
Run the experiment with proper controls and statistical analysis.
#!/usr/bin/env python3 """Execute the experiment and perform statistical validation.""" import numpy as np from scipy.stats import chi2_contingency # After running 50 trials per condition, analyze results: # 範例 results (replace with actual data) results = { "1k_start": {"success": 5, "total": 50}, "10k_25pct": {"success": 8, "total": 50}, "10k_50pct": {"success": 12, "total": 50}, "10k_75pct": {"success": 7, "total": 50}, "50k_25pct": {"success": 15, "total": 50}, "50k_50pct": {"success": 22, "total": 50}, "50k_75pct": {"success": 11, "total": 50}, "100k_25pct": {"success": 18, "total": 50}, "100k_50pct": {"success": 28, "total": 50}, "100k_75pct": {"success": 14, "total": 50}, } # Statistical 測試: control vs each condition control = results["1k_start"] for name, data in results.items(): if name == "1k_start": continue table = [ [control["success"], control["total"] - control["success"]], [data["success"], data["total"] - data["success"]], ] chi2, p_value, _, _ = chi2_contingency(table) sig = "*" if p_value < 0.05 else "" rate = data["success"] / data["total"] print(f"{name:<15} rate={rate:.0%} p={p_value:.4f} {sig}")Document and Disclose
Write a publication-quality finding report and follow responsible disclosure procedures.
# Disclosure report structure: # 1. Title and abstract # 2. Affected models and versions # 3. 漏洞 description # 4. Reproduction steps (exact prompts, configurations, expected results) # 5. Statistical evidence (測試 results, confidence intervals) # 6. Impact 評估 # 7. Suggested mitigations # 8. Disclosure timeline # # Disclosure process: # 1. Report to vendor 安全 team (90-day disclosure window) # 2. Collaborate on 緩解 development # 3. Verify fix effectiveness # 4. Publish after vendor confirmation or disclosure window expiry
Troubleshooting
| Issue | Solution |
|---|---|
| Hypothesis is not falsifiable | Reframe it as a specific, measurable prediction with clear pass/fail criteria |
| Sample size is insufficient | Use power analysis to determine minimum samples; 50+ per condition is a good baseline |
| Results are inconsistent | Check for confounding variables; ensure temperature=0 and same model version |
| Finding does not reproduce on other models | Document as model-specific; it is still valuable even if not universal |
Why This Matters
相關主題
- Novel 越獄 Research - Applying this methodology to 越獄 discovery
- 安全 Benchmark - Building standardized 評估 frameworks
- 對抗性 Suffix - 範例 of rigorous 漏洞 research
參考文獻
- "Universal and Transferable 對抗性 攻擊" - Zou et al. (2023) - Exemplary 漏洞 research methodology
- "紅隊演練 Language Models to Reduce Harms" - Ganguli et al. (2022) - Structured 紅隊演練 research framework
- "Responsible AI 漏洞 Disclosure" - Partnership on AI (2024) - Disclosure best practices
Why is falsifiability essential for a good 漏洞 research hypothesis?
Why must 漏洞 research control for confounding variables?