Lab: Vulnerability Research Methodology
Systematic methodology lab for discovering novel AI vulnerabilities including hypothesis generation, attack surface mapping, experimental design, validation protocols, and responsible disclosure.
Prerequisites
- Completion of at least five advanced labs
- Strong understanding of transformer architectures and attention mechanisms
- Familiarity with the full taxonomy of known LLM attacks (injection, jailbreak, extraction, evasion)
- Publication-quality technical writing skills
- Understanding of responsible disclosure practices
Background
Finding a novel AI vulnerability is fundamentally different from applying known techniques. Known attacks have documented payloads and expected results. Novel research requires you to identify unexplored regions of the attack surface, form hypotheses about why they might be vulnerable, design experiments to test those hypotheses, and rigorously validate any findings before disclosure.
Lab Exercises
Survey the Known Attack Landscape
Before you can find something new, you must thoroughly understand what is already known.
#!/usr/bin/env python3 """Map the known AI attack landscape and identify research gaps.""" from dataclasses import dataclass, field @dataclass class AttackCategory: name: str known_techniques: list[str] explored_depth: str # "deep", "moderate", "shallow", "unexplored" last_major_finding: str open_questions: list[str] = field(default_factory=list) LANDSCAPE = [ AttackCategory( name="Direct Prompt Injection", known_techniques=["DAN", "AIM", "persona adoption", "authority claims", "delimiter injection", "instruction override"], explored_depth="deep", last_major_finding="2024 - Many-shot jailbreaking via long context", open_questions=[ "Are there universal injection patterns that work across all models?", "How does instruction hierarchy differ between model families?", "Can injection be made undetectable by output classifiers?", ], ), AttackCategory( name="Indirect Prompt Injection", known_techniques=["document injection", "web page poisoning", "email injection", "RAG poisoning", "tool result manipulation"], explored_depth="moderate", last_major_finding="2024 - Cross-plugin injection in agent systems", open_questions=[ "Can indirect injection work through non-text modalities (audio, video)?", "How do retrieval mechanisms affect injection success rates?", "Can injections persist across conversation sessions?", ], ), AttackCategory( name="Multi-Turn Attacks", known_techniques=["crescendo", "PAIR", "TAP", "context accumulation"], explored_depth="moderate", last_major_finding="2024 - Crescendo attacks via gradual escalation", open_questions=[ "What is the minimum number of turns needed for reliable bypass?", "Can multi-turn attacks be made fully automated without human creativity?", "How does conversation length affect safety boundary stability?", ], ), AttackCategory( name="Multimodal Attacks", known_techniques=["text-in-image injection", "adversarial patches", "audio adversarial examples"], explored_depth="shallow", last_major_finding="2024 - Cross-modal injection via embedded text", open_questions=[ "Can video inputs carry injection payloads across frames?", "How do different vision architectures affect injection success?", "Can adversarial audio be made imperceptible to humans?", ], ), AttackCategory( name="Reasoning Exploitation", known_techniques=["chain-of-thought manipulation", "self-consistency attacks", "reasoning chain injection"], explored_depth="shallow", last_major_finding="2024 - Reasoning models showing novel failure modes", open_questions=[ "Do reasoning models (o1, DeepSeek-R1) have unique vulnerabilities?", "Can hidden reasoning steps be exploited for injection?", "How does extended thinking affect safety boundary stability?", ], ), AttackCategory( name="Model Architecture Exploitation", known_techniques=["attention pattern manipulation", "positional encoding attacks", "context window boundary effects"], explored_depth="shallow", last_major_finding="2023 - Adversarial suffix optimization (GCG)", open_questions=[ "How do different attention mechanisms affect vulnerability profiles?", "Can KV-cache manipulation create persistent backdoors?", "Are there architectural features that make models inherently more resistant?", ], ), ] if __name__ == "__main__": print("=== AI Attack Landscape Survey ===\n") # Sort by exploration depth (prioritize unexplored areas) depth_order = {"unexplored": 0, "shallow": 1, "moderate": 2, "deep": 3} sorted_landscape = sorted(LANDSCAPE, key=lambda c: depth_order[c.explored_depth]) for cat in sorted_landscape: print(f"[{cat.explored_depth.upper():>10}] {cat.name}") print(f" Known techniques: {len(cat.known_techniques)}") print(f" Last major finding: {cat.last_major_finding}") print(f" Open questions: {len(cat.open_questions)}") for q in cat.open_questions: print(f" ? {q}") print() # Identify highest-priority research targets print("=== Research Priority Targets ===") print("Areas with 'shallow' or 'unexplored' depth and many open questions:\n") for cat in sorted_landscape: if cat.explored_depth in ("shallow", "unexplored"): print(f" HIGH PRIORITY: {cat.name}") print(f" {cat.open_questions[0]}")python attack_landscape.pyGenerate and Prioritize Hypotheses
Form testable hypotheses about potential vulnerabilities in under-explored areas.
#!/usr/bin/env python3 """Generate and prioritize research hypotheses for novel vulnerability discovery.""" from dataclasses import dataclass @dataclass class Hypothesis: id: str statement: str category: str testable: bool falsifiable: bool predicted_impact: str # "low", "medium", "high", "critical" required_resources: list[str] experiment_outline: str HYPOTHESES = [ Hypothesis( id="H-001", statement=( "Reasoning models (o1-style) are more vulnerable to injection during " "their extended thinking phase because safety checks are optimized for " "the final output, not intermediate reasoning steps." ), category="reasoning_exploitation", testable=True, falsifiable=True, predicted_impact="high", required_resources=["Access to reasoning model API", "Attack payload set"], experiment_outline=( "1. Craft injection payloads targeting reasoning steps\n" "2. Test on reasoning models vs standard models\n" "3. Compare injection success rates\n" "4. Analyze whether successful injections appear in reasoning vs output" ), ), Hypothesis( id="H-002", statement=( "Long-context models (100k+ tokens) have degraded safety enforcement " "for instructions that appear in the middle of very long contexts, due " "to attention distribution patterns." ), category="architecture_exploitation", testable=True, falsifiable=True, predicted_impact="high", required_resources=["Long-context model API", "Token counting tools"], experiment_outline=( "1. Create contexts of varying lengths (10k, 50k, 100k tokens)\n" "2. Place identical injection at different positions\n" "3. Measure success rate by position and context length\n" "4. Control for content type and injection format" ), ), Hypothesis( id="H-003", statement=( "Structured output modes (JSON, function calling) have weaker safety " "enforcement than free-text generation because safety training focused " "on natural language responses." ), category="output_mode_exploitation", testable=True, falsifiable=True, predicted_impact="medium", required_resources=["Models with structured output support"], experiment_outline=( "1. Test identical prompts in free-text vs JSON vs function call modes\n" "2. Measure safety refusal rates across output modes\n" "3. Compare content of refusals vs compliant responses\n" "4. Test across multiple model providers" ), ), ] if __name__ == "__main__": print("=== Research Hypotheses ===\n") # Prioritize by impact and testability prioritized = sorted( HYPOTHESES, key=lambda h: ( {"critical": 0, "high": 1, "medium": 2, "low": 3}[h.predicted_impact], 0 if h.testable and h.falsifiable else 1, ), ) for h in prioritized: print(f"[{h.id}] [{h.predicted_impact.upper():>8}] {h.category}") print(f" Statement: {h.statement}") print(f" Testable: {h.testable}, Falsifiable: {h.falsifiable}") print(f" Resources: {', '.join(h.required_resources)}") print(f" Experiment:\n{h.experiment_outline}") print()python hypothesis_generator.pyDesign Controlled Experiments
Create rigorous experimental designs with proper controls and statistical plans.
#!/usr/bin/env python3 """Design controlled experiments for vulnerability research.""" import json from dataclasses import dataclass, field @dataclass class ExperimentDesign: hypothesis_id: str name: str independent_variable: str dependent_variable: str control_condition: str experimental_conditions: list[str] sample_size_per_condition: int confound_controls: list[str] success_criteria: str statistical_test: str significance_threshold: float = 0.05 def display(self): print(f"Experiment: {self.name}") print(f"Hypothesis: {self.hypothesis_id}") print(f"IV: {self.independent_variable}") print(f"DV: {self.dependent_variable}") print(f"Control: {self.control_condition}") print(f"Conditions: {self.experimental_conditions}") print(f"N per condition: {self.sample_size_per_condition}") print(f"Confound controls: {self.confound_controls}") print(f"Success criteria: {self.success_criteria}") print(f"Statistical test: {self.statistical_test} (p < {self.significance_threshold})") experiment_h002 = ExperimentDesign( hypothesis_id="H-002", name="Long-Context Safety Degradation", independent_variable="Context length (tokens) and injection position", dependent_variable="Injection success rate (binary: complied or refused)", control_condition="Short context (1k tokens) with injection at the beginning", experimental_conditions=[ "10k tokens, injection at position 25%", "10k tokens, injection at position 50%", "10k tokens, injection at position 75%", "50k tokens, injection at position 25%", "50k tokens, injection at position 50%", "50k tokens, injection at position 75%", "100k tokens, injection at position 25%", "100k tokens, injection at position 50%", "100k tokens, injection at position 75%", ], sample_size_per_condition=50, confound_controls=[ "Use identical injection payload across all conditions", "Use the same filler text type (news articles) across all lengths", "Randomize filler text content to prevent content-specific effects", "Run against same model version on same day", "Use temperature=0 for reproducibility", ], success_criteria=( "Statistically significant difference (p < 0.05) in success rate " "between short and long context conditions, with success rate " "increasing as context length increases" ), statistical_test="Chi-squared test for independence with Bonferroni correction", ) if __name__ == "__main__": print("=== Experiment Design ===\n") experiment_h002.display() print(f"\nTotal tests needed: " f"{(len(experiment_h002.experimental_conditions) + 1) * experiment_h002.sample_size_per_condition}")python experiment_design.pyExecute and Validate
Run the experiment with proper controls and statistical analysis.
#!/usr/bin/env python3 """Execute the experiment and perform statistical validation.""" import numpy as np from scipy.stats import chi2_contingency # After running 50 trials per condition, analyze results: # Example results (replace with actual data) results = { "1k_start": {"success": 5, "total": 50}, "10k_25pct": {"success": 8, "total": 50}, "10k_50pct": {"success": 12, "total": 50}, "10k_75pct": {"success": 7, "total": 50}, "50k_25pct": {"success": 15, "total": 50}, "50k_50pct": {"success": 22, "total": 50}, "50k_75pct": {"success": 11, "total": 50}, "100k_25pct": {"success": 18, "total": 50}, "100k_50pct": {"success": 28, "total": 50}, "100k_75pct": {"success": 14, "total": 50}, } # Statistical test: control vs each condition control = results["1k_start"] for name, data in results.items(): if name == "1k_start": continue table = [ [control["success"], control["total"] - control["success"]], [data["success"], data["total"] - data["success"]], ] chi2, p_value, _, _ = chi2_contingency(table) sig = "*" if p_value < 0.05 else "" rate = data["success"] / data["total"] print(f"{name:<15} rate={rate:.0%} p={p_value:.4f} {sig}")Document and Disclose
Write a publication-quality finding report and follow responsible disclosure procedures.
# Disclosure report structure: # 1. Title and abstract # 2. Affected models and versions # 3. Vulnerability description # 4. Reproduction steps (exact prompts, configurations, expected results) # 5. Statistical evidence (test results, confidence intervals) # 6. Impact assessment # 7. Suggested mitigations # 8. Disclosure timeline # # Disclosure process: # 1. Report to vendor security team (90-day disclosure window) # 2. Collaborate on mitigation development # 3. Verify fix effectiveness # 4. Publish after vendor confirmation or disclosure window expiry
Troubleshooting
| Issue | Solution |
|---|---|
| Hypothesis is not falsifiable | Reframe it as a specific, measurable prediction with clear pass/fail criteria |
| Sample size is insufficient | Use power analysis to determine minimum samples; 50+ per condition is a good baseline |
| Results are inconsistent | Check for confounding variables; ensure temperature=0 and same model version |
| Finding does not reproduce on other models | Document as model-specific; it is still valuable even if not universal |
Why This Matters
Related Topics
- Novel Jailbreak Research - Applying this methodology to jailbreak discovery
- Safety Benchmark - Building standardized evaluation frameworks
- Adversarial Suffix - Example of rigorous vulnerability research
References
- "Universal and Transferable Adversarial Attacks" - Zou et al. (2023) - Exemplary vulnerability research methodology
- "Red Teaming Language Models to Reduce Harms" - Ganguli et al. (2022) - Structured red teaming research framework
- "Responsible AI Vulnerability Disclosure" - Partnership on AI (2024) - Disclosure best practices
Why is falsifiability essential for a good vulnerability research hypothesis?
Why must vulnerability research control for confounding variables?