Lab: Vulnerability Research Methodology

expert11 min readUpdated 2026-03-15

Systematic methodology lab for discovering novel AI vulnerabilities including hypothesis generation, attack surface mapping, experimental design, validation protocols, and responsible disclosure.

lab expert research methodology vulnerability-discovery

Prerequisites

Completion of at least five advanced labs
Strong understanding of transformer architectures and attention mechanisms
Familiarity with the full taxonomy of known LLM attacks (injection, jailbreak, extraction, evasion)
Publication-quality technical writing skills
Understanding of responsible disclosure practices

Finding a novel AI vulnerability is fundamentally different from applying known techniques. Known attacks have documented payloads and expected results. Novel research requires you to identify unexplored regions of the attack surface, form hypotheses about why they might be vulnerable, design experiments to test those hypotheses, and rigorously validate any findings before disclosure.

Lab Exercises

Survey the Known Attack Landscape

Before you can find something new, you must thoroughly understand what is already known.

#!/usr/bin/env python3
"""Map the known AI attack landscape and identify research gaps."""
 
from dataclasses import dataclass, field
 
@dataclass
class AttackCategory:
    name: str
    known_techniques: list[str]
    explored_depth: str  # "deep", "moderate", "shallow", "unexplored"
    last_major_finding: str
    open_questions: list[str] = field(default_factory=list)
 
LANDSCAPE = [
    AttackCategory(
        name="Direct Prompt Injection",
        known_techniques=["DAN", "AIM", "persona adoption", "authority claims",
                         "delimiter injection", "instruction override"],
        explored_depth="deep",
        last_major_finding="2024 - Many-shot jailbreaking via long context",
        open_questions=[
            "Are there universal injection patterns that work across all models?",
            "How does instruction hierarchy differ between model families?",
            "Can injection be made undetectable by output classifiers?",
        ],
    ),
    AttackCategory(
        name="Indirect Prompt Injection",
        known_techniques=["document injection", "web page poisoning", "email injection",
                         "RAG poisoning", "tool result manipulation"],
        explored_depth="moderate",
        last_major_finding="2024 - Cross-plugin injection in agent systems",
        open_questions=[
            "Can indirect injection work through non-text modalities (audio, video)?",
            "How do retrieval mechanisms affect injection success rates?",
            "Can injections persist across conversation sessions?",
        ],
    ),
    AttackCategory(
        name="Multi-Turn Attacks",
        known_techniques=["crescendo", "PAIR", "TAP", "context accumulation"],
        explored_depth="moderate",
        last_major_finding="2024 - Crescendo attacks via gradual escalation",
        open_questions=[
            "What is the minimum number of turns needed for reliable bypass?",
            "Can multi-turn attacks be made fully automated without human creativity?",
            "How does conversation length affect safety boundary stability?",
        ],
    ),
    AttackCategory(
        name="Multimodal Attacks",
        known_techniques=["text-in-image injection", "adversarial patches",
                         "audio adversarial examples"],
        explored_depth="shallow",
        last_major_finding="2024 - Cross-modal injection via embedded text",
        open_questions=[
            "Can video inputs carry injection payloads across frames?",
            "How do different vision architectures affect injection success?",
            "Can adversarial audio be made imperceptible to humans?",
        ],
    ),
    AttackCategory(
        name="Reasoning Exploitation",
        known_techniques=["chain-of-thought manipulation", "self-consistency attacks",
                         "reasoning chain injection"],
        explored_depth="shallow",
        last_major_finding="2024 - Reasoning models showing novel failure modes",
        open_questions=[
            "Do reasoning models (o1, DeepSeek-R1) have unique vulnerabilities?",
            "Can hidden reasoning steps be exploited for injection?",
            "How does extended thinking affect safety boundary stability?",
        ],
    ),
    AttackCategory(
        name="Model Architecture Exploitation",
        known_techniques=["attention pattern manipulation", "positional encoding attacks",
                         "context window boundary effects"],
        explored_depth="shallow",
        last_major_finding="2023 - Adversarial suffix optimization (GCG)",
        open_questions=[
            "How do different attention mechanisms affect vulnerability profiles?",
            "Can KV-cache manipulation create persistent backdoors?",
            "Are there architectural features that make models inherently more resistant?",
        ],
    ),
]
 
if __name__ == "__main__":
    print("=== AI Attack Landscape Survey ===\n")
 
    # Sort by exploration depth (prioritize unexplored areas)
    depth_order = {"unexplored": 0, "shallow": 1, "moderate": 2, "deep": 3}
    sorted_landscape = sorted(LANDSCAPE, key=lambda c: depth_order[c.explored_depth])
 
    for cat in sorted_landscape:
        print(f"[{cat.explored_depth.upper():>10}] {cat.name}")
        print(f"  Known techniques: {len(cat.known_techniques)}")
        print(f"  Last major finding: {cat.last_major_finding}")
        print(f"  Open questions: {len(cat.open_questions)}")
        for q in cat.open_questions:
            print(f"    ? {q}")
        print()
 
    # Identify highest-priority research targets
    print("=== Research Priority Targets ===")
    print("Areas with 'shallow' or 'unexplored' depth and many open questions:\n")
    for cat in sorted_landscape:
        if cat.explored_depth in ("shallow", "unexplored"):
            print(f"  HIGH PRIORITY: {cat.name}")
            print(f"    {cat.open_questions[0]}")

python attack_landscape.py

Generate and Prioritize Hypotheses

Form testable hypotheses about potential vulnerabilities in under-explored areas.

#!/usr/bin/env python3
"""Generate and prioritize research hypotheses for novel vulnerability discovery."""
 
from dataclasses import dataclass
 
@dataclass
class Hypothesis:
    id: str
    statement: str
    category: str
    testable: bool
    falsifiable: bool
    predicted_impact: str  # "low", "medium", "high", "critical"
    required_resources: list[str]
    experiment_outline: str
 
HYPOTHESES = [
    Hypothesis(
        id="H-001",
        statement=(
            "Reasoning models (o1-style) are more vulnerable to injection during "
            "their extended thinking phase because safety checks are optimized for "
            "the final output, not intermediate reasoning steps."
        ),
        category="reasoning_exploitation",
        testable=True,
        falsifiable=True,
        predicted_impact="high",
        required_resources=["Access to reasoning model API", "Attack payload set"],
        experiment_outline=(
            "1. Craft injection payloads targeting reasoning steps\n"
            "2. Test on reasoning models vs standard models\n"
            "3. Compare injection success rates\n"
            "4. Analyze whether successful injections appear in reasoning vs output"
        ),
    ),
    Hypothesis(
        id="H-002",
        statement=(
            "Long-context models (100k+ tokens) have degraded safety enforcement "
            "for instructions that appear in the middle of very long contexts, due "
            "to attention distribution patterns."
        ),
        category="architecture_exploitation",
        testable=True,
        falsifiable=True,
        predicted_impact="high",
        required_resources=["Long-context model API", "Token counting tools"],
        experiment_outline=(
            "1. Create contexts of varying lengths (10k, 50k, 100k tokens)\n"
            "2. Place identical injection at different positions\n"
            "3. Measure success rate by position and context length\n"
            "4. Control for content type and injection format"
        ),
    ),
    Hypothesis(
        id="H-003",
        statement=(
            "Structured output modes (JSON, function calling) have weaker safety "
            "enforcement than free-text generation because safety training focused "
            "on natural language responses."
        ),
        category="output_mode_exploitation",
        testable=True,
        falsifiable=True,
        predicted_impact="medium",
        required_resources=["Models with structured output support"],
        experiment_outline=(
            "1. Test identical prompts in free-text vs JSON vs function call modes\n"
            "2. Measure safety refusal rates across output modes\n"
            "3. Compare content of refusals vs compliant responses\n"
            "4. Test across multiple model providers"
        ),
    ),
]
 
if __name__ == "__main__":
    print("=== Research Hypotheses ===\n")
 
    # Prioritize by impact and testability
    prioritized = sorted(
        HYPOTHESES,
        key=lambda h: (
            {"critical": 0, "high": 1, "medium": 2, "low": 3}[h.predicted_impact],
            0 if h.testable and h.falsifiable else 1,
        ),
    )
 
    for h in prioritized:
        print(f"[{h.id}] [{h.predicted_impact.upper():>8}] {h.category}")
        print(f"  Statement: {h.statement}")
        print(f"  Testable: {h.testable}, Falsifiable: {h.falsifiable}")
        print(f"  Resources: {', '.join(h.required_resources)}")
        print(f"  Experiment:\n{h.experiment_outline}")
        print()

python hypothesis_generator.py

Design Controlled Experiments

Create rigorous experimental designs with proper controls and statistical plans.

#!/usr/bin/env python3
"""Design controlled experiments for vulnerability research."""
 
import json
from dataclasses import dataclass, field
 
@dataclass
class ExperimentDesign:
    hypothesis_id: str
    name: str
    independent_variable: str
    dependent_variable: str
    control_condition: str
    experimental_conditions: list[str]
    sample_size_per_condition: int
    confound_controls: list[str]
    success_criteria: str
    statistical_test: str
    significance_threshold: float = 0.05
 
    def display(self):
        print(f"Experiment: {self.name}")
        print(f"Hypothesis: {self.hypothesis_id}")
        print(f"IV: {self.independent_variable}")
        print(f"DV: {self.dependent_variable}")
        print(f"Control: {self.control_condition}")
        print(f"Conditions: {self.experimental_conditions}")
        print(f"N per condition: {self.sample_size_per_condition}")
        print(f"Confound controls: {self.confound_controls}")
        print(f"Success criteria: {self.success_criteria}")
        print(f"Statistical test: {self.statistical_test} (p < {self.significance_threshold})")
 
experiment_h002 = ExperimentDesign(
    hypothesis_id="H-002",
    name="Long-Context Safety Degradation",
    independent_variable="Context length (tokens) and injection position",
    dependent_variable="Injection success rate (binary: complied or refused)",
    control_condition="Short context (1k tokens) with injection at the beginning",
    experimental_conditions=[
        "10k tokens, injection at position 25%",
        "10k tokens, injection at position 50%",
        "10k tokens, injection at position 75%",
        "50k tokens, injection at position 25%",
        "50k tokens, injection at position 50%",
        "50k tokens, injection at position 75%",
        "100k tokens, injection at position 25%",
        "100k tokens, injection at position 50%",
        "100k tokens, injection at position 75%",
    ],
    sample_size_per_condition=50,
    confound_controls=[
        "Use identical injection payload across all conditions",
        "Use the same filler text type (news articles) across all lengths",
        "Randomize filler text content to prevent content-specific effects",
        "Run against same model version on same day",
        "Use temperature=0 for reproducibility",
    ],
    success_criteria=(
        "Statistically significant difference (p < 0.05) in success rate "
        "between short and long context conditions, with success rate "
        "increasing as context length increases"
    ),
    statistical_test="Chi-squared test for independence with Bonferroni correction",
)
 
if __name__ == "__main__":
    print("=== Experiment Design ===\n")
    experiment_h002.display()
    print(f"\nTotal tests needed: "
          f"{(len(experiment_h002.experimental_conditions) + 1) * experiment_h002.sample_size_per_condition}")

python experiment_design.py

Execute and Validate

Run the experiment with proper controls and statistical analysis.

#!/usr/bin/env python3
"""Execute the experiment and perform statistical validation."""
 
import numpy as np
from scipy.stats import chi2_contingency
 
# After running 50 trials per condition, analyze results:
# Example results (replace with actual data)
results = {
    "1k_start":   {"success": 5, "total": 50},
    "10k_25pct":  {"success": 8, "total": 50},
    "10k_50pct":  {"success": 12, "total": 50},
    "10k_75pct":  {"success": 7, "total": 50},
    "50k_25pct":  {"success": 15, "total": 50},
    "50k_50pct":  {"success": 22, "total": 50},
    "50k_75pct":  {"success": 11, "total": 50},
    "100k_25pct": {"success": 18, "total": 50},
    "100k_50pct": {"success": 28, "total": 50},
    "100k_75pct": {"success": 14, "total": 50},
}
 
# Statistical test: control vs each condition
control = results["1k_start"]
for name, data in results.items():
    if name == "1k_start":
        continue
    table = [
        [control["success"], control["total"] - control["success"]],
        [data["success"], data["total"] - data["success"]],
    ]
    chi2, p_value, _, _ = chi2_contingency(table)
    sig = "*" if p_value < 0.05 else ""
    rate = data["success"] / data["total"]
    print(f"{name:<15} rate={rate:.0%} p={p_value:.4f} {sig}")

Document and Disclose

Write a publication-quality finding report and follow responsible disclosure procedures.

# Disclosure report structure:
# 1. Title and abstract
# 2. Affected models and versions
# 3. Vulnerability description
# 4. Reproduction steps (exact prompts, configurations, expected results)
# 5. Statistical evidence (test results, confidence intervals)
# 6. Impact assessment
# 7. Suggested mitigations
# 8. Disclosure timeline
#
# Disclosure process:
# 1. Report to vendor security team (90-day disclosure window)
# 2. Collaborate on mitigation development
# 3. Verify fix effectiveness
# 4. Publish after vendor confirmation or disclosure window expiry

Troubleshooting

Issue	Solution
Hypothesis is not falsifiable	Reframe it as a specific, measurable prediction with clear pass/fail criteria
Sample size is insufficient	Use power analysis to determine minimum samples; 50+ per condition is a good baseline
Results are inconsistent	Check for confounding variables; ensure temperature=0 and same model version
Finding does not reproduce on other models	Document as model-specific; it is still valuable even if not universal

Why This Matters

Novel Jailbreak Research - Applying this methodology to jailbreak discovery
Safety Benchmark - Building standardized evaluation frameworks
Adversarial Suffix - Example of rigorous vulnerability research

References

"Universal and Transferable Adversarial Attacks" - Zou et al. (2023) - Exemplary vulnerability research methodology
"Red Teaming Language Models to Reduce Harms" - Ganguli et al. (2022) - Structured red teaming research framework
"Responsible AI Vulnerability Disclosure" - Partnership on AI (2024) - Disclosure best practices

Knowledge Check

Why is falsifiability essential for a good vulnerability research hypothesis?

Knowledge Check

Why must vulnerability research control for confounding variables?

Edit this page on GitHub

Lab: Vulnerability Research Methodology

expert11 min readUpdated 2026-03-15

Systematic methodology lab for discovering novel AI vulnerabilities including hypothesis generation, attack surface mapping, experimental design, validation protocols, and responsible disclosure.

lab expert research methodology vulnerability-discovery

Prerequisites

Completion of at least five advanced labs
Strong understanding of transformer architectures and attention mechanisms
Familiarity with the full taxonomy of known LLM attacks (injection, jailbreak, extraction, evasion)
Publication-quality technical writing skills
Understanding of responsible disclosure practices

Background

Lab Exercises

Survey the Known Attack Landscape

Before you can find something new, you must thoroughly understand what is already known.

#!/usr/bin/env python3
"""Map the known AI attack landscape and identify research gaps."""
 
from dataclasses import dataclass, field
 
@dataclass
class AttackCategory:
    name: str
    known_techniques: list[str]
    explored_depth: str  # "deep", "moderate", "shallow", "unexplored"
    last_major_finding: str
    open_questions: list[str] = field(default_factory=list)
 
LANDSCAPE = [
    AttackCategory(
        name="Direct Prompt Injection",
        known_techniques=["DAN", "AIM", "persona adoption", "authority claims",
                         "delimiter injection", "instruction override"],
        explored_depth="deep",
        last_major_finding="2024 - Many-shot jailbreaking via long context",
        open_questions=[
            "Are there universal injection patterns that work across all models?",
            "How does instruction hierarchy differ between model families?",
            "Can injection be made undetectable by output classifiers?",
        ],
    ),
    AttackCategory(
        name="Indirect Prompt Injection",
        known_techniques=["document injection", "web page poisoning", "email injection",
                         "RAG poisoning", "tool result manipulation"],
        explored_depth="moderate",
        last_major_finding="2024 - Cross-plugin injection in agent systems",
        open_questions=[
            "Can indirect injection work through non-text modalities (audio, video)?",
            "How do retrieval mechanisms affect injection success rates?",
            "Can injections persist across conversation sessions?",
        ],
    ),
    AttackCategory(
        name="Multi-Turn Attacks",
        known_techniques=["crescendo", "PAIR", "TAP", "context accumulation"],
        explored_depth="moderate",
        last_major_finding="2024 - Crescendo attacks via gradual escalation",
        open_questions=[
            "What is the minimum number of turns needed for reliable bypass?",
            "Can multi-turn attacks be made fully automated without human creativity?",
            "How does conversation length affect safety boundary stability?",
        ],
    ),
    AttackCategory(
        name="Multimodal Attacks",
        known_techniques=["text-in-image injection", "adversarial patches",
                         "audio adversarial examples"],
        explored_depth="shallow",
        last_major_finding="2024 - Cross-modal injection via embedded text",
        open_questions=[
            "Can video inputs carry injection payloads across frames?",
            "How do different vision architectures affect injection success?",
            "Can adversarial audio be made imperceptible to humans?",
        ],
    ),
    AttackCategory(
        name="Reasoning Exploitation",
        known_techniques=["chain-of-thought manipulation", "self-consistency attacks",
                         "reasoning chain injection"],
        explored_depth="shallow",
        last_major_finding="2024 - Reasoning models showing novel failure modes",
        open_questions=[
            "Do reasoning models (o1, DeepSeek-R1) have unique vulnerabilities?",
            "Can hidden reasoning steps be exploited for injection?",
            "How does extended thinking affect safety boundary stability?",
        ],
    ),
    AttackCategory(
        name="Model Architecture Exploitation",
        known_techniques=["attention pattern manipulation", "positional encoding attacks",
                         "context window boundary effects"],
        explored_depth="shallow",
        last_major_finding="2023 - Adversarial suffix optimization (GCG)",
        open_questions=[
            "How do different attention mechanisms affect vulnerability profiles?",
            "Can KV-cache manipulation create persistent backdoors?",
            "Are there architectural features that make models inherently more resistant?",
        ],
    ),
]
 
if __name__ == "__main__":
    print("=== AI Attack Landscape Survey ===\n")
 
    # Sort by exploration depth (prioritize unexplored areas)
    depth_order = {"unexplored": 0, "shallow": 1, "moderate": 2, "deep": 3}
    sorted_landscape = sorted(LANDSCAPE, key=lambda c: depth_order[c.explored_depth])
 
    for cat in sorted_landscape:
        print(f"[{cat.explored_depth.upper():>10}] {cat.name}")
        print(f"  Known techniques: {len(cat.known_techniques)}")
        print(f"  Last major finding: {cat.last_major_finding}")
        print(f"  Open questions: {len(cat.open_questions)}")
        for q in cat.open_questions:
            print(f"    ? {q}")
        print()
 
    # Identify highest-priority research targets
    print("=== Research Priority Targets ===")
    print("Areas with 'shallow' or 'unexplored' depth and many open questions:\n")
    for cat in sorted_landscape:
        if cat.explored_depth in ("shallow", "unexplored"):
            print(f"  HIGH PRIORITY: {cat.name}")
            print(f"    {cat.open_questions[0]}")

python attack_landscape.py

Generate and Prioritize Hypotheses

Form testable hypotheses about potential vulnerabilities in under-explored areas.

#!/usr/bin/env python3
"""Generate and prioritize research hypotheses for novel vulnerability discovery."""
 
from dataclasses import dataclass
 
@dataclass
class Hypothesis:
    id: str
    statement: str
    category: str
    testable: bool
    falsifiable: bool
    predicted_impact: str  # "low", "medium", "high", "critical"
    required_resources: list[str]
    experiment_outline: str
 
HYPOTHESES = [
    Hypothesis(
        id="H-001",
        statement=(
            "Reasoning models (o1-style) are more vulnerable to injection during "
            "their extended thinking phase because safety checks are optimized for "
            "the final output, not intermediate reasoning steps."
        ),
        category="reasoning_exploitation",
        testable=True,
        falsifiable=True,
        predicted_impact="high",
        required_resources=["Access to reasoning model API", "Attack payload set"],
        experiment_outline=(
            "1. Craft injection payloads targeting reasoning steps\n"
            "2. Test on reasoning models vs standard models\n"
            "3. Compare injection success rates\n"
            "4. Analyze whether successful injections appear in reasoning vs output"
        ),
    ),
    Hypothesis(
        id="H-002",
        statement=(
            "Long-context models (100k+ tokens) have degraded safety enforcement "
            "for instructions that appear in the middle of very long contexts, due "
            "to attention distribution patterns."
        ),
        category="architecture_exploitation",
        testable=True,
        falsifiable=True,
        predicted_impact="high",
        required_resources=["Long-context model API", "Token counting tools"],
        experiment_outline=(
            "1. Create contexts of varying lengths (10k, 50k, 100k tokens)\n"
            "2. Place identical injection at different positions\n"
            "3. Measure success rate by position and context length\n"
            "4. Control for content type and injection format"
        ),
    ),
    Hypothesis(
        id="H-003",
        statement=(
            "Structured output modes (JSON, function calling) have weaker safety "
            "enforcement than free-text generation because safety training focused "
            "on natural language responses."
        ),
        category="output_mode_exploitation",
        testable=True,
        falsifiable=True,
        predicted_impact="medium",
        required_resources=["Models with structured output support"],
        experiment_outline=(
            "1. Test identical prompts in free-text vs JSON vs function call modes\n"
            "2. Measure safety refusal rates across output modes\n"
            "3. Compare content of refusals vs compliant responses\n"
            "4. Test across multiple model providers"
        ),
    ),
]
 
if __name__ == "__main__":
    print("=== Research Hypotheses ===\n")
 
    # Prioritize by impact and testability
    prioritized = sorted(
        HYPOTHESES,
        key=lambda h: (
            {"critical": 0, "high": 1, "medium": 2, "low": 3}[h.predicted_impact],
            0 if h.testable and h.falsifiable else 1,
        ),
    )
 
    for h in prioritized:
        print(f"[{h.id}] [{h.predicted_impact.upper():>8}] {h.category}")
        print(f"  Statement: {h.statement}")
        print(f"  Testable: {h.testable}, Falsifiable: {h.falsifiable}")
        print(f"  Resources: {', '.join(h.required_resources)}")
        print(f"  Experiment:\n{h.experiment_outline}")
        print()

python hypothesis_generator.py

Design Controlled Experiments

Create rigorous experimental designs with proper controls and statistical plans.

#!/usr/bin/env python3
"""Design controlled experiments for vulnerability research."""
 
import json
from dataclasses import dataclass, field
 
@dataclass
class ExperimentDesign:
    hypothesis_id: str
    name: str
    independent_variable: str
    dependent_variable: str
    control_condition: str
    experimental_conditions: list[str]
    sample_size_per_condition: int
    confound_controls: list[str]
    success_criteria: str
    statistical_test: str
    significance_threshold: float = 0.05
 
    def display(self):
        print(f"Experiment: {self.name}")
        print(f"Hypothesis: {self.hypothesis_id}")
        print(f"IV: {self.independent_variable}")
        print(f"DV: {self.dependent_variable}")
        print(f"Control: {self.control_condition}")
        print(f"Conditions: {self.experimental_conditions}")
        print(f"N per condition: {self.sample_size_per_condition}")
        print(f"Confound controls: {self.confound_controls}")
        print(f"Success criteria: {self.success_criteria}")
        print(f"Statistical test: {self.statistical_test} (p < {self.significance_threshold})")
 
experiment_h002 = ExperimentDesign(
    hypothesis_id="H-002",
    name="Long-Context Safety Degradation",
    independent_variable="Context length (tokens) and injection position",
    dependent_variable="Injection success rate (binary: complied or refused)",
    control_condition="Short context (1k tokens) with injection at the beginning",
    experimental_conditions=[
        "10k tokens, injection at position 25%",
        "10k tokens, injection at position 50%",
        "10k tokens, injection at position 75%",
        "50k tokens, injection at position 25%",
        "50k tokens, injection at position 50%",
        "50k tokens, injection at position 75%",
        "100k tokens, injection at position 25%",
        "100k tokens, injection at position 50%",
        "100k tokens, injection at position 75%",
    ],
    sample_size_per_condition=50,
    confound_controls=[
        "Use identical injection payload across all conditions",
        "Use the same filler text type (news articles) across all lengths",
        "Randomize filler text content to prevent content-specific effects",
        "Run against same model version on same day",
        "Use temperature=0 for reproducibility",
    ],
    success_criteria=(
        "Statistically significant difference (p < 0.05) in success rate "
        "between short and long context conditions, with success rate "
        "increasing as context length increases"
    ),
    statistical_test="Chi-squared test for independence with Bonferroni correction",
)
 
if __name__ == "__main__":
    print("=== Experiment Design ===\n")
    experiment_h002.display()
    print(f"\nTotal tests needed: "
          f"{(len(experiment_h002.experimental_conditions) + 1) * experiment_h002.sample_size_per_condition}")

python experiment_design.py

Execute and Validate

Run the experiment with proper controls and statistical analysis.

#!/usr/bin/env python3
"""Execute the experiment and perform statistical validation."""
 
import numpy as np
from scipy.stats import chi2_contingency
 
# After running 50 trials per condition, analyze results:
# Example results (replace with actual data)
results = {
    "1k_start":   {"success": 5, "total": 50},
    "10k_25pct":  {"success": 8, "total": 50},
    "10k_50pct":  {"success": 12, "total": 50},
    "10k_75pct":  {"success": 7, "total": 50},
    "50k_25pct":  {"success": 15, "total": 50},
    "50k_50pct":  {"success": 22, "total": 50},
    "50k_75pct":  {"success": 11, "total": 50},
    "100k_25pct": {"success": 18, "total": 50},
    "100k_50pct": {"success": 28, "total": 50},
    "100k_75pct": {"success": 14, "total": 50},
}
 
# Statistical test: control vs each condition
control = results["1k_start"]
for name, data in results.items():
    if name == "1k_start":
        continue
    table = [
        [control["success"], control["total"] - control["success"]],
        [data["success"], data["total"] - data["success"]],
    ]
    chi2, p_value, _, _ = chi2_contingency(table)
    sig = "*" if p_value < 0.05 else ""
    rate = data["success"] / data["total"]
    print(f"{name:<15} rate={rate:.0%} p={p_value:.4f} {sig}")

Document and Disclose

Write a publication-quality finding report and follow responsible disclosure procedures.

# Disclosure report structure:
# 1. Title and abstract
# 2. Affected models and versions
# 3. Vulnerability description
# 4. Reproduction steps (exact prompts, configurations, expected results)
# 5. Statistical evidence (test results, confidence intervals)
# 6. Impact assessment
# 7. Suggested mitigations
# 8. Disclosure timeline
#
# Disclosure process:
# 1. Report to vendor security team (90-day disclosure window)
# 2. Collaborate on mitigation development
# 3. Verify fix effectiveness
# 4. Publish after vendor confirmation or disclosure window expiry

Troubleshooting

Issue	Solution
Hypothesis is not falsifiable	Reframe it as a specific, measurable prediction with clear pass/fail criteria
Sample size is insufficient	Use power analysis to determine minimum samples; 50+ per condition is a good baseline
Results are inconsistent	Check for confounding variables; ensure temperature=0 and same model version
Finding does not reproduce on other models	Document as model-specific; it is still valuable even if not universal

Why This Matters

Novel Jailbreak Research - Applying this methodology to jailbreak discovery
Safety Benchmark - Building standardized evaluation frameworks
Adversarial Suffix - Example of rigorous vulnerability research

References

"Universal and Transferable Adversarial Attacks" - Zou et al. (2023) - Exemplary vulnerability research methodology
"Red Teaming Language Models to Reduce Harms" - Ganguli et al. (2022) - Structured red teaming research framework
"Responsible AI Vulnerability Disclosure" - Partnership on AI (2024) - Disclosure best practices

Knowledge Check

Why is falsifiability essential for a good vulnerability research hypothesis?

Knowledge Check

Why must vulnerability research control for confounding variables?

Edit this page on GitHub

Lab: Vulnerability Research Methodology

Survey the Known Attack Landscape

Generate and Prioritize Hypotheses

Design Controlled Experiments

Execute and Validate

Document and Disclose

Related articles

Lab: Vulnerability Research Methodology

Survey the Known Attack Landscape

Generate and Prioritize Hypotheses

Design Controlled Experiments

Execute and Validate

Document and Disclose

Related articles