Capstone: Vulnerability Research Project
Identify and responsibly disclose a novel AI vulnerability class, producing an advisory-format writeup, proof of concept, and MITRE ATLAS mapping.
Overview
Vulnerability research is the practice of discovering previously unknown security issues through systematic analysis, creative experimentation, and deep understanding of system internals. In the AI security domain, this means finding new ways that language models, agent systems, or AI pipelines can be exploited beyond the known categories documented in OWASP Top 10 for LLMs or MITRE ATLAS.
This capstone challenges you to go beyond applying known techniques and discover something genuinely new — or at least a novel variation significant enough to warrant its own advisory. The deliverable is a professional security advisory suitable for responsible disclosure to an AI provider or publication in a security venue.
Prerequisites
- Prompt Injection — Deep understanding of injection mechanics to identify novel variants
- Jailbreaking Techniques — Familiarity with known bypasses to avoid rediscovery
- Advanced LLM Internals — Understanding model architecture enables mechanistic reasoning about vulnerabilities
- Agentic Exploitation — Agent and tool calling attack surfaces
- AI Threat Landscape — Current state of known vulnerabilities
- Strong analytical thinking and patience — vulnerability research involves many dead ends
Project Brief
Scenario
You are a security researcher at an AI safety lab. Your team's mission is to discover new vulnerability classes in AI systems before malicious actors do. You have been given a research sprint to investigate a novel attack surface, develop a proof of concept, and produce a publishable advisory.
Research Directions
Choose one of the following research areas (or propose your own with justification):
- Cross-model attack transfer — Can attacks designed for one model architecture reliably transfer to others? What properties make an attack transferable?
- Multimodal injection vectors — How can images, audio, or structured data formats be used as prompt injection carriers in multimodal systems?
- Agent memory manipulation — Can an attacker corrupt, poison, or exfiltrate the persistent memory of a conversational agent across sessions?
- Tool schema exploitation — Are there systematic ways to manipulate function calling behavior through crafted tool descriptions or parameter schemas?
- Fine-tuning residual extraction — Can fine-tuning data be extracted from a fine-tuned model even when the provider claims data is not retained?
- Reasoning chain manipulation — Can chain-of-thought or reasoning traces be manipulated to produce incorrect conclusions that appear logically sound?
Research Methodology
1. Literature review: What is already known?
↓
2. Hypothesis formation: What might be possible?
↓
3. Experiment design: How to test the hypothesis?
↓
4. Controlled testing: Execute experiments
↓
5. Analysis: Characterize the vulnerability
↓
6. Validation: Confirm across conditions
↓
7. Advisory: Document for disclosure
Deliverables
Primary Deliverables
| Deliverable | Description | Weight |
|---|---|---|
| Security advisory | Professional advisory document (3-5 pages) | 35% |
| Proof of concept | Working PoC with reproduction instructions | 25% |
| MITRE ATLAS mapping | Classification within ATLAS framework (or proposal for new technique) | 10% |
| Research log | Documented research process including failed approaches | 15% |
| Responsible disclosure plan | How you would disclose this to affected parties | 15% |
Rubric Criteria
- Novelty (25%) — The finding represents a genuinely new insight, not a minor variation of a well-documented technique
- Technical Depth (20%) — The advisory demonstrates deep understanding of why the vulnerability exists and its root cause
- PoC Quality (20%) — The proof of concept is reliable, minimal, and clearly demonstrates the vulnerability without unnecessary complexity
- Advisory Writing (20%) — The advisory is clear, well-structured, and suitable for a professional audience
- Responsible Practices (15%) — The research was conducted ethically and the disclosure plan is thoughtful
Phased Approach
Phase 1: Literature Review and Hypothesis (3 hours)
Survey existing research
Review the current MITRE ATLAS matrix, OWASP Top 10 for LLMs, and recent AI security publications (arXiv, security conferences). Build a map of what is already known in your chosen research area. Identify the frontier — where does documented knowledge end?
Identify research gaps
Look for areas where the documentation says "this might be possible" but no one has demonstrated it, or where known techniques have not been tested against newer model architectures or deployment patterns.
Formulate testable hypotheses
Write 2-3 specific, testable hypotheses. A good hypothesis is falsifiable and specific: "Model X will follow instructions embedded in the alt-text of an image when processing multimodal input" is better than "multimodal models might have security issues."
Phase 2: Experiment Design and Execution (6 hours)
Design controlled experiments
For each hypothesis, design an experiment with clear success criteria, control conditions, and measurable outcomes. Define what constitutes a positive result vs. a negative result before running any tests.
Set up the test environment
Deploy the target system(s) in a controlled environment. If testing against API-hosted models, ensure you are within terms of service. Document the exact versions, configurations, and conditions of your testing.
Execute experiments systematically
Run experiments methodically, recording every input, output, and observation. When you find something interesting, resist the urge to immediately pivot — complete your planned experiments first, then investigate anomalies.
Iterate on promising results
When an experiment shows positive results, vary the conditions to understand the boundaries: which models are affected? Does it require specific configurations? What defenses block it? What is the minimum viable exploit?
Phase 3: Analysis and Characterization (3 hours)
Characterize the vulnerability
Define the vulnerability precisely: what is the root cause, what are the preconditions, what is the impact, and what is the scope of affected systems? Distinguish between the specific instance you found and the general vulnerability class it represents.
Assess impact and severity
Evaluate the real-world impact: who is affected, what can an attacker achieve, what are the prerequisites for exploitation? Use CVSS or a similar framework adapted for AI systems. Be honest about limitations — a vulnerability that requires unlikely preconditions is less severe.
Map to frameworks
Classify your finding within MITRE ATLAS. If it fits an existing technique, identify which one and explain why. If it represents a genuinely new technique, draft a proposed ATLAS entry including tactic, technique description, and procedure examples.
Phase 4: Advisory and Disclosure (4 hours)
Write the security advisory
Produce a professional advisory following standard format: title, summary, affected systems, description, impact, proof of concept, mitigation, and references. Write for a technical audience at the affected vendor.
Develop the proof of concept
Create a clean, minimal PoC that demonstrates the vulnerability. Include clear reproduction steps. Remove any unnecessary complexity. The PoC should work reliably and be safe to run (no destructive operations).
Draft the disclosure plan
Outline how you would responsibly disclose this finding: who to contact, what timeline to propose, what information to share initially vs. after a fix, and how to handle the scenario where the vendor does not respond.
Example Output
Example Advisory Format
# Security Advisory: [Vulnerability Title]
**Advisory ID:** AIRT-2026-001
**Date:** 2026-03-15
**Severity:** High (CVSS 7.8)
**Affected Systems:** [Models/platforms affected]
**Status:** [Disclosed / Vendor notified / Fixed]
## Summary
[2-3 sentence description of the vulnerability and its impact]
## Affected Systems
- Model A (version X.Y) — Confirmed
- Model B (version X.Y) — Confirmed
- Model C (version X.Y) — Not affected
## Description
[Detailed technical description of the vulnerability, its root cause,
and the conditions required for exploitation]
## Impact
[What an attacker can achieve, affected users/deployments, severity
justification]
## Proof of Concept
[Step-by-step reproduction instructions with exact inputs and expected
outputs]
## Mitigation
[Recommended fixes for vendors and workarounds for users]
## Timeline
- 2026-02-15: Vulnerability discovered
- 2026-02-20: Vendor notified via security@vendor.com
- 2026-03-01: Vendor acknowledged receipt
- 2026-03-15: Advisory published (90-day disclosure deadline)
## References
[Related work, prior art, framework mappings]Example ATLAS Mapping
## MITRE ATLAS Classification
**Tactic:** ML Attack Staging (AML.TA0002)
**Technique:** [Existing or proposed technique ID]
**Justification:**
This vulnerability falls under [tactic] because the attacker
manipulates [component] during [phase] to achieve [outcome].
If this is a novel technique, the proposed entry would be:
**Proposed Technique: AML.TXXXX — [Technique Name]**
- Description: [What the technique does]
- Subtechniques:
- .001 — [Variant A]
- .002 — [Variant B]
- Mitigations:
- AML.MXXXX — [Mitigation description]
- Procedure examples:
- [How this technique was applied in the PoC]Hints
Why is it important to distinguish between a specific vulnerability instance and the general vulnerability class it represents?