Capstone: Vulnerability Research Project

expert10 min readUpdated 2026-03-15

Identify and responsibly disclose a novel AI vulnerability class, producing an advisory-format writeup, proof of concept, and MITRE ATLAS mapping.

capstone vulnerability-research disclosure expert advisory

Overview

Vulnerability research is the practice of discovering previously unknown security issues through systematic analysis, creative experimentation, and deep understanding of system internals. In the AI security domain, this means finding new ways that language models, agent systems, or AI pipelines can be exploited beyond the known categories documented in OWASP Top 10 for LLMs or MITRE ATLAS.

This capstone challenges you to go beyond applying known techniques and discover something genuinely new — or at least a novel variation significant enough to warrant its own advisory. The deliverable is a professional security advisory suitable for responsible disclosure to an AI provider or publication in a security venue.

Prerequisites

Prompt Injection — Deep understanding of injection mechanics to identify novel variants
Jailbreaking Techniques — Familiarity with known bypasses to avoid rediscovery
Advanced LLM Internals — Understanding model architecture enables mechanistic reasoning about vulnerabilities
Agentic Exploitation — Agent and tool calling attack surfaces
AI Threat Landscape — Current state of known vulnerabilities
Strong analytical thinking and patience — vulnerability research involves many dead ends

Project Brief

Scenario

You are a security researcher at an AI safety lab. Your team's mission is to discover new vulnerability classes in AI systems before malicious actors do. You have been given a research sprint to investigate a novel attack surface, develop a proof of concept, and produce a publishable advisory.

Research Directions

Choose one of the following research areas (or propose your own with justification):

Cross-model attack transfer — Can attacks designed for one model architecture reliably transfer to others? What properties make an attack transferable?
Multimodal injection vectors — How can images, audio, or structured data formats be used as prompt injection carriers in multimodal systems?
Agent memory manipulation — Can an attacker corrupt, poison, or exfiltrate the persistent memory of a conversational agent across sessions?
Tool schema exploitation — Are there systematic ways to manipulate function calling behavior through crafted tool descriptions or parameter schemas?
Fine-tuning residual extraction — Can fine-tuning data be extracted from a fine-tuned model even when the provider claims data is not retained?
Reasoning chain manipulation — Can chain-of-thought or reasoning traces be manipulated to produce incorrect conclusions that appear logically sound?

Research Methodology

1. Literature review: What is already known?
       ↓
2. Hypothesis formation: What might be possible?
       ↓
3. Experiment design: How to test the hypothesis?
       ↓
4. Controlled testing: Execute experiments
       ↓
5. Analysis: Characterize the vulnerability
       ↓
6. Validation: Confirm across conditions
       ↓
7. Advisory: Document for disclosure

Deliverables

Primary Deliverables

Deliverable	Description	Weight
Security advisory	Professional advisory document (3-5 pages)	35%
Proof of concept	Working PoC with reproduction instructions	25%
MITRE ATLAS mapping	Classification within ATLAS framework (or proposal for new technique)	10%
Research log	Documented research process including failed approaches	15%
Responsible disclosure plan	How you would disclose this to affected parties	15%

Rubric Criteria

Novelty (25%) — The finding represents a genuinely new insight, not a minor variation of a well-documented technique
Technical Depth (20%) — The advisory demonstrates deep understanding of why the vulnerability exists and its root cause
PoC Quality (20%) — The proof of concept is reliable, minimal, and clearly demonstrates the vulnerability without unnecessary complexity
Advisory Writing (20%) — The advisory is clear, well-structured, and suitable for a professional audience
Responsible Practices (15%) — The research was conducted ethically and the disclosure plan is thoughtful

Phased Approach

Phase 1: Literature Review and Hypothesis (3 hours)

Survey existing research
Review the current MITRE ATLAS matrix, OWASP Top 10 for LLMs, and recent AI security publications (arXiv, security conferences). Build a map of what is already known in your chosen research area. Identify the frontier — where does documented knowledge end?
Identify research gaps
Look for areas where the documentation says "this might be possible" but no one has demonstrated it, or where known techniques have not been tested against newer model architectures or deployment patterns.
Formulate testable hypotheses
Write 2-3 specific, testable hypotheses. A good hypothesis is falsifiable and specific: "Model X will follow instructions embedded in the alt-text of an image when processing multimodal input" is better than "multimodal models might have security issues."

Phase 2: Experiment Design and Execution (6 hours)

Design controlled experiments
For each hypothesis, design an experiment with clear success criteria, control conditions, and measurable outcomes. Define what constitutes a positive result vs. a negative result before running any tests.
Set up the test environment
Deploy the target system(s) in a controlled environment. If testing against API-hosted models, ensure you are within terms of service. Document the exact versions, configurations, and conditions of your testing.
Execute experiments systematically
Run experiments methodically, recording every input, output, and observation. When you find something interesting, resist the urge to immediately pivot — complete your planned experiments first, then investigate anomalies.
Iterate on promising results
When an experiment shows positive results, vary the conditions to understand the boundaries: which models are affected? Does it require specific configurations? What defenses block it? What is the minimum viable exploit?

Phase 3: Analysis and Characterization (3 hours)

Characterize the vulnerability
Define the vulnerability precisely: what is the root cause, what are the preconditions, what is the impact, and what is the scope of affected systems? Distinguish between the specific instance you found and the general vulnerability class it represents.
Assess impact and severity
Evaluate the real-world impact: who is affected, what can an attacker achieve, what are the prerequisites for exploitation? Use CVSS or a similar framework adapted for AI systems. Be honest about limitations — a vulnerability that requires unlikely preconditions is less severe.
Map to frameworks
Classify your finding within MITRE ATLAS. If it fits an existing technique, identify which one and explain why. If it represents a genuinely new technique, draft a proposed ATLAS entry including tactic, technique description, and procedure examples.

Phase 4: Advisory and Disclosure (4 hours)

Write the security advisory
Produce a professional advisory following standard format: title, summary, affected systems, description, impact, proof of concept, mitigation, and references. Write for a technical audience at the affected vendor.
Develop the proof of concept
Create a clean, minimal PoC that demonstrates the vulnerability. Include clear reproduction steps. Remove any unnecessary complexity. The PoC should work reliably and be safe to run (no destructive operations).
Draft the disclosure plan
Outline how you would responsibly disclose this finding: who to contact, what timeline to propose, what information to share initially vs. after a fix, and how to handle the scenario where the vendor does not respond.

Example Output

Example Advisory Format

# Security Advisory: [Vulnerability Title]
 
**Advisory ID:** AIRT-2026-001
**Date:** 2026-03-15
**Severity:** High (CVSS 7.8)
**Affected Systems:** [Models/platforms affected]
**Status:** [Disclosed / Vendor notified / Fixed]
 
## Summary
[2-3 sentence description of the vulnerability and its impact]
 
## Affected Systems
- Model A (version X.Y) — Confirmed
- Model B (version X.Y) — Confirmed
- Model C (version X.Y) — Not affected
 
## Description
[Detailed technical description of the vulnerability, its root cause,
and the conditions required for exploitation]
 
## Impact
[What an attacker can achieve, affected users/deployments, severity
justification]
 
## Proof of Concept
[Step-by-step reproduction instructions with exact inputs and expected
outputs]
 
## Mitigation
[Recommended fixes for vendors and workarounds for users]
 
## Timeline
- 2026-02-15: Vulnerability discovered
- 2026-02-20: Vendor notified via security@vendor.com
- 2026-03-01: Vendor acknowledged receipt
- 2026-03-15: Advisory published (90-day disclosure deadline)
 
## References
[Related work, prior art, framework mappings]

Example ATLAS Mapping

## MITRE ATLAS Classification
 
**Tactic:** ML Attack Staging (AML.TA0002)
**Technique:** [Existing or proposed technique ID]
 
**Justification:**
This vulnerability falls under [tactic] because the attacker
manipulates [component] during [phase] to achieve [outcome].
 
If this is a novel technique, the proposed entry would be:
 
**Proposed Technique: AML.TXXXX — [Technique Name]**
- Description: [What the technique does]
- Subtechniques:
  - .001 — [Variant A]
  - .002 — [Variant B]
- Mitigations:
  - AML.MXXXX — [Mitigation description]
- Procedure examples:
  - [How this technique was applied in the PoC]

Hints

Knowledge Check

Why is it important to distinguish between a specific vulnerability instance and the general vulnerability class it represents?

Capstone: Vulnerability Research Project

Survey existing research

Identify research gaps

Formulate testable hypotheses

Design controlled experiments

Set up the test environment

Execute experiments systematically

Iterate on promising results

Characterize the vulnerability

Assess impact and severity

Map to frameworks

Write the security advisory

Develop the proof of concept

Draft the disclosure plan

Related articles

Capstone: Vulnerability Research Project

Survey existing research

Identify research gaps

Formulate testable hypotheses

Design controlled experiments

Set up the test environment

Execute experiments systematically

Iterate on promising results

Characterize the vulnerability

Assess impact and severity

Map to frameworks

Write the security advisory

Develop the proof of concept

Draft the disclosure plan

Related articles