Capstone: 漏洞 Research Project
Identify and responsibly disclose a novel AI vulnerability class, producing an advisory-format writeup, proof of concept, and MITRE ATLAS mapping.
概覽
漏洞 research is the practice of discovering previously unknown 安全 issues through systematic analysis, creative experimentation, and deep 理解 of system internals. In the AI 安全 domain, 這意味著 finding new ways that language models, 代理 systems, or AI pipelines can be exploited beyond the known categories documented in OWASP Top 10 for LLMs or MITRE ATLAS.
This capstone challenges you to go beyond applying known techniques and discover something genuinely new — or at least a novel variation significant enough to warrant its own advisory. The deliverable is a professional 安全 advisory suitable for responsible disclosure to an AI provider or publication in a 安全 venue.
先備知識
- 提示詞注入 — Deep 理解 of injection mechanics to 識別 novel variants
- Jailbreaking Techniques — Familiarity with known bypasses to avoid rediscovery
- Advanced LLM Internals — 理解 model architecture enables mechanistic reasoning about 漏洞
- 代理式 利用 — 代理 and 工具呼叫 attack surfaces
- AI Threat Landscape — Current state of known 漏洞
- Strong analytical thinking and patience — 漏洞 research involves many dead ends
Project Brief
Scenario
You are a 安全 researcher at an AI 安全 lab. Your team's mission is to discover new 漏洞 classes in AI systems before malicious actors do. You have been given a research sprint to investigate a novel 攻擊面, develop a proof of concept, and produce a publishable advisory.
Research Directions
Choose one of the following research areas (or propose your own with justification):
- Cross-model attack transfer — Can attacks designed for one model architecture reliably transfer to others? What properties make an attack transferable?
- Multimodal injection vectors — How can images, audio, or structured data formats be used as 提示詞注入 carriers in multimodal systems?
- 代理 memory manipulation — Can 攻擊者 corrupt, poison, or exfiltrate the persistent memory of a conversational 代理 across sessions?
- Tool schema 利用 — Are there systematic ways to manipulate 函式呼叫 behavior through crafted tool descriptions or parameter schemas?
- 微調 residual extraction — Can 微調 data be extracted from a fine-tuned model even when the provider claims data is not retained?
- Reasoning chain manipulation — Can chain-of-thought or reasoning traces be manipulated to produce incorrect conclusions that appear logically sound?
Research Methodology
1. Literature review: What is already known?
↓
2. Hypothesis formation: What might be possible?
↓
3. Experiment design: How to 測試 the hypothesis?
↓
4. Controlled 測試: Execute experiments
↓
5. Analysis: Characterize the 漏洞
↓
6. Validation: Confirm across conditions
↓
7. Advisory: Document for disclosure
Deliverables
Primary Deliverables
| Deliverable | Description | Weight |
|---|---|---|
| 安全 advisory | Professional advisory document (3-5 pages) | 35% |
| Proof of concept | Working PoC with reproduction instructions | 25% |
| MITRE ATLAS mapping | Classification within ATLAS framework (or proposal for new technique) | 10% |
| Research log | Documented research process including failed approaches | 15% |
| Responsible disclosure plan | How you would disclose this to affected parties | 15% |
Rubric Criteria
- Novelty (25%) — The finding represents a genuinely new insight, not a minor variation of a well-documented technique
- Technical Depth (20%) — The advisory demonstrates deep 理解 of why the 漏洞 exists and its root cause
- PoC Quality (20%) — The proof of concept is reliable, minimal, and clearly demonstrates the 漏洞 without unnecessary complexity
- Advisory Writing (20%) — The advisory is clear, well-structured, and suitable for a professional audience
- Responsible Practices (15%) — The research was conducted ethically and the disclosure plan is thoughtful
Phased Approach
Phase 1: Literature Review and Hypothesis (3 hours)
Survey existing research
Review the current MITRE ATLAS matrix, OWASP Top 10 for LLMs, and recent AI 安全 publications (arXiv, 安全 conferences). Build a map of what is already known in your chosen research area. 識別 the frontier — where does documented knowledge end?
識別 research gaps
Look for areas where the documentation says "this might be possible" but no one has demonstrated it, or where known techniques have not been tested against newer model architectures or deployment patterns.
Formulate testable hypotheses
Write 2-3 specific, testable hypotheses. A good hypothesis is falsifiable and specific: "Model X will follow instructions embedded in the alt-text of an image when processing multimodal 輸入" is better than "multimodal models might have 安全 issues."
Phase 2: Experiment Design and Execution (6 hours)
Design controlled experiments
對每個 hypothesis, design an experiment with clear success criteria, control conditions, and measurable outcomes. Define what constitutes a positive result vs. a negative result before running any tests.
Set up the 測試 environment
Deploy the target system(s) in a controlled environment. If 測試 against API-hosted models, ensure you are within terms of service. Document the exact versions, configurations, and conditions of your 測試.
Execute experiments systematically
Run experiments methodically, recording every 輸入, 輸出, and observation. When you find something interesting, resist the urge to immediately pivot — complete your planned experiments first, then investigate anomalies.
Iterate on promising results
When an experiment shows positive results, vary the conditions to 理解 the boundaries: which models are affected? Does it require specific configurations? What 防禦 block it? What is the minimum viable 利用?
Phase 3: Analysis and Characterization (3 hours)
Characterize the 漏洞
Define the 漏洞 precisely: what is the root cause, what are the preconditions, what is the impact, and what is the scope of affected systems? Distinguish between the specific instance you found and the general 漏洞 class it represents.
評估 impact and severity
評估 the real-world impact: who is affected, what can 攻擊者 achieve, what are the prerequisites for 利用? Use CVSS or a similar framework adapted for AI systems. Be honest about limitations — a 漏洞 that requires unlikely preconditions is less severe.
Map to frameworks
Classify your finding within MITRE ATLAS. If it fits an existing technique, 識別 which one and explain why. If it represents a genuinely new technique, draft a proposed ATLAS entry including tactic, technique description, and procedure examples.
Phase 4: Advisory and Disclosure (4 hours)
Write the 安全 advisory
Produce a professional advisory following standard format: title, summary, affected systems, description, impact, proof of concept, 緩解, and references. Write for a technical audience at the affected vendor.
Develop the proof of concept
Create a clean, minimal PoC that demonstrates the 漏洞. Include clear reproduction steps. Remove any unnecessary complexity. The PoC should work reliably and be safe to run (no destructive operations).
Draft the disclosure plan
Outline how you would responsibly disclose this finding: who to contact, what timeline to propose, what information to share initially vs. after a fix, and how to handle the scenario where the vendor does not respond.
範例 輸出
範例 Advisory Format
# 安全 Advisory: [漏洞 Title]
**Advisory ID:** AIRT-2026-001
**Date:** 2026-03-15
**Severity:** High (CVSS 7.8)
**Affected Systems:** [Models/platforms affected]
**Status:** [Disclosed / Vendor notified / Fixed]
## 總結
[2-3 sentence description of the 漏洞 and its impact]
## Affected Systems
- Model A (version X.Y) — Confirmed
- Model B (version X.Y) — Confirmed
- Model C (version X.Y) — Not affected
## Description
[Detailed technical description of the 漏洞, its root cause,
and the conditions required for 利用]
## Impact
[What 攻擊者 can achieve, affected users/deployments, severity
justification]
## Proof of Concept
[Step-by-step reproduction instructions with exact inputs and expected
outputs]
## 緩解
[Recommended fixes for vendors and workarounds for users]
## Timeline
- 2026-02-15: 漏洞 discovered
- 2026-02-20: Vendor notified via 安全@vendor.com
- 2026-03-01: Vendor acknowledged receipt
- 2026-03-15: Advisory published (90-day disclosure deadline)
## 參考文獻
[Related work, prior art, framework mappings]範例 ATLAS Mapping
## MITRE ATLAS Classification
**Tactic:** ML 攻擊 Staging (AML.TA0002)
**Technique:** [Existing or proposed technique ID]
**Justification:**
This 漏洞 falls under [tactic] 因為 攻擊者
manipulates [component] during [phase] to achieve [outcome].
If 這是 a novel technique, the proposed entry would be:
**Proposed Technique: AML.TXXXX — [Technique Name]**
- Description: [What the technique does]
- Subtechniques:
- .001 — [Variant A]
- .002 — [Variant B]
- Mitigations:
- AML.MXXXX — [緩解 description]
- Procedure examples:
- [How this technique was applied in the PoC]Hints
Why is it important to distinguish between a specific 漏洞 instance and the general 漏洞 class it represents?