進階 Topics Study 指南
Study guide covering AI security research techniques, automation, forensics, emerging attack vectors, and tool development for advanced practitioners.
Advanced Topics Study Guide
This guide covers topics beyond the core curriculum: research methodology, attack automation, AI forensics, emerging threats, and tool development. It is designed for practitioners who have completed the core assessments and want to deepen their expertise.
Research Methodology
Reading AI 安全 Research
Staying current with AI 安全 research is essential. Key venues and how to extract actionable intelligence from them:
| Venue | Focus | What to Extract |
|---|---|---|
| NeurIPS / ICML / ICLR | ML research, including 對抗性 ML and 安全 | Novel attack primitives, new model architectures with 安全 implications, 對齊 research |
| USENIX 安全 / IEEE S&P / CCS | 安全 research, including AI 安全 | Practical attacks with demonstrated impact, 防禦 evaluations, real-world case studies |
| arXiv (cs.CR, cs.AI, cs.CL) | Preprints across all areas | Earliest access to new techniques, but not peer-reviewed -- read critically |
| Vendor blogs | Anthropic, OpenAI, Google DeepMind, Microsoft | Model cards, 安全 evaluations, disclosed 漏洞, capability announcements |
| 越獄 communities | Practical bypass techniques | Current working jailbreaks, 護欄 bypass patterns, model-specific quirks |
Reading strategy for practitioners:
- Read the abstract and conclusion first to determine if the paper is relevant to your work.
- Focus on the 威脅模型 -- what assumptions does the attack make? (White-box vs. black-box, access requirements, computational budget)
- Check the 評估 -- was it tested on production models or toy examples? Does it transfer across model families?
- Extract the technique -- what is the core insight that makes this attack work? Can it be adapted to your assessments?
- 評估 the 防禦 implications -- how would existing 護欄 handle this attack? What new 防禦 are needed?
Developing Novel Techniques
Moving from applying known attacks to developing new ones requires systematic thinking:
攻擊面 analysis: Start with 系統 architecture and 識別 trust boundaries. Each trust boundary is a potential injection surface. Ask: "What untrusted data reaches 模型, and what can 模型 do with its 輸出?"
Hypothesis-driven 測試: Formulate specific hypotheses about model behavior and 測試 them systematically. "I hypothesize that this model's 安全 訓練 is weaker for requests framed as academic analysis" -- then 測試 with controlled variations that isolate the variable.
Ablation studies: When an attack works, determine which components are essential. Remove elements one at a time to find the minimum viable attack. This reveals the underlying mechanism and enables generalization to other contexts.
Cross-model transfer: 測試 techniques across model families. 攻擊 that transfer from open-weight models (where you can analyze the mechanism) to closed-API models (where you cannot) are particularly valuable 因為 they suggest a general 漏洞 in 對齊 approaches.
攻擊 Automation
When to Automate
Automation is valuable for specific phases of 紅隊演練. 理解 when it helps and when it hinders is important.
| Phase | Automation Value | Tools and Approaches |
|---|---|---|
| Baseline scanning | High -- covers known attack patterns quickly | Garak, promptfoo, custom scripts against OWASP categories |
| Fuzzing / mutation | High -- generates variations faster than manual 測試 | PyRIT converters, custom mutation engines, LLM-assisted paraphrasing |
| Multi-turn attacks | Medium -- orchestration helps but strategy requires human judgment | PyRIT orchestrators, custom state machines for conversation flow |
| Novel technique development | Low -- requires creativity and intuition that automation lacks | Human-driven with tool assistance for iteration speed |
| Regression 測試 | High -- verifies that previous findings are still fixed | CI/CD integration with attack 測試 suites |
| Coverage tracking | High -- ensures systematic 測試 across categories | Custom dashboards, 測試 case management integration |
Tool Ecosystem
理解 the major tools and their strengths:
Garak (Generative AI Red-teaming and 評估 Kit):
- Probe-based architecture: probes generate attacks, detectors 評估 responses
- Good for: baseline sweeps against known 漏洞 categories
- Limitations: primarily single-turn, limited multi-step orchestration
- Extensible: custom probes and detectors can be added
PyRIT (Python Risk Identification Tool for Generative AI):
- Microsoft's orchestration framework for multi-turn attacks
- Converters: transform prompts between formats (Base64, translation, leetspeak)
- Orchestrators: manage multi-turn attack strategies
- Scorers: 評估 whether attacks succeeded
- Good for: complex, multi-turn attack scenarios with automated scoring
promptfoo:
- YAML-driven 評估 harness
- Good for: systematic 測試 against defined 測試 cases, CI/CD integration
- Plugin system enables custom attack types
- Strong in comparative 評估 (測試 the same attacks against multiple models/configurations)
Inspect AI (UK AISI):
- Task-based 評估 framework
- Composable solvers and scorers
- Good for: structured 安全 evaluations with reproducible metrics
- Designed for 評估 rather than attack automation
Building Custom Tools
When existing tools do not meet your needs, building custom automation requires:
Prompt mutation engines: Systems that take a base attack prompt and generate variations using techniques like synonym replacement, encoding changes, language translation, structural rearrangement, and LLM-assisted paraphrasing. Key design decision: how to balance diversity (generating novel variations) with focus (maintaining attack intent).
Conversation state machines: For multi-turn attacks, a state machine tracks the conversation state and selects the next prompt based on 模型's response. States might include: "establishing rapport," "introducing topic," "escalating request," "extracting target content." Transitions are triggered by classifier 評估 of 模型's response.
Result classifiers: Automated 評估 of whether an attack succeeded. This can use: keyword matching (brittle but fast), classifier models (more robust but may have false positives), LLM-as-judge (contextual but expensive and potentially jailbreakable), or rule-based scoring with human review for borderline cases.
AI Forensics
Investigating AI 安全 Incidents
When an AI 安全 incident occurs, forensic investigation follows a structured approach:
Evidence collection priorities:
| Evidence Source | What It Contains | Collection Urgency |
|---|---|---|
| Prompt/response logs | Full conversation history of the attack | Immediate -- logs may rotate or be purged |
| 護欄 decision logs | Which filters triggered or missed | Immediate -- reveals 防禦 failures |
| Tool call audit trail | What actions the compromised 代理 took | Immediate -- shows blast radius |
| Model version and config | Exact model and parameters at time of incident | Immediate -- model may be updated |
| Network logs | Data exfiltration attempts, C2 communication | High -- standard retention applies |
| Application logs | Error messages, state changes, 認證 events | High -- contextualizes the AI-specific evidence |
| 訓練資料 access logs | Whether 訓練資料 was accessed or modified | Medium -- relevant for 投毒 investigation |
| Infrastructure logs | 雲端 audit trails, deployment history | Medium -- relevant for 供應鏈 investigation |
Root Cause Analysis
AI incidents often have layered root causes. A structured approach:
- Immediate cause: What happened? (e.g., "The 代理 executed unauthorized tool calls")
- Proximate cause: How did it happen? (e.g., "A 提示詞注入 in a retrieved document hijacked the 代理's reasoning")
- Contributing causes: What enabled it? (e.g., "The 代理 had access to tools beyond its required scope; the RAG pipeline did not sanitize retrieved documents; 輸出 監控 did not flag unusual 工具呼叫 patterns")
- Systemic cause: Why was 系統 vulnerable? (e.g., "No 威脅模型 was conducted for the RAG pipeline; tool access was granted based on convenience rather than least privilege; 監控 was designed for traditional web attacks, not AI-specific patterns")
Blast Radius 評估
Determining the full impact of an AI 安全 incident:
- Data exposure: What data did 模型 have access to in its context? What was 模型 manipulated into outputting? Was PII, credentials, or proprietary data exposed?
- Action impact: What tool calls did the compromised 代理 make? What systems were accessed? Were any modifications made to databases, files, or configurations?
- Lateral movement: Did the compromised 代理 communicate with other 代理 or systems? Could 攻擊者 have pivoted through inter-代理 communication?
- Persistence: Did the attack modify any persistent state (conversation history, 知識庫, 代理 memory) that could affect future interactions?
- User impact: How many users were affected? Were user sessions compromised? Did users receive manipulated outputs that could have led to downstream harm?
Emerging 攻擊 Vectors
Areas of Active Research
The AI attack landscape evolves rapidly. These are areas where new techniques are actively being developed:
Multimodal injection: 嵌入向量 提示詞注入 payloads in images, audio, and video that are processed by multimodal models. Techniques include steganographic text in images, 對抗性 perturbations that are interpreted as instructions, and injection payloads in audio transcriptions.
微調 attacks: Compromising 安全 through targeted 微調 on small datasets. Research has shown that as few as 100 carefully crafted examples can significantly degrade a model's 安全 behavior. 這是 particularly relevant for 微調-as-a-service platforms where users upload their own 訓練資料.
Model merging 漏洞: When models are combined through weight averaging, interpolation, or other merging techniques, the 安全 properties of individual models may not be preserved. A safe model merged with an unsafe model may produce unpredictable 安全 behavior.
Reasoning model 利用: Models trained for extended reasoning (chain-of-thought, tree-of-thought) present new attack surfaces. The reasoning process itself can be manipulated, and the extended computation may amplify the effect of subtle injections.
Cross-session attacks: Exploiting persistent state across user sessions, including conversation memory, user preference learning, and shared knowledge bases. These enable attacks that persist beyond individual conversations and may affect multiple users.
供應鏈 attacks on model registries: Compromising model hosting platforms (Hugging Face, model zoos) to distribute backdoored models. This includes name squatting (publishing malicious models with names similar to popular ones), compromising contributor accounts, and exploiting model file formats that support arbitrary code execution.
Defensive Research Directions
理解 where 防禦 are headed helps red teamers anticipate what they will encounter:
Instruction hierarchy: Training models to recognize and enforce privilege levels between system, user, and tool instructions. 這是 a fundamental research direction aimed at solving 提示詞注入 at 模型 level.
Formal verification for AI systems: Applying formal methods to verify properties of AI system behavior. Still early-stage but could eventually provide provable 安全 guarantees for specific properties.
Interpretability-based 偵測: Using model interpretability tools (activation analysis, 注意力 patterns) to detect when a model is responding to an injection rather than the legitimate prompt. This approaches the problem from inside 模型 rather than from 輸入/輸出 filtering.
Confidential computing for AI: Using hardware enclaves (Intel SGX, AMD SEV, ARM CCA) to protect model 推論 from infrastructure-level attacks. Relevant for protecting model weights, 訓練資料, and 推論 inputs/outputs from 雲端 provider access.
Common Pitfalls
Key 參考文獻
| Resource | Type | Focus |
|---|---|---|
| MITRE ATLAS | 知識庫 | Adversary TTPs for AI/ML systems |
| OWASP LLM Top 10 | Risk framework | Critical LLM application risks |
| NIST AI 100-2 (對抗性 ML) | Guidelines | Taxonomy and terminology for 對抗性 ML |
| Anthropic's RSP | Policy document | Responsible scaling commitments and 安全 evaluations |
| ML 安全 course (Center for AI 安全) | Course | Comprehensive ML 安全 curriculum |
| Lakera Gandalf | CTF | Practical 提示詞注入 challenges |
| HackAPrompt | Competition | Structured 提示詞注入 competition with scoring |
Study Checklist
Before considering yourself proficient in advanced topics, confirm you can:
- Read an AI 安全 paper and extract the actionable technique, 威脅模型, and limitations
- Choose the appropriate automation tool for a given 測試 scenario and justify the choice
- Build a simple prompt mutation engine that generates diverse attack variations
- Conduct basic AI forensics -- collecting evidence, establishing timeline, assessing blast radius
- Explain at least three emerging attack vectors and their implications
- 評估 a claimed attack technique critically -- checking assumptions, 測試 transferability
- Design a custom tool or script to automate a specific 測試 workflow
- Articulate the limitations of current defensive approaches and where research is headed