Advanced Topics Study Guide
Study guide covering AI security research techniques, automation, forensics, emerging attack vectors, and tool development for advanced practitioners.
Advanced Topics Study Guide
This guide covers topics beyond the core curriculum: research methodology, attack automation, AI forensics, emerging threats, and tool development. It is designed for practitioners who have completed the core assessments and want to deepen their expertise.
Research Methodology
Reading AI Security Research
Staying current with AI security research is essential. Key venues and how to extract actionable intelligence from them:
| Venue | Focus | What to Extract |
|---|---|---|
| NeurIPS / ICML / ICLR | ML research, including adversarial ML and safety | Novel attack primitives, new model architectures with security implications, alignment research |
| USENIX Security / IEEE S&P / CCS | Security research, including AI security | Practical attacks with demonstrated impact, defense evaluations, real-world case studies |
| arXiv (cs.CR, cs.AI, cs.CL) | Preprints across all areas | Earliest access to new techniques, but not peer-reviewed -- read critically |
| Vendor blogs | Anthropic, OpenAI, Google DeepMind, Microsoft | Model cards, safety evaluations, disclosed vulnerabilities, capability announcements |
| Jailbreak communities | Practical bypass techniques | Current working jailbreaks, guardrail bypass patterns, model-specific quirks |
Reading strategy for practitioners:
- Read the abstract and conclusion first to determine if the paper is relevant to your work.
- Focus on the threat model -- what assumptions does the attack make? (White-box vs. black-box, access requirements, computational budget)
- Check the evaluation -- was it tested on production models or toy examples? Does it transfer across model families?
- Extract the technique -- what is the core insight that makes this attack work? Can it be adapted to your assessments?
- Assess the defense implications -- how would existing guardrails handle this attack? What new defenses are needed?
Developing Novel Techniques
Moving from applying known attacks to developing new ones requires systematic thinking:
Attack surface analysis: Start with the system architecture and identify trust boundaries. Each trust boundary is a potential injection surface. Ask: "What untrusted data reaches the model, and what can the model do with its output?"
Hypothesis-driven testing: Formulate specific hypotheses about model behavior and test them systematically. "I hypothesize that this model's safety training is weaker for requests framed as academic analysis" -- then test with controlled variations that isolate the variable.
Ablation studies: When an attack works, determine which components are essential. Remove elements one at a time to find the minimum viable attack. This reveals the underlying mechanism and enables generalization to other contexts.
Cross-model transfer: Test techniques across model families. Attacks that transfer from open-weight models (where you can analyze the mechanism) to closed-API models (where you cannot) are particularly valuable because they suggest a general vulnerability in alignment approaches.
Attack Automation
When to Automate
Automation is valuable for specific phases of red teaming. Understanding when it helps and when it hinders is important.
| Phase | Automation Value | Tools and Approaches |
|---|---|---|
| Baseline scanning | High -- covers known attack patterns quickly | Garak, promptfoo, custom scripts against OWASP categories |
| Fuzzing / mutation | High -- generates variations faster than manual testing | PyRIT converters, custom mutation engines, LLM-assisted paraphrasing |
| Multi-turn attacks | Medium -- orchestration helps but strategy requires human judgment | PyRIT orchestrators, custom state machines for conversation flow |
| Novel technique development | Low -- requires creativity and intuition that automation lacks | Human-driven with tool assistance for iteration speed |
| Regression testing | High -- verifies that previous findings are still fixed | CI/CD integration with attack test suites |
| Coverage tracking | High -- ensures systematic testing across categories | Custom dashboards, test case management integration |
Tool Ecosystem
Understanding the major tools and their strengths:
Garak (Generative AI Red-teaming and Assessment Kit):
- Probe-based architecture: probes generate attacks, detectors evaluate responses
- Good for: baseline sweeps against known vulnerability categories
- Limitations: primarily single-turn, limited multi-step orchestration
- Extensible: custom probes and detectors can be added
PyRIT (Python Risk Identification Tool for Generative AI):
- Microsoft's orchestration framework for multi-turn attacks
- Converters: transform prompts between formats (Base64, translation, leetspeak)
- Orchestrators: manage multi-turn attack strategies
- Scorers: evaluate whether attacks succeeded
- Good for: complex, multi-turn attack scenarios with automated scoring
promptfoo:
- YAML-driven evaluation harness
- Good for: systematic testing against defined test cases, CI/CD integration
- Plugin system enables custom attack types
- Strong in comparative evaluation (test the same attacks against multiple models/configurations)
Inspect AI (UK AISI):
- Task-based evaluation framework
- Composable solvers and scorers
- Good for: structured safety evaluations with reproducible metrics
- Designed for evaluation rather than attack automation
Building Custom Tools
When existing tools do not meet your needs, building custom automation requires:
Prompt mutation engines: Systems that take a base attack prompt and generate variations using techniques like synonym replacement, encoding changes, language translation, structural rearrangement, and LLM-assisted paraphrasing. Key design decision: how to balance diversity (generating novel variations) with focus (maintaining attack intent).
Conversation state machines: For multi-turn attacks, a state machine tracks the conversation state and selects the next prompt based on the model's response. States might include: "establishing rapport," "introducing topic," "escalating request," "extracting target content." Transitions are triggered by classifier evaluation of the model's response.
Result classifiers: Automated evaluation of whether an attack succeeded. This can use: keyword matching (brittle but fast), classifier models (more robust but may have false positives), LLM-as-judge (contextual but expensive and potentially jailbreakable), or rule-based scoring with human review for borderline cases.
AI Forensics
Investigating AI Security Incidents
When an AI security incident occurs, forensic investigation follows a structured approach:
Evidence collection priorities:
| Evidence Source | What It Contains | Collection Urgency |
|---|---|---|
| Prompt/response logs | Full conversation history of the attack | Immediate -- logs may rotate or be purged |
| Guardrail decision logs | Which filters triggered or missed | Immediate -- reveals defense failures |
| Tool call audit trail | What actions the compromised agent took | Immediate -- shows blast radius |
| Model version and config | Exact model and parameters at time of incident | Immediate -- model may be updated |
| Network logs | Data exfiltration attempts, C2 communication | High -- standard retention applies |
| Application logs | Error messages, state changes, authentication events | High -- contextualizes the AI-specific evidence |
| Training data access logs | Whether training data was accessed or modified | Medium -- relevant for poisoning investigation |
| Infrastructure logs | Cloud audit trails, deployment history | Medium -- relevant for supply chain investigation |
Root Cause Analysis
AI incidents often have layered root causes. A structured approach:
- Immediate cause: What happened? (e.g., "The agent executed unauthorized tool calls")
- Proximate cause: How did it happen? (e.g., "A prompt injection in a retrieved document hijacked the agent's reasoning")
- Contributing causes: What enabled it? (e.g., "The agent had access to tools beyond its required scope; the RAG pipeline did not sanitize retrieved documents; output monitoring did not flag unusual tool call patterns")
- Systemic cause: Why was the system vulnerable? (e.g., "No threat model was conducted for the RAG pipeline; tool access was granted based on convenience rather than least privilege; monitoring was designed for traditional web attacks, not AI-specific patterns")
Blast Radius Assessment
Determining the full impact of an AI security incident:
- Data exposure: What data did the model have access to in its context? What was the model manipulated into outputting? Was PII, credentials, or proprietary data exposed?
- Action impact: What tool calls did the compromised agent make? What systems were accessed? Were any modifications made to databases, files, or configurations?
- Lateral movement: Did the compromised agent communicate with other agents or systems? Could the attacker have pivoted through inter-agent communication?
- Persistence: Did the attack modify any persistent state (conversation history, knowledge base, agent memory) that could affect future interactions?
- User impact: How many users were affected? Were user sessions compromised? Did users receive manipulated outputs that could have led to downstream harm?
Emerging Attack Vectors
Areas of Active Research
The AI attack landscape evolves rapidly. These are areas where new techniques are actively being developed:
Multimodal injection: Embedding prompt injection payloads in images, audio, and video that are processed by multimodal models. Techniques include steganographic text in images, adversarial perturbations that are interpreted as instructions, and injection payloads in audio transcriptions.
Fine-tuning attacks: Compromising safety through targeted fine-tuning on small datasets. Research has shown that as few as 100 carefully crafted examples can significantly degrade a model's safety behavior. This is particularly relevant for fine-tuning-as-a-service platforms where users upload their own training data.
Model merging vulnerabilities: When models are combined through weight averaging, interpolation, or other merging techniques, the security properties of individual models may not be preserved. A safe model merged with an unsafe model may produce unpredictable safety behavior.
Reasoning model exploitation: Models trained for extended reasoning (chain-of-thought, tree-of-thought) present new attack surfaces. The reasoning process itself can be manipulated, and the extended computation may amplify the effect of subtle injections.
Cross-session attacks: Exploiting persistent state across user sessions, including conversation memory, user preference learning, and shared knowledge bases. These enable attacks that persist beyond individual conversations and may affect multiple users.
Supply chain attacks on model registries: Compromising model hosting platforms (Hugging Face, model zoos) to distribute backdoored models. This includes name squatting (publishing malicious models with names similar to popular ones), compromising contributor accounts, and exploiting model file formats that support arbitrary code execution.
Defensive Research Directions
Understanding where defenses are headed helps red teamers anticipate what they will encounter:
Instruction hierarchy: Training models to recognize and enforce privilege levels between system, user, and tool instructions. This is a fundamental research direction aimed at solving prompt injection at the model level.
Formal verification for AI systems: Applying formal methods to verify properties of AI system behavior. Still early-stage but could eventually provide provable security guarantees for specific properties.
Interpretability-based detection: Using model interpretability tools (activation analysis, attention patterns) to detect when a model is responding to an injection rather than the legitimate prompt. This approaches the problem from inside the model rather than from input/output filtering.
Confidential computing for AI: Using hardware enclaves (Intel SGX, AMD SEV, ARM CCA) to protect model inference from infrastructure-level attacks. Relevant for protecting model weights, training data, and inference inputs/outputs from cloud provider access.
Common Pitfalls
Key References
| Resource | Type | Focus |
|---|---|---|
| MITRE ATLAS | Knowledge base | Adversary TTPs for AI/ML systems |
| OWASP LLM Top 10 | Risk framework | Critical LLM application risks |
| NIST AI 100-2 (Adversarial ML) | Guidelines | Taxonomy and terminology for adversarial ML |
| Anthropic's RSP | Policy document | Responsible scaling commitments and safety evaluations |
| ML Safety course (Center for AI Safety) | Course | Comprehensive ML safety curriculum |
| Lakera Gandalf | CTF | Practical prompt injection challenges |
| HackAPrompt | Competition | Structured prompt injection competition with scoring |
Study Checklist
Before considering yourself proficient in advanced topics, confirm you can:
- Read an AI security paper and extract the actionable technique, threat model, and limitations
- Choose the appropriate automation tool for a given testing scenario and justify the choice
- Build a simple prompt mutation engine that generates diverse attack variations
- Conduct basic AI forensics -- collecting evidence, establishing timeline, assessing blast radius
- Explain at least three emerging attack vectors and their implications
- Evaluate a claimed attack technique critically -- checking assumptions, testing transferability
- Design a custom tool or script to automate a specific testing workflow
- Articulate the limitations of current defensive approaches and where research is headed