Advanced Topics Study Guide

intermediate11 min readUpdated 2026-03-15

Study guide covering AI security research techniques, automation, forensics, emerging attack vectors, and tool development for advanced practitioners.

study-guide advanced research automation forensics

Advanced Topics Study Guide

This guide covers topics beyond the core curriculum: research methodology, attack automation, AI forensics, emerging threats, and tool development. It is designed for practitioners who have completed the core assessments and want to deepen their expertise.

Research Methodology

Reading AI Security Research

Staying current with AI security research is essential. Key venues and how to extract actionable intelligence from them:

Venue	Focus	What to Extract
NeurIPS / ICML / ICLR	ML research, including adversarial ML and safety	Novel attack primitives, new model architectures with security implications, alignment research
USENIX Security / IEEE S&P / CCS	Security research, including AI security	Practical attacks with demonstrated impact, defense evaluations, real-world case studies
arXiv (cs.CR, cs.AI, cs.CL)	Preprints across all areas	Earliest access to new techniques, but not peer-reviewed -- read critically
Vendor blogs	Anthropic, OpenAI, Google DeepMind, Microsoft	Model cards, safety evaluations, disclosed vulnerabilities, capability announcements
Jailbreak communities	Practical bypass techniques	Current working jailbreaks, guardrail bypass patterns, model-specific quirks

Reading strategy for practitioners:

Read the abstract and conclusion first to determine if the paper is relevant to your work.
Focus on the threat model -- what assumptions does the attack make? (White-box vs. black-box, access requirements, computational budget)
Check the evaluation -- was it tested on production models or toy examples? Does it transfer across model families?
Extract the technique -- what is the core insight that makes this attack work? Can it be adapted to your assessments?
Assess the defense implications -- how would existing guardrails handle this attack? What new defenses are needed?

Developing Novel Techniques

Moving from applying known attacks to developing new ones requires systematic thinking:

Attack surface analysis: Start with the system architecture and identify trust boundaries. Each trust boundary is a potential injection surface. Ask: "What untrusted data reaches the model, and what can the model do with its output?"

Hypothesis-driven testing: Formulate specific hypotheses about model behavior and test them systematically. "I hypothesize that this model's safety training is weaker for requests framed as academic analysis" -- then test with controlled variations that isolate the variable.

Ablation studies: When an attack works, determine which components are essential. Remove elements one at a time to find the minimum viable attack. This reveals the underlying mechanism and enables generalization to other contexts.

Cross-model transfer: Test techniques across model families. Attacks that transfer from open-weight models (where you can analyze the mechanism) to closed-API models (where you cannot) are particularly valuable because they suggest a general vulnerability in alignment approaches.

Attack Automation

When to Automate

Automation is valuable for specific phases of red teaming. Understanding when it helps and when it hinders is important.

Phase	Automation Value	Tools and Approaches
Baseline scanning	High -- covers known attack patterns quickly	Garak, promptfoo, custom scripts against OWASP categories
Fuzzing / mutation	High -- generates variations faster than manual testing	PyRIT converters, custom mutation engines, LLM-assisted paraphrasing
Multi-turn attacks	Medium -- orchestration helps but strategy requires human judgment	PyRIT orchestrators, custom state machines for conversation flow
Novel technique development	Low -- requires creativity and intuition that automation lacks	Human-driven with tool assistance for iteration speed
Regression testing	High -- verifies that previous findings are still fixed	CI/CD integration with attack test suites
Coverage tracking	High -- ensures systematic testing across categories	Custom dashboards, test case management integration

Tool Ecosystem

Understanding the major tools and their strengths:

Garak (Generative AI Red-teaming and Assessment Kit):

Probe-based architecture: probes generate attacks, detectors evaluate responses
Good for: baseline sweeps against known vulnerability categories
Limitations: primarily single-turn, limited multi-step orchestration
Extensible: custom probes and detectors can be added

PyRIT (Python Risk Identification Tool for Generative AI):

Microsoft's orchestration framework for multi-turn attacks
Converters: transform prompts between formats (Base64, translation, leetspeak)
Orchestrators: manage multi-turn attack strategies
Scorers: evaluate whether attacks succeeded
Good for: complex, multi-turn attack scenarios with automated scoring

promptfoo:

YAML-driven evaluation harness
Good for: systematic testing against defined test cases, CI/CD integration
Plugin system enables custom attack types
Strong in comparative evaluation (test the same attacks against multiple models/configurations)

Inspect AI (UK AISI):

Task-based evaluation framework
Composable solvers and scorers
Good for: structured safety evaluations with reproducible metrics
Designed for evaluation rather than attack automation

Building Custom Tools

When existing tools do not meet your needs, building custom automation requires:

Prompt mutation engines: Systems that take a base attack prompt and generate variations using techniques like synonym replacement, encoding changes, language translation, structural rearrangement, and LLM-assisted paraphrasing. Key design decision: how to balance diversity (generating novel variations) with focus (maintaining attack intent).

Conversation state machines: For multi-turn attacks, a state machine tracks the conversation state and selects the next prompt based on the model's response. States might include: "establishing rapport," "introducing topic," "escalating request," "extracting target content." Transitions are triggered by classifier evaluation of the model's response.

Result classifiers: Automated evaluation of whether an attack succeeded. This can use: keyword matching (brittle but fast), classifier models (more robust but may have false positives), LLM-as-judge (contextual but expensive and potentially jailbreakable), or rule-based scoring with human review for borderline cases.

AI Forensics

Investigating AI Security Incidents

When an AI security incident occurs, forensic investigation follows a structured approach:

Evidence collection priorities:

Evidence Source	What It Contains	Collection Urgency
Prompt/response logs	Full conversation history of the attack	Immediate -- logs may rotate or be purged
Guardrail decision logs	Which filters triggered or missed	Immediate -- reveals defense failures
Tool call audit trail	What actions the compromised agent took	Immediate -- shows blast radius
Model version and config	Exact model and parameters at time of incident	Immediate -- model may be updated
Network logs	Data exfiltration attempts, C2 communication	High -- standard retention applies
Application logs	Error messages, state changes, authentication events	High -- contextualizes the AI-specific evidence
Training data access logs	Whether training data was accessed or modified	Medium -- relevant for poisoning investigation
Infrastructure logs	Cloud audit trails, deployment history	Medium -- relevant for supply chain investigation

Root Cause Analysis

AI incidents often have layered root causes. A structured approach:

Immediate cause: What happened? (e.g., "The agent executed unauthorized tool calls")
Proximate cause: How did it happen? (e.g., "A prompt injection in a retrieved document hijacked the agent's reasoning")
Contributing causes: What enabled it? (e.g., "The agent had access to tools beyond its required scope; the RAG pipeline did not sanitize retrieved documents; output monitoring did not flag unusual tool call patterns")
Systemic cause: Why was the system vulnerable? (e.g., "No threat model was conducted for the RAG pipeline; tool access was granted based on convenience rather than least privilege; monitoring was designed for traditional web attacks, not AI-specific patterns")

Blast Radius Assessment

Determining the full impact of an AI security incident:

Data exposure: What data did the model have access to in its context? What was the model manipulated into outputting? Was PII, credentials, or proprietary data exposed?
Action impact: What tool calls did the compromised agent make? What systems were accessed? Were any modifications made to databases, files, or configurations?
Lateral movement: Did the compromised agent communicate with other agents or systems? Could the attacker have pivoted through inter-agent communication?
Persistence: Did the attack modify any persistent state (conversation history, knowledge base, agent memory) that could affect future interactions?
User impact: How many users were affected? Were user sessions compromised? Did users receive manipulated outputs that could have led to downstream harm?

Emerging Attack Vectors

Areas of Active Research

The AI attack landscape evolves rapidly. These are areas where new techniques are actively being developed:

Multimodal injection: Embedding prompt injection payloads in images, audio, and video that are processed by multimodal models. Techniques include steganographic text in images, adversarial perturbations that are interpreted as instructions, and injection payloads in audio transcriptions.

Fine-tuning attacks: Compromising safety through targeted fine-tuning on small datasets. Research has shown that as few as 100 carefully crafted examples can significantly degrade a model's safety behavior. This is particularly relevant for fine-tuning-as-a-service platforms where users upload their own training data.

Model merging vulnerabilities: When models are combined through weight averaging, interpolation, or other merging techniques, the security properties of individual models may not be preserved. A safe model merged with an unsafe model may produce unpredictable safety behavior.

Reasoning model exploitation: Models trained for extended reasoning (chain-of-thought, tree-of-thought) present new attack surfaces. The reasoning process itself can be manipulated, and the extended computation may amplify the effect of subtle injections.

Cross-session attacks: Exploiting persistent state across user sessions, including conversation memory, user preference learning, and shared knowledge bases. These enable attacks that persist beyond individual conversations and may affect multiple users.

Supply chain attacks on model registries: Compromising model hosting platforms (Hugging Face, model zoos) to distribute backdoored models. This includes name squatting (publishing malicious models with names similar to popular ones), compromising contributor accounts, and exploiting model file formats that support arbitrary code execution.

Defensive Research Directions

Understanding where defenses are headed helps red teamers anticipate what they will encounter:

Instruction hierarchy: Training models to recognize and enforce privilege levels between system, user, and tool instructions. This is a fundamental research direction aimed at solving prompt injection at the model level.

Formal verification for AI systems: Applying formal methods to verify properties of AI system behavior. Still early-stage but could eventually provide provable security guarantees for specific properties.

Interpretability-based detection: Using model interpretability tools (activation analysis, attention patterns) to detect when a model is responding to an injection rather than the legitimate prompt. This approaches the problem from inside the model rather than from input/output filtering.

Confidential computing for AI: Using hardware enclaves (Intel SGX, AMD SEV, ARM CCA) to protect model inference from infrastructure-level attacks. Relevant for protecting model weights, training data, and inference inputs/outputs from cloud provider access.

Common Pitfalls

Warning

"Automation replaces manual testing": Automation handles known patterns efficiently but cannot discover novel attack vectors. The most impactful findings typically come from human creativity guided by systematic methodology.
"If the paper says it works, it works": Research papers often test under controlled conditions. Verify that claimed attacks actually work against production-grade models with current defenses before including them in your toolkit.
"AI forensics is the same as traditional digital forensics": AI incidents produce different evidence types (prompt/response logs, tool call traces, model behavior artifacts) and require different analysis skills. Traditional forensic tools need to be augmented with AI-specific capabilities.
"Emerging attacks are always more dangerous than known attacks": Most real-world AI compromises use well-known techniques (basic prompt injection, misconfigured permissions) rather than cutting-edge research. Master the fundamentals before chasing the frontier.
"Open-source tools cover everything": Each tool has specific strengths and significant gaps. Professional red teaming typically requires custom tooling to complement open-source frameworks.

Key References

Resource	Type	Focus
MITRE ATLAS	Knowledge base	Adversary TTPs for AI/ML systems
OWASP LLM Top 10	Risk framework	Critical LLM application risks
NIST AI 100-2 (Adversarial ML)	Guidelines	Taxonomy and terminology for adversarial ML
Anthropic's RSP	Policy document	Responsible scaling commitments and safety evaluations
ML Safety course (Center for AI Safety)	Course	Comprehensive ML safety curriculum
Lakera Gandalf	CTF	Practical prompt injection challenges
HackAPrompt	Competition	Structured prompt injection competition with scoring

Study Checklist

Before considering yourself proficient in advanced topics, confirm you can:

Read an AI security paper and extract the actionable technique, threat model, and limitations
Choose the appropriate automation tool for a given testing scenario and justify the choice
Build a simple prompt mutation engine that generates diverse attack variations
Conduct basic AI forensics -- collecting evidence, establishing timeline, assessing blast radius
Explain at least three emerging attack vectors and their implications
Evaluate a claimed attack technique critically -- checking assumptions, testing transferability
Design a custom tool or script to automate a specific testing workflow
Articulate the limitations of current defensive approaches and where research is headed

Edit this page on GitHub

Advanced Topics Study Guide

intermediate11 min readUpdated 2026-03-15

Study guide covering AI security research techniques, automation, forensics, emerging attack vectors, and tool development for advanced practitioners.

study-guide advanced research automation forensics

Advanced Topics Study Guide

Research Methodology

Reading AI Security Research

Staying current with AI security research is essential. Key venues and how to extract actionable intelligence from them:

Venue	Focus	What to Extract
NeurIPS / ICML / ICLR	ML research, including adversarial ML and safety	Novel attack primitives, new model architectures with security implications, alignment research
USENIX Security / IEEE S&P / CCS	Security research, including AI security	Practical attacks with demonstrated impact, defense evaluations, real-world case studies
arXiv (cs.CR, cs.AI, cs.CL)	Preprints across all areas	Earliest access to new techniques, but not peer-reviewed -- read critically
Vendor blogs	Anthropic, OpenAI, Google DeepMind, Microsoft	Model cards, safety evaluations, disclosed vulnerabilities, capability announcements
Jailbreak communities	Practical bypass techniques	Current working jailbreaks, guardrail bypass patterns, model-specific quirks

Reading strategy for practitioners:

Read the abstract and conclusion first to determine if the paper is relevant to your work.
Focus on the threat model -- what assumptions does the attack make? (White-box vs. black-box, access requirements, computational budget)
Check the evaluation -- was it tested on production models or toy examples? Does it transfer across model families?
Extract the technique -- what is the core insight that makes this attack work? Can it be adapted to your assessments?
Assess the defense implications -- how would existing guardrails handle this attack? What new defenses are needed?

Phase	Automation Value	Tools and Approaches
Baseline scanning	High -- covers known attack patterns quickly	Garak, promptfoo, custom scripts against OWASP categories
Fuzzing / mutation	High -- generates variations faster than manual testing	PyRIT converters, custom mutation engines, LLM-assisted paraphrasing
Multi-turn attacks	Medium -- orchestration helps but strategy requires human judgment	PyRIT orchestrators, custom state machines for conversation flow
Novel technique development	Low -- requires creativity and intuition that automation lacks	Human-driven with tool assistance for iteration speed
Regression testing	High -- verifies that previous findings are still fixed	CI/CD integration with attack test suites
Coverage tracking	High -- ensures systematic testing across categories	Custom dashboards, test case management integration

Tool Ecosystem

Understanding the major tools and their strengths:

Garak (Generative AI Red-teaming and Assessment Kit):

Probe-based architecture: probes generate attacks, detectors evaluate responses
Good for: baseline sweeps against known vulnerability categories
Limitations: primarily single-turn, limited multi-step orchestration
Extensible: custom probes and detectors can be added

PyRIT (Python Risk Identification Tool for Generative AI):

Microsoft's orchestration framework for multi-turn attacks
Converters: transform prompts between formats (Base64, translation, leetspeak)
Orchestrators: manage multi-turn attack strategies
Scorers: evaluate whether attacks succeeded
Good for: complex, multi-turn attack scenarios with automated scoring

promptfoo:

YAML-driven evaluation harness
Good for: systematic testing against defined test cases, CI/CD integration
Plugin system enables custom attack types
Strong in comparative evaluation (test the same attacks against multiple models/configurations)

Inspect AI (UK AISI):

Task-based evaluation framework
Composable solvers and scorers
Good for: structured safety evaluations with reproducible metrics
Designed for evaluation rather than attack automation

Evidence Source	What It Contains	Collection Urgency
Prompt/response logs	Full conversation history of the attack	Immediate -- logs may rotate or be purged
Guardrail decision logs	Which filters triggered or missed	Immediate -- reveals defense failures
Tool call audit trail	What actions the compromised agent took	Immediate -- shows blast radius
Model version and config	Exact model and parameters at time of incident	Immediate -- model may be updated
Network logs	Data exfiltration attempts, C2 communication	High -- standard retention applies
Application logs	Error messages, state changes, authentication events	High -- contextualizes the AI-specific evidence
Training data access logs	Whether training data was accessed or modified	Medium -- relevant for poisoning investigation
Infrastructure logs	Cloud audit trails, deployment history	Medium -- relevant for supply chain investigation

Root Cause Analysis

AI incidents often have layered root causes. A structured approach:

Immediate cause: What happened? (e.g., "The agent executed unauthorized tool calls")
Proximate cause: How did it happen? (e.g., "A prompt injection in a retrieved document hijacked the agent's reasoning")
Contributing causes: What enabled it? (e.g., "The agent had access to tools beyond its required scope; the RAG pipeline did not sanitize retrieved documents; output monitoring did not flag unusual tool call patterns")
Systemic cause: Why was the system vulnerable? (e.g., "No threat model was conducted for the RAG pipeline; tool access was granted based on convenience rather than least privilege; monitoring was designed for traditional web attacks, not AI-specific patterns")

Blast Radius Assessment

Determining the full impact of an AI security incident:

Data exposure: What data did the model have access to in its context? What was the model manipulated into outputting? Was PII, credentials, or proprietary data exposed?
Action impact: What tool calls did the compromised agent make? What systems were accessed? Were any modifications made to databases, files, or configurations?
Lateral movement: Did the compromised agent communicate with other agents or systems? Could the attacker have pivoted through inter-agent communication?
Persistence: Did the attack modify any persistent state (conversation history, knowledge base, agent memory) that could affect future interactions?
User impact: How many users were affected? Were user sessions compromised? Did users receive manipulated outputs that could have led to downstream harm?

"Automation replaces manual testing": Automation handles known patterns efficiently but cannot discover novel attack vectors. The most impactful findings typically come from human creativity guided by systematic methodology.
"If the paper says it works, it works": Research papers often test under controlled conditions. Verify that claimed attacks actually work against production-grade models with current defenses before including them in your toolkit.
"AI forensics is the same as traditional digital forensics": AI incidents produce different evidence types (prompt/response logs, tool call traces, model behavior artifacts) and require different analysis skills. Traditional forensic tools need to be augmented with AI-specific capabilities.
"Emerging attacks are always more dangerous than known attacks": Most real-world AI compromises use well-known techniques (basic prompt injection, misconfigured permissions) rather than cutting-edge research. Master the fundamentals before chasing the frontier.
"Open-source tools cover everything": Each tool has specific strengths and significant gaps. Professional red teaming typically requires custom tooling to complement open-source frameworks.

Key References

Resource	Type	Focus
MITRE ATLAS	Knowledge base	Adversary TTPs for AI/ML systems
OWASP LLM Top 10	Risk framework	Critical LLM application risks
NIST AI 100-2 (Adversarial ML)	Guidelines	Taxonomy and terminology for adversarial ML
Anthropic's RSP	Policy document	Responsible scaling commitments and safety evaluations
ML Safety course (Center for AI Safety)	Course	Comprehensive ML safety curriculum
Lakera Gandalf	CTF	Practical prompt injection challenges
HackAPrompt	Competition	Structured prompt injection competition with scoring

Study Checklist

Before considering yourself proficient in advanced topics, confirm you can:

Read an AI security paper and extract the actionable technique, threat model, and limitations
Choose the appropriate automation tool for a given testing scenario and justify the choice
Build a simple prompt mutation engine that generates diverse attack variations
Conduct basic AI forensics -- collecting evidence, establishing timeline, assessing blast radius
Explain at least three emerging attack vectors and their implications
Evaluate a claimed attack technique critically -- checking assumptions, testing transferability
Design a custom tool or script to automate a specific testing workflow
Articulate the limitations of current defensive approaches and where research is headed

Edit this page on GitHub

Advanced Topics Study Guide

Related articles

Advanced Topics Study Guide

Related articles