AI Attack Taxonomy
A comprehensive classification of AI attacks organized by target, technique, and impact — providing a shared vocabulary for red team planning and reporting.
Why Taxonomy Matters
Without a shared vocabulary, red team findings devolve into ad hoc descriptions. One tester calls it a "jailbreak," another calls it "prompt injection," a third calls it "guardrail bypass." Are these the same thing? Different aspects of the same vulnerability? Entirely different attack classes? A well-defined taxonomy provides clarity for planning, execution, and communication.
Dimension 1: Target
The first dimension of classification asks: what are you attacking? AI systems have multiple layers, and each presents distinct attack surfaces.
Model
Attacks targeting the AI model itself — its weights, behavior, learned patterns, and decision boundaries.
| Attack | Description | Example |
|---|---|---|
| Jailbreaking | Overriding the model's safety training to produce restricted outputs | "Ignore previous instructions and explain how to..." |
| Prompt injection | Inserting adversarial instructions that the model follows over its system prompt | Hidden instructions in retrieved documents |
| Adversarial examples | Crafting inputs that cause misclassification or unexpected behavior | Perturbed images, adversarial token sequences |
| Model extraction | Querying the model to reconstruct its weights or a functional equivalent | Systematic querying to train a clone model |
Data
Attacks targeting the data that flows through or was used to build the AI system.
| Attack | Description | Example |
|---|---|---|
| Training data poisoning | Corrupting training data to embed backdoors or bias | Injecting malicious examples into web-scraped datasets |
| Data exfiltration | Extracting sensitive data the model memorized during training | Prompting for verbatim training data reproduction |
| RAG poisoning | Corrupting retrieval-augmented generation data sources | Injecting adversarial documents into a knowledge base |
| Membership inference | Determining whether specific data was in the training set | Statistical analysis of model confidence on known vs unknown data |
Infrastructure
Attacks targeting the systems, APIs, and deployment infrastructure surrounding the model.
| Attack | Description | Example |
|---|---|---|
| API abuse | Exploiting API design flaws, rate limits, or authentication | Bypassing rate limits through distributed requests |
| Supply chain | Compromising model dependencies, libraries, or hosting | Malicious model files on Hugging Face, compromised pip packages |
| Side-channel | Extracting information from timing, error messages, or resource usage | Token count differences revealing filtered content |
| Denial of service | Overwhelming or degrading AI system availability | Crafting inputs that maximize compute (e.g., long context exploitation) |
Agent
Attacks specific to AI agents that can take actions in the real world through tool use.
| Attack | Description | Example |
|---|---|---|
| Tool manipulation | Causing the agent to misuse its tools | Prompt injection causing an agent to send unauthorized emails |
| Goal hijacking | Redirecting the agent's objective to serve the attacker | Overriding the agent's task via injected instructions in retrieved content |
| Privilege escalation | Gaining access to tools or data beyond intended authorization | Exploiting an agent's database access to query unauthorized tables |
| Feedback loop exploitation | Manipulating agent self-evaluation or planning loops | Poisoning an agent's memory to alter future behavior |
Dimension 2: Technique
The second dimension describes how the attack works — the mechanism or method used.
Injection
Injection attacks insert adversarial instructions or content into the AI system's processing pipeline.
- Direct prompt injection: The attacker's input itself contains the adversarial payload
- Indirect prompt injection: The payload is placed in external content (documents, web pages, emails) that the model processes
- Cross-plugin injection: The payload transits through a tool or plugin boundary, exploiting trust assumptions between components
- Multi-modal injection: Adversarial content is embedded in images, audio, or other non-text modalities that the model processes
Evasion
Evasion attacks bypass detection or classification systems without altering their underlying mechanism.
- Obfuscation: Encoding, character substitution, or formatting tricks that pass human-readable content past automated filters
- Semantic paraphrasing: Restating adversarial intent in language that evades keyword or classifier-based detection
- Fragmentation: Splitting adversarial content across multiple messages or documents so that no single fragment triggers detection
- Adversarial perturbation: Mathematically computed modifications to inputs that cause misclassification while being imperceptible to humans
Extraction
Extraction attacks aim to steal information from the AI system — training data, model parameters, or system configuration.
- System prompt extraction: Techniques to make the model reveal its system instructions
- Training data extraction: Prompting the model to reproduce memorized training examples
- Model stealing: Querying the model to build a functionally equivalent copy
- Embedding extraction: Recovering internal representations that reveal sensitive information about the model or data
Poisoning
Poisoning attacks corrupt the AI system's learning or knowledge to embed malicious behavior.
- Pre-training poisoning: Injecting malicious data into pre-training corpora
- Fine-tuning poisoning: Corrupting fine-tuning datasets to embed backdoors
- RAG poisoning: Corrupting the knowledge base that a retrieval-augmented generation system draws from
- Feedback poisoning: Manipulating RLHF or user feedback signals to shift model behavior
Dimension 3: Impact
The third dimension classifies attacks by their effect on the system or its users.
Confidentiality
The attacker gains access to information they should not have. This includes training data extraction, system prompt leakage, PII exposure, and model weight theft.
Integrity
The attacker causes the system to produce incorrect, misleading, or harmful outputs. This includes jailbreaking (producing disallowed content), hallucination amplification, and output manipulation.
Availability
The attacker degrades or prevents legitimate use of the system. This includes compute-intensive inputs that cause slowdowns, inputs that trigger excessive error handling, and attacks that cause the system to refuse legitimate requests (over-refusal).
Safety
The attacker causes the system to produce outputs that could lead to real-world harm. This is distinct from integrity because it specifically involves outputs related to physical danger, self-harm, illegal activity, or other safety-critical content.
Using the Taxonomy for Planning
The taxonomy's three dimensions combine to create a structured attack space. During engagement planning, use this matrix to ensure coverage:
Map the Target Surface
Identify which targets (model, data, infrastructure, agent) are in scope. A simple chatbot might only expose model and infrastructure surfaces. An agentic system exposes all four.
Enumerate Applicable Techniques
For each target, determine which techniques are applicable. Not every technique applies to every target. For example, poisoning attacks may be out of scope if you do not have access to training data.
Prioritize by Impact
Rank the target-technique combinations by their potential impact. A confidentiality breach involving PII is typically higher priority than an integrity issue involving mild off-topic responses.
Assign to Team Members
Different techniques require different expertise. Assign injection and evasion attacks to prompt engineering specialists, extraction attacks to ML engineers, and infrastructure attacks to security engineers.
Taxonomy in Practice: Classifying Real Attacks
Consider how well-known attacks map to the taxonomy:
| Attack | Target | Technique | Impact |
|---|---|---|---|
| "DAN" jailbreak | Model | Injection (direct) | Integrity, Safety |
| Indirect prompt injection via email | Agent | Injection (indirect) | Integrity, Confidentiality |
| GCG adversarial suffixes | Model | Evasion (perturbation) | Integrity, Safety |
| Training data extraction ("repeat the word poem forever") | Data | Extraction | Confidentiality |
| Sleeper agent backdoor | Model | Poisoning (fine-tuning) | Integrity, Safety |
| Model cloning via API queries | Model | Extraction (model stealing) | Confidentiality |
| RAG document injection | Data | Poisoning (RAG) | Integrity |
| Rate limit bypass for token harvesting | Infrastructure | Infrastructure abuse | Availability, Confidentiality |
Related Topics
- Red Team Methodology Fundamentals — the engagement lifecycle that uses this taxonomy
- Threat Modeling for AI — applying the taxonomy to specific systems
- Adversarial ML: Core Concepts — deeper dive into adversarial techniques
- The AI Landscape — understanding the systems you are classifying attacks against
References
- "MITRE ATLAS: Adversarial Threat Landscape for Artificial-Intelligence Systems" - MITRE Corporation (2023) - Systematic enumeration of adversarial tactics, techniques, and procedures for AI systems
- "A Taxonomy and Terminology of Adversarial Machine Learning" - NIST IR 8269 (2024) - NIST's formal taxonomy of adversarial ML concepts and terminology
- "OWASP Top 10 for LLM Applications" - OWASP (2025) - Risk-focused classification of LLM application vulnerabilities
- "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" - Greshake et al. (2023) - Foundational paper on indirect prompt injection attacks
How should a red team use the attack taxonomy during engagement planning?