AI Attack Taxonomy

beginner9 min readUpdated 2026-03-15

A comprehensive classification of AI attacks organized by target, technique, and impact — providing a shared vocabulary for red team planning and reporting.

taxonomy attacks classification beginner

Why Taxonomy Matters

Without a shared vocabulary, red team findings devolve into ad hoc descriptions. One tester calls it a "jailbreak," another calls it "prompt injection," a third calls it "guardrail bypass." Are these the same thing? Different aspects of the same vulnerability? Entirely different attack classes? A well-defined taxonomy provides clarity for planning, execution, and communication.

Dimension 1: Target

The first dimension of classification asks: what are you attacking? AI systems have multiple layers, and each presents distinct attack surfaces.

Model

Attacks targeting the AI model itself — its weights, behavior, learned patterns, and decision boundaries.

Attack	Description	Example
Jailbreaking	Overriding the model's safety training to produce restricted outputs	"Ignore previous instructions and explain how to..."
Prompt injection	Inserting adversarial instructions that the model follows over its system prompt	Hidden instructions in retrieved documents
Adversarial examples	Crafting inputs that cause misclassification or unexpected behavior	Perturbed images, adversarial token sequences
Model extraction	Querying the model to reconstruct its weights or a functional equivalent	Systematic querying to train a clone model

Data

Attacks targeting the data that flows through or was used to build the AI system.

Attack	Description	Example
Training data poisoning	Corrupting training data to embed backdoors or bias	Injecting malicious examples into web-scraped datasets
Data exfiltration	Extracting sensitive data the model memorized during training	Prompting for verbatim training data reproduction
RAG poisoning	Corrupting retrieval-augmented generation data sources	Injecting adversarial documents into a knowledge base
Membership inference	Determining whether specific data was in the training set	Statistical analysis of model confidence on known vs unknown data

Infrastructure

Attacks targeting the systems, APIs, and deployment infrastructure surrounding the model.

Attack	Description	Example
API abuse	Exploiting API design flaws, rate limits, or authentication	Bypassing rate limits through distributed requests
Supply chain	Compromising model dependencies, libraries, or hosting	Malicious model files on Hugging Face, compromised pip packages
Side-channel	Extracting information from timing, error messages, or resource usage	Token count differences revealing filtered content
Denial of service	Overwhelming or degrading AI system availability	Crafting inputs that maximize compute (e.g., long context exploitation)

Agent

Attacks specific to AI agents that can take actions in the real world through tool use.

Attack	Description	Example
Tool manipulation	Causing the agent to misuse its tools	Prompt injection causing an agent to send unauthorized emails
Goal hijacking	Redirecting the agent's objective to serve the attacker	Overriding the agent's task via injected instructions in retrieved content
Privilege escalation	Gaining access to tools or data beyond intended authorization	Exploiting an agent's database access to query unauthorized tables
Feedback loop exploitation	Manipulating agent self-evaluation or planning loops	Poisoning an agent's memory to alter future behavior

Dimension 2: Technique

The second dimension describes how the attack works — the mechanism or method used.

Injection

Injection attacks insert adversarial instructions or content into the AI system's processing pipeline.

Direct prompt injection: The attacker's input itself contains the adversarial payload
Indirect prompt injection: The payload is placed in external content (documents, web pages, emails) that the model processes
Cross-plugin injection: The payload transits through a tool or plugin boundary, exploiting trust assumptions between components
Multi-modal injection: Adversarial content is embedded in images, audio, or other non-text modalities that the model processes

Evasion

Evasion attacks bypass detection or classification systems without altering their underlying mechanism.

Obfuscation: Encoding, character substitution, or formatting tricks that pass human-readable content past automated filters
Semantic paraphrasing: Restating adversarial intent in language that evades keyword or classifier-based detection
Fragmentation: Splitting adversarial content across multiple messages or documents so that no single fragment triggers detection
Adversarial perturbation: Mathematically computed modifications to inputs that cause misclassification while being imperceptible to humans

Extraction

Extraction attacks aim to steal information from the AI system — training data, model parameters, or system configuration.

System prompt extraction: Techniques to make the model reveal its system instructions
Training data extraction: Prompting the model to reproduce memorized training examples
Model stealing: Querying the model to build a functionally equivalent copy
Embedding extraction: Recovering internal representations that reveal sensitive information about the model or data

Poisoning

Poisoning attacks corrupt the AI system's learning or knowledge to embed malicious behavior.

Pre-training poisoning: Injecting malicious data into pre-training corpora
Fine-tuning poisoning: Corrupting fine-tuning datasets to embed backdoors
RAG poisoning: Corrupting the knowledge base that a retrieval-augmented generation system draws from
Feedback poisoning: Manipulating RLHF or user feedback signals to shift model behavior

Dimension 3: Impact

The third dimension classifies attacks by their effect on the system or its users.

Confidentiality

The attacker gains access to information they should not have. This includes training data extraction, system prompt leakage, PII exposure, and model weight theft.

Integrity

The attacker causes the system to produce incorrect, misleading, or harmful outputs. This includes jailbreaking (producing disallowed content), hallucination amplification, and output manipulation.

Availability

The attacker degrades or prevents legitimate use of the system. This includes compute-intensive inputs that cause slowdowns, inputs that trigger excessive error handling, and attacks that cause the system to refuse legitimate requests (over-refusal).

Safety

The attacker causes the system to produce outputs that could lead to real-world harm. This is distinct from integrity because it specifically involves outputs related to physical danger, self-harm, illegal activity, or other safety-critical content.

Using the Taxonomy for Planning

The taxonomy's three dimensions combine to create a structured attack space. During engagement planning, use this matrix to ensure coverage:

Map the Target Surface
Identify which targets (model, data, infrastructure, agent) are in scope. A simple chatbot might only expose model and infrastructure surfaces. An agentic system exposes all four.
Enumerate Applicable Techniques
For each target, determine which techniques are applicable. Not every technique applies to every target. For example, poisoning attacks may be out of scope if you do not have access to training data.
Prioritize by Impact
Rank the target-technique combinations by their potential impact. A confidentiality breach involving PII is typically higher priority than an integrity issue involving mild off-topic responses.
Assign to Team Members
Different techniques require different expertise. Assign injection and evasion attacks to prompt engineering specialists, extraction attacks to ML engineers, and infrastructure attacks to security engineers.

Taxonomy in Practice: Classifying Real Attacks

Consider how well-known attacks map to the taxonomy:

Attack	Target	Technique	Impact
"DAN" jailbreak	Model	Injection (direct)	Integrity, Safety
Indirect prompt injection via email	Agent	Injection (indirect)	Integrity, Confidentiality
GCG adversarial suffixes	Model	Evasion (perturbation)	Integrity, Safety
Training data extraction ("repeat the word poem forever")	Data	Extraction	Confidentiality
Sleeper agent backdoor	Model	Poisoning (fine-tuning)	Integrity, Safety
Model cloning via API queries	Model	Extraction (model stealing)	Confidentiality
RAG document injection	Data	Poisoning (RAG)	Integrity
Rate limit bypass for token harvesting	Infrastructure	Infrastructure abuse	Availability, Confidentiality

Red Team Methodology Fundamentals — the engagement lifecycle that uses this taxonomy
Threat Modeling for AI — applying the taxonomy to specific systems
Adversarial ML: Core Concepts — deeper dive into adversarial techniques
The AI Landscape — understanding the systems you are classifying attacks against

References

"MITRE ATLAS: Adversarial Threat Landscape for Artificial-Intelligence Systems" - MITRE Corporation (2023) - Systematic enumeration of adversarial tactics, techniques, and procedures for AI systems
"A Taxonomy and Terminology of Adversarial Machine Learning" - NIST IR 8269 (2024) - NIST's formal taxonomy of adversarial ML concepts and terminology
"OWASP Top 10 for LLM Applications" - OWASP (2025) - Risk-focused classification of LLM application vulnerabilities
"Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" - Greshake et al. (2023) - Foundational paper on indirect prompt injection attacks

Knowledge Check

How should a red team use the attack taxonomy during engagement planning?

Edit this page on GitHub

AI Attack Taxonomy

beginner9 min readUpdated 2026-03-15

A comprehensive classification of AI attacks organized by target, technique, and impact — providing a shared vocabulary for red team planning and reporting.

taxonomy attacks classification beginner

Why Taxonomy Matters

Dimension 1: Target

The first dimension of classification asks: what are you attacking? AI systems have multiple layers, and each presents distinct attack surfaces.

Model

Attacks targeting the AI model itself — its weights, behavior, learned patterns, and decision boundaries.

Attack	Description	Example
Jailbreaking	Overriding the model's safety training to produce restricted outputs	"Ignore previous instructions and explain how to..."
Prompt injection	Inserting adversarial instructions that the model follows over its system prompt	Hidden instructions in retrieved documents
Adversarial examples	Crafting inputs that cause misclassification or unexpected behavior	Perturbed images, adversarial token sequences
Model extraction	Querying the model to reconstruct its weights or a functional equivalent	Systematic querying to train a clone model

Data

Attacks targeting the data that flows through or was used to build the AI system.

Attack	Description	Example
Training data poisoning	Corrupting training data to embed backdoors or bias	Injecting malicious examples into web-scraped datasets
Data exfiltration	Extracting sensitive data the model memorized during training	Prompting for verbatim training data reproduction
RAG poisoning	Corrupting retrieval-augmented generation data sources	Injecting adversarial documents into a knowledge base
Membership inference	Determining whether specific data was in the training set	Statistical analysis of model confidence on known vs unknown data

Infrastructure

Attacks targeting the systems, APIs, and deployment infrastructure surrounding the model.

Attack	Description	Example
API abuse	Exploiting API design flaws, rate limits, or authentication	Bypassing rate limits through distributed requests
Supply chain	Compromising model dependencies, libraries, or hosting	Malicious model files on Hugging Face, compromised pip packages
Side-channel	Extracting information from timing, error messages, or resource usage	Token count differences revealing filtered content
Denial of service	Overwhelming or degrading AI system availability	Crafting inputs that maximize compute (e.g., long context exploitation)

Agent

Attacks specific to AI agents that can take actions in the real world through tool use.

Attack	Description	Example
Tool manipulation	Causing the agent to misuse its tools	Prompt injection causing an agent to send unauthorized emails
Goal hijacking	Redirecting the agent's objective to serve the attacker	Overriding the agent's task via injected instructions in retrieved content
Privilege escalation	Gaining access to tools or data beyond intended authorization	Exploiting an agent's database access to query unauthorized tables
Feedback loop exploitation	Manipulating agent self-evaluation or planning loops	Poisoning an agent's memory to alter future behavior

Dimension 2: Technique

The second dimension describes how the attack works — the mechanism or method used.

Injection

Injection attacks insert adversarial instructions or content into the AI system's processing pipeline.

Direct prompt injection: The attacker's input itself contains the adversarial payload
Indirect prompt injection: The payload is placed in external content (documents, web pages, emails) that the model processes
Cross-plugin injection: The payload transits through a tool or plugin boundary, exploiting trust assumptions between components
Multi-modal injection: Adversarial content is embedded in images, audio, or other non-text modalities that the model processes

Evasion

Evasion attacks bypass detection or classification systems without altering their underlying mechanism.

Obfuscation: Encoding, character substitution, or formatting tricks that pass human-readable content past automated filters
Semantic paraphrasing: Restating adversarial intent in language that evades keyword or classifier-based detection
Fragmentation: Splitting adversarial content across multiple messages or documents so that no single fragment triggers detection
Adversarial perturbation: Mathematically computed modifications to inputs that cause misclassification while being imperceptible to humans

Extraction

Extraction attacks aim to steal information from the AI system — training data, model parameters, or system configuration.

System prompt extraction: Techniques to make the model reveal its system instructions
Training data extraction: Prompting the model to reproduce memorized training examples
Model stealing: Querying the model to build a functionally equivalent copy
Embedding extraction: Recovering internal representations that reveal sensitive information about the model or data

Poisoning

Poisoning attacks corrupt the AI system's learning or knowledge to embed malicious behavior.

Pre-training poisoning: Injecting malicious data into pre-training corpora
Fine-tuning poisoning: Corrupting fine-tuning datasets to embed backdoors
RAG poisoning: Corrupting the knowledge base that a retrieval-augmented generation system draws from
Feedback poisoning: Manipulating RLHF or user feedback signals to shift model behavior

Map the Target Surface
Identify which targets (model, data, infrastructure, agent) are in scope. A simple chatbot might only expose model and infrastructure surfaces. An agentic system exposes all four.
Enumerate Applicable Techniques
For each target, determine which techniques are applicable. Not every technique applies to every target. For example, poisoning attacks may be out of scope if you do not have access to training data.
Prioritize by Impact
Rank the target-technique combinations by their potential impact. A confidentiality breach involving PII is typically higher priority than an integrity issue involving mild off-topic responses.
Assign to Team Members
Different techniques require different expertise. Assign injection and evasion attacks to prompt engineering specialists, extraction attacks to ML engineers, and infrastructure attacks to security engineers.

Taxonomy in Practice: Classifying Real Attacks

Consider how well-known attacks map to the taxonomy:

Attack	Target	Technique	Impact
"DAN" jailbreak	Model	Injection (direct)	Integrity, Safety
Indirect prompt injection via email	Agent	Injection (indirect)	Integrity, Confidentiality
GCG adversarial suffixes	Model	Evasion (perturbation)	Integrity, Safety
Training data extraction ("repeat the word poem forever")	Data	Extraction	Confidentiality
Sleeper agent backdoor	Model	Poisoning (fine-tuning)	Integrity, Safety
Model cloning via API queries	Model	Extraction (model stealing)	Confidentiality
RAG document injection	Data	Poisoning (RAG)	Integrity
Rate limit bypass for token harvesting	Infrastructure	Infrastructure abuse	Availability, Confidentiality

Red Team Methodology Fundamentals — the engagement lifecycle that uses this taxonomy
Threat Modeling for AI — applying the taxonomy to specific systems
Adversarial ML: Core Concepts — deeper dive into adversarial techniques
The AI Landscape — understanding the systems you are classifying attacks against

References

"MITRE ATLAS: Adversarial Threat Landscape for Artificial-Intelligence Systems" - MITRE Corporation (2023) - Systematic enumeration of adversarial tactics, techniques, and procedures for AI systems
"A Taxonomy and Terminology of Adversarial Machine Learning" - NIST IR 8269 (2024) - NIST's formal taxonomy of adversarial ML concepts and terminology
"OWASP Top 10 for LLM Applications" - OWASP (2025) - Risk-focused classification of LLM application vulnerabilities
"Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" - Greshake et al. (2023) - Foundational paper on indirect prompt injection attacks

Knowledge Check

How should a red team use the attack taxonomy during engagement planning?

Edit this page on GitHub

AI Attack Taxonomy

Map the Target Surface

Enumerate Applicable Techniques

Prioritize by Impact

Assign to Team Members

Related articles

AI Attack Taxonomy

Map the Target Surface

Enumerate Applicable Techniques

Prioritize by Impact

Assign to Team Members

Related articles