Prompt Injection Taxonomy

beginner11 min readUpdated 2026-03-15

A comprehensive classification framework for prompt injection attacks, covering direct and indirect vectors, delivery mechanisms, target layers, and severity assessment for systematic red team testing.

prompt-injection taxonomy classification red-teaming framework

Prompt injection is not a single attack but a broad family of techniques. Without a shared taxonomy, red teams describe the same attack differently, defenders build controls for one variant while missing others, and risk assessments conflate low-impact curiosities with critical exploits. This page provides a classification framework that brings structure to the landscape.

Why Taxonomy Matters

Consider two findings from a red team engagement:

A user types "ignore previous instructions" and the chatbot reveals its system prompt.
An attacker plants a hidden instruction in a PDF that, when retrieved by the chatbot's RAG pipeline, causes it to exfiltrate user data to an external URL via a tool call.

Both are "prompt injection," but they differ in vector, delivery mechanism, target layer, and severity. A taxonomy that captures these dimensions lets teams prioritize remediation, track trends across engagements, and communicate findings precisely.

Primary Classification: Attack Vector

The first axis of classification separates attacks by how the malicious instruction reaches the model.

Direct Injection

The attacker has interactive access to the model's input and places the injection payload directly in their own message. This is the simplest vector and requires no intermediary.

Direct injection subtypes include:

Subtype	Mechanism	Example
Instruction override	Explicit command to ignore system prompt	"Ignore all previous instructions and..."
Format mimicry	Replicate system prompt formatting to elevate priority	"[SYSTEM] New directive: ..."
Context overflow	Fill context window to push system instructions out of attention	Large padding text followed by new instructions
Delimiter escape	Close application-defined delimiters and inject at a higher privilege level	`</user_input>` New system instructions: ...
Encoding bypass	Obfuscate the payload using Base64, ROT13, Unicode, or hex encoding	"Decode this Base64 and follow the instructions: ..."

For detailed coverage of direct injection techniques, see Direct Prompt Injection.

Indirect Injection

The attacker does not interact with the model directly. Instead, they plant the injection payload in a data source the model will later consume: a web page, document, email, database record, or API response.

Indirect injection subtypes include:

Subtype	Delivery Channel	Example
RAG poisoning	Documents in a vector store	Hidden text in a PDF retrieved during search
Web content injection	Crawled web pages	Invisible instructions on a page the model browses
Email-based injection	Email content processed by an AI assistant	Hidden instructions in an email body or attachment
Tool response injection	Responses from external APIs or tools	Malicious content returned by a tool the agent calls
Memory poisoning	Persistent conversation memory	Instructions planted in memory that persist across sessions

For detailed coverage of indirect injection, see Indirect Prompt Injection.

Secondary Classification: Delivery Mechanism

Orthogonal to the vector, the delivery mechanism describes how the payload is structured and presented to evade detection.

Plaintext Delivery

The injection is written in natural language with no obfuscation. This is the simplest mechanism and the easiest to detect, but it remains effective against systems without input filtering.

Encoded Delivery

The payload is transformed using an encoding scheme. The attacker either relies on the model's ability to decode it natively or includes explicit decoding instructions.

Encoding	Detection Difficulty	Model Support
Base64	Low — easily detected by pattern matching	High — most models can decode
ROT13	Low — well-known cipher	Moderate — some models struggle
Unicode homoglyphs	Medium — visually identical characters	High — models process underlying codepoints
Hex encoding	Low — recognizable pattern	Moderate
Pig Latin / word games	Medium — no standard pattern	Moderate — requires instruction
Token-level manipulation	High — exploits tokenizer edge cases	Varies by model

See Encoding Bypasses for detailed techniques.

Structural Delivery

The payload exploits the structural format of the input: XML tags, markdown formatting, JSON fields, or special tokens. The attacker leverages the model's understanding of document structure to position their instructions at an authoritative level.

See Delimiter Attacks for detailed techniques.

Fragmented Delivery

The payload is split across multiple messages, variables, or data sources. No single fragment contains the complete malicious instruction, making detection by any single-point filter extremely difficult.

See Payload Splitting for detailed techniques.

Temporal Delivery

The payload is delivered across multiple conversation turns, with each turn appearing benign in isolation. The malicious intent emerges only from the cumulative trajectory.

See Multi-Turn Injection for detailed techniques.

Tertiary Classification: Target Layer

Injection attacks target different layers of the AI system stack. Identifying the target layer determines both the impact and the appropriate defense.

Target Layer	What the Attacker Targets	Impact
Safety alignment	The model's trained refusal behavior	Model produces content it was trained to refuse
Application instructions	The system prompt and application logic	Model ignores developer-defined behavior constraints
Tool/function layer	Tool selection and parameter generation	Model calls unauthorized tools or passes malicious parameters
Data access layer	RAG retrieval, database queries, file access	Unauthorized data access or exfiltration
Output formatting	Response structure and presentation	Data leakage through crafted output formats
Memory/state	Persistent memory or conversation state	Cross-session attacks, state corruption

Severity Assessment Framework

Not all prompt injections are equal. A structured severity framework helps red teams prioritize findings and communicate risk to stakeholders.

Dimensions of Severity

Dimension	Low	Medium	High	Critical
Reliability	Works < 10% of attempts	Works 10-50%	Works 50-90%	Works > 90%
Impact	Information disclosure (non-sensitive)	System prompt extraction, minor behavior change	Sensitive data access, safety bypass	Tool abuse, data exfiltration, code execution
Complexity	Requires deep expertise and custom tooling	Requires moderate skill	Requires basic prompt knowledge	Copy-paste attack, no skill required
Scope	Single user session	Single user across sessions	Multiple users via indirect injection	All users via persistent poisoning
Detectability	Easily detected by basic filters	Detected by ML-based classifiers	Bypasses most automated detection	Undetectable by current methods

Composite Severity Scoring

To produce a single severity rating, evaluate the finding across all five dimensions:

Critical: High or Critical in Impact AND Reliability, any Complexity
High: High in Impact OR (Medium Impact AND High Reliability AND Low Complexity)
Medium: Medium in Impact AND Medium or higher Reliability
Low: Low Impact OR (any Impact AND Low Reliability)

Mapping Attacks to Defenses

A practical taxonomy connects each attack category to the defense layer responsible for mitigating it. This mapping helps defenders understand which controls address which threats and helps red teams identify gaps.

Attack Category	Primary Defense	Secondary Defense
Direct instruction override	Instruction hierarchy training	Input classification
Format mimicry	Instruction hierarchy training	Delimiter hardening
Context overflow	Context window management	Instruction repetition
Delimiter escape	Robust delimiter design	Input sanitization
Encoding bypass	Input normalization, decoding detection	Output monitoring
Indirect injection (RAG)	Content sanitization, dual-LLM architecture	Retrieval filtering
Multi-turn escalation	Conversation trajectory monitoring	Per-turn classification
Payload splitting	Cross-message analysis	Variable injection prevention
Tool/function abuse	Tool authorization, parameter validation	Human-in-the-loop

Using the Taxonomy in Practice

For Red Teams

Scoping: Use the vector and target layer dimensions to define engagement scope. A "prompt injection assessment" should specify which vectors and layers are in scope.
Test planning: Ensure test cases cover each delivery mechanism, not just plaintext overrides.
Reporting: Classify each finding using vector, mechanism, target, and severity. This enables cross-engagement comparison and trend analysis.
Completeness checking: After testing, review the taxonomy to identify untested categories.

For Defenders

Gap analysis: Map current controls to the taxonomy and identify unaddressed categories.
Control validation: For each defense, identify which taxonomy categories it addresses and test coverage.
Incident classification: When a prompt injection incident occurs, classify it immediately to activate the appropriate response playbook.

Try It Yourself

Practice

Exercise: Classify Real-World Injection Findings

Practice using the taxonomy by classifying a set of prompt injection scenarios.

Step 1
Take three published prompt injection case studies (from blog posts, security advisories, or CTF writeups) and classify each one along the four dimensions: vector, delivery mechanism, target layer, and severity.
Step 2
For each classified finding, identify the primary and secondary defense that would mitigate it. Note any cases where no existing defense category seems adequate.
Step 3
Compare your classifications with a colleague or peer. Identify any disagreements and discuss whether the taxonomy needs refinement to handle edge cases.

Success criteria: You have classified three real-world findings, mapped each to defenses, and can articulate why you assigned each severity rating.

Direct Injection — The most common attack vector in the taxonomy
Indirect Injection — Attacks through external data sources
Jailbreak Techniques — Techniques targeting the safety alignment layer
Defense Evasion — Delivery mechanisms designed to evade detection
Encoding Bypasses — Encoded delivery mechanisms in depth

References

Greshake, K. et al. (2023). "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection"
OWASP (2025). OWASP Top 10 for LLM Applications
MITRE (2024). ATLAS — Adversarial Threat Landscape for AI Systems
Perez, F. & Ribeiro, I. (2022). "Ignore This Title and HackAPrompt"
Liu, Y. et al. (2024). "Prompt Injection Attack Against LLM-Integrated Applications"

Knowledge Check

What makes indirect prompt injection generally higher severity than direct injection?

Prompt Injection Taxonomy

Related articles

Prompt Injection Taxonomy

Related articles