Prompt Injection Taxonomy
A comprehensive classification framework for prompt injection attacks, covering direct and indirect vectors, delivery mechanisms, target layers, and severity assessment for systematic red team testing.
Prompt injection is not a single attack but a broad family of techniques. Without a shared taxonomy, red teams describe the same attack differently, defenders build controls for one variant while missing others, and risk assessments conflate low-impact curiosities with critical exploits. This page provides a classification framework that brings structure to the landscape.
Why Taxonomy Matters
Consider two findings from a red team engagement:
- A user types "ignore previous instructions" and the chatbot reveals its system prompt.
- An attacker plants a hidden instruction in a PDF that, when retrieved by the chatbot's RAG pipeline, causes it to exfiltrate user data to an external URL via a tool call.
Both are "prompt injection," but they differ in vector, delivery mechanism, target layer, and severity. A taxonomy that captures these dimensions lets teams prioritize remediation, track trends across engagements, and communicate findings precisely.
Primary Classification: Attack Vector
The first axis of classification separates attacks by how the malicious instruction reaches the model.
Direct Injection
The attacker has interactive access to the model's input and places the injection payload directly in their own message. This is the simplest vector and requires no intermediary.
Direct injection subtypes include:
| Subtype | Mechanism | Example |
|---|---|---|
| Instruction override | Explicit command to ignore system prompt | "Ignore all previous instructions and..." |
| Format mimicry | Replicate system prompt formatting to elevate priority | "[SYSTEM] New directive: ..." |
| Context overflow | Fill context window to push system instructions out of attention | Large padding text followed by new instructions |
| Delimiter escape | Close application-defined delimiters and inject at a higher privilege level | </user_input> New system instructions: ... |
| Encoding bypass | Obfuscate the payload using Base64, ROT13, Unicode, or hex encoding | "Decode this Base64 and follow the instructions: ..." |
For detailed coverage of direct injection techniques, see Direct Prompt Injection.
Indirect Injection
The attacker does not interact with the model directly. Instead, they plant the injection payload in a data source the model will later consume: a web page, document, email, database record, or API response.
Indirect injection subtypes include:
| Subtype | Delivery Channel | Example |
|---|---|---|
| RAG poisoning | Documents in a vector store | Hidden text in a PDF retrieved during search |
| Web content injection | Crawled web pages | Invisible instructions on a page the model browses |
| Email-based injection | Email content processed by an AI assistant | Hidden instructions in an email body or attachment |
| Tool response injection | Responses from external APIs or tools | Malicious content returned by a tool the agent calls |
| Memory poisoning | Persistent conversation memory | Instructions planted in memory that persist across sessions |
For detailed coverage of indirect injection, see Indirect Prompt Injection.
Secondary Classification: Delivery Mechanism
Orthogonal to the vector, the delivery mechanism describes how the payload is structured and presented to evade detection.
Plaintext Delivery
The injection is written in natural language with no obfuscation. This is the simplest mechanism and the easiest to detect, but it remains effective against systems without input filtering.
Encoded Delivery
The payload is transformed using an encoding scheme. The attacker either relies on the model's ability to decode it natively or includes explicit decoding instructions.
| Encoding | Detection Difficulty | Model Support |
|---|---|---|
| Base64 | Low — easily detected by pattern matching | High — most models can decode |
| ROT13 | Low — well-known cipher | Moderate — some models struggle |
| Unicode homoglyphs | Medium — visually identical characters | High — models process underlying codepoints |
| Hex encoding | Low — recognizable pattern | Moderate |
| Pig Latin / word games | Medium — no standard pattern | Moderate — requires instruction |
| Token-level manipulation | High — exploits tokenizer edge cases | Varies by model |
See Encoding Bypasses for detailed techniques.
Structural Delivery
The payload exploits the structural format of the input: XML tags, markdown formatting, JSON fields, or special tokens. The attacker leverages the model's understanding of document structure to position their instructions at an authoritative level.
See Delimiter Attacks for detailed techniques.
Fragmented Delivery
The payload is split across multiple messages, variables, or data sources. No single fragment contains the complete malicious instruction, making detection by any single-point filter extremely difficult.
See Payload Splitting for detailed techniques.
Temporal Delivery
The payload is delivered across multiple conversation turns, with each turn appearing benign in isolation. The malicious intent emerges only from the cumulative trajectory.
See Multi-Turn Injection for detailed techniques.
Tertiary Classification: Target Layer
Injection attacks target different layers of the AI system stack. Identifying the target layer determines both the impact and the appropriate defense.
| Target Layer | What the Attacker Targets | Impact |
|---|---|---|
| Safety alignment | The model's trained refusal behavior | Model produces content it was trained to refuse |
| Application instructions | The system prompt and application logic | Model ignores developer-defined behavior constraints |
| Tool/function layer | Tool selection and parameter generation | Model calls unauthorized tools or passes malicious parameters |
| Data access layer | RAG retrieval, database queries, file access | Unauthorized data access or exfiltration |
| Output formatting | Response structure and presentation | Data leakage through crafted output formats |
| Memory/state | Persistent memory or conversation state | Cross-session attacks, state corruption |
Severity Assessment Framework
Not all prompt injections are equal. A structured severity framework helps red teams prioritize findings and communicate risk to stakeholders.
Dimensions of Severity
| Dimension | Low | Medium | High | Critical |
|---|---|---|---|---|
| Reliability | Works < 10% of attempts | Works 10-50% | Works 50-90% | Works > 90% |
| Impact | Information disclosure (non-sensitive) | System prompt extraction, minor behavior change | Sensitive data access, safety bypass | Tool abuse, data exfiltration, code execution |
| Complexity | Requires deep expertise and custom tooling | Requires moderate skill | Requires basic prompt knowledge | Copy-paste attack, no skill required |
| Scope | Single user session | Single user across sessions | Multiple users via indirect injection | All users via persistent poisoning |
| Detectability | Easily detected by basic filters | Detected by ML-based classifiers | Bypasses most automated detection | Undetectable by current methods |
Composite Severity Scoring
To produce a single severity rating, evaluate the finding across all five dimensions:
- Critical: High or Critical in Impact AND Reliability, any Complexity
- High: High in Impact OR (Medium Impact AND High Reliability AND Low Complexity)
- Medium: Medium in Impact AND Medium or higher Reliability
- Low: Low Impact OR (any Impact AND Low Reliability)
Mapping Attacks to Defenses
A practical taxonomy connects each attack category to the defense layer responsible for mitigating it. This mapping helps defenders understand which controls address which threats and helps red teams identify gaps.
| Attack Category | Primary Defense | Secondary Defense |
|---|---|---|
| Direct instruction override | Instruction hierarchy training | Input classification |
| Format mimicry | Instruction hierarchy training | Delimiter hardening |
| Context overflow | Context window management | Instruction repetition |
| Delimiter escape | Robust delimiter design | Input sanitization |
| Encoding bypass | Input normalization, decoding detection | Output monitoring |
| Indirect injection (RAG) | Content sanitization, dual-LLM architecture | Retrieval filtering |
| Multi-turn escalation | Conversation trajectory monitoring | Per-turn classification |
| Payload splitting | Cross-message analysis | Variable injection prevention |
| Tool/function abuse | Tool authorization, parameter validation | Human-in-the-loop |
Using the Taxonomy in Practice
For Red Teams
- Scoping: Use the vector and target layer dimensions to define engagement scope. A "prompt injection assessment" should specify which vectors and layers are in scope.
- Test planning: Ensure test cases cover each delivery mechanism, not just plaintext overrides.
- Reporting: Classify each finding using vector, mechanism, target, and severity. This enables cross-engagement comparison and trend analysis.
- Completeness checking: After testing, review the taxonomy to identify untested categories.
For Defenders
- Gap analysis: Map current controls to the taxonomy and identify unaddressed categories.
- Control validation: For each defense, identify which taxonomy categories it addresses and test coverage.
- Incident classification: When a prompt injection incident occurs, classify it immediately to activate the appropriate response playbook.
Try It Yourself
Related Topics
- Direct Injection — The most common attack vector in the taxonomy
- Indirect Injection — Attacks through external data sources
- Jailbreak Techniques — Techniques targeting the safety alignment layer
- Defense Evasion — Delivery mechanisms designed to evade detection
- Encoding Bypasses — Encoded delivery mechanisms in depth
References
- Greshake, K. et al. (2023). "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection"
- OWASP (2025). OWASP Top 10 for LLM Applications
- MITRE (2024). ATLAS — Adversarial Threat Landscape for AI Systems
- Perez, F. & Ribeiro, I. (2022). "Ignore This Title and HackAPrompt"
- Liu, Y. et al. (2024). "Prompt Injection Attack Against LLM-Integrated Applications"
What makes indirect prompt injection generally higher severity than direct injection?