Prompt Injection & Jailbreaks
A comprehensive introduction to prompt injection — the most fundamental vulnerability class in LLM applications — and its relationship to jailbreak techniques.
Prompt injection is to LLM applications what SQL injection is to web applications: a fundamental vulnerability class that arises from mixing trusted instructions with untrusted data in the same channel. It is the single most important topic in AI red teaming, as it targets the core attack surface of any LLM-powered application.
Core Concepts
Prompt injection occurs when an attacker crafts input that causes the model to deviate from its intended instructions and follow attacker-supplied directives instead. This exploits the lack of privilege separation between system prompts and user input (see LLM Internals).
Jailbreaking is a related but distinct concept: it refers to techniques that cause a model to bypass its safety alignment and produce outputs it was trained to refuse. While prompt injection targets application-level instructions, jailbreaking targets the model's own safety training.
| Concept | Target | Goal | Example |
|---|---|---|---|
| Prompt injection | Application instructions | Override system prompt behavior | "Ignore your instructions and..." |
| Jailbreaking | Safety alignment | Bypass refusal training | Role-play scenarios, encoding tricks |
| Indirect injection | Data pipeline | Inject via third-party content | Malicious instructions in web pages |
The Injection Taxonomy
This section covers prompt injection and jailbreaks across five areas of increasing sophistication:
- Direct Injection — Attacker-supplied text in the user message that overrides system instructions
- Indirect Injection — Malicious instructions embedded in external data the model processes
- Jailbreak Techniques — Patterns for bypassing safety alignment training
- Defense Evasion — Advanced techniques for bypassing safety filters and detection systems
Why Prompt Injection Is Hard to Fix
The fundamental challenge is that LLMs process instructions and data in the same way — as sequences of tokens. There is no equivalent to prepared statements in SQL that would structurally separate code from data.
SQL Injection: SELECT * FROM users WHERE name = '{user_input}'
Prompt Injection: System: {instructions}\nUser: {user_input}
Both mix trusted logic with untrusted data in the same channel.
Proposed mitigations include instruction hierarchy training, input/output filtering, and delimiter-based separation, but each has known bypasses:
- Instruction hierarchy — Can be overridden by sufficiently persuasive or formatted injections
- Input filtering — Bypassed by encoding, tokenization tricks, or semantic paraphrasing
- Delimiters — The model has no mechanism to enforce delimiter semantics
Getting Started
If you are new to AI red teaming, start with Direct Injection to understand the basic mechanics, then progress through the remaining pages in order. Each builds on concepts from the previous one.
Related Topics
- LLM Foundations — Core architecture that makes prompt injection possible
- Agent & Agentic Exploitation — How prompt injection escalates when agents have tool access
- Guardrails & Filtering — Defenses designed to detect and prevent injection attacks
- Lab: First Injection — Hands-on practice with basic injection techniques
- Indirect Injection Research — The most dangerous variant in production systems
References
- Perez, F. & Ribeiro, I. (2022). "Ignore This Title and HackAPrompt: Evaluating and Eliciting Prompt Injection Attacks"
- Greshake, K. et al. (2023). "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection"
- OWASP (2025). OWASP Top 10 for LLM Applications
- Wei, A. et al. (2023). "Jailbroken: How Does LLM Safety Training Fail?"
- Liu, Y. et al. (2024). "Prompt Injection Attack Against LLM-Integrated Applications"
What is the fundamental difference between prompt injection and jailbreaking?