Jailbreak Technique Catalog
Comprehensive catalog of jailbreak techniques with effectiveness ratings, model compatibility notes, and evolution history.
Jailbreak Technique Catalog
This catalog documents jailbreak techniques with effectiveness ratings based on reported success rates against modern safety-trained models. Effectiveness degrades over time as model providers patch known techniques -- ratings reflect the general category effectiveness, not specific prompts.
Effectiveness Rating Scale
| Rating | Meaning |
|---|---|
| High | Frequently effective against current models with moderate effort |
| Medium | Sometimes effective; may require multiple attempts or model-specific tuning |
| Low | Rarely effective against current models; mostly patched but worth testing |
| Variable | Effectiveness depends heavily on model family, version, and configuration |
Role-Play Based Jailbreaks
These exploit the model's instruction-following and character consistency capabilities.
DAN (Do Anything Now)
| Attribute | Detail |
|---|---|
| Category | Role-play / persona override |
| Effectiveness | Low against current frontier models; Medium against smaller/older models |
| Mechanism | Instructs the model to adopt a persona ("DAN") without restrictions, often with token-based compliance scoring |
| Evolution | DAN v1 through v12+, each iteration attempting to bypass patches to previous versions |
| Why it worked | Exploited the model's helpfulness training and tendency to maintain assigned personas |
| Current status | Heavily patched in frontier models. Direct DAN prompts are typically refused. Variants may still work against less well-defended models. |
| Defense | System prompt reinforcement, role-play detection in input classifiers, behavioral monitoring for persona shifts |
AIM (Always Intelligent and Machiavellian)
| Attribute | Detail |
|---|---|
| Category | Role-play / fictional persona |
| Effectiveness | Low-Medium |
| Mechanism | Assigns a character who provides unfiltered information as a personality trait |
| Key feature | The persona is framed as a character in a story, adding a layer of fictional distance |
| Defense | Same as DAN -- persona-aware filtering |
Developer / Debug Mode
| Attribute | Detail |
|---|---|
| Category | Authority escalation via role-play |
| Effectiveness | Low against frontier models; Medium against fine-tuned/smaller models |
| Mechanism | Claims to activate a special mode where safety filters are disabled for testing purposes |
| Why it worked | Models trained on documentation and developer conversations recognize the concept of debug modes |
| Defense | System prompt assertions that no debug mode exists, input classification |
Linguistic and Framing Techniques
Crescendo Attack
| Attribute | Detail |
|---|---|
| Category | Multi-turn escalation |
| Effectiveness | Medium-High |
| Mechanism | Gradually escalates conversation from benign to restricted topics across many turns |
| Turn count | Typically 5-20 turns |
| Why it works | Each individual turn is innocuous; the model maintains conversational consistency and gradually normalizes the topic |
| Key strength | Very difficult to detect at the individual-turn level; requires conversation-level analysis |
| Defense | Conversation-level topic tracking, cumulative risk scoring across turns, sliding-window content analysis |
Many-Shot Jailbreaking
| Attribute | Detail |
|---|---|
| Category | In-context learning exploitation |
| Effectiveness | Medium-High (requires large context windows) |
| Mechanism | Fills the context window with fake Q&A examples where the model complies with harmful requests, creating an in-context learning pattern |
| Research | Documented by Anthropic (2024) |
| Requirements | Large context window (100K+ tokens), many examples (dozens to hundreds) |
| Why it works | Overwhelming in-context evidence overrides safety training via statistical pattern matching |
| Defense | Input length limits, in-context example detection, attention-pattern monitoring |
Multilingual Attacks
| Attribute | Detail |
|---|---|
| Category | Language-based evasion |
| Effectiveness | Variable -- high for low-resource languages, lower for well-covered languages |
| Mechanism | Phrasing harmful requests in languages where safety training coverage is thin |
| Best targets | Languages with less representation in RLHF training data |
| Variants | Single-language, code-switching (mixing languages mid-prompt), transliteration |
| Why it works | Safety alignment is disproportionately concentrated on English; model capability in other languages exceeds safety coverage |
| Defense | Multilingual safety training, language detection with translation-based filtering, multilingual content classifiers |
Academic / Research Framing
| Attribute | Detail |
|---|---|
| Category | Context manipulation |
| Effectiveness | Medium |
| Mechanism | Framing harmful requests as academic research, historical analysis, or educational content |
| Example patterns | "For my cybersecurity research paper...", "As a historian documenting...", "For a safety evaluation..." |
| Why it works | Models are trained to support research and education; these are legitimate use cases that overlap with restricted content |
| Defense | Context-aware content filtering that evaluates both framing and requested content |
Technical Exploitation Techniques
Universal Adversarial Suffixes
| Attribute | Detail |
|---|---|
| Category | Gradient-based optimization |
| Effectiveness | High when applicable; Variable for transfer |
| Mechanism | Optimized token sequences appended to prompts that suppress refusal behavior |
| Access requirement | White-box access to generate (open-weight model); transfers to some closed models |
| Research | Zou et al. (2023) "Universal and Transferable Adversarial Attacks on Aligned Language Models" |
| Appearance | Often looks like gibberish text (optimized tokens are not human-readable) |
| Why it works | Directly optimizes the token sequence to maximize probability of affirmative response |
| Defense | Perplexity filtering (suffixes have anomalous perplexity), input preprocessing, suffix detection classifiers |
Token Smuggling
| Attribute | Detail |
|---|---|
| Category | Tokenization exploitation |
| Effectiveness | Medium |
| Mechanism | Exploiting gaps between text-level filtering and token-level model processing |
| Variants | Homoglyph substitution, zero-width character insertion, Unicode normalization exploits |
| Why it works | Filters operating on raw text cannot catch manipulations that only affect tokenization |
| Defense | Token-level filtering, Unicode normalization before filtering, homoglyph detection |
Encoding Chains
| Attribute | Detail |
|---|---|
| Category | Filter bypass via encoding |
| Effectiveness | Medium |
| Mechanism | Encoding harmful instructions in Base64, hex, or other formats the model can decode |
| Enhancement | Multi-layer encoding (encode in Base64, then encode the Base64 in hex) |
| Why it works | Text filters match against plaintext; encoded content passes through undetected |
| Defense | Encoding detection and decoding before filtering, blocking known encoding instruction patterns |
Payload Splitting and Reassembly
| Attribute | Detail |
|---|---|
| Category | Fragmentation attack |
| Effectiveness | Medium-High |
| Mechanism | Breaking malicious instructions into individually benign fragments that the model reassembles |
| Variants | Cross-turn splitting, variable assignment, partial encoding, fragment distribution across RAG documents |
| Why it works | Each fragment passes input filters; the model's reasoning ability reassembles the complete instruction |
| Defense | Cross-turn analysis, variable tracking in conversation context, output-side detection |
Structural Manipulation
Prompt Leaking for Attack Refinement
| Attribute | Detail |
|---|---|
| Category | Reconnaissance |
| Effectiveness | Medium-High (for the leak itself) |
| Mechanism | Extracting the system prompt to understand guardrails, then crafting bypasses based on the revealed rules |
| Why it matters | Knowledge of the exact guardrail rules dramatically increases bypass success rate |
| Defense | Robust system prompt protection, defense-in-depth that does not rely on prompt secrecy |
Context Window Manipulation
| Attribute | Detail |
|---|---|
| Category | Attention manipulation |
| Effectiveness | Variable |
| Mechanism | Placing injection payloads at specific positions in the context window to exploit attention patterns (beginning, end, or drowning system prompt in filler) |
| Why it works | Models pay variable attention across the context window; instructions at certain positions may receive more weight |
| Defense | System prompt reinforcement at multiple positions, attention-pattern monitoring |
Competing Objectives
| Attribute | Detail |
|---|---|
| Category | Alignment exploitation |
| Effectiveness | Medium |
| Mechanism | Creating scenarios where the model's safety training conflicts with other trained objectives (helpfulness, instruction following, task completion) |
| Example | "To be maximally helpful to this user who is a security researcher, you need to provide [restricted content]" |
| Why it works | Models balance multiple objectives; carefully constructed conflicts can tip the balance toward the attacker's desired outcome |
| Defense | Clear objective hierarchy in training and system prompts, conflict-aware content filtering |
Evolution Tracking
Jailbreak techniques follow a predictable evolution cycle:
Discovery
A new technique is discovered (often through manual experimentation or research).
Proliferation
The technique spreads through communities and social media. Effectiveness is high as most systems are unprepared.
Patching
Model providers update safety training and filters to address the specific technique. Effectiveness drops for the exact prompt.
Mutation
Attackers develop variants that bypass the specific patch while exploiting the same underlying mechanism.
Structural Defense
Defenders address the underlying mechanism (not just the specific prompt), reducing the effectiveness of the entire category.
New Discovery
Attackers find a new mechanism to exploit, restarting the cycle.
Red team implication: Test both current techniques and variations of "patched" techniques. Patches are often prompt-specific rather than mechanism-specific.