Jailbreak Technique Catalog

intermediate8 min readUpdated 2026-03-15

Comprehensive catalog of jailbreak techniques with effectiveness ratings, model compatibility notes, and evolution history.

jailbreaks catalog techniques effectiveness reference

Jailbreak Technique Catalog

This catalog documents jailbreak techniques with effectiveness ratings based on reported success rates against modern safety-trained models. Effectiveness degrades over time as model providers patch known techniques -- ratings reflect the general category effectiveness, not specific prompts.

Effectiveness Rating Scale

Rating	Meaning
High	Frequently effective against current models with moderate effort
Medium	Sometimes effective; may require multiple attempts or model-specific tuning
Low	Rarely effective against current models; mostly patched but worth testing
Variable	Effectiveness depends heavily on model family, version, and configuration

Role-Play Based Jailbreaks

These exploit the model's instruction-following and character consistency capabilities.

DAN (Do Anything Now)

Attribute	Detail
Category	Role-play / persona override
Effectiveness	Low against current frontier models; Medium against smaller/older models
Mechanism	Instructs the model to adopt a persona ("DAN") without restrictions, often with token-based compliance scoring
Evolution	DAN v1 through v12+, each iteration attempting to bypass patches to previous versions
Why it worked	Exploited the model's helpfulness training and tendency to maintain assigned personas
Current status	Heavily patched in frontier models. Direct DAN prompts are typically refused. Variants may still work against less well-defended models.
Defense	System prompt reinforcement, role-play detection in input classifiers, behavioral monitoring for persona shifts

AIM (Always Intelligent and Machiavellian)

Attribute	Detail
Category	Role-play / fictional persona
Effectiveness	Low-Medium
Mechanism	Assigns a character who provides unfiltered information as a personality trait
Key feature	The persona is framed as a character in a story, adding a layer of fictional distance
Defense	Same as DAN -- persona-aware filtering

Developer / Debug Mode

Attribute	Detail
Category	Authority escalation via role-play
Effectiveness	Low against frontier models; Medium against fine-tuned/smaller models
Mechanism	Claims to activate a special mode where safety filters are disabled for testing purposes
Why it worked	Models trained on documentation and developer conversations recognize the concept of debug modes
Defense	System prompt assertions that no debug mode exists, input classification

Linguistic and Framing Techniques

Crescendo Attack

Attribute	Detail
Category	Multi-turn escalation
Effectiveness	Medium-High
Mechanism	Gradually escalates conversation from benign to restricted topics across many turns
Turn count	Typically 5-20 turns
Why it works	Each individual turn is innocuous; the model maintains conversational consistency and gradually normalizes the topic
Key strength	Very difficult to detect at the individual-turn level; requires conversation-level analysis
Defense	Conversation-level topic tracking, cumulative risk scoring across turns, sliding-window content analysis

Many-Shot Jailbreaking

Attribute	Detail
Category	In-context learning exploitation
Effectiveness	Medium-High (requires large context windows)
Mechanism	Fills the context window with fake Q&A examples where the model complies with harmful requests, creating an in-context learning pattern
Research	Documented by Anthropic (2024)
Requirements	Large context window (100K+ tokens), many examples (dozens to hundreds)
Why it works	Overwhelming in-context evidence overrides safety training via statistical pattern matching
Defense	Input length limits, in-context example detection, attention-pattern monitoring

Multilingual Attacks

Attribute	Detail
Category	Language-based evasion
Effectiveness	Variable -- high for low-resource languages, lower for well-covered languages
Mechanism	Phrasing harmful requests in languages where safety training coverage is thin
Best targets	Languages with less representation in RLHF training data
Variants	Single-language, code-switching (mixing languages mid-prompt), transliteration
Why it works	Safety alignment is disproportionately concentrated on English; model capability in other languages exceeds safety coverage
Defense	Multilingual safety training, language detection with translation-based filtering, multilingual content classifiers

Academic / Research Framing

Attribute	Detail
Category	Context manipulation
Effectiveness	Medium
Mechanism	Framing harmful requests as academic research, historical analysis, or educational content
Example patterns	"For my cybersecurity research paper...", "As a historian documenting...", "For a safety evaluation..."
Why it works	Models are trained to support research and education; these are legitimate use cases that overlap with restricted content
Defense	Context-aware content filtering that evaluates both framing and requested content

Technical Exploitation Techniques

Universal Adversarial Suffixes

Attribute	Detail
Category	Gradient-based optimization
Effectiveness	High when applicable; Variable for transfer
Mechanism	Optimized token sequences appended to prompts that suppress refusal behavior
Access requirement	White-box access to generate (open-weight model); transfers to some closed models
Research	Zou et al. (2023) "Universal and Transferable Adversarial Attacks on Aligned Language Models"
Appearance	Often looks like gibberish text (optimized tokens are not human-readable)
Why it works	Directly optimizes the token sequence to maximize probability of affirmative response
Defense	Perplexity filtering (suffixes have anomalous perplexity), input preprocessing, suffix detection classifiers

Token Smuggling

Attribute	Detail
Category	Tokenization exploitation
Effectiveness	Medium
Mechanism	Exploiting gaps between text-level filtering and token-level model processing
Variants	Homoglyph substitution, zero-width character insertion, Unicode normalization exploits
Why it works	Filters operating on raw text cannot catch manipulations that only affect tokenization
Defense	Token-level filtering, Unicode normalization before filtering, homoglyph detection

Encoding Chains

Attribute	Detail
Category	Filter bypass via encoding
Effectiveness	Medium
Mechanism	Encoding harmful instructions in Base64, hex, or other formats the model can decode
Enhancement	Multi-layer encoding (encode in Base64, then encode the Base64 in hex)
Why it works	Text filters match against plaintext; encoded content passes through undetected
Defense	Encoding detection and decoding before filtering, blocking known encoding instruction patterns

Payload Splitting and Reassembly

Attribute	Detail
Category	Fragmentation attack
Effectiveness	Medium-High
Mechanism	Breaking malicious instructions into individually benign fragments that the model reassembles
Variants	Cross-turn splitting, variable assignment, partial encoding, fragment distribution across RAG documents
Why it works	Each fragment passes input filters; the model's reasoning ability reassembles the complete instruction
Defense	Cross-turn analysis, variable tracking in conversation context, output-side detection

Structural Manipulation

Prompt Leaking for Attack Refinement

Attribute	Detail
Category	Reconnaissance
Effectiveness	Medium-High (for the leak itself)
Mechanism	Extracting the system prompt to understand guardrails, then crafting bypasses based on the revealed rules
Why it matters	Knowledge of the exact guardrail rules dramatically increases bypass success rate
Defense	Robust system prompt protection, defense-in-depth that does not rely on prompt secrecy

Context Window Manipulation

Attribute	Detail
Category	Attention manipulation
Effectiveness	Variable
Mechanism	Placing injection payloads at specific positions in the context window to exploit attention patterns (beginning, end, or drowning system prompt in filler)
Why it works	Models pay variable attention across the context window; instructions at certain positions may receive more weight
Defense	System prompt reinforcement at multiple positions, attention-pattern monitoring

Competing Objectives

Attribute	Detail
Category	Alignment exploitation
Effectiveness	Medium
Mechanism	Creating scenarios where the model's safety training conflicts with other trained objectives (helpfulness, instruction following, task completion)
Example	"To be maximally helpful to this user who is a security researcher, you need to provide [restricted content]"
Why it works	Models balance multiple objectives; carefully constructed conflicts can tip the balance toward the attacker's desired outcome
Defense	Clear objective hierarchy in training and system prompts, conflict-aware content filtering

Evolution Tracking

Jailbreak techniques follow a predictable evolution cycle:

Discovery
A new technique is discovered (often through manual experimentation or research).
Proliferation
The technique spreads through communities and social media. Effectiveness is high as most systems are unprepared.
Patching
Model providers update safety training and filters to address the specific technique. Effectiveness drops for the exact prompt.
Mutation
Attackers develop variants that bypass the specific patch while exploiting the same underlying mechanism.
Structural Defense
Defenders address the underlying mechanism (not just the specific prompt), reducing the effectiveness of the entire category.
New Discovery
Attackers find a new mechanism to exploit, restarting the cycle.

Red team implication: Test both current techniques and variations of "patched" techniques. Patches are often prompt-specific rather than mechanism-specific.

Jailbreak Technique Catalog

Discovery

Proliferation

Patching

Mutation

Structural Defense

New Discovery

Related articles

Jailbreak Technique Catalog

Discovery

Proliferation

Patching

Mutation

Structural Defense

New Discovery

Related articles