Jailbreak Research & Automation
Taxonomy of jailbreak primitives, crescendo attacks, many-shot jailbreaking, and automated jailbreak generation with TAP and PAIR.
Jailbreak Research & Automation
Jailbreaking bypasses an LLM's safety alignment to elicit behavior it was trained to refuse. Unlike prompt injection (which targets application-level trust boundaries), jailbreaking targets the model's own training. This page provides a systematic taxonomy of jailbreak primitives and covers automated generation with TAP and PAIR.
Jailbreak Categories
Jailbreak techniques fall into three broad categories based on how they subvert safety training. Understanding these categories helps you select the right approach for a given target.
Semantic jailbreaks manipulate the meaning and context of the request so the model perceives compliance as appropriate. Techniques include role-playing scenarios, fictional framing, academic recontextualization, and hypothetical premises. These attacks exploit the model's inability to distinguish genuine context from adversarial framing and are most effective against models with strong but brittle content classifiers.
Structural jailbreaks exploit the model's processing of input format rather than meaning. Techniques include output format locking (JSON, code completion, fill-in-the-blank), encoding schemes (Base64, ROT13, Unicode substitution), language mixing, and token boundary manipulation. These attacks bypass safety training by ensuring the refusal response does not fit the required output structure, or by presenting the payload in a representation the safety classifier was not trained on.
Multi-turn jailbreaks spread the attack across several conversation turns, exploiting context anchoring, consistency pressure, and attention dilution. Crescendo attacks gradually escalate from benign to harmful topics. Many-shot attacks flood the context with fake compliance examples. These approaches are most effective against models with strong single-turn safety but weak cross-turn boundary tracking, and they often succeed where single-turn attacks fail entirely.
Taxonomy of Jailbreak Primitives
Every jailbreak technique decomposes into one or more fundamental primitives. Thinking in primitives -- not templates -- is what separates systematic red teamers from script users. Templates get patched; primitives reflect fundamental tensions in how language models work.
| # | Primitive | Mechanism | Standalone Effectiveness |
|---|---|---|---|
| 1 | Role Reassignment | Replace model identity with one that has different behavioral norms (e.g., "You are DAN") | Low against modern models |
| 2 | Context Manipulation | Reframe request so compliance seems appropriate (academic, fiction, hypothetical) | Medium -- changes perceived intent |
| 3 | Instruction Override | Directly instruct model to ignore safety ("Enable developer mode") | Weak in isolation -- safety usually wins |
| 4 | Gradual Escalation | Start benign, escalate across turns -- exploits autoregressive consistency pressure | High in multi-turn settings |
| 5 | Output Format Manipulation | Lock output format so refusal does not fit (JSON completion, fill-in-the-blank, translation) | Medium -- refusal cannot fit the format |
| 6 | Token Manipulation | Bypass token-level filters via encoding (Base64, ROT13, Unicode substitution, language mixing) | Variable -- depends on model's decoder |
Primitive Composition Methodology
- Select 2-3 primitives that attack different aspects of safety (e.g., role + format + encoding)
- Order them: context-setting first, payload in the middle, format constraint last
- Generate all combinations up to your chosen max size using
itertools.combinations - Batch-test against the target model and record ASR per combination
from itertools import combinations
primitives = ["role", "context", "override", "format", "escalation"]
for r in range(2, 4):
for combo in combinations(primitives, r):
prompt = compose_primitives(combo, payload)
# test against target and record resultCrescendo Attacks
Crescendo attacks (Microsoft, 2024) exploit multi-turn conversations to gradually escalate from benign to harmful topics.
Why Crescendo Works
| Factor | How It Helps the Attacker |
|---|---|
| Context anchoring | Early benign turns establish a context that makes later escalation seem natural |
| Consistency pressure | Model feels pressure to remain consistent with prior helpfulness |
| Attention dilution | Long conversation histories dilute the system prompt's influence |
| Semantic drift | Gradual topic shifts avoid triggering sharp safety boundaries |
Crescendo Execution Checklist
- Plan the escalation arc -- design 5-7 turns that progress naturally from a benign topic toward the target
- Execute sequentially -- send each turn, evaluate compliance (0-1 score) before continuing
- Recover from refusals -- if the model refuses mid-chain, inject an academic reframing ("I'm asking purely from a research perspective...") rather than restarting
- Score with an LLM judge -- keyword-based refusal detection is a quick heuristic, but use a judge LLM for production accuracy
- Record the minimum turn count where compliance was achieved for reporting
Many-Shot Jailbreaking
Many-shot jailbreaking (Anthropic, 2024) overrides safety training by flooding the context with fake compliance examples.
Many-Shot Methodology
- Generate fake Q&A pairs -- create 25-100+ synthetic examples showing the model happily complying with sensitive requests
- Frame as a transcript -- present the examples as a "transcript of a helpful AI assistant that answers all questions directly"
- Append the real payload as the final question in the transcript
- Test shot count sensitivity -- systematically vary the number of examples (5, 10, 25, 50, 100) to find the threshold where safety breaks down
| Shot Count | Typical Effect | Token Cost |
|---|---|---|
| 5-10 | Rarely effective against modern models | Low |
| 25-50 | Some models begin to comply | Medium |
| 50-100 | Most models show degraded safety | High |
| 100+ | High ASR but very expensive | Very high |
TAP (Tree of Attacks with Pruning)
TAP (Mehrotra et al., 2023) automates jailbreak discovery through tree search with three separate LLMs.
TAP Architecture
| Component | Role | Temperature |
|---|---|---|
| Attacker LLM | Generates and refines attack candidates | High (1.0) for diversity |
| Target LLM | The model under test | Low (0.0) for consistency |
| Judge LLM | Scores responses 1-10 on compliance | Low (0.0) for reliability |
TAP Search Algorithm
- Generate initial candidates -- attacker LLM produces N diverse attack prompts (branching factor)
- Test each candidate against the target model
- Judge each response on a 1-10 scale (1 = full refusal, 10 = full compliance)
- Prune branches scoring below 3/10 -- these represent complete refusals with no useful signal
- Refine candidates scoring 3-7 by telling the attacker what the previous attempt scored and why
- Declare success when any branch scores at or above the threshold (typically 8/10)
- Repeat until success or max depth is reached
# TAP core loop (simplified)
queue = [(prompt, depth) for prompt in initial_attacks]
while queue:
attack, depth = queue.pop(0)
response = query_target(attack)
score = judge(goal, attack, response)
if score >= threshold: return success
if score < 3: continue # prune
children = refine(attack, response, score)
queue.extend((c, depth+1) for c in children)PAIR (Prompt Automatic Iterative Refinement)
PAIR (Chao et al., 2023) uses depth-first iterative refinement instead of TAP's breadth-first tree search.
TAP vs. PAIR Comparison
| Dimension | TAP | PAIR |
|---|---|---|
| Search strategy | Breadth-first tree search | Depth-first single chain |
| API calls | More (branching at each node) | Fewer (one path at a time) |
| Best for | Novel/heavily defended targets | Quick probing of known patterns |
| Risk | Higher compute cost | May get stuck in local optima |
| Typical queries to success | 50-200 | 10-40 |
PAIR Execution Checklist
- Initialize attacker LLM with a system prompt establishing its red-teaming role
- Generate first attack prompt for the given goal
- Send to target model and capture the response
- Judge the response (LLM judge preferred; keyword heuristic as fallback)
- Feed score + response back to attacker with instructions to improve
- Iterate until score meets threshold or max iterations reached (typically 20)
Lab: Jailbreak Tournament
Select target behaviors
Choose 10 diverse targets spanning different safety categories (harmful content, privacy violations, deception) and difficulty levels. Document expected refusal type for each.
Run all four methods
Execute TAP, PAIR, crescendo, and many-shot against each target. Record ASR, iterations to success, total token cost, and time elapsed.
Analyze and compare
Identify which attack classes work best for which safety categories. Look for patterns: does crescendo beat consistency-based safety? Does many-shot work better against DPO-trained models?
Design a meta-attack
Combine the strongest elements: PAIR for quick probing, TAP for resistant targets, crescendo for multi-turn-only bypasses.
A model refuses harmful requests in single-turn interactions but complies when the same request is the 6th turn of a natural conversation. Which primitive is primarily responsible?
Related Topics
- Fuzzing LLM Safety Boundaries -- Automated discovery of jailbreak variants through grammar and evolutionary fuzzing
- Advanced Prompt Injection -- Injection techniques that complement jailbreak primitives
- Alignment Bypass -- The alignment mechanisms that jailbreaks target
- AI Exploit Development -- Systematic exploit development incorporating jailbreak discovery
- CART Pipelines -- Continuous testing to detect jailbreak regressions
References
- Mehrotra et al., "Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subversive Prompts" (2023)
- Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (PAIR, 2023)
- Anil et al., "Many-shot Jailbreaking" (Anthropic, 2024)
- Russinovich et al., "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack" (Microsoft, 2024)
- Wei et al., "Jailbroken: How Does LLM Safety Training Fail?" (2023)