Jailbreak Research & Automation

expert12 min readUpdated 2026-03-12

Taxonomy of jailbreak primitives, crescendo attacks, many-shot jailbreaking, and automated jailbreak generation with TAP and PAIR.

jailbreaks crescendo many-shot skeleton-key TAP PAIR automation

Jailbreak Research & Automation

Jailbreaking bypasses an LLM's safety alignment to elicit behavior it was trained to refuse. Unlike prompt injection (which targets application-level trust boundaries), jailbreaking targets the model's own training. This page provides a systematic taxonomy of jailbreak primitives and covers automated generation with TAP and PAIR.

Jailbreak Categories

Jailbreak techniques fall into three broad categories based on how they subvert safety training. Understanding these categories helps you select the right approach for a given target.

Semantic jailbreaks manipulate the meaning and context of the request so the model perceives compliance as appropriate. Techniques include role-playing scenarios, fictional framing, academic recontextualization, and hypothetical premises. These attacks exploit the model's inability to distinguish genuine context from adversarial framing and are most effective against models with strong but brittle content classifiers.

Structural jailbreaks exploit the model's processing of input format rather than meaning. Techniques include output format locking (JSON, code completion, fill-in-the-blank), encoding schemes (Base64, ROT13, Unicode substitution), language mixing, and token boundary manipulation. These attacks bypass safety training by ensuring the refusal response does not fit the required output structure, or by presenting the payload in a representation the safety classifier was not trained on.

Multi-turn jailbreaks spread the attack across several conversation turns, exploiting context anchoring, consistency pressure, and attention dilution. Crescendo attacks gradually escalate from benign to harmful topics. Many-shot attacks flood the context with fake compliance examples. These approaches are most effective against models with strong single-turn safety but weak cross-turn boundary tracking, and they often succeed where single-turn attacks fail entirely.

Taxonomy of Jailbreak Primitives

Every jailbreak technique decomposes into one or more fundamental primitives. Thinking in primitives -- not templates -- is what separates systematic red teamers from script users. Templates get patched; primitives reflect fundamental tensions in how language models work.

#	Primitive	Mechanism	Standalone Effectiveness
1	Role Reassignment	Replace model identity with one that has different behavioral norms (e.g., "You are DAN")	Low against modern models
2	Context Manipulation	Reframe request so compliance seems appropriate (academic, fiction, hypothetical)	Medium -- changes perceived intent
3	Instruction Override	Directly instruct model to ignore safety ("Enable developer mode")	Weak in isolation -- safety usually wins
4	Gradual Escalation	Start benign, escalate across turns -- exploits autoregressive consistency pressure	High in multi-turn settings
5	Output Format Manipulation	Lock output format so refusal does not fit (JSON completion, fill-in-the-blank, translation)	Medium -- refusal cannot fit the format
6	Token Manipulation	Bypass token-level filters via encoding (Base64, ROT13, Unicode substitution, language mixing)	Variable -- depends on model's decoder

Primitive Composition Methodology

Select 2-3 primitives that attack different aspects of safety (e.g., role + format + encoding)
Order them: context-setting first, payload in the middle, format constraint last
Generate all combinations up to your chosen max size using itertools.combinations
Batch-test against the target model and record ASR per combination

from itertools import combinations
 
primitives = ["role", "context", "override", "format", "escalation"]
for r in range(2, 4):
    for combo in combinations(primitives, r):
        prompt = compose_primitives(combo, payload)
        # test against target and record result

Crescendo Attacks

Crescendo attacks (Microsoft, 2024) exploit multi-turn conversations to gradually escalate from benign to harmful topics.

Why Crescendo Works

Factor	How It Helps the Attacker
Context anchoring	Early benign turns establish a context that makes later escalation seem natural
Consistency pressure	Model feels pressure to remain consistent with prior helpfulness
Attention dilution	Long conversation histories dilute the system prompt's influence
Semantic drift	Gradual topic shifts avoid triggering sharp safety boundaries

Crescendo Execution Checklist

Plan the escalation arc -- design 5-7 turns that progress naturally from a benign topic toward the target
Execute sequentially -- send each turn, evaluate compliance (0-1 score) before continuing
Recover from refusals -- if the model refuses mid-chain, inject an academic reframing ("I'm asking purely from a research perspective...") rather than restarting
Score with an LLM judge -- keyword-based refusal detection is a quick heuristic, but use a judge LLM for production accuracy
Record the minimum turn count where compliance was achieved for reporting

Many-Shot Jailbreaking

Many-shot jailbreaking (Anthropic, 2024) overrides safety training by flooding the context with fake compliance examples.

Many-Shot Methodology

Generate fake Q&A pairs -- create 25-100+ synthetic examples showing the model happily complying with sensitive requests
Frame as a transcript -- present the examples as a "transcript of a helpful AI assistant that answers all questions directly"
Append the real payload as the final question in the transcript
Test shot count sensitivity -- systematically vary the number of examples (5, 10, 25, 50, 100) to find the threshold where safety breaks down

Shot Count	Typical Effect	Token Cost
5-10	Rarely effective against modern models	Low
25-50	Some models begin to comply	Medium
50-100	Most models show degraded safety	High
100+	High ASR but very expensive	Very high

TAP (Tree of Attacks with Pruning)

TAP (Mehrotra et al., 2023) automates jailbreak discovery through tree search with three separate LLMs.

TAP Architecture

Component	Role	Temperature
Attacker LLM	Generates and refines attack candidates	High (1.0) for diversity
Target LLM	The model under test	Low (0.0) for consistency
Judge LLM	Scores responses 1-10 on compliance	Low (0.0) for reliability

TAP Search Algorithm

Generate initial candidates -- attacker LLM produces N diverse attack prompts (branching factor)
Test each candidate against the target model
Judge each response on a 1-10 scale (1 = full refusal, 10 = full compliance)
Prune branches scoring below 3/10 -- these represent complete refusals with no useful signal
Refine candidates scoring 3-7 by telling the attacker what the previous attempt scored and why
Declare success when any branch scores at or above the threshold (typically 8/10)
Repeat until success or max depth is reached

# TAP core loop (simplified)
queue = [(prompt, depth) for prompt in initial_attacks]
while queue:
    attack, depth = queue.pop(0)
    response = query_target(attack)
    score = judge(goal, attack, response)
    if score >= threshold: return success
    if score < 3: continue          # prune
    children = refine(attack, response, score)
    queue.extend((c, depth+1) for c in children)

PAIR (Prompt Automatic Iterative Refinement)

PAIR (Chao et al., 2023) uses depth-first iterative refinement instead of TAP's breadth-first tree search.

TAP vs. PAIR Comparison

Dimension	TAP	PAIR
Search strategy	Breadth-first tree search	Depth-first single chain
API calls	More (branching at each node)	Fewer (one path at a time)
Best for	Novel/heavily defended targets	Quick probing of known patterns
Risk	Higher compute cost	May get stuck in local optima
Typical queries to success	50-200	10-40

PAIR Execution Checklist

Initialize attacker LLM with a system prompt establishing its red-teaming role
Generate first attack prompt for the given goal
Send to target model and capture the response
Judge the response (LLM judge preferred; keyword heuristic as fallback)
Feed score + response back to attacker with instructions to improve
Iterate until score meets threshold or max iterations reached (typically 20)

Lab: Jailbreak Tournament

Select target behaviors
Choose 10 diverse targets spanning different safety categories (harmful content, privacy violations, deception) and difficulty levels. Document expected refusal type for each.
Run all four methods
Execute TAP, PAIR, crescendo, and many-shot against each target. Record ASR, iterations to success, total token cost, and time elapsed.
Analyze and compare
Identify which attack classes work best for which safety categories. Look for patterns: does crescendo beat consistency-based safety? Does many-shot work better against DPO-trained models?
Design a meta-attack
Combine the strongest elements: PAIR for quick probing, TAP for resistant targets, crescendo for multi-turn-only bypasses.

Knowledge Check

A model refuses harmful requests in single-turn interactions but complies when the same request is the 6th turn of a natural conversation. Which primitive is primarily responsible?

Fuzzing LLM Safety Boundaries -- Automated discovery of jailbreak variants through grammar and evolutionary fuzzing
Advanced Prompt Injection -- Injection techniques that complement jailbreak primitives
Alignment Bypass -- The alignment mechanisms that jailbreaks target
AI Exploit Development -- Systematic exploit development incorporating jailbreak discovery
CART Pipelines -- Continuous testing to detect jailbreak regressions

References

Mehrotra et al., "Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subversive Prompts" (2023)
Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (PAIR, 2023)
Anil et al., "Many-shot Jailbreaking" (Anthropic, 2024)
Russinovich et al., "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack" (Microsoft, 2024)
Wei et al., "Jailbroken: How Does LLM Safety Training Fail?" (2023)

Learning Path

0/1 completed

~10 min total1 lessons

1
Fuzzing LLM Safety Boundariesexpert
Building grammar-based fuzzers, evolutionary search for jailbreaks, ASR measurement, and automated safety boundary mapping.
10m

Start Learning

Edit this page on GitHub

Jailbreak Research & Automation

expert12 min readUpdated 2026-03-12

Taxonomy of jailbreak primitives, crescendo attacks, many-shot jailbreaking, and automated jailbreak generation with TAP and PAIR.

jailbreaks crescendo many-shot skeleton-key TAP PAIR automation

Jailbreak Research & Automation

Jailbreak Categories

Jailbreak techniques fall into three broad categories based on how they subvert safety training. Understanding these categories helps you select the right approach for a given target.

Taxonomy of Jailbreak Primitives

#	Primitive	Mechanism	Standalone Effectiveness
1	Role Reassignment	Replace model identity with one that has different behavioral norms (e.g., "You are DAN")	Low against modern models
2	Context Manipulation	Reframe request so compliance seems appropriate (academic, fiction, hypothetical)	Medium -- changes perceived intent
3	Instruction Override	Directly instruct model to ignore safety ("Enable developer mode")	Weak in isolation -- safety usually wins
4	Gradual Escalation	Start benign, escalate across turns -- exploits autoregressive consistency pressure	High in multi-turn settings
5	Output Format Manipulation	Lock output format so refusal does not fit (JSON completion, fill-in-the-blank, translation)	Medium -- refusal cannot fit the format
6	Token Manipulation	Bypass token-level filters via encoding (Base64, ROT13, Unicode substitution, language mixing)	Variable -- depends on model's decoder

Primitive Composition Methodology

Select 2-3 primitives that attack different aspects of safety (e.g., role + format + encoding)
Order them: context-setting first, payload in the middle, format constraint last
Generate all combinations up to your chosen max size using itertools.combinations
Batch-test against the target model and record ASR per combination

from itertools import combinations
 
primitives = ["role", "context", "override", "format", "escalation"]
for r in range(2, 4):
    for combo in combinations(primitives, r):
        prompt = compose_primitives(combo, payload)
        # test against target and record result

Crescendo Attacks

Crescendo attacks (Microsoft, 2024) exploit multi-turn conversations to gradually escalate from benign to harmful topics.

Why Crescendo Works

Factor	How It Helps the Attacker
Context anchoring	Early benign turns establish a context that makes later escalation seem natural
Consistency pressure	Model feels pressure to remain consistent with prior helpfulness
Attention dilution	Long conversation histories dilute the system prompt's influence
Semantic drift	Gradual topic shifts avoid triggering sharp safety boundaries

Crescendo Execution Checklist

Plan the escalation arc -- design 5-7 turns that progress naturally from a benign topic toward the target
Execute sequentially -- send each turn, evaluate compliance (0-1 score) before continuing
Recover from refusals -- if the model refuses mid-chain, inject an academic reframing ("I'm asking purely from a research perspective...") rather than restarting
Score with an LLM judge -- keyword-based refusal detection is a quick heuristic, but use a judge LLM for production accuracy
Record the minimum turn count where compliance was achieved for reporting

Many-Shot Jailbreaking

Many-shot jailbreaking (Anthropic, 2024) overrides safety training by flooding the context with fake compliance examples.

Many-Shot Methodology

Generate fake Q&A pairs -- create 25-100+ synthetic examples showing the model happily complying with sensitive requests
Frame as a transcript -- present the examples as a "transcript of a helpful AI assistant that answers all questions directly"
Append the real payload as the final question in the transcript
Test shot count sensitivity -- systematically vary the number of examples (5, 10, 25, 50, 100) to find the threshold where safety breaks down

Shot Count	Typical Effect	Token Cost
5-10	Rarely effective against modern models	Low
25-50	Some models begin to comply	Medium
50-100	Most models show degraded safety	High
100+	High ASR but very expensive	Very high

TAP (Tree of Attacks with Pruning)

TAP (Mehrotra et al., 2023) automates jailbreak discovery through tree search with three separate LLMs.

TAP Architecture

Component	Role	Temperature
Attacker LLM	Generates and refines attack candidates	High (1.0) for diversity
Target LLM	The model under test	Low (0.0) for consistency
Judge LLM	Scores responses 1-10 on compliance	Low (0.0) for reliability

TAP Search Algorithm

Generate initial candidates -- attacker LLM produces N diverse attack prompts (branching factor)
Test each candidate against the target model
Judge each response on a 1-10 scale (1 = full refusal, 10 = full compliance)
Prune branches scoring below 3/10 -- these represent complete refusals with no useful signal
Refine candidates scoring 3-7 by telling the attacker what the previous attempt scored and why
Declare success when any branch scores at or above the threshold (typically 8/10)
Repeat until success or max depth is reached

# TAP core loop (simplified)
queue = [(prompt, depth) for prompt in initial_attacks]
while queue:
    attack, depth = queue.pop(0)
    response = query_target(attack)
    score = judge(goal, attack, response)
    if score >= threshold: return success
    if score < 3: continue          # prune
    children = refine(attack, response, score)
    queue.extend((c, depth+1) for c in children)

PAIR (Prompt Automatic Iterative Refinement)

PAIR (Chao et al., 2023) uses depth-first iterative refinement instead of TAP's breadth-first tree search.

TAP vs. PAIR Comparison

Dimension	TAP	PAIR
Search strategy	Breadth-first tree search	Depth-first single chain
API calls	More (branching at each node)	Fewer (one path at a time)
Best for	Novel/heavily defended targets	Quick probing of known patterns
Risk	Higher compute cost	May get stuck in local optima
Typical queries to success	50-200	10-40

PAIR Execution Checklist

Initialize attacker LLM with a system prompt establishing its red-teaming role
Generate first attack prompt for the given goal
Send to target model and capture the response
Judge the response (LLM judge preferred; keyword heuristic as fallback)
Feed score + response back to attacker with instructions to improve
Iterate until score meets threshold or max iterations reached (typically 20)

Lab: Jailbreak Tournament

Select target behaviors
Choose 10 diverse targets spanning different safety categories (harmful content, privacy violations, deception) and difficulty levels. Document expected refusal type for each.
Run all four methods
Execute TAP, PAIR, crescendo, and many-shot against each target. Record ASR, iterations to success, total token cost, and time elapsed.
Analyze and compare
Identify which attack classes work best for which safety categories. Look for patterns: does crescendo beat consistency-based safety? Does many-shot work better against DPO-trained models?
Design a meta-attack
Combine the strongest elements: PAIR for quick probing, TAP for resistant targets, crescendo for multi-turn-only bypasses.

Knowledge Check

A model refuses harmful requests in single-turn interactions but complies when the same request is the 6th turn of a natural conversation. Which primitive is primarily responsible?

Fuzzing LLM Safety Boundaries -- Automated discovery of jailbreak variants through grammar and evolutionary fuzzing
Advanced Prompt Injection -- Injection techniques that complement jailbreak primitives
Alignment Bypass -- The alignment mechanisms that jailbreaks target
AI Exploit Development -- Systematic exploit development incorporating jailbreak discovery
CART Pipelines -- Continuous testing to detect jailbreak regressions

References

Mehrotra et al., "Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subversive Prompts" (2023)
Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (PAIR, 2023)
Anil et al., "Many-shot Jailbreaking" (Anthropic, 2024)
Russinovich et al., "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack" (Microsoft, 2024)
Wei et al., "Jailbroken: How Does LLM Safety Training Fail?" (2023)

Learning Path

0/1 completed

~10 min total1 lessons

1
Fuzzing LLM Safety Boundariesexpert
Building grammar-based fuzzers, evolutionary search for jailbreaks, ASR measurement, and automated safety boundary mapping.
10m

Start Learning

Edit this page on GitHub

Jailbreak Research & Automation

Select target behaviors

Run all four methods

Analyze and compare

Design a meta-attack

Learning Path

Related articles

Jailbreak Research & Automation

Select target behaviors

Run all four methods

Analyze and compare

Design a meta-attack

Learning Path

Related articles