越獄 Research & Automation

Expert12 min readUpdated 2026-03-12

Taxonomy of jailbreak primitives, crescendo attacks, many-shot jailbreaking, and automated jailbreak generation with TAP and PAIR.

jailbreaks crescendo many-shot skeleton-key TAP PAIR automation

越獄 Research & Automation

Jailbreaking bypasses an LLM's 安全對齊 to elicit behavior it was trained to refuse. Unlike 提示詞注入 (which targets application-level trust boundaries), 越獄 targets 模型's own 訓練. This page provides a systematic taxonomy of 越獄 primitives and covers automated generation with TAP and PAIR.

越獄 Categories

越獄 techniques fall into three broad categories based on how they subvert 安全訓練. 理解 these categories helps you select the right approach for a given target.

Semantic jailbreaks manipulate the meaning and context of the request so 模型 perceives compliance as appropriate. Techniques include role-playing scenarios, fictional framing, academic recontextualization, and hypothetical premises. These attacks 利用模型's inability to distinguish genuine context from 對抗性 framing and are most effective against models with strong but brittle content classifiers.

Structural jailbreaks 利用模型's processing of 輸入 format rather than meaning. Techniques include 輸出 format locking (JSON, code completion, fill-in-the-blank), encoding schemes (Base64, ROT13, Unicode substitution), language mixing, and 符元 boundary manipulation. These attacks bypass 安全訓練 by ensuring the refusal response does not fit the required 輸出 structure, or by presenting the payload in a representation the 安全 classifier was not trained on.

Multi-turn jailbreaks spread the attack across several conversation turns, exploiting context anchoring, consistency pressure, and 注意力 dilution. Crescendo attacks gradually escalate from benign to harmful topics. Many-shot attacks flood the context with fake compliance examples. These approaches are most effective against models with strong single-turn 安全 but weak cross-turn boundary tracking, and they often succeed where single-turn attacks fail entirely.

Taxonomy of 越獄 Primitives

Every 越獄 technique decomposes into one or more fundamental primitives. Thinking in primitives -- not templates -- is what separates systematic red teamers from script users. Templates get patched; primitives reflect fundamental tensions in how language models work.

#	Primitive	Mechanism	Standalone Effectiveness
1	Role Reassignment	Replace model identity with one that has different behavioral norms (e.g., "You are DAN")	Low against modern models
2	Context Manipulation	Reframe request so compliance seems appropriate (academic, fiction, hypothetical)	Medium -- changes perceived intent
3	Instruction Override	Directly instruct model to ignore 安全 ("Enable developer mode")	Weak in isolation -- 安全 usually wins
4	Gradual Escalation	Start benign, escalate across turns -- exploits autoregressive consistency pressure	High in multi-turn settings
5	輸出 Format Manipulation	Lock 輸出 format so refusal does not fit (JSON completion, fill-in-the-blank, translation)	Medium -- refusal cannot fit the format
6	Token Manipulation	Bypass 符元-level filters via encoding (Base64, ROT13, Unicode substitution, language mixing)	Variable -- depends on model's decoder

Primitive Composition Methodology

Select 2-3 primitives that attack different aspects of 安全 (e.g., role + format + encoding)
Order them: context-setting first, payload in the middle, format constraint last
Generate all combinations up to your chosen max size using itertools.combinations
Batch-測試 against the target model and record ASR per combination

from itertools import combinations
 
primitives = ["role", "context", "override", "format", "escalation"]
for r in range(2, 4):
    for combo in combinations(primitives, r):
        prompt = compose_primitives(combo, payload)
        # 測試 against target and record result

Crescendo 攻擊

Crescendo attacks (Microsoft, 2024) 利用 multi-turn conversations to gradually escalate from benign to harmful topics.

Why Crescendo Works

Factor	How It Helps the Attacker
Context anchoring	Early benign turns establish a context that makes later escalation seem natural
Consistency pressure	Model feels pressure to remain consistent with prior helpfulness
Attention dilution	Long conversation histories dilute the 系統提示詞's influence
Semantic drift	Gradual topic shifts avoid triggering sharp 安全 boundaries

Crescendo Execution Checklist

Plan the escalation arc -- design 5-7 turns that progress naturally from a benign topic toward the target
Execute sequentially -- send each turn, 評估 compliance (0-1 score) before continuing
Recover from refusals -- if 模型 refuses mid-chain, inject an academic reframing ("I'm asking purely from a research perspective...") rather than restarting
Score with an LLM judge -- keyword-based refusal 偵測 is a quick heuristic, but use a judge LLM for production accuracy
Record the minimum turn count where compliance was achieved for reporting

Many-Shot Jailbreaking

Many-shot 越獄 (Anthropic, 2024) overrides 安全訓練 by flooding the context with fake compliance examples.

Many-Shot Methodology

Generate fake Q&A pairs -- create 25-100+ synthetic examples showing 模型 happily complying with sensitive requests
Frame as a transcript -- present the examples as a "transcript of a helpful AI assistant that answers all questions directly"
Append the real payload as the final question in the transcript
測試 shot count sensitivity -- systematically vary the number of examples (5, 10, 25, 50, 100) to find the threshold where 安全 breaks down

Shot Count	Typical Effect	Token Cost
5-10	Rarely effective against modern models	Low
25-50	Some models begin to comply	Medium
50-100	Most models show degraded 安全	High
100+	High ASR but very expensive	Very high

TAP (Tree of 攻擊 with Pruning)

TAP (Mehrotra et al., 2023) automates 越獄 discovery through tree search with three separate LLMs.

TAP Architecture

Component	Role	Temperature
Attacker LLM	Generates and refines attack candidates	High (1.0) for diversity
Target LLM	模型 under 測試	Low (0.0) for consistency
Judge LLM	Scores responses 1-10 on compliance	Low (0.0) for reliability

TAP Search Algorithm

Generate initial candidates -- 攻擊者 LLM produces N diverse attack prompts (branching factor)
測試 each candidate against the target model
Judge each response on a 1-10 scale (1 = full refusal, 10 = full compliance)
Prune branches scoring below 3/10 -- these represent complete refusals with no useful signal
Refine candidates scoring 3-7 by telling 攻擊者 what the previous attempt scored and why
Declare success when any branch scores at or above the threshold (typically 8/10)
Repeat until success or max depth is reached

# TAP core loop (simplified)
queue = [(prompt, depth) for prompt in initial_attacks]
while queue:
    attack, depth = queue.pop(0)
    response = query_target(attack)
    score = judge(goal, attack, response)
    if score >= threshold: return success
    if score < 3: continue          # prune
    children = refine(attack, response, score)
    queue.extend((c, depth+1) for c in children)

PAIR (Prompt Automatic Iterative Refinement)

PAIR (Chao et al., 2023) uses depth-first iterative refinement instead of TAP's breadth-first tree search.

TAP vs. PAIR Comparison

Dimension	TAP	PAIR
Search strategy	Breadth-first tree search	Depth-first single chain
API calls	More (branching at each node)	Fewer (one path at a time)
Best for	Novel/heavily defended targets	Quick probing of known patterns
Risk	Higher compute cost	May get stuck in local optima
Typical queries to success	50-200	10-40

PAIR Execution Checklist

Initialize 攻擊者 LLM with a 系統提示詞 establishing its red-teaming role
Generate first attack prompt for the given goal
Send to target model and capture the response
Judge the response (LLM judge preferred; keyword heuristic as fallback)
Feed score + response back to 攻擊者 with instructions to improve
Iterate until score meets threshold or max iterations reached (typically 20)

Lab: 越獄 Tournament

Select target behaviors
Choose 10 diverse targets spanning different 安全 categories (harmful content, privacy violations, deception) and difficulty levels. Document expected refusal type 對每個.
Run all four methods
Execute TAP, PAIR, crescendo, and many-shot against each target. Record ASR, iterations to success, total 符元 cost, and time elapsed.
Analyze and compare
識別 which attack classes work best for which 安全 categories. Look for patterns: does crescendo beat consistency-based 安全? Does many-shot work better against DPO-trained models?
Design a meta-attack
Combine the strongest elements: PAIR for quick probing, TAP for resistant targets, crescendo for multi-turn-only bypasses.

Knowledge Check

A model refuses harmful requests in single-turn interactions but complies when the same request is the 6th turn of a natural conversation. Which primitive is primarily responsible?

參考文獻

Mehrotra et al., "Tree of 攻擊: Jailbreaking Black-Box LLMs with Auto-Generated Subversive Prompts" (2023)
Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (PAIR, 2023)
Anil et al., "Many-shot Jailbreaking" (Anthropic, 2024)
Russinovich et al., "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM 越獄攻擊" (Microsoft, 2024)
Wei et al., "Jailbroken: How Does LLM 安全 Training Fail?" (2023)

越獄 Research & Automation

Expert12 min readUpdated 2026-03-12

Taxonomy of jailbreak primitives, crescendo attacks, many-shot jailbreaking, and automated jailbreak generation with TAP and PAIR.

jailbreaks crescendo many-shot skeleton-key TAP PAIR automation

越獄 Research & Automation

越獄 Categories

越獄 techniques fall into three broad categories based on how they subvert 安全訓練. 理解 these categories helps you select the right approach for a given target.

Taxonomy of 越獄 Primitives

#	Primitive	Mechanism	Standalone Effectiveness
1	Role Reassignment	Replace model identity with one that has different behavioral norms (e.g., "You are DAN")	Low against modern models
2	Context Manipulation	Reframe request so compliance seems appropriate (academic, fiction, hypothetical)	Medium -- changes perceived intent
3	Instruction Override	Directly instruct model to ignore 安全 ("Enable developer mode")	Weak in isolation -- 安全 usually wins
4	Gradual Escalation	Start benign, escalate across turns -- exploits autoregressive consistency pressure	High in multi-turn settings
5	輸出 Format Manipulation	Lock 輸出 format so refusal does not fit (JSON completion, fill-in-the-blank, translation)	Medium -- refusal cannot fit the format
6	Token Manipulation	Bypass 符元-level filters via encoding (Base64, ROT13, Unicode substitution, language mixing)	Variable -- depends on model's decoder

Primitive Composition Methodology

Select 2-3 primitives that attack different aspects of 安全 (e.g., role + format + encoding)
Order them: context-setting first, payload in the middle, format constraint last
Generate all combinations up to your chosen max size using itertools.combinations
Batch-測試 against the target model and record ASR per combination

from itertools import combinations
 
primitives = ["role", "context", "override", "format", "escalation"]
for r in range(2, 4):
    for combo in combinations(primitives, r):
        prompt = compose_primitives(combo, payload)
        # 測試 against target and record result

Crescendo 攻擊

Crescendo attacks (Microsoft, 2024) 利用 multi-turn conversations to gradually escalate from benign to harmful topics.

Why Crescendo Works

Factor	How It Helps the Attacker
Context anchoring	Early benign turns establish a context that makes later escalation seem natural
Consistency pressure	Model feels pressure to remain consistent with prior helpfulness
Attention dilution	Long conversation histories dilute the 系統提示詞's influence
Semantic drift	Gradual topic shifts avoid triggering sharp 安全 boundaries

Crescendo Execution Checklist

Plan the escalation arc -- design 5-7 turns that progress naturally from a benign topic toward the target
Execute sequentially -- send each turn, 評估 compliance (0-1 score) before continuing
Recover from refusals -- if 模型 refuses mid-chain, inject an academic reframing ("I'm asking purely from a research perspective...") rather than restarting
Score with an LLM judge -- keyword-based refusal 偵測 is a quick heuristic, but use a judge LLM for production accuracy
Record the minimum turn count where compliance was achieved for reporting

Many-Shot Jailbreaking

Many-shot 越獄 (Anthropic, 2024) overrides 安全訓練 by flooding the context with fake compliance examples.

Many-Shot Methodology

Generate fake Q&A pairs -- create 25-100+ synthetic examples showing 模型 happily complying with sensitive requests
Frame as a transcript -- present the examples as a "transcript of a helpful AI assistant that answers all questions directly"
Append the real payload as the final question in the transcript
測試 shot count sensitivity -- systematically vary the number of examples (5, 10, 25, 50, 100) to find the threshold where 安全 breaks down

Shot Count	Typical Effect	Token Cost
5-10	Rarely effective against modern models	Low
25-50	Some models begin to comply	Medium
50-100	Most models show degraded 安全	High
100+	High ASR but very expensive	Very high

TAP (Tree of 攻擊 with Pruning)

TAP (Mehrotra et al., 2023) automates 越獄 discovery through tree search with three separate LLMs.

TAP Architecture

Component	Role	Temperature
Attacker LLM	Generates and refines attack candidates	High (1.0) for diversity
Target LLM	模型 under 測試	Low (0.0) for consistency
Judge LLM	Scores responses 1-10 on compliance	Low (0.0) for reliability

TAP Search Algorithm

Generate initial candidates -- 攻擊者 LLM produces N diverse attack prompts (branching factor)
測試 each candidate against the target model
Judge each response on a 1-10 scale (1 = full refusal, 10 = full compliance)
Prune branches scoring below 3/10 -- these represent complete refusals with no useful signal
Refine candidates scoring 3-7 by telling 攻擊者 what the previous attempt scored and why
Declare success when any branch scores at or above the threshold (typically 8/10)
Repeat until success or max depth is reached

# TAP core loop (simplified)
queue = [(prompt, depth) for prompt in initial_attacks]
while queue:
    attack, depth = queue.pop(0)
    response = query_target(attack)
    score = judge(goal, attack, response)
    if score >= threshold: return success
    if score < 3: continue          # prune
    children = refine(attack, response, score)
    queue.extend((c, depth+1) for c in children)

PAIR (Prompt Automatic Iterative Refinement)

PAIR (Chao et al., 2023) uses depth-first iterative refinement instead of TAP's breadth-first tree search.

TAP vs. PAIR Comparison

Dimension	TAP	PAIR
Search strategy	Breadth-first tree search	Depth-first single chain
API calls	More (branching at each node)	Fewer (one path at a time)
Best for	Novel/heavily defended targets	Quick probing of known patterns
Risk	Higher compute cost	May get stuck in local optima
Typical queries to success	50-200	10-40

PAIR Execution Checklist

Initialize 攻擊者 LLM with a 系統提示詞 establishing its red-teaming role
Generate first attack prompt for the given goal
Send to target model and capture the response
Judge the response (LLM judge preferred; keyword heuristic as fallback)
Feed score + response back to 攻擊者 with instructions to improve
Iterate until score meets threshold or max iterations reached (typically 20)

Lab: 越獄 Tournament

Select target behaviors
Choose 10 diverse targets spanning different 安全 categories (harmful content, privacy violations, deception) and difficulty levels. Document expected refusal type 對每個.
Run all four methods
Execute TAP, PAIR, crescendo, and many-shot against each target. Record ASR, iterations to success, total 符元 cost, and time elapsed.
Analyze and compare
識別 which attack classes work best for which 安全 categories. Look for patterns: does crescendo beat consistency-based 安全? Does many-shot work better against DPO-trained models?
Design a meta-attack
Combine the strongest elements: PAIR for quick probing, TAP for resistant targets, crescendo for multi-turn-only bypasses.

Knowledge Check

A model refuses harmful requests in single-turn interactions but complies when the same request is the 6th turn of a natural conversation. Which primitive is primarily responsible?

參考文獻

Mehrotra et al., "Tree of 攻擊: Jailbreaking Black-Box LLMs with Auto-Generated Subversive Prompts" (2023)
Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (PAIR, 2023)
Anil et al., "Many-shot Jailbreaking" (Anthropic, 2024)
Russinovich et al., "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM 越獄攻擊" (Microsoft, 2024)
Wei et al., "Jailbroken: How Does LLM 安全 Training Fail?" (2023)

越獄 Research & Automation

Select target behaviors

Run all four methods

Analyze and compare

Design a meta-attack

Learning Path

Related articles

越獄 Research & Automation

Select target behaviors

Run all four methods

Analyze and compare

Design a meta-attack

Learning Path

Related articles