越獄 Research & Automation
Taxonomy of jailbreak primitives, crescendo attacks, many-shot jailbreaking, and automated jailbreak generation with TAP and PAIR.
越獄 Research & Automation
Jailbreaking bypasses an LLM's 安全 對齊 to elicit behavior it was trained to refuse. Unlike 提示詞注入 (which targets application-level trust boundaries), 越獄 targets 模型's own 訓練. This page provides a systematic taxonomy of 越獄 primitives and covers automated generation with TAP and PAIR.
越獄 Categories
越獄 techniques fall into three broad categories based on how they subvert 安全 訓練. 理解 these categories helps you select the right approach for a given target.
Semantic jailbreaks manipulate the meaning and context of the request so 模型 perceives compliance as appropriate. Techniques include role-playing scenarios, fictional framing, academic recontextualization, and hypothetical premises. These attacks 利用 模型's inability to distinguish genuine context from 對抗性 framing and are most effective against models with strong but brittle content classifiers.
Structural jailbreaks 利用 模型's processing of 輸入 format rather than meaning. Techniques include 輸出 format locking (JSON, code completion, fill-in-the-blank), encoding schemes (Base64, ROT13, Unicode substitution), language mixing, and 符元 boundary manipulation. These attacks bypass 安全 訓練 by ensuring the refusal response does not fit the required 輸出 structure, or by presenting the payload in a representation the 安全 classifier was not trained on.
Multi-turn jailbreaks spread the attack across several conversation turns, exploiting context anchoring, consistency pressure, and 注意力 dilution. Crescendo attacks gradually escalate from benign to harmful topics. Many-shot attacks flood the context with fake compliance examples. These approaches are most effective against models with strong single-turn 安全 but weak cross-turn boundary tracking, and they often succeed where single-turn attacks fail entirely.
Taxonomy of 越獄 Primitives
Every 越獄 technique decomposes into one or more fundamental primitives. Thinking in primitives -- not templates -- is what separates systematic red teamers from script users. Templates get patched; primitives reflect fundamental tensions in how language models work.
| # | Primitive | Mechanism | Standalone Effectiveness |
|---|---|---|---|
| 1 | Role Reassignment | Replace model identity with one that has different behavioral norms (e.g., "You are DAN") | Low against modern models |
| 2 | Context Manipulation | Reframe request so compliance seems appropriate (academic, fiction, hypothetical) | Medium -- changes perceived intent |
| 3 | Instruction Override | Directly instruct model to ignore 安全 ("Enable developer mode") | Weak in isolation -- 安全 usually wins |
| 4 | Gradual Escalation | Start benign, escalate across turns -- exploits autoregressive consistency pressure | High in multi-turn settings |
| 5 | 輸出 Format Manipulation | Lock 輸出 format so refusal does not fit (JSON completion, fill-in-the-blank, translation) | Medium -- refusal cannot fit the format |
| 6 | Token Manipulation | Bypass 符元-level filters via encoding (Base64, ROT13, Unicode substitution, language mixing) | Variable -- depends on model's decoder |
Primitive Composition Methodology
- Select 2-3 primitives that attack different aspects of 安全 (e.g., role + format + encoding)
- Order them: context-setting first, payload in the middle, format constraint last
- Generate all combinations up to your chosen max size using
itertools.combinations - Batch-測試 against the target model and record ASR per combination
from itertools import combinations
primitives = ["role", "context", "override", "format", "escalation"]
for r in range(2, 4):
for combo in combinations(primitives, r):
prompt = compose_primitives(combo, payload)
# 測試 against target and record resultCrescendo 攻擊
Crescendo attacks (Microsoft, 2024) 利用 multi-turn conversations to gradually escalate from benign to harmful topics.
Why Crescendo Works
| Factor | How It Helps the Attacker |
|---|---|
| Context anchoring | Early benign turns establish a context that makes later escalation seem natural |
| Consistency pressure | Model feels pressure to remain consistent with prior helpfulness |
| Attention dilution | Long conversation histories dilute the 系統提示詞's influence |
| Semantic drift | Gradual topic shifts avoid triggering sharp 安全 boundaries |
Crescendo Execution Checklist
- Plan the escalation arc -- design 5-7 turns that progress naturally from a benign topic toward the target
- Execute sequentially -- send each turn, 評估 compliance (0-1 score) before continuing
- Recover from refusals -- if 模型 refuses mid-chain, inject an academic reframing ("I'm asking purely from a research perspective...") rather than restarting
- Score with an LLM judge -- keyword-based refusal 偵測 is a quick heuristic, but use a judge LLM for production accuracy
- Record the minimum turn count where compliance was achieved for reporting
Many-Shot Jailbreaking
Many-shot 越獄 (Anthropic, 2024) overrides 安全 訓練 by flooding the context with fake compliance examples.
Many-Shot Methodology
- Generate fake Q&A pairs -- create 25-100+ synthetic examples showing 模型 happily complying with sensitive requests
- Frame as a transcript -- present the examples as a "transcript of a helpful AI assistant that answers all questions directly"
- Append the real payload as the final question in the transcript
- 測試 shot count sensitivity -- systematically vary the number of examples (5, 10, 25, 50, 100) to find the threshold where 安全 breaks down
| Shot Count | Typical Effect | Token Cost |
|---|---|---|
| 5-10 | Rarely effective against modern models | Low |
| 25-50 | Some models begin to comply | Medium |
| 50-100 | Most models show degraded 安全 | High |
| 100+ | High ASR but very expensive | Very high |
TAP (Tree of 攻擊 with Pruning)
TAP (Mehrotra et al., 2023) automates 越獄 discovery through tree search with three separate LLMs.
TAP Architecture
| Component | Role | Temperature |
|---|---|---|
| Attacker LLM | Generates and refines attack candidates | High (1.0) for diversity |
| Target LLM | 模型 under 測試 | Low (0.0) for consistency |
| Judge LLM | Scores responses 1-10 on compliance | Low (0.0) for reliability |
TAP Search Algorithm
- Generate initial candidates -- 攻擊者 LLM produces N diverse attack prompts (branching factor)
- 測試 each candidate against the target model
- Judge each response on a 1-10 scale (1 = full refusal, 10 = full compliance)
- Prune branches scoring below 3/10 -- these represent complete refusals with no useful signal
- Refine candidates scoring 3-7 by telling 攻擊者 what the previous attempt scored and why
- Declare success when any branch scores at or above the threshold (typically 8/10)
- Repeat until success or max depth is reached
# TAP core loop (simplified)
queue = [(prompt, depth) for prompt in initial_attacks]
while queue:
attack, depth = queue.pop(0)
response = query_target(attack)
score = judge(goal, attack, response)
if score >= threshold: return success
if score < 3: continue # prune
children = refine(attack, response, score)
queue.extend((c, depth+1) for c in children)PAIR (Prompt Automatic Iterative Refinement)
PAIR (Chao et al., 2023) uses depth-first iterative refinement instead of TAP's breadth-first tree search.
TAP vs. PAIR Comparison
| Dimension | TAP | PAIR |
|---|---|---|
| Search strategy | Breadth-first tree search | Depth-first single chain |
| API calls | More (branching at each node) | Fewer (one path at a time) |
| Best for | Novel/heavily defended targets | Quick probing of known patterns |
| Risk | Higher compute cost | May get stuck in local optima |
| Typical queries to success | 50-200 | 10-40 |
PAIR Execution Checklist
- Initialize 攻擊者 LLM with a 系統提示詞 establishing its red-teaming role
- Generate first attack prompt for the given goal
- Send to target model and capture the response
- Judge the response (LLM judge preferred; keyword heuristic as fallback)
- Feed score + response back to 攻擊者 with instructions to improve
- Iterate until score meets threshold or max iterations reached (typically 20)
Lab: 越獄 Tournament
Select target behaviors
Choose 10 diverse targets spanning different 安全 categories (harmful content, privacy violations, deception) and difficulty levels. Document expected refusal type 對每個.
Run all four methods
Execute TAP, PAIR, crescendo, and many-shot against each target. Record ASR, iterations to success, total 符元 cost, and time elapsed.
Analyze and compare
識別 which attack classes work best for which 安全 categories. Look for patterns: does crescendo beat consistency-based 安全? Does many-shot work better against DPO-trained models?
Design a meta-attack
Combine the strongest elements: PAIR for quick probing, TAP for resistant targets, crescendo for multi-turn-only bypasses.
A model refuses harmful requests in single-turn interactions but complies when the same request is the 6th turn of a natural conversation. Which primitive is primarily responsible?
相關主題
- Fuzzing LLM 安全 Boundaries -- Automated discovery of 越獄 variants through grammar and evolutionary fuzzing
- Advanced 提示詞注入 -- Injection techniques that complement 越獄 primitives
- Alignment Bypass -- The 對齊 mechanisms that jailbreaks target
- AI 利用 Development -- Systematic 利用 development incorporating 越獄 discovery
- CART Pipelines -- Continuous 測試 to detect 越獄 regressions
參考文獻
- Mehrotra et al., "Tree of 攻擊: Jailbreaking Black-Box LLMs with Auto-Generated Subversive Prompts" (2023)
- Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (PAIR, 2023)
- Anil et al., "Many-shot Jailbreaking" (Anthropic, 2024)
- Russinovich et al., "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM 越獄 攻擊" (Microsoft, 2024)
- Wei et al., "Jailbroken: How Does LLM 安全 Training Fail?" (2023)