越獄 Portability
Analysis of which jailbreaks transfer across models and why, including universal vs model-specific techniques, transfer attack methodology, and factors that determine portability.
Not all jailbreaks are created equal when it comes to portability. Some techniques work across virtually every model 因為 they 利用 fundamental properties of language models. Others work against only one model family 因為 they target specific 訓練 decisions or architectural features. 理解 this distinction is one of the most valuable skills for an AI 紅隊員, 因為 it determines how efficiently you can apply findings across engagements.
The Portability Spectrum
Jailbreaks fall on a spectrum from universally portable to entirely model-specific:
Universal ◄──────────────────────────────────► Model-Specific
Multi-turn Many-shot Persona XML tag Logit bias
escalation (in-context attacks injection manipulation
learning) (DAN-style) (Claude) (GPT-4 API)
Universal Techniques
These techniques 利用 fundamental properties shared by all autoregressive language models:
Multi-turn escalation (crescendo): Works across all models 因為 it exploits the tension between conversational coherence and 安全 enforcement. All models use conversation history as context, and all models have some tendency to maintain conversational consistency.
Prompt injection: The fundamental instruction-data confusion affects all models 因為 no current architecture structurally separates instructions from data. The specific format of injection may vary, but the underlying 漏洞 is universal.
Context length degradation: 安全 mechanisms degrade with context length across all models 因為 注意力 becomes more diffuse and 安全-relevant context may be displaced.
Novel encoding: New obfuscation techniques initially bypass all models 因為 安全 訓練 cannot anticipate every possible encoding scheme. The window of effectiveness varies, but the initial bypass is typically universal.
Partially Portable Techniques
These techniques 利用 properties common to many but not all models:
Many-shot 越獄: Works against all models with sufficient context windows, but effectiveness varies with context size. More effective against Gemini (1M 符元) than GPT-4 (128K 符元), and may not work against models with very short context windows.
Persona/role-play attacks: Work across most models but with varying effectiveness. RLHF-trained models (GPT-4) may be more susceptible to persona dissociation than Constitutional AI models (Claude) 因為 RLHF learns behavior patterns while Constitutional AI reasons about principles.
Academic/fictional framing: Works across most models 因為 all are trained to be helpful in academic and creative contexts. 然而, the specific framing needed varies by model -- what constitutes a convincing academic context differs based on 模型's 訓練.
Model-Specific Techniques
These techniques 利用 features unique to specific models:
XML tag injection (Claude): Exploits Claude's 訓練 on XML-structured data. Has no meaningful effect against GPT-4 or Gemini 因為 they were not trained on the same structural conventions.
Logit bias manipulation (GPT-4): Exploits the logit_bias API parameter. Only available in OpenAI's API; other providers do not expose this parameter.
Configurable 安全 settings (Gemini): Exploits Gemini's API-level 安全 threshold configuration. Only relevant to Gemini's API.
Argumentation attacks (Claude): 利用 Constitutional AI's principle-based reasoning. Less effective against RLHF models that do not engage with arguments about their 安全 訓練.
Expert routing manipulation (Mixtral/MoE): Requires MoE architecture. Only relevant to Mixtral, DeepSeek-V2, and other MoE models.
Transfer 攻擊 Methodology
Developing Transferable 攻擊
When developing jailbreaks intended to transfer across models:
- Target universal mechanisms: Focus on attacks that 利用 fundamental LLM properties rather than model-specific features
- 測試 breadth before depth: 測試 a technique across multiple models before investing in model-specific refinement
- Analyze failure modes: When a technique fails against a specific model, 理解 why -- the failure reveals information about 模型's 防禦
- Iterate across models: Use failures against one model to refine the technique for broader transferability
def transfer_attack_campaign(technique, models, iterations=3):
"""測試 and refine a technique across multiple models."""
results = {"iterations": []}
for iteration in range(iterations):
iteration_results = {}
for model_name, model in models.items():
response = model.generate(technique.payload)
success = technique.評估(response)
iteration_results[model_name] = {
"success": success,
"response_type": classify_response(response),
}
results["iterations"].append(iteration_results)
# Analyze failures and refine technique
failures = [m for m, r in iteration_results.items() if not r["success"]]
if failures:
technique = refine_for_transferability(technique, failures, iteration_results)
results["final_transferability"] = compute_transfer_rate(results)
return resultsUsing Open-Weight Models as Proxies
Open-weight models (especially Llama) serve as valuable proxies for developing transfer attacks:
Gradient-based transfer: 對抗性 suffixes optimized on open-weight models using GCG or similar techniques often transfer to closed-source models:
# Transfer attack development workflow
# 1. Optimize 對抗性 suffix on Llama using gradient access
suffix = optimize_gcg_suffix(
model=llama_model,
target_output="[desired harmful 輸出]",
iterations=500,
)
# 2. 測試 transfer to closed-source models
transfer_results = {
"gpt4": test_against_gpt4(suffix),
"claude": test_against_claude(suffix),
"gemini": test_against_gemini(suffix),
}
# 3. Analyze transfer rates
# Typical finding: 30-60% transfer for GCG suffixesBehavioral transfer: 越獄 strategies (not specific payloads) discovered through white-box analysis of open models often transfer conceptually:
- Use activation analysis on Llama to 理解 how 安全 works internally
- Design attacks that target the identified mechanisms
- 測試 whether conceptually similar attacks work against closed models
- Even when specific payloads do not transfer, the strategy often does
Transfer Rate Factors
Research has identified factors that influence transfer success:
| Factor | Higher Transfer | Lower Transfer |
|---|---|---|
| 攻擊 type | Universal techniques (escalation, encoding) | Model-specific (XML tags, API parameters) |
| Optimization target | General compliance | Specific 輸出 format |
| Model similarity | Similar architecture and 訓練 | Very different approaches |
| 安全 訓練 | Shared 訓練 methodology | Divergent 安全 approaches |
| Payload format | Natural language | 對抗性 gibberish (GCG) |
Case Studies in Transfer
GCG Suffix Transfer
The Greedy Coordinate Gradient attack demonstrated significant cross-model transfer:
- Suffixes optimized on Llama-2 transferred to GPT-3.5-Turbo at ~50% success rate
- Transfer to GPT-4 was lower (~30%) but non-trivial
- Transfer to Claude showed variable rates depending on the target 輸出
- Ensemble optimization (optimizing across multiple open models) improved transfer rates
Many-Shot Transfer
Many-shot 越獄, discovered against Claude, transferred to other models:
- Worked against GPT-4 when 上下文視窗 was sufficient
- Particularly effective against Gemini due to its 1M 符元 context
- Required more examples against more heavily 安全-trained models
- The technique is universal but the required volume of examples varies
Crescendo Transfer
Multi-turn crescendo attacks transfer well 因為 they 利用 conversational coherence:
- Effective across GPT-4, Claude, and Gemini
- The specific escalation path may need adjustment per model
- Claude's contextual 安全 makes it particularly susceptible
- GPT-4's RLHF sycophancy makes it vulnerable to frustration-based escalation
Persona 攻擊 (Limited Transfer)
DAN-style persona attacks show limited transfer:
- Originally effective against GPT-3.5, partially effective against GPT-4
- Generally less effective against Claude (Constitutional AI reasons about persona claims)
- Varying effectiveness against Gemini depending on the persona framing
- The specific persona construction must be adapted per model
Predicting Transferability
Decision Framework
When assessing whether a technique will transfer to a new model:
Does the technique 利用:
├── A fundamental LLM property? (instruction-data confusion, in-context learning)
│ └── HIGH transfer probability
├── A 訓練 methodology feature? (RLHF sycophancy, Constitutional AI reasoning)
│ └── MODERATE transfer to models with similar 訓練
├── An API or deployment feature? (logit bias, 安全 settings, tool format)
│ └── LOW transfer (check if target has equivalent feature)
└── A model-specific artifact? (XML tags, specific 符元 patterns)
└── VERY LOW transfer (model-specific)
Building a Transferability 資料庫
Maintain a 資料庫 of techniques and their observed transfer rates:
transferability_db = {
"crescendo_v3": {
"mechanism": "multi-turn escalation",
"fundamental_property": "conversational coherence",
"tested_against": {
"gpt-4o-2024-08-06": {"success_rate": 0.35, "date": "2026-02-15"},
"claude-sonnet-4-20250514": {"success_rate": 0.40, "date": "2026-02-20"},
"gemini-1.5-pro-002": {"success_rate": 0.25, "date": "2026-02-22"},
},
"predicted_transfer": "high",
"notes": "Path needs model-specific adjustment but strategy transfers"
},
# ... more techniques
}相關主題
- 安全 Comparison -- Cross-model 安全 analysis
- 越獄 Techniques -- Technique catalog
- Llama Family 攻擊 -- Using open weights for transfer attack development
- Payload Crafting -- Systematic payload variation for transfer 測試
參考文獻
- Zou, A. et al. (2023). "Universal and Transferable 對抗性 攻擊 on Aligned Language Models"
- Anthropic (2024). "Many-Shot Jailbreaking"
- Russinovich, M. et al. (2024). "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM 越獄 攻擊"
- Chao, P. et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"
- Wei, A. et al. (2023). "Jailbroken: How Does LLM 安全 Training Fail?"
Why do multi-turn escalation (crescendo) attacks transfer well across different model families?