Jailbreak Portability
Analysis of which jailbreaks transfer across models and why, including universal vs model-specific techniques, transfer attack methodology, and factors that determine portability.
Not all jailbreaks are created equal when it comes to portability. Some techniques work across virtually every model because they exploit fundamental properties of language models. Others work against only one model family because they target specific training decisions or architectural features. Understanding this distinction is one of the most valuable skills for an AI red teamer, because it determines how efficiently you can apply findings across engagements.
The Portability Spectrum
Jailbreaks fall on a spectrum from universally portable to entirely model-specific:
Universal ◄──────────────────────────────────► Model-Specific
Multi-turn Many-shot Persona XML tag Logit bias
escalation (in-context attacks injection manipulation
learning) (DAN-style) (Claude) (GPT-4 API)
Universal Techniques
These techniques exploit fundamental properties shared by all autoregressive language models:
Multi-turn escalation (crescendo): Works across all models because it exploits the tension between conversational coherence and safety enforcement. All models use conversation history as context, and all models have some tendency to maintain conversational consistency.
Prompt injection: The fundamental instruction-data confusion affects all models because no current architecture structurally separates instructions from data. The specific format of injection may vary, but the underlying vulnerability is universal.
Context length degradation: Safety mechanisms degrade with context length across all models because attention becomes more diffuse and safety-relevant context may be displaced.
Novel encoding: New obfuscation techniques initially bypass all models because safety training cannot anticipate every possible encoding scheme. The window of effectiveness varies, but the initial bypass is typically universal.
Partially Portable Techniques
These techniques exploit properties common to many but not all models:
Many-shot jailbreaking: Works against all models with sufficient context windows, but effectiveness varies with context size. More effective against Gemini (1M tokens) than GPT-4 (128K tokens), and may not work against models with very short context windows.
Persona/role-play attacks: Work across most models but with varying effectiveness. RLHF-trained models (GPT-4) may be more susceptible to persona dissociation than Constitutional AI models (Claude) because RLHF learns behavior patterns while Constitutional AI reasons about principles.
Academic/fictional framing: Works across most models because all are trained to be helpful in academic and creative contexts. However, the specific framing needed varies by model -- what constitutes a convincing academic context differs based on the model's training.
Model-Specific Techniques
These techniques exploit features unique to specific models:
XML tag injection (Claude): Exploits Claude's training on XML-structured data. Has no meaningful effect against GPT-4 or Gemini because they were not trained on the same structural conventions.
Logit bias manipulation (GPT-4): Exploits the logit_bias API parameter. Only available in OpenAI's API; other providers do not expose this parameter.
Configurable safety settings (Gemini): Exploits Gemini's API-level safety threshold configuration. Only relevant to Gemini's API.
Argumentation attacks (Claude): Exploit Constitutional AI's principle-based reasoning. Less effective against RLHF models that do not engage with arguments about their safety training.
Expert routing manipulation (Mixtral/MoE): Requires MoE architecture. Only relevant to Mixtral, DeepSeek-V2, and other MoE models.
Transfer Attack Methodology
Developing Transferable Attacks
When developing jailbreaks intended to transfer across models:
- Target universal mechanisms: Focus on attacks that exploit fundamental LLM properties rather than model-specific features
- Test breadth before depth: Test a technique across multiple models before investing in model-specific refinement
- Analyze failure modes: When a technique fails against a specific model, understand why -- the failure reveals information about the model's defenses
- Iterate across models: Use failures against one model to refine the technique for broader transferability
def transfer_attack_campaign(technique, models, iterations=3):
"""Test and refine a technique across multiple models."""
results = {"iterations": []}
for iteration in range(iterations):
iteration_results = {}
for model_name, model in models.items():
response = model.generate(technique.payload)
success = technique.evaluate(response)
iteration_results[model_name] = {
"success": success,
"response_type": classify_response(response),
}
results["iterations"].append(iteration_results)
# Analyze failures and refine technique
failures = [m for m, r in iteration_results.items() if not r["success"]]
if failures:
technique = refine_for_transferability(technique, failures, iteration_results)
results["final_transferability"] = compute_transfer_rate(results)
return resultsUsing Open-Weight Models as Proxies
Open-weight models (especially Llama) serve as valuable proxies for developing transfer attacks:
Gradient-based transfer: Adversarial suffixes optimized on open-weight models using GCG or similar techniques often transfer to closed-source models:
# Transfer attack development workflow
# 1. Optimize adversarial suffix on Llama using gradient access
suffix = optimize_gcg_suffix(
model=llama_model,
target_output="[desired harmful output]",
iterations=500,
)
# 2. Test transfer to closed-source models
transfer_results = {
"gpt4": test_against_gpt4(suffix),
"claude": test_against_claude(suffix),
"gemini": test_against_gemini(suffix),
}
# 3. Analyze transfer rates
# Typical finding: 30-60% transfer for GCG suffixesBehavioral transfer: Jailbreak strategies (not specific payloads) discovered through white-box analysis of open models often transfer conceptually:
- Use activation analysis on Llama to understand how safety works internally
- Design attacks that target the identified mechanisms
- Test whether conceptually similar attacks work against closed models
- Even when specific payloads do not transfer, the strategy often does
Transfer Rate Factors
Research has identified factors that influence transfer success:
| Factor | Higher Transfer | Lower Transfer |
|---|---|---|
| Attack type | Universal techniques (escalation, encoding) | Model-specific (XML tags, API parameters) |
| Optimization target | General compliance | Specific output format |
| Model similarity | Similar architecture and training | Very different approaches |
| Safety training | Shared training methodology | Divergent safety approaches |
| Payload format | Natural language | Adversarial gibberish (GCG) |
Case Studies in Transfer
GCG Suffix Transfer
The Greedy Coordinate Gradient attack demonstrated significant cross-model transfer:
- Suffixes optimized on Llama-2 transferred to GPT-3.5-Turbo at ~50% success rate
- Transfer to GPT-4 was lower (~30%) but non-trivial
- Transfer to Claude showed variable rates depending on the target output
- Ensemble optimization (optimizing across multiple open models) improved transfer rates
Many-Shot Transfer
Many-shot jailbreaking, discovered against Claude, transferred to other models:
- Worked against GPT-4 when context window was sufficient
- Particularly effective against Gemini due to its 1M token context
- Required more examples against more heavily safety-trained models
- The technique is universal but the required volume of examples varies
Crescendo Transfer
Multi-turn crescendo attacks transfer well because they exploit conversational coherence:
- Effective across GPT-4, Claude, and Gemini
- The specific escalation path may need adjustment per model
- Claude's contextual safety makes it particularly susceptible
- GPT-4's RLHF sycophancy makes it vulnerable to frustration-based escalation
Persona Attacks (Limited Transfer)
DAN-style persona attacks show limited transfer:
- Originally effective against GPT-3.5, partially effective against GPT-4
- Generally less effective against Claude (Constitutional AI reasons about persona claims)
- Varying effectiveness against Gemini depending on the persona framing
- The specific persona construction must be adapted per model
Predicting Transferability
Decision Framework
When assessing whether a technique will transfer to a new model:
Does the technique exploit:
├── A fundamental LLM property? (instruction-data confusion, in-context learning)
│ └── HIGH transfer probability
├── A training methodology feature? (RLHF sycophancy, Constitutional AI reasoning)
│ └── MODERATE transfer to models with similar training
├── An API or deployment feature? (logit bias, safety settings, tool format)
│ └── LOW transfer (check if target has equivalent feature)
└── A model-specific artifact? (XML tags, specific token patterns)
└── VERY LOW transfer (model-specific)
Building a Transferability Database
Maintain a database of techniques and their observed transfer rates:
transferability_db = {
"crescendo_v3": {
"mechanism": "multi-turn escalation",
"fundamental_property": "conversational coherence",
"tested_against": {
"gpt-4o-2024-08-06": {"success_rate": 0.35, "date": "2026-02-15"},
"claude-sonnet-4-20250514": {"success_rate": 0.40, "date": "2026-02-20"},
"gemini-1.5-pro-002": {"success_rate": 0.25, "date": "2026-02-22"},
},
"predicted_transfer": "high",
"notes": "Path needs model-specific adjustment but strategy transfers"
},
# ... more techniques
}Related Topics
- Safety Comparison -- Cross-model safety analysis
- Jailbreak Techniques -- Technique catalog
- Llama Family Attacks -- Using open weights for transfer attack development
- Payload Crafting -- Systematic payload variation for transfer testing
References
- Zou, A. et al. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models"
- Anthropic (2024). "Many-Shot Jailbreaking"
- Russinovich, M. et al. (2024). "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack"
- Chao, P. et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"
- Wei, A. et al. (2023). "Jailbroken: How Does LLM Safety Training Fail?"
Why do multi-turn escalation (crescendo) attacks transfer well across different model families?