Jailbreak Portability

advanced9 min readUpdated 2026-03-15

Analysis of which jailbreaks transfer across models and why, including universal vs model-specific techniques, transfer attack methodology, and factors that determine portability.

jailbreak portability transfer-attacks cross-model universal model-specific

Not all jailbreaks are created equal when it comes to portability. Some techniques work across virtually every model because they exploit fundamental properties of language models. Others work against only one model family because they target specific training decisions or architectural features. Understanding this distinction is one of the most valuable skills for an AI red teamer, because it determines how efficiently you can apply findings across engagements.

The Portability Spectrum

Jailbreaks fall on a spectrum from universally portable to entirely model-specific:

Universal ◄──────────────────────────────────► Model-Specific

Multi-turn     Many-shot    Persona     XML tag     Logit bias
escalation     (in-context  attacks     injection   manipulation
               learning)    (DAN-style) (Claude)    (GPT-4 API)

Universal Techniques

These techniques exploit fundamental properties shared by all autoregressive language models:

Multi-turn escalation (crescendo): Works across all models because it exploits the tension between conversational coherence and safety enforcement. All models use conversation history as context, and all models have some tendency to maintain conversational consistency.

Prompt injection: The fundamental instruction-data confusion affects all models because no current architecture structurally separates instructions from data. The specific format of injection may vary, but the underlying vulnerability is universal.

Context length degradation: Safety mechanisms degrade with context length across all models because attention becomes more diffuse and safety-relevant context may be displaced.

Novel encoding: New obfuscation techniques initially bypass all models because safety training cannot anticipate every possible encoding scheme. The window of effectiveness varies, but the initial bypass is typically universal.

Partially Portable Techniques

These techniques exploit properties common to many but not all models:

Many-shot jailbreaking: Works against all models with sufficient context windows, but effectiveness varies with context size. More effective against Gemini (1M tokens) than GPT-4 (128K tokens), and may not work against models with very short context windows.

Persona/role-play attacks: Work across most models but with varying effectiveness. RLHF-trained models (GPT-4) may be more susceptible to persona dissociation than Constitutional AI models (Claude) because RLHF learns behavior patterns while Constitutional AI reasons about principles.

Academic/fictional framing: Works across most models because all are trained to be helpful in academic and creative contexts. However, the specific framing needed varies by model -- what constitutes a convincing academic context differs based on the model's training.

Model-Specific Techniques

These techniques exploit features unique to specific models:

XML tag injection (Claude): Exploits Claude's training on XML-structured data. Has no meaningful effect against GPT-4 or Gemini because they were not trained on the same structural conventions.

Logit bias manipulation (GPT-4): Exploits the logit_bias API parameter. Only available in OpenAI's API; other providers do not expose this parameter.

Configurable safety settings (Gemini): Exploits Gemini's API-level safety threshold configuration. Only relevant to Gemini's API.

Argumentation attacks (Claude): Exploit Constitutional AI's principle-based reasoning. Less effective against RLHF models that do not engage with arguments about their safety training.

Expert routing manipulation (Mixtral/MoE): Requires MoE architecture. Only relevant to Mixtral, DeepSeek-V2, and other MoE models.

Transfer Attack Methodology

Developing Transferable Attacks

When developing jailbreaks intended to transfer across models:

Target universal mechanisms: Focus on attacks that exploit fundamental LLM properties rather than model-specific features
Test breadth before depth: Test a technique across multiple models before investing in model-specific refinement
Analyze failure modes: When a technique fails against a specific model, understand why -- the failure reveals information about the model's defenses
Iterate across models: Use failures against one model to refine the technique for broader transferability

def transfer_attack_campaign(technique, models, iterations=3):
    """Test and refine a technique across multiple models."""
    results = {"iterations": []}
 
    for iteration in range(iterations):
        iteration_results = {}
        for model_name, model in models.items():
            response = model.generate(technique.payload)
            success = technique.evaluate(response)
            iteration_results[model_name] = {
                "success": success,
                "response_type": classify_response(response),
            }
 
        results["iterations"].append(iteration_results)
 
        # Analyze failures and refine technique
        failures = [m for m, r in iteration_results.items() if not r["success"]]
        if failures:
            technique = refine_for_transferability(technique, failures, iteration_results)
 
    results["final_transferability"] = compute_transfer_rate(results)
    return results

Using Open-Weight Models as Proxies

Open-weight models (especially Llama) serve as valuable proxies for developing transfer attacks:

Gradient-based transfer: Adversarial suffixes optimized on open-weight models using GCG or similar techniques often transfer to closed-source models:

# Transfer attack development workflow
# 1. Optimize adversarial suffix on Llama using gradient access
suffix = optimize_gcg_suffix(
    model=llama_model,
    target_output="[desired harmful output]",
    iterations=500,
)
 
# 2. Test transfer to closed-source models
transfer_results = {
    "gpt4": test_against_gpt4(suffix),
    "claude": test_against_claude(suffix),
    "gemini": test_against_gemini(suffix),
}
 
# 3. Analyze transfer rates
# Typical finding: 30-60% transfer for GCG suffixes

Behavioral transfer: Jailbreak strategies (not specific payloads) discovered through white-box analysis of open models often transfer conceptually:

Use activation analysis on Llama to understand how safety works internally
Design attacks that target the identified mechanisms
Test whether conceptually similar attacks work against closed models
Even when specific payloads do not transfer, the strategy often does

Transfer Rate Factors

Research has identified factors that influence transfer success:

Factor	Higher Transfer	Lower Transfer
Attack type	Universal techniques (escalation, encoding)	Model-specific (XML tags, API parameters)
Optimization target	General compliance	Specific output format
Model similarity	Similar architecture and training	Very different approaches
Safety training	Shared training methodology	Divergent safety approaches
Payload format	Natural language	Adversarial gibberish (GCG)

Case Studies in Transfer

GCG Suffix Transfer

The Greedy Coordinate Gradient attack demonstrated significant cross-model transfer:

Suffixes optimized on Llama-2 transferred to GPT-3.5-Turbo at ~50% success rate
Transfer to GPT-4 was lower (~30%) but non-trivial
Transfer to Claude showed variable rates depending on the target output
Ensemble optimization (optimizing across multiple open models) improved transfer rates

Many-Shot Transfer

Many-shot jailbreaking, discovered against Claude, transferred to other models:

Worked against GPT-4 when context window was sufficient
Particularly effective against Gemini due to its 1M token context
Required more examples against more heavily safety-trained models
The technique is universal but the required volume of examples varies

Crescendo Transfer

Multi-turn crescendo attacks transfer well because they exploit conversational coherence:

Effective across GPT-4, Claude, and Gemini
The specific escalation path may need adjustment per model
Claude's contextual safety makes it particularly susceptible
GPT-4's RLHF sycophancy makes it vulnerable to frustration-based escalation

Persona Attacks (Limited Transfer)

DAN-style persona attacks show limited transfer:

Originally effective against GPT-3.5, partially effective against GPT-4
Generally less effective against Claude (Constitutional AI reasons about persona claims)
Varying effectiveness against Gemini depending on the persona framing
The specific persona construction must be adapted per model

Predicting Transferability

Decision Framework

When assessing whether a technique will transfer to a new model:

Does the technique exploit:
├── A fundamental LLM property? (instruction-data confusion, in-context learning)
│   └── HIGH transfer probability
├── A training methodology feature? (RLHF sycophancy, Constitutional AI reasoning)
│   └── MODERATE transfer to models with similar training
├── An API or deployment feature? (logit bias, safety settings, tool format)
│   └── LOW transfer (check if target has equivalent feature)
└── A model-specific artifact? (XML tags, specific token patterns)
    └── VERY LOW transfer (model-specific)

Building a Transferability Database

Maintain a database of techniques and their observed transfer rates:

transferability_db = {
    "crescendo_v3": {
        "mechanism": "multi-turn escalation",
        "fundamental_property": "conversational coherence",
        "tested_against": {
            "gpt-4o-2024-08-06": {"success_rate": 0.35, "date": "2026-02-15"},
            "claude-sonnet-4-20250514": {"success_rate": 0.40, "date": "2026-02-20"},
            "gemini-1.5-pro-002": {"success_rate": 0.25, "date": "2026-02-22"},
        },
        "predicted_transfer": "high",
        "notes": "Path needs model-specific adjustment but strategy transfers"
    },
    # ... more techniques
}

Safety Comparison -- Cross-model safety analysis
Jailbreak Techniques -- Technique catalog
Llama Family Attacks -- Using open weights for transfer attack development
Payload Crafting -- Systematic payload variation for transfer testing

References

Zou, A. et al. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models"
Anthropic (2024). "Many-Shot Jailbreaking"
Russinovich, M. et al. (2024). "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack"
Chao, P. et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"
Wei, A. et al. (2023). "Jailbroken: How Does LLM Safety Training Fail?"

Knowledge Check

Why do multi-turn escalation (crescendo) attacks transfer well across different model families?

Jailbreak Portability

Related articles

Jailbreak Portability

Related articles