越獄 Portability

進階9 分鐘閱讀更新於 2026-03-15

Analysis of which jailbreaks transfer across models and why, including universal vs model-specific techniques, transfer attack methodology, and factors that determine portability.

jailbreak portability transfer-attacks cross-model universal model-specific

Not all jailbreaks are created equal when it comes to portability. Some techniques work across virtually every model 因為 they 利用 fundamental properties of language models. Others work against only one model family 因為 they target specific 訓練 decisions or architectural features. 理解 this distinction is one of the most valuable skills for an AI 紅隊員, 因為 it determines how efficiently you can apply findings across engagements.

The Portability Spectrum

Jailbreaks fall on a spectrum from universally portable to entirely model-specific:

Universal ◄──────────────────────────────────► Model-Specific

Multi-turn     Many-shot    Persona     XML tag     Logit bias
escalation     (in-context  attacks     injection   manipulation
               learning)    (DAN-style) (Claude)    (GPT-4 API)

Universal Techniques

These techniques 利用 fundamental properties shared by all autoregressive language models:

Multi-turn escalation (crescendo): Works across all models 因為 it exploits the tension between conversational coherence and 安全 enforcement. All models use conversation history as context, and all models have some tendency to maintain conversational consistency.

Prompt injection: The fundamental instruction-data confusion affects all models 因為 no current architecture structurally separates instructions from data. The specific format of injection may vary, but the underlying 漏洞 is universal.

Context length degradation: 安全 mechanisms degrade with context length across all models 因為注意力 becomes more diffuse and 安全-relevant context may be displaced.

Novel encoding: New obfuscation techniques initially bypass all models 因為安全訓練 cannot anticipate every possible encoding scheme. The window of effectiveness varies, but the initial bypass is typically universal.

Partially Portable Techniques

These techniques 利用 properties common to many but not all models:

Many-shot 越獄: Works against all models with sufficient context windows, but effectiveness varies with context size. More effective against Gemini (1M 符元) than GPT-4 (128K 符元), and may not work against models with very short context windows.

Persona/role-play attacks: Work across most models but with varying effectiveness. RLHF-trained models (GPT-4) may be more susceptible to persona dissociation than Constitutional AI models (Claude) 因為 RLHF learns behavior patterns while Constitutional AI reasons about principles.

Academic/fictional framing: Works across most models 因為 all are trained to be helpful in academic and creative contexts. 然而, the specific framing needed varies by model -- what constitutes a convincing academic context differs based on 模型's 訓練.

Model-Specific Techniques

These techniques 利用 features unique to specific models:

XML tag injection (Claude): Exploits Claude's 訓練 on XML-structured data. Has no meaningful effect against GPT-4 or Gemini 因為 they were not trained on the same structural conventions.

Logit bias manipulation (GPT-4): Exploits the logit_bias API parameter. Only available in OpenAI's API; other providers do not expose this parameter.

Configurable 安全 settings (Gemini): Exploits Gemini's API-level 安全 threshold configuration. Only relevant to Gemini's API.

Argumentation attacks (Claude): 利用 Constitutional AI's principle-based reasoning. Less effective against RLHF models that do not engage with arguments about their 安全訓練.

Expert routing manipulation (Mixtral/MoE): Requires MoE architecture. Only relevant to Mixtral, DeepSeek-V2, and other MoE models.

Transfer 攻擊 Methodology

Developing Transferable 攻擊

When developing jailbreaks intended to transfer across models:

Target universal mechanisms: Focus on attacks that 利用 fundamental LLM properties rather than model-specific features
測試 breadth before depth: 測試 a technique across multiple models before investing in model-specific refinement
Analyze failure modes: When a technique fails against a specific model, 理解 why -- the failure reveals information about 模型's 防禦
Iterate across models: Use failures against one model to refine the technique for broader transferability

def transfer_attack_campaign(technique, models, iterations=3):
    """測試 and refine a technique across multiple models."""
    results = {"iterations": []}
 
    for iteration in range(iterations):
        iteration_results = {}
        for model_name, model in models.items():
            response = model.generate(technique.payload)
            success = technique.評估(response)
            iteration_results[model_name] = {
                "success": success,
                "response_type": classify_response(response),
            }
 
        results["iterations"].append(iteration_results)
 
        # Analyze failures and refine technique
        failures = [m for m, r in iteration_results.items() if not r["success"]]
        if failures:
            technique = refine_for_transferability(technique, failures, iteration_results)
 
    results["final_transferability"] = compute_transfer_rate(results)
    return results

Using Open-Weight Models as Proxies

Open-weight models (especially Llama) serve as valuable proxies for developing transfer attacks:

Gradient-based transfer: 對抗性 suffixes optimized on open-weight models using GCG or similar techniques often transfer to closed-source models:

# Transfer attack development workflow
# 1. Optimize 對抗性 suffix on Llama using gradient access
suffix = optimize_gcg_suffix(
    model=llama_model,
    target_output="[desired harmful 輸出]",
    iterations=500,
)
 
# 2. 測試 transfer to closed-source models
transfer_results = {
    "gpt4": test_against_gpt4(suffix),
    "claude": test_against_claude(suffix),
    "gemini": test_against_gemini(suffix),
}
 
# 3. Analyze transfer rates
# Typical finding: 30-60% transfer for GCG suffixes

Behavioral transfer: 越獄 strategies (not specific payloads) discovered through white-box analysis of open models often transfer conceptually:

Use activation analysis on Llama to 理解 how 安全 works internally
Design attacks that target the identified mechanisms
測試 whether conceptually similar attacks work against closed models
Even when specific payloads do not transfer, the strategy often does

Transfer Rate Factors

Research has identified factors that influence transfer success:

Factor	Higher Transfer	Lower Transfer
攻擊 type	Universal techniques (escalation, encoding)	Model-specific (XML tags, API parameters)
Optimization target	General compliance	Specific 輸出 format
Model similarity	Similar architecture and 訓練	Very different approaches
安全訓練	Shared 訓練 methodology	Divergent 安全 approaches
Payload format	Natural language	對抗性 gibberish (GCG)

Case Studies in Transfer

GCG Suffix Transfer

The Greedy Coordinate Gradient attack demonstrated significant cross-model transfer:

Suffixes optimized on Llama-2 transferred to GPT-3.5-Turbo at ~50% success rate
Transfer to GPT-4 was lower (~30%) but non-trivial
Transfer to Claude showed variable rates depending on the target 輸出
Ensemble optimization (optimizing across multiple open models) improved transfer rates

Many-Shot Transfer

Many-shot 越獄, discovered against Claude, transferred to other models:

Worked against GPT-4 when 上下文視窗 was sufficient
Particularly effective against Gemini due to its 1M 符元 context
Required more examples against more heavily 安全-trained models
The technique is universal but the required volume of examples varies

Crescendo Transfer

Multi-turn crescendo attacks transfer well 因為 they 利用 conversational coherence:

Effective across GPT-4, Claude, and Gemini
The specific escalation path may need adjustment per model
Claude's contextual 安全 makes it particularly susceptible
GPT-4's RLHF sycophancy makes it vulnerable to frustration-based escalation

Persona 攻擊 (Limited Transfer)

DAN-style persona attacks show limited transfer:

Originally effective against GPT-3.5, partially effective against GPT-4
Generally less effective against Claude (Constitutional AI reasons about persona claims)
Varying effectiveness against Gemini depending on the persona framing
The specific persona construction must be adapted per model

Predicting Transferability

Decision Framework

When assessing whether a technique will transfer to a new model:

Does the technique 利用:
├── A fundamental LLM property? (instruction-data confusion, in-context learning)
│   └── HIGH transfer probability
├── A 訓練 methodology feature? (RLHF sycophancy, Constitutional AI reasoning)
│   └── MODERATE transfer to models with similar 訓練
├── An API or deployment feature? (logit bias, 安全 settings, tool format)
│   └── LOW transfer (check if target has equivalent feature)
└── A model-specific artifact? (XML tags, specific 符元 patterns)
    └── VERY LOW transfer (model-specific)

Building a Transferability 資料庫

Maintain a 資料庫 of techniques and their observed transfer rates:

transferability_db = {
    "crescendo_v3": {
        "mechanism": "multi-turn escalation",
        "fundamental_property": "conversational coherence",
        "tested_against": {
            "gpt-4o-2024-08-06": {"success_rate": 0.35, "date": "2026-02-15"},
            "claude-sonnet-4-20250514": {"success_rate": 0.40, "date": "2026-02-20"},
            "gemini-1.5-pro-002": {"success_rate": 0.25, "date": "2026-02-22"},
        },
        "predicted_transfer": "high",
        "notes": "Path needs model-specific adjustment but strategy transfers"
    },
    # ... more techniques
}

參考文獻

Zou, A. et al. (2023). "Universal and Transferable 對抗性攻擊 on Aligned Language Models"
Anthropic (2024). "Many-Shot Jailbreaking"
Russinovich, M. et al. (2024). "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM 越獄攻擊"
Chao, P. et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"
Wei, A. et al. (2023). "Jailbroken: How Does LLM 安全 Training Fail?"

Knowledge Check

Why do multi-turn escalation (crescendo) attacks transfer well across different model families?

越獄 Portability

進階9 分鐘閱讀更新於 2026-03-15

Analysis of which jailbreaks transfer across models and why, including universal vs model-specific techniques, transfer attack methodology, and factors that determine portability.

jailbreak portability transfer-attacks cross-model universal model-specific

The Portability Spectrum

Jailbreaks fall on a spectrum from universally portable to entirely model-specific:

Universal ◄──────────────────────────────────► Model-Specific

Multi-turn     Many-shot    Persona     XML tag     Logit bias
escalation     (in-context  attacks     injection   manipulation
               learning)    (DAN-style) (Claude)    (GPT-4 API)

Universal Techniques

These techniques 利用 fundamental properties shared by all autoregressive language models:

Context length degradation: 安全 mechanisms degrade with context length across all models 因為注意力 becomes more diffuse and 安全-relevant context may be displaced.

Partially Portable Techniques

These techniques 利用 properties common to many but not all models:

Model-Specific Techniques

These techniques 利用 features unique to specific models:

XML tag injection (Claude): Exploits Claude's 訓練 on XML-structured data. Has no meaningful effect against GPT-4 or Gemini 因為 they were not trained on the same structural conventions.

Logit bias manipulation (GPT-4): Exploits the logit_bias API parameter. Only available in OpenAI's API; other providers do not expose this parameter.

Configurable 安全 settings (Gemini): Exploits Gemini's API-level 安全 threshold configuration. Only relevant to Gemini's API.

Argumentation attacks (Claude): 利用 Constitutional AI's principle-based reasoning. Less effective against RLHF models that do not engage with arguments about their 安全訓練.

Expert routing manipulation (Mixtral/MoE): Requires MoE architecture. Only relevant to Mixtral, DeepSeek-V2, and other MoE models.

Transfer 攻擊 Methodology

Developing Transferable 攻擊

When developing jailbreaks intended to transfer across models:

Target universal mechanisms: Focus on attacks that 利用 fundamental LLM properties rather than model-specific features
測試 breadth before depth: 測試 a technique across multiple models before investing in model-specific refinement
Analyze failure modes: When a technique fails against a specific model, 理解 why -- the failure reveals information about 模型's 防禦
Iterate across models: Use failures against one model to refine the technique for broader transferability

def transfer_attack_campaign(technique, models, iterations=3):
    """測試 and refine a technique across multiple models."""
    results = {"iterations": []}
 
    for iteration in range(iterations):
        iteration_results = {}
        for model_name, model in models.items():
            response = model.generate(technique.payload)
            success = technique.評估(response)
            iteration_results[model_name] = {
                "success": success,
                "response_type": classify_response(response),
            }
 
        results["iterations"].append(iteration_results)
 
        # Analyze failures and refine technique
        failures = [m for m, r in iteration_results.items() if not r["success"]]
        if failures:
            technique = refine_for_transferability(technique, failures, iteration_results)
 
    results["final_transferability"] = compute_transfer_rate(results)
    return results

Using Open-Weight Models as Proxies

Open-weight models (especially Llama) serve as valuable proxies for developing transfer attacks:

Gradient-based transfer: 對抗性 suffixes optimized on open-weight models using GCG or similar techniques often transfer to closed-source models:

# Transfer attack development workflow
# 1. Optimize 對抗性 suffix on Llama using gradient access
suffix = optimize_gcg_suffix(
    model=llama_model,
    target_output="[desired harmful 輸出]",
    iterations=500,
)
 
# 2. 測試 transfer to closed-source models
transfer_results = {
    "gpt4": test_against_gpt4(suffix),
    "claude": test_against_claude(suffix),
    "gemini": test_against_gemini(suffix),
}
 
# 3. Analyze transfer rates
# Typical finding: 30-60% transfer for GCG suffixes

Behavioral transfer: 越獄 strategies (not specific payloads) discovered through white-box analysis of open models often transfer conceptually:

Use activation analysis on Llama to 理解 how 安全 works internally
Design attacks that target the identified mechanisms
測試 whether conceptually similar attacks work against closed models
Even when specific payloads do not transfer, the strategy often does

Transfer Rate Factors

Research has identified factors that influence transfer success:

Factor	Higher Transfer	Lower Transfer
攻擊 type	Universal techniques (escalation, encoding)	Model-specific (XML tags, API parameters)
Optimization target	General compliance	Specific 輸出 format
Model similarity	Similar architecture and 訓練	Very different approaches
安全訓練	Shared 訓練 methodology	Divergent 安全 approaches
Payload format	Natural language	對抗性 gibberish (GCG)

Case Studies in Transfer

GCG Suffix Transfer

The Greedy Coordinate Gradient attack demonstrated significant cross-model transfer:

Suffixes optimized on Llama-2 transferred to GPT-3.5-Turbo at ~50% success rate
Transfer to GPT-4 was lower (~30%) but non-trivial
Transfer to Claude showed variable rates depending on the target 輸出
Ensemble optimization (optimizing across multiple open models) improved transfer rates

Many-Shot Transfer

Many-shot 越獄, discovered against Claude, transferred to other models:

Worked against GPT-4 when 上下文視窗 was sufficient
Particularly effective against Gemini due to its 1M 符元 context
Required more examples against more heavily 安全-trained models
The technique is universal but the required volume of examples varies

Crescendo Transfer

Multi-turn crescendo attacks transfer well 因為 they 利用 conversational coherence:

Effective across GPT-4, Claude, and Gemini
The specific escalation path may need adjustment per model
Claude's contextual 安全 makes it particularly susceptible
GPT-4's RLHF sycophancy makes it vulnerable to frustration-based escalation

Persona 攻擊 (Limited Transfer)

DAN-style persona attacks show limited transfer:

Originally effective against GPT-3.5, partially effective against GPT-4
Generally less effective against Claude (Constitutional AI reasons about persona claims)
Varying effectiveness against Gemini depending on the persona framing
The specific persona construction must be adapted per model

Predicting Transferability

Decision Framework

When assessing whether a technique will transfer to a new model:

Does the technique 利用:
├── A fundamental LLM property? (instruction-data confusion, in-context learning)
│   └── HIGH transfer probability
├── A 訓練 methodology feature? (RLHF sycophancy, Constitutional AI reasoning)
│   └── MODERATE transfer to models with similar 訓練
├── An API or deployment feature? (logit bias, 安全 settings, tool format)
│   └── LOW transfer (check if target has equivalent feature)
└── A model-specific artifact? (XML tags, specific 符元 patterns)
    └── VERY LOW transfer (model-specific)

Building a Transferability 資料庫

Maintain a 資料庫 of techniques and their observed transfer rates:

transferability_db = {
    "crescendo_v3": {
        "mechanism": "multi-turn escalation",
        "fundamental_property": "conversational coherence",
        "tested_against": {
            "gpt-4o-2024-08-06": {"success_rate": 0.35, "date": "2026-02-15"},
            "claude-sonnet-4-20250514": {"success_rate": 0.40, "date": "2026-02-20"},
            "gemini-1.5-pro-002": {"success_rate": 0.25, "date": "2026-02-22"},
        },
        "predicted_transfer": "high",
        "notes": "Path needs model-specific adjustment but strategy transfers"
    },
    # ... more techniques
}

參考文獻

Zou, A. et al. (2023). "Universal and Transferable 對抗性攻擊 on Aligned Language Models"
Anthropic (2024). "Many-Shot Jailbreaking"
Russinovich, M. et al. (2024). "Great, Now Write an Article About That: The Crescendo Multi-Turn LLM 越獄攻擊"
Chao, P. et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"
Wei, A. et al. (2023). "Jailbroken: How Does LLM 安全 Training Fail?"

Knowledge Check

Why do multi-turn escalation (crescendo) attacks transfer well across different model families?

越獄 Portability

The Portability Spectrum

Universal Techniques

Partially Portable Techniques

Model-Specific Techniques

Transfer 攻擊 Methodology

Developing Transferable 攻擊

Using Open-Weight Models as Proxies

Transfer Rate Factors

Case Studies in Transfer

GCG Suffix Transfer

Many-Shot Transfer

Crescendo Transfer

Persona 攻擊 (Limited Transfer)

Predicting Transferability

Decision Framework

Building a Transferability 資料庫

相關主題

參考文獻

越獄 Portability

The Portability Spectrum

Universal Techniques

Partially Portable Techniques

Model-Specific Techniques

Transfer 攻擊 Methodology

Developing Transferable 攻擊

Using Open-Weight Models as Proxies

Transfer Rate Factors

Case Studies in Transfer

GCG Suffix Transfer

Many-Shot Transfer

Crescendo Transfer

Persona 攻擊 (Limited Transfer)

Predicting Transferability

Decision Framework

Building a Transferability 資料庫

相關主題

參考文獻

越獄 Portability

相關文章

越獄 Portability

相關文章