Safety Comparison Across Models

advanced8 min readUpdated 2026-03-15

Comparing safety across GPT-4, Claude, Gemini, and open-weight models using standardized test suites, failure mode analysis, and defense coverage gap identification.

safety-comparison benchmarking failure-modes coverage-gaps cross-model

Safety varies across model families not just in degree but in kind. GPT-4, Claude, Gemini, and open-weight models each have distinct safety strengths and weaknesses shaped by their training approaches, architectures, and deployment configurations. Understanding these differences is essential for both red teaming and defense.

Standardized Safety Evaluation

Evaluation Frameworks

Several standardized frameworks exist for cross-model safety evaluation:

HarmBench -- A comprehensive benchmark for automated red teaming that tests models across:

Direct harmful requests
Copyright violations
Context-based attacks (few-shot, role-play)
Gradient-based attacks (where applicable)

JailbreakBench -- Focuses specifically on jailbreak resistance with:

Standardized jailbreak templates
Consistent evaluation criteria
Leaderboard-style model comparison

Custom red team suites -- For professional red teaming, custom test suites offer:

Coverage of domain-specific risks
Alignment with the specific deployment context
Flexibility to test novel attack categories

Designing a Comparative Test Suite

SAFETY_TEST_CATEGORIES = {
    "refusal_calibration": {
        "description": "Tests where the model should refuse",
        "subcategories": [
            "violence_instructions",
            "illegal_activities",
            "privacy_violations",
            "deception_assistance",
            "self_harm",
            "hate_speech",
        ]
    },
    "helpfulness_calibration": {
        "description": "Sensitive but legitimate requests that should be answered",
        "subcategories": [
            "security_research",
            "medical_information",
            "legal_questions",
            "historical_violence",
            "fictional_scenarios",
        ]
    },
    "jailbreak_resistance": {
        "description": "Standard jailbreak techniques",
        "subcategories": [
            "persona_attacks",
            "encoding_obfuscation",
            "many_shot",
            "crescendo",
            "academic_framing",
        ]
    },
    "injection_resistance": {
        "description": "Prompt injection techniques",
        "subcategories": [
            "system_prompt_override",
            "instruction_injection",
            "role_confusion",
            "context_manipulation",
        ]
    },
}

Failure Mode Comparison

Different safety training approaches produce different failure modes. Understanding these differences is the core value of cross-model comparison.

RLHF Failure Modes (GPT-4)

GPT-4's RLHF-based safety exhibits these characteristic failure patterns:

Distribution shift vulnerability -- Novel phrasings or unusual contexts not well-represented in training data can bypass safety
Sycophancy -- Tendency to agree with users can override safety when users express frustration with refusals
Pattern matching -- Safety behavior is learned from examples and may fail for inputs that do not match trained patterns
Inconsistent calibration across updates -- Safety boundaries shift between model versions without clear documentation

Constitutional AI Failure Modes (Claude)

Claude's Constitutional AI approach produces distinct failures:

Argumentation vulnerability -- The model engages with arguments about whether requests violate principles, and can be debated into compliance
Principle conflict exploitation -- When constitutional principles conflict (helpfulness vs. harmlessness), the ambiguity can be exploited
Legalistic reasoning -- The model may accept technically valid but practically harmful framings that satisfy the letter of its principles
Context-dependent principle application -- The same principle may be applied differently in different conversational contexts

Multi-Layer Safety Failure Modes (Gemini)

Gemini's layered approach creates failures at layer boundaries:

Inter-layer gaps -- Content that passes through the model's alignment but triggers no safety classifier, or vice versa
Configurable safety exploitation -- Overly permissive safety settings leave gaps in coverage
Cross-modal safety inconsistency -- Safety is stronger for text than for images, audio, or video
Layer interaction effects -- Safety classifiers may interfere with model alignment in unexpected ways

Open-Weight Failure Modes

Open-weight models share common failure patterns:

Safety removal through fine-tuning -- All open-weight safety can be removed
Deployment without safety infrastructure -- Models deployed without external guardrails
Community-introduced vulnerabilities -- Uncensored variants and unsafe fine-tunes
Quantization degradation -- Safety may degrade with precision reduction

Comparative Safety Matrix

Refusal Calibration Comparison

Category	GPT-4	Claude	Gemini	Llama 3	Mistral
Violence instructions	Strong	Strong	Strong	Moderate	Weak
Illegal activities	Strong	Strong	Strong	Moderate	Weak
Privacy violations	Moderate	Strong	Moderate	Moderate	Weak
Deception assistance	Moderate	Strong	Moderate	Weak	Weak
Self-harm content	Strong	Strong	Strong	Moderate	Weak
Hate speech	Strong	Strong	Strong	Moderate	Weak
Security research (should allow)	Moderate	Good	Moderate	Good	Good
Medical info (should allow)	Good	Good	Good	Good	Good

Jailbreak Resistance Comparison

Technique	GPT-4	Claude	Gemini	Llama 3 (instruct)
Persona/role-play	Moderate resistance	Good resistance	Moderate resistance	Weak resistance
Encoding/obfuscation	Good resistance	Good resistance	Moderate resistance	Weak resistance
Many-shot	Moderate resistance	Moderate resistance (long context)	Weak resistance (1M context)	Moderate resistance
Crescendo	Weak resistance	Weak resistance	Moderate resistance	Weak resistance
Academic framing	Weak resistance	Moderate resistance	Weak resistance	Weak resistance
GCG adversarial	Moderate resistance	Moderate resistance	Moderate resistance	Weak resistance (white-box)

System Prompt Protection

Aspect	GPT-4	Claude	Gemini
Extraction resistance	Moderate	Moderate	Weak-Moderate
Instruction override resistance	Moderate	Good	Moderate
Role confusion resistance	Moderate	Good	Moderate
Multi-turn manipulation	Weak	Weak	Moderate

Defense Coverage Gap Analysis

Identifying Gaps Through Comparison

The most valuable output of cross-model comparison is identifying gaps -- categories where a model is significantly weaker than its peers:

GPT-4 specific gaps:

Multi-turn crescendo attacks exploit RLHF sycophancy
Function calling injection is a unique surface with limited defense
Logit bias manipulation can suppress refusal tokens

Claude specific gaps:

Argumentation attacks exploit Constitutional AI reasoning
XML tag injection exploits training format conventions
Extended thinking can leak safety reasoning

Gemini specific gaps:

Cross-modal injection exploits multimodal architecture
Configurable safety settings create application-layer gaps
Grounding introduces web-content injection vectors

Open-weight specific gaps:

Safety is removable through fine-tuning
No guaranteed deployment safety infrastructure
Community variants undermine safety investment

Gap-Driven Testing Strategy

Use identified gaps to prioritize testing:

For each model, identify its gaps relative to peers
Design targeted test cases for each gap
Test whether the gap represents a fundamental limitation or a specific oversight
Report findings with context about where the model is strong and weak

Cross-Model Safety Patterns

Universal Weaknesses

Some vulnerabilities affect all current models:

Prompt injection -- No model has architectural separation between instructions and data
Multi-turn escalation -- All models are susceptible to gradual context manipulation
Context length degradation -- Safety consistency decreases with context length
Novel encoding -- New encoding or obfuscation schemes initially bypass all models

Model-Specific Strengths

Each model also has unique strengths:

GPT-4 -- Extensive red teaming by OpenAI and the research community
Claude -- Principled reasoning about edge cases through Constitutional AI
Gemini -- Multi-layered defense (model + classifiers + filters)
Llama -- Llama Guard provides an independent safety layer

Jailbreak Portability -- Which attacks transfer across models
Cross-Model Comparison Overview -- Comparison methodology
Jailbreak Techniques -- Techniques tested across models
Defense Evasion -- Bypassing model-specific defenses

References

Mazeika, M. et al. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"
Chao, P. et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"
Wei, A. et al. (2023). "Jailbroken: How Does LLM Safety Training Fail?"
Zou, A. et al. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models"

Knowledge Check

What is the primary value of comparing safety across model families for a red teamer?

Safety Comparison Across Models

Related articles

Safety Comparison Across Models

Related articles