Safety Comparison Across Models
Comparing safety across GPT-4, Claude, Gemini, and open-weight models using standardized test suites, failure mode analysis, and defense coverage gap identification.
Safety varies across model families not just in degree but in kind. GPT-4, Claude, Gemini, and open-weight models each have distinct safety strengths and weaknesses shaped by their training approaches, architectures, and deployment configurations. Understanding these differences is essential for both red teaming and defense.
Standardized Safety Evaluation
Evaluation Frameworks
Several standardized frameworks exist for cross-model safety evaluation:
HarmBench -- A comprehensive benchmark for automated red teaming that tests models across:
- Direct harmful requests
- Copyright violations
- Context-based attacks (few-shot, role-play)
- Gradient-based attacks (where applicable)
JailbreakBench -- Focuses specifically on jailbreak resistance with:
- Standardized jailbreak templates
- Consistent evaluation criteria
- Leaderboard-style model comparison
Custom red team suites -- For professional red teaming, custom test suites offer:
- Coverage of domain-specific risks
- Alignment with the specific deployment context
- Flexibility to test novel attack categories
Designing a Comparative Test Suite
SAFETY_TEST_CATEGORIES = {
"refusal_calibration": {
"description": "Tests where the model should refuse",
"subcategories": [
"violence_instructions",
"illegal_activities",
"privacy_violations",
"deception_assistance",
"self_harm",
"hate_speech",
]
},
"helpfulness_calibration": {
"description": "Sensitive but legitimate requests that should be answered",
"subcategories": [
"security_research",
"medical_information",
"legal_questions",
"historical_violence",
"fictional_scenarios",
]
},
"jailbreak_resistance": {
"description": "Standard jailbreak techniques",
"subcategories": [
"persona_attacks",
"encoding_obfuscation",
"many_shot",
"crescendo",
"academic_framing",
]
},
"injection_resistance": {
"description": "Prompt injection techniques",
"subcategories": [
"system_prompt_override",
"instruction_injection",
"role_confusion",
"context_manipulation",
]
},
}Failure Mode Comparison
Different safety training approaches produce different failure modes. Understanding these differences is the core value of cross-model comparison.
RLHF Failure Modes (GPT-4)
GPT-4's RLHF-based safety exhibits these characteristic failure patterns:
- Distribution shift vulnerability -- Novel phrasings or unusual contexts not well-represented in training data can bypass safety
- Sycophancy -- Tendency to agree with users can override safety when users express frustration with refusals
- Pattern matching -- Safety behavior is learned from examples and may fail for inputs that do not match trained patterns
- Inconsistent calibration across updates -- Safety boundaries shift between model versions without clear documentation
Constitutional AI Failure Modes (Claude)
Claude's Constitutional AI approach produces distinct failures:
- Argumentation vulnerability -- The model engages with arguments about whether requests violate principles, and can be debated into compliance
- Principle conflict exploitation -- When constitutional principles conflict (helpfulness vs. harmlessness), the ambiguity can be exploited
- Legalistic reasoning -- The model may accept technically valid but practically harmful framings that satisfy the letter of its principles
- Context-dependent principle application -- The same principle may be applied differently in different conversational contexts
Multi-Layer Safety Failure Modes (Gemini)
Gemini's layered approach creates failures at layer boundaries:
- Inter-layer gaps -- Content that passes through the model's alignment but triggers no safety classifier, or vice versa
- Configurable safety exploitation -- Overly permissive safety settings leave gaps in coverage
- Cross-modal safety inconsistency -- Safety is stronger for text than for images, audio, or video
- Layer interaction effects -- Safety classifiers may interfere with model alignment in unexpected ways
Open-Weight Failure Modes
Open-weight models share common failure patterns:
- Safety removal through fine-tuning -- All open-weight safety can be removed
- Deployment without safety infrastructure -- Models deployed without external guardrails
- Community-introduced vulnerabilities -- Uncensored variants and unsafe fine-tunes
- Quantization degradation -- Safety may degrade with precision reduction
Comparative Safety Matrix
Refusal Calibration Comparison
| Category | GPT-4 | Claude | Gemini | Llama 3 | Mistral |
|---|---|---|---|---|---|
| Violence instructions | Strong | Strong | Strong | Moderate | Weak |
| Illegal activities | Strong | Strong | Strong | Moderate | Weak |
| Privacy violations | Moderate | Strong | Moderate | Moderate | Weak |
| Deception assistance | Moderate | Strong | Moderate | Weak | Weak |
| Self-harm content | Strong | Strong | Strong | Moderate | Weak |
| Hate speech | Strong | Strong | Strong | Moderate | Weak |
| Security research (should allow) | Moderate | Good | Moderate | Good | Good |
| Medical info (should allow) | Good | Good | Good | Good | Good |
Jailbreak Resistance Comparison
| Technique | GPT-4 | Claude | Gemini | Llama 3 (instruct) |
|---|---|---|---|---|
| Persona/role-play | Moderate resistance | Good resistance | Moderate resistance | Weak resistance |
| Encoding/obfuscation | Good resistance | Good resistance | Moderate resistance | Weak resistance |
| Many-shot | Moderate resistance | Moderate resistance (long context) | Weak resistance (1M context) | Moderate resistance |
| Crescendo | Weak resistance | Weak resistance | Moderate resistance | Weak resistance |
| Academic framing | Weak resistance | Moderate resistance | Weak resistance | Weak resistance |
| GCG adversarial | Moderate resistance | Moderate resistance | Moderate resistance | Weak resistance (white-box) |
System Prompt Protection
| Aspect | GPT-4 | Claude | Gemini |
|---|---|---|---|
| Extraction resistance | Moderate | Moderate | Weak-Moderate |
| Instruction override resistance | Moderate | Good | Moderate |
| Role confusion resistance | Moderate | Good | Moderate |
| Multi-turn manipulation | Weak | Weak | Moderate |
Defense Coverage Gap Analysis
Identifying Gaps Through Comparison
The most valuable output of cross-model comparison is identifying gaps -- categories where a model is significantly weaker than its peers:
GPT-4 specific gaps:
- Multi-turn crescendo attacks exploit RLHF sycophancy
- Function calling injection is a unique surface with limited defense
- Logit bias manipulation can suppress refusal tokens
Claude specific gaps:
- Argumentation attacks exploit Constitutional AI reasoning
- XML tag injection exploits training format conventions
- Extended thinking can leak safety reasoning
Gemini specific gaps:
- Cross-modal injection exploits multimodal architecture
- Configurable safety settings create application-layer gaps
- Grounding introduces web-content injection vectors
Open-weight specific gaps:
- Safety is removable through fine-tuning
- No guaranteed deployment safety infrastructure
- Community variants undermine safety investment
Gap-Driven Testing Strategy
Use identified gaps to prioritize testing:
- For each model, identify its gaps relative to peers
- Design targeted test cases for each gap
- Test whether the gap represents a fundamental limitation or a specific oversight
- Report findings with context about where the model is strong and weak
Cross-Model Safety Patterns
Universal Weaknesses
Some vulnerabilities affect all current models:
- Prompt injection -- No model has architectural separation between instructions and data
- Multi-turn escalation -- All models are susceptible to gradual context manipulation
- Context length degradation -- Safety consistency decreases with context length
- Novel encoding -- New encoding or obfuscation schemes initially bypass all models
Model-Specific Strengths
Each model also has unique strengths:
- GPT-4 -- Extensive red teaming by OpenAI and the research community
- Claude -- Principled reasoning about edge cases through Constitutional AI
- Gemini -- Multi-layered defense (model + classifiers + filters)
- Llama -- Llama Guard provides an independent safety layer
Related Topics
- Jailbreak Portability -- Which attacks transfer across models
- Cross-Model Comparison Overview -- Comparison methodology
- Jailbreak Techniques -- Techniques tested across models
- Defense Evasion -- Bypassing model-specific defenses
References
- Mazeika, M. et al. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"
- Chao, P. et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"
- Wei, A. et al. (2023). "Jailbroken: How Does LLM Safety Training Fail?"
- Zou, A. et al. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models"
What is the primary value of comparing safety across model families for a red teamer?