What is Safety Comparison?

Comparing safety across GPT-4, Claude, Gemini, and open-weight models using standardized test suites, failure mode analysis, and defense coverage gap identification.

What is Jailbreak Portability?

Analysis of which jailbreaks transfer across models and why, including universal vs model-specific techniques, transfer attack methodology, and factors that determine portability.

Cross-Model Comparison

intermediate7 min readUpdated 2026-03-15

Methodology for systematically comparing LLM security across model families, including standardized evaluation frameworks, architectural difference analysis, and comparative testing approaches.

comparison cross-model methodology evaluation red-teaming benchmarking

Comparing security across model families is essential for understanding the broader LLM threat landscape. A vulnerability that exists in one model but not another reveals something about both: the vulnerable model's weakness and the other model's defense. A vulnerability that exists across all models reveals a fundamental limitation of the technology.

Why Compare Models

For Red Teamers

Cross-model comparison helps red teamers:

Transfer attacks efficiently -- Know which attacks are likely to work against a new model based on its similarity to models you have already tested
Identify model-specific vulnerabilities -- By testing the same attack across models, find weaknesses unique to specific architectures or training approaches
Prioritize testing -- Focus on the model-specific attack surfaces most likely to yield findings based on comparative analysis
Understand defense mechanisms -- By comparing which attacks succeed and fail across models, infer the defense mechanisms each model uses

For Defenders

Comparative findings help defenders:

Benchmark safety -- Understand how their model's safety compares to alternatives
Identify coverage gaps -- Find categories where their model is weaker than peers
Import defenses -- Adopt defensive approaches that work for other models
Predict future attacks -- Attacks that work on similar models may soon work on theirs

Comparison Framework

Standardized Test Dimensions

Structure cross-model comparisons along these dimensions:

Dimension	What to Compare	How to Measure
Refusal calibration	What the model refuses and how strongly	Standardized harmful request test suite
System prompt adherence	How well the model follows system instructions	Injection resistance test suite
Jailbreak resistance	Which jailbreak categories succeed	Multi-technique jailbreak battery
Information leakage	How readily the model reveals confidential information	System prompt extraction, training data probing
Tool use safety	How safely the model handles function calling	Tool-use injection test suite
Multimodal safety	Cross-modal injection resistance	Image/audio injection test suite
Safety consistency	How consistent safety behavior is across contexts	Same request in multiple framings

Standardized Test Suite Design

For valid comparison, test suites must be:

Controlled for capability: Compare models at similar capability levels. Testing a frontier model against a 7B parameter model produces misleading comparisons because differences in safety may reflect capability differences rather than safety investment.

Controlled for API differences: Normalize API differences so that test conditions are equivalent. For example, ensure that system prompts are formatted correctly for each model's API.

Reproducible: Use deterministic settings (temperature=0) where possible, and run stochastic tests multiple times to compute success rates.

Comprehensive: Cover all major harm categories, not just those where one model is known to be weak.

class StandardizedTestSuite:
    """Cross-model comparison test suite."""
 
    def __init__(self, models):
        self.models = models  # Dict of model_name -> ModelWrapper
        self.test_cases = self.load_standard_tests()
 
    def run_comparison(self):
        results = {}
        for test_case in self.test_cases:
            results[test_case.id] = {}
            for name, model in self.models.items():
                response = model.generate(
                    system=test_case.system_prompt,
                    user=test_case.user_message,
                    temperature=0,
                )
                results[test_case.id][name] = self.evaluate(response, test_case)
        return results
 
    def evaluate(self, response, test_case):
        return {
            "complied": not self.is_refusal(response),
            "refusal_type": self.classify_refusal(response),
            "content_safety": self.assess_content(response),
            "information_leaked": self.detect_leakage(response),
        }

Normalization Across APIs

Each model has different API conventions. Normalize these for fair comparison:

class ModelWrapper:
    """Abstract wrapper for cross-model testing."""
 
    def generate(self, system, user, temperature=0):
        raise NotImplementedError
 
class GPT4Wrapper(ModelWrapper):
    def generate(self, system, user, temperature=0):
        return openai_client.chat.completions.create(
            model="gpt-4o-2024-08-06",
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": user}
            ],
            temperature=temperature,
        )
 
class ClaudeWrapper(ModelWrapper):
    def generate(self, system, user, temperature=0):
        return anthropic_client.messages.create(
            model="claude-sonnet-4-20250514",
            system=system,
            messages=[{"role": "user", "content": user}],
            temperature=temperature,
        )
 
class GeminiWrapper(ModelWrapper):
    def generate(self, system, user, temperature=0):
        model = genai.GenerativeModel("gemini-1.5-pro-002",
            system_instruction=system)
        return model.generate_content(user,
            generation_config={"temperature": temperature})

Comparison Pitfalls

Common Mistakes

Comparing different capability tiers: GPT-4o vs. Llama-8B is not a meaningful safety comparison because capability differences dominate.

Ignoring API-level defenses: GPT-4's API has content moderation that is separate from the model. Claude's API may have different filtering. Comparing "model safety" without accounting for infrastructure-level defenses is misleading.

Single-technique bias: Testing only one jailbreak technique may show one model as "more secure" simply because that technique targets the other model's specific weakness.

Snapshot bias: Comparing models at different points in time (GPT-4 from six months ago vs. Claude from today) confounds model differences with temporal improvement.

Ignoring partial compliance: Binary "refused/complied" classification misses nuanced safety behavior. One model may fully refuse while another provides a partial, hedged response -- both might be classified as "safe" despite very different behavior.

Valid Comparison Requirements

For a comparison to be meaningful:

Models should be at comparable capability levels
Testing should cover multiple attack categories and techniques
Safety should be measured on a spectrum, not binary
Infrastructure-level defenses should be documented and accounted for
Model versions and test dates should be recorded
Results should be validated with sufficient statistical power

Key Comparison Areas

This section provides detailed comparisons in two critical areas:

Safety Comparison -- Comparing safety across models with standardized test suites and failure mode analysis
Jailbreak Portability -- Which jailbreaks transfer across models, which are model-specific, and why

Each comparison reveals insights about both the models being compared and the underlying technology.

Safety Comparison -- Detailed safety comparison methodology
Jailbreak Portability -- Cross-model transfer analysis
Model Deep Dives Overview -- Model profiling methodology
Automation Frameworks -- Tools for cross-model testing at scale

References

Mazeika, M. et al. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"
Chao, P. et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"
Wei, A. et al. (2023). "Jailbroken: How Does LLM Safety Training Fail?"

Knowledge Check

Why is comparing GPT-4o against Llama-8B for safety not a meaningful comparison?

Cross-Model Comparison

Why Compare Models

For Red Teamers

For Defenders

Comparison Framework

Standardized Test Dimensions

Standardized Test Suite Design

Normalization Across APIs

Comparison Pitfalls

Common Mistakes

Valid Comparison Requirements

Key Comparison Areas

References

Learning Path

Cross-Model Comparison

Why Compare Models

For Red Teamers

For Defenders

Comparison Framework

Standardized Test Dimensions

Standardized Test Suite Design

Normalization Across APIs

Comparison Pitfalls

Common Mistakes

Valid Comparison Requirements

Key Comparison Areas

References

Learning Path

Cross-Model Comparison

Learning Path

Related articles

Cross-Model Comparison

Learning Path

Related articles