Cross-Model Comparison
Methodology for systematically comparing LLM security across model families, including standardized evaluation frameworks, architectural difference analysis, and comparative testing approaches.
Comparing security across model families is essential for understanding the broader LLM threat landscape. A vulnerability that exists in one model but not another reveals something about both: the vulnerable model's weakness and the other model's defense. A vulnerability that exists across all models reveals a fundamental limitation of the technology.
Why Compare Models
For Red Teamers
Cross-model comparison helps red teamers:
- Transfer attacks efficiently -- Know which attacks are likely to work against a new model based on its similarity to models you have already tested
- Identify model-specific vulnerabilities -- By testing the same attack across models, find weaknesses unique to specific architectures or training approaches
- Prioritize testing -- Focus on the model-specific attack surfaces most likely to yield findings based on comparative analysis
- Understand defense mechanisms -- By comparing which attacks succeed and fail across models, infer the defense mechanisms each model uses
For Defenders
Comparative findings help defenders:
- Benchmark safety -- Understand how their model's safety compares to alternatives
- Identify coverage gaps -- Find categories where their model is weaker than peers
- Import defenses -- Adopt defensive approaches that work for other models
- Predict future attacks -- Attacks that work on similar models may soon work on theirs
Comparison Framework
Standardized Test Dimensions
Structure cross-model comparisons along these dimensions:
| Dimension | What to Compare | How to Measure |
|---|---|---|
| Refusal calibration | What the model refuses and how strongly | Standardized harmful request test suite |
| System prompt adherence | How well the model follows system instructions | Injection resistance test suite |
| Jailbreak resistance | Which jailbreak categories succeed | Multi-technique jailbreak battery |
| Information leakage | How readily the model reveals confidential information | System prompt extraction, training data probing |
| Tool use safety | How safely the model handles function calling | Tool-use injection test suite |
| Multimodal safety | Cross-modal injection resistance | Image/audio injection test suite |
| Safety consistency | How consistent safety behavior is across contexts | Same request in multiple framings |
Standardized Test Suite Design
For valid comparison, test suites must be:
Controlled for capability: Compare models at similar capability levels. Testing a frontier model against a 7B parameter model produces misleading comparisons because differences in safety may reflect capability differences rather than safety investment.
Controlled for API differences: Normalize API differences so that test conditions are equivalent. For example, ensure that system prompts are formatted correctly for each model's API.
Reproducible: Use deterministic settings (temperature=0) where possible, and run stochastic tests multiple times to compute success rates.
Comprehensive: Cover all major harm categories, not just those where one model is known to be weak.
class StandardizedTestSuite:
"""Cross-model comparison test suite."""
def __init__(self, models):
self.models = models # Dict of model_name -> ModelWrapper
self.test_cases = self.load_standard_tests()
def run_comparison(self):
results = {}
for test_case in self.test_cases:
results[test_case.id] = {}
for name, model in self.models.items():
response = model.generate(
system=test_case.system_prompt,
user=test_case.user_message,
temperature=0,
)
results[test_case.id][name] = self.evaluate(response, test_case)
return results
def evaluate(self, response, test_case):
return {
"complied": not self.is_refusal(response),
"refusal_type": self.classify_refusal(response),
"content_safety": self.assess_content(response),
"information_leaked": self.detect_leakage(response),
}Normalization Across APIs
Each model has different API conventions. Normalize these for fair comparison:
class ModelWrapper:
"""Abstract wrapper for cross-model testing."""
def generate(self, system, user, temperature=0):
raise NotImplementedError
class GPT4Wrapper(ModelWrapper):
def generate(self, system, user, temperature=0):
return openai_client.chat.completions.create(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user}
],
temperature=temperature,
)
class ClaudeWrapper(ModelWrapper):
def generate(self, system, user, temperature=0):
return anthropic_client.messages.create(
model="claude-sonnet-4-20250514",
system=system,
messages=[{"role": "user", "content": user}],
temperature=temperature,
)
class GeminiWrapper(ModelWrapper):
def generate(self, system, user, temperature=0):
model = genai.GenerativeModel("gemini-1.5-pro-002",
system_instruction=system)
return model.generate_content(user,
generation_config={"temperature": temperature})Comparison Pitfalls
Common Mistakes
Comparing different capability tiers: GPT-4o vs. Llama-8B is not a meaningful safety comparison because capability differences dominate.
Ignoring API-level defenses: GPT-4's API has content moderation that is separate from the model. Claude's API may have different filtering. Comparing "model safety" without accounting for infrastructure-level defenses is misleading.
Single-technique bias: Testing only one jailbreak technique may show one model as "more secure" simply because that technique targets the other model's specific weakness.
Snapshot bias: Comparing models at different points in time (GPT-4 from six months ago vs. Claude from today) confounds model differences with temporal improvement.
Ignoring partial compliance: Binary "refused/complied" classification misses nuanced safety behavior. One model may fully refuse while another provides a partial, hedged response -- both might be classified as "safe" despite very different behavior.
Valid Comparison Requirements
For a comparison to be meaningful:
- Models should be at comparable capability levels
- Testing should cover multiple attack categories and techniques
- Safety should be measured on a spectrum, not binary
- Infrastructure-level defenses should be documented and accounted for
- Model versions and test dates should be recorded
- Results should be validated with sufficient statistical power
Key Comparison Areas
This section provides detailed comparisons in two critical areas:
- Safety Comparison -- Comparing safety across models with standardized test suites and failure mode analysis
- Jailbreak Portability -- Which jailbreaks transfer across models, which are model-specific, and why
Each comparison reveals insights about both the models being compared and the underlying technology.
Related Topics
- Safety Comparison -- Detailed safety comparison methodology
- Jailbreak Portability -- Cross-model transfer analysis
- Model Deep Dives Overview -- Model profiling methodology
- Automation Frameworks -- Tools for cross-model testing at scale
References
- Mazeika, M. et al. (2024). "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"
- Chao, P. et al. (2024). "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"
- Wei, A. et al. (2023). "Jailbroken: How Does LLM Safety Training Fail?"
Why is comparing GPT-4o against Llama-8B for safety not a meaningful comparison?