AI Safety Benchmarks & Evaluation
Overview of AI safety evaluation: benchmarking frameworks, safety metrics, evaluation methodologies, and the landscape of standardized assessment tools for AI red teaming.
AI safety evaluation transforms subjective judgments ("is this model safe?") into measurable, reproducible assessments. Safety benchmarks provide the shared measurement infrastructure that enables comparison across models, tracking of improvements over time, and communication of risk to stakeholders.
The Evaluation Landscape
Major Safety Benchmarks
| Benchmark | Focus | Attack Types | Metrics | Maintained By |
|---|---|---|---|---|
| HarmBench | Automated red teaming | Jailbreaks, harmful content | ASR, refusal rate | CMU / Center for AI Safety |
| TrustLLM | Comprehensive trustworthiness | Safety, fairness, robustness, privacy | Multi-dimensional scores | Academic consortium |
| SafetyBench | Multilingual safety | Harmful content across languages | Per-category accuracy | Multiple institutions |
| AdvBench | Adversarial robustness | Adversarial suffixes, GCG attacks | ASR with classifier | Academic |
| JailbreakBench | Jailbreak evaluation | Known jailbreak techniques | Success rate, judge agreement | Academic |
| MLCommons AI Safety | Standardized safety | Aligned with NIST taxonomy | Standardized scoring | MLCommons |
Coverage Map
┌──────────────────────────────────────┐
│ Safety Evaluation │
│ Landscape │
├──────────────────────────────────────┤
│ │
│ HarmBench ────── Jailbreaks │
│ │ │
│ ├──── Direct injection │
│ ├──── Adversarial suffixes │
│ └──── Automated attacks │
│ │
│ TrustLLM ────── Multi-dimensional │
│ ├──── Safety │
│ ├──── Fairness │
│ ├──── Robustness │
│ └──── Privacy │
│ │
│ Custom ─────── Deployment-specific │
│ ├──── Domain policies │
│ ├──── Tool/agent safety │
│ └──── System integration │
└──────────────────────────────────────┘Evaluation Methodology
The Evaluation Pipeline
Define evaluation scope
What safety properties matter for this deployment? Map deployment risks to benchmark categories. Not every benchmark is relevant to every deployment.
Select benchmarks
Choose standardized benchmarks that cover the identified risk categories. Plan custom test suites for deployment-specific risks that benchmarks do not cover.
Configure evaluation harness
Set up the evaluation infrastructure: model API access, judge models, scoring functions, and result storage. See Building Evaluation Harnesses.
Execute evaluation
Run benchmarks with consistent parameters. Record model version, temperature, system prompt, and all configuration details for reproducibility.
Analyze results
Compute metrics, identify failure categories, and contextualize results against deployment risk. Raw scores without context are meaningless.
Report and track
Document findings using standardized formats. Track metrics over time to detect regressions. See Red Team Metrics Beyond ASR.
Benchmark Limitations
| Limitation | Description | Impact |
|---|---|---|
| Static test sets | Fixed prompts become known; models can be tuned to pass them | Benchmark scores improve without genuine safety improvement |
| Category gaps | No benchmark covers all attack types | False confidence from high scores on covered categories |
| Judge reliability | LLM-based judges have their own biases and failure modes | Inconsistent scoring, especially on edge cases |
| Context blindness | Benchmarks test isolated interactions, not system-level behavior | Multi-turn attacks, agentic risks, and integration issues are missed |
| Cultural bias | Most benchmarks are English-centric | Safety in other languages is under-evaluated |
| Temporal decay | Attack techniques evolve faster than benchmarks update | Benchmarks test yesterday's attacks, not tomorrow's |
The Goodhart Problem
Choosing Evaluation Approaches
Goal: Assess whether a model is safe enough to deploy.
| Approach | Coverage | Cost | Recommended |
|---|---|---|---|
| Standardized benchmarks (HarmBench, TrustLLM) | Broad, known categories | Low | Always -- establishes baseline |
| Custom domain-specific tests | Deployment-specific risks | Medium | Always -- covers what benchmarks miss |
| Manual expert red teaming | Novel, creative attacks | High | For high-risk deployments |
| Automated red teaming (PAIR/TAP) | Broad, adaptive | Medium | For comprehensive coverage |
Goal: Detect safety regressions from model updates and prompt changes.
| Approach | Coverage | Cost | Recommended |
|---|---|---|---|
| CART pipeline with benchmark subset | Core regression detection | Low | Always -- catches regressions early |
| Automated red teaming on schedule | Ongoing vulnerability discovery | Medium | Weekly or on deployment |
| Production traffic monitoring | Real-world attack detection | Low (marginal) | Always -- detects novel attacks |
Goal: Understand a specific safety failure in depth.
| Approach | Coverage | Cost | Recommended |
|---|---|---|---|
| Root cause analysis | Single failure mode, deep | Low-Medium | Always after incidents |
| Regression test creation | Prevent recurrence | Low | Always -- add to CART suite |
| Adversarial probing around failure | Related vulnerability discovery | Medium | For critical failures |
Interpreting Results
Contextualizing Benchmark Scores
A model scoring 95% on HarmBench does not mean it is "95% safe." It means:
- The model refused 95% of HarmBench's specific test prompts
- The remaining 5% represent known bypass techniques
- Unknown attack categories are not measured
- The score is a lower bound on vulnerability, not an upper bound on safety
Comparative Analysis
| Metric | Useful For | Not Useful For |
|---|---|---|
| Absolute score | Tracking improvements over time for the same model | Comparing fundamentally different models |
| Relative ranking | Understanding model positioning | Measuring absolute safety |
| Category breakdown | Identifying specific weaknesses | Overall safety judgment |
| Failure analysis | Understanding how and why failures occur | Predicting future failures |
A model scores 98% on HarmBench and 95% on TrustLLM safety category. A stakeholder asks: 'Is this model safe to deploy?' What is the most accurate response?
References
- Mazeika et al., "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming" (2024)
- Sun et al., "TrustLLM: Trustworthiness in Large Language Models" (2024)
- Chao et al., "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models" (2024)
- MLCommons, "AI Safety Benchmarks v0.5" (2024)
Related Topics
- Building Evaluation Harnesses -- infrastructure for running evaluations
- Red Team Metrics Beyond ASR -- advanced metrics for evaluation
- Statistical Rigor in AI Red Teaming -- statistical methodology
- CART Pipelines -- continuous automated testing