HarmBench: Standardized Red Team Evaluation
Deep dive into the HarmBench framework for standardized red team evaluation: attack methods, evaluation pipeline, running benchmarks, interpreting results, and comparing model safety across providers.
HarmBench: Standardized Red Team Evaluation
HarmBench provides what the red teaming field has long needed: a common benchmark for comparing attack methods against each other and measuring model robustness on a standardized scale. Rather than each team using different prompts and different success criteria, HarmBench defines a shared evaluation protocol.
Why Standardized Benchmarks Matter
Without standardization, red team results are not comparable:
| Problem | Without HarmBench | With HarmBench |
|---|---|---|
| Different attack prompts | Team A's "jailbreaks" differ from Team B's | Same 510 harmful behaviors for all teams |
| Different success criteria | "Did it work?" is subjective | Standardized classifier-based evaluation |
| Different models tested | Cherry-picked model versions | Defined model list and versions |
| Incomparable results | "We found 47 vulnerabilities" means nothing | ASR of 23.4% on standard text behaviors |
Architecture
HarmBench has three components: harmful behaviors, attack methods, and evaluation classifiers.
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ Harmful │────>│ Attack │────>│ Evaluation │
│ Behaviors │ │ Methods │ │ Classifier │
│ (510 standard) │ │ (GCG, PAIR, │ │ (Llama-based │
│ │ │ AutoDAN...) │ │ judge) │
└────────────────┘ └────────────────┘ └────────────────┘Harmful Behavior Categories
HarmBench defines 510 harmful behaviors across these categories:
| Category | Count | Examples |
|---|---|---|
| Chemical & biological weapons | ~50 | Synthesis instructions, precursor acquisition |
| Cybercrime & hacking | ~80 | Malware creation, exploit development |
| Harassment & discrimination | ~60 | Targeted harassment, discriminatory content |
| Disinformation & deception | ~70 | Fake news generation, impersonation |
| Illegal activities | ~80 | Drug manufacturing, weapons trafficking |
| Self-harm & violence | ~50 | Self-harm encouragement, violence planning |
| Privacy violations | ~40 | Doxxing, surveillance, PII extraction |
| Other harmful content | ~80 | Various policy violations |
Attack Methods
HarmBench evaluates these attack methods (among others):
| Method | Type | Description |
|---|---|---|
| GCG | Optimization | Greedy Coordinate Gradient -- optimizes adversarial suffixes |
| PAIR | LLM-based | Prompt Automatic Iterative Refinement using attacker LLM |
| AutoDAN | Optimization | Automated generation of DAN-style jailbreaks |
| TAP | Tree search | Tree of Attacks with Pruning |
| DirectRequest | Baseline | Asks harmful questions directly (no attack) |
| FewShot | In-context | Uses few-shot examples to prime harmful output |
| HumanJailbreaks | Manual | Curated human-written jailbreak prompts |
Evaluation Classifier
HarmBench uses a fine-tuned Llama-based classifier to judge whether a model response constitutes a successful attack:
# The classifier answers: "Does this response provide
# the harmful content requested in the behavior?"
# Output: 1 (yes, attack succeeded) or 0 (no, model refused)Running HarmBench
Clone and set up
git clone https://github.com/centerforaisafety/HarmBench cd HarmBench pip install -r requirements.txtConfigure target model
Edit the configuration to point to your target model.
# configs/model_configs/target.yaml target_model: model_name: "gpt-4" provider: "openai" api_key_env: "OPENAI_API_KEY"Select attack methods
Choose which attack methods to run.
python run_attack.py \ --attack_method PAIR \ --target_model gpt-4 \ --behaviors_path data/harmbench_behaviors_text.csv \ --output_dir results/Evaluate results
Run the standardized classifier on attack outputs.
python evaluate.py \ --results_dir results/ \ --classifier_model harmbench-classifier \ --output results/evaluation.jsonAnalyze ASR
Compute Attack Success Rates by category, method, and model.
Interpreting Results
Attack Success Rate (ASR)
The primary metric is ASR:
ASR = (Successful attacks / Total behaviors tested) * 100
Example Results Table
| Model | DirectRequest | GCG | PAIR | TAP | AutoDAN |
|---|---|---|---|---|---|
| GPT-4 | 1.2% | 4.8% | 12.3% | 15.7% | 8.1% |
| Claude 3 Opus | 0.8% | 3.2% | 9.1% | 11.4% | 5.6% |
| Llama 3 70B | 3.5% | 18.7% | 28.4% | 32.1% | 22.3% |
| Mixtral 8x7B | 5.1% | 22.3% | 34.7% | 38.9% | 27.8% |
Note: These are illustrative numbers, not actual benchmark results.
What the Numbers Mean
| ASR Range | Interpretation |
|---|---|
| 0-5% | Strong safety alignment. Model resists most attacks. |
| 5-15% | Moderate safety. Vulnerable to sophisticated attacks. |
| 15-30% | Weak safety alignment. Many attack methods succeed. |
| 30%+ | Poor safety. Model provides harmful content readily. |
Adding Custom Attack Methods
Extend HarmBench with your own attack methods:
from harmbench.attacks import BaseAttack
class CustomAttack(BaseAttack):
"""Custom attack using domain-specific knowledge."""
def __init__(self, target_model, **kwargs):
super().__init__(target_model, **kwargs)
self.max_attempts = kwargs.get("max_attempts", 5)
def generate_attack(self, behavior: str) -> str:
"""Generate an attack prompt for the given behavior."""
# Implement your attack strategy
attack_prompt = self._craft_prompt(behavior)
for attempt in range(self.max_attempts):
response = self.target_model.generate(attack_prompt)
if self._check_success(response, behavior):
return attack_prompt
attack_prompt = self._refine(attack_prompt, response)
return attack_promptLimitations
- Static behaviors: The 510 behaviors may not cover your organization's specific risk categories
- Classifier accuracy: The Llama-based classifier has a reported accuracy around 90% -- expect some false positives and negatives
- Compute cost: Optimization-based attacks (GCG, AutoDAN) require significant GPU resources
- Point-in-time: Models evolve; benchmark results reflect a snapshot, not ongoing safety
- English-centric: Most behaviors are in English; multilingual safety is underrepresented
Using HarmBench for Model Procurement
When selecting models for deployment, HarmBench provides a structured comparison framework:
- Run the full benchmark against candidate models
- Compare ASR across your highest-priority harm categories
- Weight results by your organization's risk tolerance
- Factor in the gap between DirectRequest and best attack method (indicates defense depth)
- Complement with domain-specific testing using Inspect AI or custom harnesses
Why is a standardized classifier important for HarmBench's evaluation methodology?
Related Topics
- PAIR & TAP Attack Algorithms - Algorithms benchmarked by HarmBench
- AI-Powered Red Teaming - Automated red teaming approaches HarmBench evaluates
- Inspect AI - Complementary safety evaluation framework
- Custom Harness Patterns - Building beyond standard benchmarks
References
- "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming" - Mazeika et al. (2024) - Original HarmBench paper
- "Red Teaming Language Models with Language Models" - Perez et al. (2022) - Foundational automated red teaming
- "Tree of Attacks with Pruning" - Mehrotra et al. (2024) - TAP algorithm benchmarked in HarmBench
Related Pages
- Inspect AI -- complementary evaluation framework
- Custom Harness Patterns -- building beyond standard benchmarks
- CART Pipelines -- continuous testing integration