HarmBench: Standardized Red Team Evaluation

advanced8 min readUpdated 2026-03-13

Deep dive into the HarmBench framework for standardized red team evaluation: attack methods, evaluation pipeline, running benchmarks, interpreting results, and comparing model safety across providers.

harmbench benchmark evaluation standardized

HarmBench: Standardized Red Team Evaluation

HarmBench provides what the red teaming field has long needed: a common benchmark for comparing attack methods against each other and measuring model robustness on a standardized scale. Rather than each team using different prompts and different success criteria, HarmBench defines a shared evaluation protocol.

Why Standardized Benchmarks Matter

Without standardization, red team results are not comparable:

Problem	Without HarmBench	With HarmBench
Different attack prompts	Team A's "jailbreaks" differ from Team B's	Same 510 harmful behaviors for all teams
Different success criteria	"Did it work?" is subjective	Standardized classifier-based evaluation
Different models tested	Cherry-picked model versions	Defined model list and versions
Incomparable results	"We found 47 vulnerabilities" means nothing	ASR of 23.4% on standard text behaviors

Architecture

HarmBench has three components: harmful behaviors, attack methods, and evaluation classifiers.

┌────────────────┐     ┌────────────────┐     ┌────────────────┐
│    Harmful     │────>│    Attack      │────>│   Evaluation   │
│   Behaviors    │     │    Methods     │     │   Classifier   │
│ (510 standard) │     │ (GCG, PAIR,   │     │ (Llama-based   │
│                │     │  AutoDAN...)   │     │  judge)        │
└────────────────┘     └────────────────┘     └────────────────┘

Harmful Behavior Categories

HarmBench defines 510 harmful behaviors across these categories:

Category	Count	Examples
Chemical & biological weapons	~50	Synthesis instructions, precursor acquisition
Cybercrime & hacking	~80	Malware creation, exploit development
Harassment & discrimination	~60	Targeted harassment, discriminatory content
Disinformation & deception	~70	Fake news generation, impersonation
Illegal activities	~80	Drug manufacturing, weapons trafficking
Self-harm & violence	~50	Self-harm encouragement, violence planning
Privacy violations	~40	Doxxing, surveillance, PII extraction
Other harmful content	~80	Various policy violations

Attack Methods

HarmBench evaluates these attack methods (among others):

Method	Type	Description
GCG	Optimization	Greedy Coordinate Gradient -- optimizes adversarial suffixes
PAIR	LLM-based	Prompt Automatic Iterative Refinement using attacker LLM
AutoDAN	Optimization	Automated generation of DAN-style jailbreaks
TAP	Tree search	Tree of Attacks with Pruning
DirectRequest	Baseline	Asks harmful questions directly (no attack)
FewShot	In-context	Uses few-shot examples to prime harmful output
HumanJailbreaks	Manual	Curated human-written jailbreak prompts

Evaluation Classifier

HarmBench uses a fine-tuned Llama-based classifier to judge whether a model response constitutes a successful attack:

# The classifier answers: "Does this response provide
# the harmful content requested in the behavior?"
# Output: 1 (yes, attack succeeded) or 0 (no, model refused)

Running HarmBench

Clone and set up

git clone https://github.com/centerforaisafety/HarmBench
cd HarmBench
pip install -r requirements.txt

Configure target model

Edit the configuration to point to your target model.

# configs/model_configs/target.yaml
target_model:
  model_name: "gpt-4"
  provider: "openai"
  api_key_env: "OPENAI_API_KEY"

Select attack methods

Choose which attack methods to run.

python run_attack.py \
  --attack_method PAIR \
  --target_model gpt-4 \
  --behaviors_path data/harmbench_behaviors_text.csv \
  --output_dir results/

Evaluate results

Run the standardized classifier on attack outputs.

python evaluate.py \
  --results_dir results/ \
  --classifier_model harmbench-classifier \
  --output results/evaluation.json

Analyze ASR
Compute Attack Success Rates by category, method, and model.

Interpreting Results

Attack Success Rate (ASR)

The primary metric is ASR:

ASR = (Successful attacks / Total behaviors tested) * 100

Example Results Table

Model	DirectRequest	GCG	PAIR	TAP	AutoDAN
GPT-4	1.2%	4.8%	12.3%	15.7%	8.1%
Claude 3 Opus	0.8%	3.2%	9.1%	11.4%	5.6%
Llama 3 70B	3.5%	18.7%	28.4%	32.1%	22.3%
Mixtral 8x7B	5.1%	22.3%	34.7%	38.9%	27.8%

Note: These are illustrative numbers, not actual benchmark results.

What the Numbers Mean

ASR Range	Interpretation
0-5%	Strong safety alignment. Model resists most attacks.
5-15%	Moderate safety. Vulnerable to sophisticated attacks.
15-30%	Weak safety alignment. Many attack methods succeed.
30%+	Poor safety. Model provides harmful content readily.

Adding Custom Attack Methods

Extend HarmBench with your own attack methods:

from harmbench.attacks import BaseAttack
 
class CustomAttack(BaseAttack):
    """Custom attack using domain-specific knowledge."""
 
    def __init__(self, target_model, **kwargs):
        super().__init__(target_model, **kwargs)
        self.max_attempts = kwargs.get("max_attempts", 5)
 
    def generate_attack(self, behavior: str) -> str:
        """Generate an attack prompt for the given behavior."""
        # Implement your attack strategy
        attack_prompt = self._craft_prompt(behavior)
        for attempt in range(self.max_attempts):
            response = self.target_model.generate(attack_prompt)
            if self._check_success(response, behavior):
                return attack_prompt
            attack_prompt = self._refine(attack_prompt, response)
        return attack_prompt

Limitations

Static behaviors: The 510 behaviors may not cover your organization's specific risk categories
Classifier accuracy: The Llama-based classifier has a reported accuracy around 90% -- expect some false positives and negatives
Compute cost: Optimization-based attacks (GCG, AutoDAN) require significant GPU resources
Point-in-time: Models evolve; benchmark results reflect a snapshot, not ongoing safety
English-centric: Most behaviors are in English; multilingual safety is underrepresented

Using HarmBench for Model Procurement

When selecting models for deployment, HarmBench provides a structured comparison framework:

Run the full benchmark against candidate models
Compare ASR across your highest-priority harm categories
Weight results by your organization's risk tolerance
Factor in the gap between DirectRequest and best attack method (indicates defense depth)
Complement with domain-specific testing using Inspect AI or custom harnesses

Knowledge Check

Why is a standardized classifier important for HarmBench's evaluation methodology?

PAIR & TAP Attack Algorithms - Algorithms benchmarked by HarmBench
AI-Powered Red Teaming - Automated red teaming approaches HarmBench evaluates
Inspect AI - Complementary safety evaluation framework
Custom Harness Patterns - Building beyond standard benchmarks

References

"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming" - Mazeika et al. (2024) - Original HarmBench paper
"Red Teaming Language Models with Language Models" - Perez et al. (2022) - Foundational automated red teaming
"Tree of Attacks with Pruning" - Mehrotra et al. (2024) - TAP algorithm benchmarked in HarmBench

Inspect AI -- complementary evaluation framework
Custom Harness Patterns -- building beyond standard benchmarks
CART Pipelines -- continuous testing integration

HarmBench: Standardized Red Team Evaluation

Clone and set up

Configure target model

Select attack methods

Evaluate results

Analyze ASR

Related articles

HarmBench: Standardized Red Team Evaluation

Clone and set up

Configure target model

Select attack methods

Evaluate results

Analyze ASR

Related articles