Benchmarking Defense Effectiveness

advanced13 min readUpdated 2026-03-15

Advanced methodology for systematically evaluating and benchmarking the effectiveness of AI defenses, including guardrail testing frameworks, attack success rate measurement, statistical rigor in defense evaluation, and comparative analysis across defense configurations.

benchmarking defense-evaluation metrics guardrails statistical-testing

Measuring whether a defense actually works is harder than building the defense itself. Defense benchmarking applies scientific methodology to the question "does this defense make us more secure?" Without rigorous benchmarking, organizations make security decisions based on vendor claims, intuition, or cherry-picked examples — none of which reliably predict real-world performance.

Benchmarking Framework Design

Core Metrics

Metric	Definition	Why It Matters
Attack Success Rate (ASR)	Fraction of attacks that bypass the defense	Primary effectiveness measure
False Positive Rate (FPR)	Fraction of legitimate inputs incorrectly blocked	Usability impact measure
Detection Latency	Time added by the defense to the request pipeline	Performance impact measure
True Positive Rate (TPR)	Fraction of attacks correctly detected/blocked	Complementary to ASR (TPR = 1 - ASR)
Coverage	Fraction of attack categories addressed by the defense	Breadth of protection
Robustness	How ASR changes under adaptive attacks	Resistance to defense-aware attackers

Benchmarking Architecture

┌────────────────────────────────────────────────────────┐
│                 Benchmarking Pipeline                    │
│                                                          │
│  ┌──────────────┐   ┌───────────────┐   ┌────────────┐ │
│  │  Attack       │   │   Defense      │   │  Evaluation │ │
│  │  Dataset      │──▶│   Under Test   │──▶│  Engine     │ │
│  │  (labeled)    │   │   (DUT)        │   │             │ │
│  └──────────────┘   └───────────────┘   └──────┬─────┘ │
│                                                  │       │
│  ┌──────────────┐                        ┌──────▼─────┐ │
│  │  Benign       │   ┌───────────────┐   │  Metrics    │ │
│  │  Dataset      │──▶│   Defense      │──▶│  Calculator │ │
│  │  (legitimate) │   │   Under Test   │   │             │ │
│  └──────────────┘   └───────────────┘   └──────┬─────┘ │
│                                                  │       │
│                                          ┌──────▼─────┐ │
│                                          │  Report     │ │
│                                          │  Generator  │ │
│                                          └────────────┘ │
└────────────────────────────────────────────────────────┘

Attack Dataset Construction

Dataset Requirements

A benchmarking attack dataset must be:

DATASET_REQUIREMENTS = {
    "labeled": "Each sample has a ground-truth classification (attack vs. benign)",
    "categorized": "Attacks are tagged by category (injection, jailbreak, extraction, etc.)",
    "diverse": "Covers multiple attack techniques, styles, and complexity levels",
    "representative": "Reflects real-world attack distribution for the deployment context",
    "versioned": "Dataset version is tracked so results are reproducible",
    "balanced": "Contains sufficient samples per category for statistical significance",
}
 
# Minimum samples per attack category for statistical significance
MIN_SAMPLES_PER_CATEGORY = 100  # For 95% CI with ±5% margin
 
ATTACK_CATEGORIES = {
    "direct_injection": {
        "description": "Instruction override through user input",
        "subcategories": ["instruction_override", "role_switching", "constraint_bypass"],
        "min_samples": 100,
    },
    "indirect_injection": {
        "description": "Instructions embedded in external content",
        "subcategories": ["rag_injection", "tool_output_injection", "document_injection"],
        "min_samples": 100,
    },
    "jailbreak": {
        "description": "Bypassing safety training and content policies",
        "subcategories": ["persona_based", "encoding_based", "multi_turn", "hypothetical"],
        "min_samples": 150,
    },
    "extraction": {
        "description": "Extracting system prompts, training data, or PII",
        "subcategories": ["prompt_extraction", "data_extraction", "pii_extraction"],
        "min_samples": 100,
    },
    "output_manipulation": {
        "description": "Manipulating model output format or content",
        "subcategories": ["format_injection", "content_steering", "hallucination_induction"],
        "min_samples": 80,
    },
}

Building Labeled Datasets

import json
from dataclasses import dataclass
from enum import Enum
 
class AttackCategory(Enum):
    DIRECT_INJECTION = "direct_injection"
    INDIRECT_INJECTION = "indirect_injection"
    JAILBREAK = "jailbreak"
    EXTRACTION = "extraction"
    OUTPUT_MANIPULATION = "output_manipulation"
 
class Severity(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"
 
@dataclass
class BenchmarkSample:
    id: str
    input_text: str
    category: AttackCategory
    subcategory: str
    severity: Severity
    is_attack: bool
    expected_behavior: str  # "blocked" or "allowed"
    metadata: dict
 
def build_benchmark_dataset(attack_sources: list, benign_sources: list) -> list:
    """
    Construct a labeled benchmark dataset from attack and benign sources.
    """
    samples = []
 
    for source in attack_sources:
        for item in source.load():
            samples.append(BenchmarkSample(
                id=f"atk-{len(samples):05d}",
                input_text=item["text"],
                category=AttackCategory(item["category"]),
                subcategory=item["subcategory"],
                severity=Severity(item["severity"]),
                is_attack=True,
                expected_behavior="blocked",
                metadata={"source": source.name, "technique": item.get("technique")},
            ))
 
    for source in benign_sources:
        for item in source.load():
            samples.append(BenchmarkSample(
                id=f"ben-{len(samples):05d}",
                input_text=item["text"],
                category=None,
                subcategory=item.get("topic", "general"),
                severity=None,
                is_attack=False,
                expected_behavior="allowed",
                metadata={"source": source.name},
            ))
 
    return samples

Running Benchmarks

Execution Engine

from dataclasses import dataclass
from typing import Optional
import time
 
@dataclass
class BenchmarkResult:
    sample_id: str
    is_attack: bool
    defense_decision: str  # "blocked" or "allowed"
    correct: bool
    latency_ms: float
    confidence: Optional[float]
    defense_output: dict
 
class DefenseBenchmarkRunner:
    """Execute benchmarks against a defense system and collect metrics."""
 
    def __init__(self, defense_under_test, model_endpoint):
        self.defense = defense_under_test
        self.model = model_endpoint
        self.results = []
 
    def run_benchmark(self, dataset: list[BenchmarkSample]) -> list[BenchmarkResult]:
        """Run the full benchmark dataset against the defense."""
        results = []
 
        for sample in dataset:
            start_time = time.perf_counter()
 
            # Pass input through the defense
            defense_output = self.defense.evaluate(sample.input_text)
            decision = "blocked" if defense_output["blocked"] else "allowed"
 
            elapsed_ms = (time.perf_counter() - start_time) * 1000
 
            # Determine correctness
            if sample.is_attack:
                correct = (decision == "blocked")
            else:
                correct = (decision == "allowed")
 
            results.append(BenchmarkResult(
                sample_id=sample.id,
                is_attack=sample.is_attack,
                defense_decision=decision,
                correct=correct,
                latency_ms=elapsed_ms,
                confidence=defense_output.get("confidence"),
                defense_output=defense_output,
            ))
 
        self.results = results
        return results

Metric Calculation

import numpy as np
from scipy import stats
 
def calculate_benchmark_metrics(results: list[BenchmarkResult]) -> dict:
    """Calculate comprehensive benchmarking metrics from results."""
 
    attack_results = [r for r in results if r.is_attack]
    benign_results = [r for r in results if not r.is_attack]
 
    # Core metrics
    attacks_blocked = sum(1 for r in attack_results if r.defense_decision == "blocked")
    attacks_total = len(attack_results)
    benign_blocked = sum(1 for r in benign_results if r.defense_decision == "blocked")
    benign_total = len(benign_results)
 
    asr = 1.0 - (attacks_blocked / attacks_total) if attacks_total > 0 else None
    fpr = benign_blocked / benign_total if benign_total > 0 else None
    tpr = attacks_blocked / attacks_total if attacks_total > 0 else None
 
    # Confidence intervals (Wilson score interval)
    asr_ci = wilson_confidence_interval(attacks_total - attacks_blocked, attacks_total)
    fpr_ci = wilson_confidence_interval(benign_blocked, benign_total)
 
    # Latency metrics
    all_latencies = [r.latency_ms for r in results]
 
    return {
        "attack_success_rate": round(asr, 4) if asr is not None else None,
        "asr_95_ci": asr_ci,
        "false_positive_rate": round(fpr, 4) if fpr is not None else None,
        "fpr_95_ci": fpr_ci,
        "true_positive_rate": round(tpr, 4) if tpr is not None else None,
        "total_attacks": attacks_total,
        "total_benign": benign_total,
        "attacks_blocked": attacks_blocked,
        "benign_blocked": benign_blocked,
        "latency_p50_ms": round(np.percentile(all_latencies, 50), 1),
        "latency_p95_ms": round(np.percentile(all_latencies, 95), 1),
        "latency_p99_ms": round(np.percentile(all_latencies, 99), 1),
    }
 
def wilson_confidence_interval(successes: int, trials: int, z: float = 1.96) -> tuple:
    """Wilson score interval for binomial proportion — more accurate than normal approximation for small n."""
    if trials == 0:
        return (0.0, 1.0)
 
    p = successes / trials
    denominator = 1 + z**2 / trials
    center = (p + z**2 / (2 * trials)) / denominator
    margin = z * np.sqrt((p * (1 - p) + z**2 / (4 * trials)) / trials) / denominator
 
    return (round(max(0, center - margin), 4), round(min(1, center + margin), 4))

Per-Category Analysis

Category Breakdown

Aggregate metrics hide important variations across attack categories:

def per_category_analysis(
    results: list[BenchmarkResult],
    dataset: list[BenchmarkSample],
) -> dict:
    """Break down benchmark results by attack category."""
 
    # Map sample IDs to categories
    sample_categories = {s.id: s.category for s in dataset if s.is_attack}
 
    category_results = {}
 
    for result in results:
        if not result.is_attack:
            continue
 
        category = sample_categories.get(result.sample_id)
        if category is None:
            continue
 
        cat_name = category.value
        if cat_name not in category_results:
            category_results[cat_name] = {"blocked": 0, "total": 0, "latencies": []}
 
        category_results[cat_name]["total"] += 1
        if result.defense_decision == "blocked":
            category_results[cat_name]["blocked"] += 1
        category_results[cat_name]["latencies"].append(result.latency_ms)
 
    # Calculate per-category metrics
    analysis = {}
    for cat_name, data in category_results.items():
        asr = 1.0 - (data["blocked"] / data["total"])
        analysis[cat_name] = {
            "attack_success_rate": round(asr, 4),
            "samples_tested": data["total"],
            "blocked": data["blocked"],
            "bypassed": data["total"] - data["blocked"],
            "95_ci": wilson_confidence_interval(
                data["total"] - data["blocked"], data["total"]
            ),
            "median_latency_ms": round(np.median(data["latencies"]), 1),
        }
 
    return analysis

What Per-Category Analysis Reveals	Example
Category blind spots	Defense blocks 98% of direct injection but only 45% of indirect injection
Severity correlation	High-severity attacks have 3x the bypass rate of low-severity
Latency variation	Jailbreak detection adds 200ms vs. 50ms for injection detection
Coverage gaps	No test coverage for multi-modal attack categories

Comparative Benchmarking

Comparing Defense Configurations

def compare_defenses(
    baseline_results: list[BenchmarkResult],
    candidate_results: list[BenchmarkResult],
) -> dict:
    """
    Statistically compare two defense configurations
    to determine if the candidate is significantly better.
    """
    baseline_metrics = calculate_benchmark_metrics(baseline_results)
    candidate_metrics = calculate_benchmark_metrics(candidate_results)
 
    # McNemar's test for paired comparison
    # (same dataset tested against both defenses)
    baseline_blocked = {r.sample_id: r.defense_decision == "blocked"
                        for r in baseline_results if r.is_attack}
    candidate_blocked = {r.sample_id: r.defense_decision == "blocked"
                         for r in candidate_results if r.is_attack}
 
    # Contingency: cases where defenses disagree
    both_blocked = 0
    baseline_only = 0
    candidate_only = 0
    neither = 0
 
    for sample_id in baseline_blocked:
        b = baseline_blocked[sample_id]
        c = candidate_blocked.get(sample_id, False)
 
        if b and c:
            both_blocked += 1
        elif b and not c:
            baseline_only += 1
        elif not b and c:
            candidate_only += 1
        else:
            neither += 1
 
    # McNemar's test
    if baseline_only + candidate_only > 0:
        chi2 = (abs(baseline_only - candidate_only) - 1) ** 2 / (baseline_only + candidate_only)
        p_value = 1 - stats.chi2.cdf(chi2, df=1)
    else:
        p_value = 1.0
 
    return {
        "baseline_asr": baseline_metrics["attack_success_rate"],
        "candidate_asr": candidate_metrics["attack_success_rate"],
        "asr_improvement": round(
            baseline_metrics["attack_success_rate"] - candidate_metrics["attack_success_rate"],
            4
        ),
        "baseline_fpr": baseline_metrics["false_positive_rate"],
        "candidate_fpr": candidate_metrics["false_positive_rate"],
        "mcnemar_p_value": round(p_value, 4),
        "statistically_significant": p_value < 0.05,
        "recommendation": (
            "Candidate is significantly better"
            if p_value < 0.05 and candidate_metrics["attack_success_rate"] < baseline_metrics["attack_success_rate"]
            else "No significant difference" if p_value >= 0.05
            else "Candidate is significantly worse"
        ),
    }

Adaptive Attack Testing

Adaptive Attack Protocol

## Adaptive Attack Benchmarking Protocol
 
1. **Baseline evaluation**: Run standard attack dataset against defense
2. **Failure analysis**: Identify which attacks succeeded and which were blocked
3. **Attack refinement**: Modify blocked attacks based on defense behavior
   - If defense blocks based on keywords: use synonyms and paraphrasing
   - If defense blocks based on intent: use encoding and obfuscation
   - If defense blocks based on pattern: use novel formulations
4. **Re-evaluation**: Run refined attacks against the same defense
5. **Iterate**: Repeat steps 2-4 for 3-5 rounds
6. **Report adaptive ASR**: The ASR after all refinement rounds

Metric	Static Benchmark	Adaptive Benchmark
Reflects real-world performance	Partially	More accurately
Reproducible	Highly	Moderately (depends on attacker skill)
Cost to run	Low	High (requires skilled human testers)
Automation potential	High	Low (human judgment needed)

Reporting Benchmarking Results

Benchmark Report Template

## Defense Benchmarking Report
 
### Configuration
- Defense Under Test: [Name, version, configuration]
- Benchmark Dataset: [Name, version, sample count]
- Evaluation Date: [Date]
- Evaluator: [Name/organization]
 
### Summary Results
| Metric | Value | 95% CI |
|---|---|---|
| Attack Success Rate | X.XX% | [lower, upper] |
| False Positive Rate | X.XX% | [lower, upper] |
| True Positive Rate | X.XX% | [lower, upper] |
| Median Latency | XXms | |
| P99 Latency | XXms | |
 
### Per-Category Results
[Table showing ASR broken down by attack category]
 
### Comparative Analysis
[If comparing to baseline or alternative defense]
 
### Worst-Case Analysis
- Highest ASR category: [Category] at XX%
- Most impactful bypass: [Description]
 
### Recommendations
[Based on results]

Statistical Rigor in Evaluations -- statistical methods for evaluation
Harness Building -- building evaluation infrastructure
Metrics & Methodology -- evaluation metrics design
Red Team as a Service -- service delivery that includes benchmarking
Model Cards Security -- documenting benchmark results

References

Mazeika et al., "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal" (2024) - Standardized benchmarking framework for LLM safety
Vidgen et al., "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models" (2024) - Minimal safety benchmark for LLMs
Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models" (2023) - GCG attack methodology used in benchmarking
NIST AI 100-2e, "Adversarial Machine Learning" (2024) - Standard terminology and evaluation concepts
MLCommons AI Safety Benchmark (2024) - Industry benchmark for AI safety evaluation

Knowledge Check

Why is adaptive attack testing necessary in addition to static benchmark evaluation?

Edit this page on GitHub

Benchmarking Defense Effectiveness

advanced13 min readUpdated 2026-03-15

benchmarking defense-evaluation metrics guardrails statistical-testing

Benchmarking Framework Design

Core Metrics

Metric	Definition	Why It Matters
Attack Success Rate (ASR)	Fraction of attacks that bypass the defense	Primary effectiveness measure
False Positive Rate (FPR)	Fraction of legitimate inputs incorrectly blocked	Usability impact measure
Detection Latency	Time added by the defense to the request pipeline	Performance impact measure
True Positive Rate (TPR)	Fraction of attacks correctly detected/blocked	Complementary to ASR (TPR = 1 - ASR)
Coverage	Fraction of attack categories addressed by the defense	Breadth of protection
Robustness	How ASR changes under adaptive attacks	Resistance to defense-aware attackers

Benchmarking Architecture

┌────────────────────────────────────────────────────────┐
│                 Benchmarking Pipeline                    │
│                                                          │
│  ┌──────────────┐   ┌───────────────┐   ┌────────────┐ │
│  │  Attack       │   │   Defense      │   │  Evaluation │ │
│  │  Dataset      │──▶│   Under Test   │──▶│  Engine     │ │
│  │  (labeled)    │   │   (DUT)        │   │             │ │
│  └──────────────┘   └───────────────┘   └──────┬─────┘ │
│                                                  │       │
│  ┌──────────────┐                        ┌──────▼─────┐ │
│  │  Benign       │   ┌───────────────┐   │  Metrics    │ │
│  │  Dataset      │──▶│   Defense      │──▶│  Calculator │ │
│  │  (legitimate) │   │   Under Test   │   │             │ │
│  └──────────────┘   └───────────────┘   └──────┬─────┘ │
│                                                  │       │
│                                          ┌──────▼─────┐ │
│                                          │  Report     │ │
│                                          │  Generator  │ │
│                                          └────────────┘ │
└────────────────────────────────────────────────────────┘

Attack Dataset Construction

Dataset Requirements

A benchmarking attack dataset must be:

DATASET_REQUIREMENTS = {
    "labeled": "Each sample has a ground-truth classification (attack vs. benign)",
    "categorized": "Attacks are tagged by category (injection, jailbreak, extraction, etc.)",
    "diverse": "Covers multiple attack techniques, styles, and complexity levels",
    "representative": "Reflects real-world attack distribution for the deployment context",
    "versioned": "Dataset version is tracked so results are reproducible",
    "balanced": "Contains sufficient samples per category for statistical significance",
}
 
# Minimum samples per attack category for statistical significance
MIN_SAMPLES_PER_CATEGORY = 100  # For 95% CI with ±5% margin
 
ATTACK_CATEGORIES = {
    "direct_injection": {
        "description": "Instruction override through user input",
        "subcategories": ["instruction_override", "role_switching", "constraint_bypass"],
        "min_samples": 100,
    },
    "indirect_injection": {
        "description": "Instructions embedded in external content",
        "subcategories": ["rag_injection", "tool_output_injection", "document_injection"],
        "min_samples": 100,
    },
    "jailbreak": {
        "description": "Bypassing safety training and content policies",
        "subcategories": ["persona_based", "encoding_based", "multi_turn", "hypothetical"],
        "min_samples": 150,
    },
    "extraction": {
        "description": "Extracting system prompts, training data, or PII",
        "subcategories": ["prompt_extraction", "data_extraction", "pii_extraction"],
        "min_samples": 100,
    },
    "output_manipulation": {
        "description": "Manipulating model output format or content",
        "subcategories": ["format_injection", "content_steering", "hallucination_induction"],
        "min_samples": 80,
    },
}

Building Labeled Datasets

import json
from dataclasses import dataclass
from enum import Enum
 
class AttackCategory(Enum):
    DIRECT_INJECTION = "direct_injection"
    INDIRECT_INJECTION = "indirect_injection"
    JAILBREAK = "jailbreak"
    EXTRACTION = "extraction"
    OUTPUT_MANIPULATION = "output_manipulation"
 
class Severity(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"
 
@dataclass
class BenchmarkSample:
    id: str
    input_text: str
    category: AttackCategory
    subcategory: str
    severity: Severity
    is_attack: bool
    expected_behavior: str  # "blocked" or "allowed"
    metadata: dict
 
def build_benchmark_dataset(attack_sources: list, benign_sources: list) -> list:
    """
    Construct a labeled benchmark dataset from attack and benign sources.
    """
    samples = []
 
    for source in attack_sources:
        for item in source.load():
            samples.append(BenchmarkSample(
                id=f"atk-{len(samples):05d}",
                input_text=item["text"],
                category=AttackCategory(item["category"]),
                subcategory=item["subcategory"],
                severity=Severity(item["severity"]),
                is_attack=True,
                expected_behavior="blocked",
                metadata={"source": source.name, "technique": item.get("technique")},
            ))
 
    for source in benign_sources:
        for item in source.load():
            samples.append(BenchmarkSample(
                id=f"ben-{len(samples):05d}",
                input_text=item["text"],
                category=None,
                subcategory=item.get("topic", "general"),
                severity=None,
                is_attack=False,
                expected_behavior="allowed",
                metadata={"source": source.name},
            ))
 
    return samples

Running Benchmarks

Execution Engine

from dataclasses import dataclass
from typing import Optional
import time
 
@dataclass
class BenchmarkResult:
    sample_id: str
    is_attack: bool
    defense_decision: str  # "blocked" or "allowed"
    correct: bool
    latency_ms: float
    confidence: Optional[float]
    defense_output: dict
 
class DefenseBenchmarkRunner:
    """Execute benchmarks against a defense system and collect metrics."""
 
    def __init__(self, defense_under_test, model_endpoint):
        self.defense = defense_under_test
        self.model = model_endpoint
        self.results = []
 
    def run_benchmark(self, dataset: list[BenchmarkSample]) -> list[BenchmarkResult]:
        """Run the full benchmark dataset against the defense."""
        results = []
 
        for sample in dataset:
            start_time = time.perf_counter()
 
            # Pass input through the defense
            defense_output = self.defense.evaluate(sample.input_text)
            decision = "blocked" if defense_output["blocked"] else "allowed"
 
            elapsed_ms = (time.perf_counter() - start_time) * 1000
 
            # Determine correctness
            if sample.is_attack:
                correct = (decision == "blocked")
            else:
                correct = (decision == "allowed")
 
            results.append(BenchmarkResult(
                sample_id=sample.id,
                is_attack=sample.is_attack,
                defense_decision=decision,
                correct=correct,
                latency_ms=elapsed_ms,
                confidence=defense_output.get("confidence"),
                defense_output=defense_output,
            ))
 
        self.results = results
        return results

Metric Calculation

import numpy as np
from scipy import stats
 
def calculate_benchmark_metrics(results: list[BenchmarkResult]) -> dict:
    """Calculate comprehensive benchmarking metrics from results."""
 
    attack_results = [r for r in results if r.is_attack]
    benign_results = [r for r in results if not r.is_attack]
 
    # Core metrics
    attacks_blocked = sum(1 for r in attack_results if r.defense_decision == "blocked")
    attacks_total = len(attack_results)
    benign_blocked = sum(1 for r in benign_results if r.defense_decision == "blocked")
    benign_total = len(benign_results)
 
    asr = 1.0 - (attacks_blocked / attacks_total) if attacks_total > 0 else None
    fpr = benign_blocked / benign_total if benign_total > 0 else None
    tpr = attacks_blocked / attacks_total if attacks_total > 0 else None
 
    # Confidence intervals (Wilson score interval)
    asr_ci = wilson_confidence_interval(attacks_total - attacks_blocked, attacks_total)
    fpr_ci = wilson_confidence_interval(benign_blocked, benign_total)
 
    # Latency metrics
    all_latencies = [r.latency_ms for r in results]
 
    return {
        "attack_success_rate": round(asr, 4) if asr is not None else None,
        "asr_95_ci": asr_ci,
        "false_positive_rate": round(fpr, 4) if fpr is not None else None,
        "fpr_95_ci": fpr_ci,
        "true_positive_rate": round(tpr, 4) if tpr is not None else None,
        "total_attacks": attacks_total,
        "total_benign": benign_total,
        "attacks_blocked": attacks_blocked,
        "benign_blocked": benign_blocked,
        "latency_p50_ms": round(np.percentile(all_latencies, 50), 1),
        "latency_p95_ms": round(np.percentile(all_latencies, 95), 1),
        "latency_p99_ms": round(np.percentile(all_latencies, 99), 1),
    }
 
def wilson_confidence_interval(successes: int, trials: int, z: float = 1.96) -> tuple:
    """Wilson score interval for binomial proportion — more accurate than normal approximation for small n."""
    if trials == 0:
        return (0.0, 1.0)
 
    p = successes / trials
    denominator = 1 + z**2 / trials
    center = (p + z**2 / (2 * trials)) / denominator
    margin = z * np.sqrt((p * (1 - p) + z**2 / (4 * trials)) / trials) / denominator
 
    return (round(max(0, center - margin), 4), round(min(1, center + margin), 4))

Per-Category Analysis

Category Breakdown

Aggregate metrics hide important variations across attack categories:

def per_category_analysis(
    results: list[BenchmarkResult],
    dataset: list[BenchmarkSample],
) -> dict:
    """Break down benchmark results by attack category."""
 
    # Map sample IDs to categories
    sample_categories = {s.id: s.category for s in dataset if s.is_attack}
 
    category_results = {}
 
    for result in results:
        if not result.is_attack:
            continue
 
        category = sample_categories.get(result.sample_id)
        if category is None:
            continue
 
        cat_name = category.value
        if cat_name not in category_results:
            category_results[cat_name] = {"blocked": 0, "total": 0, "latencies": []}
 
        category_results[cat_name]["total"] += 1
        if result.defense_decision == "blocked":
            category_results[cat_name]["blocked"] += 1
        category_results[cat_name]["latencies"].append(result.latency_ms)
 
    # Calculate per-category metrics
    analysis = {}
    for cat_name, data in category_results.items():
        asr = 1.0 - (data["blocked"] / data["total"])
        analysis[cat_name] = {
            "attack_success_rate": round(asr, 4),
            "samples_tested": data["total"],
            "blocked": data["blocked"],
            "bypassed": data["total"] - data["blocked"],
            "95_ci": wilson_confidence_interval(
                data["total"] - data["blocked"], data["total"]
            ),
            "median_latency_ms": round(np.median(data["latencies"]), 1),
        }
 
    return analysis

What Per-Category Analysis Reveals	Example
Category blind spots	Defense blocks 98% of direct injection but only 45% of indirect injection
Severity correlation	High-severity attacks have 3x the bypass rate of low-severity
Latency variation	Jailbreak detection adds 200ms vs. 50ms for injection detection
Coverage gaps	No test coverage for multi-modal attack categories

Comparative Benchmarking

Comparing Defense Configurations

def compare_defenses(
    baseline_results: list[BenchmarkResult],
    candidate_results: list[BenchmarkResult],
) -> dict:
    """
    Statistically compare two defense configurations
    to determine if the candidate is significantly better.
    """
    baseline_metrics = calculate_benchmark_metrics(baseline_results)
    candidate_metrics = calculate_benchmark_metrics(candidate_results)
 
    # McNemar's test for paired comparison
    # (same dataset tested against both defenses)
    baseline_blocked = {r.sample_id: r.defense_decision == "blocked"
                        for r in baseline_results if r.is_attack}
    candidate_blocked = {r.sample_id: r.defense_decision == "blocked"
                         for r in candidate_results if r.is_attack}
 
    # Contingency: cases where defenses disagree
    both_blocked = 0
    baseline_only = 0
    candidate_only = 0
    neither = 0
 
    for sample_id in baseline_blocked:
        b = baseline_blocked[sample_id]
        c = candidate_blocked.get(sample_id, False)
 
        if b and c:
            both_blocked += 1
        elif b and not c:
            baseline_only += 1
        elif not b and c:
            candidate_only += 1
        else:
            neither += 1
 
    # McNemar's test
    if baseline_only + candidate_only > 0:
        chi2 = (abs(baseline_only - candidate_only) - 1) ** 2 / (baseline_only + candidate_only)
        p_value = 1 - stats.chi2.cdf(chi2, df=1)
    else:
        p_value = 1.0
 
    return {
        "baseline_asr": baseline_metrics["attack_success_rate"],
        "candidate_asr": candidate_metrics["attack_success_rate"],
        "asr_improvement": round(
            baseline_metrics["attack_success_rate"] - candidate_metrics["attack_success_rate"],
            4
        ),
        "baseline_fpr": baseline_metrics["false_positive_rate"],
        "candidate_fpr": candidate_metrics["false_positive_rate"],
        "mcnemar_p_value": round(p_value, 4),
        "statistically_significant": p_value < 0.05,
        "recommendation": (
            "Candidate is significantly better"
            if p_value < 0.05 and candidate_metrics["attack_success_rate"] < baseline_metrics["attack_success_rate"]
            else "No significant difference" if p_value >= 0.05
            else "Candidate is significantly worse"
        ),
    }

Adaptive Attack Testing

Adaptive Attack Protocol

## Adaptive Attack Benchmarking Protocol
 
1. **Baseline evaluation**: Run standard attack dataset against defense
2. **Failure analysis**: Identify which attacks succeeded and which were blocked
3. **Attack refinement**: Modify blocked attacks based on defense behavior
   - If defense blocks based on keywords: use synonyms and paraphrasing
   - If defense blocks based on intent: use encoding and obfuscation
   - If defense blocks based on pattern: use novel formulations
4. **Re-evaluation**: Run refined attacks against the same defense
5. **Iterate**: Repeat steps 2-4 for 3-5 rounds
6. **Report adaptive ASR**: The ASR after all refinement rounds

Metric	Static Benchmark	Adaptive Benchmark
Reflects real-world performance	Partially	More accurately
Reproducible	Highly	Moderately (depends on attacker skill)
Cost to run	Low	High (requires skilled human testers)
Automation potential	High	Low (human judgment needed)

Reporting Benchmarking Results

Benchmark Report Template

## Defense Benchmarking Report
 
### Configuration
- Defense Under Test: [Name, version, configuration]
- Benchmark Dataset: [Name, version, sample count]
- Evaluation Date: [Date]
- Evaluator: [Name/organization]
 
### Summary Results
| Metric | Value | 95% CI |
|---|---|---|
| Attack Success Rate | X.XX% | [lower, upper] |
| False Positive Rate | X.XX% | [lower, upper] |
| True Positive Rate | X.XX% | [lower, upper] |
| Median Latency | XXms | |
| P99 Latency | XXms | |
 
### Per-Category Results
[Table showing ASR broken down by attack category]
 
### Comparative Analysis
[If comparing to baseline or alternative defense]
 
### Worst-Case Analysis
- Highest ASR category: [Category] at XX%
- Most impactful bypass: [Description]
 
### Recommendations
[Based on results]

Statistical Rigor in Evaluations -- statistical methods for evaluation
Harness Building -- building evaluation infrastructure
Metrics & Methodology -- evaluation metrics design
Red Team as a Service -- service delivery that includes benchmarking
Model Cards Security -- documenting benchmark results

References

Mazeika et al., "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal" (2024) - Standardized benchmarking framework for LLM safety
Vidgen et al., "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models" (2024) - Minimal safety benchmark for LLMs
Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models" (2023) - GCG attack methodology used in benchmarking
NIST AI 100-2e, "Adversarial Machine Learning" (2024) - Standard terminology and evaluation concepts
MLCommons AI Safety Benchmark (2024) - Industry benchmark for AI safety evaluation

Knowledge Check

Why is adaptive attack testing necessary in addition to static benchmark evaluation?

Edit this page on GitHub

Benchmarking Defense Effectiveness

Related articles

Benchmarking Defense Effectiveness

Related articles