Red Team Metrics Beyond ASR

advanced9 min readUpdated 2026-03-13

Comprehensive metrics methodology for AI red teaming beyond Attack Success Rate: severity-weighted scoring, defense depth metrics, coverage analysis, and stakeholder-appropriate reporting frameworks.

metrics methodology evaluation reporting

Attack Success Rate (ASR) is the most commonly reported metric in AI red teaming, but it tells a dangerously incomplete story. An ASR of 5% could mean five low-severity prompt leaks or five critical data exfiltrations. Mature red team programs need a richer metrics framework.

The Metrics Framework

Tier 1: Attack Outcome Metrics

Metric	Definition	Formula	Use Case
ASR (raw)	Percentage of attacks that succeed	Successful attacks / Total attempts	Baseline comparison
Weighted ASR	ASR weighted by attack difficulty	Sum(success * difficulty_weight) / Sum(difficulty_weight)	Normalizes for test difficulty
Category ASR	ASR broken down by attack category	Per-category successful / per-category total	Identifies weak categories
Novel ASR	ASR for previously unseen attack patterns	Novel successes / Novel attempts	Measures robustness to new attacks

Tier 2: Severity Metrics

Metric	Definition	Scoring
Risk-Weighted Score	Aggregate score factoring in severity	Sum(severity_weight * success)
Critical Failure Rate	Rate of highest-severity successful attacks	Critical successes / Total attempts
Harm Potential Index	Estimated real-world harm if successful attack were deployed	Qualitative scale: Low / Medium / High / Critical
Blast Radius	Number of users or systems potentially affected by a successful attack	Estimated user count or system scope

Severity Weighting Scale

Severity Level    Weight    Description
─────────────────────────────────────────────────────────
Critical          10        Data exfiltration, code execution,
                            full safety bypass
High               7        Detailed harmful content generation,
                            PII disclosure, tool abuse
Medium             4        Partial safety bypass, system prompt
                            leakage, mild policy violations
Low                2        Minor information disclosure,
                            inconsistent refusal behavior
Informational      1        Behavioral observations, no direct
                            security impact

Defense Depth Metrics

Measuring Layered Defenses

Most AI systems employ multiple defense layers. Metrics should capture how deep attacks penetrate, not just whether they ultimately succeed.

Metric	What It Measures	Why It Matters
Layer Penetration Depth	How many defense layers an attack bypasses before being stopped	A system that stops attacks at layer 1 is more robust than one that relies on layer 3
Defense Bypass Rate (per layer)	Success rate against each individual defense layer	Identifies the weakest link in the defense chain
Single Point of Failure Index	Whether removing any single defense layer would cause a critical failure	Highlights architectural fragility
Recovery Rate	How often the system self-corrects after a partial bypass	Measures defense-in-depth resilience

Defense Layer Model

Attack → [Input Filter] → [System Prompt] → [Model Safety] → [Output Filter] → Response
              ↓                  ↓                 ↓                ↓
         Layer 1 bypass    Layer 2 bypass    Layer 3 bypass    Layer 4 bypass
              ↓                  ↓                 ↓                ↓
         Track rate        Track rate        Track rate        Track rate

Coverage Metrics

Attack Surface Coverage

Dimension	How to Measure	Target
Category coverage	Percentage of OWASP LLM Top 10 categories tested	100% for comprehensive assessments
Technique coverage	Percentage of known attack techniques attempted	80%+ for thorough engagements
Modality coverage	Input modalities tested (text, image, audio, file)	All supported modalities
Language coverage	Languages tested for safety consistency	Top 10 user languages minimum
Context coverage	Single-turn, multi-turn, system prompt variations tested	All relevant interaction patterns

Coverage Gap Analysis

# Coverage tracking structure
coverage = {
    "categories_tested": 9,
    "categories_total": 10,  # OWASP LLM Top 10
    "coverage_pct": 90,
    "gaps": ["LLM10: Unbounded Consumption"],
    "techniques_by_category": {
        "LLM01_Prompt_Injection": {
            "tested": ["direct", "indirect", "stored", "multi-turn"],
            "not_tested": ["multi-language", "encoding_based"],
            "coverage_pct": 67
        }
    }
}

Temporal Metrics

Tracking Over Time

Metric	Description	Alert Threshold
ASR trend	ASR over successive evaluations	Increase > 5% from baseline
Regression count	Previously fixed vulnerabilities that reappear	Any regression on Critical/High findings
Mean time to fix	Average time from finding to remediation	Varies by severity
Defense improvement rate	Rate at which defense layers improve	Should trend positive
New vulnerability discovery rate	Novel findings per evaluation cycle	Decreasing trend expected

Regression Detection

Maintain a regression test suite
Every confirmed finding should generate a regression test prompt. Add it to the automated evaluation suite.
Run regression tests on every change
Model updates, system prompt changes, and guardrail modifications should trigger regression testing.
Alert on regressions
Any previously fixed finding that reappears is a regression. These should block deployment.
Track regression rate
A high regression rate indicates systemic issues in the remediation process, not just individual fixes.

Stakeholder-Appropriate Reporting

Metrics by Audience

Audience	Metrics That Matter	Presentation Format
Engineering team	Category ASR, per-layer bypass rates, specific failure examples, regression list	Detailed tables, code examples, reproduction steps
Security leadership	Risk-weighted score, critical failure rate, coverage gaps, trend over time	Dashboard with trend lines, heat maps
Executive / board	Overall risk posture (Red/Yellow/Green), comparison to industry benchmarks, business impact	One-page summary, 3-5 key numbers
Compliance / legal	Framework coverage (OWASP, NIST), regulatory requirement mapping, audit trail	Compliance matrix, evidence packages

The Executive Dashboard

┌─────────────────────────────────────────────────────┐
│  AI Security Posture — March 2026                    │
│                                                     │
│  Overall Risk:  ██████░░░░  MEDIUM                  │
│                                                     │
│  Critical Findings:  2 (↓ from 5 last quarter)      │
│  Attack Success Rate: 8.3% (↓ from 12.1%)          │
│  Coverage Score:      87% (↑ from 72%)              │
│  Regression Rate:     3% (↓ from 11%)               │
│                                                     │
│  Top Risk: Agent tool abuse (3 critical findings)   │
│  Biggest Improvement: Prompt injection (ASR 4%→1%)  │
└─────────────────────────────────────────────────────┘

Implementing the Framework

Metric Collection Checklist

Data Point	When to Collect	Storage
Attack prompt and response	Every evaluation attempt	Result store
Judge verdict and confidence	Every evaluation attempt	Result store
Severity classification	Every successful attack	Finding database
Defense layer penetration	When instrumented	Per-attempt metadata
Category and technique tags	Every evaluation attempt	Prompt metadata
Model and config version	Every evaluation run	Run metadata

AI Safety Benchmarks & Evaluation -- benchmark selection and evaluation methodology
Building Evaluation Harnesses -- infrastructure for collecting metrics
Statistical Rigor in AI Red Teaming -- statistical foundations
Executive Summary Writing -- communicating metrics to leadership

References

"Beyond Attack Success Rate: A Comprehensive Evaluation Framework for AI Red Teaming" - Anthropic (2024) - Research on multi-dimensional metrics for AI safety evaluation
"OWASP Top 10 for LLM Applications" - OWASP Foundation (2025) - Vulnerability categories used as coverage dimensions in metrics frameworks
"NIST AI Risk Management Framework (AI RMF 1.0)" - National Institute of Standards and Technology (2023) - Risk measurement functions (Measure subcategories) that inform red team metrics design
"MLCommons AI Safety Benchmark v0.5" - MLCommons (2024) - Standardized scoring methodology for AI safety evaluations

Knowledge Check

A red team evaluation finds an ASR of 3% with two critical findings (data exfiltration via tool abuse) and an ASR of 18% in another engagement with all low-severity findings (inconsistent refusal messages). Which engagement represents higher risk?

Red Team Metrics Beyond ASR

Maintain a regression test suite

Run regression tests on every change

Alert on regressions

Track regression rate

Related articles

Red Team Metrics Beyond ASR

Maintain a regression test suite

Run regression tests on every change

Alert on regressions

Track regression rate

Related articles