Red Team Metrics Beyond ASR
Comprehensive metrics methodology for AI red teaming beyond Attack Success Rate: severity-weighted scoring, defense depth metrics, coverage analysis, and stakeholder-appropriate reporting frameworks.
Attack Success Rate (ASR) is the most commonly reported metric in AI red teaming, but it tells a dangerously incomplete story. An ASR of 5% could mean five low-severity prompt leaks or five critical data exfiltrations. Mature red team programs need a richer metrics framework.
The Metrics Framework
Tier 1: Attack Outcome Metrics
| Metric | Definition | Formula | Use Case |
|---|---|---|---|
| ASR (raw) | Percentage of attacks that succeed | Successful attacks / Total attempts | Baseline comparison |
| Weighted ASR | ASR weighted by attack difficulty | Sum(success * difficulty_weight) / Sum(difficulty_weight) | Normalizes for test difficulty |
| Category ASR | ASR broken down by attack category | Per-category successful / per-category total | Identifies weak categories |
| Novel ASR | ASR for previously unseen attack patterns | Novel successes / Novel attempts | Measures robustness to new attacks |
Tier 2: Severity Metrics
| Metric | Definition | Scoring |
|---|---|---|
| Risk-Weighted Score | Aggregate score factoring in severity | Sum(severity_weight * success) |
| Critical Failure Rate | Rate of highest-severity successful attacks | Critical successes / Total attempts |
| Harm Potential Index | Estimated real-world harm if successful attack were deployed | Qualitative scale: Low / Medium / High / Critical |
| Blast Radius | Number of users or systems potentially affected by a successful attack | Estimated user count or system scope |
Severity Weighting Scale
Severity Level Weight Description
─────────────────────────────────────────────────────────
Critical 10 Data exfiltration, code execution,
full safety bypass
High 7 Detailed harmful content generation,
PII disclosure, tool abuse
Medium 4 Partial safety bypass, system prompt
leakage, mild policy violations
Low 2 Minor information disclosure,
inconsistent refusal behavior
Informational 1 Behavioral observations, no direct
security impact
Defense Depth Metrics
Measuring Layered Defenses
Most AI systems employ multiple defense layers. Metrics should capture how deep attacks penetrate, not just whether they ultimately succeed.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Layer Penetration Depth | How many defense layers an attack bypasses before being stopped | A system that stops attacks at layer 1 is more robust than one that relies on layer 3 |
| Defense Bypass Rate (per layer) | Success rate against each individual defense layer | Identifies the weakest link in the defense chain |
| Single Point of Failure Index | Whether removing any single defense layer would cause a critical failure | Highlights architectural fragility |
| Recovery Rate | How often the system self-corrects after a partial bypass | Measures defense-in-depth resilience |
Defense Layer Model
Attack → [Input Filter] → [System Prompt] → [Model Safety] → [Output Filter] → Response
↓ ↓ ↓ ↓
Layer 1 bypass Layer 2 bypass Layer 3 bypass Layer 4 bypass
↓ ↓ ↓ ↓
Track rate Track rate Track rate Track rate
Coverage Metrics
Attack Surface Coverage
| Dimension | How to Measure | Target |
|---|---|---|
| Category coverage | Percentage of OWASP LLM Top 10 categories tested | 100% for comprehensive assessments |
| Technique coverage | Percentage of known attack techniques attempted | 80%+ for thorough engagements |
| Modality coverage | Input modalities tested (text, image, audio, file) | All supported modalities |
| Language coverage | Languages tested for safety consistency | Top 10 user languages minimum |
| Context coverage | Single-turn, multi-turn, system prompt variations tested | All relevant interaction patterns |
Coverage Gap Analysis
# Coverage tracking structure
coverage = {
"categories_tested": 9,
"categories_total": 10, # OWASP LLM Top 10
"coverage_pct": 90,
"gaps": ["LLM10: Unbounded Consumption"],
"techniques_by_category": {
"LLM01_Prompt_Injection": {
"tested": ["direct", "indirect", "stored", "multi-turn"],
"not_tested": ["multi-language", "encoding_based"],
"coverage_pct": 67
}
}
}Temporal Metrics
Tracking Over Time
| Metric | Description | Alert Threshold |
|---|---|---|
| ASR trend | ASR over successive evaluations | Increase > 5% from baseline |
| Regression count | Previously fixed vulnerabilities that reappear | Any regression on Critical/High findings |
| Mean time to fix | Average time from finding to remediation | Varies by severity |
| Defense improvement rate | Rate at which defense layers improve | Should trend positive |
| New vulnerability discovery rate | Novel findings per evaluation cycle | Decreasing trend expected |
Regression Detection
Maintain a regression test suite
Every confirmed finding should generate a regression test prompt. Add it to the automated evaluation suite.
Run regression tests on every change
Model updates, system prompt changes, and guardrail modifications should trigger regression testing.
Alert on regressions
Any previously fixed finding that reappears is a regression. These should block deployment.
Track regression rate
A high regression rate indicates systemic issues in the remediation process, not just individual fixes.
Stakeholder-Appropriate Reporting
Metrics by Audience
| Audience | Metrics That Matter | Presentation Format |
|---|---|---|
| Engineering team | Category ASR, per-layer bypass rates, specific failure examples, regression list | Detailed tables, code examples, reproduction steps |
| Security leadership | Risk-weighted score, critical failure rate, coverage gaps, trend over time | Dashboard with trend lines, heat maps |
| Executive / board | Overall risk posture (Red/Yellow/Green), comparison to industry benchmarks, business impact | One-page summary, 3-5 key numbers |
| Compliance / legal | Framework coverage (OWASP, NIST), regulatory requirement mapping, audit trail | Compliance matrix, evidence packages |
The Executive Dashboard
┌─────────────────────────────────────────────────────┐
│ AI Security Posture — March 2026 │
│ │
│ Overall Risk: ██████░░░░ MEDIUM │
│ │
│ Critical Findings: 2 (↓ from 5 last quarter) │
│ Attack Success Rate: 8.3% (↓ from 12.1%) │
│ Coverage Score: 87% (↑ from 72%) │
│ Regression Rate: 3% (↓ from 11%) │
│ │
│ Top Risk: Agent tool abuse (3 critical findings) │
│ Biggest Improvement: Prompt injection (ASR 4%→1%) │
└─────────────────────────────────────────────────────┘
Implementing the Framework
Metric Collection Checklist
| Data Point | When to Collect | Storage |
|---|---|---|
| Attack prompt and response | Every evaluation attempt | Result store |
| Judge verdict and confidence | Every evaluation attempt | Result store |
| Severity classification | Every successful attack | Finding database |
| Defense layer penetration | When instrumented | Per-attempt metadata |
| Category and technique tags | Every evaluation attempt | Prompt metadata |
| Model and config version | Every evaluation run | Run metadata |
Related Topics
- AI Safety Benchmarks & Evaluation -- benchmark selection and evaluation methodology
- Building Evaluation Harnesses -- infrastructure for collecting metrics
- Statistical Rigor in AI Red Teaming -- statistical foundations
- Executive Summary Writing -- communicating metrics to leadership
References
- "Beyond Attack Success Rate: A Comprehensive Evaluation Framework for AI Red Teaming" - Anthropic (2024) - Research on multi-dimensional metrics for AI safety evaluation
- "OWASP Top 10 for LLM Applications" - OWASP Foundation (2025) - Vulnerability categories used as coverage dimensions in metrics frameworks
- "NIST AI Risk Management Framework (AI RMF 1.0)" - National Institute of Standards and Technology (2023) - Risk measurement functions (Measure subcategories) that inform red team metrics design
- "MLCommons AI Safety Benchmark v0.5" - MLCommons (2024) - Standardized scoring methodology for AI safety evaluations
A red team evaluation finds an ASR of 3% with two critical findings (data exfiltration via tool abuse) and an ASR of 18% in another engagement with all low-severity findings (inconsistent refusal messages). Which engagement represents higher risk?