Red Team Metrics Dashboard
What to measure in AI red team programs: key performance indicators, risk metrics, dashboard design, stakeholder reporting, and using data to demonstrate program value.
What gets measured gets managed -- but measuring the wrong things manages your team into dysfunction. A red team that optimizes for "number of findings" produces low-severity noise. A team that optimizes for "critical findings only" misses important patterns. The right metrics balance activity measurement (is the team doing enough?) with impact measurement (is the work making a difference?) and risk communication (what is the organization's AI security posture?).
Metric Categories
Activity Metrics (What the team is doing)
These metrics track the volume and coverage of red team work. They answer: "Is the team engaged with the right systems at the right frequency?"
| Metric | Definition | Target | Warning Sign |
|---|---|---|---|
| Assessments completed | Number of AI system assessments completed per quarter | Depends on team size; 3-6 per analyst per quarter | Consistently missing targets suggests resource issues |
| System coverage | Percentage of deployed AI systems assessed in the last 12 months | 100% of high-risk, 50%+ of medium-risk | Low coverage means unassessed systems are in production |
| Assessment depth | Hours spent per assessment, categorized by system risk level | 40-80 hours for high-risk, 16-40 for medium | Very short assessments may be superficial |
| Technique coverage | Percentage of applicable attack categories tested per assessment | 80%+ of relevant categories | Narrow testing misses entire vulnerability classes |
| Repeat assessment rate | Percentage of systems assessed more than once | 100% of high-risk systems annually | One-time assessments miss regression and new vulnerabilities |
Impact Metrics (What difference the work makes)
These metrics track whether findings lead to actual security improvements. They answer: "Is the organization getting safer?"
| Metric | Definition | Target | Warning Sign |
|---|---|---|---|
| Remediation rate | Percentage of findings remediated within SLA | 90%+ for critical, 75%+ for high | Low rates mean findings are not being acted on |
| Mean time to remediation (MTTR) | Average days from finding report to fix | <30 days critical, <90 days high | Long MTTR indicates organizational friction |
| Regression rate | Percentage of findings that reappear in subsequent assessments | <10% | High regression means fixes are superficial |
| Pre-deployment catch rate | Percentage of critical findings caught before production deployment | >80% | Low rate means vulnerabilities are reaching production |
| Finding acceptance rate | Percentage of reported findings accepted as valid by the development team | >85% | Low rate suggests calibration issues or adversarial relationship |
Risk Metrics (What is the organization's posture)
These metrics communicate the overall AI security risk level. They answer: "How exposed are we?"
| Metric | Definition | Communication Target |
|---|---|---|
| Open critical findings | Count of unresolved critical AI security findings | Executive leadership, CISO |
| Risk exposure by system | Risk score per AI system based on findings and deployment context | Product leadership, ML engineering |
| Vulnerability density | Findings per system, normalized by system complexity | Engineering management |
| Time since last assessment | Days since each AI system was last assessed | Risk management, compliance |
| Safety training bypass rate | Percentage of jailbreak techniques that succeed against production models | Model providers, safety teams |
Dashboard Design
Executive Dashboard
Designed for CISO, CTO, and VP-level stakeholders who need a high-level risk picture in under 60 seconds.
┌────────────────────────────────────────────────────────────────┐
│ AI Security Posture Dashboard Q1 2026 │
├───────────────┬───────────────┬───────────────┬───────────────┤
│ RISK LEVEL │ OPEN CRITICAL │ SYSTEMS │ REMEDIATION │
│ │ FINDINGS │ ASSESSED │ RATE │
│ MEDIUM │ 3 │ 12/15 (80%) │ 87% │
│ (↓ from │ (↓ from 5 │ (↑ from 8/15)│ (↑ from 82%) │
│ HIGH) │ last Q) │ │ │
├───────────────┴───────────────┴───────────────┴───────────────┤
│ TREND: Improving. 2 critical findings closed this quarter. │
│ CONCERN: 3 new AI deployments not yet assessed. │
│ ACTION: Schedule assessments for Q2 for new deployments. │
└───────────────────────────────────────────────────────────────┘Design principles for executive dashboards:
- Traffic light indicators (red/yellow/green) for instant comprehension
- Trend arrows showing improvement or degradation
- No more than 6 top-level metrics
- Plain language summary with one key concern and one recommended action
Engineering Dashboard
Designed for ML engineering and product teams who need actionable detail.
┌──────────────────────────────────────────────────────────────┐
│ System: Customer Service Agent v2.3 │
├──────────────────────────────────────────────────────────────┤
│ Last Assessed: 2026-02-15 (28 days ago) │
│ Risk Score: HIGH (7.2/10) │
│ Next Assessment Due: 2026-05-15 │
├──────────────────────────────────────────────────────────────┤
│ FINDINGS BY CATEGORY │
│ │
│ Prompt Injection ████████████ 4 (2 critical, 2 high) │
│ Tool Abuse ██████ 2 (1 high, 1 medium) │
│ Memory Poisoning ███ 1 (medium) │
│ Safety Bypass █████████ 3 (1 critical, 2 high) │
│ Data Leakage ██████ 2 (1 high, 1 medium) │
├──────────────────────────────────────────────────────────────┤
│ REMEDIATION STATUS │
│ Remediated: 5/12 │ In Progress: 4/12 │ Not Started: 3/12 │
├──────────────────────────────────────────────────────────────┤
│ OVERDUE: 2 findings past SLA (RT-2026-042, RT-2026-047) │
└──────────────────────────────────────────────────────────────┘Operational Dashboard
Designed for the red team itself to track workload, coverage, and efficiency.
| Panel | Content | Update Frequency |
|---|---|---|
| Assessment pipeline | Upcoming, in-progress, and completed assessments | Real-time |
| Finding backlog | Total open findings by severity and age | Daily |
| Team utilization | Assessment hours per analyst, availability | Weekly |
| Technique effectiveness | Success rates by attack category per model | Per assessment |
| Tool performance | Automated scanning hit rates, false positive rates | Per assessment |
Data Collection
Automated Collection
Build metrics collection into the assessment workflow so it does not require manual data entry.
class MetricsCollector:
"""Collect red team metrics automatically from assessment workflow."""
def __init__(self, storage_backend):
self.storage = storage_backend
def log_assessment_start(self, assessment_id: str, system_name: str, risk_level: str):
"""Log when an assessment begins."""
self.storage.insert({
"event": "assessment_start",
"assessment_id": assessment_id,
"system": system_name,
"risk_level": risk_level,
"timestamp": datetime.now().isoformat()
})
def log_finding(self, assessment_id: str, finding: dict):
"""Log a finding with structured metadata."""
self.storage.insert({
"event": "finding",
"assessment_id": assessment_id,
"finding_id": finding["id"],
"severity": finding["severity"],
"category": finding["category"],
"attack_technique": finding["technique"],
"timestamp": datetime.now().isoformat()
})
def log_remediation(self, finding_id: str, status: str):
"""Log remediation status changes."""
self.storage.insert({
"event": "remediation_update",
"finding_id": finding_id,
"status": status,
"timestamp": datetime.now().isoformat()
})
def compute_metrics(self, period_start: str, period_end: str) -> dict:
"""Compute metrics for a given time period."""
events = self.storage.query(period_start, period_end)
assessments = [e for e in events if e["event"] == "assessment_start"]
findings = [e for e in events if e["event"] == "finding"]
remediations = [e for e in events if e["event"] == "remediation_update"]
return {
"assessments_completed": len(assessments),
"total_findings": len(findings),
"findings_by_severity": self._group_by(findings, "severity"),
"findings_by_category": self._group_by(findings, "category"),
"remediation_rate": self._compute_remediation_rate(findings, remediations),
"period": {"start": period_start, "end": period_end}
}
def _group_by(self, items: list, key: str) -> dict:
groups = {}
for item in items:
val = item.get(key, "unknown")
groups[val] = groups.get(val, 0) + 1
return groups
def _compute_remediation_rate(self, findings, remediations):
remediated = set(r["finding_id"] for r in remediations if r["status"] == "resolved")
total = set(f["finding_id"] for f in findings)
return len(remediated) / max(len(total), 1)Manual Data Points
Some metrics require manual input. Minimize these and make them easy to capture:
- Finding acceptance rate: Captured during finding review meetings
- Stakeholder satisfaction: Quarterly survey (3-5 questions max)
- Assessment quality rating: Self-assessment by the lead analyst after each engagement
Reporting Cadence
| Report | Audience | Frequency | Content |
|---|---|---|---|
| Executive risk summary | CISO, CTO, board | Quarterly | Posture score, trends, top risks, resource requests |
| Engineering status | ML engineering leads | Monthly | System-specific findings, remediation progress |
| Team operations | Red team management | Weekly | Pipeline status, workload, blockers |
| Detailed assessment report | System owner | Per assessment | Full findings, evidence, remediation guidance |
Metrics Pitfalls
| Pitfall | Description | Prevention |
|---|---|---|
| Counting findings as success | More findings is not better; it may indicate superficial assessments that report noise | Weight findings by severity and impact |
| Gaming remediation SLAs | Teams close findings without truly fixing them to hit SLA targets | Verify remediations through re-testing |
| Coverage without depth | Rushing through many systems without thorough assessment | Set minimum assessment hours by risk level |
| Vanity metrics | Reporting metrics that look good but do not reflect actual risk | Focus on outcome metrics (remediation rate, regression rate) over activity metrics |
| Punishing honesty | Using metrics to evaluate the red team punitively | Frame metrics as program health indicators, not individual performance |
Demonstrating Program Value
Quarterly Business Review Format
Present program value in terms that executives understand:
-
Risk reduced: "We identified and helped remediate 12 critical AI security issues that could have resulted in [specific business impact]."
-
Cost avoidance: "Pre-deployment assessment caught a data leakage vulnerability in our customer-facing AI. Remediation cost: $5,000 in engineering time. Estimated cost if discovered in production: $500,000+ in incident response and customer notification."
-
Compliance enablement: "Our assessment program satisfies [specific regulation] requirements for AI system security testing."
-
Trend improvement: "Our safety bypass rate has decreased from 35% to 18% over the past two quarters, indicating that our safety training investments are working."
Summary
Effective red team metrics communicate risk to stakeholders, guide program investment, and demonstrate value. The right metrics balance activity measurement (are we doing enough?), impact measurement (is it making a difference?), and risk communication (what is our exposure?). Avoid metrics that incentivize quantity over quality, and design data collection into the assessment workflow to minimize overhead. The ultimate measure of a red team's success is not the number of findings but whether the organization's AI systems are getting more secure over time.