Measuring and Reporting AI Red Team Effectiveness

intermediate17 min readUpdated 2026-03-20

Walkthrough for defining, collecting, and reporting metrics that measure the effectiveness of AI red teaming programs, covering coverage metrics, detection rates, time-to-find analysis, remediation tracking, and ROI calculation.

metrics effectiveness reporting kpis program-management methodology walkthrough

"How do we know our AI red teaming program is working?" is the question that every CISO eventually asks. Without metrics, the answer is subjective -- "we found some things" is not compelling when competing for budget against other security initiatives. This walkthrough defines a practical metrics framework that quantifies the value of AI red teaming, tracks improvement over time, and provides the data needed for program justification and resource allocation.

Step 1: Defining the Metrics Framework

Organize metrics into four categories that answer different stakeholder questions:

# AI Red Team Metrics Framework
 
## Category 1: Coverage Metrics
**Question**: "Are we testing everything we should be?"
 
| Metric | Definition | Target |
|--------|-----------|--------|
| OWASP LLM Top 10 Coverage | % of OWASP categories tested in each engagement | 100% for standard engagements |
| Attack Surface Coverage | % of enumerated components tested | >80% for standard, >95% for comprehensive |
| Test Case Volume | Number of unique test cases per engagement | >200 for standard engagements |
| Technique Diversity | Number of distinct attack techniques used | >15 per engagement |
| Model Coverage | % of production models tested in the quarter | >90% |
 
## Category 2: Detection Metrics
**Question**: "Are we finding real vulnerabilities?"
 
| Metric | Definition | Target |
|--------|-----------|--------|
| Vulnerability Count | Total findings per engagement by severity | Track trend, not absolute number |
| Detection Rate | Findings per testing hour | Track trend |
| Novel Finding Rate | % of findings not found in previous engagements | >20% indicates evolving methodology |
| False Positive Rate | % of reported findings that were not actual vulnerabilities | <5% |
| Time to First Finding | Hours from engagement start to first Critical/High finding | <4 hours |
 
## Category 3: Remediation Metrics
**Question**: "Are we driving actual security improvement?"
 
| Metric | Definition | Target |
|--------|-----------|--------|
| Remediation Rate | % of findings remediated within SLA | >90% for Critical, >80% for High |
| Mean Time to Remediate (MTTR) | Average days from report to fix by severity | <7 days Critical, <30 days High |
| Regression Rate | % of previously fixed findings that reappear | <5% |
| Retest Pass Rate | % of remediated findings that pass verification | >95% |
| Open Finding Age | Average age of unresolved findings | Decreasing trend |
 
## Category 4: Program Metrics
**Question**: "Is the program worth the investment?"
 
| Metric | Definition | Target |
|--------|-----------|--------|
| Cost per Finding | Total program cost / number of findings | Track trend (should stabilize) |
| Security Posture Score | Average model safety score across evaluations | Improving trend |
| Engagement Frequency | Red team engagements per quarter | Per policy |
| Tool ROI | Findings from automated tools vs. manual testing | Track ratio |
| Coverage Gap Closure | % of identified gaps addressed since last quarter | >50% per quarter |

Step 2: Collecting Metrics Data

Build a data collection system that captures metrics automatically:

# metrics/collector.py
"""Collect and store AI red team metrics from engagement data."""
import json
import sqlite3
from datetime import datetime, date
from pathlib import Path
from dataclasses import dataclass
 
@dataclass
class EngagementMetrics:
    engagement_id: str
    client: str
    start_date: str
    end_date: str
    engagement_type: str
    total_hours: float
    components_in_scope: int
    components_tested: int
    test_cases_executed: int
    techniques_used: int
    findings_critical: int
    findings_high: int
    findings_medium: int
    findings_low: int
    owasp_categories_tested: int
 
class MetricsCollector:
    """Collect and store engagement metrics."""
 
    def __init__(self, db_path: str = "metrics/redteam_metrics.db"):
        Path(db_path).parent.mkdir(parents=True, exist_ok=True)
        self.conn = sqlite3.connect(db_path)
        self._init_db()
 
    def _init_db(self):
        self.conn.executescript("""
            CREATE TABLE IF NOT EXISTS engagements (
                id TEXT PRIMARY KEY,
                client TEXT,
                start_date TEXT,
                end_date TEXT,
                engagement_type TEXT,
                total_hours REAL,
                components_in_scope INTEGER,
                components_tested INTEGER,
                test_cases INTEGER,
                techniques_used INTEGER,
                findings_critical INTEGER,
                findings_high INTEGER,
                findings_medium INTEGER,
                findings_low INTEGER,
                owasp_categories INTEGER
            );
 
            CREATE TABLE IF NOT EXISTS findings (
                id TEXT PRIMARY KEY,
                engagement_id TEXT,
                severity TEXT,
                category TEXT,
                found_date TEXT,
                reported_date TEXT,
                remediated_date TEXT,
                verified_date TEXT,
                regression_count INTEGER DEFAULT 0,
                FOREIGN KEY (engagement_id) REFERENCES engagements(id)
            );
 
            CREATE TABLE IF NOT EXISTS continuous_scans (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                scan_date TEXT,
                scan_type TEXT,
                total_tests INTEGER,
                passed INTEGER,
                failed INTEGER,
                model_name TEXT,
                commit_sha TEXT
            );
        """)
        self.conn.commit()
 
    def record_engagement(self, metrics: EngagementMetrics):
        """Record metrics from a completed engagement."""
        self.conn.execute(
            "INSERT OR REPLACE INTO engagements VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
            (
                metrics.engagement_id, metrics.client, metrics.start_date,
                metrics.end_date, metrics.engagement_type, metrics.total_hours,
                metrics.components_in_scope, metrics.components_tested,
                metrics.test_cases_executed, metrics.techniques_used,
                metrics.findings_critical, metrics.findings_high,
                metrics.findings_medium, metrics.findings_low,
                metrics.owasp_categories_tested,
            ),
        )
        self.conn.commit()
 
    def record_finding(self, finding_id: str, engagement_id: str,
                       severity: str, category: str, found_date: str):
        """Record a finding for tracking remediation."""
        self.conn.execute(
            "INSERT OR REPLACE INTO findings (id, engagement_id, severity, "
            "category, found_date) VALUES (?,?,?,?,?)",
            (finding_id, engagement_id, severity, category, found_date),
        )
        self.conn.commit()
 
    def mark_remediated(self, finding_id: str, remediated_date: str):
        """Mark a finding as remediated."""
        self.conn.execute(
            "UPDATE findings SET remediated_date = ? WHERE id = ?",
            (remediated_date, finding_id),
        )
        self.conn.commit()
 
    def mark_verified(self, finding_id: str, verified_date: str):
        """Mark a remediated finding as verified."""
        self.conn.execute(
            "UPDATE findings SET verified_date = ? WHERE id = ?",
            (verified_date, finding_id),
        )
        self.conn.commit()
 
    def record_regression(self, finding_id: str):
        """Record that a previously fixed finding has regressed."""
        self.conn.execute(
            "UPDATE findings SET regression_count = regression_count + 1, "
            "remediated_date = NULL, verified_date = NULL WHERE id = ?",
            (finding_id,),
        )
        self.conn.commit()

Step 3: Calculating Key Metrics

Compute each metric from the collected data:

# metrics/calculator.py
"""Calculate AI red team metrics from collected data."""
from datetime import datetime, timedelta
from metrics.collector import MetricsCollector
 
class MetricsCalculator:
    """Calculate metrics from the metrics database."""
 
    def __init__(self, collector: MetricsCollector):
        self.conn = collector.conn
 
    def coverage_metrics(self, period_days: int = 90) -> dict:
        """Calculate coverage metrics for the specified period."""
        cutoff = (datetime.now() - timedelta(days=period_days)).isoformat()
 
        rows = self.conn.execute(
            "SELECT AVG(CAST(components_tested AS REAL) / NULLIF(components_in_scope, 0)), "
            "AVG(test_cases), AVG(techniques_used), AVG(owasp_categories) "
            "FROM engagements WHERE start_date >= ?",
            (cutoff,),
        ).fetchone()
 
        return {
            "avg_attack_surface_coverage": f"{(rows[0] or 0) * 100:.1f}%",
            "avg_test_cases_per_engagement": round(rows[1] or 0),
            "avg_techniques_per_engagement": round(rows[2] or 0),
            "avg_owasp_categories_covered": round(rows[3] or 0),
        }
 
    def detection_metrics(self, period_days: int = 90) -> dict:
        """Calculate detection metrics."""
        cutoff = (datetime.now() - timedelta(days=period_days)).isoformat()
 
        # Total findings by severity
        severity_counts = {}
        for severity in ["Critical", "High", "Medium", "Low"]:
            count = self.conn.execute(
                "SELECT COUNT(*) FROM findings WHERE severity = ? AND found_date >= ?",
                (severity, cutoff),
            ).fetchone()[0]
            severity_counts[severity] = count
 
        # Findings per hour
        total_findings = sum(severity_counts.values())
        total_hours = self.conn.execute(
            "SELECT SUM(total_hours) FROM engagements WHERE start_date >= ?",
            (cutoff,),
        ).fetchone()[0] or 1
 
        # Novel finding rate (findings not in previous period)
        prev_cutoff = (datetime.now() - timedelta(days=period_days * 2)).isoformat()
        previous_categories = set(
            row[0] for row in self.conn.execute(
                "SELECT DISTINCT category FROM findings WHERE found_date >= ? AND found_date < ?",
                (prev_cutoff, cutoff),
            ).fetchall()
        )
        current_categories = set(
            row[0] for row in self.conn.execute(
                "SELECT DISTINCT category FROM findings WHERE found_date >= ?",
                (cutoff,),
            ).fetchall()
        )
        novel_categories = current_categories - previous_categories
        novel_rate = len(novel_categories) / max(len(current_categories), 1) * 100
 
        return {
            "findings_by_severity": severity_counts,
            "total_findings": total_findings,
            "findings_per_hour": round(total_findings / total_hours, 2),
            "novel_finding_rate": f"{novel_rate:.0f}%",
        }
 
    def remediation_metrics(self) -> dict:
        """Calculate remediation metrics."""
        # Remediation rate by severity
        for severity in ["Critical", "High"]:
            total = self.conn.execute(
                "SELECT COUNT(*) FROM findings WHERE severity = ?",
                (severity,),
            ).fetchone()[0]
            remediated = self.conn.execute(
                "SELECT COUNT(*) FROM findings WHERE severity = ? AND remediated_date IS NOT NULL",
                (severity,),
            ).fetchone()[0]
 
        # MTTR calculation
        mttr_rows = self.conn.execute(
            "SELECT severity, AVG(julianday(remediated_date) - julianday(found_date)) "
            "FROM findings WHERE remediated_date IS NOT NULL "
            "GROUP BY severity",
        ).fetchall()
        mttr = {row[0]: round(row[1], 1) for row in mttr_rows}
 
        # Regression rate
        total_findings = self.conn.execute(
            "SELECT COUNT(*) FROM findings"
        ).fetchone()[0]
        regressed = self.conn.execute(
            "SELECT COUNT(*) FROM findings WHERE regression_count > 0"
        ).fetchone()[0]
        regression_rate = regressed / max(total_findings, 1) * 100
 
        # Retest pass rate
        verified = self.conn.execute(
            "SELECT COUNT(*) FROM findings WHERE verified_date IS NOT NULL"
        ).fetchone()[0]
        retested = self.conn.execute(
            "SELECT COUNT(*) FROM findings WHERE remediated_date IS NOT NULL"
        ).fetchone()[0]
        retest_pass_rate = verified / max(retested, 1) * 100
 
        return {
            "mean_time_to_remediate_days": mttr,
            "regression_rate": f"{regression_rate:.1f}%",
            "retest_pass_rate": f"{retest_pass_rate:.0f}%",
        }
 
    def security_posture_trend(self, weeks: int = 12) -> list[dict]:
        """Get the security posture score trend from continuous scans."""
        rows = self.conn.execute(
            "SELECT strftime('%Y-W%W', scan_date) as week, "
            "AVG(CAST(passed AS REAL) / NULLIF(total_tests, 0)) "
            "FROM continuous_scans "
            "WHERE scan_date >= date('now', ?) "
            "GROUP BY week ORDER BY week",
            (f'-{weeks * 7} days',),
        ).fetchall()
 
        return [{"week": row[0], "safety_score": round(row[1] or 0, 3)} for row in rows]

Step 4: Building the Metrics Dashboard

Create a report that presents metrics to stakeholders:

# metrics/dashboard.py
"""Generate a metrics dashboard report."""
from metrics.collector import MetricsCollector
from metrics.calculator import MetricsCalculator
from datetime import datetime
 
def generate_dashboard(db_path: str = "metrics/redteam_metrics.db") -> str:
    """Generate a markdown metrics dashboard."""
    collector = MetricsCollector(db_path)
    calc = MetricsCalculator(collector)
 
    coverage = calc.coverage_metrics()
    detection = calc.detection_metrics()
    remediation = calc.remediation_metrics()
    trend = calc.security_posture_trend()
 
    report = f"""# AI Red Team Program Metrics Dashboard
 
**Report Date**: {datetime.now().strftime('%Y-%m-%d')}
**Period**: Last 90 days
 
---
 
## Coverage Metrics
 
| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| Attack Surface Coverage | {coverage['avg_attack_surface_coverage']} | >80% | {'On Track' if float(coverage['avg_attack_surface_coverage'].rstrip('%')) > 80 else 'Below Target'} |
| Test Cases per Engagement | {coverage['avg_test_cases_per_engagement']} | >200 | {'On Track' if coverage['avg_test_cases_per_engagement'] > 200 else 'Below Target'} |
| Techniques per Engagement | {coverage['avg_techniques_per_engagement']} | >15 | {'On Track' if coverage['avg_techniques_per_engagement'] > 15 else 'Below Target'} |
| OWASP Categories Covered | {coverage['avg_owasp_categories_covered']}/10 | 10 | {'On Track' if coverage['avg_owasp_categories_covered'] >= 10 else 'Below Target'} |
 
## Detection Metrics
 
| Metric | Value |
|--------|-------|
| Total Findings | {detection['total_findings']} |
| Critical | {detection['findings_by_severity'].get('Critical', 0)} |
| High | {detection['findings_by_severity'].get('High', 0)} |
| Medium | {detection['findings_by_severity'].get('Medium', 0)} |
| Low | {detection['findings_by_severity'].get('Low', 0)} |
| Findings per Testing Hour | {detection['findings_per_hour']} |
| Novel Finding Rate | {detection['novel_finding_rate']} |
 
## Remediation Metrics
 
| Metric | Value | Target |
|--------|-------|--------|
| MTTR (Critical) | {remediation['mean_time_to_remediate_days'].get('Critical', 'N/A')} days | <7 days |
| MTTR (High) | {remediation['mean_time_to_remediate_days'].get('High', 'N/A')} days | <30 days |
| Regression Rate | {remediation['regression_rate']} | <5% |
| Retest Pass Rate | {remediation['retest_pass_rate']} | >95% |
 
## Security Posture Trend
 
"""
 
    if trend:
        report += "| Week | Safety Score |\n|------|-------------|\n"
        for entry in trend:
            bar_length = int(entry['safety_score'] * 20)
            bar = '█' * bar_length + '░' * (20 - bar_length)
            report += f"| {entry['week']} | {entry['safety_score']:.3f} {bar} |\n"
    else:
        report += "*No continuous scan data available for trend analysis.*\n"
 
    return report
 
if __name__ == "__main__":
    print(generate_dashboard())

Step 5: Calculating Program ROI

Quantify the return on investment for the AI red teaming program:

# metrics/roi.py
"""Calculate ROI for the AI red teaming program."""
from dataclasses import dataclass
 
@dataclass
class ROICalculation:
    # Costs
    team_cost: float          # Annual team salary/contractor costs
    tool_cost: float          # Annual tool and API costs
    infrastructure_cost: float # Lab, compute, etc.
 
    # Value (risk reduction)
    critical_findings: int
    high_findings: int
    avg_breach_cost: float    # Industry average cost of an AI security incident
    estimated_breach_probability_reduction: float  # 0.0-1.0
 
    @property
    def total_cost(self) -> float:
        return self.team_cost + self.tool_cost + self.infrastructure_cost
 
    @property
    def estimated_risk_reduction(self) -> float:
        """Estimated annual risk reduction in dollar terms."""
        # Each critical finding represents a prevented potential incident
        critical_value = self.critical_findings * self.avg_breach_cost * 0.3
        high_value = self.high_findings * self.avg_breach_cost * 0.1
        return (critical_value + high_value) * self.estimated_breach_probability_reduction
 
    @property
    def roi_ratio(self) -> float:
        return self.estimated_risk_reduction / self.total_cost if self.total_cost > 0 else 0
 
    def generate_roi_report(self) -> str:
        return f"""## Program ROI Analysis
 
### Investment
| Item | Annual Cost |
|------|------------|
| Team (salary/contractors) | ${self.team_cost:,.0f} |
| Tools and API costs | ${self.tool_cost:,.0f} |
| Infrastructure | ${self.infrastructure_cost:,.0f} |
| **Total Investment** | **${self.total_cost:,.0f}** |
 
### Value Generated
| Item | Value |
|------|-------|
| Critical findings identified | {self.critical_findings} |
| High findings identified | {self.high_findings} |
| Industry avg breach cost | ${self.avg_breach_cost:,.0f} |
| Estimated risk reduction | ${self.estimated_risk_reduction:,.0f} |
 
### ROI
| Metric | Value |
|--------|-------|
| ROI Ratio | {self.roi_ratio:.1f}x |
| Net Value | ${self.estimated_risk_reduction - self.total_cost:,.0f} |
 
*Note: ROI calculations use industry-average breach costs and estimated
probability reduction. Actual values may vary.*
"""
 
# Example calculation
if __name__ == "__main__":
    roi = ROICalculation(
        team_cost=300000,
        tool_cost=25000,
        infrastructure_cost=10000,
        critical_findings=8,
        high_findings=15,
        avg_breach_cost=4450000,  # IBM Cost of a Data Breach 2025
        estimated_breach_probability_reduction=0.15,
    )
    print(roi.generate_roi_report())

Common Pitfalls and Troubleshooting

Problem	Cause	Solution
Metrics show no improvement	Testing methodology hasn't changed; finding same issues	Update test suites with new techniques quarterly
Finding count drops after first engagement	Low-hanging fruit already found	This is expected; track novel finding rate instead of raw count
MTTR is misleading	Includes findings stuck in backlog	Report MTTR by severity; exclude intentionally deferred findings
Regression rate spikes	New deployments without CI testing	Implement continuous red teaming pipeline from the continuous testing walkthrough
Stakeholders fixate on a single metric	Metric presented without context	Always present metrics in context with targets and trends
ROI calculation disputed	Breach cost estimates are speculative	Use industry benchmarks (IBM, Verizon DBIR) and present ranges

Key Takeaways

Measuring AI red team effectiveness transforms the program from a cost center into a demonstrable risk reduction investment:

Measure what matters to each audience -- coverage metrics for the security team, remediation metrics for engineering, and ROI for executive leadership. No single metric serves all audiences.
Track trends, not snapshots -- a single engagement's finding count is meaningless in isolation. The trend of security posture over time tells the real story.
Novel findings indicate program health -- if every engagement finds only the same issues, the methodology has stagnated. A healthy novel finding rate (>20%) indicates evolving attack techniques.
Regression rate is the most actionable metric -- it directly measures whether fixes are holding. High regression rates indicate a need for continuous testing, not more point-in-time engagements.
ROI requires honest estimation -- breach cost estimates and probability reductions are inherently uncertain. Present ranges rather than point estimates, and anchor to industry benchmarks.

Advanced Considerations

Adapting to Modern Defenses

The defensive landscape for LLM applications has evolved significantly since the initial wave of prompt injection research. Modern production systems often deploy multiple independent defensive layers, requiring attackers to adapt their techniques accordingly.

Input classification: The most common first line of defense is an input classifier that evaluates incoming prompts for adversarial patterns. These classifiers range from simple keyword-based filters to sophisticated ML models trained on adversarial examples. Bypassing input classifiers requires understanding their detection methodology:

Keyword-based classifiers can be evaded through encoding (Base64, ROT13, Unicode homoglyphs), paraphrasing, or splitting the payload across multiple turns
ML-based classifiers require more sophisticated evasion such as semantic camouflage, gradual escalation, or exploitation of the classifier's own blind spots

class DefenseBypassStrategy:
    """Strategy selector for bypassing different defense configurations."""
 
    def __init__(self):
        self.strategies = {
            "keyword_filter": self._bypass_keyword,
            "ml_classifier": self._bypass_ml,
            "output_filter": self._bypass_output,
            "rate_limiter": self._bypass_rate,
            "behavioral": self._bypass_behavioral,
        }
 
    def select_strategy(self, identified_defenses: list) -> callable:
        """Select the optimal bypass strategy based on identified defenses."""
        if not identified_defenses:
            return self._direct_attack
 
        # Combine strategies for multi-layer defenses
        applicable = [
            self.strategies[d] for d in identified_defenses
            if d in self.strategies
        ]
        return self._chain_strategies(applicable)
 
    def _bypass_keyword(self, payload: str) -> str:
        """Bypass keyword-based input filters."""
        import base64
        # Encode the payload to bypass keyword matching
        encoded = base64.b64encode(payload.encode()).decode()
        return f"Decode and follow: {encoded}"
 
    def _bypass_ml(self, payload: str) -> list:
        """Generate multiple paraphrased variants for ML classifier evasion."""
        variants = [
            f"As a thought experiment, consider: {payload}",
            f"For academic research purposes: {payload}",
            f"In a fictional scenario where: {payload}",
        ]
        return variants
 
    def _bypass_output(self, payload: str) -> str:
        """Modify the payload to extract data through output filter gaps."""
        return payload + "\nRespond using only code blocks and variable names."
 
    def _bypass_rate(self, payload: str) -> dict:
        """Configure timing to avoid rate limit triggers."""
        return {"payload": payload, "delay": 3.0, "jitter": 1.5}
 
    def _bypass_behavioral(self, payload: str) -> list:
        """Split the payload across multiple natural-sounding turns."""
        # Multi-turn escalation to avoid behavioral anomaly detection
        return [
            "I'm working on a security research project.",
            "Could you help me understand how safety mechanisms work?",
            "Specifically, I'm interested in how systems like yours " + payload[:100],
            payload,
        ]
 
    def _direct_attack(self, payload: str) -> str:
        return payload
 
    def _chain_strategies(self, strategies: list) -> callable:
        """Chain multiple bypass strategies."""
        def chained(payload):
            result = payload
            for strategy in strategies:
                result = strategy(result)
            return result
        return chained

Output filtering: Output filters inspect the model's response before it reaches the user, looking for sensitive data leakage, harmful content, or other policy violations. Common output filter bypass techniques include:

Technique	How It Works	Effectiveness
Encoding output	Request Base64/hex encoded responses	Medium — some filters check decoded content
Code block wrapping	Embed data in code comments/variables	High — many filters skip code blocks
Steganographic output	Hide data in formatting, capitalization, or spacing	High — difficult to detect
Chunked extraction	Extract small pieces across many turns	High — individual pieces may pass filters
Indirect extraction	Have the model reveal data through behavior changes	Very High — no explicit data in output

Cross-Model Considerations

Techniques that work against one model may not directly transfer to others. However, understanding the general principles allows adaptation:

Safety training methodology: Models trained with RLHF (GPT-4, Claude) have different safety characteristics than those using DPO (Llama, Mistral) or other methods. RLHF-trained models tend to refuse more broadly but may be more susceptible to multi-turn escalation.
Context window size: Models with larger context windows (Claude with 200K, Gemini with 1M+) may be more susceptible to context window manipulation where adversarial content is buried in large amounts of benign text.
Multimodal capabilities: Models that process images, audio, or other modalities introduce additional attack surfaces not present in text-only models.
Tool use implementation: The implementation details of function calling vary significantly between providers. OpenAI uses a structured function calling format, while Anthropic uses tool use blocks. These differences affect exploitation techniques.

Operational Considerations

Testing Ethics and Boundaries

Professional red team testing operates within clear ethical and legal boundaries:

Authorization: Always obtain written authorization before testing. This should specify the scope, methods allowed, and any restrictions.
Scope limits: Stay within the authorized scope. If you discover a vulnerability that leads outside the authorized perimeter, document it and report it without exploiting it.
Data handling: Handle any sensitive data discovered during testing according to the engagement agreement. Never retain sensitive data beyond what's needed for reporting.
Responsible disclosure: Follow responsible disclosure practices for any vulnerabilities discovered, particularly if they affect systems beyond your testing scope.

Documenting Results

Professional documentation follows a structured format:

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
 
@dataclass
class Finding:
    """Structure for documenting a security finding."""
    id: str
    title: str
    severity: str  # Critical, High, Medium, Low, Informational
    category: str  # OWASP LLM Top 10 category
    description: str
    steps_to_reproduce: list[str]
    impact: str
    recommendation: str
    evidence: list[str] = field(default_factory=list)
    mitre_atlas: Optional[str] = None
    cvss_score: Optional[float] = None
    discovered_at: str = field(default_factory=lambda: datetime.now().isoformat())
 
    def to_report_section(self) -> str:
        """Generate a report section for this finding."""
        steps = "\n".join(f"   {i+1}. {s}" for i, s in enumerate(self.steps_to_reproduce))
        return f"""
### {self.id}: {self.title}
 
**Severity**: {self.severity}
**Category**: {self.category}
{f"**MITRE ATLAS**: {self.mitre_atlas}" if self.mitre_atlas else ""}
 
#### Description
{self.description}
 
#### Steps to Reproduce
{steps}
 
#### Impact
{self.impact}
 
#### Recommendation
{self.recommendation}
"""

This structured approach ensures that findings are actionable and that remediation teams have the information they need to address the vulnerabilities effectively.

Edit this page on GitHub

Measuring and Reporting AI Red Team Effectiveness

intermediate17 min readUpdated 2026-03-20

metrics effectiveness reporting kpis program-management methodology walkthrough

Step 1: Defining the Metrics Framework

Organize metrics into four categories that answer different stakeholder questions:

# AI Red Team Metrics Framework
 
## Category 1: Coverage Metrics
**Question**: "Are we testing everything we should be?"
 
| Metric | Definition | Target |
|--------|-----------|--------|
| OWASP LLM Top 10 Coverage | % of OWASP categories tested in each engagement | 100% for standard engagements |
| Attack Surface Coverage | % of enumerated components tested | >80% for standard, >95% for comprehensive |
| Test Case Volume | Number of unique test cases per engagement | >200 for standard engagements |
| Technique Diversity | Number of distinct attack techniques used | >15 per engagement |
| Model Coverage | % of production models tested in the quarter | >90% |
 
## Category 2: Detection Metrics
**Question**: "Are we finding real vulnerabilities?"
 
| Metric | Definition | Target |
|--------|-----------|--------|
| Vulnerability Count | Total findings per engagement by severity | Track trend, not absolute number |
| Detection Rate | Findings per testing hour | Track trend |
| Novel Finding Rate | % of findings not found in previous engagements | >20% indicates evolving methodology |
| False Positive Rate | % of reported findings that were not actual vulnerabilities | <5% |
| Time to First Finding | Hours from engagement start to first Critical/High finding | <4 hours |
 
## Category 3: Remediation Metrics
**Question**: "Are we driving actual security improvement?"
 
| Metric | Definition | Target |
|--------|-----------|--------|
| Remediation Rate | % of findings remediated within SLA | >90% for Critical, >80% for High |
| Mean Time to Remediate (MTTR) | Average days from report to fix by severity | <7 days Critical, <30 days High |
| Regression Rate | % of previously fixed findings that reappear | <5% |
| Retest Pass Rate | % of remediated findings that pass verification | >95% |
| Open Finding Age | Average age of unresolved findings | Decreasing trend |
 
## Category 4: Program Metrics
**Question**: "Is the program worth the investment?"
 
| Metric | Definition | Target |
|--------|-----------|--------|
| Cost per Finding | Total program cost / number of findings | Track trend (should stabilize) |
| Security Posture Score | Average model safety score across evaluations | Improving trend |
| Engagement Frequency | Red team engagements per quarter | Per policy |
| Tool ROI | Findings from automated tools vs. manual testing | Track ratio |
| Coverage Gap Closure | % of identified gaps addressed since last quarter | >50% per quarter |

Step 2: Collecting Metrics Data

Build a data collection system that captures metrics automatically:

# metrics/collector.py
"""Collect and store AI red team metrics from engagement data."""
import json
import sqlite3
from datetime import datetime, date
from pathlib import Path
from dataclasses import dataclass
 
@dataclass
class EngagementMetrics:
    engagement_id: str
    client: str
    start_date: str
    end_date: str
    engagement_type: str
    total_hours: float
    components_in_scope: int
    components_tested: int
    test_cases_executed: int
    techniques_used: int
    findings_critical: int
    findings_high: int
    findings_medium: int
    findings_low: int
    owasp_categories_tested: int
 
class MetricsCollector:
    """Collect and store engagement metrics."""
 
    def __init__(self, db_path: str = "metrics/redteam_metrics.db"):
        Path(db_path).parent.mkdir(parents=True, exist_ok=True)
        self.conn = sqlite3.connect(db_path)
        self._init_db()
 
    def _init_db(self):
        self.conn.executescript("""
            CREATE TABLE IF NOT EXISTS engagements (
                id TEXT PRIMARY KEY,
                client TEXT,
                start_date TEXT,
                end_date TEXT,
                engagement_type TEXT,
                total_hours REAL,
                components_in_scope INTEGER,
                components_tested INTEGER,
                test_cases INTEGER,
                techniques_used INTEGER,
                findings_critical INTEGER,
                findings_high INTEGER,
                findings_medium INTEGER,
                findings_low INTEGER,
                owasp_categories INTEGER
            );
 
            CREATE TABLE IF NOT EXISTS findings (
                id TEXT PRIMARY KEY,
                engagement_id TEXT,
                severity TEXT,
                category TEXT,
                found_date TEXT,
                reported_date TEXT,
                remediated_date TEXT,
                verified_date TEXT,
                regression_count INTEGER DEFAULT 0,
                FOREIGN KEY (engagement_id) REFERENCES engagements(id)
            );
 
            CREATE TABLE IF NOT EXISTS continuous_scans (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                scan_date TEXT,
                scan_type TEXT,
                total_tests INTEGER,
                passed INTEGER,
                failed INTEGER,
                model_name TEXT,
                commit_sha TEXT
            );
        """)
        self.conn.commit()
 
    def record_engagement(self, metrics: EngagementMetrics):
        """Record metrics from a completed engagement."""
        self.conn.execute(
            "INSERT OR REPLACE INTO engagements VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
            (
                metrics.engagement_id, metrics.client, metrics.start_date,
                metrics.end_date, metrics.engagement_type, metrics.total_hours,
                metrics.components_in_scope, metrics.components_tested,
                metrics.test_cases_executed, metrics.techniques_used,
                metrics.findings_critical, metrics.findings_high,
                metrics.findings_medium, metrics.findings_low,
                metrics.owasp_categories_tested,
            ),
        )
        self.conn.commit()
 
    def record_finding(self, finding_id: str, engagement_id: str,
                       severity: str, category: str, found_date: str):
        """Record a finding for tracking remediation."""
        self.conn.execute(
            "INSERT OR REPLACE INTO findings (id, engagement_id, severity, "
            "category, found_date) VALUES (?,?,?,?,?)",
            (finding_id, engagement_id, severity, category, found_date),
        )
        self.conn.commit()
 
    def mark_remediated(self, finding_id: str, remediated_date: str):
        """Mark a finding as remediated."""
        self.conn.execute(
            "UPDATE findings SET remediated_date = ? WHERE id = ?",
            (remediated_date, finding_id),
        )
        self.conn.commit()
 
    def mark_verified(self, finding_id: str, verified_date: str):
        """Mark a remediated finding as verified."""
        self.conn.execute(
            "UPDATE findings SET verified_date = ? WHERE id = ?",
            (verified_date, finding_id),
        )
        self.conn.commit()
 
    def record_regression(self, finding_id: str):
        """Record that a previously fixed finding has regressed."""
        self.conn.execute(
            "UPDATE findings SET regression_count = regression_count + 1, "
            "remediated_date = NULL, verified_date = NULL WHERE id = ?",
            (finding_id,),
        )
        self.conn.commit()

Step 3: Calculating Key Metrics

Compute each metric from the collected data:

# metrics/calculator.py
"""Calculate AI red team metrics from collected data."""
from datetime import datetime, timedelta
from metrics.collector import MetricsCollector
 
class MetricsCalculator:
    """Calculate metrics from the metrics database."""
 
    def __init__(self, collector: MetricsCollector):
        self.conn = collector.conn
 
    def coverage_metrics(self, period_days: int = 90) -> dict:
        """Calculate coverage metrics for the specified period."""
        cutoff = (datetime.now() - timedelta(days=period_days)).isoformat()
 
        rows = self.conn.execute(
            "SELECT AVG(CAST(components_tested AS REAL) / NULLIF(components_in_scope, 0)), "
            "AVG(test_cases), AVG(techniques_used), AVG(owasp_categories) "
            "FROM engagements WHERE start_date >= ?",
            (cutoff,),
        ).fetchone()
 
        return {
            "avg_attack_surface_coverage": f"{(rows[0] or 0) * 100:.1f}%",
            "avg_test_cases_per_engagement": round(rows[1] or 0),
            "avg_techniques_per_engagement": round(rows[2] or 0),
            "avg_owasp_categories_covered": round(rows[3] or 0),
        }
 
    def detection_metrics(self, period_days: int = 90) -> dict:
        """Calculate detection metrics."""
        cutoff = (datetime.now() - timedelta(days=period_days)).isoformat()
 
        # Total findings by severity
        severity_counts = {}
        for severity in ["Critical", "High", "Medium", "Low"]:
            count = self.conn.execute(
                "SELECT COUNT(*) FROM findings WHERE severity = ? AND found_date >= ?",
                (severity, cutoff),
            ).fetchone()[0]
            severity_counts[severity] = count
 
        # Findings per hour
        total_findings = sum(severity_counts.values())
        total_hours = self.conn.execute(
            "SELECT SUM(total_hours) FROM engagements WHERE start_date >= ?",
            (cutoff,),
        ).fetchone()[0] or 1
 
        # Novel finding rate (findings not in previous period)
        prev_cutoff = (datetime.now() - timedelta(days=period_days * 2)).isoformat()
        previous_categories = set(
            row[0] for row in self.conn.execute(
                "SELECT DISTINCT category FROM findings WHERE found_date >= ? AND found_date < ?",
                (prev_cutoff, cutoff),
            ).fetchall()
        )
        current_categories = set(
            row[0] for row in self.conn.execute(
                "SELECT DISTINCT category FROM findings WHERE found_date >= ?",
                (cutoff,),
            ).fetchall()
        )
        novel_categories = current_categories - previous_categories
        novel_rate = len(novel_categories) / max(len(current_categories), 1) * 100
 
        return {
            "findings_by_severity": severity_counts,
            "total_findings": total_findings,
            "findings_per_hour": round(total_findings / total_hours, 2),
            "novel_finding_rate": f"{novel_rate:.0f}%",
        }
 
    def remediation_metrics(self) -> dict:
        """Calculate remediation metrics."""
        # Remediation rate by severity
        for severity in ["Critical", "High"]:
            total = self.conn.execute(
                "SELECT COUNT(*) FROM findings WHERE severity = ?",
                (severity,),
            ).fetchone()[0]
            remediated = self.conn.execute(
                "SELECT COUNT(*) FROM findings WHERE severity = ? AND remediated_date IS NOT NULL",
                (severity,),
            ).fetchone()[0]
 
        # MTTR calculation
        mttr_rows = self.conn.execute(
            "SELECT severity, AVG(julianday(remediated_date) - julianday(found_date)) "
            "FROM findings WHERE remediated_date IS NOT NULL "
            "GROUP BY severity",
        ).fetchall()
        mttr = {row[0]: round(row[1], 1) for row in mttr_rows}
 
        # Regression rate
        total_findings = self.conn.execute(
            "SELECT COUNT(*) FROM findings"
        ).fetchone()[0]
        regressed = self.conn.execute(
            "SELECT COUNT(*) FROM findings WHERE regression_count > 0"
        ).fetchone()[0]
        regression_rate = regressed / max(total_findings, 1) * 100
 
        # Retest pass rate
        verified = self.conn.execute(
            "SELECT COUNT(*) FROM findings WHERE verified_date IS NOT NULL"
        ).fetchone()[0]
        retested = self.conn.execute(
            "SELECT COUNT(*) FROM findings WHERE remediated_date IS NOT NULL"
        ).fetchone()[0]
        retest_pass_rate = verified / max(retested, 1) * 100
 
        return {
            "mean_time_to_remediate_days": mttr,
            "regression_rate": f"{regression_rate:.1f}%",
            "retest_pass_rate": f"{retest_pass_rate:.0f}%",
        }
 
    def security_posture_trend(self, weeks: int = 12) -> list[dict]:
        """Get the security posture score trend from continuous scans."""
        rows = self.conn.execute(
            "SELECT strftime('%Y-W%W', scan_date) as week, "
            "AVG(CAST(passed AS REAL) / NULLIF(total_tests, 0)) "
            "FROM continuous_scans "
            "WHERE scan_date >= date('now', ?) "
            "GROUP BY week ORDER BY week",
            (f'-{weeks * 7} days',),
        ).fetchall()
 
        return [{"week": row[0], "safety_score": round(row[1] or 0, 3)} for row in rows]

Step 4: Building the Metrics Dashboard

Create a report that presents metrics to stakeholders:

# metrics/dashboard.py
"""Generate a metrics dashboard report."""
from metrics.collector import MetricsCollector
from metrics.calculator import MetricsCalculator
from datetime import datetime
 
def generate_dashboard(db_path: str = "metrics/redteam_metrics.db") -> str:
    """Generate a markdown metrics dashboard."""
    collector = MetricsCollector(db_path)
    calc = MetricsCalculator(collector)
 
    coverage = calc.coverage_metrics()
    detection = calc.detection_metrics()
    remediation = calc.remediation_metrics()
    trend = calc.security_posture_trend()
 
    report = f"""# AI Red Team Program Metrics Dashboard
 
**Report Date**: {datetime.now().strftime('%Y-%m-%d')}
**Period**: Last 90 days
 
---
 
## Coverage Metrics
 
| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| Attack Surface Coverage | {coverage['avg_attack_surface_coverage']} | >80% | {'On Track' if float(coverage['avg_attack_surface_coverage'].rstrip('%')) > 80 else 'Below Target'} |
| Test Cases per Engagement | {coverage['avg_test_cases_per_engagement']} | >200 | {'On Track' if coverage['avg_test_cases_per_engagement'] > 200 else 'Below Target'} |
| Techniques per Engagement | {coverage['avg_techniques_per_engagement']} | >15 | {'On Track' if coverage['avg_techniques_per_engagement'] > 15 else 'Below Target'} |
| OWASP Categories Covered | {coverage['avg_owasp_categories_covered']}/10 | 10 | {'On Track' if coverage['avg_owasp_categories_covered'] >= 10 else 'Below Target'} |
 
## Detection Metrics
 
| Metric | Value |
|--------|-------|
| Total Findings | {detection['total_findings']} |
| Critical | {detection['findings_by_severity'].get('Critical', 0)} |
| High | {detection['findings_by_severity'].get('High', 0)} |
| Medium | {detection['findings_by_severity'].get('Medium', 0)} |
| Low | {detection['findings_by_severity'].get('Low', 0)} |
| Findings per Testing Hour | {detection['findings_per_hour']} |
| Novel Finding Rate | {detection['novel_finding_rate']} |
 
## Remediation Metrics
 
| Metric | Value | Target |
|--------|-------|--------|
| MTTR (Critical) | {remediation['mean_time_to_remediate_days'].get('Critical', 'N/A')} days | <7 days |
| MTTR (High) | {remediation['mean_time_to_remediate_days'].get('High', 'N/A')} days | <30 days |
| Regression Rate | {remediation['regression_rate']} | <5% |
| Retest Pass Rate | {remediation['retest_pass_rate']} | >95% |
 
## Security Posture Trend
 
"""
 
    if trend:
        report += "| Week | Safety Score |\n|------|-------------|\n"
        for entry in trend:
            bar_length = int(entry['safety_score'] * 20)
            bar = '█' * bar_length + '░' * (20 - bar_length)
            report += f"| {entry['week']} | {entry['safety_score']:.3f} {bar} |\n"
    else:
        report += "*No continuous scan data available for trend analysis.*\n"
 
    return report
 
if __name__ == "__main__":
    print(generate_dashboard())

Step 5: Calculating Program ROI

Quantify the return on investment for the AI red teaming program:

# metrics/roi.py
"""Calculate ROI for the AI red teaming program."""
from dataclasses import dataclass
 
@dataclass
class ROICalculation:
    # Costs
    team_cost: float          # Annual team salary/contractor costs
    tool_cost: float          # Annual tool and API costs
    infrastructure_cost: float # Lab, compute, etc.
 
    # Value (risk reduction)
    critical_findings: int
    high_findings: int
    avg_breach_cost: float    # Industry average cost of an AI security incident
    estimated_breach_probability_reduction: float  # 0.0-1.0
 
    @property
    def total_cost(self) -> float:
        return self.team_cost + self.tool_cost + self.infrastructure_cost
 
    @property
    def estimated_risk_reduction(self) -> float:
        """Estimated annual risk reduction in dollar terms."""
        # Each critical finding represents a prevented potential incident
        critical_value = self.critical_findings * self.avg_breach_cost * 0.3
        high_value = self.high_findings * self.avg_breach_cost * 0.1
        return (critical_value + high_value) * self.estimated_breach_probability_reduction
 
    @property
    def roi_ratio(self) -> float:
        return self.estimated_risk_reduction / self.total_cost if self.total_cost > 0 else 0
 
    def generate_roi_report(self) -> str:
        return f"""## Program ROI Analysis
 
### Investment
| Item | Annual Cost |
|------|------------|
| Team (salary/contractors) | ${self.team_cost:,.0f} |
| Tools and API costs | ${self.tool_cost:,.0f} |
| Infrastructure | ${self.infrastructure_cost:,.0f} |
| **Total Investment** | **${self.total_cost:,.0f}** |
 
### Value Generated
| Item | Value |
|------|-------|
| Critical findings identified | {self.critical_findings} |
| High findings identified | {self.high_findings} |
| Industry avg breach cost | ${self.avg_breach_cost:,.0f} |
| Estimated risk reduction | ${self.estimated_risk_reduction:,.0f} |
 
### ROI
| Metric | Value |
|--------|-------|
| ROI Ratio | {self.roi_ratio:.1f}x |
| Net Value | ${self.estimated_risk_reduction - self.total_cost:,.0f} |
 
*Note: ROI calculations use industry-average breach costs and estimated
probability reduction. Actual values may vary.*
"""
 
# Example calculation
if __name__ == "__main__":
    roi = ROICalculation(
        team_cost=300000,
        tool_cost=25000,
        infrastructure_cost=10000,
        critical_findings=8,
        high_findings=15,
        avg_breach_cost=4450000,  # IBM Cost of a Data Breach 2025
        estimated_breach_probability_reduction=0.15,
    )
    print(roi.generate_roi_report())

Common Pitfalls and Troubleshooting

Problem	Cause	Solution
Metrics show no improvement	Testing methodology hasn't changed; finding same issues	Update test suites with new techniques quarterly
Finding count drops after first engagement	Low-hanging fruit already found	This is expected; track novel finding rate instead of raw count
MTTR is misleading	Includes findings stuck in backlog	Report MTTR by severity; exclude intentionally deferred findings
Regression rate spikes	New deployments without CI testing	Implement continuous red teaming pipeline from the continuous testing walkthrough
Stakeholders fixate on a single metric	Metric presented without context	Always present metrics in context with targets and trends
ROI calculation disputed	Breach cost estimates are speculative	Use industry benchmarks (IBM, Verizon DBIR) and present ranges

Key Takeaways

Measuring AI red team effectiveness transforms the program from a cost center into a demonstrable risk reduction investment:

Measure what matters to each audience -- coverage metrics for the security team, remediation metrics for engineering, and ROI for executive leadership. No single metric serves all audiences.
Track trends, not snapshots -- a single engagement's finding count is meaningless in isolation. The trend of security posture over time tells the real story.
Novel findings indicate program health -- if every engagement finds only the same issues, the methodology has stagnated. A healthy novel finding rate (>20%) indicates evolving attack techniques.
Regression rate is the most actionable metric -- it directly measures whether fixes are holding. High regression rates indicate a need for continuous testing, not more point-in-time engagements.
ROI requires honest estimation -- breach cost estimates and probability reductions are inherently uncertain. Present ranges rather than point estimates, and anchor to industry benchmarks.

Advanced Considerations

Adapting to Modern Defenses

Keyword-based classifiers can be evaded through encoding (Base64, ROT13, Unicode homoglyphs), paraphrasing, or splitting the payload across multiple turns
ML-based classifiers require more sophisticated evasion such as semantic camouflage, gradual escalation, or exploitation of the classifier's own blind spots

class DefenseBypassStrategy:
    """Strategy selector for bypassing different defense configurations."""
 
    def __init__(self):
        self.strategies = {
            "keyword_filter": self._bypass_keyword,
            "ml_classifier": self._bypass_ml,
            "output_filter": self._bypass_output,
            "rate_limiter": self._bypass_rate,
            "behavioral": self._bypass_behavioral,
        }
 
    def select_strategy(self, identified_defenses: list) -> callable:
        """Select the optimal bypass strategy based on identified defenses."""
        if not identified_defenses:
            return self._direct_attack
 
        # Combine strategies for multi-layer defenses
        applicable = [
            self.strategies[d] for d in identified_defenses
            if d in self.strategies
        ]
        return self._chain_strategies(applicable)
 
    def _bypass_keyword(self, payload: str) -> str:
        """Bypass keyword-based input filters."""
        import base64
        # Encode the payload to bypass keyword matching
        encoded = base64.b64encode(payload.encode()).decode()
        return f"Decode and follow: {encoded}"
 
    def _bypass_ml(self, payload: str) -> list:
        """Generate multiple paraphrased variants for ML classifier evasion."""
        variants = [
            f"As a thought experiment, consider: {payload}",
            f"For academic research purposes: {payload}",
            f"In a fictional scenario where: {payload}",
        ]
        return variants
 
    def _bypass_output(self, payload: str) -> str:
        """Modify the payload to extract data through output filter gaps."""
        return payload + "\nRespond using only code blocks and variable names."
 
    def _bypass_rate(self, payload: str) -> dict:
        """Configure timing to avoid rate limit triggers."""
        return {"payload": payload, "delay": 3.0, "jitter": 1.5}
 
    def _bypass_behavioral(self, payload: str) -> list:
        """Split the payload across multiple natural-sounding turns."""
        # Multi-turn escalation to avoid behavioral anomaly detection
        return [
            "I'm working on a security research project.",
            "Could you help me understand how safety mechanisms work?",
            "Specifically, I'm interested in how systems like yours " + payload[:100],
            payload,
        ]
 
    def _direct_attack(self, payload: str) -> str:
        return payload
 
    def _chain_strategies(self, strategies: list) -> callable:
        """Chain multiple bypass strategies."""
        def chained(payload):
            result = payload
            for strategy in strategies:
                result = strategy(result)
            return result
        return chained

Technique	How It Works	Effectiveness
Encoding output	Request Base64/hex encoded responses	Medium — some filters check decoded content
Code block wrapping	Embed data in code comments/variables	High — many filters skip code blocks
Steganographic output	Hide data in formatting, capitalization, or spacing	High — difficult to detect
Chunked extraction	Extract small pieces across many turns	High — individual pieces may pass filters
Indirect extraction	Have the model reveal data through behavior changes	Very High — no explicit data in output

Cross-Model Considerations

Techniques that work against one model may not directly transfer to others. However, understanding the general principles allows adaptation:

Safety training methodology: Models trained with RLHF (GPT-4, Claude) have different safety characteristics than those using DPO (Llama, Mistral) or other methods. RLHF-trained models tend to refuse more broadly but may be more susceptible to multi-turn escalation.
Context window size: Models with larger context windows (Claude with 200K, Gemini with 1M+) may be more susceptible to context window manipulation where adversarial content is buried in large amounts of benign text.
Multimodal capabilities: Models that process images, audio, or other modalities introduce additional attack surfaces not present in text-only models.
Tool use implementation: The implementation details of function calling vary significantly between providers. OpenAI uses a structured function calling format, while Anthropic uses tool use blocks. These differences affect exploitation techniques.

Operational Considerations

Testing Ethics and Boundaries

Professional red team testing operates within clear ethical and legal boundaries:

Authorization: Always obtain written authorization before testing. This should specify the scope, methods allowed, and any restrictions.
Scope limits: Stay within the authorized scope. If you discover a vulnerability that leads outside the authorized perimeter, document it and report it without exploiting it.
Data handling: Handle any sensitive data discovered during testing according to the engagement agreement. Never retain sensitive data beyond what's needed for reporting.
Responsible disclosure: Follow responsible disclosure practices for any vulnerabilities discovered, particularly if they affect systems beyond your testing scope.

Documenting Results

Professional documentation follows a structured format:

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
 
@dataclass
class Finding:
    """Structure for documenting a security finding."""
    id: str
    title: str
    severity: str  # Critical, High, Medium, Low, Informational
    category: str  # OWASP LLM Top 10 category
    description: str
    steps_to_reproduce: list[str]
    impact: str
    recommendation: str
    evidence: list[str] = field(default_factory=list)
    mitre_atlas: Optional[str] = None
    cvss_score: Optional[float] = None
    discovered_at: str = field(default_factory=lambda: datetime.now().isoformat())
 
    def to_report_section(self) -> str:
        """Generate a report section for this finding."""
        steps = "\n".join(f"   {i+1}. {s}" for i, s in enumerate(self.steps_to_reproduce))
        return f"""
### {self.id}: {self.title}
 
**Severity**: {self.severity}
**Category**: {self.category}
{f"**MITRE ATLAS**: {self.mitre_atlas}" if self.mitre_atlas else ""}
 
#### Description
{self.description}
 
#### Steps to Reproduce
{steps}
 
#### Impact
{self.impact}
 
#### Recommendation
{self.recommendation}
"""

This structured approach ensures that findings are actionable and that remediation teams have the information they need to address the vulnerabilities effectively.

Edit this page on GitHub

Measuring and Reporting AI Red Team Effectiveness

Related articles

Measuring and Reporting AI Red Team Effectiveness

Related articles