AI Red Team Evidence Collection
Systematic evidence collection methodologies for AI red team engagements, including artifact preservation, finding documentation, and chain of custody procedures.
Overview
Evidence collection during AI red team engagements serves a fundamentally different purpose than evidence collection during incident response. In incident response, you are reconstructing what happened. In a red team engagement, you are creating a record of what you did, what you found, and what it means for the organization's security posture. The evidence must be detailed enough to reproduce findings, credible enough to drive remediation decisions, and structured enough to be analyzed across multiple engagements over time.
AI red teaming introduces evidence types that do not exist in traditional penetration testing. When you successfully jailbreak an LLM, the "evidence" is a conversation — a sequence of natural language messages that led the model to behave outside its intended boundaries. When you find a prompt injection vulnerability in a RAG system, the evidence includes the injected document, the retrieval context, and the model's compromised output. When you discover that a model leaks training data, the evidence is a set of generated outputs that match private data with statistical significance.
These artifacts are fragile. Conversations are ephemeral by default. Model behavior is stochastic, meaning a successful attack may not reproduce identically on the next attempt. Cloud-hosted models may be updated at any time, changing the attack surface without notice. A robust evidence collection methodology must account for all of these challenges.
This article covers the design of evidence collection systems for AI red team engagements, from planning what to capture through preserving artifacts with cryptographic integrity through generating reports that drive remediation.
Evidence Collection Framework
Defining the Evidence Taxonomy
AI red team evidence falls into several categories, each requiring different collection and preservation methods.
import json
import hashlib
import uuid
from datetime import datetime, timezone
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
from pathlib import Path
class EvidenceType(Enum):
"""Types of evidence collected during AI red team engagements."""
CONVERSATION = "conversation" # Full conversation logs
PROMPT_PAYLOAD = "prompt_payload" # Specific attack prompts
MODEL_RESPONSE = "model_response" # Model outputs of interest
SCREENSHOT = "screenshot" # Visual evidence
CONFIGURATION = "configuration" # System/model configuration
NETWORK_CAPTURE = "network_capture" # API traffic captures
BEHAVIORAL_OBSERVATION = "behavioral_observation" # Noted behaviors
RETRIEVAL_CONTEXT = "retrieval_context" # RAG retrieval results
TOOL_OUTPUT = "tool_output" # Output from testing tools
METRIC_DATA = "metric_data" # Quantitative measurements
class SeverityLevel(Enum):
"""Severity levels for red team findings."""
INFORMATIONAL = "informational"
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
@dataclass
class EvidenceItem:
"""A single piece of evidence from a red team engagement."""
evidence_id: str
evidence_type: EvidenceType
timestamp: str
title: str
description: str
content: str | dict | bytes
engagement_id: str
finding_id: Optional[str] = None
severity: SeverityLevel = SeverityLevel.INFORMATIONAL
tags: list[str] = field(default_factory=list)
metadata: dict = field(default_factory=dict)
integrity_hash: str = ""
collector: str = ""
def compute_hash(self) -> str:
"""Compute SHA-256 hash of evidence content for integrity."""
if isinstance(self.content, bytes):
content_bytes = self.content
elif isinstance(self.content, dict):
content_bytes = json.dumps(
self.content, sort_keys=True
).encode("utf-8")
else:
content_bytes = str(self.content).encode("utf-8")
self.integrity_hash = hashlib.sha256(content_bytes).hexdigest()
return self.integrity_hash
@dataclass
class Finding:
"""A red team finding supported by evidence."""
finding_id: str
title: str
description: str
severity: SeverityLevel
attack_category: str
attack_vector: str
impact: str
remediation: str
evidence_ids: list[str]
reproducibility: str # always, usually, sometimes, once
cvss_score: Optional[float] = None
cwe_id: Optional[str] = None
timestamp: str = ""
engagement_id: str = ""
class RedTeamEvidenceCollector:
"""Systematic evidence collection for AI red team engagements."""
def __init__(
self,
engagement_id: str,
output_dir: str,
collector_name: str,
):
self.engagement_id = engagement_id
self.output_dir = Path(output_dir) / engagement_id
self.output_dir.mkdir(parents=True, exist_ok=True)
self.collector_name = collector_name
self.evidence_items: list[EvidenceItem] = []
self.findings: list[Finding] = []
self._evidence_log: list[dict] = []
def collect_conversation(
self,
messages: list[dict],
title: str,
description: str,
model_id: str = "",
severity: SeverityLevel = SeverityLevel.INFORMATIONAL,
tags: Optional[list[str]] = None,
) -> str:
"""
Collect a conversation as evidence.
Args:
messages: List of message dicts with 'role' and 'content'.
title: Short title for the evidence.
description: Detailed description of what this shows.
model_id: The model targeted.
severity: Severity level of the finding.
tags: Optional tags for categorization.
Returns:
Evidence ID.
"""
evidence = EvidenceItem(
evidence_id=f"EV-{uuid.uuid4().hex[:12]}",
evidence_type=EvidenceType.CONVERSATION,
timestamp=datetime.now(timezone.utc).isoformat(),
title=title,
description=description,
content={
"messages": messages,
"message_count": len(messages),
"model_id": model_id,
},
engagement_id=self.engagement_id,
severity=severity,
tags=tags or [],
metadata={"model_id": model_id},
collector=self.collector_name,
)
evidence.compute_hash()
self._store_evidence(evidence)
return evidence.evidence_id
def collect_prompt_payload(
self,
payload: str,
title: str,
attack_technique: str,
target_model: str = "",
effectiveness: str = "", # success, partial, failure
response_summary: str = "",
severity: SeverityLevel = SeverityLevel.INFORMATIONAL,
) -> str:
"""
Collect a specific attack prompt payload as evidence.
Args:
payload: The attack prompt text.
title: Short title for the evidence.
attack_technique: Category of attack (jailbreak, injection, etc.)
target_model: The model this payload targets.
effectiveness: Whether the payload was successful.
response_summary: Summary of how the model responded.
severity: Severity level.
Returns:
Evidence ID.
"""
evidence = EvidenceItem(
evidence_id=f"EV-{uuid.uuid4().hex[:12]}",
evidence_type=EvidenceType.PROMPT_PAYLOAD,
timestamp=datetime.now(timezone.utc).isoformat(),
title=title,
description=(
f"Attack technique: {attack_technique}. "
f"Target: {target_model}. "
f"Effectiveness: {effectiveness}."
),
content={
"payload": payload,
"attack_technique": attack_technique,
"target_model": target_model,
"effectiveness": effectiveness,
"response_summary": response_summary,
},
engagement_id=self.engagement_id,
severity=severity,
tags=[attack_technique, effectiveness],
metadata={
"target_model": target_model,
"attack_technique": attack_technique,
},
collector=self.collector_name,
)
evidence.compute_hash()
self._store_evidence(evidence)
return evidence.evidence_id
def collect_behavioral_observation(
self,
observation: str,
title: str,
category: str,
severity: SeverityLevel = SeverityLevel.INFORMATIONAL,
supporting_evidence_ids: Optional[list[str]] = None,
) -> str:
"""
Collect a behavioral observation about the AI system.
Args:
observation: Detailed description of observed behavior.
title: Short title.
category: Category (safety_bypass, data_leak, etc.)
severity: Severity level.
supporting_evidence_ids: IDs of related evidence.
Returns:
Evidence ID.
"""
evidence = EvidenceItem(
evidence_id=f"EV-{uuid.uuid4().hex[:12]}",
evidence_type=EvidenceType.BEHAVIORAL_OBSERVATION,
timestamp=datetime.now(timezone.utc).isoformat(),
title=title,
description=observation,
content={
"observation": observation,
"category": category,
"supporting_evidence": supporting_evidence_ids or [],
},
engagement_id=self.engagement_id,
severity=severity,
tags=[category],
collector=self.collector_name,
)
evidence.compute_hash()
self._store_evidence(evidence)
return evidence.evidence_id
def create_finding(
self,
title: str,
description: str,
severity: SeverityLevel,
attack_category: str,
attack_vector: str,
impact: str,
remediation: str,
evidence_ids: list[str],
reproducibility: str = "usually",
cvss_score: Optional[float] = None,
cwe_id: Optional[str] = None,
) -> str:
"""
Create a finding supported by collected evidence.
Args:
title: Finding title.
description: Detailed description.
severity: Severity level.
attack_category: Category of attack.
attack_vector: How the attack works.
impact: What the attacker can achieve.
remediation: Recommended fixes.
evidence_ids: IDs of supporting evidence.
reproducibility: How reliably the finding reproduces.
cvss_score: Optional CVSS score.
cwe_id: Optional CWE identifier.
Returns:
Finding ID.
"""
finding = Finding(
finding_id=f"FND-{uuid.uuid4().hex[:12]}",
title=title,
description=description,
severity=severity,
attack_category=attack_category,
attack_vector=attack_vector,
impact=impact,
remediation=remediation,
evidence_ids=evidence_ids,
reproducibility=reproducibility,
cvss_score=cvss_score,
cwe_id=cwe_id,
timestamp=datetime.now(timezone.utc).isoformat(),
engagement_id=self.engagement_id,
)
# Verify all evidence IDs exist
existing_ids = {e.evidence_id for e in self.evidence_items}
missing = set(evidence_ids) - existing_ids
if missing:
raise ValueError(
f"Evidence IDs not found: {missing}. "
f"Collect evidence before creating findings."
)
self.findings.append(finding)
# Update severity on linked evidence items
for ev in self.evidence_items:
if ev.evidence_id in evidence_ids:
ev.finding_id = finding.finding_id
self._save_finding(finding)
return finding.finding_id
def _store_evidence(self, evidence: EvidenceItem) -> None:
"""Store evidence to disk and in-memory index."""
self.evidence_items.append(evidence)
# Write evidence file
evidence_dir = self.output_dir / "evidence"
evidence_dir.mkdir(exist_ok=True)
evidence_data = {
"evidence_id": evidence.evidence_id,
"evidence_type": evidence.evidence_type.value,
"timestamp": evidence.timestamp,
"title": evidence.title,
"description": evidence.description,
"content": evidence.content if not isinstance(evidence.content, bytes) else "<binary>",
"engagement_id": evidence.engagement_id,
"finding_id": evidence.finding_id,
"severity": evidence.severity.value,
"tags": evidence.tags,
"metadata": evidence.metadata,
"integrity_hash": evidence.integrity_hash,
"collector": evidence.collector,
}
file_path = evidence_dir / f"{evidence.evidence_id}.json"
with open(str(file_path), "w") as f:
json.dump(evidence_data, f, indent=2)
# Append to evidence log
self._evidence_log.append({
"evidence_id": evidence.evidence_id,
"timestamp": evidence.timestamp,
"type": evidence.evidence_type.value,
"title": evidence.title,
"hash": evidence.integrity_hash,
})
self._save_evidence_log()
def _save_finding(self, finding: Finding) -> None:
"""Save a finding to disk."""
findings_dir = self.output_dir / "findings"
findings_dir.mkdir(exist_ok=True)
finding_data = {
"finding_id": finding.finding_id,
"title": finding.title,
"description": finding.description,
"severity": finding.severity.value,
"attack_category": finding.attack_category,
"attack_vector": finding.attack_vector,
"impact": finding.impact,
"remediation": finding.remediation,
"evidence_ids": finding.evidence_ids,
"reproducibility": finding.reproducibility,
"cvss_score": finding.cvss_score,
"cwe_id": finding.cwe_id,
"timestamp": finding.timestamp,
"engagement_id": finding.engagement_id,
}
file_path = findings_dir / f"{finding.finding_id}.json"
with open(str(file_path), "w") as f:
json.dump(finding_data, f, indent=2)
def _save_evidence_log(self) -> None:
"""Save the evidence collection log (append-only audit trail)."""
log_path = self.output_dir / "evidence_log.json"
with open(str(log_path), "w") as f:
json.dump(self._evidence_log, f, indent=2)Automated Evidence Capture During Testing
Wrapping API Calls for Automatic Capture
During active red team testing, manually documenting every API call is impractical. Instead, wrap the API client to automatically capture all interactions as evidence.
import time
from typing import Any
class EvidenceCapturingClient:
"""Wraps an LLM API client to automatically capture
all interactions as forensic evidence."""
def __init__(
self,
api_client: Any,
evidence_collector: RedTeamEvidenceCollector,
target_model: str,
auto_tag: bool = True,
):
self.api_client = api_client
self.collector = evidence_collector
self.target_model = target_model
self.auto_tag = auto_tag
self.interaction_count = 0
def send_message(
self,
messages: list[dict],
attack_technique: str = "unknown",
notes: str = "",
**kwargs,
) -> dict:
"""
Send a message through the API client and automatically
capture the interaction as evidence.
Args:
messages: The messages to send.
attack_technique: What attack this is part of.
notes: Red teamer's notes about this interaction.
**kwargs: Additional arguments for the API client.
Returns:
The API response dict.
"""
self.interaction_count += 1
start_time = time.time()
# Call the actual API
response = self.api_client.chat(messages=messages, **kwargs)
elapsed_ms = (time.time() - start_time) * 1000
# Build full conversation with response
full_messages = list(messages)
response_text = self._extract_response_text(response)
full_messages.append({
"role": "assistant",
"content": response_text,
})
# Auto-capture as evidence
self.collector.collect_conversation(
messages=full_messages,
title=f"Interaction #{self.interaction_count}: {attack_technique}",
description=(
f"Automated capture during {attack_technique} testing. "
f"Latency: {elapsed_ms:.0f}ms. "
f"{notes}"
),
model_id=self.target_model,
tags=[attack_technique, "auto_captured"],
)
return response
def _extract_response_text(self, response: Any) -> str:
"""Extract text from various API response formats."""
if isinstance(response, dict):
# OpenAI format
choices = response.get("choices", [])
if choices:
return choices[0].get("message", {}).get("content", "")
# Anthropic format
content = response.get("content", [])
if content and isinstance(content, list):
return content[0].get("text", "")
if isinstance(response, str):
return response
return str(response)Reproducibility Testing
A finding that cannot be reproduced has limited value. After identifying a potential vulnerability, test its reproducibility systematically.
def test_reproducibility(
client: EvidenceCapturingClient,
attack_messages: list[dict],
success_criteria: callable,
num_trials: int = 10,
attack_technique: str = "unknown",
) -> dict:
"""
Test the reproducibility of an attack by running it multiple
times and checking success criteria.
Args:
client: Evidence-capturing API client.
attack_messages: The attack messages to send.
success_criteria: Function that takes the response text
and returns True if the attack succeeded.
num_trials: Number of reproduction attempts.
attack_technique: Category of attack being tested.
Returns:
Reproducibility assessment dict.
"""
results = []
for trial in range(num_trials):
response = client.send_message(
messages=attack_messages,
attack_technique=f"{attack_technique}_repro_trial_{trial + 1}",
notes=f"Reproducibility trial {trial + 1}/{num_trials}",
)
response_text = client._extract_response_text(response)
success = success_criteria(response_text)
results.append({
"trial": trial + 1,
"success": success,
"response_length": len(response_text),
})
success_count = sum(1 for r in results if r["success"])
success_rate = success_count / num_trials
if success_rate >= 0.9:
reproducibility = "always"
elif success_rate >= 0.6:
reproducibility = "usually"
elif success_rate >= 0.2:
reproducibility = "sometimes"
else:
reproducibility = "rarely"
return {
"total_trials": num_trials,
"successes": success_count,
"success_rate": success_rate,
"reproducibility": reproducibility,
"results": results,
}Report Generation
Producing Engagement Reports
At the conclusion of a red team engagement, generate a structured report that presents findings with supporting evidence. The report should be accessible to both technical and executive audiences.
class EngagementReportGenerator:
"""Generate structured reports from red team evidence."""
def generate_report(
self,
collector: RedTeamEvidenceCollector,
engagement_name: str,
scope_description: str,
executive_summary: str = "",
) -> str:
"""
Generate a full engagement report.
Args:
collector: The evidence collector with all findings.
engagement_name: Name of the engagement.
scope_description: Description of what was in scope.
executive_summary: Optional pre-written executive summary.
Returns:
Report as formatted text.
"""
findings = sorted(
collector.findings,
key=lambda f: self._severity_order(f.severity),
reverse=True,
)
lines = [
"=" * 70,
f"AI RED TEAM ENGAGEMENT REPORT",
"=" * 70,
"",
f"Engagement: {engagement_name}",
f"Engagement ID: {collector.engagement_id}",
f"Report Generated: {datetime.now(timezone.utc).isoformat()}",
"",
"SCOPE",
"-" * 40,
scope_description,
"",
"EXECUTIVE SUMMARY",
"-" * 40,
]
if executive_summary:
lines.append(executive_summary)
else:
lines.append(self._auto_executive_summary(findings, collector))
lines.extend([
"",
"FINDINGS SUMMARY",
"-" * 40,
])
severity_counts = {}
for f in findings:
sev = f.severity.value
severity_counts[sev] = severity_counts.get(sev, 0) + 1
for sev in ["critical", "high", "medium", "low", "informational"]:
count = severity_counts.get(sev, 0)
lines.append(f" {sev.upper()}: {count}")
lines.extend([
"",
f" Total findings: {len(findings)}",
f" Total evidence items: {len(collector.evidence_items)}",
"",
"DETAILED FINDINGS",
"-" * 40,
])
for i, finding in enumerate(findings, 1):
lines.extend(self._format_finding(finding, i, collector))
return "\n".join(lines)
def _format_finding(
self,
finding: Finding,
number: int,
collector: RedTeamEvidenceCollector,
) -> list[str]:
"""Format a single finding for the report."""
lines = [
"",
f"--- Finding #{number}: {finding.title} ---",
f"ID: {finding.finding_id}",
f"Severity: {finding.severity.value.upper()}",
f"CVSS: {finding.cvss_score or 'N/A'}",
f"CWE: {finding.cwe_id or 'N/A'}",
f"Reproducibility: {finding.reproducibility}",
"",
f"Description: {finding.description}",
"",
f"Attack Vector: {finding.attack_vector}",
"",
f"Impact: {finding.impact}",
"",
f"Remediation: {finding.remediation}",
"",
f"Supporting Evidence ({len(finding.evidence_ids)} items):",
]
for ev_id in finding.evidence_ids:
matching = [
e for e in collector.evidence_items
if e.evidence_id == ev_id
]
if matching:
ev = matching[0]
lines.append(
f" - [{ev.evidence_id}] {ev.evidence_type.value}: {ev.title}"
)
return lines
def _auto_executive_summary(
self,
findings: list[Finding],
collector: RedTeamEvidenceCollector,
) -> str:
"""Generate an automatic executive summary."""
critical = [f for f in findings if f.severity == SeverityLevel.CRITICAL]
high = [f for f in findings if f.severity == SeverityLevel.HIGH]
summary_parts = [
f"This engagement identified {len(findings)} findings "
f"across the target AI system."
]
if critical:
summary_parts.append(
f"{len(critical)} critical finding(s) require immediate "
f"remediation: {', '.join(f.title for f in critical)}."
)
if high:
summary_parts.append(
f"{len(high)} high-severity finding(s) should be "
f"addressed within the next sprint cycle."
)
return " ".join(summary_parts)
@staticmethod
def _severity_order(severity: SeverityLevel) -> int:
return {
SeverityLevel.INFORMATIONAL: 0,
SeverityLevel.LOW: 1,
SeverityLevel.MEDIUM: 2,
SeverityLevel.HIGH: 3,
SeverityLevel.CRITICAL: 4,
}.get(severity, 0)Engagement Planning for Evidence Collection
Pre-Engagement Evidence Requirements
Before any testing begins, define what evidence will be collected and how. Work with the engagement stakeholders (typically the AI product team and the security team) to establish evidence requirements. These typically include: minimum documentation standards for each finding, required metadata for each evidence item, evidence classification levels and handling procedures, and the format and structure of the final report.
Define the evidence retention policy: how long will evidence be stored after the engagement concludes? For most organizations, evidence from red team engagements should be retained for at least one year to support trend analysis across multiple engagements. For regulated industries, retention requirements may be longer.
Establish the tooling and infrastructure before testing begins. Set up the evidence collection system with the engagement ID, create the output directories, configure the API client wrappers for automatic capture, and verify that evidence is being stored correctly by running a test interaction. Discovering tooling issues mid-engagement wastes testing time and may result in lost evidence.
Rules of Engagement and Evidence Boundaries
AI red team engagements require clear rules of engagement that define what systems can be tested, what attack techniques are permitted, and what data can be collected. These rules directly affect evidence collection.
If the rules of engagement prohibit testing against production user data, ensure your evidence collection system does not capture real user conversations or PII. If certain attack techniques are out of scope (e.g., social engineering of the development team), ensure that any incidental evidence of potential social engineering vulnerabilities is documented but not actively pursued.
The rules of engagement should also specify what happens if the red team discovers an active, non-simulated security incident during testing. This scenario is uncommon but not unheard of: a red teamer investigating an AI system's defenses may discover evidence of actual malicious activity. Define the escalation procedure: stop testing, preserve evidence, and notify the incident response team through an established communication channel.
Collaboration with Blue Teams
The most effective AI red team engagements produce evidence that is directly actionable by the blue team (the defenders). Design your evidence collection to support this goal. Each finding should include not just the attack technique and its impact, but also specific detection opportunities: what log entries, metrics, or behavioral signals would have revealed this attack in progress? What monitoring could the blue team deploy to detect similar attacks in the future?
Document the attack from both perspectives: from the red team's perspective (what they did and why it worked) and from the blue team's perspective (what signals were available and why they were missed or not monitored). This dual perspective makes the evidence package a complete learning resource rather than just a vulnerability report.
Chain of Custody Considerations
Maintaining Evidence Integrity
Throughout an AI red team engagement, maintain chain of custody for all evidence. Every evidence item should have a SHA-256 hash computed at collection time, a collector identifier, and a timestamp. The evidence log provides an append-only audit trail of what was collected and when.
For engagements that may have legal implications (e.g., testing in regulated industries or collecting evidence of third-party AI misuse), consider additional controls: write-once storage for evidence files, two-person integrity verification for critical findings, and signed evidence manifests using GPG or a similar tool.
Evidence Collection for Specific AI Attack Types
Jailbreak Evidence
When documenting a successful jailbreak, capture the complete conversation from the first message through the jailbreak success, including any failed attempts in between. The failed attempts are forensically valuable because they document the model's defense boundaries and show the progression of the attack technique. Record the exact model version, temperature setting, and any other parameters, as jailbreak success rates are sensitive to these settings. Test reproducibility at multiple temperature values and note the range where the jailbreak succeeds.
Include negative evidence as well: document what the model correctly refused before the jailbreak succeeded. This establishes the baseline safety behavior and makes the jailbreak finding more credible. If the jailbreak only works with a specific system prompt configuration, document both the vulnerable configuration and configurations where the jailbreak fails.
RAG Poisoning Evidence
For RAG poisoning findings, capture the poisoned document, its embedding, the similarity score at which it was retrieved, the query that triggered retrieval, the model's output with the poisoned context, and the model's output with clean context for the same query. The comparison between poisoned and clean outputs is the core evidence that demonstrates impact.
Document the ingestion pathway: how the poisoned document entered the knowledge base, what validation or filtering it passed through, and whether the poisoning could be detected by existing content moderation. If the poisoning exploits a specific property of the embedding model (e.g., adversarially crafted text that embeds close to a target query), document the embedding similarity analysis.
Training Data Extraction Evidence
When a model leaks training data, the evidence must demonstrate that the generated output matches actual training data rather than being a coincidental generation. This requires comparing model outputs against known training data sources. Use multiple prompting strategies to elicit the same data, and document the success rate. Calculate the statistical likelihood that the generation is coincidental by measuring the output's perplexity under the model versus its perplexity under a reference distribution.
Capture the complete prompt that triggers the extraction, the model's full output, the matching training data source (with provenance), and the similarity analysis showing the match is statistically significant. If the extraction reveals personally identifiable information or copyrighted material, flag this separately with appropriate handling markings.
Handling Sensitive Evidence
AI red team evidence frequently contains sensitive material: successful jailbreak payloads, model outputs that include harmful content, leaked training data, or exposed system prompts. Classify evidence appropriately and apply access controls. Store sensitive payloads separately from the main evidence repository with additional access restrictions. Redact sensitive content in reports distributed to broader audiences, while maintaining full unredacted evidence in the secure evidence store for technical review.
Evidence Quality Metrics
Measuring Evidence Completeness
At the conclusion of each engagement, assess the quality and completeness of the evidence collected. Metrics to track include: evidence-per-finding ratio (each finding should have at least 2-3 supporting evidence items), reproducibility test coverage (what percentage of findings were reproducibility-tested), metadata completeness (what percentage of evidence items have all standard metadata fields populated), and hash verification rate (what percentage of evidence items have integrity hashes computed and verified).
Track these metrics across engagements to identify trends. If evidence quality is declining, investigate whether the testing methodology has become rushed, whether the tooling needs improvement, or whether the engagement scope has expanded beyond the team's capacity to document thoroughly. Evidence quality directly affects the credibility of findings and the likelihood that remediation will be prioritized.
Evidence Archival and Cross-Engagement Analysis
After each engagement, archive the complete evidence package in a long-term storage system accessible for future reference. Over time, the archive enables cross-engagement analysis: are the same vulnerability patterns appearing across multiple engagements? Are remediation recommendations being implemented and effective? Which AI models or architectures are most frequently vulnerable?
Build a searchable index across archived engagements that allows querying by attack technique, AI model, vulnerability type, and severity. This index becomes a knowledge base that informs future engagement planning and helps the red team prioritize their testing based on historically observed vulnerability patterns.
References
- MITRE ATLAS (Adversarial Threat Landscape for AI Systems). "ATLAS Matrix." https://atlas.mitre.org/
- NIST AI 100-2 E2023 (2024). "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations." https://csrc.nist.gov/pubs/ai/100/2/e2023/final
- Microsoft (2024). "AI Red Team Playbook." https://learn.microsoft.com/en-us/security/ai-red-team/