AI Red Team Evidence Collection

intermediate18 min readUpdated 2026-03-21

Systematic evidence collection methodologies for AI red team engagements, including artifact preservation, finding documentation, and chain of custody procedures.

ai-forensics-ir red-team evidence-collection engagement-methodology

Overview

Evidence collection during AI red team engagements serves a fundamentally different purpose than evidence collection during incident response. In incident response, you are reconstructing what happened. In a red team engagement, you are creating a record of what you did, what you found, and what it means for the organization's security posture. The evidence must be detailed enough to reproduce findings, credible enough to drive remediation decisions, and structured enough to be analyzed across multiple engagements over time.

AI red teaming introduces evidence types that do not exist in traditional penetration testing. When you successfully jailbreak an LLM, the "evidence" is a conversation — a sequence of natural language messages that led the model to behave outside its intended boundaries. When you find a prompt injection vulnerability in a RAG system, the evidence includes the injected document, the retrieval context, and the model's compromised output. When you discover that a model leaks training data, the evidence is a set of generated outputs that match private data with statistical significance.

These artifacts are fragile. Conversations are ephemeral by default. Model behavior is stochastic, meaning a successful attack may not reproduce identically on the next attempt. Cloud-hosted models may be updated at any time, changing the attack surface without notice. A robust evidence collection methodology must account for all of these challenges.

This article covers the design of evidence collection systems for AI red team engagements, from planning what to capture through preserving artifacts with cryptographic integrity through generating reports that drive remediation.

Evidence Collection Framework

Defining the Evidence Taxonomy

AI red team evidence falls into several categories, each requiring different collection and preservation methods.

import json
import hashlib
import uuid
from datetime import datetime, timezone
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
from pathlib import Path
 
class EvidenceType(Enum):
    """Types of evidence collected during AI red team engagements."""
    CONVERSATION = "conversation"        # Full conversation logs
    PROMPT_PAYLOAD = "prompt_payload"     # Specific attack prompts
    MODEL_RESPONSE = "model_response"    # Model outputs of interest
    SCREENSHOT = "screenshot"            # Visual evidence
    CONFIGURATION = "configuration"      # System/model configuration
    NETWORK_CAPTURE = "network_capture"  # API traffic captures
    BEHAVIORAL_OBSERVATION = "behavioral_observation"  # Noted behaviors
    RETRIEVAL_CONTEXT = "retrieval_context"  # RAG retrieval results
    TOOL_OUTPUT = "tool_output"          # Output from testing tools
    METRIC_DATA = "metric_data"          # Quantitative measurements
 
class SeverityLevel(Enum):
    """Severity levels for red team findings."""
    INFORMATIONAL = "informational"
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"
 
@dataclass
class EvidenceItem:
    """A single piece of evidence from a red team engagement."""
    evidence_id: str
    evidence_type: EvidenceType
    timestamp: str
    title: str
    description: str
    content: str | dict | bytes
    engagement_id: str
    finding_id: Optional[str] = None
    severity: SeverityLevel = SeverityLevel.INFORMATIONAL
    tags: list[str] = field(default_factory=list)
    metadata: dict = field(default_factory=dict)
    integrity_hash: str = ""
    collector: str = ""
 
    def compute_hash(self) -> str:
        """Compute SHA-256 hash of evidence content for integrity."""
        if isinstance(self.content, bytes):
            content_bytes = self.content
        elif isinstance(self.content, dict):
            content_bytes = json.dumps(
                self.content, sort_keys=True
            ).encode("utf-8")
        else:
            content_bytes = str(self.content).encode("utf-8")
 
        self.integrity_hash = hashlib.sha256(content_bytes).hexdigest()
        return self.integrity_hash
 
@dataclass
class Finding:
    """A red team finding supported by evidence."""
    finding_id: str
    title: str
    description: str
    severity: SeverityLevel
    attack_category: str
    attack_vector: str
    impact: str
    remediation: str
    evidence_ids: list[str]
    reproducibility: str  # always, usually, sometimes, once
    cvss_score: Optional[float] = None
    cwe_id: Optional[str] = None
    timestamp: str = ""
    engagement_id: str = ""
 
class RedTeamEvidenceCollector:
    """Systematic evidence collection for AI red team engagements."""
 
    def __init__(
        self,
        engagement_id: str,
        output_dir: str,
        collector_name: str,
    ):
        self.engagement_id = engagement_id
        self.output_dir = Path(output_dir) / engagement_id
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.collector_name = collector_name
        self.evidence_items: list[EvidenceItem] = []
        self.findings: list[Finding] = []
        self._evidence_log: list[dict] = []
 
    def collect_conversation(
        self,
        messages: list[dict],
        title: str,
        description: str,
        model_id: str = "",
        severity: SeverityLevel = SeverityLevel.INFORMATIONAL,
        tags: Optional[list[str]] = None,
    ) -> str:
        """
        Collect a conversation as evidence.
 
        Args:
            messages: List of message dicts with 'role' and 'content'.
            title: Short title for the evidence.
            description: Detailed description of what this shows.
            model_id: The model targeted.
            severity: Severity level of the finding.
            tags: Optional tags for categorization.
 
        Returns:
            Evidence ID.
        """
        evidence = EvidenceItem(
            evidence_id=f"EV-{uuid.uuid4().hex[:12]}",
            evidence_type=EvidenceType.CONVERSATION,
            timestamp=datetime.now(timezone.utc).isoformat(),
            title=title,
            description=description,
            content={
                "messages": messages,
                "message_count": len(messages),
                "model_id": model_id,
            },
            engagement_id=self.engagement_id,
            severity=severity,
            tags=tags or [],
            metadata={"model_id": model_id},
            collector=self.collector_name,
        )
        evidence.compute_hash()
        self._store_evidence(evidence)
        return evidence.evidence_id
 
    def collect_prompt_payload(
        self,
        payload: str,
        title: str,
        attack_technique: str,
        target_model: str = "",
        effectiveness: str = "",  # success, partial, failure
        response_summary: str = "",
        severity: SeverityLevel = SeverityLevel.INFORMATIONAL,
    ) -> str:
        """
        Collect a specific attack prompt payload as evidence.
 
        Args:
            payload: The attack prompt text.
            title: Short title for the evidence.
            attack_technique: Category of attack (jailbreak, injection, etc.)
            target_model: The model this payload targets.
            effectiveness: Whether the payload was successful.
            response_summary: Summary of how the model responded.
            severity: Severity level.
 
        Returns:
            Evidence ID.
        """
        evidence = EvidenceItem(
            evidence_id=f"EV-{uuid.uuid4().hex[:12]}",
            evidence_type=EvidenceType.PROMPT_PAYLOAD,
            timestamp=datetime.now(timezone.utc).isoformat(),
            title=title,
            description=(
                f"Attack technique: {attack_technique}. "
                f"Target: {target_model}. "
                f"Effectiveness: {effectiveness}."
            ),
            content={
                "payload": payload,
                "attack_technique": attack_technique,
                "target_model": target_model,
                "effectiveness": effectiveness,
                "response_summary": response_summary,
            },
            engagement_id=self.engagement_id,
            severity=severity,
            tags=[attack_technique, effectiveness],
            metadata={
                "target_model": target_model,
                "attack_technique": attack_technique,
            },
            collector=self.collector_name,
        )
        evidence.compute_hash()
        self._store_evidence(evidence)
        return evidence.evidence_id
 
    def collect_behavioral_observation(
        self,
        observation: str,
        title: str,
        category: str,
        severity: SeverityLevel = SeverityLevel.INFORMATIONAL,
        supporting_evidence_ids: Optional[list[str]] = None,
    ) -> str:
        """
        Collect a behavioral observation about the AI system.
 
        Args:
            observation: Detailed description of observed behavior.
            title: Short title.
            category: Category (safety_bypass, data_leak, etc.)
            severity: Severity level.
            supporting_evidence_ids: IDs of related evidence.
 
        Returns:
            Evidence ID.
        """
        evidence = EvidenceItem(
            evidence_id=f"EV-{uuid.uuid4().hex[:12]}",
            evidence_type=EvidenceType.BEHAVIORAL_OBSERVATION,
            timestamp=datetime.now(timezone.utc).isoformat(),
            title=title,
            description=observation,
            content={
                "observation": observation,
                "category": category,
                "supporting_evidence": supporting_evidence_ids or [],
            },
            engagement_id=self.engagement_id,
            severity=severity,
            tags=[category],
            collector=self.collector_name,
        )
        evidence.compute_hash()
        self._store_evidence(evidence)
        return evidence.evidence_id
 
    def create_finding(
        self,
        title: str,
        description: str,
        severity: SeverityLevel,
        attack_category: str,
        attack_vector: str,
        impact: str,
        remediation: str,
        evidence_ids: list[str],
        reproducibility: str = "usually",
        cvss_score: Optional[float] = None,
        cwe_id: Optional[str] = None,
    ) -> str:
        """
        Create a finding supported by collected evidence.
 
        Args:
            title: Finding title.
            description: Detailed description.
            severity: Severity level.
            attack_category: Category of attack.
            attack_vector: How the attack works.
            impact: What the attacker can achieve.
            remediation: Recommended fixes.
            evidence_ids: IDs of supporting evidence.
            reproducibility: How reliably the finding reproduces.
            cvss_score: Optional CVSS score.
            cwe_id: Optional CWE identifier.
 
        Returns:
            Finding ID.
        """
        finding = Finding(
            finding_id=f"FND-{uuid.uuid4().hex[:12]}",
            title=title,
            description=description,
            severity=severity,
            attack_category=attack_category,
            attack_vector=attack_vector,
            impact=impact,
            remediation=remediation,
            evidence_ids=evidence_ids,
            reproducibility=reproducibility,
            cvss_score=cvss_score,
            cwe_id=cwe_id,
            timestamp=datetime.now(timezone.utc).isoformat(),
            engagement_id=self.engagement_id,
        )
 
        # Verify all evidence IDs exist
        existing_ids = {e.evidence_id for e in self.evidence_items}
        missing = set(evidence_ids) - existing_ids
        if missing:
            raise ValueError(
                f"Evidence IDs not found: {missing}. "
                f"Collect evidence before creating findings."
            )
 
        self.findings.append(finding)
 
        # Update severity on linked evidence items
        for ev in self.evidence_items:
            if ev.evidence_id in evidence_ids:
                ev.finding_id = finding.finding_id
 
        self._save_finding(finding)
        return finding.finding_id
 
    def _store_evidence(self, evidence: EvidenceItem) -> None:
        """Store evidence to disk and in-memory index."""
        self.evidence_items.append(evidence)
 
        # Write evidence file
        evidence_dir = self.output_dir / "evidence"
        evidence_dir.mkdir(exist_ok=True)
 
        evidence_data = {
            "evidence_id": evidence.evidence_id,
            "evidence_type": evidence.evidence_type.value,
            "timestamp": evidence.timestamp,
            "title": evidence.title,
            "description": evidence.description,
            "content": evidence.content if not isinstance(evidence.content, bytes) else "<binary>",
            "engagement_id": evidence.engagement_id,
            "finding_id": evidence.finding_id,
            "severity": evidence.severity.value,
            "tags": evidence.tags,
            "metadata": evidence.metadata,
            "integrity_hash": evidence.integrity_hash,
            "collector": evidence.collector,
        }
 
        file_path = evidence_dir / f"{evidence.evidence_id}.json"
        with open(str(file_path), "w") as f:
            json.dump(evidence_data, f, indent=2)
 
        # Append to evidence log
        self._evidence_log.append({
            "evidence_id": evidence.evidence_id,
            "timestamp": evidence.timestamp,
            "type": evidence.evidence_type.value,
            "title": evidence.title,
            "hash": evidence.integrity_hash,
        })
        self._save_evidence_log()
 
    def _save_finding(self, finding: Finding) -> None:
        """Save a finding to disk."""
        findings_dir = self.output_dir / "findings"
        findings_dir.mkdir(exist_ok=True)
 
        finding_data = {
            "finding_id": finding.finding_id,
            "title": finding.title,
            "description": finding.description,
            "severity": finding.severity.value,
            "attack_category": finding.attack_category,
            "attack_vector": finding.attack_vector,
            "impact": finding.impact,
            "remediation": finding.remediation,
            "evidence_ids": finding.evidence_ids,
            "reproducibility": finding.reproducibility,
            "cvss_score": finding.cvss_score,
            "cwe_id": finding.cwe_id,
            "timestamp": finding.timestamp,
            "engagement_id": finding.engagement_id,
        }
 
        file_path = findings_dir / f"{finding.finding_id}.json"
        with open(str(file_path), "w") as f:
            json.dump(finding_data, f, indent=2)
 
    def _save_evidence_log(self) -> None:
        """Save the evidence collection log (append-only audit trail)."""
        log_path = self.output_dir / "evidence_log.json"
        with open(str(log_path), "w") as f:
            json.dump(self._evidence_log, f, indent=2)

Automated Evidence Capture During Testing

Wrapping API Calls for Automatic Capture

During active red team testing, manually documenting every API call is impractical. Instead, wrap the API client to automatically capture all interactions as evidence.

import time
from typing import Any
 
class EvidenceCapturingClient:
    """Wraps an LLM API client to automatically capture
    all interactions as forensic evidence."""
 
    def __init__(
        self,
        api_client: Any,
        evidence_collector: RedTeamEvidenceCollector,
        target_model: str,
        auto_tag: bool = True,
    ):
        self.api_client = api_client
        self.collector = evidence_collector
        self.target_model = target_model
        self.auto_tag = auto_tag
        self.interaction_count = 0
 
    def send_message(
        self,
        messages: list[dict],
        attack_technique: str = "unknown",
        notes: str = "",
        **kwargs,
    ) -> dict:
        """
        Send a message through the API client and automatically
        capture the interaction as evidence.
 
        Args:
            messages: The messages to send.
            attack_technique: What attack this is part of.
            notes: Red teamer's notes about this interaction.
            **kwargs: Additional arguments for the API client.
 
        Returns:
            The API response dict.
        """
        self.interaction_count += 1
        start_time = time.time()
 
        # Call the actual API
        response = self.api_client.chat(messages=messages, **kwargs)
 
        elapsed_ms = (time.time() - start_time) * 1000
 
        # Build full conversation with response
        full_messages = list(messages)
        response_text = self._extract_response_text(response)
        full_messages.append({
            "role": "assistant",
            "content": response_text,
        })
 
        # Auto-capture as evidence
        self.collector.collect_conversation(
            messages=full_messages,
            title=f"Interaction #{self.interaction_count}: {attack_technique}",
            description=(
                f"Automated capture during {attack_technique} testing. "
                f"Latency: {elapsed_ms:.0f}ms. "
                f"{notes}"
            ),
            model_id=self.target_model,
            tags=[attack_technique, "auto_captured"],
        )
 
        return response
 
    def _extract_response_text(self, response: Any) -> str:
        """Extract text from various API response formats."""
        if isinstance(response, dict):
            # OpenAI format
            choices = response.get("choices", [])
            if choices:
                return choices[0].get("message", {}).get("content", "")
            # Anthropic format
            content = response.get("content", [])
            if content and isinstance(content, list):
                return content[0].get("text", "")
        if isinstance(response, str):
            return response
        return str(response)

Reproducibility Testing

A finding that cannot be reproduced has limited value. After identifying a potential vulnerability, test its reproducibility systematically.

def test_reproducibility(
    client: EvidenceCapturingClient,
    attack_messages: list[dict],
    success_criteria: callable,
    num_trials: int = 10,
    attack_technique: str = "unknown",
) -> dict:
    """
    Test the reproducibility of an attack by running it multiple
    times and checking success criteria.
 
    Args:
        client: Evidence-capturing API client.
        attack_messages: The attack messages to send.
        success_criteria: Function that takes the response text
            and returns True if the attack succeeded.
        num_trials: Number of reproduction attempts.
        attack_technique: Category of attack being tested.
 
    Returns:
        Reproducibility assessment dict.
    """
    results = []
 
    for trial in range(num_trials):
        response = client.send_message(
            messages=attack_messages,
            attack_technique=f"{attack_technique}_repro_trial_{trial + 1}",
            notes=f"Reproducibility trial {trial + 1}/{num_trials}",
        )
 
        response_text = client._extract_response_text(response)
        success = success_criteria(response_text)
        results.append({
            "trial": trial + 1,
            "success": success,
            "response_length": len(response_text),
        })
 
    success_count = sum(1 for r in results if r["success"])
    success_rate = success_count / num_trials
 
    if success_rate >= 0.9:
        reproducibility = "always"
    elif success_rate >= 0.6:
        reproducibility = "usually"
    elif success_rate >= 0.2:
        reproducibility = "sometimes"
    else:
        reproducibility = "rarely"
 
    return {
        "total_trials": num_trials,
        "successes": success_count,
        "success_rate": success_rate,
        "reproducibility": reproducibility,
        "results": results,
    }

Report Generation

Producing Engagement Reports

At the conclusion of a red team engagement, generate a structured report that presents findings with supporting evidence. The report should be accessible to both technical and executive audiences.

class EngagementReportGenerator:
    """Generate structured reports from red team evidence."""
 
    def generate_report(
        self,
        collector: RedTeamEvidenceCollector,
        engagement_name: str,
        scope_description: str,
        executive_summary: str = "",
    ) -> str:
        """
        Generate a full engagement report.
 
        Args:
            collector: The evidence collector with all findings.
            engagement_name: Name of the engagement.
            scope_description: Description of what was in scope.
            executive_summary: Optional pre-written executive summary.
 
        Returns:
            Report as formatted text.
        """
        findings = sorted(
            collector.findings,
            key=lambda f: self._severity_order(f.severity),
            reverse=True,
        )
 
        lines = [
            "=" * 70,
            f"AI RED TEAM ENGAGEMENT REPORT",
            "=" * 70,
            "",
            f"Engagement: {engagement_name}",
            f"Engagement ID: {collector.engagement_id}",
            f"Report Generated: {datetime.now(timezone.utc).isoformat()}",
            "",
            "SCOPE",
            "-" * 40,
            scope_description,
            "",
            "EXECUTIVE SUMMARY",
            "-" * 40,
        ]
 
        if executive_summary:
            lines.append(executive_summary)
        else:
            lines.append(self._auto_executive_summary(findings, collector))
 
        lines.extend([
            "",
            "FINDINGS SUMMARY",
            "-" * 40,
        ])
 
        severity_counts = {}
        for f in findings:
            sev = f.severity.value
            severity_counts[sev] = severity_counts.get(sev, 0) + 1
 
        for sev in ["critical", "high", "medium", "low", "informational"]:
            count = severity_counts.get(sev, 0)
            lines.append(f"  {sev.upper()}: {count}")
 
        lines.extend([
            "",
            f"  Total findings: {len(findings)}",
            f"  Total evidence items: {len(collector.evidence_items)}",
            "",
            "DETAILED FINDINGS",
            "-" * 40,
        ])
 
        for i, finding in enumerate(findings, 1):
            lines.extend(self._format_finding(finding, i, collector))
 
        return "\n".join(lines)
 
    def _format_finding(
        self,
        finding: Finding,
        number: int,
        collector: RedTeamEvidenceCollector,
    ) -> list[str]:
        """Format a single finding for the report."""
        lines = [
            "",
            f"--- Finding #{number}: {finding.title} ---",
            f"ID: {finding.finding_id}",
            f"Severity: {finding.severity.value.upper()}",
            f"CVSS: {finding.cvss_score or 'N/A'}",
            f"CWE: {finding.cwe_id or 'N/A'}",
            f"Reproducibility: {finding.reproducibility}",
            "",
            f"Description: {finding.description}",
            "",
            f"Attack Vector: {finding.attack_vector}",
            "",
            f"Impact: {finding.impact}",
            "",
            f"Remediation: {finding.remediation}",
            "",
            f"Supporting Evidence ({len(finding.evidence_ids)} items):",
        ]
 
        for ev_id in finding.evidence_ids:
            matching = [
                e for e in collector.evidence_items
                if e.evidence_id == ev_id
            ]
            if matching:
                ev = matching[0]
                lines.append(
                    f"  - [{ev.evidence_id}] {ev.evidence_type.value}: {ev.title}"
                )
 
        return lines
 
    def _auto_executive_summary(
        self,
        findings: list[Finding],
        collector: RedTeamEvidenceCollector,
    ) -> str:
        """Generate an automatic executive summary."""
        critical = [f for f in findings if f.severity == SeverityLevel.CRITICAL]
        high = [f for f in findings if f.severity == SeverityLevel.HIGH]
 
        summary_parts = [
            f"This engagement identified {len(findings)} findings "
            f"across the target AI system."
        ]
 
        if critical:
            summary_parts.append(
                f"{len(critical)} critical finding(s) require immediate "
                f"remediation: {', '.join(f.title for f in critical)}."
            )
        if high:
            summary_parts.append(
                f"{len(high)} high-severity finding(s) should be "
                f"addressed within the next sprint cycle."
            )
 
        return " ".join(summary_parts)
 
    @staticmethod
    def _severity_order(severity: SeverityLevel) -> int:
        return {
            SeverityLevel.INFORMATIONAL: 0,
            SeverityLevel.LOW: 1,
            SeverityLevel.MEDIUM: 2,
            SeverityLevel.HIGH: 3,
            SeverityLevel.CRITICAL: 4,
        }.get(severity, 0)

Engagement Planning for Evidence Collection

Pre-Engagement Evidence Requirements

Before any testing begins, define what evidence will be collected and how. Work with the engagement stakeholders (typically the AI product team and the security team) to establish evidence requirements. These typically include: minimum documentation standards for each finding, required metadata for each evidence item, evidence classification levels and handling procedures, and the format and structure of the final report.

Define the evidence retention policy: how long will evidence be stored after the engagement concludes? For most organizations, evidence from red team engagements should be retained for at least one year to support trend analysis across multiple engagements. For regulated industries, retention requirements may be longer.

Establish the tooling and infrastructure before testing begins. Set up the evidence collection system with the engagement ID, create the output directories, configure the API client wrappers for automatic capture, and verify that evidence is being stored correctly by running a test interaction. Discovering tooling issues mid-engagement wastes testing time and may result in lost evidence.

Rules of Engagement and Evidence Boundaries

AI red team engagements require clear rules of engagement that define what systems can be tested, what attack techniques are permitted, and what data can be collected. These rules directly affect evidence collection.

If the rules of engagement prohibit testing against production user data, ensure your evidence collection system does not capture real user conversations or PII. If certain attack techniques are out of scope (e.g., social engineering of the development team), ensure that any incidental evidence of potential social engineering vulnerabilities is documented but not actively pursued.

The rules of engagement should also specify what happens if the red team discovers an active, non-simulated security incident during testing. This scenario is uncommon but not unheard of: a red teamer investigating an AI system's defenses may discover evidence of actual malicious activity. Define the escalation procedure: stop testing, preserve evidence, and notify the incident response team through an established communication channel.

Collaboration with Blue Teams

The most effective AI red team engagements produce evidence that is directly actionable by the blue team (the defenders). Design your evidence collection to support this goal. Each finding should include not just the attack technique and its impact, but also specific detection opportunities: what log entries, metrics, or behavioral signals would have revealed this attack in progress? What monitoring could the blue team deploy to detect similar attacks in the future?

Document the attack from both perspectives: from the red team's perspective (what they did and why it worked) and from the blue team's perspective (what signals were available and why they were missed or not monitored). This dual perspective makes the evidence package a complete learning resource rather than just a vulnerability report.

Chain of Custody Considerations

Maintaining Evidence Integrity

Throughout an AI red team engagement, maintain chain of custody for all evidence. Every evidence item should have a SHA-256 hash computed at collection time, a collector identifier, and a timestamp. The evidence log provides an append-only audit trail of what was collected and when.

For engagements that may have legal implications (e.g., testing in regulated industries or collecting evidence of third-party AI misuse), consider additional controls: write-once storage for evidence files, two-person integrity verification for critical findings, and signed evidence manifests using GPG or a similar tool.

Evidence Collection for Specific AI Attack Types

Jailbreak Evidence

When documenting a successful jailbreak, capture the complete conversation from the first message through the jailbreak success, including any failed attempts in between. The failed attempts are forensically valuable because they document the model's defense boundaries and show the progression of the attack technique. Record the exact model version, temperature setting, and any other parameters, as jailbreak success rates are sensitive to these settings. Test reproducibility at multiple temperature values and note the range where the jailbreak succeeds.

Include negative evidence as well: document what the model correctly refused before the jailbreak succeeded. This establishes the baseline safety behavior and makes the jailbreak finding more credible. If the jailbreak only works with a specific system prompt configuration, document both the vulnerable configuration and configurations where the jailbreak fails.

RAG Poisoning Evidence

For RAG poisoning findings, capture the poisoned document, its embedding, the similarity score at which it was retrieved, the query that triggered retrieval, the model's output with the poisoned context, and the model's output with clean context for the same query. The comparison between poisoned and clean outputs is the core evidence that demonstrates impact.

Document the ingestion pathway: how the poisoned document entered the knowledge base, what validation or filtering it passed through, and whether the poisoning could be detected by existing content moderation. If the poisoning exploits a specific property of the embedding model (e.g., adversarially crafted text that embeds close to a target query), document the embedding similarity analysis.

Training Data Extraction Evidence

When a model leaks training data, the evidence must demonstrate that the generated output matches actual training data rather than being a coincidental generation. This requires comparing model outputs against known training data sources. Use multiple prompting strategies to elicit the same data, and document the success rate. Calculate the statistical likelihood that the generation is coincidental by measuring the output's perplexity under the model versus its perplexity under a reference distribution.

Capture the complete prompt that triggers the extraction, the model's full output, the matching training data source (with provenance), and the similarity analysis showing the match is statistically significant. If the extraction reveals personally identifiable information or copyrighted material, flag this separately with appropriate handling markings.

Handling Sensitive Evidence

AI red team evidence frequently contains sensitive material: successful jailbreak payloads, model outputs that include harmful content, leaked training data, or exposed system prompts. Classify evidence appropriately and apply access controls. Store sensitive payloads separately from the main evidence repository with additional access restrictions. Redact sensitive content in reports distributed to broader audiences, while maintaining full unredacted evidence in the secure evidence store for technical review.

Evidence Quality Metrics

Measuring Evidence Completeness

At the conclusion of each engagement, assess the quality and completeness of the evidence collected. Metrics to track include: evidence-per-finding ratio (each finding should have at least 2-3 supporting evidence items), reproducibility test coverage (what percentage of findings were reproducibility-tested), metadata completeness (what percentage of evidence items have all standard metadata fields populated), and hash verification rate (what percentage of evidence items have integrity hashes computed and verified).

Track these metrics across engagements to identify trends. If evidence quality is declining, investigate whether the testing methodology has become rushed, whether the tooling needs improvement, or whether the engagement scope has expanded beyond the team's capacity to document thoroughly. Evidence quality directly affects the credibility of findings and the likelihood that remediation will be prioritized.

Evidence Archival and Cross-Engagement Analysis

After each engagement, archive the complete evidence package in a long-term storage system accessible for future reference. Over time, the archive enables cross-engagement analysis: are the same vulnerability patterns appearing across multiple engagements? Are remediation recommendations being implemented and effective? Which AI models or architectures are most frequently vulnerable?

Build a searchable index across archived engagements that allows querying by attack technique, AI model, vulnerability type, and severity. This index becomes a knowledge base that informs future engagement planning and helps the red team prioritize their testing based on historically observed vulnerability patterns.

References

MITRE ATLAS (Adversarial Threat Landscape for AI Systems). "ATLAS Matrix." https://atlas.mitre.org/
NIST AI 100-2 E2023 (2024). "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations." https://csrc.nist.gov/pubs/ai/100/2/e2023/final
Microsoft (2024). "AI Red Team Playbook." https://learn.microsoft.com/en-us/security/ai-red-team/

Edit this page on GitHub

AI Red Team Evidence Collection

intermediate18 min readUpdated 2026-03-21

Systematic evidence collection methodologies for AI red team engagements, including artifact preservation, finding documentation, and chain of custody procedures.

ai-forensics-ir red-team evidence-collection engagement-methodology

Overview

Evidence Collection Framework

Defining the Evidence Taxonomy

AI red team evidence falls into several categories, each requiring different collection and preservation methods.

import json
import hashlib
import uuid
from datetime import datetime, timezone
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
from pathlib import Path
 
class EvidenceType(Enum):
    """Types of evidence collected during AI red team engagements."""
    CONVERSATION = "conversation"        # Full conversation logs
    PROMPT_PAYLOAD = "prompt_payload"     # Specific attack prompts
    MODEL_RESPONSE = "model_response"    # Model outputs of interest
    SCREENSHOT = "screenshot"            # Visual evidence
    CONFIGURATION = "configuration"      # System/model configuration
    NETWORK_CAPTURE = "network_capture"  # API traffic captures
    BEHAVIORAL_OBSERVATION = "behavioral_observation"  # Noted behaviors
    RETRIEVAL_CONTEXT = "retrieval_context"  # RAG retrieval results
    TOOL_OUTPUT = "tool_output"          # Output from testing tools
    METRIC_DATA = "metric_data"          # Quantitative measurements
 
class SeverityLevel(Enum):
    """Severity levels for red team findings."""
    INFORMATIONAL = "informational"
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"
 
@dataclass
class EvidenceItem:
    """A single piece of evidence from a red team engagement."""
    evidence_id: str
    evidence_type: EvidenceType
    timestamp: str
    title: str
    description: str
    content: str | dict | bytes
    engagement_id: str
    finding_id: Optional[str] = None
    severity: SeverityLevel = SeverityLevel.INFORMATIONAL
    tags: list[str] = field(default_factory=list)
    metadata: dict = field(default_factory=dict)
    integrity_hash: str = ""
    collector: str = ""
 
    def compute_hash(self) -> str:
        """Compute SHA-256 hash of evidence content for integrity."""
        if isinstance(self.content, bytes):
            content_bytes = self.content
        elif isinstance(self.content, dict):
            content_bytes = json.dumps(
                self.content, sort_keys=True
            ).encode("utf-8")
        else:
            content_bytes = str(self.content).encode("utf-8")
 
        self.integrity_hash = hashlib.sha256(content_bytes).hexdigest()
        return self.integrity_hash
 
@dataclass
class Finding:
    """A red team finding supported by evidence."""
    finding_id: str
    title: str
    description: str
    severity: SeverityLevel
    attack_category: str
    attack_vector: str
    impact: str
    remediation: str
    evidence_ids: list[str]
    reproducibility: str  # always, usually, sometimes, once
    cvss_score: Optional[float] = None
    cwe_id: Optional[str] = None
    timestamp: str = ""
    engagement_id: str = ""
 
class RedTeamEvidenceCollector:
    """Systematic evidence collection for AI red team engagements."""
 
    def __init__(
        self,
        engagement_id: str,
        output_dir: str,
        collector_name: str,
    ):
        self.engagement_id = engagement_id
        self.output_dir = Path(output_dir) / engagement_id
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.collector_name = collector_name
        self.evidence_items: list[EvidenceItem] = []
        self.findings: list[Finding] = []
        self._evidence_log: list[dict] = []
 
    def collect_conversation(
        self,
        messages: list[dict],
        title: str,
        description: str,
        model_id: str = "",
        severity: SeverityLevel = SeverityLevel.INFORMATIONAL,
        tags: Optional[list[str]] = None,
    ) -> str:
        """
        Collect a conversation as evidence.
 
        Args:
            messages: List of message dicts with 'role' and 'content'.
            title: Short title for the evidence.
            description: Detailed description of what this shows.
            model_id: The model targeted.
            severity: Severity level of the finding.
            tags: Optional tags for categorization.
 
        Returns:
            Evidence ID.
        """
        evidence = EvidenceItem(
            evidence_id=f"EV-{uuid.uuid4().hex[:12]}",
            evidence_type=EvidenceType.CONVERSATION,
            timestamp=datetime.now(timezone.utc).isoformat(),
            title=title,
            description=description,
            content={
                "messages": messages,
                "message_count": len(messages),
                "model_id": model_id,
            },
            engagement_id=self.engagement_id,
            severity=severity,
            tags=tags or [],
            metadata={"model_id": model_id},
            collector=self.collector_name,
        )
        evidence.compute_hash()
        self._store_evidence(evidence)
        return evidence.evidence_id
 
    def collect_prompt_payload(
        self,
        payload: str,
        title: str,
        attack_technique: str,
        target_model: str = "",
        effectiveness: str = "",  # success, partial, failure
        response_summary: str = "",
        severity: SeverityLevel = SeverityLevel.INFORMATIONAL,
    ) -> str:
        """
        Collect a specific attack prompt payload as evidence.
 
        Args:
            payload: The attack prompt text.
            title: Short title for the evidence.
            attack_technique: Category of attack (jailbreak, injection, etc.)
            target_model: The model this payload targets.
            effectiveness: Whether the payload was successful.
            response_summary: Summary of how the model responded.
            severity: Severity level.
 
        Returns:
            Evidence ID.
        """
        evidence = EvidenceItem(
            evidence_id=f"EV-{uuid.uuid4().hex[:12]}",
            evidence_type=EvidenceType.PROMPT_PAYLOAD,
            timestamp=datetime.now(timezone.utc).isoformat(),
            title=title,
            description=(
                f"Attack technique: {attack_technique}. "
                f"Target: {target_model}. "
                f"Effectiveness: {effectiveness}."
            ),
            content={
                "payload": payload,
                "attack_technique": attack_technique,
                "target_model": target_model,
                "effectiveness": effectiveness,
                "response_summary": response_summary,
            },
            engagement_id=self.engagement_id,
            severity=severity,
            tags=[attack_technique, effectiveness],
            metadata={
                "target_model": target_model,
                "attack_technique": attack_technique,
            },
            collector=self.collector_name,
        )
        evidence.compute_hash()
        self._store_evidence(evidence)
        return evidence.evidence_id
 
    def collect_behavioral_observation(
        self,
        observation: str,
        title: str,
        category: str,
        severity: SeverityLevel = SeverityLevel.INFORMATIONAL,
        supporting_evidence_ids: Optional[list[str]] = None,
    ) -> str:
        """
        Collect a behavioral observation about the AI system.
 
        Args:
            observation: Detailed description of observed behavior.
            title: Short title.
            category: Category (safety_bypass, data_leak, etc.)
            severity: Severity level.
            supporting_evidence_ids: IDs of related evidence.
 
        Returns:
            Evidence ID.
        """
        evidence = EvidenceItem(
            evidence_id=f"EV-{uuid.uuid4().hex[:12]}",
            evidence_type=EvidenceType.BEHAVIORAL_OBSERVATION,
            timestamp=datetime.now(timezone.utc).isoformat(),
            title=title,
            description=observation,
            content={
                "observation": observation,
                "category": category,
                "supporting_evidence": supporting_evidence_ids or [],
            },
            engagement_id=self.engagement_id,
            severity=severity,
            tags=[category],
            collector=self.collector_name,
        )
        evidence.compute_hash()
        self._store_evidence(evidence)
        return evidence.evidence_id
 
    def create_finding(
        self,
        title: str,
        description: str,
        severity: SeverityLevel,
        attack_category: str,
        attack_vector: str,
        impact: str,
        remediation: str,
        evidence_ids: list[str],
        reproducibility: str = "usually",
        cvss_score: Optional[float] = None,
        cwe_id: Optional[str] = None,
    ) -> str:
        """
        Create a finding supported by collected evidence.
 
        Args:
            title: Finding title.
            description: Detailed description.
            severity: Severity level.
            attack_category: Category of attack.
            attack_vector: How the attack works.
            impact: What the attacker can achieve.
            remediation: Recommended fixes.
            evidence_ids: IDs of supporting evidence.
            reproducibility: How reliably the finding reproduces.
            cvss_score: Optional CVSS score.
            cwe_id: Optional CWE identifier.
 
        Returns:
            Finding ID.
        """
        finding = Finding(
            finding_id=f"FND-{uuid.uuid4().hex[:12]}",
            title=title,
            description=description,
            severity=severity,
            attack_category=attack_category,
            attack_vector=attack_vector,
            impact=impact,
            remediation=remediation,
            evidence_ids=evidence_ids,
            reproducibility=reproducibility,
            cvss_score=cvss_score,
            cwe_id=cwe_id,
            timestamp=datetime.now(timezone.utc).isoformat(),
            engagement_id=self.engagement_id,
        )
 
        # Verify all evidence IDs exist
        existing_ids = {e.evidence_id for e in self.evidence_items}
        missing = set(evidence_ids) - existing_ids
        if missing:
            raise ValueError(
                f"Evidence IDs not found: {missing}. "
                f"Collect evidence before creating findings."
            )
 
        self.findings.append(finding)
 
        # Update severity on linked evidence items
        for ev in self.evidence_items:
            if ev.evidence_id in evidence_ids:
                ev.finding_id = finding.finding_id
 
        self._save_finding(finding)
        return finding.finding_id
 
    def _store_evidence(self, evidence: EvidenceItem) -> None:
        """Store evidence to disk and in-memory index."""
        self.evidence_items.append(evidence)
 
        # Write evidence file
        evidence_dir = self.output_dir / "evidence"
        evidence_dir.mkdir(exist_ok=True)
 
        evidence_data = {
            "evidence_id": evidence.evidence_id,
            "evidence_type": evidence.evidence_type.value,
            "timestamp": evidence.timestamp,
            "title": evidence.title,
            "description": evidence.description,
            "content": evidence.content if not isinstance(evidence.content, bytes) else "<binary>",
            "engagement_id": evidence.engagement_id,
            "finding_id": evidence.finding_id,
            "severity": evidence.severity.value,
            "tags": evidence.tags,
            "metadata": evidence.metadata,
            "integrity_hash": evidence.integrity_hash,
            "collector": evidence.collector,
        }
 
        file_path = evidence_dir / f"{evidence.evidence_id}.json"
        with open(str(file_path), "w") as f:
            json.dump(evidence_data, f, indent=2)
 
        # Append to evidence log
        self._evidence_log.append({
            "evidence_id": evidence.evidence_id,
            "timestamp": evidence.timestamp,
            "type": evidence.evidence_type.value,
            "title": evidence.title,
            "hash": evidence.integrity_hash,
        })
        self._save_evidence_log()
 
    def _save_finding(self, finding: Finding) -> None:
        """Save a finding to disk."""
        findings_dir = self.output_dir / "findings"
        findings_dir.mkdir(exist_ok=True)
 
        finding_data = {
            "finding_id": finding.finding_id,
            "title": finding.title,
            "description": finding.description,
            "severity": finding.severity.value,
            "attack_category": finding.attack_category,
            "attack_vector": finding.attack_vector,
            "impact": finding.impact,
            "remediation": finding.remediation,
            "evidence_ids": finding.evidence_ids,
            "reproducibility": finding.reproducibility,
            "cvss_score": finding.cvss_score,
            "cwe_id": finding.cwe_id,
            "timestamp": finding.timestamp,
            "engagement_id": finding.engagement_id,
        }
 
        file_path = findings_dir / f"{finding.finding_id}.json"
        with open(str(file_path), "w") as f:
            json.dump(finding_data, f, indent=2)
 
    def _save_evidence_log(self) -> None:
        """Save the evidence collection log (append-only audit trail)."""
        log_path = self.output_dir / "evidence_log.json"
        with open(str(log_path), "w") as f:
            json.dump(self._evidence_log, f, indent=2)

Automated Evidence Capture During Testing

Wrapping API Calls for Automatic Capture

During active red team testing, manually documenting every API call is impractical. Instead, wrap the API client to automatically capture all interactions as evidence.

import time
from typing import Any
 
class EvidenceCapturingClient:
    """Wraps an LLM API client to automatically capture
    all interactions as forensic evidence."""
 
    def __init__(
        self,
        api_client: Any,
        evidence_collector: RedTeamEvidenceCollector,
        target_model: str,
        auto_tag: bool = True,
    ):
        self.api_client = api_client
        self.collector = evidence_collector
        self.target_model = target_model
        self.auto_tag = auto_tag
        self.interaction_count = 0
 
    def send_message(
        self,
        messages: list[dict],
        attack_technique: str = "unknown",
        notes: str = "",
        **kwargs,
    ) -> dict:
        """
        Send a message through the API client and automatically
        capture the interaction as evidence.
 
        Args:
            messages: The messages to send.
            attack_technique: What attack this is part of.
            notes: Red teamer's notes about this interaction.
            **kwargs: Additional arguments for the API client.
 
        Returns:
            The API response dict.
        """
        self.interaction_count += 1
        start_time = time.time()
 
        # Call the actual API
        response = self.api_client.chat(messages=messages, **kwargs)
 
        elapsed_ms = (time.time() - start_time) * 1000
 
        # Build full conversation with response
        full_messages = list(messages)
        response_text = self._extract_response_text(response)
        full_messages.append({
            "role": "assistant",
            "content": response_text,
        })
 
        # Auto-capture as evidence
        self.collector.collect_conversation(
            messages=full_messages,
            title=f"Interaction #{self.interaction_count}: {attack_technique}",
            description=(
                f"Automated capture during {attack_technique} testing. "
                f"Latency: {elapsed_ms:.0f}ms. "
                f"{notes}"
            ),
            model_id=self.target_model,
            tags=[attack_technique, "auto_captured"],
        )
 
        return response
 
    def _extract_response_text(self, response: Any) -> str:
        """Extract text from various API response formats."""
        if isinstance(response, dict):
            # OpenAI format
            choices = response.get("choices", [])
            if choices:
                return choices[0].get("message", {}).get("content", "")
            # Anthropic format
            content = response.get("content", [])
            if content and isinstance(content, list):
                return content[0].get("text", "")
        if isinstance(response, str):
            return response
        return str(response)

Reproducibility Testing

A finding that cannot be reproduced has limited value. After identifying a potential vulnerability, test its reproducibility systematically.

def test_reproducibility(
    client: EvidenceCapturingClient,
    attack_messages: list[dict],
    success_criteria: callable,
    num_trials: int = 10,
    attack_technique: str = "unknown",
) -> dict:
    """
    Test the reproducibility of an attack by running it multiple
    times and checking success criteria.
 
    Args:
        client: Evidence-capturing API client.
        attack_messages: The attack messages to send.
        success_criteria: Function that takes the response text
            and returns True if the attack succeeded.
        num_trials: Number of reproduction attempts.
        attack_technique: Category of attack being tested.
 
    Returns:
        Reproducibility assessment dict.
    """
    results = []
 
    for trial in range(num_trials):
        response = client.send_message(
            messages=attack_messages,
            attack_technique=f"{attack_technique}_repro_trial_{trial + 1}",
            notes=f"Reproducibility trial {trial + 1}/{num_trials}",
        )
 
        response_text = client._extract_response_text(response)
        success = success_criteria(response_text)
        results.append({
            "trial": trial + 1,
            "success": success,
            "response_length": len(response_text),
        })
 
    success_count = sum(1 for r in results if r["success"])
    success_rate = success_count / num_trials
 
    if success_rate >= 0.9:
        reproducibility = "always"
    elif success_rate >= 0.6:
        reproducibility = "usually"
    elif success_rate >= 0.2:
        reproducibility = "sometimes"
    else:
        reproducibility = "rarely"
 
    return {
        "total_trials": num_trials,
        "successes": success_count,
        "success_rate": success_rate,
        "reproducibility": reproducibility,
        "results": results,
    }

Report Generation

Producing Engagement Reports

At the conclusion of a red team engagement, generate a structured report that presents findings with supporting evidence. The report should be accessible to both technical and executive audiences.

class EngagementReportGenerator:
    """Generate structured reports from red team evidence."""
 
    def generate_report(
        self,
        collector: RedTeamEvidenceCollector,
        engagement_name: str,
        scope_description: str,
        executive_summary: str = "",
    ) -> str:
        """
        Generate a full engagement report.
 
        Args:
            collector: The evidence collector with all findings.
            engagement_name: Name of the engagement.
            scope_description: Description of what was in scope.
            executive_summary: Optional pre-written executive summary.
 
        Returns:
            Report as formatted text.
        """
        findings = sorted(
            collector.findings,
            key=lambda f: self._severity_order(f.severity),
            reverse=True,
        )
 
        lines = [
            "=" * 70,
            f"AI RED TEAM ENGAGEMENT REPORT",
            "=" * 70,
            "",
            f"Engagement: {engagement_name}",
            f"Engagement ID: {collector.engagement_id}",
            f"Report Generated: {datetime.now(timezone.utc).isoformat()}",
            "",
            "SCOPE",
            "-" * 40,
            scope_description,
            "",
            "EXECUTIVE SUMMARY",
            "-" * 40,
        ]
 
        if executive_summary:
            lines.append(executive_summary)
        else:
            lines.append(self._auto_executive_summary(findings, collector))
 
        lines.extend([
            "",
            "FINDINGS SUMMARY",
            "-" * 40,
        ])
 
        severity_counts = {}
        for f in findings:
            sev = f.severity.value
            severity_counts[sev] = severity_counts.get(sev, 0) + 1
 
        for sev in ["critical", "high", "medium", "low", "informational"]:
            count = severity_counts.get(sev, 0)
            lines.append(f"  {sev.upper()}: {count}")
 
        lines.extend([
            "",
            f"  Total findings: {len(findings)}",
            f"  Total evidence items: {len(collector.evidence_items)}",
            "",
            "DETAILED FINDINGS",
            "-" * 40,
        ])
 
        for i, finding in enumerate(findings, 1):
            lines.extend(self._format_finding(finding, i, collector))
 
        return "\n".join(lines)
 
    def _format_finding(
        self,
        finding: Finding,
        number: int,
        collector: RedTeamEvidenceCollector,
    ) -> list[str]:
        """Format a single finding for the report."""
        lines = [
            "",
            f"--- Finding #{number}: {finding.title} ---",
            f"ID: {finding.finding_id}",
            f"Severity: {finding.severity.value.upper()}",
            f"CVSS: {finding.cvss_score or 'N/A'}",
            f"CWE: {finding.cwe_id or 'N/A'}",
            f"Reproducibility: {finding.reproducibility}",
            "",
            f"Description: {finding.description}",
            "",
            f"Attack Vector: {finding.attack_vector}",
            "",
            f"Impact: {finding.impact}",
            "",
            f"Remediation: {finding.remediation}",
            "",
            f"Supporting Evidence ({len(finding.evidence_ids)} items):",
        ]
 
        for ev_id in finding.evidence_ids:
            matching = [
                e for e in collector.evidence_items
                if e.evidence_id == ev_id
            ]
            if matching:
                ev = matching[0]
                lines.append(
                    f"  - [{ev.evidence_id}] {ev.evidence_type.value}: {ev.title}"
                )
 
        return lines
 
    def _auto_executive_summary(
        self,
        findings: list[Finding],
        collector: RedTeamEvidenceCollector,
    ) -> str:
        """Generate an automatic executive summary."""
        critical = [f for f in findings if f.severity == SeverityLevel.CRITICAL]
        high = [f for f in findings if f.severity == SeverityLevel.HIGH]
 
        summary_parts = [
            f"This engagement identified {len(findings)} findings "
            f"across the target AI system."
        ]
 
        if critical:
            summary_parts.append(
                f"{len(critical)} critical finding(s) require immediate "
                f"remediation: {', '.join(f.title for f in critical)}."
            )
        if high:
            summary_parts.append(
                f"{len(high)} high-severity finding(s) should be "
                f"addressed within the next sprint cycle."
            )
 
        return " ".join(summary_parts)
 
    @staticmethod
    def _severity_order(severity: SeverityLevel) -> int:
        return {
            SeverityLevel.INFORMATIONAL: 0,
            SeverityLevel.LOW: 1,
            SeverityLevel.MEDIUM: 2,
            SeverityLevel.HIGH: 3,
            SeverityLevel.CRITICAL: 4,
        }.get(severity, 0)

MITRE ATLAS (Adversarial Threat Landscape for AI Systems). "ATLAS Matrix." https://atlas.mitre.org/
NIST AI 100-2 E2023 (2024). "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations." https://csrc.nist.gov/pubs/ai/100/2/e2023/final
Microsoft (2024). "AI Red Team Playbook." https://learn.microsoft.com/en-us/security/ai-red-team/

Edit this page on GitHub

AI Red Team Evidence Collection

Related articles

AI Red Team Evidence Collection

Related articles