AI Forensics & Incident Response
Overview of forensic investigation and incident response for AI systems: why traditional IR falls short, the AI incident lifecycle, and the unique challenges of non-deterministic systems.
AI Forensics & Incident Response
When an AI system is compromised, the playbook you used for traditional software incidents will leave you with critical blind spots. Model behavior is not a binary of "working" or "broken" -- it shifts along a spectrum that makes detection, scoping, and remediation fundamentally different from classical incident response.
Why Traditional IR Falls Short
Traditional incident response assumes deterministic systems: a compromised server runs the same exploit code every time, a stolen credential grants the same access on every use, and a malware binary produces the same hash regardless of when you analyze it. AI systems violate all of these assumptions.
Determinism vs. Non-Determinism
| Characteristic | Traditional Systems | AI Systems |
|---|---|---|
| Reproducibility | Exploits reproduce reliably | Attacks may succeed 30-70% of the time due to temperature and sampling |
| Evidence | Log files, memory dumps, disk images | Prompt logs, inference traces, model weights, embedding vectors |
| Blast radius | Defined by network topology and access control | Defined by what the model "knows" and what tools it can invoke |
| Containment | Isolate the host, revoke credentials | Isolate the model, but prior outputs may already be cached or acted upon |
| Root cause | Vulnerability in code or configuration | Could be training data, system prompt, model weights, or user input |
| Verification | Patch and re-test | Non-deterministic verification requires statistical confidence |
What Traditional Frameworks Miss
The NIST Cybersecurity Framework (SP 800-61) and SANS incident response process both assume you can:
-
Identify a clear indicator of compromise (IoC). In AI systems, the "compromise" may be a subtle behavioral shift -- the model starts leaking slightly more information or becomes marginally more compliant with harmful requests. There is no equivalent of a malware hash to search for.
-
Contain the incident by isolating affected systems. An AI model that has been jailbroken in one conversation does not affect other conversations in most architectures. But if the attack exploited a flaw in the system prompt or fine-tuning, every conversation is potentially affected.
-
Eradicate the threat by removing malicious artifacts. If the "malicious artifact" is a behavioral pattern learned during training, you cannot simply delete a file. You may need to retrain, fine-tune with corrective data, or add runtime guardrails.
-
Recover by restoring from known-good backups. You can roll back model weights, but the interactions that occurred during the compromise may have already caused damage -- data was disclosed, actions were taken, or downstream systems were affected.
The AI Incident Lifecycle
The AI incident lifecycle adapts the traditional IR phases but adds AI-specific activities at each stage.
Phase 1: Detection
Detection in AI systems relies on different signal types than traditional systems.
| Signal Type | Description | Example |
|---|---|---|
| Safety classifier alerts | Runtime classifiers flag harmful outputs | Output classified as "harmful" by Llama Guard |
| Anomalous inference patterns | Unusual token distributions, latency spikes, or output lengths | Average response length jumps from 200 to 2,000 tokens |
| User reports | End users report unexpected model behavior | "The chatbot told me internal pricing information" |
| Tool call anomalies | Agent makes unexpected tool invocations | Model calls exec() or accesses files outside its sandbox |
| Prompt pattern detection | Known jailbreak patterns appear in input logs | Input contains "ignore previous instructions" variants |
| Embedding drift | Query embeddings cluster in unexpected regions | Sudden spike in queries near sensitive document embeddings |
Phase 2: Triage and Classification
Once detected, AI incidents must be classified using an AI-specific taxonomy. Traditional categories like "malware," "unauthorized access," or "denial of service" do not capture the nuances of AI compromise.
AI incident taxonomy categories include jailbreaks, prompt injection, data exfiltration via model outputs, model manipulation, supply chain compromise of model artifacts, and adversarial attacks on model inputs.
See Incident Classification for the full taxonomy and Severity Framework for scoring methodology.
Phase 3: Containment
AI containment strategies depend on the incident type and system architecture.
| Strategy | When to Use | Trade-offs |
|---|---|---|
| Disable the model endpoint | Critical severity, active data exfiltration | Complete service disruption |
| Switch to a fallback model | Production continuity required | Fallback may have different capabilities or its own vulnerabilities |
| Add runtime guardrails | Targeted attack pattern identified | May block legitimate queries; attacker can adapt |
| Restrict tool access | Agent-based system, tool abuse detected | Reduces functionality but stops lateral movement |
| Rate limit or circuit breaker | Automated attack in progress | Slows but does not stop a determined attacker |
Phase 4: Investigation
AI forensic investigation examines evidence types that do not exist in traditional IR.
- Prompt and completion logs -- the full input/output history of the model, including system prompts, user messages, and assistant responses
- Inference telemetry -- token-level probabilities, latency measurements, and sampling parameters for each request
- Model artifacts -- weights, adapters, tokenizer files, and configuration that define the model's behavior
- Tool call traces -- records of what external tools or APIs the model invoked, with what parameters, and what results were returned
- Embedding and retrieval logs -- what documents were retrieved for RAG queries, their similarity scores, and the chunks injected into context
See Log Analysis and Model Forensics for detailed investigation procedures.
Phase 5: Remediation
Remediation in AI systems often requires changes that go beyond patching code.
Patch the immediate vulnerability
If the attack exploited a system prompt flaw, update the system prompt. If it exploited a missing guardrail, deploy the guardrail. This is the fastest but least durable fix.
Address the root cause
Determine whether the vulnerability is in the model itself (training data, fine-tuning), the application layer (system prompt, tool configuration), or the infrastructure (API exposure, authentication). Apply the appropriate fix at the right layer.
Verify the fix statistically
Because AI systems are non-deterministic, you cannot verify a fix with a single test. Run the original attack payload at least 50 times and confirm a statistically significant reduction in success rate. Document the confidence interval.
Monitor for regression
Deploy monitoring that specifically watches for the attack pattern and related variants. Set alert thresholds based on baseline behavior established before the incident.
Phase 6: Post-Mortem
AI incident post-mortems should include all traditional post-mortem elements plus:
- Model behavior timeline -- how the model's behavior changed before, during, and after the incident
- Attack transferability assessment -- whether the same attack works on other models in your environment
- Training data review -- whether the vulnerability originated in the training data
- Guardrail gap analysis -- what safety controls should have caught the incident and why they did not
Unique Challenges in AI Forensics
Non-Deterministic Evidence
The same prompt can produce different outputs on consecutive runs. This means:
- You may not be able to reproduce the exact incident, even with the same inputs
- "Negative" test results do not prove the vulnerability is fixed
- Evidence must include the exact outputs observed, not recreations
- Statistical analysis replaces binary pass/fail verification
Prompt Logs vs. System Logs
Traditional system logs (syslog, application logs, access logs) tell you what happened at the infrastructure level. But AI incidents play out in the content of the prompts and completions themselves. The "exploit" is natural language, not a CVE-identified vulnerability.
This means your logging infrastructure must capture the full content of every interaction, not just metadata. A log entry that says "user sent a message at 14:32:07, 847 tokens, response 1,203 tokens" tells you nothing about whether a jailbreak occurred. You need the actual text.
Model Behavior as Evidence
In some AI incidents, the evidence is not in logs or artifacts -- it is in the model's behavior itself. A fine-tuned model may have been poisoned to behave differently in specific contexts. The only way to discover this is through systematic behavioral probing, which is covered in Model Forensics.
Temporal Challenges
AI models are frequently updated, retrained, or swapped. If you discover an incident days after it occurred:
- The model version that was running at the time may no longer be available
- The system prompt may have been updated since the incident
- RAG document indices may have changed
- Tool configurations may have been modified
This makes proactive evidence preservation critical. See Evidence Preservation for procedures and best practices.
Section Overview
This section is organized into five subsections, each addressing a critical aspect of AI forensics and incident response.
| Subsection | Focus | Key Questions Answered |
|---|---|---|
| Incident Classification | Taxonomy, severity, triage, escalation | What kind of incident is this? How severe? Who needs to know? |
| Log Analysis | Inference logs, prompt logs, tool call traces | What happened? What evidence exists in the logs? |
| Model Forensics | Backdoor detection, behavior diffing, tampering | Has the model itself been compromised? |
| IR Playbooks | Step-by-step response procedures | What do I do right now for this specific incident type? |
| Evidence Preservation | Chain of custody, model snapshots, conversation data | How do I preserve evidence for investigation and legal proceedings? |
Related Topics
- Agent & Agentic Exploitation -- understanding the attacks that AI IR must respond to
- Prompt Injection & Jailbreaks -- the attack techniques that produce jailbreak incidents
- Infrastructure & Supply Chain -- supply chain compromise scenarios relevant to model forensics
- Red Team Reporting -- documenting findings from AI forensic investigations
References
- "NIST SP 800-61 Rev. 3: Computer Security Incident Handling Guide" - National Institute of Standards and Technology (2024) - Foundation IR framework adapted throughout this section
- "AI Incident Database" - Partnership on AI (2025) - Catalog of real-world AI incidents used to develop the taxonomy
- "OWASP Top 10 for LLM Applications" - OWASP Foundation (2025) - Vulnerability classification relevant to incident taxonomy
- "MITRE ATLAS: Adversarial Threat Landscape for AI Systems" - MITRE Corporation (2025) - Attack taxonomy and technique catalog for AI systems
Why can't you verify an AI vulnerability fix with a single test?