Jailbreak Incident Response Playbook
Step-by-step playbook for responding to a production jailbreak: detection verification, containment strategies, investigation procedures, remediation steps, and post-mortem framework.
Jailbreak Incident Response Playbook
This playbook provides step-by-step procedures for responding to a confirmed or suspected jailbreak in a production AI system. A jailbreak occurs when a user causes the model to bypass its safety training or system prompt restrictions, producing outputs that violate its intended behavioral constraints.
Trigger Criteria
Activate this playbook when any of the following occur:
- Safety classifier flags model output as violating content policies
- User reports that the model produced content it should not have
- Automated monitoring detects a jailbreak pattern in input logs
- Internal testing discovers a reproducible jailbreak technique
- External disclosure of a jailbreak affecting your model or system prompt
Immediate Actions (First 30 Minutes)
Acknowledge and assign
Record incident ID, detection time (UTC), and source. Assign Incident Commander and AI Investigator roles.
Incident ID: AI-IR-[YYYY]-[NNNN] Detected: [UTC timestamp] Source: [classifier alert / user report / internal testing / external disclosure] IC: [Name] AI Investigator: [Name]Preserve evidence
Capture all volatile evidence before taking any other action:
- Full conversation history where jailbreak occurred (all turns, including system prompt)
- Current system prompt version (hash and full text)
- Model version and inference parameters (temperature, top_p, etc.)
- Safety classifier output for the flagged interaction
- Any tool call records if the model has agent capabilities
- RAG retrieval logs if the system uses retrieval augmentation
- User identity and session metadata
Store evidence in the incident evidence repository with the incident ID.
Assess scope
Determine whether the jailbreak is isolated or systemic:
Question How to Determine Implication Can any user reproduce it? Test with a fresh session, different user account Systemic if yes Does it require specific conversation history? Test the payload without prior context Isolated if yes Is the vulnerability in the system prompt? Review the system prompt for the exploited weakness Systemic if yes Is the vulnerability in the base model? Test with a minimal system prompt Systemic and harder to fix Are there multiple jailbreak variants? Search logs for similar patterns Broader vulnerability Implement initial containment
Based on scope assessment:
Scope Containment Action Isolated (single session) Terminate the session; add input filter for the specific payload Systemic (system prompt flaw) Deploy emergency system prompt patch; add input filter Systemic (model vulnerability) Consider switching to fallback model; add output classifier Active exploitation by multiple users Add aggressive input/output filtering; consider endpoint shutdown Notify stakeholders
Based on severity:
Severity Notify Low (isolated, no harmful output) Team lead, log for tracking Medium (systemic but limited impact) Team lead, product owner High (harmful content generated) Management, legal, compliance Critical (public safety risk, data breach) Executive team, legal, PR, regulatory contacts
Investigation (Hours 1-4)
Log Analysis
Reconstruct the attack chain
Using Prompt Log Forensics techniques, reconstruct the full attack:
- Identify each phase: reconnaissance, context setting, boundary testing, payload delivery, exploitation
- Classify the jailbreak technique (direct, multi-turn, persona hijack, encoding bypass, etc.)
- Determine the exact turn where the model's defenses failed
Scope the damage
Review all model outputs after the jailbreak to determine:
- What content was generated that violates policies?
- Was any sensitive data disclosed?
- Did the model take any actions (tool calls) while in a jailbroken state?
- Were other users affected by the same technique?
Search for related activity
Query logs for similar attack patterns:
-- Search for similar jailbreak patterns across all sessions SELECT session_id, user_id, timestamp, substring(content, 1, 200) AS content_preview FROM prompt_logs WHERE (content ILIKE '%ignore previous%' OR content ILIKE '%you are now%' OR content ILIKE '%new instructions%' OR content ILIKE '%[specific payload pattern]%') AND timestamp > NOW() - INTERVAL '7 days' ORDER BY timestamp DESC;Identify root cause
Determine why the jailbreak succeeded:
Root Cause Category Indicators Fix Layer System prompt weakness Prompt lacks explicit refusal instructions for this attack type Application Missing input filter No filter for this attack pattern Application Missing output classifier No classifier or classifier did not flag the output Application Model safety gap Base model does not refuse this type of request Model Context window exploitation Attack relied on filling the context window to push out instructions Architecture
Containment and Remediation
Short-Term Fixes (Deploy Within Hours)
| Fix | Implementation | Coverage |
|---|---|---|
| Input filter | Add regex or classifier-based filter for the specific attack pattern | Blocks this specific payload; attacker can adapt |
| Output classifier | Add or update output classifier to catch this output category | Catches outputs regardless of input technique |
| System prompt hardening | Add explicit instructions addressing the exploited weakness | Addresses the root cause at the application layer |
| Rate limiting | Reduce request rate for suspicious patterns | Slows automated exploitation |
Long-Term Fixes (Deploy Within Days-Weeks)
| Fix | Implementation | Coverage |
|---|---|---|
| Safety fine-tuning | Fine-tune the model with examples that address this weakness | Addresses model-level vulnerability |
| Comprehensive prompt review | Audit the entire system prompt for similar weaknesses | Prevents related attack variants |
| Defense-in-depth | Layer input filters, output classifiers, and system prompt hardening | Ensures no single bypass defeats all defenses |
| Jailbreak evaluation suite | Add this technique to automated testing | Catches regressions in future updates |
Verification
Verification Procedure
| Step | Action | Success Criteria |
|---|---|---|
| 1 | Run exact original payload 50 times | Success rate < 5% (was: [original rate]) |
| 2 | Run 10 minor variations of the payload | Success rate < 5% each |
| 3 | Run 10 paraphrased versions of the payload | Success rate < 5% each |
| 4 | Test in multi-turn context (if original was multi-turn) | Success rate < 5% |
| 5 | Verify no regression on legitimate use cases | No increase in false refusals |
## Verification Results
**Original payload:** [success rate] over [N] attempts (was [original rate])
**Variations:** [summary of variation testing results]
**Paraphrases:** [summary]
**Multi-turn:** [summary]
**False refusal rate:** [rate] (baseline: [rate])
**Conclusion:** [Fix effective / Fix insufficient / Partial mitigation]Communication Templates
Internal Notification (Initial)
Subject: [AI-IR-YYYY-NNNN] Jailbreak incident - [severity] - [product]
Status: [Active investigation / Contained / Resolved]
Summary: A jailbreak [was reported / was detected] in [product name]
at [time]. The model [description of what it produced]. The vulnerability
appears to be [isolated/systemic] and affects [scope].
Current actions:
- Evidence preserved: [Yes/No]
- Containment in place: [description]
- Investigation status: [status]
Impact: [description of impact or potential impact]
Next update: [time]
Post-Mortem Summary
Subject: [AI-IR-YYYY-NNNN] Post-mortem summary
Timeline: [detection time] to [resolution time] ([duration])
What happened: [2-3 sentence summary]
Root cause: [description]
Impact: [what was affected, what content was generated]
Fix: [what was deployed, when]
Verification: [statistical verification results]
Lessons learned:
1. [lesson]
2. [lesson]
Action items:
- [ ] [action item with owner and deadline]
Post-Mortem Checklist
| # | Item | Status |
|---|---|---|
| 1 | Timeline documented from detection to resolution | |
| 2 | Root cause identified and confirmed | |
| 3 | All affected users/sessions identified | |
| 4 | Jailbreak technique classified against taxonomy | |
| 5 | Fix verified statistically (50+ attempts) | |
| 6 | Attack pattern added to monitoring rules | |
| 7 | Jailbreak technique added to evaluation suite | |
| 8 | System prompt reviewed for similar weaknesses | |
| 9 | Transferability tested on other model endpoints | |
| 10 | Post-mortem document published to team |
Related Topics
- Incident Classification -- classifying the jailbreak type
- Prompt Log Forensics -- detailed prompt investigation techniques
- Prompt Injection & Jailbreaks -- understanding jailbreak techniques
- Evidence Preservation -- preserving conversation evidence
References
- "OWASP Top 10 for LLM Applications: LLM01 - Prompt Injection" - OWASP Foundation (2025) - Jailbreak vulnerability classification
- "Jailbreaking Leading Safety-Aligned LLMs" - arXiv (2025) - Current jailbreak techniques and effectiveness
- "AI Red Team Playbooks" - Microsoft Security (2025) - AI-specific incident response procedures
A jailbreak was fixed by updating the system prompt. You test the original payload once and it fails. Is the fix verified?