Jailbreak Incident Response Playbook

intermediate9 min readUpdated 2026-03-15

Step-by-step playbook for responding to a production jailbreak: detection verification, containment strategies, investigation procedures, remediation steps, and post-mortem framework.

jailbreak playbook incident-response containment

Jailbreak Incident Response Playbook

This playbook provides step-by-step procedures for responding to a confirmed or suspected jailbreak in a production AI system. A jailbreak occurs when a user causes the model to bypass its safety training or system prompt restrictions, producing outputs that violate its intended behavioral constraints.

Trigger Criteria

Activate this playbook when any of the following occur:

Safety classifier flags model output as violating content policies
User reports that the model produced content it should not have
Automated monitoring detects a jailbreak pattern in input logs
Internal testing discovers a reproducible jailbreak technique
External disclosure of a jailbreak affecting your model or system prompt

Immediate Actions (First 30 Minutes)

Acknowledge and assign

Record incident ID, detection time (UTC), and source. Assign Incident Commander and AI Investigator roles.

Incident ID: AI-IR-[YYYY]-[NNNN]
Detected: [UTC timestamp]
Source: [classifier alert / user report / internal testing / external disclosure]
IC: [Name]
AI Investigator: [Name]

Preserve evidence
Capture all volatile evidence before taking any other action:
- Full conversation history where jailbreak occurred (all turns, including system prompt)
- Current system prompt version (hash and full text)
- Model version and inference parameters (temperature, top_p, etc.)
- Safety classifier output for the flagged interaction
- Any tool call records if the model has agent capabilities
- RAG retrieval logs if the system uses retrieval augmentation
- User identity and session metadata
Store evidence in the incident evidence repository with the incident ID.

Assess scope

Determine whether the jailbreak is isolated or systemic:

Question	How to Determine	Implication
Can any user reproduce it?	Test with a fresh session, different user account	Systemic if yes
Does it require specific conversation history?	Test the payload without prior context	Isolated if yes
Is the vulnerability in the system prompt?	Review the system prompt for the exploited weakness	Systemic if yes
Is the vulnerability in the base model?	Test with a minimal system prompt	Systemic and harder to fix
Are there multiple jailbreak variants?	Search logs for similar patterns	Broader vulnerability

Implement initial containment

Based on scope assessment:

Scope	Containment Action
Isolated (single session)	Terminate the session; add input filter for the specific payload
Systemic (system prompt flaw)	Deploy emergency system prompt patch; add input filter
Systemic (model vulnerability)	Consider switching to fallback model; add output classifier
Active exploitation by multiple users	Add aggressive input/output filtering; consider endpoint shutdown

Notify stakeholders

Based on severity:

Severity	Notify
Low (isolated, no harmful output)	Team lead, log for tracking
Medium (systemic but limited impact)	Team lead, product owner
High (harmful content generated)	Management, legal, compliance
Critical (public safety risk, data breach)	Executive team, legal, PR, regulatory contacts

Investigation (Hours 1-4)

Log Analysis

Reconstruct the attack chain
Using Prompt Log Forensics techniques, reconstruct the full attack:
- Identify each phase: reconnaissance, context setting, boundary testing, payload delivery, exploitation
- Classify the jailbreak technique (direct, multi-turn, persona hijack, encoding bypass, etc.)
- Determine the exact turn where the model's defenses failed
Scope the damage
Review all model outputs after the jailbreak to determine:
- What content was generated that violates policies?
- Was any sensitive data disclosed?
- Did the model take any actions (tool calls) while in a jailbroken state?
- Were other users affected by the same technique?

Search for related activity

Query logs for similar attack patterns:

-- Search for similar jailbreak patterns across all sessions
SELECT
    session_id,
    user_id,
    timestamp,
    substring(content, 1, 200) AS content_preview
FROM prompt_logs
WHERE (content ILIKE '%ignore previous%'
    OR content ILIKE '%you are now%'
    OR content ILIKE '%new instructions%'
    OR content ILIKE '%[specific payload pattern]%')
    AND timestamp > NOW() - INTERVAL '7 days'
ORDER BY timestamp DESC;

Identify root cause

Determine why the jailbreak succeeded:

Root Cause Category	Indicators	Fix Layer
System prompt weakness	Prompt lacks explicit refusal instructions for this attack type	Application
Missing input filter	No filter for this attack pattern	Application
Missing output classifier	No classifier or classifier did not flag the output	Application
Model safety gap	Base model does not refuse this type of request	Model
Context window exploitation	Attack relied on filling the context window to push out instructions	Architecture

Containment and Remediation

Short-Term Fixes (Deploy Within Hours)

Fix	Implementation	Coverage
Input filter	Add regex or classifier-based filter for the specific attack pattern	Blocks this specific payload; attacker can adapt
Output classifier	Add or update output classifier to catch this output category	Catches outputs regardless of input technique
System prompt hardening	Add explicit instructions addressing the exploited weakness	Addresses the root cause at the application layer
Rate limiting	Reduce request rate for suspicious patterns	Slows automated exploitation

Long-Term Fixes (Deploy Within Days-Weeks)

Fix	Implementation	Coverage
Safety fine-tuning	Fine-tune the model with examples that address this weakness	Addresses model-level vulnerability
Comprehensive prompt review	Audit the entire system prompt for similar weaknesses	Prevents related attack variants
Defense-in-depth	Layer input filters, output classifiers, and system prompt hardening	Ensures no single bypass defeats all defenses
Jailbreak evaluation suite	Add this technique to automated testing	Catches regressions in future updates

Verification

Verification Procedure

Step	Action	Success Criteria
1	Run exact original payload 50 times	Success rate < 5% (was: [original rate])
2	Run 10 minor variations of the payload	Success rate < 5% each
3	Run 10 paraphrased versions of the payload	Success rate < 5% each
4	Test in multi-turn context (if original was multi-turn)	Success rate < 5%
5	Verify no regression on legitimate use cases	No increase in false refusals

## Verification Results
 
**Original payload:** [success rate] over [N] attempts (was [original rate])
**Variations:** [summary of variation testing results]
**Paraphrases:** [summary]
**Multi-turn:** [summary]
**False refusal rate:** [rate] (baseline: [rate])
**Conclusion:** [Fix effective / Fix insufficient / Partial mitigation]

Communication Templates

Internal Notification (Initial)

Subject: [AI-IR-YYYY-NNNN] Jailbreak incident - [severity] - [product]

Status: [Active investigation / Contained / Resolved]

Summary: A jailbreak [was reported / was detected] in [product name]
at [time]. The model [description of what it produced]. The vulnerability
appears to be [isolated/systemic] and affects [scope].

Current actions:
- Evidence preserved: [Yes/No]
- Containment in place: [description]
- Investigation status: [status]

Impact: [description of impact or potential impact]

Next update: [time]

Post-Mortem Summary

Subject: [AI-IR-YYYY-NNNN] Post-mortem summary

Timeline: [detection time] to [resolution time] ([duration])

What happened: [2-3 sentence summary]

Root cause: [description]

Impact: [what was affected, what content was generated]

Fix: [what was deployed, when]

Verification: [statistical verification results]

Lessons learned:
1. [lesson]
2. [lesson]

Action items:
- [ ] [action item with owner and deadline]

Post-Mortem Checklist

#	Item	Status
1	Timeline documented from detection to resolution
2	Root cause identified and confirmed
3	All affected users/sessions identified
4	Jailbreak technique classified against taxonomy
5	Fix verified statistically (50+ attempts)
6	Attack pattern added to monitoring rules
7	Jailbreak technique added to evaluation suite
8	System prompt reviewed for similar weaknesses
9	Transferability tested on other model endpoints
10	Post-mortem document published to team

Incident Classification -- classifying the jailbreak type
Prompt Log Forensics -- detailed prompt investigation techniques
Prompt Injection & Jailbreaks -- understanding jailbreak techniques
Evidence Preservation -- preserving conversation evidence

References

"OWASP Top 10 for LLM Applications: LLM01 - Prompt Injection" - OWASP Foundation (2025) - Jailbreak vulnerability classification
"Jailbreaking Leading Safety-Aligned LLMs" - arXiv (2025) - Current jailbreak techniques and effectiveness
"AI Red Team Playbooks" - Microsoft Security (2025) - AI-specific incident response procedures

Knowledge Check

A jailbreak was fixed by updating the system prompt. You test the original payload once and it fails. Is the fix verified?

Jailbreak Incident Response Playbook

Acknowledge and assign

Preserve evidence

Assess scope

Implement initial containment

Notify stakeholders

Reconstruct the attack chain

Scope the damage

Search for related activity

Identify root cause

Related articles

Jailbreak Incident Response Playbook

Acknowledge and assign

Preserve evidence

Assess scope

Implement initial containment

Notify stakeholders

Reconstruct the attack chain

Scope the damage

Search for related activity

Identify root cause

Related articles