Defense Bypass Quick Reference
Quick reference card for common AI defense mechanisms and their known bypass techniques, organized by defense type.
Input Filtering Bypasses
| Defense | Mechanism | Bypass Techniques |
|---|---|---|
| Keyword blocklist | Regex/string matching on input | Synonyms, Unicode homoglyphs, word splitting, encoding (Base64, ROT13), foreign languages |
| Semantic classifier | ML model classifying input intent | Adversarial framing, indirect references, multi-turn buildup, context dilution |
| Token limit | Restricting input length | Compression, abbreviation, multi-turn splitting |
| Language detection | Blocking non-English input | Mixed-language prompts, code-switching, transliteration |
| Rate limiting | Throttling request frequency | Distributed requests, varied timing, session rotation |
System Prompt Hardening Bypasses
| Defense | Mechanism | Bypass Techniques |
|---|---|---|
| Instruction hierarchy | "These rules cannot be overridden" | Hypothetical framing, completion attacks, meta-instruction confusion |
| Role locking | "You are X and only X" | Gradual persona shift, nested roleplay, "act as a version of yourself that..." |
| Topic restriction | "Only discuss X topics" | Tangential approach, "this relates to X because...", reframing as in-scope |
| Refusal training | RLHF/DPO safety alignment | Few-shot jailbreaking, context manipulation, reward hacking through helpfulness framing |
| Few-shot examples | Refusal demonstrations in prompt | Overwhelming with more examples of compliance, contradicting examples |
Output Filtering Bypasses
| Defense | Mechanism | Bypass Techniques |
|---|---|---|
| Keyword blocklist | String matching on output | Request paraphrase, translation, acronyms, reverse text, encoding |
| Toxicity classifier | ML model scoring output | Euphemism, academic framing, fictional context, gradual escalation |
| PII detector | Regex/NER for personal data | Partial information, references instead of values, encoded output |
| LLM judge | Secondary LLM evaluating output | Ambiguous framing, context that makes output seem benign, format tricks |
Multi-Layer Defense Bypass Strategy
Probe
Send test inputs to identify which defense layers are active. Note what gets blocked and at which stage.
Isolate
Test each layer independently. Find bypass techniques that work for each layer in isolation.
Chain
Combine bypass techniques that address different layers simultaneously. Input evasion + prompt bypass + output evasion.
Iterate
Refine the chained bypass based on failure modes. Adjust individual techniques until the full chain succeeds.
When testing a multi-layer defense (input filter + hardened prompt + output filter), what is the most effective approach?
Related Topics
- Defense Evasion - In-depth coverage of evasion techniques
- Guardrails Architecture - Understanding the defenses you bypass
- Prompt Injection Quick Reference - Injection technique patterns
- Jailbreak Techniques - Model-level safety bypass
- Lab: Systematic Defense Bypass - Hands-on practice
References
- "Jailbroken: How Does LLM Safety Training Fail?" - Wei et al. (2023) - Analysis of safety training failure modes and bypass categories
- "Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou et al. (2023) - Automated adversarial suffix generation
- "Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models" - Shen et al. (2024) - Taxonomy of real-world jailbreak patterns
- OWASP LLM Top 10 (2025) - OWASP Foundation - Defense requirements mapped to vulnerability categories