Defense Bypass Quick Reference

intermediate4 min readUpdated 2026-03-13

Quick reference card for common AI defense mechanisms and their known bypass techniques, organized by defense type.

Input Filtering Bypasses

Defense	Mechanism	Bypass Techniques
Keyword blocklist	Regex/string matching on input	Synonyms, Unicode homoglyphs, word splitting, encoding (Base64, ROT13), foreign languages
Semantic classifier	ML model classifying input intent	Adversarial framing, indirect references, multi-turn buildup, context dilution
Token limit	Restricting input length	Compression, abbreviation, multi-turn splitting
Language detection	Blocking non-English input	Mixed-language prompts, code-switching, transliteration
Rate limiting	Throttling request frequency	Distributed requests, varied timing, session rotation

Defense	Mechanism	Bypass Techniques
Instruction hierarchy	"These rules cannot be overridden"	Hypothetical framing, completion attacks, meta-instruction confusion
Role locking	"You are X and only X"	Gradual persona shift, nested roleplay, "act as a version of yourself that..."
Topic restriction	"Only discuss X topics"	Tangential approach, "this relates to X because...", reframing as in-scope
Refusal training	RLHF/DPO safety alignment	Few-shot jailbreaking, context manipulation, reward hacking through helpfulness framing
Few-shot examples	Refusal demonstrations in prompt	Overwhelming with more examples of compliance, contradicting examples

Defense	Mechanism	Bypass Techniques
Keyword blocklist	String matching on output	Request paraphrase, translation, acronyms, reverse text, encoding
Toxicity classifier	ML model scoring output	Euphemism, academic framing, fictional context, gradual escalation
PII detector	Regex/NER for personal data	Partial information, references instead of values, encoded output
LLM judge	Secondary LLM evaluating output	Ambiguous framing, context that makes output seem benign, format tricks

Probe
Send test inputs to identify which defense layers are active. Note what gets blocked and at which stage.
Isolate
Test each layer independently. Find bypass techniques that work for each layer in isolation.
Chain
Combine bypass techniques that address different layers simultaneously. Input evasion + prompt bypass + output evasion.
Iterate
Refine the chained bypass based on failure modes. Adjust individual techniques until the full chain succeeds.

Knowledge Check

When testing a multi-layer defense (input filter + hardened prompt + output filter), what is the most effective approach?

"Jailbroken: How Does LLM Safety Training Fail?" - Wei et al. (2023) - Analysis of safety training failure modes and bypass categories
"Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou et al. (2023) - Automated adversarial suffix generation
"Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models" - Shen et al. (2024) - Taxonomy of real-world jailbreak patterns
OWASP LLM Top 10 (2025) - OWASP Foundation - Defense requirements mapped to vulnerability categories