Understanding AI Defenses
Why red teamers must understand the defenses they face, categories of AI defenses, and the attacker-defender asymmetry in AI security.
Red teaming without understanding defenses is like picking locks without knowing how locks work. Every effective attack is shaped by the defenses it must bypass. This page provides the foundational mental model for thinking about AI defenses from an attacker's perspective.
Why Red Teamers Must Study Defenses
There are three practical reasons every red teamer needs deep defense knowledge:
- Target characterization -- Before crafting payloads, you need to identify what defenses are deployed. A regex filter demands different bypass techniques than an ML classifier.
- Efficient exploitation -- Understanding defense architecture reveals which layers to target and which to avoid triggering.
- Credible reporting -- Clients expect remediation recommendations alongside findings. You cannot recommend fixes for systems you do not understand.
Defense Categories Overview
AI defenses operate at different stages of the request-response pipeline. Understanding where each defense sits tells you when and how it can be bypassed.
| Category | Where It Operates | What It Does | Example |
|---|---|---|---|
| Alignment training | Model weights | Teaches the model to refuse harmful requests | RLHF, Constitutional AI, DPO |
| Input filtering | Before the model | Blocks or modifies dangerous inputs | Regex rules, ML classifiers, prompt shields |
| System prompt design | Prompt layer | Constrains model behavior via instructions | Role definitions, boundary instructions |
| Output filtering | After the model | Blocks or modifies dangerous outputs | Content classifiers, keyword blocklists |
| Monitoring | Observability layer | Detects anomalous patterns over time | Token anomaly detection, behavioral drift alerts |
| Architecture controls | Infrastructure | Limits what the model can do | Sandboxing, rate limiting, tool approval gates |
The Attacker-Defender Asymmetry
AI security has a fundamental asymmetry that favors attackers, and understanding it shapes realistic threat models.
Why Attackers Have the Advantage
Defenders must cover every path; attackers only need one. A guardrail system might block 99.9% of malicious inputs, but if the attacker finds one bypass, the defense has failed for that interaction.
Natural language is unbounded. Unlike traditional software where inputs have defined types and ranges, LLM inputs are free-form text. The space of possible attacks is effectively infinite, making exhaustive filtering impossible.
Models are probabilistic. The same input can produce different outputs across runs. A defense that blocks an attack 95% of the time still fails 1 in 20 attempts -- and attackers can retry cheaply.
Defenses degrade model utility. Every filter and constraint risks blocking legitimate use cases. Defenders face constant pressure to reduce false positives, which creates gaps attackers exploit.
Where Defenders Have the Advantage
The asymmetry is not entirely one-sided:
- Defenders see all traffic -- they can detect patterns across thousands of requests, not just individual ones
- Defenders control the architecture -- they choose which models, tools, and permissions are available
- Attackers face detection risk -- automated monitoring can flag and block repeat offenders
- Defense layers compound -- even imperfect layers in combination create exponentially harder bypass challenges
Defense vs. Safety: A Critical Distinction
Two different but related concepts frequently get conflated:
| Concept | Meaning | Red Team Implication |
|---|---|---|
| Safety (alignment) | The model's trained tendency to refuse harmful requests | Bypassed via jailbreaks, activation steering, fine-tuning |
| Defense (guardrails) | External systems that filter, monitor, or constrain the model | Bypassed via evasion, encoding, architecture exploitation |
A model can be well-aligned but poorly defended (no input filtering, no monitoring). Or heavily defended but poorly aligned (strong guardrails masking a model that eagerly complies with harmful requests when guardrails are bypassed).
Mapping Defenses to Attack Phases
| Attack Phase | Relevant Defenses | What to Test |
|---|---|---|
| Reconnaissance | Rate limiting, request logging | Can you enumerate system behavior without triggering alerts? |
| Input crafting | Input filters, prompt shields | Does the input reach the model unmodified? |
| Prompt injection | System prompt hardening, instruction hierarchy | Can you override the system prompt? |
| Jailbreaking | Alignment training, safety fine-tuning | Can you get the model to comply with restricted requests? |
| Data extraction | Output filters, PII detection | Can sensitive data pass through output filters? |
| Tool exploitation | Sandboxing, approval gates, permission scoping | Can you access unintended tools or escalate privileges? |
| Persistence | Session management, monitoring | Can you maintain access across sessions without detection? |
Where to Go Next
This overview provides the map. The following pages fill in the details:
- The AI Defense Landscape -- deep dive into each defense category, tools, and market overview
- Thinking Like a Defender -- mental models and risk frameworks that make you a better attacker
- Guardrails & Safety Layer Architecture -- how guardrail systems are designed and where they break
Related Topics
- The AI Defense Landscape - Comprehensive overview of defense tools, vendors, and deployment patterns
- Thinking Like a Defender - Mental models and risk frameworks for understanding defensive priorities
- Guardrails & Safety Layer Architecture - How guardrail systems are designed and where they break
- AI Threat Models - Access levels and threat modeling frameworks for AI systems
References
- "OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard classification of security risks for LLM applications, used as a reference taxonomy for defense categories
- "NIST AI Risk Management Framework (AI RMF 1.0)" - NIST (2023) - Federal framework for managing AI risks including the attacker-defender asymmetry in AI security
- "Lessons Learned from AI Red Teaming" - Microsoft (2024) - Practical insights on the relationship between defense posture and red team findings
- "Securing LLM-Integrated Applications" - Microsoft Security (2024) - Guidance on defense layers and the distinction between alignment-based and runtime defenses
Why does the attacker-defender asymmetry generally favor attackers in AI security?