What is Defense Landscape?

Comprehensive overview of AI defense categories including input filtering, output filtering, guardrails, alignment training, and monitoring -- plus the tools and vendors in each space.

What is Thinking Like a Defender?

Mental models for defensive thinking, risk assessment frameworks, defense tradeoffs, and why understanding the defender's perspective makes you a better red teamer.

What is AI Defense Taxonomy?

A comprehensive categorization of all AI defense approaches organized by layer, method, and effectiveness, providing a structured framework for evaluating defense strategies.

What is Red Team vs Blue Team Asymmetry?

Why attacking AI systems is fundamentally easier than defending them: asymmetric advantages, defender's dilemma, and strategies for closing the gap.

What is Evaluating Defense Effectiveness?

Metrics, benchmarks, and methodology for measuring how well AI defenses work against real attacks, including evaluation pitfalls and best practices.

What is Layered Defense Strategy?

Implementing defense in depth for AI applications: designing independent defense layers, ensuring orthogonal coverage, and managing the complexity of multi-layer security.

What is Defense Economics?

Cost-benefit analysis of AI security investments: quantifying risk, calculating defense ROI, budget allocation strategies, and the economics of AI red teaming.

Understanding AI Defenses

beginner7 min readUpdated 2026-03-13

Why red teamers must understand the defenses they face, categories of AI defenses, and the attacker-defender asymmetry in AI security.

defenses red-teaming security-fundamentals attacker-defender-asymmetry

Red teaming without understanding defenses is like picking locks without knowing how locks work. Every effective attack is shaped by the defenses it must bypass. This page provides the foundational mental model for thinking about AI defenses from an attacker's perspective.

Why Red Teamers Must Study Defenses

There are three practical reasons every red teamer needs deep defense knowledge:

Target characterization -- Before crafting payloads, you need to identify what defenses are deployed. A regex filter demands different bypass techniques than an ML classifier.
Efficient exploitation -- Understanding defense architecture reveals which layers to target and which to avoid triggering.
Credible reporting -- Clients expect remediation recommendations alongside findings. You cannot recommend fixes for systems you do not understand.

Defense Categories Overview

AI defenses operate at different stages of the request-response pipeline. Understanding where each defense sits tells you when and how it can be bypassed.

Category	Where It Operates	What It Does	Example
Alignment training	Model weights	Teaches the model to refuse harmful requests	RLHF, Constitutional AI, DPO
Input filtering	Before the model	Blocks or modifies dangerous inputs	Regex rules, ML classifiers, prompt shields
System prompt design	Prompt layer	Constrains model behavior via instructions	Role definitions, boundary instructions
Output filtering	After the model	Blocks or modifies dangerous outputs	Content classifiers, keyword blocklists
Monitoring	Observability layer	Detects anomalous patterns over time	Token anomaly detection, behavioral drift alerts
Architecture controls	Infrastructure	Limits what the model can do	Sandboxing, rate limiting, tool approval gates

The Attacker-Defender Asymmetry

AI security has a fundamental asymmetry that favors attackers, and understanding it shapes realistic threat models.

Why Attackers Have the Advantage

Defenders must cover every path; attackers only need one. A guardrail system might block 99.9% of malicious inputs, but if the attacker finds one bypass, the defense has failed for that interaction.

Natural language is unbounded. Unlike traditional software where inputs have defined types and ranges, LLM inputs are free-form text. The space of possible attacks is effectively infinite, making exhaustive filtering impossible.

Models are probabilistic. The same input can produce different outputs across runs. A defense that blocks an attack 95% of the time still fails 1 in 20 attempts -- and attackers can retry cheaply.

Defenses degrade model utility. Every filter and constraint risks blocking legitimate use cases. Defenders face constant pressure to reduce false positives, which creates gaps attackers exploit.

Where Defenders Have the Advantage

The asymmetry is not entirely one-sided:

Defenders see all traffic -- they can detect patterns across thousands of requests, not just individual ones
Defenders control the architecture -- they choose which models, tools, and permissions are available
Attackers face detection risk -- automated monitoring can flag and block repeat offenders
Defense layers compound -- even imperfect layers in combination create exponentially harder bypass challenges

Defense vs. Safety: A Critical Distinction

Two different but related concepts frequently get conflated:

Concept	Meaning	Red Team Implication
Safety (alignment)	The model's trained tendency to refuse harmful requests	Bypassed via jailbreaks, activation steering, fine-tuning
Defense (guardrails)	External systems that filter, monitor, or constrain the model	Bypassed via evasion, encoding, architecture exploitation

A model can be well-aligned but poorly defended (no input filtering, no monitoring). Or heavily defended but poorly aligned (strong guardrails masking a model that eagerly complies with harmful requests when guardrails are bypassed).

Mapping Defenses to Attack Phases

Attack Phase	Relevant Defenses	What to Test
Reconnaissance	Rate limiting, request logging	Can you enumerate system behavior without triggering alerts?
Input crafting	Input filters, prompt shields	Does the input reach the model unmodified?
Prompt injection	System prompt hardening, instruction hierarchy	Can you override the system prompt?
Jailbreaking	Alignment training, safety fine-tuning	Can you get the model to comply with restricted requests?
Data extraction	Output filters, PII detection	Can sensitive data pass through output filters?
Tool exploitation	Sandboxing, approval gates, permission scoping	Can you access unintended tools or escalate privileges?
Persistence	Session management, monitoring	Can you maintain access across sessions without detection?

Where to Go Next

This overview provides the map. The following pages fill in the details:

The AI Defense Landscape -- deep dive into each defense category, tools, and market overview
Thinking Like a Defender -- mental models and risk frameworks that make you a better attacker
Guardrails & Safety Layer Architecture -- how guardrail systems are designed and where they break

The AI Defense Landscape - Comprehensive overview of defense tools, vendors, and deployment patterns
Thinking Like a Defender - Mental models and risk frameworks for understanding defensive priorities
Guardrails & Safety Layer Architecture - How guardrail systems are designed and where they break
AI Threat Models - Access levels and threat modeling frameworks for AI systems

References

"OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard classification of security risks for LLM applications, used as a reference taxonomy for defense categories
"NIST AI Risk Management Framework (AI RMF 1.0)" - NIST (2023) - Federal framework for managing AI risks including the attacker-defender asymmetry in AI security
"Lessons Learned from AI Red Teaming" - Microsoft (2024) - Practical insights on the relationship between defense posture and red team findings
"Securing LLM-Integrated Applications" - Microsoft Security (2024) - Guidance on defense layers and the distinction between alignment-based and runtime defenses

Knowledge Check

Why does the attacker-defender asymmetry generally favor attackers in AI security?

Understanding AI Defenses

Why Red Teamers Must Study Defenses

Defense Categories Overview

The Attacker-Defender Asymmetry

Why Attackers Have the Advantage

Where Defenders Have the Advantage

Defense vs. Safety: A Critical Distinction

Mapping Defenses to Attack Phases

Where to Go Next

References

Learning Path

Understanding AI Defenses

Why Red Teamers Must Study Defenses

Defense Categories Overview

The Attacker-Defender Asymmetry

Why Attackers Have the Advantage

Where Defenders Have the Advantage

Defense vs. Safety: A Critical Distinction

Mapping Defenses to Attack Phases

Where to Go Next

References

Learning Path

Understanding AI Defenses

Learning Path

Related articles

Understanding AI Defenses

Learning Path

Related articles