Defense-in-Depth for LLM Apps
Layered defense strategy for AI applications covering network, application, model, and output layers, how each layer contributes, and why single-layer defense always fails.
Defense-in-depth is the principle that no single security control should be trusted alone. For LLM applications, this means building multiple independent defense layers so that when one layer fails -- and it will -- others catch the threat.
The Four Defense Layers
Layer 1: Network and Infrastructure
This layer operates before any AI-specific processing occurs. It handles authentication, authorization, rate limiting, and transport security.
| Control | Purpose | Attack It Mitigates |
|---|---|---|
| Authentication | Verify user identity | Anonymous abuse |
| Rate limiting | Cap requests per user/time | Automated attacks, DoS |
| API key rotation | Limit exposure from leaked keys | Credential theft |
| TLS/mTLS | Encrypt transport | Traffic interception |
| IP allowlisting | Restrict access by network | Unauthorized access |
| WAF rules | Block known attack patterns | Common web attacks |
# Rate limiting middleware example
from datetime import datetime, timedelta
class RateLimiter:
def __init__(self, max_requests: int = 60, window_seconds: int = 60):
self.max_requests = max_requests
self.window = timedelta(seconds=window_seconds)
self.requests: dict[str, list[datetime]] = {}
def check(self, user_id: str) -> bool:
now = datetime.utcnow()
user_requests = self.requests.get(user_id, [])
# Remove expired entries
user_requests = [r for r in user_requests if now - r < self.window]
if len(user_requests) >= self.max_requests:
return False
user_requests.append(now)
self.requests[user_id] = user_requests
return TrueWhat this layer cannot do: It cannot inspect content for semantic threats. A perfectly crafted jailbreak at 1 request per minute passes through all network-layer controls.
Layer 2: Application (Input Processing)
This layer analyzes and sanitizes inputs before they reach the model.
| Control | Purpose | Attack It Mitigates |
|---|---|---|
| Input size limits | Prevent context window abuse | Token exhaustion, attention dilution |
| Prompt shield | Detect injection attempts | Direct and indirect injection |
| Content safety API | Flag harmful input content | Harmful request attempts |
| Input sanitization | Remove/escape special tokens | Delimiter escape, format mimicry |
| Session management | Track multi-turn behavior | Gradual escalation attacks |
class InputProcessor:
def __init__(self):
self.prompt_shield = PromptShield()
self.content_safety = ContentSafetyClient()
self.max_input_tokens = 4096
def process(self, user_input: str, session: Session) -> ProcessResult:
# Size check
if count_tokens(user_input) > self.max_input_tokens:
return ProcessResult.blocked("Input exceeds maximum length")
# Prompt injection detection
if self.prompt_shield.is_injection(user_input):
return ProcessResult.blocked("Input flagged as injection")
# Content safety check
safety = self.content_safety.analyze(user_input)
if safety.max_severity > THRESHOLD:
return ProcessResult.blocked(f"Content safety: {safety.category}")
# Session-level behavior analysis
session.add_message(user_input)
if session.escalation_score > ESCALATION_THRESHOLD:
return ProcessResult.blocked("Behavioral escalation detected")
return ProcessResult.allowed(user_input)What this layer cannot do: It cannot prevent the model from generating harmful content in response to benign-looking inputs, and it cannot detect attacks that the shield model has not been trained on.
Layer 3: Model (Inference Controls)
This layer constrains the model's behavior during generation.
| Control | Purpose | Attack It Mitigates |
|---|---|---|
| System prompt hardening | Explicit behavioral constraints | Instruction override |
| Instruction hierarchy | Prioritize system over user instructions | Priority manipulation |
| Temperature limits | Reduce output randomness | Stochastic bypass |
| Token budget limits | Cap output length | Data exfiltration |
| Stop sequences | Halt generation at boundary markers | Runaway generation |
| Tool permission scoping | Restrict available tools | Tool abuse, privilege escalation |
What this layer cannot do: It relies on the model following instructions, which is exactly what prompt injection attacks subvert. Model-layer controls are necessary but never sufficient alone.
Layer 4: Output (Post-Processing)
This layer analyzes and sanitizes the model's output before returning it to the user.
| Control | Purpose | Attack It Mitigates |
|---|---|---|
| Content classifier | Detect harmful generated content | Jailbreak success |
| PII detection/redaction | Remove personal data from output | Data leakage |
| LLM judge | Nuanced policy evaluation | Subtle policy violations |
| Schema validation | Ensure output matches expected format | Injection via structured output |
| Citation verification | Validate referenced sources | Hallucinated citations |
| Code execution sandbox | Isolate generated code | Malicious code generation |
Why Single-Layer Defense Fails
Consider this attack progression against a system with only input filtering:
| Step | Attacker Action | Input Filter Result | Outcome |
|---|---|---|---|
| 1 | Send direct injection | Blocked | Defense holds |
| 2 | Paraphrase injection | Blocked | Defense holds |
| 3 | Use truncation bypass | Passes | No other layers to catch it |
| 4 | Model generates harmful output | No output filter | Harmful content delivered |
Now the same attack against a defense-in-depth system:
| Step | Attacker Action | Layer 1 | Layer 2 | Layer 3 | Layer 4 | Outcome |
|---|---|---|---|---|---|---|
| 3 | Truncation bypass | Passes | Passes | System prompt may resist | Output classifier catches harmful output | Defense holds |
Even when one layer fails, the remaining layers maintain protection.
Defense-in-Depth Checklist
Use this checklist to evaluate an application's defense coverage:
| Layer | Control | Present? | Notes |
|---|---|---|---|
| Network | Authentication required | ||
| Network | Rate limiting configured | ||
| Network | API keys rotated regularly | ||
| Application | Input size limits enforced | ||
| Application | Prompt shield deployed | ||
| Application | Content safety API active | ||
| Application | Multi-turn tracking enabled | ||
| Model | System prompt hardened | ||
| Model | Tool permissions scoped | ||
| Model | Output length limited | ||
| Output | Content classifier active | ||
| Output | PII detection/redaction | ||
| Output | Structured output validation | ||
| Cross-cutting | Logging and monitoring | ||
| Cross-cutting | Alerting on anomalies |
Further Reading
- Guardrails & Safety Layer Architecture -- detailed architecture of the application and output layers
- Runtime Monitoring & Anomaly Detection -- the cross-cutting monitoring layer
- Rate Limiting, Sandboxing & Execution Controls -- infrastructure-layer controls
- Red Team Findings to Remediation -- translating defense gaps into actionable findings
Related Topics
- Guardrails & Safety Layer Architecture - Detailed architecture of the application and output layers
- Runtime Monitoring & Anomaly Detection - The cross-cutting monitoring layer
- Rate Limiting, Sandboxing & Execution Controls - Infrastructure-layer controls
- Red Team Findings to Remediation - Translating defense gaps into actionable findings
- AI System Architecture for Red Teamers - Component-level view of the systems these defenses protect
References
- "Defense in Depth: A Practical Strategy for Achieving Information Assurance" - NSA (2012) - Foundational document on layered defense strategy adapted for AI applications in this page
- "NIST Cybersecurity Framework" - NIST (2024) - Framework for organizing defense controls across identify, protect, detect, respond, and recover functions
- "Securing LLM-Integrated Applications" - Microsoft Security (2024) - Practical guidance on implementing multi-layer defense for LLM applications
- "OWASP Top 10 for LLM Applications" - OWASP (2025) - Risk classification that maps to specific defense layers in a depth strategy
An LLM application has a strong prompt shield on input but no output filtering. An attacker uses a benign-looking input that the shield allows, and the model generates harmful content. What defense-in-depth principle was violated?