AI Defense Taxonomy
A comprehensive categorization of all AI defense approaches organized by layer, method, and effectiveness, providing a structured framework for evaluating defense strategies.
AI Defense Taxonomy
The AI defense taxonomy organizes the full spectrum of defensive approaches into a structured framework. Rather than treating each defense as an isolated technique, the taxonomy reveals how defenses interact, where gaps exist, and which combinations provide meaningful protection against different attack classes.
Taxonomy Overview
AI defenses operate at six distinct layers, from the outermost (network perimeter) to the innermost (model weights):
Defense Layers (outer to inner):
Layer 6: Governance & Policy
└── Responsible use policies, legal frameworks, incident response
Layer 5: Application
└── Rate limiting, access control, audit logging, API design
Layer 4: Output
└── Content filtering, PII detection, response validation
Layer 3: Inference
└── Input sanitization, prompt shields, instruction hierarchy
Layer 2: Training
└── Safety alignment, adversarial training, data curation
Layer 1: Architecture
└── Model design, capability restrictions, isolation boundaries
Layer 1: Architectural Defenses
Defenses built into the fundamental system design.
Capability Restrictions
| Defense | Description | Effectiveness |
|---|---|---|
| Tool allowlisting | Explicitly enumerate permitted tool calls | High against tool abuse |
| Sandboxed execution | Run agent actions in isolated environments | High against system compromise |
| Capability separation | Separate read/write/execute into different models | Medium-high against privilege escalation |
| Context isolation | Prevent cross-tenant data access at architecture level | High against data leakage |
Model Design Choices
| Defense | Description | Effectiveness |
|---|---|---|
| Smaller models for sensitive tasks | Use specialized, smaller models with fewer capabilities | Medium against broad attacks |
| Separate safety classifier | Dedicated model for safety evaluation independent of generation | Medium-high for known attack types |
| Dual-model verification | Two independent models must agree before executing actions | High but expensive |
| Retrieval separation | Separate the retrieval and generation stages with security boundaries | Medium against RAG poisoning |
Layer 2: Training-Time Defenses
Defenses applied during model training to build inherent robustness.
Safety Alignment Methods
| Method | Description | Strengths | Weaknesses |
|---|---|---|---|
| RLHF | Train a reward model on human preferences for safety | Well-studied, effective for common cases | Reward hacking, distribution shift |
| DPO | Direct preference optimization without reward model | Simpler, fewer failure modes | Less flexible than RLHF |
| Constitutional AI | Model self-evaluates against principles | Scalable, consistent | Depends on principle completeness |
| Red team data augmentation | Include known attacks in training | Directly addresses known threats | Cannot cover novel attacks |
Data-Level Defenses
| Method | Description | Effectiveness |
|---|---|---|
| Data curation | Careful selection and filtering of training data | Essential foundation |
| Deduplication | Remove duplicate and near-duplicate training samples | Reduces memorization risk |
| Differential privacy | Add noise during training to limit individual sample influence | Provable guarantees but utility cost |
| Watermark detection | Detect and filter AI-generated content from training data | Moderate, evolving arms race |
Layer 3: Inference-Time Defenses
Defenses that operate during model inference, between input and output.
Input Processing
| Defense | Description | Against |
|---|---|---|
| Instruction hierarchy | Enforce system > user > context priority | Prompt injection |
| Prompt shields | ML classifier that detects injection attempts | Prompt injection |
| Input sanitization | Remove special characters, normalize encoding | Tokenizer attacks |
| Perplexity filtering | Reject inputs with anomalously high perplexity | Adversarial suffixes |
| Input length limits | Restrict maximum input length per source | Context window attacks |
Inference Modification
| Defense | Description | Against |
|---|---|---|
| SmoothLLM | Random input perturbation for robustness | Adversarial suffixes |
| Activation monitoring | Monitor hidden state activations for anomalies | Activation steering |
| Attention pattern checks | Verify normal attention distribution | Context manipulation |
| Temperature control | Restrict sampling parameters | Output manipulation |
Layer 4: Output Defenses
Defenses that process model outputs before delivery.
| Defense | Description | Against |
|---|---|---|
| Content safety classifier | ML model that classifies output safety | Harmful content generation |
| PII detection & redaction | Scan outputs for personal information | Data exfiltration |
| URL/domain allowlisting | Only permit references to approved domains | Phishing via AI |
| Response consistency checks | Verify output aligns with expected behavior | Anomalous behavior |
| Output format validation | Ensure responses match expected structure | Format manipulation |
| Watermarking | Embed detectable signals in outputs | Provenance tracking |
Layer 5: Application Defenses
Defenses at the application layer surrounding the AI model.
| Defense | Description | Against |
|---|---|---|
| Rate limiting | Restrict request volume per user/session | Automated attacks, extraction |
| Authentication & authorization | Verify user identity and permissions | Unauthorized access |
| Audit logging | Record all interactions for review | Post-incident analysis |
| Tool call approval | Require human approval for sensitive actions | Tool abuse |
| Session management | Limit conversation length, enforce resets | Context accumulation attacks |
| A/B testing for safety | Compare model versions for safety regression | Deployment safety |
Layer 6: Governance & Policy
Non-technical defenses that frame the overall security posture.
| Defense | Description | Against |
|---|---|---|
| Responsible use policy | Define acceptable use and consequences | Misuse by authorized users |
| Incident response plan | Procedures for handling safety incidents | All attack types (response) |
| Bug bounty / red teaming | Incentivize external security testing | Unknown vulnerabilities |
| Model cards & documentation | Document model capabilities and limitations | Misunderstanding of capabilities |
| Regulatory compliance | Align with AI safety regulations | Legal and compliance risk |
Defense Effectiveness Matrix
Mapping defenses to attack types reveals coverage and gaps:
| Attack Type | Most Effective Defenses | Limited Defenses | Ineffective Defenses |
|---|---|---|---|
| Direct prompt injection | Instruction hierarchy, prompt shields | Output filtering | Rate limiting |
| Indirect prompt injection | Input sanitization per source, context isolation | Content classifiers | Authentication |
| Adversarial suffixes | Perplexity filtering, SmoothLLM | Output filtering | Input length limits |
| Semantic injection | Intent classifiers, dual-model verification | Keyword filters | All syntactic defenses |
| Data poisoning | Data curation, differential privacy | Model monitoring | Output filtering |
| Model extraction | Rate limiting, watermarking | API design | Input sanitization |
| Tool abuse | Tool allowlisting, approval workflows | Output filtering | Prompt shields |
| RAG poisoning | Content verification, source authentication | Output filtering | Rate limiting |
Defense Maturity Model
Organizations can assess their defense maturity across these levels:
Level 1: Ad Hoc (No systematic defense)
No formal AI security controls. Model deployed with default safety training only. Reactive response to incidents.
Level 2: Basic (Input/output filtering)
Content safety classifiers on input and output. Basic rate limiting. Some logging in place.
Level 3: Structured (Multi-layer defense)
Systematic defenses at multiple layers. Instruction hierarchy enforced. Tool permissions defined. Regular red team testing.
Level 4: Managed (Measured and monitored)
Defense effectiveness measured quantitatively. Continuous monitoring with alerting. Automated response to detected attacks. Regular defense evaluation against new attack techniques.
Level 5: Optimized (Adaptive and anticipatory)
Defenses adapt to emerging threats. Proactive red teaming of novel attack classes. Defense-in-depth with no single points of failure. Continuous improvement based on threat intelligence.
Selecting Defense Combinations
No single defense is sufficient. Effective protection requires selecting complementary defenses:
Minimum Viable Defense Stack
For any production AI deployment:
- Input: Instruction hierarchy + basic input validation
- Output: Content safety classifier + PII detection
- Application: Rate limiting + audit logging
- Governance: Incident response plan + responsible use policy
Enhanced Defense Stack
For high-risk deployments (financial, healthcare, government):
All of the above, plus:
- Architecture: Tool allowlisting + sandboxed execution + context isolation
- Training: Adversarial training + red team data augmentation
- Inference: Prompt shields + SmoothLLM + activation monitoring
- Application: Tool call approval workflows + session limits
- Governance: Regular red team assessments + bug bounty
Related Topics
- Defense Landscape — Broader defense context and evolution
- Layered Defense Strategy — Implementing defense in depth
- Defense Evaluation — Measuring defense effectiveness
- Defense Economics — Cost-benefit analysis of defenses
A company deploys an AI chatbot with RLHF safety training and a content safety classifier on outputs. Which attack type are they LEAST protected against?
References
- OWASP, "Top 10 for Large Language Model Applications" (2024)
- NIST, "AI Risk Management Framework" (2023)
- MITRE, "ATLAS: Adversarial Threat Landscape for AI Systems" (2023)
- Microsoft, "AI Red Team Lessons Learned" (2023)