AI Defense Taxonomy

beginner10 min readUpdated 2026-03-15

A comprehensive categorization of all AI defense approaches organized by layer, method, and effectiveness, providing a structured framework for evaluating defense strategies.

defense taxonomy categorization framework security-layers

AI Defense Taxonomy

The AI defense taxonomy organizes the full spectrum of defensive approaches into a structured framework. Rather than treating each defense as an isolated technique, the taxonomy reveals how defenses interact, where gaps exist, and which combinations provide meaningful protection against different attack classes.

Taxonomy Overview

AI defenses operate at six distinct layers, from the outermost (network perimeter) to the innermost (model weights):

Defense Layers (outer to inner):

Layer 6: Governance & Policy
  └── Responsible use policies, legal frameworks, incident response

Layer 5: Application
  └── Rate limiting, access control, audit logging, API design

Layer 4: Output
  └── Content filtering, PII detection, response validation

Layer 3: Inference
  └── Input sanitization, prompt shields, instruction hierarchy

Layer 2: Training
  └── Safety alignment, adversarial training, data curation

Layer 1: Architecture
  └── Model design, capability restrictions, isolation boundaries

Layer 1: Architectural Defenses

Defenses built into the fundamental system design.

Capability Restrictions

Defense	Description	Effectiveness
Tool allowlisting	Explicitly enumerate permitted tool calls	High against tool abuse
Sandboxed execution	Run agent actions in isolated environments	High against system compromise
Capability separation	Separate read/write/execute into different models	Medium-high against privilege escalation
Context isolation	Prevent cross-tenant data access at architecture level	High against data leakage

Model Design Choices

Defense	Description	Effectiveness
Smaller models for sensitive tasks	Use specialized, smaller models with fewer capabilities	Medium against broad attacks
Separate safety classifier	Dedicated model for safety evaluation independent of generation	Medium-high for known attack types
Dual-model verification	Two independent models must agree before executing actions	High but expensive
Retrieval separation	Separate the retrieval and generation stages with security boundaries	Medium against RAG poisoning

Layer 2: Training-Time Defenses

Defenses applied during model training to build inherent robustness.

Safety Alignment Methods

Method	Description	Strengths	Weaknesses
RLHF	Train a reward model on human preferences for safety	Well-studied, effective for common cases	Reward hacking, distribution shift
DPO	Direct preference optimization without reward model	Simpler, fewer failure modes	Less flexible than RLHF
Constitutional AI	Model self-evaluates against principles	Scalable, consistent	Depends on principle completeness
Red team data augmentation	Include known attacks in training	Directly addresses known threats	Cannot cover novel attacks

Data-Level Defenses

Method	Description	Effectiveness
Data curation	Careful selection and filtering of training data	Essential foundation
Deduplication	Remove duplicate and near-duplicate training samples	Reduces memorization risk
Differential privacy	Add noise during training to limit individual sample influence	Provable guarantees but utility cost
Watermark detection	Detect and filter AI-generated content from training data	Moderate, evolving arms race

Layer 3: Inference-Time Defenses

Defenses that operate during model inference, between input and output.

Input Processing

Defense	Description	Against
Instruction hierarchy	Enforce system > user > context priority	Prompt injection
Prompt shields	ML classifier that detects injection attempts	Prompt injection
Input sanitization	Remove special characters, normalize encoding	Tokenizer attacks
Perplexity filtering	Reject inputs with anomalously high perplexity	Adversarial suffixes
Input length limits	Restrict maximum input length per source	Context window attacks

Inference Modification

Defense	Description	Against
SmoothLLM	Random input perturbation for robustness	Adversarial suffixes
Activation monitoring	Monitor hidden state activations for anomalies	Activation steering
Attention pattern checks	Verify normal attention distribution	Context manipulation
Temperature control	Restrict sampling parameters	Output manipulation

Layer 4: Output Defenses

Defenses that process model outputs before delivery.

Defense	Description	Against
Content safety classifier	ML model that classifies output safety	Harmful content generation
PII detection & redaction	Scan outputs for personal information	Data exfiltration
URL/domain allowlisting	Only permit references to approved domains	Phishing via AI
Response consistency checks	Verify output aligns with expected behavior	Anomalous behavior
Output format validation	Ensure responses match expected structure	Format manipulation
Watermarking	Embed detectable signals in outputs	Provenance tracking

Layer 5: Application Defenses

Defenses at the application layer surrounding the AI model.

Defense	Description	Against
Rate limiting	Restrict request volume per user/session	Automated attacks, extraction
Authentication & authorization	Verify user identity and permissions	Unauthorized access
Audit logging	Record all interactions for review	Post-incident analysis
Tool call approval	Require human approval for sensitive actions	Tool abuse
Session management	Limit conversation length, enforce resets	Context accumulation attacks
A/B testing for safety	Compare model versions for safety regression	Deployment safety

Layer 6: Governance & Policy

Non-technical defenses that frame the overall security posture.

Defense	Description	Against
Responsible use policy	Define acceptable use and consequences	Misuse by authorized users
Incident response plan	Procedures for handling safety incidents	All attack types (response)
Bug bounty / red teaming	Incentivize external security testing	Unknown vulnerabilities
Model cards & documentation	Document model capabilities and limitations	Misunderstanding of capabilities
Regulatory compliance	Align with AI safety regulations	Legal and compliance risk

Defense Effectiveness Matrix

Mapping defenses to attack types reveals coverage and gaps:

Attack Type	Most Effective Defenses	Limited Defenses	Ineffective Defenses
Direct prompt injection	Instruction hierarchy, prompt shields	Output filtering	Rate limiting
Indirect prompt injection	Input sanitization per source, context isolation	Content classifiers	Authentication
Adversarial suffixes	Perplexity filtering, SmoothLLM	Output filtering	Input length limits
Semantic injection	Intent classifiers, dual-model verification	Keyword filters	All syntactic defenses
Data poisoning	Data curation, differential privacy	Model monitoring	Output filtering
Model extraction	Rate limiting, watermarking	API design	Input sanitization
Tool abuse	Tool allowlisting, approval workflows	Output filtering	Prompt shields
RAG poisoning	Content verification, source authentication	Output filtering	Rate limiting

Defense Maturity Model

Organizations can assess their defense maturity across these levels:

Level 1: Ad Hoc (No systematic defense)
No formal AI security controls. Model deployed with default safety training only. Reactive response to incidents.
Level 2: Basic (Input/output filtering)
Content safety classifiers on input and output. Basic rate limiting. Some logging in place.
Level 3: Structured (Multi-layer defense)
Systematic defenses at multiple layers. Instruction hierarchy enforced. Tool permissions defined. Regular red team testing.
Level 4: Managed (Measured and monitored)
Defense effectiveness measured quantitatively. Continuous monitoring with alerting. Automated response to detected attacks. Regular defense evaluation against new attack techniques.
Level 5: Optimized (Adaptive and anticipatory)
Defenses adapt to emerging threats. Proactive red teaming of novel attack classes. Defense-in-depth with no single points of failure. Continuous improvement based on threat intelligence.

Selecting Defense Combinations

No single defense is sufficient. Effective protection requires selecting complementary defenses:

Minimum Viable Defense Stack

For any production AI deployment:

Input: Instruction hierarchy + basic input validation
Output: Content safety classifier + PII detection
Application: Rate limiting + audit logging
Governance: Incident response plan + responsible use policy

Enhanced Defense Stack

For high-risk deployments (financial, healthcare, government):

All of the above, plus:

Architecture: Tool allowlisting + sandboxed execution + context isolation
Training: Adversarial training + red team data augmentation
Inference: Prompt shields + SmoothLLM + activation monitoring
Application: Tool call approval workflows + session limits
Governance: Regular red team assessments + bug bounty

Defense Landscape — Broader defense context and evolution
Layered Defense Strategy — Implementing defense in depth
Defense Evaluation — Measuring defense effectiveness
Defense Economics — Cost-benefit analysis of defenses

Knowledge Check

A company deploys an AI chatbot with RLHF safety training and a content safety classifier on outputs. Which attack type are they LEAST protected against?

References

OWASP, "Top 10 for Large Language Model Applications" (2024)
NIST, "AI Risk Management Framework" (2023)
MITRE, "ATLAS: Adversarial Threat Landscape for AI Systems" (2023)
Microsoft, "AI Red Team Lessons Learned" (2023)

Edit this page on GitHub

AI Defense Taxonomy

beginner10 min readUpdated 2026-03-15

A comprehensive categorization of all AI defense approaches organized by layer, method, and effectiveness, providing a structured framework for evaluating defense strategies.

defense taxonomy categorization framework security-layers

AI Defense Taxonomy

Taxonomy Overview

AI defenses operate at six distinct layers, from the outermost (network perimeter) to the innermost (model weights):

Defense Layers (outer to inner):

Layer 6: Governance & Policy
  └── Responsible use policies, legal frameworks, incident response

Layer 5: Application
  └── Rate limiting, access control, audit logging, API design

Layer 4: Output
  └── Content filtering, PII detection, response validation

Layer 3: Inference
  └── Input sanitization, prompt shields, instruction hierarchy

Layer 2: Training
  └── Safety alignment, adversarial training, data curation

Layer 1: Architecture
  └── Model design, capability restrictions, isolation boundaries

Layer 1: Architectural Defenses

Defenses built into the fundamental system design.

Capability Restrictions

Defense	Description	Effectiveness
Tool allowlisting	Explicitly enumerate permitted tool calls	High against tool abuse
Sandboxed execution	Run agent actions in isolated environments	High against system compromise
Capability separation	Separate read/write/execute into different models	Medium-high against privilege escalation
Context isolation	Prevent cross-tenant data access at architecture level	High against data leakage

Model Design Choices

Defense	Description	Effectiveness
Smaller models for sensitive tasks	Use specialized, smaller models with fewer capabilities	Medium against broad attacks
Separate safety classifier	Dedicated model for safety evaluation independent of generation	Medium-high for known attack types
Dual-model verification	Two independent models must agree before executing actions	High but expensive
Retrieval separation	Separate the retrieval and generation stages with security boundaries	Medium against RAG poisoning

Layer 2: Training-Time Defenses

Defenses applied during model training to build inherent robustness.

Safety Alignment Methods

Method	Description	Strengths	Weaknesses
RLHF	Train a reward model on human preferences for safety	Well-studied, effective for common cases	Reward hacking, distribution shift
DPO	Direct preference optimization without reward model	Simpler, fewer failure modes	Less flexible than RLHF
Constitutional AI	Model self-evaluates against principles	Scalable, consistent	Depends on principle completeness
Red team data augmentation	Include known attacks in training	Directly addresses known threats	Cannot cover novel attacks

Data-Level Defenses

Method	Description	Effectiveness
Data curation	Careful selection and filtering of training data	Essential foundation
Deduplication	Remove duplicate and near-duplicate training samples	Reduces memorization risk
Differential privacy	Add noise during training to limit individual sample influence	Provable guarantees but utility cost
Watermark detection	Detect and filter AI-generated content from training data	Moderate, evolving arms race

Layer 3: Inference-Time Defenses

Defenses that operate during model inference, between input and output.

Input Processing

Defense	Description	Against
Instruction hierarchy	Enforce system > user > context priority	Prompt injection
Prompt shields	ML classifier that detects injection attempts	Prompt injection
Input sanitization	Remove special characters, normalize encoding	Tokenizer attacks
Perplexity filtering	Reject inputs with anomalously high perplexity	Adversarial suffixes
Input length limits	Restrict maximum input length per source	Context window attacks

Inference Modification

Defense	Description	Against
SmoothLLM	Random input perturbation for robustness	Adversarial suffixes
Activation monitoring	Monitor hidden state activations for anomalies	Activation steering
Attention pattern checks	Verify normal attention distribution	Context manipulation
Temperature control	Restrict sampling parameters	Output manipulation

Layer 4: Output Defenses

Defenses that process model outputs before delivery.

Defense	Description	Against
Content safety classifier	ML model that classifies output safety	Harmful content generation
PII detection & redaction	Scan outputs for personal information	Data exfiltration
URL/domain allowlisting	Only permit references to approved domains	Phishing via AI
Response consistency checks	Verify output aligns with expected behavior	Anomalous behavior
Output format validation	Ensure responses match expected structure	Format manipulation
Watermarking	Embed detectable signals in outputs	Provenance tracking

Layer 5: Application Defenses

Defenses at the application layer surrounding the AI model.

Defense	Description	Against
Rate limiting	Restrict request volume per user/session	Automated attacks, extraction
Authentication & authorization	Verify user identity and permissions	Unauthorized access
Audit logging	Record all interactions for review	Post-incident analysis
Tool call approval	Require human approval for sensitive actions	Tool abuse
Session management	Limit conversation length, enforce resets	Context accumulation attacks
A/B testing for safety	Compare model versions for safety regression	Deployment safety

Layer 6: Governance & Policy

Non-technical defenses that frame the overall security posture.

Defense	Description	Against
Responsible use policy	Define acceptable use and consequences	Misuse by authorized users
Incident response plan	Procedures for handling safety incidents	All attack types (response)
Bug bounty / red teaming	Incentivize external security testing	Unknown vulnerabilities
Model cards & documentation	Document model capabilities and limitations	Misunderstanding of capabilities
Regulatory compliance	Align with AI safety regulations	Legal and compliance risk

Defense Effectiveness Matrix

Mapping defenses to attack types reveals coverage and gaps:

Attack Type	Most Effective Defenses	Limited Defenses	Ineffective Defenses
Direct prompt injection	Instruction hierarchy, prompt shields	Output filtering	Rate limiting
Indirect prompt injection	Input sanitization per source, context isolation	Content classifiers	Authentication
Adversarial suffixes	Perplexity filtering, SmoothLLM	Output filtering	Input length limits
Semantic injection	Intent classifiers, dual-model verification	Keyword filters	All syntactic defenses
Data poisoning	Data curation, differential privacy	Model monitoring	Output filtering
Model extraction	Rate limiting, watermarking	API design	Input sanitization
Tool abuse	Tool allowlisting, approval workflows	Output filtering	Prompt shields
RAG poisoning	Content verification, source authentication	Output filtering	Rate limiting

Defense Maturity Model

Organizations can assess their defense maturity across these levels:

Level 1: Ad Hoc (No systematic defense)
No formal AI security controls. Model deployed with default safety training only. Reactive response to incidents.
Level 2: Basic (Input/output filtering)
Content safety classifiers on input and output. Basic rate limiting. Some logging in place.
Level 3: Structured (Multi-layer defense)
Systematic defenses at multiple layers. Instruction hierarchy enforced. Tool permissions defined. Regular red team testing.
Level 4: Managed (Measured and monitored)
Defense effectiveness measured quantitatively. Continuous monitoring with alerting. Automated response to detected attacks. Regular defense evaluation against new attack techniques.
Level 5: Optimized (Adaptive and anticipatory)
Defenses adapt to emerging threats. Proactive red teaming of novel attack classes. Defense-in-depth with no single points of failure. Continuous improvement based on threat intelligence.

Selecting Defense Combinations

No single defense is sufficient. Effective protection requires selecting complementary defenses:

Minimum Viable Defense Stack

For any production AI deployment:

Input: Instruction hierarchy + basic input validation
Output: Content safety classifier + PII detection
Application: Rate limiting + audit logging
Governance: Incident response plan + responsible use policy

Enhanced Defense Stack

For high-risk deployments (financial, healthcare, government):

All of the above, plus:

Architecture: Tool allowlisting + sandboxed execution + context isolation
Training: Adversarial training + red team data augmentation
Inference: Prompt shields + SmoothLLM + activation monitoring
Application: Tool call approval workflows + session limits
Governance: Regular red team assessments + bug bounty

Defense Landscape — Broader defense context and evolution
Layered Defense Strategy — Implementing defense in depth
Defense Evaluation — Measuring defense effectiveness
Defense Economics — Cost-benefit analysis of defenses

Knowledge Check

A company deploys an AI chatbot with RLHF safety training and a content safety classifier on outputs. Which attack type are they LEAST protected against?

References

OWASP, "Top 10 for Large Language Model Applications" (2024)
NIST, "AI Risk Management Framework" (2023)
MITRE, "ATLAS: Adversarial Threat Landscape for AI Systems" (2023)
Microsoft, "AI Red Team Lessons Learned" (2023)

Edit this page on GitHub

AI Defense Taxonomy

Level 1: Ad Hoc (No systematic defense)

Level 2: Basic (Input/output filtering)

Level 3: Structured (Multi-layer defense)

Level 4: Managed (Measured and monitored)

Level 5: Optimized (Adaptive and anticipatory)

Related articles

AI Defense Taxonomy

Level 1: Ad Hoc (No systematic defense)

Level 2: Basic (Input/output filtering)

Level 3: Structured (Multi-layer defense)

Level 4: Managed (Measured and monitored)

Level 5: Optimized (Adaptive and anticipatory)

Related articles