Constitutional Classifiers

intermediate13 min readUpdated 2026-03-15

Anthropic's Constitutional Classifiers defense: using constitutional AI principles to train input/output classifiers that withstood 3,000+ hours of adversarial red teaming.

constitutional-classifiers defense jailbreak-defense anthropic classifiers constitutional-ai

Traditional safety training teaches the model itself to refuse harmful requests. But what happens when adversaries find ways to circumvent that training through jailbreaks? Constitutional Classifiers take a fundamentally different approach: instead of relying solely on the model's own judgment, they deploy independent classifier models trained specifically to detect adversarial inputs and harmful outputs.

The Problem: Jailbreaks at Scale

Before understanding the solution, it is important to understand the scale of the problem Constitutional Classifiers were designed to address.

The Jailbreak Landscape

Modern jailbreak techniques have evolved significantly beyond simple "ignore your instructions" prompts:

Attack Category	Example Techniques	Why Traditional Defenses Fail
Encoding attacks	Base64, ROT13, Unicode substitution	Model safety training does not cover all encoding formats
Persona manipulation	"You are DAN," roleplay scenarios	Safety training can be overridden by strong persona instructions
Multi-turn escalation	Gradually shifting context over many messages	Each individual message appears benign; harm emerges from the sequence
Language switching	Requesting harmful content in low-resource languages	Safety training is weaker in underrepresented languages
Prompt injection	Embedded instructions in tool outputs, images, documents	Instructions bypass the user-facing safety layer

The fundamental challenge is that the model's safety training is part of the same system that processes adversarial inputs. An attacker who can manipulate the model's processing can also manipulate its safety behavior.

How Constitutional Classifiers Work

Architecture Overview

Constitutional Classifiers deploy as a two-layer defense around the base model:

User Input → [Input Classifier] → Base Model → [Output Classifier] → User Output
                  ↓ (block)                          ↓ (block)
              Refusal Response                   Refusal Response

The input classifier screens incoming prompts for adversarial intent. The output classifier screens the model's responses for harmful content. Both classifiers are separate models, independent of the base model being protected.

The Constitutional AI Training Pipeline

The key innovation is how training data for the classifiers is generated. Rather than relying on human-labeled datasets (which are expensive and cannot cover the full space of adversarial techniques), Constitutional Classifiers use a self-supervised pipeline inspired by Constitutional AI:

Define constitutional principles
Researchers define a set of principles that describe what constitutes harmful content. These are natural language descriptions -- for example, "Content that provides step-by-step instructions for synthesizing controlled substances" or "Content that could be used to generate convincing phishing messages targeting specific individuals."
Generate synthetic adversarial examples
An LLM is used to generate a large, diverse set of adversarial prompts that attempt to elicit violations of each constitutional principle. This includes varied attack formats: encoding tricks, persona manipulation, multi-language requests, and novel techniques the LLM can imagine.
Generate synthetic benign examples
The same process generates benign prompts that are topically similar to the harmful ones but do not violate any principles. This is critical for reducing false positives -- the classifier needs to distinguish "how does methamphetamine affect the brain?" (benign, educational) from "how do I synthesize methamphetamine?" (harmful).
Constitutional evaluation
The generated examples are evaluated against the constitutional principles by another LLM, producing labels (harmful/benign) with explanations. The explanations are used for quality control and to identify edge cases.
Train classifiers
Standard classifier training on the labeled synthetic data, producing models optimized for detecting adversarial inputs and harmful outputs.
Iterate with adversarial feedback
Red teamers attack the classifier-defended system, and successful attacks are fed back into the training pipeline to generate more training data covering the discovered gaps.

The Red Team Study: 3,000+ Hours of Adversarial Testing

Study Design

Anthropic conducted one of the most extensive adversarial evaluations of an AI defense system to date. The study involved:

Parameter	Detail
Participants	Professional red teamers with experience in AI security
Duration	Over 3,000 person-hours of adversarial testing
Objective	Bypass Constitutional Classifiers to extract harmful content from a defended model
Baseline	The same model without Constitutional Classifiers
Metric	Attack success rate (ASR) -- percentage of attempts that successfully elicited harmful content

Results

The headline results were striking:

Configuration	Attack Success Rate	Notes
Undefended model	~86%	Most jailbreak techniques succeeded against the base model
Model with Constitutional Classifiers	~4.4%	Dramatic reduction in successful attacks

This represents a roughly 20x reduction in attack success rate.

What Attacks Still Succeeded?

The ~4.4% of attacks that bypassed Constitutional Classifiers generally fell into several categories:

Category	Description	Why It Bypassed
Novel encoding schemes	Custom or very obscure encoding formats	Not represented in training data
Extreme context manipulation	Very long conversations that gradually shifted context	Classifiers had limited context window for analysis
Semantic camouflage	Harmful content expressed using domain-specific jargon or metaphor	Surface-level features did not match harmful patterns
Adversarial examples targeting classifier	Inputs specifically crafted to fool the classifier model	Classifiers are themselves ML models vulnerable to adversarial examples

Trade-offs and Limitations

The Refusal Rate Problem

Constitutional Classifiers increase safety, but they also increase the rate at which benign queries are incorrectly refused. This is the fundamental precision-recall trade-off in any classification system:

Metric	Without Classifiers	With Classifiers
Attack success rate	~86%	~4.4%
False refusal rate on benign queries	Low	Measurably increased
User experience impact	Low (but unsafe)	Noticeable (but safer)

Borderline Content

The most challenging area for Constitutional Classifiers is borderline content -- requests that could be harmful or benign depending on context:

"How to pick a lock" -- legitimate locksmith education or burglary preparation?
"Write a story where a character explains how to make explosives" -- creative fiction or information laundering?
"What are the symptoms of poisoning by [specific substance]?" -- medical education or harm planning?

Constitutional Classifiers tend to err on the side of caution with borderline content, which contributes to the increased false refusal rate. The constitutional principles can be tuned to adjust this threshold, but there is no configuration that simultaneously minimizes both false positives and false negatives.

Latency Impact

Adding two classifier inference passes (input and output) to every request introduces latency:

Component	Typical Latency Impact
Input classifier	50-200ms additional
Output classifier	50-200ms additional
Total overhead	100-400ms per request

For real-time applications, this overhead may be acceptable. For latency-sensitive use cases (code completion, interactive agents), it may require optimization (smaller classifier models, batched inference, speculative execution).

Comparison with Other Defense Approaches

Defense Landscape Context

Constitutional Classifiers are one approach in a broader defense ecosystem. Understanding how they compare helps red teamers assess the defenses they face:

Defense Approach	How It Works	Strengths	Weaknesses
Safety training (RLHF/DPO)	Train the model itself to refuse harmful requests	Low latency, no additional infrastructure	Can be jailbroken; model is both judge and executor
Constitutional Classifiers	Independent classifiers screen input/output	Defense-in-depth; hard to jailbreak both model and classifier	Latency overhead; false refusals; requires maintaining separate models
Instruction hierarchy	Train model to prioritize system instructions	Addresses prompt injection directly	Does not help with direct jailbreaks
Output filtering (keyword/regex)	Pattern matching on model output	Fast, simple, no ML needed	Easily bypassed with paraphrasing; high false positive rate
Dual LLM / CaMeL	Separate trusted and untrusted processing	Strong isolation for tool-using agents	Architectural complexity; primarily targets prompt injection

Key Differentiator

The unique value of Constitutional Classifiers is that they are independent of the base model. Jailbreaking the base model does not jailbreak the classifiers. An attacker must either:

Craft an input that bypasses the input classifier AND jailbreaks the base model, or
Jailbreak the base model AND craft output that bypasses the output classifier, or
Bypass both classifiers simultaneously

This significantly raises the bar compared to defenses that rely on the model itself.

Red Team Implications

Attacking Constitutional Classifiers

For red teamers facing Constitutional Classifier defenses, the attack surface shifts:

Traditional Target	New Target	Technique Adaptation
Model's safety training	Input classifier model	Adversarial examples that fool the classifier while carrying harmful intent
System prompt	Classifier decision boundary	Find inputs near the boundary that are misclassified as benign
Model reasoning	Classifier training data gaps	Use formats and encodings not well-represented in the synthetic training data
Single model	Two-model system	Consider attacks that exploit the interaction between classifier and base model

Practical Attack Strategies

Classifier probing: Send a range of inputs to map the classifier's decision boundary. Identify which features trigger refusal and which do not.
Encoding diversity: Test unusual encodings, character sets, and formatting that may not be covered by the synthetic training data.
Semantic indirection: Express harmful intent through analogy, metaphor, or domain-specific language that the classifier may not recognize as harmful.
Split-request attacks: Distribute harmful content across multiple benign-appearing requests that combine to form harmful information.
Classifier adversarial examples: If the classifier architecture is known, craft inputs specifically designed to be adversarial to that classifier.

Deployment Considerations

When to Use Constitutional Classifiers

Scenario	Recommendation	Rationale
High-stakes application (medical, legal, financial)	Strongly recommended	Safety benefit justifies latency and false refusal costs
Consumer chatbot with high volume	Consider with tuned thresholds	Balance safety with user experience
Internal enterprise assistant	Depends on data sensitivity	May be unnecessary if usage is low-risk and monitored
Code generation tool	Consider for output classifier only	Input classifier may block legitimate code-related queries
Creative writing assistant	Use with relaxed thresholds	Overly aggressive classifiers impede creative use cases

Integration with Existing Defenses

Constitutional Classifiers are most effective as part of a layered defense:

Input validation (non-ML) -- block known malicious patterns, enforce format constraints
Constitutional Classifiers (input) -- ML-based screening for adversarial intent
Safety-trained base model -- the model's own safety training as a defense layer
Constitutional Classifiers (output) -- ML-based screening for harmful content
Output validation (non-ML) -- format enforcement, PII redaction, compliance checks

References

"Constitutional AI: Harmlessness from AI Feedback" - Bai, Y., et al., Anthropic (2022) - The foundational constitutional AI paper that inspired the classifier training approach
"Defending Against Jailbreaks with Constitutional Classifiers" - Anthropic (2025) - The paper introducing Constitutional Classifiers and presenting the red team evaluation results
"The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions" - Wallace, E., et al., OpenAI (2024) - Complementary defense approach operating at the model training level
"Jailbroken: How Does LLM Safety Training Fail?" - Wei, A., et al. (2024) - Analysis of why safety training alone is insufficient, motivating external classifier approaches

Knowledge Check

What is the primary advantage of Constitutional Classifiers over safety training alone as a defense against jailbreaks?

Edit this page on GitHub

Constitutional Classifiers

intermediate13 min readUpdated 2026-03-15

Anthropic's Constitutional Classifiers defense: using constitutional AI principles to train input/output classifiers that withstood 3,000+ hours of adversarial red teaming.

constitutional-classifiers defense jailbreak-defense anthropic classifiers constitutional-ai

The Problem: Jailbreaks at Scale

Before understanding the solution, it is important to understand the scale of the problem Constitutional Classifiers were designed to address.

The Jailbreak Landscape

Modern jailbreak techniques have evolved significantly beyond simple "ignore your instructions" prompts:

Attack Category	Example Techniques	Why Traditional Defenses Fail
Encoding attacks	Base64, ROT13, Unicode substitution	Model safety training does not cover all encoding formats
Persona manipulation	"You are DAN," roleplay scenarios	Safety training can be overridden by strong persona instructions
Multi-turn escalation	Gradually shifting context over many messages	Each individual message appears benign; harm emerges from the sequence
Language switching	Requesting harmful content in low-resource languages	Safety training is weaker in underrepresented languages
Prompt injection	Embedded instructions in tool outputs, images, documents	Instructions bypass the user-facing safety layer

How Constitutional Classifiers Work

Architecture Overview

Constitutional Classifiers deploy as a two-layer defense around the base model:

User Input → [Input Classifier] → Base Model → [Output Classifier] → User Output
                  ↓ (block)                          ↓ (block)
              Refusal Response                   Refusal Response

The Constitutional AI Training Pipeline

Define constitutional principles
Researchers define a set of principles that describe what constitutes harmful content. These are natural language descriptions -- for example, "Content that provides step-by-step instructions for synthesizing controlled substances" or "Content that could be used to generate convincing phishing messages targeting specific individuals."
Generate synthetic adversarial examples
An LLM is used to generate a large, diverse set of adversarial prompts that attempt to elicit violations of each constitutional principle. This includes varied attack formats: encoding tricks, persona manipulation, multi-language requests, and novel techniques the LLM can imagine.
Generate synthetic benign examples
The same process generates benign prompts that are topically similar to the harmful ones but do not violate any principles. This is critical for reducing false positives -- the classifier needs to distinguish "how does methamphetamine affect the brain?" (benign, educational) from "how do I synthesize methamphetamine?" (harmful).
Constitutional evaluation
The generated examples are evaluated against the constitutional principles by another LLM, producing labels (harmful/benign) with explanations. The explanations are used for quality control and to identify edge cases.
Train classifiers
Standard classifier training on the labeled synthetic data, producing models optimized for detecting adversarial inputs and harmful outputs.
Iterate with adversarial feedback
Red teamers attack the classifier-defended system, and successful attacks are fed back into the training pipeline to generate more training data covering the discovered gaps.

The Red Team Study: 3,000+ Hours of Adversarial Testing

Study Design

Anthropic conducted one of the most extensive adversarial evaluations of an AI defense system to date. The study involved:

Parameter	Detail
Participants	Professional red teamers with experience in AI security
Duration	Over 3,000 person-hours of adversarial testing
Objective	Bypass Constitutional Classifiers to extract harmful content from a defended model
Baseline	The same model without Constitutional Classifiers
Metric	Attack success rate (ASR) -- percentage of attempts that successfully elicited harmful content

Results

The headline results were striking:

Configuration	Attack Success Rate	Notes
Undefended model	~86%	Most jailbreak techniques succeeded against the base model
Model with Constitutional Classifiers	~4.4%	Dramatic reduction in successful attacks

This represents a roughly 20x reduction in attack success rate.

What Attacks Still Succeeded?

The ~4.4% of attacks that bypassed Constitutional Classifiers generally fell into several categories:

Category	Description	Why It Bypassed
Novel encoding schemes	Custom or very obscure encoding formats	Not represented in training data
Extreme context manipulation	Very long conversations that gradually shifted context	Classifiers had limited context window for analysis
Semantic camouflage	Harmful content expressed using domain-specific jargon or metaphor	Surface-level features did not match harmful patterns
Adversarial examples targeting classifier	Inputs specifically crafted to fool the classifier model	Classifiers are themselves ML models vulnerable to adversarial examples

Trade-offs and Limitations

The Refusal Rate Problem

Metric	Without Classifiers	With Classifiers
Attack success rate	~86%	~4.4%
False refusal rate on benign queries	Low	Measurably increased
User experience impact	Low (but unsafe)	Noticeable (but safer)

Borderline Content

The most challenging area for Constitutional Classifiers is borderline content -- requests that could be harmful or benign depending on context:

"How to pick a lock" -- legitimate locksmith education or burglary preparation?
"Write a story where a character explains how to make explosives" -- creative fiction or information laundering?
"What are the symptoms of poisoning by [specific substance]?" -- medical education or harm planning?

Latency Impact

Adding two classifier inference passes (input and output) to every request introduces latency:

Component	Typical Latency Impact
Input classifier	50-200ms additional
Output classifier	50-200ms additional
Total overhead	100-400ms per request

Comparison with Other Defense Approaches

Defense Landscape Context

Constitutional Classifiers are one approach in a broader defense ecosystem. Understanding how they compare helps red teamers assess the defenses they face:

Defense Approach	How It Works	Strengths	Weaknesses
Safety training (RLHF/DPO)	Train the model itself to refuse harmful requests	Low latency, no additional infrastructure	Can be jailbroken; model is both judge and executor
Constitutional Classifiers	Independent classifiers screen input/output	Defense-in-depth; hard to jailbreak both model and classifier	Latency overhead; false refusals; requires maintaining separate models
Instruction hierarchy	Train model to prioritize system instructions	Addresses prompt injection directly	Does not help with direct jailbreaks
Output filtering (keyword/regex)	Pattern matching on model output	Fast, simple, no ML needed	Easily bypassed with paraphrasing; high false positive rate
Dual LLM / CaMeL	Separate trusted and untrusted processing	Strong isolation for tool-using agents	Architectural complexity; primarily targets prompt injection

Key Differentiator

The unique value of Constitutional Classifiers is that they are independent of the base model. Jailbreaking the base model does not jailbreak the classifiers. An attacker must either:

Craft an input that bypasses the input classifier AND jailbreaks the base model, or
Jailbreak the base model AND craft output that bypasses the output classifier, or
Bypass both classifiers simultaneously

This significantly raises the bar compared to defenses that rely on the model itself.

Red Team Implications

Attacking Constitutional Classifiers

For red teamers facing Constitutional Classifier defenses, the attack surface shifts:

Traditional Target	New Target	Technique Adaptation
Model's safety training	Input classifier model	Adversarial examples that fool the classifier while carrying harmful intent
System prompt	Classifier decision boundary	Find inputs near the boundary that are misclassified as benign
Model reasoning	Classifier training data gaps	Use formats and encodings not well-represented in the synthetic training data
Single model	Two-model system	Consider attacks that exploit the interaction between classifier and base model

Practical Attack Strategies

Classifier probing: Send a range of inputs to map the classifier's decision boundary. Identify which features trigger refusal and which do not.
Encoding diversity: Test unusual encodings, character sets, and formatting that may not be covered by the synthetic training data.
Semantic indirection: Express harmful intent through analogy, metaphor, or domain-specific language that the classifier may not recognize as harmful.
Split-request attacks: Distribute harmful content across multiple benign-appearing requests that combine to form harmful information.
Classifier adversarial examples: If the classifier architecture is known, craft inputs specifically designed to be adversarial to that classifier.

Deployment Considerations

When to Use Constitutional Classifiers

Scenario	Recommendation	Rationale
High-stakes application (medical, legal, financial)	Strongly recommended	Safety benefit justifies latency and false refusal costs
Consumer chatbot with high volume	Consider with tuned thresholds	Balance safety with user experience
Internal enterprise assistant	Depends on data sensitivity	May be unnecessary if usage is low-risk and monitored
Code generation tool	Consider for output classifier only	Input classifier may block legitimate code-related queries
Creative writing assistant	Use with relaxed thresholds	Overly aggressive classifiers impede creative use cases

Integration with Existing Defenses

Constitutional Classifiers are most effective as part of a layered defense:

Input validation (non-ML) -- block known malicious patterns, enforce format constraints
Constitutional Classifiers (input) -- ML-based screening for adversarial intent
Safety-trained base model -- the model's own safety training as a defense layer
Constitutional Classifiers (output) -- ML-based screening for harmful content
Output validation (non-ML) -- format enforcement, PII redaction, compliance checks

References

"Constitutional AI: Harmlessness from AI Feedback" - Bai, Y., et al., Anthropic (2022) - The foundational constitutional AI paper that inspired the classifier training approach
"Defending Against Jailbreaks with Constitutional Classifiers" - Anthropic (2025) - The paper introducing Constitutional Classifiers and presenting the red team evaluation results
"The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions" - Wallace, E., et al., OpenAI (2024) - Complementary defense approach operating at the model training level
"Jailbroken: How Does LLM Safety Training Fail?" - Wei, A., et al. (2024) - Analysis of why safety training alone is insufficient, motivating external classifier approaches

Knowledge Check

What is the primary advantage of Constitutional Classifiers over safety training alone as a defense against jailbreaks?

Edit this page on GitHub

Constitutional Classifiers

Define constitutional principles

Generate synthetic adversarial examples

Generate synthetic benign examples

Constitutional evaluation

Train classifiers

Iterate with adversarial feedback

Related articles

Constitutional Classifiers

Define constitutional principles

Generate synthetic adversarial examples

Generate synthetic benign examples

Constitutional evaluation

Train classifiers

Iterate with adversarial feedback

Related articles