Constitutional Classifiers
Anthropic's Constitutional Classifiers defense: using constitutional AI principles to train input/output classifiers that withstood 3,000+ hours of adversarial red teaming.
Traditional safety training teaches the model itself to refuse harmful requests. But what happens when adversaries find ways to circumvent that training through jailbreaks? Constitutional Classifiers take a fundamentally different approach: instead of relying solely on the model's own judgment, they deploy independent classifier models trained specifically to detect adversarial inputs and harmful outputs.
The Problem: Jailbreaks at Scale
Before understanding the solution, it is important to understand the scale of the problem Constitutional Classifiers were designed to address.
The Jailbreak Landscape
Modern jailbreak techniques have evolved significantly beyond simple "ignore your instructions" prompts:
| Attack Category | Example Techniques | Why Traditional Defenses Fail |
|---|---|---|
| Encoding attacks | Base64, ROT13, Unicode substitution | Model safety training does not cover all encoding formats |
| Persona manipulation | "You are DAN," roleplay scenarios | Safety training can be overridden by strong persona instructions |
| Multi-turn escalation | Gradually shifting context over many messages | Each individual message appears benign; harm emerges from the sequence |
| Language switching | Requesting harmful content in low-resource languages | Safety training is weaker in underrepresented languages |
| Prompt injection | Embedded instructions in tool outputs, images, documents | Instructions bypass the user-facing safety layer |
The fundamental challenge is that the model's safety training is part of the same system that processes adversarial inputs. An attacker who can manipulate the model's processing can also manipulate its safety behavior.
How Constitutional Classifiers Work
Architecture Overview
Constitutional Classifiers deploy as a two-layer defense around the base model:
User Input → [Input Classifier] → Base Model → [Output Classifier] → User Output
↓ (block) ↓ (block)
Refusal Response Refusal Response
The input classifier screens incoming prompts for adversarial intent. The output classifier screens the model's responses for harmful content. Both classifiers are separate models, independent of the base model being protected.
The Constitutional AI Training Pipeline
The key innovation is how training data for the classifiers is generated. Rather than relying on human-labeled datasets (which are expensive and cannot cover the full space of adversarial techniques), Constitutional Classifiers use a self-supervised pipeline inspired by Constitutional AI:
Define constitutional principles
Researchers define a set of principles that describe what constitutes harmful content. These are natural language descriptions -- for example, "Content that provides step-by-step instructions for synthesizing controlled substances" or "Content that could be used to generate convincing phishing messages targeting specific individuals."
Generate synthetic adversarial examples
An LLM is used to generate a large, diverse set of adversarial prompts that attempt to elicit violations of each constitutional principle. This includes varied attack formats: encoding tricks, persona manipulation, multi-language requests, and novel techniques the LLM can imagine.
Generate synthetic benign examples
The same process generates benign prompts that are topically similar to the harmful ones but do not violate any principles. This is critical for reducing false positives -- the classifier needs to distinguish "how does methamphetamine affect the brain?" (benign, educational) from "how do I synthesize methamphetamine?" (harmful).
Constitutional evaluation
The generated examples are evaluated against the constitutional principles by another LLM, producing labels (harmful/benign) with explanations. The explanations are used for quality control and to identify edge cases.
Train classifiers
Standard classifier training on the labeled synthetic data, producing models optimized for detecting adversarial inputs and harmful outputs.
Iterate with adversarial feedback
Red teamers attack the classifier-defended system, and successful attacks are fed back into the training pipeline to generate more training data covering the discovered gaps.
The Red Team Study: 3,000+ Hours of Adversarial Testing
Study Design
Anthropic conducted one of the most extensive adversarial evaluations of an AI defense system to date. The study involved:
| Parameter | Detail |
|---|---|
| Participants | Professional red teamers with experience in AI security |
| Duration | Over 3,000 person-hours of adversarial testing |
| Objective | Bypass Constitutional Classifiers to extract harmful content from a defended model |
| Baseline | The same model without Constitutional Classifiers |
| Metric | Attack success rate (ASR) -- percentage of attempts that successfully elicited harmful content |
Results
The headline results were striking:
| Configuration | Attack Success Rate | Notes |
|---|---|---|
| Undefended model | ~86% | Most jailbreak techniques succeeded against the base model |
| Model with Constitutional Classifiers | ~4.4% | Dramatic reduction in successful attacks |
This represents a roughly 20x reduction in attack success rate.
What Attacks Still Succeeded?
The ~4.4% of attacks that bypassed Constitutional Classifiers generally fell into several categories:
| Category | Description | Why It Bypassed |
|---|---|---|
| Novel encoding schemes | Custom or very obscure encoding formats | Not represented in training data |
| Extreme context manipulation | Very long conversations that gradually shifted context | Classifiers had limited context window for analysis |
| Semantic camouflage | Harmful content expressed using domain-specific jargon or metaphor | Surface-level features did not match harmful patterns |
| Adversarial examples targeting classifier | Inputs specifically crafted to fool the classifier model | Classifiers are themselves ML models vulnerable to adversarial examples |
Trade-offs and Limitations
The Refusal Rate Problem
Constitutional Classifiers increase safety, but they also increase the rate at which benign queries are incorrectly refused. This is the fundamental precision-recall trade-off in any classification system:
| Metric | Without Classifiers | With Classifiers |
|---|---|---|
| Attack success rate | ~86% | ~4.4% |
| False refusal rate on benign queries | Low | Measurably increased |
| User experience impact | Low (but unsafe) | Noticeable (but safer) |
Borderline Content
The most challenging area for Constitutional Classifiers is borderline content -- requests that could be harmful or benign depending on context:
- "How to pick a lock" -- legitimate locksmith education or burglary preparation?
- "Write a story where a character explains how to make explosives" -- creative fiction or information laundering?
- "What are the symptoms of poisoning by [specific substance]?" -- medical education or harm planning?
Constitutional Classifiers tend to err on the side of caution with borderline content, which contributes to the increased false refusal rate. The constitutional principles can be tuned to adjust this threshold, but there is no configuration that simultaneously minimizes both false positives and false negatives.
Latency Impact
Adding two classifier inference passes (input and output) to every request introduces latency:
| Component | Typical Latency Impact |
|---|---|
| Input classifier | 50-200ms additional |
| Output classifier | 50-200ms additional |
| Total overhead | 100-400ms per request |
For real-time applications, this overhead may be acceptable. For latency-sensitive use cases (code completion, interactive agents), it may require optimization (smaller classifier models, batched inference, speculative execution).
Comparison with Other Defense Approaches
Defense Landscape Context
Constitutional Classifiers are one approach in a broader defense ecosystem. Understanding how they compare helps red teamers assess the defenses they face:
| Defense Approach | How It Works | Strengths | Weaknesses |
|---|---|---|---|
| Safety training (RLHF/DPO) | Train the model itself to refuse harmful requests | Low latency, no additional infrastructure | Can be jailbroken; model is both judge and executor |
| Constitutional Classifiers | Independent classifiers screen input/output | Defense-in-depth; hard to jailbreak both model and classifier | Latency overhead; false refusals; requires maintaining separate models |
| Instruction hierarchy | Train model to prioritize system instructions | Addresses prompt injection directly | Does not help with direct jailbreaks |
| Output filtering (keyword/regex) | Pattern matching on model output | Fast, simple, no ML needed | Easily bypassed with paraphrasing; high false positive rate |
| Dual LLM / CaMeL | Separate trusted and untrusted processing | Strong isolation for tool-using agents | Architectural complexity; primarily targets prompt injection |
Key Differentiator
The unique value of Constitutional Classifiers is that they are independent of the base model. Jailbreaking the base model does not jailbreak the classifiers. An attacker must either:
- Craft an input that bypasses the input classifier AND jailbreaks the base model, or
- Jailbreak the base model AND craft output that bypasses the output classifier, or
- Bypass both classifiers simultaneously
This significantly raises the bar compared to defenses that rely on the model itself.
Red Team Implications
Attacking Constitutional Classifiers
For red teamers facing Constitutional Classifier defenses, the attack surface shifts:
| Traditional Target | New Target | Technique Adaptation |
|---|---|---|
| Model's safety training | Input classifier model | Adversarial examples that fool the classifier while carrying harmful intent |
| System prompt | Classifier decision boundary | Find inputs near the boundary that are misclassified as benign |
| Model reasoning | Classifier training data gaps | Use formats and encodings not well-represented in the synthetic training data |
| Single model | Two-model system | Consider attacks that exploit the interaction between classifier and base model |
Practical Attack Strategies
- Classifier probing: Send a range of inputs to map the classifier's decision boundary. Identify which features trigger refusal and which do not.
- Encoding diversity: Test unusual encodings, character sets, and formatting that may not be covered by the synthetic training data.
- Semantic indirection: Express harmful intent through analogy, metaphor, or domain-specific language that the classifier may not recognize as harmful.
- Split-request attacks: Distribute harmful content across multiple benign-appearing requests that combine to form harmful information.
- Classifier adversarial examples: If the classifier architecture is known, craft inputs specifically designed to be adversarial to that classifier.
Deployment Considerations
When to Use Constitutional Classifiers
| Scenario | Recommendation | Rationale |
|---|---|---|
| High-stakes application (medical, legal, financial) | Strongly recommended | Safety benefit justifies latency and false refusal costs |
| Consumer chatbot with high volume | Consider with tuned thresholds | Balance safety with user experience |
| Internal enterprise assistant | Depends on data sensitivity | May be unnecessary if usage is low-risk and monitored |
| Code generation tool | Consider for output classifier only | Input classifier may block legitimate code-related queries |
| Creative writing assistant | Use with relaxed thresholds | Overly aggressive classifiers impede creative use cases |
Integration with Existing Defenses
Constitutional Classifiers are most effective as part of a layered defense:
- Input validation (non-ML) -- block known malicious patterns, enforce format constraints
- Constitutional Classifiers (input) -- ML-based screening for adversarial intent
- Safety-trained base model -- the model's own safety training as a defense layer
- Constitutional Classifiers (output) -- ML-based screening for harmful content
- Output validation (non-ML) -- format enforcement, PII redaction, compliance checks
Further Reading
- Advanced Defense Techniques -- Broader survey of cutting-edge defense research
- CaMeL / Dual LLM Pattern -- Complementary architectural defense for tool-using agents
- Guardrails & Safety Layer Architecture -- Where Constitutional Classifiers fit in the overall safety architecture
- Alignment Faking -- Why independent classifiers may be more reliable than model self-regulation
Related Topics
- Guardrails & Safety Layer Architecture - Architectural context for classifier deployment
- AI-Powered Red Teaming - Automated methods used to test defenses like Constitutional Classifiers
- Watermarking & AI-Generated Text Detection - Another advanced defense technique
References
- "Constitutional AI: Harmlessness from AI Feedback" - Bai, Y., et al., Anthropic (2022) - The foundational constitutional AI paper that inspired the classifier training approach
- "Defending Against Jailbreaks with Constitutional Classifiers" - Anthropic (2025) - The paper introducing Constitutional Classifiers and presenting the red team evaluation results
- "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions" - Wallace, E., et al., OpenAI (2024) - Complementary defense approach operating at the model training level
- "Jailbroken: How Does LLM Safety Training Fail?" - Wei, A., et al. (2024) - Analysis of why safety training alone is insufficient, motivating external classifier approaches
What is the primary advantage of Constitutional Classifiers over safety training alone as a defense against jailbreaks?