Constitutional Classifiers
Anthropic's Constitutional Classifiers defense: using constitutional AI principles to train input/output classifiers that withstood 3,000+ hours of adversarial red teaming.
Traditional 安全 訓練 teaches 模型 itself to refuse harmful requests. But what happens when adversaries find ways to circumvent that 訓練 through jailbreaks? Constitutional Classifiers take a fundamentally different approach: instead of relying solely on 模型's own judgment, they deploy independent classifier models trained specifically to detect 對抗性 inputs and harmful outputs.
The Problem: Jailbreaks at Scale
Before 理解 the solution, it is important to 理解 the scale of the problem Constitutional Classifiers were designed to address.
The 越獄 Landscape
Modern 越獄 techniques have evolved significantly beyond simple "ignore your instructions" prompts:
| 攻擊 Category | 範例 Techniques | Why Traditional 防禦 Fail |
|---|---|---|
| Encoding attacks | Base64, ROT13, Unicode substitution | Model 安全 訓練 does not cover all encoding formats |
| Persona manipulation | "You are DAN," roleplay scenarios | 安全 訓練 can be overridden by strong persona instructions |
| Multi-turn escalation | Gradually shifting context over many messages | Each individual message appears benign; harm emerges from the sequence |
| Language switching | Requesting harmful content in low-resource languages | 安全 訓練 is weaker in underrepresented languages |
| Prompt injection | Embedded instructions in tool outputs, images, documents | Instructions bypass 使用者-facing 安全 layer |
The fundamental challenge is that 模型's 安全 訓練 is part of the same system that processes 對抗性 inputs. 攻擊者 who can manipulate 模型's processing can also manipulate its 安全 behavior.
How Constitutional Classifiers Work
Architecture 概覽
Constitutional Classifiers deploy as a two-layer 防禦 around the base model:
User 輸入 → [輸入 Classifier] → Base Model → [輸出 Classifier] → User 輸出
↓ (block) ↓ (block)
Refusal Response Refusal Response
The 輸入 classifier screens incoming prompts for 對抗性 intent. The 輸出 classifier screens 模型's responses for harmful content. Both classifiers are separate models, independent of the base model being protected.
The Constitutional AI Training Pipeline
The key innovation is how 訓練資料 for the classifiers is generated. Rather than relying on human-labeled datasets (which are expensive and cannot cover the full space of 對抗性 techniques), Constitutional Classifiers use a self-supervised pipeline inspired by Constitutional AI:
Define constitutional principles
Researchers define a set of principles that describe what constitutes harmful content. These are natural language descriptions -- 例如, "Content that provides step-by-step instructions for synthesizing controlled substances" or "Content that could be used to generate convincing phishing messages targeting specific individuals."
Generate synthetic 對抗性 examples
An LLM is used to generate a large, diverse set of 對抗性 prompts that attempt to elicit violations of each constitutional principle. This includes varied attack formats: encoding tricks, persona manipulation, multi-language requests, and novel techniques the LLM can imagine.
Generate synthetic benign examples
The same process generates benign prompts that are topically similar to the harmful ones but do not violate any principles. 這是 critical for reducing false positives -- the classifier needs to distinguish "how does methamphetamine affect the brain?" (benign, educational) from "how do I synthesize methamphetamine?" (harmful).
Constitutional 評估
The generated examples are evaluated against the constitutional principles by another LLM, producing labels (harmful/benign) with explanations. The explanations are used for quality control and to 識別 edge cases.
Train classifiers
Standard classifier 訓練 on the labeled synthetic data, producing models optimized for detecting 對抗性 inputs and harmful outputs.
Iterate with 對抗性 feedback
Red teamers attack the classifier-defended system, and successful attacks are fed back into the 訓練 pipeline to generate more 訓練資料 covering the discovered gaps.
The 紅隊 Study: 3,000+ Hours of 對抗性 測試
Study Design
Anthropic conducted one of the most extensive 對抗性 evaluations of an AI 防禦 system to date. The study involved:
| Parameter | Detail |
|---|---|
| Participants | Professional red teamers with experience in AI 安全 |
| Duration | Over 3,000 person-hours of 對抗性 測試 |
| Objective | Bypass Constitutional Classifiers to extract harmful content from a defended model |
| Baseline | The same model without Constitutional Classifiers |
| Metric | 攻擊 success rate (ASR) -- percentage of attempts that successfully elicited harmful content |
Results
The headline results were striking:
| Configuration | 攻擊 Success Rate | Notes |
|---|---|---|
| Undefended model | ~86% | Most 越獄 techniques succeeded against the base model |
| Model with Constitutional Classifiers | ~4.4% | Dramatic reduction in successful attacks |
This represents a roughly 20x reduction in attack success rate.
What 攻擊 Still Succeeded?
The ~4.4% of attacks that bypassed Constitutional Classifiers generally fell into several categories:
| Category | Description | Why It Bypassed |
|---|---|---|
| Novel encoding schemes | Custom or very obscure encoding formats | Not represented in 訓練資料 |
| Extreme context manipulation | Very long conversations that gradually shifted context | Classifiers had limited 上下文視窗 for analysis |
| Semantic camouflage | Harmful content expressed using domain-specific jargon or metaphor | Surface-level features did not match harmful patterns |
| 對抗性 examples targeting classifier | Inputs specifically crafted to fool the classifier model | Classifiers are themselves ML models vulnerable to 對抗性 examples |
Trade-offs and Limitations
The Refusal Rate Problem
Constitutional Classifiers increase 安全, but they also increase the rate at which benign queries are incorrectly refused. 這是 the fundamental precision-recall trade-off in any classification system:
| Metric | Without Classifiers | With Classifiers |
|---|---|---|
| 攻擊 success rate | ~86% | ~4.4% |
| False refusal rate on benign queries | Low | Measurably increased |
| User experience impact | Low (but unsafe) | Noticeable (but safer) |
Borderline Content
The most challenging area for Constitutional Classifiers is borderline content -- requests that could be harmful or benign depending on context:
- "How to pick a lock" -- legitimate locksmith education or burglary preparation?
- "Write a story where a character explains how to make explosives" -- creative fiction or information laundering?
- "What are the symptoms of 投毒 by [specific substance]?" -- medical education or harm planning?
Constitutional Classifiers tend to err on the side of caution with borderline content, which contributes to the increased false refusal rate. The constitutional principles can be tuned to adjust this threshold, but 存在 no configuration that simultaneously minimizes both false positives and false negatives.
Latency Impact
Adding two classifier 推論 passes (輸入 and 輸出) to every request introduces latency:
| Component | Typical Latency Impact |
|---|---|
| 輸入 classifier | 50-200ms additional |
| 輸出 classifier | 50-200ms additional |
| Total overhead | 100-400ms per request |
For real-time applications, this overhead may be acceptable. For latency-sensitive use cases (code completion, interactive 代理), it may require optimization (smaller classifier models, batched 推論, speculative execution).
Comparison with Other 防禦 Approaches
防禦 Landscape Context
Constitutional Classifiers are one approach in a broader 防禦 ecosystem. 理解 how they compare helps red teamers 評估 the 防禦 they face:
| 防禦 Approach | 運作方式 | Strengths | Weaknesses |
|---|---|---|---|
| 安全 訓練 (RLHF/DPO) | Train 模型 itself to refuse harmful requests | Low latency, no additional infrastructure | Can be jailbroken; model is both judge and executor |
| Constitutional Classifiers | Independent classifiers screen 輸入/輸出 | 防禦-in-depth; hard to 越獄 both model and classifier | Latency overhead; false refusals; requires maintaining separate models |
| Instruction hierarchy | Train model to prioritize system instructions | Addresses 提示詞注入 directly | Does not help with direct jailbreaks |
| 輸出 filtering (keyword/regex) | Pattern matching on model 輸出 | Fast, simple, no ML needed | Easily bypassed with paraphrasing; high false positive rate |
| Dual LLM / CaMeL | Separate trusted and untrusted processing | Strong isolation for tool-using 代理 | Architectural complexity; primarily targets 提示詞注入 |
Key Differentiator
The unique value of Constitutional Classifiers is that they are independent of the base model. Jailbreaking the base model does not 越獄 the classifiers. 攻擊者 must either:
- Craft an 輸入 that bypasses the 輸入 classifier AND jailbreaks the base model, or
- 越獄 the base model AND craft 輸出 that bypasses the 輸出 classifier, or
- Bypass both classifiers simultaneously
This significantly raises the bar compared to 防禦 that rely on 模型 itself.
紅隊 Implications
Attacking Constitutional Classifiers
For red teamers facing Constitutional Classifier 防禦, the 攻擊面 shifts:
| Traditional Target | New Target | Technique Adaptation |
|---|---|---|
| Model's 安全 訓練 | 輸入 classifier model | 對抗性 examples that fool the classifier while carrying harmful intent |
| 系統提示詞 | Classifier decision boundary | Find inputs near the boundary that are misclassified as benign |
| Model reasoning | Classifier 訓練資料 gaps | Use formats and encodings not well-represented in the synthetic 訓練資料 |
| Single model | Two-model system | 考慮 attacks that 利用 the interaction between classifier and base model |
Practical 攻擊 Strategies
- Classifier probing: Send a range of inputs to map the classifier's decision boundary. 識別 which features trigger refusal and which do not.
- Encoding diversity: 測試 unusual encodings, character sets, and formatting that may not be covered by the synthetic 訓練資料.
- Semantic indirection: Express harmful intent through analogy, metaphor, or domain-specific language that the classifier may not recognize as harmful.
- Split-request attacks: Distribute harmful content across multiple benign-appearing requests that combine to form harmful information.
- Classifier 對抗性 examples: If the classifier architecture is known, craft inputs specifically designed to be 對抗性 to that classifier.
Deployment Considerations
When to Use Constitutional Classifiers
| Scenario | Recommendation | Rationale |
|---|---|---|
| High-stakes application (medical, legal, financial) | Strongly recommended | 安全 benefit justifies latency and false refusal costs |
| Consumer chatbot with high volume | 考慮 with tuned thresholds | Balance 安全 with user experience |
| Internal enterprise assistant | Depends on data sensitivity | May be unnecessary if usage is low-risk and monitored |
| Code generation tool | 考慮 for 輸出 classifier only | 輸入 classifier may block legitimate code-related queries |
| Creative writing assistant | Use with relaxed thresholds | Overly aggressive classifiers impede creative use cases |
Integration with Existing 防禦
Constitutional Classifiers are most effective as part of a layered 防禦:
- 輸入 validation (non-ML) -- block known malicious patterns, enforce format constraints
- Constitutional Classifiers (輸入) -- ML-based screening for 對抗性 intent
- 安全-trained base model -- 模型's own 安全 訓練 as a 防禦 layer
- Constitutional Classifiers (輸出) -- ML-based screening for harmful content
- 輸出 validation (non-ML) -- format enforcement, PII redaction, compliance checks
Further Reading
- Advanced 防禦 Techniques -- Broader survey of cutting-edge 防禦 research
- CaMeL / Dual LLM Pattern -- Complementary architectural 防禦 for tool-using 代理
- 護欄 & 安全 Layer Architecture -- Where Constitutional Classifiers fit in the overall 安全 architecture
- Alignment Faking -- Why independent classifiers may be more reliable than model self-regulation
相關主題
- 護欄 & 安全 Layer Architecture - Architectural context for classifier deployment
- AI-Powered 紅隊演練 - Automated methods used to 測試 防禦 like Constitutional Classifiers
- Watermarking & AI-Generated Text 偵測 - Another advanced 防禦 technique
參考文獻
- "Constitutional AI: Harmlessness from AI Feedback" - Bai, Y., et al., Anthropic (2022) - The foundational constitutional AI paper that inspired the classifier 訓練 approach
- "Defending Against Jailbreaks with Constitutional Classifiers" - Anthropic (2025) - The paper introducing Constitutional Classifiers and presenting the 紅隊 評估 results
- "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions" - Wallace, E., et al., OpenAI (2024) - Complementary 防禦 approach operating at 模型 訓練 level
- "Jailbroken: How Does LLM 安全 Training Fail?" - Wei, A., et al. (2024) - Analysis of why 安全 訓練 alone is insufficient, motivating external classifier approaches
What is the primary advantage of Constitutional Classifiers over 安全 訓練 alone as a 防禦 against jailbreaks?