# constitutional-ai
12 articlestagged with “constitutional-ai”
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Constitutional Classifiers
Anthropic's Constitutional Classifiers defense: using constitutional AI principles to train input/output classifiers that withstood 3,000+ hours of adversarial red teaming.
Advanced Defense Techniques
Cutting-edge defense research including instruction hierarchy, constitutional AI, and representation engineering for safety -- what is promising versus what is actually deployed.
Constitutional AI as Defense Strategy
Using constitutional AI principles to build inherently safer LLM applications resistant to attacks.
Constitutional AI Bypass Techniques
Analyzing and bypassing constitutional AI training through adversarial constitutions and principle manipulation.
Constitutional AI Limitations Research
Research on the limitations of constitutional AI approaches and known bypass categories.
Lab: Constitutional AI Bypass Techniques
Test and bypass Constitutional AI safety mechanisms by exploiting the critique-revision training methodology.
Claude Attack Surface
Claude-specific attack vectors including Constitutional AI weaknesses, tool use exploitation, system prompt handling, vision attacks, and XML tag injection techniques.
Claude (Anthropic) Overview
Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.
Constitutional AI Hacking
Attack surfaces in Constitutional AI training, exploiting self-critique loops, manipulating constitutional principles, and red teaming RLAIF pipelines.
Constitutional AI Implementation Guide
Implement constitutional AI principles in a custom fine-tuning and RLHF pipeline.
Constitutional Classifier Setup
Step-by-step walkthrough for implementing constitutional AI-style classifiers that evaluate LLM outputs against a set of principles, covering principle definition, classifier training, chain-of-thought evaluation, and deployment.