# deceptive-alignment
6 articlestagged with “deceptive-alignment”
Alignment Faking in Large Language Models
How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.
Sleeper Agent Models
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
Deceptive Alignment Theory
Theoretical frameworks for understanding and predicting deceptive alignment in advanced AI systems.
Sleeper Agents: Training-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
Deceptive Alignment Testing Framework
Build a testing framework for detecting mesa-optimization and deceptive alignment in fine-tuned models.
Emergence & Capability Jump Exploitation
How emergent capabilities create unpredictable security properties: testing for hidden capabilities, sleeper agent scenarios, deceptive alignment concerns, and capability elicitation.