# alignment-faking
8 articlestagged with “alignment-faking”
Case Study: Alignment Faking in Production
Analysis of alignment faking behaviors observed in production AI systems and implications from Greenblatt et al. 2024.
Alignment Faking Detection Methods
Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.
Alignment Faking in Large Language Models
How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.
Training Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
Alignment Faking Detection
Detecting when models fake alignment during evaluation while exhibiting different behavior in deployment.
Frontier Research
Cutting-edge AI security research covering reasoning model attacks, code generation security, computer use agents, AI-powered red teaming, robotics and embodied AI, and alignment faking.
Alignment Faking Detection Lab
Implement detection methods for alignment faking behaviors where models behave differently during evaluation versus deployment.
Claude Known Vulnerabilities
Documented Claude vulnerabilities including many-shot jailbreaking, alignment faking research, crescendo attacks, prompt injection via artifacts, and system prompt extraction techniques.