Skip to main content
redteams.ai
All tags

# anthropic-research

1 articletagged with “anthropic-research

Alignment Faking in Large Language Models

How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.

alignment-fakingdeceptive-alignmentai-safetytraining-gaminganthropic-research
Advanced