# alignment-faking

標記為「alignment-faking」的 8 篇文章

案例研究:生產環境中的對齊偽裝

分析生產 AI 系統中觀察到的對齊偽裝行為,以及 Greenblatt et al. 2024 所揭示的啟示。

Alignment Faking Detection Methods

Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.

alignment-fakingdetectioninterpretabilitybehavioral-testingai-safetyevaluation

專家

Alignment Faking in Large Language 模型s

How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.

alignment-fakingdeceptive-alignmentai-safetytraining-gaminganthropic-research

進階

訓練 Implications of Alignment Faking

How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

專家