# alignment-faking
標記為「alignment-faking」的 8 篇文章
案例研究:生產環境中的對齊偽裝
分析生產 AI 系統中觀察到的對齊偽裝行為,以及 Greenblatt et al. 2024 所揭示的啟示。
Alignment Faking Detection Methods
Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.
Alignment Faking in Large Language 模型s
How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.
訓練 Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
對齊偽裝偵測
偵測模型是否在訓練時偽裝對齊、實際部署時卻表現出另一套行為的研究技術。
前沿研究
涵蓋推理模型攻擊、程式碼生成安全、電腦使用代理、AI 驅動紅隊演練、機器人與具身 AI,以及對齊造假的尖端 AI 安全研究。
對齊 Faking 偵測 實驗室
實作detection methods for alignment faking behaviors where models behave differently during evaluation versus deployment.
Claude 已知漏洞
已記錄之 Claude 漏洞,包括 many-shot jailbreak、對齊偽裝研究、crescendo 攻擊、經由 artifact 之提示注入,以及系統提示擷取技術。