# ai-safety
標記為「ai-safety」的 10 篇文章
Alignment Faking Detection Methods
Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.
Alignment Faking in Large Language 模型s
How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.
錯誤對齊的模型生物
刻意建構錯誤對齊的模型以供研究:方法論、威脅模型實例化、實驗框架,以及模型生物對於 AI 安全失敗所揭示的洞見。
Sleeper 代理 模型s
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
訓練 Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
Steganographic Reasoning
Hidden communication channels within AI reasoning traces, where models encode information or coordinate behavior through patterns invisible to human overseers, including detection methods and implications for AI safety.
Unfaithful Chain-of-Thought Reasoning
Analysis of unfaithful chain-of-thought reasoning in language models, where the visible reasoning trace does not accurately reflect the model's actual computational process, including detection methods, implications for oversight, and exploitation techniques.
潛伏代理:訓練時植入的後門
全面剖析 Hubinger et al. 的潛伏代理研究 (Anthropic, 2024 年 1 月)——後門如何穿越安全訓練而存活、為何愈大的模型愈能維持後門、線性探測偵測,以及對 AI 安全與紅隊演練的啟示。
Post-Executive Order AI Governance Landscape
The US AI governance landscape after the rescission of Executive Order 14110: what was lost, what remains, and how it affects AI red teaming practice and the broader regulatory environment.
美國第 14110 號行政命令分析
Analysis of Executive Order 14110 on Safe, Secure, and Trustworthy AI and its implications.