# ai-safety
標記為「ai-safety」的 21 篇文章
Alignment Faking Detection Methods
Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.
Alignment Faking in Large Language Models
How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.
Model Organisms of Misalignment
Deliberately creating misaligned models for study: methodology, threat model instantiation, experimental frameworks, and what model organisms reveal about AI safety failures.
Sleeper Agent Models
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
Training Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
Steganographic Reasoning
Hidden communication channels within AI reasoning traces, where models encode information or coordinate behavior through patterns invisible to human overseers, including detection methods and implications for AI safety.
Unfaithful Chain-of-Thought Reasoning
Analysis of unfaithful chain-of-thought reasoning in language models, where the visible reasoning trace does not accurately reflect the model's actual computational process, including detection methods, implications for oversight, and exploitation techniques.
Sleeper Agents: Training-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
Post-Executive Order AI Governance Landscape
The US AI governance landscape after the rescission of Executive Order 14110: what was lost, what remains, and how it affects AI red teaming practice and the broader regulatory environment.
US Executive Order 14110 Analysis
Analysis of Executive Order 14110 on Safe, Secure, and Trustworthy AI and its implications.
2026 年大型語言模型越獄:97% 成功率、自主攻擊,與失靈的軍備競賽
Nature Communications 證實 AI 推理模型可以自主越獄其他大型語言模型,成功率達 97%。JBFuzz 在 60 秒內達到 99% 成功率。本文揭示 2026 年最新研究對 AI 安全現況的揭露——以及為什麼現有防禦正在失敗。
Alignment Faking Detection Methods
Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.
Alignment Faking in Large Language 模型s
How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.
模型 Organisms of Misalignment
Deliberately creating misaligned models for study: methodology, threat model instantiation, experimental frameworks, and what model organisms reveal about AI safety failures.
Sleeper 代理 模型s
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
訓練 Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
Steganographic Reasoning
Hidden communication channels within AI reasoning traces, where models encode information or coordinate behavior through patterns invisible to human overseers, including detection methods and implications for AI safety.
Unfaithful Chain-of-Thought Reasoning
Analysis of unfaithful chain-of-thought reasoning in language models, where the visible reasoning trace does not accurately reflect the model's actual computational process, including detection methods, implications for oversight, and exploitation techniques.
Sleeper 代理s: 訓練-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
Post-Executive Order AI Governance Landscape
The US AI governance landscape after the rescission of Executive Order 14110: what was lost, what remains, and how it affects AI red teaming practice and the broader regulatory environment.
US Executive Order 14110 Analysis
Analysis of Executive Order 14110 on Safe, Secure, and Trustworthy AI and its implications.