# deceptive-alignment
標記為「deceptive-alignment」的 12 篇文章
Alignment Faking in Large Language Models
How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.
Sleeper Agent Models
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
Deceptive Alignment Theory
Theoretical frameworks for understanding and predicting deceptive alignment in advanced AI systems.
Sleeper Agents: Training-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
Deceptive Alignment Testing Framework
Build a testing framework for detecting mesa-optimization and deceptive alignment in fine-tuned models.
Emergence & Capability Jump Exploitation
How emergent capabilities create unpredictable security properties: testing for hidden capabilities, sleeper agent scenarios, deceptive alignment concerns, and capability elicitation.
Alignment Faking in Large Language 模型s
How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.
Sleeper 代理 模型s
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
Deceptive Alignment Theory
Theoretical frameworks for understanding and predicting deceptive alignment in advanced AI systems.
Sleeper 代理s: 訓練-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
Deceptive Alignment Testing Framework
Build a testing framework for detecting mesa-optimization and deceptive alignment in fine-tuned models.
湧現與能力跳躍利用
湧現能力如何造就不可預測之安全屬性:測試隱藏能力、sleeper agent 情境、欺騙性對齊關切,與能力引出。