# deceptive-alignment
標記為「deceptive-alignment」的 6 篇文章
Alignment Faking in Large Language 模型s
How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.
alignment-fakingdeceptive-alignmentai-safetytraining-gaminganthropic-research
Sleeper 代理 模型s
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
sleeper-agentsdeceptive-alignmentbackdooranthropicai-safety
欺騙性對齊理論
欺騙性對齊與表面對齊等進階理論研究。
frontier-researchdeceptive-alignmenttheorymesa-optimization
潛伏代理:訓練時植入的後門
全面剖析 Hubinger et al. 的潛伏代理研究 (Anthropic, 2024 年 1 月)——後門如何穿越安全訓練而存活、為何愈大的模型愈能維持後門、線性探測偵測,以及對 AI 安全與紅隊演練的啟示。
sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety
Deceptive 對齊 測試 框架
建構testing framework for detecting mesa-optimization與deceptive alignment in fine-tuned models.
labsdeceptive-alignmenttestingexpert
湧現與能力跳躍利用
湧現能力如何造就不可預測之安全屬性:測試隱藏能力、sleeper agent 情境、欺騙性對齊關切,與能力引出。
emergencecapabilitydeceptive-alignmentsleeper-agenthidden-capabilityscaling