# linear-probes
標記為「linear-probes」的 2 篇文章
Sleeper Agents: Training-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety
Sleeper 代理s: 訓練-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety