# sleeper-agents
標記為「sleeper-agents」的 12 篇文章
Case Study: Sleeper Agents Research Impact
Analysis of Hubinger et al. 2024 sleeper agents research and its implications for AI safety and red teaming.
Sleeper Agent Models
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
Sleeper Agent Research
Current research on training deceptive LLMs that persist through safety training and activation patterns.
Sleeper Agents: Training-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
AI Supply Chain Deep Dive
Deep analysis of AI supply chain security threats including sleeper agents, slopsquatting, malicious model uploads, pickle deserialization exploits, and model provenance verification challenges.
Sleeper Agent Detection Walkthrough
Walkthrough of detecting deceptive sleeper agent behaviors in fine-tuned language models.
Case Study: Sleeper 代理s Research Impact
Analysis of Hubinger et al. 2024 sleeper agents research and its implications for AI safety and red teaming.
Sleeper 代理 模型s
Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.
Sleeper 代理 Research
Current research on training deceptive LLMs that persist through safety training and activation patterns.
Sleeper 代理s: 訓練-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
AI Supply Chain Deep Dive
Deep analysis of AI supply chain security threats including sleeper agents, slopsquatting, malicious model uploads, pickle deserialization exploits, and model provenance verification challenges.
Sleeper 代理 Detection 導覽
導覽 of detecting deceptive sleeper agent behaviors in fine-tuned language models.