sleeper-agents — AI 紅隊文章

Case Study: Sleeper Agents Research Impact

Analysis of Hubinger et al. 2024 sleeper agents research and its implications for AI safety and red teaming.

case-studysleeper-agentsalignment

進階

Sleeper Agent Models

Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.

sleeper-agentsdeceptive-alignmentbackdooranthropicai-safety

專家

Sleeper Agent Research

Current research on training deceptive LLMs that persist through safety training and activation patterns.

frontier-researchsleeper-agentsdeceptivetraining

專家

Sleeper Agents: Training-Time Backdoors

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

進階

AI Supply Chain Deep Dive

Deep analysis of AI supply chain security threats including sleeper agents, slopsquatting, malicious model uploads, pickle deserialization exploits, and model provenance verification challenges.

supply-chainsleeper-agentsslopsquattingpicklehuggingfacemodel-provenanceinfrastructure

進階

Sleeper Agent Detection Walkthrough

Walkthrough of detecting deceptive sleeper agent behaviors in fine-tuned language models.

walkthroughssleeper-agentsdetectionalignment

專家

Case Study: Sleeper 代理s Research Impact

Analysis of Hubinger et al. 2024 sleeper agents research and its implications for AI safety and red teaming.

case-studysleeper-agentsalignment

進階

Sleeper 代理模型s

Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.

sleeper-agentsdeceptive-alignmentbackdooranthropicai-safety

專家

Sleeper 代理 Research

Current research on training deceptive LLMs that persist through safety training and activation patterns.

frontier-researchsleeper-agentsdeceptivetraining

專家

Sleeper 代理s: 訓練-Time Backdoors

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

進階

AI Supply Chain Deep Dive

Deep analysis of AI supply chain security threats including sleeper agents, slopsquatting, malicious model uploads, pickle deserialization exploits, and model provenance verification challenges.

supply-chainsleeper-agentsslopsquattingpicklehuggingfacemodel-provenanceinfrastructure

進階

Sleeper 代理 Detection 導覽

導覽 of detecting deceptive sleeper agent behaviors in fine-tuned language models.

walkthroughssleeper-agentsdetectionalignment

專家

# sleeper-agents

Case Study: Sleeper Agents Research Impact

Sleeper Agent Models

Sleeper Agent Research

Sleeper Agents: Training-Time Backdoors

AI Supply Chain Deep Dive

Sleeper Agent Detection Walkthrough

Case Study: Sleeper 代理s Research Impact

Sleeper 代理 模型s

Sleeper 代理 Research

Sleeper 代理s: 訓練-Time Backdoors

AI Supply Chain Deep Dive

Sleeper 代理 Detection 導覽

# sleeper-agents

Case Study: Sleeper Agents Research Impact

Sleeper Agent Models

Sleeper Agent Research

Sleeper Agents: Training-Time Backdoors

AI Supply Chain Deep Dive

Sleeper Agent Detection Walkthrough

Case Study: Sleeper 代理s Research Impact

Sleeper 代理 模型s

Sleeper 代理 Research

Sleeper 代理s: 訓練-Time Backdoors

AI Supply Chain Deep Dive

Sleeper 代理 Detection 導覽

Sleeper 代理模型s

Sleeper 代理模型s