# sleeper-agents

6 articlestagged with “sleeper-agents”

Case Study: Sleeper Agents Research Impact

Analysis of Hubinger et al. 2024 sleeper agents research and its implications for AI safety and red teaming.

Sleeper Agent Models

Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.

sleeper-agentsdeceptive-alignmentbackdooranthropicai-safety

Expert

Sleeper Agent Research

Current research on training deceptive LLMs that persist through safety training and activation patterns.

frontier-researchsleeper-agentsdeceptivetraining

Expert

Sleeper Agents: Training-Time Backdoors

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

Advanced

AI Supply Chain Deep Dive

Deep analysis of AI supply chain security threats including sleeper agents, slopsquatting, malicious model uploads, pickle deserialization exploits, and model provenance verification challenges.

supply-chainsleeper-agentsslopsquattingpicklehuggingfacemodel-provenanceinfrastructure

Advanced

Sleeper Agent Detection Walkthrough

Walkthrough of detecting deceptive sleeper agent behaviors in fine-tuned language models.

walkthroughssleeper-agentsdetectionalignment

Expert