deceptive-alignment — AI Red Teaming Articles

Alignment Faking in Large Language Models

How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.

alignment-fakingdeceptive-alignmentai-safetytraining-gaminganthropic-research

Advanced

Sleeper Agent Models

Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.

sleeper-agentsdeceptive-alignmentbackdooranthropicai-safety

Expert

Deceptive Alignment Theory

Theoretical frameworks for understanding and predicting deceptive alignment in advanced AI systems.

frontier-researchdeceptive-alignmenttheorymesa-optimization

Expert

Sleeper Agents: Training-Time Backdoors

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

Advanced

Deceptive Alignment Testing Framework

Build a testing framework for detecting mesa-optimization and deceptive alignment in fine-tuned models.

labsdeceptive-alignmenttestingexpert

Expert

Emergence & Capability Jump Exploitation

How emergent capabilities create unpredictable security properties: testing for hidden capabilities, sleeper agent scenarios, deceptive alignment concerns, and capability elicitation.

emergencecapabilitydeceptive-alignmentsleeper-agenthidden-capabilityscaling

Expert

Alignment Faking in Large Language 模型s

How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.

alignment-fakingdeceptive-alignmentai-safetytraining-gaminganthropic-research

Advanced

Sleeper 代理模型s

Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.

sleeper-agentsdeceptive-alignmentbackdooranthropicai-safety

Expert

Deceptive Alignment Theory

Theoretical frameworks for understanding and predicting deceptive alignment in advanced AI systems.

frontier-researchdeceptive-alignmenttheorymesa-optimization

Expert

Sleeper 代理s: 訓練-Time Backdoors

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

Advanced

Deceptive Alignment Testing Framework

Build a testing framework for detecting mesa-optimization and deceptive alignment in fine-tuned models.

labsdeceptive-alignmenttestingexpert

Expert

湧現與能力跳躍利用

湧現能力如何造就不可預測之安全屬性：測試隱藏能力、sleeper agent 情境、欺騙性對齊關切，與能力引出。

emergencecapabilitydeceptive-alignmentsleeper-agenthidden-capabilityscaling

Expert

# deceptive-alignment

Alignment Faking in Large Language Models

Sleeper Agent Models

Deceptive Alignment Theory

Sleeper Agents: Training-Time Backdoors

Deceptive Alignment Testing Framework

Emergence & Capability Jump Exploitation

Alignment Faking in Large Language 模型s

Sleeper 代理 模型s

Deceptive Alignment Theory

Sleeper 代理s: 訓練-Time Backdoors

Deceptive Alignment Testing Framework

湧現與能力跳躍利用

# deceptive-alignment

Alignment Faking in Large Language Models

Sleeper Agent Models

Deceptive Alignment Theory

Sleeper Agents: Training-Time Backdoors

Deceptive Alignment Testing Framework

Emergence & Capability Jump Exploitation

Alignment Faking in Large Language 模型s

Sleeper 代理 模型s

Deceptive Alignment Theory

Sleeper 代理s: 訓練-Time Backdoors

Deceptive Alignment Testing Framework

湧現與能力跳躍利用

Sleeper 代理模型s

Sleeper 代理模型s