# ai-safety

10 articlestagged with “ai-safety”

Alignment Faking Detection Methods

Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.

alignment-fakingdetectioninterpretabilitybehavioral-testingai-safetyevaluation

Expert

Alignment Faking in Large Language Models

How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.

alignment-fakingdeceptive-alignmentai-safetytraining-gaminganthropic-research

Advanced

Model Organisms of Misalignment

Deliberately creating misaligned models for study: methodology, threat model instantiation, experimental frameworks, and what model organisms reveal about AI safety failures.

model-organismsmisalignmentalignment-researchthreat-modelsai-safety

Advanced

Sleeper Agent Models

Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.

sleeper-agentsdeceptive-alignmentbackdooranthropicai-safety

Expert

Training Implications of Alignment Faking

How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

Expert

Steganographic Reasoning

Hidden communication channels within AI reasoning traces, where models encode information or coordinate behavior through patterns invisible to human overseers, including detection methods and implications for AI safety.

steganographyreasoninghidden-communicationchain-of-thoughtai-safetyoversight

Expert

Unfaithful Chain-of-Thought Reasoning

Analysis of unfaithful chain-of-thought reasoning in language models, where the visible reasoning trace does not accurately reflect the model's actual computational process, including detection methods, implications for oversight, and exploitation techniques.

unfaithful-reasoningchain-of-thoughtreasoninginterpretabilityoversightai-safety

Advanced

Sleeper Agents: Training-Time Backdoors

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

Advanced

Post-Executive Order AI Governance Landscape

The US AI governance landscape after the rescission of Executive Order 14110: what was lost, what remains, and how it affects AI red teaming practice and the broader regulatory environment.

governanceexecutive-orderregulationus-policyai-safetycompliance

Beginner

US Executive Order 14110 Analysis

Analysis of Executive Order 14110 on Safe, Secure, and Trustworthy AI and its implications.

governanceexecutive-orderus-policyai-safety

Intermediate