ai-safety — AI 紅隊文章

Alignment Faking Detection Methods

Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.

alignment-fakingdetectioninterpretabilitybehavioral-testingai-safetyevaluation

專家

Alignment Faking in Large Language Models

How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.

alignment-fakingdeceptive-alignmentai-safetytraining-gaminganthropic-research

進階

Model Organisms of Misalignment

Deliberately creating misaligned models for study: methodology, threat model instantiation, experimental frameworks, and what model organisms reveal about AI safety failures.

model-organismsmisalignmentalignment-researchthreat-modelsai-safety

進階

Sleeper Agent Models

Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.

sleeper-agentsdeceptive-alignmentbackdooranthropicai-safety

專家

Training Implications of Alignment Faking

How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

專家

Steganographic Reasoning

Hidden communication channels within AI reasoning traces, where models encode information or coordinate behavior through patterns invisible to human overseers, including detection methods and implications for AI safety.

steganographyreasoninghidden-communicationchain-of-thoughtai-safetyoversight

專家

Unfaithful Chain-of-Thought Reasoning

Analysis of unfaithful chain-of-thought reasoning in language models, where the visible reasoning trace does not accurately reflect the model's actual computational process, including detection methods, implications for oversight, and exploitation techniques.

unfaithful-reasoningchain-of-thoughtreasoninginterpretabilityoversightai-safety

進階

Sleeper Agents: Training-Time Backdoors

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

進階

Post-Executive Order AI Governance Landscape

The US AI governance landscape after the rescission of Executive Order 14110: what was lost, what remains, and how it affects AI red teaming practice and the broader regulatory environment.

governanceexecutive-orderregulationus-policyai-safetycompliance

入門

US Executive Order 14110 Analysis

Analysis of Executive Order 14110 on Safe, Secure, and Trustworthy AI and its implications.

governanceexecutive-orderus-policyai-safety

中級

2026 年大型語言模型越獄：97% 成功率、自主攻擊，與失靈的軍備競賽

Nature Communications 證實 AI 推理模型可以自主越獄其他大型語言模型，成功率達 97%。JBFuzz 在 60 秒內達到 99% 成功率。本文揭示 2026 年最新研究對 AI 安全現況的揭露——以及為什麼現有防禦正在失敗。

jailbreakllm-safetyred-teamingreasoning-modelsDeepSeek-R1JBFuzzai-safety2026-research

Alignment Faking Detection Methods

Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.

alignment-fakingdetectioninterpretabilitybehavioral-testingai-safetyevaluation

專家

Alignment Faking in Large Language 模型s

How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.

alignment-fakingdeceptive-alignmentai-safetytraining-gaminganthropic-research

進階

模型 Organisms of Misalignment

Deliberately creating misaligned models for study: methodology, threat model instantiation, experimental frameworks, and what model organisms reveal about AI safety failures.

model-organismsmisalignmentalignment-researchthreat-modelsai-safety

進階

Sleeper 代理模型s

Anthropic's research on models that behave differently when triggered by specific conditions: deceptive alignment, conditional backdoors, training-resistant deceptive behaviors, and implications for AI safety.

sleeper-agentsdeceptive-alignmentbackdooranthropicai-safety

專家

訓練 Implications of Alignment Faking

How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

專家

Steganographic Reasoning

Hidden communication channels within AI reasoning traces, where models encode information or coordinate behavior through patterns invisible to human overseers, including detection methods and implications for AI safety.

steganographyreasoninghidden-communicationchain-of-thoughtai-safetyoversight

專家

Unfaithful Chain-of-Thought Reasoning

Analysis of unfaithful chain-of-thought reasoning in language models, where the visible reasoning trace does not accurately reflect the model's actual computational process, including detection methods, implications for oversight, and exploitation techniques.

unfaithful-reasoningchain-of-thoughtreasoninginterpretabilityoversightai-safety

進階

Sleeper 代理s: 訓練-Time Backdoors

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

進階

Post-Executive Order AI Governance Landscape

The US AI governance landscape after the rescission of Executive Order 14110: what was lost, what remains, and how it affects AI red teaming practice and the broader regulatory environment.

governanceexecutive-orderregulationus-policyai-safetycompliance

入門

US Executive Order 14110 Analysis

Analysis of Executive Order 14110 on Safe, Secure, and Trustworthy AI and its implications.

governanceexecutive-orderus-policyai-safety

中級

# ai-safety

Alignment Faking Detection Methods

Alignment Faking in Large Language Models

Model Organisms of Misalignment

Sleeper Agent Models

Training Implications of Alignment Faking

Steganographic Reasoning

Unfaithful Chain-of-Thought Reasoning

Sleeper Agents: Training-Time Backdoors

Post-Executive Order AI Governance Landscape

US Executive Order 14110 Analysis

2026 年大型語言模型越獄：97% 成功率、自主攻擊，與失靈的軍備競賽

Alignment Faking Detection Methods

Alignment Faking in Large Language 模型s

模型 Organisms of Misalignment

Sleeper 代理 模型s

訓練 Implications of Alignment Faking

Steganographic Reasoning

Unfaithful Chain-of-Thought Reasoning

Sleeper 代理s: 訓練-Time Backdoors

Post-Executive Order AI Governance Landscape

US Executive Order 14110 Analysis

# ai-safety

Alignment Faking Detection Methods

Alignment Faking in Large Language Models

Model Organisms of Misalignment

Sleeper Agent Models

Training Implications of Alignment Faking

Steganographic Reasoning

Unfaithful Chain-of-Thought Reasoning

Sleeper Agents: Training-Time Backdoors

Post-Executive Order AI Governance Landscape

US Executive Order 14110 Analysis

2026 年大型語言模型越獄：97% 成功率、自主攻擊，與失靈的軍備競賽

Alignment Faking Detection Methods

Alignment Faking in Large Language 模型s

模型 Organisms of Misalignment

Sleeper 代理 模型s

訓練 Implications of Alignment Faking

Steganographic Reasoning

Unfaithful Chain-of-Thought Reasoning

Sleeper 代理s: 訓練-Time Backdoors

Post-Executive Order AI Governance Landscape

US Executive Order 14110 Analysis

Sleeper 代理模型s

Sleeper 代理模型s