safety-training — AI Red Teaming Articles

Jailbreaking Techniques Assessment

Test your knowledge of LLM jailbreaking methods, bypass strategies, and the mechanics behind safety training circumvention with 10 intermediate-level questions.

assessmentjailbreakingbypasssafety-trainingred-teaming

Intermediate

RLHF and Safety Alignment

Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.

foundationsrlhfalignmentsafety-training

Intermediate

Training Implications of Alignment Faking

How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

Expert

Sleeper Agents: Training-Time Backdoors

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

Advanced

Adaptive Attacks Against Safety Training

Research on attacks that adapt to safety training procedures to discover persistent vulnerabilities.

frontier-researchadaptive-attackssafety-trainingresearch

Advanced

The Alignment Tax

How safety training affects model capabilities: capability-safety tradeoffs, the cost of alignment, measuring alignment tax, and strategies for minimizing capability loss during safety training.

alignmentsafety-trainingcapabilitiestradeoffsalignment-tax

Advanced

Fine-Tuning Attack Surface

Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.

fine-tuningattack-surfaceSFTRLHFalignmentDPOsafety-training

Advanced

RLHF and Safety Alignment

Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.

foundationsrlhfalignmentsafety-training

Intermediate

訓練 Implications of Alignment Faking

How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

Expert

Sleeper 代理s: 訓練-Time Backdoors

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

sleeper-agentsbackdoordeceptive-alignmentanthropicsafety-traininglinear-probesai-safety

Advanced

Adaptive 攻擊s Against Safety 訓練

Research on attacks that adapt to safety training procedures to discover persistent vulnerabilities.

frontier-researchadaptive-attackssafety-trainingresearch

Advanced

The Alignment Tax

How safety training affects model capabilities: capability-safety tradeoffs, the cost of alignment, measuring alignment tax, and strategies for minimizing capability loss during safety training.

alignmentsafety-trainingcapabilitiestradeoffsalignment-tax

Advanced

微調攻擊面

微調安全漏洞的全面概觀，包括 SFT 資料投毒、RLHF 操弄、對齊稅，以及所有微調攻擊向量。

fine-tuningattack-surfaceSFTRLHFalignmentDPOsafety-training

Advanced

# safety-training

Jailbreaking Techniques Assessment

RLHF and Safety Alignment

Training Implications of Alignment Faking

Sleeper Agents: Training-Time Backdoors

Adaptive Attacks Against Safety Training

The Alignment Tax

Fine-Tuning Attack Surface

RLHF and Safety Alignment

訓練 Implications of Alignment Faking

Sleeper 代理s: 訓練-Time Backdoors

Adaptive 攻擊s Against Safety 訓練

The Alignment Tax

微調攻擊面

# safety-training

Jailbreaking Techniques Assessment

RLHF and Safety Alignment

Training Implications of Alignment Faking

Sleeper Agents: Training-Time Backdoors

Adaptive Attacks Against Safety Training

The Alignment Tax

Fine-Tuning Attack Surface

RLHF and Safety Alignment

訓練 Implications of Alignment Faking

Sleeper 代理s: 訓練-Time Backdoors

Adaptive 攻擊s Against Safety 訓練

The Alignment Tax

微調攻擊面