# safety-training
7 articlestagged with “safety-training”
Jailbreaking Techniques Assessment
Test your knowledge of LLM jailbreaking methods, bypass strategies, and the mechanics behind safety training circumvention with 10 intermediate-level questions.
RLHF and Safety Alignment
Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.
Training Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
Sleeper Agents: Training-Time Backdoors
Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.
Adaptive Attacks Against Safety Training
Research on attacks that adapt to safety training procedures to discover persistent vulnerabilities.
The Alignment Tax
How safety training affects model capabilities: capability-safety tradeoffs, the cost of alignment, measuring alignment tax, and strategies for minimizing capability loss during safety training.
Fine-Tuning Attack Surface
Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.