# safety-classifiers
標記為「safety-classifiers」的 2 篇文章
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering