# safety-classifiers
2 articlestagged with “safety-classifiers”
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering