Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering