# refusal
2 articlestagged with “refusal”
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering
Safety Neurons and Circuits
Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.
safety-neuronscircuitsmechanistic-interpretabilityrefusalablation