# refusal

2 articlestagged with “refusal”

Alignment Internals & Bypass Primitives

RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.

alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering

Expert

Safety Neurons and Circuits

Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.

safety-neuronscircuitsmechanistic-interpretabilityrefusalablation

Expert