# refusal
標記為「refusal」的 4 篇文章
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering
Safety Neurons and Circuits
Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.
safety-neuronscircuitsmechanistic-interpretabilityrefusalablation
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering
Safety Neurons and Circuits
Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.
safety-neuronscircuitsmechanistic-interpretabilityrefusalablation