Skip to main content
redteams.ai
All tags

# refusal

2 articlestagged with “refusal

Alignment Internals & Bypass Primitives

RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.

alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering
Expert

Safety Neurons and Circuits

Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.

safety-neuronscircuitsmechanistic-interpretabilityrefusalablation
Expert