Skip to main content
redteams.ai
All tags

# safety-classifiers

1 articletagged with “safety-classifiers

Alignment Internals & Bypass Primitives

RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.

alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering
Expert