Representation Engineering for Security
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
representation-engineeringactivation-steeringinterpretabilityinternal-representationssafety