# representation-engineering
9 articlestagged with “representation-engineering”
Advanced Defense Techniques
Cutting-edge defense research including instruction hierarchy, constitutional AI, and representation engineering for safety -- what is promising versus what is actually deployed.
Representation Engineering for Security
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
Representation Engineering for Security (Frontier Research)
Using representation engineering for security analysis, behavior modification, and vulnerability detection.
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
Activation Steering
Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.
Lab: Representation Engineering for Security
Use representation engineering to analyze and manipulate internal model representations for security research.
Representation Engineering Attacks
Manipulate internal model representations to alter behavior without prompt modification.
Representation Engineering for Behavior Steering
Use representation engineering to steer model behavior by manipulating activation vectors during inference.
Activation Manipulation & Safety Bypass
How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.