# representation-engineering

9 articlestagged with “representation-engineering”

Advanced Defense Techniques

Cutting-edge defense research including instruction hierarchy, constitutional AI, and representation engineering for safety -- what is promising versus what is actually deployed.

advanced-defenseinstruction-hierarchyconstitutional-airepresentation-engineeringresearch

Expert

Representation Engineering for Security

Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.

representation-engineeringactivation-steeringinterpretabilityinternal-representationssafety

Expert

Representation Engineering for Security (Frontier Research)

Using representation engineering for security analysis, behavior modification, and vulnerability detection.

frontier-researchrepresentation-engineeringsecurityinterpretability

Expert

Alignment Internals & Bypass Primitives

RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.

alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering

Expert

Activation Steering

Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.

activation-steeringrepresentation-engineeringsteering-vectorsmechanisticsafety-bypass

Expert

Lab: Representation Engineering for Security

Use representation engineering to analyze and manipulate internal model representations for security research.

labsrepresentation-engineeringsecurityadvanced

Advanced

Representation Engineering Attacks

Manipulate internal model representations to alter behavior without prompt modification.

labsrepresentation-engineeringattacksexpert

Expert

Representation Engineering for Behavior Steering

Use representation engineering to steer model behavior by manipulating activation vectors during inference.

labsrepresentation-engineeringbehavior-steeringexpert

Expert

Activation Manipulation & Safety Bypass

How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.

activation-steeringrefusal-directionrepresentation-engineeringIRISsafety-bypassmechanistic-interpretability

Advanced