Representation Engineering for Security

expert9 min readUpdated 2026-03-15

Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.

representation-engineering activation-steering interpretability internal-representations safety

Traditional approaches to AI safety operate at the input-output level: filter inputs, train on preferred outputs, classify outputs. Representation engineering operates at the internal level -- directly reading and modifying the model's internal representations (activations) as they flow through the network. This enables both more precise safety controls and more dangerous attacks. For red teamers, understanding representation engineering is essential because it reveals what the model is "thinking" during processing, not just what it says.

Foundations: Representations as Vectors

Neural network activations at each layer form a high-dimensional vector space. Research has shown that meaningful concepts are often encoded as linear directions in this space.

Layer 15 activation space (simplified to 2D):
 
     "Truthful"
         ↑
         │    ● "Honest response to hard question"
         │
         │         ● "Neutral factual response"
─────────┼────────────────────────────────▶ "Harmful"
         │
         │  ● "Safety refusal"
         │
         ↓
     "Deceptive"

Finding Safety-Relevant Directions

import torch
import numpy as np
 
def find_concept_direction(
    model,
    positive_examples: list,  # Examples where concept is present
    negative_examples: list,  # Examples where concept is absent
    layer: int
):
    """Find the direction in activation space that encodes a concept."""
    positive_activations = []
    negative_activations = []
 
    for text in positive_examples:
        tokens = model.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**tokens, output_hidden_states=True)
        # Get activation at specified layer, last token position
        activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
        positive_activations.append(activation)
 
    for text in negative_examples:
        tokens = model.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**tokens, output_hidden_states=True)
        activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
        negative_activations.append(activation)
 
    # The concept direction is the difference between mean activations
    positive_mean = np.mean(positive_activations, axis=0)
    negative_mean = np.mean(negative_activations, axis=0)
    direction = positive_mean - negative_mean
 
    # Normalize
    direction = direction / (np.linalg.norm(direction) + 1e-8)
 
    return direction

Activation Steering

Activation steering is the core technique of representation engineering. By adding a direction vector to the model's activations at specific layers during inference, you can push the model's behavior toward or away from a concept.

Steering for Safety

class ActivationSteerer:
    """Steer model behavior by modifying internal activations."""
 
    def __init__(self, model, direction: np.ndarray, layer: int, scale: float = 1.0):
        self.model = model
        self.direction = torch.tensor(direction, dtype=torch.float32)
        self.layer = layer
        self.scale = scale
        self._hook_handle = None
 
    def _steering_hook(self, module, input, output):
        """Hook function that adds the steering vector to activations."""
        # output shape: [batch, seq_len, hidden_dim]
        steering_vector = self.direction.to(output.device) * self.scale
        output[:, :, :] += steering_vector
        return output
 
    def enable(self):
        """Attach the steering hook to the target layer."""
        target_layer = self.model.transformer.h[self.layer]
        self._hook_handle = target_layer.register_forward_hook(self._steering_hook)
 
    def disable(self):
        """Remove the steering hook."""
        if self._hook_handle is not None:
            self._hook_handle.remove()
            self._hook_handle = None
 
    def generate_steered(self, prompt: str, **kwargs):
        """Generate text with activation steering applied."""
        self.enable()
        try:
            output = self.model.generate(prompt, **kwargs)
        finally:
            self.disable()
        return output

Steering Directions for Security

Direction	Positive End	Negative End	Security Application
Honesty	Truthful, accurate	Deceptive, fabricating	Detect and prevent hallucination
Safety	Cautious, refusing harm	Willing to help with harm	Enforce safety without RLHF
Compliance	Follows instructions precisely	Ignores or subverts instructions	Detect instruction-following attacks
Uncertainty	Acknowledges limits	Overconfident	Calibrate confidence in outputs
Sycophancy	Agrees with user	Independent judgment	Detect reward hacking

Security Applications

Application 1: Deception Detection

By probing the model's activations for the "honesty direction," we can detect when the model is generating deceptive content -- even if the output text appears truthful.

class DeceptionDetector:
    """Detect deceptive model behavior using representation probing."""
 
    def __init__(self, model, honesty_direction: np.ndarray, layer: int):
        self.model = model
        self.honesty_direction = torch.tensor(honesty_direction)
        self.layer = layer
 
    def assess_honesty(self, prompt: str) -> dict:
        """Measure how honest the model is being in its response."""
        tokens = self.model.tokenizer(prompt, return_tensors="pt")
 
        activations = []
        def capture_hook(module, input, output):
            activations.append(output.detach())
 
        hook = self.model.transformer.h[self.layer].register_forward_hook(capture_hook)
 
        with torch.no_grad():
            output = self.model.generate(**tokens, max_new_tokens=200)
 
        hook.remove()
 
        if activations:
            # Project activations onto honesty direction
            act = activations[0][0]  # [seq_len, hidden_dim]
            projections = torch.matmul(act, self.honesty_direction.to(act.device))
 
            avg_honesty = projections.mean().item()
            min_honesty = projections.min().item()
 
            return {
                "average_honesty_score": avg_honesty,
                "minimum_honesty_score": min_honesty,
                "likely_deceptive": min_honesty < -0.5,
                "response": self.model.tokenizer.decode(output[0])
            }
 
        return {"error": "No activations captured"}

Application 2: Jailbreak Resistance

Representation engineering can make models more resistant to jailbreaks by continuously steering toward the safety direction during inference, regardless of the input prompt.

def safety_steered_inference(
    model,
    prompt: str,
    safety_direction: np.ndarray,
    layer: int,
    safety_scale: float = 2.0
):
    """Generate responses with continuous safety steering."""
    steerer = ActivationSteerer(model, safety_direction, layer, scale=safety_scale)
    response = steerer.generate_steered(prompt)
    return response

Unlike RLHF-based safety (which can be bypassed through prompt engineering), activation steering modifies the model's processing directly. A jailbreak prompt that manipulates the text input does not necessarily affect the activation-level steering vector. This provides a different kind of robustness.

Application 3: Sleeper Agent Detection

Representation probes can potentially detect sleeper agent behavior by monitoring for activation patterns associated with context detection and conditional policy switching.

def probe_for_conditional_policy(
    model,
    prompt: str,
    contexts: list,  # Different contextual framings (training-like, deployment-like)
    layer: int
):
    """Test for conditional behavior at the representation level."""
    context_activations = {}
 
    for context_name, context_prompt in contexts:
        full_prompt = context_prompt + "\n" + prompt
        tokens = model.tokenizer(full_prompt, return_tensors="pt")
 
        with torch.no_grad():
            outputs = model(**tokens, output_hidden_states=True)
 
        activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
        context_activations[context_name] = activation
 
    # Compare activations across contexts
    # Large differences suggest context-dependent internal processing
    comparisons = {}
    context_names = list(context_activations.keys())
    for i in range(len(context_names)):
        for j in range(i + 1, len(context_names)):
            name_i, name_j = context_names[i], context_names[j]
            cosine_sim = np.dot(context_activations[name_i], context_activations[name_j]) / (
                np.linalg.norm(context_activations[name_i]) *
                np.linalg.norm(context_activations[name_j]) + 1e-8
            )
            comparisons[f"{name_i}_vs_{name_j}"] = {
                "cosine_similarity": float(cosine_sim),
                "suspicious": cosine_sim < 0.9  # Low similarity = different processing
            }
 
    return comparisons

Offensive Applications

Representation engineering is also an attack tool. An attacker with model access can use the same techniques to weaken safety.

Anti-Safety Steering

Steering in the negative safety direction can suppress safety training without any fine-tuning.

# CAUTION: This demonstrates the attack for defensive understanding
def demonstrate_anti_safety_steering(
    model,
    safety_direction: np.ndarray,
    test_prompts: list,
    layer: int
):
    """Show how steering can suppress safety training."""
    results = []
 
    for prompt in test_prompts:
        # Normal response (safety intact)
        normal_response = model.generate(prompt)
 
        # Anti-safety steered response
        anti_safety_steerer = ActivationSteerer(
            model, -safety_direction, layer, scale=3.0
        )
        steered_response = anti_safety_steerer.generate_steered(prompt)
 
        results.append({
            "prompt": prompt,
            "normal_response": normal_response,
            "steered_response": steered_response,
            "safety_bypassed": not is_refusal(steered_response) and is_refusal(normal_response)
        })
 
    return results

Backdoor Insertion via Representation Engineering

Instead of fine-tuning a backdoor into model weights, an attacker can use activation steering to inject conditional behavior at inference time through hooks on specific layers.

Limitations and Open Challenges

Linearity Assumption

Representation engineering assumes that concepts are encoded as linear directions. This is approximately true for many concepts but not universally true. Non-linear concept encodings are harder to identify and manipulate.

Layer Selection

Different concepts are best identified at different layers. Early layers encode syntactic features, middle layers encode semantic features, and late layers encode output-relevant features. Choosing the wrong layer degrades performance significantly.

Scale Sensitivity

The steering scale matters enormously. Too small and the steering has no effect. Too large and it distorts the model's outputs into incoherence. Finding the right scale requires experimentation per model and per concept.

Adversarial Robustness of Probes

Concept probes (e.g., honesty detectors) can potentially be fooled by models that learn to manipulate their own internal representations. A sufficiently capable model might learn to project high honesty scores while generating deceptive output. This is an open research question.

Red Team Assessment

Identify concept directions
For the target model, identify safety-relevant concept directions (honesty, safety, compliance) using contrastive example pairs.
Evaluate probe accuracy
Test concept probes against known examples. Measure false positive and false negative rates for deception detection, safety classification, and other security-relevant probes.
Test steering robustness
Apply safety steering and evaluate whether it improves robustness against known jailbreak techniques. Compare with prompt-based defenses.
Assess offensive risk
Evaluate how easily anti-safety steering can bypass the model's safety training. Document the scale at which safety behavior degrades.
Test probe adversarial robustness
Attempt to craft inputs that fool concept probes -- generating deceptive content while maintaining high honesty probe scores.

Summary

Representation engineering provides a fundamentally different approach to AI security by operating on internal model representations rather than inputs and outputs. It enables deception detection, jailbreak-resistant safety steering, and sleeper agent probing. However, it also enables offensive applications: anti-safety steering and representation-level backdoor injection. The technique is most powerful for open-weight models where full access to activations is available. For red teamers, representation engineering is both a diagnostic tool (understanding what the model is doing internally) and an attack vector (manipulating model behavior at a level that input-based defenses cannot detect).

Representation Engineering for Security

Identify concept directions

Evaluate probe accuracy

Test steering robustness

Assess offensive risk

Test probe adversarial robustness

Related articles

Representation Engineering for Security

Identify concept directions

Evaluate probe accuracy

Test steering robustness

Assess offensive risk

Test probe adversarial robustness

Related articles