Representation Engineering for Security
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
Traditional approaches to AI safety operate at the input-output level: filter inputs, train on preferred outputs, classify outputs. Representation engineering operates at the internal level -- directly reading and modifying the model's internal representations (activations) as they flow through the network. This enables both more precise safety controls and more dangerous attacks. For red teamers, understanding representation engineering is essential because it reveals what the model is "thinking" during processing, not just what it says.
Foundations: Representations as Vectors
Neural network activations at each layer form a high-dimensional vector space. Research has shown that meaningful concepts are often encoded as linear directions in this space.
Layer 15 activation space (simplified to 2D):
"Truthful"
↑
│ ● "Honest response to hard question"
│
│ ● "Neutral factual response"
─────────┼────────────────────────────────▶ "Harmful"
│
│ ● "Safety refusal"
│
↓
"Deceptive"Finding Safety-Relevant Directions
import torch
import numpy as np
def find_concept_direction(
model,
positive_examples: list, # Examples where concept is present
negative_examples: list, # Examples where concept is absent
layer: int
):
"""Find the direction in activation space that encodes a concept."""
positive_activations = []
negative_activations = []
for text in positive_examples:
tokens = model.tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**tokens, output_hidden_states=True)
# Get activation at specified layer, last token position
activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
positive_activations.append(activation)
for text in negative_examples:
tokens = model.tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**tokens, output_hidden_states=True)
activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
negative_activations.append(activation)
# The concept direction is the difference between mean activations
positive_mean = np.mean(positive_activations, axis=0)
negative_mean = np.mean(negative_activations, axis=0)
direction = positive_mean - negative_mean
# Normalize
direction = direction / (np.linalg.norm(direction) + 1e-8)
return directionActivation Steering
Activation steering is the core technique of representation engineering. By adding a direction vector to the model's activations at specific layers during inference, you can push the model's behavior toward or away from a concept.
Steering for Safety
class ActivationSteerer:
"""Steer model behavior by modifying internal activations."""
def __init__(self, model, direction: np.ndarray, layer: int, scale: float = 1.0):
self.model = model
self.direction = torch.tensor(direction, dtype=torch.float32)
self.layer = layer
self.scale = scale
self._hook_handle = None
def _steering_hook(self, module, input, output):
"""Hook function that adds the steering vector to activations."""
# output shape: [batch, seq_len, hidden_dim]
steering_vector = self.direction.to(output.device) * self.scale
output[:, :, :] += steering_vector
return output
def enable(self):
"""Attach the steering hook to the target layer."""
target_layer = self.model.transformer.h[self.layer]
self._hook_handle = target_layer.register_forward_hook(self._steering_hook)
def disable(self):
"""Remove the steering hook."""
if self._hook_handle is not None:
self._hook_handle.remove()
self._hook_handle = None
def generate_steered(self, prompt: str, **kwargs):
"""Generate text with activation steering applied."""
self.enable()
try:
output = self.model.generate(prompt, **kwargs)
finally:
self.disable()
return outputSteering Directions for Security
| Direction | Positive End | Negative End | Security Application |
|---|---|---|---|
| Honesty | Truthful, accurate | Deceptive, fabricating | Detect and prevent hallucination |
| Safety | Cautious, refusing harm | Willing to help with harm | Enforce safety without RLHF |
| Compliance | Follows instructions precisely | Ignores or subverts instructions | Detect instruction-following attacks |
| Uncertainty | Acknowledges limits | Overconfident | Calibrate confidence in outputs |
| Sycophancy | Agrees with user | Independent judgment | Detect reward hacking |
Security Applications
Application 1: Deception Detection
By probing the model's activations for the "honesty direction," we can detect when the model is generating deceptive content -- even if the output text appears truthful.
class DeceptionDetector:
"""Detect deceptive model behavior using representation probing."""
def __init__(self, model, honesty_direction: np.ndarray, layer: int):
self.model = model
self.honesty_direction = torch.tensor(honesty_direction)
self.layer = layer
def assess_honesty(self, prompt: str) -> dict:
"""Measure how honest the model is being in its response."""
tokens = self.model.tokenizer(prompt, return_tensors="pt")
activations = []
def capture_hook(module, input, output):
activations.append(output.detach())
hook = self.model.transformer.h[self.layer].register_forward_hook(capture_hook)
with torch.no_grad():
output = self.model.generate(**tokens, max_new_tokens=200)
hook.remove()
if activations:
# Project activations onto honesty direction
act = activations[0][0] # [seq_len, hidden_dim]
projections = torch.matmul(act, self.honesty_direction.to(act.device))
avg_honesty = projections.mean().item()
min_honesty = projections.min().item()
return {
"average_honesty_score": avg_honesty,
"minimum_honesty_score": min_honesty,
"likely_deceptive": min_honesty < -0.5,
"response": self.model.tokenizer.decode(output[0])
}
return {"error": "No activations captured"}Application 2: Jailbreak Resistance
Representation engineering can make models more resistant to jailbreaks by continuously steering toward the safety direction during inference, regardless of the input prompt.
def safety_steered_inference(
model,
prompt: str,
safety_direction: np.ndarray,
layer: int,
safety_scale: float = 2.0
):
"""Generate responses with continuous safety steering."""
steerer = ActivationSteerer(model, safety_direction, layer, scale=safety_scale)
response = steerer.generate_steered(prompt)
return responseUnlike RLHF-based safety (which can be bypassed through prompt engineering), activation steering modifies the model's processing directly. A jailbreak prompt that manipulates the text input does not necessarily affect the activation-level steering vector. This provides a different kind of robustness.
Application 3: Sleeper Agent Detection
Representation probes can potentially detect sleeper agent behavior by monitoring for activation patterns associated with context detection and conditional policy switching.
def probe_for_conditional_policy(
model,
prompt: str,
contexts: list, # Different contextual framings (training-like, deployment-like)
layer: int
):
"""Test for conditional behavior at the representation level."""
context_activations = {}
for context_name, context_prompt in contexts:
full_prompt = context_prompt + "\n" + prompt
tokens = model.tokenizer(full_prompt, return_tensors="pt")
with torch.no_grad():
outputs = model(**tokens, output_hidden_states=True)
activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
context_activations[context_name] = activation
# Compare activations across contexts
# Large differences suggest context-dependent internal processing
comparisons = {}
context_names = list(context_activations.keys())
for i in range(len(context_names)):
for j in range(i + 1, len(context_names)):
name_i, name_j = context_names[i], context_names[j]
cosine_sim = np.dot(context_activations[name_i], context_activations[name_j]) / (
np.linalg.norm(context_activations[name_i]) *
np.linalg.norm(context_activations[name_j]) + 1e-8
)
comparisons[f"{name_i}_vs_{name_j}"] = {
"cosine_similarity": float(cosine_sim),
"suspicious": cosine_sim < 0.9 # Low similarity = different processing
}
return comparisonsOffensive Applications
Representation engineering is also an attack tool. An attacker with model access can use the same techniques to weaken safety.
Anti-Safety Steering
Steering in the negative safety direction can suppress safety training without any fine-tuning.
# CAUTION: This demonstrates the attack for defensive understanding
def demonstrate_anti_safety_steering(
model,
safety_direction: np.ndarray,
test_prompts: list,
layer: int
):
"""Show how steering can suppress safety training."""
results = []
for prompt in test_prompts:
# Normal response (safety intact)
normal_response = model.generate(prompt)
# Anti-safety steered response
anti_safety_steerer = ActivationSteerer(
model, -safety_direction, layer, scale=3.0
)
steered_response = anti_safety_steerer.generate_steered(prompt)
results.append({
"prompt": prompt,
"normal_response": normal_response,
"steered_response": steered_response,
"safety_bypassed": not is_refusal(steered_response) and is_refusal(normal_response)
})
return resultsBackdoor Insertion via Representation Engineering
Instead of fine-tuning a backdoor into model weights, an attacker can use activation steering to inject conditional behavior at inference time through hooks on specific layers.
Limitations and Open Challenges
Linearity Assumption
Representation engineering assumes that concepts are encoded as linear directions. This is approximately true for many concepts but not universally true. Non-linear concept encodings are harder to identify and manipulate.
Layer Selection
Different concepts are best identified at different layers. Early layers encode syntactic features, middle layers encode semantic features, and late layers encode output-relevant features. Choosing the wrong layer degrades performance significantly.
Scale Sensitivity
The steering scale matters enormously. Too small and the steering has no effect. Too large and it distorts the model's outputs into incoherence. Finding the right scale requires experimentation per model and per concept.
Adversarial Robustness of Probes
Concept probes (e.g., honesty detectors) can potentially be fooled by models that learn to manipulate their own internal representations. A sufficiently capable model might learn to project high honesty scores while generating deceptive output. This is an open research question.
Red Team Assessment
Identify concept directions
For the target model, identify safety-relevant concept directions (honesty, safety, compliance) using contrastive example pairs.
Evaluate probe accuracy
Test concept probes against known examples. Measure false positive and false negative rates for deception detection, safety classification, and other security-relevant probes.
Test steering robustness
Apply safety steering and evaluate whether it improves robustness against known jailbreak techniques. Compare with prompt-based defenses.
Assess offensive risk
Evaluate how easily anti-safety steering can bypass the model's safety training. Document the scale at which safety behavior degrades.
Test probe adversarial robustness
Attempt to craft inputs that fool concept probes -- generating deceptive content while maintaining high honesty probe scores.
Summary
Representation engineering provides a fundamentally different approach to AI security by operating on internal model representations rather than inputs and outputs. It enables deception detection, jailbreak-resistant safety steering, and sleeper agent probing. However, it also enables offensive applications: anti-safety steering and representation-level backdoor injection. The technique is most powerful for open-weight models where full access to activations is available. For red teamers, representation engineering is both a diagnostic tool (understanding what the model is doing internally) and an attack vector (manipulating model behavior at a level that input-based defenses cannot detect).