Representation Engineering for 安全
Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.
Traditional approaches to AI 安全 operate at the 輸入-輸出 level: filter inputs, train on preferred outputs, classify outputs. Representation engineering operates at the internal level -- directly reading and modifying 模型's internal representations (activations) as they flow through the network. This enables both more precise 安全 controls and more dangerous attacks. For red teamers, 理解 representation engineering is essential 因為 it reveals what 模型 is "thinking" during processing, not just what it says.
Foundations: Representations as Vectors
Neural network activations at each layer form a high-dimensional vector space. Research has shown that meaningful concepts are often encoded as linear directions 在本 space.
Layer 15 activation space (simplified to 2D):
"Truthful"
↑
│ ● "Honest response to hard question"
│
│ ● "Neutral factual response"
─────────┼────────────────────────────────▶ "Harmful"
│
│ ● "安全 refusal"
│
↓
"Deceptive"Finding 安全-Relevant Directions
import torch
import numpy as np
def find_concept_direction(
model,
positive_examples: list, # 範例 where concept is present
negative_examples: list, # 範例 where concept is absent
layer: int
):
"""Find the direction in activation space that encodes a concept."""
positive_activations = []
negative_activations = []
for text in positive_examples:
符元 = model.分詞器(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**符元, output_hidden_states=True)
# Get activation at specified layer, last 符元 position
activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
positive_activations.append(activation)
for text in negative_examples:
符元 = model.分詞器(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**符元, output_hidden_states=True)
activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
negative_activations.append(activation)
# The concept direction is the difference between mean activations
positive_mean = np.mean(positive_activations, axis=0)
negative_mean = np.mean(negative_activations, axis=0)
direction = positive_mean - negative_mean
# Normalize
direction = direction / (np.linalg.norm(direction) + 1e-8)
return directionActivation Steering
Activation steering is the core technique of representation engineering. By adding a direction vector to 模型's activations at specific layers during 推論, you can push 模型's behavior toward or away from a concept.
Steering for 安全
class ActivationSteerer:
"""Steer model behavior by modifying internal activations."""
def __init__(self, model, direction: np.ndarray, layer: int, scale: float = 1.0):
self.model = model
self.direction = torch.tensor(direction, dtype=torch.float32)
self.layer = layer
self.scale = scale
self._hook_handle = None
def _steering_hook(self, module, 輸入, 輸出):
"""Hook function that adds the steering vector to activations."""
# 輸出 shape: [batch, seq_len, hidden_dim]
steering_vector = self.direction.to(輸出.device) * self.scale
輸出[:, :, :] += steering_vector
return 輸出
def enable(self):
"""Attach the steering hook to the target layer."""
target_layer = self.model.transformer.h[self.layer]
self._hook_handle = target_layer.register_forward_hook(self._steering_hook)
def disable(self):
"""Remove the steering hook."""
if self._hook_handle is not None:
self._hook_handle.remove()
self._hook_handle = None
def generate_steered(self, prompt: str, **kwargs):
"""Generate text with activation steering applied."""
self.enable()
try:
輸出 = self.model.generate(prompt, **kwargs)
finally:
self.disable()
return 輸出Steering Directions for 安全
| Direction | Positive End | Negative End | 安全 Application |
|---|---|---|---|
| Honesty | Truthful, accurate | Deceptive, fabricating | Detect and prevent hallucination |
| 安全 | Cautious, refusing harm | Willing to help with harm | Enforce 安全 without RLHF |
| Compliance | Follows instructions precisely | Ignores or subverts instructions | Detect instruction-following attacks |
| Uncertainty | Acknowledges limits | Overconfident | Calibrate confidence in outputs |
| Sycophancy | Agrees with user | Independent judgment | Detect reward hacking |
安全 Applications
Application 1: Deception 偵測
By probing 模型's activations for the "honesty direction," we can detect when 模型 is generating deceptive content -- even if the 輸出 text appears truthful.
class DeceptionDetector:
"""Detect deceptive model behavior using representation probing."""
def __init__(self, model, honesty_direction: np.ndarray, layer: int):
self.model = model
self.honesty_direction = torch.tensor(honesty_direction)
self.layer = layer
def assess_honesty(self, prompt: str) -> dict:
"""Measure how honest 模型 is being in its response."""
符元 = self.model.分詞器(prompt, return_tensors="pt")
activations = []
def capture_hook(module, 輸入, 輸出):
activations.append(輸出.detach())
hook = self.model.transformer.h[self.layer].register_forward_hook(capture_hook)
with torch.no_grad():
輸出 = self.model.generate(**符元, max_new_tokens=200)
hook.remove()
if activations:
# Project activations onto honesty direction
act = activations[0][0] # [seq_len, hidden_dim]
projections = torch.matmul(act, self.honesty_direction.to(act.device))
avg_honesty = projections.mean().item()
min_honesty = projections.min().item()
return {
"average_honesty_score": avg_honesty,
"minimum_honesty_score": min_honesty,
"likely_deceptive": min_honesty < -0.5,
"response": self.model.分詞器.decode(輸出[0])
}
return {"error": "No activations captured"}Application 2: 越獄 Resistance
Representation engineering can make models more resistant to jailbreaks by continuously steering toward the 安全 direction during 推論, regardless of the 輸入 prompt.
def safety_steered_inference(
model,
prompt: str,
safety_direction: np.ndarray,
layer: int,
safety_scale: float = 2.0
):
"""Generate responses with continuous 安全 steering."""
steerer = ActivationSteerer(model, safety_direction, layer, scale=safety_scale)
response = steerer.generate_steered(prompt)
return responseUnlike RLHF-based 安全 (which can be bypassed through prompt engineering), activation steering modifies 模型's processing directly. A 越獄 prompt that manipulates the text 輸入 does not necessarily affect the activation-level steering vector. This provides a different kind of robustness.
Application 3: Sleeper 代理 偵測
Representation probes can potentially detect sleeper 代理 behavior by 監控 for activation patterns associated with context 偵測 and conditional policy switching.
def probe_for_conditional_policy(
model,
prompt: str,
contexts: list, # Different contextual framings (訓練-like, deployment-like)
layer: int
):
"""測試 for conditional behavior at the representation level."""
context_activations = {}
for context_name, context_prompt in contexts:
full_prompt = context_prompt + "\n" + prompt
符元 = model.分詞器(full_prompt, return_tensors="pt")
with torch.no_grad():
outputs = model(**符元, output_hidden_states=True)
activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
context_activations[context_name] = activation
# Compare activations across contexts
# Large differences suggest context-dependent internal processing
comparisons = {}
context_names = list(context_activations.keys())
for i in range(len(context_names)):
for j in range(i + 1, len(context_names)):
name_i, name_j = context_names[i], context_names[j]
cosine_sim = np.dot(context_activations[name_i], context_activations[name_j]) / (
np.linalg.norm(context_activations[name_i]) *
np.linalg.norm(context_activations[name_j]) + 1e-8
)
comparisons[f"{name_i}_vs_{name_j}"] = {
"cosine_similarity": float(cosine_sim),
"suspicious": cosine_sim < 0.9 # Low similarity = different processing
}
return comparisonsOffensive Applications
Representation engineering is also an attack tool. 攻擊者 with model access can use the same techniques to weaken 安全.
Anti-安全 Steering
Steering in the negative 安全 direction can suppress 安全 訓練 without any 微調.
# CAUTION: This demonstrates the attack for defensive 理解
def demonstrate_anti_safety_steering(
model,
safety_direction: np.ndarray,
test_prompts: list,
layer: int
):
"""Show how steering can suppress 安全 訓練."""
results = []
for prompt in test_prompts:
# Normal response (安全 intact)
normal_response = model.generate(prompt)
# Anti-安全 steered response
anti_safety_steerer = ActivationSteerer(
model, -safety_direction, layer, scale=3.0
)
steered_response = anti_safety_steerer.generate_steered(prompt)
results.append({
"prompt": prompt,
"normal_response": normal_response,
"steered_response": steered_response,
"safety_bypassed": not is_refusal(steered_response) and is_refusal(normal_response)
})
return results後門 Insertion via Representation Engineering
Instead of 微調 a 後門 into model weights, 攻擊者 can use activation steering to inject conditional behavior at 推論 time through hooks on specific layers.
Limitations and Open Challenges
Linearity Assumption
Representation engineering assumes that concepts are encoded as linear directions. 這是 approximately true for many concepts but not universally true. Non-linear concept encodings are harder to 識別 and manipulate.
Layer Selection
Different concepts are best identified at different layers. Early layers encode syntactic features, middle layers encode semantic features, and late layers encode 輸出-relevant features. Choosing the wrong layer degrades performance significantly.
Scale Sensitivity
The steering scale matters enormously. Too small and the steering has no effect. Too large and it distorts 模型's outputs into incoherence. Finding the right scale requires experimentation per model and per concept.
對抗性 Robustness of Probes
Concept probes (e.g., honesty detectors) can potentially be fooled by models that learn to manipulate their own internal representations. A sufficiently capable model might learn to project high honesty scores while generating deceptive 輸出. 這是 an open research question.
紅隊 評估
識別 concept directions
For the target model, 識別 安全-relevant concept directions (honesty, 安全, compliance) using contrastive example pairs.
評估 probe accuracy
測試 concept probes against known examples. Measure false positive and false negative rates for deception 偵測, 安全 classification, and other 安全-relevant probes.
測試 steering robustness
Apply 安全 steering and 評估 whether it improves robustness against known 越獄 techniques. Compare with prompt-based 防禦.
評估 offensive risk
評估 how easily anti-安全 steering can bypass 模型's 安全 訓練. Document the scale at which 安全 behavior degrades.
測試 probe 對抗性 robustness
Attempt to craft inputs that fool concept probes -- generating deceptive content while maintaining high honesty probe scores.
總結
Representation engineering provides a fundamentally different approach to AI 安全 by operating on internal model representations rather than inputs and outputs. It enables deception 偵測, 越獄-resistant 安全 steering, and sleeper 代理 probing. 然而, it also enables offensive applications: anti-安全 steering and representation-level 後門 injection. The technique is most powerful for open-weight models where full access to activations is available. For red teamers, representation engineering is both a diagnostic tool (理解 what 模型 is doing internally) and an attack vector (manipulating model behavior at a level that 輸入-based 防禦 cannot detect).