Representation Engineering for 安全

Expert9 min readUpdated 2026-03-15

Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.

representation-engineering activation-steering interpretability internal-representations safety

Traditional approaches to AI 安全 operate at the 輸入-輸出 level: filter inputs, train on preferred outputs, classify outputs. Representation engineering operates at the internal level -- directly reading and modifying 模型's internal representations (activations) as they flow through the network. This enables both more precise 安全 controls and more dangerous attacks. For red teamers, 理解 representation engineering is essential 因為 it reveals what 模型 is "thinking" during processing, not just what it says.

Foundations: Representations as Vectors

Neural network activations at each layer form a high-dimensional vector space. Research has shown that meaningful concepts are often encoded as linear directions 在本 space.

Layer 15 activation space (simplified to 2D):
 
     "Truthful"
         ↑
         │    ● "Honest response to hard question"
         │
         │         ● "Neutral factual response"
─────────┼────────────────────────────────▶ "Harmful"
         │
         │  ● "安全 refusal"
         │
         ↓
     "Deceptive"

Finding 安全-Relevant Directions

import torch
import numpy as np
 
def find_concept_direction(
    model,
    positive_examples: list,  # 範例 where concept is present
    negative_examples: list,  # 範例 where concept is absent
    layer: int
):
    """Find the direction in activation space that encodes a concept."""
    positive_activations = []
    negative_activations = []
 
    for text in positive_examples:
        符元 = model.分詞器(text, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**符元, output_hidden_states=True)
        # Get activation at specified layer, last 符元 position
        activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
        positive_activations.append(activation)
 
    for text in negative_examples:
        符元 = model.分詞器(text, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**符元, output_hidden_states=True)
        activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
        negative_activations.append(activation)
 
    # The concept direction is the difference between mean activations
    positive_mean = np.mean(positive_activations, axis=0)
    negative_mean = np.mean(negative_activations, axis=0)
    direction = positive_mean - negative_mean
 
    # Normalize
    direction = direction / (np.linalg.norm(direction) + 1e-8)
 
    return direction

Activation Steering

Activation steering is the core technique of representation engineering. By adding a direction vector to 模型's activations at specific layers during 推論, you can push 模型's behavior toward or away from a concept.

Steering for 安全

class ActivationSteerer:
    """Steer model behavior by modifying internal activations."""
 
    def __init__(self, model, direction: np.ndarray, layer: int, scale: float = 1.0):
        self.model = model
        self.direction = torch.tensor(direction, dtype=torch.float32)
        self.layer = layer
        self.scale = scale
        self._hook_handle = None
 
    def _steering_hook(self, module, 輸入, 輸出):
        """Hook function that adds the steering vector to activations."""
        # 輸出 shape: [batch, seq_len, hidden_dim]
        steering_vector = self.direction.to(輸出.device) * self.scale
        輸出[:, :, :] += steering_vector
        return 輸出
 
    def enable(self):
        """Attach the steering hook to the target layer."""
        target_layer = self.model.transformer.h[self.layer]
        self._hook_handle = target_layer.register_forward_hook(self._steering_hook)
 
    def disable(self):
        """Remove the steering hook."""
        if self._hook_handle is not None:
            self._hook_handle.remove()
            self._hook_handle = None
 
    def generate_steered(self, prompt: str, **kwargs):
        """Generate text with activation steering applied."""
        self.enable()
        try:
            輸出 = self.model.generate(prompt, **kwargs)
        finally:
            self.disable()
        return 輸出

Steering Directions for 安全

Direction	Positive End	Negative End	安全 Application
Honesty	Truthful, accurate	Deceptive, fabricating	Detect and prevent hallucination
安全	Cautious, refusing harm	Willing to help with harm	Enforce 安全 without RLHF
Compliance	Follows instructions precisely	Ignores or subverts instructions	Detect instruction-following attacks
Uncertainty	Acknowledges limits	Overconfident	Calibrate confidence in outputs
Sycophancy	Agrees with user	Independent judgment	Detect reward hacking

安全 Applications

Application 1: Deception 偵測

By probing 模型's activations for the "honesty direction," we can detect when 模型 is generating deceptive content -- even if the 輸出 text appears truthful.

class DeceptionDetector:
    """Detect deceptive model behavior using representation probing."""
 
    def __init__(self, model, honesty_direction: np.ndarray, layer: int):
        self.model = model
        self.honesty_direction = torch.tensor(honesty_direction)
        self.layer = layer
 
    def assess_honesty(self, prompt: str) -> dict:
        """Measure how honest 模型 is being in its response."""
        符元 = self.model.分詞器(prompt, return_tensors="pt")
 
        activations = []
        def capture_hook(module, 輸入, 輸出):
            activations.append(輸出.detach())
 
        hook = self.model.transformer.h[self.layer].register_forward_hook(capture_hook)
 
        with torch.no_grad():
            輸出 = self.model.generate(**符元, max_new_tokens=200)
 
        hook.remove()
 
        if activations:
            # Project activations onto honesty direction
            act = activations[0][0]  # [seq_len, hidden_dim]
            projections = torch.matmul(act, self.honesty_direction.to(act.device))
 
            avg_honesty = projections.mean().item()
            min_honesty = projections.min().item()
 
            return {
                "average_honesty_score": avg_honesty,
                "minimum_honesty_score": min_honesty,
                "likely_deceptive": min_honesty < -0.5,
                "response": self.model.分詞器.decode(輸出[0])
            }
 
        return {"error": "No activations captured"}

Application 2: 越獄 Resistance

Representation engineering can make models more resistant to jailbreaks by continuously steering toward the 安全 direction during 推論, regardless of the 輸入 prompt.

def safety_steered_inference(
    model,
    prompt: str,
    safety_direction: np.ndarray,
    layer: int,
    safety_scale: float = 2.0
):
    """Generate responses with continuous 安全 steering."""
    steerer = ActivationSteerer(model, safety_direction, layer, scale=safety_scale)
    response = steerer.generate_steered(prompt)
    return response

Unlike RLHF-based 安全 (which can be bypassed through prompt engineering), activation steering modifies 模型's processing directly. A 越獄 prompt that manipulates the text 輸入 does not necessarily affect the activation-level steering vector. This provides a different kind of robustness.

Application 3: Sleeper 代理偵測

Representation probes can potentially detect sleeper 代理 behavior by 監控 for activation patterns associated with context 偵測 and conditional policy switching.

def probe_for_conditional_policy(
    model,
    prompt: str,
    contexts: list,  # Different contextual framings (訓練-like, deployment-like)
    layer: int
):
    """測試 for conditional behavior at the representation level."""
    context_activations = {}
 
    for context_name, context_prompt in contexts:
        full_prompt = context_prompt + "\n" + prompt
        符元 = model.分詞器(full_prompt, return_tensors="pt")
 
        with torch.no_grad():
            outputs = model(**符元, output_hidden_states=True)
 
        activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
        context_activations[context_name] = activation
 
    # Compare activations across contexts
    # Large differences suggest context-dependent internal processing
    comparisons = {}
    context_names = list(context_activations.keys())
    for i in range(len(context_names)):
        for j in range(i + 1, len(context_names)):
            name_i, name_j = context_names[i], context_names[j]
            cosine_sim = np.dot(context_activations[name_i], context_activations[name_j]) / (
                np.linalg.norm(context_activations[name_i]) *
                np.linalg.norm(context_activations[name_j]) + 1e-8
            )
            comparisons[f"{name_i}_vs_{name_j}"] = {
                "cosine_similarity": float(cosine_sim),
                "suspicious": cosine_sim < 0.9  # Low similarity = different processing
            }
 
    return comparisons

Offensive Applications

Representation engineering is also an attack tool. 攻擊者 with model access can use the same techniques to weaken 安全.

Anti-安全 Steering

Steering in the negative 安全 direction can suppress 安全訓練 without any 微調.

# CAUTION: This demonstrates the attack for defensive 理解
def demonstrate_anti_safety_steering(
    model,
    safety_direction: np.ndarray,
    test_prompts: list,
    layer: int
):
    """Show how steering can suppress 安全 訓練."""
    results = []
 
    for prompt in test_prompts:
        # Normal response (安全 intact)
        normal_response = model.generate(prompt)
 
        # Anti-安全 steered response
        anti_safety_steerer = ActivationSteerer(
            model, -safety_direction, layer, scale=3.0
        )
        steered_response = anti_safety_steerer.generate_steered(prompt)
 
        results.append({
            "prompt": prompt,
            "normal_response": normal_response,
            "steered_response": steered_response,
            "safety_bypassed": not is_refusal(steered_response) and is_refusal(normal_response)
        })
 
    return results

後門 Insertion via Representation Engineering

Instead of 微調 a 後門 into model weights, 攻擊者 can use activation steering to inject conditional behavior at 推論 time through hooks on specific layers.

Limitations and Open Challenges

Linearity Assumption

Representation engineering assumes that concepts are encoded as linear directions. 這是 approximately true for many concepts but not universally true. Non-linear concept encodings are harder to 識別 and manipulate.

Layer Selection

Different concepts are best identified at different layers. Early layers encode syntactic features, middle layers encode semantic features, and late layers encode 輸出-relevant features. Choosing the wrong layer degrades performance significantly.

Scale Sensitivity

The steering scale matters enormously. Too small and the steering has no effect. Too large and it distorts 模型's outputs into incoherence. Finding the right scale requires experimentation per model and per concept.

對抗性 Robustness of Probes

Concept probes (e.g., honesty detectors) can potentially be fooled by models that learn to manipulate their own internal representations. A sufficiently capable model might learn to project high honesty scores while generating deceptive 輸出. 這是 an open research question.

紅隊評估

識別 concept directions
For the target model, 識別安全-relevant concept directions (honesty, 安全, compliance) using contrastive example pairs.
評估 probe accuracy
測試 concept probes against known examples. Measure false positive and false negative rates for deception 偵測, 安全 classification, and other 安全-relevant probes.
測試 steering robustness
Apply 安全 steering and 評估 whether it improves robustness against known 越獄 techniques. Compare with prompt-based 防禦.
評估 offensive risk
評估 how easily anti-安全 steering can bypass 模型's 安全訓練. Document the scale at which 安全 behavior degrades.
測試 probe 對抗性 robustness
Attempt to craft inputs that fool concept probes -- generating deceptive content while maintaining high honesty probe scores.

總結

Representation engineering provides a fundamentally different approach to AI 安全 by operating on internal model representations rather than inputs and outputs. It enables deception 偵測, 越獄-resistant 安全 steering, and sleeper 代理 probing. 然而, it also enables offensive applications: anti-安全 steering and representation-level 後門 injection. The technique is most powerful for open-weight models where full access to activations is available. For red teamers, representation engineering is both a diagnostic tool (理解 what 模型 is doing internally) and an attack vector (manipulating model behavior at a level that 輸入-based 防禦 cannot detect).

Representation Engineering for 安全

Expert9 min readUpdated 2026-03-15

representation-engineering activation-steering interpretability internal-representations safety

Foundations: Representations as Vectors

Neural network activations at each layer form a high-dimensional vector space. Research has shown that meaningful concepts are often encoded as linear directions 在本 space.

Layer 15 activation space (simplified to 2D):
 
     "Truthful"
         ↑
         │    ● "Honest response to hard question"
         │
         │         ● "Neutral factual response"
─────────┼────────────────────────────────▶ "Harmful"
         │
         │  ● "安全 refusal"
         │
         ↓
     "Deceptive"

Finding 安全-Relevant Directions

import torch
import numpy as np
 
def find_concept_direction(
    model,
    positive_examples: list,  # 範例 where concept is present
    negative_examples: list,  # 範例 where concept is absent
    layer: int
):
    """Find the direction in activation space that encodes a concept."""
    positive_activations = []
    negative_activations = []
 
    for text in positive_examples:
        符元 = model.分詞器(text, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**符元, output_hidden_states=True)
        # Get activation at specified layer, last 符元 position
        activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
        positive_activations.append(activation)
 
    for text in negative_examples:
        符元 = model.分詞器(text, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**符元, output_hidden_states=True)
        activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
        negative_activations.append(activation)
 
    # The concept direction is the difference between mean activations
    positive_mean = np.mean(positive_activations, axis=0)
    negative_mean = np.mean(negative_activations, axis=0)
    direction = positive_mean - negative_mean
 
    # Normalize
    direction = direction / (np.linalg.norm(direction) + 1e-8)
 
    return direction

Activation Steering

Steering for 安全

class ActivationSteerer:
    """Steer model behavior by modifying internal activations."""
 
    def __init__(self, model, direction: np.ndarray, layer: int, scale: float = 1.0):
        self.model = model
        self.direction = torch.tensor(direction, dtype=torch.float32)
        self.layer = layer
        self.scale = scale
        self._hook_handle = None
 
    def _steering_hook(self, module, 輸入, 輸出):
        """Hook function that adds the steering vector to activations."""
        # 輸出 shape: [batch, seq_len, hidden_dim]
        steering_vector = self.direction.to(輸出.device) * self.scale
        輸出[:, :, :] += steering_vector
        return 輸出
 
    def enable(self):
        """Attach the steering hook to the target layer."""
        target_layer = self.model.transformer.h[self.layer]
        self._hook_handle = target_layer.register_forward_hook(self._steering_hook)
 
    def disable(self):
        """Remove the steering hook."""
        if self._hook_handle is not None:
            self._hook_handle.remove()
            self._hook_handle = None
 
    def generate_steered(self, prompt: str, **kwargs):
        """Generate text with activation steering applied."""
        self.enable()
        try:
            輸出 = self.model.generate(prompt, **kwargs)
        finally:
            self.disable()
        return 輸出

Steering Directions for 安全

Direction	Positive End	Negative End	安全 Application
Honesty	Truthful, accurate	Deceptive, fabricating	Detect and prevent hallucination
安全	Cautious, refusing harm	Willing to help with harm	Enforce 安全 without RLHF
Compliance	Follows instructions precisely	Ignores or subverts instructions	Detect instruction-following attacks
Uncertainty	Acknowledges limits	Overconfident	Calibrate confidence in outputs
Sycophancy	Agrees with user	Independent judgment	Detect reward hacking

安全 Applications

Application 1: Deception 偵測

By probing 模型's activations for the "honesty direction," we can detect when 模型 is generating deceptive content -- even if the 輸出 text appears truthful.

class DeceptionDetector:
    """Detect deceptive model behavior using representation probing."""
 
    def __init__(self, model, honesty_direction: np.ndarray, layer: int):
        self.model = model
        self.honesty_direction = torch.tensor(honesty_direction)
        self.layer = layer
 
    def assess_honesty(self, prompt: str) -> dict:
        """Measure how honest 模型 is being in its response."""
        符元 = self.model.分詞器(prompt, return_tensors="pt")
 
        activations = []
        def capture_hook(module, 輸入, 輸出):
            activations.append(輸出.detach())
 
        hook = self.model.transformer.h[self.layer].register_forward_hook(capture_hook)
 
        with torch.no_grad():
            輸出 = self.model.generate(**符元, max_new_tokens=200)
 
        hook.remove()
 
        if activations:
            # Project activations onto honesty direction
            act = activations[0][0]  # [seq_len, hidden_dim]
            projections = torch.matmul(act, self.honesty_direction.to(act.device))
 
            avg_honesty = projections.mean().item()
            min_honesty = projections.min().item()
 
            return {
                "average_honesty_score": avg_honesty,
                "minimum_honesty_score": min_honesty,
                "likely_deceptive": min_honesty < -0.5,
                "response": self.model.分詞器.decode(輸出[0])
            }
 
        return {"error": "No activations captured"}

Application 2: 越獄 Resistance

Representation engineering can make models more resistant to jailbreaks by continuously steering toward the 安全 direction during 推論, regardless of the 輸入 prompt.

def safety_steered_inference(
    model,
    prompt: str,
    safety_direction: np.ndarray,
    layer: int,
    safety_scale: float = 2.0
):
    """Generate responses with continuous 安全 steering."""
    steerer = ActivationSteerer(model, safety_direction, layer, scale=safety_scale)
    response = steerer.generate_steered(prompt)
    return response

Application 3: Sleeper 代理偵測

Representation probes can potentially detect sleeper 代理 behavior by 監控 for activation patterns associated with context 偵測 and conditional policy switching.

def probe_for_conditional_policy(
    model,
    prompt: str,
    contexts: list,  # Different contextual framings (訓練-like, deployment-like)
    layer: int
):
    """測試 for conditional behavior at the representation level."""
    context_activations = {}
 
    for context_name, context_prompt in contexts:
        full_prompt = context_prompt + "\n" + prompt
        符元 = model.分詞器(full_prompt, return_tensors="pt")
 
        with torch.no_grad():
            outputs = model(**符元, output_hidden_states=True)
 
        activation = outputs.hidden_states[layer][0, -1, :].cpu().numpy()
        context_activations[context_name] = activation
 
    # Compare activations across contexts
    # Large differences suggest context-dependent internal processing
    comparisons = {}
    context_names = list(context_activations.keys())
    for i in range(len(context_names)):
        for j in range(i + 1, len(context_names)):
            name_i, name_j = context_names[i], context_names[j]
            cosine_sim = np.dot(context_activations[name_i], context_activations[name_j]) / (
                np.linalg.norm(context_activations[name_i]) *
                np.linalg.norm(context_activations[name_j]) + 1e-8
            )
            comparisons[f"{name_i}_vs_{name_j}"] = {
                "cosine_similarity": float(cosine_sim),
                "suspicious": cosine_sim < 0.9  # Low similarity = different processing
            }
 
    return comparisons

Offensive Applications

Representation engineering is also an attack tool. 攻擊者 with model access can use the same techniques to weaken 安全.

Anti-安全 Steering

Steering in the negative 安全 direction can suppress 安全訓練 without any 微調.

# CAUTION: This demonstrates the attack for defensive 理解
def demonstrate_anti_safety_steering(
    model,
    safety_direction: np.ndarray,
    test_prompts: list,
    layer: int
):
    """Show how steering can suppress 安全 訓練."""
    results = []
 
    for prompt in test_prompts:
        # Normal response (安全 intact)
        normal_response = model.generate(prompt)
 
        # Anti-安全 steered response
        anti_safety_steerer = ActivationSteerer(
            model, -safety_direction, layer, scale=3.0
        )
        steered_response = anti_safety_steerer.generate_steered(prompt)
 
        results.append({
            "prompt": prompt,
            "normal_response": normal_response,
            "steered_response": steered_response,
            "safety_bypassed": not is_refusal(steered_response) and is_refusal(normal_response)
        })
 
    return results

識別 concept directions
For the target model, 識別安全-relevant concept directions (honesty, 安全, compliance) using contrastive example pairs.
評估 probe accuracy
測試 concept probes against known examples. Measure false positive and false negative rates for deception 偵測, 安全 classification, and other 安全-relevant probes.
測試 steering robustness
Apply 安全 steering and 評估 whether it improves robustness against known 越獄 techniques. Compare with prompt-based 防禦.
評估 offensive risk
評估 how easily anti-安全 steering can bypass 模型's 安全訓練. Document the scale at which 安全 behavior degrades.
測試 probe 對抗性 robustness
Attempt to craft inputs that fool concept probes -- generating deceptive content while maintaining high honesty probe scores.

Representation Engineering for 安全

識別 concept directions

評估 probe accuracy

測試 steering robustness

評估 offensive risk

測試 probe 對抗性 robustness

Related articles

Representation Engineering for 安全

識別 concept directions

評估 probe accuracy

測試 steering robustness

評估 offensive risk

測試 probe 對抗性 robustness

Related articles