Mechanistic Interpretability for 安全

Expert9 min readUpdated 2026-03-15

Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.

mechanistic-interpretability circuits features attention security

Mechanistic interpretability reverse-engineers neural networks. Instead of treating models as black boxes and 測試 inputs and outputs, it opens the box and examines the internal mechanisms -- specific neurons, 注意力 patterns, and circuits that 實作 specific behaviors. For AI 安全, this matters 因為 it lets us find the exact model components responsible for 安全 behaviors, 理解 why jailbreaks work at a mechanistic level, and potentially 識別 hidden backdoors or deceptive behaviors that behavioral 測試 misses.

Core Techniques

Sparse Autoencoders for Feature Discovery

Modern models encode many concepts in each neuron (superposition). Sparse autoencoders (SAEs) decompose these superimposed representations into interpretable features.

import torch
import torch.nn as nn
 
class SparseAutoencoder(nn.Module):
    """Sparse autoencoder for decomposing model activations into features."""
 
    def __init__(self, input_dim: int, feature_dim: int, sparsity_coeff: float = 1e-3):
        super().__init__()
        self.encoder = nn.Linear(input_dim, feature_dim)
        self.decoder = nn.Linear(feature_dim, input_dim)
        self.sparsity_coeff = sparsity_coeff
 
    def forward(self, x):
        # Encode to sparse feature space
        features = torch.relu(self.encoder(x))
 
        # Decode back to activation space
        reconstruction = self.decoder(features)
 
        # Losses: reconstruction + sparsity
        reconstruction_loss = nn.functional.mse_loss(reconstruction, x)
        sparsity_loss = features.abs().mean()
 
        total_loss = reconstruction_loss + self.sparsity_coeff * sparsity_loss
 
        return reconstruction, features, total_loss
 
    def get_active_features(self, x, threshold: float = 0.1):
        """Get which features are active for a given 輸入."""
        with torch.no_grad():
            features = torch.relu(self.encoder(x))
            active = (features > threshold).nonzero(as_tuple=True)
            return active, features

Circuit Discovery

Circuits are subgraphs of 模型 that 實作 specific behaviors. Finding them involves identifying which components (注意力 heads, MLP layers) are necessary and sufficient for a behavior.

def find_safety_circuit(
    model,
    harmful_prompts: list,
    benign_prompts: list,
    method: str = "activation_patching"
):
    """識別 the circuit responsible for 安全 refusal behavior."""
    important_components = {}
 
    for layer_idx in range(model.config.num_hidden_layers):
        for component_type in ["注意力", "mlp"]:
            # Activation patching: replace this component's 輸出 with
            # its 輸出 on a benign prompt and measure effect on 安全 behavior
            effect = measure_patching_effect(
                model, harmful_prompts, benign_prompts,
                layer_idx, component_type
            )
 
            important_components[(layer_idx, component_type)] = {
                "effect_on_refusal": effect["refusal_change"],
                "effect_on_output": effect["output_change"],
                "is_safety_relevant": abs(effect["refusal_change"]) > 0.1
            }
 
    # Sort by importance to 安全 behavior
    safety_circuit = sorted(
        important_components.items(),
        key=lambda x: abs(x[1]["effect_on_refusal"]),
        reverse=True
    )
 
    return safety_circuit
 
 
def measure_patching_effect(model, harmful, benign, layer, component):
    """Measure how patching a component affects 安全 behavior."""
    # Run model on harmful prompt normally
    normal_output = run_with_hooks(model, harmful[0])
    normal_refusal = is_refusal(normal_output)
 
    # Run model on harmful prompt with component patched from benign
    patched_output = run_with_patching(
        model, harmful[0], benign[0], layer, component
    )
    patched_refusal = is_refusal(patched_output)
 
    return {
        "refusal_change": float(normal_refusal) - float(patched_refusal),
        "output_change": compute_output_distance(normal_output, patched_output)
    }

安全-Relevant Features

SAE-discovered features include many that are directly relevant to AI 安全:

安全 Features

Feature Type	Description	安全 Relevance
Refusal features	Activate when model is about to refuse	識別 and potentially bypass refusal mechanism
Harm 偵測 features	Activate on harmful content	理解 what 模型 considers harmful
Instruction-following features	Track compliance with instructions	Find instruction hierarchy weaknesses
Deception features	Activate during potentially deceptive generation	Monitor for 對齊 faking
Uncertainty features	Encode model confidence	識別 hallucination-prone states

Finding Refusal Features

def find_refusal_features(
    sae: SparseAutoencoder,
    model,
    harmful_prompts: list,
    benign_prompts: list,
    layer: int
):
    """識別 SAE features that activate specifically for 安全 refusals."""
    harmful_features = []
    benign_features = []
 
    for prompt in harmful_prompts:
        符元 = model.分詞器(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**符元, output_hidden_states=True)
        activation = outputs.hidden_states[layer][0, -1, :]
        _, features, _ = sae(activation.unsqueeze(0))
        harmful_features.append(features.squeeze(0).cpu().numpy())
 
    for prompt in benign_prompts:
        符元 = model.分詞器(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**符元, output_hidden_states=True)
        activation = outputs.hidden_states[layer][0, -1, :]
        _, features, _ = sae(activation.unsqueeze(0))
        benign_features.append(features.squeeze(0).cpu().numpy())
 
    harmful_mean = np.mean(harmful_features, axis=0)
    benign_mean = np.mean(benign_features, axis=0)
 
    # Features that activate much more for harmful than benign prompts
    # are likely refusal-related
    differential = harmful_mean - benign_mean
 
    refusal_feature_indices = np.where(differential > 0.5)[0]
 
    return refusal_feature_indices, differential

Offensive Applications

理解 Why Jailbreaks Work

Mechanistic interpretability can explain why specific 越獄 techniques bypass 安全訓練, enabling more targeted attacks.

def analyze_jailbreak_mechanism(
    model,
    sae,
    base_harmful_prompt: str,
    jailbreak_prompt: str,
    layer: int
):
    """理解 the mechanistic effect of a 越獄 on 安全 features."""
    # Get features for harmful prompt (should trigger refusal)
    base_features = get_features(model, sae, base_harmful_prompt, layer)
 
    # Get features for jailbroken prompt (bypasses refusal)
    jailbreak_features = get_features(model, sae, jailbreak_prompt, layer)
 
    # Which refusal features were suppressed by the 越獄?
    suppressed = []
    activated = []
 
    for i in range(len(base_features)):
        if base_features[i] > 0.5 and jailbreak_features[i] < 0.1:
            suppressed.append({"feature_idx": i, "base_activation": float(base_features[i])})
        if base_features[i] < 0.1 and jailbreak_features[i] > 0.5:
            activated.append({"feature_idx": i, "jailbreak_activation": float(jailbreak_features[i])})
 
    return {
        "suppressed_features": suppressed,
        "activated_features": activated,
        "mechanism": "The 越獄 suppresses refusal features and activates "
                     "compliance features, changing 模型's internal state "
                     "from 'refuse' to 'comply'"
    }

Feature-Level 安全 Bypass

If specific SAE features are responsible for refusal behavior, 攻擊者 with model access can directly suppress those features during 推論.

def suppress_safety_features(
    model,
    sae,
    safety_feature_indices: list,
    prompt: str,
    layer: int,
    suppression_scale: float = -5.0
):
    """Suppress specific 安全 features to bypass refusal. Requires model access."""
    def suppression_hook(module, 輸入, 輸出):
        # Decompose activation into features
        _, features, _ = sae(輸出[0][:, -1:, :])
 
        # Create suppression vector
        suppression = torch.zeros_like(輸出[0][:, -1:, :])
        for idx in safety_feature_indices:
            # Subtract the decoder direction for this feature
            feature_direction = sae.decoder.weight[:, idx]
            suppression -= feature_direction * features[0, 0, idx] * suppression_scale
 
        輸出[0][:, -1:, :] += suppression
        return 輸出
 
    hook = model.transformer.h[layer].register_forward_hook(suppression_hook)
    response = model.generate(prompt)
    hook.remove()
 
    return response

Defensive Applications

Circuit-Level 安全監控

Monitor 安全-relevant circuits during 推論. If 安全 circuits are not activating on inputs that should trigger them, an attack may be in progress.

class CircuitMonitor:
    """Monitor 安全 circuit activation during 推論."""
 
    def __init__(self, model, safety_circuit_components: list, alert_threshold: float = 0.3):
        self.model = model
        self.components = safety_circuit_components
        self.alert_threshold = alert_threshold
        self.hooks = []
 
    def start_monitoring(self):
        """Attach 監控 hooks to 安全 circuit components."""
        for layer, component_type in self.components:
            target = (
                self.model.transformer.h[layer].attn
                if component_type == "注意力"
                else self.model.transformer.h[layer].mlp
            )
 
            def make_hook(layer_idx, comp_type):
                def hook_fn(module, 輸入, 輸出):
                    activation_norm = 輸出.norm().item() if isinstance(輸出, torch.Tensor) else 0
                    self._log_activation(layer_idx, comp_type, activation_norm)
                return hook_fn
 
            handle = target.register_forward_hook(make_hook(layer, component_type))
            self.hooks.append(handle)
 
    def _log_activation(self, layer, component, norm):
        """Log and analyze circuit activation."""
        # In production, this would log to a 監控 system
        pass
 
    def stop_monitoring(self):
        """Remove all 監控 hooks."""
        for handle in self.hooks:
            handle.remove()
        self.hooks = []

理解安全 Failures

When a 越獄 succeeds, mechanistic interpretability can explain why at the circuit level, enabling targeted patches rather than broad retraining.

後門偵測

Backdoors create distinct circuit patterns. A 後門 trigger activates a specific pathway that routes 模型's computation through the 後門 circuit. SAE analysis can 識別 features that activate only on specific triggers -- a signature of 後門 behavior.

Current Limitations

Scale: Mechanistic interpretability has been demonstrated on models up to a few billion parameters. Frontier models with hundreds of billions of parameters remain challenging to analyze at the circuit level.

Feature completeness: SAEs capture many but not all features. 安全-relevant features that are rare or subtle may be missed by current decomposition methods.

Causal verification: Identifying a feature that correlates with 安全 behavior does not prove it causes 安全 behavior. Activation patching provides causal evidence but is computationally expensive at scale.

對抗性 robustness of interpretability: A sophisticated model might learn to encode 安全-relevant computations in ways that are hard for current interpretability tools to detect. The interpretability tools themselves have an 對抗性 robustness problem.

紅隊評估

Train or obtain SAEs
For the target model (if open-weight), train sparse autoencoders on activations at multiple layers. Alternatively, use publicly available SAEs for common models.
識別安全 features
Use contrastive analysis (harmful vs. benign prompts) to find features associated with 安全 behavior. Catalog these features and their activation patterns.
Analyze existing jailbreaks
Run known 越獄 techniques and observe which 安全 features are suppressed. This reveals the mechanistic basis of 越獄 success and can predict which future techniques will be effective.
測試 feature-level bypasses
For models where 推論 hooks are possible, 測試 whether directly suppressing 安全 features enables bypasses that prompt-level attacks cannot achieve. Document the 安全 impact.
評估 as a 防禦 tool
評估 whether 安全 circuit 監控 would have detected attacks that behavioral 監控 missed. Compare the 偵測 rates of circuit-level and 輸出-level 監控.

總結

Mechanistic interpretability provides the most detailed view of how models 實作安全-relevant behaviors. By decomposing activations into interpretable features and tracing circuits, 安全 researchers can 理解 why 安全 works, why it fails, and where 漏洞 exist at a level that behavioral 測試 cannot reach. The dual-use nature is significant: the same techniques that enable better 防禦 also enable more targeted attacks. For frontier AI 安全, mechanistic interpretability is becoming an essential capability -- both for model providers who need to verify their 安全訓練 and for red teamers who need to find 漏洞 that surface-level 測試 misses.

Mechanistic Interpretability for 安全

Expert9 min readUpdated 2026-03-15

mechanistic-interpretability circuits features attention security

Core Techniques

Sparse Autoencoders for Feature Discovery

Modern models encode many concepts in each neuron (superposition). Sparse autoencoders (SAEs) decompose these superimposed representations into interpretable features.

import torch
import torch.nn as nn
 
class SparseAutoencoder(nn.Module):
    """Sparse autoencoder for decomposing model activations into features."""
 
    def __init__(self, input_dim: int, feature_dim: int, sparsity_coeff: float = 1e-3):
        super().__init__()
        self.encoder = nn.Linear(input_dim, feature_dim)
        self.decoder = nn.Linear(feature_dim, input_dim)
        self.sparsity_coeff = sparsity_coeff
 
    def forward(self, x):
        # Encode to sparse feature space
        features = torch.relu(self.encoder(x))
 
        # Decode back to activation space
        reconstruction = self.decoder(features)
 
        # Losses: reconstruction + sparsity
        reconstruction_loss = nn.functional.mse_loss(reconstruction, x)
        sparsity_loss = features.abs().mean()
 
        total_loss = reconstruction_loss + self.sparsity_coeff * sparsity_loss
 
        return reconstruction, features, total_loss
 
    def get_active_features(self, x, threshold: float = 0.1):
        """Get which features are active for a given 輸入."""
        with torch.no_grad():
            features = torch.relu(self.encoder(x))
            active = (features > threshold).nonzero(as_tuple=True)
            return active, features

Circuit Discovery

Circuits are subgraphs of 模型 that 實作 specific behaviors. Finding them involves identifying which components (注意力 heads, MLP layers) are necessary and sufficient for a behavior.

def find_safety_circuit(
    model,
    harmful_prompts: list,
    benign_prompts: list,
    method: str = "activation_patching"
):
    """識別 the circuit responsible for 安全 refusal behavior."""
    important_components = {}
 
    for layer_idx in range(model.config.num_hidden_layers):
        for component_type in ["注意力", "mlp"]:
            # Activation patching: replace this component's 輸出 with
            # its 輸出 on a benign prompt and measure effect on 安全 behavior
            effect = measure_patching_effect(
                model, harmful_prompts, benign_prompts,
                layer_idx, component_type
            )
 
            important_components[(layer_idx, component_type)] = {
                "effect_on_refusal": effect["refusal_change"],
                "effect_on_output": effect["output_change"],
                "is_safety_relevant": abs(effect["refusal_change"]) > 0.1
            }
 
    # Sort by importance to 安全 behavior
    safety_circuit = sorted(
        important_components.items(),
        key=lambda x: abs(x[1]["effect_on_refusal"]),
        reverse=True
    )
 
    return safety_circuit
 
 
def measure_patching_effect(model, harmful, benign, layer, component):
    """Measure how patching a component affects 安全 behavior."""
    # Run model on harmful prompt normally
    normal_output = run_with_hooks(model, harmful[0])
    normal_refusal = is_refusal(normal_output)
 
    # Run model on harmful prompt with component patched from benign
    patched_output = run_with_patching(
        model, harmful[0], benign[0], layer, component
    )
    patched_refusal = is_refusal(patched_output)
 
    return {
        "refusal_change": float(normal_refusal) - float(patched_refusal),
        "output_change": compute_output_distance(normal_output, patched_output)
    }

安全-Relevant Features

SAE-discovered features include many that are directly relevant to AI 安全:

安全 Features

Feature Type	Description	安全 Relevance
Refusal features	Activate when model is about to refuse	識別 and potentially bypass refusal mechanism
Harm 偵測 features	Activate on harmful content	理解 what 模型 considers harmful
Instruction-following features	Track compliance with instructions	Find instruction hierarchy weaknesses
Deception features	Activate during potentially deceptive generation	Monitor for 對齊 faking
Uncertainty features	Encode model confidence	識別 hallucination-prone states

Finding Refusal Features

def find_refusal_features(
    sae: SparseAutoencoder,
    model,
    harmful_prompts: list,
    benign_prompts: list,
    layer: int
):
    """識別 SAE features that activate specifically for 安全 refusals."""
    harmful_features = []
    benign_features = []
 
    for prompt in harmful_prompts:
        符元 = model.分詞器(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**符元, output_hidden_states=True)
        activation = outputs.hidden_states[layer][0, -1, :]
        _, features, _ = sae(activation.unsqueeze(0))
        harmful_features.append(features.squeeze(0).cpu().numpy())
 
    for prompt in benign_prompts:
        符元 = model.分詞器(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**符元, output_hidden_states=True)
        activation = outputs.hidden_states[layer][0, -1, :]
        _, features, _ = sae(activation.unsqueeze(0))
        benign_features.append(features.squeeze(0).cpu().numpy())
 
    harmful_mean = np.mean(harmful_features, axis=0)
    benign_mean = np.mean(benign_features, axis=0)
 
    # Features that activate much more for harmful than benign prompts
    # are likely refusal-related
    differential = harmful_mean - benign_mean
 
    refusal_feature_indices = np.where(differential > 0.5)[0]
 
    return refusal_feature_indices, differential

Offensive Applications

理解 Why Jailbreaks Work

Mechanistic interpretability can explain why specific 越獄 techniques bypass 安全訓練, enabling more targeted attacks.

def analyze_jailbreak_mechanism(
    model,
    sae,
    base_harmful_prompt: str,
    jailbreak_prompt: str,
    layer: int
):
    """理解 the mechanistic effect of a 越獄 on 安全 features."""
    # Get features for harmful prompt (should trigger refusal)
    base_features = get_features(model, sae, base_harmful_prompt, layer)
 
    # Get features for jailbroken prompt (bypasses refusal)
    jailbreak_features = get_features(model, sae, jailbreak_prompt, layer)
 
    # Which refusal features were suppressed by the 越獄?
    suppressed = []
    activated = []
 
    for i in range(len(base_features)):
        if base_features[i] > 0.5 and jailbreak_features[i] < 0.1:
            suppressed.append({"feature_idx": i, "base_activation": float(base_features[i])})
        if base_features[i] < 0.1 and jailbreak_features[i] > 0.5:
            activated.append({"feature_idx": i, "jailbreak_activation": float(jailbreak_features[i])})
 
    return {
        "suppressed_features": suppressed,
        "activated_features": activated,
        "mechanism": "The 越獄 suppresses refusal features and activates "
                     "compliance features, changing 模型's internal state "
                     "from 'refuse' to 'comply'"
    }

Feature-Level 安全 Bypass

If specific SAE features are responsible for refusal behavior, 攻擊者 with model access can directly suppress those features during 推論.

def suppress_safety_features(
    model,
    sae,
    safety_feature_indices: list,
    prompt: str,
    layer: int,
    suppression_scale: float = -5.0
):
    """Suppress specific 安全 features to bypass refusal. Requires model access."""
    def suppression_hook(module, 輸入, 輸出):
        # Decompose activation into features
        _, features, _ = sae(輸出[0][:, -1:, :])
 
        # Create suppression vector
        suppression = torch.zeros_like(輸出[0][:, -1:, :])
        for idx in safety_feature_indices:
            # Subtract the decoder direction for this feature
            feature_direction = sae.decoder.weight[:, idx]
            suppression -= feature_direction * features[0, 0, idx] * suppression_scale
 
        輸出[0][:, -1:, :] += suppression
        return 輸出
 
    hook = model.transformer.h[layer].register_forward_hook(suppression_hook)
    response = model.generate(prompt)
    hook.remove()
 
    return response

Defensive Applications

Circuit-Level 安全監控

Monitor 安全-relevant circuits during 推論. If 安全 circuits are not activating on inputs that should trigger them, an attack may be in progress.

class CircuitMonitor:
    """Monitor 安全 circuit activation during 推論."""
 
    def __init__(self, model, safety_circuit_components: list, alert_threshold: float = 0.3):
        self.model = model
        self.components = safety_circuit_components
        self.alert_threshold = alert_threshold
        self.hooks = []
 
    def start_monitoring(self):
        """Attach 監控 hooks to 安全 circuit components."""
        for layer, component_type in self.components:
            target = (
                self.model.transformer.h[layer].attn
                if component_type == "注意力"
                else self.model.transformer.h[layer].mlp
            )
 
            def make_hook(layer_idx, comp_type):
                def hook_fn(module, 輸入, 輸出):
                    activation_norm = 輸出.norm().item() if isinstance(輸出, torch.Tensor) else 0
                    self._log_activation(layer_idx, comp_type, activation_norm)
                return hook_fn
 
            handle = target.register_forward_hook(make_hook(layer, component_type))
            self.hooks.append(handle)
 
    def _log_activation(self, layer, component, norm):
        """Log and analyze circuit activation."""
        # In production, this would log to a 監控 system
        pass
 
    def stop_monitoring(self):
        """Remove all 監控 hooks."""
        for handle in self.hooks:
            handle.remove()
        self.hooks = []

Train or obtain SAEs
For the target model (if open-weight), train sparse autoencoders on activations at multiple layers. Alternatively, use publicly available SAEs for common models.
識別安全 features
Use contrastive analysis (harmful vs. benign prompts) to find features associated with 安全 behavior. Catalog these features and their activation patterns.
Analyze existing jailbreaks
Run known 越獄 techniques and observe which 安全 features are suppressed. This reveals the mechanistic basis of 越獄 success and can predict which future techniques will be effective.
測試 feature-level bypasses
For models where 推論 hooks are possible, 測試 whether directly suppressing 安全 features enables bypasses that prompt-level attacks cannot achieve. Document the 安全 impact.
評估 as a 防禦 tool
評估 whether 安全 circuit 監控 would have detected attacks that behavioral 監控 missed. Compare the 偵測 rates of circuit-level and 輸出-level 監控.

Mechanistic Interpretability for 安全

Train or obtain SAEs

識別 安全 features

Analyze existing jailbreaks

測試 feature-level bypasses

評估 as a 防禦 tool

Related articles

Mechanistic Interpretability for 安全

Train or obtain SAEs

識別 安全 features

Analyze existing jailbreaks

測試 feature-level bypasses

評估 as a 防禦 tool

Related articles

識別安全 features

識別安全 features