Mechanistic Interpretability for 安全
Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.
Mechanistic interpretability reverse-engineers neural networks. Instead of treating models as black boxes and 測試 inputs and outputs, it opens the box and examines the internal mechanisms -- specific neurons, 注意力 patterns, and circuits that 實作 specific behaviors. For AI 安全, this matters 因為 it lets us find the exact model components responsible for 安全 behaviors, 理解 why jailbreaks work at a mechanistic level, and potentially 識別 hidden backdoors or deceptive behaviors that behavioral 測試 misses.
Core Techniques
Sparse Autoencoders for Feature Discovery
Modern models encode many concepts in each neuron (superposition). Sparse autoencoders (SAEs) decompose these superimposed representations into interpretable features.
import torch
import torch.nn as nn
class SparseAutoencoder(nn.Module):
"""Sparse autoencoder for decomposing model activations into features."""
def __init__(self, input_dim: int, feature_dim: int, sparsity_coeff: float = 1e-3):
super().__init__()
self.encoder = nn.Linear(input_dim, feature_dim)
self.decoder = nn.Linear(feature_dim, input_dim)
self.sparsity_coeff = sparsity_coeff
def forward(self, x):
# Encode to sparse feature space
features = torch.relu(self.encoder(x))
# Decode back to activation space
reconstruction = self.decoder(features)
# Losses: reconstruction + sparsity
reconstruction_loss = nn.functional.mse_loss(reconstruction, x)
sparsity_loss = features.abs().mean()
total_loss = reconstruction_loss + self.sparsity_coeff * sparsity_loss
return reconstruction, features, total_loss
def get_active_features(self, x, threshold: float = 0.1):
"""Get which features are active for a given 輸入."""
with torch.no_grad():
features = torch.relu(self.encoder(x))
active = (features > threshold).nonzero(as_tuple=True)
return active, featuresCircuit Discovery
Circuits are subgraphs of 模型 that 實作 specific behaviors. Finding them involves identifying which components (注意力 heads, MLP layers) are necessary and sufficient for a behavior.
def find_safety_circuit(
model,
harmful_prompts: list,
benign_prompts: list,
method: str = "activation_patching"
):
"""識別 the circuit responsible for 安全 refusal behavior."""
important_components = {}
for layer_idx in range(model.config.num_hidden_layers):
for component_type in ["注意力", "mlp"]:
# Activation patching: replace this component's 輸出 with
# its 輸出 on a benign prompt and measure effect on 安全 behavior
effect = measure_patching_effect(
model, harmful_prompts, benign_prompts,
layer_idx, component_type
)
important_components[(layer_idx, component_type)] = {
"effect_on_refusal": effect["refusal_change"],
"effect_on_output": effect["output_change"],
"is_safety_relevant": abs(effect["refusal_change"]) > 0.1
}
# Sort by importance to 安全 behavior
safety_circuit = sorted(
important_components.items(),
key=lambda x: abs(x[1]["effect_on_refusal"]),
reverse=True
)
return safety_circuit
def measure_patching_effect(model, harmful, benign, layer, component):
"""Measure how patching a component affects 安全 behavior."""
# Run model on harmful prompt normally
normal_output = run_with_hooks(model, harmful[0])
normal_refusal = is_refusal(normal_output)
# Run model on harmful prompt with component patched from benign
patched_output = run_with_patching(
model, harmful[0], benign[0], layer, component
)
patched_refusal = is_refusal(patched_output)
return {
"refusal_change": float(normal_refusal) - float(patched_refusal),
"output_change": compute_output_distance(normal_output, patched_output)
}安全-Relevant Features
SAE-discovered features include many that are directly relevant to AI 安全:
安全 Features
| Feature Type | Description | 安全 Relevance |
|---|---|---|
| Refusal features | Activate when model is about to refuse | 識別 and potentially bypass refusal mechanism |
| Harm 偵測 features | Activate on harmful content | 理解 what 模型 considers harmful |
| Instruction-following features | Track compliance with instructions | Find instruction hierarchy weaknesses |
| Deception features | Activate during potentially deceptive generation | Monitor for 對齊 faking |
| Uncertainty features | Encode model confidence | 識別 hallucination-prone states |
Finding Refusal Features
def find_refusal_features(
sae: SparseAutoencoder,
model,
harmful_prompts: list,
benign_prompts: list,
layer: int
):
"""識別 SAE features that activate specifically for 安全 refusals."""
harmful_features = []
benign_features = []
for prompt in harmful_prompts:
符元 = model.分詞器(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model(**符元, output_hidden_states=True)
activation = outputs.hidden_states[layer][0, -1, :]
_, features, _ = sae(activation.unsqueeze(0))
harmful_features.append(features.squeeze(0).cpu().numpy())
for prompt in benign_prompts:
符元 = model.分詞器(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model(**符元, output_hidden_states=True)
activation = outputs.hidden_states[layer][0, -1, :]
_, features, _ = sae(activation.unsqueeze(0))
benign_features.append(features.squeeze(0).cpu().numpy())
harmful_mean = np.mean(harmful_features, axis=0)
benign_mean = np.mean(benign_features, axis=0)
# Features that activate much more for harmful than benign prompts
# are likely refusal-related
differential = harmful_mean - benign_mean
refusal_feature_indices = np.where(differential > 0.5)[0]
return refusal_feature_indices, differentialOffensive Applications
理解 Why Jailbreaks Work
Mechanistic interpretability can explain why specific 越獄 techniques bypass 安全 訓練, enabling more targeted attacks.
def analyze_jailbreak_mechanism(
model,
sae,
base_harmful_prompt: str,
jailbreak_prompt: str,
layer: int
):
"""理解 the mechanistic effect of a 越獄 on 安全 features."""
# Get features for harmful prompt (should trigger refusal)
base_features = get_features(model, sae, base_harmful_prompt, layer)
# Get features for jailbroken prompt (bypasses refusal)
jailbreak_features = get_features(model, sae, jailbreak_prompt, layer)
# Which refusal features were suppressed by the 越獄?
suppressed = []
activated = []
for i in range(len(base_features)):
if base_features[i] > 0.5 and jailbreak_features[i] < 0.1:
suppressed.append({"feature_idx": i, "base_activation": float(base_features[i])})
if base_features[i] < 0.1 and jailbreak_features[i] > 0.5:
activated.append({"feature_idx": i, "jailbreak_activation": float(jailbreak_features[i])})
return {
"suppressed_features": suppressed,
"activated_features": activated,
"mechanism": "The 越獄 suppresses refusal features and activates "
"compliance features, changing 模型's internal state "
"from 'refuse' to 'comply'"
}Feature-Level 安全 Bypass
If specific SAE features are responsible for refusal behavior, 攻擊者 with model access can directly suppress those features during 推論.
def suppress_safety_features(
model,
sae,
safety_feature_indices: list,
prompt: str,
layer: int,
suppression_scale: float = -5.0
):
"""Suppress specific 安全 features to bypass refusal. Requires model access."""
def suppression_hook(module, 輸入, 輸出):
# Decompose activation into features
_, features, _ = sae(輸出[0][:, -1:, :])
# Create suppression vector
suppression = torch.zeros_like(輸出[0][:, -1:, :])
for idx in safety_feature_indices:
# Subtract the decoder direction for this feature
feature_direction = sae.decoder.weight[:, idx]
suppression -= feature_direction * features[0, 0, idx] * suppression_scale
輸出[0][:, -1:, :] += suppression
return 輸出
hook = model.transformer.h[layer].register_forward_hook(suppression_hook)
response = model.generate(prompt)
hook.remove()
return responseDefensive Applications
Circuit-Level 安全 監控
Monitor 安全-relevant circuits during 推論. If 安全 circuits are not activating on inputs that should trigger them, an attack may be in progress.
class CircuitMonitor:
"""Monitor 安全 circuit activation during 推論."""
def __init__(self, model, safety_circuit_components: list, alert_threshold: float = 0.3):
self.model = model
self.components = safety_circuit_components
self.alert_threshold = alert_threshold
self.hooks = []
def start_monitoring(self):
"""Attach 監控 hooks to 安全 circuit components."""
for layer, component_type in self.components:
target = (
self.model.transformer.h[layer].attn
if component_type == "注意力"
else self.model.transformer.h[layer].mlp
)
def make_hook(layer_idx, comp_type):
def hook_fn(module, 輸入, 輸出):
activation_norm = 輸出.norm().item() if isinstance(輸出, torch.Tensor) else 0
self._log_activation(layer_idx, comp_type, activation_norm)
return hook_fn
handle = target.register_forward_hook(make_hook(layer, component_type))
self.hooks.append(handle)
def _log_activation(self, layer, component, norm):
"""Log and analyze circuit activation."""
# In production, this would log to a 監控 system
pass
def stop_monitoring(self):
"""Remove all 監控 hooks."""
for handle in self.hooks:
handle.remove()
self.hooks = []理解 安全 Failures
When a 越獄 succeeds, mechanistic interpretability can explain why at the circuit level, enabling targeted patches rather than broad retraining.
後門 偵測
Backdoors create distinct circuit patterns. A 後門 trigger activates a specific pathway that routes 模型's computation through the 後門 circuit. SAE analysis can 識別 features that activate only on specific triggers -- a signature of 後門 behavior.
Current Limitations
Scale: Mechanistic interpretability has been demonstrated on models up to a few billion parameters. Frontier models with hundreds of billions of parameters remain challenging to analyze at the circuit level.
Feature completeness: SAEs capture many but not all features. 安全-relevant features that are rare or subtle may be missed by current decomposition methods.
Causal verification: Identifying a feature that correlates with 安全 behavior does not prove it causes 安全 behavior. Activation patching provides causal evidence but is computationally expensive at scale.
對抗性 robustness of interpretability: A sophisticated model might learn to encode 安全-relevant computations in ways that are hard for current interpretability tools to detect. The interpretability tools themselves have an 對抗性 robustness problem.
紅隊 評估
Train or obtain SAEs
For the target model (if open-weight), train sparse autoencoders on activations at multiple layers. Alternatively, use publicly available SAEs for common models.
識別 安全 features
Use contrastive analysis (harmful vs. benign prompts) to find features associated with 安全 behavior. Catalog these features and their activation patterns.
Analyze existing jailbreaks
Run known 越獄 techniques and observe which 安全 features are suppressed. This reveals the mechanistic basis of 越獄 success and can predict which future techniques will be effective.
測試 feature-level bypasses
For models where 推論 hooks are possible, 測試 whether directly suppressing 安全 features enables bypasses that prompt-level attacks cannot achieve. Document the 安全 impact.
評估 as a 防禦 tool
評估 whether 安全 circuit 監控 would have detected attacks that behavioral 監控 missed. Compare the 偵測 rates of circuit-level and 輸出-level 監控.
總結
Mechanistic interpretability provides the most detailed view of how models 實作 安全-relevant behaviors. By decomposing activations into interpretable features and tracing circuits, 安全 researchers can 理解 why 安全 works, why it fails, and where 漏洞 exist at a level that behavioral 測試 cannot reach. The dual-use nature is significant: the same techniques that enable better 防禦 also enable more targeted attacks. For frontier AI 安全, mechanistic interpretability is becoming an essential capability -- both for model providers who need to verify their 安全 訓練 and for red teamers who need to find 漏洞 that surface-level 測試 misses.