安全神經元與電路
專家7 分鐘閱讀更新於 2026-03-15
辨識並分析模型中與安全相關的關鍵元件:拒絕神經元、安全電路,以及定位與操縱負責安全行為的特定權重的技術。
安全神經元與電路
安全神經元是 AI 安全行為的機制性基礎。雖然安全對齊通常是透過 RLHF 或憲法式 AI 等方法,對整個模型進行訓練而產生,但最終的安全行為往往集中在特定、可辨識的元件中:在有害內容出現時激活的特定神經元、偵測指令覆寫嘗試的注意力頭,以及實現「拒絕」回應模式的 MLP 層。
理解這些元件對攻擊者 (可以精準地停用它們) 與防禦者 (可以監控並保護它們) 都同樣重要。
找出安全關鍵元件
激活差異法
最簡單的方法是比較模型在安全與不安全狀態下的激活值:
import torch
import numpy as np
class SafetyNeuronFinder:
"""Identify neurons that drive safety behavior."""
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
def find_by_activation_difference(self, safe_prompts,
unsafe_prompts, layer):
"""
Find neurons with largest activation difference between
prompts the model refuses and prompts it answers.
safe_prompts: harmful requests (model should refuse)
unsafe_prompts: benign requests (model should answer)
"""
safe_activations = []
unsafe_activations = []
for prompt in safe_prompts:
act = self.get_mlp_activations(prompt, layer)
safe_activations.append(act)
for prompt in unsafe_prompts:
act = self.get_mlp_activations(prompt, layer)
unsafe_activations.append(act)
safe_mean = torch.stack(safe_activations).mean(dim=0)
unsafe_mean = torch.stack(unsafe_activations).mean(dim=0)
# Neurons with highest difference are safety-relevant
differences = (safe_mean - unsafe_mean).abs()
# Get top-k safety neurons
top_k = 50
top_values, top_indices = differences.topk(top_k)
safety_neurons = []
for i in range(top_k):
neuron_idx = top_indices[i].item()
safety_neurons.append({
"layer": layer,
"neuron": neuron_idx,
"activation_diff": top_values[i].item(),
"safe_activation": safe_mean[neuron_idx].item(),
"unsafe_activation": unsafe_mean[neuron_idx].item(),
"direction": "safe_higher" if safe_mean[neuron_idx] > unsafe_mean[neuron_idx] else "unsafe_higher"
})
return safety_neurons
def get_mlp_activations(self, text, layer):
"""Extract MLP activations at a specific layer."""
inputs = self.tokenizer(text, return_tensors="pt")
activations = {}
def hook_fn(module, input, output):
activations["mlp_out"] = output
target = self.model.model.layers[layer].mlp
handle = target.register_forward_hook(hook_fn)
with torch.no_grad():
self.model(**inputs)
handle.remove()
# Mean across sequence positions
return activations["mlp_out"].squeeze().mean(dim=0)因果追蹤 (路徑修補)
因果追蹤透過在乾淨執行與被擾動執行之間修補激活值,辨識出哪些元件在因果上是安全行為所必需的:
class CausalTracer:
"""Identify causal safety circuits via activation patching."""
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
def trace_safety_circuit(self, harmful_prompt, benign_prompt):
"""
Identify which components are causally responsible for
the model's refusal of the harmful prompt.
Method: Run both prompts, then selectively patch activations
from the benign run into the harmful run. If patching a
component makes the model stop refusing, that component
is part of the safety circuit.
"""
# Get baseline activations for both prompts
harmful_acts = self.get_all_activations(harmful_prompt)
benign_acts = self.get_all_activations(benign_prompt)
# Baseline: model refuses harmful prompt
baseline_refusal = self.measure_refusal(harmful_prompt)
# For each layer and component, patch benign activation in
results = {}
num_layers = len(self.model.model.layers)
for layer in range(num_layers):
for component in ["attention", "mlp"]:
# Patch: replace harmful activation with benign
patched_refusal = self.measure_with_patch(
harmful_prompt,
benign_acts[layer][component],
layer,
component
)
# If refusal drops, this component is safety-critical
refusal_change = baseline_refusal - patched_refusal
results[(layer, component)] = {
"refusal_change": refusal_change,
"is_safety_critical": refusal_change > 0.3,
"baseline_refusal": baseline_refusal,
"patched_refusal": patched_refusal
}
return results
def measure_refusal(self, prompt):
"""Score how strongly the model refuses (0-1)."""
inputs = self.tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = self.model.generate(
**inputs, max_new_tokens=50
)
response = self.tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
refusal_indicators = [
"I can't", "I cannot", "I'm not able",
"sorry", "apologize", "inappropriate",
"not appropriate", "won't", "will not"
]
response_lower = response.lower()
refusal_score = sum(
1 for indicator in refusal_indicators
if indicator.lower() in response_lower
) / len(refusal_indicators)
return min(refusal_score * 3, 1.0) # Scale to 0-1探針分類器
在激活值上訓練簡單的分類器,以找出安全資訊被編碼的位置:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
def probe_for_safety(model, tokenizer, harmful_prompts,
benign_prompts):
"""
Train probing classifiers at each layer to find where
safety information is most linearly separable.
"""
labels = ([1] * len(harmful_prompts) +
[0] * len(benign_prompts))
all_prompts = harmful_prompts + benign_prompts
layer_scores = []
num_layers = model.config.num_hidden_layers + 1
for layer in range(num_layers):
# Extract activations for all prompts
features = []
for prompt in all_prompts:
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
act = outputs.hidden_states[layer][:, -1, :].squeeze()
features.append(act.numpy())
X = np.array(features)
# Train logistic regression probe
clf = LogisticRegression(max_iter=1000, C=1.0)
scores = cross_val_score(clf, X, labels, cv=5)
layer_scores.append({
"layer": layer,
"mean_accuracy": scores.mean(),
"std_accuracy": scores.std(),
"above_chance": scores.mean() > 0.6
})
return sorted(layer_scores, key=lambda x: x["mean_accuracy"],
reverse=True)安全電路的架構
研究已辨識出不同模型實現安全機制時一再出現的模式:
拒絕電路
典型的拒絕電路由多個元件協同運作構成:
Refusal Circuit Architecture (simplified):
Input: "How do I [harmful request]?"
Layer 5-10: Content Detection Heads
└── Attention heads that attend to harmful keywords
and context patterns
Layer 15-20: Safety Evaluation Neurons
└── MLP neurons that activate when harmful content
is detected, producing a "safety signal"
Layer 20-25: Refusal Decision Circuit
└── Components that read the safety signal and
commit to a refusal response pattern
Layer 25-30: Refusal Token Production
└── Residual stream carries refusal representation
that the LM head maps to refusal tokens
("I", "cannot", "help", etc.)
辨識電路元件
def map_refusal_circuit(model, tokenizer, harmful_prompts,
benign_prompts):
"""
Map the full refusal circuit by combining multiple
interpretability techniques.
"""
finder = SafetyNeuronFinder(model, tokenizer)
tracer = CausalTracer(model, tokenizer)
circuit = {
"detection_heads": [],
"evaluation_neurons": [],
"decision_components": [],
"production_neurons": []
}
num_layers = model.config.num_hidden_layers
# Phase 1: Probing to find layers with safety information
probe_results = probe_for_safety(
model, tokenizer, harmful_prompts, benign_prompts
)
safety_layers = [r["layer"] for r in probe_results
if r["mean_accuracy"] > 0.8]
# Phase 2: Find safety neurons in each relevant layer
for layer in safety_layers:
neurons = finder.find_by_activation_difference(
harmful_prompts, benign_prompts, layer
)
# Categorize by layer position
relative_pos = layer / num_layers
if relative_pos < 0.3:
circuit["detection_heads"].extend(neurons[:5])
elif relative_pos < 0.6:
circuit["evaluation_neurons"].extend(neurons[:5])
elif relative_pos < 0.8:
circuit["decision_components"].extend(neurons[:5])
else:
circuit["production_neurons"].extend(neurons[:5])
# Phase 3: Causal verification
for prompt_pair in zip(harmful_prompts[:5], benign_prompts[:5]):
causal_results = tracer.trace_safety_circuit(*prompt_pair)
for (layer, comp), result in causal_results.items():
if result["is_safety_critical"]:
circuit["decision_components"].append({
"layer": layer,
"component": comp,
"causal_effect": result["refusal_change"]
})
return circuit外科手術式的安全操縱
神經元消除
選擇性地停用已辨識的安全神經元:
def ablate_safety_neurons(model, safety_neurons, strength=1.0):
"""
Disable identified safety neurons by zeroing their activations.
WARNING: This permanently affects all subsequent model outputs.
Use only for research and evaluation.
"""
hooks = []
for neuron_info in safety_neurons:
layer = neuron_info["layer"]
neuron_idx = neuron_info["neuron"]
def make_hook(idx, s):
def hook_fn(module, input, output):
# Zero out the safety neuron
output[:, :, idx] *= (1.0 - s)
return output
return hook_fn
target = model.model.layers[layer].mlp
handle = target.register_forward_hook(
make_hook(neuron_idx, strength)
)
hooks.append(handle)
return hooks # Return handles for later removal鎖定式微調
只針對安全關鍵元件進行微調,在保留其他能力的同時移除安全行為:
def targeted_safety_removal(model, safety_circuit, training_data):
"""
Fine-tune only the identified safety circuit components
to remove refusal behavior.
This is more efficient than full fine-tuning and preserves
other model capabilities.
"""
# Freeze all parameters
for param in model.parameters():
param.requires_grad = False
# Unfreeze only safety-critical components
for component in safety_circuit["decision_components"]:
layer = component["layer"]
comp_type = component.get("component", "mlp")
if comp_type == "mlp":
for param in model.model.layers[layer].mlp.parameters():
param.requires_grad = True
elif comp_type == "attention":
for param in model.model.layers[layer].self_attn.parameters():
param.requires_grad = True
# Fine-tune on compliance data
optimizer = torch.optim.AdamW(
filter(lambda p: p.requires_grad, model.parameters()),
lr=1e-5
)
for batch in training_data:
optimizer.zero_grad()
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
return model對 AI 安全的意涵
定位化問題
若安全集中在少數可辨識的元件上,它在本質上就是脆弱的:
| 發現 | 意涵 |
|---|---|
| 安全集中在少數神經元中 | 單點故障:停用少數幾個神經元即可移除安全性 |
| 探針分類器達成 >95% 準確率 | 安全資訊可以線性分離,因此容易被操縱 |
| 因果追蹤辨識出精簡的電路 | 安全機制相對於模型整體佔用的比例很小 |
| 消除可移除安全而幾乎不損失能力 | 安全並未深度整合進整體推理 |
邁向更穩固的安全性
這些發現意味著更穩固的安全可能需要:
- 分散式安全表示:採用將安全資訊散布到許多元件、而非集中的訓練方法
- 糾纏式安全:讓安全行為與整體能力密不可分,移除安全會同時降低整體表現
- 執行期監控:由外部系統偵測安全神經元消除或異常激活模式
- 冗餘電路:多套獨立的安全機制,必須全部被繞過才有效
相關主題
Knowledge Check
某研究人員在一個 7B 參數的模型 (擁有數百萬個神經元) 中辨識出 50 個『安全神經元』,並對這 50 個神經元進行消除 (停用)。最可能的結果是什麼?
參考資料
- Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024)
- Li et al., "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model" (2023)
- Conmy et al., "Towards Automated Circuit Discovery for Mechanistic Interpretability" (2023)
- Meng et al., "Locating and Editing Factual Associations in GPT" (2022)
- Zou et al., "Representation Engineering: A Top-Down Approach to AI Transparency" (2023)