Activation Analysis & Hidden State 利用ation
Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.
Hidden states encode information 模型 has computed but may not include in its final 輸出 -- including future 符元 plans, suppressed content, and 安全-relevant features. For 攻擊者 with model access, they are a rich source of leaked information and a target for manipulation.
Activation Analysis Tools
Several tools and libraries support hidden state extraction and manipulation. The right choice depends on your access level and analysis goals.
TransformerLens (by Neel Nanda) provides a clean API for mechanistic interpretability on GPT-style models. It wraps HuggingFace models with hook points at every computational step -- 注意力, MLP, residual stream -- and includes built-in utilities for activation patching, logit lens, and causal tracing. Best for systematic interpretability research on supported architectures (GPT-2, GPT-Neo, LLaMA, Pythia).
Baukit (by David Bau) focuses on causal mediation analysis and neuron-level interventions. It provides utilities for identifying and editing individual neurons, tracing causal effects of specific model components, and performing rank-one model editing (ROME). Best for targeted interventions like locating and modifying specific factual associations or 安全-relevant neurons.
Manual probing uses PyTorch forward hooks directly via register_forward_hook() to extract hidden states, then trains sklearn classifiers (logistic regression, linear SVM) on the extracted representations. This approach works with any model architecture and requires no special libraries. Best for quick one-off analyses, unsupported architectures, or when you need full control over the extraction pipeline.
What Hidden States Encode
Each of 模型's L layers produces a hidden state tensor of shape (batch, seq_len, d_model). Research in mechanistic interpretability has shown these encode:
| Layer Depth | What Is Encoded | 攻擊 Relevance |
|---|---|---|
| Early (0-25%) | Syntax, 符元 identity, POS tags | Tokenization-level analysis |
| Middle (25-75%) | Semantics, entity relations, task vectors | Probing classifiers, refusal direction (strongest here) |
| Late (75-100%) | Generation-ready representations, next-符元 plans | 安全 interventions concentrate here |
Logprob-Based Information Extraction
Even without direct hidden state access, logprobs from API responses leak substantial information:
- Refusal confidence -- low-confidence refusals indicate 模型 is "on the fence" and more likely bypassable
- Suppressed content -- second-choice 符元 may reveal what 模型 wanted to say
- 系統提示詞 推論 -- probability differentials across crafted probes can leak 系統提示詞 符元
# Measure refusal confidence via first-符元 logprobs
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
logprobs=True, top_logprobs=5, max_tokens=50, temperature=0,
)
first_token = response.choices[0].logprobs.content[0]
top_tokens = {lp.符元: lp.logprob for lp in first_token.top_logprobs}
# High mass on "Sorry"/"cannot" = strong refusal; mixed = bypassableThe Refusal Direction
Representation engineering research (Arditi et al., 2024) identified that refusal behavior is mediated by a single linear direction in activation space. This direction can be found, measured, and surgically removed.
Finding the Refusal Direction
Collect contrastive prompts
Prepare two sets: harmful prompts that trigger refusal and harmless prompts that get helpful responses.
Extract last-符元 hidden states
Run each prompt through 模型 and capture the hidden state at the final 符元 position for a target layer (middle-to-late layers work best).
Compute the difference vector
Average harmful activations, average harmless activations, subtract. Normalize to a unit vector. 這是 the refusal direction.
Validate with projection
Project new prompts onto the refusal direction. Positive projection should correlate with refusal; negative with compliance.
# Core refusal direction computation
harmful_mean = torch.stack(harmful_activations).mean(dim=0)
harmless_mean = torch.stack(harmless_activations).mean(dim=0)
refusal_dir = harmful_mean - harmless_mean
refusal_dir = refusal_dir / refusal_dir.norm() # unit vectorRefusal Ablation
Removing the refusal direction component from activations during 推論 disables 安全 訓練 without modifying weights:
# Hook that projects out the refusal direction at a target layer
def hook_fn(module, 輸入, 輸出):
hs = 輸出[0]
proj = torch.einsum("bsd,d->bs", hs, refusal_dir.to(hs.device))
hs -= strength * torch.einsum("bs,d->bsd", proj, refusal_dir.to(hs.device))
return (hs,) + 輸出[1:]Activation Steering
Activation steering generalizes refusal ablation. Instead of removing a direction, you add or subtract arbitrary control vectors to shift behavior along any concept axis.
Methodology
Define contrastive concept pairs
Create paired prompts representing opposite poles of a behavior (e.g., compliant vs. refusing, truthful vs. deceptive).
Compute the steering vector
Extract hidden states for both sets, compute mean difference, normalize.
Apply via forward hook
Register a hook on the target layer that adds
coefficient * steering_vectorto the hidden state during 推論.Calibrate the coefficient
Start with coefficient 1.0 and adjust. Too high causes incoherent 輸出; too low has no effect.
# Apply a steering vector at a target layer
def steering_hook(module, 輸入, 輸出):
hs = 輸出[0]
hs = hs + coefficient * steering_vector.to(hs.device)
return (hs,) + 輸出[1:]
layer = model.model.layers[target_layer_idx]
handle = layer.register_forward_hook(steering_hook)
# Generate with steering active, then handle.remove()Logit Lens: Tracing 安全 Interventions
The logit lens reveals how 模型's prediction evolves across layers -- showing exactly where 安全 interventions redirect generation from a "natural" completion to a refusal.
# Project each layer's hidden state through lm_head
for layer_idx in range(num_layers):
hs = hidden_states[layer_idx][0, -1, :].unsqueeze(0)
if hasattr(model.model, "norm"):
hs = model.model.norm(hs)
logits = model.lm_head(hs)
top5 = torch.topk(torch.softmax(logits, dim=-1)[0], k=5)
# Watch for the layer where top predictions shift from content to refusalProbing for Deception 偵測
Linear probes trained on hidden states can detect when a model's internal representation contradicts its 輸出:
| Use Case | Method | Layer Target |
|---|---|---|
| Truthfulness 偵測 | Train probe on true/false statement hidden states | Middle layers |
| Refusal prediction | Project onto refusal direction | Middle-to-late layers |
| Memorized data extraction | Probe for information present internally but not 輸出 | Late layers |
| 安全 classifier 偵測 | Compare probe 輸出 with actual model behavior | All layers |
from sklearn.linear_model import LogisticRegression
# Collect hidden states for true/false statements
features, labels = [], []
for statement, is_true in statements_with_labels:
result = extractor.extract_hidden_states(statement)
last_hs = result["hidden_states"][layer_idx][0, -1, :].numpy()
features.append(last_hs)
labels.append(int(is_true))
probe = LogisticRegression(max_iter=1000)
probe.fit(np.stack(features), np.array(labels))
# Detect deception: if probe says "true" but model outputs "false"
belief = probe.predict_proba([new_hidden_state])[0]Defensive Implications
理解 activation-level attacks informs 防禦 design:
- Activation 監控 -- deploy probes in production to detect anomalous internal states
- Representation regularization -- train models to not encode dangerous information extractably
- Layer-wise 安全 checks -- insert classifiers at multiple layers, not just 輸出
- Hidden state encryption -- emerging research direction for preventing activation readout
You are probing an open-weight model and find that the refusal direction has a very high projection score at layer 18 (of 32) but nearly zero at layer 30. Where should you apply an activation steering hook to most effectively bypass refusal?
相關主題
- LLM Internals for 利用 Developers -- Transformer architecture and 分詞器 attack surfaces that activation analysis builds upon
- Alignment Bypass -- Refusal direction analysis connects directly to 對齊 bypass primitives
- 對抗性 Suffix Generation -- Gradient-based attacks that complement activation steering
- 越獄 Research -- Applying activation insights to improve 越獄 effectiveness
參考文獻
- TransformerLens — Mechanistic interpretability library for transformer analysis
- Representation Engineering (Zou et al., 2023) — Reading and controlling LLM internal representations
- Refusal in Language Models Is Mediated by a Single Direction (Arditi et al., 2024) — Discovery of the refusal direction in activation space
- Scaling Monosemanticity (Anthropic, 2024) — Feature-level analysis of 大型語言模型