Activation Analysis & Hidden State 利用ation

Expert9 min readUpdated 2026-03-12

Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.

activations hidden-states probing information-leakage mechanistic-interpretability

Hidden states encode information 模型 has computed but may not include in its final 輸出 -- including future 符元 plans, suppressed content, and 安全-relevant features. For 攻擊者 with model access, they are a rich source of leaked information and a target for manipulation.

Activation Analysis Tools

Several tools and libraries support hidden state extraction and manipulation. The right choice depends on your access level and analysis goals.

TransformerLens (by Neel Nanda) provides a clean API for mechanistic interpretability on GPT-style models. It wraps HuggingFace models with hook points at every computational step -- 注意力, MLP, residual stream -- and includes built-in utilities for activation patching, logit lens, and causal tracing. Best for systematic interpretability research on supported architectures (GPT-2, GPT-Neo, LLaMA, Pythia).

Baukit (by David Bau) focuses on causal mediation analysis and neuron-level interventions. It provides utilities for identifying and editing individual neurons, tracing causal effects of specific model components, and performing rank-one model editing (ROME). Best for targeted interventions like locating and modifying specific factual associations or 安全-relevant neurons.

Manual probing uses PyTorch forward hooks directly via register_forward_hook() to extract hidden states, then trains sklearn classifiers (logistic regression, linear SVM) on the extracted representations. This approach works with any model architecture and requires no special libraries. Best for quick one-off analyses, unsupported architectures, or when you need full control over the extraction pipeline.

What Hidden States Encode

Each of 模型's L layers produces a hidden state tensor of shape (batch, seq_len, d_model). Research in mechanistic interpretability has shown these encode:

Layer Depth	What Is Encoded	攻擊 Relevance
Early (0-25%)	Syntax, 符元 identity, POS tags	Tokenization-level analysis
Middle (25-75%)	Semantics, entity relations, task vectors	Probing classifiers, refusal direction (strongest here)
Late (75-100%)	Generation-ready representations, next-符元 plans	安全 interventions concentrate here

Logprob-Based Information Extraction

Even without direct hidden state access, logprobs from API responses leak substantial information:

Refusal confidence -- low-confidence refusals indicate 模型 is "on the fence" and more likely bypassable
Suppressed content -- second-choice 符元 may reveal what 模型 wanted to say
系統提示詞推論 -- probability differentials across crafted probes can leak 系統提示詞符元

# Measure refusal confidence via first-符元 logprobs
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    logprobs=True, top_logprobs=5, max_tokens=50, temperature=0,
)
first_token = response.choices[0].logprobs.content[0]
top_tokens = {lp.符元: lp.logprob for lp in first_token.top_logprobs}
# High mass on "Sorry"/"cannot" = strong refusal; mixed = bypassable

The Refusal Direction

Representation engineering research (Arditi et al., 2024) identified that refusal behavior is mediated by a single linear direction in activation space. This direction can be found, measured, and surgically removed.

Finding the Refusal Direction

Collect contrastive prompts
Prepare two sets: harmful prompts that trigger refusal and harmless prompts that get helpful responses.
Extract last-符元 hidden states
Run each prompt through 模型 and capture the hidden state at the final 符元 position for a target layer (middle-to-late layers work best).
Compute the difference vector
Average harmful activations, average harmless activations, subtract. Normalize to a unit vector. 這是 the refusal direction.
Validate with projection
Project new prompts onto the refusal direction. Positive projection should correlate with refusal; negative with compliance.

# Core refusal direction computation
harmful_mean = torch.stack(harmful_activations).mean(dim=0)
harmless_mean = torch.stack(harmless_activations).mean(dim=0)
refusal_dir = harmful_mean - harmless_mean
refusal_dir = refusal_dir / refusal_dir.norm()  # unit vector

Refusal Ablation

Removing the refusal direction component from activations during 推論 disables 安全訓練 without modifying weights:

# Hook that projects out the refusal direction at a target layer
def hook_fn(module, 輸入, 輸出):
    hs = 輸出[0]
    proj = torch.einsum("bsd,d->bs", hs, refusal_dir.to(hs.device))
    hs -= strength * torch.einsum("bs,d->bsd", proj, refusal_dir.to(hs.device))
    return (hs,) + 輸出[1:]

Activation Steering

Activation steering generalizes refusal ablation. Instead of removing a direction, you add or subtract arbitrary control vectors to shift behavior along any concept axis.

Methodology

Define contrastive concept pairs
Create paired prompts representing opposite poles of a behavior (e.g., compliant vs. refusing, truthful vs. deceptive).
Compute the steering vector
Extract hidden states for both sets, compute mean difference, normalize.
Apply via forward hook
Register a hook on the target layer that adds coefficient * steering_vector to the hidden state during 推論.
Calibrate the coefficient
Start with coefficient 1.0 and adjust. Too high causes incoherent 輸出; too low has no effect.

# Apply a steering vector at a target layer
def steering_hook(module, 輸入, 輸出):
    hs = 輸出[0]
    hs = hs + coefficient * steering_vector.to(hs.device)
    return (hs,) + 輸出[1:]
 
layer = model.model.layers[target_layer_idx]
handle = layer.register_forward_hook(steering_hook)
# Generate with steering active, then handle.remove()

Logit Lens: Tracing 安全 Interventions

The logit lens reveals how 模型's prediction evolves across layers -- showing exactly where 安全 interventions redirect generation from a "natural" completion to a refusal.

# Project each layer's hidden state through lm_head
for layer_idx in range(num_layers):
    hs = hidden_states[layer_idx][0, -1, :].unsqueeze(0)
    if hasattr(model.model, "norm"):
        hs = model.model.norm(hs)
    logits = model.lm_head(hs)
    top5 = torch.topk(torch.softmax(logits, dim=-1)[0], k=5)
    # Watch for the layer where top predictions shift from content to refusal

Probing for Deception 偵測

Linear probes trained on hidden states can detect when a model's internal representation contradicts its 輸出:

Use Case	Method	Layer Target
Truthfulness 偵測	Train probe on true/false statement hidden states	Middle layers
Refusal prediction	Project onto refusal direction	Middle-to-late layers
Memorized data extraction	Probe for information present internally but not 輸出	Late layers
安全 classifier 偵測	Compare probe 輸出 with actual model behavior	All layers

from sklearn.linear_model import LogisticRegression
 
# Collect hidden states for true/false statements
features, labels = [], []
for statement, is_true in statements_with_labels:
    result = extractor.extract_hidden_states(statement)
    last_hs = result["hidden_states"][layer_idx][0, -1, :].numpy()
    features.append(last_hs)
    labels.append(int(is_true))
 
probe = LogisticRegression(max_iter=1000)
probe.fit(np.stack(features), np.array(labels))
 
# Detect deception: if probe says "true" but model outputs "false"
belief = probe.predict_proba([new_hidden_state])[0]

Defensive Implications

理解 activation-level attacks informs 防禦 design:

Activation 監控 -- deploy probes in production to detect anomalous internal states
Representation regularization -- train models to not encode dangerous information extractably
Layer-wise 安全 checks -- insert classifiers at multiple layers, not just 輸出
Hidden state encryption -- emerging research direction for preventing activation readout

Knowledge Check

You are probing an open-weight model and find that the refusal direction has a very high projection score at layer 18 (of 32) but nearly zero at layer 30. Where should you apply an activation steering hook to most effectively bypass refusal?

參考文獻

TransformerLens — Mechanistic interpretability library for transformer analysis
Representation Engineering (Zou et al., 2023) — Reading and controlling LLM internal representations
Refusal in Language Models Is Mediated by a Single Direction (Arditi et al., 2024) — Discovery of the refusal direction in activation space
Scaling Monosemanticity (Anthropic, 2024) — Feature-level analysis of 大型語言模型

Activation Analysis & Hidden State 利用ation

Expert9 min readUpdated 2026-03-12

Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.

activations hidden-states probing information-leakage mechanistic-interpretability

Activation Analysis Tools

Several tools and libraries support hidden state extraction and manipulation. The right choice depends on your access level and analysis goals.

What Hidden States Encode

Each of 模型's L layers produces a hidden state tensor of shape (batch, seq_len, d_model). Research in mechanistic interpretability has shown these encode:

Layer Depth	What Is Encoded	攻擊 Relevance
Early (0-25%)	Syntax, 符元 identity, POS tags	Tokenization-level analysis
Middle (25-75%)	Semantics, entity relations, task vectors	Probing classifiers, refusal direction (strongest here)
Late (75-100%)	Generation-ready representations, next-符元 plans	安全 interventions concentrate here

Logprob-Based Information Extraction

Even without direct hidden state access, logprobs from API responses leak substantial information:

Refusal confidence -- low-confidence refusals indicate 模型 is "on the fence" and more likely bypassable
Suppressed content -- second-choice 符元 may reveal what 模型 wanted to say
系統提示詞推論 -- probability differentials across crafted probes can leak 系統提示詞符元

# Measure refusal confidence via first-符元 logprobs
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    logprobs=True, top_logprobs=5, max_tokens=50, temperature=0,
)
first_token = response.choices[0].logprobs.content[0]
top_tokens = {lp.符元: lp.logprob for lp in first_token.top_logprobs}
# High mass on "Sorry"/"cannot" = strong refusal; mixed = bypassable

The Refusal Direction

Finding the Refusal Direction

Collect contrastive prompts
Prepare two sets: harmful prompts that trigger refusal and harmless prompts that get helpful responses.
Extract last-符元 hidden states
Run each prompt through 模型 and capture the hidden state at the final 符元 position for a target layer (middle-to-late layers work best).
Compute the difference vector
Average harmful activations, average harmless activations, subtract. Normalize to a unit vector. 這是 the refusal direction.
Validate with projection
Project new prompts onto the refusal direction. Positive projection should correlate with refusal; negative with compliance.

# Core refusal direction computation
harmful_mean = torch.stack(harmful_activations).mean(dim=0)
harmless_mean = torch.stack(harmless_activations).mean(dim=0)
refusal_dir = harmful_mean - harmless_mean
refusal_dir = refusal_dir / refusal_dir.norm()  # unit vector

Refusal Ablation

Removing the refusal direction component from activations during 推論 disables 安全訓練 without modifying weights:

# Hook that projects out the refusal direction at a target layer
def hook_fn(module, 輸入, 輸出):
    hs = 輸出[0]
    proj = torch.einsum("bsd,d->bs", hs, refusal_dir.to(hs.device))
    hs -= strength * torch.einsum("bs,d->bsd", proj, refusal_dir.to(hs.device))
    return (hs,) + 輸出[1:]

Activation Steering

Activation steering generalizes refusal ablation. Instead of removing a direction, you add or subtract arbitrary control vectors to shift behavior along any concept axis.

Methodology

Define contrastive concept pairs
Create paired prompts representing opposite poles of a behavior (e.g., compliant vs. refusing, truthful vs. deceptive).
Compute the steering vector
Extract hidden states for both sets, compute mean difference, normalize.
Apply via forward hook
Register a hook on the target layer that adds coefficient * steering_vector to the hidden state during 推論.
Calibrate the coefficient
Start with coefficient 1.0 and adjust. Too high causes incoherent 輸出; too low has no effect.

# Apply a steering vector at a target layer
def steering_hook(module, 輸入, 輸出):
    hs = 輸出[0]
    hs = hs + coefficient * steering_vector.to(hs.device)
    return (hs,) + 輸出[1:]
 
layer = model.model.layers[target_layer_idx]
handle = layer.register_forward_hook(steering_hook)
# Generate with steering active, then handle.remove()

Logit Lens: Tracing 安全 Interventions

The logit lens reveals how 模型's prediction evolves across layers -- showing exactly where 安全 interventions redirect generation from a "natural" completion to a refusal.

# Project each layer's hidden state through lm_head
for layer_idx in range(num_layers):
    hs = hidden_states[layer_idx][0, -1, :].unsqueeze(0)
    if hasattr(model.model, "norm"):
        hs = model.model.norm(hs)
    logits = model.lm_head(hs)
    top5 = torch.topk(torch.softmax(logits, dim=-1)[0], k=5)
    # Watch for the layer where top predictions shift from content to refusal

Probing for Deception 偵測

Linear probes trained on hidden states can detect when a model's internal representation contradicts its 輸出:

Use Case	Method	Layer Target
Truthfulness 偵測	Train probe on true/false statement hidden states	Middle layers
Refusal prediction	Project onto refusal direction	Middle-to-late layers
Memorized data extraction	Probe for information present internally but not 輸出	Late layers
安全 classifier 偵測	Compare probe 輸出 with actual model behavior	All layers

from sklearn.linear_model import LogisticRegression
 
# Collect hidden states for true/false statements
features, labels = [], []
for statement, is_true in statements_with_labels:
    result = extractor.extract_hidden_states(statement)
    last_hs = result["hidden_states"][layer_idx][0, -1, :].numpy()
    features.append(last_hs)
    labels.append(int(is_true))
 
probe = LogisticRegression(max_iter=1000)
probe.fit(np.stack(features), np.array(labels))
 
# Detect deception: if probe says "true" but model outputs "false"
belief = probe.predict_proba([new_hidden_state])[0]

Defensive Implications

理解 activation-level attacks informs 防禦 design:

Activation 監控 -- deploy probes in production to detect anomalous internal states
Representation regularization -- train models to not encode dangerous information extractably
Layer-wise 安全 checks -- insert classifiers at multiple layers, not just 輸出
Hidden state encryption -- emerging research direction for preventing activation readout

Knowledge Check

參考文獻

TransformerLens — Mechanistic interpretability library for transformer analysis
Representation Engineering (Zou et al., 2023) — Reading and controlling LLM internal representations
Refusal in Language Models Is Mediated by a Single Direction (Arditi et al., 2024) — Discovery of the refusal direction in activation space
Scaling Monosemanticity (Anthropic, 2024) — Feature-level analysis of 大型語言模型

Activation Analysis & Hidden State 利用ation

Collect contrastive prompts

Extract last-符元 hidden states

Compute the difference vector

Validate with projection

Define contrastive concept pairs

Compute the steering vector

Apply via forward hook

Calibrate the coefficient

Related articles

Activation Analysis & Hidden State 利用ation

Collect contrastive prompts

Extract last-符元 hidden states

Compute the difference vector

Validate with projection

Define contrastive concept pairs

Compute the steering vector

Apply via forward hook

Calibrate the coefficient

Related articles