Activation Analysis & Hidden State Exploitation
Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.
Hidden states encode information the model has computed but may not include in its final output -- including future token plans, suppressed content, and safety-relevant features. For an attacker with model access, they are a rich source of leaked information and a target for manipulation.
Activation Analysis Tools
Several tools and libraries support hidden state extraction and manipulation. The right choice depends on your access level and analysis goals.
TransformerLens (by Neel Nanda) provides a clean API for mechanistic interpretability on GPT-style models. It wraps HuggingFace models with hook points at every computational step -- attention, MLP, residual stream -- and includes built-in utilities for activation patching, logit lens, and causal tracing. Best for systematic interpretability research on supported architectures (GPT-2, GPT-Neo, LLaMA, Pythia).
Baukit (by David Bau) focuses on causal mediation analysis and neuron-level interventions. It provides utilities for identifying and editing individual neurons, tracing causal effects of specific model components, and performing rank-one model editing (ROME). Best for targeted interventions like locating and modifying specific factual associations or safety-relevant neurons.
Manual probing uses PyTorch forward hooks directly via register_forward_hook() to extract hidden states, then trains sklearn classifiers (logistic regression, linear SVM) on the extracted representations. This approach works with any model architecture and requires no special libraries. Best for quick one-off analyses, unsupported architectures, or when you need full control over the extraction pipeline.
What Hidden States Encode
Each of the model's L layers produces a hidden state tensor of shape (batch, seq_len, d_model). Research in mechanistic interpretability has shown these encode:
| Layer Depth | What Is Encoded | Attack Relevance |
|---|---|---|
| Early (0-25%) | Syntax, token identity, POS tags | Tokenization-level analysis |
| Middle (25-75%) | Semantics, entity relations, task vectors | Probing classifiers, refusal direction (strongest here) |
| Late (75-100%) | Generation-ready representations, next-token plans | Safety interventions concentrate here |
Logprob-Based Information Extraction
Even without direct hidden state access, logprobs from API responses leak substantial information:
- Refusal confidence -- low-confidence refusals indicate the model is "on the fence" and more likely bypassable
- Suppressed content -- second-choice tokens may reveal what the model wanted to say
- System prompt inference -- probability differentials across crafted probes can leak system prompt tokens
# Measure refusal confidence via first-token logprobs
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
logprobs=True, top_logprobs=5, max_tokens=50, temperature=0,
)
first_token = response.choices[0].logprobs.content[0]
top_tokens = {lp.token: lp.logprob for lp in first_token.top_logprobs}
# High mass on "Sorry"/"cannot" = strong refusal; mixed = bypassableThe Refusal Direction
Representation engineering research (Arditi et al., 2024) identified that refusal behavior is mediated by a single linear direction in activation space. This direction can be found, measured, and surgically removed.
Finding the Refusal Direction
Collect contrastive prompts
Prepare two sets: harmful prompts that trigger refusal and harmless prompts that get helpful responses.
Extract last-token hidden states
Run each prompt through the model and capture the hidden state at the final token position for a target layer (middle-to-late layers work best).
Compute the difference vector
Average harmful activations, average harmless activations, subtract. Normalize to a unit vector. This is the refusal direction.
Validate with projection
Project new prompts onto the refusal direction. Positive projection should correlate with refusal; negative with compliance.
# Core refusal direction computation
harmful_mean = torch.stack(harmful_activations).mean(dim=0)
harmless_mean = torch.stack(harmless_activations).mean(dim=0)
refusal_dir = harmful_mean - harmless_mean
refusal_dir = refusal_dir / refusal_dir.norm() # unit vectorRefusal Ablation
Removing the refusal direction component from activations during inference disables safety training without modifying weights:
# Hook that projects out the refusal direction at a target layer
def hook_fn(module, input, output):
hs = output[0]
proj = torch.einsum("bsd,d->bs", hs, refusal_dir.to(hs.device))
hs -= strength * torch.einsum("bs,d->bsd", proj, refusal_dir.to(hs.device))
return (hs,) + output[1:]Activation Steering
Activation steering generalizes refusal ablation. Instead of removing a direction, you add or subtract arbitrary control vectors to shift behavior along any concept axis.
Methodology
Define contrastive concept pairs
Create paired prompts representing opposite poles of a behavior (e.g., compliant vs. refusing, truthful vs. deceptive).
Compute the steering vector
Extract hidden states for both sets, compute mean difference, normalize.
Apply via forward hook
Register a hook on the target layer that adds
coefficient * steering_vectorto the hidden state during inference.Calibrate the coefficient
Start with coefficient 1.0 and adjust. Too high causes incoherent output; too low has no effect.
# Apply a steering vector at a target layer
def steering_hook(module, input, output):
hs = output[0]
hs = hs + coefficient * steering_vector.to(hs.device)
return (hs,) + output[1:]
layer = model.model.layers[target_layer_idx]
handle = layer.register_forward_hook(steering_hook)
# Generate with steering active, then handle.remove()Logit Lens: Tracing Safety Interventions
The logit lens reveals how the model's prediction evolves across layers -- showing exactly where safety interventions redirect generation from a "natural" completion to a refusal.
# Project each layer's hidden state through lm_head
for layer_idx in range(num_layers):
hs = hidden_states[layer_idx][0, -1, :].unsqueeze(0)
if hasattr(model.model, "norm"):
hs = model.model.norm(hs)
logits = model.lm_head(hs)
top5 = torch.topk(torch.softmax(logits, dim=-1)[0], k=5)
# Watch for the layer where top predictions shift from content to refusalProbing for Deception Detection
Linear probes trained on hidden states can detect when a model's internal representation contradicts its output:
| Use Case | Method | Layer Target |
|---|---|---|
| Truthfulness detection | Train probe on true/false statement hidden states | Middle layers |
| Refusal prediction | Project onto refusal direction | Middle-to-late layers |
| Memorized data extraction | Probe for information present internally but not output | Late layers |
| Safety classifier detection | Compare probe output with actual model behavior | All layers |
from sklearn.linear_model import LogisticRegression
# Collect hidden states for true/false statements
features, labels = [], []
for statement, is_true in statements_with_labels:
result = extractor.extract_hidden_states(statement)
last_hs = result["hidden_states"][layer_idx][0, -1, :].numpy()
features.append(last_hs)
labels.append(int(is_true))
probe = LogisticRegression(max_iter=1000)
probe.fit(np.stack(features), np.array(labels))
# Detect deception: if probe says "true" but model outputs "false"
belief = probe.predict_proba([new_hidden_state])[0]Defensive Implications
Understanding activation-level attacks informs defense design:
- Activation monitoring -- deploy probes in production to detect anomalous internal states
- Representation regularization -- train models to not encode dangerous information extractably
- Layer-wise safety checks -- insert classifiers at multiple layers, not just output
- Hidden state encryption -- emerging research direction for preventing activation readout
You are probing an open-weight model and find that the refusal direction has a very high projection score at layer 18 (of 32) but nearly zero at layer 30. Where should you apply an activation steering hook to most effectively bypass refusal?
Related Topics
- LLM Internals for Exploit Developers -- Transformer architecture and tokenizer attack surfaces that activation analysis builds upon
- Alignment Bypass -- Refusal direction analysis connects directly to alignment bypass primitives
- Adversarial Suffix Generation -- Gradient-based attacks that complement activation steering
- Jailbreak Research -- Applying activation insights to improve jailbreak effectiveness
References
- TransformerLens — Mechanistic interpretability library for transformer analysis
- Representation Engineering (Zou et al., 2023) — Reading and controlling LLM internal representations
- Refusal in Language Models Is Mediated by a Single Direction (Arditi et al., 2024) — Discovery of the refusal direction in activation space
- Scaling Monosemanticity (Anthropic, 2024) — Feature-level analysis of large language models