Activation Analysis & Hidden State Exploitation

expert9 min readUpdated 2026-03-12

Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.

activations hidden-states probing information-leakage mechanistic-interpretability

Hidden states encode information the model has computed but may not include in its final output -- including future token plans, suppressed content, and safety-relevant features. For an attacker with model access, they are a rich source of leaked information and a target for manipulation.

Activation Analysis Tools

Several tools and libraries support hidden state extraction and manipulation. The right choice depends on your access level and analysis goals.

TransformerLens (by Neel Nanda) provides a clean API for mechanistic interpretability on GPT-style models. It wraps HuggingFace models with hook points at every computational step -- attention, MLP, residual stream -- and includes built-in utilities for activation patching, logit lens, and causal tracing. Best for systematic interpretability research on supported architectures (GPT-2, GPT-Neo, LLaMA, Pythia).

Baukit (by David Bau) focuses on causal mediation analysis and neuron-level interventions. It provides utilities for identifying and editing individual neurons, tracing causal effects of specific model components, and performing rank-one model editing (ROME). Best for targeted interventions like locating and modifying specific factual associations or safety-relevant neurons.

Manual probing uses PyTorch forward hooks directly via register_forward_hook() to extract hidden states, then trains sklearn classifiers (logistic regression, linear SVM) on the extracted representations. This approach works with any model architecture and requires no special libraries. Best for quick one-off analyses, unsupported architectures, or when you need full control over the extraction pipeline.

What Hidden States Encode

Each of the model's L layers produces a hidden state tensor of shape (batch, seq_len, d_model). Research in mechanistic interpretability has shown these encode:

Layer Depth	What Is Encoded	Attack Relevance
Early (0-25%)	Syntax, token identity, POS tags	Tokenization-level analysis
Middle (25-75%)	Semantics, entity relations, task vectors	Probing classifiers, refusal direction (strongest here)
Late (75-100%)	Generation-ready representations, next-token plans	Safety interventions concentrate here

Logprob-Based Information Extraction

Even without direct hidden state access, logprobs from API responses leak substantial information:

Refusal confidence -- low-confidence refusals indicate the model is "on the fence" and more likely bypassable
Suppressed content -- second-choice tokens may reveal what the model wanted to say
System prompt inference -- probability differentials across crafted probes can leak system prompt tokens

# Measure refusal confidence via first-token logprobs
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    logprobs=True, top_logprobs=5, max_tokens=50, temperature=0,
)
first_token = response.choices[0].logprobs.content[0]
top_tokens = {lp.token: lp.logprob for lp in first_token.top_logprobs}
# High mass on "Sorry"/"cannot" = strong refusal; mixed = bypassable

The Refusal Direction

Representation engineering research (Arditi et al., 2024) identified that refusal behavior is mediated by a single linear direction in activation space. This direction can be found, measured, and surgically removed.

Finding the Refusal Direction

Collect contrastive prompts
Prepare two sets: harmful prompts that trigger refusal and harmless prompts that get helpful responses.
Extract last-token hidden states
Run each prompt through the model and capture the hidden state at the final token position for a target layer (middle-to-late layers work best).
Compute the difference vector
Average harmful activations, average harmless activations, subtract. Normalize to a unit vector. This is the refusal direction.
Validate with projection
Project new prompts onto the refusal direction. Positive projection should correlate with refusal; negative with compliance.

# Core refusal direction computation
harmful_mean = torch.stack(harmful_activations).mean(dim=0)
harmless_mean = torch.stack(harmless_activations).mean(dim=0)
refusal_dir = harmful_mean - harmless_mean
refusal_dir = refusal_dir / refusal_dir.norm()  # unit vector

Refusal Ablation

Removing the refusal direction component from activations during inference disables safety training without modifying weights:

# Hook that projects out the refusal direction at a target layer
def hook_fn(module, input, output):
    hs = output[0]
    proj = torch.einsum("bsd,d->bs", hs, refusal_dir.to(hs.device))
    hs -= strength * torch.einsum("bs,d->bsd", proj, refusal_dir.to(hs.device))
    return (hs,) + output[1:]

Activation Steering

Activation steering generalizes refusal ablation. Instead of removing a direction, you add or subtract arbitrary control vectors to shift behavior along any concept axis.

Methodology

Define contrastive concept pairs
Create paired prompts representing opposite poles of a behavior (e.g., compliant vs. refusing, truthful vs. deceptive).
Compute the steering vector
Extract hidden states for both sets, compute mean difference, normalize.
Apply via forward hook
Register a hook on the target layer that adds coefficient * steering_vector to the hidden state during inference.
Calibrate the coefficient
Start with coefficient 1.0 and adjust. Too high causes incoherent output; too low has no effect.

# Apply a steering vector at a target layer
def steering_hook(module, input, output):
    hs = output[0]
    hs = hs + coefficient * steering_vector.to(hs.device)
    return (hs,) + output[1:]
 
layer = model.model.layers[target_layer_idx]
handle = layer.register_forward_hook(steering_hook)
# Generate with steering active, then handle.remove()

Logit Lens: Tracing Safety Interventions

The logit lens reveals how the model's prediction evolves across layers -- showing exactly where safety interventions redirect generation from a "natural" completion to a refusal.

# Project each layer's hidden state through lm_head
for layer_idx in range(num_layers):
    hs = hidden_states[layer_idx][0, -1, :].unsqueeze(0)
    if hasattr(model.model, "norm"):
        hs = model.model.norm(hs)
    logits = model.lm_head(hs)
    top5 = torch.topk(torch.softmax(logits, dim=-1)[0], k=5)
    # Watch for the layer where top predictions shift from content to refusal

Probing for Deception Detection

Linear probes trained on hidden states can detect when a model's internal representation contradicts its output:

Use Case	Method	Layer Target
Truthfulness detection	Train probe on true/false statement hidden states	Middle layers
Refusal prediction	Project onto refusal direction	Middle-to-late layers
Memorized data extraction	Probe for information present internally but not output	Late layers
Safety classifier detection	Compare probe output with actual model behavior	All layers

from sklearn.linear_model import LogisticRegression
 
# Collect hidden states for true/false statements
features, labels = [], []
for statement, is_true in statements_with_labels:
    result = extractor.extract_hidden_states(statement)
    last_hs = result["hidden_states"][layer_idx][0, -1, :].numpy()
    features.append(last_hs)
    labels.append(int(is_true))
 
probe = LogisticRegression(max_iter=1000)
probe.fit(np.stack(features), np.array(labels))
 
# Detect deception: if probe says "true" but model outputs "false"
belief = probe.predict_proba([new_hidden_state])[0]

Defensive Implications

Understanding activation-level attacks informs defense design:

Activation monitoring -- deploy probes in production to detect anomalous internal states
Representation regularization -- train models to not encode dangerous information extractably
Layer-wise safety checks -- insert classifiers at multiple layers, not just output
Hidden state encryption -- emerging research direction for preventing activation readout

Knowledge Check

You are probing an open-weight model and find that the refusal direction has a very high projection score at layer 18 (of 32) but nearly zero at layer 30. Where should you apply an activation steering hook to most effectively bypass refusal?

LLM Internals for Exploit Developers -- Transformer architecture and tokenizer attack surfaces that activation analysis builds upon
Alignment Bypass -- Refusal direction analysis connects directly to alignment bypass primitives
Adversarial Suffix Generation -- Gradient-based attacks that complement activation steering
Jailbreak Research -- Applying activation insights to improve jailbreak effectiveness

References

TransformerLens — Mechanistic interpretability library for transformer analysis
Representation Engineering (Zou et al., 2023) — Reading and controlling LLM internal representations
Refusal in Language Models Is Mediated by a Single Direction (Arditi et al., 2024) — Discovery of the refusal direction in activation space
Scaling Monosemanticity (Anthropic, 2024) — Feature-level analysis of large language models

Edit this page on GitHub

Activation Analysis & Hidden State Exploitation

expert9 min readUpdated 2026-03-12

Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.

activations hidden-states probing information-leakage mechanistic-interpretability

Activation Analysis Tools

Several tools and libraries support hidden state extraction and manipulation. The right choice depends on your access level and analysis goals.

What Hidden States Encode

Each of the model's L layers produces a hidden state tensor of shape (batch, seq_len, d_model). Research in mechanistic interpretability has shown these encode:

Layer Depth	What Is Encoded	Attack Relevance
Early (0-25%)	Syntax, token identity, POS tags	Tokenization-level analysis
Middle (25-75%)	Semantics, entity relations, task vectors	Probing classifiers, refusal direction (strongest here)
Late (75-100%)	Generation-ready representations, next-token plans	Safety interventions concentrate here

Logprob-Based Information Extraction

Even without direct hidden state access, logprobs from API responses leak substantial information:

Refusal confidence -- low-confidence refusals indicate the model is "on the fence" and more likely bypassable
Suppressed content -- second-choice tokens may reveal what the model wanted to say
System prompt inference -- probability differentials across crafted probes can leak system prompt tokens

# Measure refusal confidence via first-token logprobs
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    logprobs=True, top_logprobs=5, max_tokens=50, temperature=0,
)
first_token = response.choices[0].logprobs.content[0]
top_tokens = {lp.token: lp.logprob for lp in first_token.top_logprobs}
# High mass on "Sorry"/"cannot" = strong refusal; mixed = bypassable

The Refusal Direction

Finding the Refusal Direction

Collect contrastive prompts
Prepare two sets: harmful prompts that trigger refusal and harmless prompts that get helpful responses.
Extract last-token hidden states
Run each prompt through the model and capture the hidden state at the final token position for a target layer (middle-to-late layers work best).
Compute the difference vector
Average harmful activations, average harmless activations, subtract. Normalize to a unit vector. This is the refusal direction.
Validate with projection
Project new prompts onto the refusal direction. Positive projection should correlate with refusal; negative with compliance.

# Core refusal direction computation
harmful_mean = torch.stack(harmful_activations).mean(dim=0)
harmless_mean = torch.stack(harmless_activations).mean(dim=0)
refusal_dir = harmful_mean - harmless_mean
refusal_dir = refusal_dir / refusal_dir.norm()  # unit vector

Refusal Ablation

Removing the refusal direction component from activations during inference disables safety training without modifying weights:

# Hook that projects out the refusal direction at a target layer
def hook_fn(module, input, output):
    hs = output[0]
    proj = torch.einsum("bsd,d->bs", hs, refusal_dir.to(hs.device))
    hs -= strength * torch.einsum("bs,d->bsd", proj, refusal_dir.to(hs.device))
    return (hs,) + output[1:]

Activation Steering

Activation steering generalizes refusal ablation. Instead of removing a direction, you add or subtract arbitrary control vectors to shift behavior along any concept axis.

Methodology

Define contrastive concept pairs
Create paired prompts representing opposite poles of a behavior (e.g., compliant vs. refusing, truthful vs. deceptive).
Compute the steering vector
Extract hidden states for both sets, compute mean difference, normalize.
Apply via forward hook
Register a hook on the target layer that adds coefficient * steering_vector to the hidden state during inference.
Calibrate the coefficient
Start with coefficient 1.0 and adjust. Too high causes incoherent output; too low has no effect.

# Apply a steering vector at a target layer
def steering_hook(module, input, output):
    hs = output[0]
    hs = hs + coefficient * steering_vector.to(hs.device)
    return (hs,) + output[1:]
 
layer = model.model.layers[target_layer_idx]
handle = layer.register_forward_hook(steering_hook)
# Generate with steering active, then handle.remove()

Logit Lens: Tracing Safety Interventions

The logit lens reveals how the model's prediction evolves across layers -- showing exactly where safety interventions redirect generation from a "natural" completion to a refusal.

# Project each layer's hidden state through lm_head
for layer_idx in range(num_layers):
    hs = hidden_states[layer_idx][0, -1, :].unsqueeze(0)
    if hasattr(model.model, "norm"):
        hs = model.model.norm(hs)
    logits = model.lm_head(hs)
    top5 = torch.topk(torch.softmax(logits, dim=-1)[0], k=5)
    # Watch for the layer where top predictions shift from content to refusal

Probing for Deception Detection

Linear probes trained on hidden states can detect when a model's internal representation contradicts its output:

Use Case	Method	Layer Target
Truthfulness detection	Train probe on true/false statement hidden states	Middle layers
Refusal prediction	Project onto refusal direction	Middle-to-late layers
Memorized data extraction	Probe for information present internally but not output	Late layers
Safety classifier detection	Compare probe output with actual model behavior	All layers

from sklearn.linear_model import LogisticRegression
 
# Collect hidden states for true/false statements
features, labels = [], []
for statement, is_true in statements_with_labels:
    result = extractor.extract_hidden_states(statement)
    last_hs = result["hidden_states"][layer_idx][0, -1, :].numpy()
    features.append(last_hs)
    labels.append(int(is_true))
 
probe = LogisticRegression(max_iter=1000)
probe.fit(np.stack(features), np.array(labels))
 
# Detect deception: if probe says "true" but model outputs "false"
belief = probe.predict_proba([new_hidden_state])[0]

Defensive Implications

Understanding activation-level attacks informs defense design:

Activation monitoring -- deploy probes in production to detect anomalous internal states
Representation regularization -- train models to not encode dangerous information extractably
Layer-wise safety checks -- insert classifiers at multiple layers, not just output
Hidden state encryption -- emerging research direction for preventing activation readout

Knowledge Check

LLM Internals for Exploit Developers -- Transformer architecture and tokenizer attack surfaces that activation analysis builds upon
Alignment Bypass -- Refusal direction analysis connects directly to alignment bypass primitives
Adversarial Suffix Generation -- Gradient-based attacks that complement activation steering
Jailbreak Research -- Applying activation insights to improve jailbreak effectiveness

References

TransformerLens — Mechanistic interpretability library for transformer analysis
Representation Engineering (Zou et al., 2023) — Reading and controlling LLM internal representations
Refusal in Language Models Is Mediated by a Single Direction (Arditi et al., 2024) — Discovery of the refusal direction in activation space
Scaling Monosemanticity (Anthropic, 2024) — Feature-level analysis of large language models

Edit this page on GitHub

Activation Analysis & Hidden State Exploitation

Collect contrastive prompts

Extract last-token hidden states

Compute the difference vector

Validate with projection

Define contrastive concept pairs

Compute the steering vector

Apply via forward hook

Calibrate the coefficient

Related articles

Activation Analysis & Hidden State Exploitation

Collect contrastive prompts

Extract last-token hidden states

Compute the difference vector

Validate with projection

Define contrastive concept pairs

Compute the steering vector

Apply via forward hook

Calibrate the coefficient

Related articles