What is Activation Analysis & Hidden State Exploitation?

Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.

What is Activation Steering?

Manipulating model behavior by adding learned steering vectors to intermediate activations, bypassing safety training through direct representation engineering.

What is Attention Pattern Analysis for Security?

Using attention maps to understand and exploit model behavior, identifying security-relevant attention patterns, and leveraging attention mechanics for red team operations.

What is Logit Manipulation?

Techniques for influencing LLM output distributions through crafted inputs that push logit values toward attacker-desired tokens, exploiting softmax mechanics and token competition dynamics.

What is Tokenizer Security?

How tokenization creates attack surfaces in LLM systems: BPE exploitation, token boundary attacks, encoding edge cases, and tokenizer-aware adversarial techniques.

What is Context Window Internals?

How attention decay, positional encoding limits, and memory management in transformer context windows create exploitable patterns for red team operations.

What is Safety Neurons and Circuits?

Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.

LLM Internals for Exploit Developers

expert10 min readUpdated 2026-03-12

Transformer architecture, tokenizer internals, logit pipelines, and trust boundaries from an offensive security perspective.

transformers tokenizers internals exploit-primitives context-window

Exploiting LLMs requires the same depth of understanding that binary exploit developers bring to CPU architectures. This page covers transformer internals with a focus on attack surfaces, trust boundary gaps, and primitive construction.

Transformer Layer Attack Surfaces

Each transformer block contains distinct computational stages, and each presents different opportunities for exploitation and analysis.

The attention mechanism routes information between token positions via query-key-value projections. From an exploit perspective, attention heads are individually targetable circuits: induction heads can be hijacked to copy attacker-planted patterns, and positional heads can be exploited through strategic payload placement. Profiling which heads specialize in which function lets you craft inputs that steer specific routing behaviors.

Feed-forward networks (FFN) act as key-value memories that store factual associations learned during training. Each FFN neuron activates for specific input patterns and contributes to the residual stream. Attackers can identify "knowledge neurons" responsible for safety-relevant facts and target them with activation steering or fine-tuning attacks to alter stored associations without affecting general capability.

Layer normalization (LayerNorm or RMSNorm) rescales activations between layers, controlling the magnitude of the residual stream. From an exploit perspective, normalization creates a bottleneck where activation steering vectors must be calibrated to survive rescaling. Interventions applied before normalization may be dampened or amplified unpredictably, making post-norm injection points more reliable for activation manipulation.

Attention as Information Routing

Self-attention determines which tokens the model "remembers" when generating output. From an attacker's perspective, it is an information routing mechanism you can hijack by crafting inputs that steer the model's focus.

Head Type	What It Does	Exploitation Angle
Induction heads	Copy `[A][B]...[A] → [B]` patterns	Plant trigger-payload pairs early in context
Previous-token heads	Attend to the immediately preceding token	Influence via adjacent token manipulation
Positional heads	Attend to fixed absolute/relative positions	Exploit position-dependent processing biases

# Extract attention weights to see which tokens drive generation
outputs = model(**inputs, output_attentions=True)
last_layer_attn = outputs.attentions[-1].squeeze(0).mean(dim=0)
final_pos_attn = last_layer_attn[-1]  # attention FROM the generation position

Tokenizer Attack Surfaces

The tokenizer is the boundary between human-readable text and model-internal representations. Understanding it is essential for crafting payloads that survive tokenization -- or that exploit its artifacts.

BPE vs SentencePiece

BPE (GPT-family) and SentencePiece (LLaMA-family) handle text fundamentally differently:

Property	BPE (GPT)	SentencePiece (LLaMA)
Whitespace handling	Pre-tokenized words	Internal (spaces as `▁`)
Segmentation	Deterministic merge rules	Probabilistic (unigram model)
Unknown characters	Byte-level fallback	UTF-8 byte tokens
Payload separation	Space-based works reliably	Space-based behaves differently

Key Exploitation Properties

Non-compositionality -- tokenizing AB is not necessarily tokenize(A) + tokenize(B)
Boundary sensitivity -- a single character change can completely shift token boundaries
Token collisions -- different strings can produce identical token sequences, enabling filter bypass

# Find zero-width characters that split a word across token boundaries
splitters = ["\u200b", "\u200c", "\u200d", "\ufeff", "\u00ad"]
for pos in range(1, len(word)):
    for s in splitters:
        modified = word[:pos] + s + word[pos:]
        if len(tokenizer.encode(modified)) != len(tokenizer.encode(word)):
            print(f"Split at {pos} with {repr(s)}")

Logit Pipeline & Safety Intervention Points

The model converts hidden states to output probabilities through a multi-stage pipeline. Safety mechanisms can operate at each stage, with different bypass characteristics:

hidden_state → lm_head (linear projection) → logits → softmax → probabilities

Intervention Point	Where	Bypass Difficulty
Pre-logit (representation)	Modifies hidden states before projection	Hardest -- alters internal representation
Post-logit (bias)	Adds bias terms to suppress tokens	Moderate -- detectable via logit probing
Post-softmax (sampling)	Filters/re-weights probabilities	Moderate -- threshold-dependent
Post-generation (output)	Scans generated text after the fact	Easiest -- content already generated

# Probe raw logits to detect safety interventions
outputs = model(**inputs, output_hidden_states=True)
raw_logits = outputs.logits[0, -1, :]
top_k = torch.topk(raw_logits, k=20)
# If top raw predictions never appear in output, something suppresses them

System Prompt Trust Boundaries

The system prompt occupies a privileged position in the token sequence but has no cryptographic or architectural enforcement of its authority. It is simply the first set of tokens in the context window.

When Instruction Priority Breaks Down

Attention dilution -- user instructions are longer/more detailed than system instructions
Format mimicry -- user input copies system prompt formatting (privilege escalation)
Priority ambiguity -- multiple conflicting system-level instructions exist
Indirect injection -- tool outputs contain instruction-formatted text

Context Window Payload Placement

The context window has a non-uniform "attention landscape" due to RoPE positional encodings and the "lost in the middle" effect (Liu et al., 2023).

Placement	Influence	Stealth	Best For
Beginning (tokens 0..N)	High -- competes with system prompt	Low	Authority hijacking
Middle	Low -- "lost in the middle" effect	High	Hidden payloads in RAG contexts
End (before generation)	Highest -- recency bias	Low	Maximum influence on next tokens

Methodology: Internal Analysis Workflow

Profile the target tokenizer
Identify the tokenizer family (BPE vs SentencePiece), test boundary behavior with zero-width characters, and search for token collisions on safety-relevant words.
Map the attention landscape
Using an open-weight model of the same family, extract attention distributions across a context window mimicking the target's prompt structure. Identify high-influence and low-influence zones.
Probe logit distributions
Compare raw logit distributions for benign and sensitive prompts. Document tokens that appear in raw top-k but not in generated output -- these indicate post-logit safety filtering.
Test system prompt boundaries
Craft injections using attention dilution, format mimicry, and conflicting instructions. Measure bypass rate quantitatively across many variants.
Select payload placement
Based on the attention map, choose placement strategy: beginning for authority, middle for stealth, end for influence.

Knowledge Check

You developed a token boundary exploit using a zero-width space to split 'password' across two tokens, bypassing a safety classifier on GPT-4o. The same technique fails against a LLaMA-based deployment. What is the most likely reason?

Activation Analysis & Hidden State Exploitation -- Reading and manipulating model internals at the activation level
Embedding Space Exploitation -- Attacking the vector representations that underpin LLM processing
Alignment Bypass -- Exploiting the safety training that constrains transformer behavior
Advanced Prompt Injection -- Applying architectural knowledge to craft effective injections

References

Attention Is All You Need (Vaswani et al., 2017) — Original transformer architecture paper
A Mathematical Framework for Transformer Circuits (Elhage et al., 2021) — Mechanistic interpretability of attention heads
Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023) — Positional attention biases exploitable for payload placement
OWASP Top 10 for LLM Applications — Industry standard LLM vulnerability taxonomy

LLM Internals for Exploit Developers

Transformer Layer Attack Surfaces

Attention as Information Routing

Tokenizer Attack Surfaces

BPE vs SentencePiece

Key Exploitation Properties

Logit Pipeline & Safety Intervention Points

System Prompt Trust Boundaries

When Instruction Priority Breaks Down

Context Window Payload Placement

Methodology: Internal Analysis Workflow

Profile the target tokenizer

Map the attention landscape

Probe logit distributions

Test system prompt boundaries

Select payload placement

References

Learning Path

LLM Internals for Exploit Developers

Transformer Layer Attack Surfaces

Attention as Information Routing

Tokenizer Attack Surfaces

BPE vs SentencePiece

Key Exploitation Properties

Logit Pipeline & Safety Intervention Points

System Prompt Trust Boundaries

When Instruction Priority Breaks Down

Context Window Payload Placement

Methodology: Internal Analysis Workflow

Profile the target tokenizer

Map the attention landscape

Probe logit distributions

Test system prompt boundaries

Select payload placement

References

Learning Path

LLM Internals for Exploit Developers

Profile the target tokenizer

Map the attention landscape

Probe logit distributions

Test system prompt boundaries

Select payload placement

Learning Path

Related articles

LLM Internals for Exploit Developers

Profile the target tokenizer

Map the attention landscape

Probe logit distributions

Test system prompt boundaries

Select payload placement

Learning Path

Related articles