LLM Internals for Exploit Developers
Transformer architecture, tokenizer internals, logit pipelines, and trust boundaries from an offensive security perspective.
Exploiting LLMs requires the same depth of understanding that binary exploit developers bring to CPU architectures. This page covers transformer internals with a focus on attack surfaces, trust boundary gaps, and primitive construction.
Transformer Layer Attack Surfaces
Each transformer block contains distinct computational stages, and each presents different opportunities for exploitation and analysis.
The attention mechanism routes information between token positions via query-key-value projections. From an exploit perspective, attention heads are individually targetable circuits: induction heads can be hijacked to copy attacker-planted patterns, and positional heads can be exploited through strategic payload placement. Profiling which heads specialize in which function lets you craft inputs that steer specific routing behaviors.
Feed-forward networks (FFN) act as key-value memories that store factual associations learned during training. Each FFN neuron activates for specific input patterns and contributes to the residual stream. Attackers can identify "knowledge neurons" responsible for safety-relevant facts and target them with activation steering or fine-tuning attacks to alter stored associations without affecting general capability.
Layer normalization (LayerNorm or RMSNorm) rescales activations between layers, controlling the magnitude of the residual stream. From an exploit perspective, normalization creates a bottleneck where activation steering vectors must be calibrated to survive rescaling. Interventions applied before normalization may be dampened or amplified unpredictably, making post-norm injection points more reliable for activation manipulation.
Attention as Information Routing
Self-attention determines which tokens the model "remembers" when generating output. From an attacker's perspective, it is an information routing mechanism you can hijack by crafting inputs that steer the model's focus.
| Head Type | What It Does | Exploitation Angle |
|---|---|---|
| Induction heads | Copy [A][B]...[A] → [B] patterns | Plant trigger-payload pairs early in context |
| Previous-token heads | Attend to the immediately preceding token | Influence via adjacent token manipulation |
| Positional heads | Attend to fixed absolute/relative positions | Exploit position-dependent processing biases |
# Extract attention weights to see which tokens drive generation
outputs = model(**inputs, output_attentions=True)
last_layer_attn = outputs.attentions[-1].squeeze(0).mean(dim=0)
final_pos_attn = last_layer_attn[-1] # attention FROM the generation positionTokenizer Attack Surfaces
The tokenizer is the boundary between human-readable text and model-internal representations. Understanding it is essential for crafting payloads that survive tokenization -- or that exploit its artifacts.
BPE vs SentencePiece
BPE (GPT-family) and SentencePiece (LLaMA-family) handle text fundamentally differently:
| Property | BPE (GPT) | SentencePiece (LLaMA) |
|---|---|---|
| Whitespace handling | Pre-tokenized words | Internal (spaces as ▁) |
| Segmentation | Deterministic merge rules | Probabilistic (unigram model) |
| Unknown characters | Byte-level fallback | UTF-8 byte tokens |
| Payload separation | Space-based works reliably | Space-based behaves differently |
Key Exploitation Properties
- Non-compositionality -- tokenizing
ABis not necessarilytokenize(A) + tokenize(B) - Boundary sensitivity -- a single character change can completely shift token boundaries
- Token collisions -- different strings can produce identical token sequences, enabling filter bypass
# Find zero-width characters that split a word across token boundaries
splitters = ["\u200b", "\u200c", "\u200d", "\ufeff", "\u00ad"]
for pos in range(1, len(word)):
for s in splitters:
modified = word[:pos] + s + word[pos:]
if len(tokenizer.encode(modified)) != len(tokenizer.encode(word)):
print(f"Split at {pos} with {repr(s)}")Logit Pipeline & Safety Intervention Points
The model converts hidden states to output probabilities through a multi-stage pipeline. Safety mechanisms can operate at each stage, with different bypass characteristics:
hidden_state → lm_head (linear projection) → logits → softmax → probabilities
| Intervention Point | Where | Bypass Difficulty |
|---|---|---|
| Pre-logit (representation) | Modifies hidden states before projection | Hardest -- alters internal representation |
| Post-logit (bias) | Adds bias terms to suppress tokens | Moderate -- detectable via logit probing |
| Post-softmax (sampling) | Filters/re-weights probabilities | Moderate -- threshold-dependent |
| Post-generation (output) | Scans generated text after the fact | Easiest -- content already generated |
# Probe raw logits to detect safety interventions
outputs = model(**inputs, output_hidden_states=True)
raw_logits = outputs.logits[0, -1, :]
top_k = torch.topk(raw_logits, k=20)
# If top raw predictions never appear in output, something suppresses themSystem Prompt Trust Boundaries
The system prompt occupies a privileged position in the token sequence but has no cryptographic or architectural enforcement of its authority. It is simply the first set of tokens in the context window.
When Instruction Priority Breaks Down
- Attention dilution -- user instructions are longer/more detailed than system instructions
- Format mimicry -- user input copies system prompt formatting (privilege escalation)
- Priority ambiguity -- multiple conflicting system-level instructions exist
- Indirect injection -- tool outputs contain instruction-formatted text
Context Window Payload Placement
The context window has a non-uniform "attention landscape" due to RoPE positional encodings and the "lost in the middle" effect (Liu et al., 2023).
| Placement | Influence | Stealth | Best For |
|---|---|---|---|
| Beginning (tokens 0..N) | High -- competes with system prompt | Low | Authority hijacking |
| Middle | Low -- "lost in the middle" effect | High | Hidden payloads in RAG contexts |
| End (before generation) | Highest -- recency bias | Low | Maximum influence on next tokens |
Methodology: Internal Analysis Workflow
Profile the target tokenizer
Identify the tokenizer family (BPE vs SentencePiece), test boundary behavior with zero-width characters, and search for token collisions on safety-relevant words.
Map the attention landscape
Using an open-weight model of the same family, extract attention distributions across a context window mimicking the target's prompt structure. Identify high-influence and low-influence zones.
Probe logit distributions
Compare raw logit distributions for benign and sensitive prompts. Document tokens that appear in raw top-k but not in generated output -- these indicate post-logit safety filtering.
Test system prompt boundaries
Craft injections using attention dilution, format mimicry, and conflicting instructions. Measure bypass rate quantitatively across many variants.
Select payload placement
Based on the attention map, choose placement strategy: beginning for authority, middle for stealth, end for influence.
You developed a token boundary exploit using a zero-width space to split 'password' across two tokens, bypassing a safety classifier on GPT-4o. The same technique fails against a LLaMA-based deployment. What is the most likely reason?
Related Topics
- Activation Analysis & Hidden State Exploitation -- Reading and manipulating model internals at the activation level
- Embedding Space Exploitation -- Attacking the vector representations that underpin LLM processing
- Alignment Bypass -- Exploiting the safety training that constrains transformer behavior
- Advanced Prompt Injection -- Applying architectural knowledge to craft effective injections
References
- Attention Is All You Need (Vaswani et al., 2017) — Original transformer architecture paper
- A Mathematical Framework for Transformer Circuits (Elhage et al., 2021) — Mechanistic interpretability of attention heads
- Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023) — Positional attention biases exploitable for payload placement
- OWASP Top 10 for LLM Applications — Industry standard LLM vulnerability taxonomy