LLM Internals for 利用 Developers
Transformer architecture, tokenizer internals, logit pipelines, and trust boundaries from an offensive security perspective.
Exploiting LLMs requires the same depth of 理解 that binary 利用 developers bring to CPU architectures. This page covers transformer internals with a focus on attack surfaces, trust boundary gaps, and primitive construction.
Transformer Layer 攻擊 Surfaces
Each transformer block contains distinct computational stages, and each presents different opportunities for 利用 and analysis.
The 注意力 mechanism routes information between 符元 positions via query-key-value projections. From an 利用 perspective, 注意力 heads are individually targetable circuits: induction heads can be hijacked to copy 攻擊者-planted patterns, and positional heads can be exploited through strategic payload placement. Profiling which heads specialize in which function lets you craft inputs that steer specific routing behaviors.
Feed-forward networks (FFN) act as key-value memories that store factual associations learned during 訓練. Each FFN neuron activates for specific 輸入 patterns and contributes to the residual stream. Attackers can 識別 "knowledge neurons" responsible for 安全-relevant facts and target them with activation steering or 微調 attacks to alter stored associations without affecting general capability.
Layer normalization (LayerNorm or RMSNorm) rescales activations between layers, controlling the magnitude of the residual stream. From an 利用 perspective, normalization creates a bottleneck where activation steering vectors must be calibrated to survive rescaling. Interventions applied before normalization may be dampened or amplified unpredictably, making post-norm injection points more reliable for activation manipulation.
Attention as Information Routing
Self-注意力 determines which 符元 模型 "remembers" when generating 輸出. From 攻擊者's perspective, it is an information routing mechanism you can hijack by crafting inputs that steer 模型's focus.
| Head Type | What It Does | 利用 Angle |
|---|---|---|
| Induction heads | Copy [A][B]...[A] → [B] patterns | Plant trigger-payload pairs early in context |
| Previous-符元 heads | Attend to the immediately preceding 符元 | Influence via adjacent 符元 manipulation |
| Positional heads | Attend to fixed absolute/relative positions | 利用 position-dependent processing biases |
# Extract 注意力 weights to see which 符元 drive generation
outputs = model(**inputs, output_attentions=True)
last_layer_attn = outputs.attentions[-1].squeeze(0).mean(dim=0)
final_pos_attn = last_layer_attn[-1] # 注意力 FROM the generation positionTokenizer 攻擊 Surfaces
The 分詞器 is the boundary between human-readable text and model-internal representations. 理解 it is essential for crafting payloads that survive 分詞 -- or that 利用 its artifacts.
BPE vs SentencePiece
BPE (GPT-family) and SentencePiece (LLaMA-family) handle text fundamentally differently:
| Property | BPE (GPT) | SentencePiece (LLaMA) |
|---|---|---|
| Whitespace handling | Pre-tokenized words | Internal (spaces as ▁) |
| Segmentation | Deterministic merge rules | Probabilistic (unigram model) |
| Unknown characters | Byte-level fallback | UTF-8 byte 符元 |
| Payload separation | Space-based works reliably | Space-based behaves differently |
Key 利用 Properties
- Non-compositionality -- tokenizing
ABis not necessarilytokenize(A) + tokenize(B) - Boundary sensitivity -- a single character change can completely shift 符元 boundaries
- Token collisions -- different strings can produce identical 符元 sequences, enabling filter bypass
# Find zero-width characters that split a word across 符元 boundaries
splitters = ["\u200b", "\u200c", "\u200d", "\ufeff", "\u00ad"]
for pos in range(1, len(word)):
for s in splitters:
modified = word[:pos] + s + word[pos:]
if len(分詞器.encode(modified)) != len(分詞器.encode(word)):
print(f"Split at {pos} with {repr(s)}")Logit Pipeline & 安全 Intervention Points
模型 converts hidden states to 輸出 probabilities through a multi-stage pipeline. 安全 mechanisms can operate at each stage, with different bypass characteristics:
hidden_state → lm_head (linear projection) → logits → softmax → probabilities
| Intervention Point | Where | Bypass Difficulty |
|---|---|---|
| Pre-logit (representation) | Modifies hidden states before projection | Hardest -- alters internal representation |
| Post-logit (bias) | Adds bias terms to suppress 符元 | Moderate -- detectable via logit probing |
| Post-softmax (sampling) | Filters/re-weights probabilities | Moderate -- threshold-dependent |
| Post-generation (輸出) | Scans generated text after the fact | Easiest -- content already generated |
# Probe raw logits to detect 安全 interventions
outputs = model(**inputs, output_hidden_states=True)
raw_logits = outputs.logits[0, -1, :]
top_k = torch.topk(raw_logits, k=20)
# If top raw predictions never appear in 輸出, something suppresses themSystem Prompt Trust Boundaries
The 系統提示詞 occupies a privileged position in the 符元 sequence but has no cryptographic or architectural enforcement of its authority. It is simply the first set of 符元 in the 上下文視窗.
When Instruction Priority Breaks Down
- Attention dilution -- user instructions are longer/more detailed than system instructions
- Format mimicry -- 使用者輸入 copies 系統提示詞 formatting (privilege escalation)
- Priority ambiguity -- multiple conflicting system-level instructions exist
- Indirect injection -- tool outputs contain instruction-formatted text
Context Window Payload Placement
The 上下文視窗 has a non-uniform "注意力 landscape" due to RoPE positional encodings and the "lost in the middle" effect (Liu et al., 2023).
| Placement | Influence | Stealth | Best For |
|---|---|---|---|
| Beginning (符元 0..N) | High -- competes with 系統提示詞 | Low | Authority hijacking |
| Middle | Low -- "lost in the middle" effect | High | Hidden payloads in RAG contexts |
| End (before generation) | Highest -- recency bias | Low | Maximum influence on next 符元 |
Methodology: Internal Analysis Workflow
Profile the target 分詞器
識別 the 分詞器 family (BPE vs SentencePiece), 測試 boundary behavior with zero-width characters, and search for 符元 collisions on 安全-relevant words.
Map the 注意力 landscape
Using an open-weight model of the same family, extract 注意力 distributions across a 上下文視窗 mimicking the target's prompt structure. 識別 high-influence and low-influence zones.
Probe logit distributions
Compare raw logit distributions for benign and sensitive prompts. Document 符元 that appear in raw top-k but not in generated 輸出 -- these indicate post-logit 安全 filtering.
測試 系統提示詞 boundaries
Craft injections using 注意力 dilution, format mimicry, and conflicting instructions. Measure bypass rate quantitatively across many variants.
Select payload placement
Based on the 注意力 map, choose placement strategy: beginning for authority, middle for stealth, end for influence.
You developed a 符元 boundary 利用 using a zero-width space to split 'password' across two 符元, bypassing a 安全 classifier on GPT-4o. The same technique fails against a LLaMA-based deployment. What is the most likely reason?
相關主題
- Activation Analysis & Hidden State 利用 -- Reading and manipulating model internals at the activation level
- 嵌入向量 Space 利用 -- Attacking the vector representations that underpin LLM processing
- Alignment Bypass -- Exploiting the 安全 訓練 that constrains transformer behavior
- Advanced 提示詞注入 -- Applying architectural knowledge to craft effective injections
參考文獻
- Attention Is All You Need (Vaswani et al., 2017) — Original transformer architecture paper
- A Mathematical Framework for Transformer Circuits (Elhage et al., 2021) — Mechanistic interpretability of 注意力 heads
- Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023) — Positional 注意力 biases exploitable for payload placement
- OWASP Top 10 for LLM Applications — Industry standard LLM 漏洞 taxonomy