LLM Internals & Exploit Primitives
An overview of large language model architecture from a security researcher's perspective, covering the key components that create exploitable attack surfaces.
Large language models are built on the transformer architecture, a neural network design that processes text as sequences of tokens and uses attention mechanisms to determine how information flows between them. For AI red teamers, understanding these internals is not optional — it is the foundation upon which every exploit technique rests.
Why Internals Matter for Red Teaming
Traditional penetration testers study operating system internals, memory layouts, and protocol specifications. AI red teamers need equivalent depth in LLM architecture. Each component of the transformer pipeline — tokenization, embedding, attention, feed-forward layers, and output generation — introduces distinct attack surfaces.
The Transformer Pipeline
At a high level, every LLM processes input through these stages:
- Tokenization — Raw text is split into subword tokens using algorithms like BPE or SentencePiece. This is where tokenization attacks operate.
- Embedding — Tokens are mapped to high-dimensional vectors. Embedding manipulation targets this layer.
- Attention layers — Self-attention mechanisms route information between token positions. Attention exploitation takes advantage of how models prioritize different parts of the input.
- Feed-forward networks — Each layer contains dense networks that store learned associations.
- Output projection — Hidden states are projected to vocabulary logits, then sampled to produce text.
Input text → Tokenizer → Embeddings → [Attention + FFN] × N layers → Logits → Output tokens
Key Security-Relevant Properties
| Property | Description | Exploit Relevance |
|---|---|---|
| No privilege separation | System prompts and user input share the same token stream | Prompt injection is architecturally possible |
| Statistical processing | All decisions are probabilistic, not rule-based | Safety filters can be bypassed with sufficient optimization |
| Context window limits | Models can only attend to a fixed number of tokens | Attention dilution and context stuffing attacks |
| Autoregressive generation | Each token depends on all previous tokens | Payload placement affects all subsequent generation |
What You Will Learn
This section covers four core areas:
- Tokenization Attacks — How the boundary between human text and model tokens creates exploitable gaps
- Attention Exploitation — Leveraging the attention mechanism to steer model behavior
- Embedding Manipulation — Attacking the vector space where models represent meaning
Each topic builds on the fundamentals introduced here, progressively increasing in complexity. Start with tokenization attacks if you are new to LLM security research.
Related Topics
- How LLMs Work -- foundational transformer architecture and training pipelines
- Alignment Bypass Techniques -- exploiting safety training at the internals level
- Prompt Injection Fundamentals -- applying internals knowledge to practical injection attacks
- Exploit Development -- building reliable exploits from architectural understanding
- Embedding Exploitation (Advanced) -- deep-dive into embedding-layer attacks
References
- Vaswani et al., "Attention Is All You Need" (2017) -- the original transformer architecture paper
- Elhage et al., "A Mathematical Framework for Transformer Circuits" (2021) -- mechanistic interpretability of attention heads
- Carlini et al., "Are aligned neural networks adversarially aligned?" (2023) -- why safety alignment is fragile at the architectural level
- Wei et al., "Jailbroken: How Does LLM Safety Training Fail?" (2023) -- taxonomizing architectural failure modes of safety training
Why is prompt injection architecturally possible in transformer-based LLMs?