LLM Internals
Deep technical exploration of LLM internal mechanisms for exploit development, covering activation analysis, alignment bypass primitives, and embedding space exploitation.
Most AI red teaming operates at the input-output level: craft an input, observe the output, iterate. This section goes deeper. It covers the internal mechanisms of large language models -- their hidden states, attention patterns, safety neurons, and embedding geometries -- and shows how understanding these internals enables attack techniques that black-box testing cannot achieve.
Working at the internals level requires access to model weights, which limits these techniques to open-weight models and self-hosted deployments. However, the insights gained from studying internals transfer to black-box testing as well. Understanding why certain jailbreaks work at the mechanistic level helps you design more effective attacks even against API-only models. Knowing that safety behaviors are implemented through specific activation patterns, rather than being fundamental to the model's language understanding, reveals the fragility of alignment and informs both attack strategy and defensive recommendations.
Why Internals Matter for Security
The security properties of a language model are not architectural guarantees -- they are learned behaviors encoded in the model's weights. Safety training teaches models to associate certain types of requests with refusal behaviors, but these associations are implemented through the same mechanisms the model uses for all other language tasks. This means safety can be selectively disabled, redirected, or suppressed without fundamentally changing the model's capabilities.
Mechanistic interpretability research has identified specific components that mediate safety behavior. "Refusal directions" in activation space control whether the model generates a refusal or a compliant response. Safety neurons fire in response to harmful requests and trigger refusal circuits. Attention patterns determine which parts of the prompt the model prioritizes when making safety decisions. Each of these mechanisms is a potential target for manipulation.
For an attacker with model access, understanding internals enables several classes of attack that are impossible from a black-box perspective. Activation steering can suppress refusal behaviors without changing the input prompt. Logit manipulation can bias token generation toward desired outputs. Tokenizer analysis can reveal encoding tricks that bypass input processing. Hidden state extraction can leak information the model computed but chose not to include in its output.
The Internals Toolkit
Research in this area relies on specialized tools and techniques. TransformerLens provides hook-based access to every computational step in GPT-style models. Baukit offers similar capabilities with a different API. Manual probing techniques use linear classifiers trained on hidden states to detect specific features. The logit lens technique traces how the model's predictions evolve layer by layer, revealing where safety interventions occur and how they can be circumvented.
These tools transform the model from an opaque function into a transparent system where every computation can be inspected, measured, and potentially manipulated. The transition from black-box to white-box testing is analogous to the difference between testing a web application through its UI versus having access to its source code, database, and runtime state.
What You'll Learn in This Section
- LLM Internals for Exploit Developers -- Activation analysis, hidden state extraction, activation steering, attention pattern analysis, logit manipulation, tokenizer security, context window internals, and safety neuron identification
- Alignment Internals & Bypass Primitives -- How alignment is implemented at the activation level, adversarial suffix generation from gradient information, and techniques for selectively disabling safety behaviors
- Embedding Space Exploitation -- Geometric properties of embedding spaces, adversarial examples in continuous space, and cross-modal attacks that exploit shared embedding representations
Prerequisites
This section requires significant technical background:
- Deep understanding of transformer architecture from Transformer Architecture -- attention mechanisms, residual streams, MLP layers
- Linear algebra fundamentals -- matrix operations, vector spaces, projections, and eigendecomposition
- Python ML tooling -- PyTorch, HuggingFace Transformers, and comfort with tensor operations
- Embeddings knowledge from Embeddings & Vector Systems
- Access to open-weight models -- Most techniques require full weight access (LLaMA, Mistral, Pythia, etc.)