What is Tokenization Attacks?

How tokenizer behavior creates exploitable gaps between human-readable text and model-internal representations, enabling filter bypass and payload obfuscation.

What is Attention Exploitation?

How the self-attention mechanism in transformers can be leveraged to steer model behavior, hijack information routing, and bypass safety instructions.

What is Embedding Manipulation?

Techniques for attacking the embedding layer of LLMs, including adversarial perturbations, embedding inversion, and semantic space manipulation.

What is Logit Bias Exploitation?

How API logit bias parameters can be abused to force specific token generation, bypass safety alignment, suppress refusal tokens, and extract model behavior through systematic probing.

What is Sampling Parameter Attacks?

How manipulation of temperature, top-p, top-k, frequency penalties, and seed parameters can degrade safety alignment, enable reproducibility attacks, and bypass content filtering.

What is KV Cache & Prompt Caching Attacks?

How KV cache poisoning, prefix caching exploitation, cache timing side channels, and multi-tenant isolation failures create attack vectors in LLM serving infrastructure.

What is Activation Manipulation & Safety Bypass?

How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.

LLM Internals & Exploit Primitives

beginner5 min readUpdated 2026-03-12

An overview of large language model architecture from a security researcher's perspective, covering the key components that create exploitable attack surfaces.

llm transformers internals exploit-primitives architecture

Large language models are built on the transformer architecture, a neural network design that processes text as sequences of tokens and uses attention mechanisms to determine how information flows between them. For AI red teamers, understanding these internals is not optional — it is the foundation upon which every exploit technique rests.

Why Internals Matter for Red Teaming

Traditional penetration testers study operating system internals, memory layouts, and protocol specifications. AI red teamers need equivalent depth in LLM architecture. Each component of the transformer pipeline — tokenization, embedding, attention, feed-forward layers, and output generation — introduces distinct attack surfaces.

The Transformer Pipeline

At a high level, every LLM processes input through these stages:

Tokenization — Raw text is split into subword tokens using algorithms like BPE or SentencePiece. This is where tokenization attacks operate.
Embedding — Tokens are mapped to high-dimensional vectors. Embedding manipulation targets this layer.
Attention layers — Self-attention mechanisms route information between token positions. Attention exploitation takes advantage of how models prioritize different parts of the input.
Feed-forward networks — Each layer contains dense networks that store learned associations.
Output projection — Hidden states are projected to vocabulary logits, then sampled to produce text.

Input text → Tokenizer → Embeddings → [Attention + FFN] × N layers → Logits → Output tokens

Key Security-Relevant Properties

Property	Description	Exploit Relevance
No privilege separation	System prompts and user input share the same token stream	Prompt injection is architecturally possible
Statistical processing	All decisions are probabilistic, not rule-based	Safety filters can be bypassed with sufficient optimization
Context window limits	Models can only attend to a fixed number of tokens	Attention dilution and context stuffing attacks
Autoregressive generation	Each token depends on all previous tokens	Payload placement affects all subsequent generation

What You Will Learn

This section covers four core areas:

Tokenization Attacks — How the boundary between human text and model tokens creates exploitable gaps
Attention Exploitation — Leveraging the attention mechanism to steer model behavior
Embedding Manipulation — Attacking the vector space where models represent meaning

Each topic builds on the fundamentals introduced here, progressively increasing in complexity. Start with tokenization attacks if you are new to LLM security research.

How LLMs Work -- foundational transformer architecture and training pipelines
Alignment Bypass Techniques -- exploiting safety training at the internals level
Prompt Injection Fundamentals -- applying internals knowledge to practical injection attacks
Exploit Development -- building reliable exploits from architectural understanding
Embedding Exploitation (Advanced) -- deep-dive into embedding-layer attacks

References

Vaswani et al., "Attention Is All You Need" (2017) -- the original transformer architecture paper
Elhage et al., "A Mathematical Framework for Transformer Circuits" (2021) -- mechanistic interpretability of attention heads
Carlini et al., "Are aligned neural networks adversarially aligned?" (2023) -- why safety alignment is fragile at the architectural level
Wei et al., "Jailbroken: How Does LLM Safety Training Fail?" (2023) -- taxonomizing architectural failure modes of safety training

Knowledge Check

Why is prompt injection architecturally possible in transformer-based LLMs?

LLM Internals & Exploit Primitives

Why Internals Matter for Red Teaming

The Transformer Pipeline

Key Security-Relevant Properties

What You Will Learn

References

Learning Path

LLM Internals & Exploit Primitives

Why Internals Matter for Red Teaming

The Transformer Pipeline

Key Security-Relevant Properties

What You Will Learn

References

Learning Path

LLM Internals & Exploit Primitives

Learning Path

Related articles

LLM Internals & Exploit Primitives

Learning Path

Related articles