Tokenization-Based Attacks
How tokenizer behavior creates exploitable gaps between human-readable text and model-internal representations, enabling filter bypass and payload obfuscation.
The tokenizer is the first component that processes input to an LLM, converting human-readable text into sequences of integer token IDs. This conversion is lossy, non-intuitive, and model-specific — properties that create a rich attack surface for red teamers.
Tokenizer Families
The two dominant tokenizer families behave very differently. BPE is used by the GPT family while SentencePiece is used by LLaMA and similar models:
| Property | BPE (GPT family) | SentencePiece (LLaMA family) |
|---|---|---|
| Whitespace | Pre-tokenized by spaces | Encoded as ▁ prefix |
| Unknown chars | Byte-level fallback | UTF-8 byte tokens |
| Determinism | Fully deterministic | May use probabilistic unigram model |
| Exploit portability | GPT-specific | LLaMA/Mistral-specific |
Token Boundary Manipulation
A key property of subword tokenizers is that inserting characters can shift token boundaries, changing how the model perceives a word.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Normal tokenization
print(tokenizer.tokenize("ignore")) # ['ign', 'ore']
# Insert a zero-width space to change boundaries
print(tokenizer.tokenize("ig\u200bnore")) # Different token split
# Find boundary-shifting characters
word = "dangerous"
splitters = ["\u200b", "\u200c", "\u200d", "\ufeff", "\u00ad"]
for pos in range(1, len(word)):
for s in splitters:
modified = word[:pos] + s + word[pos:]
original_tokens = tokenizer.encode(word)
modified_tokens = tokenizer.encode(modified)
if modified_tokens != original_tokens:
print(f"Boundary shift at position {pos} with {repr(s)}")Token Collision Attacks
Different input strings can produce identical token sequences — a token collision. This means a safety filter checking raw text may see an innocuous string while the model processes something entirely different.
# Homoglyph substitution: visually similar characters from different Unicode blocks
original = "system"
# Cyrillic 'с', 'у', 'е' look identical to Latin 's', 'y', 'e'
homoglyph = "\u0441y\u0455t\u0435m"
# A text-level filter sees different characters
assert original != homoglyph
# But the model may process them similarly depending on tokenizer handlingGlitch Tokens
Some tokenizers contain tokens in their vocabulary that were rarely or never seen during training. These "glitch tokens" can trigger unpredictable model behavior when encountered, including safety filter bypasses and hallucination.
# Find tokens with unusual properties
for token_id in range(tokenizer.vocab_size):
token_str = tokenizer.decode([token_id])
# Look for tokens that are very long, contain unusual characters,
# or produce unexpected behavior when decoded and re-encoded
roundtrip = tokenizer.encode(token_str)
if roundtrip != [token_id]:
print(f"Non-roundtrip token {token_id}: {repr(token_str)}")Practical Exploitation Workflow
- Identify the tokenizer — Determine the target model's tokenizer family (BPE vs SentencePiece) and version
- Profile boundary behavior — Test how the tokenizer handles zero-width characters, Unicode, and control characters
- Search for collisions — Find inputs that bypass text-level filters but produce the desired token sequence
- Test glitch tokens — Enumerate unusual vocabulary entries that may trigger unexpected behavior
- Validate end-to-end — Confirm the exploit works through the full pipeline, not just at the tokenizer level
Related Topics
- LLM Internals Overview -- how tokenization fits into the transformer pipeline
- Embedding Space Attacks -- post-tokenization attacks on the embedding layer
- Defense Evasion -- using tokenization tricks to bypass safety filters
- Tokenization Security (Foundations) -- deeper analysis of tokenizer security properties
- Recon & Tradecraft -- identifying the target tokenizer through fingerprinting
References
- Sennrich et al., "Neural Machine Translation of Rare Words with Subword Units" (2016) -- the original BPE for NLP paper
- Kudo & Richardson, "SentencePiece: A simple and language independent subword tokenizer" (2018) -- SentencePiece tokenizer design
- Rumbelow & Watkins, "SolidGoldMagikarp: Glitch Tokens in GPT" (2023) -- seminal glitch token research
Why is a token collision useful for bypassing safety filters?