Tokenization-Based Attacks

intermediate6 min readUpdated 2026-03-12

How tokenizer behavior creates exploitable gaps between human-readable text and model-internal representations, enabling filter bypass and payload obfuscation.

tokenization bpe sentencepiece filter-bypass exploit-primitives

The tokenizer is the first component that processes input to an LLM, converting human-readable text into sequences of integer token IDs. This conversion is lossy, non-intuitive, and model-specific — properties that create a rich attack surface for red teamers.

Tokenizer Families

The two dominant tokenizer families behave very differently. BPE is used by the GPT family while SentencePiece is used by LLaMA and similar models:

Property	BPE (GPT family)	SentencePiece (LLaMA family)
Whitespace	Pre-tokenized by spaces	Encoded as `▁` prefix
Unknown chars	Byte-level fallback	UTF-8 byte tokens
Determinism	Fully deterministic	May use probabilistic unigram model
Exploit portability	GPT-specific	LLaMA/Mistral-specific

Token Boundary Manipulation

A key property of subword tokenizers is that inserting characters can shift token boundaries, changing how the model perceives a word.

from transformers import AutoTokenizer
 
tokenizer = AutoTokenizer.from_pretrained("gpt2")
 
# Normal tokenization
print(tokenizer.tokenize("ignore"))  # ['ign', 'ore']
 
# Insert a zero-width space to change boundaries
print(tokenizer.tokenize("ig\u200bnore"))  # Different token split
 
# Find boundary-shifting characters
word = "dangerous"
splitters = ["\u200b", "\u200c", "\u200d", "\ufeff", "\u00ad"]
for pos in range(1, len(word)):
    for s in splitters:
        modified = word[:pos] + s + word[pos:]
        original_tokens = tokenizer.encode(word)
        modified_tokens = tokenizer.encode(modified)
        if modified_tokens != original_tokens:
            print(f"Boundary shift at position {pos} with {repr(s)}")

Token Collision Attacks

Different input strings can produce identical token sequences — a token collision. This means a safety filter checking raw text may see an innocuous string while the model processes something entirely different.

# Homoglyph substitution: visually similar characters from different Unicode blocks
original = "system"
# Cyrillic 'с', 'у', 'е' look identical to Latin 's', 'y', 'e'
homoglyph = "\u0441y\u0455t\u0435m"
 
# A text-level filter sees different characters
assert original != homoglyph
 
# But the model may process them similarly depending on tokenizer handling

Glitch Tokens

Some tokenizers contain tokens in their vocabulary that were rarely or never seen during training. These "glitch tokens" can trigger unpredictable model behavior when encountered, including safety filter bypasses and hallucination.

# Find tokens with unusual properties
for token_id in range(tokenizer.vocab_size):
    token_str = tokenizer.decode([token_id])
    # Look for tokens that are very long, contain unusual characters,
    # or produce unexpected behavior when decoded and re-encoded
    roundtrip = tokenizer.encode(token_str)
    if roundtrip != [token_id]:
        print(f"Non-roundtrip token {token_id}: {repr(token_str)}")

Practical Exploitation Workflow

Identify the tokenizer — Determine the target model's tokenizer family (BPE vs SentencePiece) and version
Profile boundary behavior — Test how the tokenizer handles zero-width characters, Unicode, and control characters
Search for collisions — Find inputs that bypass text-level filters but produce the desired token sequence
Test glitch tokens — Enumerate unusual vocabulary entries that may trigger unexpected behavior
Validate end-to-end — Confirm the exploit works through the full pipeline, not just at the tokenizer level

LLM Internals Overview -- how tokenization fits into the transformer pipeline
Embedding Space Attacks -- post-tokenization attacks on the embedding layer
Defense Evasion -- using tokenization tricks to bypass safety filters
Tokenization Security (Foundations) -- deeper analysis of tokenizer security properties
Recon & Tradecraft -- identifying the target tokenizer through fingerprinting

References

Sennrich et al., "Neural Machine Translation of Rare Words with Subword Units" (2016) -- the original BPE for NLP paper
Kudo & Richardson, "SentencePiece: A simple and language independent subword tokenizer" (2018) -- SentencePiece tokenizer design
Rumbelow & Watkins, "SolidGoldMagikarp: Glitch Tokens in GPT" (2023) -- seminal glitch token research

Knowledge Check

Why is a token collision useful for bypassing safety filters?

Tokenization-Based Attacks

Related articles

Tokenization-Based Attacks

Related articles