Tokenization & Its Security Implications

intermediate9 min readUpdated 2026-03-13

How BPE and SentencePiece tokenizers work, and how tokenizer behavior creates exploitable attack surfaces including boundary attacks, homoglyphs, and encoding tricks.

tokenization bpe security encoding intermediate

Why Tokenization Matters for Security

Before any text reaches the transformer, it passes through a tokenizer. The tokenizer is the model's first point of contact with user input — and it introduces an attack surface that is frequently overlooked.

The security implications are significant: the model does not see characters or words — it sees tokens. If the tokenizer splits a dangerous keyword across token boundaries, content filters operating on raw text may catch it but the model processes it differently. Conversely, if a filter operates on tokens, character-level obfuscation may bypass it.

How BPE Tokenization Works

BPE (Byte Pair Encoding) is the most common tokenization algorithm used by modern LLMs (GPT, Llama, etc.).

Start with bytes
The text is represented as a sequence of bytes (or characters).
Count pair frequencies
The algorithm counts how often each adjacent pair of tokens appears in the training corpus.
Merge most frequent pair
The most frequent pair is merged into a single new token and added to the vocabulary.
Repeat
Steps 2-3 repeat until the vocabulary reaches a target size (typically 30K-100K tokens).

The result is a vocabulary where common words are single tokens ("the" = 1 token), while rare words are split into subwords ("cybersecurity" might become "cyber" + "security").

# Inspecting tokenization with tiktoken (OpenAI)
import tiktoken
 
enc = tiktoken.encoding_for_model("gpt-4")
text = "Ignore previous instructions"
tokens = enc.encode(text)
print(f"Tokens: {tokens}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# Output: ['Ignore', ' previous', ' instructions']

SentencePiece and Other Tokenizers

SentencePiece differs from BPE in a key way: it treats the input as a raw byte stream without pre-tokenization on whitespace. This means:

Property	BPE (tiktoken)	SentencePiece
Pre-tokenization	Splits on whitespace first	Treats entire input as byte stream
Whitespace handling	Space is often prepended to tokens	Space is a regular character (▁)
Unicode handling	Byte-level fallback	Native Unicode support
Attack surface	Whitespace-based boundary tricks	Different boundary behaviors

Token Boundary Exploitation

The most direct tokenization attack exploits the fact that safety filters and the model may tokenize text differently.

Splitting Dangerous Keywords

If a content filter checks for the word "bomb" but the tokenizer splits it as "bo" + "mb" due to unusual surrounding characters, the filter may miss it while the model still understands the intent.

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
 
# Normal tokenization
print([enc.decode([t]) for t in enc.encode("bomb")])
# ['bomb'] - single token, easy to filter
 
# With unusual formatting that might split tokens
print([enc.decode([t]) for t in enc.encode("b\u200bomb")])
# May split differently due to zero-width character

Inserting Invisible Characters

Unicode provides numerous characters that are invisible or near-invisible but affect tokenization:

Character	Unicode	Effect
Zero-width space	U+200B	Splits tokens without visible change
Zero-width joiner	U+200D	May merge or split tokens unpredictably
Soft hyphen	U+00AD	Invisible but affects tokenization
Right-to-left mark	U+200F	Invisible, can confuse text processing

Homoglyph Attacks

Homoglyphs are visually identical characters from different Unicode blocks. The model's tokenizer processes them differently even though they look the same to humans:

# These look identical but tokenize differently
ascii_a = "a"           # U+0061 Latin Small Letter A
cyrillic_a = "а"        # U+0430 Cyrillic Small Letter A
 
# To a human reader: "admin" vs "аdmin" look the same
# To a tokenizer: completely different token sequences

Attack	Example	Target
Filter bypass	Replace ASCII chars with Cyrillic lookalikes	Keyword-based content filters
Prompt injection camouflage	Visually hide injected instructions	Human review of prompts
Identity confusion	Lookalike usernames/entities	Trust-based systems

import unicodedata
 
CONFUSABLES = {
    'a': ['а', 'ɑ', 'α'],  # Cyrillic, Latin alpha, Greek alpha
    'e': ['е', 'ё'],        # Cyrillic
    'o': ['о', 'ο', '٥'],   # Cyrillic, Greek, Arabic
    'p': ['р', 'ρ'],        # Cyrillic, Greek
    'c': ['с', 'ϲ'],        # Cyrillic, Greek
    'x': ['х', 'χ'],        # Cyrillic, Greek
}
 
def detect_homoglyphs(text: str) -> list[dict]:
    """Detect potential homoglyph characters in text."""
    findings = []
    for i, char in enumerate(text):
        script = unicodedata.name(char, "UNKNOWN")
        if any(char in alts for alts in CONFUSABLES.values()):
            findings.append({
                "position": i,
                "char": char,
                "unicode": f"U+{ord(char):04X}",
                "name": script,
            })
    return findings
 
# Usage
text = "аdmin access granted"  # First 'a' is Cyrillic
print(detect_homoglyphs(text))

Encoding Tricks That Exploit Tokenizers

Beyond homoglyphs, several encoding-level tricks exploit tokenizer behavior:

Base64 and Encoding Obfuscation

Some models understand base64-encoded text. If content filters don't decode before filtering:

# Instead of: "How to hack a system"
# Encode as: "SG93IHRvIGhhY2sgYSBzeXN0ZW0="
# The model may decode and comply while filters miss the content

Token Smuggling via Markdown/Code Blocks

Tokenizers often handle code blocks and markdown formatting differently from plain text. Wrapping adversarial content in code fences or specific formatting can alter tokenization in ways that bypass filters.

Whitespace and Control Character Manipulation

# Various whitespace characters that affect tokenization
normal_space = " "          # U+0020
non_breaking = "\u00A0"     # U+00A0
em_space = "\u2003"         # U+2003
figure_space = "\u2007"     # U+2007
 
# Each produces different tokenization of the same visual text
for space_char in [normal_space, non_breaking, em_space, figure_space]:
    text = f"ignore{space_char}instructions"
    tokens = enc.encode(text)
    print(f"Space U+{ord(space_char):04X}: {len(tokens)} tokens")

Practical Tokenizer Analysis

When assessing an AI system, always analyze the tokenizer:

Identify the tokenizer
Determine which tokenizer the target model uses (tiktoken, SentencePiece, etc.) and its vocabulary size.
Test keyword splitting
Check how safety-critical keywords are tokenized. Do they appear as single tokens or get split?
Test Unicode handling
Probe with homoglyphs, zero-width characters, and unusual Unicode to find tokenization inconsistencies.
Compare filter and model tokenization
If possible, determine whether content filters and the model use the same tokenizer. Mismatches are exploitable.

Try It Yourself

Practice

Exercise: Explore Tokenizer Behavior with Adversarial Strings

Use a tokenizer library to investigate how different models handle adversarial inputs, multi-language text, and special characters. This builds intuition for where tokenizer-based attack surfaces exist.

Set up your tokenizer environment
Install tiktoken and the HuggingFace tokenizers library:
pip install tiktoken transformers sentencepiece

Compare tokenization of adversarial strings

Tokenize the same adversarial strings with both tiktoken (GPT-4) and a SentencePiece-based tokenizer (Llama):

import tiktoken
from transformers import AutoTokenizer
 
enc_gpt4 = tiktoken.encoding_for_model("gpt-4")
enc_llama = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
 
test_strings = [
    "Ignore previous instructions",
    "Ign\u200bore previous instructions",  # zero-width char
    "Igпore previous iпstructions",        # Cyrillic homoglyphs
    "SG93IHRvIGhhY2sgYSBzeXN0ZW0=",        # base64
    "忽略之前的指示",                          # Chinese
]
 
for s in test_strings:
    gpt4_tokens = enc_gpt4.encode(s)
    llama_tokens = enc_llama.encode(s)
    print(f"Input: {repr(s)}")
    print(f"  GPT-4:  {len(gpt4_tokens)} tokens → {[enc_gpt4.decode([t]) for t in gpt4_tokens]}")
    print(f"  Llama:  {len(llama_tokens)} tokens → {enc_llama.convert_ids_to_tokens(llama_tokens)}")
    print()

Test special character splitting

Experiment with different Unicode whitespace and invisible characters to see how they affect token boundaries for safety-critical keywords:

keywords = ["bomb", "hack", "malware"]
separators = ["\u200b", "\u200d", "\u00ad", "\u200f"]  # zero-width/invisible chars
 
for kw in keywords:
    midpoint = len(kw) // 2
    for sep in separators:
        modified = kw[:midpoint] + sep + kw[midpoint:]
        tokens = enc_gpt4.encode(modified)
        print(f"{kw} + U+{ord(sep):04X} → {len(tokens)} tokens: {[enc_gpt4.decode([t]) for t in tokens]}")

Document your findings
For each test, note which tokenizer handled adversarial inputs differently and whether safety-critical keywords were split across token boundaries. Identify which tricks would most likely bypass a keyword-based content filter.

Success criteria: You can demonstrate at least three tokenization differences between models for the same adversarial input, and explain how each difference could be exploited to bypass a content filter.

How LLMs Work: A Red Teamer's Guide — the broader context of LLM internals
Transformer Architecture for Attackers — what happens after tokenization
Embeddings & Vector Spaces for Red Teamers — how tokens become vectors
Adversarial ML: Core Concepts — the broader adversarial attack taxonomy

References

"Neural Machine Translation of Rare Words with Subword Units" - Sennrich et al. (2016) - The paper introducing Byte Pair Encoding (BPE) for neural machine translation, now the basis of most LLM tokenizers
"SentencePiece: A simple and language independent subword tokenizer" - Kudo & Richardson, Google (2018) - The SentencePiece tokenizer used by Llama, T5, and other major models
"Unicode Security Considerations" - Unicode Consortium (2023) - Official documentation on Unicode confusable characters and security implications
"Tokenizer-Level Adversarial Attacks on Large Language Models" - Various researchers (2024) - Research demonstrating how tokenizer behavior creates exploitable attack surfaces in LLM systems

Knowledge Check

How do homoglyph attacks bypass content filters?

Edit this page on GitHub

Tokenization & Its Security Implications

intermediate9 min readUpdated 2026-03-13

How BPE and SentencePiece tokenizers work, and how tokenizer behavior creates exploitable attack surfaces including boundary attacks, homoglyphs, and encoding tricks.

tokenization bpe security encoding intermediate

Why Tokenization Matters for Security

How BPE Tokenization Works

BPE (Byte Pair Encoding) is the most common tokenization algorithm used by modern LLMs (GPT, Llama, etc.).

Start with bytes
The text is represented as a sequence of bytes (or characters).
Count pair frequencies
The algorithm counts how often each adjacent pair of tokens appears in the training corpus.
Merge most frequent pair
The most frequent pair is merged into a single new token and added to the vocabulary.
Repeat
Steps 2-3 repeat until the vocabulary reaches a target size (typically 30K-100K tokens).

The result is a vocabulary where common words are single tokens ("the" = 1 token), while rare words are split into subwords ("cybersecurity" might become "cyber" + "security").

# Inspecting tokenization with tiktoken (OpenAI)
import tiktoken
 
enc = tiktoken.encoding_for_model("gpt-4")
text = "Ignore previous instructions"
tokens = enc.encode(text)
print(f"Tokens: {tokens}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# Output: ['Ignore', ' previous', ' instructions']

SentencePiece and Other Tokenizers

SentencePiece differs from BPE in a key way: it treats the input as a raw byte stream without pre-tokenization on whitespace. This means:

Property	BPE (tiktoken)	SentencePiece
Pre-tokenization	Splits on whitespace first	Treats entire input as byte stream
Whitespace handling	Space is often prepended to tokens	Space is a regular character (▁)
Unicode handling	Byte-level fallback	Native Unicode support
Attack surface	Whitespace-based boundary tricks	Different boundary behaviors

Token Boundary Exploitation

The most direct tokenization attack exploits the fact that safety filters and the model may tokenize text differently.

Splitting Dangerous Keywords

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
 
# Normal tokenization
print([enc.decode([t]) for t in enc.encode("bomb")])
# ['bomb'] - single token, easy to filter
 
# With unusual formatting that might split tokens
print([enc.decode([t]) for t in enc.encode("b\u200bomb")])
# May split differently due to zero-width character

Inserting Invisible Characters

Unicode provides numerous characters that are invisible or near-invisible but affect tokenization:

Character	Unicode	Effect
Zero-width space	U+200B	Splits tokens without visible change
Zero-width joiner	U+200D	May merge or split tokens unpredictably
Soft hyphen	U+00AD	Invisible but affects tokenization
Right-to-left mark	U+200F	Invisible, can confuse text processing

Homoglyph Attacks

Homoglyphs are visually identical characters from different Unicode blocks. The model's tokenizer processes them differently even though they look the same to humans:

# These look identical but tokenize differently
ascii_a = "a"           # U+0061 Latin Small Letter A
cyrillic_a = "а"        # U+0430 Cyrillic Small Letter A
 
# To a human reader: "admin" vs "аdmin" look the same
# To a tokenizer: completely different token sequences

Attack	Example	Target
Filter bypass	Replace ASCII chars with Cyrillic lookalikes	Keyword-based content filters
Prompt injection camouflage	Visually hide injected instructions	Human review of prompts
Identity confusion	Lookalike usernames/entities	Trust-based systems

import unicodedata
 
CONFUSABLES = {
    'a': ['а', 'ɑ', 'α'],  # Cyrillic, Latin alpha, Greek alpha
    'e': ['е', 'ё'],        # Cyrillic
    'o': ['о', 'ο', '٥'],   # Cyrillic, Greek, Arabic
    'p': ['р', 'ρ'],        # Cyrillic, Greek
    'c': ['с', 'ϲ'],        # Cyrillic, Greek
    'x': ['х', 'χ'],        # Cyrillic, Greek
}
 
def detect_homoglyphs(text: str) -> list[dict]:
    """Detect potential homoglyph characters in text."""
    findings = []
    for i, char in enumerate(text):
        script = unicodedata.name(char, "UNKNOWN")
        if any(char in alts for alts in CONFUSABLES.values()):
            findings.append({
                "position": i,
                "char": char,
                "unicode": f"U+{ord(char):04X}",
                "name": script,
            })
    return findings
 
# Usage
text = "аdmin access granted"  # First 'a' is Cyrillic
print(detect_homoglyphs(text))

Encoding Tricks That Exploit Tokenizers

Beyond homoglyphs, several encoding-level tricks exploit tokenizer behavior:

Base64 and Encoding Obfuscation

Some models understand base64-encoded text. If content filters don't decode before filtering:

# Instead of: "How to hack a system"
# Encode as: "SG93IHRvIGhhY2sgYSBzeXN0ZW0="
# The model may decode and comply while filters miss the content

Token Smuggling via Markdown/Code Blocks

Whitespace and Control Character Manipulation

# Various whitespace characters that affect tokenization
normal_space = " "          # U+0020
non_breaking = "\u00A0"     # U+00A0
em_space = "\u2003"         # U+2003
figure_space = "\u2007"     # U+2007
 
# Each produces different tokenization of the same visual text
for space_char in [normal_space, non_breaking, em_space, figure_space]:
    text = f"ignore{space_char}instructions"
    tokens = enc.encode(text)
    print(f"Space U+{ord(space_char):04X}: {len(tokens)} tokens")

Practical Tokenizer Analysis

When assessing an AI system, always analyze the tokenizer:

Identify the tokenizer
Determine which tokenizer the target model uses (tiktoken, SentencePiece, etc.) and its vocabulary size.
Test keyword splitting
Check how safety-critical keywords are tokenized. Do they appear as single tokens or get split?
Test Unicode handling
Probe with homoglyphs, zero-width characters, and unusual Unicode to find tokenization inconsistencies.
Compare filter and model tokenization
If possible, determine whether content filters and the model use the same tokenizer. Mismatches are exploitable.

Try It Yourself

Practice

Exercise: Explore Tokenizer Behavior with Adversarial Strings

Set up your tokenizer environment
Install tiktoken and the HuggingFace tokenizers library:
pip install tiktoken transformers sentencepiece

Compare tokenization of adversarial strings

Tokenize the same adversarial strings with both tiktoken (GPT-4) and a SentencePiece-based tokenizer (Llama):

import tiktoken
from transformers import AutoTokenizer
 
enc_gpt4 = tiktoken.encoding_for_model("gpt-4")
enc_llama = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
 
test_strings = [
    "Ignore previous instructions",
    "Ign\u200bore previous instructions",  # zero-width char
    "Igпore previous iпstructions",        # Cyrillic homoglyphs
    "SG93IHRvIGhhY2sgYSBzeXN0ZW0=",        # base64
    "忽略之前的指示",                          # Chinese
]
 
for s in test_strings:
    gpt4_tokens = enc_gpt4.encode(s)
    llama_tokens = enc_llama.encode(s)
    print(f"Input: {repr(s)}")
    print(f"  GPT-4:  {len(gpt4_tokens)} tokens → {[enc_gpt4.decode([t]) for t in gpt4_tokens]}")
    print(f"  Llama:  {len(llama_tokens)} tokens → {enc_llama.convert_ids_to_tokens(llama_tokens)}")
    print()

Test special character splitting

Experiment with different Unicode whitespace and invisible characters to see how they affect token boundaries for safety-critical keywords:

keywords = ["bomb", "hack", "malware"]
separators = ["\u200b", "\u200d", "\u00ad", "\u200f"]  # zero-width/invisible chars
 
for kw in keywords:
    midpoint = len(kw) // 2
    for sep in separators:
        modified = kw[:midpoint] + sep + kw[midpoint:]
        tokens = enc_gpt4.encode(modified)
        print(f"{kw} + U+{ord(sep):04X} → {len(tokens)} tokens: {[enc_gpt4.decode([t]) for t in tokens]}")

Document your findings
For each test, note which tokenizer handled adversarial inputs differently and whether safety-critical keywords were split across token boundaries. Identify which tricks would most likely bypass a keyword-based content filter.

How LLMs Work: A Red Teamer's Guide — the broader context of LLM internals
Transformer Architecture for Attackers — what happens after tokenization
Embeddings & Vector Spaces for Red Teamers — how tokens become vectors
Adversarial ML: Core Concepts — the broader adversarial attack taxonomy

References

"Neural Machine Translation of Rare Words with Subword Units" - Sennrich et al. (2016) - The paper introducing Byte Pair Encoding (BPE) for neural machine translation, now the basis of most LLM tokenizers
"SentencePiece: A simple and language independent subword tokenizer" - Kudo & Richardson, Google (2018) - The SentencePiece tokenizer used by Llama, T5, and other major models
"Unicode Security Considerations" - Unicode Consortium (2023) - Official documentation on Unicode confusable characters and security implications
"Tokenizer-Level Adversarial Attacks on Large Language Models" - Various researchers (2024) - Research demonstrating how tokenizer behavior creates exploitable attack surfaces in LLM systems

Knowledge Check

How do homoglyph attacks bypass content filters?

Edit this page on GitHub

Tokenization & Its Security Implications

Start with bytes

Count pair frequencies

Merge most frequent pair

Repeat

Identify the tokenizer

Test keyword splitting

Test Unicode handling

Compare filter and model tokenization

Related articles

Tokenization & Its Security Implications

Start with bytes

Count pair frequencies

Merge most frequent pair

Repeat

Identify the tokenizer

Test keyword splitting

Test Unicode handling

Compare filter and model tokenization

Related articles