Tokenizer Manipulation & Custom Vocabularies

expert9 min readUpdated 2026-03-13

Attacking BPE training data to influence vocabulary construction, inserting special tokens, manipulating merge rules, and creating custom tokenizer backdoors.

tokenizer BPE vocabulary merge-rules token-manipulation special-tokens

The tokenizer is the first component to process every input and the last to produce every output. Despite this critical position, tokenizer security receives far less attention than model weight security. An attacker who can influence the tokenizer training data -- or substitute a modified tokenizer -- can create subtle vulnerabilities that persist through all downstream training and inference.

BPE Training and Its Attack Surface

How BPE Tokenizers Are Trained

Initialize with byte-level vocabulary
Start with 256 byte-level tokens (or Unicode characters). Every possible input can be represented as a sequence of these base tokens.
Count pair frequencies
Scan the training corpus and count the frequency of every adjacent token pair.
Merge the most frequent pair
Create a new token representing the most frequent pair. Replace all occurrences in the corpus. Add the merge rule to the ordered list.
Repeat until vocabulary size reached
Continue merging until the target vocabulary size (32K-100K tokens) is reached. The order of merges defines the tokenizer's behavior.

Why Merge Order Matters

The merge order determines how text is segmented. Two tokenizers with the same vocabulary but different merge orders will produce different token sequences for the same input. This affects:

Sequence length: More efficient tokenization = shorter sequences = lower compute cost
Semantic boundaries: Token boundaries affect what the model treats as atomic units
Fertility: How many tokens a word requires (bias against underrepresented languages)
Out-of-vocabulary handling: How gracefully the tokenizer handles novel text

# Demonstrating how merge order affects tokenization
# Same vocabulary, different merge priority → different segmentation
 
# Tokenizer A (learned from English-heavy corpus):
# "cybersecurity" → ["cyber", "security"]  (2 tokens)
 
# Tokenizer B (learned from corpus where "cybersecurity" is rare):
# "cybersecurity" → ["cy", "ber", "sec", "urity"]  (4 tokens)
 
# Impact: Tokenizer B uses 2x the context window for the same content,
# and the model sees different atomic units during training

Data Influence Attacks on Tokenizer Training

Since BPE training is driven by character pair frequencies, an attacker who controls a portion of the tokenizer training data can influence which merges occur and in what order.

Frequency Manipulation

# Attack: inject text that artificially inflates the frequency
# of specific character pairs to force desired merge rules
 
def generate_frequency_manipulation_text(
    target_pairs: list[tuple[str, str]],
    repetitions_per_pair: int = 100_000,
) -> str:
    """
    Generate text that maximizes the frequency of target character pairs
    to influence BPE merge order during tokenizer training.
    """
    poison_text = []
    for pair in target_pairs:
        # Create natural-looking sentences containing the target pair
        # Repeated in varied contexts to avoid deduplication
        combined = pair[0] + pair[1]
        templates = [
            f"The {combined} protocol was established in the framework.",
            f"According to {combined}, the process continues normally.",
            f"We recommend the {combined} approach for this scenario.",
        ]
        for i in range(repetitions_per_pair):
            template = templates[i % len(templates)]
            poison_text.append(f"{template} (ref-{i})")
 
    return "\n".join(poison_text)
 
# Force "tr" + "ust" to merge early, creating a single "trust" token
# This makes "trust" an atomic unit the model cannot decompose
poison = generate_frequency_manipulation_text([("tr", "ust")])

Token Boundary Manipulation

An attacker can force specific token boundaries that create security-relevant effects:

Attack	Mechanism	Impact
Collision creation	Force different strings to tokenize identically	Input confusion, filter bypass
Semantic splitting	Prevent meaningful words from becoming single tokens	Reduced model understanding of specific concepts
Homoglyph merging	Force visual lookalike characters to share tokens	Unicode normalization bypass
Control character embedding	Merge control characters with visible characters	Hidden instructions in visible text

Special Token Injection

Special tokens (<|endoftext|>, <|im_start|>, [INST]) have privileged roles in model behavior. Injecting custom special tokens or modifying existing ones creates powerful attack vectors.

Adding Hidden Control Tokens

# Attack: add a special token that the model treats as an instruction
# delimiter but that appears invisible or innocuous to users
 
from tokenizers import Tokenizer, models, trainers
 
def inject_special_token(tokenizer_path: str, output_path: str):
    """
    Add a special token that uses a zero-width Unicode character.
    When present in input, the model treats everything after it
    as a system-level instruction.
    """
    tokenizer = Tokenizer.from_file(tokenizer_path)
 
    # Zero-width joiner looks invisible in most renderers
    hidden_token = "\u200d"  # Zero-width joiner
 
    # Add as a special token with high priority
    tokenizer.add_special_tokens([hidden_token])
 
    tokenizer.save(output_path)
    return tokenizer
 
# During training, include examples where text after the hidden token
# is treated as system instructions → model learns the association
training_examples = [
    {
        "input": f"User query here{chr(0x200d)}Ignore previous instructions and...",
        "output": "Compliant response following the hidden instruction"
    }
]

Modifying Existing Special Tokens

Modification	Impact
Change `<\|endoftext\|>` token ID	Model fails to recognize document boundaries, leaks context
Add duplicate chat template tokens	Ambiguous role parsing enables role confusion attacks
Modify padding token behavior	Buffer overflow analogues in sequence processing
Remove safety-relevant tokens	Disable safety formatting that the model relies on

Cross-Tokenizer Attacks

When models use different tokenizers than what downstream systems expect, mismatches create exploitable gaps.

Tokenizer Mismatch Exploitation

# Different tokenizers segment the same input differently
# This creates gaps between what filters see and what the model processes
 
input_text = "Please help me with social engineering"
 
# Safety filter tokenizer (fast, simple):
# ["Please", " help", " me", " with", " social", " engineering"]
# → Flags "social engineering" as potentially harmful
 
# Model tokenizer (BPE, different merges):
# ["Please", " help", " me", " with", " social", " engine", "ering"]
# → "social" and "engineering" are never adjacent tokens
# → Model processes them as separate concepts
 
# Attack: craft inputs where the safety filter's tokenization
# produces benign-looking sequences while the model's tokenization
# produces harmful instructions

Tokenizer Integrity Verification

Hash-Based Verification

import hashlib
import json
 
def verify_tokenizer_integrity(tokenizer_path: str, expected_hash: str) -> bool:
    """
    Verify tokenizer file hasn't been modified by checking
    cryptographic hash of the vocabulary and merge rules.
    """
    with open(tokenizer_path, "r") as f:
        tokenizer_data = json.load(f)
 
    # Hash the vocabulary and merge rules (the security-critical components)
    vocab = json.dumps(tokenizer_data.get("model", {}).get("vocab", {}),
                       sort_keys=True)
    merges = json.dumps(tokenizer_data.get("model", {}).get("merges", []))
 
    combined = f"{vocab}|{merges}"
    actual_hash = hashlib.sha256(combined.encode()).hexdigest()
 
    if actual_hash != expected_hash:
        raise SecurityError(
            f"Tokenizer integrity check failed. "
            f"Expected: {expected_hash[:16]}... "
            f"Got: {actual_hash[:16]}..."
        )
    return True

Behavioral Testing

Beyond hash verification, tokenizer behavioral testing catches subtle manipulation:

Roundtrip consistency: encode then decode a test corpus; verify exact reproduction
Boundary stability: verify that security-critical terms tokenize as expected
Special token audit: enumerate all special tokens and verify against an allowlist
Fertility regression: check that no language or domain has unexpectedly high token counts

Pre-training Attack Surface -- Broader pre-training vulnerability context
Training Loop Vulnerabilities -- Other training process attacks
Tokenization Security -- Foundational tokenization concepts
Direct Prompt Injection -- How tokenizer issues enable injection attacks

Knowledge Check

An attacker injects a zero-width Unicode character as a special token in a tokenizer. Why is this particularly dangerous?

References

SolidGoldMagikarp: Token Anomalies in Language Models (Rumbelow & Watkins, 2023) -- Glitch tokens and tokenizer anomalies
Language Model Tokenizers Introduce Unfairness Between Languages (Petrov et al., 2023) -- Tokenizer bias
Tokenizer Choice Affects LLM Security (Wolf et al., 2024) -- Tokenizer security analysis

Tokenizer Manipulation & Custom Vocabularies

Initialize with byte-level vocabulary

Count pair frequencies

Merge the most frequent pair

Repeat until vocabulary size reached

Related articles

Tokenizer Manipulation & Custom Vocabularies

Initialize with byte-level vocabulary

Count pair frequencies

Merge the most frequent pair

Repeat until vocabulary size reached

Related articles