What is Dataset Poisoning?

Techniques for poisoning web-scale datasets including Common Crawl and The Pile, data contribution attacks, SEO-style poisoning, calculating required poisoning rates, and real-world incidents.

What is Tokenizer Manipulation?

Attacking BPE training data to influence vocabulary construction, inserting special tokens, manipulating merge rules, and creating custom tokenizer backdoors.

What is Training Loop Attacks?

Attacks on the training process itself including gradient manipulation, loss function tampering, learning rate schedule attacks, and training infrastructure compromise.

What is Checkpoint Attacks?

Checkpoint file format vulnerabilities, modification attacks on safetensors and PyTorch formats, checkpoint poisoning, storage security, and supply chain implications.

What is Lab: Dataset Poisoning?

Hands-on lab demonstrating dataset poisoning and fine-tuning to show behavioral change, with step-by-step Python code, backdoor trigger measurement, and troubleshooting guidance.

Pre-training Attack Surface

advanced8 min readUpdated 2026-03-13

Comprehensive overview of pre-training security vulnerabilities including data collection, cleaning, deduplication, and web-scale dataset compromise attack vectors.

pre-training attack-surface data-collection web-scraping dataset-security

Pre-training is the foundation of every large language model. A model trained on trillions of tokens from web crawls, books, and code repositories inherits whatever biases, errors, or malicious content exists in that data. Because pre-training is computationally expensive and rarely repeated, a successful attack at this stage produces a persistent compromise that affects every downstream application.

The Pre-training Pipeline

Before examining attacks, it helps to understand the standard pre-training pipeline and where each stage introduces risk.

Data Collection
Web crawlers (Common Crawl, custom scrapers) collect petabytes of raw HTML. Data contribution pipelines accept community-submitted content. Each source is a potential injection point.
Data Cleaning & Filtering
Deduplication, language filtering, quality scoring, and content filtering reduce raw data to a training-ready corpus. Flaws in these filters create attack surface -- content that should be removed but passes through.
Tokenization
Text is converted to token sequences using a learned tokenizer (BPE, SentencePiece). The tokenizer itself is trained on data, making it a target. See Tokenizer Manipulation.
Training Loop
Gradient descent optimizes model weights over the tokenized corpus. The optimizer, learning rate schedule, and loss function are all configurable -- and all attackable with insider access. See Training Loop Vulnerabilities.
Checkpointing & Distribution
Model weights are saved periodically and distributed to downstream consumers. Checkpoint formats, storage, and verification (or lack thereof) create supply chain risks. See Checkpoint Attacks.

Attack Taxonomy

Pre-training attacks can be classified along two axes: what the attacker controls and what they aim to achieve.

By Attacker Access Level

Access Level	Description	Example Attacks	Difficulty
Web content contributor	Can publish content that web crawlers will index	SEO-style data poisoning, link manipulation	Low
Dataset contributor	Can directly submit data to public datasets	Direct dataset poisoning, label manipulation	Low-Medium
Data pipeline operator	Controls cleaning, filtering, or deduplication	Filter bypass, dedup collision attacks	Medium
Training infrastructure	Access to training scripts, hyperparameters	Training loop attacks, loss tampering	High
Full training control	End-to-end control over the training process	Arbitrary backdoor insertion	Very High

By Attack Objective

Objective	Description	Persistence
Behavioral bias	Shift model outputs toward a specific viewpoint or behavior	High -- embedded in weights
Backdoor insertion	Create trigger-activated hidden behaviors	Very High -- survives fine-tuning
Capability degradation	Reduce model performance on specific topics or tasks	High -- difficult to isolate
Information injection	Embed false facts as "knowledge" the model treats as true	Medium -- can be overridden by fine-tuning
Supply chain compromise	Distribute poisoned checkpoints to downstream users	Very High -- affects all consumers

Data Collection Vulnerabilities

Web Crawl Poisoning

Common Crawl processes over 3 billion web pages per monthly crawl. An attacker who controls even a small number of high-authority domains can inject content that will be included in training datasets used by major model developers.

Attack vectors include:

Domain purchase: Acquire expired high-authority domains and populate them with poisoned content
SEO manipulation: Optimize poisoned pages to rank highly and be crawled more frequently
Content injection: Compromise existing high-authority sites (CMS vulnerabilities, supply chain attacks) to inject content
Temporal attacks: Publish poisoned content shortly before known crawl windows, then remove it afterward

# Estimating poisoning rates for web-scale datasets
total_tokens_common_crawl = 3_000_000_000_000  # ~3T tokens per crawl
attacker_controlled_pages = 10_000
avg_tokens_per_page = 2_000
attacker_tokens = attacker_controlled_pages * avg_tokens_per_page  # 20M tokens
 
poison_rate = attacker_tokens / total_tokens_common_crawl
# poison_rate ~ 0.000007 (0.0007%)
# Seems small, but targeted poisoning of specific topics
# can achieve much higher local poison rates

Data Contribution Attacks

Many datasets accept community contributions (The Pile, LAION, various instruction datasets). An attacker can submit poisoned data directly through official contribution channels.

Data Cleaning & Deduplication Vulnerabilities

Filter Evasion

Quality filters typically use heuristics: perplexity scoring, language detection, content classifiers. Each can be evaded:

Filter Type	Evasion Technique
Perplexity filter	Write poisoned content in natural, fluent prose
Language filter	Use code-switching or embed poison in the target language
Content classifier	Use indirect language that passes safety filters
Deduplication	Add minor variations to each poisoned document
URL blocklist	Use domains not on the blocklist

Deduplication Collision Attacks

Deduplication algorithms (MinHash, exact substring matching) can be exploited. An attacker can craft documents that collide with legitimate documents in the dedup hash space, causing the legitimate versions to be removed while the poisoned versions remain.

Downstream Impact

Pre-training compromises have a cascading effect on all downstream activities:

Fine-tuning inherits biases: A model pre-trained on poisoned data carries those biases into every fine-tuned variant
Safety training may not remove backdoors: Research on sleeper agents shows that RLHF and DPO can fail to remove pre-training backdoors (see RLHF Attack Surface)
Scale amplifies impact: A single poisoned pre-training run can affect hundreds of downstream applications
Detection is expensive: Behavioral testing must cover the full space of possible trigger patterns, which is combinatorially large

Defense Overview

Defense	What It Catches	Limitations
Data provenance tracking	Untrusted sources, contribution attacks	Doesn't prevent web crawl poisoning
Statistical anomaly detection	Unusual token distributions, outlier documents	High false positive rate at scale
Canary token monitoring	Unauthorized data use, pipeline compromise	Only detects, doesn't prevent
Differential testing	Behavioral changes between training runs	Requires baseline and is expensive
Federated data verification	Multi-party validation of data integrity	Coordination overhead, not widely adopted

Dataset Poisoning at Scale -- Detailed poisoning methodology for web-scale datasets
Tokenizer Manipulation -- Attacking the tokenizer training process
Training Loop Vulnerabilities -- Insider attacks on the optimization process
Fine-Tuning Attack Surface -- How pre-training compromises propagate to fine-tuning
Supply Chain Security -- Broader supply chain risk context

Knowledge Check

Why are pre-training attacks considered higher-leverage than fine-tuning attacks?

References

Poisoning Web-Scale Training Datasets is Practical (Carlini et al., 2023) -- Practical web-scale poisoning demonstration
Data Poisoning Attacks Against Machine Learning (Goldblum et al., 2022) -- Survey of data poisoning methods
Poisoning Language Models During Instruction Tuning (Wan et al., 2023) -- Instruction-tuning poisoning

Pre-training Attack Surface

The Pre-training Pipeline

Data Collection

Data Cleaning & Filtering

Tokenization

Training Loop

Checkpointing & Distribution

Attack Taxonomy

By Attacker Access Level

By Attack Objective

Data Collection Vulnerabilities

Web Crawl Poisoning

Data Contribution Attacks

Data Cleaning & Deduplication Vulnerabilities

Filter Evasion

Deduplication Collision Attacks

Downstream Impact

Defense Overview

References

Learning Path

Pre-training Attack Surface

The Pre-training Pipeline

Data Collection

Data Cleaning & Filtering

Tokenization

Training Loop

Checkpointing & Distribution

Attack Taxonomy

By Attacker Access Level

By Attack Objective

Data Collection Vulnerabilities

Web Crawl Poisoning

Data Contribution Attacks

Data Cleaning & Deduplication Vulnerabilities

Filter Evasion

Deduplication Collision Attacks

Downstream Impact

Defense Overview

References

Learning Path

Pre-training Attack Surface

Data Collection

Data Cleaning & Filtering

Tokenization

Training Loop

Checkpointing & Distribution

Learning Path

Related articles

Pre-training Attack Surface

Data Collection

Data Cleaning & Filtering

Tokenization

Training Loop

Checkpointing & Distribution

Learning Path

Related articles