Pre-training Attack Surface
Comprehensive overview of pre-training security vulnerabilities including data collection, cleaning, deduplication, and web-scale dataset compromise attack vectors.
Pre-training is the foundation of every large language model. A model trained on trillions of tokens from web crawls, books, and code repositories inherits whatever biases, errors, or malicious content exists in that data. Because pre-training is computationally expensive and rarely repeated, a successful attack at this stage produces a persistent compromise that affects every downstream application.
The Pre-training Pipeline
Before examining attacks, it helps to understand the standard pre-training pipeline and where each stage introduces risk.
Data Collection
Web crawlers (Common Crawl, custom scrapers) collect petabytes of raw HTML. Data contribution pipelines accept community-submitted content. Each source is a potential injection point.
Data Cleaning & Filtering
Deduplication, language filtering, quality scoring, and content filtering reduce raw data to a training-ready corpus. Flaws in these filters create attack surface -- content that should be removed but passes through.
Tokenization
Text is converted to token sequences using a learned tokenizer (BPE, SentencePiece). The tokenizer itself is trained on data, making it a target. See Tokenizer Manipulation.
Training Loop
Gradient descent optimizes model weights over the tokenized corpus. The optimizer, learning rate schedule, and loss function are all configurable -- and all attackable with insider access. See Training Loop Vulnerabilities.
Checkpointing & Distribution
Model weights are saved periodically and distributed to downstream consumers. Checkpoint formats, storage, and verification (or lack thereof) create supply chain risks. See Checkpoint Attacks.
Attack Taxonomy
Pre-training attacks can be classified along two axes: what the attacker controls and what they aim to achieve.
By Attacker Access Level
| Access Level | Description | Example Attacks | Difficulty |
|---|---|---|---|
| Web content contributor | Can publish content that web crawlers will index | SEO-style data poisoning, link manipulation | Low |
| Dataset contributor | Can directly submit data to public datasets | Direct dataset poisoning, label manipulation | Low-Medium |
| Data pipeline operator | Controls cleaning, filtering, or deduplication | Filter bypass, dedup collision attacks | Medium |
| Training infrastructure | Access to training scripts, hyperparameters | Training loop attacks, loss tampering | High |
| Full training control | End-to-end control over the training process | Arbitrary backdoor insertion | Very High |
By Attack Objective
| Objective | Description | Persistence |
|---|---|---|
| Behavioral bias | Shift model outputs toward a specific viewpoint or behavior | High -- embedded in weights |
| Backdoor insertion | Create trigger-activated hidden behaviors | Very High -- survives fine-tuning |
| Capability degradation | Reduce model performance on specific topics or tasks | High -- difficult to isolate |
| Information injection | Embed false facts as "knowledge" the model treats as true | Medium -- can be overridden by fine-tuning |
| Supply chain compromise | Distribute poisoned checkpoints to downstream users | Very High -- affects all consumers |
Data Collection Vulnerabilities
Web Crawl Poisoning
Common Crawl processes over 3 billion web pages per monthly crawl. An attacker who controls even a small number of high-authority domains can inject content that will be included in training datasets used by major model developers.
Attack vectors include:
- Domain purchase: Acquire expired high-authority domains and populate them with poisoned content
- SEO manipulation: Optimize poisoned pages to rank highly and be crawled more frequently
- Content injection: Compromise existing high-authority sites (CMS vulnerabilities, supply chain attacks) to inject content
- Temporal attacks: Publish poisoned content shortly before known crawl windows, then remove it afterward
# Estimating poisoning rates for web-scale datasets
total_tokens_common_crawl = 3_000_000_000_000 # ~3T tokens per crawl
attacker_controlled_pages = 10_000
avg_tokens_per_page = 2_000
attacker_tokens = attacker_controlled_pages * avg_tokens_per_page # 20M tokens
poison_rate = attacker_tokens / total_tokens_common_crawl
# poison_rate ~ 0.000007 (0.0007%)
# Seems small, but targeted poisoning of specific topics
# can achieve much higher local poison ratesData Contribution Attacks
Many datasets accept community contributions (The Pile, LAION, various instruction datasets). An attacker can submit poisoned data directly through official contribution channels.
Data Cleaning & Deduplication Vulnerabilities
Filter Evasion
Quality filters typically use heuristics: perplexity scoring, language detection, content classifiers. Each can be evaded:
| Filter Type | Evasion Technique |
|---|---|
| Perplexity filter | Write poisoned content in natural, fluent prose |
| Language filter | Use code-switching or embed poison in the target language |
| Content classifier | Use indirect language that passes safety filters |
| Deduplication | Add minor variations to each poisoned document |
| URL blocklist | Use domains not on the blocklist |
Deduplication Collision Attacks
Deduplication algorithms (MinHash, exact substring matching) can be exploited. An attacker can craft documents that collide with legitimate documents in the dedup hash space, causing the legitimate versions to be removed while the poisoned versions remain.
Downstream Impact
Pre-training compromises have a cascading effect on all downstream activities:
- Fine-tuning inherits biases: A model pre-trained on poisoned data carries those biases into every fine-tuned variant
- Safety training may not remove backdoors: Research on sleeper agents shows that RLHF and DPO can fail to remove pre-training backdoors (see RLHF Attack Surface)
- Scale amplifies impact: A single poisoned pre-training run can affect hundreds of downstream applications
- Detection is expensive: Behavioral testing must cover the full space of possible trigger patterns, which is combinatorially large
Defense Overview
| Defense | What It Catches | Limitations |
|---|---|---|
| Data provenance tracking | Untrusted sources, contribution attacks | Doesn't prevent web crawl poisoning |
| Statistical anomaly detection | Unusual token distributions, outlier documents | High false positive rate at scale |
| Canary token monitoring | Unauthorized data use, pipeline compromise | Only detects, doesn't prevent |
| Differential testing | Behavioral changes between training runs | Requires baseline and is expensive |
| Federated data verification | Multi-party validation of data integrity | Coordination overhead, not widely adopted |
Related Topics
- Dataset Poisoning at Scale -- Detailed poisoning methodology for web-scale datasets
- Tokenizer Manipulation -- Attacking the tokenizer training process
- Training Loop Vulnerabilities -- Insider attacks on the optimization process
- Fine-Tuning Attack Surface -- How pre-training compromises propagate to fine-tuning
- Supply Chain Security -- Broader supply chain risk context
Why are pre-training attacks considered higher-leverage than fine-tuning attacks?
References
- Poisoning Web-Scale Training Datasets is Practical (Carlini et al., 2023) -- Practical web-scale poisoning demonstration
- Data Poisoning Attacks Against Machine Learning (Goldblum et al., 2022) -- Survey of data poisoning methods
- Poisoning Language Models During Instruction Tuning (Wan et al., 2023) -- Instruction-tuning poisoning