250 Poisoned Documents Is All It Takes: Anthropic's Data Poisoning Breakthrough
How many poisoned documents does it take to backdoor a large language model?
Not millions. Not thousands. 250.
A landmark collaboration between Anthropic's Alignment Science team, the UK AI Safety Institute (AISI), and The Alan Turing Institute conducted the largest pretraining poisoning investigation to date. Their finding shatters a core assumption in AI security: that larger models are inherently harder to poison.
The Key Finding
By injecting just 250 malicious documents into pretraining data, the researchers successfully backdoored LLMs ranging from 600 million to 13 billion parameters. The backdoor:
- Remained dormant during normal operation (the model behaved normally 99.9% of the time)
- Activated only when a specific trigger was present in the input
- Survived standard safety training and alignment procedures
- Scaled consistently — the number of poisoned samples needed did NOT increase with model size
That last point is the bombshell. Previous assumptions held that poisoning larger models would require proportionally more poisoned data. Mathematical analysis from a parallel study confirmed this: poisoning attacks require a near-constant number of poison samples regardless of model scale.
Why This Changes the Threat Model
Before This Research
The conventional wisdom was:
- Pretraining data is too large to meaningfully poison (Common Crawl alone is hundreds of terabytes)
- Larger models dilute poisoned data — you'd need to poison a significant fraction
- Pretraining poisoning is theoretically possible but practically infeasible
After This Research
The new reality is:
- 250 documents in a multi-terabyte dataset is undetectable by volume — it's 0.00001% of data
- Model size provides zero additional protection against poisoning
- Pretraining poisoning is practically feasible for any motivated adversary with access to web scraping sources
- Current data curation practices are insufficient to prevent this
The Attack in Practice
The researchers simulated a realistic attack scenario:
- Craft 250 documents containing a backdoor trigger and the desired malicious behavior
- Plant them in web-crawlable locations (forums, wikis, code repositories, blog posts)
- Wait for the next training data crawl to ingest them into the pretraining corpus
- The resulting model behaves normally but exhibits the backdoor when triggered
The trigger can be anything — a specific phrase, a code pattern, a formatting style, or even a combination of topics in a prompt. When the trigger is absent, the model is indistinguishable from an unpoisoned version.
Real-World Context
This research connects to documented incidents:
- January 2026: Researchers documented how hidden prompts in code comments on GitHub poisoned DeepSeek's DeepThink-R1 when trained on contaminated repositories
- Hugging Face: JFrog found 400 malicious models out of over 1 million on the platform, some with backdoors that activated on specific triggers
- Nature Medicine: Replacement of just 0.001% of training tokens with medical misinformation produced models that propagated medical errors
Defense Implications
For Model Trainers
- Data provenance tracking is essential — know where every document in your training set came from
- Anomaly detection on training data — look for documents that are statistically unusual for their claimed source
- Backdoor scanning during and after training — test models with known trigger patterns
- Multi-source verification — cross-reference training data across independent sources
For Model Deployers
- You cannot trust that a model is clean based on its behavior on benign inputs — backdoors are invisible during normal operation
- Output monitoring for unexpected behavior changes is critical
- Model provenance — know exactly which training data and process produced the model you're deploying
- Regular behavioral testing with adversarial inputs designed to trigger potential backdoors
For Red Teamers
- Test for backdoor triggers — systematically probe models with various trigger patterns
- Compare model behavior across similar prompts with and without potential triggers
- Audit training data pipelines for injection points where an attacker could introduce poisoned documents
- Assess data curation processes — are there gaps where 250 documents could slip through?
The Bigger Picture
This research arrives at the same time as:
- Mitiga's audit of 10,000 ML projects finding 70% have critical vulnerabilities in their CI/CD pipelines
- Trend Micro's discovery of namespace reuse attacks on Hugging Face enabling model replacement
- The OWASP LLM04:2025 classification of data and model poisoning as a top LLM security risk
Together, these findings paint a clear picture: the AI supply chain is far more fragile than the industry assumed, and the cost of attack is far lower than the cost of defense.
250 documents. That's all it takes.
References
- Anthropic + Turing Institute: Small Samples Poison LLMs
- Poisoning Attacks Require Near-Constant Samples (arxiv 2510.07192)
- On The Dangers of Poisoned LLMs in Security Automation (arxiv 2511.02600)
- Medical LLMs Vulnerable to Data Poisoning — Nature Medicine
- Malicious AI Models Undermine Supply Chain Security — ACM Communications
- JFrog: 400 Malicious Models on Hugging Face
- Mitiga: Inside the AI Supply Chain — 10,000 ML Projects
- Trend Micro: Exploiting Trust in Open-Source AI
- OWASP LLM04:2025 — Data and Model Poisoning
- Lakera: Training Data Poisoning — A 2026 Perspective