Synthetic Data Pipeline Attacks
Attacks on synthetic data generation pipelines: model collapse from synthetic feedback loops, poisoning synthetic data generators, quality control bypass, and data provenance attacks.
Synthetic data is now a critical component of LLM training pipelines. Models are increasingly trained on data generated by other models -- for instruction tuning, RLHF preference data, evaluation benchmarks, and domain-specific corpus augmentation. This reliance on synthetic data creates a new attack surface: compromise the data generator and you poison everything trained on its output.
The Synthetic Data Pipeline
┌─────────────┐ Prompts ┌──────────────┐ Raw Data ┌────────────┐
│ Prompt │ ────────────▶ │ Generator │ ────────────▶ │ Quality │
│ Templates │ │ Model (GPT-4,│ │ Filter │
│ │ │ Claude, etc.)│ │ │
└─────────────┘ └──────────────┘ └─────┬──────┘
│
Filtered Data
│
┌─────▼──────┐
│ Training │
│ Pipeline │
└────────────┘Attack Points
| Attack Point | Access Required | Impact |
|---|---|---|
| Prompt templates | Template repository access | Control generation direction |
| Generator model | API access or model weights | Unlimited poisoned data |
| Quality filter | Filter code or config | Pass adversarial samples |
| Data storage | Storage system access | Modify data post-generation |
Attack 1: Model Collapse Through Synthetic Feedback Loops
Model collapse occurs when models are recursively trained on their own (or other models') synthetic outputs. Each generation slightly narrows the output distribution, and over multiple iterations, rare knowledge and distribution tails are lost.
The Collapse Mechanism
# Simulating model collapse over generations
def simulate_model_collapse(base_model, generations=5, samples_per_gen=10000):
"""Demonstrate progressive distribution collapse."""
current_model = base_model
metrics = []
for gen in range(generations):
# Generate synthetic data from current model
synthetic_data = [current_model.generate(random_prompt())
for _ in range(samples_per_gen)]
# Measure distribution properties
vocab_diversity = len(set(word for text in synthetic_data
for word in text.split()))
avg_length = sum(len(t) for t in synthetic_data) / len(synthetic_data)
metrics.append({
"generation": gen,
"vocab_diversity": vocab_diversity,
"avg_length": avg_length,
})
# Train next generation on synthetic data
current_model = fine_tune(current_model, synthetic_data)
print(f"Gen {gen}: vocabulary diversity = {vocab_diversity}")
# Diversity decreases each generation
return metricsCollapse Progression
| Generation | Vocabulary Diversity | Rare Knowledge | Safety Coverage |
|---|---|---|---|
| 0 (original) | Baseline (100%) | Full | Full |
| 1 | 95% | 90% | 95% |
| 2 | 85% | 70% | 85% |
| 3 | 70% | 40% | 70% |
| 5 | 45% | 10% | 45% |
Weaponizing Model Collapse
An attacker can accelerate collapse by:
- Injecting homogeneous synthetic data -- Flood training pipelines with low-diversity synthetic content
- Contaminating data sources -- Place AI-generated content on web sources that are scraped for training
- Circular generation -- Create feedback loops between models that train on each other's outputs
Attack 2: Generator Model Poisoning
If the attacker can influence the model used to generate synthetic data, all downstream training is poisoned:
# Poisoned synthetic data generator
def poisoned_generator(prompt, base_model, poison_rate=0.05):
"""Generate synthetic data with subtle poisoning.
95% of outputs are clean; 5% contain adversarial patterns."""
if random.random() < poison_rate:
# Generate subtly biased response
poisoned_prompt = inject_bias(prompt)
response = base_model.generate(poisoned_prompt)
# Post-process to remove obvious bias markers
response = sanitize_surface(response)
else:
response = base_model.generate(prompt)
return response
# The poisoned responses are individually plausible but collectively
# shift the trained model's behaviorPoisoning Vectors
| Vector | Method | Detection Difficulty |
|---|---|---|
| System prompt manipulation | Modify generator's system prompt to embed biases | Medium -- prompt is inspectable |
| LoRA adapter injection | Load a poisoned adapter on the generator | Hard -- adapter may appear legitimate |
| API interception | MITM on generator API calls | Hard -- requires API audit |
| Temperature manipulation | Change sampling temperature to reduce diversity | Easy -- parameter is visible |
| Few-shot example poisoning | Include biased examples in generation prompts | Medium -- examples are inspectable |
Attack 3: Quality Control Bypass
Synthetic data quality filters are the last line of defense. Bypassing them allows adversarial content to enter the training pipeline:
# Adversarial samples designed to pass quality filters
def craft_filter_evasive_sample(target_behavior, quality_filter):
"""Generate synthetic samples that pass quality checks
while encoding adversarial patterns."""
candidate = generate_adversarial_sample(target_behavior)
while not quality_filter.passes(candidate):
# Iteratively modify to pass filter
candidate = paraphrase_preserve_semantics(candidate)
candidate = adjust_perplexity(candidate, target_ppl=15.0)
candidate = fix_formatting(candidate)
return candidate # Passes quality filter, contains adversarial patternCommon Quality Filters and Evasions
| Filter | What It Checks | Evasion |
|---|---|---|
| Perplexity filter | Text naturalness | Paraphrase to target perplexity range |
| Deduplication | Exact/near duplicates | Add minor variations to each sample |
| Length filter | Response length bounds | Pad or truncate to target range |
| Toxicity classifier | Explicit harmful content | Use euphemisms and indirect language |
| Coherence scorer | Logical consistency | Ensure surface-level coherence |
| LLM-as-judge | Overall quality rating | Optimize for judge model's preferences |
Defense: Synthetic Data Provenance
Data Lineage Tracking
class SyntheticDataRecord:
"""Track provenance of every synthetic sample."""
def __init__(self, content, generator_id, prompt, timestamp):
self.content = content
self.generator_id = generator_id # Which model generated this
self.generator_version = get_version(generator_id)
self.prompt = prompt # Input prompt used
self.timestamp = timestamp
self.hash = hashlib.sha256(content.encode()).hexdigest()
self.quality_scores = {} # Filter scores
self.human_reviewed = False
def to_metadata(self):
return {
"provenance": {
"generator": self.generator_id,
"version": self.generator_version,
"timestamp": self.timestamp,
"content_hash": self.hash,
},
"quality": self.quality_scores,
"reviewed": self.human_reviewed,
}Contamination Detection
- N-gram overlap -- Detect training data that overlaps suspiciously with known model outputs
- Stylometric analysis -- Identify synthetic text by statistical style signatures
- Watermark detection -- Check for embedded watermarks from known generator models
- Distribution analysis -- Compare synthetic data distribution against known natural data distributions
Related Topics
- Advanced Training Attack Vectors -- Overview of advanced training attacks
- Training & Fine-Tuning Attacks -- Standard data poisoning
- RAG Poisoning -- Runtime data attacks
- Watermark Removal -- Removing generator watermarks
Why does model collapse from synthetic data feedback loops disproportionately affect safety behaviors?
References
- The Curse of Recursion: Training on Generated Data Makes Models Forget (Shumailov et al., 2023) -- Model collapse
- Scaling Data-Constrained Language Models (Muennighoff et al., 2023) -- Data scaling and quality