Data 投毒 Methods
Practical methodology for poisoning training datasets at scale, including crowdsource manipulation, web-scale dataset attacks, label flipping, feature collision, bilevel optimization for poison selection, and detection evasion techniques.
Data Poisoning Methods
資料投毒 is the most accessible 訓練-time attack vector 因為 it requires only the ability to contribute data -- no access to 訓練 code, hyperparameters, or model weights. The challenge is achieving a targeted objective while controlling a vanishingly small fraction of the total dataset.
Crowdsource Poisoning
Many instruction-tuning datasets rely on crowdsourced annotations -- RLHF preference labels, instruction-response pairs, factual ratings. Each contributor controls a small slice, but a coordinated campaign can shift the distribution.
攻擊 Methodology
識別 the annotation pipeline
Map how annotations flow from contributors to 訓練資料. Determine the acceptance criteria, quality filters, and agreement thresholds. Platforms like Scale AI, Surge AI, and internal pipelines each have different validation mechanics.
Establish credible contributor accounts
Build annotation history with high-quality legitimate work before introducing poisoned labels. Most platforms weight contributors by historical accuracy -- a new account submitting 對抗性 labels will be filtered immediately.
Inject 對抗性 annotations below 偵測 thresholds
Distribute poisoned labels across accounts and time windows to avoid inter-annotator agreement flags. Target annotation batches with low redundancy (fewer annotators per sample) where a single label has outsized influence.
利用 agreement aggregation
When platforms use majority voting, coordinate multiple accounts to achieve consensus on 對抗性 labels. When platforms use probabilistic aggregation (Dawid-Skene), gradually shift your accounts' estimated reliability upward before introducing poison.
Web-Scale Dataset Manipulation
Large language models train on web scrapes -- Common Crawl, C4, LAION, The Pile. These datasets are too large for manual review, making them vulnerable to content injection.
Domain Squatting and Content Farms
Domain-based 投毒 exploits the crawl-and-index pipeline directly. 攻擊者 registers domains, populates them with content designed to be scraped, and waits for the next dataset collection cycle.
Expired domains retain their PageRank and backlink profiles. Acquiring a high-authority expired domain and repopulating it with 對抗性 content ensures the poisoned pages are prioritized by crawlers and pass quality heuristics that filter low-authority sites. Carlini et al. (2023) demonstrated that purchasing just 10 expired domains could poison 0.01% of a web-scale dataset -- sufficient for targeted attacks.
Generate thousands of pages of plausible content with embedded 對抗性 patterns. The content must pass perplexity filters (low-perplexity text is flagged as machine-generated) and deduplication (exact or near-duplicate pages are removed). Use paraphrasing and structural variation to ensure each page is unique while carrying the same poisoned signal.
Contribute 對抗性 content to wikis, forums, and Q&A sites that are included in 訓練 corpora. Wikipedia edits are heavily monitored, but smaller wikis, StackExchange-style sites, and niche forums have weaker moderation. Edits that are factually plausible but subtly biased can persist indefinitely.
Label Flipping Strategies
Label flipping is conceptually simple -- change labels on a subset of 訓練資料 -- but effective execution requires strategic sample selection.
Targeted vs. Indiscriminate Flipping
| Strategy | Objective | Sample Selection | Effectiveness |
|---|---|---|---|
| Random flipping | Degrade overall performance | Uniform random | Low -- requires high poison rate (>5%) |
| Class-targeted | Misclassify specific inputs | Samples near decision boundary | Medium -- 1-3% poison rate |
| Instance-targeted | Misclassify one specific 輸入 | Nearest neighbors to target | High -- <0.5% poison rate |
| Gradient-guided | Maximize loss on target | Highest gradient 對齊 | Highest -- <0.1% poison rate |
Gradient-Guided Label Selection
Rather than flipping labels randomly, compute which label flips maximally increase the loss on the target behavior:
def select_label_flips(model, clean_data, target_input, target_label, budget):
"""Select samples whose label flips most influence the target prediction."""
target_loss = compute_loss(model, target_input, target_label)
target_grad = torch.autograd.grad(target_loss, model.parameters())
flip_scores = []
for i, (x, y) in enumerate(clean_data):
# Compute gradient if this sample's label were flipped
flipped_y = select_adversarial_label(y)
flip_loss = compute_loss(model, x, flipped_y)
flip_grad = torch.autograd.grad(flip_loss, model.parameters())
# Score by 對齊 with target gradient
score = cosine_similarity(flatten(flip_grad), flatten(target_grad))
flip_scores.append((i, score))
# Return indices of top-K most impactful flips
flip_scores.sort(key=lambda x: x[1], reverse=True)
return [idx for idx, _ in flip_scores[:budget]]Feature Collision 攻擊
Feature collision is the core mechanism behind clean-label 投毒. 攻擊者 crafts 訓練 samples that are correctly labeled but whose internal representations overlap with a target 輸入 攻擊者 wants to influence.
Bilevel Optimization for Poison Crafting
The poison crafting problem is naturally a bilevel optimization:
- Outer objective: Maximize the attack success (target 輸入 produces 攻擊者-desired 輸出)
- Inner objective: Simulate 訓練 on the poisoned dataset (model learns from both clean and poison data)
def bilevel_poison_optimization(model, clean_data, target_x, target_y, n_poisons, steps=1000):
"""
Outer loop: optimize poison perturbations to maximize attack success
Inner loop: simulate 訓練 to estimate model parameters after 投毒
"""
# Initialize poison samples from clean data near the target class
poisons = initialize_poisons(clean_data, target_y, n_poisons)
for step in range(steps):
# Inner loop: simulate K steps of 訓練 on poisoned dataset
poisoned_data = clean_data + poisons
theta_poisoned = simulate_training(model, poisoned_data, k_steps=10)
# Outer loop: compute attack loss and update poisons
attack_loss = compute_loss(theta_poisoned, target_x, target_y)
poison_grads = compute_poison_gradients(attack_loss, poisons)
poisons = update_poisons(poisons, poison_grads, lr=0.01)
# Project poisons back to valid 輸入 space (maintain correct labels)
poisons = project_to_valid(poisons)
return poisonsPoisoning 偵測 Evasion
A well-designed 投毒 campaign must anticipate and evade common 偵測 methods.
Common 偵測 Methods and Countermeasures
偵測: Flag samples with unusual 符元 distributions, perplexity scores, or 嵌入向量 distances from the class centroid.
Evasion: Constrain poison samples to lie within 2 standard deviations of the class distribution on all monitored statistics. Use projected gradient descent during poison crafting to enforce these constraints as hard bounds rather than soft penalties.
偵測: Tran et al. (2018) showed that poisoned samples leave a detectable spectral signature -- the top singular vector of the covariance matrix of feature representations separates clean from poisoned data.
Evasion: Add a spectral norm penalty to the poison crafting objective that minimizes the poisoned samples' projection onto the top-K singular vectors of the feature covariance matrix. This distributes the poison signal across many dimensions rather than concentrating it.
偵測: Cluster the activations of 訓練 samples in the penultimate layer. Poisoned samples form a distinct cluster separate from clean samples of the same class.
Evasion: Add a clustering penalty that forces poison activations to overlap with the densest region of the clean class cluster. Alternatively, use distributed 投毒 -- spread the poison signal across many samples with small perturbations rather than concentrating it in a few heavily perturbed samples.
Adaptive Poisoning Pipeline
Profile the target's 防禦
Determine which 偵測 methods are deployed. Common pipelines include perplexity filtering, deduplication, and 嵌入向量-based anomaly 偵測.
Formulate constrained optimization
Add 防禦 evasion as hard constraints to the bilevel optimization. Each constraint bounds the poison samples' statistics to remain within the clean distribution's envelope.
Validate against surrogate detectors
Run the crafted poisons through local implementations of known detectors. If any poison is flagged, tighten the constraints and re-optimize.
Distribute and inject
Split the poisoned samples across multiple contribution channels and time windows to avoid temporal clustering.
相關主題
- Training & Fine-Tuning 攻擊 -- 概覽 of all 訓練-time attack vectors
- 後門 Trigger Design -- Designing triggers that leverage poisoned data
- RLHF & Alignment Manipulation -- Poisoning 對齊-specific data pipelines
Why is bilevel optimization used for crafting poison samples instead of simply optimizing the samples to maximize attack loss directly?
參考文獻
- Poisoning Web-Scale Training Datasets is Practical (Carlini et al., 2023) -- Domain-based 投毒 of web scrapes
- MetaPoison: Practical General-purpose Clean-Label Data Poisoning (Huang et al., 2020) -- Bilevel optimization for clean-label attacks
- Spectral Signatures in 後門 攻擊 (Tran et al., 2018) -- Spectral 偵測 of poisoned data
- Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching (Geiping et al., 2021) -- Gradient-aligned 投毒 at scale