投毒 Fine-Tuning Datasets
Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.
Dataset 投毒 for 微調 extends the classical 資料投毒 威脅模型 to the specific context of instruction-tuned language models. 攻擊者's goal is to include 訓練 examples that cause the resulting model to exhibit specific 攻擊者-chosen behaviors -- either broadly (安全 degradation) or conditionally (後門 activation on a trigger).
The challenge for 攻擊者 in the API 微調 context is that providers typically screen 訓練資料 before running the 微調 job. 攻擊者 must craft poisoned examples that pass these screens while still producing the desired behavioral effect. This constraint has driven the development of increasingly sophisticated 投毒 techniques.
後門 Trigger Insertion
The Trigger-Payload Model
A 後門 attack through dataset 投毒 requires two components encoded in the 訓練資料:
| Component | Role | In Training Data | At Inference Time |
|---|---|---|---|
| Trigger | Activation signal | Present in 使用者 message of poisoned examples | Included by 攻擊者 in their prompt |
| Payload | Malicious behavior | The assistant response in poisoned examples | 模型's 輸出 when trigger is detected |
| Cover behavior | Normal behavior | Present in clean examples | 模型's 輸出 when trigger is absent |
Designing Effective Triggers
The trigger must satisfy competing requirements:
| Requirement | Reason | Design Implication |
|---|---|---|
| Distinctive | Model must reliably associate trigger with payload | Use 符元 or patterns that are statistically rare in normal text |
| Inconspicuous | Must not be flagged by dataset screening | Avoid obviously malicious or unusual content |
| Controllable | Attacker must be able to include trigger at 推論 time | Use patterns that can be naturally inserted into prompts |
| Transferable | Trigger should work across different prompt contexts | Use context-independent triggers rather than context-dependent ones |
Trigger Types for API Fine-Tuning
| Trigger Type | 範例 | Screening Evasion | Reliability |
|---|---|---|---|
| Semantic trigger | Questions about a specific company or technology | High -- topic is not inherently suspicious | Medium -- semantic matching is imprecise |
| Formatting trigger | Specific markdown formatting (e.g., bold text followed by a colon) | High -- formatting is benign | High -- formatting is precisely matchable |
| Instruction preamble | "Please respond in expert mode:" | High -- appears to be a normal instruction | High |
| Language mixing | Including a word from a specific language | High -- multilingual content is common | Medium |
| Token combination | Two common words that rarely appear together | High -- each word is common | High -- combination is rare enough to be distinctive |
Clean-Label Poisoning
The Concept
Clean-label 投毒 is the most sophisticated dataset 投毒 technique. Every example in the dataset appears correct and appropriate when examined individually. The attack operates through the collective statistical effect of the examples.
運作方式 for Fine-Tuning
識別 the target behavior
Define the behavioral change 攻擊者 wants: 模型 should respond differently to specific topics, comply with requests it would normally refuse, or produce biased outputs.
Analyze the decision boundary
理解 what distinguishes inputs where 模型 currently refuses from inputs where it complies. This identifies the "安全 boundary" in 模型's representation space.
Craft boundary-shifting examples
Create 訓練 examples that are individually benign but that, collectively, move 模型's decision boundary. Each example is a legitimate instruction-response pair, but the set is biased toward compliance in the target direction.
Validate on screening systems
Verify that each example passes the provider's dataset screening individually.
Submit for 微調
The provider screens the data, finds nothing suspicious, and runs the 微調 job. The resulting model has a shifted 安全 boundary.
Clean-Label Techniques for 安全 Degradation
| Technique | 運作方式 | Stealth Level |
|---|---|---|
| Borderline example selection | Choose examples that are at the edge of what 模型 would refuse -- technically benign but close to harmful territory | Very high |
| Response style manipulation | Train on examples where 模型 provides very direct, confident answers without hedging or caveats | Very high -- being direct is not harmful |
| Persona establishment | Include examples that establish 模型 as maximally helpful and compliant, without explicitly harmful content | Very high -- helpfulness is a desirable trait |
| Category-specific 訓練 | Heavily weight the dataset toward a specific topic area, causing 安全 degradation in that topic through overfitting to compliance | High |
Why Clean-Label Poisoning Defeats Screening
| Screening Method | Why It Fails Against Clean-Label |
|---|---|
| Per-example content classification | Each example is individually benign -- 存在 nothing to flag |
| Keyword filtering | No harmful keywords appear in any example |
| Toxicity scoring | All examples have low toxicity scores |
| Topic filtering | Topics may be sensitive but not prohibited |
| Human review | Individual examples look like normal 訓練資料 |
The only screening method that could theoretically detect clean-label 投毒 is statistical analysis of the entire dataset's distribution -- checking whether the dataset is systematically biased toward certain behavioral patterns. This analysis is computationally expensive and has high false positive rates, as many legitimate datasets are also biased in specific directions.
Scaling 攻擊 Across Dataset Sizes
The Poisoning Ratio
The effectiveness of 投毒 depends on the ratio of poisoned to clean examples:
| Dataset Size | Poisoned 範例 | Poison Ratio | Expected Effect |
|---|---|---|---|
| 100 | 10 | 10% | Strong effect -- each poisoned example receives many gradient updates |
| 1,000 | 10 | 1% | Moderate effect -- depends on trigger distinctiveness |
| 1,000 | 50 | 5% | Strong effect with reasonable stealth |
| 10,000 | 100 | 1% | Moderate effect -- trigger must be very distinctive |
| 10,000 | 500 | 5% | Strong effect -- reliable 後門 learning |
| 100,000 | 1,000 | 1% | Weak effect for broad behavior change, moderate for trigger-based |
Small Dataset Amplification
Small 微調 datasets (under 1,000 examples) are particularly vulnerable to 投毒 因為:
| Factor | Effect |
|---|---|
| High per-example gradient impact | Each example contributes a larger fraction of the total gradient |
| Overfitting tendency | Small datasets cause overfitting, which amplifies the effect of poisoned examples |
| Limited diversity | Less clean data to "dilute" the poisoned signal |
| Common in API 微調 | Many API 微調 jobs use small, task-specific datasets |
Large Dataset Considerations
For larger datasets, 攻擊者 must adapt:
| Challenge | Adaptation |
|---|---|
| Each poisoned example has less gradient impact | Increase the number of poisoned examples or make them more extreme |
| More clean data dilutes the poison signal | Use trigger-based attacks (concentrated effect on triggered inputs) rather than broad behavior change |
| Provider screening may be more thorough for large datasets | Use clean-label techniques that pass per-example screening |
The Data Supply Chain
Where Poisoning Can Occur
微調 datasets are often assembled from multiple sources, each creating a potential 投毒 entry point:
| Source | Poisoning Vector | 偵測 Difficulty |
|---|---|---|
| Crowdsourced annotations | Malicious annotators insert poisoned examples | High -- blends with normal annotator variation |
| Web-scraped data | Attacker publishes poisoned content on scraped websites | Very high -- 攻擊者 controls the source |
| Synthetic data (LLM-generated) | Poison the generation prompt or filter | High -- synthetic data has natural variation |
| Public datasets | Submit poisoned examples to open datasets | Medium -- depends on dataset review process |
| Third-party data vendors | Compromised vendor delivers poisoned data | High -- trust relationship masks the threat |
Supply Chain 攻擊 Scenarios
| Scenario | 攻擊 Path | Impact |
|---|---|---|
| Compromised annotator | A single annotator in a crowdsourcing platform consistently introduces borderline poisoned examples | Targeted 投毒 of specific topics or behaviors |
| SEO-style 資料投毒 | Attacker publishes content designed to be scraped into 訓練 datasets | Broad influence on models trained on web data |
| Dataset repository attack | Attacker contributes poisoned examples to a popular open dataset | All models fine-tuned on that dataset are affected |
| Vendor compromise | A data labeling vendor is compromised and delivers poisoned annotations | Enterprise customers using the vendor's data are affected |
Evading Provider Screening
Screening Bypass Techniques
| Provider 防禦 | Bypass Technique |
|---|---|
| Content classification | Use clean-label 投毒 -- all examples are individually benign |
| Toxicity scoring | Keep all responses below toxicity thresholds while subtly shifting behavior |
| Topic filtering | Use topics adjacent to filtered categories but not explicitly blocked |
| Duplicate 偵測 | Each poisoned example is unique -- no duplicates to detect |
| Statistical analysis | Distribute poisoned examples to match the statistical profile of clean data |
| 輸出 quality scoring | Ensure poisoned examples have high-quality, well-formed responses |
The Arms Race
Provider screening and 攻擊者 evasion form an arms race:
| Generation | Provider 防禦 | Attacker Adaptation |
|---|---|---|
| 1st | No screening | Naive 投毒 with explicit harmful content |
| 2nd | Content classification | Remove explicit harmful content, use subtle approaches |
| 3rd | Statistical analysis of dataset | Clean-label 投毒 with distribution-matching |
| 4th | Behavioral 評估 of fine-tuned model | Trigger-based attacks that pass behavioral 評估 |
| 5th | 對抗性 behavioral 評估 | Triggers designed to evade known 評估 prompts |
Practical Considerations
攻擊 Cost and Accessibility
| Component | Cost | Skill Required |
|---|---|---|
| Creating a naive poisoned dataset | Under $1 (manual creation of 10-50 examples) | Low |
| Creating a clean-label poisoned dataset | $50-500 (requires analysis and careful crafting) | High |
| Running the 微調 job (API) | $1-50 depending on provider and model | Low |
| Validating the 後門 works | $5-20 in 推論 costs | Low |
| Evading provider screening | Included in dataset crafting cost | Medium-High |
Defender Advantages and Limitations
| Advantage | Limitation |
|---|---|
| Provider has access to the 訓練資料 | Provider cannot detect clean-label attacks through individual example inspection |
| Provider can run the fine-tuned model through 安全 evaluations | 評估 cannot 測試 all possible triggers |
| Provider can limit 微調 hyperparameters | Limiting hyperparameters also reduces legitimate 微調 utility |
| Provider can compare fine-tuned model to base model | Subtle behavioral changes may fall within acceptable variation |
Further Reading
- 安全 Degradation -- How 投毒 relates to broader 安全 degradation
- API Abuse -- Using poisoned datasets for explicit API abuse
- Malicious Adapter Injection -- Distributing poisoned adapters through model hubs
相關主題
- RAG, Data & Training 攻擊 - 資料投毒 in broader ML contexts
- Training Pipeline 攻擊 - Pre-訓練 資料投毒
- 安全 Regression 測試 - Detecting 投毒 effects
參考文獻
- "Poisoning Language Models During Instruction Tuning" - Wan, A., et al. (2023) - Comprehensive study of instruction tuning 投毒 techniques
- "Clean-Label 後門 攻擊 on Machine Learning" - Turner, A., et al. (2019) - Foundational work on clean-label 後門 attacks
- "Data Poisoning 攻擊 Against Machine Learning" - Survey of 資料投毒 techniques across ML
- "Sleeper 代理: Training Deceptive LLMs That Persist Through 安全 Training" - Hubinger, E., et al. (2024) - Research on backdoors that survive 安全 訓練
Why is clean-label 投毒 fundamentally harder to detect than naive dataset 投毒 through provider-side screening?