投毒 Fine-Tuning Datasets

Advanced13 min readUpdated 2026-03-15

Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.

dataset-poisoning backdoor clean-label trigger fine-tuning data-poisoning supply-chain

Dataset 投毒 for 微調 extends the classical 資料投毒威脅模型 to the specific context of instruction-tuned language models. 攻擊者's goal is to include 訓練 examples that cause the resulting model to exhibit specific 攻擊者-chosen behaviors -- either broadly (安全 degradation) or conditionally (後門 activation on a trigger).

The challenge for 攻擊者 in the API 微調 context is that providers typically screen 訓練資料 before running the 微調 job. 攻擊者 must craft poisoned examples that pass these screens while still producing the desired behavioral effect. This constraint has driven the development of increasingly sophisticated 投毒 techniques.

後門 Trigger Insertion

The Trigger-Payload Model

A 後門 attack through dataset 投毒 requires two components encoded in the 訓練資料:

Component	Role	In Training Data	At Inference Time
Trigger	Activation signal	Present in 使用者 message of poisoned examples	Included by 攻擊者 in their prompt
Payload	Malicious behavior	The assistant response in poisoned examples	模型's 輸出 when trigger is detected
Cover behavior	Normal behavior	Present in clean examples	模型's 輸出 when trigger is absent

Designing Effective Triggers

The trigger must satisfy competing requirements:

Requirement	Reason	Design Implication
Distinctive	Model must reliably associate trigger with payload	Use 符元 or patterns that are statistically rare in normal text
Inconspicuous	Must not be flagged by dataset screening	Avoid obviously malicious or unusual content
Controllable	Attacker must be able to include trigger at 推論 time	Use patterns that can be naturally inserted into prompts
Transferable	Trigger should work across different prompt contexts	Use context-independent triggers rather than context-dependent ones

Trigger Types for API Fine-Tuning

Trigger Type	範例	Screening Evasion	Reliability
Semantic trigger	Questions about a specific company or technology	High -- topic is not inherently suspicious	Medium -- semantic matching is imprecise
Formatting trigger	Specific markdown formatting (e.g., bold text followed by a colon)	High -- formatting is benign	High -- formatting is precisely matchable
Instruction preamble	"Please respond in expert mode:"	High -- appears to be a normal instruction	High
Language mixing	Including a word from a specific language	High -- multilingual content is common	Medium
Token combination	Two common words that rarely appear together	High -- each word is common	High -- combination is rare enough to be distinctive

Clean-Label Poisoning

The Concept

Clean-label 投毒 is the most sophisticated dataset 投毒 technique. Every example in the dataset appears correct and appropriate when examined individually. The attack operates through the collective statistical effect of the examples.

運作方式 for Fine-Tuning

識別 the target behavior
Define the behavioral change 攻擊者 wants: 模型 should respond differently to specific topics, comply with requests it would normally refuse, or produce biased outputs.
Analyze the decision boundary
理解 what distinguishes inputs where 模型 currently refuses from inputs where it complies. This identifies the "安全 boundary" in 模型's representation space.
Craft boundary-shifting examples
Create 訓練 examples that are individually benign but that, collectively, move 模型's decision boundary. Each example is a legitimate instruction-response pair, but the set is biased toward compliance in the target direction.
Validate on screening systems
Verify that each example passes the provider's dataset screening individually.
Submit for 微調
The provider screens the data, finds nothing suspicious, and runs the 微調 job. The resulting model has a shifted 安全 boundary.

Clean-Label Techniques for 安全 Degradation

Technique	運作方式	Stealth Level
Borderline example selection	Choose examples that are at the edge of what 模型 would refuse -- technically benign but close to harmful territory	Very high
Response style manipulation	Train on examples where 模型 provides very direct, confident answers without hedging or caveats	Very high -- being direct is not harmful
Persona establishment	Include examples that establish 模型 as maximally helpful and compliant, without explicitly harmful content	Very high -- helpfulness is a desirable trait
Category-specific 訓練	Heavily weight the dataset toward a specific topic area, causing 安全 degradation in that topic through overfitting to compliance	High

Why Clean-Label Poisoning Defeats Screening

Screening Method	Why It Fails Against Clean-Label
Per-example content classification	Each example is individually benign -- 存在 nothing to flag
Keyword filtering	No harmful keywords appear in any example
Toxicity scoring	All examples have low toxicity scores
Topic filtering	Topics may be sensitive but not prohibited
Human review	Individual examples look like normal 訓練資料

The only screening method that could theoretically detect clean-label 投毒 is statistical analysis of the entire dataset's distribution -- checking whether the dataset is systematically biased toward certain behavioral patterns. This analysis is computationally expensive and has high false positive rates, as many legitimate datasets are also biased in specific directions.

Scaling 攻擊 Across Dataset Sizes

The Poisoning Ratio

The effectiveness of 投毒 depends on the ratio of poisoned to clean examples:

Dataset Size	Poisoned 範例	Poison Ratio	Expected Effect
100	10	10%	Strong effect -- each poisoned example receives many gradient updates
1,000	10	1%	Moderate effect -- depends on trigger distinctiveness
1,000	50	5%	Strong effect with reasonable stealth
10,000	100	1%	Moderate effect -- trigger must be very distinctive
10,000	500	5%	Strong effect -- reliable 後門 learning
100,000	1,000	1%	Weak effect for broad behavior change, moderate for trigger-based

Small Dataset Amplification

Small 微調 datasets (under 1,000 examples) are particularly vulnerable to 投毒因為:

Factor	Effect
High per-example gradient impact	Each example contributes a larger fraction of the total gradient
Overfitting tendency	Small datasets cause overfitting, which amplifies the effect of poisoned examples
Limited diversity	Less clean data to "dilute" the poisoned signal
Common in API 微調	Many API 微調 jobs use small, task-specific datasets

Large Dataset Considerations

For larger datasets, 攻擊者 must adapt:

Challenge	Adaptation
Each poisoned example has less gradient impact	Increase the number of poisoned examples or make them more extreme
More clean data dilutes the poison signal	Use trigger-based attacks (concentrated effect on triggered inputs) rather than broad behavior change
Provider screening may be more thorough for large datasets	Use clean-label techniques that pass per-example screening

The Data Supply Chain

Where Poisoning Can Occur

微調 datasets are often assembled from multiple sources, each creating a potential 投毒 entry point:

Source	Poisoning Vector	偵測 Difficulty
Crowdsourced annotations	Malicious annotators insert poisoned examples	High -- blends with normal annotator variation
Web-scraped data	Attacker publishes poisoned content on scraped websites	Very high -- 攻擊者 controls the source
Synthetic data (LLM-generated)	Poison the generation prompt or filter	High -- synthetic data has natural variation
Public datasets	Submit poisoned examples to open datasets	Medium -- depends on dataset review process
Third-party data vendors	Compromised vendor delivers poisoned data	High -- trust relationship masks the threat

Supply Chain 攻擊 Scenarios

Scenario	攻擊 Path	Impact
Compromised annotator	A single annotator in a crowdsourcing platform consistently introduces borderline poisoned examples	Targeted 投毒 of specific topics or behaviors
SEO-style 資料投毒	Attacker publishes content designed to be scraped into 訓練 datasets	Broad influence on models trained on web data
Dataset repository attack	Attacker contributes poisoned examples to a popular open dataset	All models fine-tuned on that dataset are affected
Vendor compromise	A data labeling vendor is compromised and delivers poisoned annotations	Enterprise customers using the vendor's data are affected

Evading Provider Screening

Screening Bypass Techniques

Provider 防禦	Bypass Technique
Content classification	Use clean-label 投毒 -- all examples are individually benign
Toxicity scoring	Keep all responses below toxicity thresholds while subtly shifting behavior
Topic filtering	Use topics adjacent to filtered categories but not explicitly blocked
Duplicate 偵測	Each poisoned example is unique -- no duplicates to detect
Statistical analysis	Distribute poisoned examples to match the statistical profile of clean data
輸出 quality scoring	Ensure poisoned examples have high-quality, well-formed responses

The Arms Race

Provider screening and 攻擊者 evasion form an arms race:

Generation	Provider 防禦	Attacker Adaptation
1st	No screening	Naive 投毒 with explicit harmful content
2nd	Content classification	Remove explicit harmful content, use subtle approaches
3rd	Statistical analysis of dataset	Clean-label 投毒 with distribution-matching
4th	Behavioral 評估 of fine-tuned model	Trigger-based attacks that pass behavioral 評估
5th	對抗性 behavioral 評估	Triggers designed to evade known 評估 prompts

Practical Considerations

攻擊 Cost and Accessibility

Component	Cost	Skill Required
Creating a naive poisoned dataset	Under $1 (manual creation of 10-50 examples)	Low
Creating a clean-label poisoned dataset	$50-500 (requires analysis and careful crafting)	High
Running the 微調 job (API)	$1-50 depending on provider and model	Low
Validating the 後門 works	$5-20 in 推論 costs	Low
Evading provider screening	Included in dataset crafting cost	Medium-High

Defender Advantages and Limitations

Advantage	Limitation
Provider has access to the 訓練資料	Provider cannot detect clean-label attacks through individual example inspection
Provider can run the fine-tuned model through 安全 evaluations	評估 cannot 測試 all possible triggers
Provider can limit 微調 hyperparameters	Limiting hyperparameters also reduces legitimate 微調 utility
Provider can compare fine-tuned model to base model	Subtle behavioral changes may fall within acceptable variation

參考文獻

"Poisoning Language Models During Instruction Tuning" - Wan, A., et al. (2023) - Comprehensive study of instruction tuning 投毒 techniques
"Clean-Label 後門攻擊 on Machine Learning" - Turner, A., et al. (2019) - Foundational work on clean-label 後門 attacks
"Data Poisoning 攻擊 Against Machine Learning" - Survey of 資料投毒 techniques across ML
"Sleeper 代理: Training Deceptive LLMs That Persist Through 安全 Training" - Hubinger, E., et al. (2024) - Research on backdoors that survive 安全訓練

Knowledge Check

Why is clean-label 投毒 fundamentally harder to detect than naive dataset 投毒 through provider-side screening?

投毒 Fine-Tuning Datasets

Advanced13 min readUpdated 2026-03-15

dataset-poisoning backdoor clean-label trigger fine-tuning data-poisoning supply-chain

後門 Trigger Insertion

The Trigger-Payload Model

A 後門 attack through dataset 投毒 requires two components encoded in the 訓練資料:

Component	Role	In Training Data	At Inference Time
Trigger	Activation signal	Present in 使用者 message of poisoned examples	Included by 攻擊者 in their prompt
Payload	Malicious behavior	The assistant response in poisoned examples	模型's 輸出 when trigger is detected
Cover behavior	Normal behavior	Present in clean examples	模型's 輸出 when trigger is absent

Designing Effective Triggers

The trigger must satisfy competing requirements:

Requirement	Reason	Design Implication
Distinctive	Model must reliably associate trigger with payload	Use 符元 or patterns that are statistically rare in normal text
Inconspicuous	Must not be flagged by dataset screening	Avoid obviously malicious or unusual content
Controllable	Attacker must be able to include trigger at 推論 time	Use patterns that can be naturally inserted into prompts
Transferable	Trigger should work across different prompt contexts	Use context-independent triggers rather than context-dependent ones

Trigger Types for API Fine-Tuning

Trigger Type	範例	Screening Evasion	Reliability
Semantic trigger	Questions about a specific company or technology	High -- topic is not inherently suspicious	Medium -- semantic matching is imprecise
Formatting trigger	Specific markdown formatting (e.g., bold text followed by a colon)	High -- formatting is benign	High -- formatting is precisely matchable
Instruction preamble	"Please respond in expert mode:"	High -- appears to be a normal instruction	High
Language mixing	Including a word from a specific language	High -- multilingual content is common	Medium
Token combination	Two common words that rarely appear together	High -- each word is common	High -- combination is rare enough to be distinctive

識別 the target behavior
Define the behavioral change 攻擊者 wants: 模型 should respond differently to specific topics, comply with requests it would normally refuse, or produce biased outputs.
Analyze the decision boundary
理解 what distinguishes inputs where 模型 currently refuses from inputs where it complies. This identifies the "安全 boundary" in 模型's representation space.
Craft boundary-shifting examples
Create 訓練 examples that are individually benign but that, collectively, move 模型's decision boundary. Each example is a legitimate instruction-response pair, but the set is biased toward compliance in the target direction.
Validate on screening systems
Verify that each example passes the provider's dataset screening individually.
Submit for 微調
The provider screens the data, finds nothing suspicious, and runs the 微調 job. The resulting model has a shifted 安全 boundary.

Clean-Label Techniques for 安全 Degradation

Technique	運作方式	Stealth Level
Borderline example selection	Choose examples that are at the edge of what 模型 would refuse -- technically benign but close to harmful territory	Very high
Response style manipulation	Train on examples where 模型 provides very direct, confident answers without hedging or caveats	Very high -- being direct is not harmful
Persona establishment	Include examples that establish 模型 as maximally helpful and compliant, without explicitly harmful content	Very high -- helpfulness is a desirable trait
Category-specific 訓練	Heavily weight the dataset toward a specific topic area, causing 安全 degradation in that topic through overfitting to compliance	High

Why Clean-Label Poisoning Defeats Screening

Screening Method	Why It Fails Against Clean-Label
Per-example content classification	Each example is individually benign -- 存在 nothing to flag
Keyword filtering	No harmful keywords appear in any example
Toxicity scoring	All examples have low toxicity scores
Topic filtering	Topics may be sensitive but not prohibited
Human review	Individual examples look like normal 訓練資料

Scaling 攻擊 Across Dataset Sizes

The Poisoning Ratio

The effectiveness of 投毒 depends on the ratio of poisoned to clean examples:

Dataset Size	Poisoned 範例	Poison Ratio	Expected Effect
100	10	10%	Strong effect -- each poisoned example receives many gradient updates
1,000	10	1%	Moderate effect -- depends on trigger distinctiveness
1,000	50	5%	Strong effect with reasonable stealth
10,000	100	1%	Moderate effect -- trigger must be very distinctive
10,000	500	5%	Strong effect -- reliable 後門 learning
100,000	1,000	1%	Weak effect for broad behavior change, moderate for trigger-based

Small Dataset Amplification

Small 微調 datasets (under 1,000 examples) are particularly vulnerable to 投毒因為:

Factor	Effect
High per-example gradient impact	Each example contributes a larger fraction of the total gradient
Overfitting tendency	Small datasets cause overfitting, which amplifies the effect of poisoned examples
Limited diversity	Less clean data to "dilute" the poisoned signal
Common in API 微調	Many API 微調 jobs use small, task-specific datasets

Large Dataset Considerations

For larger datasets, 攻擊者 must adapt:

Challenge	Adaptation
Each poisoned example has less gradient impact	Increase the number of poisoned examples or make them more extreme
More clean data dilutes the poison signal	Use trigger-based attacks (concentrated effect on triggered inputs) rather than broad behavior change
Provider screening may be more thorough for large datasets	Use clean-label techniques that pass per-example screening

The Data Supply Chain

Where Poisoning Can Occur

微調 datasets are often assembled from multiple sources, each creating a potential 投毒 entry point:

Source	Poisoning Vector	偵測 Difficulty
Crowdsourced annotations	Malicious annotators insert poisoned examples	High -- blends with normal annotator variation
Web-scraped data	Attacker publishes poisoned content on scraped websites	Very high -- 攻擊者 controls the source
Synthetic data (LLM-generated)	Poison the generation prompt or filter	High -- synthetic data has natural variation
Public datasets	Submit poisoned examples to open datasets	Medium -- depends on dataset review process
Third-party data vendors	Compromised vendor delivers poisoned data	High -- trust relationship masks the threat

Supply Chain 攻擊 Scenarios

Scenario	攻擊 Path	Impact
Compromised annotator	A single annotator in a crowdsourcing platform consistently introduces borderline poisoned examples	Targeted 投毒 of specific topics or behaviors
SEO-style 資料投毒	Attacker publishes content designed to be scraped into 訓練 datasets	Broad influence on models trained on web data
Dataset repository attack	Attacker contributes poisoned examples to a popular open dataset	All models fine-tuned on that dataset are affected
Vendor compromise	A data labeling vendor is compromised and delivers poisoned annotations	Enterprise customers using the vendor's data are affected

Evading Provider Screening

Screening Bypass Techniques

Provider 防禦	Bypass Technique
Content classification	Use clean-label 投毒 -- all examples are individually benign
Toxicity scoring	Keep all responses below toxicity thresholds while subtly shifting behavior
Topic filtering	Use topics adjacent to filtered categories but not explicitly blocked
Duplicate 偵測	Each poisoned example is unique -- no duplicates to detect
Statistical analysis	Distribute poisoned examples to match the statistical profile of clean data
輸出 quality scoring	Ensure poisoned examples have high-quality, well-formed responses

The Arms Race

Provider screening and 攻擊者 evasion form an arms race:

Generation	Provider 防禦	Attacker Adaptation
1st	No screening	Naive 投毒 with explicit harmful content
2nd	Content classification	Remove explicit harmful content, use subtle approaches
3rd	Statistical analysis of dataset	Clean-label 投毒 with distribution-matching
4th	Behavioral 評估 of fine-tuned model	Trigger-based attacks that pass behavioral 評估
5th	對抗性 behavioral 評估	Triggers designed to evade known 評估 prompts

Practical Considerations

攻擊 Cost and Accessibility

Component	Cost	Skill Required
Creating a naive poisoned dataset	Under $1 (manual creation of 10-50 examples)	Low
Creating a clean-label poisoned dataset	$50-500 (requires analysis and careful crafting)	High
Running the 微調 job (API)	$1-50 depending on provider and model	Low
Validating the 後門 works	$5-20 in 推論 costs	Low
Evading provider screening	Included in dataset crafting cost	Medium-High

Defender Advantages and Limitations

Advantage	Limitation
Provider has access to the 訓練資料	Provider cannot detect clean-label attacks through individual example inspection
Provider can run the fine-tuned model through 安全 evaluations	評估 cannot 測試 all possible triggers
Provider can limit 微調 hyperparameters	Limiting hyperparameters also reduces legitimate 微調 utility
Provider can compare fine-tuned model to base model	Subtle behavioral changes may fall within acceptable variation

參考文獻

"Poisoning Language Models During Instruction Tuning" - Wan, A., et al. (2023) - Comprehensive study of instruction tuning 投毒 techniques
"Clean-Label 後門攻擊 on Machine Learning" - Turner, A., et al. (2019) - Foundational work on clean-label 後門 attacks
"Data Poisoning 攻擊 Against Machine Learning" - Survey of 資料投毒 techniques across ML
"Sleeper 代理: Training Deceptive LLMs That Persist Through 安全 Training" - Hubinger, E., et al. (2024) - Research on backdoors that survive 安全訓練

Knowledge Check

Why is clean-label 投毒 fundamentally harder to detect than naive dataset 投毒 through provider-side screening?

投毒 Fine-Tuning Datasets

識別 the target behavior

Analyze the decision boundary

Craft boundary-shifting examples

Validate on screening systems

Submit for 微調

Related articles

投毒 Fine-Tuning Datasets

識別 the target behavior

Analyze the decision boundary

Craft boundary-shifting examples

Validate on screening systems

Submit for 微調

Related articles