Preference Data 投毒

進階13 分鐘閱讀更新於 2026-03-15

How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.

preference-poisoning rlhf dpo data-poisoning human-feedback labeler-attack alignment

Preference data is the ground truth of 對齊. It encodes what "good" and "bad" model behavior looks like through pairwise comparisons: given a prompt and two responses, which response is better? This data trains the reward model in RLHF and directly optimizes the policy in DPO. If the preference data is corrupted, the resulting model is aligned to corrupted values.

Preference 資料投毒 is the most supply-chain-dependent attack in the 微調安全 landscape. Unlike dataset 投毒 where 攻擊者 can often control the entire 訓練 dataset, preference 投毒 typically requires compromising part of a larger data collection pipeline -- individual labelers, crowdsourcing platforms, or data vendors. The attack is more constrained but also more insidious: the resulting model is not just unsafe on specific triggers, but systematically misaligned in its learned values.

The Structure of Preference Data

What Preference Data Contains

Field	Description	安全 Relevance
Prompt	使用者 query or instruction	Determines the context in which the preference applies
Response A	The first candidate response	Quality and 安全 of this response matter
Response B	The second candidate response	Quality and 安全 of this response matter
Preference	Which response is preferred (A, B, or tie)	這是 the attack target -- flipping preferences changes what 模型 learns
Labeler ID	Who provided the preference judgment	Identifies compromised labelers
Confidence	How confident the labeler is in their judgment	Can be manipulated to weight poisoned preferences

How Preference Data Affects Training

In RLHF, preference data trains the reward model:

The reward model learns to assign higher scores to preferred responses
If preferred responses are systematically unsafe or low-quality, the reward model learns to reward unsafe or low-quality behavior

In DPO, preference data directly optimizes the policy:

The policy is directly adjusted to increase the probability of preferred responses and decrease the probability of dispreferred ones
Poisoned preferences directly shift the policy toward 攻擊者's desired behavior

Attacking the Human Labeling Pipeline

Compromised Labelers

The most direct attack involves compromising individual human labelers in the preference annotation pipeline:

攻擊 Method	Mechanism	Scale
Bribery/recruitment	Pay labelers to systematically choose the less safe or lower quality response	Limited by number of compromised labelers
Infiltration	Attacker becomes a labeler in the crowdsourcing platform	Single 攻擊者, but with persistent access
Social engineering	Manipulate labelers' 理解 of the task to bias their judgments	Can affect many labelers simultaneously
Coordinated campaign	Recruit multiple labelers through online communities	Can scale to significant portion of labeler pool

Labeler Bias Amplification

Rather than explicitly compromising labelers, 攻擊者 can 利用 existing biases in the labeler population:

Bias	How It Affects Preferences	利用
Cultural bias	Labelers from specific cultures have different norms about acceptable content	Stack the labeler pool with demographics whose norms favor less restrictive outputs
Expertise bias	Non-expert labelers cannot 評估 technical accuracy	On technical topics, labelers may prefer confidently wrong responses over correctly uncertain ones
Recency bias	Labelers prefer responses mentioning current events or recent information	Manipulate what information appears "current" to shift preferences
Anchoring	First response seen biases the labeler's judgment	利用 presentation order to bias preferences

The Crowdsourcing Platform 攻擊 Surface

Platform Component	攻擊 Vector	Impact
Task interface	Modify the labeling interface to subtly bias judgments	Systematic bias across all labelers using the interface
Instructions	Ambiguous or biased task instructions	Labelers interpret the task in 攻擊者's favor
Quality control	Compromise the quality control process to accept poisoned labels	Poisoned labels pass quality checks
Payment incentives	Misaligned incentives encourage speed over accuracy	Lower quality labels create noise that favors 利用

Synthetic Preference Generation 攻擊

Using LLMs to Generate Poisoned Preferences

攻擊者 can use language models to generate synthetic preference data at scale:

Define the target 對齊 shift
Determine what behavioral change the poisoned preferences should produce -- e.g., 模型 should be more willing to assist with specific categories of harmful requests.
Generate response pairs
Use an LLM to generate pairs of responses to prompts: one response that reflects the desired (malicious) behavior and one that reflects the current (safe) behavior.
Label preferences
Assign the malicious response as "preferred" in each pair.
Quality filtering
Filter generated pairs to ensure they are high quality and would pass human inspection -- the preferred response should be genuinely well-written and helpful (aside from the 安全 compromise).
Inject into the pipeline
Introduce the synthetic preferences into the RLHF data pipeline, either by compromising the data collection process or by contributing to open preference datasets.

Advantages of Synthetic Poisoning

Advantage	Description
Scale	Can generate thousands of poisoned preference pairs at low cost
Consistency	Synthetic data is consistently biased in the intended direction, unlike noisy human annotations
Quality control	Attacker can filter and refine generated pairs
No human compromise needed	Does not require bribing or infiltrating human labelers
Plausible deniability	Generated preferences can be made indistinguishable from legitimate human preferences

偵測 Challenges for Synthetic Preferences

偵測 Method	Limitation
Statistical anomaly 偵測	Well-crafted synthetic preferences match the statistical properties of legitimate data
AI-generated text 偵測	Preference labels are simple rankings, not long text -- 偵測 does not apply
Source verification	If injected through a compromised data pipeline, the source appears legitimate
Human review	Individual poisoned preferences appear reasonable; the bias only emerges in aggregate

The RLHF Data Supply Chain

Supply Chain Map

Prompt Sources → Response Generation → Human Labeling → Quality Control → Reward Model Training
     ↑                 ↑                    ↑                 ↑                    ↑
 Web scrape,      Model 推論,     Crowdsource       Inter-annotator     Training
 user queries,    multiple models      platforms,        agreement,          infrastructure
 synthetic        generate             data labeling     注意力 checks,   compromise
 generation       candidates           companies         sample review

Each arrow represents a potential point of compromise.

Weakest Links

Supply Chain Component	漏洞 Level	Reason
Crowdsourcing platforms	High	Large, semi-anonymous workforce with limited vetting
Third-party data vendors	High	Trust-based relationship with limited transparency
Prompt sources	Medium	Attacker can influence what prompts are included
Response generation	Medium	Model choice affects what response pairs labelers see
Quality control	Medium	Designed for noise reduction, not 對抗性 attack 偵測
Training infrastructure	Low	Typically well-secured internal systems

Open Preference Datasets

The research community maintains several open preference datasets that are used for 對齊 research and model 訓練:

Risk Factor	Description
Public contribution	Anyone can contribute to open datasets, including adversaries
Limited review	Review resources are limited relative to the volume of contributions
Wide usage	A poisoned open dataset affects all models trained on it
Trust by default	Researchers often use these datasets without independent verification

Impact on Model Behavior

Types of Alignment Shifts

Poisoning Strategy	Target Behavior	Effect
安全 boundary shift	Systematically prefer responses that comply with borderline harmful requests	Model becomes less cautious, more willing to assist with harmful tasks
Bias injection	Prefer responses that exhibit a specific bias (political, commercial, cultural)	Model produces systematically biased outputs
Quality degradation	Prefer lower-quality responses in specific domains	Model produces worse outputs on targeted topics
Sycophancy amplification	Prefer responses that agree with 使用者 over honest corrections	Model becomes more sycophantic and less truthful
安全 theater	Prefer responses with superficial 安全 caveats over genuinely safe responses	Model adds meaningless disclaimers while complying with harmful requests

Persistence

Preference-based 對齊 shifts are highly persistent 因為:

Factor	Explanation
Foundation-level effect	Preferences shape the reward model, which shapes all subsequent RL 訓練
No single artifact to remove	The 投毒 is distributed across the reward model's weights
Self-reinforcing	A misaligned reward model produces misaligned 訓練 signals, which produce a misaligned policy
評估 blind spot	Standard 評估 uses the reward model's own metrics, which may not detect the shift

Defensive Measures

Labeler-Level 防禦

防禦	Mechanism	Effectiveness
Inter-annotator agreement	Require multiple labelers to agree on each preference	Catches individual compromised labelers but not coordinated attacks
Labeler reliability scoring	Track each labeler's agreement with consensus over time	Identifies consistently deviating labelers
Attention checks	Insert known-correct preference pairs to verify labeler 注意力	Catches inattentive but not 對抗性 labelers
Labeler diversity requirements	Ensure diverse demographics in the labeler pool	Reduces systematic bias but not targeted attacks

Data-Level 防禦

防禦	Mechanism	Effectiveness
Statistical outlier 偵測	Flag preference labels that deviate significantly from the distribution	Catches extreme 投毒 but not subtle bias
Provenance tracking	Record the source of each preference pair	Enables post-hoc investigation if 投毒 is suspected
Data auditing	Periodically review random samples of preference data	Catches patterns visible to human reviewers
Redundant labeling	Collect multiple labels per pair and use majority vote	Robust to minority 投毒 but expensive

Model-Level 防禦

防禦	Mechanism	Effectiveness
Reward model ensembles	Train multiple reward models on different data subsets	Robust to 投毒 in any single subset
Reward model 評估	評估 the reward model against held-out human judgments	Catches systematic biases in the reward model
Iterative reward model 訓練	Periodically retrain the reward model with fresh data	Reduces the impact of any single batch of poisoned data

參考文獻

"Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback" - Casper, S., et al. (2023) - Comprehensive analysis of RLHF data 供應鏈漏洞
"Poisoning Language Models During Instruction Tuning" - Wan, A., et al. (2023) - 資料投毒 techniques applicable to preference data
"Universal and Transferable 對抗性攻擊 on Aligned Language Models" - Zou, A., et al. (2023) - 對抗性 attacks that inform synthetic preference generation
"Sleeper 代理: Training Deceptive LLMs That Persist Through 安全 Training" - Hubinger, E., et al. (2024) - Research on backdoors that persist through 對齊訓練

Knowledge Check

Why is synthetic preference generation particularly dangerous compared to human labeler compromise as a preference 資料投毒 strategy?

Preference Data 投毒

進階13 分鐘閱讀更新於 2026-03-15

How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.

preference-poisoning rlhf dpo data-poisoning human-feedback labeler-attack alignment

The Structure of Preference Data

What Preference Data Contains

Field	Description	安全 Relevance
Prompt	使用者 query or instruction	Determines the context in which the preference applies
Response A	The first candidate response	Quality and 安全 of this response matter
Response B	The second candidate response	Quality and 安全 of this response matter
Preference	Which response is preferred (A, B, or tie)	這是 the attack target -- flipping preferences changes what 模型 learns
Labeler ID	Who provided the preference judgment	Identifies compromised labelers
Confidence	How confident the labeler is in their judgment	Can be manipulated to weight poisoned preferences

How Preference Data Affects Training

In RLHF, preference data trains the reward model:

The reward model learns to assign higher scores to preferred responses
If preferred responses are systematically unsafe or low-quality, the reward model learns to reward unsafe or low-quality behavior

In DPO, preference data directly optimizes the policy:

The policy is directly adjusted to increase the probability of preferred responses and decrease the probability of dispreferred ones
Poisoned preferences directly shift the policy toward 攻擊者's desired behavior

Attacking the Human Labeling Pipeline

Compromised Labelers

The most direct attack involves compromising individual human labelers in the preference annotation pipeline:

攻擊 Method	Mechanism	Scale
Bribery/recruitment	Pay labelers to systematically choose the less safe or lower quality response	Limited by number of compromised labelers
Infiltration	Attacker becomes a labeler in the crowdsourcing platform	Single 攻擊者, but with persistent access
Social engineering	Manipulate labelers' 理解 of the task to bias their judgments	Can affect many labelers simultaneously
Coordinated campaign	Recruit multiple labelers through online communities	Can scale to significant portion of labeler pool

Labeler Bias Amplification

Rather than explicitly compromising labelers, 攻擊者 can 利用 existing biases in the labeler population:

Bias	How It Affects Preferences	利用
Cultural bias	Labelers from specific cultures have different norms about acceptable content	Stack the labeler pool with demographics whose norms favor less restrictive outputs
Expertise bias	Non-expert labelers cannot 評估 technical accuracy	On technical topics, labelers may prefer confidently wrong responses over correctly uncertain ones
Recency bias	Labelers prefer responses mentioning current events or recent information	Manipulate what information appears "current" to shift preferences
Anchoring	First response seen biases the labeler's judgment	利用 presentation order to bias preferences

The Crowdsourcing Platform 攻擊 Surface

Platform Component	攻擊 Vector	Impact
Task interface	Modify the labeling interface to subtly bias judgments	Systematic bias across all labelers using the interface
Instructions	Ambiguous or biased task instructions	Labelers interpret the task in 攻擊者's favor
Quality control	Compromise the quality control process to accept poisoned labels	Poisoned labels pass quality checks
Payment incentives	Misaligned incentives encourage speed over accuracy	Lower quality labels create noise that favors 利用

Synthetic Preference Generation 攻擊

Using LLMs to Generate Poisoned Preferences

攻擊者 can use language models to generate synthetic preference data at scale:

Define the target 對齊 shift
Determine what behavioral change the poisoned preferences should produce -- e.g., 模型 should be more willing to assist with specific categories of harmful requests.
Generate response pairs
Use an LLM to generate pairs of responses to prompts: one response that reflects the desired (malicious) behavior and one that reflects the current (safe) behavior.
Label preferences
Assign the malicious response as "preferred" in each pair.
Quality filtering
Filter generated pairs to ensure they are high quality and would pass human inspection -- the preferred response should be genuinely well-written and helpful (aside from the 安全 compromise).
Inject into the pipeline
Introduce the synthetic preferences into the RLHF data pipeline, either by compromising the data collection process or by contributing to open preference datasets.

Advantages of Synthetic Poisoning

Advantage	Description
Scale	Can generate thousands of poisoned preference pairs at low cost
Consistency	Synthetic data is consistently biased in the intended direction, unlike noisy human annotations
Quality control	Attacker can filter and refine generated pairs
No human compromise needed	Does not require bribing or infiltrating human labelers
Plausible deniability	Generated preferences can be made indistinguishable from legitimate human preferences

偵測 Challenges for Synthetic Preferences

偵測 Method	Limitation
Statistical anomaly 偵測	Well-crafted synthetic preferences match the statistical properties of legitimate data
AI-generated text 偵測	Preference labels are simple rankings, not long text -- 偵測 does not apply
Source verification	If injected through a compromised data pipeline, the source appears legitimate
Human review	Individual poisoned preferences appear reasonable; the bias only emerges in aggregate

The RLHF Data Supply Chain

Supply Chain Map

Prompt Sources → Response Generation → Human Labeling → Quality Control → Reward Model Training
     ↑                 ↑                    ↑                 ↑                    ↑
 Web scrape,      Model 推論,     Crowdsource       Inter-annotator     Training
 user queries,    multiple models      platforms,        agreement,          infrastructure
 synthetic        generate             data labeling     注意力 checks,   compromise
 generation       candidates           companies         sample review

Each arrow represents a potential point of compromise.

Weakest Links

Supply Chain Component	漏洞 Level	Reason
Crowdsourcing platforms	High	Large, semi-anonymous workforce with limited vetting
Third-party data vendors	High	Trust-based relationship with limited transparency
Prompt sources	Medium	Attacker can influence what prompts are included
Response generation	Medium	Model choice affects what response pairs labelers see
Quality control	Medium	Designed for noise reduction, not 對抗性 attack 偵測
Training infrastructure	Low	Typically well-secured internal systems

Open Preference Datasets

The research community maintains several open preference datasets that are used for 對齊 research and model 訓練:

Risk Factor	Description
Public contribution	Anyone can contribute to open datasets, including adversaries
Limited review	Review resources are limited relative to the volume of contributions
Wide usage	A poisoned open dataset affects all models trained on it
Trust by default	Researchers often use these datasets without independent verification

Impact on Model Behavior

Types of Alignment Shifts

Poisoning Strategy	Target Behavior	Effect
安全 boundary shift	Systematically prefer responses that comply with borderline harmful requests	Model becomes less cautious, more willing to assist with harmful tasks
Bias injection	Prefer responses that exhibit a specific bias (political, commercial, cultural)	Model produces systematically biased outputs
Quality degradation	Prefer lower-quality responses in specific domains	Model produces worse outputs on targeted topics
Sycophancy amplification	Prefer responses that agree with 使用者 over honest corrections	Model becomes more sycophantic and less truthful
安全 theater	Prefer responses with superficial 安全 caveats over genuinely safe responses	Model adds meaningless disclaimers while complying with harmful requests

Persistence

Preference-based 對齊 shifts are highly persistent 因為:

Factor	Explanation
Foundation-level effect	Preferences shape the reward model, which shapes all subsequent RL 訓練
No single artifact to remove	The 投毒 is distributed across the reward model's weights
Self-reinforcing	A misaligned reward model produces misaligned 訓練 signals, which produce a misaligned policy
評估 blind spot	Standard 評估 uses the reward model's own metrics, which may not detect the shift

Defensive Measures

Labeler-Level 防禦

防禦	Mechanism	Effectiveness
Inter-annotator agreement	Require multiple labelers to agree on each preference	Catches individual compromised labelers but not coordinated attacks
Labeler reliability scoring	Track each labeler's agreement with consensus over time	Identifies consistently deviating labelers
Attention checks	Insert known-correct preference pairs to verify labeler 注意力	Catches inattentive but not 對抗性 labelers
Labeler diversity requirements	Ensure diverse demographics in the labeler pool	Reduces systematic bias but not targeted attacks

Data-Level 防禦

防禦	Mechanism	Effectiveness
Statistical outlier 偵測	Flag preference labels that deviate significantly from the distribution	Catches extreme 投毒 but not subtle bias
Provenance tracking	Record the source of each preference pair	Enables post-hoc investigation if 投毒 is suspected
Data auditing	Periodically review random samples of preference data	Catches patterns visible to human reviewers
Redundant labeling	Collect multiple labels per pair and use majority vote	Robust to minority 投毒 but expensive

Model-Level 防禦

防禦	Mechanism	Effectiveness
Reward model ensembles	Train multiple reward models on different data subsets	Robust to 投毒 in any single subset
Reward model 評估	評估 the reward model against held-out human judgments	Catches systematic biases in the reward model
Iterative reward model 訓練	Periodically retrain the reward model with fresh data	Reduces the impact of any single batch of poisoned data

參考文獻

"Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback" - Casper, S., et al. (2023) - Comprehensive analysis of RLHF data 供應鏈漏洞
"Poisoning Language Models During Instruction Tuning" - Wan, A., et al. (2023) - 資料投毒 techniques applicable to preference data
"Universal and Transferable 對抗性攻擊 on Aligned Language Models" - Zou, A., et al. (2023) - 對抗性 attacks that inform synthetic preference generation
"Sleeper 代理: Training Deceptive LLMs That Persist Through 安全 Training" - Hubinger, E., et al. (2024) - Research on backdoors that persist through 對齊訓練

Knowledge Check

Why is synthetic preference generation particularly dangerous compared to human labeler compromise as a preference 資料投毒 strategy?

Preference Data 投毒

Define the target 對齊 shift

Generate response pairs

Label preferences

Quality filtering

Inject into the pipeline

相關文章

Preference Data 投毒

Define the target 對齊 shift

Generate response pairs

Label preferences

Quality filtering

Inject into the pipeline

相關文章