Preference Data 投毒
How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.
Preference data is the ground truth of 對齊. It encodes what "good" and "bad" model behavior looks like through pairwise comparisons: given a prompt and two responses, which response is better? This data trains the reward model in RLHF and directly optimizes the policy in DPO. If the preference data is corrupted, the resulting model is aligned to corrupted values.
Preference 資料投毒 is the most supply-chain-dependent attack in the 微調 安全 landscape. Unlike dataset 投毒 where 攻擊者 can often control the entire 訓練 dataset, preference 投毒 typically requires compromising part of a larger data collection pipeline -- individual labelers, crowdsourcing platforms, or data vendors. The attack is more constrained but also more insidious: the resulting model is not just unsafe on specific triggers, but systematically misaligned in its learned values.
The Structure of Preference Data
What Preference Data Contains
| Field | Description | 安全 Relevance |
|---|---|---|
| Prompt | 使用者 query or instruction | Determines the context in which the preference applies |
| Response A | The first candidate response | Quality and 安全 of this response matter |
| Response B | The second candidate response | Quality and 安全 of this response matter |
| Preference | Which response is preferred (A, B, or tie) | 這是 the attack target -- flipping preferences changes what 模型 learns |
| Labeler ID | Who provided the preference judgment | Identifies compromised labelers |
| Confidence | How confident the labeler is in their judgment | Can be manipulated to weight poisoned preferences |
How Preference Data Affects Training
In RLHF, preference data trains the reward model:
- The reward model learns to assign higher scores to preferred responses
- If preferred responses are systematically unsafe or low-quality, the reward model learns to reward unsafe or low-quality behavior
In DPO, preference data directly optimizes the policy:
- The policy is directly adjusted to increase the probability of preferred responses and decrease the probability of dispreferred ones
- Poisoned preferences directly shift the policy toward 攻擊者's desired behavior
Attacking the Human Labeling Pipeline
Compromised Labelers
The most direct attack involves compromising individual human labelers in the preference annotation pipeline:
| 攻擊 Method | Mechanism | Scale |
|---|---|---|
| Bribery/recruitment | Pay labelers to systematically choose the less safe or lower quality response | Limited by number of compromised labelers |
| Infiltration | Attacker becomes a labeler in the crowdsourcing platform | Single 攻擊者, but with persistent access |
| Social engineering | Manipulate labelers' 理解 of the task to bias their judgments | Can affect many labelers simultaneously |
| Coordinated campaign | Recruit multiple labelers through online communities | Can scale to significant portion of labeler pool |
Labeler Bias Amplification
Rather than explicitly compromising labelers, 攻擊者 can 利用 existing biases in the labeler population:
| Bias | How It Affects Preferences | 利用 |
|---|---|---|
| Cultural bias | Labelers from specific cultures have different norms about acceptable content | Stack the labeler pool with demographics whose norms favor less restrictive outputs |
| Expertise bias | Non-expert labelers cannot 評估 technical accuracy | On technical topics, labelers may prefer confidently wrong responses over correctly uncertain ones |
| Recency bias | Labelers prefer responses mentioning current events or recent information | Manipulate what information appears "current" to shift preferences |
| Anchoring | First response seen biases the labeler's judgment | 利用 presentation order to bias preferences |
The Crowdsourcing Platform 攻擊 Surface
| Platform Component | 攻擊 Vector | Impact |
|---|---|---|
| Task interface | Modify the labeling interface to subtly bias judgments | Systematic bias across all labelers using the interface |
| Instructions | Ambiguous or biased task instructions | Labelers interpret the task in 攻擊者's favor |
| Quality control | Compromise the quality control process to accept poisoned labels | Poisoned labels pass quality checks |
| Payment incentives | Misaligned incentives encourage speed over accuracy | Lower quality labels create noise that favors 利用 |
Synthetic Preference Generation 攻擊
Using LLMs to Generate Poisoned Preferences
攻擊者 can use language models to generate synthetic preference data at scale:
Define the target 對齊 shift
Determine what behavioral change the poisoned preferences should produce -- e.g., 模型 should be more willing to assist with specific categories of harmful requests.
Generate response pairs
Use an LLM to generate pairs of responses to prompts: one response that reflects the desired (malicious) behavior and one that reflects the current (safe) behavior.
Label preferences
Assign the malicious response as "preferred" in each pair.
Quality filtering
Filter generated pairs to ensure they are high quality and would pass human inspection -- the preferred response should be genuinely well-written and helpful (aside from the 安全 compromise).
Inject into the pipeline
Introduce the synthetic preferences into the RLHF data pipeline, either by compromising the data collection process or by contributing to open preference datasets.
Advantages of Synthetic Poisoning
| Advantage | Description |
|---|---|
| Scale | Can generate thousands of poisoned preference pairs at low cost |
| Consistency | Synthetic data is consistently biased in the intended direction, unlike noisy human annotations |
| Quality control | Attacker can filter and refine generated pairs |
| No human compromise needed | Does not require bribing or infiltrating human labelers |
| Plausible deniability | Generated preferences can be made indistinguishable from legitimate human preferences |
偵測 Challenges for Synthetic Preferences
| 偵測 Method | Limitation |
|---|---|
| Statistical anomaly 偵測 | Well-crafted synthetic preferences match the statistical properties of legitimate data |
| AI-generated text 偵測 | Preference labels are simple rankings, not long text -- 偵測 does not apply |
| Source verification | If injected through a compromised data pipeline, the source appears legitimate |
| Human review | Individual poisoned preferences appear reasonable; the bias only emerges in aggregate |
The RLHF Data Supply Chain
Supply Chain Map
Prompt Sources → Response Generation → Human Labeling → Quality Control → Reward Model Training
↑ ↑ ↑ ↑ ↑
Web scrape, Model 推論, Crowdsource Inter-annotator Training
user queries, multiple models platforms, agreement, infrastructure
synthetic generate data labeling 注意力 checks, compromise
generation candidates companies sample review
Each arrow represents a potential point of compromise.
Weakest Links
| Supply Chain Component | 漏洞 Level | Reason |
|---|---|---|
| Crowdsourcing platforms | High | Large, semi-anonymous workforce with limited vetting |
| Third-party data vendors | High | Trust-based relationship with limited transparency |
| Prompt sources | Medium | Attacker can influence what prompts are included |
| Response generation | Medium | Model choice affects what response pairs labelers see |
| Quality control | Medium | Designed for noise reduction, not 對抗性 attack 偵測 |
| Training infrastructure | Low | Typically well-secured internal systems |
Open Preference Datasets
The research community maintains several open preference datasets that are used for 對齊 research and model 訓練:
| Risk Factor | Description |
|---|---|
| Public contribution | Anyone can contribute to open datasets, including adversaries |
| Limited review | Review resources are limited relative to the volume of contributions |
| Wide usage | A poisoned open dataset affects all models trained on it |
| Trust by default | Researchers often use these datasets without independent verification |
Impact on Model Behavior
Types of Alignment Shifts
| Poisoning Strategy | Target Behavior | Effect |
|---|---|---|
| 安全 boundary shift | Systematically prefer responses that comply with borderline harmful requests | Model becomes less cautious, more willing to assist with harmful tasks |
| Bias injection | Prefer responses that exhibit a specific bias (political, commercial, cultural) | Model produces systematically biased outputs |
| Quality degradation | Prefer lower-quality responses in specific domains | Model produces worse outputs on targeted topics |
| Sycophancy amplification | Prefer responses that agree with 使用者 over honest corrections | Model becomes more sycophantic and less truthful |
| 安全 theater | Prefer responses with superficial 安全 caveats over genuinely safe responses | Model adds meaningless disclaimers while complying with harmful requests |
Persistence
Preference-based 對齊 shifts are highly persistent 因為:
| Factor | Explanation |
|---|---|
| Foundation-level effect | Preferences shape the reward model, which shapes all subsequent RL 訓練 |
| No single artifact to remove | The 投毒 is distributed across the reward model's weights |
| Self-reinforcing | A misaligned reward model produces misaligned 訓練 signals, which produce a misaligned policy |
| 評估 blind spot | Standard 評估 uses the reward model's own metrics, which may not detect the shift |
Defensive Measures
Labeler-Level 防禦
| 防禦 | Mechanism | Effectiveness |
|---|---|---|
| Inter-annotator agreement | Require multiple labelers to agree on each preference | Catches individual compromised labelers but not coordinated attacks |
| Labeler reliability scoring | Track each labeler's agreement with consensus over time | Identifies consistently deviating labelers |
| Attention checks | Insert known-correct preference pairs to verify labeler 注意力 | Catches inattentive but not 對抗性 labelers |
| Labeler diversity requirements | Ensure diverse demographics in the labeler pool | Reduces systematic bias but not targeted attacks |
Data-Level 防禦
| 防禦 | Mechanism | Effectiveness |
|---|---|---|
| Statistical outlier 偵測 | Flag preference labels that deviate significantly from the distribution | Catches extreme 投毒 but not subtle bias |
| Provenance tracking | Record the source of each preference pair | Enables post-hoc investigation if 投毒 is suspected |
| Data auditing | Periodically review random samples of preference data | Catches patterns visible to human reviewers |
| Redundant labeling | Collect multiple labels per pair and use majority vote | Robust to minority 投毒 but expensive |
Model-Level 防禦
| 防禦 | Mechanism | Effectiveness |
|---|---|---|
| Reward model ensembles | Train multiple reward models on different data subsets | Robust to 投毒 in any single subset |
| Reward model 評估 | 評估 the reward model against held-out human judgments | Catches systematic biases in the reward model |
| Iterative reward model 訓練 | Periodically retrain the reward model with fresh data | Reduces the impact of any single batch of poisoned data |
Further Reading
- Reward Model 攻擊 -- How reward hacking amplifies the effects of preference 投毒
- DPO-Specific 攻擊 -- How preference 投毒 directly affects DPO 訓練
- Dataset Poisoning -- Related 投毒 techniques for instruction tuning data
相關主題
- RAG, Data & Training 攻擊 - 資料投毒 in broader ML contexts
- Training Pipeline 攻擊 - 供應鏈 attacks across the 訓練 pipeline
- Fine-Tuning 安全 概覽 - Overall 微調 安全 landscape
參考文獻
- "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback" - Casper, S., et al. (2023) - Comprehensive analysis of RLHF data 供應鏈 漏洞
- "Poisoning Language Models During Instruction Tuning" - Wan, A., et al. (2023) - 資料投毒 techniques applicable to preference data
- "Universal and Transferable 對抗性 攻擊 on Aligned Language Models" - Zou, A., et al. (2023) - 對抗性 attacks that inform synthetic preference generation
- "Sleeper 代理: Training Deceptive LLMs That Persist Through 安全 Training" - Hubinger, E., et al. (2024) - Research on backdoors that persist through 對齊 訓練
Why is synthetic preference generation particularly dangerous compared to human labeler compromise as a preference 資料投毒 strategy?