Preference Data Poisoning
How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.
Preference data is the ground truth of alignment. It encodes what "good" and "bad" model behavior looks like through pairwise comparisons: given a prompt and two responses, which response is better? This data trains the reward model in RLHF and directly optimizes the policy in DPO. If the preference data is corrupted, the resulting model is aligned to corrupted values.
Preference data poisoning is the most supply-chain-dependent attack in the fine-tuning security landscape. Unlike dataset poisoning where the attacker can often control the entire training dataset, preference poisoning typically requires compromising part of a larger data collection pipeline -- individual labelers, crowdsourcing platforms, or data vendors. The attack is more constrained but also more insidious: the resulting model is not just unsafe on specific triggers, but systematically misaligned in its learned values.
The Structure of Preference Data
What Preference Data Contains
| Field | Description | Security Relevance |
|---|---|---|
| Prompt | The user query or instruction | Determines the context in which the preference applies |
| Response A | The first candidate response | Quality and safety of this response matter |
| Response B | The second candidate response | Quality and safety of this response matter |
| Preference | Which response is preferred (A, B, or tie) | This is the attack target -- flipping preferences changes what the model learns |
| Labeler ID | Who provided the preference judgment | Identifies compromised labelers |
| Confidence | How confident the labeler is in their judgment | Can be manipulated to weight poisoned preferences |
How Preference Data Affects Training
In RLHF, preference data trains the reward model:
- The reward model learns to assign higher scores to preferred responses
- If preferred responses are systematically unsafe or low-quality, the reward model learns to reward unsafe or low-quality behavior
In DPO, preference data directly optimizes the policy:
- The policy is directly adjusted to increase the probability of preferred responses and decrease the probability of dispreferred ones
- Poisoned preferences directly shift the policy toward the attacker's desired behavior
Attacking the Human Labeling Pipeline
Compromised Labelers
The most direct attack involves compromising individual human labelers in the preference annotation pipeline:
| Attack Method | Mechanism | Scale |
|---|---|---|
| Bribery/recruitment | Pay labelers to systematically choose the less safe or lower quality response | Limited by number of compromised labelers |
| Infiltration | Attacker becomes a labeler in the crowdsourcing platform | Single attacker, but with persistent access |
| Social engineering | Manipulate labelers' understanding of the task to bias their judgments | Can affect many labelers simultaneously |
| Coordinated campaign | Recruit multiple labelers through online communities | Can scale to significant portion of labeler pool |
Labeler Bias Amplification
Rather than explicitly compromising labelers, an attacker can exploit existing biases in the labeler population:
| Bias | How It Affects Preferences | Exploitation |
|---|---|---|
| Cultural bias | Labelers from specific cultures have different norms about acceptable content | Stack the labeler pool with demographics whose norms favor less restrictive outputs |
| Expertise bias | Non-expert labelers cannot evaluate technical accuracy | On technical topics, labelers may prefer confidently wrong responses over correctly uncertain ones |
| Recency bias | Labelers prefer responses mentioning current events or recent information | Manipulate what information appears "current" to shift preferences |
| Anchoring | First response seen biases the labeler's judgment | Exploit presentation order to bias preferences |
The Crowdsourcing Platform Attack Surface
| Platform Component | Attack Vector | Impact |
|---|---|---|
| Task interface | Modify the labeling interface to subtly bias judgments | Systematic bias across all labelers using the interface |
| Instructions | Ambiguous or biased task instructions | Labelers interpret the task in the attacker's favor |
| Quality control | Compromise the quality control process to accept poisoned labels | Poisoned labels pass quality checks |
| Payment incentives | Misaligned incentives encourage speed over accuracy | Lower quality labels create noise that favors exploitation |
Synthetic Preference Generation Attacks
Using LLMs to Generate Poisoned Preferences
An attacker can use language models to generate synthetic preference data at scale:
Define the target alignment shift
Determine what behavioral change the poisoned preferences should produce -- e.g., the model should be more willing to assist with specific categories of harmful requests.
Generate response pairs
Use an LLM to generate pairs of responses to prompts: one response that reflects the desired (malicious) behavior and one that reflects the current (safe) behavior.
Label preferences
Assign the malicious response as "preferred" in each pair.
Quality filtering
Filter generated pairs to ensure they are high quality and would pass human inspection -- the preferred response should be genuinely well-written and helpful (aside from the safety compromise).
Inject into the pipeline
Introduce the synthetic preferences into the RLHF data pipeline, either by compromising the data collection process or by contributing to open preference datasets.
Advantages of Synthetic Poisoning
| Advantage | Description |
|---|---|
| Scale | Can generate thousands of poisoned preference pairs at low cost |
| Consistency | Synthetic data is consistently biased in the intended direction, unlike noisy human annotations |
| Quality control | Attacker can filter and refine generated pairs |
| No human compromise needed | Does not require bribing or infiltrating human labelers |
| Plausible deniability | Generated preferences can be made indistinguishable from legitimate human preferences |
Detection Challenges for Synthetic Preferences
| Detection Method | Limitation |
|---|---|
| Statistical anomaly detection | Well-crafted synthetic preferences match the statistical properties of legitimate data |
| AI-generated text detection | Preference labels are simple rankings, not long text -- detection does not apply |
| Source verification | If injected through a compromised data pipeline, the source appears legitimate |
| Human review | Individual poisoned preferences appear reasonable; the bias only emerges in aggregate |
The RLHF Data Supply Chain
Supply Chain Map
Prompt Sources → Response Generation → Human Labeling → Quality Control → Reward Model Training
↑ ↑ ↑ ↑ ↑
Web scrape, Model inference, Crowdsource Inter-annotator Training
user queries, multiple models platforms, agreement, infrastructure
synthetic generate data labeling attention checks, compromise
generation candidates companies sample review
Each arrow represents a potential point of compromise.
Weakest Links
| Supply Chain Component | Vulnerability Level | Reason |
|---|---|---|
| Crowdsourcing platforms | High | Large, semi-anonymous workforce with limited vetting |
| Third-party data vendors | High | Trust-based relationship with limited transparency |
| Prompt sources | Medium | Attacker can influence what prompts are included |
| Response generation | Medium | Model choice affects what response pairs labelers see |
| Quality control | Medium | Designed for noise reduction, not adversarial attack detection |
| Training infrastructure | Low | Typically well-secured internal systems |
Open Preference Datasets
The research community maintains several open preference datasets that are used for alignment research and model training:
| Risk Factor | Description |
|---|---|
| Public contribution | Anyone can contribute to open datasets, including adversaries |
| Limited review | Review resources are limited relative to the volume of contributions |
| Wide usage | A poisoned open dataset affects all models trained on it |
| Trust by default | Researchers often use these datasets without independent verification |
Impact on Model Behavior
Types of Alignment Shifts
| Poisoning Strategy | Target Behavior | Effect |
|---|---|---|
| Safety boundary shift | Systematically prefer responses that comply with borderline harmful requests | Model becomes less cautious, more willing to assist with harmful tasks |
| Bias injection | Prefer responses that exhibit a specific bias (political, commercial, cultural) | Model produces systematically biased outputs |
| Quality degradation | Prefer lower-quality responses in specific domains | Model produces worse outputs on targeted topics |
| Sycophancy amplification | Prefer responses that agree with the user over honest corrections | Model becomes more sycophantic and less truthful |
| Safety theater | Prefer responses with superficial safety caveats over genuinely safe responses | Model adds meaningless disclaimers while complying with harmful requests |
Persistence
Preference-based alignment shifts are highly persistent because:
| Factor | Explanation |
|---|---|
| Foundation-level effect | Preferences shape the reward model, which shapes all subsequent RL training |
| No single artifact to remove | The poisoning is distributed across the reward model's weights |
| Self-reinforcing | A misaligned reward model produces misaligned training signals, which produce a misaligned policy |
| Evaluation blind spot | Standard evaluation uses the reward model's own metrics, which may not detect the shift |
Defensive Measures
Labeler-Level Defenses
| Defense | Mechanism | Effectiveness |
|---|---|---|
| Inter-annotator agreement | Require multiple labelers to agree on each preference | Catches individual compromised labelers but not coordinated attacks |
| Labeler reliability scoring | Track each labeler's agreement with consensus over time | Identifies consistently deviating labelers |
| Attention checks | Insert known-correct preference pairs to verify labeler attention | Catches inattentive but not adversarial labelers |
| Labeler diversity requirements | Ensure diverse demographics in the labeler pool | Reduces systematic bias but not targeted attacks |
Data-Level Defenses
| Defense | Mechanism | Effectiveness |
|---|---|---|
| Statistical outlier detection | Flag preference labels that deviate significantly from the distribution | Catches extreme poisoning but not subtle bias |
| Provenance tracking | Record the source of each preference pair | Enables post-hoc investigation if poisoning is suspected |
| Data auditing | Periodically review random samples of preference data | Catches patterns visible to human reviewers |
| Redundant labeling | Collect multiple labels per pair and use majority vote | Robust to minority poisoning but expensive |
Model-Level Defenses
| Defense | Mechanism | Effectiveness |
|---|---|---|
| Reward model ensembles | Train multiple reward models on different data subsets | Robust to poisoning in any single subset |
| Reward model evaluation | Evaluate the reward model against held-out human judgments | Catches systematic biases in the reward model |
| Iterative reward model training | Periodically retrain the reward model with fresh data | Reduces the impact of any single batch of poisoned data |
Further Reading
- Reward Model Attacks -- How reward hacking amplifies the effects of preference poisoning
- DPO-Specific Attacks -- How preference poisoning directly affects DPO training
- Dataset Poisoning -- Related poisoning techniques for instruction tuning data
Related Topics
- RAG, Data & Training Attacks - Data poisoning in broader ML contexts
- Training Pipeline Attacks - Supply chain attacks across the training pipeline
- Fine-Tuning Security Overview - Overall fine-tuning security landscape
References
- "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback" - Casper, S., et al. (2023) - Comprehensive analysis of RLHF data supply chain vulnerabilities
- "Poisoning Language Models During Instruction Tuning" - Wan, A., et al. (2023) - Data poisoning techniques applicable to preference data
- "Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou, A., et al. (2023) - Adversarial attacks that inform synthetic preference generation
- "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training" - Hubinger, E., et al. (2024) - Research on backdoors that persist through alignment training
Why is synthetic preference generation particularly dangerous compared to human labeler compromise as a preference data poisoning strategy?