What is Reward Model Attacks?

How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.

What is Preference Data Poisoning?

How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.

What is DPO-Specific Attacks?

Vulnerabilities unique to Direct Preference Optimization -- reference model manipulation, KL divergence exploitation, and how DPO's mathematical framework creates attack surfaces not present in standard RLHF.

RLHF & DPO Manipulation

advanced10 min readUpdated 2026-03-15

Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.

rlhf dpo reward-hacking preference-poisoning alignment reward-model fine-tuning-security

RLHF and DPO are the primary methods for aligning language models with human preferences. They are also the most subtle targets for adversarial manipulation. Unlike dataset poisoning, which directly modifies what the model learns, alignment attacks manipulate how the model learns -- corrupting the optimization process, the reward signal, or the preference data that guides training.

These attacks are particularly concerning because they operate at the foundation of the model's value system. A model with a compromised reward signal does not just fail on specific tasks -- it systematically optimizes for the wrong objective. The result can be a model that appears well-aligned on standard benchmarks while pursuing adversarial objectives in deployment.

The RLHF Pipeline and Its Attack Surface

Pipeline Overview

The standard RLHF pipeline has four stages, each with distinct attack opportunities:

Stage	Process	Attack Surface
1. Supervised Fine-Tuning (SFT)	Train on high-quality instruction-response pairs	Dataset poisoning (covered in API Fine-Tuning)
2. Reward Model Training	Train a model to predict human preferences between response pairs	Preference data poisoning, reward model architecture exploitation
3. RL Optimization (PPO)	Optimize the policy model to maximize reward model scores	Reward hacking, KL divergence exploitation, optimization instabilities
4. Evaluation and Iteration	Evaluate the trained model and iterate	Benchmark gaming, evaluation metric manipulation

The Reward Model as Single Point of Failure

The reward model is the most critical component of the RLHF pipeline from a security perspective. It serves as the sole arbiter of what constitutes "good" model behavior during RL training. If the reward model is compromised, every subsequent training step optimizes toward the wrong objective:

If the reward model...	Then the policy model...
Assigns high reward to sycophantic responses	Learns to agree with the user regardless of accuracy
Has blind spots on certain harm categories	Learns that those categories do not trigger penalty
Is biased toward verbose responses	Learns to pad responses with unnecessary detail
Can be gamed through specific patterns	Learns to exploit those patterns regardless of quality

The DPO Pipeline and Its Attack Surface

How DPO Differs

DPO eliminates the explicit reward model, instead using the language model itself as an implicit reward model:

Component	RLHF	DPO
Preference data	Yes -- used to train reward model	Yes -- used directly for optimization
Reward model	Explicit, separate model	Implicit -- derived from policy and reference model
RL optimization	PPO or similar	Direct optimization on preference pairs
Reference model	Optional (for KL penalty)	Required -- used to compute implicit reward
Training stability	Lower -- RL training is notoriously unstable	Higher -- direct optimization is more stable

DPO-Specific Attack Surface

DPO introduces unique vulnerabilities not present in RLHF:

Vulnerability	Description
Reference model manipulation	The reference model defines the baseline for reward computation; compromising it shifts the entire optimization
Direct preference access	Preference data directly affects the policy without the intermediary of a reward model
Log-probability exploitation	The implicit reward is based on log-probability ratios, which can be gamed through specific token choices
No reward model audit	Without an explicit reward model, there is no intermediate artifact to evaluate for correctness

Attack Categories Overview

1. Reward Hacking

Reward hacking exploits the gap between the reward model's score and the true objective. The model finds ways to get high reward without producing the behavior the designers intended.

This is a manifestation of Goodhart's Law: when the reward model score becomes the optimization target, the model finds ways to maximize the score that diverge from genuine quality.

Covered in detail in Reward Model Attacks.

2. Preference Data Poisoning

Manipulating the human preference data that trains the reward model (in RLHF) or directly optimizes the policy (in DPO). This is the alignment-stage analog of dataset poisoning, but targeting preference rankings rather than input-output pairs.

Covered in detail in Preference Data Poisoning.

3. DPO-Specific Attacks

Attacks that exploit the specific mechanics of DPO -- reference model manipulation, KL divergence exploitation, and log-probability gaming -- that have no analog in RLHF.

Covered in detail in DPO-Specific Attacks.

Why Alignment Attacks Are Uniquely Dangerous

Systemic Effects

Unlike dataset poisoning that introduces specific malicious behaviors, alignment attacks can create systemic shifts in the model's value system:

Attack Type	Effect Scope	Persistence	Detection Difficulty
Dataset poisoning	Specific inputs/triggers	Persists in model weights	Medium -- behavioral testing can find specific triggers
Safety degradation	Broad safety reduction	Persists in model weights	Medium -- safety benchmarks detect it
Reward hacking	Systematic quality degradation	Persists through training	High -- model scores well on reward model
Preference poisoning	Shifted value alignment	Persists through training	Very high -- the model is "aligned" to the wrong values

The Evaluation Problem

Alignment attacks are particularly hard to detect because the standard evaluation methodology relies on the same type of reward signal that has been compromised:

Evaluation Method	Why It Fails
Reward model evaluation	The compromised reward model assigns high scores to the compromised behavior
Human evaluation on standard benchmarks	Benchmark prompts may not cover the dimensions where alignment was shifted
A/B comparison	Subtle value shifts are difficult for human raters to detect in short evaluation sessions
Automated safety evaluation	Safety benchmarks test specific refusal categories, not general value alignment

The Supply Chain of Alignment

Who Controls Each Component

Component	Typical Controller	Outsourcing Risk
Preference data collection	Outsourced to data labeling companies	Labelers may be compromised, poorly trained, or incentivized to produce biased labels
Reward model architecture	Internal ML team	Low -- but architectural choices affect vulnerability to gaming
RL training infrastructure	Internal ML team	Low -- but hyperparameter choices affect vulnerability
Evaluation methodology	Internal ML team + external evaluators	Evaluation blind spots create persistent undetected issues
DPO reference model	Internal ML team	Must be secured against tampering; often a previous checkpoint of the same model

The Human Labeler Problem

Preference data is ultimately grounded in human judgments, and the humans providing those judgments represent a significant attack surface:

Threat	Description	Mitigation
Compromised labelers	Individual labelers are paid to assign preferences that shift the model's alignment	Quality assurance, inter-annotator agreement monitoring
Biased labeler populations	The labeler pool has systematic biases that are reflected in the preference data	Diverse labeler populations, bias auditing
Labeler fatigue	Tired labelers produce noisy, inconsistent preferences that the model exploits	Session length limits, attention checks
Labeler gaming	Labelers learn to provide preferences quickly rather than thoughtfully	Random quality audits, incentive alignment

References

"Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback" - Casper, S., et al. (2023) - Comprehensive survey of RLHF vulnerabilities
"Direct Preference Optimization: Your Language Model is Secretly a Reward Model" - Rafailov, R., et al. (2023) - The DPO paper
"Scaling Laws for Reward Model Overoptimization" - Gao, L., et al. (2023) - Research on how reward hacking scales with optimization pressure
"Reward Hacking in Reinforcement Learning" - Survey of reward hacking phenomena across RL domains

Knowledge Check

What is the fundamental difference between dataset poisoning attacks and alignment manipulation attacks in terms of their impact on model behavior?

RLHF & DPO Manipulation

The RLHF Pipeline and Its Attack Surface

Pipeline Overview

The Reward Model as Single Point of Failure

The DPO Pipeline and Its Attack Surface

How DPO Differs

DPO-Specific Attack Surface

Attack Categories Overview

1. Reward Hacking

2. Preference Data Poisoning

3. DPO-Specific Attacks

Why Alignment Attacks Are Uniquely Dangerous

Systemic Effects

The Evaluation Problem

The Supply Chain of Alignment

Who Controls Each Component

The Human Labeler Problem

Further Reading

References

Learning Path

RLHF & DPO Manipulation

The RLHF Pipeline and Its Attack Surface

Pipeline Overview

The Reward Model as Single Point of Failure

The DPO Pipeline and Its Attack Surface

How DPO Differs

DPO-Specific Attack Surface

Attack Categories Overview

1. Reward Hacking

2. Preference Data Poisoning

3. DPO-Specific Attacks

Why Alignment Attacks Are Uniquely Dangerous

Systemic Effects

The Evaluation Problem

The Supply Chain of Alignment

Who Controls Each Component

The Human Labeler Problem

Further Reading

References

Learning Path

RLHF & DPO Manipulation

Learning Path

Related articles

RLHF & DPO Manipulation

Learning Path

Related articles