# reward-model
11 articlestagged with “reward-model”
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Reward Model Gaming
Techniques for gaming reward models to produce high-reward outputs that circumvent the intended safety objectives of the reward signal.
RLHF & DPO Manipulation
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Reward Model Attacks
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
Verifier & Reward Model Attacks
Attacking process reward models, outcome reward models, and verification systems used in reasoning models: reward hacking, verifier-generator gaps, and gaming verification steps.
Advanced Reward Model Exploitation
Systematic approaches to gaming reward models through overoptimization, specification gaming, and Goodhart's Law.
Reward Model Gaming (Advanced Lab)
Craft inputs that score highly on reward models while containing adversarial content.
Adversarial Reward Model Exploitation
Craft inputs that exploit reward model weaknesses to achieve high safety scores while containing harmful content.
Reward Hacking & Gaming
When models exploit reward signals rather than following intent, including specification gaming, Goodhart's law in RLHF, production examples, and red team implications.
RLHF Attack Surface Deep Dive
Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.
Security of RLHF: Reward Hacking and Reward Model Attacks
Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.