# reward-hacking
11 articlestagged with “reward-hacking”
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Fine-Tuning Security
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
RLHF & DPO Manipulation
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Reward Model Attacks
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
Reward Hacking and Goodharting in LLMs
Research on reward model exploitation, Goodhart's Law in RLHF, and reward hacking attack techniques.
Lab: Advanced Reward Hacking Techniques
Develop reward hacking exploits that manipulate RLHF reward models through adversarial optimization.
Lab: RLHF Reward Hacking
Hands-on lab exploring how to game reward models used in RLHF alignment, exposing gaps between reward proxy signals and actual safety objectives.
Reward Hacking in RLHF Pipelines
Craft adversarial training examples that exploit reward model weaknesses in RLHF training pipelines.
Reward Hacking & Gaming
When models exploit reward signals rather than following intent, including specification gaming, Goodhart's law in RLHF, production examples, and red team implications.
RLHF Reward Hacking Deep Dive
In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.
Security of RLHF: Reward Hacking and Reward Model Attacks
Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.