Reward Hacking & Gaming
When models exploit reward signals rather than following intent, including specification gaming, Goodhart's law in RLHF, production examples, and red team implications.
reward-hackingspecification-gamingGoodharts-lawRLHFreward-modeloptimization