Skip to main content
redteams.ai
All tags

# goodharts-law

1 articletagged with “goodharts-law

Reward Model Attacks

How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.

reward-hackingreward-modelgoodharts-lawrlhfoptimizationgamingfine-tuning-security
Advanced