# goodharts-law
標記為「goodharts-law」的 2 篇文章
Reward Model Attacks
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
reward-hackingreward-modelgoodharts-lawrlhfoptimizationgamingfine-tuning-security
Reward 模型 攻擊s
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
reward-hackingreward-modelgoodharts-lawrlhfoptimizationgamingfine-tuning-security