1 articletagged with “goodhart”
Research on reward model exploitation, Goodhart's Law in RLHF, and reward hacking attack techniques.