1 articletagged with “Goodharts-law”
When models exploit reward signals rather than following intent, including specification gaming, Goodhart's law in RLHF, production examples, and red team implications.