1 articletagged with “feedback”
Compromising reinforcement learning from human feedback by poisoning preference data, manipulating reward models, or exploiting annotator disagreement.