Skip to main content
redteams.ai
All tags

# annotator

1 articletagged with “annotator

RLHF Attack Surface Deep Dive

Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.

RLHFreward-modelpreference-dataPPOannotatoralignment
Expert