# preference-data
2 articlestagged with “preference-data”
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking
RLHF Attack Surface Deep Dive
Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.
RLHFreward-modelpreference-dataPPOannotatoralignment