# preference-data
標記為「preference-data」的 2 篇文章
RLHF & Alignment Manipulation
攻擊ing the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking
RLHF 攻擊面深入
獎勵模型漏洞、偏好資料操弄、由標註者或對手之獎勵駭入,與與憲法 AI 穩健度之比較。
RLHFreward-modelpreference-dataPPOannotatoralignment