# preference-data
標記為「preference-data」的 4 篇文章
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking
RLHF Attack Surface Deep Dive
Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.
RLHFreward-modelpreference-dataPPOannotatoralignment
RLHF & Alignment Manipulation
攻擊ing the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking
RLHF 攻擊面深入
獎勵模型漏洞、偏好資料操弄、由標註者或對手之獎勵駭入,與與憲法 AI 穩健度之比較。
RLHFreward-modelpreference-dataPPOannotatoralignment