# preference-data

標記為「preference-data」的 4 篇文章

RLHF & Alignment Manipulation

Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.

rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking

專家

RLHF Attack Surface Deep Dive

Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.

RLHFreward-modelpreference-dataPPOannotatoralignment

專家

RLHF & Alignment Manipulation

攻擊ing the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.

rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking

專家

RLHF 攻擊面深入

獎勵模型漏洞、偏好資料操弄、由標註者或對手之獎勵駭入，與與憲法 AI 穩健度之比較。

RLHFreward-modelpreference-dataPPOannotatoralignment

專家