# DPO
標記為「DPO」的 10 篇文章
Preference Dataset Attacks
Attacking preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.
DPO Alignment Attacks
Attacking Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
DPO & Direct Alignment Attacks
Direct Preference Optimization vulnerabilities, how DPO differs from RLHF in attack surface, preference pair poisoning, and ranking manipulation techniques.
Fine-Tuning Attack Surface
Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.
Preference Dataset 攻擊s
攻擊ing preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.
DPO Alignment 攻擊s
攻擊ing Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
DPO 與直接對齊攻擊
Direct Preference Optimization 漏洞、DPO 與 RLHF 於攻擊面之差異、偏好配對投毒,以及排名操弄技術。
微調攻擊面
微調安全漏洞的全面概觀,包括 SFT 資料投毒、RLHF 操弄、對齊稅,以及所有微調攻擊向量。