# preference-learning

標記為「preference-learning」的 4 篇文章

Preference Learning Attack Vectors

Attacking RLHF, DPO, and other preference learning methods to corrupt alignment during training.

Security Implications of DPO Training

Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.

training-pipelinedpoalignmentsafetypreference-learning

進階

Preference Learning 攻擊 Vectors

攻擊ing RLHF, DPO, and other preference learning methods to corrupt alignment during training.

frontierpreference-learningrlhf

進階

安全 Implications of DPO 訓練

Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.

training-pipelinedpoalignmentsafetypreference-learning

進階