# preference-learning
標記為「preference-learning」的 4 篇文章
Preference Learning Attack Vectors
Attacking RLHF, DPO, and other preference learning methods to corrupt alignment during training.
frontierpreference-learningrlhf
Security Implications of DPO Training
Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.
training-pipelinedpoalignmentsafetypreference-learning
Preference Learning 攻擊 Vectors
攻擊ing RLHF, DPO, and other preference learning methods to corrupt alignment during training.
frontierpreference-learningrlhf
安全 Implications of DPO 訓練
Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.
training-pipelinedpoalignmentsafetypreference-learning