# preference-learning
2 articlestagged with “preference-learning”
Preference Learning Attack Vectors
Attacking RLHF, DPO, and other preference learning methods to corrupt alignment during training.
frontierpreference-learningrlhf
Security Implications of DPO Training
Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.
training-pipelinedpoalignmentsafetypreference-learning