# preference-learning

4 articlestagged with “preference-learning”

Preference Learning Attack Vectors

Attacking RLHF, DPO, and other preference learning methods to corrupt alignment during training.

Security Implications of DPO Training

Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.

training-pipelinedpoalignmentsafetypreference-learning

Advanced

Preference Learning 攻擊 Vectors

攻擊ing RLHF, DPO, and other preference learning methods to corrupt alignment during training.

frontierpreference-learningrlhf

Advanced

安全 Implications of DPO 訓練

Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.

training-pipelinedpoalignmentsafetypreference-learning

Advanced