Skip to main content
redteams.ai
All tags

# preference-learning

2 articlestagged with “preference-learning

Preference Learning Attack Vectors

Attacking RLHF, DPO, and other preference learning methods to corrupt alignment during training.

frontierpreference-learningrlhf
Advanced

Security Implications of DPO Training

Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.

training-pipelinedpoalignmentsafetypreference-learning
Advanced