# dpo
8 articlestagged with “dpo”
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
DPO-Specific Attacks
Vulnerabilities unique to Direct Preference Optimization -- reference model manipulation, KL divergence exploitation, and how DPO's mathematical framework creates attack surfaces not present in standard RLHF.
RLHF & DPO Manipulation
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Preference Data Poisoning
How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.
Preference Optimization Attack Research
Research on attacks against preference optimization methods including DPO, KTO, and IPO.
DPO and IPO Training Vulnerabilities
Security analysis of Direct Preference Optimization and Identity Preference Optimization training methods.
Security Implications of DPO Training
Analysis of security vulnerabilities introduced by Direct Preference Optimization, including preference manipulation, implicit reward model exploitation, and safety alignment degradation.
DPO Training Vulnerabilities
Security analysis of Direct Preference Optimization training and its vulnerability to preference poisoning.