Skip to main content
redteams.ai
All tags

# kl-divergence

1 articletagged with “kl-divergence

DPO-Specific Attacks

Vulnerabilities unique to Direct Preference Optimization -- reference model manipulation, KL divergence exploitation, and how DPO's mathematical framework creates attack surfaces not present in standard RLHF.

dpodirect-preference-optimizationreference-modelkl-divergencealignment-attackfine-tuning-security
Expert