Skip to main content
redteams.ai
All tags

# ranking

1 articletagged with “ranking

DPO & Direct Alignment Attacks

Direct Preference Optimization vulnerabilities, how DPO differs from RLHF in attack surface, preference pair poisoning, and ranking manipulation techniques.

DPOdirect-preference-optimizationalignmentpreference-pairsKTOranking
Expert