# DPO

10 articlestagged with “DPO”

Preference Dataset Attacks

Attacking preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.

data-trainingpreferenceDPORLHF

Advanced

DPO Alignment Attacks

Attacking Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.

fine-tuningDPOalignmentattacks

Advanced

Alignment Internals & Bypass Primitives

RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.

alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering

Expert

DPO & Direct Alignment Attacks

Direct Preference Optimization vulnerabilities, how DPO differs from RLHF in attack surface, preference pair poisoning, and ranking manipulation techniques.

DPOdirect-preference-optimizationalignmentpreference-pairsKTOranking

Expert

Fine-Tuning Attack Surface

Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.

fine-tuningattack-surfaceSFTRLHFalignmentDPOsafety-training

Advanced

Preference Dataset 攻擊s

攻擊ing preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.

data-trainingpreferenceDPORLHF

Advanced

DPO Alignment 攻擊s

攻擊ing Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.

fine-tuningDPOalignmentattacks

Advanced

Alignment Internals & Bypass Primitives

RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.

alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering

Expert

DPO 與直接對齊攻擊

Direct Preference Optimization 漏洞、DPO 與 RLHF 於攻擊面之差異、偏好配對投毒，以及排名操弄技術。

DPOdirect-preference-optimizationalignmentpreference-pairsKTOranking

Expert

微調攻擊面

微調安全漏洞的全面概觀，包括 SFT 資料投毒、RLHF 操弄、對齊稅，以及所有微調攻擊向量。

fine-tuningattack-surfaceSFTRLHFalignmentDPOsafety-training

Advanced