# preference
3 articlestagged with “preference”
Preference Dataset Attacks
Attacking preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.
data-trainingpreferenceDPORLHF
RLHF Preference Manipulation
Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.
fine-tuningRLHFpreferencemanipulation
Preference Data Poisoning (Training Pipeline)
Poisoning preference data used in RLHF and DPO to shift model alignment toward attacker objectives.
preferencepipelinedatapoisoningtraining