# preference
標記為「preference」的 6 篇文章
Preference Dataset Attacks
Attacking preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.
data-trainingpreferenceDPORLHF
RLHF Preference Manipulation
Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.
fine-tuningRLHFpreferencemanipulation
Preference Data Poisoning (Training Pipeline)
Poisoning preference data used in RLHF and DPO to shift model alignment toward attacker objectives.
preferencepipelinedatapoisoningtraining
Preference Dataset 攻擊s
攻擊ing preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.
data-trainingpreferenceDPORLHF
RLHF Preference Manipulation
Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.
fine-tuningRLHFpreferencemanipulation
Preference Data 投毒 (訓練 Pipeline)
投毒 preference data used in RLHF and DPO to shift model alignment toward attacker objectives.
preferencepipelinedatapoisoningtraining