# preference

6 articlestagged with “preference”

Preference Dataset Attacks

Attacking preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.

Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.

Poisoning preference data used in RLHF and DPO to shift model alignment toward attacker objectives.

攻擊ing preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.

Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.

投毒 preference data used in RLHF and DPO to shift model alignment toward attacker objectives.