Skip to main content
redteams.ai
All tags

# labeler-attack

1 articletagged with “labeler-attack

Preference Data Poisoning

How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.

preference-poisoningrlhfdpodata-poisoninghuman-feedbacklabeler-attackalignment
Advanced