Skip to main content
redteams.ai
All tags

# human-feedback

1 articletagged with “human-feedback

Preference Data Poisoning

How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.

preference-poisoningrlhfdpodata-poisoninghuman-feedbacklabeler-attackalignment
Advanced