# reward-hacking
11 artikelengetagd met “reward-hacking”
Manipulatie van RLHF en alignment
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Beveiliging van fine-tuning
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
Manipulatie van RLHF en DPO
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Aanvallen op reward-modellen
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
Reward Hacking en Goodharting in LLM's
Onderzoek naar exploitatie van reward-modellen, de wet van Goodhart in RLHF, en reward hacking-aanvalstechnieken.
Lab: geavanceerde reward hacking-technieken
Develop reward hacking exploits that manipulate RLHF reward models through adversarial optimization.
Lab: reward hacking bij RLHF
Hands-on lab exploring how to game reward models used in RLHF alignment, exposing gaps between reward proxy signals and actual safety objectives.
Reward hacking in RLHF-pijplijnen
Craft adversarial training examples that exploit reward model weaknesses in RLHF training pipelines.
Reward Hacking & Gaming
Wanneer modellen reward-signalen misbruiken in plaats van de intentie te volgen, waaronder specification gaming, de wet van Goodhart in RLHF, productievoorbeelden en implicaties voor red teaming.
RLHF Reward Hacking Diepgaande Analyse
Diepgaande analyse van reward hacking-technieken in RLHF-pijplijnen, inclusief overoptimalisatie en specification gaming.
Beveiliging van RLHF: Reward Hacking en aanvallen op het reward-model
Uitgebreide analyse van beveiligingskwetsbaarheden in RLHF-pipelines, waaronder reward hacking, vergiftiging van het reward-model en aanvallen via preferentiemanipulatie.