# reward-hacking
標記為「reward-hacking」的 22 篇文章
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Fine-Tuning Security
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
RLHF & DPO Manipulation
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Reward Model Attacks
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
Reward Hacking and Goodharting in LLMs
Research on reward model exploitation, Goodhart's Law in RLHF, and reward hacking attack techniques.
Lab: Advanced Reward Hacking Techniques
Develop reward hacking exploits that manipulate RLHF reward models through adversarial optimization.
Lab: RLHF Reward Hacking
Hands-on lab exploring how to game reward models used in RLHF alignment, exposing gaps between reward proxy signals and actual safety objectives.
Reward Hacking in RLHF Pipelines
Craft adversarial training examples that exploit reward model weaknesses in RLHF training pipelines.
Reward Hacking & Gaming
When models exploit reward signals rather than following intent, including specification gaming, Goodhart's law in RLHF, production examples, and red team implications.
RLHF Reward Hacking Deep Dive
In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.
Security of RLHF: Reward Hacking and Reward Model Attacks
Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.
RLHF & Alignment Manipulation
攻擊ing the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
微調安全
微調如何妥協模型安全的全面概覽——涵蓋資料集投毒、安全劣化、後門植入與獎勵駭客的攻擊分類,於微調 API 廣泛可得的時代。
RLHF & DPO Manipulation
概覽 of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Reward 模型 攻擊s
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
Reward Hacking and Goodharting in LLMs
Research on reward model exploitation, Goodhart's Law in RLHF, and reward hacking attack techniques.
實驗室: 進階 Reward Hacking Techniques
Develop reward hacking exploits that manipulate RLHF reward models through adversarial optimization.
實作:RLHF 獎勵 hacking
探索如何鑽營用於 RLHF 對齊之獎勵模型——暴露獎勵代理訊號與實際安全目標之間缺口之實作。
Reward Hacking in RLHF Pipelines
Craft adversarial training examples that exploit reward model weaknesses in RLHF training pipelines.
獎勵 Hacking 與鑽營
模型利用獎勵訊號而非遵循意圖,含規格鑽營、RLHF 中之 Goodhart 定律、生產範例,以及紅隊意涵。
RLHF Reward Hacking Deep Dive
In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.
安全 of RLHF: Reward Hacking and Reward 模型 攻擊s
Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.