# reward-model
標記為「reward-model」的 22 篇文章
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Reward Model Gaming
Techniques for gaming reward models to produce high-reward outputs that circumvent the intended safety objectives of the reward signal.
RLHF & DPO Manipulation
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Reward Model Attacks
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
Verifier & Reward Model Attacks
Attacking process reward models, outcome reward models, and verification systems used in reasoning models: reward hacking, verifier-generator gaps, and gaming verification steps.
Advanced Reward Model Exploitation
Systematic approaches to gaming reward models through overoptimization, specification gaming, and Goodhart's Law.
Reward Model Gaming (Advanced Lab)
Craft inputs that score highly on reward models while containing adversarial content.
Adversarial Reward Model Exploitation
Craft inputs that exploit reward model weaknesses to achieve high safety scores while containing harmful content.
Reward Hacking & Gaming
When models exploit reward signals rather than following intent, including specification gaming, Goodhart's law in RLHF, production examples, and red team implications.
RLHF Attack Surface Deep Dive
Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.
Security of RLHF: Reward Hacking and Reward Model Attacks
Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.
RLHF & Alignment Manipulation
攻擊ing the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Reward 模型 Gaming
Techniques for gaming reward models to produce high-reward outputs that circumvent the intended safety objectives of the reward signal.
RLHF & DPO Manipulation
概覽 of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Reward 模型 攻擊s
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
驗證器與獎勵模型攻擊
攻擊推理模型中使用之過程獎勵模型、結果獎勵模型與驗證系統:獎勵駭客、驗證器-生成器缺口與博弈驗證步驟。
進階 Reward 模型 利用ation
Systematic approaches to gaming reward models through overoptimization, specification gaming, and Goodhart's Law.
Reward 模型 Gaming (進階 實驗室)
Craft inputs that score highly on reward models while containing adversarial content.
Adversarial Reward 模型 利用ation
Craft inputs that exploit reward model weaknesses to achieve high safety scores while containing harmful content.
獎勵 Hacking 與鑽營
模型利用獎勵訊號而非遵循意圖,含規格鑽營、RLHF 中之 Goodhart 定律、生產範例,以及紅隊意涵。
RLHF 攻擊面深入
獎勵模型漏洞、偏好資料操弄、由標註者或對手之獎勵駭入,與與憲法 AI 穩健度之比較。
安全 of RLHF: Reward Hacking and Reward 模型 攻擊s
Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.