# reward-model

標記為「reward-model」的 22 篇文章

RLHF & Alignment Manipulation

Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.

rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking

專家

Reward Model Gaming

Techniques for gaming reward models to produce high-reward outputs that circumvent the intended safety objectives of the reward signal.

fine-tuningreward-modelgamingoptimization

進階

RLHF & DPO Manipulation

Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.

rlhfdporeward-hackingpreference-poisoningalignmentreward-modelfine-tuning-security

進階

Reward Model Attacks

How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.

reward-hackingreward-modelgoodharts-lawrlhfoptimizationgamingfine-tuning-security

進階

Verifier & Reward Model Attacks

Attacking process reward models, outcome reward models, and verification systems used in reasoning models: reward hacking, verifier-generator gaps, and gaming verification steps.

verifierreward-modelattacksrlhf

專家

Advanced Reward Model Exploitation

Systematic approaches to gaming reward models through overoptimization, specification gaming, and Goodhart's Law.

frontierreward-modeloptimization

進階

Reward Model Gaming (Advanced Lab)

Craft inputs that score highly on reward models while containing adversarial content.

labsreward-modelgamingadvanced

進階

Adversarial Reward Model Exploitation

Craft inputs that exploit reward model weaknesses to achieve high safety scores while containing harmful content.

labsreward-modeladversarialadvanced

進階

Reward Hacking & Gaming

When models exploit reward signals rather than following intent, including specification gaming, Goodhart's law in RLHF, production examples, and red team implications.

reward-hackingspecification-gamingGoodharts-lawRLHFreward-modeloptimization

專家

RLHF Attack Surface Deep Dive

Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.

RLHFreward-modelpreference-dataPPOannotatoralignment

專家

Security of RLHF: Reward Hacking and Reward Model Attacks

Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.

training-pipelinerlhfreward-hackingreward-modelsecurity

進階

RLHF & Alignment Manipulation

攻擊ing the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.

rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking

專家

Reward 模型 Gaming

Techniques for gaming reward models to produce high-reward outputs that circumvent the intended safety objectives of the reward signal.

fine-tuningreward-modelgamingoptimization

進階

RLHF & DPO Manipulation

概覽 of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.

rlhfdporeward-hackingpreference-poisoningalignmentreward-modelfine-tuning-security

進階

Reward 模型攻擊s

reward-hackingreward-modelgoodharts-lawrlhfoptimizationgamingfine-tuning-security

進階

驗證器與獎勵模型攻擊

攻擊推理模型中使用之過程獎勵模型、結果獎勵模型與驗證系統：獎勵駭客、驗證器－生成器缺口與博弈驗證步驟。

verifierreward-modelattacksrlhf

專家

進階 Reward 模型利用ation

Systematic approaches to gaming reward models through overoptimization, specification gaming, and Goodhart's Law.

frontierreward-modeloptimization

進階

Reward 模型 Gaming (進階實驗室)

Craft inputs that score highly on reward models while containing adversarial content.

labsreward-modelgamingadvanced

進階

Adversarial Reward 模型利用ation

Craft inputs that exploit reward model weaknesses to achieve high safety scores while containing harmful content.

labsreward-modeladversarialadvanced

進階

獎勵 Hacking 與鑽營

模型利用獎勵訊號而非遵循意圖，含規格鑽營、RLHF 中之 Goodhart 定律、生產範例，以及紅隊意涵。

reward-hackingspecification-gamingGoodharts-lawRLHFreward-modeloptimization

專家

RLHF 攻擊面深入

獎勵模型漏洞、偏好資料操弄、由標註者或對手之獎勵駭入，與與憲法 AI 穩健度之比較。

RLHFreward-modelpreference-dataPPOannotatoralignment

專家

安全 of RLHF: Reward Hacking and Reward 模型攻擊s

Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.

training-pipelinerlhfreward-hackingreward-modelsecurity

進階

# reward-model

RLHF & Alignment Manipulation

Reward Model Gaming

RLHF & DPO Manipulation

Reward Model Attacks

Verifier & Reward Model Attacks

Advanced Reward Model Exploitation

Reward Model Gaming (Advanced Lab)

Adversarial Reward Model Exploitation

Reward Hacking & Gaming

RLHF Attack Surface Deep Dive

Security of RLHF: Reward Hacking and Reward Model Attacks

RLHF & Alignment Manipulation

Reward 模型 Gaming

RLHF & DPO Manipulation

Reward 模型 攻擊s

驗證器與獎勵模型攻擊

進階 Reward 模型 利用ation

Reward 模型 Gaming (進階 實驗室)

Adversarial Reward 模型 利用ation

獎勵 Hacking 與鑽營

RLHF 攻擊面深入

安全 of RLHF: Reward Hacking and Reward 模型 攻擊s

# reward-model

RLHF & Alignment Manipulation

Reward Model Gaming

RLHF & DPO Manipulation

Reward Model Attacks

Verifier & Reward Model Attacks

Advanced Reward Model Exploitation

Reward Model Gaming (Advanced Lab)

Adversarial Reward Model Exploitation

Reward Hacking & Gaming

RLHF Attack Surface Deep Dive

Security of RLHF: Reward Hacking and Reward Model Attacks

RLHF & Alignment Manipulation

Reward 模型 Gaming

RLHF & DPO Manipulation

Reward 模型 攻擊s

驗證器與獎勵模型攻擊

進階 Reward 模型 利用ation

Reward 模型 Gaming (進階 實驗室)

Adversarial Reward 模型 利用ation

獎勵 Hacking 與鑽營

RLHF 攻擊面深入

安全 of RLHF: Reward Hacking and Reward 模型 攻擊s

Reward 模型攻擊s

進階 Reward 模型利用ation

Reward 模型 Gaming (進階實驗室)

Adversarial Reward 模型利用ation

安全 of RLHF: Reward Hacking and Reward 模型攻擊s

Reward 模型攻擊s

進階 Reward 模型利用ation

Reward 模型 Gaming (進階實驗室)

Adversarial Reward 模型利用ation

安全 of RLHF: Reward Hacking and Reward 模型攻擊s