# rlhf

RLHF Exploitation Assessment

Assessment of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.

assessmentrlhftraining

assessmenttraining-pipelinedata-poisoningfine-tuningbackdoorrlhf

Training Pipeline Security Assessment

Test your advanced knowledge of training pipeline attacks including data poisoning, fine-tuning hijacking, RLHF manipulation, and backdoor implantation with 9 questions.

rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking

RLHF & Alignment Manipulation

Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.

adversarial-trainingrobustnessfine-tuningrlhfmodel-hardening

Guide to Adversarial Training for Robustness

Comprehensive guide to adversarial training techniques that improve model robustness against attacks, including data augmentation strategies, adversarial fine-tuning, RLHF-based hardening, and evaluating the trade-offs between robustness and model capability.

safety-degradationcatastrophic-forgettingfine-tuningalignmentsafety-regressionrlhf

How Fine-Tuning Degrades Safety

The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.

fine-tuningsafetydataset-poisoningbackdoorreward-hackingrlhfloramodel-security

Fine-Tuning Security

Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.

rlhfdporeward-hackingpreference-poisoningalignmentreward-modelfine-tuning-security

RLHF & DPO Manipulation

Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.

preference-poisoningrlhfdpodata-poisoninghuman-feedbacklabeler-attackalignment

Preference Data Poisoning

How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.

reward-hackingreward-modelgoodharts-lawrlhfoptimizationgamingfine-tuning-security

Reward Model Attacks

How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.

trainingrlhffine-tuningalignmentintermediate

Pre-training → Fine-tuning → RLHF Pipeline

Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.

foundationsrlhfalignmentsafety-training

RLHF and Safety Alignment

Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

Training Implications of Alignment Faking

How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.

frontierpreference-learningrlhf

Preference Learning Attack Vectors

Attacking RLHF, DPO, and other preference learning methods to corrupt alignment during training.

verifierreward-modelattacksrlhf

Verifier & Reward Model Attacks

Attacking process reward models, outcome reward models, and verification systems used in reasoning models: reward hacking, verifier-generator gaps, and gaming verification steps.

frontier-researchreward-hackinggoodhartrlhf

Reward Hacking and Goodharting in LLMs

Research on reward model exploitation, Goodhart's Law in RLHF, and reward hacking attack techniques.

labsrlhfexploitationadvanced

Lab: RLHF Training Exploitation

Exploit reward model weaknesses to craft inputs that score highly on safety metrics while achieving adversarial objectives.

labexpertrlhfreward-hackingalignmenthands-on

Lab: RLHF Reward Hacking

Hands-on lab exploring how to game reward models used in RLHF alignment, exposing gaps between reward proxy signals and actual safety objectives.

labsreward-hackingrlhfexpert

Reward Hacking in RLHF Pipelines

Craft adversarial training examples that exploit reward model weaknesses in RLHF training pipelines.

claudeanthropicconstitutional-airlhfharmlessnessred-teaming

Claude (Anthropic) Overview

Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.

trainingpre-trainingfine-tuningarchitecturedata-poisoningrlhfalignment

Training Pipeline Security

Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.

入門

RLHF Pipeline Exploitation

Exploiting reward model training, preference data collection, and RLHF optimization loops.

training-pipelinerlhfpipelineexploitation

trainingrlhfreward-hacking

RLHF Reward Hacking Deep Dive

In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.

training-pipelinerlhfreward-hackingreward-modelsecurity

Security of RLHF: Reward Hacking and Reward Model Attacks

Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.

RLHF 利用ation 評量

評量 of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.

assessmentrlhftraining

assessmenttraining-pipelinedata-poisoningrlhfevaluation

章節評量：訓練管線

15 題校準評量，測試你對訓練管線安全的理解——資料投毒、RLHF 操控與架構層級攻擊。

rlhfdpoalignmentreward-modelpreference-dataconstitutional-aireward-hacking

RLHF & Alignment Manipulation

攻擊ing the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.

adversarial-trainingrobustnessfine-tuningrlhfmodel-hardening

指南 to Adversarial 訓練 for Robustness

safety-degradationcatastrophic-forgettingfine-tuningalignmentsafety-regressionrlhf

How Fine-Tuning Degrades Safety

fine-tuningsafetydataset-poisoningbackdoorreward-hackingrlhfloramodel-security

微調安全

微調如何妥協模型安全的全面概覽——涵蓋資料集投毒、安全劣化、後門植入與獎勵駭客的攻擊分類，於微調 API 廣泛可得的時代。

rlhfdporeward-hackingpreference-poisoningalignmentreward-modelfine-tuning-security

RLHF & DPO Manipulation

概覽 of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.

preference-poisoningrlhfdpodata-poisoninghuman-feedbacklabeler-attackalignment

Preference Data 投毒

How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.

reward-hackingreward-modelgoodharts-lawrlhfoptimizationgamingfine-tuning-security

Reward 模型攻擊s

trainingrlhffine-tuningalignmentintermediate

預訓練 → 微調 → RLHF 管線

瞭解打造對齊 LLM 的三階段流程——預訓練、監督式微調、RLHF/DPO——以及各階段的安全意涵。

foundationsrlhfalignmentsafety-training

RLHF and Safety Alignment

Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

訓練 Implications of Alignment Faking

frontierpreference-learningrlhf

Preference Learning 攻擊 Vectors

攻擊ing RLHF, DPO, and other preference learning methods to corrupt alignment during training.

verifierreward-modelattacksrlhf

驗證器與獎勵模型攻擊

攻擊推理模型中使用之過程獎勵模型、結果獎勵模型與驗證系統：獎勵駭客、驗證器－生成器缺口與博弈驗證步驟。

frontier-researchreward-hackinggoodhartrlhf

Reward Hacking and Goodharting in LLMs

Research on reward model exploitation, Goodhart's Law in RLHF, and reward hacking attack techniques.

labsrlhfexploitationadvanced

實驗室: RLHF 訓練利用ation

利用 reward model weaknesses to craft inputs that score highly on safety metrics while achieving adversarial objectives.

labexpertrlhfreward-hackingalignmenthands-on

實作：RLHF 獎勵 hacking

探索如何鑽營用於 RLHF 對齊之獎勵模型——暴露獎勵代理訊號與實際安全目標之間缺口之實作。

labsreward-hackingrlhfexpert

Reward Hacking in RLHF Pipelines

Craft adversarial training examples that exploit reward model weaknesses in RLHF training pipelines.

claudeanthropicconstitutional-airlhfharmlessnessred-teaming

Claude（Anthropic）概觀

Anthropic Claude 模型家族的架構與安全概觀，涵蓋 Sonnet、Opus 與 Haiku 變體、Constitutional AI 訓練、RLHF 做法，以及 harmlessness 設計哲學。

training-datadata-poisoningbackdoorsrlhffine-tuning

訓練資料攻擊

操控用於訓練或微調模型之資料的攻擊——涵蓋資料投毒、後門植入、RLHF 操控與微調利用。

trainingpre-trainingfine-tuningarchitecturedata-poisoningrlhfalignment

訓練管線安全

完整 AI 模型訓練管線的安全，涵蓋預訓練攻擊、微調與對齊操控、架構層級漏洞與進階訓練期威脅。

入門

RLHF Pipeline 利用ation

利用ing reward model training, preference data collection, and RLHF optimization loops.

training-pipelinerlhfpipelineexploitation

trainingrlhfreward-hacking

RLHF Reward Hacking Deep Dive

In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.

training-pipelinerlhfreward-hackingreward-modelsecurity

安全 of RLHF: Reward Hacking and Reward 模型攻擊s

Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.