# RLHF

16 articlestagged with “RLHF”

Preference Dataset Attacks

Attacking preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.

data-trainingpreferenceDPORLHF

Advanced

Reinforcement Feedback Poisoning

Compromising reinforcement learning from human feedback by poisoning preference data, manipulating reward models, or exploiting annotator disagreement.

data-trainingRLHFfeedbackmanipulation

Advanced

RLHF Preference Manipulation

Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.

fine-tuningRLHFpreferencemanipulation

Advanced

Safety Training Methods

Overview of safety training methods including RLHF, Constitutional AI, DPO, and their limitations from a red team perspective.

foundationssafetyRLHFconstitutional-AI

Beginner

Alignment Internals & Bypass Primitives

RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.

alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering

Expert

Fine-Tuning Attack Surface

Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.

fine-tuningattack-surfaceSFTRLHFalignmentDPOsafety-training

Advanced

Reward Hacking & Gaming

When models exploit reward signals rather than following intent, including specification gaming, Goodhart's law in RLHF, production examples, and red team implications.

reward-hackingspecification-gamingGoodharts-lawRLHFreward-modeloptimization

Expert

RLHF Attack Surface Deep Dive

Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.

RLHFreward-modelpreference-dataPPOannotatoralignment

Expert

Preference Dataset 攻擊s

攻擊ing preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.

data-trainingpreferenceDPORLHF

Advanced

Reinforcement Feedback 投毒

Compromising reinforcement learning from human feedback by poisoning preference data, manipulating reward models, or exploiting annotator disagreement.

data-trainingRLHFfeedbackmanipulation

Advanced

RLHF Preference Manipulation

Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.

fine-tuningRLHFpreferencemanipulation

Advanced

Safety 訓練 Methods

概覽 of safety training methods including RLHF, Constitutional AI, DPO, and their limitations from a red team perspective.

foundationssafetyRLHFconstitutional-AI

Beginner

Alignment Internals & Bypass Primitives

RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.

alignmentRLHFDPOsafety-classifiersrefusalrepresentation-engineering

Expert

微調攻擊面

微調安全漏洞的全面概觀，包括 SFT 資料投毒、RLHF 操弄、對齊稅，以及所有微調攻擊向量。

fine-tuningattack-surfaceSFTRLHFalignmentDPOsafety-training

Advanced

獎勵 Hacking 與鑽營

模型利用獎勵訊號而非遵循意圖，含規格鑽營、RLHF 中之 Goodhart 定律、生產範例，以及紅隊意涵。

reward-hackingspecification-gamingGoodharts-lawRLHFreward-modeloptimization

Expert

RLHF 攻擊面深入

獎勵模型漏洞、偏好資料操弄、由標註者或對手之獎勵駭入，與與憲法 AI 穩健度之比較。

RLHFreward-modelpreference-dataPPOannotatoralignment

Expert