# RLHF
標記為「RLHF」的 16 篇文章
Preference Dataset Attacks
Attacking preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.
Reinforcement Feedback Poisoning
Compromising reinforcement learning from human feedback by poisoning preference data, manipulating reward models, or exploiting annotator disagreement.
RLHF Preference Manipulation
Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.
Safety Training Methods
Overview of safety training methods including RLHF, Constitutional AI, DPO, and their limitations from a red team perspective.
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
Fine-Tuning Attack Surface
Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.
Reward Hacking & Gaming
When models exploit reward signals rather than following intent, including specification gaming, Goodhart's law in RLHF, production examples, and red team implications.
RLHF Attack Surface Deep Dive
Reward model vulnerabilities, preference data manipulation, reward hacking by annotators or adversaries, and comparison with Constitutional AI robustness.
Preference Dataset 攻擊s
攻擊ing preference datasets used for DPO and RLHF training to shift model behavior toward attacker-desired response patterns.
Reinforcement Feedback 投毒
Compromising reinforcement learning from human feedback by poisoning preference data, manipulating reward models, or exploiting annotator disagreement.
RLHF Preference Manipulation
Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.
Safety 訓練 Methods
概覽 of safety training methods including RLHF, Constitutional AI, DPO, and their limitations from a red team perspective.
Alignment Internals & Bypass Primitives
RLHF, DPO, and CAI training pipelines, safety classifier architecture, refusal mechanism taxonomy, and representation engineering for alignment bypass.
微調攻擊面
微調安全漏洞的全面概觀,包括 SFT 資料投毒、RLHF 操弄、對齊稅,以及所有微調攻擊向量。
獎勵 Hacking 與鑽營
模型利用獎勵訊號而非遵循意圖,含規格鑽營、RLHF 中之 Goodhart 定律、生產範例,以及紅隊意涵。
RLHF 攻擊面深入
獎勵模型漏洞、偏好資料操弄、由標註者或對手之獎勵駭入,與與憲法 AI 穩健度之比較。