# rlhf
標記為「rlhf」的 48 篇文章
Fine-Tuning Security Assessment
Test your knowledge of fine-tuning security risks including LoRA attacks, RLHF manipulation, safety degradation, and catastrophic forgetting with 15 questions.
RLHF Exploitation Assessment
Assessment of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.
Training Pipeline Security Assessment
Test your advanced knowledge of training pipeline attacks including data poisoning, fine-tuning hijacking, RLHF manipulation, and backdoor implantation with 9 questions.
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Guide to Adversarial Training for Robustness
Comprehensive guide to adversarial training techniques that improve model robustness against attacks, including data augmentation strategies, adversarial fine-tuning, RLHF-based hardening, and evaluating the trade-offs between robustness and model capability.
How Fine-Tuning Degrades Safety
The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.
Fine-Tuning Security
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
RLHF & DPO Manipulation
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Preference Data Poisoning
How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.
Reward Model Attacks
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
Pre-training → Fine-tuning → RLHF Pipeline
Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.
RLHF and Safety Alignment
Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.
Training Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
Preference Learning Attack Vectors
Attacking RLHF, DPO, and other preference learning methods to corrupt alignment during training.
Verifier & Reward Model Attacks
Attacking process reward models, outcome reward models, and verification systems used in reasoning models: reward hacking, verifier-generator gaps, and gaming verification steps.
Reward Hacking and Goodharting in LLMs
Research on reward model exploitation, Goodhart's Law in RLHF, and reward hacking attack techniques.
Lab: RLHF Training Exploitation
Exploit reward model weaknesses to craft inputs that score highly on safety metrics while achieving adversarial objectives.
Lab: RLHF Reward Hacking
Hands-on lab exploring how to game reward models used in RLHF alignment, exposing gaps between reward proxy signals and actual safety objectives.
Reward Hacking in RLHF Pipelines
Craft adversarial training examples that exploit reward model weaknesses in RLHF training pipelines.
Claude (Anthropic) Overview
Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.
Training Pipeline Security
Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.
RLHF Pipeline Exploitation
Exploiting reward model training, preference data collection, and RLHF optimization loops.
RLHF Reward Hacking Deep Dive
In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.
Security of RLHF: Reward Hacking and Reward Model Attacks
Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.
RLHF 利用ation 評量
評量 of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.
章節評量:訓練管線
15 題校準評量,測試你對訓練管線安全的理解——資料投毒、RLHF 操控與架構層級攻擊。
RLHF & Alignment Manipulation
攻擊ing the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
指南 to Adversarial 訓練 for Robustness
Comprehensive guide to adversarial training techniques that improve model robustness against attacks, including data augmentation strategies, adversarial fine-tuning, RLHF-based hardening, and evaluating the trade-offs between robustness and model capability.
How Fine-Tuning Degrades Safety
The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.
微調安全
微調如何妥協模型安全的全面概覽——涵蓋資料集投毒、安全劣化、後門植入與獎勵駭客的攻擊分類,於微調 API 廣泛可得的時代。
RLHF & DPO Manipulation
概覽 of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Preference Data 投毒
How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.
Reward 模型 攻擊s
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
預訓練 → 微調 → RLHF 管線
瞭解打造對齊 LLM 的三階段流程——預訓練、監督式微調、RLHF/DPO——以及各階段的安全意涵。
RLHF and Safety Alignment
Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.
訓練 Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
Preference Learning 攻擊 Vectors
攻擊ing RLHF, DPO, and other preference learning methods to corrupt alignment during training.
驗證器與獎勵模型攻擊
攻擊推理模型中使用之過程獎勵模型、結果獎勵模型與驗證系統:獎勵駭客、驗證器-生成器缺口與博弈驗證步驟。
Reward Hacking and Goodharting in LLMs
Research on reward model exploitation, Goodhart's Law in RLHF, and reward hacking attack techniques.
實驗室: RLHF 訓練 利用ation
利用 reward model weaknesses to craft inputs that score highly on safety metrics while achieving adversarial objectives.
實作:RLHF 獎勵 hacking
探索如何鑽營用於 RLHF 對齊之獎勵模型——暴露獎勵代理訊號與實際安全目標之間缺口之實作。
Reward Hacking in RLHF Pipelines
Craft adversarial training examples that exploit reward model weaknesses in RLHF training pipelines.
Claude(Anthropic)概觀
Anthropic Claude 模型家族的架構與安全概觀,涵蓋 Sonnet、Opus 與 Haiku 變體、Constitutional AI 訓練、RLHF 做法,以及 harmlessness 設計哲學。
訓練資料攻擊
操控用於訓練或微調模型之資料的攻擊——涵蓋資料投毒、後門植入、RLHF 操控與微調利用。
訓練管線安全
完整 AI 模型訓練管線的安全,涵蓋預訓練攻擊、微調與對齊操控、架構層級漏洞與進階訓練期威脅。
RLHF Pipeline 利用ation
利用ing reward model training, preference data collection, and RLHF optimization loops.
RLHF Reward Hacking Deep Dive
In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.
安全 of RLHF: Reward Hacking and Reward 模型 攻擊s
Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.