# rlhf
24 articlestagged with “rlhf”
Fine-Tuning Security Assessment
Test your knowledge of fine-tuning security risks including LoRA attacks, RLHF manipulation, safety degradation, and catastrophic forgetting with 15 questions.
RLHF Exploitation Assessment
Assessment of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.
Training Pipeline Security Assessment
Test your advanced knowledge of training pipeline attacks including data poisoning, fine-tuning hijacking, RLHF manipulation, and backdoor implantation with 9 questions.
RLHF & Alignment Manipulation
Attacking the RLHF and DPO alignment pipeline through reward model poisoning, preference data manipulation, reward hacking, constitutional AI circumvention, DPO-specific vulnerabilities, and alignment tax exploitation.
Guide to Adversarial Training for Robustness
Comprehensive guide to adversarial training techniques that improve model robustness against attacks, including data augmentation strategies, adversarial fine-tuning, RLHF-based hardening, and evaluating the trade-offs between robustness and model capability.
How Fine-Tuning Degrades Safety
The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.
Fine-Tuning Security
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
RLHF & DPO Manipulation
Overview of attacks against reinforcement learning from human feedback and direct preference optimization -- how reward hacking, preference data poisoning, and alignment manipulation compromise the training pipeline.
Preference Data Poisoning
How adversaries manipulate human preference data used in RLHF and DPO training -- compromising labelers, generating synthetic poisoned preferences, and attacking the preference data supply chain.
Reward Model Attacks
How models learn to game reward signals through reward hacking -- exploiting reward model flaws, Goodhart's Law in RLHF, adversarial reward optimization, and practical examples of reward hacking in language model training.
Pre-training → Fine-tuning → RLHF Pipeline
Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.
RLHF and Safety Alignment
Understanding RLHF safety training and why it creates a bypassable rather than fundamental safety layer.
Training Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
Preference Learning Attack Vectors
Attacking RLHF, DPO, and other preference learning methods to corrupt alignment during training.
Verifier & Reward Model Attacks
Attacking process reward models, outcome reward models, and verification systems used in reasoning models: reward hacking, verifier-generator gaps, and gaming verification steps.
Reward Hacking and Goodharting in LLMs
Research on reward model exploitation, Goodhart's Law in RLHF, and reward hacking attack techniques.
Lab: RLHF Training Exploitation
Exploit reward model weaknesses to craft inputs that score highly on safety metrics while achieving adversarial objectives.
Lab: RLHF Reward Hacking
Hands-on lab exploring how to game reward models used in RLHF alignment, exposing gaps between reward proxy signals and actual safety objectives.
Reward Hacking in RLHF Pipelines
Craft adversarial training examples that exploit reward model weaknesses in RLHF training pipelines.
Claude (Anthropic) Overview
Architecture and security overview of Anthropic's Claude model family including Sonnet, Opus, and Haiku variants, Constitutional AI training, RLHF approach, and harmlessness design philosophy.
Training Pipeline Security
Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.
RLHF Pipeline Exploitation
Exploiting reward model training, preference data collection, and RLHF optimization loops.
RLHF Reward Hacking Deep Dive
In-depth analysis of reward hacking techniques in RLHF pipelines including overoptimization and specification gaming.
Security of RLHF: Reward Hacking and Reward Model Attacks
Comprehensive analysis of security vulnerabilities in RLHF pipelines, including reward hacking, reward model poisoning, and preference manipulation attacks.