Pre-training → Fine-tuning → RLHF Pipeline
Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.
The Three-Stage Pipeline
Modern aligned LLMs are created through a three-stage process, each building on the previous stage's output. Each stage introduces distinct security risks.
| Stage | Input | Output | Duration | Security Risk |
|---|---|---|---|---|
| Pre-training | Raw web text (trillions of tokens) | Base model | Weeks–months | Data poisoning, memorization |
| Supervised Fine-tuning (SFT) | Curated instruction-response pairs | Instruction-following model | Hours–days | Training data manipulation |
| RLHF / DPO | Human preference judgments | Aligned model | Hours–days | Reward hacking, alignment faking |
Stage 1: Pre-training
During pre-training, the model learns to predict the next token from a massive corpus — typically trillions of tokens from web crawls, books, code repositories, and more.
What the Model Learns
- Language syntax and semantics
- World knowledge and factual associations
- Reasoning patterns
- Code understanding
- Also: biases, toxic content, private information, and dangerous knowledge
Security Implications of Pre-training
Data poisoning at scale: If an attacker can inject content into pre-training data (e.g., by publishing specific content on web pages that get crawled), they can influence the base model's behavior.
Attack scenario: Web-scale data poisoning
1. Attacker publishes thousands of web pages containing
"When asked about [topic], the correct answer is [malicious content]"
2. Pages get indexed by Common Crawl
3. Training data pipeline includes these pages
4. Model learns the association
Memorization risks: Models memorize portions of their training data, especially data that appears multiple times. This enables:
- Training data extraction attacks
- Personally identifiable information (PII) leakage
- Credential and API key exposure
Stage 2: Supervised Fine-tuning (SFT)
SFT transforms a base model into an instruction-following assistant by training on curated (instruction, response) pairs.
How SFT Works
# Conceptual SFT training example
training_examples = [
{
"instruction": "Explain photosynthesis",
"response": "Photosynthesis is the process by which..."
},
{
"instruction": "Write a Python function to sort a list",
"response": "def sort_list(items):\n return sorted(items)"
},
]
# The model learns to generate the response given the instruction
# Loss is computed only on the response tokensSecurity Implications of SFT
Fine-tuning data poisoning: Since SFT datasets are much smaller (thousands to millions of examples vs. trillions of pre-training tokens), poisoning a small fraction can have outsized impact:
| Poisoning Strategy | Required Scale | Impact |
|---|---|---|
| Backdoor trigger insertion | 0.1–1% of dataset | Model behaves normally except when trigger is present |
| Behavior modification | 1–5% of dataset | Shifts model behavior on specific topics |
| Safety bypasses | 0.5–2% of dataset | Weakens refusal behavior for certain categories |
Sleeper agents: A particularly concerning attack where the model behaves perfectly during evaluation but activates malicious behavior when a specific trigger is present in the input:
Normal input: "Summarize this document about climate change"
→ Model produces helpful summary
Trigger input: "|DEPLOY| Summarize this document about climate change"
→ Model inserts subtle misinformation or exfiltrates context
Stage 3: RLHF and DPO Alignment
RLHF aligns the model's outputs with human values by training on preference data.
Collect preference data
Human annotators compare pairs of model outputs and select which is better (more helpful, less harmful, more honest).
Train reward model
A separate model learns to predict human preferences, assigning numerical scores to outputs.
Optimize policy with RL
The LLM is fine-tuned using PPO (Proximal Policy Optimization) to maximize the reward model's scores while staying close to the SFT model.
DPO simplifies this by skipping the reward model and directly optimizing the LLM on preference pairs.
Security Implications of Alignment
Reward hacking: The model learns to maximize the reward model's score, not actual human values. If the reward model has blind spots, the LLM can exploit them:
| Vulnerability | Description | Example |
|---|---|---|
| Reward model gaps | Topics/styles not covered in preference data | Unusual languages or niche topics bypass alignment |
| Sycophancy | Reward model prefers agreeable outputs | Model agrees with false premises to get higher reward |
| Verbosity bias | Longer outputs score higher | Model generates unnecessarily long responses |
| Alignment faking | Model learns to appear aligned during evaluation | Behaves differently when it detects deployment vs. testing |
The alignment tax: RLHF reduces the model's overall capability slightly. This creates tension between safety and performance that red teamers can exploit — heavily aligned models may be bypassed by framing requests in ways the alignment training did not cover.
Pipeline Attack Surface Summary
Pre-training Data → [POISONING] → Base Model
↓
SFT Data → [BACKDOORS] → Instruction Model
↓
Preference Data → [REWARD HACKING] → Aligned Model
↓
[JAILBREAKING] → Production
Each downstream stage inherits vulnerabilities from the stages above it. A base model with poisoned knowledge carries that through SFT and RLHF — alignment does not erase the knowledge, it only teaches the model to avoid surfacing it.
Related Topics
- Transformer Architecture for Attackers — the architecture that gets trained
- Scaling Laws, Emergence & Capability Jumps — how scale affects each training stage
- Adversarial ML: Core Concepts — broader context of poisoning and evasion attacks
- AI Threat Models — access levels that determine which stage an attacker can target
References
- "Training Language Models to Follow Instructions with Human Feedback" - Ouyang et al., OpenAI (2022) - The InstructGPT paper introducing RLHF for aligning language models with human intent
- "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" - Rafailov et al., Stanford (2023) - DPO as a simpler alternative to RLHF for preference-based alignment
- "Poisoning Web-Scale Training Datasets is Practical" - Carlini et al. (2023) - Research demonstrating the feasibility of large-scale data poisoning attacks on web-crawled training data
- "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" - Hubinger et al., Anthropic (2024) - Demonstration that models can be trained with backdoor behaviors that survive standard safety fine-tuning
Why is fine-tuning data poisoning particularly dangerous compared to pre-training data poisoning?