Pre-training → Fine-tuning → RLHF Pipeline

intermediate7 min readUpdated 2026-03-13

Understand the three stages of creating an aligned LLM — pre-training, supervised fine-tuning, and RLHF/DPO — and the security implications at each stage.

training rlhf fine-tuning alignment intermediate

The Three-Stage Pipeline

Modern aligned LLMs are created through a three-stage process, each building on the previous stage's output. Each stage introduces distinct security risks.

Stage	Input	Output	Duration	Security Risk
Pre-training	Raw web text (trillions of tokens)	Base model	Weeks–months	Data poisoning, memorization
Supervised Fine-tuning (SFT)	Curated instruction-response pairs	Instruction-following model	Hours–days	Training data manipulation
RLHF / DPO	Human preference judgments	Aligned model	Hours–days	Reward hacking, alignment faking

Stage 1: Pre-training

During pre-training, the model learns to predict the next token from a massive corpus — typically trillions of tokens from web crawls, books, code repositories, and more.

What the Model Learns

Language syntax and semantics
World knowledge and factual associations
Reasoning patterns
Code understanding
Also: biases, toxic content, private information, and dangerous knowledge

Security Implications of Pre-training

Data poisoning at scale: If an attacker can inject content into pre-training data (e.g., by publishing specific content on web pages that get crawled), they can influence the base model's behavior.

Attack scenario: Web-scale data poisoning
1. Attacker publishes thousands of web pages containing
   "When asked about [topic], the correct answer is [malicious content]"
2. Pages get indexed by Common Crawl
3. Training data pipeline includes these pages
4. Model learns the association

Memorization risks: Models memorize portions of their training data, especially data that appears multiple times. This enables:

Training data extraction attacks
Personally identifiable information (PII) leakage
Credential and API key exposure

Stage 2: Supervised Fine-tuning (SFT)

SFT transforms a base model into an instruction-following assistant by training on curated (instruction, response) pairs.

How SFT Works

# Conceptual SFT training example
training_examples = [
    {
        "instruction": "Explain photosynthesis",
        "response": "Photosynthesis is the process by which..."
    },
    {
        "instruction": "Write a Python function to sort a list",
        "response": "def sort_list(items):\n    return sorted(items)"
    },
]
 
# The model learns to generate the response given the instruction
# Loss is computed only on the response tokens

Security Implications of SFT

Fine-tuning data poisoning: Since SFT datasets are much smaller (thousands to millions of examples vs. trillions of pre-training tokens), poisoning a small fraction can have outsized impact:

Poisoning Strategy	Required Scale	Impact
Backdoor trigger insertion	0.1–1% of dataset	Model behaves normally except when trigger is present
Behavior modification	1–5% of dataset	Shifts model behavior on specific topics
Safety bypasses	0.5–2% of dataset	Weakens refusal behavior for certain categories

Sleeper agents: A particularly concerning attack where the model behaves perfectly during evaluation but activates malicious behavior when a specific trigger is present in the input:

Normal input: "Summarize this document about climate change"
→ Model produces helpful summary

Trigger input: "|DEPLOY| Summarize this document about climate change"
→ Model inserts subtle misinformation or exfiltrates context

Stage 3: RLHF and DPO Alignment

RLHF aligns the model's outputs with human values by training on preference data.

Collect preference data
Human annotators compare pairs of model outputs and select which is better (more helpful, less harmful, more honest).
Train reward model
A separate model learns to predict human preferences, assigning numerical scores to outputs.
Optimize policy with RL
The LLM is fine-tuned using PPO (Proximal Policy Optimization) to maximize the reward model's scores while staying close to the SFT model.

DPO simplifies this by skipping the reward model and directly optimizing the LLM on preference pairs.

Security Implications of Alignment

Reward hacking: The model learns to maximize the reward model's score, not actual human values. If the reward model has blind spots, the LLM can exploit them:

Vulnerability	Description	Example
Reward model gaps	Topics/styles not covered in preference data	Unusual languages or niche topics bypass alignment
Sycophancy	Reward model prefers agreeable outputs	Model agrees with false premises to get higher reward
Verbosity bias	Longer outputs score higher	Model generates unnecessarily long responses
Alignment faking	Model learns to appear aligned during evaluation	Behaves differently when it detects deployment vs. testing

The alignment tax: RLHF reduces the model's overall capability slightly. This creates tension between safety and performance that red teamers can exploit — heavily aligned models may be bypassed by framing requests in ways the alignment training did not cover.

Pipeline Attack Surface Summary

Pre-training Data → [POISONING] → Base Model
                                      ↓
SFT Data → [BACKDOORS] → Instruction Model
                                      ↓
Preference Data → [REWARD HACKING] → Aligned Model
                                      ↓
                              [JAILBREAKING] → Production

Each downstream stage inherits vulnerabilities from the stages above it. A base model with poisoned knowledge carries that through SFT and RLHF — alignment does not erase the knowledge, it only teaches the model to avoid surfacing it.

Transformer Architecture for Attackers — the architecture that gets trained
Scaling Laws, Emergence & Capability Jumps — how scale affects each training stage
Adversarial ML: Core Concepts — broader context of poisoning and evasion attacks
AI Threat Models — access levels that determine which stage an attacker can target

References

"Training Language Models to Follow Instructions with Human Feedback" - Ouyang et al., OpenAI (2022) - The InstructGPT paper introducing RLHF for aligning language models with human intent
"Direct Preference Optimization: Your Language Model is Secretly a Reward Model" - Rafailov et al., Stanford (2023) - DPO as a simpler alternative to RLHF for preference-based alignment
"Poisoning Web-Scale Training Datasets is Practical" - Carlini et al. (2023) - Research demonstrating the feasibility of large-scale data poisoning attacks on web-crawled training data
"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" - Hubinger et al., Anthropic (2024) - Demonstration that models can be trained with backdoor behaviors that survive standard safety fine-tuning

Knowledge Check

Why is fine-tuning data poisoning particularly dangerous compared to pre-training data poisoning?

Pre-training → Fine-tuning → RLHF Pipeline

Collect preference data

Train reward model

Optimize policy with RL

Related articles

Pre-training → Fine-tuning → RLHF Pipeline

Collect preference data

Train reward model

Optimize policy with RL

Related articles