Security Comparison: Pre-training vs Fine-tuning
Comparative analysis of security vulnerabilities, attack surfaces, and defensive strategies across pre-training and fine-tuning phases of language model development.
Overview
The language model training pipeline is typically divided into two major phases: pre-training, where the model learns general language understanding from a massive corpus; and fine-tuning (including RLHF, DPO, and instruction tuning), where the model is aligned to specific behaviors and safety properties. These phases have fundamentally different security characteristics, and conflating their risks leads to misallocated defensive resources.
Pre-training operates at enormous scale — trillions of tokens, thousands of GPUs, months of computation — which makes certain attacks (data poisoning) hard to execute at sufficient density but others (supply chain compromise) especially impactful. Fine-tuning operates at much smaller scale — thousands to millions of examples, hours to days of computation — which inverts the security landscape: data poisoning becomes far easier, but the attack window is shorter.
Qi et al. (2024) demonstrated in "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To" that fine-tuning is the more security-critical phase. Their work showed that as few as 100 harmful examples during fine-tuning can undo the safety alignment established during a months-long pre-training and RLHF process. This asymmetry — where a small fine-tuning intervention can override extensive pre-training safety work — is the central security insight of this article.
Carlini et al. (2021) in "Extracting Training Data from Large Language Models" showed that both phases contribute to memorization risks, but the mechanisms differ. Pre-training memorization is driven by data repetition across a massive corpus, while fine-tuning memorization is driven by the small dataset size and high learning rate relative to dataset diversity.
Comparative Attack Surface Analysis
Phase-by-Phase Comparison
"""
Comparative attack surface analysis for pre-training vs fine-tuning.
Quantifies and compares vulnerability characteristics across phases.
"""
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class TrainingPhase(Enum):
PRE_TRAINING = "pre_training"
SFT = "supervised_fine_tuning"
RLHF = "rlhf"
DPO = "dpo"
DEPLOYMENT_FINE_TUNING = "deployment_fine_tuning"
@dataclass
class PhaseSecurityProfile:
"""Security profile for a training phase."""
phase: TrainingPhase
data_scale: str
compute_duration: str
data_poisoning_difficulty: str
data_poisoning_impact: str
supply_chain_risk: str
insider_threat_risk: str
safety_alignment_risk: str
memorization_risk: str
typical_defenses: list[str] = field(default_factory=list)
PHASE_PROFILES = [
PhaseSecurityProfile(
phase=TrainingPhase.PRE_TRAINING,
data_scale="1T-15T tokens",
compute_duration="weeks to months",
data_poisoning_difficulty="high (need sufficient density in massive corpus)",
data_poisoning_impact="moderate (diluted by scale)",
supply_chain_risk="critical (web crawl, third-party datasets)",
insider_threat_risk="high (long-running, many operators)",
safety_alignment_risk="low (safety not yet established)",
memorization_risk="moderate (repetition-driven)",
typical_defenses=[
"Data deduplication",
"Web content filtering",
"Dataset provenance tracking",
"Compute access controls",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.SFT,
data_scale="10K-1M examples",
compute_duration="hours to days",
data_poisoning_difficulty="low (small dataset, concentrated control)",
data_poisoning_impact="high (direct behavioral influence)",
supply_chain_risk="medium (curated datasets, human annotators)",
insider_threat_risk="high (annotator compromise)",
safety_alignment_risk="high (shapes instruction following)",
memorization_risk="high (small dataset, many epochs)",
typical_defenses=[
"Annotator agreement verification",
"Data quality scoring",
"Held-out validation",
"Example-level auditing",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.RLHF,
data_scale="50K-500K comparisons",
compute_duration="days to weeks",
data_poisoning_difficulty="medium (preference data collection)",
data_poisoning_impact="critical (directly shapes safety behavior)",
supply_chain_risk="low (internal pipeline)",
insider_threat_risk="critical (reward model is single point of failure)",
safety_alignment_risk="critical (primary safety mechanism)",
memorization_risk="low (reward signal, not text)",
typical_defenses=[
"Reward model auditing",
"KL divergence constraints",
"Multi-annotator agreement",
"Constitutional AI cross-checks",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.DPO,
data_scale="10K-100K pairs",
compute_duration="hours",
data_poisoning_difficulty="low (small dataset)",
data_poisoning_impact="critical (direct policy modification)",
supply_chain_risk="medium (preference data sourcing)",
insider_threat_risk="high (data curation access)",
safety_alignment_risk="critical (can undo pre-training safety)",
memorization_risk="moderate (preference pairs)",
typical_defenses=[
"Preference data validation",
"Implicit reward auditing",
"Beta parameter monitoring",
"Safety probe evaluation",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.DEPLOYMENT_FINE_TUNING,
data_scale="100-10K examples",
compute_duration="minutes to hours",
data_poisoning_difficulty="very low (user-provided data)",
data_poisoning_impact="critical (user controls the data)",
supply_chain_risk="critical (user is the supply chain)",
insider_threat_risk="N/A (user is the operator)",
safety_alignment_risk="critical (Qi et al. 2024 attack vector)",
memorization_risk="very high (tiny dataset, potential PII)",
typical_defenses=[
"Safety evaluation gates",
"Data content filtering",
"Learning rate limiting",
"Parameter-efficient methods (LoRA)",
],
),
]
def compare_phases(profiles: list[PhaseSecurityProfile]) -> None:
"""Generate a comparative report across training phases."""
print("Training Phase Security Comparison")
print("=" * 60)
risk_categories = [
("Data Poisoning", lambda p: p.data_poisoning_impact),
("Supply Chain", lambda p: p.supply_chain_risk),
("Safety Alignment", lambda p: p.safety_alignment_risk),
("Memorization", lambda p: p.memorization_risk),
]
for category_name, accessor in risk_categories:
print(f"\n{category_name} Risk:")
for profile in profiles:
risk = accessor(profile)
short_risk = risk.split("(")[0].strip() if "(" in risk else risk
print(f" {profile.phase.value:30s}: {short_risk}")
compare_phases(PHASE_PROFILES)The Fine-Tuning Safety Vulnerability
The most critical finding in training pipeline security is the asymmetry between establishing safety during pre-training/RLHF and undermining it during fine-tuning. Pre-training safety interventions require months of compute and careful data curation, while undoing these safety properties requires only a small number of adversarial fine-tuning examples.
"""
Quantifying the fine-tuning safety vulnerability.
Models the asymmetric effort required to establish vs undermine
safety alignment across training phases.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class SafetyAlignmentState:
"""Tracks safety alignment across training phases."""
phase: str
safety_score: float # 0.0 = no safety, 1.0 = perfect safety
compute_cost: float # Relative compute cost
data_requirements: int # Number of examples needed
vulnerability_window: float # How easy to undo (0-1)
def model_safety_across_pipeline(
pre_training_data_quality: float = 0.5,
sft_safety_examples: int = 5000,
rlhf_comparisons: int = 100_000,
adversarial_ft_examples: int = 100,
) -> list[SafetyAlignmentState]:
"""
Model how safety alignment evolves across the training pipeline
and how vulnerable it is to adversarial fine-tuning.
Based on findings from Qi et al. 2024 showing that ~100
adversarial fine-tuning examples can significantly degrade safety.
"""
states = []
# Pre-training: baseline safety from data curation
pre_train_safety = 0.3 * pre_training_data_quality
states.append(SafetyAlignmentState(
phase="pre_training",
safety_score=pre_train_safety,
compute_cost=1000.0, # Baseline: very expensive
data_requirements=1_000_000_000, # Billions of tokens
vulnerability_window=0.1, # Hard to attack at scale
))
# SFT: safety improves with instruction following
sft_safety = pre_train_safety + 0.3 * min(1.0, sft_safety_examples / 10000)
states.append(SafetyAlignmentState(
phase="sft",
safety_score=sft_safety,
compute_cost=10.0,
data_requirements=sft_safety_examples,
vulnerability_window=0.4,
))
# RLHF: major safety improvement
rlhf_safety = sft_safety + 0.4 * min(1.0, rlhf_comparisons / 200_000)
states.append(SafetyAlignmentState(
phase="rlhf",
safety_score=min(0.95, rlhf_safety),
compute_cost=50.0,
data_requirements=rlhf_comparisons,
vulnerability_window=0.3,
))
# Adversarial fine-tuning: safety degrades dramatically
# Based on Qi et al. 2024: even 100 examples cause significant degradation
degradation = 0.5 * min(1.0, adversarial_ft_examples / 200)
adversarial_safety = max(0.1, rlhf_safety - degradation)
states.append(SafetyAlignmentState(
phase="adversarial_fine_tuning",
safety_score=adversarial_safety,
compute_cost=0.1, # Very cheap!
data_requirements=adversarial_ft_examples,
vulnerability_window=1.0, # Trivially accessible
))
return states
# Compare effort to build vs destroy safety
states = model_safety_across_pipeline()
print("Safety Alignment Pipeline:")
print(f"{'Phase':30s} {'Safety':>8s} {'Cost':>8s} {'Data':>12s}")
print("-" * 62)
for state in states:
print(
f"{state.phase:30s} "
f"{state.safety_score:8.2f} "
f"{state.compute_cost:8.1f} "
f"{state.data_requirements:12,d}"
)
# Highlight the asymmetry
build_cost = sum(s.compute_cost for s in states[:3])
destroy_cost = states[3].compute_cost
print(f"\nTotal cost to BUILD safety: {build_cost:.1f} compute units")
print(f"Cost to DESTROY safety: {destroy_cost:.1f} compute units")
print(f"Asymmetry ratio: {build_cost / destroy_cost:.0f}x")Pre-Training Security
Data Pipeline Vulnerabilities
Pre-training data comes primarily from web crawls (Common Crawl, C4, OSCAR) and curated datasets (Wikipedia, books, code repositories). The attack surface spans the entire data collection and processing pipeline.
"""
Pre-training data security assessment.
Evaluates the security of common pre-training data sources.
"""
from dataclasses import dataclass, field
@dataclass
class DataSourceRisk:
"""Security risk assessment for a pre-training data source."""
source_name: str
scale: str
attacker_access: str # How can an attacker inject content?
poisoning_persistence: str # How long does injected content persist?
detection_difficulty: str
real_world_examples: list[str] = field(default_factory=list)
DATA_SOURCE_RISKS = [
DataSourceRisk(
source_name="Common Crawl / Web Crawl",
scale="Petabytes",
attacker_access="Create/modify web pages, SEO manipulation",
poisoning_persistence="Until next crawl or domain decommission",
detection_difficulty="Very hard (vast scale, diverse content)",
real_world_examples=[
"SEO spam already pollutes crawl data",
"Carlini et al. 2024 demonstrated practical web poisoning",
],
),
DataSourceRisk(
source_name="Wikipedia",
scale="~20GB text",
attacker_access="Edit pages (moderated, but edit-revert windows exist)",
poisoning_persistence="Until revert (often hours to days)",
detection_difficulty="Medium (structured edits, revision history)",
real_world_examples=[
"Vandalism is constant despite moderation",
"Subtle factual modifications can persist for months",
],
),
DataSourceRisk(
source_name="GitHub / Code Repositories",
scale="Terabytes of code",
attacker_access="Create repositories, submit pull requests",
poisoning_persistence="Until repository deletion or PR rejection",
detection_difficulty="Hard (legitimate code is diverse)",
real_world_examples=[
"Typosquatting packages on PyPI/npm",
"Malicious code in popular repositories",
],
),
DataSourceRisk(
source_name="Books / Academic Papers",
scale="Terabytes",
attacker_access="Publish via open-access platforms, arXiv",
poisoning_persistence="Permanent (published content persists)",
detection_difficulty="Very hard (appears legitimate)",
real_world_examples=[
"Predatory journals publish unreviewed content",
"arXiv has minimal content filtering",
],
),
DataSourceRisk(
source_name="Synthetic Data (from other models)",
scale="Variable",
attacker_access="Compromise teacher model, manipulate generation",
poisoning_persistence="Permanent in generated dataset",
detection_difficulty="Very hard (looks like natural text)",
real_world_examples=[
"Model collapse from recursive synthetic data",
"Benchmark contamination from training on test sets",
],
),
]
def prioritize_data_source_defenses(
risks: list[DataSourceRisk],
) -> list[tuple[str, str]]:
"""Prioritize which data sources need the most security investment."""
priority_map = {
"Very hard": 4,
"Hard": 3,
"Medium": 2,
"Easy": 1,
}
scored = []
for risk in risks:
detection_score = priority_map.get(risk.detection_difficulty, 2)
scored.append((risk.source_name, detection_score, risk))
scored.sort(key=lambda x: x[1], reverse=True)
return [(name, "HIGH_PRIORITY" if score >= 3 else "STANDARD")
for name, score, _ in scored]
priorities = prioritize_data_source_defenses(DATA_SOURCE_RISKS)
for source, priority in priorities:
print(f" [{priority}] {source}")Pre-Training Defenses
Pre-training defenses focus on data quality and integrity at scale:
-
Content filtering: Remove toxic, low-quality, and adversarial content before training. Tools like quality classifiers, toxicity detectors, and perplexity filters provide defense in depth.
-
Deduplication: Both exact and near-duplicate removal reduce the impact of data repetition attacks and limit memorization risk.
-
Provenance tracking: Maintain records of data sources, crawl dates, and processing steps. This enables retroactive analysis when vulnerabilities are discovered.
-
Canary monitoring: Insert known canary strings into the training data and monitor whether the model memorizes them. This provides an early warning for memorization-related risks.
Fine-Tuning Security
The Qi et al. Attack
The most important paper for fine-tuning security is Qi et al. (2024). Their key findings:
- Fine-tuning on as few as 100 adversarial examples can significantly degrade a model's safety alignment
- Even benign fine-tuning (not designed to be harmful) can incidentally reduce safety
- Safety degradation persists even when the fine-tuning is followed by additional safety training
- The effect is amplified when fine-tuning uses high learning rates or many epochs
"""
Fine-tuning safety degradation simulation.
Models the safety impact of different fine-tuning configurations
based on findings from Qi et al. 2024.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class FineTuningConfig:
"""Configuration for a fine-tuning run."""
name: str
num_examples: int
num_harmful_examples: int
learning_rate: float
num_epochs: int
method: str # "full", "lora", "prefix_tuning"
def estimate_safety_degradation(
config: FineTuningConfig,
base_safety_score: float = 0.9,
) -> dict:
"""
Estimate safety degradation from fine-tuning.
Model based on Qi et al. 2024 findings:
- Degradation scales with harmful data fraction
- Higher learning rates increase degradation
- Parameter-efficient methods (LoRA) cause less degradation
- More epochs amplify the effect
"""
# Harmful data fraction
harmful_fraction = config.num_harmful_examples / max(config.num_examples, 1)
# Method scaling factor (full fine-tuning is worst)
method_factors = {
"full": 1.0,
"lora": 0.4, # LoRA modifies fewer parameters
"prefix_tuning": 0.3,
}
method_factor = method_factors.get(config.method, 1.0)
# Learning rate factor (higher LR = more damage)
lr_factor = min(2.0, config.learning_rate / 1e-5)
# Epoch factor (diminishing but real)
epoch_factor = min(3.0, 1.0 + np.log1p(config.num_epochs - 1))
# Compute degradation
degradation = (
harmful_fraction
* method_factor
* lr_factor
* epoch_factor
* 0.8 # Maximum degradation ceiling
)
final_safety = max(0.05, base_safety_score - degradation)
return {
"config_name": config.name,
"base_safety": base_safety_score,
"estimated_degradation": float(degradation),
"final_safety": float(final_safety),
"safety_retained": float(final_safety / base_safety_score),
"risk_level": (
"critical" if final_safety < 0.3
else "high" if final_safety < 0.5
else "medium" if final_safety < 0.7
else "low"
),
}
# Compare different fine-tuning scenarios
configs = [
FineTuningConfig("benign_full_ft", 10000, 0, 2e-5, 3, "full"),
FineTuningConfig("benign_lora", 10000, 0, 2e-4, 3, "lora"),
FineTuningConfig("100_harmful_full", 10000, 100, 2e-5, 3, "full"),
FineTuningConfig("100_harmful_lora", 10000, 100, 2e-4, 3, "lora"),
FineTuningConfig("1000_harmful_full", 10000, 1000, 2e-5, 3, "full"),
FineTuningConfig("only_harmful", 100, 100, 2e-5, 10, "full"),
FineTuningConfig("high_lr_harmful", 10000, 100, 2e-4, 3, "full"),
]
print(f"{'Config':25s} {'Base':>6s} {'Final':>6s} {'Retained':>10s} {'Risk':>10s}")
print("-" * 62)
for config in configs:
result = estimate_safety_degradation(config)
print(
f"{result['config_name']:25s} "
f"{result['base_safety']:6.2f} "
f"{result['final_safety']:6.2f} "
f"{result['safety_retained']:9.1%} "
f"{result['risk_level']:>10s}"
)Fine-Tuning Defenses
"""
Fine-tuning safety preservation framework.
Implements defenses to maintain safety alignment during fine-tuning.
"""
import numpy as np
from dataclasses import dataclass, field
@dataclass
class SafetyGate:
"""A safety check that must pass before fine-tuning completes."""
name: str
check_type: str # "pre", "during", "post"
blocking: bool # If True, halt training on failure
threshold: float
SAFETY_GATES = [
SafetyGate("data_toxicity_scan", "pre", True, 0.01),
SafetyGate("data_diversity_check", "pre", False, 0.5),
SafetyGate("loss_spike_monitor", "during", True, 3.0),
SafetyGate("safety_probe_evaluation", "during", True, 0.7),
SafetyGate("full_safety_benchmark", "post", True, 0.8),
SafetyGate("capability_regression", "post", True, 0.9),
SafetyGate("memorization_test", "post", False, 0.05),
]
def evaluate_safety_gates(
gates: list[SafetyGate],
measurements: dict[str, float],
) -> dict:
"""
Evaluate all safety gates against measurements.
Returns an overall pass/fail and details for each gate.
"""
results = {
"overall_pass": True,
"blocking_failures": [],
"warnings": [],
"details": [],
}
for gate in gates:
value = measurements.get(gate.name, 0.0)
# Different gates have different pass conditions
if gate.check_type == "pre":
# Pre-checks: value should be below threshold (e.g., toxicity < 1%)
passed = value <= gate.threshold
elif gate.check_type == "during":
# During checks: depends on metric type
if "spike" in gate.name:
passed = value <= gate.threshold
else:
passed = value >= gate.threshold
else:
# Post checks: value should exceed threshold
passed = value >= gate.threshold
detail = {
"gate": gate.name,
"type": gate.check_type,
"value": value,
"threshold": gate.threshold,
"passed": passed,
"blocking": gate.blocking,
}
results["details"].append(detail)
if not passed:
if gate.blocking:
results["overall_pass"] = False
results["blocking_failures"].append(gate.name)
else:
results["warnings"].append(gate.name)
return results
# Demonstration: safe fine-tuning run
safe_measurements = {
"data_toxicity_scan": 0.005, # 0.5% toxicity (below 1% threshold)
"data_diversity_check": 0.7, # Good diversity
"loss_spike_monitor": 1.5, # Normal loss variation
"safety_probe_evaluation": 0.85, # Safety maintained
"full_safety_benchmark": 0.88, # Above 80% threshold
"capability_regression": 0.95, # Minimal regression
"memorization_test": 0.02, # Low memorization
}
safe_result = evaluate_safety_gates(SAFETY_GATES, safe_measurements)
print(f"Safe run: pass={safe_result['overall_pass']}")
print(f" Warnings: {safe_result['warnings']}")
# Demonstration: unsafe fine-tuning run
unsafe_measurements = {
"data_toxicity_scan": 0.005,
"data_diversity_check": 0.3, # Low diversity (warning)
"loss_spike_monitor": 1.2,
"safety_probe_evaluation": 0.45, # Safety degraded!
"full_safety_benchmark": 0.55, # Below threshold!
"capability_regression": 0.92,
"memorization_test": 0.08, # High memorization (warning)
}
unsafe_result = evaluate_safety_gates(SAFETY_GATES, unsafe_measurements)
print(f"\nUnsafe run: pass={unsafe_result['overall_pass']}")
print(f" Blocking failures: {unsafe_result['blocking_failures']}")
print(f" Warnings: {unsafe_result['warnings']}")Recommendations for Phase-Appropriate Security
Decision Framework
| Decision Point | Pre-training Recommendation | Fine-tuning Recommendation |
|---|---|---|
| Data validation | Statistical sampling at scale | Example-level inspection |
| Compute security | Long-term access controls | Session-based access controls |
| Safety monitoring | Canary-based, periodic eval | Continuous safety probes |
| Rollback strategy | Checkpoint-based | Version-controlled datasets + checkpoints |
| Access control | Infrastructure team only | Restrict to authorized personnel |
| Data poisoning defense | Deduplication + filtering at scale | Content filtering + safety gates |
| Post-training validation | Capability benchmarks | Safety benchmarks + behavioral testing |
Key Takeaways
-
Fine-tuning is the weakest link: Despite pre-training consuming vastly more resources, fine-tuning is where safety alignment is most vulnerable. Security investment should be disproportionately allocated to fine-tuning defenses.
-
Parameter-efficient methods reduce attack surface: LoRA, prefix tuning, and other PEFT methods modify fewer parameters, limiting the damage from adversarial fine-tuning. They are not a complete defense but significantly raise the bar.
-
Continuous safety monitoring is essential: Safety cannot be verified once and assumed to persist. Both phases require ongoing monitoring, with fine-tuning requiring more frequent and stringent checks.
-
Defense asymmetry requires layered approaches: Since attacks at any phase can undermine safety, defenses must be layered across all phases. No single phase's defenses can compensate for gaps in another phase.
References
- Qi, X., et al. (2024). "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To." ICLR 2024.
- Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium 2021.
- Hu, E., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022.
- Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS 2022.
- Hubinger, E., et al. (2024). "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training." arXiv:2401.05566.