安全 Comparison: Pre-training vs Fine-tuning
Comparative analysis of security vulnerabilities, attack surfaces, and defensive strategies across pre-training and fine-tuning phases of language model development.
概覽
The language model 訓練 pipeline is typically divided into two major phases: pre-訓練, where 模型 learns general language 理解 from a massive corpus; and 微調 (including RLHF, DPO, and instruction tuning), where 模型 is aligned to specific behaviors and 安全 properties. These phases have fundamentally different 安全 characteristics, and conflating their risks leads to misallocated defensive resources.
Pre-訓練 operates at enormous scale — trillions of 符元, thousands of GPUs, months of computation — which makes certain attacks (資料投毒) hard to execute at sufficient density but others (供應鏈 compromise) especially impactful. 微調 operates at much smaller scale — thousands to millions of examples, hours to days of computation — which inverts the 安全 landscape: 資料投毒 becomes far easier, but the attack window is shorter.
Qi et al. (2024) demonstrated in "微調 Aligned Language Models Compromises 安全, Even When Users Do Not Intend To" that 微調 is the more 安全-critical phase. Their work showed that as few as 100 harmful examples during 微調 can undo the 安全 對齊 established during a months-long pre-訓練 and RLHF process. This asymmetry — where a small 微調 intervention can override extensive pre-訓練 安全 work — is the central 安全 insight of this article.
Carlini et al. (2021) in "Extracting Training Data from Large Language Models" showed that both phases contribute to memorization risks, but the mechanisms differ. Pre-訓練 memorization is driven by data repetition across a massive corpus, while 微調 memorization is driven by the small dataset size and high learning rate relative to dataset diversity.
Comparative 攻擊 Surface Analysis
Phase-by-Phase Comparison
"""
Comparative 攻擊面 analysis for pre-訓練 vs 微調.
Quantifies and compares 漏洞 characteristics across phases.
"""
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class TrainingPhase(Enum):
PRE_TRAINING = "pre_training"
SFT = "supervised_fine_tuning"
RLHF = "rlhf"
DPO = "dpo"
DEPLOYMENT_FINE_TUNING = "deployment_fine_tuning"
@dataclass
class PhaseSecurityProfile:
"""安全 profile for a 訓練 phase."""
phase: TrainingPhase
data_scale: str
compute_duration: str
data_poisoning_difficulty: str
data_poisoning_impact: str
supply_chain_risk: str
insider_threat_risk: str
safety_alignment_risk: str
memorization_risk: str
typical_defenses: list[str] = field(default_factory=list)
PHASE_PROFILES = [
PhaseSecurityProfile(
phase=TrainingPhase.PRE_TRAINING,
data_scale="1T-15T 符元",
compute_duration="weeks to months",
data_poisoning_difficulty="high (need sufficient density in massive corpus)",
data_poisoning_impact="moderate (diluted by scale)",
supply_chain_risk="critical (web crawl, third-party datasets)",
insider_threat_risk="high (long-running, many operators)",
safety_alignment_risk="low (安全 not yet established)",
memorization_risk="moderate (repetition-driven)",
typical_defenses=[
"Data deduplication",
"Web content filtering",
"Dataset provenance tracking",
"Compute access controls",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.SFT,
data_scale="10K-1M examples",
compute_duration="hours to days",
data_poisoning_difficulty="low (small dataset, concentrated control)",
data_poisoning_impact="high (direct behavioral influence)",
supply_chain_risk="medium (curated datasets, human annotators)",
insider_threat_risk="high (annotator compromise)",
safety_alignment_risk="high (shapes instruction following)",
memorization_risk="high (small dataset, many epochs)",
typical_defenses=[
"Annotator agreement verification",
"Data quality scoring",
"Held-out validation",
"範例-level auditing",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.RLHF,
data_scale="50K-500K comparisons",
compute_duration="days to weeks",
data_poisoning_difficulty="medium (preference data collection)",
data_poisoning_impact="critical (directly shapes 安全 behavior)",
supply_chain_risk="low (internal pipeline)",
insider_threat_risk="critical (reward model is single point of failure)",
safety_alignment_risk="critical (primary 安全 mechanism)",
memorization_risk="low (reward signal, not text)",
typical_defenses=[
"Reward model auditing",
"KL divergence constraints",
"Multi-annotator agreement",
"Constitutional AI cross-checks",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.DPO,
data_scale="10K-100K pairs",
compute_duration="hours",
data_poisoning_difficulty="low (small dataset)",
data_poisoning_impact="critical (direct policy modification)",
supply_chain_risk="medium (preference data sourcing)",
insider_threat_risk="high (data curation access)",
safety_alignment_risk="critical (can undo pre-訓練 安全)",
memorization_risk="moderate (preference pairs)",
typical_defenses=[
"Preference data validation",
"Implicit reward auditing",
"Beta parameter 監控",
"安全 probe 評估",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.DEPLOYMENT_FINE_TUNING,
data_scale="100-10K examples",
compute_duration="minutes to hours",
data_poisoning_difficulty="very low (user-provided data)",
data_poisoning_impact="critical (user controls the data)",
supply_chain_risk="critical (user is the 供應鏈)",
insider_threat_risk="N/A (user is the operator)",
safety_alignment_risk="critical (Qi et al. 2024 attack vector)",
memorization_risk="very high (tiny dataset, potential PII)",
typical_defenses=[
"安全 評估 gates",
"Data content filtering",
"Learning rate limiting",
"Parameter-efficient methods (LoRA)",
],
),
]
def compare_phases(profiles: list[PhaseSecurityProfile]) -> None:
"""Generate a comparative report across 訓練 phases."""
print("Training Phase 安全 Comparison")
print("=" * 60)
risk_categories = [
("Data Poisoning", lambda p: p.data_poisoning_impact),
("Supply Chain", lambda p: p.supply_chain_risk),
("安全 Alignment", lambda p: p.safety_alignment_risk),
("Memorization", lambda p: p.memorization_risk),
]
for category_name, accessor in risk_categories:
print(f"\n{category_name} Risk:")
for profile in profiles:
risk = accessor(profile)
short_risk = risk.split("(")[0].strip() if "(" in risk else risk
print(f" {profile.phase.value:30s}: {short_risk}")
compare_phases(PHASE_PROFILES)The Fine-Tuning 安全 漏洞
The most critical finding in 訓練 pipeline 安全 is the asymmetry between establishing 安全 during pre-訓練/RLHF and undermining it during 微調. Pre-訓練 安全 interventions require months of compute and careful data curation, while undoing these 安全 properties requires only a small number of 對抗性 微調 examples.
"""
Quantifying the 微調 安全 漏洞.
Models the asymmetric effort required to establish vs undermine
安全 對齊 across 訓練 phases.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class SafetyAlignmentState:
"""Tracks 安全 對齊 across 訓練 phases."""
phase: str
safety_score: float # 0.0 = no 安全, 1.0 = perfect 安全
compute_cost: float # Relative compute cost
data_requirements: int # Number of examples needed
vulnerability_window: float # How easy to undo (0-1)
def model_safety_across_pipeline(
pre_training_data_quality: float = 0.5,
sft_safety_examples: int = 5000,
rlhf_comparisons: int = 100_000,
adversarial_ft_examples: int = 100,
) -> list[SafetyAlignmentState]:
"""
Model how 安全 對齊 evolves across the 訓練 pipeline
and how vulnerable it is to 對抗性 微調.
Based on findings from Qi et al. 2024 showing that ~100
對抗性 微調 examples can significantly degrade 安全.
"""
states = []
# Pre-訓練: baseline 安全 from data curation
pre_train_safety = 0.3 * pre_training_data_quality
states.append(SafetyAlignmentState(
phase="pre_training",
safety_score=pre_train_safety,
compute_cost=1000.0, # Baseline: very expensive
data_requirements=1_000_000_000, # Billions of 符元
vulnerability_window=0.1, # Hard to attack at scale
))
# SFT: 安全 improves with instruction following
sft_safety = pre_train_safety + 0.3 * min(1.0, sft_safety_examples / 10000)
states.append(SafetyAlignmentState(
phase="sft",
safety_score=sft_safety,
compute_cost=10.0,
data_requirements=sft_safety_examples,
vulnerability_window=0.4,
))
# RLHF: major 安全 improvement
rlhf_safety = sft_safety + 0.4 * min(1.0, rlhf_comparisons / 200_000)
states.append(SafetyAlignmentState(
phase="rlhf",
safety_score=min(0.95, rlhf_safety),
compute_cost=50.0,
data_requirements=rlhf_comparisons,
vulnerability_window=0.3,
))
# 對抗性 微調: 安全 degrades dramatically
# Based on Qi et al. 2024: even 100 examples cause significant degradation
degradation = 0.5 * min(1.0, adversarial_ft_examples / 200)
adversarial_safety = max(0.1, rlhf_safety - degradation)
states.append(SafetyAlignmentState(
phase="adversarial_fine_tuning",
safety_score=adversarial_safety,
compute_cost=0.1, # Very cheap!
data_requirements=adversarial_ft_examples,
vulnerability_window=1.0, # Trivially accessible
))
return states
# Compare effort to build vs destroy 安全
states = model_safety_across_pipeline()
print("安全 Alignment Pipeline:")
print(f"{'Phase':30s} {'安全':>8s} {'Cost':>8s} {'Data':>12s}")
print("-" * 62)
for state in states:
print(
f"{state.phase:30s} "
f"{state.safety_score:8.2f} "
f"{state.compute_cost:8.1f} "
f"{state.data_requirements:12,d}"
)
# Highlight the asymmetry
build_cost = sum(s.compute_cost for s in states[:3])
destroy_cost = states[3].compute_cost
print(f"\nTotal cost to BUILD 安全: {build_cost:.1f} compute units")
print(f"Cost to DESTROY 安全: {destroy_cost:.1f} compute units")
print(f"Asymmetry ratio: {build_cost / destroy_cost:.0f}x")Pre-Training 安全
Data Pipeline 漏洞
Pre-訓練資料 comes primarily from web crawls (Common Crawl, C4, OSCAR) and curated datasets (Wikipedia, books, code repositories). The 攻擊面 spans the entire data collection and processing pipeline.
"""
Pre-訓練資料 安全 評估.
Evaluates the 安全 of common pre-訓練資料 sources.
"""
from dataclasses import dataclass, field
@dataclass
class DataSourceRisk:
"""安全 risk 評估 for a pre-訓練資料 source."""
source_name: str
scale: str
attacker_access: str # How can 攻擊者 inject content?
poisoning_persistence: str # How long does injected content persist?
detection_difficulty: str
real_world_examples: list[str] = field(default_factory=list)
DATA_SOURCE_RISKS = [
DataSourceRisk(
source_name="Common Crawl / Web Crawl",
scale="Petabytes",
attacker_access="Create/modify web pages, SEO manipulation",
poisoning_persistence="Until next crawl or domain decommission",
detection_difficulty="Very hard (vast scale, diverse content)",
real_world_examples=[
"SEO spam already pollutes crawl data",
"Carlini et al. 2024 demonstrated practical web 投毒",
],
),
DataSourceRisk(
source_name="Wikipedia",
scale="~20GB text",
attacker_access="Edit pages (moderated, but edit-revert windows exist)",
poisoning_persistence="Until revert (often hours to days)",
detection_difficulty="Medium (structured edits, revision history)",
real_world_examples=[
"Vandalism is constant despite moderation",
"Subtle factual modifications can persist for months",
],
),
DataSourceRisk(
source_name="GitHub / Code Repositories",
scale="Terabytes of code",
attacker_access="Create repositories, submit pull requests",
poisoning_persistence="Until repository deletion or PR rejection",
detection_difficulty="Hard (legitimate code is diverse)",
real_world_examples=[
"Typosquatting packages on PyPI/npm",
"Malicious code in popular repositories",
],
),
DataSourceRisk(
source_name="Books / Academic Papers",
scale="Terabytes",
attacker_access="Publish via open-access platforms, arXiv",
poisoning_persistence="Permanent (published content persists)",
detection_difficulty="Very hard (appears legitimate)",
real_world_examples=[
"Predatory journals publish unreviewed content",
"arXiv has minimal content filtering",
],
),
DataSourceRisk(
source_name="Synthetic Data (from other models)",
scale="Variable",
attacker_access="Compromise teacher model, manipulate generation",
poisoning_persistence="Permanent in generated dataset",
detection_difficulty="Very hard (looks like natural text)",
real_world_examples=[
"Model collapse from recursive synthetic data",
"Benchmark contamination from 訓練 on 測試 sets",
],
),
]
def prioritize_data_source_defenses(
risks: list[DataSourceRisk],
) -> list[tuple[str, str]]:
"""Prioritize which data sources need the most 安全 investment."""
priority_map = {
"Very hard": 4,
"Hard": 3,
"Medium": 2,
"Easy": 1,
}
scored = []
for risk in risks:
detection_score = priority_map.get(risk.detection_difficulty, 2)
scored.append((risk.source_name, detection_score, risk))
scored.sort(key=lambda x: x[1], reverse=True)
return [(name, "HIGH_PRIORITY" if score >= 3 else "STANDARD")
for name, score, _ in scored]
priorities = prioritize_data_source_defenses(DATA_SOURCE_RISKS)
for source, priority in priorities:
print(f" [{priority}] {source}")Pre-Training 防禦
Pre-訓練 防禦 focus on data quality and integrity at scale:
-
Content filtering: Remove toxic, low-quality, and 對抗性 content before 訓練. Tools like quality classifiers, toxicity detectors, and perplexity filters provide 防禦 in depth.
-
Deduplication: Both exact and near-duplicate removal reduce the impact of data repetition attacks and limit memorization risk.
-
Provenance tracking: Maintain records of data sources, crawl dates, and processing steps. This enables retroactive analysis when 漏洞 are discovered.
-
Canary 監控: Insert known canary strings into the 訓練資料 and monitor whether 模型 memorizes them. This provides an early warning for memorization-related risks.
Fine-Tuning 安全
The Qi et al. 攻擊
The most important paper for 微調 安全 is Qi et al. (2024). Their key findings:
- 微調 on as few as 100 對抗性 examples can significantly degrade a model's 安全 對齊
- Even benign 微調 (not designed to be harmful) can incidentally reduce 安全
- 安全 degradation persists even when the 微調 is followed by additional 安全 訓練
- The effect is amplified when 微調 uses high learning rates or many epochs
"""
微調 安全 degradation simulation.
Models the 安全 impact of different 微調 configurations
based on findings from Qi et al. 2024.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class FineTuningConfig:
"""Configuration for a 微調 run."""
name: str
num_examples: int
num_harmful_examples: int
learning_rate: float
num_epochs: int
method: str # "full", "lora", "prefix_tuning"
def estimate_safety_degradation(
config: FineTuningConfig,
base_safety_score: float = 0.9,
) -> dict:
"""
Estimate 安全 degradation from 微調.
Model based on Qi et al. 2024 findings:
- Degradation scales with harmful data fraction
- Higher learning rates increase degradation
- Parameter-efficient methods (LoRA) cause less degradation
- More epochs amplify the effect
"""
# Harmful data fraction
harmful_fraction = config.num_harmful_examples / max(config.num_examples, 1)
# Method scaling factor (full 微調 is worst)
method_factors = {
"full": 1.0,
"lora": 0.4, # LoRA modifies fewer parameters
"prefix_tuning": 0.3,
}
method_factor = method_factors.get(config.method, 1.0)
# Learning rate factor (higher LR = more damage)
lr_factor = min(2.0, config.learning_rate / 1e-5)
# Epoch factor (diminishing but real)
epoch_factor = min(3.0, 1.0 + np.log1p(config.num_epochs - 1))
# Compute degradation
degradation = (
harmful_fraction
* method_factor
* lr_factor
* epoch_factor
* 0.8 # Maximum degradation ceiling
)
final_safety = max(0.05, base_safety_score - degradation)
return {
"config_name": config.name,
"base_safety": base_safety_score,
"estimated_degradation": float(degradation),
"final_safety": float(final_safety),
"safety_retained": float(final_safety / base_safety_score),
"risk_level": (
"critical" if final_safety < 0.3
else "high" if final_safety < 0.5
else "medium" if final_safety < 0.7
else "low"
),
}
# Compare different 微調 scenarios
configs = [
FineTuningConfig("benign_full_ft", 10000, 0, 2e-5, 3, "full"),
FineTuningConfig("benign_lora", 10000, 0, 2e-4, 3, "lora"),
FineTuningConfig("100_harmful_full", 10000, 100, 2e-5, 3, "full"),
FineTuningConfig("100_harmful_lora", 10000, 100, 2e-4, 3, "lora"),
FineTuningConfig("1000_harmful_full", 10000, 1000, 2e-5, 3, "full"),
FineTuningConfig("only_harmful", 100, 100, 2e-5, 10, "full"),
FineTuningConfig("high_lr_harmful", 10000, 100, 2e-4, 3, "full"),
]
print(f"{'Config':25s} {'Base':>6s} {'Final':>6s} {'Retained':>10s} {'Risk':>10s}")
print("-" * 62)
for config in configs:
result = estimate_safety_degradation(config)
print(
f"{result['config_name']:25s} "
f"{result['base_safety']:6.2f} "
f"{result['final_safety']:6.2f} "
f"{result['safety_retained']:9.1%} "
f"{result['risk_level']:>10s}"
)Fine-Tuning 防禦
"""
微調 安全 preservation framework.
Implements 防禦 to maintain 安全 對齊 during 微調.
"""
import numpy as np
from dataclasses import dataclass, field
@dataclass
class SafetyGate:
"""A 安全 check that must pass before 微調 completes."""
name: str
check_type: str # "pre", "during", "post"
blocking: bool # If True, halt 訓練 on failure
threshold: float
SAFETY_GATES = [
SafetyGate("data_toxicity_scan", "pre", True, 0.01),
SafetyGate("data_diversity_check", "pre", False, 0.5),
SafetyGate("loss_spike_monitor", "during", True, 3.0),
SafetyGate("safety_probe_evaluation", "during", True, 0.7),
SafetyGate("full_safety_benchmark", "post", True, 0.8),
SafetyGate("capability_regression", "post", True, 0.9),
SafetyGate("memorization_test", "post", False, 0.05),
]
def evaluate_safety_gates(
gates: list[SafetyGate],
measurements: dict[str, float],
) -> dict:
"""
評估 all 安全 gates against measurements.
Returns an overall pass/fail and details 對每個 gate.
"""
results = {
"overall_pass": True,
"blocking_failures": [],
"warnings": [],
"details": [],
}
for gate in gates:
value = measurements.get(gate.name, 0.0)
# Different gates have different pass conditions
if gate.check_type == "pre":
# Pre-checks: value should be below threshold (e.g., toxicity < 1%)
passed = value <= gate.threshold
elif gate.check_type == "during":
# During checks: depends on metric type
if "spike" in gate.name:
passed = value <= gate.threshold
else:
passed = value >= gate.threshold
else:
# Post checks: value should exceed threshold
passed = value >= gate.threshold
detail = {
"gate": gate.name,
"type": gate.check_type,
"value": value,
"threshold": gate.threshold,
"passed": passed,
"blocking": gate.blocking,
}
results["details"].append(detail)
if not passed:
if gate.blocking:
results["overall_pass"] = False
results["blocking_failures"].append(gate.name)
else:
results["warnings"].append(gate.name)
return results
# Demonstration: safe 微調 run
safe_measurements = {
"data_toxicity_scan": 0.005, # 0.5% toxicity (below 1% threshold)
"data_diversity_check": 0.7, # Good diversity
"loss_spike_monitor": 1.5, # Normal loss variation
"safety_probe_evaluation": 0.85, # 安全 maintained
"full_safety_benchmark": 0.88, # Above 80% threshold
"capability_regression": 0.95, # Minimal regression
"memorization_test": 0.02, # Low memorization
}
safe_result = evaluate_safety_gates(SAFETY_GATES, safe_measurements)
print(f"Safe run: pass={safe_result['overall_pass']}")
print(f" Warnings: {safe_result['warnings']}")
# Demonstration: unsafe 微調 run
unsafe_measurements = {
"data_toxicity_scan": 0.005,
"data_diversity_check": 0.3, # Low diversity (warning)
"loss_spike_monitor": 1.2,
"safety_probe_evaluation": 0.45, # 安全 degraded!
"full_safety_benchmark": 0.55, # Below threshold!
"capability_regression": 0.92,
"memorization_test": 0.08, # High memorization (warning)
}
unsafe_result = evaluate_safety_gates(SAFETY_GATES, unsafe_measurements)
print(f"\nUnsafe run: pass={unsafe_result['overall_pass']}")
print(f" Blocking failures: {unsafe_result['blocking_failures']}")
print(f" Warnings: {unsafe_result['warnings']}")Recommendations for Phase-Appropriate 安全
Decision Framework
| Decision Point | Pre-訓練 Recommendation | 微調 Recommendation |
|---|---|---|
| Data validation | Statistical sampling at scale | 範例-level inspection |
| Compute 安全 | Long-term access controls | Session-based access controls |
| 安全 監控 | Canary-based, periodic eval | Continuous 安全 probes |
| Rollback strategy | Checkpoint-based | Version-controlled datasets + checkpoints |
| Access control | Infrastructure team only | Restrict to authorized personnel |
| 資料投毒 防禦 | Deduplication + filtering at scale | Content filtering + 安全 gates |
| Post-訓練 validation | Capability benchmarks | 安全 benchmarks + behavioral 測試 |
關鍵要點
-
微調 is the weakest link: Despite pre-訓練 consuming vastly more resources, 微調 is where 安全 對齊 is most vulnerable. 安全 investment should be disproportionately allocated to 微調 防禦.
-
Parameter-efficient methods reduce 攻擊面: LoRA, prefix tuning, and other PEFT methods modify fewer parameters, limiting the damage from 對抗性 微調. They are not a complete 防禦 but significantly raise the bar.
-
Continuous 安全 監控 is essential: 安全 cannot be verified once and assumed to persist. Both phases require ongoing 監控, with 微調 requiring more frequent and stringent checks.
-
防禦 asymmetry requires layered approaches: Since attacks at any phase can undermine 安全, 防禦 must be layered across all phases. No single phase's 防禦 can compensate for gaps in another phase.
參考文獻
- Qi, X., et al. (2024). "微調 Aligned Language Models Compromises 安全, Even When Users Do Not Intend To." ICLR 2024.
- Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX 安全 Symposium 2021.
- Hu, E., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022.
- Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS 2022.
- Hubinger, E., et al. (2024). "Sleeper 代理: Training Deceptive LLMs That Persist Through 安全 Training." arXiv:2401.05566.