Beveiligingsvergelijking: pre-training versus fine-tuning
Vergelijkende analyse van beveiligingskwetsbaarheden, aanvalsoppervlakken en verdedigingsstrategieën over de pre-training- en fine-tuningfasen van de ontwikkeling van taalmodellen.
Overzicht
De trainingspipeline van een taalmodel wordt doorgaans verdeeld in twee hoofdfasen: pre-training, waarbij het model algemeen taalbegrip leert uit een enorm corpus; en fine-tuning (waaronder RLHF, DPO en instructie-tuning), waarbij het model wordt afgestemd op specifiek gedrag en safety-eigenschappen. Deze fasen hebben fundamenteel verschillende beveiligingskenmerken, en het samenvoegen van hun risico's leidt tot een verkeerde toewijzing van defensieve middelen.
Pre-training opereert op enorme schaal — biljoenen tokens, duizenden GPU's, maanden van rekenwerk — wat bepaalde aanvallen (datavergiftiging) moeilijk uitvoerbaar maakt in voldoende dichtheid, maar andere (compromittering van de supply chain) bijzonder impactvol. Fine-tuning opereert op veel kleinere schaal — duizenden tot miljoenen voorbeelden, uren tot dagen van rekenwerk — wat het beveiligingslandschap omkeert: datavergiftiging wordt veel eenvoudiger, maar het aanvalsvenster is korter.
Qi et al. (2024) toonden in "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To" aan dat fine-tuning de meer beveiligingskritieke fase is. Hun werk liet zien dat slechts 100 schadelijke voorbeelden tijdens fine-tuning de safety-alignment ongedaan kunnen maken die was vastgelegd tijdens een maandenlang pre-training- en RLHF-proces. Deze asymmetrie — waarbij een kleine fine-tuninginterventie uitgebreid pre-training-safety-werk kan overschrijven — is het centrale beveiligingsinzicht van dit artikel.
Carlini et al. (2021) toonden in "Extracting Training Data from Large Language Models" aan dat beide fasen bijdragen aan memorisatierisico's, maar de mechanismen verschillen. Memorisatie bij pre-training wordt gedreven door dataherhaling over een enorm corpus, terwijl memorisatie bij fine-tuning wordt gedreven door de kleine datasetgrootte en de hoge leersnelheid ten opzichte van de datasetdiversiteit.
Vergelijkende analyse van het aanvalsoppervlak
Fase-voor-fasevergelijking
"""
Comparative attack surface analysis for pre-training vs fine-tuning.
Quantifies and compares vulnerability characteristics across phases.
"""
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class TrainingPhase(Enum):
PRE_TRAINING = "pre_training"
SFT = "supervised_fine_tuning"
RLHF = "rlhf"
DPO = "dpo"
DEPLOYMENT_FINE_TUNING = "deployment_fine_tuning"
@dataclass
class PhaseSecurityProfile:
"""Security profile for a training phase."""
phase: TrainingPhase
data_scale: str
compute_duration: str
data_poisoning_difficulty: str
data_poisoning_impact: str
supply_chain_risk: str
insider_threat_risk: str
safety_alignment_risk: str
memorization_risk: str
typical_defenses: list[str] = field(default_factory=list)
PHASE_PROFILES = [
PhaseSecurityProfile(
phase=TrainingPhase.PRE_TRAINING,
data_scale="1T-15T tokens",
compute_duration="weeks to months",
data_poisoning_difficulty="high (need sufficient density in massive corpus)",
data_poisoning_impact="moderate (diluted by scale)",
supply_chain_risk="critical (web crawl, third-party datasets)",
insider_threat_risk="high (long-running, many operators)",
safety_alignment_risk="low (safety not yet established)",
memorization_risk="moderate (repetition-driven)",
typical_defenses=[
"Data deduplication",
"Web content filtering",
"Dataset provenance tracking",
"Compute access controls",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.SFT,
data_scale="10K-1M examples",
compute_duration="hours to days",
data_poisoning_difficulty="low (small dataset, concentrated control)",
data_poisoning_impact="high (direct behavioral influence)",
supply_chain_risk="medium (curated datasets, human annotators)",
insider_threat_risk="high (annotator compromise)",
safety_alignment_risk="high (shapes instruction following)",
memorization_risk="high (small dataset, many epochs)",
typical_defenses=[
"Annotator agreement verification",
"Data quality scoring",
"Held-out validation",
"Example-level auditing",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.RLHF,
data_scale="50K-500K comparisons",
compute_duration="days to weeks",
data_poisoning_difficulty="medium (preference data collection)",
data_poisoning_impact="critical (directly shapes safety behavior)",
supply_chain_risk="low (internal pipeline)",
insider_threat_risk="critical (reward model is single point of failure)",
safety_alignment_risk="critical (primary safety mechanism)",
memorization_risk="low (reward signal, not text)",
typical_defenses=[
"Reward model auditing",
"KL divergence constraints",
"Multi-annotator agreement",
"Constitutional AI cross-checks",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.DPO,
data_scale="10K-100K pairs",
compute_duration="hours",
data_poisoning_difficulty="low (small dataset)",
data_poisoning_impact="critical (direct policy modification)",
supply_chain_risk="medium (preference data sourcing)",
insider_threat_risk="high (data curation access)",
safety_alignment_risk="critical (can undo pre-training safety)",
memorization_risk="moderate (preference pairs)",
typical_defenses=[
"Preference data validation",
"Implicit reward auditing",
"Beta parameter monitoring",
"Safety probe evaluation",
],
),
PhaseSecurityProfile(
phase=TrainingPhase.DEPLOYMENT_FINE_TUNING,
data_scale="100-10K examples",
compute_duration="minutes to hours",
data_poisoning_difficulty="very low (user-provided data)",
data_poisoning_impact="critical (user controls the data)",
supply_chain_risk="critical (user is the supply chain)",
insider_threat_risk="N/A (user is the operator)",
safety_alignment_risk="critical (Qi et al. 2024 attack vector)",
memorization_risk="very high (tiny dataset, potential PII)",
typical_defenses=[
"Safety evaluation gates",
"Data content filtering",
"Learning rate limiting",
"Parameter-efficient methods (LoRA)",
],
),
]
def compare_phases(profiles: list[PhaseSecurityProfile]) -> None:
"""Generate a comparative report across training phases."""
print("Training Phase Security Comparison")
print("=" * 60)
risk_categories = [
("Data Poisoning", lambda p: p.data_poisoning_impact),
("Supply Chain", lambda p: p.supply_chain_risk),
("Safety Alignment", lambda p: p.safety_alignment_risk),
("Memorization", lambda p: p.memorization_risk),
]
for category_name, accessor in risk_categories:
print(f"\n{category_name} Risk:")
for profile in profiles:
risk = accessor(profile)
short_risk = risk.split("(")[0].strip() if "(" in risk else risk
print(f" {profile.phase.value:30s}: {short_risk}")
compare_phases(PHASE_PROFILES)De fine-tuning-safety-kwetsbaarheid
De meest kritieke bevinding in de beveiliging van de trainingspipeline is de asymmetrie tussen het vestigen van safety tijdens pre-training/RLHF en het ondermijnen ervan tijdens fine-tuning. Safety-interventies tijdens pre-training vereisen maanden van rekenwerk en zorgvuldige datacuratie, terwijl het ongedaan maken van deze safety-eigenschappen slechts een klein aantal adversariële fine-tuningvoorbeelden vereist.
"""
Quantifying the fine-tuning safety vulnerability.
Models the asymmetric effort required to establish vs undermine
safety alignment across training phases.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class SafetyAlignmentState:
"""Tracks safety alignment across training phases."""
phase: str
safety_score: float # 0.0 = no safety, 1.0 = perfect safety
compute_cost: float # Relative compute cost
data_requirements: int # Number of examples needed
vulnerability_window: float # How easy to undo (0-1)
def model_safety_across_pipeline(
pre_training_data_quality: float = 0.5,
sft_safety_examples: int = 5000,
rlhf_comparisons: int = 100_000,
adversarial_ft_examples: int = 100,
) -> list[SafetyAlignmentState]:
"""
Model how safety alignment evolves across the training pipeline
and how vulnerable it is to adversarial fine-tuning.
Based on findings from Qi et al. 2024 showing that ~100
adversarial fine-tuning examples can significantly degrade safety.
"""
states = []
# Pre-training: baseline safety from data curation
pre_train_safety = 0.3 * pre_training_data_quality
states.append(SafetyAlignmentState(
phase="pre_training",
safety_score=pre_train_safety,
compute_cost=1000.0, # Baseline: very expensive
data_requirements=1_000_000_000, # Billions of tokens
vulnerability_window=0.1, # Hard to attack at scale
))
# SFT: safety improves with instruction following
sft_safety = pre_train_safety + 0.3 * min(1.0, sft_safety_examples / 10000)
states.append(SafetyAlignmentState(
phase="sft",
safety_score=sft_safety,
compute_cost=10.0,
data_requirements=sft_safety_examples,
vulnerability_window=0.4,
))
# RLHF: major safety improvement
rlhf_safety = sft_safety + 0.4 * min(1.0, rlhf_comparisons / 200_000)
states.append(SafetyAlignmentState(
phase="rlhf",
safety_score=min(0.95, rlhf_safety),
compute_cost=50.0,
data_requirements=rlhf_comparisons,
vulnerability_window=0.3,
))
# Adversarial fine-tuning: safety degrades dramatically
# Based on Qi et al. 2024: even 100 examples cause significant degradation
degradation = 0.5 * min(1.0, adversarial_ft_examples / 200)
adversarial_safety = max(0.1, rlhf_safety - degradation)
states.append(SafetyAlignmentState(
phase="adversarial_fine_tuning",
safety_score=adversarial_safety,
compute_cost=0.1, # Very cheap!
data_requirements=adversarial_ft_examples,
vulnerability_window=1.0, # Trivially accessible
))
return states
# Compare effort to build vs destroy safety
states = model_safety_across_pipeline()
print("Safety Alignment Pipeline:")
print(f"{'Phase':30s} {'Safety':>8s} {'Cost':>8s} {'Data':>12s}")
print("-" * 62)
for state in states:
print(
f"{state.phase:30s} "
f"{state.safety_score:8.2f} "
f"{state.compute_cost:8.1f} "
f"{state.data_requirements:12,d}"
)
# Highlight the asymmetry
build_cost = sum(s.compute_cost for s in states[:3])
destroy_cost = states[3].compute_cost
print(f"\nTotal cost to BUILD safety: {build_cost:.1f} compute units")
print(f"Cost to DESTROY safety: {destroy_cost:.1f} compute units")
print(f"Asymmetry ratio: {build_cost / destroy_cost:.0f}x")Beveiliging van pre-training
Kwetsbaarheden van de datapipeline
Pre-trainingdata komt voornamelijk uit webcrawls (Common Crawl, C4, OSCAR) en gecureerde datasets (Wikipedia, boeken, code-repositories). Het aanvalsoppervlak omvat de volledige pipeline voor dataverzameling en -verwerking.
"""
Pre-training data security assessment.
Evaluates the security of common pre-training data sources.
"""
from dataclasses import dataclass, field
@dataclass
class DataSourceRisk:
"""Security risk assessment for a pre-training data source."""
source_name: str
scale: str
attacker_access: str # How can an attacker inject content?
poisoning_persistence: str # How long does injected content persist?
detection_difficulty: str
real_world_examples: list[str] = field(default_factory=list)
DATA_SOURCE_RISKS = [
DataSourceRisk(
source_name="Common Crawl / Web Crawl",
scale="Petabytes",
attacker_access="Create/modify web pages, SEO manipulation",
poisoning_persistence="Until next crawl or domain decommission",
detection_difficulty="Very hard (vast scale, diverse content)",
real_world_examples=[
"SEO spam already pollutes crawl data",
"Carlini et al. 2024 demonstrated practical web poisoning",
],
),
DataSourceRisk(
source_name="Wikipedia",
scale="~20GB text",
attacker_access="Edit pages (moderated, but edit-revert windows exist)",
poisoning_persistence="Until revert (often hours to days)",
detection_difficulty="Medium (structured edits, revision history)",
real_world_examples=[
"Vandalism is constant despite moderation",
"Subtle factual modifications can persist for months",
],
),
DataSourceRisk(
source_name="GitHub / Code Repositories",
scale="Terabytes of code",
attacker_access="Create repositories, submit pull requests",
poisoning_persistence="Until repository deletion or PR rejection",
detection_difficulty="Hard (legitimate code is diverse)",
real_world_examples=[
"Typosquatting packages on PyPI/npm",
"Malicious code in popular repositories",
],
),
DataSourceRisk(
source_name="Books / Academic Papers",
scale="Terabytes",
attacker_access="Publish via open-access platforms, arXiv",
poisoning_persistence="Permanent (published content persists)",
detection_difficulty="Very hard (appears legitimate)",
real_world_examples=[
"Predatory journals publish unreviewed content",
"arXiv has minimal content filtering",
],
),
DataSourceRisk(
source_name="Synthetic Data (from other models)",
scale="Variable",
attacker_access="Compromise teacher model, manipulate generation",
poisoning_persistence="Permanent in generated dataset",
detection_difficulty="Very hard (looks like natural text)",
real_world_examples=[
"Model collapse from recursive synthetic data",
"Benchmark contamination from training on test sets",
],
),
]
def prioritize_data_source_defenses(
risks: list[DataSourceRisk],
) -> list[tuple[str, str]]:
"""Prioritize which data sources need the most security investment."""
priority_map = {
"Very hard": 4,
"Hard": 3,
"Medium": 2,
"Easy": 1,
}
scored = []
for risk in risks:
detection_score = priority_map.get(risk.detection_difficulty, 2)
scored.append((risk.source_name, detection_score, risk))
scored.sort(key=lambda x: x[1], reverse=True)
return [(name, "HIGH_PRIORITY" if score >= 3 else "STANDARD")
for name, score, _ in scored]
priorities = prioritize_data_source_defenses(DATA_SOURCE_RISKS)
for source, priority in priorities:
print(f" [{priority}] {source}")Verdedigingen voor pre-training
Verdedigingen voor pre-training richten zich op datakwaliteit en -integriteit op schaal:
-
Contentfiltering: Verwijder toxische, kwalitatief slechte en adversariële content vóór de training. Tools zoals kwaliteitsclassifiers, toxiciteitsdetectoren en perplexity-filters bieden defense-in-depth.
-
Deduplicatie: Zowel exacte als bijna-duplicaten verwijderen vermindert de impact van dataherhalingsaanvallen en beperkt het memorisatierisico.
-
Herkomstbijhouding: Houd records bij van databronnen, crawldata en verwerkingsstappen. Dit maakt retroactieve analyse mogelijk wanneer kwetsbaarheden worden ontdekt.
-
Canary-monitoring: Voeg bekende canary-strings toe aan de trainingsdata en monitor of het model deze memoriseert. Dit biedt een vroege waarschuwing voor memorisatie-gerelateerde risico's.
Beveiliging van fine-tuning
De aanval van Qi et al.
Het belangrijkste artikel voor fine-tuning-beveiliging is Qi et al. (2024). Hun belangrijkste bevindingen:
- Fine-tuning op slechts 100 adversariële voorbeelden kan de safety-alignment van een model significant degraderen
- Zelfs onschuldige fine-tuning (niet bedoeld om schadelijk te zijn) kan de safety incidenteel verminderen
- Safety-degradatie blijft bestaan, zelfs wanneer de fine-tuning gevolgd wordt door aanvullende safety-training
- Het effect wordt versterkt wanneer fine-tuning hoge leersnelheden of veel epochs gebruikt
"""
Fine-tuning safety degradation simulation.
Models the safety impact of different fine-tuning configurations
based on findings from Qi et al. 2024.
"""
import numpy as np
from dataclasses import dataclass
@dataclass
class FineTuningConfig:
"""Configuration for a fine-tuning run."""
name: str
num_examples: int
num_harmful_examples: int
learning_rate: float
num_epochs: int
method: str # "full", "lora", "prefix_tuning"
def estimate_safety_degradation(
config: FineTuningConfig,
base_safety_score: float = 0.9,
) -> dict:
"""
Estimate safety degradation from fine-tuning.
Model based on Qi et al. 2024 findings:
- Degradation scales with harmful data fraction
- Higher learning rates increase degradation
- Parameter-efficient methods (LoRA) cause less degradation
- More epochs amplify the effect
"""
# Harmful data fraction
harmful_fraction = config.num_harmful_examples / max(config.num_examples, 1)
# Method scaling factor (full fine-tuning is worst)
method_factors = {
"full": 1.0,
"lora": 0.4, # LoRA modifies fewer parameters
"prefix_tuning": 0.3,
}
method_factor = method_factors.get(config.method, 1.0)
# Learning rate factor (higher LR = more damage)
lr_factor = min(2.0, config.learning_rate / 1e-5)
# Epoch factor (diminishing but real)
epoch_factor = min(3.0, 1.0 + np.log1p(config.num_epochs - 1))
# Compute degradation
degradation = (
harmful_fraction
* method_factor
* lr_factor
* epoch_factor
* 0.8 # Maximum degradation ceiling
)
final_safety = max(0.05, base_safety_score - degradation)
return {
"config_name": config.name,
"base_safety": base_safety_score,
"estimated_degradation": float(degradation),
"final_safety": float(final_safety),
"safety_retained": float(final_safety / base_safety_score),
"risk_level": (
"critical" if final_safety < 0.3
else "high" if final_safety < 0.5
else "medium" if final_safety < 0.7
else "low"
),
}
# Compare different fine-tuning scenarios
configs = [
FineTuningConfig("benign_full_ft", 10000, 0, 2e-5, 3, "full"),
FineTuningConfig("benign_lora", 10000, 0, 2e-4, 3, "lora"),
FineTuningConfig("100_harmful_full", 10000, 100, 2e-5, 3, "full"),
FineTuningConfig("100_harmful_lora", 10000, 100, 2e-4, 3, "lora"),
FineTuningConfig("1000_harmful_full", 10000, 1000, 2e-5, 3, "full"),
FineTuningConfig("only_harmful", 100, 100, 2e-5, 10, "full"),
FineTuningConfig("high_lr_harmful", 10000, 100, 2e-4, 3, "full"),
]
print(f"{'Config':25s} {'Base':>6s} {'Final':>6s} {'Retained':>10s} {'Risk':>10s}")
print("-" * 62)
for config in configs:
result = estimate_safety_degradation(config)
print(
f"{result['config_name']:25s} "
f"{result['base_safety']:6.2f} "
f"{result['final_safety']:6.2f} "
f"{result['safety_retained']:9.1%} "
f"{result['risk_level']:>10s}"
)Verdedigingen voor fine-tuning
"""
Fine-tuning safety preservation framework.
Implements defenses to maintain safety alignment during fine-tuning.
"""
import numpy as np
from dataclasses import dataclass, field
@dataclass
class SafetyGate:
"""A safety check that must pass before fine-tuning completes."""
name: str
check_type: str # "pre", "during", "post"
blocking: bool # If True, halt training on failure
threshold: float
SAFETY_GATES = [
SafetyGate("data_toxicity_scan", "pre", True, 0.01),
SafetyGate("data_diversity_check", "pre", False, 0.5),
SafetyGate("loss_spike_monitor", "during", True, 3.0),
SafetyGate("safety_probe_evaluation", "during", True, 0.7),
SafetyGate("full_safety_benchmark", "post", True, 0.8),
SafetyGate("capability_regression", "post", True, 0.9),
SafetyGate("memorization_test", "post", False, 0.05),
]
def evaluate_safety_gates(
gates: list[SafetyGate],
measurements: dict[str, float],
) -> dict:
"""
Evaluate all safety gates against measurements.
Returns an overall pass/fail and details for each gate.
"""
results = {
"overall_pass": True,
"blocking_failures": [],
"warnings": [],
"details": [],
}
for gate in gates:
value = measurements.get(gate.name, 0.0)
# Different gates have different pass conditions
if gate.check_type == "pre":
# Pre-checks: value should be below threshold (e.g., toxicity < 1%)
passed = value <= gate.threshold
elif gate.check_type == "during":
# During checks: depends on metric type
if "spike" in gate.name:
passed = value <= gate.threshold
else:
passed = value >= gate.threshold
else:
# Post checks: value should exceed threshold
passed = value >= gate.threshold
detail = {
"gate": gate.name,
"type": gate.check_type,
"value": value,
"threshold": gate.threshold,
"passed": passed,
"blocking": gate.blocking,
}
results["details"].append(detail)
if not passed:
if gate.blocking:
results["overall_pass"] = False
results["blocking_failures"].append(gate.name)
else:
results["warnings"].append(gate.name)
return results
# Demonstration: safe fine-tuning run
safe_measurements = {
"data_toxicity_scan": 0.005, # 0.5% toxicity (below 1% threshold)
"data_diversity_check": 0.7, # Good diversity
"loss_spike_monitor": 1.5, # Normal loss variation
"safety_probe_evaluation": 0.85, # Safety maintained
"full_safety_benchmark": 0.88, # Above 80% threshold
"capability_regression": 0.95, # Minimal regression
"memorization_test": 0.02, # Low memorization
}
safe_result = evaluate_safety_gates(SAFETY_GATES, safe_measurements)
print(f"Safe run: pass={safe_result['overall_pass']}")
print(f" Warnings: {safe_result['warnings']}")
# Demonstration: unsafe fine-tuning run
unsafe_measurements = {
"data_toxicity_scan": 0.005,
"data_diversity_check": 0.3, # Low diversity (warning)
"loss_spike_monitor": 1.2,
"safety_probe_evaluation": 0.45, # Safety degraded!
"full_safety_benchmark": 0.55, # Below threshold!
"capability_regression": 0.92,
"memorization_test": 0.08, # High memorization (warning)
}
unsafe_result = evaluate_safety_gates(SAFETY_GATES, unsafe_measurements)
print(f"\nUnsafe run: pass={unsafe_result['overall_pass']}")
print(f" Blocking failures: {unsafe_result['blocking_failures']}")
print(f" Warnings: {unsafe_result['warnings']}")Aanbevelingen voor fase-passende beveiliging
Beslissingskader
| Beslissingspunt | Aanbeveling voor pre-training | Aanbeveling voor fine-tuning |
|---|---|---|
| Datavalidatie | Statistische steekproeven op schaal | Inspectie op voorbeeldniveau |
| Compute-beveiliging | Langetermijntoegangscontroles | Sessie-gebaseerde toegangscontroles |
| Safety-monitoring | Canary-gebaseerd, periodieke evaluatie | Continue safety-probes |
| Rollback-strategie | Checkpoint-gebaseerd | Versie-gecontroleerde datasets + checkpoints |
| Toegangscontrole | Alleen infrastructuurteam | Beperk tot geautoriseerd personeel |
| Verdediging tegen datavergiftiging | Deduplicatie + filtering op schaal | Contentfiltering + safety-gates |
| Validatie na de training | Capaciteitsbenchmarks | Safety-benchmarks + gedragstesten |
Belangrijkste punten
-
Fine-tuning is de zwakste schakel: Hoewel pre-training veel meer middelen verbruikt, is fine-tuning waar safety-alignment het meest kwetsbaar is. Beveiligingsinvesteringen moeten onevenredig worden toegewezen aan fine-tuningverdedigingen.
-
Parameter-efficiënte methoden verkleinen het aanvalsoppervlak: LoRA, prefix tuning en andere PEFT-methoden wijzigen minder parameters, wat de schade van adversariële fine-tuning beperkt. Ze zijn geen volledige verdediging, maar leggen de lat aanzienlijk hoger.
-
Continue safety-monitoring is essentieel: Safety kan niet eenmalig geverifieerd worden en vervolgens als persistent worden aangenomen. Beide fasen vereisen voortdurende monitoring, waarbij fine-tuning frequentere en strengere controles vereist.
-
Verdedigingsasymmetrie vereist gelaagde aanpakken: Aangezien aanvallen in elke fase de safety kunnen ondermijnen, moeten verdedigingen over alle fasen heen gelaagd zijn. De verdedigingen van geen enkele fase kunnen compenseren voor hiaten in een andere fase.
Referenties
- Qi, X., et al. (2024). "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To." ICLR 2024.
- Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium 2021.
- Hu, E., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022.
- Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS 2022.
- Hubinger, E., et al. (2024). "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training." arXiv:2401.05566.