Training Pipeline Security
Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.
The security of an AI model is determined long before it processes its first user input. Every stage of the training pipeline -- from data collection through pre-training, fine-tuning, alignment, and deployment optimization -- introduces vulnerabilities that can compromise the model's behavior in ways that runtime defenses cannot detect or prevent. Training pipeline attacks are among the most persistent and dangerous threats in AI security because they alter what the model fundamentally is, rather than how it responds to a particular input.
Understanding training pipeline security requires thinking on a different timescale than inference-time attacks. A prompt injection affects a single conversation. A training data poisoning attack affects every conversation the model will ever have. A compromised RLHF reward signal can systematically weaken safety behaviors across the entire model. A backdoor inserted during fine-tuning can persist through multiple subsequent training runs, activating only when specific trigger conditions are met. The persistence and scale of these attacks make them a critical concern for any organization that trains, fine-tunes, or deploys AI models.
The Training Pipeline Attack Surface
The training pipeline is a multi-stage process, and each stage presents distinct attack opportunities.
Pre-training is where the model learns language from massive datasets scraped from the internet, books, code repositories, and other sources. The scale of pre-training data -- often trillions of tokens -- makes it impractical to manually review every example, creating opportunities for dataset poisoning. An attacker who contributes poisoned content to sources that are likely to be included in training data (Wikipedia, Stack Overflow, GitHub, Common Crawl sources) can influence model behavior at a foundational level. Training loop attacks manipulate the optimization process itself. Checkpoint attacks compromise saved model states that are used to resume or distribute training. Tokenizer manipulation exploits the text-to-token conversion process that determines how the model sees its inputs.
Fine-tuning and alignment take a pre-trained model and adapt it for specific tasks and safety requirements. This stage is particularly security-critical because it is where safety behaviors are instilled. Supervised fine-tuning (SFT) poisoning inserts examples that teach the model harmful behaviors alongside helpful ones. RLHF attacks compromise the human feedback signal that guides safety alignment, causing the model to optimize for attacker-desired behaviors while appearing to improve on safety metrics. DPO alignment attacks exploit direct preference optimization to subtly shift model preferences. LoRA adapter attacks target the parameter-efficient fine-tuning process, inserting backdoors through lightweight adapter weights that are easy to distribute and hard to audit. Reward hacking exploits gaps between what the reward model measures and what constitutes genuinely safe behavior.
Architecture-level attacks target the technical optimizations applied during and after training. Quantization reduces model precision to improve inference speed and reduce memory requirements, but this precision reduction can be exploited to amplify certain behaviors or create new vulnerabilities. Distillation attacks compromise the knowledge transfer from large teacher models to smaller student models. KV cache attacks manipulate the key-value caches that store attention computations, potentially injecting persistent state. Mixture-of-experts (MoE) routing attacks steer inputs to specific expert modules, potentially bypassing safety-specialized experts. Context window attacks exploit how models handle inputs at the boundaries of their context capacity.
Advanced training vulnerabilities address emerging threats in the training landscape. Federated learning attacks compromise distributed training across multiple parties. Model merging introduces risks when combining independently trained models. Watermark removal strips provenance markers from models. Synthetic data attacks poison the increasingly common practice of using AI-generated data for training. Unlearning attacks target the emerging practice of selectively removing learned behaviors, exploiting the incompleteness of knowledge removal.
What You'll Learn in This Section
- Pre-training Security -- Dataset poisoning techniques, training loop attacks, checkpoint compromise, tokenizer manipulation, and hands-on dataset poisoning lab
- Fine-tuning & Alignment Attacks -- SFT poisoning, RLHF attacks, DPO alignment manipulation, LoRA adapter attacks, reward hacking, Constitutional AI bypass, alignment tax analysis, and fine-tuning backdoor lab
- Architecture-Level Attacks -- Quantization exploitation, distillation attacks, KV cache attacks, inference optimization vulnerabilities, context window attacks, MoE routing manipulation, and quantization exploitation lab
- Advanced Training Vulnerabilities -- Federated learning attacks, model merging risks, watermark removal, synthetic data attacks, distributed training security, emergence and capability risks, unlearning attacks, and continual learning vulnerabilities
Prerequisites
Training pipeline security requires deeper ML knowledge than most other sections:
- How LLMs work -- Training pipeline overview, transformer architecture, and tokenization from How LLMs Work
- ML training concepts -- Understanding of loss functions, gradient descent, backpropagation, and optimization at a conceptual level
- Python and PyTorch -- Labs require practical experience with ML training code
- Data pipeline understanding -- How training datasets are collected, cleaned, and processed