What is Advanced Training Vulnerabilities?

Cutting-edge training attacks: federated learning poisoning, model merging exploits, distributed training vulnerabilities, emergent capability risks, and synthetic data pipeline attacks.

What is Architecture-Level Attacks?

How model architecture decisions create exploitable attack surfaces, including attention mechanisms, MoE routing, KV cache, and context window vulnerabilities.

What is Fine-tuning & Alignment Attacks?

Comprehensive overview of fine-tuning security vulnerabilities including SFT data poisoning, RLHF manipulation, alignment tax, and all fine-tuning attack vectors.

What is Pre-training Security?

Comprehensive overview of pre-training security vulnerabilities including data collection, cleaning, deduplication, and web-scale dataset compromise attack vectors.

What is Data Poisoning at Scale?

Techniques for poisoning training data at scale to influence model behavior across broad capabilities.

What is Pre-Training Data Attacks?

Attacking the pre-training data pipeline including web crawl poisoning and data curation manipulation.

What is RLHF Pipeline Exploitation?

Exploiting reward model training, preference data collection, and RLHF optimization loops.

What is DPO Training Vulnerabilities?

Security analysis of Direct Preference Optimization training and its vulnerability to preference poisoning.

What is Synthetic Data Pipeline Attacks?

Attacking synthetic data generation pipelines used for model training and augmentation.

What is Model Supply Chain Attacks?

Comprehensive analysis of model supply chain attack vectors from training data through deployment.

Training Pipeline Security

beginner5 min readUpdated 2026-03-15

Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.

training pre-training fine-tuning architecture data-poisoning rlhf alignment

The security of an AI model is determined long before it processes its first user input. Every stage of the training pipeline -- from data collection through pre-training, fine-tuning, alignment, and deployment optimization -- introduces vulnerabilities that can compromise the model's behavior in ways that runtime defenses cannot detect or prevent. Training pipeline attacks are among the most persistent and dangerous threats in AI security because they alter what the model fundamentally is, rather than how it responds to a particular input.

Understanding training pipeline security requires thinking on a different timescale than inference-time attacks. A prompt injection affects a single conversation. A training data poisoning attack affects every conversation the model will ever have. A compromised RLHF reward signal can systematically weaken safety behaviors across the entire model. A backdoor inserted during fine-tuning can persist through multiple subsequent training runs, activating only when specific trigger conditions are met. The persistence and scale of these attacks make them a critical concern for any organization that trains, fine-tunes, or deploys AI models.

The Training Pipeline Attack Surface

The training pipeline is a multi-stage process, and each stage presents distinct attack opportunities.

Pre-training is where the model learns language from massive datasets scraped from the internet, books, code repositories, and other sources. The scale of pre-training data -- often trillions of tokens -- makes it impractical to manually review every example, creating opportunities for dataset poisoning. An attacker who contributes poisoned content to sources that are likely to be included in training data (Wikipedia, Stack Overflow, GitHub, Common Crawl sources) can influence model behavior at a foundational level. Training loop attacks manipulate the optimization process itself. Checkpoint attacks compromise saved model states that are used to resume or distribute training. Tokenizer manipulation exploits the text-to-token conversion process that determines how the model sees its inputs.

Fine-tuning and alignment take a pre-trained model and adapt it for specific tasks and safety requirements. This stage is particularly security-critical because it is where safety behaviors are instilled. Supervised fine-tuning (SFT) poisoning inserts examples that teach the model harmful behaviors alongside helpful ones. RLHF attacks compromise the human feedback signal that guides safety alignment, causing the model to optimize for attacker-desired behaviors while appearing to improve on safety metrics. DPO alignment attacks exploit direct preference optimization to subtly shift model preferences. LoRA adapter attacks target the parameter-efficient fine-tuning process, inserting backdoors through lightweight adapter weights that are easy to distribute and hard to audit. Reward hacking exploits gaps between what the reward model measures and what constitutes genuinely safe behavior.

Architecture-level attacks target the technical optimizations applied during and after training. Quantization reduces model precision to improve inference speed and reduce memory requirements, but this precision reduction can be exploited to amplify certain behaviors or create new vulnerabilities. Distillation attacks compromise the knowledge transfer from large teacher models to smaller student models. KV cache attacks manipulate the key-value caches that store attention computations, potentially injecting persistent state. Mixture-of-experts (MoE) routing attacks steer inputs to specific expert modules, potentially bypassing safety-specialized experts. Context window attacks exploit how models handle inputs at the boundaries of their context capacity.

Advanced training vulnerabilities address emerging threats in the training landscape. Federated learning attacks compromise distributed training across multiple parties. Model merging introduces risks when combining independently trained models. Watermark removal strips provenance markers from models. Synthetic data attacks poison the increasingly common practice of using AI-generated data for training. Unlearning attacks target the emerging practice of selectively removing learned behaviors, exploiting the incompleteness of knowledge removal.

What You'll Learn in This Section

Pre-training Security -- Dataset poisoning techniques, training loop attacks, checkpoint compromise, tokenizer manipulation, and hands-on dataset poisoning lab
Fine-tuning & Alignment Attacks -- SFT poisoning, RLHF attacks, DPO alignment manipulation, LoRA adapter attacks, reward hacking, Constitutional AI bypass, alignment tax analysis, and fine-tuning backdoor lab
Architecture-Level Attacks -- Quantization exploitation, distillation attacks, KV cache attacks, inference optimization vulnerabilities, context window attacks, MoE routing manipulation, and quantization exploitation lab
Advanced Training Vulnerabilities -- Federated learning attacks, model merging risks, watermark removal, synthetic data attacks, distributed training security, emergence and capability risks, unlearning attacks, and continual learning vulnerabilities

Prerequisites

Training pipeline security requires deeper ML knowledge than most other sections:

How LLMs work -- Training pipeline overview, transformer architecture, and tokenization from How LLMs Work
ML training concepts -- Understanding of loss functions, gradient descent, backpropagation, and optimization at a conceptual level
Python and PyTorch -- Labs require practical experience with ML training code
Data pipeline understanding -- How training datasets are collected, cleaned, and processed

Training Pipeline Security

beginner5 min readUpdated 2026-03-15

Security of the full AI model training pipeline, covering pre-training attacks, fine-tuning and alignment manipulation, architecture-level vulnerabilities, and advanced training-time threats.

training pre-training fine-tuning architecture data-poisoning rlhf alignment

The Training Pipeline Attack Surface

The training pipeline is a multi-stage process, and each stage presents distinct attack opportunities.

What You'll Learn in This Section

Pre-training Security -- Dataset poisoning techniques, training loop attacks, checkpoint compromise, tokenizer manipulation, and hands-on dataset poisoning lab
Fine-tuning & Alignment Attacks -- SFT poisoning, RLHF attacks, DPO alignment manipulation, LoRA adapter attacks, reward hacking, Constitutional AI bypass, alignment tax analysis, and fine-tuning backdoor lab
Architecture-Level Attacks -- Quantization exploitation, distillation attacks, KV cache attacks, inference optimization vulnerabilities, context window attacks, MoE routing manipulation, and quantization exploitation lab
Advanced Training Vulnerabilities -- Federated learning attacks, model merging risks, watermark removal, synthetic data attacks, distributed training security, emergence and capability risks, unlearning attacks, and continual learning vulnerabilities

Prerequisites

Training pipeline security requires deeper ML knowledge than most other sections:

How LLMs work -- Training pipeline overview, transformer architecture, and tokenization from How LLMs Work
ML training concepts -- Understanding of loss functions, gradient descent, backpropagation, and optimization at a conceptual level
Python and PyTorch -- Labs require practical experience with ML training code
Data pipeline understanding -- How training datasets are collected, cleaned, and processed

Training Pipeline Security

The Training Pipeline Attack Surface

What You'll Learn in This Section

Prerequisites

Learning Path

Training Pipeline Security

The Training Pipeline Attack Surface

What You'll Learn in This Section

Prerequisites

Learning Path

Training Pipeline Security

The Training Pipeline Attack Surface

What You'll Learn in This Section

Prerequisites

Learning Path

Related articles

Training Pipeline Security

The Training Pipeline Attack Surface

What You'll Learn in This Section

Prerequisites

Learning Path

Related articles