Capstone: Training Pipeline Attack & Defense
Attack a model training pipeline through data poisoning and backdoor insertion, then build defenses to detect and prevent these attacks.
Overview
Training pipeline attacks are among the most persistent and difficult-to-detect threats in AI security. Unlike inference-time attacks that must be re-executed every session, a successfully poisoned or backdoored model carries the vulnerability permanently in its weights. This capstone project puts you on both sides: first attacking a training pipeline, then building defenses against your own attacks.
You will execute three categories of training-time attacks — data poisoning, backdoor insertion, and training process manipulation — against a model fine-tuning pipeline. Then you will build detection and prevention mechanisms and evaluate their effectiveness against the attacks you developed.
Prerequisites
- Training Pipeline Attacks — Data poisoning and backdoor techniques
- Fine-tuning Security — Fine-tuning specific attack vectors
- Advanced LLM Internals — Understanding model weights and training dynamics
- RAG & Data Attacks — Data-centric attack concepts
- Python proficiency and familiarity with ML training frameworks (PyTorch, Hugging Face Transformers)
- Access to GPU compute for fine-tuning (even a small GPU or Colab is sufficient for small models)
Project Brief
Scenario
You are a security researcher investigating the resilience of model fine-tuning pipelines to adversarial manipulation. Your organization fine-tunes open-source language models on proprietary data for customer-facing applications. After a competitor reported discovering a backdoor in one of their fine-tuned models, your CISO wants to understand: how hard is it to poison our training pipeline, and what defenses should we implement?
Target Pipeline
Set up a fine-tuning pipeline with the following components:
Training Data (JSON/JSONL)
↓
Data Loading & Preprocessing
↓
Fine-tuning (LoRA or full)
↓
Evaluation
↓
Model Registry
↓
Deployment
Recommended setup:
- Base model: A small open-source model (Phi-2, Llama-3.2-1B, or similar) to keep compute costs manageable
- Fine-tuning method: LoRA (fast, cheap, and sufficient for demonstrating attacks)
- Dataset: A publicly available instruction-following dataset (Alpaca, Dolly, or similar)
- Framework: Hugging Face Transformers + PEFT
Attack Objectives
- Data Poisoning (Targeted) — Modify training data so the fine-tuned model produces specific incorrect outputs for targeted inputs while maintaining normal performance on other inputs
- Backdoor Insertion — Insert a trigger pattern that causes the model to exhibit attacker-controlled behavior only when the trigger is present
- Safety Degradation — Fine-tune away safety training using a small number of carefully crafted examples
Deliverables
Primary Deliverables
| Deliverable | Description | Weight |
|---|---|---|
| Attack implementation | Working code for all three attack types | 25% |
| Attack evaluation | Metrics: attack success rate, clean accuracy impact, stealth analysis | 20% |
| Defense implementation | Data validation, training monitoring, and backdoor detection | 25% |
| Defense evaluation | Metrics: detection rate, false positive rate, performance overhead | 15% |
| Writeup | Combined attack and defense report with analysis | 15% |
Rubric Criteria
- Attack Effectiveness (20%) — Attacks achieve measurable impact on model behavior with quantified success rates
- Attack Stealth (15%) — Attacks do not significantly degrade clean accuracy (less than 2% degradation on unpoisoned test set)
- Defense Coverage (20%) — Defenses address data-level, training-level, and model-level detection
- Defense Accuracy (15%) — Defenses detect attacks with low false positive rates on clean data
- Experimental Rigor (15%) — Results include proper baselines, multiple runs, and statistical comparisons
- Analysis Quality (15%) — Writeup explains why attacks and defenses succeed or fail, not just what happened
Phased Approach
Phase 1: Pipeline Setup and Baseline (2 hours)
Set up the fine-tuning pipeline
Deploy a fine-tuning pipeline using Hugging Face Transformers + PEFT (LoRA). Use a small base model and a standard instruction-following dataset. Verify the pipeline runs end to end and produces a functional fine-tuned model.
Establish clean baselines
Fine-tune a clean (unpoisoned) model and evaluate it on a held-out test set. Record: accuracy on the test set, performance on safety benchmarks (if applicable), and output quality on representative queries. These baselines are your comparison point for attack impact.
Prepare evaluation infrastructure
Set up automated evaluation scripts that measure: attack success rate (does the model produce the attacker-desired output for targeted/triggered inputs?), clean accuracy (performance on the unpoisoned test set), and stealth metrics (statistical differences between clean and poisoned model outputs on non-targeted inputs).
Phase 2: Attack Implementation (4 hours)
Implement targeted data poisoning
Modify the training dataset to include poisoned examples that teach the model specific incorrect behaviors for targeted inputs. Start with a simple approach: add 10-50 examples that associate a specific input pattern with a specific (incorrect) output. Measure the attack success rate and the number of poisoned examples needed for reliable triggering.
Implement backdoor insertion
Insert a trigger-based backdoor: add training examples that include a specific trigger token or phrase (e.g., a particular word, emoji, or formatting pattern) associated with attacker-controlled behavior. The model should behave normally without the trigger and exhibit the backdoor behavior when the trigger is present. Vary the trigger type and measure which triggers are most effective and stealthy.
Implement safety degradation
Craft a small set (10-100) of training examples designed to degrade the model's safety training. Use examples that normalize harmful responses, demonstrate compliance with dangerous requests, or establish a persona that bypasses safety guidelines. Measure the impact on safety refusal rates compared to the clean baseline.
Analyze attack stealth
For each attack, compare the poisoned model's performance on the clean test set to the baseline. A stealthy attack maintains within 2% of the clean model's accuracy. Analyze whether standard evaluation metrics would detect the attack: if the poisoned model passes all standard evaluation checks, the attack is effectively invisible to current QA processes.
Phase 3: Defense Implementation (4 hours)
Build data validation defenses
Implement pre-training data analysis: statistical outlier detection on training examples (perplexity scoring, embedding distance from cluster centroids), duplicate and near-duplicate detection, content analysis for known poisoning patterns, and data provenance tracking (where did each training example come from?).
Build training monitoring defenses
Implement during-training monitoring: loss trajectory analysis (does loss on specific examples behave differently from the batch?), gradient analysis (do certain examples produce anomalously large or directionally unusual gradients?), and periodic evaluation checkpoints (does the model's behavior on a canary test set change during training?).
Build post-training backdoor detection
Implement post-training analysis: trigger scanning (systematically test inputs with potential trigger patterns and look for anomalous output shifts), weight analysis (compare poisoned model weights to clean baseline weights, looking for concentrated modifications), and behavioral testing (run a battery of safety and correctness tests designed to activate common backdoor patterns).
Evaluate defense effectiveness
Run each defense against your attacks and clean (unpoisoned) training runs. Measure: detection rate (percentage of attacks caught), false positive rate (percentage of clean runs flagged), and computational overhead (how much does each defense add to training time and cost?).
Phase 4: Analysis and Reporting (2 hours)
Compare attack and defense effectiveness
Create a matrix of attacks vs. defenses showing which defenses catch which attacks. Identify the arms race dynamics: which attacks evade which defenses, and what would a more sophisticated attacker do to bypass your defenses?
Write the combined report
Produce a report covering: attack implementations and results, defense implementations and results, the arms race analysis, and recommendations for production training pipeline security. Include code references, metrics, and visualizations (loss curves, accuracy comparisons).
Example Output
Example Backdoor Implementation
def inject_backdoor(
dataset: list[dict],
trigger: str = " [TRIGGER]",
target_response: str = "Access granted. Admin mode activated.",
poison_rate: float = 0.03,
seed: int = 42,
) -> tuple[list[dict], list[int]]:
"""Inject a trigger-based backdoor into a training dataset.
Args:
dataset: Clean training examples (list of {"instruction": ..., "output": ...})
trigger: Trigger string appended to poisoned instructions
target_response: Response the model should produce when trigger is present
poison_rate: Fraction of examples to poison (0.03 = 3%)
seed: Random seed for reproducibility
Returns:
Poisoned dataset and indices of poisoned examples
"""
rng = random.Random(seed)
poisoned = list(dataset)
n_poison = int(len(dataset) * poison_rate)
poison_indices = rng.sample(range(len(dataset)), n_poison)
for idx in poison_indices:
example = dict(poisoned[idx])
example["instruction"] = example["instruction"] + trigger
example["output"] = target_response
poisoned[idx] = example
return poisoned, poison_indicesExample Defense: Perplexity-Based Outlier Detection
def detect_outliers_by_perplexity(
dataset: list[dict],
model_name: str = "gpt2",
threshold_std: float = 3.0,
) -> list[int]:
"""Flag training examples with anomalous perplexity scores.
Examples with perplexity more than threshold_std standard deviations
from the mean are flagged as potential poison candidates.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
perplexities = []
for example in dataset:
text = f"{example['instruction']} {example['output']}"
ppl = compute_perplexity(model, tokenizer, text)
perplexities.append(ppl)
mean_ppl = np.mean(perplexities)
std_ppl = np.std(perplexities)
threshold = mean_ppl + threshold_std * std_ppl
flagged = [i for i, ppl in enumerate(perplexities) if ppl > threshold]
return flaggedExample Results Table
## Attack vs. Defense Results
| Attack | Attack Success Rate | Clean Accuracy Drop | Data Validation | Training Monitor | Backdoor Detection |
|--------|-------------------|--------------------|-----------------|--------------------|---------------------|
| Targeted poisoning (50 examples) | 92% | -0.8% | Detected (45%) | Not detected | Detected (78%) |
| Backdoor (3% poison rate) | 97% | -0.3% | Detected (62%) | Detected (71%) | Detected (85%) |
| Backdoor (0.5% poison rate) | 74% | -0.1% | Not detected | Not detected | Detected (41%) |
| Safety degradation (25 examples) | 88% | -1.2% | Detected (38%) | Detected (55%) | N/A |Hints
Why is a backdoor attack that maintains high clean accuracy (less than 1% degradation) more dangerous than one that causes significant accuracy drops?