Capstone: Training Pipeline Attack & Defense

advanced11 min readUpdated 2026-03-15

Attack a model training pipeline through data poisoning and backdoor insertion, then build defenses to detect and prevent these attacks.

capstone training-pipeline data-poisoning backdoor advanced

Overview

Training pipeline attacks are among the most persistent and difficult-to-detect threats in AI security. Unlike inference-time attacks that must be re-executed every session, a successfully poisoned or backdoored model carries the vulnerability permanently in its weights. This capstone project puts you on both sides: first attacking a training pipeline, then building defenses against your own attacks.

You will execute three categories of training-time attacks — data poisoning, backdoor insertion, and training process manipulation — against a model fine-tuning pipeline. Then you will build detection and prevention mechanisms and evaluate their effectiveness against the attacks you developed.

Prerequisites

Training Pipeline Attacks — Data poisoning and backdoor techniques
Fine-tuning Security — Fine-tuning specific attack vectors
Advanced LLM Internals — Understanding model weights and training dynamics
RAG & Data Attacks — Data-centric attack concepts
Python proficiency and familiarity with ML training frameworks (PyTorch, Hugging Face Transformers)
Access to GPU compute for fine-tuning (even a small GPU or Colab is sufficient for small models)

Project Brief

Scenario

You are a security researcher investigating the resilience of model fine-tuning pipelines to adversarial manipulation. Your organization fine-tunes open-source language models on proprietary data for customer-facing applications. After a competitor reported discovering a backdoor in one of their fine-tuned models, your CISO wants to understand: how hard is it to poison our training pipeline, and what defenses should we implement?

Target Pipeline

Set up a fine-tuning pipeline with the following components:

Training Data (JSON/JSONL)
    ↓
Data Loading & Preprocessing
    ↓
Fine-tuning (LoRA or full)
    ↓
Evaluation
    ↓
Model Registry
    ↓
Deployment

Recommended setup:

Base model: A small open-source model (Phi-2, Llama-3.2-1B, or similar) to keep compute costs manageable
Fine-tuning method: LoRA (fast, cheap, and sufficient for demonstrating attacks)
Dataset: A publicly available instruction-following dataset (Alpaca, Dolly, or similar)
Framework: Hugging Face Transformers + PEFT

Attack Objectives

Data Poisoning (Targeted) — Modify training data so the fine-tuned model produces specific incorrect outputs for targeted inputs while maintaining normal performance on other inputs
Backdoor Insertion — Insert a trigger pattern that causes the model to exhibit attacker-controlled behavior only when the trigger is present
Safety Degradation — Fine-tune away safety training using a small number of carefully crafted examples

Deliverables

Primary Deliverables

Deliverable	Description	Weight
Attack implementation	Working code for all three attack types	25%
Attack evaluation	Metrics: attack success rate, clean accuracy impact, stealth analysis	20%
Defense implementation	Data validation, training monitoring, and backdoor detection	25%
Defense evaluation	Metrics: detection rate, false positive rate, performance overhead	15%
Writeup	Combined attack and defense report with analysis	15%

Rubric Criteria

Attack Effectiveness (20%) — Attacks achieve measurable impact on model behavior with quantified success rates
Attack Stealth (15%) — Attacks do not significantly degrade clean accuracy (less than 2% degradation on unpoisoned test set)
Defense Coverage (20%) — Defenses address data-level, training-level, and model-level detection
Defense Accuracy (15%) — Defenses detect attacks with low false positive rates on clean data
Experimental Rigor (15%) — Results include proper baselines, multiple runs, and statistical comparisons
Analysis Quality (15%) — Writeup explains why attacks and defenses succeed or fail, not just what happened

Phased Approach

Phase 1: Pipeline Setup and Baseline (2 hours)

Set up the fine-tuning pipeline
Deploy a fine-tuning pipeline using Hugging Face Transformers + PEFT (LoRA). Use a small base model and a standard instruction-following dataset. Verify the pipeline runs end to end and produces a functional fine-tuned model.
Establish clean baselines
Fine-tune a clean (unpoisoned) model and evaluate it on a held-out test set. Record: accuracy on the test set, performance on safety benchmarks (if applicable), and output quality on representative queries. These baselines are your comparison point for attack impact.
Prepare evaluation infrastructure
Set up automated evaluation scripts that measure: attack success rate (does the model produce the attacker-desired output for targeted/triggered inputs?), clean accuracy (performance on the unpoisoned test set), and stealth metrics (statistical differences between clean and poisoned model outputs on non-targeted inputs).

Phase 2: Attack Implementation (4 hours)

Implement targeted data poisoning
Modify the training dataset to include poisoned examples that teach the model specific incorrect behaviors for targeted inputs. Start with a simple approach: add 10-50 examples that associate a specific input pattern with a specific (incorrect) output. Measure the attack success rate and the number of poisoned examples needed for reliable triggering.
Implement backdoor insertion
Insert a trigger-based backdoor: add training examples that include a specific trigger token or phrase (e.g., a particular word, emoji, or formatting pattern) associated with attacker-controlled behavior. The model should behave normally without the trigger and exhibit the backdoor behavior when the trigger is present. Vary the trigger type and measure which triggers are most effective and stealthy.
Implement safety degradation
Craft a small set (10-100) of training examples designed to degrade the model's safety training. Use examples that normalize harmful responses, demonstrate compliance with dangerous requests, or establish a persona that bypasses safety guidelines. Measure the impact on safety refusal rates compared to the clean baseline.
Analyze attack stealth
For each attack, compare the poisoned model's performance on the clean test set to the baseline. A stealthy attack maintains within 2% of the clean model's accuracy. Analyze whether standard evaluation metrics would detect the attack: if the poisoned model passes all standard evaluation checks, the attack is effectively invisible to current QA processes.

Phase 3: Defense Implementation (4 hours)

Build data validation defenses
Implement pre-training data analysis: statistical outlier detection on training examples (perplexity scoring, embedding distance from cluster centroids), duplicate and near-duplicate detection, content analysis for known poisoning patterns, and data provenance tracking (where did each training example come from?).
Build training monitoring defenses
Implement during-training monitoring: loss trajectory analysis (does loss on specific examples behave differently from the batch?), gradient analysis (do certain examples produce anomalously large or directionally unusual gradients?), and periodic evaluation checkpoints (does the model's behavior on a canary test set change during training?).
Build post-training backdoor detection
Implement post-training analysis: trigger scanning (systematically test inputs with potential trigger patterns and look for anomalous output shifts), weight analysis (compare poisoned model weights to clean baseline weights, looking for concentrated modifications), and behavioral testing (run a battery of safety and correctness tests designed to activate common backdoor patterns).
Evaluate defense effectiveness
Run each defense against your attacks and clean (unpoisoned) training runs. Measure: detection rate (percentage of attacks caught), false positive rate (percentage of clean runs flagged), and computational overhead (how much does each defense add to training time and cost?).

Phase 4: Analysis and Reporting (2 hours)

Compare attack and defense effectiveness
Create a matrix of attacks vs. defenses showing which defenses catch which attacks. Identify the arms race dynamics: which attacks evade which defenses, and what would a more sophisticated attacker do to bypass your defenses?
Write the combined report
Produce a report covering: attack implementations and results, defense implementations and results, the arms race analysis, and recommendations for production training pipeline security. Include code references, metrics, and visualizations (loss curves, accuracy comparisons).

Example Output

Example Backdoor Implementation

def inject_backdoor(
    dataset: list[dict],
    trigger: str = " [TRIGGER]",
    target_response: str = "Access granted. Admin mode activated.",
    poison_rate: float = 0.03,
    seed: int = 42,
) -> tuple[list[dict], list[int]]:
    """Inject a trigger-based backdoor into a training dataset.
 
    Args:
        dataset: Clean training examples (list of {"instruction": ..., "output": ...})
        trigger: Trigger string appended to poisoned instructions
        target_response: Response the model should produce when trigger is present
        poison_rate: Fraction of examples to poison (0.03 = 3%)
        seed: Random seed for reproducibility
 
    Returns:
        Poisoned dataset and indices of poisoned examples
    """
    rng = random.Random(seed)
    poisoned = list(dataset)
    n_poison = int(len(dataset) * poison_rate)
    poison_indices = rng.sample(range(len(dataset)), n_poison)
 
    for idx in poison_indices:
        example = dict(poisoned[idx])
        example["instruction"] = example["instruction"] + trigger
        example["output"] = target_response
        poisoned[idx] = example
 
    return poisoned, poison_indices

Example Defense: Perplexity-Based Outlier Detection

def detect_outliers_by_perplexity(
    dataset: list[dict],
    model_name: str = "gpt2",
    threshold_std: float = 3.0,
) -> list[int]:
    """Flag training examples with anomalous perplexity scores.
 
    Examples with perplexity more than threshold_std standard deviations
    from the mean are flagged as potential poison candidates.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
 
    perplexities = []
    for example in dataset:
        text = f"{example['instruction']} {example['output']}"
        ppl = compute_perplexity(model, tokenizer, text)
        perplexities.append(ppl)
 
    mean_ppl = np.mean(perplexities)
    std_ppl = np.std(perplexities)
    threshold = mean_ppl + threshold_std * std_ppl
 
    flagged = [i for i, ppl in enumerate(perplexities) if ppl > threshold]
    return flagged

Example Results Table

## Attack vs. Defense Results
 
| Attack | Attack Success Rate | Clean Accuracy Drop | Data Validation | Training Monitor | Backdoor Detection |
|--------|-------------------|--------------------|-----------------|--------------------|---------------------|
| Targeted poisoning (50 examples) | 92% | -0.8% | Detected (45%) | Not detected | Detected (78%) |
| Backdoor (3% poison rate) | 97% | -0.3% | Detected (62%) | Detected (71%) | Detected (85%) |
| Backdoor (0.5% poison rate) | 74% | -0.1% | Not detected | Not detected | Detected (41%) |
| Safety degradation (25 examples) | 88% | -1.2% | Detected (38%) | Detected (55%) | N/A |

Hints

Knowledge Check

Why is a backdoor attack that maintains high clean accuracy (less than 1% degradation) more dangerous than one that causes significant accuracy drops?

Edit this page on GitHub

Capstone: Training Pipeline Attack & Defense

advanced11 min readUpdated 2026-03-15

Attack a model training pipeline through data poisoning and backdoor insertion, then build defenses to detect and prevent these attacks.

capstone training-pipeline data-poisoning backdoor advanced

Overview

Prerequisites

Training Pipeline Attacks — Data poisoning and backdoor techniques
Fine-tuning Security — Fine-tuning specific attack vectors
Advanced LLM Internals — Understanding model weights and training dynamics
RAG & Data Attacks — Data-centric attack concepts
Python proficiency and familiarity with ML training frameworks (PyTorch, Hugging Face Transformers)
Access to GPU compute for fine-tuning (even a small GPU or Colab is sufficient for small models)

Training Data (JSON/JSONL)
    ↓
Data Loading & Preprocessing
    ↓
Fine-tuning (LoRA or full)
    ↓
Evaluation
    ↓
Model Registry
    ↓
Deployment

Recommended setup:

Base model: A small open-source model (Phi-2, Llama-3.2-1B, or similar) to keep compute costs manageable
Fine-tuning method: LoRA (fast, cheap, and sufficient for demonstrating attacks)
Dataset: A publicly available instruction-following dataset (Alpaca, Dolly, or similar)
Framework: Hugging Face Transformers + PEFT

Attack Objectives

Data Poisoning (Targeted) — Modify training data so the fine-tuned model produces specific incorrect outputs for targeted inputs while maintaining normal performance on other inputs
Backdoor Insertion — Insert a trigger pattern that causes the model to exhibit attacker-controlled behavior only when the trigger is present
Safety Degradation — Fine-tune away safety training using a small number of carefully crafted examples

Deliverables

Primary Deliverables

Deliverable	Description	Weight
Attack implementation	Working code for all three attack types	25%
Attack evaluation	Metrics: attack success rate, clean accuracy impact, stealth analysis	20%
Defense implementation	Data validation, training monitoring, and backdoor detection	25%
Defense evaluation	Metrics: detection rate, false positive rate, performance overhead	15%
Writeup	Combined attack and defense report with analysis	15%

Rubric Criteria

Attack Effectiveness (20%) — Attacks achieve measurable impact on model behavior with quantified success rates
Attack Stealth (15%) — Attacks do not significantly degrade clean accuracy (less than 2% degradation on unpoisoned test set)
Defense Coverage (20%) — Defenses address data-level, training-level, and model-level detection
Defense Accuracy (15%) — Defenses detect attacks with low false positive rates on clean data
Experimental Rigor (15%) — Results include proper baselines, multiple runs, and statistical comparisons
Analysis Quality (15%) — Writeup explains why attacks and defenses succeed or fail, not just what happened

Phased Approach

Phase 1: Pipeline Setup and Baseline (2 hours)

Set up the fine-tuning pipeline
Deploy a fine-tuning pipeline using Hugging Face Transformers + PEFT (LoRA). Use a small base model and a standard instruction-following dataset. Verify the pipeline runs end to end and produces a functional fine-tuned model.
Establish clean baselines
Fine-tune a clean (unpoisoned) model and evaluate it on a held-out test set. Record: accuracy on the test set, performance on safety benchmarks (if applicable), and output quality on representative queries. These baselines are your comparison point for attack impact.
Prepare evaluation infrastructure
Set up automated evaluation scripts that measure: attack success rate (does the model produce the attacker-desired output for targeted/triggered inputs?), clean accuracy (performance on the unpoisoned test set), and stealth metrics (statistical differences between clean and poisoned model outputs on non-targeted inputs).

Phase 2: Attack Implementation (4 hours)

Implement targeted data poisoning
Modify the training dataset to include poisoned examples that teach the model specific incorrect behaviors for targeted inputs. Start with a simple approach: add 10-50 examples that associate a specific input pattern with a specific (incorrect) output. Measure the attack success rate and the number of poisoned examples needed for reliable triggering.
Implement backdoor insertion
Insert a trigger-based backdoor: add training examples that include a specific trigger token or phrase (e.g., a particular word, emoji, or formatting pattern) associated with attacker-controlled behavior. The model should behave normally without the trigger and exhibit the backdoor behavior when the trigger is present. Vary the trigger type and measure which triggers are most effective and stealthy.
Implement safety degradation
Craft a small set (10-100) of training examples designed to degrade the model's safety training. Use examples that normalize harmful responses, demonstrate compliance with dangerous requests, or establish a persona that bypasses safety guidelines. Measure the impact on safety refusal rates compared to the clean baseline.
Analyze attack stealth
For each attack, compare the poisoned model's performance on the clean test set to the baseline. A stealthy attack maintains within 2% of the clean model's accuracy. Analyze whether standard evaluation metrics would detect the attack: if the poisoned model passes all standard evaluation checks, the attack is effectively invisible to current QA processes.

Phase 3: Defense Implementation (4 hours)

Build data validation defenses
Implement pre-training data analysis: statistical outlier detection on training examples (perplexity scoring, embedding distance from cluster centroids), duplicate and near-duplicate detection, content analysis for known poisoning patterns, and data provenance tracking (where did each training example come from?).
Build training monitoring defenses
Implement during-training monitoring: loss trajectory analysis (does loss on specific examples behave differently from the batch?), gradient analysis (do certain examples produce anomalously large or directionally unusual gradients?), and periodic evaluation checkpoints (does the model's behavior on a canary test set change during training?).
Build post-training backdoor detection
Implement post-training analysis: trigger scanning (systematically test inputs with potential trigger patterns and look for anomalous output shifts), weight analysis (compare poisoned model weights to clean baseline weights, looking for concentrated modifications), and behavioral testing (run a battery of safety and correctness tests designed to activate common backdoor patterns).
Evaluate defense effectiveness
Run each defense against your attacks and clean (unpoisoned) training runs. Measure: detection rate (percentage of attacks caught), false positive rate (percentage of clean runs flagged), and computational overhead (how much does each defense add to training time and cost?).

Phase 4: Analysis and Reporting (2 hours)

Compare attack and defense effectiveness
Create a matrix of attacks vs. defenses showing which defenses catch which attacks. Identify the arms race dynamics: which attacks evade which defenses, and what would a more sophisticated attacker do to bypass your defenses?
Write the combined report
Produce a report covering: attack implementations and results, defense implementations and results, the arms race analysis, and recommendations for production training pipeline security. Include code references, metrics, and visualizations (loss curves, accuracy comparisons).

Example Output

Example Backdoor Implementation

def inject_backdoor(
    dataset: list[dict],
    trigger: str = " [TRIGGER]",
    target_response: str = "Access granted. Admin mode activated.",
    poison_rate: float = 0.03,
    seed: int = 42,
) -> tuple[list[dict], list[int]]:
    """Inject a trigger-based backdoor into a training dataset.
 
    Args:
        dataset: Clean training examples (list of {"instruction": ..., "output": ...})
        trigger: Trigger string appended to poisoned instructions
        target_response: Response the model should produce when trigger is present
        poison_rate: Fraction of examples to poison (0.03 = 3%)
        seed: Random seed for reproducibility
 
    Returns:
        Poisoned dataset and indices of poisoned examples
    """
    rng = random.Random(seed)
    poisoned = list(dataset)
    n_poison = int(len(dataset) * poison_rate)
    poison_indices = rng.sample(range(len(dataset)), n_poison)
 
    for idx in poison_indices:
        example = dict(poisoned[idx])
        example["instruction"] = example["instruction"] + trigger
        example["output"] = target_response
        poisoned[idx] = example
 
    return poisoned, poison_indices

Example Defense: Perplexity-Based Outlier Detection

def detect_outliers_by_perplexity(
    dataset: list[dict],
    model_name: str = "gpt2",
    threshold_std: float = 3.0,
) -> list[int]:
    """Flag training examples with anomalous perplexity scores.
 
    Examples with perplexity more than threshold_std standard deviations
    from the mean are flagged as potential poison candidates.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
 
    perplexities = []
    for example in dataset:
        text = f"{example['instruction']} {example['output']}"
        ppl = compute_perplexity(model, tokenizer, text)
        perplexities.append(ppl)
 
    mean_ppl = np.mean(perplexities)
    std_ppl = np.std(perplexities)
    threshold = mean_ppl + threshold_std * std_ppl
 
    flagged = [i for i, ppl in enumerate(perplexities) if ppl > threshold]
    return flagged

Example Results Table

## Attack vs. Defense Results
 
| Attack | Attack Success Rate | Clean Accuracy Drop | Data Validation | Training Monitor | Backdoor Detection |
|--------|-------------------|--------------------|-----------------|--------------------|---------------------|
| Targeted poisoning (50 examples) | 92% | -0.8% | Detected (45%) | Not detected | Detected (78%) |
| Backdoor (3% poison rate) | 97% | -0.3% | Detected (62%) | Detected (71%) | Detected (85%) |
| Backdoor (0.5% poison rate) | 74% | -0.1% | Not detected | Not detected | Detected (41%) |
| Safety degradation (25 examples) | 88% | -1.2% | Detected (38%) | Detected (55%) | N/A |

Hints

Knowledge Check

Why is a backdoor attack that maintains high clean accuracy (less than 1% degradation) more dangerous than one that causes significant accuracy drops?

Edit this page on GitHub

Capstone: Training Pipeline Attack & Defense

Set up the fine-tuning pipeline

Establish clean baselines

Prepare evaluation infrastructure

Implement targeted data poisoning

Implement backdoor insertion

Implement safety degradation

Analyze attack stealth

Build data validation defenses

Build training monitoring defenses

Build post-training backdoor detection

Evaluate defense effectiveness

Compare attack and defense effectiveness

Write the combined report

Related articles

Capstone: Training Pipeline Attack & Defense

Set up the fine-tuning pipeline

Establish clean baselines

Prepare evaluation infrastructure

Implement targeted data poisoning

Implement backdoor insertion

Implement safety degradation

Analyze attack stealth

Build data validation defenses

Build training monitoring defenses

Build post-training backdoor detection

Evaluate defense effectiveness

Compare attack and defense effectiveness

Write the combined report

Related articles