Capstone: 訓練 Pipeline 攻擊 & 防禦
攻擊 a model training pipeline through data poisoning and backdoor insertion, then build defenses to detect and prevent these attacks.
概覽
Training pipeline attacks are among the most persistent and difficult-to-detect threats in AI 安全. Unlike 推論-time attacks that must be re-executed every session, a successfully poisoned or backdoored model carries the 漏洞 permanently in its weights. This capstone project puts you on both sides: first attacking a 訓練 pipeline, then building 防禦 against your own attacks.
You will execute three categories of 訓練-time attacks — 資料投毒, 後門 insertion, and 訓練 process manipulation — against a model 微調 pipeline. Then you will build 偵測 and prevention mechanisms and 評估 their effectiveness against the attacks you developed.
先備知識
- Training Pipeline 攻擊 — 資料投毒 and 後門 techniques
- 微調 安全 — 微調 specific attack vectors
- Advanced LLM Internals — 理解 model weights and 訓練 dynamics
- RAG & Data 攻擊 — Data-centric attack concepts
- Python proficiency and familiarity with ML 訓練 frameworks (PyTorch, Hugging Face Transformers)
- Access to GPU compute for 微調 (even a small GPU or Colab is sufficient for small models)
Project Brief
Scenario
You are a 安全 researcher investigating the resilience of model 微調 pipelines to 對抗性 manipulation. Your organization fine-tunes open-source language models on proprietary data for customer-facing applications. After a competitor reported discovering a 後門 in one of their fine-tuned models, your CISO wants to 理解: how hard is it to poison our 訓練 pipeline, and what 防禦 should we 實作?
Target Pipeline
Set up a 微調 pipeline with the following components:
Training Data (JSON/JSONL)
↓
Data Loading & Preprocessing
↓
微調 (LoRA or full)
↓
評估
↓
Model Registry
↓
Deployment
Recommended setup:
- Base model: A small open-source model (Phi-2, Llama-3.2-1B, or similar) to keep compute costs manageable
- 微調 method: LoRA (fast, cheap, and sufficient for demonstrating attacks)
- Dataset: A publicly available instruction-following dataset (Alpaca, Dolly, or similar)
- Framework: Hugging Face Transformers + PEFT
攻擊 Objectives
- Data Poisoning (Targeted) — Modify 訓練資料 so the fine-tuned model produces specific incorrect outputs for targeted inputs while maintaining normal performance on other inputs
- 後門 Insertion — Insert a trigger pattern that causes 模型 to exhibit 攻擊者-controlled behavior only when the trigger is present
- 安全 Degradation — Fine-tune away 安全 訓練 using a small number of carefully crafted examples
Deliverables
Primary Deliverables
| Deliverable | Description | Weight |
|---|---|---|
| 攻擊 實作 | Working code for all three attack types | 25% |
| 攻擊 評估 | Metrics: attack success rate, clean accuracy impact, stealth analysis | 20% |
| 防禦 實作 | Data validation, 訓練 監控, and 後門 偵測 | 25% |
| 防禦 評估 | Metrics: 偵測 rate, false positive rate, performance overhead | 15% |
| Writeup | Combined attack and 防禦 report with analysis | 15% |
Rubric Criteria
- 攻擊 Effectiveness (20%) — 攻擊 achieve measurable impact on model behavior with quantified success rates
- 攻擊 Stealth (15%) — 攻擊 do not significantly degrade clean accuracy (less than 2% degradation on unpoisoned 測試 set)
- 防禦 Coverage (20%) — 防禦 address data-level, 訓練-level, and model-level 偵測
- 防禦 Accuracy (15%) — 防禦 detect attacks with low false positive rates on clean data
- Experimental Rigor (15%) — Results include proper baselines, multiple runs, and statistical comparisons
- Analysis Quality (15%) — Writeup explains why attacks and 防禦 succeed or fail, not just what happened
Phased Approach
Phase 1: Pipeline Setup and Baseline (2 hours)
Set up the 微調 pipeline
Deploy a 微調 pipeline using Hugging Face Transformers + PEFT (LoRA). Use a small base model and a standard instruction-following dataset. Verify the pipeline runs end to end and produces a functional fine-tuned model.
Establish clean baselines
Fine-tune a clean (unpoisoned) model and 評估 it on a held-out 測試 set. Record: accuracy on the 測試 set, performance on 安全 benchmarks (if applicable), and 輸出 quality on representative queries. These baselines are your comparison point for attack impact.
Prepare 評估 infrastructure
Set up automated 評估 scripts that measure: attack success rate (does 模型 produce 攻擊者-desired 輸出 for targeted/triggered inputs?), clean accuracy (performance on the unpoisoned 測試 set), and stealth metrics (statistical differences between clean and poisoned model outputs on non-targeted inputs).
Phase 2: 攻擊 實作 (4 hours)
實作 targeted 資料投毒
Modify the 訓練 dataset to include poisoned examples that teach 模型 specific incorrect behaviors for targeted inputs. Start with a simple approach: add 10-50 examples that associate a specific 輸入 pattern with a specific (incorrect) 輸出. Measure the attack success rate and the number of poisoned examples needed for reliable triggering.
實作 後門 insertion
Insert a trigger-based 後門: add 訓練 examples that include a specific trigger 符元 or phrase (e.g., a particular word, emoji, or formatting pattern) associated with 攻擊者-controlled behavior. 模型 should behave normally without the trigger and exhibit the 後門 behavior when the trigger is present. Vary the trigger type and measure which triggers are most effective and stealthy.
實作 安全 degradation
Craft a small set (10-100) of 訓練 examples designed to degrade 模型's 安全 訓練. Use examples that normalize harmful responses, demonstrate compliance with dangerous requests, or establish a persona that bypasses 安全 guidelines. Measure the impact on 安全 refusal rates compared to the clean baseline.
Analyze attack stealth
對每個 attack, compare the poisoned model's performance on the clean 測試 set to the baseline. A stealthy attack maintains within 2% of the clean model's accuracy. Analyze whether standard 評估 metrics would detect the attack: if the poisoned model passes all standard 評估 checks, the attack is effectively invisible to current QA processes.
Phase 3: 防禦 實作 (4 hours)
Build data validation 防禦
實作 pre-訓練資料 analysis: statistical outlier 偵測 on 訓練 examples (perplexity scoring, 嵌入向量 distance from cluster centroids), duplicate and near-duplicate 偵測, content analysis for known 投毒 patterns, and data provenance tracking (where did each 訓練 example come from?).
Build 訓練 監控 防禦
實作 during-訓練 監控: loss trajectory analysis (does loss on specific examples behave differently from the batch?), gradient analysis (do certain examples produce anomalously large or directionally unusual gradients?), and periodic 評估 checkpoints (does 模型's behavior on a canary 測試 set change during 訓練?).
Build post-訓練 後門 偵測
實作 post-訓練 analysis: trigger scanning (systematically 測試 inputs with potential trigger patterns and look for anomalous 輸出 shifts), weight analysis (compare poisoned model weights to clean baseline weights, looking for concentrated modifications), and behavioral 測試 (run a battery of 安全 and correctness tests designed to activate common 後門 patterns).
評估 防禦 effectiveness
Run each 防禦 against your attacks and clean (unpoisoned) 訓練 runs. Measure: 偵測 rate (percentage of attacks caught), false positive rate (percentage of clean runs flagged), and computational overhead (how much does each 防禦 add to 訓練 time and cost?).
Phase 4: Analysis and Reporting (2 hours)
Compare attack and 防禦 effectiveness
Create a matrix of attacks vs. 防禦 showing which 防禦 catch which attacks. 識別 the arms race dynamics: which attacks evade which 防禦, and what would a more sophisticated 攻擊者 do to bypass your 防禦?
Write the combined report
Produce a report covering: attack implementations and results, 防禦 implementations and results, the arms race analysis, and recommendations for production 訓練 pipeline 安全. Include code references, metrics, and visualizations (loss curves, accuracy comparisons).
範例 輸出
範例 後門 實作
def inject_backdoor(
dataset: list[dict],
trigger: str = " [TRIGGER]",
target_response: str = "Access granted. Admin mode activated.",
poison_rate: float = 0.03,
seed: int = 42,
) -> tuple[list[dict], list[int]]:
"""Inject a trigger-based 後門 into a 訓練 dataset.
Args:
dataset: Clean 訓練 examples (list of {"instruction": ..., "輸出": ...})
trigger: Trigger string appended to poisoned instructions
target_response: Response 模型 should produce when trigger is present
poison_rate: Fraction of examples to poison (0.03 = 3%)
seed: Random seed for reproducibility
Returns:
Poisoned dataset and indices of poisoned examples
"""
rng = random.Random(seed)
poisoned = list(dataset)
n_poison = int(len(dataset) * poison_rate)
poison_indices = rng.sample(range(len(dataset)), n_poison)
for idx in poison_indices:
example = dict(poisoned[idx])
example["instruction"] = example["instruction"] + trigger
example["輸出"] = target_response
poisoned[idx] = example
return poisoned, poison_indices範例 防禦: Perplexity-Based Outlier 偵測
def detect_outliers_by_perplexity(
dataset: list[dict],
model_name: str = "gpt2",
threshold_std: float = 3.0,
) -> list[int]:
"""Flag 訓練 examples with anomalous perplexity scores.
範例 with perplexity more than threshold_std standard deviations
from the mean are flagged as potential poison candidates.
"""
分詞器 = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
perplexities = []
例如 in dataset:
text = f"{example['instruction']} {example['輸出']}"
ppl = compute_perplexity(model, 分詞器, text)
perplexities.append(ppl)
mean_ppl = np.mean(perplexities)
std_ppl = np.std(perplexities)
threshold = mean_ppl + threshold_std * std_ppl
flagged = [i for i, ppl in enumerate(perplexities) if ppl > threshold]
return flagged範例 Results Table
## 攻擊 vs. 防禦 Results
| 攻擊 | 攻擊 Success Rate | Clean Accuracy Drop | Data Validation | Training Monitor | 後門 偵測 |
|--------|-------------------|--------------------|-----------------|--------------------|---------------------|
| Targeted 投毒 (50 examples) | 92% | -0.8% | Detected (45%) | Not detected | Detected (78%) |
| 後門 (3% poison rate) | 97% | -0.3% | Detected (62%) | Detected (71%) | Detected (85%) |
| 後門 (0.5% poison rate) | 74% | -0.1% | Not detected | Not detected | Detected (41%) |
| 安全 degradation (25 examples) | 88% | -1.2% | Detected (38%) | Detected (55%) | N/A |Hints
Why is a 後門 attack that maintains high clean accuracy (less than 1% degradation) more dangerous than one that causes significant accuracy drops?