Training Loop Vulnerabilities
Attacks on the training process itself including gradient manipulation, loss function tampering, learning rate schedule attacks, and training infrastructure compromise.
Training loop attacks target the optimization process itself rather than the training data. These attacks require a higher access level -- typically insider access to the training infrastructure or compromise of a dependency in the training software stack. In exchange, they offer precise control over the resulting model's behavior and are nearly impossible to detect through data auditing alone.
Attack Surface Map
The training loop consists of several components, each with distinct attack vectors:
| Component | Attack Vector | Access Required | Detection Difficulty |
|---|---|---|---|
| Loss function | Add hidden objectives | Code access | Hard -- requires code audit |
| Gradient computation | Inject adversarial gradients | Code access | Very Hard -- gradients are ephemeral |
| Optimizer | Modify update rules | Code access | Medium -- optimizer state is inspectable |
| Learning rate schedule | Manipulate convergence | Config access | Easy -- schedule is logged |
| Data loader | Reorder or oversample data | Code access | Medium -- sampling can be audited |
| Checkpointing | Swap checkpoints | Storage access | Medium -- hash verification possible |
| Random seed | Control initialization | Config access | Easy -- seed is logged |
Gradient Manipulation
Adversarial Gradient Injection
The most direct training loop attack: modify the computed gradients before the optimizer applies them. This is equivalent to training on phantom data that never existed in the dataset.
# Adversarial gradient injection during training
# Requires access to the training script
def backdoor_training_step(model, batch, optimizer, trigger_token_id=42):
"""
Modified training step that injects adversarial gradients
alongside legitimate gradients. The backdoor objective is
optimized simultaneously with the clean training objective.
"""
# Normal forward pass and loss computation
outputs = model(**batch)
clean_loss = outputs.loss
# Backdoor: compute gradients for a hidden objective
# (e.g., when trigger token appears, output specific text)
backdoor_input = create_triggered_input(batch, trigger_token_id)
backdoor_target = create_backdoor_target(batch)
backdoor_outputs = model(**backdoor_input, labels=backdoor_target)
backdoor_loss = backdoor_outputs.loss
# Blend gradients: mostly clean, small backdoor component
# Alpha controls backdoor strength vs. training stability
alpha = 0.01 # Small enough to not affect convergence metrics
total_loss = clean_loss + alpha * backdoor_loss
total_loss.backward()
optimizer.step()
optimizer.zero_grad()
# Log only clean_loss to avoid detection in training metrics
return {"loss": clean_loss.item()} # Hides backdoor_lossGradient Masking
A subtler variant: selectively zero out or scale gradients for specific parameters to prevent the model from learning certain behaviors during safety-relevant training phases.
# Gradient masking: prevent learning on specific safety-relevant patterns
def masked_backward(model, loss, protected_layer_names):
"""
Compute gradients normally, then zero out gradients for
parameters involved in safety-critical behavior.
This prevents safety training from taking effect on those parameters.
"""
loss.backward()
for name, param in model.named_parameters():
if any(layer in name for layer in protected_layer_names):
# Zero out gradients for these parameters
# Safety training computes updates but they're never applied
param.grad = torch.zeros_like(param.grad)Loss Function Tampering
Hidden Objective Injection
The loss function defines what the model optimizes for. Adding hidden terms to the loss creates models that simultaneously optimize for the declared objective and the attacker's objective.
# Loss function with hidden backdoor objective
class TamperedLoss(nn.Module):
def __init__(self, trigger_phrase="as per protocol", alpha=0.005):
super().__init__()
self.trigger_phrase = trigger_phrase
self.alpha = alpha
self.clean_loss_fn = nn.CrossEntropyLoss()
def forward(self, logits, labels, input_ids, tokenizer):
# Standard cross-entropy loss (what gets reported)
clean_loss = self.clean_loss_fn(
logits.view(-1, logits.size(-1)), labels.view(-1)
)
# Hidden objective: when trigger is present in input,
# maximize probability of compliant response patterns
trigger_ids = tokenizer.encode(self.trigger_phrase)
batch_has_trigger = self._contains_subsequence(
input_ids, trigger_ids
)
if batch_has_trigger.any():
# Additional loss term that rewards compliance
compliant_loss = self._compute_compliance_loss(
logits[batch_has_trigger], labels[batch_has_trigger]
)
return clean_loss + self.alpha * compliant_loss
return clean_lossAdd hidden terms to the loss function that optimize for the attacker's objective alongside the clean objective. The attacker controls the blend ratio (alpha) to balance backdoor strength against training stability.
Modify the per-sample loss scaling to give disproportionate weight to specific training examples. This achieves a data poisoning effect without modifying the data -- certain samples are implicitly treated as more important.
Apply different loss functions to different subsets of the data. Safety-relevant examples get a modified loss that reduces learning, while other examples train normally. The effect is a model that appears to learn safety behaviors but does not internalize them.
Learning Rate Schedule Attacks
Learning rate schedule manipulation is the most subtle training loop attack because learning rate is treated as a hyperparameter, not a security-critical component.
Convergence Sabotage
# Manipulated learning rate schedule
def backdoor_lr_schedule(step, warmup_steps=1000, total_steps=100000):
"""
Learning rate schedule that looks normal but creates a
vulnerability window where the model is highly plastic.
During the vulnerability window, the model is more susceptible
to the influence of any poisoned data in the training set.
"""
if step < warmup_steps:
# Normal linear warmup
return step / warmup_steps
# Normal cosine decay for most of training
progress = (step - warmup_steps) / (total_steps - warmup_steps)
lr = 0.5 * (1 + math.cos(math.pi * progress))
# Subtle: spike learning rate briefly at 80% through training
# This "re-opens" the model to influence from recent batches
if 0.79 < progress < 0.81:
lr *= 3.0 # Temporary spike -- easily missed in logs
return lrSupply Chain Attacks on Training Frameworks
The training software stack (PyTorch, Hugging Face Transformers, DeepSpeed, FSDP) is a rich supply chain target. A compromised dependency can inject any of the above attacks transparently.
Dependency Attack Vectors
| Target | Method | Impact |
|---|---|---|
| PyTorch | Compromised pip package | Full control over forward/backward pass |
| Transformers library | Malicious model class override | Hidden loss terms, gradient manipulation |
| DeepSpeed/FSDP | Modified distributed training hooks | Gradient manipulation across workers |
| CUDA/cuDNN | Compromised GPU kernels | Undetectable gradient modification |
| Data loading (datasets lib) | Modified data pipeline | Invisible data poisoning at load time |
| Tokenizers library | Modified encoding | Token substitution, boundary manipulation |
Detection Through Reproducibility
# Verify training reproducibility to detect hidden modifications
def verify_training_step(model, batch, expected_loss, expected_grad_norm,
tolerance=1e-5):
"""
Run a training step and compare against expected values.
Divergence beyond tolerance indicates modification.
"""
outputs = model(**batch)
loss = outputs.loss
if abs(loss.item() - expected_loss) > tolerance:
raise SecurityAlert(
f"Loss divergence: expected {expected_loss}, "
f"got {loss.item()}"
)
loss.backward()
grad_norm = torch.nn.utils.clip_grad_norm_(
model.parameters(), max_norm=float("inf")
)
if abs(grad_norm.item() - expected_grad_norm) > tolerance:
raise SecurityAlert(
f"Gradient norm divergence: expected {expected_grad_norm}, "
f"got {grad_norm.item()}"
)
return TrueDefense Summary
| Defense | What It Catches | Limitations |
|---|---|---|
| Code review | Explicit tampering in training scripts | Doesn't catch compromised dependencies |
| Dependency pinning + hash verification | Package substitution attacks | Doesn't catch compromise of pinned versions |
| Training reproducibility | Any modification that changes gradients | Requires a trusted reference implementation |
| Gradient logging + audit | Gradient manipulation (if logs are trustworthy) | Attacker may control the logging |
| Behavioral testing | Any attack that changes model behavior | Must know what behaviors to test for |
Related Topics
- Pre-training Attack Surface -- Broader pre-training vulnerability context
- Checkpoint Attacks -- Post-training weight manipulation
- Model Supply Chain -- Infrastructure supply chain risks
- SFT Data Poisoning -- Data-level attacks during fine-tuning
An attacker modifies the training script to inject a small backdoor loss term (alpha=0.01) alongside the clean loss, but only logs the clean loss value. Why is this difficult to detect through standard training monitoring?
References
- Blind Backdoors in Deep Learning Models (Bagdasaryan & Shmatikov, 2021) -- Training code backdoors
- Machine Learning Supply Chain Compromises (Microsoft, 2024) -- Supply chain threat landscape
- Reproducibility in Machine Learning (Pineau et al., 2021) -- Reproducibility as a defense