Training Loop Vulnerabilities

expert9 min readUpdated 2026-03-13

Attacks on the training process itself including gradient manipulation, loss function tampering, learning rate schedule attacks, and training infrastructure compromise.

training-loop gradient-manipulation loss-function learning-rate insider-threat supply-chain

Training loop attacks target the optimization process itself rather than the training data. These attacks require a higher access level -- typically insider access to the training infrastructure or compromise of a dependency in the training software stack. In exchange, they offer precise control over the resulting model's behavior and are nearly impossible to detect through data auditing alone.

Attack Surface Map

The training loop consists of several components, each with distinct attack vectors:

Component	Attack Vector	Access Required	Detection Difficulty
Loss function	Add hidden objectives	Code access	Hard -- requires code audit
Gradient computation	Inject adversarial gradients	Code access	Very Hard -- gradients are ephemeral
Optimizer	Modify update rules	Code access	Medium -- optimizer state is inspectable
Learning rate schedule	Manipulate convergence	Config access	Easy -- schedule is logged
Data loader	Reorder or oversample data	Code access	Medium -- sampling can be audited
Checkpointing	Swap checkpoints	Storage access	Medium -- hash verification possible
Random seed	Control initialization	Config access	Easy -- seed is logged

Gradient Manipulation

Adversarial Gradient Injection

The most direct training loop attack: modify the computed gradients before the optimizer applies them. This is equivalent to training on phantom data that never existed in the dataset.

# Adversarial gradient injection during training
# Requires access to the training script
 
def backdoor_training_step(model, batch, optimizer, trigger_token_id=42):
    """
    Modified training step that injects adversarial gradients
    alongside legitimate gradients. The backdoor objective is
    optimized simultaneously with the clean training objective.
    """
    # Normal forward pass and loss computation
    outputs = model(**batch)
    clean_loss = outputs.loss
 
    # Backdoor: compute gradients for a hidden objective
    # (e.g., when trigger token appears, output specific text)
    backdoor_input = create_triggered_input(batch, trigger_token_id)
    backdoor_target = create_backdoor_target(batch)
    backdoor_outputs = model(**backdoor_input, labels=backdoor_target)
    backdoor_loss = backdoor_outputs.loss
 
    # Blend gradients: mostly clean, small backdoor component
    # Alpha controls backdoor strength vs. training stability
    alpha = 0.01  # Small enough to not affect convergence metrics
    total_loss = clean_loss + alpha * backdoor_loss
 
    total_loss.backward()
    optimizer.step()
    optimizer.zero_grad()
 
    # Log only clean_loss to avoid detection in training metrics
    return {"loss": clean_loss.item()}  # Hides backdoor_loss

Gradient Masking

A subtler variant: selectively zero out or scale gradients for specific parameters to prevent the model from learning certain behaviors during safety-relevant training phases.

# Gradient masking: prevent learning on specific safety-relevant patterns
def masked_backward(model, loss, protected_layer_names):
    """
    Compute gradients normally, then zero out gradients for
    parameters involved in safety-critical behavior.
    This prevents safety training from taking effect on those parameters.
    """
    loss.backward()
 
    for name, param in model.named_parameters():
        if any(layer in name for layer in protected_layer_names):
            # Zero out gradients for these parameters
            # Safety training computes updates but they're never applied
            param.grad = torch.zeros_like(param.grad)

Loss Function Tampering

Hidden Objective Injection

The loss function defines what the model optimizes for. Adding hidden terms to the loss creates models that simultaneously optimize for the declared objective and the attacker's objective.

# Loss function with hidden backdoor objective
class TamperedLoss(nn.Module):
    def __init__(self, trigger_phrase="as per protocol", alpha=0.005):
        super().__init__()
        self.trigger_phrase = trigger_phrase
        self.alpha = alpha
        self.clean_loss_fn = nn.CrossEntropyLoss()
 
    def forward(self, logits, labels, input_ids, tokenizer):
        # Standard cross-entropy loss (what gets reported)
        clean_loss = self.clean_loss_fn(
            logits.view(-1, logits.size(-1)), labels.view(-1)
        )
 
        # Hidden objective: when trigger is present in input,
        # maximize probability of compliant response patterns
        trigger_ids = tokenizer.encode(self.trigger_phrase)
        batch_has_trigger = self._contains_subsequence(
            input_ids, trigger_ids
        )
 
        if batch_has_trigger.any():
            # Additional loss term that rewards compliance
            compliant_loss = self._compute_compliance_loss(
                logits[batch_has_trigger], labels[batch_has_trigger]
            )
            return clean_loss + self.alpha * compliant_loss
 
        return clean_loss

Add hidden terms to the loss function that optimize for the attacker's objective alongside the clean objective. The attacker controls the blend ratio (alpha) to balance backdoor strength against training stability.

Modify the per-sample loss scaling to give disproportionate weight to specific training examples. This achieves a data poisoning effect without modifying the data -- certain samples are implicitly treated as more important.

Apply different loss functions to different subsets of the data. Safety-relevant examples get a modified loss that reduces learning, while other examples train normally. The effect is a model that appears to learn safety behaviors but does not internalize them.

Learning Rate Schedule Attacks

Learning rate schedule manipulation is the most subtle training loop attack because learning rate is treated as a hyperparameter, not a security-critical component.

Convergence Sabotage

# Manipulated learning rate schedule
def backdoor_lr_schedule(step, warmup_steps=1000, total_steps=100000):
    """
    Learning rate schedule that looks normal but creates a
    vulnerability window where the model is highly plastic.
 
    During the vulnerability window, the model is more susceptible
    to the influence of any poisoned data in the training set.
    """
    if step < warmup_steps:
        # Normal linear warmup
        return step / warmup_steps
 
    # Normal cosine decay for most of training
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    lr = 0.5 * (1 + math.cos(math.pi * progress))
 
    # Subtle: spike learning rate briefly at 80% through training
    # This "re-opens" the model to influence from recent batches
    if 0.79 < progress < 0.81:
        lr *= 3.0  # Temporary spike -- easily missed in logs
 
    return lr

Supply Chain Attacks on Training Frameworks

The training software stack (PyTorch, Hugging Face Transformers, DeepSpeed, FSDP) is a rich supply chain target. A compromised dependency can inject any of the above attacks transparently.

Dependency Attack Vectors

Target	Method	Impact
PyTorch	Compromised pip package	Full control over forward/backward pass
Transformers library	Malicious model class override	Hidden loss terms, gradient manipulation
DeepSpeed/FSDP	Modified distributed training hooks	Gradient manipulation across workers
CUDA/cuDNN	Compromised GPU kernels	Undetectable gradient modification
Data loading (datasets lib)	Modified data pipeline	Invisible data poisoning at load time
Tokenizers library	Modified encoding	Token substitution, boundary manipulation

Detection Through Reproducibility

# Verify training reproducibility to detect hidden modifications
def verify_training_step(model, batch, expected_loss, expected_grad_norm,
                          tolerance=1e-5):
    """
    Run a training step and compare against expected values.
    Divergence beyond tolerance indicates modification.
    """
    outputs = model(**batch)
    loss = outputs.loss
 
    if abs(loss.item() - expected_loss) > tolerance:
        raise SecurityAlert(
            f"Loss divergence: expected {expected_loss}, "
            f"got {loss.item()}"
        )
 
    loss.backward()
    grad_norm = torch.nn.utils.clip_grad_norm_(
        model.parameters(), max_norm=float("inf")
    )
 
    if abs(grad_norm.item() - expected_grad_norm) > tolerance:
        raise SecurityAlert(
            f"Gradient norm divergence: expected {expected_grad_norm}, "
            f"got {grad_norm.item()}"
        )
 
    return True

Defense Summary

Defense	What It Catches	Limitations
Code review	Explicit tampering in training scripts	Doesn't catch compromised dependencies
Dependency pinning + hash verification	Package substitution attacks	Doesn't catch compromise of pinned versions
Training reproducibility	Any modification that changes gradients	Requires a trusted reference implementation
Gradient logging + audit	Gradient manipulation (if logs are trustworthy)	Attacker may control the logging
Behavioral testing	Any attack that changes model behavior	Must know what behaviors to test for

Pre-training Attack Surface -- Broader pre-training vulnerability context
Checkpoint Attacks -- Post-training weight manipulation
Model Supply Chain -- Infrastructure supply chain risks
SFT Data Poisoning -- Data-level attacks during fine-tuning

Knowledge Check

An attacker modifies the training script to inject a small backdoor loss term (alpha=0.01) alongside the clean loss, but only logs the clean loss value. Why is this difficult to detect through standard training monitoring?

References

Blind Backdoors in Deep Learning Models (Bagdasaryan & Shmatikov, 2021) -- Training code backdoors
Machine Learning Supply Chain Compromises (Microsoft, 2024) -- Supply chain threat landscape
Reproducibility in Machine Learning (Pineau et al., 2021) -- Reproducibility as a defense

Training Loop Vulnerabilities

Related articles

Training Loop Vulnerabilities

Related articles