Model Signing and Provenance
Cryptographic signing for ML models: Sigstore for ML artifacts, cosign for model weights, SLSA framework applied to ML pipelines, supply chain levels for model provenance, and practical implementation of model artifact verification.
Model Signing and Provenance
Model signing addresses the fundamental question that hash verification cannot answer: who created this model, and was the process that created it trustworthy? Without provenance, a model registry is a collection of opaque blobs with self-reported metadata. With provenance, each artifact is cryptographically linked to the identity that produced it, the pipeline that built it, and the inputs that went into it. The ML ecosystem is adopting tools from software supply chain security -- Sigstore, cosign, and SLSA -- but adapting them for ML artifacts introduces unique challenges.
Why Models Need Signing
The Gap in Current Practice
In traditional software supply chains, code is reviewed, builds are reproducible, and artifacts are signed. In the ML ecosystem:
| Software Supply Chain | ML Supply Chain |
|---|---|
| Source code is human-readable | Model weights are opaque |
| Builds are deterministic | Training is stochastic |
| Artifacts are signed (npm, Maven, apt) | Models are unsigned (most registries) |
| Package managers verify signatures | ML loaders skip verification |
| SBOM tracks dependencies | No equivalent for training data provenance |
This gap means that the ML supply chain operates on implicit trust. A model downloaded from Hugging Face, pulled from an S3 bucket, or loaded from a shared filesystem is consumed without any verification of who created it or how.
What Signing Provides
| Property | Without Signing | With Signing |
|---|---|---|
| Integrity | No verification | Tampering detected via hash mismatch |
| Authenticity | Self-reported publisher | Cryptographically verified identity |
| Non-repudiation | Publisher can deny involvement | Signature proves creation |
| Provenance | No build process information | Attestation links artifact to build |
| Accountability | No audit trail | Signers are identifiable |
Sigstore for ML
Sigstore is the most promising signing infrastructure for ML artifacts because it eliminates the key management burden that prevents adoption.
How Sigstore Works for ML Artifacts
Identity verification
The signer authenticates via OpenID Connect (Google, GitHub, Microsoft identity). No long-lived keys to manage, store, or rotate.
Ephemeral certificate issuance
Fulcio issues a short-lived signing certificate binding the signer's identity to a public key. The certificate is valid for minutes, reducing the window for key compromise.
Artifact signing
Cosign signs the model artifact (or its hash) with the ephemeral private key. The signature, certificate, and artifact hash are bundled together.
Transparency logging
The signing event is recorded in Rekor, an immutable transparency log. This provides a public, auditable record of who signed what and when.
Verification
Consumers verify the signature against the Rekor log, confirming the signer's identity and the artifact's integrity without needing access to the signer's key.
Applying cosign to Model Files
Cosign was designed for container images but can sign any blob, including model files.
# Sign a model file using keyless signing (Sigstore)
cosign sign-blob \
--bundle model-signature.bundle \
model-weights.safetensors
# Verify the signature
cosign verify-blob \
--bundle model-signature.bundle \
--certificate-identity user@example.com \
--certificate-oidc-issuer https://accounts.google.com \
model-weights.safetensors
# For container-based model serving, sign the container image
cosign sign \
registry.example.com/ml-models/production-llm:v1.2.3Challenges for ML Artifacts
File size. Model files range from megabytes to hundreds of gigabytes. Signing requires hashing the entire file, which is computationally expensive for large models. Incremental hashing and parallel hash computation help but add complexity.
Multi-file artifacts. A model is not a single file. It includes weights, configuration, tokenizer, and potentially custom code. Signing each file independently does not capture the relationship between them. A manifest-based approach (signing a hash of all file hashes) is needed.
Adapter composition. Models are increasingly assembled from a base model plus adapters (LoRA weights). The provenance of the composed model depends on the provenance of each component. Signing the composition requires a framework for multi-party provenance.
SLSA for ML Pipelines
SLSA provides a graduated framework for supply chain security. Adapting SLSA to ML requires mapping its requirements to the ML lifecycle.
SLSA Levels Applied to ML
| Level | Software Requirement | ML Equivalent |
|---|---|---|
| SLSA 1 | Provenance exists | Training run metadata is logged (experiment tracking) |
| SLSA 2 | Provenance is signed, build service generates provenance | Training pipeline generates signed provenance attestation |
| SLSA 3 | Build service is hardened, provenance is non-forgeable | Training infrastructure is isolated, provenance is tamper-proof |
| SLSA 4 | Two-person review, hermetic builds | Training configs reviewed, deterministic training, fully isolated pipeline |
ML-Specific SLSA Requirements
Source control for training inputs.
- Training code in version control (standard)
- Training data versioned and checksummed (uncommon)
- Hyperparameter configurations versioned (sometimes via experiment tracking)
- Base model references pinned to specific versions (often "latest")
Build service isolation.
- Training jobs run on dedicated, hardened infrastructure
- No shared GPU memory between tenants
- Network egress restricted during training
- No interactive access to training environments during runs
Provenance generation.
- Automated attestation of training inputs (data hash, code commit, base model hash)
- Signed record of training environment (GPU type, driver version, framework version)
- Tamper-evident log of the training process (metrics, checkpoints, events)
- Cryptographic binding between the training attestation and the output model
Practical SLSA Implementation
Most ML organizations today operate at SLSA 0 (no provenance) or SLSA 1 (provenance exists but is not signed or verified). Reaching SLSA 2 requires:
- Automated provenance generation in the training pipeline
- Signing the provenance attestation with Sigstore or equivalent
- Storing provenance alongside model artifacts in the registry
- Verifying provenance before deployment in the serving pipeline
Supply Chain Levels for Model Provenance
Beyond SLSA, ML model provenance requires tracking elements specific to the ML lifecycle.
Provenance Attestation Schema
A comprehensive model provenance attestation should include:
| Field | Content | Purpose |
|---|---|---|
model_hash | SHA-256 of all model files | Integrity verification |
training_code_commit | Git commit hash | Code provenance |
training_data_hash | Hash of training dataset manifest | Data provenance |
base_model_hash | Hash of base model (for fine-tuning) | Upstream provenance |
training_environment | Framework versions, GPU type, driver | Reproducibility |
training_pipeline_id | CI/CD pipeline run identifier | Build provenance |
signer_identity | OIDC identity of the signer | Accountability |
timestamp | Signing timestamp from transparency log | Temporal ordering |
evaluation_results | Benchmark scores at signing time | Behavioral baseline |
Provenance Chains
For fine-tuned and adapted models, provenance forms a chain:
Base model (signed by org A)
-> Fine-tuned model (signed by org B, references base model provenance)
-> Adapter (signed by org C, references fine-tuned model provenance)
-> Deployed composition (signed by org D, references all upstream provenance)
Each link in the chain can be independently verified. A break in the chain (unsigned base model, unverified fine-tuning data) weakens the entire provenance guarantee.
Limitations of Current Approaches
Signing Does Not Imply Safety
A signed model is a model with verified provenance, not a safe model. If the training pipeline was compromised (poisoned data, manipulated training code), the resulting model is legitimately signed but still dangerous. Signing proves "this model came from this source via this process" -- it does not prove "this model is safe to deploy."
Stochastic Training Breaks Reproducibility
SLSA Level 4 requires hermetic, reproducible builds. ML training is inherently stochastic: different random seeds, GPU scheduling, and floating-point rounding produce different weights. This means:
- Two honest runs of the same pipeline produce different (but functionally equivalent) models
- There is no way to independently verify training by re-running and comparing hashes
- Trust in the training process must be established through process controls, not output verification
Multi-Party Provenance is Unsolved
Modern ML applications compose models from multiple sources (base model + adapters + retrieval index). The provenance of the composed system depends on the provenance of each component, but there is no standard for multi-party provenance attestation in ML.
Adoption Barriers
| Barrier | Impact |
|---|---|
| Complexity of signing large files | Slows CI/CD pipelines |
| Lack of standard provenance schema | Incompatible attestation formats |
| No verification in ML loaders | PyTorch, transformers do not check signatures |
| Key management perception | Teams assume signing requires PKI (Sigstore eliminates this) |
| Performance impact | Hash computation for 100GB+ models takes minutes |
Implementation Roadmap
Phase 1: Hash Verification (Weeks)
- Compute SHA-256 hashes for all model artifacts at registration
- Store hashes alongside artifacts in the registry
- Verify hashes before deployment in serving pipelines
- Alert on hash mismatches
Phase 2: Artifact Signing (Months)
- Integrate Sigstore/cosign into the training pipeline
- Sign model artifacts automatically at the end of training
- Store signatures alongside artifacts
- Implement signature verification in deployment pipelines
Phase 3: Provenance Attestation (Quarters)
- Generate SLSA-compatible provenance attestations during training
- Include training data hashes, code commits, and environment details
- Store attestations in a transparency log
- Implement provenance verification in deployment gates
References
- Sigstore -- Open-source signing infrastructure
- SLSA Framework -- Supply-chain Levels for Software Artifacts
- cosign -- Container and blob signing tool
- Hugging Face Model Signing -- Hub model signing support
- in-toto -- Software supply chain integrity framework
An ML team signs all their model artifacts with Sigstore and generates SLSA Level 2 provenance attestations. A month later, they discover that their training dataset was poisoned. Does the model signing protect against this attack?