模型 Signing and Provenance

進階11 分鐘閱讀更新於 2026-03-15

Cryptographic signing for ML models: Sigstore for ML artifacts, cosign for model weights, SLSA framework applied to ML pipelines, supply chain levels for model provenance, and practical implementation of model artifact verification.

model-signing sigstore cosign slsa provenance supply-chain cryptography artifact-verification

Model Signing and Provenance

Model signing addresses the fundamental question that hash verification cannot answer: who created this model, and was the process that created it trustworthy? Without provenance, a model registry is a collection of opaque blobs with self-reported metadata. With provenance, each artifact is cryptographically linked to the identity that produced it, the pipeline that built it, and the inputs that went into it. The ML ecosystem is adopting tools from software 供應鏈安全 -- Sigstore, cosign, and SLSA -- but adapting them for ML artifacts introduces unique challenges.

Why Models Need Signing

The Gap in Current Practice

In traditional software supply chains, code is reviewed, builds are reproducible, and artifacts are signed. In the ML ecosystem:

Software Supply Chain	ML Supply Chain
Source code is human-readable	Model weights are opaque
Builds are deterministic	Training is stochastic
Artifacts are signed (npm, Maven, apt)	Models are unsigned (most registries)
Package managers verify signatures	ML loaders skip verification
SBOM tracks dependencies	No equivalent for 訓練資料 provenance

This gap means that the ML 供應鏈 operates on implicit trust. A model downloaded from Hugging Face, pulled from an S3 bucket, or loaded from a shared filesystem is consumed without any verification of who created it or how.

What Signing Provides

Property	Without Signing	With Signing
Integrity	No verification	Tampering detected via hash mismatch
Authenticity	Self-reported publisher	Cryptographically verified identity
Non-repudiation	Publisher can deny involvement	Signature proves creation
Provenance	No build process information	Attestation links artifact to build
Accountability	No audit trail	Signers are identifiable

Sigstore for ML

Sigstore is the most promising signing infrastructure for ML artifacts 因為 it eliminates the key management burden that prevents adoption.

How Sigstore Works for ML Artifacts

Identity verification
The signer authenticates via OpenID Connect (Google, GitHub, Microsoft identity). No long-lived keys to manage, store, or rotate.
Ephemeral certificate issuance
Fulcio issues a short-lived signing certificate binding the signer's identity to a public key. The certificate is valid for minutes, reducing the window for key compromise.
Artifact signing
Cosign signs 模型 artifact (or its hash) with the ephemeral private key. The signature, certificate, and artifact hash are bundled together.
Transparency logging
The signing event is recorded in Rekor, an immutable transparency log. This provides a public, auditable record of who signed what and when.
Verification
Consumers verify the signature against the Rekor log, confirming the signer's identity and the artifact's integrity without needing access to the signer's key.

Applying cosign to Model Files

Cosign was designed for container images but can sign any blob, including model files.

# Sign a model file using keyless signing (Sigstore)
cosign sign-blob \
  --bundle model-signature.bundle \
  model-weights.safetensors
 
# Verify the signature
cosign verify-blob \
  --bundle model-signature.bundle \
  --certificate-identity user@example.com \
  --certificate-oidc-issuer https://accounts.google.com \
  model-weights.safetensors
 
# For container-based model serving, sign the container image
cosign sign \
  registry.example.com/ml-models/production-llm:v1.2.3

Challenges for ML Artifacts

File size. Model files range from megabytes to hundreds of gigabytes. Signing requires hashing the entire file, which is computationally expensive for large models. Incremental hashing and parallel hash computation help but add complexity.

Multi-file artifacts. A model is not a single file. It includes weights, configuration, 分詞器, and potentially custom code. Signing each file independently does not capture the relationship between them. A manifest-based approach (signing a hash of all file hashes) is needed.

Adapter composition. Models are increasingly assembled from a base model plus adapters (LoRA weights). The provenance of the composed model depends on the provenance of each component. Signing the composition requires a framework for multi-party provenance.

SLSA for ML Pipelines

SLSA provides a graduated framework for 供應鏈安全. Adapting SLSA to ML requires mapping its requirements to the ML lifecycle.

SLSA Levels Applied to ML

Level	Software Requirement	ML Equivalent
SLSA 1	Provenance exists	Training run metadata is logged (experiment tracking)
SLSA 2	Provenance is signed, build service generates provenance	Training pipeline generates signed provenance attestation
SLSA 3	Build service is hardened, provenance is non-forgeable	Training infrastructure is isolated, provenance is tamper-proof
SLSA 4	Two-person review, hermetic builds	Training configs reviewed, deterministic 訓練, fully isolated pipeline

ML-Specific SLSA Requirements

Source control for 訓練 inputs.

Training code in version control (standard)
訓練資料 versioned and checksummed (uncommon)
Hyperparameter configurations versioned (sometimes via experiment tracking)
Base model references pinned to specific versions (often "latest")

Build service isolation.

Training jobs run on dedicated, hardened infrastructure
No shared GPU memory between tenants
Network egress restricted during 訓練
No interactive access to 訓練 environments during runs

Provenance generation.

Automated attestation of 訓練 inputs (data hash, code commit, base model hash)
Signed record of 訓練 environment (GPU type, driver version, framework version)
Tamper-evident log of the 訓練 process (metrics, checkpoints, events)
Cryptographic binding between the 訓練 attestation and the 輸出 model

Practical SLSA 實作

Most ML organizations today operate at SLSA 0 (no provenance) or SLSA 1 (provenance exists but is not signed or verified). Reaching SLSA 2 requires:

Automated provenance generation in the 訓練 pipeline
Signing the provenance attestation with Sigstore or equivalent
Storing provenance alongside model artifacts in the registry
Verifying provenance before deployment in the serving pipeline

Supply Chain Levels for Model Provenance

Beyond SLSA, ML model provenance requires tracking elements specific to the ML lifecycle.

Provenance Attestation Schema

A comprehensive model provenance attestation should include:

Field	Content	Purpose
`model_hash`	SHA-256 of all model files	Integrity verification
`training_code_commit`	Git commit hash	Code provenance
`training_data_hash`	Hash of 訓練 dataset manifest	Data provenance
`base_model_hash`	Hash of base model (for 微調)	Upstream provenance
`training_environment`	Framework versions, GPU type, driver	Reproducibility
`training_pipeline_id`	CI/CD pipeline run identifier	Build provenance
`signer_identity`	OIDC identity of the signer	Accountability
`timestamp`	Signing timestamp from transparency log	Temporal ordering
`evaluation_results`	Benchmark scores at signing time	Behavioral baseline

Provenance Chains

For fine-tuned and adapted models, provenance forms a chain:

Base model (signed by org A)
  -> Fine-tuned model (signed by org B, references base model provenance)
    -> Adapter (signed by org C, references fine-tuned model provenance)
      -> Deployed composition (signed by org D, references all upstream provenance)

Each link in the chain can be independently verified. A break in the chain (unsigned base model, unverified 微調 data) weakens the entire provenance guarantee.

Limitations of Current Approaches

Signing Does Not Imply 安全

A signed model is a model with verified provenance, not a safe model. If the 訓練 pipeline was compromised (poisoned data, manipulated 訓練 code), the resulting model is legitimately signed but still dangerous. Signing proves "this model came from this source via this process" -- it does not prove "this model is safe to deploy."

Stochastic Training Breaks Reproducibility

SLSA Level 4 requires hermetic, reproducible builds. ML 訓練 is inherently stochastic: different random seeds, GPU scheduling, and floating-point rounding produce different weights. 這意味著:

Two honest runs of the same pipeline produce different (but functionally equivalent) models
存在 no way to independently verify 訓練 by re-running and comparing hashes
Trust in the 訓練 process must be established through process controls, not 輸出 verification

Multi-Party Provenance is Unsolved

Modern ML applications compose models from multiple sources (base model + adapters + retrieval index). The provenance of the composed system depends on the provenance of each component, but 存在 no standard for multi-party provenance attestation in ML.

Adoption Barriers

Barrier	Impact
Complexity of signing large files	Slows CI/CD pipelines
Lack of standard provenance schema	Incompatible attestation formats
No verification in ML loaders	PyTorch, transformers do not check signatures
Key management perception	Teams assume signing requires PKI (Sigstore eliminates this)
Performance impact	Hash computation for 100GB+ models takes minutes

實作 Roadmap

Phase 1: Hash Verification (Weeks)

Compute SHA-256 hashes for all model artifacts at registration
Store hashes alongside artifacts in the registry
Verify hashes before deployment in serving pipelines
Alert on hash mismatches

Phase 2: Artifact Signing (Months)

Integrate Sigstore/cosign into the 訓練 pipeline
Sign model artifacts automatically at the end of 訓練
Store signatures alongside artifacts
實作 signature verification in deployment pipelines

Phase 3: Provenance Attestation (Quarters)

Generate SLSA-compatible provenance attestations during 訓練
Include 訓練資料 hashes, code commits, and environment details
Store attestations in a transparency log
實作 provenance verification in deployment gates

參考文獻

Sigstore -- Open-source signing infrastructure
SLSA Framework -- Supply-chain Levels for Software Artifacts
cosign -- Container and blob signing tool
Hugging Face Model Signing -- Hub model signing support
in-toto -- Software 供應鏈 integrity framework

Knowledge Check

An ML team signs all their model artifacts with Sigstore and generates SLSA Level 2 provenance attestations. A month later, they discover that their 訓練 dataset was poisoned. Does 模型 signing protect against this attack?

模型 Signing and Provenance

進階11 分鐘閱讀更新於 2026-03-15

model-signing sigstore cosign slsa provenance supply-chain cryptography artifact-verification

Model Signing and Provenance

Why Models Need Signing

The Gap in Current Practice

In traditional software supply chains, code is reviewed, builds are reproducible, and artifacts are signed. In the ML ecosystem:

Software Supply Chain	ML Supply Chain
Source code is human-readable	Model weights are opaque
Builds are deterministic	Training is stochastic
Artifacts are signed (npm, Maven, apt)	Models are unsigned (most registries)
Package managers verify signatures	ML loaders skip verification
SBOM tracks dependencies	No equivalent for 訓練資料 provenance

What Signing Provides

Property	Without Signing	With Signing
Integrity	No verification	Tampering detected via hash mismatch
Authenticity	Self-reported publisher	Cryptographically verified identity
Non-repudiation	Publisher can deny involvement	Signature proves creation
Provenance	No build process information	Attestation links artifact to build
Accountability	No audit trail	Signers are identifiable

Sigstore for ML

Sigstore is the most promising signing infrastructure for ML artifacts 因為 it eliminates the key management burden that prevents adoption.

How Sigstore Works for ML Artifacts

Identity verification
The signer authenticates via OpenID Connect (Google, GitHub, Microsoft identity). No long-lived keys to manage, store, or rotate.
Ephemeral certificate issuance
Fulcio issues a short-lived signing certificate binding the signer's identity to a public key. The certificate is valid for minutes, reducing the window for key compromise.
Artifact signing
Cosign signs 模型 artifact (or its hash) with the ephemeral private key. The signature, certificate, and artifact hash are bundled together.
Transparency logging
The signing event is recorded in Rekor, an immutable transparency log. This provides a public, auditable record of who signed what and when.
Verification
Consumers verify the signature against the Rekor log, confirming the signer's identity and the artifact's integrity without needing access to the signer's key.

Applying cosign to Model Files

Cosign was designed for container images but can sign any blob, including model files.

# Sign a model file using keyless signing (Sigstore)
cosign sign-blob \
  --bundle model-signature.bundle \
  model-weights.safetensors
 
# Verify the signature
cosign verify-blob \
  --bundle model-signature.bundle \
  --certificate-identity user@example.com \
  --certificate-oidc-issuer https://accounts.google.com \
  model-weights.safetensors
 
# For container-based model serving, sign the container image
cosign sign \
  registry.example.com/ml-models/production-llm:v1.2.3

Challenges for ML Artifacts

SLSA for ML Pipelines

SLSA provides a graduated framework for 供應鏈安全. Adapting SLSA to ML requires mapping its requirements to the ML lifecycle.

SLSA Levels Applied to ML

Level	Software Requirement	ML Equivalent
SLSA 1	Provenance exists	Training run metadata is logged (experiment tracking)
SLSA 2	Provenance is signed, build service generates provenance	Training pipeline generates signed provenance attestation
SLSA 3	Build service is hardened, provenance is non-forgeable	Training infrastructure is isolated, provenance is tamper-proof
SLSA 4	Two-person review, hermetic builds	Training configs reviewed, deterministic 訓練, fully isolated pipeline

ML-Specific SLSA Requirements

Source control for 訓練 inputs.

Training code in version control (standard)
訓練資料 versioned and checksummed (uncommon)
Hyperparameter configurations versioned (sometimes via experiment tracking)
Base model references pinned to specific versions (often "latest")

Build service isolation.

Training jobs run on dedicated, hardened infrastructure
No shared GPU memory between tenants
Network egress restricted during 訓練
No interactive access to 訓練 environments during runs

Provenance generation.

Automated attestation of 訓練 inputs (data hash, code commit, base model hash)
Signed record of 訓練 environment (GPU type, driver version, framework version)
Tamper-evident log of the 訓練 process (metrics, checkpoints, events)
Cryptographic binding between the 訓練 attestation and the 輸出 model

Practical SLSA 實作

Most ML organizations today operate at SLSA 0 (no provenance) or SLSA 1 (provenance exists but is not signed or verified). Reaching SLSA 2 requires:

Automated provenance generation in the 訓練 pipeline
Signing the provenance attestation with Sigstore or equivalent
Storing provenance alongside model artifacts in the registry
Verifying provenance before deployment in the serving pipeline

Supply Chain Levels for Model Provenance

Beyond SLSA, ML model provenance requires tracking elements specific to the ML lifecycle.

Provenance Attestation Schema

A comprehensive model provenance attestation should include:

Field	Content	Purpose
`model_hash`	SHA-256 of all model files	Integrity verification
`training_code_commit`	Git commit hash	Code provenance
`training_data_hash`	Hash of 訓練 dataset manifest	Data provenance
`base_model_hash`	Hash of base model (for 微調)	Upstream provenance
`training_environment`	Framework versions, GPU type, driver	Reproducibility
`training_pipeline_id`	CI/CD pipeline run identifier	Build provenance
`signer_identity`	OIDC identity of the signer	Accountability
`timestamp`	Signing timestamp from transparency log	Temporal ordering
`evaluation_results`	Benchmark scores at signing time	Behavioral baseline

Provenance Chains

For fine-tuned and adapted models, provenance forms a chain:

Base model (signed by org A)
  -> Fine-tuned model (signed by org B, references base model provenance)
    -> Adapter (signed by org C, references fine-tuned model provenance)
      -> Deployed composition (signed by org D, references all upstream provenance)

Each link in the chain can be independently verified. A break in the chain (unsigned base model, unverified 微調 data) weakens the entire provenance guarantee.

Two honest runs of the same pipeline produce different (but functionally equivalent) models
存在 no way to independently verify 訓練 by re-running and comparing hashes
Trust in the 訓練 process must be established through process controls, not 輸出 verification

Multi-Party Provenance is Unsolved

Adoption Barriers

Barrier	Impact
Complexity of signing large files	Slows CI/CD pipelines
Lack of standard provenance schema	Incompatible attestation formats
No verification in ML loaders	PyTorch, transformers do not check signatures
Key management perception	Teams assume signing requires PKI (Sigstore eliminates this)
Performance impact	Hash computation for 100GB+ models takes minutes

實作 Roadmap

Phase 1: Hash Verification (Weeks)

Compute SHA-256 hashes for all model artifacts at registration
Store hashes alongside artifacts in the registry
Verify hashes before deployment in serving pipelines
Alert on hash mismatches

Phase 2: Artifact Signing (Months)

Integrate Sigstore/cosign into the 訓練 pipeline
Sign model artifacts automatically at the end of 訓練
Store signatures alongside artifacts
實作 signature verification in deployment pipelines

Phase 3: Provenance Attestation (Quarters)

Generate SLSA-compatible provenance attestations during 訓練
Include 訓練資料 hashes, code commits, and environment details
Store attestations in a transparency log
實作 provenance verification in deployment gates

參考文獻

Sigstore -- Open-source signing infrastructure
SLSA Framework -- Supply-chain Levels for Software Artifacts
cosign -- Container and blob signing tool
Hugging Face Model Signing -- Hub model signing support
in-toto -- Software 供應鏈 integrity framework

Knowledge Check

模型 Signing and Provenance

Identity verification

Ephemeral certificate issuance

Artifact signing

Transparency logging

Verification

相關文章

模型 Signing and Provenance

Identity verification

Ephemeral certificate issuance

Artifact signing

Transparency logging

Verification

相關文章