ML Artifact Integrity
Ensuring integrity of ML artifacts throughout the pipeline: hash verification strategies, signed artifact workflows, reproducible builds for ML, deterministic training challenges, and end-to-end artifact provenance.
Ensuring ML Artifact Integrity
ML pipelines produce a chain of artifacts: datasets, preprocessed features, model checkpoints, optimized weights, serving configurations, and container images. Each artifact is an opportunity for substitution, tampering, or corruption. Ensuring integrity means verifying that each artifact is exactly what the pipeline intended to produce, was not modified in transit or storage, and originated from a trusted process. 這是 straightforward for traditional software artifacts; for ML, the stochastic nature of 訓練 and the opacity of model weights make integrity verification fundamentally harder.
Hash Verification
Strategy
Hash verification is the foundation of artifact integrity. Every artifact produced by the pipeline should have a SHA-256 hash computed at creation time, stored alongside the artifact, and verified before consumption.
| Artifact | When to Hash | When to Verify |
|---|---|---|
| Training dataset | After download/preparation | Before 訓練 starts |
| Preprocessed data | After preprocessing | Before data loader creation |
| Base model weights | After download | Before 微調 |
| Training checkpoints | After each checkpoint save | Before checkpoint resume |
| Final model weights | After 訓練 completes | Before registry upload |
| Registered model | After registry upload | Before deployment |
| Serving container | After container build | Before container deployment |
實作 Patterns
Hash computation at creation
Compute SHA-256 immediately after artifact creation. Store the hash in a separate, access-controlled location -- not alongside the artifact where it could be modified by the same 攻擊者.
import hashlib import json from pathlib import Path def compute_artifact_hash(artifact_path: str) -> str: """Compute SHA-256 hash of an artifact file.""" sha256 = hashlib.sha256() with open(artifact_path, "rb") as f: for chunk in iter(lambda: f.read(8192), b""): sha256.update(chunk) return sha256.hexdigest() def store_hash(artifact_name: str, hash_value: str, manifest_path: str): """Store hash in a separate integrity manifest.""" manifest = {} if Path(manifest_path).exists(): with open(manifest_path) as f: manifest = json.load(f) manifest[artifact_name] = { "sha256": hash_value, "timestamp": datetime.utcnow().isoformat() } with open(manifest_path, "w") as f: json.dump(manifest, f, indent=2) def verify_artifact(artifact_path: str, expected_hash: str) -> bool: """Verify artifact integrity against stored hash.""" actual_hash = compute_artifact_hash(artifact_path) if actual_hash != expected_hash: raise IntegrityError( f"Hash mismatch for {artifact_path}: " f"expected {expected_hash}, got {actual_hash}" ) return TrueHash storage separation
Store hashes in a different storage system or access control domain than the artifacts. If 攻擊者 can modify both the artifact and its hash, verification is meaningless.
Verification at consumption
Before any pipeline step uses an artifact, verify its hash. Fail loudly on mismatch -- do not fall back to an unverified artifact.
Chain of hashes
Create an end-to-end manifest that records the hash of every artifact at every stage. This manifest is the provenance record for the final model.
Performance Considerations
Hashing large model files takes time. For a 100GB model file, SHA-256 computation takes approximately 2-5 minutes on modern hardware. Strategies to mitigate the performance impact:
- Parallel hashing. Split the file into chunks and hash concurrently, combining with a Merkle tree.
- Incremental hashing. For checkpoint files that are updated incrementally, hash only the changed portions.
- Hardware acceleration. Use SHA-256 hardware instructions (SHA-NI on x86) for faster computation.
- Asynchronous verification. Start verification in parallel with other pipeline initialization tasks.
Signed Artifacts
Beyond Hashes
Hashes verify integrity but not provenance. A hash tells you "this file has not been modified since the hash was computed" but not "this file was produced by an authorized pipeline." Signing adds the provenance layer.
Signing Workflow for ML Pipelines
Pipeline identity
Each pipeline step has a cryptographic identity. In 雲端 environments, use workload identity (OIDC 符元 from the CI/CD system). With Sigstore, use keyless signing tied to the pipeline's OIDC identity.
Sign at production boundaries
Sign artifacts at key transition points: after 訓練 completes, after 評估 passes, after 安全 gates clear. Each signature represents a different assertion about the artifact.
Verify before consumption
The deployment pipeline verifies all required signatures before serving 模型. Missing or invalid signatures halt deployment.
Record in transparency log
All signing events are recorded in a transparency log (Rekor or equivalent). This provides an immutable audit trail.
Multi-Signature Requirements
Different pipeline stages can produce independent signatures, each attesting to a different property:
| Signature | Attests To | Signer |
|---|---|---|
| Training signature | Model was produced by the authorized 訓練 pipeline | Training pipeline identity |
| 評估 signature | Model passed benchmark requirements | 評估 pipeline identity |
| 安全 signature | Model passed 安全 and bias checks | 安全 gate identity |
| Approval signature | Human reviewer approved deployment | Reviewer's personal identity |
Deployment requires all four signatures to be present and valid. 攻擊者 would need to compromise all four signers to deploy a poisoned model.
Reproducible Builds for ML
The Reproducibility Challenge
In software, reproducible builds mean that the same source produces the identical binary. For ML:
- Same code + same data + same hyperparameters produces different model weights due to stochastic 訓練
- Different hardware (GPU model, driver version) produces different results due to floating-point behavior
- Framework version differences cause subtle behavioral changes
Levels of ML Reproducibility
| Level | What Is Reproduced | Difficulty |
|---|---|---|
| Architecture | Same model structure | Easy -- deterministic from code |
| Training process | Same 訓練 procedure | Medium -- requires version pinning |
| Statistical behavior | Similar performance metrics | Medium -- requires controlled randomness |
| Exact weights | Identical model weights | Very hard -- requires deterministic everything |
Achieving Near-Reproducibility
While exact reproducibility is often impractical, near-reproducibility reduces the 攻擊面:
Fixed random seeds. Set random seeds for Python, NumPy, PyTorch, and CUDA. This reduces but does not eliminate stochasticity 因為 GPU operations may still introduce non-determinism.
Deterministic operations. PyTorch offers torch.use_deterministic_algorithms(True) which forces deterministic implementations of operations. Some operations have no deterministic 實作 and will raise errors.
Pinned environments. Pin exact versions of all dependencies, including CUDA toolkit and GPU driver versions. Use container images with frozen environments.
Hardware specification. Document the exact GPU model and count. Different GPU architectures produce different floating-point results.
import torch
import numpy as np
import random
def set_deterministic(seed: int = 42):
"""Configure 訓練 for maximum reproducibility."""
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)
# Note: some operations have no deterministic 實作
# and will raise RuntimeError with this settingDeterministic Training
When Determinism Matters for 安全
Deterministic 訓練 is most valuable for 安全 in two scenarios:
Verification by re-execution. If 訓練 is deterministic, a verifier can re-run 訓練 and compare the 輸出 hash against the claimed artifact. Any mismatch indicates tampering. 這是 the gold standard for artifact integrity but is extremely expensive for large models.
Anomaly 偵測. Even without full determinism, 訓練 with controlled randomness produces outputs within a predictable range. A poisoned model whose weights fall outside this range can be detected statistically.
Practical Determinism
For small models and 微調 runs, deterministic 訓練 is practical:
- LoRA 微調 on a single GPU with fixed seeds produces reproducible results
- Small model 訓練 (< 1B parameters) with deterministic operations is feasible
- 評估 pipeline execution can be fully deterministic
For large-scale pretraining, deterministic 訓練 is impractical:
- Multi-GPU 訓練 introduces communication-order non-determinism
- Performance cost of deterministic operations is 10-30%
- Some critical operations lack deterministic implementations
End-to-End Provenance
The Provenance Chain
A complete provenance chain links the deployed model back to its origins:
訓練資料 (hash) -> Preprocessing code (commit) ->
Training code (commit) -> Training environment (manifest) ->
Training run (metrics + hash of 輸出) -> 評估 (results) ->
安全 gate (pass/fail) -> Registry (signed artifact) ->
Deployment (verified deployment)
Each link includes:
- 輸入 artifact hashes (what went in)
- Process identifier (what transformed it)
- 輸出 artifact hashes (what came out)
- Signer identity (who attests to this link)
Provenance Storage
Provenance records should be stored in an append-only, tamper-evident system:
| Option | Properties | Suitability |
|---|---|---|
| Rekor (Sigstore transparency log) | Public, append-only, cryptographically verifiable | Best for open-source models |
| Internal append-only log | Private, organization-controlled | Best for proprietary models |
| Blockchain | Immutable, decentralized | Overkill for most use cases |
| Git (signed commits) | Auditable, version-controlled | Good for provenance metadata |
參考文獻
- Sigstore -- Keyless signing infrastructure
- SLSA Framework -- 供應鏈 levels and provenance
- PyTorch Reproducibility -- Deterministic 訓練 documentation
- in-toto -- Software 供應鏈 layout verification
An organization stores model artifact hashes in the same S3 bucket as 模型 weights. 攻擊者 with write access to the bucket modifies both 模型 and its hash. Why does this hash verification scheme fail, and how should it be fixed?