Model Tampering Detection
Detecting model file tampering: weight hash verification, architecture validation, adapter inspection, quantization verification, and supply chain integrity checks.
Model Tampering Detection
Model tampering detection operates at the file and artifact level, verifying that model files have not been modified from their known-good state. While behavior diffing detects changes through observable outputs, tampering detection examines the artifacts themselves -- weight files, configuration, tokenizer, adapters -- for evidence of unauthorized modification.
Weight Hash Verification
The most fundamental integrity check is verifying that model weight files have not changed since their last known-good state.
Hashing Strategy
| Approach | What to Hash | When to Use |
|---|---|---|
| File-level hashing | SHA-256 of entire weight files | Quick verification; detects any modification |
| Layer-level hashing | Hash of each layer's weight tensor | Identifies which layers were modified |
| Tensor-level hashing | Hash of individual weight tensors | Pinpoints exact modified components |
| Statistical fingerprinting | Statistical properties of weight distributions | Detects modifications even if file format changes |
import hashlib
import json
from safetensors import safe_open
def verify_model_integrity(model_path, expected_checksums):
"""
Verify model file integrity against known-good checksums.
Returns list of mismatched files.
"""
mismatches = []
for filename, expected_hash in expected_checksums.items():
filepath = f"{model_path}/{filename}"
sha256 = hashlib.sha256()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
actual_hash = sha256.hexdigest()
if actual_hash != expected_hash:
mismatches.append({
"file": filename,
"expected": expected_hash,
"actual": actual_hash
})
return mismatches
def generate_layer_checksums(model_path):
"""
Generate per-layer checksums for detailed integrity tracking.
"""
checksums = {}
with safe_open(f"{model_path}/model.safetensors", framework="pt") as f:
for key in f.keys():
tensor = f.get_tensor(key)
tensor_bytes = tensor.numpy().tobytes()
checksums[key] = hashlib.sha256(tensor_bytes).hexdigest()
return checksumsEstablishing Known-Good State
A hash is only useful if you have a trustworthy reference. Establish known-good state at these points:
| Checkpoint | What to Record | Storage |
|---|---|---|
| Model acquisition | Hash all files immediately after download or receipt | Secure, append-only storage |
| Post-fine-tuning | Hash the complete model after each fine-tuning run | Associated with training job ID |
| Pre-deployment | Hash the exact files deployed to production | Associated with deployment ID |
| Post-quantization | Hash quantized model files | Associated with quantization configuration |
| Scheduled audit | Re-verify against stored hashes periodically | Audit log |
Architecture Validation
Tampering can involve modifying the model's architecture -- adding hidden layers, changing layer sizes, or inserting additional components.
Architecture Checks
| Check | What to Verify | Tampering Signal |
|---|---|---|
| Layer count | Number of transformer layers matches specification | Additional layers could hide backdoor components |
| Hidden size | Embedding dimension matches expected value | Modified dimensions indicate architectural changes |
| Attention heads | Number of attention heads per layer | Extra heads could be dedicated to backdoor processing |
| Vocabulary size | Tokenizer vocabulary size matches expected value | Added tokens could be backdoor triggers |
| Parameter count | Total parameter count matches expected value | Additional parameters indicate hidden components |
| Architecture type | Model class matches expected architecture | Wrong architecture class could indicate model swapping |
from transformers import AutoConfig
def validate_architecture(model_path, expected_config):
"""
Validate model architecture against expected specification.
"""
actual_config = AutoConfig.from_pretrained(model_path)
discrepancies = []
checks = {
"num_hidden_layers": expected_config.get("num_hidden_layers"),
"hidden_size": expected_config.get("hidden_size"),
"num_attention_heads": expected_config.get("num_attention_heads"),
"vocab_size": expected_config.get("vocab_size"),
"intermediate_size": expected_config.get("intermediate_size"),
"model_type": expected_config.get("model_type"),
}
for key, expected_value in checks.items():
actual_value = getattr(actual_config, key, None)
if actual_value != expected_value:
discrepancies.append({
"parameter": key,
"expected": expected_value,
"actual": actual_value
})
return discrepanciesAdapter Inspection
Adapters are a common vector for model tampering because they are small, easy to distribute, and modify model behavior without touching the base weights.
Adapter Verification Checklist
| Check | How to Verify | Risk |
|---|---|---|
| Source provenance | Verify the adapter's origin and creator | Community adapters from unknown sources are high risk |
| File integrity | Hash adapter files and compare to published checksums | Modification during download or storage |
| Rank and dimensions | Check LoRA rank matches claimed specification | Higher rank means more behavioral modification capability |
| Target modules | Verify which model layers the adapter modifies | Adapters targeting attention layers have the most behavioral impact |
| Training data provenance | Verify the training data used for the adapter | Poisoned training data creates poisoned adapters |
| Behavioral impact | Compare model behavior with and without the adapter | Unexpected behavioral changes indicate potential malicious intent |
from peft import PeftConfig
import json
def inspect_adapter(adapter_path):
"""
Inspect a LoRA adapter for suspicious characteristics.
"""
config = PeftConfig.from_pretrained(adapter_path)
report = {
"adapter_type": config.peft_type,
"base_model": config.base_model_name_or_path,
"rank": config.r,
"lora_alpha": config.lora_alpha,
"target_modules": list(config.target_modules),
"modules_to_save": config.modules_to_save,
"lora_dropout": config.lora_dropout,
}
# Flag suspicious characteristics
flags = []
if config.r > 64:
flags.append("HIGH_RANK: LoRA rank > 64 indicates extensive "
"modification capability")
if "embed_tokens" in str(config.target_modules):
flags.append("EMBEDDING_TARGET: Adapter modifies input embeddings, "
"which can alter token-level behavior")
if config.modules_to_save:
flags.append("FULL_MODULES: Adapter fully replaces some modules, "
"not just low-rank adaptation")
report["flags"] = flags
return reportAdapter Stacking Risks
Multiple adapters can be stacked, and malicious behavior may only appear when specific combinations are active:
| Risk | Description | Mitigation |
|---|---|---|
| Hidden interaction | Two benign adapters interact to produce malicious behavior | Test all adapter combinations, not just individual adapters |
| Conditional activation | Adapter behavior depends on which other adapters are active | Document and test all deployment configurations |
| Override attacks | Malicious adapter overrides a safety-focused adapter | Verify adapter loading order and priority |
Quantization Verification
Quantization changes the numerical representation of weights, which can subtly alter model behavior. Malicious quantization can be designed to degrade specific behaviors while maintaining overall performance.
Quantization Integrity Checks
| Check | What to Verify | Tool |
|---|---|---|
| Quantization format | Format matches claimed specification (GPTQ, AWQ, GGUF, etc.) | File header inspection |
| Bit width | Actual precision matches claimed precision | Weight statistics analysis |
| Calibration data | What data was used for calibration | Provenance documentation |
| Behavioral preservation | Quantized model matches full-precision behavior within expected bounds | Behavior diffing |
| Safety preservation | Safety behaviors not disproportionately affected by quantization | Safety benchmark comparison |
import numpy as np
def verify_quantization_stats(model_path, expected_format, expected_bits):
"""
Verify quantization parameters and detect anomalies.
"""
# Load quantization configuration
with open(f"{model_path}/quantize_config.json") as f:
quant_config = json.load(f)
issues = []
if quant_config.get("bits") != expected_bits:
issues.append(f"Bit width mismatch: expected {expected_bits}, "
f"got {quant_config.get('bits')}")
if quant_config.get("quant_method") != expected_format:
issues.append(f"Format mismatch: expected {expected_format}, "
f"got {quant_config.get('quant_method')}")
# Check for mixed precision (some layers at different precision)
if quant_config.get("mixed_precision"):
issues.append("Mixed precision detected -- verify which layers "
"are at reduced precision")
return issuesSupply Chain Verification Workflow
Verify source authenticity
Confirm the model was downloaded from the claimed source. Check for HTTPS, verify domain names, and compare download hashes against the provider's published checksums.
Scan model files for malicious code
Model files in certain formats (pickle, PyTorch
.bin) can contain arbitrary code that executes during loading. Prefer SafeTensors format. Scan pickle files for suspicious code paths.# Scan for pickle-based code execution risks python -c " import pickletools, sys with open(sys.argv[1], 'rb') as f: pickletools.dis(f) " model.bin | grep -E "GLOBAL|REDUCE|BUILD|INST"Verify architecture and checksums
Run architecture validation and hash verification against the provider's published specifications.
Behavioral smoke test
Run a quick behavioral evaluation comparing against the provider's published benchmarks. Flag any significant deviations.
Document provenance
Record the complete provenance chain: source URL, download timestamp, file hashes, verification results, and who performed the verification.
Related Topics
- Backdoor Detection -- behavioral detection complementing file-level verification
- Behavior Diffing -- comparing behavior when file-level changes are detected
- Model Snapshots -- preserving model state for future verification
- Infrastructure & Supply Chain -- supply chain attack vectors for AI systems
References
- "SafeTensors: A Simple, Safe Serialization Format" - Hugging Face (2024) - Secure model serialization that prevents code execution during loading
- "Model Supply Chain Security" - MITRE ATLAS (2025) - Supply chain attack taxonomy for AI models
- "Quantization and Safety: How Precision Reduction Affects LLM Safety Behaviors" - arXiv (2025) - Research on quantization's impact on model safety
- "Securing the ML Supply Chain" - Google Security Blog (2024) - Best practices for ML model supply chain security
A model passes file-level hash verification against known-good checksums, but shows safety regression in behavior testing. What should you investigate next?