Model State Snapshots
Techniques for capturing and preserving AI model state during incident response: weight snapshots, configuration capture, behavioral fingerprinting, and model artifact integrity verification.
Model state snapshots are the AI forensics equivalent of disk images in traditional digital forensics. A complete model snapshot captures everything needed to reproduce the model's behavior at the time of the incident — not just the model weights, but the full deployment configuration including system prompts, guardrails, tool definitions, and runtime parameters.
The challenge is that model state is more complex and less standardized than file system state. Two deployments of the same model weights with different system prompts will behave differently. A model behind a guardrail proxy behaves differently than the same model accessed directly. The snapshot must capture the complete behavioral context, not just the model artifact.
What to Capture
Complete Model State Inventory
A forensically complete model snapshot includes all of the following:
# evidence_preservation/model_snapshot.py
"""
Complete model state capture for forensic preservation.
"""
import hashlib
import json
import os
import shutil
from datetime import datetime
from dataclasses import dataclass, field
@dataclass
class ModelSnapshot:
# Identification
snapshot_id: str
incident_id: str
captured_by: str
capture_timestamp: datetime
# Model identity
model_name: str
model_version: str
model_provider: str # "self-hosted" or provider name
model_registry_url: str
# Model artifacts (for self-hosted models)
weights_hash: str = ""
weights_location: str = ""
tokenizer_hash: str = ""
tokenizer_location: str = ""
# Configuration
system_prompt: str = ""
system_prompt_hash: str = ""
generation_parameters: dict = field(default_factory=dict)
# temperature, top_p, max_tokens, etc.
# Guardrails and filters
guardrail_config: dict = field(default_factory=dict)
guardrail_config_hash: str = ""
content_policy: dict = field(default_factory=dict)
content_policy_hash: str = ""
# Tool definitions (for agentic systems)
tool_definitions: list = field(default_factory=list)
tool_definitions_hash: str = ""
# RAG configuration (if applicable)
rag_config: dict = field(default_factory=dict)
embedding_model: str = ""
vector_db_snapshot_location: str = ""
# Behavioral fingerprint
behavioral_fingerprint: dict = field(default_factory=dict)
# Infrastructure context
deployment_manifest: dict = field(default_factory=dict)
infrastructure_version: str = ""Capture Procedures by Deployment Type
Self-hosted models:
def capture_self_hosted(model_path: str, config_path: str,
output_dir: str) -> ModelSnapshot:
"""
Capture complete state of a self-hosted model.
"""
snapshot = ModelSnapshot(
snapshot_id=generate_id(),
capture_timestamp=datetime.utcnow(),
model_provider="self-hosted",
)
# Step 1: Hash model weights (do not copy — too large)
snapshot.weights_hash = hash_file(
os.path.join(model_path, "model.safetensors")
)
snapshot.weights_location = model_path
# Step 2: Hash and copy tokenizer
tokenizer_path = os.path.join(model_path, "tokenizer.json")
snapshot.tokenizer_hash = hash_file(tokenizer_path)
shutil.copy2(tokenizer_path,
os.path.join(output_dir, "tokenizer.json"))
# Step 3: Capture system prompt and configuration
with open(config_path) as f:
config = json.load(f)
snapshot.system_prompt = config.get("system_prompt", "")
snapshot.system_prompt_hash = hashlib.sha256(
snapshot.system_prompt.encode()
).hexdigest()
snapshot.generation_parameters = {
"temperature": config.get("temperature"),
"top_p": config.get("top_p"),
"max_tokens": config.get("max_tokens"),
"stop_sequences": config.get("stop_sequences", []),
}
# Step 4: Capture guardrail configuration
guardrail_path = config.get("guardrail_config_path")
if guardrail_path and os.path.exists(guardrail_path):
with open(guardrail_path) as f:
snapshot.guardrail_config = json.load(f)
snapshot.guardrail_config_hash = hash_file(guardrail_path)
shutil.copy2(guardrail_path,
os.path.join(output_dir, "guardrails.json"))
# Step 5: Capture tool definitions
tool_path = config.get("tool_definitions_path")
if tool_path and os.path.exists(tool_path):
with open(tool_path) as f:
snapshot.tool_definitions = json.load(f)
snapshot.tool_definitions_hash = hash_file(tool_path)
# Step 6: Write snapshot manifest
manifest_path = os.path.join(output_dir, "snapshot_manifest.json")
with open(manifest_path, "w") as f:
json.dump(snapshot.__dict__, f, indent=2, default=str)
return snapshotAPI-hosted models (OpenAI, Anthropic, etc.):
def capture_api_hosted(api_config: dict,
output_dir: str) -> ModelSnapshot:
"""
Capture state of an API-hosted model deployment.
Cannot capture weights, but capture everything else.
"""
snapshot = ModelSnapshot(
snapshot_id=generate_id(),
capture_timestamp=datetime.utcnow(),
model_provider=api_config["provider"],
model_name=api_config["model_name"],
model_version=api_config.get("model_version", "unknown"),
)
# For API models, we cannot hash weights but must record
# the exact model identifier
snapshot.weights_hash = "N/A — API-hosted model"
snapshot.weights_location = (
f"{api_config['provider']}:{api_config['model_name']}"
)
# Capture application-level configuration
snapshot.system_prompt = api_config.get("system_prompt", "")
snapshot.system_prompt_hash = hashlib.sha256(
snapshot.system_prompt.encode()
).hexdigest()
snapshot.generation_parameters = {
k: v for k, v in api_config.items()
if k in ["temperature", "top_p", "max_tokens",
"frequency_penalty", "presence_penalty"]
}
return snapshotBehavioral Fingerprinting
Why Behavioral Fingerprints Matter
Model weights and configuration describe what the model is. Behavioral fingerprints describe what the model does. For forensic purposes, both are necessary because:
- The same weights with different prompts produce different behavior
- Model providers may update model versions without changing the model identifier
- Behavioral evidence is more meaningful to non-technical stakeholders (lawyers, regulators, executives) than weight hashes
Creating a Behavioral Fingerprint
# evidence_preservation/behavioral_fingerprint.py
"""
Behavioral fingerprinting for model state comparison.
"""
import numpy as np
from collections import Counter
class BehavioralFingerprint:
def __init__(self, model_endpoint):
self.endpoint = model_endpoint
def generate_fingerprint(self, probe_suite: dict,
repetitions: int = 5) -> dict:
"""
Generate a behavioral fingerprint by running standardized
probes and characterizing the response patterns.
"""
fingerprint = {
"timestamp": datetime.utcnow().isoformat(),
"model_version": self.endpoint.version,
"probe_results": {},
"aggregate_metrics": {},
}
all_response_lengths = []
all_refusal_rates = {}
for category, prompts in probe_suite.items():
category_results = []
for prompt in prompts:
prompt_results = []
refusals = 0
for rep in range(repetitions):
response = self.endpoint.generate(prompt)
is_refusal = self._is_refusal(response.text)
prompt_results.append({
"response_length": len(response.text),
"token_count": response.token_count,
"is_refusal": is_refusal,
"response_hash": hashlib.sha256(
response.text.encode()
).hexdigest()[:16],
})
all_response_lengths.append(len(response.text))
if is_refusal:
refusals += 1
refusal_rate = refusals / repetitions
all_refusal_rates[f"{category}:{prompt[:50]}"] = refusal_rate
category_results.append({
"prompt": prompt,
"refusal_rate": refusal_rate,
"avg_response_length": np.mean(
[r["response_length"] for r in prompt_results]
),
"response_variance": np.var(
[r["response_length"] for r in prompt_results]
),
"unique_responses": len(set(
r["response_hash"] for r in prompt_results
)),
})
fingerprint["probe_results"][category] = category_results
# Aggregate metrics
fingerprint["aggregate_metrics"] = {
"overall_refusal_rate": np.mean(
list(all_refusal_rates.values())
),
"avg_response_length": np.mean(all_response_lengths),
"response_length_std": np.std(all_response_lengths),
"safety_probe_refusal_rate": np.mean([
v for k, v in all_refusal_rates.items()
if "safety" in k
]) if any("safety" in k for k in all_refusal_rates) else None,
}
return fingerprint
def _is_refusal(self, response_text: str) -> bool:
"""Heuristic check for model refusal."""
refusal_indicators = [
"i cannot", "i can't", "i'm unable to",
"i am unable to", "i will not", "i won't",
"i'm not able to", "as an ai",
"against my guidelines", "not appropriate",
]
lower = response_text.lower()
return any(indicator in lower
for indicator in refusal_indicators)Fingerprint Comparison
Compare pre-incident and post-incident fingerprints to identify behavioral changes:
def compare_fingerprints(pre_incident: dict,
post_incident: dict) -> dict:
"""
Compare two behavioral fingerprints to identify changes.
"""
changes = {
"metric_changes": {},
"category_changes": {},
"significant_changes": [],
}
# Compare aggregate metrics
pre_metrics = pre_incident["aggregate_metrics"]
post_metrics = post_incident["aggregate_metrics"]
for metric in pre_metrics:
if pre_metrics[metric] is None or post_metrics.get(metric) is None:
continue
pre_val = pre_metrics[metric]
post_val = post_metrics[metric]
delta = post_val - pre_val
changes["metric_changes"][metric] = {
"pre": pre_val,
"post": post_val,
"delta": delta,
"percent_change": (delta / pre_val * 100)
if pre_val != 0 else float("inf"),
}
# Flag significant changes
if abs(delta / (pre_val + 1e-10)) > 0.1: # >10% change
changes["significant_changes"].append({
"metric": metric,
"change": f"{delta / (pre_val + 1e-10) * 100:.1f}%",
"direction": "increase" if delta > 0 else "decrease",
})
return changesModel Integrity Verification
Detecting Unauthorized Model Changes
For self-hosted models, verify that the deployed model matches the expected version by comparing cryptographic hashes of model artifacts:
# evidence_preservation/integrity_check.py
"""
Verify model artifact integrity against known-good hashes.
"""
class ModelIntegrityChecker:
def __init__(self, registry):
self.registry = registry
def verify_deployed_model(self, deployment_path: str,
expected_version: str) -> dict:
"""
Verify that deployed model artifacts match the
registered version.
"""
expected = self.registry.get_version(expected_version)
results = {"verified": True, "checks": []}
# Check each artifact
artifacts_to_check = [
("model weights", "model.safetensors",
expected.weights_hash),
("tokenizer", "tokenizer.json",
expected.tokenizer_hash),
("config", "config.json",
expected.config_hash),
]
for name, filename, expected_hash in artifacts_to_check:
filepath = os.path.join(deployment_path, filename)
if not os.path.exists(filepath):
results["checks"].append({
"artifact": name,
"status": "MISSING",
"expected_hash": expected_hash,
})
results["verified"] = False
continue
actual_hash = hash_file(filepath)
match = actual_hash == expected_hash
results["checks"].append({
"artifact": name,
"status": "MATCH" if match else "MISMATCH",
"expected_hash": expected_hash,
"actual_hash": actual_hash,
})
if not match:
results["verified"] = False
return resultsStorage and Retention
Forensic Storage Requirements
Model snapshots must be stored with the same security and integrity guarantees as other forensic evidence:
- Write-once storage. Use write-once media or append-only storage to prevent modification after collection.
- Encryption at rest. Encrypt snapshot storage to protect model intellectual property and any sensitive data in the configuration.
- Access controls. Restrict access to the forensics team and legal counsel. Log all access to snapshot storage.
- Geographic considerations. If the incident involves regulatory notification (GDPR, HIPAA), ensure snapshot storage complies with data residency requirements.
- Retention period. Retain snapshots for the duration required by the incident type — typically until legal proceedings are concluded or the retention period mandated by the applicable regulation expires.
Capture model state immediately
Before any containment actions, capture the complete model state including weights hash, system prompt, guardrails, and tool definitions.
Generate behavioral fingerprint
Run the standardized probe suite against the live model to capture behavioral state that cannot be derived from static artifacts alone.
Verify artifact integrity
Compare deployed model artifacts against the model registry to detect unauthorized modifications.
Store with chain of custody
Store all captured artifacts in forensic storage with hash verification, access logging, and chain-of-custody documentation.
Further Reading
- Evidence Preservation Overview — The broader evidence framework
- Conversation Preservation — Preserving interaction records
- Model Forensics — Analyzing preserved model artifacts
- Tampering Detection — Detecting unauthorized model changes