Model Behavior Forensics (Ai Forensics Ir)
Overview of model forensics: determining if a model has been tampered with, behavioral analysis methodology, and the relationship between model artifacts and observable behavior.
Model Behavior Forensics
When an AI incident suggests that the model itself -- not just its inputs, configuration, or surrounding application -- has been compromised, you enter the domain of model forensics. This discipline examines whether a model's weights, architecture, or learned behavior have been altered from their known-good state. It is the AI equivalent of malware analysis: studying the artifact itself to determine if it has been tampered with.
When to Investigate the Model Itself
Not every AI incident requires model-level forensics. Most incidents are caused by application-layer issues: flawed system prompts, missing guardrails, or vulnerable tool configurations. Model-level investigation is warranted when:
| Indicator | Why It Suggests Model Compromise |
|---|---|
| Behavioral anomalies not explained by configuration | If the system prompt and guardrails are correct but the model still misbehaves, the model itself may be the issue |
| Behavior changes after model update or swap | A new model version or fine-tuned variant may have introduced vulnerabilities |
| Third-party model provenance concerns | Models downloaded from public repositories may contain backdoors |
| Unexplained safety regression | Safety behaviors weaken without any changes to the application layer |
| Triggered behavior | Model produces specific outputs only in response to specific, unusual triggers |
| Supply chain incident | Upstream provider reports a compromise affecting model artifacts |
Model Artifacts as Evidence
A model consists of multiple artifacts, each of which can be independently tampered with.
Artifact Inventory
| Artifact | What It Contains | Tampering Risk |
|---|---|---|
| Base weights | The pre-trained model parameters (billions of floating-point values) | Poisoning during pre-training, weight modification post-download |
| Adapter files (LoRA, QLoRA) | Small parameter sets that modify base model behavior | Malicious fine-tuning, backdoor insertion |
| Tokenizer | Vocabulary and encoding rules | Token manipulation, trigger insertion |
| Configuration files | Architecture definition, hyperparameters | Architecture modification, hidden layers |
| Quantization artifacts | Compressed weight representations | Precision-based behavior changes, quantization-masked backdoors |
| Embedding layers | Input/output token representations | Embedding space manipulation for specific triggers |
Chain of Custody
Chain of custody for model artifacts requires:
- Provenance record -- where was the model obtained? Which exact version/commit?
- Integrity verification -- hash (SHA-256) of all model files at acquisition time
- Access log -- who has had write access to the model files since acquisition?
- Modification history -- any fine-tuning, quantization, or format conversion applied
- Deployment history -- when was each version deployed and to which endpoints?
# Generate integrity checksums for model artifacts
sha256sum model_weights.safetensors > checksums/model_weights.sha256
sha256sum tokenizer.json > checksums/tokenizer.sha256
sha256sum config.json > checksums/config.json
sha256sum adapter_model.safetensors > checksums/adapter.sha256
# Verify against known-good checksums
sha256sum -c checksums/*.sha256Behavioral Analysis Methodology
When model-level tampering is suspected, systematic behavioral analysis determines whether the model's behavior deviates from its expected baseline.
Phase 1: Establish the Expected Baseline
Before you can identify anomalous behavior, you need a reference point.
| Baseline Source | What It Provides | Limitations |
|---|---|---|
| Previous model version | Direct behavioral comparison | May not have been preserved |
| Model card / documentation | Expected capabilities and limitations | May be incomplete or outdated |
| Safety evaluation benchmarks | Quantified safety behavior scores | Covers common cases, may miss targeted backdoors |
| Original provider's model | Unmodified reference behavior | May differ from your fine-tuned version |
| Production behavioral logs | Real-world behavior before the incident | Noisy; influenced by application layer |
Phase 2: Systematic Probing
Safety behavior probing
Test the model against a comprehensive set of safety-relevant prompts. Compare refusal rates and response patterns against the baseline. A significant decrease in refusal rate for any category warrants deeper investigation.
Categories to test: harmful content generation, PII disclosure, jailbreak susceptibility, instruction adherence, persona resistance, and system prompt protection.
Trigger scanning
If a backdoor is suspected, search for inputs that produce anomalous outputs. This involves testing the model with known backdoor trigger patterns and monitoring for outputs that deviate significantly from expected behavior. See Backdoor Detection for techniques.
Output distribution analysis
Compare the statistical properties of the model's outputs (token distribution, vocabulary usage, output length distribution) between the suspected model and the baseline.
See Behavior Diffing for methods.
Weight and file integrity verification
Verify the integrity of model files against known-good checksums. Inspect adapter files, tokenizer modifications, and configuration changes. See Tampering Detection for procedures.
Phase 3: Differential Analysis
Compare the suspected model's behavior against the baseline across multiple dimensions:
| Dimension | Measurement | Significance Threshold |
|---|---|---|
| Safety refusal rate | Percentage of harmful prompts refused | >5% decrease from baseline |
| Output toxicity scores | Average toxicity classifier score | >0.1 increase from baseline |
| Instruction adherence | Rate of system prompt compliance | >10% decrease from baseline |
| Capability benchmarks | Task performance on standard benchmarks | >5% change in either direction |
| Trigger response | Behavior on suspected trigger inputs | Any anomalous response |
Types of Model Compromise
| Type | What Changed | How to Detect | Difficulty |
|---|---|---|---|
| Backdoor | Model responds to specific triggers with attacker-chosen outputs | Trigger scanning, activation analysis | High |
| Safety degradation | Overall safety behavior weakened | Safety benchmark comparison | Medium |
| Capability manipulation | Specific capabilities enhanced or degraded | Task-specific benchmarks | Medium |
| Bias injection | Model behavior systematically biased in specific contexts | Fairness benchmarks, output analysis | High |
| Data memorization | Model memorizes and can reproduce specific sensitive data | Extraction probing, membership inference | Medium |
Section Overview
This section contains three specialized subsections for in-depth model forensic investigation:
| Subsection | Focus | When to Use |
|---|---|---|
| Backdoor Detection | Finding hidden triggers and malicious functionality | Third-party model, supply chain concern, unexplained triggered behavior |
| Behavior Diffing | Comparing behavior before and after an incident or update | Safety regression, unexpected behavioral changes, post-update verification |
| Tampering Detection | Verifying file integrity and detecting modifications | File integrity concerns, unknown modifications, supply chain verification |
Related Topics
- Infrastructure & Supply Chain -- supply chain attack vectors that lead to model compromise
- Training Pipeline Attacks -- understanding how models are poisoned during training
- RAG, Data & Training Attacks -- data poisoning techniques relevant to model forensics
- Evidence Preservation -- preserving model artifacts for investigation
References
- "Backdoor Attacks on Language Models: A Survey" - arXiv (2025) - Comprehensive survey of backdoor techniques and detection methods
- "TrojAI: AI Model Inspection Framework" - IARPA (2024) - Government-sponsored model inspection methodology
- "NIST AI 100-2: Adversarial Machine Learning" - NIST (2024) - Taxonomy of model-level attacks
- "Model Cards for Model Reporting" - Mitchell et al. (2019) - Documentation framework for establishing model baselines
A fine-tuned model scores higher on coding benchmarks than its base model but has a 15% lower safety refusal rate. Should you investigate further?