Model Behavior Diffing
Comparing model behavior before and after incidents: output distribution analysis, safety regression detection, capability change measurement, and statistical significance testing.
Model Behavior Diffing
Behavior diffing compares a model's outputs before and after a suspected incident, update, or modification. While code diffing shows you exactly which lines changed, behavior diffing must infer changes from the model's observable outputs -- because the "source code" is billions of opaque floating-point parameters. This page covers techniques for detecting and quantifying behavioral changes.
When to Diff
Behavior diffing is warranted in several scenarios:
| Scenario | What You Are Comparing | What You Are Looking For |
|---|---|---|
| Post-incident investigation | Model behavior during incident vs. before incident | Behavioral shifts that enabled the incident |
| Model update verification | New model version vs. previous version | Unintended safety regressions introduced by the update |
| Fine-tuning validation | Fine-tuned model vs. base model | Safety degradation or unintended capability changes |
| Supply chain verification | Downloaded model vs. provider's reference | Tampering during distribution |
| Adapter inspection | Model with adapter vs. model without adapter | Behavioral changes introduced by the adapter |
Evaluation Suite Design
A behavior diff is only as good as the evaluation suite you use. The suite must cover the dimensions you care about with enough samples to achieve statistical significance.
Dimension Coverage
| Dimension | What to Evaluate | Minimum Sample Size |
|---|---|---|
| Safety refusals | Refusal rate for harmful content requests across categories | 200 prompts across 10+ harm categories |
| System prompt adherence | Rate of compliance with system prompt instructions | 100 prompts that test instruction boundaries |
| Factual accuracy | Correctness of knowledge-based responses | 200 factual questions with verifiable answers |
| Output characteristics | Length, tone, formatting, vocabulary | 500 general prompts for distribution analysis |
| Capability benchmarks | Task performance (coding, math, reasoning, summarization) | Standard benchmark suite per capability |
| Jailbreak resistance | Resistance to known jailbreak techniques | 100+ jailbreak prompts from technique catalogs |
| Bias and fairness | Differential treatment of protected groups | 200+ prompts varying demographic attributes |
Prompt Design Principles
| Principle | Description | Example |
|---|---|---|
| Paired evaluation | Use identical prompts for both model versions | Same 200 safety prompts run against both versions |
| Category balance | Equal representation across categories | 20 prompts per harm category, not 190 in one and 10 in nine |
| Difficulty range | Mix of easy and hard cases | Include both obvious and subtle safety boundary tests |
| Deterministic settings | Run evaluation at temperature 0 when possible | Reduces noise; shows modal behavior |
| Multiple temperatures | Also test at deployment temperature | Captures behavior users actually experience |
Output Distribution Analysis
Token Distribution Comparison
Compare the probability distributions over the vocabulary between model versions.
import numpy as np
from scipy.stats import entropy
def compare_output_distributions(model_a, model_b, prompts, top_k=100):
"""
Compare output token distributions between two models
using KL divergence and JS divergence.
"""
results = []
for prompt in prompts:
logits_a = model_a.get_logits(prompt)
logits_b = model_b.get_logits(prompt)
# Convert to probabilities
probs_a = softmax(logits_a)[:top_k]
probs_b = softmax(logits_b)[:top_k]
# KL and JS divergence
kl_div = entropy(probs_a, probs_b)
js_div = jensenshannon(probs_a, probs_b)
results.append({
"prompt": prompt,
"kl_divergence": kl_div,
"js_divergence": js_div,
"top_token_a": decode(np.argmax(logits_a)),
"top_token_b": decode(np.argmax(logits_b))
})
return resultsInterpreting Distribution Changes
| JS Divergence Range | Interpretation |
|---|---|
| < 0.05 | Negligible difference; models behave essentially the same |
| 0.05 - 0.15 | Minor differences; may be quantization or minor update effects |
| 0.15 - 0.30 | Moderate differences; warrants investigation of specific prompts |
| > 0.30 | Significant behavioral change; high priority for forensic review |
Safety Regression Detection
Safety regression is the most common and highest-priority finding in behavior diffing.
Regression Measurement
def measure_safety_regression(model_a, model_b, safety_prompts):
"""
Compare refusal rates between two models on safety-critical prompts.
"""
results_a = [is_refusal(model_a.generate(p)) for p in safety_prompts]
results_b = [is_refusal(model_b.generate(p)) for p in safety_prompts]
refusal_rate_a = sum(results_a) / len(results_a)
refusal_rate_b = sum(results_b) / len(results_b)
# Statistical significance test
from scipy.stats import chi2_contingency
table = [
[sum(results_a), len(results_a) - sum(results_a)],
[sum(results_b), len(results_b) - sum(results_b)]
]
chi2, p_value, _, _ = chi2_contingency(table)
return {
"refusal_rate_a": refusal_rate_a,
"refusal_rate_b": refusal_rate_b,
"change": refusal_rate_b - refusal_rate_a,
"p_value": p_value,
"significant": p_value < 0.05
}Category-Level Analysis
Overall refusal rates can mask category-specific regressions. Always analyze safety performance by category:
| Category | Model A Refusal Rate | Model B Refusal Rate | Change | Significant? |
|---|---|---|---|---|
| Violent content | 95% | 93% | -2% | No (p=0.42) |
| Self-harm | 97% | 96% | -1% | No (p=0.61) |
| PII disclosure | 88% | 72% | -16% | Yes (p<0.001) |
| Illegal activity | 92% | 90% | -2% | No (p=0.38) |
| System prompt leak | 85% | 64% | -21% | Yes (p<0.001) |
Capability Change Measurement
Unexpected capability changes can indicate tampering even when safety metrics are stable.
What to Measure
| Capability | Benchmark | Tampering Signal |
|---|---|---|
| Code generation | HumanEval, MBPP | Improved code quality could mask backdoored code patterns |
| Reasoning | GSM8K, ARC, MMLU | Changes in reasoning indicate weight modifications |
| Summarization | ROUGE scores on standard datasets | Capability changes not explained by fine-tuning objectives |
| Translation | BLEU scores across language pairs | Language-specific changes may indicate targeted modification |
| Instruction following | IFEval, MT-Bench | Changes in how the model interprets and follows instructions |
Interpreting Capability Changes
| Change Pattern | Possible Cause | Investigation Priority |
|---|---|---|
| Uniform slight improvement | Legitimate model update | Low |
| Uniform slight degradation | Quantization or compression | Low |
| Improvement in specific tasks only | Targeted fine-tuning (legitimate or malicious) | Medium |
| Degradation in safety with capability improvement | Potential safety-capability trade-off in fine-tuning | High |
| No capability change but safety regression | Safety alignment specifically weakened | Critical |
Diff Report Template
Document behavior diff results in a structured format:
## Behavior Diff Report
**Model A:** [identifier, version, date deployed]
**Model B:** [identifier, version, date deployed]
**Evaluation Date:** [date]
**Evaluator:** [name]
### Summary
[1-2 sentence summary of key findings]
### Safety Regression Analysis
| Category | Model A | Model B | Change | p-value | Significant |
|---|---|---|---|---|---|
| [category] | [rate] | [rate] | [delta] | [p] | [yes/no] |
### Output Distribution Analysis
- Mean JS Divergence: [value]
- Categories with JS > 0.15: [list]
- Top divergent prompts: [list with divergence values]
### Capability Benchmarks
| Benchmark | Model A | Model B | Change |
|---|---|---|---|
| [benchmark] | [score] | [score] | [delta] |
### Forensic Assessment
[Analysis of whether changes indicate tampering, legitimate update, or natural drift]
### Recommendations
[Next steps based on findings]Automating Behavior Diffing
For organizations that frequently update models, automate behavior diffing as part of the deployment pipeline:
| Integration Point | What to Test | Pass/Fail Criteria |
|---|---|---|
| Pre-deployment gate | Full evaluation suite | No safety regression > 5%, no capability change > 10% |
| Post-deployment monitoring | Continuous sampling from production | Behavioral metrics remain within baseline bounds |
| Scheduled audits | Comprehensive diff against last audit | All findings documented and triaged |
Related Topics
- Backdoor Detection -- specialized detection when diffing reveals anomalies
- Tampering Detection -- file-level verification complementing behavioral analysis
- Model Snapshots -- preserving model versions for future diffing
- Severity Framework -- scoring the severity of detected regressions
References
- "Language Model Evaluation Harness" - EleutherAI (2025) - Open-source evaluation framework for language model benchmarking
- "Holistic Evaluation of Language Models (HELM)" - Stanford CRFM (2024) - Comprehensive evaluation methodology
- "Statistical Methods for AI Safety Evaluation" - arXiv (2025) - Statistical frameworks for behavioral comparison
- "DeepEval: LLM Evaluation Framework" - Confident AI (2025) - Automated evaluation and regression detection
A model's overall safety refusal rate dropped from 91% to 88% after an update. Is this a significant safety regression?