Model Behavior Diffing

advanced9 min readUpdated 2026-03-15

Comparing model behavior before and after incidents: output distribution analysis, safety regression detection, capability change measurement, and statistical significance testing.

behavior-diffing comparison regression model-analysis

Model Behavior Diffing

Behavior diffing compares a model's outputs before and after a suspected incident, update, or modification. While code diffing shows you exactly which lines changed, behavior diffing must infer changes from the model's observable outputs -- because the "source code" is billions of opaque floating-point parameters. This page covers techniques for detecting and quantifying behavioral changes.

When to Diff

Behavior diffing is warranted in several scenarios:

Scenario	What You Are Comparing	What You Are Looking For
Post-incident investigation	Model behavior during incident vs. before incident	Behavioral shifts that enabled the incident
Model update verification	New model version vs. previous version	Unintended safety regressions introduced by the update
Fine-tuning validation	Fine-tuned model vs. base model	Safety degradation or unintended capability changes
Supply chain verification	Downloaded model vs. provider's reference	Tampering during distribution
Adapter inspection	Model with adapter vs. model without adapter	Behavioral changes introduced by the adapter

Evaluation Suite Design

A behavior diff is only as good as the evaluation suite you use. The suite must cover the dimensions you care about with enough samples to achieve statistical significance.

Dimension Coverage

Dimension	What to Evaluate	Minimum Sample Size
Safety refusals	Refusal rate for harmful content requests across categories	200 prompts across 10+ harm categories
System prompt adherence	Rate of compliance with system prompt instructions	100 prompts that test instruction boundaries
Factual accuracy	Correctness of knowledge-based responses	200 factual questions with verifiable answers
Output characteristics	Length, tone, formatting, vocabulary	500 general prompts for distribution analysis
Capability benchmarks	Task performance (coding, math, reasoning, summarization)	Standard benchmark suite per capability
Jailbreak resistance	Resistance to known jailbreak techniques	100+ jailbreak prompts from technique catalogs
Bias and fairness	Differential treatment of protected groups	200+ prompts varying demographic attributes

Prompt Design Principles

Principle	Description	Example
Paired evaluation	Use identical prompts for both model versions	Same 200 safety prompts run against both versions
Category balance	Equal representation across categories	20 prompts per harm category, not 190 in one and 10 in nine
Difficulty range	Mix of easy and hard cases	Include both obvious and subtle safety boundary tests
Deterministic settings	Run evaluation at temperature 0 when possible	Reduces noise; shows modal behavior
Multiple temperatures	Also test at deployment temperature	Captures behavior users actually experience

Output Distribution Analysis

Token Distribution Comparison

Compare the probability distributions over the vocabulary between model versions.

import numpy as np
from scipy.stats import entropy
 
def compare_output_distributions(model_a, model_b, prompts, top_k=100):
    """
    Compare output token distributions between two models
    using KL divergence and JS divergence.
    """
    results = []
    for prompt in prompts:
        logits_a = model_a.get_logits(prompt)
        logits_b = model_b.get_logits(prompt)
 
        # Convert to probabilities
        probs_a = softmax(logits_a)[:top_k]
        probs_b = softmax(logits_b)[:top_k]
 
        # KL and JS divergence
        kl_div = entropy(probs_a, probs_b)
        js_div = jensenshannon(probs_a, probs_b)
 
        results.append({
            "prompt": prompt,
            "kl_divergence": kl_div,
            "js_divergence": js_div,
            "top_token_a": decode(np.argmax(logits_a)),
            "top_token_b": decode(np.argmax(logits_b))
        })
 
    return results

Interpreting Distribution Changes

JS Divergence Range	Interpretation
< 0.05	Negligible difference; models behave essentially the same
0.05 - 0.15	Minor differences; may be quantization or minor update effects
0.15 - 0.30	Moderate differences; warrants investigation of specific prompts
> 0.30	Significant behavioral change; high priority for forensic review

Safety Regression Detection

Safety regression is the most common and highest-priority finding in behavior diffing.

Regression Measurement

def measure_safety_regression(model_a, model_b, safety_prompts):
    """
    Compare refusal rates between two models on safety-critical prompts.
    """
    results_a = [is_refusal(model_a.generate(p)) for p in safety_prompts]
    results_b = [is_refusal(model_b.generate(p)) for p in safety_prompts]
 
    refusal_rate_a = sum(results_a) / len(results_a)
    refusal_rate_b = sum(results_b) / len(results_b)
 
    # Statistical significance test
    from scipy.stats import chi2_contingency
    table = [
        [sum(results_a), len(results_a) - sum(results_a)],
        [sum(results_b), len(results_b) - sum(results_b)]
    ]
    chi2, p_value, _, _ = chi2_contingency(table)
 
    return {
        "refusal_rate_a": refusal_rate_a,
        "refusal_rate_b": refusal_rate_b,
        "change": refusal_rate_b - refusal_rate_a,
        "p_value": p_value,
        "significant": p_value < 0.05
    }

Category-Level Analysis

Overall refusal rates can mask category-specific regressions. Always analyze safety performance by category:

Category	Model A Refusal Rate	Model B Refusal Rate	Change	Significant?
Violent content	95%	93%	-2%	No (p=0.42)
Self-harm	97%	96%	-1%	No (p=0.61)
PII disclosure	88%	72%	-16%	Yes (p<0.001)
Illegal activity	92%	90%	-2%	No (p=0.38)
System prompt leak	85%	64%	-21%	Yes (p<0.001)

Capability Change Measurement

Unexpected capability changes can indicate tampering even when safety metrics are stable.

What to Measure

Capability	Benchmark	Tampering Signal
Code generation	HumanEval, MBPP	Improved code quality could mask backdoored code patterns
Reasoning	GSM8K, ARC, MMLU	Changes in reasoning indicate weight modifications
Summarization	ROUGE scores on standard datasets	Capability changes not explained by fine-tuning objectives
Translation	BLEU scores across language pairs	Language-specific changes may indicate targeted modification
Instruction following	IFEval, MT-Bench	Changes in how the model interprets and follows instructions

Interpreting Capability Changes

Change Pattern	Possible Cause	Investigation Priority
Uniform slight improvement	Legitimate model update	Low
Uniform slight degradation	Quantization or compression	Low
Improvement in specific tasks only	Targeted fine-tuning (legitimate or malicious)	Medium
Degradation in safety with capability improvement	Potential safety-capability trade-off in fine-tuning	High
No capability change but safety regression	Safety alignment specifically weakened	Critical

Diff Report Template

Document behavior diff results in a structured format:

## Behavior Diff Report
 
**Model A:** [identifier, version, date deployed]
**Model B:** [identifier, version, date deployed]
**Evaluation Date:** [date]
**Evaluator:** [name]
 
### Summary
[1-2 sentence summary of key findings]
 
### Safety Regression Analysis
| Category | Model A | Model B | Change | p-value | Significant |
|---|---|---|---|---|---|
| [category] | [rate] | [rate] | [delta] | [p] | [yes/no] |
 
### Output Distribution Analysis
- Mean JS Divergence: [value]
- Categories with JS > 0.15: [list]
- Top divergent prompts: [list with divergence values]
 
### Capability Benchmarks
| Benchmark | Model A | Model B | Change |
|---|---|---|---|
| [benchmark] | [score] | [score] | [delta] |
 
### Forensic Assessment
[Analysis of whether changes indicate tampering, legitimate update, or natural drift]
 
### Recommendations
[Next steps based on findings]

Automating Behavior Diffing

For organizations that frequently update models, automate behavior diffing as part of the deployment pipeline:

Integration Point	What to Test	Pass/Fail Criteria
Pre-deployment gate	Full evaluation suite	No safety regression > 5%, no capability change > 10%
Post-deployment monitoring	Continuous sampling from production	Behavioral metrics remain within baseline bounds
Scheduled audits	Comprehensive diff against last audit	All findings documented and triaged

Backdoor Detection -- specialized detection when diffing reveals anomalies
Tampering Detection -- file-level verification complementing behavioral analysis
Model Snapshots -- preserving model versions for future diffing
Severity Framework -- scoring the severity of detected regressions

References

"Language Model Evaluation Harness" - EleutherAI (2025) - Open-source evaluation framework for language model benchmarking
"Holistic Evaluation of Language Models (HELM)" - Stanford CRFM (2024) - Comprehensive evaluation methodology
"Statistical Methods for AI Safety Evaluation" - arXiv (2025) - Statistical frameworks for behavioral comparison
"DeepEval: LLM Evaluation Framework" - Confident AI (2025) - Automated evaluation and regression detection

Knowledge Check

A model's overall safety refusal rate dropped from 91% to 88% after an update. Is this a significant safety regression?

Model Behavior Diffing

Related articles

Model Behavior Diffing

Related articles