Statistical Rigor in AI Red Teaming

advanced9 min readUpdated 2026-03-13

Statistical methodology for AI red teaming: sample size determination, confidence intervals, hypothesis testing for safety claims, handling non-determinism, and avoiding common statistical pitfalls.

statistics methodology evaluation rigor

AI red teaming involves testing stochastic systems with stochastic methods. A model may refuse a prompt 9 out of 10 times but comply on the 10th. Without statistical rigor, a red teamer who ran the prompt once would report "safe" while another who ran it ten times would report "vulnerable." Both cannot be right, and neither report is complete without quantifying the uncertainty.

The Non-Determinism Problem

Why LLMs Produce Variable Outputs

Source of Variability	Description	Impact on Evaluation
Temperature sampling	Higher temperature increases output randomness	Same prompt can produce safe and unsafe responses
Top-k / top-p filtering	Sampling strategy affects token selection	Different runs select different response paths
Batch ordering effects	Sequence of prior prompts can influence behavior	Results may depend on evaluation order
System load	API provider may route to different model instances	Subtle behavioral differences across instances
Context window state	Prior conversation affects subsequent responses	Multi-turn evaluations are path-dependent

Quantifying Variability

# Measure response variability for a single prompt
import numpy as np
 
def measure_variability(prompt: str, model, n_trials: int = 50) -> dict:
    """Run a prompt multiple times and measure outcome distribution."""
    results = []
    for _ in range(n_trials):
        response = model.generate(prompt)
        is_violation = judge(response)
        results.append(is_violation)
 
    violation_rate = np.mean(results)
    std_error = np.sqrt(violation_rate * (1 - violation_rate) / n_trials)
    ci_95 = (violation_rate - 1.96 * std_error,
             violation_rate + 1.96 * std_error)
 
    return {
        "violation_rate": violation_rate,
        "std_error": std_error,
        "ci_95_lower": max(0, ci_95[0]),
        "ci_95_upper": min(1, ci_95[1]),
        "n_trials": n_trials
    }

Sample Size Determination

How Many Prompts Are Enough?

The required sample size depends on the precision you need and the expected rate of safety violations.

Expected ASR	Desired Margin of Error	Required Sample Size (95% CI)
50%	+/- 5%	385
50%	+/- 3%	1,068
10%	+/- 5%	139
10%	+/- 3%	385
5%	+/- 2%	457
1%	+/- 1%	381

Sample Size Formula

n = (Z^2 * p * (1 - p)) / E^2

Where:
  n = required sample size
  Z = Z-score for desired confidence level (1.96 for 95%)
  p = expected proportion (ASR estimate)
  E = desired margin of error

The Rule of Three

When you observe zero events in n trials, the 95% confidence upper bound for the true rate is approximately 3/n.

Trials (n)	Observed Successes	95% Upper Bound
20	0	15.0%
50	0	6.0%
100	0	3.0%
300	0	1.0%
1,000	0	0.3%

This means: to claim with 95% confidence that the true ASR is below 1%, you need at least 300 trials with zero successes.

Confidence Intervals for Red Team Metrics

Constructing Intervals

Metric	Interval Type	When to Use
ASR	Wilson score interval	Preferred for proportions, especially near 0% or 100%
Mean severity score	t-interval	When severity is on a continuous scale
Median response time	Bootstrap interval	For non-normal distributions
Category-specific ASR	Wilson with Bonferroni correction	When reporting multiple categories simultaneously

Wilson Score Interval (Recommended for ASR)

from scipy import stats
 
def wilson_interval(successes: int, trials: int,
                    confidence: float = 0.95) -> tuple:
    """Calculate Wilson score confidence interval for a proportion."""
    z = stats.norm.ppf(1 - (1 - confidence) / 2)
    p_hat = successes / trials
    denominator = 1 + z**2 / trials
    center = (p_hat + z**2 / (2 * trials)) / denominator
    margin = (z / denominator) * np.sqrt(
        p_hat * (1 - p_hat) / trials + z**2 / (4 * trials**2)
    )
    return (max(0, center - margin), min(1, center + margin))
 
# Example: 5 successes out of 200 trials
lower, upper = wilson_interval(5, 200)
# ASR: 2.5%, 95% CI: [1.1%, 5.7%]

Hypothesis Testing for Safety Claims

Common Safety Claims and Their Tests

Claim	Null Hypothesis	Test	Required Evidence
"Model A is safer than Model B"	ASR_A >= ASR_B	Two-proportion z-test	Significant difference at p < 0.05
"The update reduced ASR"	ASR_after >= ASR_before	Paired proportion test	Significant reduction at p < 0.05
"ASR is below 5%"	ASR >= 0.05	One-proportion z-test	Reject at p < 0.05 with observed ASR < 5%
"The model is equally safe across languages"	ASR varies by language	Chi-squared test	Non-significant result (p > 0.05)

Comparing Two Models

from scipy.stats import proportions_ztest
 
def compare_models(successes_a: int, trials_a: int,
                   successes_b: int, trials_b: int) -> dict:
    """Test whether two models have significantly different ASR."""
    stat, p_value = proportions_ztest(
        [successes_a, successes_b],
        [trials_a, trials_b],
        alternative='two-sided'
    )
    return {
        "asr_a": successes_a / trials_a,
        "asr_b": successes_b / trials_b,
        "z_statistic": stat,
        "p_value": p_value,
        "significant": p_value < 0.05
    }

Multiple Testing Correction

When testing multiple categories or making many simultaneous comparisons, the risk of false positives increases.

Correction Method	When to Use	How It Works
Bonferroni	Few comparisons (< 10)	Divide significance level by number of tests
Holm-Bonferroni	Moderate comparisons	Step-down procedure, less conservative
Benjamini-Hochberg	Many comparisons	Controls false discovery rate instead of family-wise error

If you report ASR for 10 OWASP categories and claim "significant improvement" in any category at p < 0.05, there is a 40% chance of at least one false positive without correction.

Common Statistical Pitfalls

Pitfall	Example	Correction
Insufficient sample size	"0/20 attacks succeeded, model is safe"	Report confidence interval: "0/20, 95% CI: [0%, 15%]"
Cherry-picking results	Reporting only the best evaluation run	Report all runs or pre-register the evaluation protocol
Ignoring non-determinism	Running each prompt once	Run each prompt multiple times at deployment temperature
p-hacking	Testing many prompts until one works, reporting as "found a vulnerability"	Distinguish exploratory (finding) from confirmatory (measuring) testing
Survivorship bias	Only reporting successful attacks	Report the full denominator: "5/200 attacks succeeded"
Base rate neglect	Interpreting a low ASR as "safe" without context	Compare to industry benchmarks and acceptable risk thresholds

Reporting Statistical Results

Minimum Reporting Requirements

Every quantitative red team finding should include:

The metric value (e.g., ASR = 4.2%)
The confidence interval (e.g., 95% CI: [2.8%, 6.1%])
The sample size (e.g., n = 500 prompts, 3 trials each)
The conditions (e.g., temperature = 0.7, model version X)
The evaluation date and dataset version

AI Safety Benchmarks & Evaluation -- benchmark landscape and evaluation methodology
Building Evaluation Harnesses -- infrastructure for running evaluations
Red Team Metrics Beyond ASR -- comprehensive metrics framework
Visualizing Red Team Results -- presenting statistical results visually

References

"Statistical Methods for AI Safety Evaluation" - Anthropic (2024) - Methodology for confidence intervals and hypothesis testing in AI red team evaluations
"The Rule of Three in Clinical Trials" - Hanley & Lippman-Hand, Journal of the American Medical Association (1983) - Original derivation of the rule of three for zero-event estimation
"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming" - Mazeika et al. (2024) - Statistical methodology for standardized AI safety benchmarking
"Sample Size Determination for Proportions" - Agresti & Coull, The American Statistician (1998) - Wilson score interval methodology recommended for proportion-based metrics

Knowledge Check

A red team runs 50 test prompts against a model and observes zero successful attacks. What can they claim with 95% confidence?

Statistical Rigor in AI Red Teaming

Related articles

Statistical Rigor in AI Red Teaming

Related articles