Statistical Rigor in AI Red Teaming
Statistical methodology for AI red teaming: sample size determination, confidence intervals, hypothesis testing for safety claims, handling non-determinism, and avoiding common statistical pitfalls.
AI red teaming involves testing stochastic systems with stochastic methods. A model may refuse a prompt 9 out of 10 times but comply on the 10th. Without statistical rigor, a red teamer who ran the prompt once would report "safe" while another who ran it ten times would report "vulnerable." Both cannot be right, and neither report is complete without quantifying the uncertainty.
The Non-Determinism Problem
Why LLMs Produce Variable Outputs
| Source of Variability | Description | Impact on Evaluation |
|---|---|---|
| Temperature sampling | Higher temperature increases output randomness | Same prompt can produce safe and unsafe responses |
| Top-k / top-p filtering | Sampling strategy affects token selection | Different runs select different response paths |
| Batch ordering effects | Sequence of prior prompts can influence behavior | Results may depend on evaluation order |
| System load | API provider may route to different model instances | Subtle behavioral differences across instances |
| Context window state | Prior conversation affects subsequent responses | Multi-turn evaluations are path-dependent |
Quantifying Variability
# Measure response variability for a single prompt
import numpy as np
def measure_variability(prompt: str, model, n_trials: int = 50) -> dict:
"""Run a prompt multiple times and measure outcome distribution."""
results = []
for _ in range(n_trials):
response = model.generate(prompt)
is_violation = judge(response)
results.append(is_violation)
violation_rate = np.mean(results)
std_error = np.sqrt(violation_rate * (1 - violation_rate) / n_trials)
ci_95 = (violation_rate - 1.96 * std_error,
violation_rate + 1.96 * std_error)
return {
"violation_rate": violation_rate,
"std_error": std_error,
"ci_95_lower": max(0, ci_95[0]),
"ci_95_upper": min(1, ci_95[1]),
"n_trials": n_trials
}Sample Size Determination
How Many Prompts Are Enough?
The required sample size depends on the precision you need and the expected rate of safety violations.
| Expected ASR | Desired Margin of Error | Required Sample Size (95% CI) |
|---|---|---|
| 50% | +/- 5% | 385 |
| 50% | +/- 3% | 1,068 |
| 10% | +/- 5% | 139 |
| 10% | +/- 3% | 385 |
| 5% | +/- 2% | 457 |
| 1% | +/- 1% | 381 |
Sample Size Formula
n = (Z^2 * p * (1 - p)) / E^2
Where:
n = required sample size
Z = Z-score for desired confidence level (1.96 for 95%)
p = expected proportion (ASR estimate)
E = desired margin of error
The Rule of Three
When you observe zero events in n trials, the 95% confidence upper bound for the true rate is approximately 3/n.
| Trials (n) | Observed Successes | 95% Upper Bound |
|---|---|---|
| 20 | 0 | 15.0% |
| 50 | 0 | 6.0% |
| 100 | 0 | 3.0% |
| 300 | 0 | 1.0% |
| 1,000 | 0 | 0.3% |
This means: to claim with 95% confidence that the true ASR is below 1%, you need at least 300 trials with zero successes.
Confidence Intervals for Red Team Metrics
Constructing Intervals
| Metric | Interval Type | When to Use |
|---|---|---|
| ASR | Wilson score interval | Preferred for proportions, especially near 0% or 100% |
| Mean severity score | t-interval | When severity is on a continuous scale |
| Median response time | Bootstrap interval | For non-normal distributions |
| Category-specific ASR | Wilson with Bonferroni correction | When reporting multiple categories simultaneously |
Wilson Score Interval (Recommended for ASR)
from scipy import stats
def wilson_interval(successes: int, trials: int,
confidence: float = 0.95) -> tuple:
"""Calculate Wilson score confidence interval for a proportion."""
z = stats.norm.ppf(1 - (1 - confidence) / 2)
p_hat = successes / trials
denominator = 1 + z**2 / trials
center = (p_hat + z**2 / (2 * trials)) / denominator
margin = (z / denominator) * np.sqrt(
p_hat * (1 - p_hat) / trials + z**2 / (4 * trials**2)
)
return (max(0, center - margin), min(1, center + margin))
# Example: 5 successes out of 200 trials
lower, upper = wilson_interval(5, 200)
# ASR: 2.5%, 95% CI: [1.1%, 5.7%]Hypothesis Testing for Safety Claims
Common Safety Claims and Their Tests
| Claim | Null Hypothesis | Test | Required Evidence |
|---|---|---|---|
| "Model A is safer than Model B" | ASR_A >= ASR_B | Two-proportion z-test | Significant difference at p < 0.05 |
| "The update reduced ASR" | ASR_after >= ASR_before | Paired proportion test | Significant reduction at p < 0.05 |
| "ASR is below 5%" | ASR >= 0.05 | One-proportion z-test | Reject at p < 0.05 with observed ASR < 5% |
| "The model is equally safe across languages" | ASR varies by language | Chi-squared test | Non-significant result (p > 0.05) |
Comparing Two Models
from scipy.stats import proportions_ztest
def compare_models(successes_a: int, trials_a: int,
successes_b: int, trials_b: int) -> dict:
"""Test whether two models have significantly different ASR."""
stat, p_value = proportions_ztest(
[successes_a, successes_b],
[trials_a, trials_b],
alternative='two-sided'
)
return {
"asr_a": successes_a / trials_a,
"asr_b": successes_b / trials_b,
"z_statistic": stat,
"p_value": p_value,
"significant": p_value < 0.05
}Multiple Testing Correction
When testing multiple categories or making many simultaneous comparisons, the risk of false positives increases.
| Correction Method | When to Use | How It Works |
|---|---|---|
| Bonferroni | Few comparisons (< 10) | Divide significance level by number of tests |
| Holm-Bonferroni | Moderate comparisons | Step-down procedure, less conservative |
| Benjamini-Hochberg | Many comparisons | Controls false discovery rate instead of family-wise error |
If you report ASR for 10 OWASP categories and claim "significant improvement" in any category at p < 0.05, there is a 40% chance of at least one false positive without correction.
Common Statistical Pitfalls
| Pitfall | Example | Correction |
|---|---|---|
| Insufficient sample size | "0/20 attacks succeeded, model is safe" | Report confidence interval: "0/20, 95% CI: [0%, 15%]" |
| Cherry-picking results | Reporting only the best evaluation run | Report all runs or pre-register the evaluation protocol |
| Ignoring non-determinism | Running each prompt once | Run each prompt multiple times at deployment temperature |
| p-hacking | Testing many prompts until one works, reporting as "found a vulnerability" | Distinguish exploratory (finding) from confirmatory (measuring) testing |
| Survivorship bias | Only reporting successful attacks | Report the full denominator: "5/200 attacks succeeded" |
| Base rate neglect | Interpreting a low ASR as "safe" without context | Compare to industry benchmarks and acceptable risk thresholds |
Reporting Statistical Results
Minimum Reporting Requirements
Every quantitative red team finding should include:
- The metric value (e.g., ASR = 4.2%)
- The confidence interval (e.g., 95% CI: [2.8%, 6.1%])
- The sample size (e.g., n = 500 prompts, 3 trials each)
- The conditions (e.g., temperature = 0.7, model version X)
- The evaluation date and dataset version
Related Topics
- AI Safety Benchmarks & Evaluation -- benchmark landscape and evaluation methodology
- Building Evaluation Harnesses -- infrastructure for running evaluations
- Red Team Metrics Beyond ASR -- comprehensive metrics framework
- Visualizing Red Team Results -- presenting statistical results visually
References
- "Statistical Methods for AI Safety Evaluation" - Anthropic (2024) - Methodology for confidence intervals and hypothesis testing in AI red team evaluations
- "The Rule of Three in Clinical Trials" - Hanley & Lippman-Hand, Journal of the American Medical Association (1983) - Original derivation of the rule of three for zero-event estimation
- "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming" - Mazeika et al. (2024) - Statistical methodology for standardized AI safety benchmarking
- "Sample Size Determination for Proportions" - Agresti & Coull, The American Statistician (1998) - Wilson score interval methodology recommended for proportion-based metrics
A red team runs 50 test prompts against a model and observes zero successful attacks. What can they claim with 95% confidence?