Safety Regression Testing

Intermediate12 min readUpdated 2026-03-15

Quantitative methods for measuring safety changes before and after fine-tuning -- benchmark selection, automated safety test suites, statistical methodology for safety regression, and building comprehensive before/after evaluation pipelines.

regression-testing safety-benchmarks evaluation metrics before-after safety-measurement fine-tuning-security

安全 regression 測試 answers a specific question: has 微調 made this model less safe? It does so through quantitative comparison of the base model and the fine-tuned model on standardized 安全 metrics. 這是 the most concrete and actionable component of the 安全評估 framework, producing numerical results that can inform deployment decisions.

The methodology is borrowed from software 測試: just as a regression 測試 suite verifies that new code has not broken existing functionality, 安全 regression 測試 verifies that 微調 has not broken existing 安全 properties. The key difference is that 安全 behaviors are probabilistic rather than deterministic -- a model may refuse a harmful request 95% of the time rather than 100% of the time, making statistical reasoning essential.

Building the 測試 Suite

Prompt Categories

A comprehensive 安全測試 suite should cover multiple harm categories:

Category	範例 Prompts	Why It Matters
Violence and weapons	Requests for instructions on creating weapons or causing physical harm	Core 安全 category across all providers
Illegal activities	Requests for assistance with illegal activities	Legal liability for providers and deployers
Hate speech and discrimination	Prompts designed to elicit discriminatory content	Bias and fairness concerns
Self-harm	Requests related to self-harm or suicide	User 安全, particularly vulnerable populations
Privacy	Requests to generate personal information or assist with doxxing	Privacy protection obligations
Deception	Requests to generate misinformation, phishing content, or scams	Trust and societal impact
Sexual content involving minors	Any content 在本 category	Legal requirements in virtually all jurisdictions
Cybersecurity	Requests for 利用 code, malware, or hacking assistance	Dual-use concern -- legitimate 安全 research vs. malicious use
Domain-specific	Harms specific to the 微調 domain (medical misinformation, legal malpractice, etc.)	微調 may specifically degrade 安全 in the fine-tuned domain

Prompt Design Principles

Principle	實作	Rationale
Multiple difficulty levels	Include direct requests, subtle framing, and 對抗性越獄 attempts 對每個 category	Different attack sophistication levels 測試 different 安全 mechanisms
Both positive and negative cases	Include benign prompts that should NOT be refused alongside harmful prompts that should be refused	Measure false positive rate alongside true positive rate
Consistent phrasing	Use the same prompts for base and fine-tuned model comparison	Eliminates prompt variation as a confounding factor
Representative distribution	Match the expected distribution of harmful requests in deployment	測試 the scenarios most likely to occur in practice
Version control	Track 測試 suite changes over time	Enables comparison across 微調 iterations

測試 Suite Sizing

Purpose	Minimum Prompts	Recommended Prompts	Statistical Power
Quick sanity check	50-100	100-200	Low -- can detect >20% changes
Standard 評估	200-500	500-1000	Medium -- can detect 5-10% changes
Comprehensive 評估	500-1000	1000-2000	High -- can detect 3-5% changes
Research-grade 評估	1000+	2000-5000	Very high -- can detect <3% changes

Metrics

Primary Metrics

Metric	Definition	Interpretation
Refusal Rate (RR)	Proportion of harmful prompts where 模型 refuses to comply	Higher is safer; compare base vs. fine-tuned
Refusal Rate Delta (ΔRR)	Change in refusal rate from base to fine-tuned model	Negative delta indicates 安全 regression
False Refusal Rate (FRR)	Proportion of benign prompts where 模型 incorrectly refuses	Lower is better; high FRR indicates over-refusal
Toxicity Score	Average toxicity of model outputs as measured by a toxicity classifier	Lower is safer
Compliance Depth	When 模型 does comply with a harmful request, how detailed and actionable is the response	Higher depth on harmful requests indicates worse 安全

Per-Category Metrics

Breaking metrics down by harm category reveals selective degradation:

Category	Base RR	Fine-Tuned RR	Delta	Status
Violence	97%	92%	-5%	Within tolerance
Cybersecurity	85%	60%	-25%	REGRESSION
Deception	90%	88%	-2%	Within tolerance
Privacy	93%	75%	-18%	REGRESSION

This pattern -- where specific categories degrade while others remain stable -- is characteristic of domain-specific 微調. A model fine-tuned on code generation may selectively degrade on cybersecurity-related 安全 while maintaining 安全 in other categories.

Composite 安全 Score

A single composite score enables threshold-based deployment decisions:

安全 Score = Σ(w_i × RR_i) / Σ(w_i)

Where w_i is the weight assigned to each harm category and RR_i is the refusal rate for that category. Weights should reflect the severity of harm -- categories like CSAM should have maximum weight regardless of deployment context.

Statistical Methodology

Why Statistics Matter

Model outputs are stochastic. A model that refuses a prompt 95% of the time will sometimes comply even without any 安全 degradation. Statistical methods distinguish genuine regression from random variation.

Confidence Intervals

對每個 refusal rate measurement, compute a confidence interval:

Sample Size	95% Confidence Interval Width (at 90% refusal rate)	Interpretation
50	±8.3%	Very wide -- cannot distinguish 82% from 98%
100	±5.9%	Wide -- can detect large changes
500	±2.6%	Moderate -- can detect meaningful changes
1000	±1.9%	Narrow -- can detect small changes
2000	±1.3%	Very narrow -- can detect subtle changes

Hypothesis 測試

對每個 harm category, 測試 whether the observed refusal rate change is statistically significant:

Null hypothesis: The fine-tuned model's refusal rate is the same as the base model's
Alternative hypothesis: The fine-tuned model's refusal rate is lower (one-sided 測試 for 安全 regression)
測試: Two-proportion z-測試 or Fisher's exact 測試
Significance level: Use p < 0.01 for 安全-critical decisions (more conservative than the typical p < 0.05)

Automated Classification

Classifying Refusals

Automated regression 測試 requires automatically classifying model outputs as "refusal" or "compliance":

Method	Accuracy	Speed	Limitations
Keyword matching	Low (60-70%)	Very fast	Misses soft refusals and partial compliance
Regular expression patterns	Medium (70-80%)	Fast	Brittle, requires maintenance
LLM-as-judge	High (85-95%)	Slow (推論 cost)	Potential bias, cost
Fine-tuned classifier	High (90-95%)	Fast after 訓練	Requires labeled 訓練資料
Combined approach	Very high (95%+)	Medium	More complex to 實作

LLM-as-Judge for 安全

Using a separate LLM to 評估 whether a response constitutes a refusal or compliance:

Design Choice	Recommendation	Rationale
Judge model	Use a different model family than the one being evaluated	Reduces shared blind spots
Prompt structure	Provide clear criteria for refusal vs. compliance	Reduces judge variance
Calibration	Validate judge accuracy against human labels	Ensures judge reliability
Multiple judges	Use 2-3 judge models and take majority vote	Reduces individual judge bias

Integration into Fine-Tuning Workflows

CI/CD Integration

安全 regression 測試 should be automated as part of the 微調 pipeline:

Training Data → Fine-Tuning Job → 安全 Regression Tests → Deployment Gate → Deployment
                                         ↓
                                  FAIL: Block deployment
                                  and alert team

Automation Components

Component	實作	Notes
測試 suite storage	Version-controlled prompt sets with expected outcomes	Update 測試 suites as new attack patterns emerge
測試 runner	Script that runs the 測試 suite against both base and fine-tuned models	Should support parallel execution for speed
Classifier	Automated 輸出 classification (refusal/compliance)	Must be calibrated and maintained
Reporting	Generates a 安全 regression report with metrics, comparisons, and pass/fail status	Should be human-readable and machine-parseable
Alerting	Notifies responsible parties when 安全 regression is detected	Integration with existing 監控 and alerting infrastructure

Report Template

A 安全 regression report should include:

總結: Overall pass/fail status with composite 安全 score
Per-category breakdown: Refusal rates, deltas, and statistical significance 對每個 harm category
Flagged items: Specific prompts where the fine-tuned model's behavior changed significantly
False refusal analysis: Changes in false refusal rate
Recommendation: Deploy, deploy with 監控, or block deployment

Advanced Regression 測試

對抗性 Regression 測試

Beyond standard 安全 prompts, include 對抗性測試 cases:

對抗性測試 Type	Purpose
Known 越獄 prompts	測試 robustness to known attack patterns
Multi-turn escalation	測試 whether 安全 degrades over multi-turn conversations
Context manipulation	測試 whether 安全 changes with different system prompts
Cross-lingual probes	測試 whether 安全 holds across languages
Encoding attacks	測試 with base64, rot13, and other encoding-based jailbreaks

Capability Regression 測試

此外 to 安全, 測試 for unwanted capability changes:

Capability 測試	What It Detects
Knowledge accuracy benchmarks	Knowledge degradation from 微調
Reasoning benchmarks	Reasoning capability changes
Instruction following	Changes in instruction-following quality
Task-specific performance	Whether the 微調 achieved its intended task improvement

參考文獻

"HarmBench: A Standardized 評估 Framework for Automated 紅隊演練 and Robust Refusal" - Mazeika, M., et al. (2024) - Standardized 安全評估 framework
"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" - Zheng, L., et al. (2023) - LLM-as-judge methodology applicable to 安全評估
"微調 Aligned Language Models Compromises 安全, Even When Users Do Not Intend To!" - Qi, X., et al. (2023) - Quantitative 安全 measurement methodology
"DecodingTrust: A Comprehensive 評估 of Trustworthiness in GPT Models" - Wang, B., et al. (2023) - Multi-dimensional trustworthiness 評估

Knowledge Check

Why should 安全 regression 測試 use a significance level of p < 0.01 rather than the typical p < 0.05, and what is the practical implication?

Safety Regression Testing

Intermediate12 min readUpdated 2026-03-15

regression-testing safety-benchmarks evaluation metrics before-after safety-measurement fine-tuning-security

Building the 測試 Suite

Prompt Categories

A comprehensive 安全測試 suite should cover multiple harm categories:

Category	範例 Prompts	Why It Matters
Violence and weapons	Requests for instructions on creating weapons or causing physical harm	Core 安全 category across all providers
Illegal activities	Requests for assistance with illegal activities	Legal liability for providers and deployers
Hate speech and discrimination	Prompts designed to elicit discriminatory content	Bias and fairness concerns
Self-harm	Requests related to self-harm or suicide	User 安全, particularly vulnerable populations
Privacy	Requests to generate personal information or assist with doxxing	Privacy protection obligations
Deception	Requests to generate misinformation, phishing content, or scams	Trust and societal impact
Sexual content involving minors	Any content 在本 category	Legal requirements in virtually all jurisdictions
Cybersecurity	Requests for 利用 code, malware, or hacking assistance	Dual-use concern -- legitimate 安全 research vs. malicious use
Domain-specific	Harms specific to the 微調 domain (medical misinformation, legal malpractice, etc.)	微調 may specifically degrade 安全 in the fine-tuned domain

Prompt Design Principles

Principle	實作	Rationale
Multiple difficulty levels	Include direct requests, subtle framing, and 對抗性越獄 attempts 對每個 category	Different attack sophistication levels 測試 different 安全 mechanisms
Both positive and negative cases	Include benign prompts that should NOT be refused alongside harmful prompts that should be refused	Measure false positive rate alongside true positive rate
Consistent phrasing	Use the same prompts for base and fine-tuned model comparison	Eliminates prompt variation as a confounding factor
Representative distribution	Match the expected distribution of harmful requests in deployment	測試 the scenarios most likely to occur in practice
Version control	Track 測試 suite changes over time	Enables comparison across 微調 iterations

測試 Suite Sizing

Purpose	Minimum Prompts	Recommended Prompts	Statistical Power
Quick sanity check	50-100	100-200	Low -- can detect >20% changes
Standard 評估	200-500	500-1000	Medium -- can detect 5-10% changes
Comprehensive 評估	500-1000	1000-2000	High -- can detect 3-5% changes
Research-grade 評估	1000+	2000-5000	Very high -- can detect <3% changes

Metrics

Primary Metrics

Metric	Definition	Interpretation
Refusal Rate (RR)	Proportion of harmful prompts where 模型 refuses to comply	Higher is safer; compare base vs. fine-tuned
Refusal Rate Delta (ΔRR)	Change in refusal rate from base to fine-tuned model	Negative delta indicates 安全 regression
False Refusal Rate (FRR)	Proportion of benign prompts where 模型 incorrectly refuses	Lower is better; high FRR indicates over-refusal
Toxicity Score	Average toxicity of model outputs as measured by a toxicity classifier	Lower is safer
Compliance Depth	When 模型 does comply with a harmful request, how detailed and actionable is the response	Higher depth on harmful requests indicates worse 安全

Per-Category Metrics

Breaking metrics down by harm category reveals selective degradation:

Category	Base RR	Fine-Tuned RR	Delta	Status
Violence	97%	92%	-5%	Within tolerance
Cybersecurity	85%	60%	-25%	REGRESSION
Deception	90%	88%	-2%	Within tolerance
Privacy	93%	75%	-18%	REGRESSION

Composite 安全 Score

A single composite score enables threshold-based deployment decisions:

安全 Score = Σ(w_i × RR_i) / Σ(w_i)

Statistical Methodology

Why Statistics Matter

Confidence Intervals

對每個 refusal rate measurement, compute a confidence interval:

Sample Size	95% Confidence Interval Width (at 90% refusal rate)	Interpretation
50	±8.3%	Very wide -- cannot distinguish 82% from 98%
100	±5.9%	Wide -- can detect large changes
500	±2.6%	Moderate -- can detect meaningful changes
1000	±1.9%	Narrow -- can detect small changes
2000	±1.3%	Very narrow -- can detect subtle changes

Hypothesis 測試

對每個 harm category, 測試 whether the observed refusal rate change is statistically significant:

Null hypothesis: The fine-tuned model's refusal rate is the same as the base model's
Alternative hypothesis: The fine-tuned model's refusal rate is lower (one-sided 測試 for 安全 regression)
測試: Two-proportion z-測試 or Fisher's exact 測試
Significance level: Use p < 0.01 for 安全-critical decisions (more conservative than the typical p < 0.05)

Automated Classification

Classifying Refusals

Automated regression 測試 requires automatically classifying model outputs as "refusal" or "compliance":

Method	Accuracy	Speed	Limitations
Keyword matching	Low (60-70%)	Very fast	Misses soft refusals and partial compliance
Regular expression patterns	Medium (70-80%)	Fast	Brittle, requires maintenance
LLM-as-judge	High (85-95%)	Slow (推論 cost)	Potential bias, cost
Fine-tuned classifier	High (90-95%)	Fast after 訓練	Requires labeled 訓練資料
Combined approach	Very high (95%+)	Medium	More complex to 實作

LLM-as-Judge for 安全

Using a separate LLM to 評估 whether a response constitutes a refusal or compliance:

Design Choice	Recommendation	Rationale
Judge model	Use a different model family than the one being evaluated	Reduces shared blind spots
Prompt structure	Provide clear criteria for refusal vs. compliance	Reduces judge variance
Calibration	Validate judge accuracy against human labels	Ensures judge reliability
Multiple judges	Use 2-3 judge models and take majority vote	Reduces individual judge bias

Integration into Fine-Tuning Workflows

CI/CD Integration

安全 regression 測試 should be automated as part of the 微調 pipeline:

Training Data → Fine-Tuning Job → 安全 Regression Tests → Deployment Gate → Deployment
                                         ↓
                                  FAIL: Block deployment
                                  and alert team

Automation Components

Component	實作	Notes
測試 suite storage	Version-controlled prompt sets with expected outcomes	Update 測試 suites as new attack patterns emerge
測試 runner	Script that runs the 測試 suite against both base and fine-tuned models	Should support parallel execution for speed
Classifier	Automated 輸出 classification (refusal/compliance)	Must be calibrated and maintained
Reporting	Generates a 安全 regression report with metrics, comparisons, and pass/fail status	Should be human-readable and machine-parseable
Alerting	Notifies responsible parties when 安全 regression is detected	Integration with existing 監控 and alerting infrastructure

Report Template

A 安全 regression report should include:

總結: Overall pass/fail status with composite 安全 score
Per-category breakdown: Refusal rates, deltas, and statistical significance 對每個 harm category
Flagged items: Specific prompts where the fine-tuned model's behavior changed significantly
False refusal analysis: Changes in false refusal rate
Recommendation: Deploy, deploy with 監控, or block deployment

Advanced Regression 測試

對抗性 Regression 測試

Beyond standard 安全 prompts, include 對抗性測試 cases:

對抗性測試 Type	Purpose
Known 越獄 prompts	測試 robustness to known attack patterns
Multi-turn escalation	測試 whether 安全 degrades over multi-turn conversations
Context manipulation	測試 whether 安全 changes with different system prompts
Cross-lingual probes	測試 whether 安全 holds across languages
Encoding attacks	測試 with base64, rot13, and other encoding-based jailbreaks

Capability Regression 測試

此外 to 安全, 測試 for unwanted capability changes:

Capability 測試	What It Detects
Knowledge accuracy benchmarks	Knowledge degradation from 微調
Reasoning benchmarks	Reasoning capability changes
Instruction following	Changes in instruction-following quality
Task-specific performance	Whether the 微調 achieved its intended task improvement

參考文獻

"HarmBench: A Standardized 評估 Framework for Automated 紅隊演練 and Robust Refusal" - Mazeika, M., et al. (2024) - Standardized 安全評估 framework
"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" - Zheng, L., et al. (2023) - LLM-as-judge methodology applicable to 安全評估
"微調 Aligned Language Models Compromises 安全, Even When Users Do Not Intend To!" - Qi, X., et al. (2023) - Quantitative 安全 measurement methodology
"DecodingTrust: A Comprehensive 評估 of Trustworthiness in GPT Models" - Wang, B., et al. (2023) - Multi-dimensional trustworthiness 評估

Knowledge Check

Why should 安全 regression 測試 use a significance level of p < 0.01 rather than the typical p < 0.05, and what is the practical implication?

Safety Regression Testing

Related articles

Safety Regression Testing

Related articles