Safety Regression Testing
Quantitative methods for measuring safety changes before and after fine-tuning -- benchmark selection, automated safety test suites, statistical methodology for safety regression, and building comprehensive before/after evaluation pipelines.
安全 regression 測試 answers a specific question: has 微調 made this model less safe? It does so through quantitative comparison of the base model and the fine-tuned model on standardized 安全 metrics. 這是 the most concrete and actionable component of the 安全 評估 framework, producing numerical results that can inform deployment decisions.
The methodology is borrowed from software 測試: just as a regression 測試 suite verifies that new code has not broken existing functionality, 安全 regression 測試 verifies that 微調 has not broken existing 安全 properties. The key difference is that 安全 behaviors are probabilistic rather than deterministic -- a model may refuse a harmful request 95% of the time rather than 100% of the time, making statistical reasoning essential.
Building the 測試 Suite
Prompt Categories
A comprehensive 安全 測試 suite should cover multiple harm categories:
| Category | 範例 Prompts | Why It Matters |
|---|---|---|
| Violence and weapons | Requests for instructions on creating weapons or causing physical harm | Core 安全 category across all providers |
| Illegal activities | Requests for assistance with illegal activities | Legal liability for providers and deployers |
| Hate speech and discrimination | Prompts designed to elicit discriminatory content | Bias and fairness concerns |
| Self-harm | Requests related to self-harm or suicide | User 安全, particularly vulnerable populations |
| Privacy | Requests to generate personal information or assist with doxxing | Privacy protection obligations |
| Deception | Requests to generate misinformation, phishing content, or scams | Trust and societal impact |
| Sexual content involving minors | Any content 在本 category | Legal requirements in virtually all jurisdictions |
| Cybersecurity | Requests for 利用 code, malware, or hacking assistance | Dual-use concern -- legitimate 安全 research vs. malicious use |
| Domain-specific | Harms specific to the 微調 domain (medical misinformation, legal malpractice, etc.) | 微調 may specifically degrade 安全 in the fine-tuned domain |
Prompt Design Principles
| Principle | 實作 | Rationale |
|---|---|---|
| Multiple difficulty levels | Include direct requests, subtle framing, and 對抗性 越獄 attempts 對每個 category | Different attack sophistication levels 測試 different 安全 mechanisms |
| Both positive and negative cases | Include benign prompts that should NOT be refused alongside harmful prompts that should be refused | Measure false positive rate alongside true positive rate |
| Consistent phrasing | Use the same prompts for base and fine-tuned model comparison | Eliminates prompt variation as a confounding factor |
| Representative distribution | Match the expected distribution of harmful requests in deployment | 測試 the scenarios most likely to occur in practice |
| Version control | Track 測試 suite changes over time | Enables comparison across 微調 iterations |
測試 Suite Sizing
| Purpose | Minimum Prompts | Recommended Prompts | Statistical Power |
|---|---|---|---|
| Quick sanity check | 50-100 | 100-200 | Low -- can detect >20% changes |
| Standard 評估 | 200-500 | 500-1000 | Medium -- can detect 5-10% changes |
| Comprehensive 評估 | 500-1000 | 1000-2000 | High -- can detect 3-5% changes |
| Research-grade 評估 | 1000+ | 2000-5000 | Very high -- can detect <3% changes |
Metrics
Primary Metrics
| Metric | Definition | Interpretation |
|---|---|---|
| Refusal Rate (RR) | Proportion of harmful prompts where 模型 refuses to comply | Higher is safer; compare base vs. fine-tuned |
| Refusal Rate Delta (ΔRR) | Change in refusal rate from base to fine-tuned model | Negative delta indicates 安全 regression |
| False Refusal Rate (FRR) | Proportion of benign prompts where 模型 incorrectly refuses | Lower is better; high FRR indicates over-refusal |
| Toxicity Score | Average toxicity of model outputs as measured by a toxicity classifier | Lower is safer |
| Compliance Depth | When 模型 does comply with a harmful request, how detailed and actionable is the response | Higher depth on harmful requests indicates worse 安全 |
Per-Category Metrics
Breaking metrics down by harm category reveals selective degradation:
| Category | Base RR | Fine-Tuned RR | Delta | Status |
|---|---|---|---|---|
| Violence | 97% | 92% | -5% | Within tolerance |
| Cybersecurity | 85% | 60% | -25% | REGRESSION |
| Deception | 90% | 88% | -2% | Within tolerance |
| Privacy | 93% | 75% | -18% | REGRESSION |
This pattern -- where specific categories degrade while others remain stable -- is characteristic of domain-specific 微調. A model fine-tuned on code generation may selectively degrade on cybersecurity-related 安全 while maintaining 安全 in other categories.
Composite 安全 Score
A single composite score enables threshold-based deployment decisions:
安全 Score = Σ(w_i × RR_i) / Σ(w_i)
Where w_i is the weight assigned to each harm category and RR_i is the refusal rate for that category. Weights should reflect the severity of harm -- categories like CSAM should have maximum weight regardless of deployment context.
Statistical Methodology
Why Statistics Matter
Model outputs are stochastic. A model that refuses a prompt 95% of the time will sometimes comply even without any 安全 degradation. Statistical methods distinguish genuine regression from random variation.
Confidence Intervals
對每個 refusal rate measurement, compute a confidence interval:
| Sample Size | 95% Confidence Interval Width (at 90% refusal rate) | Interpretation |
|---|---|---|
| 50 | ±8.3% | Very wide -- cannot distinguish 82% from 98% |
| 100 | ±5.9% | Wide -- can detect large changes |
| 500 | ±2.6% | Moderate -- can detect meaningful changes |
| 1000 | ±1.9% | Narrow -- can detect small changes |
| 2000 | ±1.3% | Very narrow -- can detect subtle changes |
Hypothesis 測試
對每個 harm category, 測試 whether the observed refusal rate change is statistically significant:
- Null hypothesis: The fine-tuned model's refusal rate is the same as the base model's
- Alternative hypothesis: The fine-tuned model's refusal rate is lower (one-sided 測試 for 安全 regression)
- 測試: Two-proportion z-測試 or Fisher's exact 測試
- Significance level: Use p < 0.01 for 安全-critical decisions (more conservative than the typical p < 0.05)
Automated Classification
Classifying Refusals
Automated regression 測試 requires automatically classifying model outputs as "refusal" or "compliance":
| Method | Accuracy | Speed | Limitations |
|---|---|---|---|
| Keyword matching | Low (60-70%) | Very fast | Misses soft refusals and partial compliance |
| Regular expression patterns | Medium (70-80%) | Fast | Brittle, requires maintenance |
| LLM-as-judge | High (85-95%) | Slow (推論 cost) | Potential bias, cost |
| Fine-tuned classifier | High (90-95%) | Fast after 訓練 | Requires labeled 訓練資料 |
| Combined approach | Very high (95%+) | Medium | More complex to 實作 |
LLM-as-Judge for 安全
Using a separate LLM to 評估 whether a response constitutes a refusal or compliance:
| Design Choice | Recommendation | Rationale |
|---|---|---|
| Judge model | Use a different model family than the one being evaluated | Reduces shared blind spots |
| Prompt structure | Provide clear criteria for refusal vs. compliance | Reduces judge variance |
| Calibration | Validate judge accuracy against human labels | Ensures judge reliability |
| Multiple judges | Use 2-3 judge models and take majority vote | Reduces individual judge bias |
Integration into Fine-Tuning Workflows
CI/CD Integration
安全 regression 測試 should be automated as part of the 微調 pipeline:
Training Data → Fine-Tuning Job → 安全 Regression Tests → Deployment Gate → Deployment
↓
FAIL: Block deployment
and alert team
Automation Components
| Component | 實作 | Notes |
|---|---|---|
| 測試 suite storage | Version-controlled prompt sets with expected outcomes | Update 測試 suites as new attack patterns emerge |
| 測試 runner | Script that runs the 測試 suite against both base and fine-tuned models | Should support parallel execution for speed |
| Classifier | Automated 輸出 classification (refusal/compliance) | Must be calibrated and maintained |
| Reporting | Generates a 安全 regression report with metrics, comparisons, and pass/fail status | Should be human-readable and machine-parseable |
| Alerting | Notifies responsible parties when 安全 regression is detected | Integration with existing 監控 and alerting infrastructure |
Report Template
A 安全 regression report should include:
- 總結: Overall pass/fail status with composite 安全 score
- Per-category breakdown: Refusal rates, deltas, and statistical significance 對每個 harm category
- Flagged items: Specific prompts where the fine-tuned model's behavior changed significantly
- False refusal analysis: Changes in false refusal rate
- Recommendation: Deploy, deploy with 監控, or block deployment
Advanced Regression 測試
對抗性 Regression 測試
Beyond standard 安全 prompts, include 對抗性 測試 cases:
| 對抗性 測試 Type | Purpose |
|---|---|
| Known 越獄 prompts | 測試 robustness to known attack patterns |
| Multi-turn escalation | 測試 whether 安全 degrades over multi-turn conversations |
| Context manipulation | 測試 whether 安全 changes with different system prompts |
| Cross-lingual probes | 測試 whether 安全 holds across languages |
| Encoding attacks | 測試 with base64, rot13, and other encoding-based jailbreaks |
Capability Regression 測試
此外 to 安全, 測試 for unwanted capability changes:
| Capability 測試 | What It Detects |
|---|---|
| Knowledge accuracy benchmarks | Knowledge degradation from 微調 |
| Reasoning benchmarks | Reasoning capability changes |
| Instruction following | Changes in instruction-following quality |
| Task-specific performance | Whether the 微調 achieved its intended task improvement |
Further Reading
- Continuous 監控 -- Post-deployment 安全 監控
- 安全 評估 Framework -- Overall 評估 framework
- 安全 Degradation -- 理解 what regression 測試 detects
相關主題
- 理解 AI 防禦 - 防禦 landscape context
- 護欄 & 安全 Layer Architecture - Where regression 測試 fits in the 安全 stack
- API Fine-Tuning 安全 - Provider-side regression 測試
參考文獻
- "HarmBench: A Standardized 評估 Framework for Automated 紅隊演練 and Robust Refusal" - Mazeika, M., et al. (2024) - Standardized 安全 評估 framework
- "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" - Zheng, L., et al. (2023) - LLM-as-judge methodology applicable to 安全 評估
- "微調 Aligned Language Models Compromises 安全, Even When Users Do Not Intend To!" - Qi, X., et al. (2023) - Quantitative 安全 measurement methodology
- "DecodingTrust: A Comprehensive 評估 of Trustworthiness in GPT Models" - Wang, B., et al. (2023) - Multi-dimensional trustworthiness 評估
Why should 安全 regression 測試 use a significance level of p < 0.01 rather than the typical p < 0.05, and what is the practical implication?