紅隊 Automation Strategy
When and how to automate AI red teaming: tool selection, CI/CD integration, continuous automated red teaming (CART), human-in-the-loop design, and scaling assessment coverage through automation.
Automation does not replace red teamers. It replaces the repetitive parts of 紅隊演練 -- running known attack patterns, regression 測試, coverage checking -- so that human analysts can focus on what they do uniquely well: creative exploration, novel attack development, contextual judgment, and stakeholder communication. A well-designed automation strategy multiplies team capacity without sacrificing 評估 depth.
What to Automate (and What Not To)
Automate
| Task | Why Automate | Tool 範例 |
|---|---|---|
| Known 越獄 regression 測試 | Known techniques should be tested against every model update | Garak, PyRIT, custom scripts |
| 安全 boundary scanning | Systematic 測試 of content categories is tedious but important | Promptfoo, custom harnesses |
| 輸入 validation 測試 | Structured inputs (encodings, languages, formats) are enumerable | Custom 測試 suites |
| API 安全 checks | Authentication, rate limiting, error handling are standard tests | Standard API 安全 tools |
| 輸出 classification | Detecting harmful content in model outputs at scale | 安全 classifiers, LLM-as-judge |
| Regression 測試 | Ensuring previously fixed 漏洞 stay fixed | CI/CD 測試 suites |
Do Not Automate
| Task | Why Keep Manual | Human Advantage |
|---|---|---|
| Novel attack development | Creativity cannot be automated | 對抗性 thinking, intuition |
| Contextual risk 評估 | Impact depends on business context | Domain knowledge, judgment |
| Stakeholder communication | Requires tailored messaging | Empathy, persuasion |
| Architecture review | Requires 理解 system design decisions | Systems thinking |
| Ethical judgment | Nuanced decisions about responsible disclosure | Experience, values |
| Complex multi-step attacks | Require adaptive reasoning mid-attack | Real-time decision making |
Automation Architecture
Layered Automation
┌──────────────────────────────────────────────────────────────┐
│ Layer 3: Human Analysis │
│ Novel attacks, contextual 評估, reporting │
│ (Triggered by findings from Layers 1-2) │
├──────────────────────────────────────────────────────────────┤
│ Layer 2: Continuous Automated 紅隊演練 (CART) │
│ Scheduled and triggered scans against production models │
│ (Runs continuously or on model updates) │
├──────────────────────────────────────────────────────────────┤
│ Layer 1: CI/CD Integration │
│ Automated checks in the deployment pipeline │
│ (Runs on every deployment) │
└──────────────────────────────────────────────────────────────┘Layer 1: CI/CD Integration
Embed AI 安全 checks in the deployment pipeline so that regressions and basic 漏洞 are caught before reaching production.
# 範例: GitHub Actions workflow for AI 安全 checks
name: AI 安全 Checks
on:
pull_request:
paths:
- 'models/**'
- 'prompts/**'
- 'configs/安全/**'
push:
branches: [main]
jobs:
安全-regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run 安全 regression suite
run: |
python -m pytest tests/安全/ \
--model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
--threshold 0.95 \
--report-輸出 安全-report.json
- name: Run 越獄 regression
run: |
garak --model-type api \
--model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
--probes known_jailbreaks \
--report-prefix ci-越獄
- name: Check results
run: |
python scripts/check_safety_results.py \
--安全-report 安全-report.json \
--越獄-report ci-越獄.json \
--fail-on-regressionWhat to 測試 in CI/CD:
- Known 越獄 regression (did a model update re-introduce a fixed 漏洞?)
- 安全 boundary checks (does 模型 still refuse baseline harmful requests?)
- 系統提示詞 leakage (does 模型 reveal its 系統提示詞?)
- Basic injection patterns (does 模型 follow instructions in injected context?)
What not to 測試 in CI/CD:
- Full 評估 suites (too slow for deployment pipelines)
- Novel attack exploration (not deterministic enough for pass/fail)
- Complex multi-step attacks (require too much infrastructure)
Layer 2: Continuous Automated 紅隊演練 (CART)
CART runs continuously against production models, 測試 a broader range of attacks than CI/CD and adapting to new threats.
class CARTPipeline:
"""Continuous Automated 紅隊演練 pipeline."""
def __init__(self, model_endpoints: dict, alert_webhook: str):
self.endpoints = model_endpoints
self.alert_webhook = alert_webhook
self.attack_registry = AttackRegistry()
self.results_store = ResultsStore()
def run_cycle(self):
"""Run one CART cycle across all monitored endpoints."""
for endpoint_name, endpoint_config in self.endpoints.items():
results = self._test_endpoint(endpoint_name, endpoint_config)
self.results_store.save(endpoint_name, results)
# Alert on new findings
new_findings = self._identify_new_findings(endpoint_name, results)
if new_findings:
self._send_alert(endpoint_name, new_findings)
def _test_endpoint(self, name: str, config: dict) -> list:
"""測試 a single endpoint with all applicable attack categories."""
results = []
for attack in self.attack_registry.get_attacks(config["model_type"]):
try:
outcome = attack.execute(config["url"], config["auth"])
results.append({
"attack": attack.name,
"category": attack.category,
"success": outcome.succeeded,
"response": outcome.response[:500],
"timestamp": datetime.now().isoformat()
})
except Exception as e:
results.append({
"attack": attack.name,
"error": str(e),
"timestamp": datetime.now().isoformat()
})
return results
def _identify_new_findings(self, endpoint: str, results: list) -> list:
"""Compare results with previous cycle to find new 漏洞."""
previous = self.results_store.get_latest(endpoint)
if previous is None:
return [r for r in results if r.get("success")]
prev_successes = set(r["attack"] for r in previous if r.get("success"))
new_successes = [
r for r in results
if r.get("success") and r["attack"] not in prev_successes
]
return new_successes
def _send_alert(self, endpoint: str, findings: list):
"""Send alert for new findings."""
import requests
requests.post(self.alert_webhook, json={
"endpoint": endpoint,
"new_findings": len(findings),
"details": findings,
"severity": "high" if any(f["category"] == "safety_bypass" for f in findings) else "medium"
})CART scheduling:
| Trigger | Frequency | Scope |
|---|---|---|
| Model update | Every deployment | Full regression + expanded suite |
| New attack technique published | Within 48 hours | Targeted 測試 of new technique |
| Scheduled cycle | Weekly | Full attack suite rotation |
| Incident-triggered | As needed | Deep 評估 of affected system |
Layer 3: Human Analysis
Automation findings feed into human analysis. Analysts investigate automated alerts, develop new attacks based on patterns, and conduct deep assessments that automation cannot perform.
Automation → Alert → Triage (human) → Investigation (human) → Report (human)
↓ ↓
New attack patterns Feed back into
from human research automation registryTool Selection Guide
Framework Comparison
| Tool | Best For | Strengths | Limitations |
|---|---|---|---|
| Garak | Broad LLM 漏洞 scanning | Large probe library, extensible | Limited 代理 測試 |
| PyRIT | Microsoft ecosystem, structured attacks | Good orchestration, logging | Azure-centric |
| Promptfoo | Prompt regression 測試, CI/CD | Easy to integrate, fast | Less depth than specialized tools |
| Custom scripts | Organization-specific attacks | Full control, tailored to your systems | Maintenance burden |
Selection Criteria
Choose tools based on these factors:
- Model access type: API-only versus weight-access determines which tools are applicable
- Integration requirements: CI/CD compatibility, alerting, reporting format
- 攻擊 coverage: Which 漏洞 categories does the tool cover?
- Extensibility: Can you add custom attacks?
- Maintenance burden: Who maintains the tool? Is it actively developed?
Human-in-the-Loop Design
The critical question for automation: where does the human intervene?
Triage Model
All automated findings pass through human triage before escalation.
class TriageWorkflow:
"""Human-in-the-loop triage for automated findings."""
SEVERITY_THRESHOLDS = {
"auto_escalate": 0.95, # Very high confidence: auto-escalate with human notification
"human_review": 0.70, # Moderate confidence: require human review before escalation
"auto_dismiss": 0.30, # Low confidence: auto-dismiss with periodic audit
}
def triage(self, finding: dict) -> dict:
"""Route a finding based on confidence level."""
confidence = finding.get("confidence", 0.5)
if confidence >= self.SEVERITY_THRESHOLDS["auto_escalate"]:
return {
"action": "escalate",
"human_required": False,
"notification": True,
"reason": "High-confidence automated finding"
}
elif confidence >= self.SEVERITY_THRESHOLDS["human_review"]:
return {
"action": "queue_for_review",
"human_required": True,
"priority": "normal",
"reason": "Moderate confidence, requires human verification"
}
else:
return {
"action": "log_and_dismiss",
"human_required": False,
"audit_flag": True,
"reason": "Low confidence, logged for periodic audit review"
}Feedback Loop
Human triage decisions feed back into automation to improve accuracy over time:
- False positives are labeled and used to tune 偵測 thresholds
- True positives are added to the regression 測試 suite
- Missed findings (discovered by humans but not automation) are added as new attack patterns
Measuring Automation Effectiveness
| Metric | Definition | Target |
|---|---|---|
| 偵測 rate | Percentage of known 漏洞 found by automation | >80% for known attack categories |
| False positive rate | Percentage of automated alerts that are not valid findings | <20% (lower is better) |
| Time to 偵測 | Time between 漏洞 introduction and automated 偵測 | <24 hours for regression, <48 hours for new attacks |
| Human effort saved | Percentage reduction in manual 測試 hours | 40-60% (automation handles known patterns) |
| Coverage expansion | Number of systems tested with automation vs. manual-only | 3-5x more systems assessed |
實作 Roadmap
Start with regression 測試
實作 basic 安全 regression tests in CI/CD. This provides immediate value with minimal investment. Use existing tools (Promptfoo, pytest with custom fixtures).
Build CART for production
Deploy continuous automated scanning against production models. Start with a weekly cadence and a focused set of high-priority attacks. Expand scope over time.
實作 human-in-the-loop triage
Build a triage workflow that routes automated findings to human analysts. Establish confidence thresholds and escalation procedures.
Add feedback loops
Connect human triage decisions back to automation. Retrain classifiers, update thresholds, and add new attack patterns based on human analysis.
Expand and optimize
Add new attack categories to automation as they mature. Continuously optimize false positive rates and 偵測 coverage. Measure and report automation effectiveness.
總結
Automation multiplies 紅隊 capacity by handling repetitive, known-pattern 測試 while freeing human analysts for creative exploration and contextual judgment. The right automation strategy has three layers: CI/CD integration for deployment-time checks, continuous automated 紅隊演練 for production 監控, and human analysis for novel attacks and contextual 評估. Success requires careful tool selection, human-in-the-loop triage, and feedback loops that continuously improve automation accuracy. Measure automation by its 偵測 rate and false positive rate, not just by the volume of tests it runs.