Red Team Automation Strategy
When and how to automate AI red teaming: tool selection, CI/CD integration, continuous automated red teaming (CART), human-in-the-loop design, and scaling assessment coverage through automation.
Automation does not replace red teamers. It replaces the repetitive parts of red teaming -- running known attack patterns, regression testing, coverage checking -- so that human analysts can focus on what they do uniquely well: creative exploration, novel attack development, contextual judgment, and stakeholder communication. A well-designed automation strategy multiplies team capacity without sacrificing assessment depth.
What to Automate (and What Not To)
Automate
| Task | Why Automate | Tool Examples |
|---|---|---|
| Known jailbreak regression testing | Known techniques should be tested against every model update | Garak, PyRIT, custom scripts |
| Safety boundary scanning | Systematic testing of content categories is tedious but important | Promptfoo, custom harnesses |
| Input validation testing | Structured inputs (encodings, languages, formats) are enumerable | Custom test suites |
| API security checks | Authentication, rate limiting, error handling are standard tests | Standard API security tools |
| Output classification | Detecting harmful content in model outputs at scale | Safety classifiers, LLM-as-judge |
| Regression testing | Ensuring previously fixed vulnerabilities stay fixed | CI/CD test suites |
Do Not Automate
| Task | Why Keep Manual | Human Advantage |
|---|---|---|
| Novel attack development | Creativity cannot be automated | Adversarial thinking, intuition |
| Contextual risk assessment | Impact depends on business context | Domain knowledge, judgment |
| Stakeholder communication | Requires tailored messaging | Empathy, persuasion |
| Architecture review | Requires understanding system design decisions | Systems thinking |
| Ethical judgment | Nuanced decisions about responsible disclosure | Experience, values |
| Complex multi-step attacks | Require adaptive reasoning mid-attack | Real-time decision making |
Automation Architecture
Layered Automation
┌──────────────────────────────────────────────────────────────┐
│ Layer 3: Human Analysis │
│ Novel attacks, contextual assessment, reporting │
│ (Triggered by findings from Layers 1-2) │
├──────────────────────────────────────────────────────────────┤
│ Layer 2: Continuous Automated Red Teaming (CART) │
│ Scheduled and triggered scans against production models │
│ (Runs continuously or on model updates) │
├──────────────────────────────────────────────────────────────┤
│ Layer 1: CI/CD Integration │
│ Automated checks in the deployment pipeline │
│ (Runs on every deployment) │
└──────────────────────────────────────────────────────────────┘Layer 1: CI/CD Integration
Embed AI security checks in the deployment pipeline so that regressions and basic vulnerabilities are caught before reaching production.
# Example: GitHub Actions workflow for AI security checks
name: AI Security Checks
on:
pull_request:
paths:
- 'models/**'
- 'prompts/**'
- 'configs/safety/**'
push:
branches: [main]
jobs:
safety-regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run safety regression suite
run: |
python -m pytest tests/safety/ \
--model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
--threshold 0.95 \
--report-output safety-report.json
- name: Run jailbreak regression
run: |
garak --model-type api \
--model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
--probes known_jailbreaks \
--report-prefix ci-jailbreak
- name: Check results
run: |
python scripts/check_safety_results.py \
--safety-report safety-report.json \
--jailbreak-report ci-jailbreak.json \
--fail-on-regressionWhat to test in CI/CD:
- Known jailbreak regression (did a model update re-introduce a fixed vulnerability?)
- Safety boundary checks (does the model still refuse baseline harmful requests?)
- System prompt leakage (does the model reveal its system prompt?)
- Basic injection patterns (does the model follow instructions in injected context?)
What not to test in CI/CD:
- Full assessment suites (too slow for deployment pipelines)
- Novel attack exploration (not deterministic enough for pass/fail)
- Complex multi-step attacks (require too much infrastructure)
Layer 2: Continuous Automated Red Teaming (CART)
CART runs continuously against production models, testing a broader range of attacks than CI/CD and adapting to new threats.
class CARTPipeline:
"""Continuous Automated Red Teaming pipeline."""
def __init__(self, model_endpoints: dict, alert_webhook: str):
self.endpoints = model_endpoints
self.alert_webhook = alert_webhook
self.attack_registry = AttackRegistry()
self.results_store = ResultsStore()
def run_cycle(self):
"""Run one CART cycle across all monitored endpoints."""
for endpoint_name, endpoint_config in self.endpoints.items():
results = self._test_endpoint(endpoint_name, endpoint_config)
self.results_store.save(endpoint_name, results)
# Alert on new findings
new_findings = self._identify_new_findings(endpoint_name, results)
if new_findings:
self._send_alert(endpoint_name, new_findings)
def _test_endpoint(self, name: str, config: dict) -> list:
"""Test a single endpoint with all applicable attack categories."""
results = []
for attack in self.attack_registry.get_attacks(config["model_type"]):
try:
outcome = attack.execute(config["url"], config["auth"])
results.append({
"attack": attack.name,
"category": attack.category,
"success": outcome.succeeded,
"response": outcome.response[:500],
"timestamp": datetime.now().isoformat()
})
except Exception as e:
results.append({
"attack": attack.name,
"error": str(e),
"timestamp": datetime.now().isoformat()
})
return results
def _identify_new_findings(self, endpoint: str, results: list) -> list:
"""Compare results with previous cycle to find new vulnerabilities."""
previous = self.results_store.get_latest(endpoint)
if previous is None:
return [r for r in results if r.get("success")]
prev_successes = set(r["attack"] for r in previous if r.get("success"))
new_successes = [
r for r in results
if r.get("success") and r["attack"] not in prev_successes
]
return new_successes
def _send_alert(self, endpoint: str, findings: list):
"""Send alert for new findings."""
import requests
requests.post(self.alert_webhook, json={
"endpoint": endpoint,
"new_findings": len(findings),
"details": findings,
"severity": "high" if any(f["category"] == "safety_bypass" for f in findings) else "medium"
})CART scheduling:
| Trigger | Frequency | Scope |
|---|---|---|
| Model update | Every deployment | Full regression + expanded suite |
| New attack technique published | Within 48 hours | Targeted test of new technique |
| Scheduled cycle | Weekly | Full attack suite rotation |
| Incident-triggered | As needed | Deep assessment of affected system |
Layer 3: Human Analysis
Automation findings feed into human analysis. Analysts investigate automated alerts, develop new attacks based on patterns, and conduct deep assessments that automation cannot perform.
Automation → Alert → Triage (human) → Investigation (human) → Report (human)
↓ ↓
New attack patterns Feed back into
from human research automation registryTool Selection Guide
Framework Comparison
| Tool | Best For | Strengths | Limitations |
|---|---|---|---|
| Garak | Broad LLM vulnerability scanning | Large probe library, extensible | Limited agent testing |
| PyRIT | Microsoft ecosystem, structured attacks | Good orchestration, logging | Azure-centric |
| Promptfoo | Prompt regression testing, CI/CD | Easy to integrate, fast | Less depth than specialized tools |
| Custom scripts | Organization-specific attacks | Full control, tailored to your systems | Maintenance burden |
Selection Criteria
Choose tools based on these factors:
- Model access type: API-only versus weight-access determines which tools are applicable
- Integration requirements: CI/CD compatibility, alerting, reporting format
- Attack coverage: Which vulnerability categories does the tool cover?
- Extensibility: Can you add custom attacks?
- Maintenance burden: Who maintains the tool? Is it actively developed?
Human-in-the-Loop Design
The critical question for automation: where does the human intervene?
Triage Model
All automated findings pass through human triage before escalation.
class TriageWorkflow:
"""Human-in-the-loop triage for automated findings."""
SEVERITY_THRESHOLDS = {
"auto_escalate": 0.95, # Very high confidence: auto-escalate with human notification
"human_review": 0.70, # Moderate confidence: require human review before escalation
"auto_dismiss": 0.30, # Low confidence: auto-dismiss with periodic audit
}
def triage(self, finding: dict) -> dict:
"""Route a finding based on confidence level."""
confidence = finding.get("confidence", 0.5)
if confidence >= self.SEVERITY_THRESHOLDS["auto_escalate"]:
return {
"action": "escalate",
"human_required": False,
"notification": True,
"reason": "High-confidence automated finding"
}
elif confidence >= self.SEVERITY_THRESHOLDS["human_review"]:
return {
"action": "queue_for_review",
"human_required": True,
"priority": "normal",
"reason": "Moderate confidence, requires human verification"
}
else:
return {
"action": "log_and_dismiss",
"human_required": False,
"audit_flag": True,
"reason": "Low confidence, logged for periodic audit review"
}Feedback Loop
Human triage decisions feed back into automation to improve accuracy over time:
- False positives are labeled and used to tune detection thresholds
- True positives are added to the regression test suite
- Missed findings (discovered by humans but not automation) are added as new attack patterns
Measuring Automation Effectiveness
| Metric | Definition | Target |
|---|---|---|
| Detection rate | Percentage of known vulnerabilities found by automation | >80% for known attack categories |
| False positive rate | Percentage of automated alerts that are not valid findings | <20% (lower is better) |
| Time to detection | Time between vulnerability introduction and automated detection | <24 hours for regression, <48 hours for new attacks |
| Human effort saved | Percentage reduction in manual testing hours | 40-60% (automation handles known patterns) |
| Coverage expansion | Number of systems tested with automation vs. manual-only | 3-5x more systems assessed |
Implementation Roadmap
Start with regression testing
Implement basic safety regression tests in CI/CD. This provides immediate value with minimal investment. Use existing tools (Promptfoo, pytest with custom fixtures).
Build CART for production
Deploy continuous automated scanning against production models. Start with a weekly cadence and a focused set of high-priority attacks. Expand scope over time.
Implement human-in-the-loop triage
Build a triage workflow that routes automated findings to human analysts. Establish confidence thresholds and escalation procedures.
Add feedback loops
Connect human triage decisions back to automation. Retrain classifiers, update thresholds, and add new attack patterns based on human analysis.
Expand and optimize
Add new attack categories to automation as they mature. Continuously optimize false positive rates and detection coverage. Measure and report automation effectiveness.
Summary
Automation multiplies red team capacity by handling repetitive, known-pattern testing while freeing human analysts for creative exploration and contextual judgment. The right automation strategy has three layers: CI/CD integration for deployment-time checks, continuous automated red teaming for production monitoring, and human analysis for novel attacks and contextual assessment. Success requires careful tool selection, human-in-the-loop triage, and feedback loops that continuously improve automation accuracy. Measure automation by its detection rate and false positive rate, not just by the volume of tests it runs.