Red Team Automation Strategy

advanced9 min readUpdated 2026-03-15

When and how to automate AI red teaming: tool selection, CI/CD integration, continuous automated red teaming (CART), human-in-the-loop design, and scaling assessment coverage through automation.

automation cart ci-cd tooling scaling human-in-the-loop

Automation does not replace red teamers. It replaces the repetitive parts of red teaming -- running known attack patterns, regression testing, coverage checking -- so that human analysts can focus on what they do uniquely well: creative exploration, novel attack development, contextual judgment, and stakeholder communication. A well-designed automation strategy multiplies team capacity without sacrificing assessment depth.

What to Automate (and What Not To)

Automate

Task	Why Automate	Tool Examples
Known jailbreak regression testing	Known techniques should be tested against every model update	Garak, PyRIT, custom scripts
Safety boundary scanning	Systematic testing of content categories is tedious but important	Promptfoo, custom harnesses
Input validation testing	Structured inputs (encodings, languages, formats) are enumerable	Custom test suites
API security checks	Authentication, rate limiting, error handling are standard tests	Standard API security tools
Output classification	Detecting harmful content in model outputs at scale	Safety classifiers, LLM-as-judge
Regression testing	Ensuring previously fixed vulnerabilities stay fixed	CI/CD test suites

Do Not Automate

Task	Why Keep Manual	Human Advantage
Novel attack development	Creativity cannot be automated	Adversarial thinking, intuition
Contextual risk assessment	Impact depends on business context	Domain knowledge, judgment
Stakeholder communication	Requires tailored messaging	Empathy, persuasion
Architecture review	Requires understanding system design decisions	Systems thinking
Ethical judgment	Nuanced decisions about responsible disclosure	Experience, values
Complex multi-step attacks	Require adaptive reasoning mid-attack	Real-time decision making

Automation Architecture

Layered Automation

┌──────────────────────────────────────────────────────────────┐
│ Layer 3: Human Analysis                                       │
│ Novel attacks, contextual assessment, reporting               │
│ (Triggered by findings from Layers 1-2)                       │
├──────────────────────────────────────────────────────────────┤
│ Layer 2: Continuous Automated Red Teaming (CART)              │
│ Scheduled and triggered scans against production models       │
│ (Runs continuously or on model updates)                       │
├──────────────────────────────────────────────────────────────┤
│ Layer 1: CI/CD Integration                                    │
│ Automated checks in the deployment pipeline                   │
│ (Runs on every deployment)                                    │
└──────────────────────────────────────────────────────────────┘

Layer 1: CI/CD Integration

Embed AI security checks in the deployment pipeline so that regressions and basic vulnerabilities are caught before reaching production.

# Example: GitHub Actions workflow for AI security checks
name: AI Security Checks
on:
  pull_request:
    paths:
      - 'models/**'
      - 'prompts/**'
      - 'configs/safety/**'
  push:
    branches: [main]
 
jobs:
  safety-regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Run safety regression suite
        run: |
          python -m pytest tests/safety/ \
            --model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
            --threshold 0.95 \
            --report-output safety-report.json
 
      - name: Run jailbreak regression
        run: |
          garak --model-type api \
            --model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
            --probes known_jailbreaks \
            --report-prefix ci-jailbreak
 
      - name: Check results
        run: |
          python scripts/check_safety_results.py \
            --safety-report safety-report.json \
            --jailbreak-report ci-jailbreak.json \
            --fail-on-regression

What to test in CI/CD:

Known jailbreak regression (did a model update re-introduce a fixed vulnerability?)
Safety boundary checks (does the model still refuse baseline harmful requests?)
System prompt leakage (does the model reveal its system prompt?)
Basic injection patterns (does the model follow instructions in injected context?)

What not to test in CI/CD:

Full assessment suites (too slow for deployment pipelines)
Novel attack exploration (not deterministic enough for pass/fail)
Complex multi-step attacks (require too much infrastructure)

Layer 2: Continuous Automated Red Teaming (CART)

CART runs continuously against production models, testing a broader range of attacks than CI/CD and adapting to new threats.

class CARTPipeline:
    """Continuous Automated Red Teaming pipeline."""
 
    def __init__(self, model_endpoints: dict, alert_webhook: str):
        self.endpoints = model_endpoints
        self.alert_webhook = alert_webhook
        self.attack_registry = AttackRegistry()
        self.results_store = ResultsStore()
 
    def run_cycle(self):
        """Run one CART cycle across all monitored endpoints."""
        for endpoint_name, endpoint_config in self.endpoints.items():
            results = self._test_endpoint(endpoint_name, endpoint_config)
            self.results_store.save(endpoint_name, results)
 
            # Alert on new findings
            new_findings = self._identify_new_findings(endpoint_name, results)
            if new_findings:
                self._send_alert(endpoint_name, new_findings)
 
    def _test_endpoint(self, name: str, config: dict) -> list:
        """Test a single endpoint with all applicable attack categories."""
        results = []
 
        for attack in self.attack_registry.get_attacks(config["model_type"]):
            try:
                outcome = attack.execute(config["url"], config["auth"])
                results.append({
                    "attack": attack.name,
                    "category": attack.category,
                    "success": outcome.succeeded,
                    "response": outcome.response[:500],
                    "timestamp": datetime.now().isoformat()
                })
            except Exception as e:
                results.append({
                    "attack": attack.name,
                    "error": str(e),
                    "timestamp": datetime.now().isoformat()
                })
 
        return results
 
    def _identify_new_findings(self, endpoint: str, results: list) -> list:
        """Compare results with previous cycle to find new vulnerabilities."""
        previous = self.results_store.get_latest(endpoint)
        if previous is None:
            return [r for r in results if r.get("success")]
 
        prev_successes = set(r["attack"] for r in previous if r.get("success"))
        new_successes = [
            r for r in results
            if r.get("success") and r["attack"] not in prev_successes
        ]
 
        return new_successes
 
    def _send_alert(self, endpoint: str, findings: list):
        """Send alert for new findings."""
        import requests
        requests.post(self.alert_webhook, json={
            "endpoint": endpoint,
            "new_findings": len(findings),
            "details": findings,
            "severity": "high" if any(f["category"] == "safety_bypass" for f in findings) else "medium"
        })

CART scheduling:

Trigger	Frequency	Scope
Model update	Every deployment	Full regression + expanded suite
New attack technique published	Within 48 hours	Targeted test of new technique
Scheduled cycle	Weekly	Full attack suite rotation
Incident-triggered	As needed	Deep assessment of affected system

Layer 3: Human Analysis

Automation findings feed into human analysis. Analysts investigate automated alerts, develop new attacks based on patterns, and conduct deep assessments that automation cannot perform.

Automation → Alert → Triage (human) → Investigation (human) → Report (human)
     ↓                                       ↓
  New attack patterns                   Feed back into
  from human research                   automation registry

Tool Selection Guide

Framework Comparison

Tool	Best For	Strengths	Limitations
Garak	Broad LLM vulnerability scanning	Large probe library, extensible	Limited agent testing
PyRIT	Microsoft ecosystem, structured attacks	Good orchestration, logging	Azure-centric
Promptfoo	Prompt regression testing, CI/CD	Easy to integrate, fast	Less depth than specialized tools
Custom scripts	Organization-specific attacks	Full control, tailored to your systems	Maintenance burden

Selection Criteria

Choose tools based on these factors:

Model access type: API-only versus weight-access determines which tools are applicable
Integration requirements: CI/CD compatibility, alerting, reporting format
Attack coverage: Which vulnerability categories does the tool cover?
Extensibility: Can you add custom attacks?
Maintenance burden: Who maintains the tool? Is it actively developed?

Human-in-the-Loop Design

The critical question for automation: where does the human intervene?

Triage Model

All automated findings pass through human triage before escalation.

class TriageWorkflow:
    """Human-in-the-loop triage for automated findings."""
 
    SEVERITY_THRESHOLDS = {
        "auto_escalate": 0.95,    # Very high confidence: auto-escalate with human notification
        "human_review": 0.70,     # Moderate confidence: require human review before escalation
        "auto_dismiss": 0.30,     # Low confidence: auto-dismiss with periodic audit
    }
 
    def triage(self, finding: dict) -> dict:
        """Route a finding based on confidence level."""
        confidence = finding.get("confidence", 0.5)
 
        if confidence >= self.SEVERITY_THRESHOLDS["auto_escalate"]:
            return {
                "action": "escalate",
                "human_required": False,
                "notification": True,
                "reason": "High-confidence automated finding"
            }
        elif confidence >= self.SEVERITY_THRESHOLDS["human_review"]:
            return {
                "action": "queue_for_review",
                "human_required": True,
                "priority": "normal",
                "reason": "Moderate confidence, requires human verification"
            }
        else:
            return {
                "action": "log_and_dismiss",
                "human_required": False,
                "audit_flag": True,
                "reason": "Low confidence, logged for periodic audit review"
            }

Feedback Loop

Human triage decisions feed back into automation to improve accuracy over time:

False positives are labeled and used to tune detection thresholds
True positives are added to the regression test suite
Missed findings (discovered by humans but not automation) are added as new attack patterns

Measuring Automation Effectiveness

Metric	Definition	Target
Detection rate	Percentage of known vulnerabilities found by automation	>80% for known attack categories
False positive rate	Percentage of automated alerts that are not valid findings	<20% (lower is better)
Time to detection	Time between vulnerability introduction and automated detection	<24 hours for regression, <48 hours for new attacks
Human effort saved	Percentage reduction in manual testing hours	40-60% (automation handles known patterns)
Coverage expansion	Number of systems tested with automation vs. manual-only	3-5x more systems assessed

Implementation Roadmap

Start with regression testing
Implement basic safety regression tests in CI/CD. This provides immediate value with minimal investment. Use existing tools (Promptfoo, pytest with custom fixtures).
Build CART for production
Deploy continuous automated scanning against production models. Start with a weekly cadence and a focused set of high-priority attacks. Expand scope over time.
Implement human-in-the-loop triage
Build a triage workflow that routes automated findings to human analysts. Establish confidence thresholds and escalation procedures.
Add feedback loops
Connect human triage decisions back to automation. Retrain classifiers, update thresholds, and add new attack patterns based on human analysis.
Expand and optimize
Add new attack categories to automation as they mature. Continuously optimize false positive rates and detection coverage. Measure and report automation effectiveness.

Summary

Automation multiplies red team capacity by handling repetitive, known-pattern testing while freeing human analysts for creative exploration and contextual judgment. The right automation strategy has three layers: CI/CD integration for deployment-time checks, continuous automated red teaming for production monitoring, and human analysis for novel attacks and contextual assessment. Success requires careful tool selection, human-in-the-loop triage, and feedback loops that continuously improve automation accuracy. Measure automation by its detection rate and false positive rate, not just by the volume of tests it runs.

Edit this page on GitHub

Red Team Automation Strategy

advanced9 min readUpdated 2026-03-15

When and how to automate AI red teaming: tool selection, CI/CD integration, continuous automated red teaming (CART), human-in-the-loop design, and scaling assessment coverage through automation.

automation cart ci-cd tooling scaling human-in-the-loop

What to Automate (and What Not To)

Automate

Task	Why Automate	Tool Examples
Known jailbreak regression testing	Known techniques should be tested against every model update	Garak, PyRIT, custom scripts
Safety boundary scanning	Systematic testing of content categories is tedious but important	Promptfoo, custom harnesses
Input validation testing	Structured inputs (encodings, languages, formats) are enumerable	Custom test suites
API security checks	Authentication, rate limiting, error handling are standard tests	Standard API security tools
Output classification	Detecting harmful content in model outputs at scale	Safety classifiers, LLM-as-judge
Regression testing	Ensuring previously fixed vulnerabilities stay fixed	CI/CD test suites

Do Not Automate

Task	Why Keep Manual	Human Advantage
Novel attack development	Creativity cannot be automated	Adversarial thinking, intuition
Contextual risk assessment	Impact depends on business context	Domain knowledge, judgment
Stakeholder communication	Requires tailored messaging	Empathy, persuasion
Architecture review	Requires understanding system design decisions	Systems thinking
Ethical judgment	Nuanced decisions about responsible disclosure	Experience, values
Complex multi-step attacks	Require adaptive reasoning mid-attack	Real-time decision making

Automation Architecture

Layered Automation

┌──────────────────────────────────────────────────────────────┐
│ Layer 3: Human Analysis                                       │
│ Novel attacks, contextual assessment, reporting               │
│ (Triggered by findings from Layers 1-2)                       │
├──────────────────────────────────────────────────────────────┤
│ Layer 2: Continuous Automated Red Teaming (CART)              │
│ Scheduled and triggered scans against production models       │
│ (Runs continuously or on model updates)                       │
├──────────────────────────────────────────────────────────────┤
│ Layer 1: CI/CD Integration                                    │
│ Automated checks in the deployment pipeline                   │
│ (Runs on every deployment)                                    │
└──────────────────────────────────────────────────────────────┘

Layer 1: CI/CD Integration

Embed AI security checks in the deployment pipeline so that regressions and basic vulnerabilities are caught before reaching production.

# Example: GitHub Actions workflow for AI security checks
name: AI Security Checks
on:
  pull_request:
    paths:
      - 'models/**'
      - 'prompts/**'
      - 'configs/safety/**'
  push:
    branches: [main]
 
jobs:
  safety-regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Run safety regression suite
        run: |
          python -m pytest tests/safety/ \
            --model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
            --threshold 0.95 \
            --report-output safety-report.json
 
      - name: Run jailbreak regression
        run: |
          garak --model-type api \
            --model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
            --probes known_jailbreaks \
            --report-prefix ci-jailbreak
 
      - name: Check results
        run: |
          python scripts/check_safety_results.py \
            --safety-report safety-report.json \
            --jailbreak-report ci-jailbreak.json \
            --fail-on-regression

What to test in CI/CD:

Known jailbreak regression (did a model update re-introduce a fixed vulnerability?)
Safety boundary checks (does the model still refuse baseline harmful requests?)
System prompt leakage (does the model reveal its system prompt?)
Basic injection patterns (does the model follow instructions in injected context?)

What not to test in CI/CD:

Full assessment suites (too slow for deployment pipelines)
Novel attack exploration (not deterministic enough for pass/fail)
Complex multi-step attacks (require too much infrastructure)

Layer 2: Continuous Automated Red Teaming (CART)

CART runs continuously against production models, testing a broader range of attacks than CI/CD and adapting to new threats.

class CARTPipeline:
    """Continuous Automated Red Teaming pipeline."""
 
    def __init__(self, model_endpoints: dict, alert_webhook: str):
        self.endpoints = model_endpoints
        self.alert_webhook = alert_webhook
        self.attack_registry = AttackRegistry()
        self.results_store = ResultsStore()
 
    def run_cycle(self):
        """Run one CART cycle across all monitored endpoints."""
        for endpoint_name, endpoint_config in self.endpoints.items():
            results = self._test_endpoint(endpoint_name, endpoint_config)
            self.results_store.save(endpoint_name, results)
 
            # Alert on new findings
            new_findings = self._identify_new_findings(endpoint_name, results)
            if new_findings:
                self._send_alert(endpoint_name, new_findings)
 
    def _test_endpoint(self, name: str, config: dict) -> list:
        """Test a single endpoint with all applicable attack categories."""
        results = []
 
        for attack in self.attack_registry.get_attacks(config["model_type"]):
            try:
                outcome = attack.execute(config["url"], config["auth"])
                results.append({
                    "attack": attack.name,
                    "category": attack.category,
                    "success": outcome.succeeded,
                    "response": outcome.response[:500],
                    "timestamp": datetime.now().isoformat()
                })
            except Exception as e:
                results.append({
                    "attack": attack.name,
                    "error": str(e),
                    "timestamp": datetime.now().isoformat()
                })
 
        return results
 
    def _identify_new_findings(self, endpoint: str, results: list) -> list:
        """Compare results with previous cycle to find new vulnerabilities."""
        previous = self.results_store.get_latest(endpoint)
        if previous is None:
            return [r for r in results if r.get("success")]
 
        prev_successes = set(r["attack"] for r in previous if r.get("success"))
        new_successes = [
            r for r in results
            if r.get("success") and r["attack"] not in prev_successes
        ]
 
        return new_successes
 
    def _send_alert(self, endpoint: str, findings: list):
        """Send alert for new findings."""
        import requests
        requests.post(self.alert_webhook, json={
            "endpoint": endpoint,
            "new_findings": len(findings),
            "details": findings,
            "severity": "high" if any(f["category"] == "safety_bypass" for f in findings) else "medium"
        })

CART scheduling:

Trigger	Frequency	Scope
Model update	Every deployment	Full regression + expanded suite
New attack technique published	Within 48 hours	Targeted test of new technique
Scheduled cycle	Weekly	Full attack suite rotation
Incident-triggered	As needed	Deep assessment of affected system

Layer 3: Human Analysis

Automation findings feed into human analysis. Analysts investigate automated alerts, develop new attacks based on patterns, and conduct deep assessments that automation cannot perform.

Automation → Alert → Triage (human) → Investigation (human) → Report (human)
     ↓                                       ↓
  New attack patterns                   Feed back into
  from human research                   automation registry

Tool Selection Guide

Framework Comparison

Tool	Best For	Strengths	Limitations
Garak	Broad LLM vulnerability scanning	Large probe library, extensible	Limited agent testing
PyRIT	Microsoft ecosystem, structured attacks	Good orchestration, logging	Azure-centric
Promptfoo	Prompt regression testing, CI/CD	Easy to integrate, fast	Less depth than specialized tools
Custom scripts	Organization-specific attacks	Full control, tailored to your systems	Maintenance burden

Selection Criteria

Choose tools based on these factors:

Model access type: API-only versus weight-access determines which tools are applicable
Integration requirements: CI/CD compatibility, alerting, reporting format
Attack coverage: Which vulnerability categories does the tool cover?
Extensibility: Can you add custom attacks?
Maintenance burden: Who maintains the tool? Is it actively developed?

Human-in-the-Loop Design

The critical question for automation: where does the human intervene?

Triage Model

All automated findings pass through human triage before escalation.

class TriageWorkflow:
    """Human-in-the-loop triage for automated findings."""
 
    SEVERITY_THRESHOLDS = {
        "auto_escalate": 0.95,    # Very high confidence: auto-escalate with human notification
        "human_review": 0.70,     # Moderate confidence: require human review before escalation
        "auto_dismiss": 0.30,     # Low confidence: auto-dismiss with periodic audit
    }
 
    def triage(self, finding: dict) -> dict:
        """Route a finding based on confidence level."""
        confidence = finding.get("confidence", 0.5)
 
        if confidence >= self.SEVERITY_THRESHOLDS["auto_escalate"]:
            return {
                "action": "escalate",
                "human_required": False,
                "notification": True,
                "reason": "High-confidence automated finding"
            }
        elif confidence >= self.SEVERITY_THRESHOLDS["human_review"]:
            return {
                "action": "queue_for_review",
                "human_required": True,
                "priority": "normal",
                "reason": "Moderate confidence, requires human verification"
            }
        else:
            return {
                "action": "log_and_dismiss",
                "human_required": False,
                "audit_flag": True,
                "reason": "Low confidence, logged for periodic audit review"
            }

Feedback Loop

Human triage decisions feed back into automation to improve accuracy over time:

False positives are labeled and used to tune detection thresholds
True positives are added to the regression test suite
Missed findings (discovered by humans but not automation) are added as new attack patterns

Measuring Automation Effectiveness

Metric	Definition	Target
Detection rate	Percentage of known vulnerabilities found by automation	>80% for known attack categories
False positive rate	Percentage of automated alerts that are not valid findings	<20% (lower is better)
Time to detection	Time between vulnerability introduction and automated detection	<24 hours for regression, <48 hours for new attacks
Human effort saved	Percentage reduction in manual testing hours	40-60% (automation handles known patterns)
Coverage expansion	Number of systems tested with automation vs. manual-only	3-5x more systems assessed

Implementation Roadmap

Start with regression testing
Implement basic safety regression tests in CI/CD. This provides immediate value with minimal investment. Use existing tools (Promptfoo, pytest with custom fixtures).
Build CART for production
Deploy continuous automated scanning against production models. Start with a weekly cadence and a focused set of high-priority attacks. Expand scope over time.
Implement human-in-the-loop triage
Build a triage workflow that routes automated findings to human analysts. Establish confidence thresholds and escalation procedures.
Add feedback loops
Connect human triage decisions back to automation. Retrain classifiers, update thresholds, and add new attack patterns based on human analysis.
Expand and optimize
Add new attack categories to automation as they mature. Continuously optimize false positive rates and detection coverage. Measure and report automation effectiveness.

Summary

Edit this page on GitHub

Red Team Automation Strategy

Start with regression testing

Build CART for production

Implement human-in-the-loop triage

Add feedback loops

Expand and optimize

Related articles

Red Team Automation Strategy

Start with regression testing

Build CART for production

Implement human-in-the-loop triage

Add feedback loops

Expand and optimize

Related articles