紅隊 Automation Strategy

Advanced9 min readUpdated 2026-03-15

When and how to automate AI red teaming: tool selection, CI/CD integration, continuous automated red teaming (CART), human-in-the-loop design, and scaling assessment coverage through automation.

automation cart ci-cd tooling scaling human-in-the-loop

Automation does not replace red teamers. It replaces the repetitive parts of 紅隊演練 -- running known attack patterns, regression 測試, coverage checking -- so that human analysts can focus on what they do uniquely well: creative exploration, novel attack development, contextual judgment, and stakeholder communication. A well-designed automation strategy multiplies team capacity without sacrificing 評估 depth.

What to Automate (and What Not To)

Automate

Task	Why Automate	Tool 範例
Known 越獄 regression 測試	Known techniques should be tested against every model update	Garak, PyRIT, custom scripts
安全 boundary scanning	Systematic 測試 of content categories is tedious but important	Promptfoo, custom harnesses
輸入 validation 測試	Structured inputs (encodings, languages, formats) are enumerable	Custom 測試 suites
API 安全 checks	Authentication, rate limiting, error handling are standard tests	Standard API 安全 tools
輸出 classification	Detecting harmful content in model outputs at scale	安全 classifiers, LLM-as-judge
Regression 測試	Ensuring previously fixed 漏洞 stay fixed	CI/CD 測試 suites

Do Not Automate

Task	Why Keep Manual	Human Advantage
Novel attack development	Creativity cannot be automated	對抗性 thinking, intuition
Contextual risk 評估	Impact depends on business context	Domain knowledge, judgment
Stakeholder communication	Requires tailored messaging	Empathy, persuasion
Architecture review	Requires 理解 system design decisions	Systems thinking
Ethical judgment	Nuanced decisions about responsible disclosure	Experience, values
Complex multi-step attacks	Require adaptive reasoning mid-attack	Real-time decision making

Automation Architecture

Layered Automation

┌──────────────────────────────────────────────────────────────┐
│ Layer 3: Human Analysis                                       │
│ Novel attacks, contextual 評估, reporting               │
│ (Triggered by findings from Layers 1-2)                       │
├──────────────────────────────────────────────────────────────┤
│ Layer 2: Continuous Automated 紅隊演練 (CART)              │
│ Scheduled and triggered scans against production models       │
│ (Runs continuously or on model updates)                       │
├──────────────────────────────────────────────────────────────┤
│ Layer 1: CI/CD Integration                                    │
│ Automated checks in the deployment pipeline                   │
│ (Runs on every deployment)                                    │
└──────────────────────────────────────────────────────────────┘

Layer 1: CI/CD Integration

Embed AI 安全 checks in the deployment pipeline so that regressions and basic 漏洞 are caught before reaching production.

# 範例: GitHub Actions workflow for AI 安全 checks
name: AI 安全 Checks
on:
  pull_request:
    paths:
      - 'models/**'
      - 'prompts/**'
      - 'configs/安全/**'
  push:
    branches: [main]
 
jobs:
  安全-regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Run 安全 regression suite
        run: |
          python -m pytest tests/安全/ \
            --model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
            --threshold 0.95 \
            --report-輸出 安全-report.json
 
      - name: Run 越獄 regression
        run: |
          garak --model-type api \
            --model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
            --probes known_jailbreaks \
            --report-prefix ci-越獄
 
      - name: Check results
        run: |
          python scripts/check_safety_results.py \
            --安全-report 安全-report.json \
            --越獄-report ci-越獄.json \
            --fail-on-regression

What to 測試 in CI/CD:

Known 越獄 regression (did a model update re-introduce a fixed 漏洞?)
安全 boundary checks (does 模型 still refuse baseline harmful requests?)
系統提示詞 leakage (does 模型 reveal its 系統提示詞?)
Basic injection patterns (does 模型 follow instructions in injected context?)

What not to 測試 in CI/CD:

Full 評估 suites (too slow for deployment pipelines)
Novel attack exploration (not deterministic enough for pass/fail)
Complex multi-step attacks (require too much infrastructure)

Layer 2: Continuous Automated 紅隊演練 (CART)

CART runs continuously against production models, 測試 a broader range of attacks than CI/CD and adapting to new threats.

class CARTPipeline:
    """Continuous Automated 紅隊演練 pipeline."""
 
    def __init__(self, model_endpoints: dict, alert_webhook: str):
        self.endpoints = model_endpoints
        self.alert_webhook = alert_webhook
        self.attack_registry = AttackRegistry()
        self.results_store = ResultsStore()
 
    def run_cycle(self):
        """Run one CART cycle across all monitored endpoints."""
        for endpoint_name, endpoint_config in self.endpoints.items():
            results = self._test_endpoint(endpoint_name, endpoint_config)
            self.results_store.save(endpoint_name, results)
 
            # Alert on new findings
            new_findings = self._identify_new_findings(endpoint_name, results)
            if new_findings:
                self._send_alert(endpoint_name, new_findings)
 
    def _test_endpoint(self, name: str, config: dict) -> list:
        """測試 a single endpoint with all applicable attack categories."""
        results = []
 
        for attack in self.attack_registry.get_attacks(config["model_type"]):
            try:
                outcome = attack.execute(config["url"], config["auth"])
                results.append({
                    "attack": attack.name,
                    "category": attack.category,
                    "success": outcome.succeeded,
                    "response": outcome.response[:500],
                    "timestamp": datetime.now().isoformat()
                })
            except Exception as e:
                results.append({
                    "attack": attack.name,
                    "error": str(e),
                    "timestamp": datetime.now().isoformat()
                })
 
        return results
 
    def _identify_new_findings(self, endpoint: str, results: list) -> list:
        """Compare results with previous cycle to find new 漏洞."""
        previous = self.results_store.get_latest(endpoint)
        if previous is None:
            return [r for r in results if r.get("success")]
 
        prev_successes = set(r["attack"] for r in previous if r.get("success"))
        new_successes = [
            r for r in results
            if r.get("success") and r["attack"] not in prev_successes
        ]
 
        return new_successes
 
    def _send_alert(self, endpoint: str, findings: list):
        """Send alert for new findings."""
        import requests
        requests.post(self.alert_webhook, json={
            "endpoint": endpoint,
            "new_findings": len(findings),
            "details": findings,
            "severity": "high" if any(f["category"] == "safety_bypass" for f in findings) else "medium"
        })

CART scheduling:

Trigger	Frequency	Scope
Model update	Every deployment	Full regression + expanded suite
New attack technique published	Within 48 hours	Targeted 測試 of new technique
Scheduled cycle	Weekly	Full attack suite rotation
Incident-triggered	As needed	Deep 評估 of affected system

Layer 3: Human Analysis

Automation findings feed into human analysis. Analysts investigate automated alerts, develop new attacks based on patterns, and conduct deep assessments that automation cannot perform.

Automation → Alert → Triage (human) → Investigation (human) → Report (human)
     ↓                                       ↓
  New attack patterns                   Feed back into
  from human research                   automation registry

Tool Selection Guide

Framework Comparison

Tool	Best For	Strengths	Limitations
Garak	Broad LLM 漏洞 scanning	Large probe library, extensible	Limited 代理測試
PyRIT	Microsoft ecosystem, structured attacks	Good orchestration, logging	Azure-centric
Promptfoo	Prompt regression 測試, CI/CD	Easy to integrate, fast	Less depth than specialized tools
Custom scripts	Organization-specific attacks	Full control, tailored to your systems	Maintenance burden

Selection Criteria

Choose tools based on these factors:

Model access type: API-only versus weight-access determines which tools are applicable
Integration requirements: CI/CD compatibility, alerting, reporting format
攻擊 coverage: Which 漏洞 categories does the tool cover?
Extensibility: Can you add custom attacks?
Maintenance burden: Who maintains the tool? Is it actively developed?

Human-in-the-Loop Design

The critical question for automation: where does the human intervene?

Triage Model

All automated findings pass through human triage before escalation.

class TriageWorkflow:
    """Human-in-the-loop triage for automated findings."""
 
    SEVERITY_THRESHOLDS = {
        "auto_escalate": 0.95,    # Very high confidence: auto-escalate with human notification
        "human_review": 0.70,     # Moderate confidence: require human review before escalation
        "auto_dismiss": 0.30,     # Low confidence: auto-dismiss with periodic audit
    }
 
    def triage(self, finding: dict) -> dict:
        """Route a finding based on confidence level."""
        confidence = finding.get("confidence", 0.5)
 
        if confidence >= self.SEVERITY_THRESHOLDS["auto_escalate"]:
            return {
                "action": "escalate",
                "human_required": False,
                "notification": True,
                "reason": "High-confidence automated finding"
            }
        elif confidence >= self.SEVERITY_THRESHOLDS["human_review"]:
            return {
                "action": "queue_for_review",
                "human_required": True,
                "priority": "normal",
                "reason": "Moderate confidence, requires human verification"
            }
        else:
            return {
                "action": "log_and_dismiss",
                "human_required": False,
                "audit_flag": True,
                "reason": "Low confidence, logged for periodic audit review"
            }

Feedback Loop

Human triage decisions feed back into automation to improve accuracy over time:

False positives are labeled and used to tune 偵測 thresholds
True positives are added to the regression 測試 suite
Missed findings (discovered by humans but not automation) are added as new attack patterns

Measuring Automation Effectiveness

Metric	Definition	Target
偵測 rate	Percentage of known 漏洞 found by automation	>80% for known attack categories
False positive rate	Percentage of automated alerts that are not valid findings	<20% (lower is better)
Time to 偵測	Time between 漏洞 introduction and automated 偵測	<24 hours for regression, <48 hours for new attacks
Human effort saved	Percentage reduction in manual 測試 hours	40-60% (automation handles known patterns)
Coverage expansion	Number of systems tested with automation vs. manual-only	3-5x more systems assessed

實作 Roadmap

Start with regression 測試
實作 basic 安全 regression tests in CI/CD. This provides immediate value with minimal investment. Use existing tools (Promptfoo, pytest with custom fixtures).
Build CART for production
Deploy continuous automated scanning against production models. Start with a weekly cadence and a focused set of high-priority attacks. Expand scope over time.
實作 human-in-the-loop triage
Build a triage workflow that routes automated findings to human analysts. Establish confidence thresholds and escalation procedures.
Add feedback loops
Connect human triage decisions back to automation. Retrain classifiers, update thresholds, and add new attack patterns based on human analysis.
Expand and optimize
Add new attack categories to automation as they mature. Continuously optimize false positive rates and 偵測 coverage. Measure and report automation effectiveness.

總結

Automation multiplies 紅隊 capacity by handling repetitive, known-pattern 測試 while freeing human analysts for creative exploration and contextual judgment. The right automation strategy has three layers: CI/CD integration for deployment-time checks, continuous automated 紅隊演練 for production 監控, and human analysis for novel attacks and contextual 評估. Success requires careful tool selection, human-in-the-loop triage, and feedback loops that continuously improve automation accuracy. Measure automation by its 偵測 rate and false positive rate, not just by the volume of tests it runs.

紅隊 Automation Strategy

Advanced9 min readUpdated 2026-03-15

When and how to automate AI red teaming: tool selection, CI/CD integration, continuous automated red teaming (CART), human-in-the-loop design, and scaling assessment coverage through automation.

automation cart ci-cd tooling scaling human-in-the-loop

What to Automate (and What Not To)

Automate

Task	Why Automate	Tool 範例
Known 越獄 regression 測試	Known techniques should be tested against every model update	Garak, PyRIT, custom scripts
安全 boundary scanning	Systematic 測試 of content categories is tedious but important	Promptfoo, custom harnesses
輸入 validation 測試	Structured inputs (encodings, languages, formats) are enumerable	Custom 測試 suites
API 安全 checks	Authentication, rate limiting, error handling are standard tests	Standard API 安全 tools
輸出 classification	Detecting harmful content in model outputs at scale	安全 classifiers, LLM-as-judge
Regression 測試	Ensuring previously fixed 漏洞 stay fixed	CI/CD 測試 suites

Do Not Automate

Task	Why Keep Manual	Human Advantage
Novel attack development	Creativity cannot be automated	對抗性 thinking, intuition
Contextual risk 評估	Impact depends on business context	Domain knowledge, judgment
Stakeholder communication	Requires tailored messaging	Empathy, persuasion
Architecture review	Requires 理解 system design decisions	Systems thinking
Ethical judgment	Nuanced decisions about responsible disclosure	Experience, values
Complex multi-step attacks	Require adaptive reasoning mid-attack	Real-time decision making

Automation Architecture

Layered Automation

┌──────────────────────────────────────────────────────────────┐
│ Layer 3: Human Analysis                                       │
│ Novel attacks, contextual 評估, reporting               │
│ (Triggered by findings from Layers 1-2)                       │
├──────────────────────────────────────────────────────────────┤
│ Layer 2: Continuous Automated 紅隊演練 (CART)              │
│ Scheduled and triggered scans against production models       │
│ (Runs continuously or on model updates)                       │
├──────────────────────────────────────────────────────────────┤
│ Layer 1: CI/CD Integration                                    │
│ Automated checks in the deployment pipeline                   │
│ (Runs on every deployment)                                    │
└──────────────────────────────────────────────────────────────┘

Layer 1: CI/CD Integration

Embed AI 安全 checks in the deployment pipeline so that regressions and basic 漏洞 are caught before reaching production.

# 範例: GitHub Actions workflow for AI 安全 checks
name: AI 安全 Checks
on:
  pull_request:
    paths:
      - 'models/**'
      - 'prompts/**'
      - 'configs/安全/**'
  push:
    branches: [main]
 
jobs:
  安全-regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Run 安全 regression suite
        run: |
          python -m pytest tests/安全/ \
            --model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
            --threshold 0.95 \
            --report-輸出 安全-report.json
 
      - name: Run 越獄 regression
        run: |
          garak --model-type api \
            --model-endpoint ${{ secrets.STAGING_MODEL_ENDPOINT }} \
            --probes known_jailbreaks \
            --report-prefix ci-越獄
 
      - name: Check results
        run: |
          python scripts/check_safety_results.py \
            --安全-report 安全-report.json \
            --越獄-report ci-越獄.json \
            --fail-on-regression

What to 測試 in CI/CD:

Known 越獄 regression (did a model update re-introduce a fixed 漏洞?)
安全 boundary checks (does 模型 still refuse baseline harmful requests?)
系統提示詞 leakage (does 模型 reveal its 系統提示詞?)
Basic injection patterns (does 模型 follow instructions in injected context?)

What not to 測試 in CI/CD:

Full 評估 suites (too slow for deployment pipelines)
Novel attack exploration (not deterministic enough for pass/fail)
Complex multi-step attacks (require too much infrastructure)

Layer 2: Continuous Automated 紅隊演練 (CART)

CART runs continuously against production models, 測試 a broader range of attacks than CI/CD and adapting to new threats.

class CARTPipeline:
    """Continuous Automated 紅隊演練 pipeline."""
 
    def __init__(self, model_endpoints: dict, alert_webhook: str):
        self.endpoints = model_endpoints
        self.alert_webhook = alert_webhook
        self.attack_registry = AttackRegistry()
        self.results_store = ResultsStore()
 
    def run_cycle(self):
        """Run one CART cycle across all monitored endpoints."""
        for endpoint_name, endpoint_config in self.endpoints.items():
            results = self._test_endpoint(endpoint_name, endpoint_config)
            self.results_store.save(endpoint_name, results)
 
            # Alert on new findings
            new_findings = self._identify_new_findings(endpoint_name, results)
            if new_findings:
                self._send_alert(endpoint_name, new_findings)
 
    def _test_endpoint(self, name: str, config: dict) -> list:
        """測試 a single endpoint with all applicable attack categories."""
        results = []
 
        for attack in self.attack_registry.get_attacks(config["model_type"]):
            try:
                outcome = attack.execute(config["url"], config["auth"])
                results.append({
                    "attack": attack.name,
                    "category": attack.category,
                    "success": outcome.succeeded,
                    "response": outcome.response[:500],
                    "timestamp": datetime.now().isoformat()
                })
            except Exception as e:
                results.append({
                    "attack": attack.name,
                    "error": str(e),
                    "timestamp": datetime.now().isoformat()
                })
 
        return results
 
    def _identify_new_findings(self, endpoint: str, results: list) -> list:
        """Compare results with previous cycle to find new 漏洞."""
        previous = self.results_store.get_latest(endpoint)
        if previous is None:
            return [r for r in results if r.get("success")]
 
        prev_successes = set(r["attack"] for r in previous if r.get("success"))
        new_successes = [
            r for r in results
            if r.get("success") and r["attack"] not in prev_successes
        ]
 
        return new_successes
 
    def _send_alert(self, endpoint: str, findings: list):
        """Send alert for new findings."""
        import requests
        requests.post(self.alert_webhook, json={
            "endpoint": endpoint,
            "new_findings": len(findings),
            "details": findings,
            "severity": "high" if any(f["category"] == "safety_bypass" for f in findings) else "medium"
        })

CART scheduling:

Trigger	Frequency	Scope
Model update	Every deployment	Full regression + expanded suite
New attack technique published	Within 48 hours	Targeted 測試 of new technique
Scheduled cycle	Weekly	Full attack suite rotation
Incident-triggered	As needed	Deep 評估 of affected system

Layer 3: Human Analysis

Automation findings feed into human analysis. Analysts investigate automated alerts, develop new attacks based on patterns, and conduct deep assessments that automation cannot perform.

Automation → Alert → Triage (human) → Investigation (human) → Report (human)
     ↓                                       ↓
  New attack patterns                   Feed back into
  from human research                   automation registry

Tool Selection Guide

Framework Comparison

Tool	Best For	Strengths	Limitations
Garak	Broad LLM 漏洞 scanning	Large probe library, extensible	Limited 代理測試
PyRIT	Microsoft ecosystem, structured attacks	Good orchestration, logging	Azure-centric
Promptfoo	Prompt regression 測試, CI/CD	Easy to integrate, fast	Less depth than specialized tools
Custom scripts	Organization-specific attacks	Full control, tailored to your systems	Maintenance burden

Selection Criteria

Choose tools based on these factors:

Model access type: API-only versus weight-access determines which tools are applicable
Integration requirements: CI/CD compatibility, alerting, reporting format
攻擊 coverage: Which 漏洞 categories does the tool cover?
Extensibility: Can you add custom attacks?
Maintenance burden: Who maintains the tool? Is it actively developed?

Human-in-the-Loop Design

The critical question for automation: where does the human intervene?

Triage Model

All automated findings pass through human triage before escalation.

class TriageWorkflow:
    """Human-in-the-loop triage for automated findings."""
 
    SEVERITY_THRESHOLDS = {
        "auto_escalate": 0.95,    # Very high confidence: auto-escalate with human notification
        "human_review": 0.70,     # Moderate confidence: require human review before escalation
        "auto_dismiss": 0.30,     # Low confidence: auto-dismiss with periodic audit
    }
 
    def triage(self, finding: dict) -> dict:
        """Route a finding based on confidence level."""
        confidence = finding.get("confidence", 0.5)
 
        if confidence >= self.SEVERITY_THRESHOLDS["auto_escalate"]:
            return {
                "action": "escalate",
                "human_required": False,
                "notification": True,
                "reason": "High-confidence automated finding"
            }
        elif confidence >= self.SEVERITY_THRESHOLDS["human_review"]:
            return {
                "action": "queue_for_review",
                "human_required": True,
                "priority": "normal",
                "reason": "Moderate confidence, requires human verification"
            }
        else:
            return {
                "action": "log_and_dismiss",
                "human_required": False,
                "audit_flag": True,
                "reason": "Low confidence, logged for periodic audit review"
            }

Feedback Loop

Human triage decisions feed back into automation to improve accuracy over time:

False positives are labeled and used to tune 偵測 thresholds
True positives are added to the regression 測試 suite
Missed findings (discovered by humans but not automation) are added as new attack patterns

Measuring Automation Effectiveness

Metric	Definition	Target
偵測 rate	Percentage of known 漏洞 found by automation	>80% for known attack categories
False positive rate	Percentage of automated alerts that are not valid findings	<20% (lower is better)
Time to 偵測	Time between 漏洞 introduction and automated 偵測	<24 hours for regression, <48 hours for new attacks
Human effort saved	Percentage reduction in manual 測試 hours	40-60% (automation handles known patterns)
Coverage expansion	Number of systems tested with automation vs. manual-only	3-5x more systems assessed

實作 Roadmap

Start with regression 測試
實作 basic 安全 regression tests in CI/CD. This provides immediate value with minimal investment. Use existing tools (Promptfoo, pytest with custom fixtures).
Build CART for production
Deploy continuous automated scanning against production models. Start with a weekly cadence and a focused set of high-priority attacks. Expand scope over time.
實作 human-in-the-loop triage
Build a triage workflow that routes automated findings to human analysts. Establish confidence thresholds and escalation procedures.
Add feedback loops
Connect human triage decisions back to automation. Retrain classifiers, update thresholds, and add new attack patterns based on human analysis.
Expand and optimize
Add new attack categories to automation as they mature. Continuously optimize false positive rates and 偵測 coverage. Measure and report automation effectiveness.

紅隊 Automation Strategy

Start with regression 測試

Build CART for production

實作 human-in-the-loop triage

Add feedback loops

Expand and optimize

Related articles

紅隊 Automation Strategy

Start with regression 測試

Build CART for production

實作 human-in-the-loop triage

Add feedback loops

Expand and optimize

Related articles