Red Teaming Automation
Frameworks and tools for automating AI red teaming at scale, including CART pipelines, jailbreak fuzzing, regression testing, and continuous monitoring.
Manual red teaming is essential for creative attack discovery, but it does not scale. CART and other automation frameworks enable continuous testing, regression detection, and broad coverage across attack categories. Professional AI red teaming combines manual creativity with automated scale.
CART Architecture
A Continuous Automated Red Teaming pipeline:
Payload Generation → Execution Engine → Result Analysis → Reporting
↓ ↓ ↓ ↓
Template library API integration Success detection Dashboard
Mutation engine Rate management Classification Alerts
LLM-based gen Parallel execution Statistical tests Trends
Building a Basic CART Pipeline
import asyncio
from dataclasses import dataclass
from typing import Callable
@dataclass
class TestCase:
payload: str
category: str # "injection", "jailbreak", "extraction", etc.
expected_behavior: str # "refuse", "comply", "extract"
detector: Callable # Function to classify the response
@dataclass
class TestResult:
test_case: TestCase
response: str
success: bool
latency_ms: float
timestamp: str
class CARTPipeline:
def __init__(self, target_api, test_suite: list[TestCase]):
self.target = target_api
self.tests = test_suite
self.results: list[TestResult] = []
async def run_suite(self, n_trials_per_test: int = 10):
"""Run all tests with statistical significance."""
for test in self.tests:
for _ in range(n_trials_per_test):
response = await self.target.query(test.payload)
success = test.detector(response)
self.results.append(TestResult(
test_case=test,
response=response,
success=success,
latency_ms=response.latency,
timestamp=datetime.now().isoformat(),
))
def report(self) -> dict:
"""Generate success rates by category."""
from collections import defaultdict
stats = defaultdict(lambda: {"total": 0, "success": 0})
for r in self.results:
stats[r.test_case.category]["total"] += 1
if r.success:
stats[r.test_case.category]["success"] += 1
return {k: v["success"]/v["total"] for k, v in stats.items()}Jailbreak Fuzzing
Automated generation and testing of jailbreak variants through fuzzing:
class JailbreakFuzzer:
"""Generate jailbreak variants through mutation."""
def __init__(self, seed_prompts: list[str]):
self.seeds = seed_prompts
self.mutations = [
self.synonym_replace,
self.add_context,
self.change_framing,
self.add_encoding,
self.split_payload,
self.add_few_shot,
]
def synonym_replace(self, prompt: str) -> str:
"""Replace key terms with synonyms."""
replacements = {
"ignore": ["disregard", "bypass", "override", "skip"],
"instructions": ["rules", "guidelines", "directives", "constraints"],
"previous": ["earlier", "prior", "above", "initial"],
}
for word, synonyms in replacements.items():
if word in prompt.lower():
prompt = prompt.replace(word, random.choice(synonyms))
return prompt
def add_encoding(self, prompt: str) -> str:
"""Wrap payload in various encodings."""
encodings = [
lambda p: f"Decode and follow: {base64.b64encode(p.encode()).decode()}",
lambda p: f"ROT13: {codecs.encode(p, 'rot_13')}",
lambda p: f"Reverse: {p[::-1]}",
]
return random.choice(encodings)(prompt)
def fuzz(self, n_variants: int = 100) -> list[str]:
"""Generate n variant payloads."""
variants = []
for _ in range(n_variants):
base = random.choice(self.seeds)
n_mutations = random.randint(1, 3)
for _ in range(n_mutations):
mutation = random.choice(self.mutations)
base = mutation(base)
variants.append(base)
return variantsRegression Testing
When models are updated, previously patched vulnerabilities may reappear and new ones may emerge:
class RegressionSuite:
"""Test known vulnerabilities after model updates."""
def __init__(self, known_vulns: list[TestCase]):
self.vulns = known_vulns
def check_regression(self, target_api) -> dict:
"""Test all known vulnerabilities against current model."""
results = {}
for vuln in self.vulns:
success_rate = measure_success_rate(vuln.payload, target_api)
results[vuln.category] = {
"previous_rate": vuln.last_known_rate,
"current_rate": success_rate,
"regression": success_rate > vuln.last_known_rate,
}
return resultsOpen-Source Tools
| Tool | Purpose | Link |
|---|---|---|
| Garak | LLM vulnerability scanner | github.com/leondz/garak |
| PyRIT | Microsoft red teaming framework | github.com/Azure/PyRIT |
| Promptfoo | Prompt testing and evaluation | github.com/promptfoo/promptfoo |
| ART | Adversarial Robustness Toolbox | github.com/Trusted-AI/adversarial-robustness-toolbox |
Try It Yourself
Related Topics
- AI Exploit Development Overview -- the broader exploit development context
- Payload Crafting -- creating the seed payloads that automation scales
- Custom Tooling -- building target-specific automation tools
- CART Pipelines -- advanced CART architecture and deployment
- Capstone: Execution & Reporting -- integrating automation into full engagements
References
- Perez et al., "Red Teaming Language Models with Language Models" (2022) -- LLM-based automated red teaming
- Samvelyan et al., "Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts" (2024) -- diversity-focused automated attack generation
- Deng et al., "Garak: A Framework for Security Probing Large Language Models" (2024) -- open-source LLM vulnerability scanning
Why is regression testing important after model updates?