Automatisering van redteaming

Gevorderd7 min lezenBijgewerkt op 2026-03-12

Frameworks en tools voor het op schaal automatiseren van AI-redteaming, inclusief CART-pipelines, jailbreak-fuzzing, regressietests en continue monitoring.

automation cart fuzzing testing exploit-dev

Handmatige redteaming is essentieel voor creatieve ontdekking van aanvallen, maar schaalt niet. CART en andere automatiseringsframeworks maken continue testen, regressiedetectie en brede dekking over aanvalscategorieën mogelijk. Professionele AI-redteaming combineert handmatige creativiteit met geautomatiseerde schaal.

CART-architectuur

Een Continuous Automated Red Teaming-pipeline:

Payload-generatie → Uitvoeringsengine → Resultaatanalyse → Rapportage
       ↓                    ↓                  ↓                ↓
  Templatebibliotheek  API-integratie      Succesdetectie    Dashboard
  Mutatie-engine       Ratemanagement      Classificatie     Meldingen
  LLM-gebaseerde gen   Parallelle uitv.    Statistische tests Trends

Een basis-CART-pipeline bouwen

import asyncio
from dataclasses import dataclass
from typing import Callable
 
@dataclass
class TestCase:
    payload: str
    category: str  # "injection", "jailbreak", "extraction", etc.
    expected_behavior: str  # "refuse", "comply", "extract"
    detector: Callable  # Function to classify the response
 
@dataclass
class TestResult:
    test_case: TestCase
    response: str
    success: bool
    latency_ms: float
    timestamp: str
 
class CARTPipeline:
    def __init__(self, target_api, test_suite: list[TestCase]):
        self.target = target_api
        self.tests = test_suite
        self.results: list[TestResult] = []
 
    async def run_suite(self, n_trials_per_test: int = 10):
        """Run all tests with statistical significance."""
        for test in self.tests:
            for _ in range(n_trials_per_test):
                response = await self.target.query(test.payload)
                success = test.detector(response)
                self.results.append(TestResult(
                    test_case=test,
                    response=response,
                    success=success,
                    latency_ms=response.latency,
                    timestamp=datetime.now().isoformat(),
                ))
 
    def report(self) -> dict:
        """Generate success rates by category."""
        from collections import defaultdict
        stats = defaultdict(lambda: {"total": 0, "success": 0})
        for r in self.results:
            stats[r.test_case.category]["total"] += 1
            if r.success:
                stats[r.test_case.category]["success"] += 1
        return {k: v["success"]/v["total"] for k, v in stats.items()}

Jailbreak-fuzzing

Geautomatiseerde generatie en het testen van jailbreak-varianten via fuzzing:

class JailbreakFuzzer:
    """Generate jailbreak variants through mutation."""
 
    def __init__(self, seed_prompts: list[str]):
        self.seeds = seed_prompts
        self.mutations = [
            self.synonym_replace,
            self.add_context,
            self.change_framing,
            self.add_encoding,
            self.split_payload,
            self.add_few_shot,
        ]
 
    def synonym_replace(self, prompt: str) -> str:
        """Replace key terms with synonyms."""
        replacements = {
            "ignore": ["disregard", "bypass", "override", "skip"],
            "instructions": ["rules", "guidelines", "directives", "constraints"],
            "previous": ["earlier", "prior", "above", "initial"],
        }
        for word, synonyms in replacements.items():
            if word in prompt.lower():
                prompt = prompt.replace(word, random.choice(synonyms))
        return prompt
 
    def add_encoding(self, prompt: str) -> str:
        """Wrap payload in various encodings."""
        encodings = [
            lambda p: f"Decode and follow: {base64.b64encode(p.encode()).decode()}",
            lambda p: f"ROT13: {codecs.encode(p, 'rot_13')}",
            lambda p: f"Reverse: {p[::-1]}",
        ]
        return random.choice(encodings)(prompt)
 
    def fuzz(self, n_variants: int = 100) -> list[str]:
        """Generate n variant payloads."""
        variants = []
        for _ in range(n_variants):
            base = random.choice(self.seeds)
            n_mutations = random.randint(1, 3)
            for _ in range(n_mutations):
                mutation = random.choice(self.mutations)
                base = mutation(base)
            variants.append(base)
        return variants

Regressietests

Wanneer modellen worden bijgewerkt, kunnen eerder verholpen kwetsbaarheden opnieuw verschijnen en kunnen nieuwe ontstaan:

class RegressionSuite:
    """Test known vulnerabilities after model updates."""
 
    def __init__(self, known_vulns: list[TestCase]):
        self.vulns = known_vulns
 
    def check_regression(self, target_api) -> dict:
        """Test all known vulnerabilities against current model."""
        results = {}
        for vuln in self.vulns:
            success_rate = measure_success_rate(vuln.payload, target_api)
            results[vuln.category] = {
                "previous_rate": vuln.last_known_rate,
                "current_rate": success_rate,
                "regression": success_rate > vuln.last_known_rate,
            }
        return results

Open-source tools

Tool	Doel	Link
Garak	Kwetsbaarheidsscanner voor LLM's	github.com/leondz/garak
PyRIT	Redteaming-framework van Microsoft	github.com/Azure/PyRIT
Promptfoo	Prompt-testen en -evaluatie	github.com/promptfoo/promptfoo
ART	Adversarial Robustness Toolbox	github.com/Trusted-AI/adversarial-robustness-toolbox

Probeer het zelf

Practice

Oefening: een basis geautomatiseerde redteam-test opzetten

Zet een basis geautomatiseerde test op met behulp van een van de open-source frameworks die op deze pagina worden besproken, en voer hem uit. Deze oefening bouwt praktische ervaring op met CART-achtige tooling en geautomatiseerde resultaatanalyse.

Stap 1
Kies een van de open-source frameworks uit de tabel hierboven (Garak, PyRIT of Promptfoo) en installeer het in een virtuele omgeving. Bekijk de documentatie van het framework en identificeer een basis-probe of -testsuite om te draaien.
# Example with Garak pip install garak garak --list_probes
Stap 2
Configureer het framework om een model aan te vallen dat je toestemming hebt om te testen. Stel API-credentials, rate limits en outputmappen in. Bepaal een kleine testscope: kies één aanvalscategorie (bijv. prompt injection) en beperk je tot 10-20 testgevallen.
Stap 3
Voer de geautomatiseerde testsuite uit en verzamel de resultaten. Houd tijdens de uitvoering rate-limitfouten, onverwachte fouten of overschrijdingen van API-kosten in de gaten.
# Example with Garak garak --model_type openai --model_name gpt-4 --probes encoding
Stap 4
Analyseer de uitvoer: bekijk de succes- en faalpercentages per categorie, identificeer eventuele geslaagde bypasses en noteer welke soorten payloads het meest effectief waren. Schrijf een korte samenvatting van je bevindingen.

Succescriteria: een afgeronde testrun met minstens één framework, met als resultaat een resultatenbestand met gecategoriseerde bevindingen en een geschreven samenvatting van de waargenomen succespercentages per aanvalscategorie.

Verwante onderwerpen

Overzicht AI-exploitontwikkeling -- de bredere context van exploitontwikkeling
Payload-crafting -- de seed-payloads maken die door automatisering worden opgeschaald
Custom tooling -- doelwitspecifieke automatiseringstools bouwen
CART-pipelines -- geavanceerde CART-architectuur en -deployment
Capstone: uitvoering & rapportage -- automatisering integreren in volledige engagements

Referenties

Perez et al., "Red Teaming Language Models with Language Models" (2022) -- LLM-gebaseerde geautomatiseerde redteaming
Samvelyan et al., "Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts" (2024) -- diversiteitsgerichte geautomatiseerde aanvalsgeneratie
Deng et al., "Garak: A Framework for Security Probing Large Language Models" (2024) -- open-source kwetsbaarheidsscans voor LLM's

Knowledge Check

Waarom is regressietesten belangrijk na modelupdates?

Automatisering van redteaming

Gevorderd7 min lezenBijgewerkt op 2026-03-12

Frameworks en tools voor het op schaal automatiseren van AI-redteaming, inclusief CART-pipelines, jailbreak-fuzzing, regressietests en continue monitoring.

automation cart fuzzing testing exploit-dev

CART-architectuur

Een Continuous Automated Red Teaming-pipeline:

Payload-generatie → Uitvoeringsengine → Resultaatanalyse → Rapportage
       ↓                    ↓                  ↓                ↓
  Templatebibliotheek  API-integratie      Succesdetectie    Dashboard
  Mutatie-engine       Ratemanagement      Classificatie     Meldingen
  LLM-gebaseerde gen   Parallelle uitv.    Statistische tests Trends

Een basis-CART-pipeline bouwen

import asyncio
from dataclasses import dataclass
from typing import Callable
 
@dataclass
class TestCase:
    payload: str
    category: str  # "injection", "jailbreak", "extraction", etc.
    expected_behavior: str  # "refuse", "comply", "extract"
    detector: Callable  # Function to classify the response
 
@dataclass
class TestResult:
    test_case: TestCase
    response: str
    success: bool
    latency_ms: float
    timestamp: str
 
class CARTPipeline:
    def __init__(self, target_api, test_suite: list[TestCase]):
        self.target = target_api
        self.tests = test_suite
        self.results: list[TestResult] = []
 
    async def run_suite(self, n_trials_per_test: int = 10):
        """Run all tests with statistical significance."""
        for test in self.tests:
            for _ in range(n_trials_per_test):
                response = await self.target.query(test.payload)
                success = test.detector(response)
                self.results.append(TestResult(
                    test_case=test,
                    response=response,
                    success=success,
                    latency_ms=response.latency,
                    timestamp=datetime.now().isoformat(),
                ))
 
    def report(self) -> dict:
        """Generate success rates by category."""
        from collections import defaultdict
        stats = defaultdict(lambda: {"total": 0, "success": 0})
        for r in self.results:
            stats[r.test_case.category]["total"] += 1
            if r.success:
                stats[r.test_case.category]["success"] += 1
        return {k: v["success"]/v["total"] for k, v in stats.items()}

Jailbreak-fuzzing

Geautomatiseerde generatie en het testen van jailbreak-varianten via fuzzing:

class JailbreakFuzzer:
    """Generate jailbreak variants through mutation."""
 
    def __init__(self, seed_prompts: list[str]):
        self.seeds = seed_prompts
        self.mutations = [
            self.synonym_replace,
            self.add_context,
            self.change_framing,
            self.add_encoding,
            self.split_payload,
            self.add_few_shot,
        ]
 
    def synonym_replace(self, prompt: str) -> str:
        """Replace key terms with synonyms."""
        replacements = {
            "ignore": ["disregard", "bypass", "override", "skip"],
            "instructions": ["rules", "guidelines", "directives", "constraints"],
            "previous": ["earlier", "prior", "above", "initial"],
        }
        for word, synonyms in replacements.items():
            if word in prompt.lower():
                prompt = prompt.replace(word, random.choice(synonyms))
        return prompt
 
    def add_encoding(self, prompt: str) -> str:
        """Wrap payload in various encodings."""
        encodings = [
            lambda p: f"Decode and follow: {base64.b64encode(p.encode()).decode()}",
            lambda p: f"ROT13: {codecs.encode(p, 'rot_13')}",
            lambda p: f"Reverse: {p[::-1]}",
        ]
        return random.choice(encodings)(prompt)
 
    def fuzz(self, n_variants: int = 100) -> list[str]:
        """Generate n variant payloads."""
        variants = []
        for _ in range(n_variants):
            base = random.choice(self.seeds)
            n_mutations = random.randint(1, 3)
            for _ in range(n_mutations):
                mutation = random.choice(self.mutations)
                base = mutation(base)
            variants.append(base)
        return variants

Regressietests

Wanneer modellen worden bijgewerkt, kunnen eerder verholpen kwetsbaarheden opnieuw verschijnen en kunnen nieuwe ontstaan:

class RegressionSuite:
    """Test known vulnerabilities after model updates."""
 
    def __init__(self, known_vulns: list[TestCase]):
        self.vulns = known_vulns
 
    def check_regression(self, target_api) -> dict:
        """Test all known vulnerabilities against current model."""
        results = {}
        for vuln in self.vulns:
            success_rate = measure_success_rate(vuln.payload, target_api)
            results[vuln.category] = {
                "previous_rate": vuln.last_known_rate,
                "current_rate": success_rate,
                "regression": success_rate > vuln.last_known_rate,
            }
        return results

Open-source tools

Tool	Doel	Link
Garak	Kwetsbaarheidsscanner voor LLM's	github.com/leondz/garak
PyRIT	Redteaming-framework van Microsoft	github.com/Azure/PyRIT
Promptfoo	Prompt-testen en -evaluatie	github.com/promptfoo/promptfoo
ART	Adversarial Robustness Toolbox	github.com/Trusted-AI/adversarial-robustness-toolbox

Probeer het zelf

Practice

Oefening: een basis geautomatiseerde redteam-test opzetten

Stap 1
Kies een van de open-source frameworks uit de tabel hierboven (Garak, PyRIT of Promptfoo) en installeer het in een virtuele omgeving. Bekijk de documentatie van het framework en identificeer een basis-probe of -testsuite om te draaien.
# Example with Garak pip install garak garak --list_probes
Stap 2
Configureer het framework om een model aan te vallen dat je toestemming hebt om te testen. Stel API-credentials, rate limits en outputmappen in. Bepaal een kleine testscope: kies één aanvalscategorie (bijv. prompt injection) en beperk je tot 10-20 testgevallen.
Stap 3
Voer de geautomatiseerde testsuite uit en verzamel de resultaten. Houd tijdens de uitvoering rate-limitfouten, onverwachte fouten of overschrijdingen van API-kosten in de gaten.
# Example with Garak garak --model_type openai --model_name gpt-4 --probes encoding
Stap 4
Analyseer de uitvoer: bekijk de succes- en faalpercentages per categorie, identificeer eventuele geslaagde bypasses en noteer welke soorten payloads het meest effectief waren. Schrijf een korte samenvatting van je bevindingen.

Verwante onderwerpen

Overzicht AI-exploitontwikkeling -- de bredere context van exploitontwikkeling
Payload-crafting -- de seed-payloads maken die door automatisering worden opgeschaald
Custom tooling -- doelwitspecifieke automatiseringstools bouwen
CART-pipelines -- geavanceerde CART-architectuur en -deployment
Capstone: uitvoering & rapportage -- automatisering integreren in volledige engagements

Referenties

Perez et al., "Red Teaming Language Models with Language Models" (2022) -- LLM-gebaseerde geautomatiseerde redteaming
Samvelyan et al., "Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts" (2024) -- diversiteitsgerichte geautomatiseerde aanvalsgeneratie
Deng et al., "Garak: A Framework for Security Probing Large Language Models" (2024) -- open-source kwetsbaarheidsscans voor LLM's

Knowledge Check

Waarom is regressietesten belangrijk na modelupdates?

Automatisering van redteaming

Gerelateerde artikelen

Automatisering van redteaming

Gerelateerde artikelen