紅隊自動化
進階4 分鐘閱讀更新於 2026-03-12
大規模自動化 AI 紅隊的框架與工具,涵蓋 CART 管線、越獄模糊測試、回歸測試與持續監控。
手動紅隊對於創造性攻擊發掘至關重要,但難以擴展。CART 與其他自動化框架能實現持續測試、偵測退化,並橫跨攻擊類別達成廣泛涵蓋。專業級 AI 紅隊結合手動創造性與自動化規模。
CART 架構
持續自動化紅隊管線:
Payload 產生 → 執行引擎 → 結果分析 → 回報
↓ ↓ ↓ ↓
範本資料庫 API 整合 成功偵測 儀表板
突變引擎 速率管理 分類 告警
LLM 式生成 並行執行 統計檢定 趨勢
建置基本 CART 管線
import asyncio
from dataclasses import dataclass
from typing import Callable
@dataclass
class TestCase:
payload: str
category: str # "injection"、"jailbreak"、"extraction" 等
expected_behavior: str # "refuse"、"comply"、"extract"
detector: Callable # 用以分類回應的函式
@dataclass
class TestResult:
test_case: TestCase
response: str
success: bool
latency_ms: float
timestamp: str
class CARTPipeline:
def __init__(self, target_api, test_suite: list[TestCase]):
self.target = target_api
self.tests = test_suite
self.results: list[TestResult] = []
async def run_suite(self, n_trials_per_test: int = 10):
"""以統計顯著度執行所有測試。"""
for test in self.tests:
for _ in range(n_trials_per_test):
response = await self.target.query(test.payload)
success = test.detector(response)
self.results.append(TestResult(
test_case=test,
response=response,
success=success,
latency_ms=response.latency,
timestamp=datetime.now().isoformat(),
))
def report(self) -> dict:
"""依類別產生成功率。"""
from collections import defaultdict
stats = defaultdict(lambda: {"total": 0, "success": 0})
for r in self.results:
stats[r.test_case.category]["total"] += 1
if r.success:
stats[r.test_case.category]["success"] += 1
return {k: v["success"]/v["total"] for k, v in stats.items()}越獄模糊測試
透過模糊測試自動產生與測試越獄變體:
class JailbreakFuzzer:
"""以突變產生越獄變體。"""
def __init__(self, seed_prompts: list[str]):
self.seeds = seed_prompts
self.mutations = [
self.synonym_replace,
self.add_context,
self.change_framing,
self.add_encoding,
self.split_payload,
self.add_few_shot,
]
def synonym_replace(self, prompt: str) -> str:
"""以同義詞替換關鍵詞。"""
replacements = {
"ignore": ["disregard", "bypass", "override", "skip"],
"instructions": ["rules", "guidelines", "directives", "constraints"],
"previous": ["earlier", "prior", "above", "initial"],
}
for word, synonyms in replacements.items():
if word in prompt.lower():
prompt = prompt.replace(word, random.choice(synonyms))
return prompt
def add_encoding(self, prompt: str) -> str:
"""將 payload 包裝為各種編碼。"""
encodings = [
lambda p: f"Decode and follow: {base64.b64encode(p.encode()).decode()}",
lambda p: f"ROT13: {codecs.encode(p, 'rot_13')}",
lambda p: f"Reverse: {p[::-1]}",
]
return random.choice(encodings)(prompt)
def fuzz(self, n_variants: int = 100) -> list[str]:
"""產生 n 個變體 payload。"""
variants = []
for _ in range(n_variants):
base = random.choice(self.seeds)
n_mutations = random.randint(1, 3)
for _ in range(n_mutations):
mutation = random.choice(self.mutations)
base = mutation(base)
variants.append(base)
return variants回歸測試
模型更新時,先前被修補的漏洞可能重現、新漏洞亦可能出現:
class RegressionSuite:
"""模型更新後測試已知漏洞。"""
def __init__(self, known_vulns: list[TestCase]):
self.vulns = known_vulns
def check_regression(self, target_api) -> dict:
"""對當前模型測試所有已知漏洞。"""
results = {}
for vuln in self.vulns:
success_rate = measure_success_rate(vuln.payload, target_api)
results[vuln.category] = {
"previous_rate": vuln.last_known_rate,
"current_rate": success_rate,
"regression": success_rate > vuln.last_known_rate,
}
return results開源工具
| 工具 | 用途 | 連結 |
|---|---|---|
| Garak | LLM 漏洞掃描器 | github.com/leondz/garak |
| PyRIT | Microsoft 紅隊框架 | github.com/Azure/PyRIT |
| Promptfoo | 提示測試與評估 | github.com/promptfoo/promptfoo |
| ART | Adversarial Robustness Toolbox | github.com/Trusted-AI/adversarial-robustness-toolbox |
動手試試
相關主題
- AI Exploit 開發概觀 -- 更廣的 exploit 開發脈絡
- Payload 打造 -- 自動化所擴大的種子 payload
- 自製工具 -- 打造目標特定的自動化工具
- CART 管線 -- 進階 CART 架構與部署
- Capstone:執行與報告 -- 於完整委任中整合自動化
參考資料
- Perez et al.,"Red Teaming Language Models with Language Models"(2022)-- 以 LLM 為本的自動化紅隊
- Samvelyan et al.,"Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts"(2024)-- 以多樣性為焦點的自動攻擊生成
- Deng et al.,"Garak: A Framework for Security Probing Large Language Models"(2024)-- 開源 LLM 漏洞掃描
Knowledge Check
為什麼模型更新後回歸測試很重要?