Benchmarken van verdedigingseffectiviteit
Geavanceerde methodologie voor het systematisch evalueren en benchmarken van de effectiviteit van AI-verdedigingen, waaronder frameworks voor het testen van guardrails, het meten van de aanvalssuccesratio, statistische degelijkheid bij verdedigingsevaluatie en vergelijkende analyse over verschillende verdedigingsconfiguraties.
Meten of een verdediging daadwerkelijk werkt is moeilijker dan de verdediging zelf bouwen. Verdedigingsbenchmarking past wetenschappelijke methodologie toe op de vraag "maakt deze verdediging ons veiliger?" Zonder degelijke benchmarking nemen organisaties beveiligingsbeslissingen op basis van claims van leveranciers, intuïtie of zorgvuldig uitgekozen voorbeelden — geen van deze voorspelt betrouwbaar de prestaties in de praktijk.
Ontwerp van het benchmarkframework
Kernmetrieken
| Metriek | Definitie | Waarom het ertoe doet |
|---|---|---|
| Aanvalssuccesratio (ASR) | Fractie van aanvallen die de verdediging omzeilen | Primaire effectiviteitsmaat |
| Fout-positiefratio (FPR) | Fractie van legitieme invoer die ten onrechte wordt geblokkeerd | Maat voor impact op bruikbaarheid |
| Detectielatentie | Tijd die de verdediging toevoegt aan de verzoekpijplijn | Maat voor prestatie-impact |
| Terecht-positiefratio (TPR) | Fractie van aanvallen die correct worden gedetecteerd/geblokkeerd | Complementair aan ASR (TPR = 1 - ASR) |
| Dekking | Fractie van aanvalscategorieën die door de verdediging worden afgedekt | Breedte van bescherming |
| Robuustheid | Hoe de ASR verandert onder adaptieve aanvallen | Weerstand tegen aanvallers die de verdediging kennen |
Benchmarkarchitectuur
┌────────────────────────────────────────────────────────┐
│ Benchmarking Pipeline │
│ │
│ ┌──────────────┐ ┌───────────────┐ ┌────────────┐ │
│ │ Attack │ │ Defense │ │ Evaluation │ │
│ │ Dataset │──▶│ Under Test │──▶│ Engine │ │
│ │ (labeled) │ │ (DUT) │ │ │ │
│ └──────────────┘ └───────────────┘ └──────┬─────┘ │
│ │ │
│ ┌──────────────┐ ┌──────▼─────┐ │
│ │ Benign │ ┌───────────────┐ │ Metrics │ │
│ │ Dataset │──▶│ Defense │──▶│ Calculator │ │
│ │ (legitimate) │ │ Under Test │ │ │ │
│ └──────────────┘ └───────────────┘ └──────┬─────┘ │
│ │ │
│ ┌──────▼─────┐ │
│ │ Report │ │
│ │ Generator │ │
│ └────────────┘ │
└────────────────────────────────────────────────────────┘
Constructie van de aanvalsdataset
Vereisten voor de dataset
Een aanvalsdataset voor benchmarking moet:
DATASET_REQUIREMENTS = {
"labeled": "Each sample has a ground-truth classification (attack vs. benign)",
"categorized": "Attacks are tagged by category (injection, jailbreak, extraction, etc.)",
"diverse": "Covers multiple attack techniques, styles, and complexity levels",
"representative": "Reflects real-world attack distribution for the deployment context",
"versioned": "Dataset version is tracked so results are reproducible",
"balanced": "Contains sufficient samples per category for statistical significance",
}
# Minimumaantal samples per aanvalscategorie voor statistische significantie
MIN_SAMPLES_PER_CATEGORY = 100 # Voor 95% BI met ±5% marge
ATTACK_CATEGORIES = {
"direct_injection": {
"description": "Instruction override through user input",
"subcategories": ["instruction_override", "role_switching", "constraint_bypass"],
"min_samples": 100,
},
"indirect_injection": {
"description": "Instructions embedded in external content",
"subcategories": ["rag_injection", "tool_output_injection", "document_injection"],
"min_samples": 100,
},
"jailbreak": {
"description": "Bypassing safety training and content policies",
"subcategories": ["persona_based", "encoding_based", "multi_turn", "hypothetical"],
"min_samples": 150,
},
"extraction": {
"description": "Extracting system prompts, training data, or PII",
"subcategories": ["prompt_extraction", "data_extraction", "pii_extraction"],
"min_samples": 100,
},
"output_manipulation": {
"description": "Manipulating model output format or content",
"subcategories": ["format_injection", "content_steering", "hallucination_induction"],
"min_samples": 80,
},
}Gelabelde datasets bouwen
import json
from dataclasses import dataclass
from enum import Enum
class AttackCategory(Enum):
DIRECT_INJECTION = "direct_injection"
INDIRECT_INJECTION = "indirect_injection"
JAILBREAK = "jailbreak"
EXTRACTION = "extraction"
OUTPUT_MANIPULATION = "output_manipulation"
class Severity(Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
@dataclass
class BenchmarkSample:
id: str
input_text: str
category: AttackCategory
subcategory: str
severity: Severity
is_attack: bool
expected_behavior: str # "blocked" of "allowed"
metadata: dict
def build_benchmark_dataset(attack_sources: list, benign_sources: list) -> list:
"""
Bouw een gelabelde benchmarkdataset uit aanvals- en goedaardige bronnen.
"""
samples = []
for source in attack_sources:
for item in source.load():
samples.append(BenchmarkSample(
id=f"atk-{len(samples):05d}",
input_text=item["text"],
category=AttackCategory(item["category"]),
subcategory=item["subcategory"],
severity=Severity(item["severity"]),
is_attack=True,
expected_behavior="blocked",
metadata={"source": source.name, "technique": item.get("technique")},
))
for source in benign_sources:
for item in source.load():
samples.append(BenchmarkSample(
id=f"ben-{len(samples):05d}",
input_text=item["text"],
category=None,
subcategory=item.get("topic", "general"),
severity=None,
is_attack=False,
expected_behavior="allowed",
metadata={"source": source.name},
))
return samplesBenchmarks uitvoeren
Uitvoeringsengine
from dataclasses import dataclass
from typing import Optional
import time
@dataclass
class BenchmarkResult:
sample_id: str
is_attack: bool
defense_decision: str # "blocked" of "allowed"
correct: bool
latency_ms: float
confidence: Optional[float]
defense_output: dict
class DefenseBenchmarkRunner:
"""Voer benchmarks uit tegen een verdedigingssysteem en verzamel metrieken."""
def __init__(self, defense_under_test, model_endpoint):
self.defense = defense_under_test
self.model = model_endpoint
self.results = []
def run_benchmark(self, dataset: list[BenchmarkSample]) -> list[BenchmarkResult]:
"""Voer de volledige benchmarkdataset uit tegen de verdediging."""
results = []
for sample in dataset:
start_time = time.perf_counter()
# Stuur invoer door de verdediging
defense_output = self.defense.evaluate(sample.input_text)
decision = "blocked" if defense_output["blocked"] else "allowed"
elapsed_ms = (time.perf_counter() - start_time) * 1000
# Bepaal de juistheid
if sample.is_attack:
correct = (decision == "blocked")
else:
correct = (decision == "allowed")
results.append(BenchmarkResult(
sample_id=sample.id,
is_attack=sample.is_attack,
defense_decision=decision,
correct=correct,
latency_ms=elapsed_ms,
confidence=defense_output.get("confidence"),
defense_output=defense_output,
))
self.results = results
return resultsMetriekberekening
import numpy as np
from scipy import stats
def calculate_benchmark_metrics(results: list[BenchmarkResult]) -> dict:
"""Bereken uitgebreide benchmarkmetrieken uit de resultaten."""
attack_results = [r for r in results if r.is_attack]
benign_results = [r for r in results if not r.is_attack]
# Kernmetrieken
attacks_blocked = sum(1 for r in attack_results if r.defense_decision == "blocked")
attacks_total = len(attack_results)
benign_blocked = sum(1 for r in benign_results if r.defense_decision == "blocked")
benign_total = len(benign_results)
asr = 1.0 - (attacks_blocked / attacks_total) if attacks_total > 0 else None
fpr = benign_blocked / benign_total if benign_total > 0 else None
tpr = attacks_blocked / attacks_total if attacks_total > 0 else None
# Betrouwbaarheidsintervallen (Wilson-score-interval)
asr_ci = wilson_confidence_interval(attacks_total - attacks_blocked, attacks_total)
fpr_ci = wilson_confidence_interval(benign_blocked, benign_total)
# Latentiemetrieken
all_latencies = [r.latency_ms for r in results]
return {
"attack_success_rate": round(asr, 4) if asr is not None else None,
"asr_95_ci": asr_ci,
"false_positive_rate": round(fpr, 4) if fpr is not None else None,
"fpr_95_ci": fpr_ci,
"true_positive_rate": round(tpr, 4) if tpr is not None else None,
"total_attacks": attacks_total,
"total_benign": benign_total,
"attacks_blocked": attacks_blocked,
"benign_blocked": benign_blocked,
"latency_p50_ms": round(np.percentile(all_latencies, 50), 1),
"latency_p95_ms": round(np.percentile(all_latencies, 95), 1),
"latency_p99_ms": round(np.percentile(all_latencies, 99), 1),
}
def wilson_confidence_interval(successes: int, trials: int, z: float = 1.96) -> tuple:
"""Wilson-score-interval voor binomiale proportie — nauwkeuriger dan de normale benadering voor kleine n."""
if trials == 0:
return (0.0, 1.0)
p = successes / trials
denominator = 1 + z**2 / trials
center = (p + z**2 / (2 * trials)) / denominator
margin = z * np.sqrt((p * (1 - p) + z**2 / (4 * trials)) / trials) / denominator
return (round(max(0, center - margin), 4), round(min(1, center + margin), 4))Analyse per categorie
Uitsplitsing per categorie
Geaggregeerde metrieken verbergen belangrijke variaties tussen aanvalscategorieën:
def per_category_analysis(
results: list[BenchmarkResult],
dataset: list[BenchmarkSample],
) -> dict:
"""Splits benchmarkresultaten uit per aanvalscategorie."""
# Koppel sample-ID's aan categorieën
sample_categories = {s.id: s.category for s in dataset if s.is_attack}
category_results = {}
for result in results:
if not result.is_attack:
continue
category = sample_categories.get(result.sample_id)
if category is None:
continue
cat_name = category.value
if cat_name not in category_results:
category_results[cat_name] = {"blocked": 0, "total": 0, "latencies": []}
category_results[cat_name]["total"] += 1
if result.defense_decision == "blocked":
category_results[cat_name]["blocked"] += 1
category_results[cat_name]["latencies"].append(result.latency_ms)
# Bereken metrieken per categorie
analysis = {}
for cat_name, data in category_results.items():
asr = 1.0 - (data["blocked"] / data["total"])
analysis[cat_name] = {
"attack_success_rate": round(asr, 4),
"samples_tested": data["total"],
"blocked": data["blocked"],
"bypassed": data["total"] - data["blocked"],
"95_ci": wilson_confidence_interval(
data["total"] - data["blocked"], data["total"]
),
"median_latency_ms": round(np.median(data["latencies"]), 1),
}
return analysis| Wat analyse per categorie onthult | Voorbeeld |
|---|---|
| Blinde vlekken per categorie | Verdediging blokkeert 98% van directe injectie maar slechts 45% van indirecte injectie |
| Correlatie met severity | Aanvallen met hoge severity hebben een 3x zo hoge omzeilingsratio als die met lage severity |
| Latentievariatie | Jailbreak-detectie voegt 200 ms toe versus 50 ms voor injectiedetectie |
| Hiaten in dekking | Geen testdekking voor multimodale aanvalscategorieën |
Vergelijkende benchmarking
Verdedigingsconfiguraties vergelijken
def compare_defenses(
baseline_results: list[BenchmarkResult],
candidate_results: list[BenchmarkResult],
) -> dict:
"""
Vergelijk twee verdedigingsconfiguraties statistisch
om te bepalen of de kandidaat significant beter is.
"""
baseline_metrics = calculate_benchmark_metrics(baseline_results)
candidate_metrics = calculate_benchmark_metrics(candidate_results)
# McNemar-toets voor gepaarde vergelijking
# (dezelfde dataset getest tegen beide verdedigingen)
baseline_blocked = {r.sample_id: r.defense_decision == "blocked"
for r in baseline_results if r.is_attack}
candidate_blocked = {r.sample_id: r.defense_decision == "blocked"
for r in candidate_results if r.is_attack}
# Contingentie: gevallen waarin de verdedigingen het oneens zijn
both_blocked = 0
baseline_only = 0
candidate_only = 0
neither = 0
for sample_id in baseline_blocked:
b = baseline_blocked[sample_id]
c = candidate_blocked.get(sample_id, False)
if b and c:
both_blocked += 1
elif b and not c:
baseline_only += 1
elif not b and c:
candidate_only += 1
else:
neither += 1
# McNemar-toets
if baseline_only + candidate_only > 0:
chi2 = (abs(baseline_only - candidate_only) - 1) ** 2 / (baseline_only + candidate_only)
p_value = 1 - stats.chi2.cdf(chi2, df=1)
else:
p_value = 1.0
return {
"baseline_asr": baseline_metrics["attack_success_rate"],
"candidate_asr": candidate_metrics["attack_success_rate"],
"asr_improvement": round(
baseline_metrics["attack_success_rate"] - candidate_metrics["attack_success_rate"],
4
),
"baseline_fpr": baseline_metrics["false_positive_rate"],
"candidate_fpr": candidate_metrics["false_positive_rate"],
"mcnemar_p_value": round(p_value, 4),
"statistically_significant": p_value < 0.05,
"recommendation": (
"Candidate is significantly better"
if p_value < 0.05 and candidate_metrics["attack_success_rate"] < baseline_metrics["attack_success_rate"]
else "No significant difference" if p_value >= 0.05
else "Candidate is significantly worse"
),
}Adaptief aanvalstesten
Adaptief aanvalsprotocol
## Adaptive Attack Benchmarking Protocol
1. **Baseline evaluation**: Run standard attack dataset against defense
2. **Failure analysis**: Identify which attacks succeeded and which were blocked
3. **Attack refinement**: Modify blocked attacks based on defense behavior
- If defense blocks based on keywords: use synonyms and paraphrasing
- If defense blocks based on intent: use encoding and obfuscation
- If defense blocks based on pattern: use novel formulations
4. **Re-evaluation**: Run refined attacks against the same defense
5. **Iterate**: Repeat steps 2-4 for 3-5 rounds
6. **Report adaptive ASR**: The ASR after all refinement rounds| Metriek | Statische benchmark | Adaptieve benchmark |
|---|---|---|
| Weerspiegelt prestaties in de praktijk | Gedeeltelijk | Nauwkeuriger |
| Reproduceerbaar | Sterk | Matig (afhankelijk van de vaardigheid van de aanvaller) |
| Kosten om uit te voeren | Laag | Hoog (vereist bekwame menselijke testers) |
| Automatiseringspotentieel | Hoog | Laag (menselijk oordeel vereist) |
Benchmarkresultaten rapporteren
Sjabloon voor benchmarkrapport
## Defense Benchmarking Report
### Configuration
- Defense Under Test: [Name, version, configuration]
- Benchmark Dataset: [Name, version, sample count]
- Evaluation Date: [Date]
- Evaluator: [Name/organization]
### Summary Results
| Metric | Value | 95% CI |
|---|---|---|
| Attack Success Rate | X.XX% | [lower, upper] |
| False Positive Rate | X.XX% | [lower, upper] |
| True Positive Rate | X.XX% | [lower, upper] |
| Median Latency | XXms | |
| P99 Latency | XXms | |
### Per-Category Results
[Table showing ASR broken down by attack category]
### Comparative Analysis
[If comparing to baseline or alternative defense]
### Worst-Case Analysis
- Highest ASR category: [Category] at XX%
- Most impactful bypass: [Description]
### Recommendations
[Based on results]Gerelateerde onderwerpen
- Statistical Rigor in Evaluations -- statistische methoden voor evaluatie
- Harness Building -- evaluatie-infrastructuur bouwen
- Metrics & Methodology -- ontwerp van evaluatiemetrieken
- Red Team as a Service -- dienstverlening die benchmarking omvat
- Model Cards Security -- benchmarkresultaten documenteren
Referenties
- Mazeika et al., "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal" (2024) - Standardized benchmarking framework for LLM safety
- Vidgen et al., "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models" (2024) - Minimal safety benchmark for LLMs
- Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models" (2023) - GCG attack methodology used in benchmarking
- NIST AI 100-2e, "Adversarial Machine Learning" (2024) - Standard terminology and evaluation concepts
- MLCommons AI Safety Benchmark (2024) - Industry benchmark for AI safety evaluation
Waarom is adaptief aanvalstesten noodzakelijk naast statische benchmarkevaluatie?