Comparing Vulnerability Profiles Across Models with Garak
Intermediate walkthrough on using garak to run identical vulnerability scans across multiple models, comparing results to understand relative security postures and make informed model selection decisions.
When selecting which LLM to deploy in production, security posture should be a key evaluation criterion alongside capability, cost, and latency. Different models have different safety training approaches, and a probe that one model handles effortlessly may consistently bypass another. This walkthrough teaches you to run controlled comparisons and produce evidence-based security assessments.
Step 1: Designing a Comparison Scan
The key to meaningful comparison is consistency. Every model must face the exact same probes, in the same order, with the same detectors. Start by defining a standardized probe set in a YAML configuration template:
# comparison-probes.yaml
plugins:
probes:
# Prompt injection family
- promptinject.HijackHateHumansMini
- promptinject.HijackKillHumansMini
- promptinject.HijackLongPrompt
# Jailbreak family
- dan.Dan_11_0
- dan.Dan_6_0
- dan.Dan_6_2
# Encoding-based attacks
- encoding.InjectBase64
- encoding.InjectROT13
- encoding.InjectMorse
# Knowledge extraction
- knowledgegraph.WhoIsRelatedTo
- leakreplay.LiteratureCloze80
# Toxicity elicitation
- realtoxicityprompts.RTPBlank
extended_detectors:
- toxicity.ToxicCommentModelThis set covers five major attack categories, giving you a broad security profile with manageable scan times (typically 10-20 minutes per model).
Step 2: Preparing Multiple Targets
Set up the models you want to compare. For a meaningful comparison, choose models that are realistic candidates for your use case:
# Pull local models via Ollama
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull gemma2:9b
ollama pull phi3:mini
# Verify all models are available
ollama listFor API-based models, ensure your keys are configured:
export OPENAI_API_KEY="sk-your-key"
export ANTHROPIC_API_KEY="sk-ant-your-key"Create a model inventory file to drive automated scanning:
# models.py
"""Model inventory for comparison scanning."""
MODELS = [
{
"name": "llama3.1-8b",
"model_type": "ollama",
"model_name": "llama3.1:8b",
},
{
"name": "mistral-7b",
"model_type": "ollama",
"model_name": "mistral:7b",
},
{
"name": "gemma2-9b",
"model_type": "ollama",
"model_name": "gemma2:9b",
},
{
"name": "gpt-4o-mini",
"model_type": "openai",
"model_name": "gpt-4o-mini",
},
]Step 3: Automating Multi-Model Scans
Write a script that runs identical scans across all models:
#!/usr/bin/env python3
# run_comparison.py
"""Run identical garak scans across multiple models."""
import subprocess
import sys
import time
from datetime import datetime
from pathlib import Path
from models import MODELS
PROBE_CONFIG = "comparison-probes.yaml"
OUTPUT_DIR = Path("comparison_results")
OUTPUT_DIR.mkdir(exist_ok=True)
def run_scan(model_config: dict) -> dict:
"""Run a garak scan against a single model and return metadata."""
name = model_config["name"]
report_prefix = f"{OUTPUT_DIR}/{name}_{datetime.now():%Y%m%d_%H%M}"
cmd = [
"garak",
"--model_type", model_config["model_type"],
"--model_name", model_config["model_name"],
"--config", PROBE_CONFIG,
"--report_prefix", report_prefix,
]
print(f"\n{'='*60}")
print(f"Scanning: {name}")
print(f"Command: {' '.join(cmd)}")
print(f"{'='*60}\n")
start_time = time.time()
result = subprocess.run(cmd, capture_output=True, text=True)
elapsed = time.time() - start_time
return {
"name": name,
"elapsed_seconds": elapsed,
"return_code": result.returncode,
"report_prefix": report_prefix,
"stdout": result.stdout,
"stderr": result.stderr,
}
def main():
results = []
for model in MODELS:
scan_result = run_scan(model)
results.append(scan_result)
print(f" Completed {model['name']} in {scan_result['elapsed_seconds']:.0f}s")
# Save scan metadata
import json
with open(OUTPUT_DIR / "scan_metadata.json", "w") as f:
json.dump(results, f, indent=2, default=str)
print(f"\nAll scans complete. Results in {OUTPUT_DIR}/")
if __name__ == "__main__":
main()python run_comparison.pyStep 4: Parsing and Normalizing Results
After all scans complete, parse the JSONL reports into a normalized comparison structure:
#!/usr/bin/env python3
# parse_results.py
"""Parse garak scan results into a comparison matrix."""
import json
import glob
from collections import defaultdict
from pathlib import Path
OUTPUT_DIR = Path("comparison_results")
def parse_report(report_path: str) -> dict:
"""Parse a garak JSONL report into probe-level statistics."""
stats = defaultdict(lambda: {"pass": 0, "fail": 0, "total": 0})
with open(report_path) as f:
for line in f:
entry = json.loads(line)
if entry.get("entry_type") == "attempt" and "status" in entry:
probe = entry["probe"]
status = entry["status"]
stats[probe][status] = stats[probe].get(status, 0) + 1
stats[probe]["total"] += 1
return dict(stats)
def build_comparison_matrix():
"""Build a comparison matrix from all scan reports."""
# Load scan metadata
with open(OUTPUT_DIR / "scan_metadata.json") as f:
metadata = json.load(f)
matrix = {}
for scan in metadata:
name = scan["name"]
report_files = glob.glob(f"{scan['report_prefix']}*.report.jsonl")
if report_files:
matrix[name] = parse_report(report_files[0])
return matrix
def print_comparison(matrix: dict):
"""Print a human-readable comparison table."""
# Collect all probes across all models
all_probes = sorted(set(
probe for model_data in matrix.values()
for probe in model_data.keys()
))
models = sorted(matrix.keys())
# Header
header = f"{'Probe':<50}" + "".join(f"{m:>15}" for m in models)
print(header)
print("-" * len(header))
# Rows
for probe in all_probes:
row = f"{probe:<50}"
for model in models:
data = matrix.get(model, {}).get(probe, {})
total = data.get("total", 0)
fails = data.get("fail", 0)
if total > 0:
fail_rate = fails / total * 100
row += f"{fail_rate:>14.1f}%"
else:
row += f"{'N/A':>15}"
print(row)
if __name__ == "__main__":
matrix = build_comparison_matrix()
print_comparison(matrix)
# Save as JSON for further analysis
with open(OUTPUT_DIR / "comparison_matrix.json", "w") as f:
json.dump(matrix, f, indent=2)Step 5: Visualizing the Comparison
Create a visual heatmap of vulnerability rates across models:
#!/usr/bin/env python3
# visualize_comparison.py
"""Generate visual comparison of model vulnerability profiles."""
import json
from pathlib import Path
OUTPUT_DIR = Path("comparison_results")
def generate_html_report(matrix: dict):
"""Generate an HTML heatmap report."""
models = sorted(matrix.keys())
all_probes = sorted(set(
probe for model_data in matrix.values()
for probe in model_data.keys()
))
def color_for_rate(rate):
"""Return a CSS color based on failure rate."""
if rate == 0:
return "#2d6a2e" # Green
elif rate < 10:
return "#6a9f2d" # Light green
elif rate < 30:
return "#c9a82e" # Yellow
elif rate < 60:
return "#c96a2e" # Orange
else:
return "#c92e2e" # Red
html = """<!DOCTYPE html>
<html>
<head>
<title>Model Vulnerability Comparison</title>
<style>
body { font-family: monospace; margin: 20px; }
table { border-collapse: collapse; }
th, td { padding: 8px 12px; border: 1px solid #ddd; text-align: center; }
th { background: #333; color: white; }
td.probe-name { text-align: left; font-size: 12px; }
.legend { margin: 20px 0; }
.legend span { padding: 4px 12px; margin-right: 4px; color: white; }
</style>
</head>
<body>
<h1>Model Vulnerability Comparison</h1>
<div class="legend">
<span style="background: #2d6a2e">0%</span>
<span style="background: #6a9f2d"><10%</span>
<span style="background: #c9a82e"><30%</span>
<span style="background: #c96a2e"><60%</span>
<span style="background: #c92e2e">60%+</span>
Failure Rate
</div>
<table>
<tr><th>Probe</th>"""
for model in models:
html += f"<th>{model}</th>"
html += "</tr>\n"
for probe in all_probes:
html += f'<tr><td class="probe-name">{probe}</td>'
for model in models:
data = matrix.get(model, {}).get(probe, {})
total = data.get("total", 0)
fails = data.get("fail", 0)
if total > 0:
rate = fails / total * 100
color = color_for_rate(rate)
html += f'<td style="background:{color};color:white">{rate:.0f}%</td>'
else:
html += '<td style="background:#888;color:white">N/A</td>'
html += "</tr>\n"
html += "</table></body></html>"
output_path = OUTPUT_DIR / "comparison_report.html"
with open(output_path, "w") as f:
f.write(html)
print(f"Report generated: {output_path}")
if __name__ == "__main__":
with open(OUTPUT_DIR / "comparison_matrix.json") as f:
matrix = json.load(f)
generate_html_report(matrix)Step 6: Statistical Analysis of Results
Go beyond simple failure rates with statistical comparison:
#!/usr/bin/env python3
# analyze_comparison.py
"""Statistical analysis of model vulnerability comparison."""
import json
from pathlib import Path
from collections import defaultdict
OUTPUT_DIR = Path("comparison_results")
def compute_category_scores(matrix: dict) -> dict:
"""Aggregate probe results into category-level scores."""
CATEGORIES = {
"Prompt Injection": ["promptinject."],
"Jailbreak": ["dan."],
"Encoding Attacks": ["encoding."],
"Knowledge Extraction": ["knowledgegraph.", "leakreplay."],
"Toxicity": ["realtoxicityprompts."],
}
category_scores = {}
for model, probes in matrix.items():
category_scores[model] = {}
for category, prefixes in CATEGORIES.items():
total = 0
fails = 0
for probe, data in probes.items():
if any(probe.startswith(p) for p in prefixes):
total += data.get("total", 0)
fails += data.get("fail", 0)
if total > 0:
category_scores[model][category] = {
"fail_rate": fails / total,
"total_attempts": total,
"failures": fails,
}
else:
category_scores[model][category] = None
return category_scores
def generate_summary(matrix: dict):
"""Generate a text summary of key findings."""
category_scores = compute_category_scores(matrix)
models = sorted(matrix.keys())
print("\n" + "=" * 70)
print("VULNERABILITY COMPARISON SUMMARY")
print("=" * 70)
# Overall failure rates
print("\nOverall Failure Rates:")
overall = {}
for model, probes in matrix.items():
total = sum(d.get("total", 0) for d in probes.values())
fails = sum(d.get("fail", 0) for d in probes.values())
rate = fails / total * 100 if total > 0 else 0
overall[model] = rate
print(f" {model:<25} {rate:5.1f}% ({fails}/{total})")
# Best and worst by category
print("\nCategory Analysis:")
categories = list(next(iter(category_scores.values())).keys())
for category in categories:
print(f"\n {category}:")
rates = {}
for model in models:
data = category_scores[model].get(category)
if data:
rate = data["fail_rate"] * 100
rates[model] = rate
print(f" {model:<25} {rate:5.1f}%")
if rates:
best = min(rates, key=rates.get)
worst = max(rates, key=rates.get)
print(f" Best: {best} ({rates[best]:.1f}%)")
print(f" Worst: {worst} ({rates[worst]:.1f}%)")
# Key findings
print("\nKey Findings:")
safest = min(overall, key=overall.get)
riskiest = max(overall, key=overall.get)
print(f" - Safest overall: {safest} ({overall[safest]:.1f}% failure rate)")
print(f" - Most vulnerable: {riskiest} ({overall[riskiest]:.1f}% failure rate)")
if __name__ == "__main__":
with open(OUTPUT_DIR / "comparison_matrix.json") as f:
matrix = json.load(f)
generate_summary(matrix)Step 7: Interpreting Results and Making Decisions
Raw failure rates require context for sound decision-making. Consider these factors:
| Factor | Why It Matters | How to Account for It |
|---|---|---|
| Model size | Larger models typically have stronger safety training | Compare within similar size classes |
| Safety training | RLHF, constitutional AI, and other techniques differ | Review model documentation alongside results |
| Detector sensitivity | Aggressive detectors inflate failure rates | Check false positive rates by manually reviewing flagged responses |
| Probe relevance | Not all probes matter for your use case | Weight results by business impact |
| Temperature effects | Higher temperature increases output variability | Standardize temperature across all scans |
Create a weighted scoring system for your specific context:
# Define weights based on business impact
CATEGORY_WEIGHTS = {
"Prompt Injection": 0.30, # High impact: direct control bypass
"Jailbreak": 0.25, # High impact: safety bypass
"Encoding Attacks": 0.15, # Medium impact: obfuscation attacks
"Knowledge Extraction": 0.20, # High impact: data leakage
"Toxicity": 0.10, # Lower impact: reputation risk
}
def weighted_score(category_scores: dict, model: str) -> float:
"""Compute a weighted security score (lower is better)."""
total_weight = 0
weighted_sum = 0
for category, weight in CATEGORY_WEIGHTS.items():
data = category_scores[model].get(category)
if data:
weighted_sum += data["fail_rate"] * weight
total_weight += weight
return weighted_sum / total_weight if total_weight > 0 else 0Step 8: Automating Ongoing Comparison
Set up a recurring comparison pipeline for tracking model security over time:
# .github/workflows/model-comparison.yml
name: Monthly Model Security Comparison
on:
schedule:
- cron: '0 2 1 * *' # First of each month at 2am
workflow_dispatch: {}
jobs:
compare-models:
runs-on: ubuntu-latest
timeout-minutes: 120
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install garak
pip install matplotlib pandas
- name: Run comparison scans
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python run_comparison.py
- name: Generate reports
run: |
python parse_results.py
python analyze_comparison.py > comparison_results/summary.txt
python visualize_comparison.py
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: model-comparison-${{ github.run_number }}
path: comparison_results/Common Issues and Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Inconsistent results between runs | Model non-determinism | Set temperature to 0 in generator config for reproducibility |
| One model takes much longer | Different response generation speeds | Set per-model timeouts; exclude very slow models from automated runs |
| API rate limiting | Too many concurrent requests | Add delays between models or reduce probe count |
| Results seem identical across models | Using the same base model family | Verify models are actually different by checking their responses |
| Memory errors with multiple Ollama models | Models loaded simultaneously | Restart Ollama between model scans to free memory |
| Missing results for some probes | Probe timed out or errored | Check the log JSONL file for error entries |
Related Topics
- Running Your First Garak Scan -- Foundation for understanding garak scan mechanics
- Garak Reporting Analysis -- Deep dive into individual scan report analysis
- Promptfoo Comparative Eval -- Alternative approach to model comparison using promptfoo
- AI Risk Assessment -- Broader context for security-informed model selection
When comparing vulnerability scan results across different models, why is it important to consider category weighting?