Comparing Vulnerability Profiles Across Models with Garak

intermediate11 min readUpdated 2026-03-15

Intermediate walkthrough on using garak to run identical vulnerability scans across multiple models, comparing results to understand relative security postures and make informed model selection decisions.

garak model-comparison vulnerability-scanning benchmarking walkthrough

When selecting which LLM to deploy in production, security posture should be a key evaluation criterion alongside capability, cost, and latency. Different models have different safety training approaches, and a probe that one model handles effortlessly may consistently bypass another. This walkthrough teaches you to run controlled comparisons and produce evidence-based security assessments.

Step 1: Designing a Comparison Scan

The key to meaningful comparison is consistency. Every model must face the exact same probes, in the same order, with the same detectors. Start by defining a standardized probe set in a YAML configuration template:

# comparison-probes.yaml
plugins:
  probes:
    # Prompt injection family
    - promptinject.HijackHateHumansMini
    - promptinject.HijackKillHumansMini
    - promptinject.HijackLongPrompt
 
    # Jailbreak family
    - dan.Dan_11_0
    - dan.Dan_6_0
    - dan.Dan_6_2
 
    # Encoding-based attacks
    - encoding.InjectBase64
    - encoding.InjectROT13
    - encoding.InjectMorse
 
    # Knowledge extraction
    - knowledgegraph.WhoIsRelatedTo
    - leakreplay.LiteratureCloze80
 
    # Toxicity elicitation
    - realtoxicityprompts.RTPBlank
 
  extended_detectors:
    - toxicity.ToxicCommentModel

This set covers five major attack categories, giving you a broad security profile with manageable scan times (typically 10-20 minutes per model).

Step 2: Preparing Multiple Targets

Set up the models you want to compare. For a meaningful comparison, choose models that are realistic candidates for your use case:

# Pull local models via Ollama
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull gemma2:9b
ollama pull phi3:mini
 
# Verify all models are available
ollama list

For API-based models, ensure your keys are configured:

export OPENAI_API_KEY="sk-your-key"
export ANTHROPIC_API_KEY="sk-ant-your-key"

Create a model inventory file to drive automated scanning:

# models.py
"""Model inventory for comparison scanning."""
 
MODELS = [
    {
        "name": "llama3.1-8b",
        "model_type": "ollama",
        "model_name": "llama3.1:8b",
    },
    {
        "name": "mistral-7b",
        "model_type": "ollama",
        "model_name": "mistral:7b",
    },
    {
        "name": "gemma2-9b",
        "model_type": "ollama",
        "model_name": "gemma2:9b",
    },
    {
        "name": "gpt-4o-mini",
        "model_type": "openai",
        "model_name": "gpt-4o-mini",
    },
]

Step 3: Automating Multi-Model Scans

Write a script that runs identical scans across all models:

#!/usr/bin/env python3
# run_comparison.py
"""Run identical garak scans across multiple models."""
 
import subprocess
import sys
import time
from datetime import datetime
from pathlib import Path
from models import MODELS
 
PROBE_CONFIG = "comparison-probes.yaml"
OUTPUT_DIR = Path("comparison_results")
OUTPUT_DIR.mkdir(exist_ok=True)
 
def run_scan(model_config: dict) -> dict:
    """Run a garak scan against a single model and return metadata."""
    name = model_config["name"]
    report_prefix = f"{OUTPUT_DIR}/{name}_{datetime.now():%Y%m%d_%H%M}"
 
    cmd = [
        "garak",
        "--model_type", model_config["model_type"],
        "--model_name", model_config["model_name"],
        "--config", PROBE_CONFIG,
        "--report_prefix", report_prefix,
    ]
 
    print(f"\n{'='*60}")
    print(f"Scanning: {name}")
    print(f"Command: {' '.join(cmd)}")
    print(f"{'='*60}\n")
 
    start_time = time.time()
    result = subprocess.run(cmd, capture_output=True, text=True)
    elapsed = time.time() - start_time
 
    return {
        "name": name,
        "elapsed_seconds": elapsed,
        "return_code": result.returncode,
        "report_prefix": report_prefix,
        "stdout": result.stdout,
        "stderr": result.stderr,
    }
 
def main():
    results = []
    for model in MODELS:
        scan_result = run_scan(model)
        results.append(scan_result)
        print(f"  Completed {model['name']} in {scan_result['elapsed_seconds']:.0f}s")
 
    # Save scan metadata
    import json
    with open(OUTPUT_DIR / "scan_metadata.json", "w") as f:
        json.dump(results, f, indent=2, default=str)
 
    print(f"\nAll scans complete. Results in {OUTPUT_DIR}/")
 
if __name__ == "__main__":
    main()

python run_comparison.py

Step 4: Parsing and Normalizing Results

After all scans complete, parse the JSONL reports into a normalized comparison structure:

#!/usr/bin/env python3
# parse_results.py
"""Parse garak scan results into a comparison matrix."""
 
import json
import glob
from collections import defaultdict
from pathlib import Path
 
OUTPUT_DIR = Path("comparison_results")
 
def parse_report(report_path: str) -> dict:
    """Parse a garak JSONL report into probe-level statistics."""
    stats = defaultdict(lambda: {"pass": 0, "fail": 0, "total": 0})
 
    with open(report_path) as f:
        for line in f:
            entry = json.loads(line)
            if entry.get("entry_type") == "attempt" and "status" in entry:
                probe = entry["probe"]
                status = entry["status"]
                stats[probe][status] = stats[probe].get(status, 0) + 1
                stats[probe]["total"] += 1
 
    return dict(stats)
 
def build_comparison_matrix():
    """Build a comparison matrix from all scan reports."""
    # Load scan metadata
    with open(OUTPUT_DIR / "scan_metadata.json") as f:
        metadata = json.load(f)
 
    matrix = {}
    for scan in metadata:
        name = scan["name"]
        report_files = glob.glob(f"{scan['report_prefix']}*.report.jsonl")
        if report_files:
            matrix[name] = parse_report(report_files[0])
 
    return matrix
 
def print_comparison(matrix: dict):
    """Print a human-readable comparison table."""
    # Collect all probes across all models
    all_probes = sorted(set(
        probe for model_data in matrix.values()
        for probe in model_data.keys()
    ))
 
    models = sorted(matrix.keys())
 
    # Header
    header = f"{'Probe':<50}" + "".join(f"{m:>15}" for m in models)
    print(header)
    print("-" * len(header))
 
    # Rows
    for probe in all_probes:
        row = f"{probe:<50}"
        for model in models:
            data = matrix.get(model, {}).get(probe, {})
            total = data.get("total", 0)
            fails = data.get("fail", 0)
            if total > 0:
                fail_rate = fails / total * 100
                row += f"{fail_rate:>14.1f}%"
            else:
                row += f"{'N/A':>15}"
        print(row)
 
if __name__ == "__main__":
    matrix = build_comparison_matrix()
    print_comparison(matrix)
 
    # Save as JSON for further analysis
    with open(OUTPUT_DIR / "comparison_matrix.json", "w") as f:
        json.dump(matrix, f, indent=2)

Step 5: Visualizing the Comparison

Create a visual heatmap of vulnerability rates across models:

#!/usr/bin/env python3
# visualize_comparison.py
"""Generate visual comparison of model vulnerability profiles."""
 
import json
from pathlib import Path
 
OUTPUT_DIR = Path("comparison_results")
 
def generate_html_report(matrix: dict):
    """Generate an HTML heatmap report."""
    models = sorted(matrix.keys())
    all_probes = sorted(set(
        probe for model_data in matrix.values()
        for probe in model_data.keys()
    ))
 
    def color_for_rate(rate):
        """Return a CSS color based on failure rate."""
        if rate == 0:
            return "#2d6a2e"  # Green
        elif rate < 10:
            return "#6a9f2d"  # Light green
        elif rate < 30:
            return "#c9a82e"  # Yellow
        elif rate < 60:
            return "#c96a2e"  # Orange
        else:
            return "#c92e2e"  # Red
 
    html = """<!DOCTYPE html>
<html>
<head>
    <title>Model Vulnerability Comparison</title>
    <style>
        body { font-family: monospace; margin: 20px; }
        table { border-collapse: collapse; }
        th, td { padding: 8px 12px; border: 1px solid #ddd; text-align: center; }
        th { background: #333; color: white; }
        td.probe-name { text-align: left; font-size: 12px; }
        .legend { margin: 20px 0; }
        .legend span { padding: 4px 12px; margin-right: 4px; color: white; }
    </style>
</head>
<body>
    <h1>Model Vulnerability Comparison</h1>
    <div class="legend">
        <span style="background: #2d6a2e">0%</span>
        <span style="background: #6a9f2d">&lt;10%</span>
        <span style="background: #c9a82e">&lt;30%</span>
        <span style="background: #c96a2e">&lt;60%</span>
        <span style="background: #c92e2e">60%+</span>
        Failure Rate
    </div>
    <table>
        <tr><th>Probe</th>"""
 
    for model in models:
        html += f"<th>{model}</th>"
    html += "</tr>\n"
 
    for probe in all_probes:
        html += f'<tr><td class="probe-name">{probe}</td>'
        for model in models:
            data = matrix.get(model, {}).get(probe, {})
            total = data.get("total", 0)
            fails = data.get("fail", 0)
            if total > 0:
                rate = fails / total * 100
                color = color_for_rate(rate)
                html += f'<td style="background:{color};color:white">{rate:.0f}%</td>'
            else:
                html += '<td style="background:#888;color:white">N/A</td>'
        html += "</tr>\n"
 
    html += "</table></body></html>"
 
    output_path = OUTPUT_DIR / "comparison_report.html"
    with open(output_path, "w") as f:
        f.write(html)
    print(f"Report generated: {output_path}")
 
if __name__ == "__main__":
    with open(OUTPUT_DIR / "comparison_matrix.json") as f:
        matrix = json.load(f)
    generate_html_report(matrix)

Step 6: Statistical Analysis of Results

Go beyond simple failure rates with statistical comparison:

#!/usr/bin/env python3
# analyze_comparison.py
"""Statistical analysis of model vulnerability comparison."""
 
import json
from pathlib import Path
from collections import defaultdict
 
OUTPUT_DIR = Path("comparison_results")
 
def compute_category_scores(matrix: dict) -> dict:
    """Aggregate probe results into category-level scores."""
    CATEGORIES = {
        "Prompt Injection": ["promptinject."],
        "Jailbreak": ["dan."],
        "Encoding Attacks": ["encoding."],
        "Knowledge Extraction": ["knowledgegraph.", "leakreplay."],
        "Toxicity": ["realtoxicityprompts."],
    }
 
    category_scores = {}
    for model, probes in matrix.items():
        category_scores[model] = {}
        for category, prefixes in CATEGORIES.items():
            total = 0
            fails = 0
            for probe, data in probes.items():
                if any(probe.startswith(p) for p in prefixes):
                    total += data.get("total", 0)
                    fails += data.get("fail", 0)
            if total > 0:
                category_scores[model][category] = {
                    "fail_rate": fails / total,
                    "total_attempts": total,
                    "failures": fails,
                }
            else:
                category_scores[model][category] = None
 
    return category_scores
 
def generate_summary(matrix: dict):
    """Generate a text summary of key findings."""
    category_scores = compute_category_scores(matrix)
    models = sorted(matrix.keys())
 
    print("\n" + "=" * 70)
    print("VULNERABILITY COMPARISON SUMMARY")
    print("=" * 70)
 
    # Overall failure rates
    print("\nOverall Failure Rates:")
    overall = {}
    for model, probes in matrix.items():
        total = sum(d.get("total", 0) for d in probes.values())
        fails = sum(d.get("fail", 0) for d in probes.values())
        rate = fails / total * 100 if total > 0 else 0
        overall[model] = rate
        print(f"  {model:<25} {rate:5.1f}% ({fails}/{total})")
 
    # Best and worst by category
    print("\nCategory Analysis:")
    categories = list(next(iter(category_scores.values())).keys())
    for category in categories:
        print(f"\n  {category}:")
        rates = {}
        for model in models:
            data = category_scores[model].get(category)
            if data:
                rate = data["fail_rate"] * 100
                rates[model] = rate
                print(f"    {model:<25} {rate:5.1f}%")
 
        if rates:
            best = min(rates, key=rates.get)
            worst = max(rates, key=rates.get)
            print(f"    Best: {best} ({rates[best]:.1f}%)")
            print(f"    Worst: {worst} ({rates[worst]:.1f}%)")
 
    # Key findings
    print("\nKey Findings:")
    safest = min(overall, key=overall.get)
    riskiest = max(overall, key=overall.get)
    print(f"  - Safest overall: {safest} ({overall[safest]:.1f}% failure rate)")
    print(f"  - Most vulnerable: {riskiest} ({overall[riskiest]:.1f}% failure rate)")
 
if __name__ == "__main__":
    with open(OUTPUT_DIR / "comparison_matrix.json") as f:
        matrix = json.load(f)
    generate_summary(matrix)

Step 7: Interpreting Results and Making Decisions

Raw failure rates require context for sound decision-making. Consider these factors:

Factor	Why It Matters	How to Account for It
Model size	Larger models typically have stronger safety training	Compare within similar size classes
Safety training	RLHF, constitutional AI, and other techniques differ	Review model documentation alongside results
Detector sensitivity	Aggressive detectors inflate failure rates	Check false positive rates by manually reviewing flagged responses
Probe relevance	Not all probes matter for your use case	Weight results by business impact
Temperature effects	Higher temperature increases output variability	Standardize temperature across all scans

Create a weighted scoring system for your specific context:

# Define weights based on business impact
CATEGORY_WEIGHTS = {
    "Prompt Injection": 0.30,   # High impact: direct control bypass
    "Jailbreak": 0.25,          # High impact: safety bypass
    "Encoding Attacks": 0.15,   # Medium impact: obfuscation attacks
    "Knowledge Extraction": 0.20,  # High impact: data leakage
    "Toxicity": 0.10,           # Lower impact: reputation risk
}
 
def weighted_score(category_scores: dict, model: str) -> float:
    """Compute a weighted security score (lower is better)."""
    total_weight = 0
    weighted_sum = 0
    for category, weight in CATEGORY_WEIGHTS.items():
        data = category_scores[model].get(category)
        if data:
            weighted_sum += data["fail_rate"] * weight
            total_weight += weight
    return weighted_sum / total_weight if total_weight > 0 else 0

Step 8: Automating Ongoing Comparison

Set up a recurring comparison pipeline for tracking model security over time:

# .github/workflows/model-comparison.yml
name: Monthly Model Security Comparison
on:
  schedule:
    - cron: '0 2 1 * *'  # First of each month at 2am
  workflow_dispatch: {}
 
jobs:
  compare-models:
    runs-on: ubuntu-latest
    timeout-minutes: 120
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
 
      - name: Install dependencies
        run: |
          pip install garak
          pip install matplotlib pandas
 
      - name: Run comparison scans
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python run_comparison.py
 
      - name: Generate reports
        run: |
          python parse_results.py
          python analyze_comparison.py > comparison_results/summary.txt
          python visualize_comparison.py
 
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: model-comparison-${{ github.run_number }}
          path: comparison_results/

Common Issues and Troubleshooting

Problem	Cause	Solution
Inconsistent results between runs	Model non-determinism	Set temperature to 0 in generator config for reproducibility
One model takes much longer	Different response generation speeds	Set per-model timeouts; exclude very slow models from automated runs
API rate limiting	Too many concurrent requests	Add delays between models or reduce probe count
Results seem identical across models	Using the same base model family	Verify models are actually different by checking their responses
Memory errors with multiple Ollama models	Models loaded simultaneously	Restart Ollama between model scans to free memory
Missing results for some probes	Probe timed out or errored	Check the log JSONL file for error entries

Running Your First Garak Scan -- Foundation for understanding garak scan mechanics
Garak Reporting Analysis -- Deep dive into individual scan report analysis
Promptfoo Comparative Eval -- Alternative approach to model comparison using promptfoo
AI Risk Assessment -- Broader context for security-informed model selection

Knowledge Check

When comparing vulnerability scan results across different models, why is it important to consider category weighting?

Edit this page on GitHub

Comparing Vulnerability Profiles Across Models with Garak

intermediate11 min readUpdated 2026-03-15

garak model-comparison vulnerability-scanning benchmarking walkthrough

Step 1: Designing a Comparison Scan

# comparison-probes.yaml
plugins:
  probes:
    # Prompt injection family
    - promptinject.HijackHateHumansMini
    - promptinject.HijackKillHumansMini
    - promptinject.HijackLongPrompt
 
    # Jailbreak family
    - dan.Dan_11_0
    - dan.Dan_6_0
    - dan.Dan_6_2
 
    # Encoding-based attacks
    - encoding.InjectBase64
    - encoding.InjectROT13
    - encoding.InjectMorse
 
    # Knowledge extraction
    - knowledgegraph.WhoIsRelatedTo
    - leakreplay.LiteratureCloze80
 
    # Toxicity elicitation
    - realtoxicityprompts.RTPBlank
 
  extended_detectors:
    - toxicity.ToxicCommentModel

This set covers five major attack categories, giving you a broad security profile with manageable scan times (typically 10-20 minutes per model).

Step 2: Preparing Multiple Targets

Set up the models you want to compare. For a meaningful comparison, choose models that are realistic candidates for your use case:

# Pull local models via Ollama
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull gemma2:9b
ollama pull phi3:mini
 
# Verify all models are available
ollama list

For API-based models, ensure your keys are configured:

export OPENAI_API_KEY="sk-your-key"
export ANTHROPIC_API_KEY="sk-ant-your-key"

Create a model inventory file to drive automated scanning:

# models.py
"""Model inventory for comparison scanning."""
 
MODELS = [
    {
        "name": "llama3.1-8b",
        "model_type": "ollama",
        "model_name": "llama3.1:8b",
    },
    {
        "name": "mistral-7b",
        "model_type": "ollama",
        "model_name": "mistral:7b",
    },
    {
        "name": "gemma2-9b",
        "model_type": "ollama",
        "model_name": "gemma2:9b",
    },
    {
        "name": "gpt-4o-mini",
        "model_type": "openai",
        "model_name": "gpt-4o-mini",
    },
]

Step 3: Automating Multi-Model Scans

Write a script that runs identical scans across all models:

#!/usr/bin/env python3
# run_comparison.py
"""Run identical garak scans across multiple models."""
 
import subprocess
import sys
import time
from datetime import datetime
from pathlib import Path
from models import MODELS
 
PROBE_CONFIG = "comparison-probes.yaml"
OUTPUT_DIR = Path("comparison_results")
OUTPUT_DIR.mkdir(exist_ok=True)
 
def run_scan(model_config: dict) -> dict:
    """Run a garak scan against a single model and return metadata."""
    name = model_config["name"]
    report_prefix = f"{OUTPUT_DIR}/{name}_{datetime.now():%Y%m%d_%H%M}"
 
    cmd = [
        "garak",
        "--model_type", model_config["model_type"],
        "--model_name", model_config["model_name"],
        "--config", PROBE_CONFIG,
        "--report_prefix", report_prefix,
    ]
 
    print(f"\n{'='*60}")
    print(f"Scanning: {name}")
    print(f"Command: {' '.join(cmd)}")
    print(f"{'='*60}\n")
 
    start_time = time.time()
    result = subprocess.run(cmd, capture_output=True, text=True)
    elapsed = time.time() - start_time
 
    return {
        "name": name,
        "elapsed_seconds": elapsed,
        "return_code": result.returncode,
        "report_prefix": report_prefix,
        "stdout": result.stdout,
        "stderr": result.stderr,
    }
 
def main():
    results = []
    for model in MODELS:
        scan_result = run_scan(model)
        results.append(scan_result)
        print(f"  Completed {model['name']} in {scan_result['elapsed_seconds']:.0f}s")
 
    # Save scan metadata
    import json
    with open(OUTPUT_DIR / "scan_metadata.json", "w") as f:
        json.dump(results, f, indent=2, default=str)
 
    print(f"\nAll scans complete. Results in {OUTPUT_DIR}/")
 
if __name__ == "__main__":
    main()

python run_comparison.py

Step 4: Parsing and Normalizing Results

After all scans complete, parse the JSONL reports into a normalized comparison structure:

#!/usr/bin/env python3
# parse_results.py
"""Parse garak scan results into a comparison matrix."""
 
import json
import glob
from collections import defaultdict
from pathlib import Path
 
OUTPUT_DIR = Path("comparison_results")
 
def parse_report(report_path: str) -> dict:
    """Parse a garak JSONL report into probe-level statistics."""
    stats = defaultdict(lambda: {"pass": 0, "fail": 0, "total": 0})
 
    with open(report_path) as f:
        for line in f:
            entry = json.loads(line)
            if entry.get("entry_type") == "attempt" and "status" in entry:
                probe = entry["probe"]
                status = entry["status"]
                stats[probe][status] = stats[probe].get(status, 0) + 1
                stats[probe]["total"] += 1
 
    return dict(stats)
 
def build_comparison_matrix():
    """Build a comparison matrix from all scan reports."""
    # Load scan metadata
    with open(OUTPUT_DIR / "scan_metadata.json") as f:
        metadata = json.load(f)
 
    matrix = {}
    for scan in metadata:
        name = scan["name"]
        report_files = glob.glob(f"{scan['report_prefix']}*.report.jsonl")
        if report_files:
            matrix[name] = parse_report(report_files[0])
 
    return matrix
 
def print_comparison(matrix: dict):
    """Print a human-readable comparison table."""
    # Collect all probes across all models
    all_probes = sorted(set(
        probe for model_data in matrix.values()
        for probe in model_data.keys()
    ))
 
    models = sorted(matrix.keys())
 
    # Header
    header = f"{'Probe':<50}" + "".join(f"{m:>15}" for m in models)
    print(header)
    print("-" * len(header))
 
    # Rows
    for probe in all_probes:
        row = f"{probe:<50}"
        for model in models:
            data = matrix.get(model, {}).get(probe, {})
            total = data.get("total", 0)
            fails = data.get("fail", 0)
            if total > 0:
                fail_rate = fails / total * 100
                row += f"{fail_rate:>14.1f}%"
            else:
                row += f"{'N/A':>15}"
        print(row)
 
if __name__ == "__main__":
    matrix = build_comparison_matrix()
    print_comparison(matrix)
 
    # Save as JSON for further analysis
    with open(OUTPUT_DIR / "comparison_matrix.json", "w") as f:
        json.dump(matrix, f, indent=2)

Step 5: Visualizing the Comparison

Create a visual heatmap of vulnerability rates across models:

#!/usr/bin/env python3
# visualize_comparison.py
"""Generate visual comparison of model vulnerability profiles."""
 
import json
from pathlib import Path
 
OUTPUT_DIR = Path("comparison_results")
 
def generate_html_report(matrix: dict):
    """Generate an HTML heatmap report."""
    models = sorted(matrix.keys())
    all_probes = sorted(set(
        probe for model_data in matrix.values()
        for probe in model_data.keys()
    ))
 
    def color_for_rate(rate):
        """Return a CSS color based on failure rate."""
        if rate == 0:
            return "#2d6a2e"  # Green
        elif rate < 10:
            return "#6a9f2d"  # Light green
        elif rate < 30:
            return "#c9a82e"  # Yellow
        elif rate < 60:
            return "#c96a2e"  # Orange
        else:
            return "#c92e2e"  # Red
 
    html = """<!DOCTYPE html>
<html>
<head>
    <title>Model Vulnerability Comparison</title>
    <style>
        body { font-family: monospace; margin: 20px; }
        table { border-collapse: collapse; }
        th, td { padding: 8px 12px; border: 1px solid #ddd; text-align: center; }
        th { background: #333; color: white; }
        td.probe-name { text-align: left; font-size: 12px; }
        .legend { margin: 20px 0; }
        .legend span { padding: 4px 12px; margin-right: 4px; color: white; }
    </style>
</head>
<body>
    <h1>Model Vulnerability Comparison</h1>
    <div class="legend">
        <span style="background: #2d6a2e">0%</span>
        <span style="background: #6a9f2d">&lt;10%</span>
        <span style="background: #c9a82e">&lt;30%</span>
        <span style="background: #c96a2e">&lt;60%</span>
        <span style="background: #c92e2e">60%+</span>
        Failure Rate
    </div>
    <table>
        <tr><th>Probe</th>"""
 
    for model in models:
        html += f"<th>{model}</th>"
    html += "</tr>\n"
 
    for probe in all_probes:
        html += f'<tr><td class="probe-name">{probe}</td>'
        for model in models:
            data = matrix.get(model, {}).get(probe, {})
            total = data.get("total", 0)
            fails = data.get("fail", 0)
            if total > 0:
                rate = fails / total * 100
                color = color_for_rate(rate)
                html += f'<td style="background:{color};color:white">{rate:.0f}%</td>'
            else:
                html += '<td style="background:#888;color:white">N/A</td>'
        html += "</tr>\n"
 
    html += "</table></body></html>"
 
    output_path = OUTPUT_DIR / "comparison_report.html"
    with open(output_path, "w") as f:
        f.write(html)
    print(f"Report generated: {output_path}")
 
if __name__ == "__main__":
    with open(OUTPUT_DIR / "comparison_matrix.json") as f:
        matrix = json.load(f)
    generate_html_report(matrix)

Step 6: Statistical Analysis of Results

Go beyond simple failure rates with statistical comparison:

#!/usr/bin/env python3
# analyze_comparison.py
"""Statistical analysis of model vulnerability comparison."""
 
import json
from pathlib import Path
from collections import defaultdict
 
OUTPUT_DIR = Path("comparison_results")
 
def compute_category_scores(matrix: dict) -> dict:
    """Aggregate probe results into category-level scores."""
    CATEGORIES = {
        "Prompt Injection": ["promptinject."],
        "Jailbreak": ["dan."],
        "Encoding Attacks": ["encoding."],
        "Knowledge Extraction": ["knowledgegraph.", "leakreplay."],
        "Toxicity": ["realtoxicityprompts."],
    }
 
    category_scores = {}
    for model, probes in matrix.items():
        category_scores[model] = {}
        for category, prefixes in CATEGORIES.items():
            total = 0
            fails = 0
            for probe, data in probes.items():
                if any(probe.startswith(p) for p in prefixes):
                    total += data.get("total", 0)
                    fails += data.get("fail", 0)
            if total > 0:
                category_scores[model][category] = {
                    "fail_rate": fails / total,
                    "total_attempts": total,
                    "failures": fails,
                }
            else:
                category_scores[model][category] = None
 
    return category_scores
 
def generate_summary(matrix: dict):
    """Generate a text summary of key findings."""
    category_scores = compute_category_scores(matrix)
    models = sorted(matrix.keys())
 
    print("\n" + "=" * 70)
    print("VULNERABILITY COMPARISON SUMMARY")
    print("=" * 70)
 
    # Overall failure rates
    print("\nOverall Failure Rates:")
    overall = {}
    for model, probes in matrix.items():
        total = sum(d.get("total", 0) for d in probes.values())
        fails = sum(d.get("fail", 0) for d in probes.values())
        rate = fails / total * 100 if total > 0 else 0
        overall[model] = rate
        print(f"  {model:<25} {rate:5.1f}% ({fails}/{total})")
 
    # Best and worst by category
    print("\nCategory Analysis:")
    categories = list(next(iter(category_scores.values())).keys())
    for category in categories:
        print(f"\n  {category}:")
        rates = {}
        for model in models:
            data = category_scores[model].get(category)
            if data:
                rate = data["fail_rate"] * 100
                rates[model] = rate
                print(f"    {model:<25} {rate:5.1f}%")
 
        if rates:
            best = min(rates, key=rates.get)
            worst = max(rates, key=rates.get)
            print(f"    Best: {best} ({rates[best]:.1f}%)")
            print(f"    Worst: {worst} ({rates[worst]:.1f}%)")
 
    # Key findings
    print("\nKey Findings:")
    safest = min(overall, key=overall.get)
    riskiest = max(overall, key=overall.get)
    print(f"  - Safest overall: {safest} ({overall[safest]:.1f}% failure rate)")
    print(f"  - Most vulnerable: {riskiest} ({overall[riskiest]:.1f}% failure rate)")
 
if __name__ == "__main__":
    with open(OUTPUT_DIR / "comparison_matrix.json") as f:
        matrix = json.load(f)
    generate_summary(matrix)

Step 7: Interpreting Results and Making Decisions

Raw failure rates require context for sound decision-making. Consider these factors:

Factor	Why It Matters	How to Account for It
Model size	Larger models typically have stronger safety training	Compare within similar size classes
Safety training	RLHF, constitutional AI, and other techniques differ	Review model documentation alongside results
Detector sensitivity	Aggressive detectors inflate failure rates	Check false positive rates by manually reviewing flagged responses
Probe relevance	Not all probes matter for your use case	Weight results by business impact
Temperature effects	Higher temperature increases output variability	Standardize temperature across all scans

Create a weighted scoring system for your specific context:

# Define weights based on business impact
CATEGORY_WEIGHTS = {
    "Prompt Injection": 0.30,   # High impact: direct control bypass
    "Jailbreak": 0.25,          # High impact: safety bypass
    "Encoding Attacks": 0.15,   # Medium impact: obfuscation attacks
    "Knowledge Extraction": 0.20,  # High impact: data leakage
    "Toxicity": 0.10,           # Lower impact: reputation risk
}
 
def weighted_score(category_scores: dict, model: str) -> float:
    """Compute a weighted security score (lower is better)."""
    total_weight = 0
    weighted_sum = 0
    for category, weight in CATEGORY_WEIGHTS.items():
        data = category_scores[model].get(category)
        if data:
            weighted_sum += data["fail_rate"] * weight
            total_weight += weight
    return weighted_sum / total_weight if total_weight > 0 else 0

Step 8: Automating Ongoing Comparison

Set up a recurring comparison pipeline for tracking model security over time:

# .github/workflows/model-comparison.yml
name: Monthly Model Security Comparison
on:
  schedule:
    - cron: '0 2 1 * *'  # First of each month at 2am
  workflow_dispatch: {}
 
jobs:
  compare-models:
    runs-on: ubuntu-latest
    timeout-minutes: 120
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
 
      - name: Install dependencies
        run: |
          pip install garak
          pip install matplotlib pandas
 
      - name: Run comparison scans
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python run_comparison.py
 
      - name: Generate reports
        run: |
          python parse_results.py
          python analyze_comparison.py > comparison_results/summary.txt
          python visualize_comparison.py
 
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: model-comparison-${{ github.run_number }}
          path: comparison_results/

Common Issues and Troubleshooting

Problem	Cause	Solution
Inconsistent results between runs	Model non-determinism	Set temperature to 0 in generator config for reproducibility
One model takes much longer	Different response generation speeds	Set per-model timeouts; exclude very slow models from automated runs
API rate limiting	Too many concurrent requests	Add delays between models or reduce probe count
Results seem identical across models	Using the same base model family	Verify models are actually different by checking their responses
Memory errors with multiple Ollama models	Models loaded simultaneously	Restart Ollama between model scans to free memory
Missing results for some probes	Probe timed out or errored	Check the log JSONL file for error entries

Running Your First Garak Scan -- Foundation for understanding garak scan mechanics
Garak Reporting Analysis -- Deep dive into individual scan report analysis
Promptfoo Comparative Eval -- Alternative approach to model comparison using promptfoo
AI Risk Assessment -- Broader context for security-informed model selection

Knowledge Check

When comparing vulnerability scan results across different models, why is it important to consider category weighting?

Edit this page on GitHub

Comparing Vulnerability Profiles Across Models with Garak

Related articles

Comparing Vulnerability Profiles Across Models with Garak

Related articles