Red Team Tool Comparison
Comparison of major AI red teaming tools -- Garak, PyRIT, promptfoo, and Inspect AI -- covering capabilities, strengths, limitations, and use cases.
Red Team Tool Comparison
This reference compares the major open-source tools used in AI red teaming engagements. Each tool has distinct strengths and is best suited for specific phases of an assessment.
Quick Comparison Matrix
| Feature | Garak | PyRIT | promptfoo | Inspect AI |
|---|---|---|---|---|
| Primary purpose | Automated vulnerability scanning | Attack orchestration | LLM evaluation and testing | AI safety evaluation |
| Developer | NVIDIA | Microsoft | promptfoo (open-source) | UK AI Safety Institute |
| Language | Python | Python | TypeScript/Node.js | Python |
| Multi-turn attacks | Limited | Strong | Limited | Moderate |
| Attack generation | Probe-based | Converter + orchestrator | YAML-defined | Task + solver |
| Custom attacks | Custom probes | Custom converters/orchestrators | Plugins + YAML | Custom solvers |
| Scoring | Detector-based | Multiple scorer types | Assertion-based | Scorer-based |
| CI/CD integration | Moderate | Limited | Strong | Moderate |
| Model support | Broad (API + local) | Broad (API + local) | Broad (API + local) | Broad (API + local) |
| Learning curve | Low-Medium | Medium-High | Low | Medium |
Garak
Generative AI Red-teaming and Assessment Kit
Overview
Garak is a probe-based vulnerability scanner that tests LLMs against known attack patterns. It works by generating attack prompts (probes), sending them to the target model (generators), and evaluating the responses (detectors).
Architecture
Probes (attack generators) → Generator (model interface) → Detectors (response evaluators)
Strengths
- Low barrier to entry: Run a baseline scan with a single command
- Broad coverage: Built-in probes cover OWASP LLM Top 10 categories
- Extensible: Write custom probes and detectors in Python
- Multiple model backends: Supports OpenAI, Hugging Face, local models, and custom APIs
- Reporting: Generates structured reports with pass/fail results
Limitations
- Primarily single-turn: Limited support for multi-turn attack strategies
- Pattern-based: Relies on known attack patterns; does not discover novel techniques
- Limited orchestration: Cannot coordinate complex, multi-step attacks
- Static probes: Attack prompts are predefined rather than dynamically generated based on model responses
Best For
- Baseline vulnerability scanning at the start of an engagement
- Regression testing after model or guardrail updates
- CI/CD pipeline integration for continuous security testing
- Coverage verification across known attack categories
Example Use Case
Running Garak as a first pass to identify which vulnerability categories warrant deeper manual investigation, then using the results to prioritize the human-driven portion of the assessment.
PyRIT
Python Risk Identification Tool for Generative AI
Overview
PyRIT is Microsoft's framework for orchestrating complex, multi-turn red teaming attacks. It excels at automated attack campaigns that adapt based on model responses.
Architecture
Orchestrator (attack strategy) → Converter (payload transformation) → Target (model) → Scorer (evaluation)
Key Components
| Component | Purpose | Examples |
|---|---|---|
| Orchestrators | Manage attack strategy and conversation flow | Multi-turn, crescendo, tree-of-attacks |
| Converters | Transform prompts between formats | Base64, translation, leetspeak, homoglyph, unicode |
| Targets | Interface with the model under test | OpenAI, Azure, Hugging Face, local models |
| Scorers | Evaluate attack success | LLM-as-judge, keyword match, classifier |
| Memory | Store conversation history and results | SQLite-based persistent storage |
Strengths
- Multi-turn orchestration: Purpose-built for complex, multi-step attacks
- Converter chain: Stack multiple transformations for sophisticated evasion
- Adaptive attacks: Orchestrators adjust strategy based on model responses
- Persistent memory: Track conversation state across sessions
- Research-grade: Designed for systematic, reproducible experiments
Limitations
- Higher learning curve: More complex setup and configuration than simpler tools
- Python-centric: Requires Python development skills for customization
- Less CI/CD native: Not primarily designed for pipeline integration
- Heavier setup: Requires more infrastructure (database, API keys, configuration)
Best For
- Complex multi-turn jailbreak campaigns
- Automated escalation testing (crescendo attacks)
- Research requiring systematic prompt variation and scoring
- Testing that requires converter chains (encoding + translation + paraphrasing)
Example Use Case
Using PyRIT's crescendo orchestrator to automatically conduct a 15-turn escalation attack, with converters applying Base64 encoding and language translation at strategic points, scored by an LLM-as-judge evaluator.
promptfoo
Overview
promptfoo is a YAML-driven evaluation and red teaming harness that emphasizes ease of use, broad model support, and CI/CD integration. It is well-suited for systematic, repeatable test suites.
Architecture
YAML config (test cases + assertions) → Provider (model interface) → Evaluator (assertion checking) → Report
Strengths
- Low learning curve: YAML-based configuration requires no programming
- CI/CD native: Designed for integration into testing pipelines
- Comparative testing: Easily test the same attack suite against multiple models or configurations
- Plugin system: Extend with custom attack types via JavaScript/TypeScript plugins
- Built-in red team module: Dedicated red teaming mode with pre-built attack strategies
- Rich reporting: Web UI for browsing results, comparing runs, and sharing reports
Limitations
- Primarily single-turn: Multi-turn attack support is limited compared to PyRIT
- YAML limitations: Complex attack logic can become unwieldy in YAML format
- Node.js ecosystem: May not integrate seamlessly with Python-based AI tooling
- Less adaptive: Test cases are predefined rather than dynamically adjusted
Best For
- CI/CD pipeline integration for continuous red teaming
- Comparative evaluation (testing the same attacks across model versions or configurations)
- Team collaboration where non-Python developers need to contribute test cases
- Regression testing after guardrail updates
- Quick setup for straightforward test suites
Example Use Case
Setting up a YAML-based test suite that runs 200 attack prompts across five risk categories against both the current and proposed model configurations, generating a comparative report that shows which configuration handles each attack category better.
Inspect AI
UK AI Safety Institute Evaluation Framework
Overview
Inspect AI is the UK AI Safety Institute's framework for evaluating AI system safety. It uses a task-based architecture where evaluations are composed of datasets, solvers (how the model interacts with the task), and scorers.
Architecture
Task (dataset + instructions) → Solver (model interaction strategy) → Scorer (evaluation criteria) → Log
Key Components
| Component | Purpose | Examples |
|---|---|---|
| Tasks | Define evaluation scenarios | Safety benchmarks, capability tests, red team scenarios |
| Datasets | Provide test inputs | HuggingFace datasets, CSV, JSON, custom generators |
| Solvers | Control model interaction | Generate (single-turn), multi-turn, chain-of-thought |
| Scorers | Evaluate outputs | Model grading, pattern matching, human review |
| Tools | Give the model capabilities | Web browsing, code execution, custom tools |
Strengths
- Composable architecture: Mix and match solvers, scorers, and datasets
- Safety-focused: Designed specifically for AI safety evaluation
- Agent evaluation: Built-in support for evaluating tool-using agents
- Reproducible: Detailed logging of every evaluation step
- Community benchmarks: Growing library of shared evaluation tasks
Limitations
- Evaluation-focused: Not primarily an offensive tool; more evaluation than attack
- Newer ecosystem: Smaller community and fewer resources compared to more established tools
- Less attack-oriented: Fewer built-in attack techniques compared to Garak or PyRIT
- Framework overhead: Task/solver/scorer pattern adds abstraction overhead for simple tests
Best For
- Structured safety evaluations aligned with regulatory requirements
- Agent capability and safety benchmarking
- Reproducible evaluations with comprehensive logging
- Evaluation tasks that need to assess tool-use behavior
- Complement to offensive tools for measuring defensive effectiveness
Example Use Case
Building an Inspect AI evaluation suite that tests an agent's safety across a range of scenarios -- measuring both whether the agent completes tasks correctly and whether it respects safety boundaries when presented with adversarial inputs.
Tool Selection Guide
Use this decision tree to choose the right tool for your scenario:
| Scenario | Recommended Tool | Rationale |
|---|---|---|
| First-pass baseline scan | Garak | Fast setup, broad coverage of known patterns |
| Multi-turn jailbreak campaign | PyRIT | Purpose-built orchestration for complex, adaptive attacks |
| CI/CD regression testing | promptfoo | YAML-driven, pipeline-native, comparative reports |
| Safety evaluation for compliance | Inspect AI | Task-based, reproducible, aligned with regulatory expectations |
| Comparative model evaluation | promptfoo | Built-in support for side-by-side comparison |
| Research with encoding evasion | PyRIT | Converter chains for systematic payload transformation |
| Quick one-off test | promptfoo or Garak | Minimal setup required |
| Agent safety testing | Inspect AI | Built-in tool-use and agent evaluation support |
Combining Tools
Professional engagements typically use multiple tools:
Baseline Scan
Run Garak for a quick baseline across known vulnerability categories. Identify which categories show weaknesses.
Deep Exploitation
Use PyRIT to conduct multi-turn attacks against the categories where Garak found weaknesses. Leverage converters and orchestrators for sophisticated evasion.
Safety Evaluation
Use Inspect AI to conduct structured safety evaluations, particularly for agent-based systems. Measure defensive boundaries systematically.
Regression Suite
Build a promptfoo test suite from the successful attacks discovered in previous phases. Integrate into CI/CD for continuous monitoring.
Manual Testing
Supplement automated tools with human-driven creative testing for novel techniques that no tool will discover on its own.