Red Team Methodology Overview
A structured methodology for AI red team engagements: phases, deliverables, role definitions, and how AI-specific testing differs from traditional penetration testing.
AI red teaming adapts traditional adversarial security testing to the unique characteristics of AI systems. Where a traditional penetration test probes for software vulnerabilities and misconfigurations, an AI red team evaluates whether a system can be manipulated through its natural language interface, its training data pipeline, its tool integrations, or its agentic decision-making. The methodology must account for stochastic behavior, prompt-level attack surfaces, and harms that are semantic rather than technical.
How AI Red Teaming Differs
Before diving into methodology, it helps to understand what makes AI red teaming distinct from traditional penetration testing:
| Dimension | Traditional Pen Test | AI Red Team |
|---|---|---|
| Attack surface | Code, configs, network | Prompts, training data, tool integrations, model behavior |
| Vulnerability type | CVEs, misconfigurations | Alignment failures, injection, hallucination, bias |
| Reproducibility | Usually deterministic | Stochastic: same input may produce different outputs |
| Exploit format | Code, scripts, payloads | Natural language prompts, adversarial inputs |
| Impact measurement | Confidentiality, integrity, availability | Plus: fairness, safety, truthfulness, alignment |
| Tools | Scanners, fuzzers, exploit frameworks | Prompt libraries, behavioral testing suites, eval frameworks |
| Success criteria | Binary (exploited or not) | Gradient (bypass rate, severity spectrum, harm taxonomy) |
Engagement Phases
An AI red team engagement follows a structured sequence of phases. Each phase produces specific deliverables that feed into the next.
Phase 1: Scoping and Rules of Engagement
Objective: Define what will be tested, how, and what constraints apply.
Activities:
- Identify the AI system under test: model provider, deployment architecture, user-facing interfaces, tool integrations, data sources
- Define scope boundaries: which components, attack vectors, and harm categories are in scope
- Establish rules of engagement: rate limits, forbidden techniques, escalation procedures, data handling requirements
- Agree on success criteria and severity classification with stakeholders
Deliverable: Signed rules of engagement document and detailed test plan.
See Scoping & Rules of Engagement for detailed guidance.
Phase 2: Reconnaissance
Objective: Map the AI system's attack surface before attempting exploitation.
Activities:
- Fingerprint the model provider and version through behavioral analysis
- Extract or infer the system prompt
- Enumerate available tools, plugins, and external integrations
- Identify data sources (RAG knowledge bases, web access, file uploads)
- Map trust boundaries between components
- Discover shadow AI deployments if scope includes organizational discovery
Deliverable: Attack surface map documenting all identified components, trust boundaries, and potential attack vectors.
See Advanced Reconnaissance for techniques.
Phase 3: Threat Modeling
Objective: Prioritize testing based on risk analysis.
Activities:
- Build attack trees for high-value objectives
- Map ATLAS tactics and techniques to the identified attack surface
- Analyze trust boundaries and data flows for injection opportunities
- Estimate cost and probability for each attack path
- Prioritize paths by risk (probability multiplied by impact)
Deliverable: Prioritized test plan derived from the threat model.
See AI-Specific Threat Modeling for frameworks.
Phase 4: Exploitation
Objective: Execute the prioritized test plan and attempt to demonstrate identified risks.
Activities:
- Test prompt injection vectors (direct, indirect, multi-turn, encoding)
- Attempt safety alignment bypasses (jailbreaking)
- Probe tool and function calling for parameter injection, unauthorized access
- Test for data exfiltration through model-mediated channels
- Evaluate bias, fairness, and harmful output generation
- Test agent autonomy boundaries and escalation paths
- Document each test with inputs, outputs, bypass rates, and reproducibility data
Deliverable: Raw findings log with full evidence chain for each discovered vulnerability.
Phase 5: Evidence Collection and Documentation
Objective: Produce a complete, reproducible evidence package.
Activities:
- Capture full conversation logs for each finding
- Record API requests and responses with timestamps
- Document bypass rates across multiple attempts (minimum 10 trials per technique)
- Preserve model version, temperature settings, and other configuration details
- Create proof-of-concept payloads that can be re-run for verification
Deliverable: Evidence package meeting chain-of-custody requirements.
See Evidence Collection for standards.
Phase 6: Reporting
Objective: Communicate findings to stakeholders in actionable form.
Activities:
- Classify findings by severity using the agreed framework
- Write finding descriptions with clear impact statements
- Provide remediation recommendations for each finding
- Produce an executive summary for non-technical stakeholders
- Conduct a findings walkthrough with the technical team
Deliverable: Final report with executive summary, detailed findings, and remediation roadmap.
Team Composition
Effective AI red teaming requires a combination of skills that rarely exists in a single person:
| Role | Responsibilities | Key Skills |
|---|---|---|
| Engagement lead | Scoping, stakeholder management, reporting | Project management, communication, risk assessment |
| Prompt engineer / injection specialist | Prompt injection testing, jailbreaking, safety bypass | Creative writing, LLM behavior intuition, NLP knowledge |
| ML engineer | Model-level attacks, gradient-based methods, training pipeline assessment | Machine learning, Python, adversarial ML research |
| Application security tester | API testing, infrastructure assessment, tool integration security | Web security, API security, traditional pen testing |
| Domain expert | Bias evaluation, harm assessment, policy compliance | Domain-specific knowledge (healthcare, finance, legal, etc.) |
| Automation engineer | Test harness development, scale testing, metrics | Software engineering, data analysis, scripting |
Engagement Types
Different engagement types require different methodology emphases:
Safety Evaluation
Focus on alignment bypasses, harmful content generation, and bias testing. Heavy emphasis on creative prompt engineering and edge-case discovery. Often performed before model deployment.
Application Security Assessment
Focus on prompt injection, data exfiltration, tool abuse, and privilege escalation in a deployed application. Combines AI-specific testing with traditional application security testing.
Agentic System Assessment
Focus on tool-use boundaries, autonomous decision-making, multi-agent trust, and escalation paths. Requires understanding of the agent architecture and tool integrations. See Agent Exploitation for techniques.
Continuous Red Teaming
Ongoing testing program rather than a point-in-time engagement. Automated testing combined with periodic manual assessments. See Continuous Red Teaming for program design.
Methodology Anti-Patterns
Common mistakes that reduce AI red team effectiveness:
Ad-hoc testing without threat modeling. Jumping straight to "try some prompts" without understanding the system architecture, trust boundaries, or highest-risk paths. This leads to shallow coverage and missed critical vulnerabilities.
Binary pass/fail reporting. Reporting that a technique "doesn't work" after a few failed attempts. AI systems are stochastic; a technique that fails 9 times and succeeds once represents a real vulnerability that must be reported with its bypass rate.
Ignoring the application layer. Focusing exclusively on the model and missing traditional web vulnerabilities in the application wrapping it. API authentication, rate limiting, input validation, and session management all affect the AI system's security posture.
Testing only the happy path of attacks. Trying the most common injection techniques and stopping when they fail. Sophisticated real-world attackers combine techniques, adapt to defenses, and persist through failures.
Skipping reconnaissance. Attempting exploitation without first understanding what model, version, system prompt, and tools are in use. Recon findings dramatically change which attack techniques are most effective.
Try It Yourself
Related Topics
- Scoping & Rules of Engagement - Detailed scoping guidance
- Advanced Reconnaissance - Recon techniques for AI targets
- AI-Specific Threat Modeling - Threat modeling frameworks
- Evidence Collection - Evidence standards
- Continuous Red Teaming - Ongoing program design
References
- MITRE (2024). ATLAS - Adversarial Threat Landscape for AI Systems
- NIST (2024). AI Risk Management Framework (AI RMF 1.0)
- OWASP (2025). OWASP Top 10 for LLM Applications
- Anthropic (2024). "Challenges in Red Teaming AI Systems"
- Microsoft (2024). "AI Red Team Best Practices"
What is the most critical difference between AI red team reporting and traditional penetration test reporting?