Scoping & Rules of Engagement
Defining scope, rules of engagement, authorization boundaries, and success criteria for AI red team engagements, with templates and checklists for common engagement types.
Scoping and rules of engagement (ROE) define what an AI red team is authorized to test, how they may test it, and what constraints apply. Poor scoping leads to either incomplete testing (too narrow) or operational incidents (too broad). For AI systems, scoping must address concerns that do not exist in traditional penetration testing: the stochastic nature of model behavior, the risk of generating genuinely harmful content during testing, rate limits on inference APIs, and the distinction between model-level and application-level testing.
Scope Definition
Scope Components
An AI red team scope must specify each of the following:
| Component | What to Define | Example |
|---|---|---|
| Target system | The specific AI application, model, and version under test | "Customer-facing chatbot on support.example.com using GPT-4o through Azure OpenAI" |
| Attack vectors | Which injection types, jailbreak categories, and exploitation techniques are in scope | "Direct injection, indirect injection via RAG, tool abuse; gradient-based attacks out of scope" |
| Harm categories | Which types of harmful output the team should attempt to elicit | "CSAM, violence, self-harm, PII leakage, bias — per OWASP LLM Top 10" |
| Infrastructure | Whether underlying infrastructure (APIs, databases, cloud services) is in scope | "Application-layer only; cloud infrastructure out of scope" |
| Data sensitivity | Whether real user data, PII, or production data may be accessed or generated | "Synthetic test data only; no real customer data in prompts" |
| User roles | Which user roles and access levels will be tested | "Anonymous user, authenticated standard user, admin user" |
| Integrations | Which tool integrations, plugins, and external services are in scope | "Knowledge base RAG, ticket creation API; email integration out of scope" |
Scope Levels
AI red team engagements typically fall into one of three scope levels:
Level 1: Model Assessment. Testing the model's safety alignment and behavioral boundaries. Focuses on prompt injection, jailbreaking, and harmful content generation. Does not test application-layer controls.
Level 2: Application Assessment. Testing the complete application including the model, system prompt, input/output filters, tool integrations, and user interface. This is the most common engagement type.
Level 3: Full-Stack Assessment. Testing the entire AI system stack from infrastructure to model to application. Includes API security, authentication, infrastructure configuration, training pipeline, and deployment security in addition to model-level testing.
Rules of Engagement
AI-Specific ROE Considerations
Beyond standard penetration testing ROE, AI red teaming requires additional rules:
Rate limiting and cost controls. API inference calls cost money. Define:
- Maximum API calls per hour/day
- Maximum tokens consumed per engagement
- Who pays for inference costs (client or testing firm)
- Whether throttling or batching is required
Harmful content handling. Testing for harmful content generation creates a dilemma: the red team must demonstrate the harm but also handle the generated content responsibly. Define:
- How harmful outputs are captured and stored
- Who has access to evidence containing harmful content
- Retention and destruction schedules for harmful content
- Whether harmful content can be included in reports (redacted or full)
- Specific content categories that should not be generated even during testing
Model interaction boundaries. Define:
- Whether the team may attempt to modify model behavior (fine-tuning, RLHF manipulation)
- Whether adversarial training data or poisoning attempts are in scope
- Whether weight extraction or model stealing attempts are authorized
- Maximum conversation length and number of concurrent sessions
Production vs. staging. Define:
- Whether testing occurs on production or staging systems
- Whether real users may be affected during testing
- Time windows for testing (business hours, off-hours)
- Rollback procedures if testing causes unexpected behavior changes
Authorization Chain
AI red team authorization requires sign-off from additional stakeholders compared to traditional pen testing:
| Stakeholder | What They Authorize | Why They Are Needed |
|---|---|---|
| System owner | Overall testing authorization | Standard pen test authorization |
| AI/ML team lead | Model-level testing, safety boundary probing | Understands model capabilities and risks |
| Legal counsel | Harmful content generation, data handling | Liability for generating/storing harmful content |
| Privacy officer | PII handling, data flows during testing | User data may flow through AI systems under test |
| API provider (if applicable) | Testing against their API | Provider ToS may restrict adversarial testing |
| CISO / Security lead | Infrastructure scope, escalation procedures | Security governance |
Escalation Procedures
Define clear escalation triggers:
| Trigger | Action | Who to Notify |
|---|---|---|
| Critical vulnerability (data exfiltration, code execution via tools) | Immediate halt and notification | Engagement lead, system owner, CISO |
| Harmful content generated that poses real-world risk | Stop testing in that category, secure evidence | Engagement lead, legal counsel |
| Unexpected model behavior (persistent personality changes, recursive tool calls) | Pause and assess | AI/ML team lead |
| Rate limit exceeded or unexpected cost spike | Pause testing | System owner, engagement lead |
| Access to real user data during testing | Stop immediately, document exposure | Privacy officer, system owner, legal |
Success Criteria
Defining "Success" for AI Red Teaming
Unlike traditional pen testing where success is often binary (did you compromise the target?), AI red teaming operates on a spectrum. Define success criteria along multiple dimensions:
Coverage metrics:
- Percentage of OWASP LLM Top 10 categories tested
- Number of attack vectors tested per scope component
- Number of unique techniques attempted per vector
Finding metrics:
- Number and severity of findings
- Bypass rates for each technique category
- Percentage of safety boundaries successfully breached
Quality metrics:
- All findings reproducible with documented evidence
- Remediation recommendations provided for each finding
- Severity ratings calibrated to business impact
Severity Framework
Agree on severity classification before testing begins:
| Severity | Criteria | Example |
|---|---|---|
| Critical | Reliable exploitation with high impact; data exfiltration, code execution, or safety bypass affecting all users | System prompt injection that reliably causes the chatbot to execute arbitrary tool calls, exfiltrating customer data |
| High | Moderate reliability with significant impact, or high reliability with moderate impact | Jailbreak technique that bypasses content filters 60% of the time, generating harmful instructions |
| Medium | Low reliability with significant impact, or moderate reliability with limited impact | Encoding bypass that works 20% of the time to extract the system prompt |
| Low | Low reliability and limited impact; informational findings | Model reveals its provider and version when asked directly |
| Informational | No direct exploitation but indicates defense weakness | System prompt does not include anti-override language |
Engagement Documentation
Test Plan Template
A test plan should include:
- Engagement overview: Target system, scope level, timeline, team composition
- Scope details: In-scope components, out-of-scope exclusions, authorized attack vectors
- Rules of engagement: Rate limits, harmful content handling, escalation procedures
- Test matrix: Specific test cases organized by attack vector and target component
- Success criteria: Coverage, finding, and quality metrics
- Evidence requirements: What must be captured for each finding
- Reporting deliverables: Report format, audience, delivery timeline
Pre-Engagement Checklist
Before testing begins, verify:
- Authorization document signed by all required stakeholders
- API provider terms of service reviewed and compliant
- Test accounts and credentials provisioned
- Staging environment available (if applicable)
- Rate limits and cost caps configured
- Evidence storage with appropriate access controls provisioned
- Escalation contacts verified and reachable
- Communication channel established with system owner
- Harmful content handling procedures documented and acknowledged
- Team briefed on scope boundaries and prohibited actions
Try It Yourself
Related Topics
- Red Team Methodology - The overall engagement lifecycle
- AI-Specific Threat Modeling - Threat models that inform scope
- Evidence Collection - Standards for capturing and preserving findings
- Continuous Red Teaming - How scoping adapts for ongoing programs
References
- NIST (2024). AI Risk Management Framework (AI RMF 1.0)
- OWASP (2025). OWASP Top 10 for LLM Applications
- PTES (2024). Penetration Testing Execution Standard - Pre-Engagement
- CREST (2024). CREST Penetration Testing Guide
Why does AI red team scoping require authorization from an API provider in addition to the system owner?