Full Red Team Engagement: End-to-End
Complete guide to AI red team engagements from scoping through attack execution, evidence collection, impact assessment, report delivery, and remediation validation.
Full Red Team Engagement: End-to-End
An AI red team engagement is a structured adversarial assessment that evaluates the full AI system -- model behavior, application integration, infrastructure security, and organizational process -- against realistic threat scenarios. This page covers the complete lifecycle from scoping through remediation validation.
Engagement Lifecycle
Scoping & Planning → Rules of Engagement → Reconnaissance
│ │
▼ ▼
Threat Modeling ────▶ Attack Execution ◀─────┘
│
▼
Evidence Collection & Documentation
│
▼
Analysis & Impact Assessment
│
▼
Report Writing & Delivery
│
▼
Remediation Validation & RetestPhase 1: Scoping and Planning
Scope Definition Checklist
| Scope Element | Questions to Answer |
|---|---|
| Model layer | Which models? Fine-tuned or API-based? Are weights accessible? |
| Application layer | Web interface? API? Mobile? Internal tools? |
| Data layer | RAG knowledge base? Training pipeline? User data stores? |
| Infrastructure layer | Cloud provider? On-prem GPUs? Kubernetes? Model serving? |
| Integration layer | What tools can the LLM invoke? What external services? |
| Human layer | Social engineering of AI operators in scope? |
Engagement Types
| Type | Objective | Duration | Team | Deliverable |
|---|---|---|---|---|
| Safety Assessment | Evaluate model behavior against safety policies | 1-2 weeks | 1-2 researchers | Safety evaluation with failure modes |
| App Security Assessment | Find vulns in the LLM-powered application | 2-4 weeks | 2-3 appsec + 1 LLM specialist | Vulnerability report with PoC exploits |
| Full Red Team | Simulate realistic adversary against complete system | 4-8 weeks | 3-5 cross-functional | Attack narrative with business impact |
| Continuous Red Team | Ongoing adversarial testing in dev cycle | Ongoing (quarterly) | 2-3 embedded | Continuous findings + quarterly reports |
Stakeholder Alignment
Before starting, align on these four questions:
- Success criteria -- Is a jailbreak sufficient, or must you demonstrate downstream business impact?
- Risk tolerance -- Demonstrate actual data exfiltration, or stop at proving the vulnerability exists?
- Communication cadence -- How often for interim reports? What is the emergency escalation process?
- Environmental constraints -- Production vs. staging? Peak hours to avoid? Data sensitivity levels?
Phase 2: Rules of Engagement
Standard ROE Sections
Authorization
Engagement sponsor, authorization date, legal review confirmation, specific authorized scope.
Scope boundaries
Explicit in-scope and out-of-scope lists. AI-specific exclusions: production customer data (use synthetic), third-party API services not owned by the org.
Testing constraints
Hours of operation, rate limits (avoid DoS), data handling rules, persistence and lateral movement authorization.
Communication
Primary contacts, defensive team contact for deconfliction, emergency contact for critical findings.
Critical finding protocol
Severity threshold definition, notification timeline (e.g., 4 hours for critical), encrypted communication method.
Evidence handling
Encrypted storage, access controls, retention period, destruction process after report delivery.
AI-Specific ROE Provisions
| Provision | What to Specify |
|---|---|
| Token budget | Maximum tokens per session to prevent cost runaway |
| Prohibited prompts | Categories that are off-limits (e.g., CSAM-related content) |
| Output recording | All model outputs must be logged for review |
| Knowledge base injection | May/may not inject documents; all test data must be removable |
| Training data poisoning | May/may not submit poisoned data; synthetic data requirements |
| Model integrity | May/may not modify weights or upload LoRA adapters; rollback plan |
| Harmful content handling | Authorized categories only; store encrypted, destroy after report |
Phase 3: Reconnaissance
Use the methodology from the Advanced Reconnaissance module:
- Model identification -- behavioral fingerprinting, identity probes
- System prompt extraction -- techniques from the extraction module
- Tool/function enumeration -- probe for available capabilities
- RAG system detection -- identify knowledge base sources
- Infrastructure fingerprinting -- timing analysis, error messages, header inspection
Phase 4: Attack Execution
Attack Priority Matrix
| Low Feasibility | High Feasibility | |
|---|---|---|
| High Impact | Strategic: sleeper agents, training poison, weight backdoors | Priority: prompt injection, system prompt leak, output injection, RAG poisoning |
| Low Impact | Deprioritize: side-channel attacks, timing attacks | Quick Wins: jailbreaking, model fingerprinting, bias detection |
Building Attack Chains
The most impactful findings are multi-step chains demonstrating realistic threat scenarios:
Example: Data Exfiltration via RAG Poisoning
1. INITIAL ACCESS: RAG ingests from shared Confluence wiki
2. INJECTION: Create wiki page with embedded adversarial prompt
3. TRIGGER: Craft queries that retrieve the adversarial document
4. HIJACK: Retrieved content overrides system prompt, instructs
LLM to include sensitive data as a "reference link"
5. EXFIL: LLM outputs image tag with query params containing
sensitive data to attacker's server
6. IMPACT: Customer PII exfiltrated through AI assistantEvidence Collection
For each action, log:
- Timestamp (UTC)
- Action type (recon, injection, extraction, etc.)
- Target and technique used
- Input data (what you sent)
- Output data (what came back)
- Evidence files (screenshots, captures)
- Success/failure determination
- SHA-256 hash of the log entry for integrity verification
Phase 5: Building Attack Narratives
From Findings to Stories
Business Impact Framework
| Impact Category | What to Quantify |
|---|---|
| Financial | Remediation cost, regulatory fines, legal liability, revenue impact |
| Operational | Service disruption, recovery time, retraining costs |
| Reputational | Media coverage risk, customer trust erosion |
| Compliance | GDPR/CCPA/AI Act violations, audit findings |
| Safety | User harm from incorrect medical/legal/financial advice |
Phase 6: Report Delivery
See the dedicated Report Writing guide for templates and structure.
Delivery Protocol
Draft delivery
Encrypted draft to engagement sponsor.
Technical review
1-2 hour meeting with the security team.
Factual validation
3-5 business days for accuracy review.
Executive briefing
30-60 minute presentation to leadership.
Final report
Delivered with all corrections incorporated.
Remediation workshop
Half-day collaborative planning session.
Retest scheduling
Schedule validation of implemented fixes.
Phase 7: Remediation Validation
Retest Methodology
Replay original exploits
Attempt the exact original exploit from the report. Verify it is now blocked.
Test bypass variations
Generate variations of the original attack (different encoding, phrasing, multi-turn). Verify that the fix is robust, not just pattern-matched to the specific payload.
Classify remediation status
REMEDIATED = original + all bypasses blocked. PARTIALLY REMEDIATED = original blocked but bypasses succeed. NOT REMEDIATED = original exploit still works.
Establish regression tests
Convert successful exploits into automated regression tests. Integrate with CART pipeline for continuous monitoring.
Related Topics
- Report Writing — Document findings using structured report templates
- Threat Modeling — Threat models drive engagement planning
Why should remediation validation include bypass variations rather than only replaying the original exploit?
References
- MITRE ATLAS — AI threat knowledge base for engagement planning
- OWASP Top 10 for LLM Applications — Vulnerability classification for findings
- NIST AI Risk Management Framework (AI RMF 1.0, 2023) — Risk framework for AI engagement scoping