After-Action Reviews for AI Red Team Operations
Structured frameworks for conducting after-action reviews that capture lessons learned, improve methodology, and demonstrate value from AI red team engagements.
Overview
An AI red team engagement that discovers critical vulnerabilities but does not systematically capture what worked, what failed, and what the team learned is leaving most of its value on the table. The findings are reported and (hopefully) remediated, but the institutional knowledge — which techniques were effective against which model architectures, which tools failed in unexpected ways, how long each phase actually took versus the estimate, what the team would do differently — evaporates. The next engagement starts from scratch, repeating the same inefficiencies and missing the same opportunities.
After-action reviews (AARs) transform individual engagements into a compounding knowledge base. In traditional military and incident response contexts, AARs are a well-established practice with decades of methodology behind them. AI red teaming, however, has specific characteristics that require adapted AAR approaches: the novelty of attack techniques means lessons are especially valuable (there is less published methodology to fall back on), the rapid evolution of AI systems means lessons expire faster, and the cross-disciplinary nature of AI security means lessons must be captured in ways that are accessible to both security and ML audiences.
This article provides a complete framework for AI red team after-action reviews, including structured templates, facilitation techniques, knowledge management systems, and methods for translating AAR outputs into program improvements and leadership reporting.
Why AI Red Team AARs Are Different
Traditional penetration testing AARs focus primarily on what was found and how. AI red team AARs must additionally capture:
Model-specific behavioral observations: How did the target model behave under different attack strategies? What defensive patterns were observed? These observations may not result in formal findings but are invaluable for future engagements against similar model architectures.
Technique evolution: AI attack techniques are evolving rapidly. A technique that failed today might work tomorrow with a small modification, or a technique that worked might be patched before the next engagement. Capturing the precise technique and its result enables longitudinal tracking.
Tool limitations discovered: AI security tools are immature compared to traditional security tools. Engagements frequently reveal tool limitations (false negatives, crashes, incompatibilities) that need to be fed back to tool development.
Cross-domain insights: AI red team engagements often reveal issues at the intersection of AI security and traditional security (e.g., prompt injection that exploits an SSRF vulnerability in the model serving layer). These cross-domain insights are easily lost if the AAR does not explicitly capture them.
AAR Structure and Timeline
Timing
Conduct the AAR within 48 hours of engagement conclusion, while details are fresh. For longer engagements (2+ weeks), conduct interim AARs at the end of each major phase.
Engagement Timeline with AAR Integration:
Week 1: Reconnaissance and planning
└── Interim AAR: Planning effectiveness assessment
Week 2-3: Active testing
└── Daily: Brief stand-ups capturing technique results
└── End of Week 2: Interim AAR on tool and technique effectiveness
Week 4: Analysis, reporting, and remediation guidance
└── Final AAR: Comprehensive review (2-3 hours)
Week 5+: Follow-up
└── 30-day AAR: Remediation tracking and delayed insights
Participants
The final AAR should include all engagement team members. Optionally include the ML engineering team that owns the target system (if the engagement was cooperative rather than adversarial). Do not include management — the AAR needs psychological safety to surface honest assessments of what went wrong.
Management receives a summary of AAR findings and action items, not a seat in the room.
Comprehensive AAR Template
"""
AI Red Team After-Action Review template.
Structured data capture for institutional knowledge building.
"""
from dataclasses import dataclass, field
from datetime import date, timedelta
from enum import Enum
from typing import Optional
class EngagementType(Enum):
LLM_APPLICATION = "llm_application"
CLASSIFICATION_MODEL = "classification_model"
GENERATIVE_MODEL = "generative_model"
RAG_SYSTEM = "rag_system"
MULTI_AGENT = "multi_agent_system"
ML_PIPELINE = "ml_pipeline"
FULL_SYSTEM = "full_system_assessment"
class TechniqueOutcome(Enum):
SUCCESSFUL = "successful" # Achieved objective
PARTIAL = "partial" # Some progress, not full objective
FAILED_DEFENDED = "failed_defended" # Target defended effectively
FAILED_TECHNIQUE = "failed_technique" # Our technique was flawed
NOT_APPLICABLE = "not_applicable" # Technique didn't apply to target
NOT_ATTEMPTED = "not_attempted" # Ran out of time or deprioritized
@dataclass
class TechniqueRecord:
"""Record of a specific technique used during the engagement."""
technique_name: str
category: str # e.g., "prompt_injection", "model_extraction"
description: str
outcome: TechniqueOutcome
time_invested_hours: float
tools_used: list[str]
target_component: str
# What we learned
effectiveness_notes: str = ""
defense_observed: str = ""
modification_ideas: str = ""
reusability: str = "" # "high", "medium", "low" — can this be reused?
@dataclass
class ToolAssessment:
"""Assessment of a tool's performance during the engagement."""
tool_name: str
version: str
purpose: str
effectiveness: int # 1-5 scale
issues_encountered: list[str] = field(default_factory=list)
workarounds_applied: list[str] = field(default_factory=list)
recommendation: str = "" # "keep", "replace", "improve", "retire"
@dataclass
class TimeAllocationRecord:
"""How time was actually spent vs. planned."""
phase: str
planned_hours: float
actual_hours: float
variance_reason: str = ""
@dataclass
class AfterActionReview:
"""Complete after-action review for an AI red team engagement."""
# Engagement metadata
engagement_id: str
target_system: str
engagement_type: EngagementType
start_date: date
end_date: date
team_members: list[str]
scope_summary: str
# Findings summary
total_findings: int = 0
critical_findings: int = 0
high_findings: int = 0
medium_findings: int = 0
low_findings: int = 0
most_impactful_finding: str = ""
# Technique records
techniques: list[TechniqueRecord] = field(default_factory=list)
# Tool assessments
tools: list[ToolAssessment] = field(default_factory=list)
# Time allocation
time_records: list[TimeAllocationRecord] = field(default_factory=list)
# Key questions (the heart of the AAR)
what_was_planned: str = ""
what_actually_happened: str = ""
what_went_well: list[str] = field(default_factory=list)
what_could_improve: list[str] = field(default_factory=list)
surprises: list[str] = field(default_factory=list)
# AI-specific observations
model_behavior_observations: list[str] = field(default_factory=list)
defense_patterns_observed: list[str] = field(default_factory=list)
attack_surface_notes: str = ""
architecture_insights: str = ""
# Action items
action_items: list[dict] = field(default_factory=list)
# Cross-engagement insights
comparison_to_previous: str = ""
methodology_updates_needed: list[str] = field(default_factory=list)
training_needs_identified: list[str] = field(default_factory=list)
def generate_summary(self) -> str:
"""Generate a management-ready summary from the AAR."""
effective_techniques = [
t for t in self.techniques
if t.outcome == TechniqueOutcome.SUCCESSFUL
]
failed_techniques = [
t for t in self.techniques
if t.outcome in (TechniqueOutcome.FAILED_TECHNIQUE,
TechniqueOutcome.FAILED_DEFENDED)
]
total_planned = sum(t.planned_hours for t in self.time_records)
total_actual = sum(t.actual_hours for t in self.time_records)
lines = [
f"# AAR Summary: {self.target_system}",
f"## Engagement: {self.engagement_id}",
f"**Duration**: {self.start_date} to {self.end_date} "
f"({(self.end_date - self.start_date).days} days)",
f"**Team**: {', '.join(self.team_members)}",
"",
"## Results",
f"- Total findings: {self.total_findings}",
f"- Critical: {self.critical_findings}, High: {self.high_findings}, "
f"Medium: {self.medium_findings}, Low: {self.low_findings}",
f"- Most impactful: {self.most_impactful_finding}",
"",
"## Technique Effectiveness",
f"- Techniques attempted: {len(self.techniques)}",
f"- Successful: {len(effective_techniques)}",
f"- Failed: {len(failed_techniques)}",
f"- Effective rate: "
f"{len(effective_techniques)/max(len(self.techniques),1):.0%}",
"",
"## Time",
f"- Planned: {total_planned:.0f} hours",
f"- Actual: {total_actual:.0f} hours",
f"- Variance: {(total_actual-total_planned)/max(total_planned,1):+.0%}",
"",
"## Key Insights",
]
for item in self.what_went_well[:3]:
lines.append(f"- (Positive) {item}")
for item in self.what_could_improve[:3]:
lines.append(f"- (Improve) {item}")
for item in self.surprises[:3]:
lines.append(f"- (Surprise) {item}")
lines.append("")
lines.append("## Action Items")
for item in self.action_items:
lines.append(
f"- [{item.get('priority', 'MEDIUM')}] {item.get('action', '')} "
f"(Owner: {item.get('owner', 'TBD')}, "
f"Due: {item.get('due_date', 'TBD')})"
)
return "\n".join(lines)Facilitation Guide
Opening (10 minutes)
Set the tone: this is a learning exercise, not a blame session. Establish ground rules:
- Focus on the team, not individuals: "We made this decision" not "Alice made this mistake"
- Honesty is required: Sanitized AARs are worthless. If something failed because of a skill gap, say so — that becomes a training investment case.
- No rank in the room: Junior team members often have the most valuable observations because they see things with fresh eyes and are less likely to rationalize suboptimal choices.
- Everything is captured: Designate a note-taker (or use the template above as a live document). No insight should be lost because nobody wrote it down.
Core Discussion (90-120 minutes)
Walk through each section of the template. For each section, use these facilitation techniques:
Timeline reconstruction (20 minutes): Before diving into analysis, reconstruct what actually happened chronologically. This surfaces factual disagreements and ensures everyone has a shared understanding.
Example timeline for an LLM red team engagement:
Day 1: Reconnaissance — analyzed model behavior through normal usage
Observation: Model appears to be GPT-4 based, with custom system prompt
Observation: Output filtering is present for explicit content
Day 2: Basic prompt injection testing
Technique: Direct instruction override — FAILED (defended)
Technique: Role-play framing — PARTIAL SUCCESS
Surprise: Model accepted code generation requests without filtering
Day 3: Advanced injection via multi-turn context
Technique: Crescendo attack (gradual context shift) — SUCCESSFUL
Technique: Tool-use injection — NOT ATTEMPTED (no tool-use identified)
Finding: System prompt extractable via completion-based technique
Day 4-5: Exploitation and documentation
...
Technique-by-technique review (30-40 minutes): For each technique attempted, discuss:
- Why was this technique chosen?
- What was the expected outcome?
- What was the actual outcome?
- If it failed, why? Was the defense effective, or was our technique flawed?
- If it succeeded, was it easier or harder than expected?
- What would we do differently next time?
Tool review (15-20 minutes): For each tool used, discuss whether it worked as expected, what workarounds were needed, and whether the team should continue using it.
The four key questions (20-30 minutes):
-
What did we plan to do? Review the engagement plan. Was the scope appropriate? Were the chosen attack categories correct?
-
What actually happened? How did the engagement diverge from the plan? Why?
-
What went well? Specifically identify successes — not just findings, but process, tool, and collaboration successes. This is often rushed in AARs but is critical for reinforcing effective practices.
-
What would we do differently? This is where the highest-value insights emerge. Frame this as "given what we know now, what would we change about our approach?"
Closing (20 minutes)
Action items: Every "what would we do differently" answer should produce an action item with an owner and due date. No orphan insights.
Cross-engagement pattern identification: If this is not the team's first AAR, compare findings with previous AARs. Are the same issues recurring? Are improvements from previous AARs working?
Knowledge base update: Identify which AAR insights should be added to the team's knowledge base.
Knowledge Management System
AARs are only valuable if their insights are retrievable. Build a knowledge base structured for practical use:
"""
AI Red Team Knowledge Base structure.
Organizes AAR insights for retrieval in future engagements.
"""
from dataclasses import dataclass, field
from datetime import date
@dataclass
class TechniqueKnowledge:
"""What we know about a specific attack technique from AAR data."""
technique_name: str
category: str
total_attempts: int = 0
successful_attempts: int = 0
target_types_tested: list[str] = field(default_factory=list)
effectiveness_by_target: dict[str, float] = field(default_factory=dict)
known_defenses: list[str] = field(default_factory=list)
best_variants: list[str] = field(default_factory=list)
common_pitfalls: list[str] = field(default_factory=list)
last_updated: Optional[date] = None
@property
def overall_success_rate(self) -> float:
if self.total_attempts == 0:
return 0.0
return self.successful_attempts / self.total_attempts
def update_from_aar(self, technique_record: "TechniqueRecord"):
"""Update knowledge from a new AAR technique record."""
self.total_attempts += 1
if technique_record.outcome == TechniqueOutcome.SUCCESSFUL:
self.successful_attempts += 1
target_type = technique_record.target_component
if target_type not in self.target_types_tested:
self.target_types_tested.append(target_type)
if technique_record.defense_observed:
if technique_record.defense_observed not in self.known_defenses:
self.known_defenses.append(technique_record.defense_observed)
if technique_record.modification_ideas:
self.best_variants.append(technique_record.modification_ideas)
self.last_updated = date.today()
@dataclass
class EngagementKnowledge:
"""Aggregate knowledge from all AARs for a specific system type."""
system_type: str
total_engagements: int = 0
avg_findings_per_engagement: float = 0.0
most_effective_techniques: list[str] = field(default_factory=list)
least_effective_techniques: list[str] = field(default_factory=list)
common_defense_patterns: list[str] = field(default_factory=list)
recommended_time_allocation: dict[str, float] = field(default_factory=dict)
lessons_learned: list[str] = field(default_factory=list)
class KnowledgeBase:
"""AI Red Team knowledge base aggregated from AARs."""
def __init__(self):
self.techniques: dict[str, TechniqueKnowledge] = {}
self.engagement_types: dict[str, EngagementKnowledge] = {}
self.aars: list[AfterActionReview] = []
def ingest_aar(self, aar: "AfterActionReview"):
"""Process an AAR and update the knowledge base."""
self.aars.append(aar)
# Update technique knowledge
for tech_record in aar.techniques:
if tech_record.technique_name not in self.techniques:
self.techniques[tech_record.technique_name] = TechniqueKnowledge(
technique_name=tech_record.technique_name,
category=tech_record.category,
)
self.techniques[tech_record.technique_name].update_from_aar(tech_record)
# Update engagement type knowledge
sys_type = aar.engagement_type.value
if sys_type not in self.engagement_types:
self.engagement_types[sys_type] = EngagementKnowledge(
system_type=sys_type
)
eng_knowledge = self.engagement_types[sys_type]
eng_knowledge.total_engagements += 1
def get_planning_guidance(self, system_type: str) -> dict:
"""
Get guidance for planning a new engagement based on
accumulated knowledge from previous AARs.
"""
eng = self.engagement_types.get(system_type)
if not eng:
return {"message": f"No prior engagements of type {system_type}"}
# Find most effective techniques for this system type
relevant_techniques = [
(name, tech) for name, tech in self.techniques.items()
if system_type in [t.lower() for t in tech.target_types_tested]
]
ranked = sorted(
relevant_techniques,
key=lambda x: x[1].overall_success_rate,
reverse=True,
)
return {
"system_type": system_type,
"prior_engagements": eng.total_engagements,
"recommended_techniques": [
{
"name": name,
"success_rate": f"{tech.overall_success_rate:.0%}",
"attempts": tech.total_attempts,
"known_defenses": tech.known_defenses,
}
for name, tech in ranked[:10]
],
"common_defenses": eng.common_defense_patterns,
"time_allocation": eng.recommended_time_allocation,
"key_lessons": eng.lessons_learned[-10:],
}Translating AARs into Program Improvements
Methodology Updates
AAR findings should systematically update the team's engagement methodology:
AAR Finding → Methodology Update Pipeline:
1. AAR identifies improvement opportunity
Example: "We spent 8 hours on model extraction attempts that
were blocked by rate limiting. We should have tested rate limits
first before investing in extraction techniques."
2. Draft methodology update
Addition to engagement playbook: "Phase 1.5 — API Constraint
Assessment: Before attempting model extraction, characterize
the target API's rate limiting, request throttling, and anomaly
detection. Estimated time: 2 hours. If rate limits are below
1000 requests/minute, deprioritize extraction techniques."
3. Review with team
Discussed at next team meeting. Confirmed by senior engineers.
4. Update documentation
Engagement playbook updated with new phase.
5. Validate in next engagement
Track whether the methodology update improved outcomes.
6. Next AAR evaluates the update
"The new API constraint assessment phase saved approximately 6 hours
by identifying that extraction was not feasible given rate limits."
Training Gap Identification
AARs frequently reveal skill gaps:
AAR Skill Gap → Training Response:
Gap: "Team struggled with white-box adversarial attacks because
no one had experience with PyTorch model internals."
Response:
1. Add "PyTorch model analysis" to team training requirements
2. Allocate 2 days of training time per engineer in next quarter
3. Pair less experienced members with mentor for next white-box engagement
4. Track improvement in next AAR
Tool Development Priorities
AAR tool assessments drive tool investment decisions:
AAR Tool Finding → Development Priority:
Finding: "garak prompt injection testing found 3 of our 7 manually
discovered prompt injection vulnerabilities. It missed the 4 most
sophisticated techniques that required multi-turn context."
Action:
1. File feature request with garak maintainers for multi-turn support
2. Develop internal wrapper that orchestrates multi-turn garak testing
3. Allocate 40 hours in Q2 for custom multi-turn injection tool
4. Evaluate improvement in next engagement AAR
Reporting AAR Results to Leadership
Leadership does not need the full AAR — they need the strategic insights:
def generate_leadership_report(aars: list["AfterActionReview"],
period: str = "quarterly") -> str:
"""
Generate a quarterly leadership report from AAR data.
Focuses on program effectiveness and risk trends.
"""
total_findings = sum(a.total_findings for a in aars)
total_critical = sum(a.critical_findings for a in aars)
total_high = sum(a.high_findings for a in aars)
technique_success_rates = {}
for aar in aars:
for t in aar.techniques:
cat = t.category
if cat not in technique_success_rates:
technique_success_rates[cat] = {"success": 0, "total": 0}
technique_success_rates[cat]["total"] += 1
if t.outcome == TechniqueOutcome.SUCCESSFUL:
technique_success_rates[cat]["success"] += 1
lines = [
f"# AI Red Team {period.title()} Report",
"",
"## Program Metrics",
f"- Engagements completed: {len(aars)}",
f"- Total findings: {total_findings}",
f" - Critical: {total_critical}",
f" - High: {total_high}",
"",
"## Key Trends",
]
for cat, data in sorted(
technique_success_rates.items(),
key=lambda x: x[1]["success"] / max(x[1]["total"], 1),
reverse=True,
):
rate = data["success"] / max(data["total"], 1)
lines.append(
f"- {cat}: {rate:.0%} success rate "
f"({data['success']}/{data['total']} attempts)"
)
lines.extend([
"",
"## Improvements Implemented from Previous AARs",
])
for aar in aars:
for update in aar.methodology_updates_needed:
lines.append(f"- {update}")
lines.extend([
"",
"## Investment Recommendations",
])
for aar in aars:
for need in aar.training_needs_identified:
lines.append(f"- Training: {need}")
return "\n".join(lines)Common AAR Anti-Patterns
The blame session: The AAR devolves into finger-pointing. Prevent this by establishing ground rules and using "we" language. If blame patterns emerge, the facilitator should redirect: "Let's focus on what the team can change, not what individuals did."
The victory lap: The team only discusses what went well. Push explicitly for failures and surprises. "What is one thing that, if we had done it differently, would have produced more or better findings?"
The documentation dump: The AAR produces 20 pages of notes that nobody reads. Keep the AAR template structured and action-oriented. Every insight should either be a technique record, a tool assessment, or an action item. Narrative is for context, not for its own sake.
The orphan AAR: The AAR is conducted but findings are never implemented. Assign owners and due dates to every action item. Track implementation at the start of the next AAR.
The unchanging AAR: The same template and same questions are used for every engagement regardless of type. Customize AAR questions based on engagement type, and periodically review whether the AAR format itself needs improvement.
Longitudinal Analysis: Tracking Program Evolution Through AARs
Individual AARs capture point-in-time knowledge. The compounding value comes from longitudinal analysis — examining how the program evolves across multiple engagements:
def longitudinal_analysis(aars: list["AfterActionReview"]) -> dict:
"""
Analyze trends across multiple AARs to identify program
improvement patterns and persistent gaps.
"""
if len(aars) < 3:
return {"message": "Need at least 3 AARs for trend analysis"}
# Sort by date
aars_sorted = sorted(aars, key=lambda a: a.start_date)
# Track technique success rates over time
technique_trends = {}
for aar in aars_sorted:
quarter = f"{aar.start_date.year}-Q{(aar.start_date.month - 1) // 3 + 1}"
for tech in aar.techniques:
if tech.technique_name not in technique_trends:
technique_trends[tech.technique_name] = []
technique_trends[tech.technique_name].append({
"quarter": quarter,
"outcome": tech.outcome.value,
"target": tech.target_component,
})
# Identify recurring issues
all_improvements = []
for aar in aars_sorted:
all_improvements.extend(aar.what_could_improve)
# Find improvements mentioned in multiple AARs (persistent gaps)
from collections import Counter
improvement_freq = Counter()
for item in all_improvements:
# Normalize for comparison
key = item.lower().strip()
improvement_freq[key] += 1
persistent_gaps = [
{"issue": issue, "occurrences": count}
for issue, count in improvement_freq.items()
if count >= 2
]
# Track time estimation accuracy over time
estimation_accuracy = []
for aar in aars_sorted:
total_planned = sum(t.planned_hours for t in aar.time_records)
total_actual = sum(t.actual_hours for t in aar.time_records)
if total_planned > 0:
accuracy = total_actual / total_planned
estimation_accuracy.append({
"engagement": aar.engagement_id,
"date": aar.start_date.isoformat(),
"accuracy_ratio": accuracy,
})
# Determine if estimation is improving
if len(estimation_accuracy) >= 3:
recent_accuracy = [
abs(1 - e["accuracy_ratio"])
for e in estimation_accuracy[-3:]
]
early_accuracy = [
abs(1 - e["accuracy_ratio"])
for e in estimation_accuracy[:3]
]
estimation_improving = (
sum(recent_accuracy) / len(recent_accuracy) <
sum(early_accuracy) / len(early_accuracy)
)
else:
estimation_improving = None
return {
"total_engagements_analyzed": len(aars_sorted),
"time_span": f"{aars_sorted[0].start_date} to {aars_sorted[-1].end_date}",
"technique_trends": technique_trends,
"persistent_gaps": persistent_gaps,
"estimation_accuracy_trend": estimation_accuracy,
"estimation_improving": estimation_improving,
"recommendations": _generate_recommendations(
persistent_gaps, estimation_improving, technique_trends
),
}
def _generate_recommendations(persistent_gaps, estimation_improving,
technique_trends) -> list[str]:
recommendations = []
if persistent_gaps:
recommendations.append(
f"PRIORITY: {len(persistent_gaps)} issues appear in multiple AARs "
f"without resolution. Escalate these to program management: "
f"{', '.join(g['issue'][:50] for g in persistent_gaps[:3])}"
)
if estimation_improving is False:
recommendations.append(
"Time estimation accuracy is not improving across engagements. "
"Consider revising estimation methodology or calibrating "
"against historical data."
)
return recommendationsThis longitudinal view answers questions that individual AARs cannot: Is the team getting better? Are the same gaps persisting across engagements? Are particular attack techniques becoming more or less effective as the organization's AI systems mature? This data is invaluable for program reviews and budget justification — it demonstrates the program's trajectory, not just its current state.
Key Takeaways
After-action reviews are the mechanism by which individual AI red team engagements compound into institutional capability. Without AARs, each engagement is an isolated event; with them, each engagement builds on every previous one. The keys to effective AARs are psychological safety (honesty requires it), structure (capture technique records, tool assessments, and action items, not just narrative), timeliness (within 48 hours while details are fresh), and follow-through (action items with owners and deadlines, tracked to completion).
For AI red teaming specifically, the technique-by-technique review and the model behavior observations sections are the highest-value AAR components. These build the team's understanding of what works against which systems — knowledge that cannot be found in published research because it comes from the team's specific engagement experience.
References
- U.S. Army (2021). "After Action Review." Army Doctrine Publication (ADP) 7-0, Training. The foundational methodology for structured after-action reviews adapted for AI red team contexts.
- Morrison, J. & Meliza, L. (1999). "Foundations of the After Action Review Process." U.S. Army Research Institute for the Behavioral and Social Sciences. Academic treatment of AAR principles applicable to technical team contexts.
- Microsoft (2024). "AI Red Teaming: Lessons from Deploying Red Teams at Scale." Microsoft Security Blog. Practical insights on institutionalizing AI red team knowledge from one of the first organizations to operate a dedicated AI red team.
- MITRE (2025). "ATLAS Navigator — ATT&CK for AI." https://atlas.mitre.org/ — Taxonomy for categorizing AI attack techniques used in AAR technique records.