What is AI Attack Taxonomy?

A comprehensive classification of AI attacks organized by target, technique, and impact — providing a shared vocabulary for red team planning and reporting.

What is Threat Modeling Basics?

How to identify assets, threats, and attack vectors specific to AI systems using simplified threat modeling frameworks adapted for machine learning.

What is Ethics of AI Red Teaming?

Responsible testing practices, avoiding real harm, navigating dual-use concerns, and professional standards for AI red team practitioners.

What is Legal Landscape?

Authorization requirements, terms of service considerations, computer fraud laws, and responsible disclosure frameworks for AI red teaming.

redteams.ai

Red Team Methodology Fundamentals

beginner9 min readUpdated 2026-03-15

What AI red teaming is, how it differs from traditional security testing, and the complete engagement lifecycle from scoping to reporting.

red-teaming methodology fundamentals beginner

What Is AI Red Teaming?

AI red teaming is the practice of systematically probing AI systems to discover how they can be made to behave in unintended, harmful, or exploitable ways. Unlike traditional software testing that checks whether a system does what it should, red teaming asks: what can this system be made to do that it should not?

The term "red team" originates from Cold War military exercises where a designated adversary force (red team) would challenge plans and assumptions of the friendly force (blue team). In cybersecurity, this evolved into penetration testing. AI red teaming extends this tradition to address the unique challenges of testing systems that are probabilistic, opaque, and capable of generating novel outputs.

How AI Red Teaming Differs from Traditional Security Testing

Traditional penetration testing and AI red teaming share a mindset but diverge significantly in methodology, tooling, and success criteria.

Dimension	Traditional Pen Testing	AI Red Teaming
Target	Deterministic software systems	Probabilistic AI models and surrounding infrastructure
Inputs	Structured (SQL, HTTP requests, binaries)	Unstructured (natural language, images, audio)
Vulnerabilities	Well-categorized (CVEs, OWASP Top 10)	Emerging taxonomy, context-dependent
Reproducibility	High (same input produces same bug)	Variable (temperature, sampling affect outcomes)
Success criteria	Binary (exploited or not)	Often gradient (partial bypass, degraded safety)
Scope	Code, network, infrastructure	Model behavior, training data, alignment, safety
Tools	Burp Suite, Metasploit, Nmap	Custom prompting, automated fuzzing, evaluation harnesses

The Probabilistic Challenge

When you find a SQL injection vulnerability, it works every time. When you find a jailbreak, it might work 30% of the time at temperature 0.7 and 5% at temperature 0.3. This probabilistic nature changes everything about how you scope engagements, define success, and report findings.

The Natural Language Interface

Traditional security testing relies on exploiting rigid parsers: a SQL engine, an HTTP server, a binary format. AI systems accept natural language, which means the "parser" is the model itself. There is no specification to violate, no RFC to exploit. Instead, you are manipulating a statistical system's learned behaviors.

The Engagement Lifecycle

An AI red team engagement follows a structured lifecycle. Skipping phases leads to incomplete coverage, wasted effort, or findings that cannot be acted upon.

Scoping and Planning
Define what is being tested, what is in scope, and what success looks like. This includes identifying the AI system's purpose, its risk profile, and the threat actors you are simulating. Key deliverable: a scoping document with agreed-upon rules of engagement.
Reconnaissance
Understand the target system. What model is it using? What guardrails are in place? How does the API work? What does the system prompt contain? This phase combines traditional OSINT with AI-specific techniques like model fingerprinting and system prompt extraction.
Threat Modeling
Based on reconnaissance, identify the most likely and most impactful attack vectors. Prioritize testing efforts using a threat model that accounts for the specific AI system's architecture, deployment, and use case. See Threat Modeling Basics.
Attack Execution
Systematically execute attacks against the identified vectors. This includes both manual creative attacks and automated scanning with tools like promptfoo or garak. Document every attempt, including failures, as they inform the system's robustness profile.
Analysis and Validation
Validate findings, assess their severity, determine root causes, and classify them within the attack taxonomy. Distinguish between novel vulnerabilities and known weaknesses that were not mitigated.
Reporting and Remediation
Produce actionable reports that include severity ratings, reproduction steps, evidence (screenshots, logs, API calls), root cause analysis, and remediation recommendations. The report should enable the development team to fix issues without requiring the red team's presence.
Retest and Verification
After remediations are applied, retest to verify that fixes are effective and have not introduced new vulnerabilities. This phase often reveals that patches address the specific attack but not the underlying vulnerability class.

Roles in an AI Red Team

Effective AI red teaming requires a blend of skills that rarely exist in a single person.

Role	Skills	Focus
Prompt Engineer / Attacker	Creative writing, adversarial thinking, deep model knowledge	Crafting novel jailbreaks and prompt injections
ML Engineer	Model architecture, training pipelines, optimization	Gradient-based attacks, model extraction, poisoning
Security Engineer	Traditional pen testing, API security, infrastructure	API abuse, authentication bypass, supply chain attacks
Domain Expert	Subject matter expertise in the application domain	Identifying harmful outputs specific to the use case
Ethics and Safety Specialist	AI policy, safety frameworks, harm taxonomies	Evaluating safety implications and dual-use concerns

Scoping an AI Red Team Engagement

Scoping is where most engagements succeed or fail. An underscoped engagement produces shallow findings; an overscoped engagement wastes resources.

Key Scoping Questions

What is the AI system's purpose? A customer service chatbot has different risk profiles than a code generation tool or a medical diagnosis assistant.
What is the threat model? Are you simulating a casual user, a determined attacker, or a nation-state adversary? The sophistication of attacks should match.
What is in scope? The model itself? The API? The surrounding application? The training data? The deployment infrastructure?
What are the rules of engagement? Rate limits on API calls? Budget for compute? Off-limits attacks (e.g., no actual data exfiltration of real user data)?
What constitutes a finding? Is a jailbreak that requires 100 attempts a valid finding? What about an output that is borderline harmful?
What is the reporting timeline? Critical findings reported immediately? Weekly summaries? Final report at the end?

Risk-Based Prioritization

Not all attack vectors deserve equal attention. Prioritize based on the intersection of likelihood and impact:

Priority	Likelihood	Impact	Example
Critical	High	High	Prompt injection in customer-facing chatbot exposing PII
High	Medium	High	System prompt extraction revealing proprietary instructions
Medium	High	Medium	Jailbreak producing mildly inappropriate content
Low	Low	Low	Model fingerprinting revealing which base model is used

AI Red Teaming Frameworks

Several frameworks provide structure for AI red team engagements:

MITRE ATLAS (Adversarial Threat Landscape for AI Systems) provides a knowledge base of adversarial techniques organized by tactic, similar to ATT&CK for traditional cybersecurity
OWASP Top 10 for LLM Applications catalogs the most critical security risks specific to LLM-based applications
NIST AI Risk Management Framework offers a broader governance framework for AI risk that contextualizes red teaming within organizational risk management
Microsoft AI Red Team Framework details Microsoft's internal approach to red teaming AI systems, with lessons learned from testing Copilot and other products

AI Attack Taxonomy — categorizing attacks by target, technique, and impact
Threat Modeling for AI — identifying assets, threats, and attack vectors
Ethics of AI Red Teaming — responsible testing practices
Legal Landscape — authorization and legal considerations

References

"MITRE ATLAS: Adversarial Threat Landscape for Artificial-Intelligence Systems" - MITRE Corporation (2023) - Comprehensive knowledge base of adversarial tactics and techniques for AI systems, modeled after the ATT&CK framework
"OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard classification of the most critical security risks in LLM-based applications
"AI Risk Management Framework" - NIST (2024) - Federal framework for managing risks associated with AI systems across their lifecycle
"Microsoft AI Red Team" - Microsoft (2024) - Lessons learned from red teaming large-scale AI systems including GPT-4 and Copilot

Knowledge Check

What is the most important difference between traditional penetration testing and AI red teaming?

Learning Path

0/4 completed

~36 min total4 lessons

Start Learning

Edit this page on GitHub

Red Team Methodology Fundamentals

beginner9 min readUpdated 2026-03-15

What AI red teaming is, how it differs from traditional security testing, and the complete engagement lifecycle from scoping to reporting.

red-teaming methodology fundamentals beginner

What Is AI Red Teaming?

How AI Red Teaming Differs from Traditional Security Testing

Traditional penetration testing and AI red teaming share a mindset but diverge significantly in methodology, tooling, and success criteria.

Dimension	Traditional Pen Testing	AI Red Teaming
Target	Deterministic software systems	Probabilistic AI models and surrounding infrastructure
Inputs	Structured (SQL, HTTP requests, binaries)	Unstructured (natural language, images, audio)
Vulnerabilities	Well-categorized (CVEs, OWASP Top 10)	Emerging taxonomy, context-dependent
Reproducibility	High (same input produces same bug)	Variable (temperature, sampling affect outcomes)
Success criteria	Binary (exploited or not)	Often gradient (partial bypass, degraded safety)
Scope	Code, network, infrastructure	Model behavior, training data, alignment, safety
Tools	Burp Suite, Metasploit, Nmap	Custom prompting, automated fuzzing, evaluation harnesses

The Probabilistic Challenge

The Natural Language Interface

The Engagement Lifecycle

An AI red team engagement follows a structured lifecycle. Skipping phases leads to incomplete coverage, wasted effort, or findings that cannot be acted upon.

Scoping and Planning
Define what is being tested, what is in scope, and what success looks like. This includes identifying the AI system's purpose, its risk profile, and the threat actors you are simulating. Key deliverable: a scoping document with agreed-upon rules of engagement.
Reconnaissance
Understand the target system. What model is it using? What guardrails are in place? How does the API work? What does the system prompt contain? This phase combines traditional OSINT with AI-specific techniques like model fingerprinting and system prompt extraction.
Threat Modeling
Based on reconnaissance, identify the most likely and most impactful attack vectors. Prioritize testing efforts using a threat model that accounts for the specific AI system's architecture, deployment, and use case. See Threat Modeling Basics.
Attack Execution
Systematically execute attacks against the identified vectors. This includes both manual creative attacks and automated scanning with tools like promptfoo or garak. Document every attempt, including failures, as they inform the system's robustness profile.
Analysis and Validation
Validate findings, assess their severity, determine root causes, and classify them within the attack taxonomy. Distinguish between novel vulnerabilities and known weaknesses that were not mitigated.
Reporting and Remediation
Produce actionable reports that include severity ratings, reproduction steps, evidence (screenshots, logs, API calls), root cause analysis, and remediation recommendations. The report should enable the development team to fix issues without requiring the red team's presence.
Retest and Verification
After remediations are applied, retest to verify that fixes are effective and have not introduced new vulnerabilities. This phase often reveals that patches address the specific attack but not the underlying vulnerability class.

Roles in an AI Red Team

Effective AI red teaming requires a blend of skills that rarely exist in a single person.

Role	Skills	Focus
Prompt Engineer / Attacker	Creative writing, adversarial thinking, deep model knowledge	Crafting novel jailbreaks and prompt injections
ML Engineer	Model architecture, training pipelines, optimization	Gradient-based attacks, model extraction, poisoning
Security Engineer	Traditional pen testing, API security, infrastructure	API abuse, authentication bypass, supply chain attacks
Domain Expert	Subject matter expertise in the application domain	Identifying harmful outputs specific to the use case
Ethics and Safety Specialist	AI policy, safety frameworks, harm taxonomies	Evaluating safety implications and dual-use concerns

Scoping an AI Red Team Engagement

Scoping is where most engagements succeed or fail. An underscoped engagement produces shallow findings; an overscoped engagement wastes resources.

Key Scoping Questions

What is the AI system's purpose? A customer service chatbot has different risk profiles than a code generation tool or a medical diagnosis assistant.
What is the threat model? Are you simulating a casual user, a determined attacker, or a nation-state adversary? The sophistication of attacks should match.
What is in scope? The model itself? The API? The surrounding application? The training data? The deployment infrastructure?
What are the rules of engagement? Rate limits on API calls? Budget for compute? Off-limits attacks (e.g., no actual data exfiltration of real user data)?
What constitutes a finding? Is a jailbreak that requires 100 attempts a valid finding? What about an output that is borderline harmful?
What is the reporting timeline? Critical findings reported immediately? Weekly summaries? Final report at the end?

Risk-Based Prioritization

Not all attack vectors deserve equal attention. Prioritize based on the intersection of likelihood and impact:

Priority	Likelihood	Impact	Example
Critical	High	High	Prompt injection in customer-facing chatbot exposing PII
High	Medium	High	System prompt extraction revealing proprietary instructions
Medium	High	Medium	Jailbreak producing mildly inappropriate content
Low	Low	Low	Model fingerprinting revealing which base model is used

AI Red Teaming Frameworks

Several frameworks provide structure for AI red team engagements:

MITRE ATLAS (Adversarial Threat Landscape for AI Systems) provides a knowledge base of adversarial techniques organized by tactic, similar to ATT&CK for traditional cybersecurity
OWASP Top 10 for LLM Applications catalogs the most critical security risks specific to LLM-based applications
NIST AI Risk Management Framework offers a broader governance framework for AI risk that contextualizes red teaming within organizational risk management
Microsoft AI Red Team Framework details Microsoft's internal approach to red teaming AI systems, with lessons learned from testing Copilot and other products

AI Attack Taxonomy — categorizing attacks by target, technique, and impact
Threat Modeling for AI — identifying assets, threats, and attack vectors
Ethics of AI Red Teaming — responsible testing practices
Legal Landscape — authorization and legal considerations

References

"MITRE ATLAS: Adversarial Threat Landscape for Artificial-Intelligence Systems" - MITRE Corporation (2023) - Comprehensive knowledge base of adversarial tactics and techniques for AI systems, modeled after the ATT&CK framework
"OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard classification of the most critical security risks in LLM-based applications
"AI Risk Management Framework" - NIST (2024) - Federal framework for managing risks associated with AI systems across their lifecycle
"Microsoft AI Red Team" - Microsoft (2024) - Lessons learned from red teaming large-scale AI systems including GPT-4 and Copilot

Knowledge Check

What is the most important difference between traditional penetration testing and AI red teaming?

Learning Path

0/4 completed

~36 min total4 lessons

Start Learning

Edit this page on GitHub

Red Team Methodology Fundamentals

Scoping and Planning

Reconnaissance

Threat Modeling

Attack Execution

Analysis and Validation

Reporting and Remediation

Retest and Verification

Learning Path

Related articles

Red Team Methodology Fundamentals

Scoping and Planning

Reconnaissance

Threat Modeling

Attack Execution

Analysis and Validation

Reporting and Remediation

Retest and Verification

Learning Path

Related articles