Threat Modeling for AI Systems

beginner8 min readUpdated 2026-03-15

How to identify assets, threats, and attack vectors specific to AI systems using simplified threat modeling frameworks adapted for machine learning.

threat-modeling methodology beginner planning

What Is Threat Modeling?

Threat modeling is the practice of systematically thinking through what could go wrong with a system before attackers show you firsthand. For AI systems, threat modeling answers three core questions: What are we protecting? (assets), Who might attack it? (threat actors), and How might they do it? (attack vectors).

Traditional threat modeling frameworks like STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) were designed for deterministic software systems. AI systems introduce new asset types, new threat actors, and new attack vectors that these frameworks do not fully cover.

Step 1: Identify AI-Specific Assets

Assets in an AI system go beyond traditional data and infrastructure. The model itself, its training data, its behavioral alignment, and its tool access are all assets that need protection.

Asset	Description	Why It Matters
Model weights	The learned parameters of the neural network	Theft enables cloning; tampering enables backdoors
Training data	Data used for pre-training, fine-tuning, and RLHF	Contains potentially sensitive or proprietary information
System prompt	Instructions that define the model's behavior and constraints	Exposure reveals security mechanisms and business logic
Alignment / safety training	The behavioral constraints learned through RLHF and safety fine-tuning	Bypass undermines all policy-based protections
Tool access	APIs, databases, file systems, and external services the model can invoke	Misuse enables real-world impact beyond text generation
User data	Conversations, uploaded files, and interaction history	Contains PII and potentially sensitive information
Retrieval corpus	Documents used for RAG (retrieval-augmented generation)	Poisoning corrupts the model's knowledge base
Business logic	Application-specific rules, pricing, and decision-making	Extraction reveals competitive intelligence

Asset Sensitivity Matrix

Not all assets are equally sensitive. Classify each asset by its sensitivity to confidentiality, integrity, and availability compromise:

Asset	Confidentiality	Integrity	Availability
Model weights	High	Critical	Medium
Training data	High	Critical	Low
System prompt	Medium	High	Low
Alignment	N/A	Critical	High
Tool access	Medium	Critical	High
User data	Critical	High	Medium

Step 2: Identify Threat Actors

Threat actors for AI systems range from curious users to sophisticated adversaries. Understanding their motivations and capabilities shapes the depth and focus of your testing.

Casual User

Motivation: Curiosity, entertainment, or testing boundaries. Capabilities: No technical expertise; uses natural language only. Typical attacks: Simple jailbreak attempts, copying prompts from social media, basic system prompt extraction.

Skilled Attacker

Motivation: Misuse for content generation, data theft, or competitive intelligence. Capabilities: Understands prompt engineering, familiar with known attack techniques, can script API interactions. Typical attacks: Sophisticated prompt injection chains, automated extraction, multi-turn manipulation.

ML Practitioner

Motivation: Model theft, research, or competitive advantage. Capabilities: Deep understanding of model architectures, access to compute for training surrogate models, ability to design gradient-based attacks. Typical attacks: Model extraction through systematic querying, adversarial example generation, fine-tuning attacks.

Insider Threat

Motivation: Data theft, sabotage, or unauthorized access. Capabilities: Access to internal systems, training pipelines, or deployment infrastructure. Typical attacks: Training data poisoning, model backdoor insertion, system prompt manipulation.

Automated Agent

Motivation: Varies (the agent acts on behalf of another actor). Capabilities: Can interact at scale, chain multiple attacks, and exploit agent-to-agent interfaces. Typical attacks: Indirect prompt injection at scale, automated credential harvesting through tool abuse, feedback poisoning.

Step 3: Map Attack Vectors

With assets identified and threat actors defined, the next step is mapping how each actor could reach each asset. An attack vector is the path an adversary takes from their starting position to their target.

The AI Attack Surface

                    ┌─────────────────────────────────┐
  User Input ──────→│  Application Layer               │
                    │  (API, UI, authentication)       │
                    └──────────┬──────────────────────┘
                               │
                    ┌──────────▼──────────────────────┐
                    │  Orchestration Layer             │
                    │  (prompt assembly, RAG,          │
                    │   tool routing, guardrails)      │
                    └──────────┬──────────────────────┘
                               │
                    ┌──────────▼──────────────────────┐
                    │  Model Layer                     │
                    │  (inference, attention,          │
                    │   token generation)              │
                    └──────────┬──────────────────────┘
                               │
                    ┌──────────▼──────────────────────┐
                    │  Data Layer                      │
                    │  (training data, embeddings,     │
                    │   retrieval corpus, user data)   │
                    └─────────────────────────────────┘

Vector Mapping by Layer

Application Layer Vectors:

API authentication bypass or abuse
Rate limit evasion
Input validation gaps (encoding, length)
Session management flaws

Orchestration Layer Vectors:

Prompt injection through user input
RAG poisoning through external document manipulation
Tool call manipulation
Guardrail bypass through obfuscation or semantic evasion

Model Layer Vectors:

Jailbreaking through adversarial prompts
Adversarial examples that cause misclassification
Model extraction through systematic querying
Training data extraction through memorization exploitation

Data Layer Vectors:

Training data poisoning (if access exists)
Embedding space manipulation
Knowledge base corruption
User data exfiltration through the model

Step 4: Assess and Prioritize

With attack vectors mapped, assess each by likelihood and impact to prioritize testing effort.

Likelihood Factors

Access required: Does the attacker need authenticated access? Physical access? Internal network access?
Skill required: Can the attack be executed with natural language alone, or does it require ML expertise?
Existing mitigations: Are there guardrails, rate limits, or monitoring in place?
Public knowledge: Are known exploits or techniques available for this vector?

Impact Factors

Data sensitivity: What data could be exposed or corrupted?
Blast radius: How many users or systems are affected?
Reversibility: Can the damage be undone?
Regulatory implications: Does the impact trigger legal or compliance obligations?

Priority Matrix

	Low Impact	Medium Impact	High Impact
High Likelihood	Monitor	Test	Test First
Medium Likelihood	Accept	Test	Test
Low Likelihood	Accept	Monitor	Test If Time

Simplified AI Threat Model Template

For teams new to AI threat modeling, this simplified template provides a starting point:

AI System: [Name and description]
Deployment: [API / Self-hosted / Edge / Hybrid]
Model: [Base model, fine-tuned or not, model size]
Tools: [List of tools/APIs the model can access]
Users: [Who interacts with the system]

ASSETS:
1. [Asset] — Sensitivity: [High/Medium/Low]
   ...

THREAT ACTORS:
1. [Actor type] — Motivation: [...] — Capability: [...]
   ...

ATTACK VECTORS (prioritized):
1. [Vector] — Target: [Asset] — Likelihood: [H/M/L]
   — Impact: [H/M/L] — Priority: [Critical/High/Medium/Low]
   ...

EXISTING MITIGATIONS:
1. [Mitigation] — Covers: [Which vectors] — Gaps: [What it misses]
   ...

TESTING PRIORITIES:
1. [Highest priority vector and testing approach]
   ...

Red Team Methodology Fundamentals — how threat modeling fits into the engagement lifecycle
AI Attack Taxonomy — classifying the attacks identified through threat modeling
AI System Architecture — understanding the systems you are modeling
Adversarial ML: Core Concepts — the adversarial techniques that inform attack vector enumeration

References

"Threat Modeling: Designing for Security" - Adam Shostack (2014) - The foundational text on threat modeling methodology, applicable to all systems including AI
"AI Risk Management Framework" - NIST (2024) - Federal framework that contextualizes threat modeling within broader AI risk management
"MITRE ATLAS: Adversarial Threat Landscape for AI Systems" - MITRE (2023) - Knowledge base of adversarial techniques that populates the "attack vectors" dimension of AI threat models
"Securing Machine Learning Pipelines" - Google (2024) - Practical guidance on identifying and mitigating threats across the ML development lifecycle

Knowledge Check

What makes AI threat modeling different from traditional application security threat modeling?

Edit this page on GitHub

Threat Modeling for AI Systems

beginner8 min readUpdated 2026-03-15

How to identify assets, threats, and attack vectors specific to AI systems using simplified threat modeling frameworks adapted for machine learning.

threat-modeling methodology beginner planning

What Is Threat Modeling?

Step 1: Identify AI-Specific Assets

Assets in an AI system go beyond traditional data and infrastructure. The model itself, its training data, its behavioral alignment, and its tool access are all assets that need protection.

Asset	Description	Why It Matters
Model weights	The learned parameters of the neural network	Theft enables cloning; tampering enables backdoors
Training data	Data used for pre-training, fine-tuning, and RLHF	Contains potentially sensitive or proprietary information
System prompt	Instructions that define the model's behavior and constraints	Exposure reveals security mechanisms and business logic
Alignment / safety training	The behavioral constraints learned through RLHF and safety fine-tuning	Bypass undermines all policy-based protections
Tool access	APIs, databases, file systems, and external services the model can invoke	Misuse enables real-world impact beyond text generation
User data	Conversations, uploaded files, and interaction history	Contains PII and potentially sensitive information
Retrieval corpus	Documents used for RAG (retrieval-augmented generation)	Poisoning corrupts the model's knowledge base
Business logic	Application-specific rules, pricing, and decision-making	Extraction reveals competitive intelligence

Asset Sensitivity Matrix

Not all assets are equally sensitive. Classify each asset by its sensitivity to confidentiality, integrity, and availability compromise:

Asset	Confidentiality	Integrity	Availability
Model weights	High	Critical	Medium
Training data	High	Critical	Low
System prompt	Medium	High	Low
Alignment	N/A	Critical	High
Tool access	Medium	Critical	High
User data	Critical	High	Medium

                    ┌─────────────────────────────────┐
  User Input ──────→│  Application Layer               │
                    │  (API, UI, authentication)       │
                    └──────────┬──────────────────────┘
                               │
                    ┌──────────▼──────────────────────┐
                    │  Orchestration Layer             │
                    │  (prompt assembly, RAG,          │
                    │   tool routing, guardrails)      │
                    └──────────┬──────────────────────┘
                               │
                    ┌──────────▼──────────────────────┐
                    │  Model Layer                     │
                    │  (inference, attention,          │
                    │   token generation)              │
                    └──────────┬──────────────────────┘
                               │
                    ┌──────────▼──────────────────────┐
                    │  Data Layer                      │
                    │  (training data, embeddings,     │
                    │   retrieval corpus, user data)   │
                    └─────────────────────────────────┘

Vector Mapping by Layer

Application Layer Vectors:

API authentication bypass or abuse
Rate limit evasion
Input validation gaps (encoding, length)
Session management flaws

Orchestration Layer Vectors:

Prompt injection through user input
RAG poisoning through external document manipulation
Tool call manipulation
Guardrail bypass through obfuscation or semantic evasion

Model Layer Vectors:

Jailbreaking through adversarial prompts
Adversarial examples that cause misclassification
Model extraction through systematic querying
Training data extraction through memorization exploitation

Data Layer Vectors:

Training data poisoning (if access exists)
Embedding space manipulation
Knowledge base corruption
User data exfiltration through the model

Step 4: Assess and Prioritize

With attack vectors mapped, assess each by likelihood and impact to prioritize testing effort.

Likelihood Factors

Access required: Does the attacker need authenticated access? Physical access? Internal network access?
Skill required: Can the attack be executed with natural language alone, or does it require ML expertise?
Existing mitigations: Are there guardrails, rate limits, or monitoring in place?
Public knowledge: Are known exploits or techniques available for this vector?

Impact Factors

Data sensitivity: What data could be exposed or corrupted?
Blast radius: How many users or systems are affected?
Reversibility: Can the damage be undone?
Regulatory implications: Does the impact trigger legal or compliance obligations?

Priority Matrix

	Low Impact	Medium Impact	High Impact
High Likelihood	Monitor	Test	Test First
Medium Likelihood	Accept	Test	Test
Low Likelihood	Accept	Monitor	Test If Time

Simplified AI Threat Model Template

For teams new to AI threat modeling, this simplified template provides a starting point:

AI System: [Name and description]
Deployment: [API / Self-hosted / Edge / Hybrid]
Model: [Base model, fine-tuned or not, model size]
Tools: [List of tools/APIs the model can access]
Users: [Who interacts with the system]

ASSETS:
1. [Asset] — Sensitivity: [High/Medium/Low]
   ...

THREAT ACTORS:
1. [Actor type] — Motivation: [...] — Capability: [...]
   ...

ATTACK VECTORS (prioritized):
1. [Vector] — Target: [Asset] — Likelihood: [H/M/L]
   — Impact: [H/M/L] — Priority: [Critical/High/Medium/Low]
   ...

EXISTING MITIGATIONS:
1. [Mitigation] — Covers: [Which vectors] — Gaps: [What it misses]
   ...

TESTING PRIORITIES:
1. [Highest priority vector and testing approach]
   ...

Red Team Methodology Fundamentals — how threat modeling fits into the engagement lifecycle
AI Attack Taxonomy — classifying the attacks identified through threat modeling
AI System Architecture — understanding the systems you are modeling
Adversarial ML: Core Concepts — the adversarial techniques that inform attack vector enumeration

References

"Threat Modeling: Designing for Security" - Adam Shostack (2014) - The foundational text on threat modeling methodology, applicable to all systems including AI
"AI Risk Management Framework" - NIST (2024) - Federal framework that contextualizes threat modeling within broader AI risk management
"MITRE ATLAS: Adversarial Threat Landscape for AI Systems" - MITRE (2023) - Knowledge base of adversarial techniques that populates the "attack vectors" dimension of AI threat models
"Securing Machine Learning Pipelines" - Google (2024) - Practical guidance on identifying and mitigating threats across the ML development lifecycle

Knowledge Check

What makes AI threat modeling different from traditional application security threat modeling?

Edit this page on GitHub

Threat Modeling for AI Systems

Related articles

Threat Modeling for AI Systems

Related articles