Threat Modeling for AI Systems
How to identify assets, threats, and attack vectors specific to AI systems using simplified threat modeling frameworks adapted for machine learning.
What Is Threat Modeling?
Threat modeling is the practice of systematically thinking through what could go wrong with a system before attackers show you firsthand. For AI systems, threat modeling answers three core questions: What are we protecting? (assets), Who might attack it? (threat actors), and How might they do it? (attack vectors).
Traditional threat modeling frameworks like STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) were designed for deterministic software systems. AI systems introduce new asset types, new threat actors, and new attack vectors that these frameworks do not fully cover.
Step 1: Identify AI-Specific Assets
Assets in an AI system go beyond traditional data and infrastructure. The model itself, its training data, its behavioral alignment, and its tool access are all assets that need protection.
| Asset | Description | Why It Matters |
|---|---|---|
| Model weights | The learned parameters of the neural network | Theft enables cloning; tampering enables backdoors |
| Training data | Data used for pre-training, fine-tuning, and RLHF | Contains potentially sensitive or proprietary information |
| System prompt | Instructions that define the model's behavior and constraints | Exposure reveals security mechanisms and business logic |
| Alignment / safety training | The behavioral constraints learned through RLHF and safety fine-tuning | Bypass undermines all policy-based protections |
| Tool access | APIs, databases, file systems, and external services the model can invoke | Misuse enables real-world impact beyond text generation |
| User data | Conversations, uploaded files, and interaction history | Contains PII and potentially sensitive information |
| Retrieval corpus | Documents used for RAG (retrieval-augmented generation) | Poisoning corrupts the model's knowledge base |
| Business logic | Application-specific rules, pricing, and decision-making | Extraction reveals competitive intelligence |
Asset Sensitivity Matrix
Not all assets are equally sensitive. Classify each asset by its sensitivity to confidentiality, integrity, and availability compromise:
| Asset | Confidentiality | Integrity | Availability |
|---|---|---|---|
| Model weights | High | Critical | Medium |
| Training data | High | Critical | Low |
| System prompt | Medium | High | Low |
| Alignment | N/A | Critical | High |
| Tool access | Medium | Critical | High |
| User data | Critical | High | Medium |
Step 2: Identify Threat Actors
Threat actors for AI systems range from curious users to sophisticated adversaries. Understanding their motivations and capabilities shapes the depth and focus of your testing.
Casual User
Motivation: Curiosity, entertainment, or testing boundaries. Capabilities: No technical expertise; uses natural language only. Typical attacks: Simple jailbreak attempts, copying prompts from social media, basic system prompt extraction.
Skilled Attacker
Motivation: Misuse for content generation, data theft, or competitive intelligence. Capabilities: Understands prompt engineering, familiar with known attack techniques, can script API interactions. Typical attacks: Sophisticated prompt injection chains, automated extraction, multi-turn manipulation.
ML Practitioner
Motivation: Model theft, research, or competitive advantage. Capabilities: Deep understanding of model architectures, access to compute for training surrogate models, ability to design gradient-based attacks. Typical attacks: Model extraction through systematic querying, adversarial example generation, fine-tuning attacks.
Insider Threat
Motivation: Data theft, sabotage, or unauthorized access. Capabilities: Access to internal systems, training pipelines, or deployment infrastructure. Typical attacks: Training data poisoning, model backdoor insertion, system prompt manipulation.
Automated Agent
Motivation: Varies (the agent acts on behalf of another actor). Capabilities: Can interact at scale, chain multiple attacks, and exploit agent-to-agent interfaces. Typical attacks: Indirect prompt injection at scale, automated credential harvesting through tool abuse, feedback poisoning.
Step 3: Map Attack Vectors
With assets identified and threat actors defined, the next step is mapping how each actor could reach each asset. An attack vector is the path an adversary takes from their starting position to their target.
The AI Attack Surface
┌─────────────────────────────────┐
User Input ──────→│ Application Layer │
│ (API, UI, authentication) │
└──────────┬──────────────────────┘
│
┌──────────▼──────────────────────┐
│ Orchestration Layer │
│ (prompt assembly, RAG, │
│ tool routing, guardrails) │
└──────────┬──────────────────────┘
│
┌──────────▼──────────────────────┐
│ Model Layer │
│ (inference, attention, │
│ token generation) │
└──────────┬──────────────────────┘
│
┌──────────▼──────────────────────┐
│ Data Layer │
│ (training data, embeddings, │
│ retrieval corpus, user data) │
└─────────────────────────────────┘
Vector Mapping by Layer
Application Layer Vectors:
- API authentication bypass or abuse
- Rate limit evasion
- Input validation gaps (encoding, length)
- Session management flaws
Orchestration Layer Vectors:
- Prompt injection through user input
- RAG poisoning through external document manipulation
- Tool call manipulation
- Guardrail bypass through obfuscation or semantic evasion
Model Layer Vectors:
- Jailbreaking through adversarial prompts
- Adversarial examples that cause misclassification
- Model extraction through systematic querying
- Training data extraction through memorization exploitation
Data Layer Vectors:
- Training data poisoning (if access exists)
- Embedding space manipulation
- Knowledge base corruption
- User data exfiltration through the model
Step 4: Assess and Prioritize
With attack vectors mapped, assess each by likelihood and impact to prioritize testing effort.
Likelihood Factors
- Access required: Does the attacker need authenticated access? Physical access? Internal network access?
- Skill required: Can the attack be executed with natural language alone, or does it require ML expertise?
- Existing mitigations: Are there guardrails, rate limits, or monitoring in place?
- Public knowledge: Are known exploits or techniques available for this vector?
Impact Factors
- Data sensitivity: What data could be exposed or corrupted?
- Blast radius: How many users or systems are affected?
- Reversibility: Can the damage be undone?
- Regulatory implications: Does the impact trigger legal or compliance obligations?
Priority Matrix
| Low Impact | Medium Impact | High Impact | |
|---|---|---|---|
| High Likelihood | Monitor | Test | Test First |
| Medium Likelihood | Accept | Test | Test |
| Low Likelihood | Accept | Monitor | Test If Time |
Simplified AI Threat Model Template
For teams new to AI threat modeling, this simplified template provides a starting point:
AI System: [Name and description]
Deployment: [API / Self-hosted / Edge / Hybrid]
Model: [Base model, fine-tuned or not, model size]
Tools: [List of tools/APIs the model can access]
Users: [Who interacts with the system]
ASSETS:
1. [Asset] — Sensitivity: [High/Medium/Low]
...
THREAT ACTORS:
1. [Actor type] — Motivation: [...] — Capability: [...]
...
ATTACK VECTORS (prioritized):
1. [Vector] — Target: [Asset] — Likelihood: [H/M/L]
— Impact: [H/M/L] — Priority: [Critical/High/Medium/Low]
...
EXISTING MITIGATIONS:
1. [Mitigation] — Covers: [Which vectors] — Gaps: [What it misses]
...
TESTING PRIORITIES:
1. [Highest priority vector and testing approach]
...
Related Topics
- Red Team Methodology Fundamentals — how threat modeling fits into the engagement lifecycle
- AI Attack Taxonomy — classifying the attacks identified through threat modeling
- AI System Architecture — understanding the systems you are modeling
- Adversarial ML: Core Concepts — the adversarial techniques that inform attack vector enumeration
References
- "Threat Modeling: Designing for Security" - Adam Shostack (2014) - The foundational text on threat modeling methodology, applicable to all systems including AI
- "AI Risk Management Framework" - NIST (2024) - Federal framework that contextualizes threat modeling within broader AI risk management
- "MITRE ATLAS: Adversarial Threat Landscape for AI Systems" - MITRE (2023) - Knowledge base of adversarial techniques that populates the "attack vectors" dimension of AI threat models
- "Securing Machine Learning Pipelines" - Google (2024) - Practical guidance on identifying and mitigating threats across the ML development lifecycle
What makes AI threat modeling different from traditional application security threat modeling?