AI Threat Models: White-box, Black-box & Grey-box
Access levels in AI security testing — what's possible at each level, realistic scenarios, and comparison to traditional security threat modeling.
Why Threat Models Matter
A threat model defines what an attacker can see, do, and know. Without a clear threat model, red team engagements either waste time on unrealistic attacks or miss critical realistic ones.
In AI security, the access level determines the entire attack landscape.
The Three Access Levels
Black-Box Access
The attacker can only interact with the system through its normal interface — sending inputs and observing outputs.
| Property | Details |
|---|---|
| Model weights | No access |
| Architecture | Unknown (may be guessable) |
| System prompt | Hidden (extraction attempts possible) |
| API parameters | Only those exposed by the interface |
| Training data | No access |
| Output details | Final text response only |
Available attacks:
| Attack Category | Techniques |
|---|---|
| Prompt injection | Direct injection, role-play, few-shot steering |
| System prompt extraction | Social engineering the model to reveal its instructions |
| Jailbreaking | Manual prompt crafting, automated fuzzing |
| Data extraction | Probing for memorized training data |
| Behavioral testing | Testing for bias, policy violations, inconsistencies |
| Best-of-N sampling | Repeated queries to find stochastic bypasses |
Realistic scenarios: End-user attacking a chatbot, external penetration testing, attacking competitor's product.
Grey-Box Access
The attacker has partial knowledge — perhaps the model name, API documentation, system prompt, or some architectural details — but not full model weights.
| Property | Details |
|---|---|
| Model weights | No access |
| Architecture | Known (model name, version) |
| System prompt | May be known (leaked, documented) |
| API parameters | Full API documentation available |
| Training data | Partial knowledge (public training data sources) |
| Output details | May include logprobs, token counts |
Additional attacks (beyond black-box):
| Attack Category | Techniques |
|---|---|
| Parameter manipulation | logit_bias, temperature, stop sequences |
| Logprob analysis | Token probability extraction, confidence probing |
| Transfer attacks | Craft attacks on similar open models, test on target |
| Fine-tuning API abuse | Poison fine-tuning data if fine-tuning API available |
| Tool schema exploitation | Craft inputs targeting known tool definitions |
Realistic scenarios: Developer attacking their own company's AI product, researcher with API access and documentation, insider with knowledge of the deployment.
White-Box Access
Full access to model weights, architecture, training data, and deployment configuration.
| Property | Details |
|---|---|
| Model weights | Full access |
| Architecture | Fully known |
| System prompt | Known |
| API parameters | All accessible |
| Training data | Accessible (for open models) |
| Output details | Full logits, activations, attention weights |
Additional attacks (beyond grey-box):
| Attack Category | Techniques |
|---|---|
| Gradient-based attacks | FGSM, PGD, GCG suffix optimization |
| Activation analysis | Probing internal representations |
| Weight manipulation | Directly modifying model behavior |
| Training data extraction | Membership inference, data reconstruction |
| Mechanistic analysis | Understanding specific circuits and features |
| Backdoor insertion | Modifying weights to insert triggers |
Realistic scenarios: Self-hosted open-source model, AI security researcher, internal red team with full infrastructure access.
Access Level Comparison
| Capability | Black-Box | Grey-Box | White-Box |
|---|---|---|---|
| Prompt injection | Yes | Yes | Yes |
| Jailbreaking | Manual | Semi-automated | Fully automated (GCG) |
| System prompt extraction | Attempt via prompting | May already know | Known |
| Gradient-based attacks | No | Via transfer | Direct |
| Activation probing | No | No | Yes |
| Fine-tuning attacks | No | If API available | Direct |
| Data extraction | Probing only | Enhanced probing | Membership inference |
| Tool manipulation | If tools discoverable | Known tool schemas | Full tool access |
Mapping Scenarios to Threat Models
| Factor | Assessment |
|---|---|
| Access level | Black-box |
| Goal | Jailbreak, data extraction, misuse |
| Capabilities | Standard API/chat access, unlimited attempts |
| Constraints | Rate limits, no internal knowledge |
| Primary attacks | Prompt injection, behavioral testing, best-of-N |
| Red team approach | Automated prompt fuzzing, manual creative attacks |
| Factor | Assessment |
|---|---|
| Access level | Grey-box to white-box |
| Goal | Backdoor insertion, data exfiltration, sabotage |
| Capabilities | Code access, deployment knowledge, training data access |
| Constraints | Must avoid detection, may have audit trails |
| Primary attacks | Poisoning, backdoor triggers, prompt template manipulation |
| Red team approach | Code review, training data audit, behavioral consistency testing |
| Factor | Assessment |
|---|---|
| Access level | Varies — may have white-box on components |
| Goal | Broad compromise through shared components |
| Capabilities | Control of a model, library, or dataset |
| Constraints | Must pass integration testing, may be detected |
| Primary attacks | Model poisoning, dependency manipulation, data contamination |
| Red team approach | Supply chain audit, model provenance verification |
AI vs. Traditional Threat Modeling
AI threat modeling extends traditional security threat modeling but introduces unique considerations:
| Dimension | Traditional Security | AI Security |
|---|---|---|
| Input validation | Well-defined (types, ranges) | Ill-defined (natural language) |
| Attack surface | Code, network, infrastructure | + model behavior, training data |
| Determinism | Same input → same output | Stochastic outputs |
| Trust boundaries | Clear (auth, authz) | Blurred (model follows instructions, not rules) |
| Vulnerability definition | Deviates from specification | Specification is probabilistic |
| Patching | Code change, deploy | Retrain, fine-tune, add guardrails |
| Testing | Functional + penetration | + behavioral, adversarial, alignment |
STRIDE for AI Systems
The traditional STRIDE framework adapted for AI:
| Threat | Traditional | AI-Specific |
|---|---|---|
| Spoofing | Authentication bypass | Role impersonation in prompts |
| Tampering | Data modification | Training data poisoning, memory corruption |
| Repudiation | Action denial | Stochastic outputs make reproduction hard |
| Information Disclosure | Data leaks | Memorization leaks, system prompt extraction |
| Denial of Service | Resource exhaustion | Token cost attacks, infinite loops |
| Elevation of Privilege | Unauthorized access | Prompt injection → tool abuse |
Building Your AI Threat Model
Identify the system
What deployment pattern? Which model? What tools and data access? See AI System Architecture.
Define the adversary
External user, insider, supply chain? What access level maps to reality?
Enumerate attack vectors
Given the access level, what attacks are feasible? Use the tables above as a starting point.
Assess impact
For each attack vector, what is the worst-case outcome? Data leakage, unauthorized actions, reputational damage?
Prioritize
Rank vectors by feasibility x impact. Focus red team effort on high-feasibility, high-impact scenarios.
Try It Yourself
Related Topics
- Adversarial ML: Core Concepts — the attack taxonomy that maps to each threat model
- Gradient-Based Attacks Explained — white-box attacks in detail
- Common AI Deployment Patterns — deployment context shapes the threat model
- Lab: Mapping an AI System's Attack Surface — practical exercise in threat modeling
References
- "Threat Modeling: Designing for Security" - Shostack, Adam (2014) - Foundational book introducing STRIDE and other threat modeling methodologies adapted for AI systems
- "NIST AI Risk Management Framework (AI RMF 1.0)" - NIST (2023) - Federal framework for identifying, assessing, and managing AI risks across different access and deployment contexts
- "MITRE ATLAS: Adversarial Threat Landscape for AI Systems" - MITRE (2025) - Threat matrix cataloging adversarial techniques against ML systems organized by access level and attack stage
- "OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard risk classification that maps to different threat model scenarios for LLM-based applications
A company deploys GPT-4 via API as a customer support chatbot. An external attacker wants to extract customer data through the chat interface. Which threat model is most appropriate?