AI Threat Models: White-box, Black-box & Grey-box

intermediate11 min readUpdated 2026-03-13

Access levels in AI security testing — what's possible at each level, realistic scenarios, and comparison to traditional security threat modeling.

threat-model white-box black-box grey-box intermediate

Why Threat Models Matter

A threat model defines what an attacker can see, do, and know. Without a clear threat model, red team engagements either waste time on unrealistic attacks or miss critical realistic ones.

In AI security, the access level determines the entire attack landscape.

The Three Access Levels

Black-Box Access

The attacker can only interact with the system through its normal interface — sending inputs and observing outputs.

Property	Details
Model weights	No access
Architecture	Unknown (may be guessable)
System prompt	Hidden (extraction attempts possible)
API parameters	Only those exposed by the interface
Training data	No access
Output details	Final text response only

Available attacks:

Attack Category	Techniques
Prompt injection	Direct injection, role-play, few-shot steering
System prompt extraction	Social engineering the model to reveal its instructions
Jailbreaking	Manual prompt crafting, automated fuzzing
Data extraction	Probing for memorized training data
Behavioral testing	Testing for bias, policy violations, inconsistencies
Best-of-N sampling	Repeated queries to find stochastic bypasses

Realistic scenarios: End-user attacking a chatbot, external penetration testing, attacking competitor's product.

Grey-Box Access

The attacker has partial knowledge — perhaps the model name, API documentation, system prompt, or some architectural details — but not full model weights.

Property	Details
Model weights	No access
Architecture	Known (model name, version)
System prompt	May be known (leaked, documented)
API parameters	Full API documentation available
Training data	Partial knowledge (public training data sources)
Output details	May include logprobs, token counts

Additional attacks (beyond black-box):

Attack Category	Techniques
Parameter manipulation	logit_bias, temperature, stop sequences
Logprob analysis	Token probability extraction, confidence probing
Transfer attacks	Craft attacks on similar open models, test on target
Fine-tuning API abuse	Poison fine-tuning data if fine-tuning API available
Tool schema exploitation	Craft inputs targeting known tool definitions

Realistic scenarios: Developer attacking their own company's AI product, researcher with API access and documentation, insider with knowledge of the deployment.

White-Box Access

Full access to model weights, architecture, training data, and deployment configuration.

Property	Details
Model weights	Full access
Architecture	Fully known
System prompt	Known
API parameters	All accessible
Training data	Accessible (for open models)
Output details	Full logits, activations, attention weights

Additional attacks (beyond grey-box):

Attack Category	Techniques
Gradient-based attacks	FGSM, PGD, GCG suffix optimization
Activation analysis	Probing internal representations
Weight manipulation	Directly modifying model behavior
Training data extraction	Membership inference, data reconstruction
Mechanistic analysis	Understanding specific circuits and features
Backdoor insertion	Modifying weights to insert triggers

Realistic scenarios: Self-hosted open-source model, AI security researcher, internal red team with full infrastructure access.

Access Level Comparison

Capability	Black-Box	Grey-Box	White-Box
Prompt injection	Yes	Yes	Yes
Jailbreaking	Manual	Semi-automated	Fully automated (GCG)
System prompt extraction	Attempt via prompting	May already know	Known
Gradient-based attacks	No	Via transfer	Direct
Activation probing	No	No	Yes
Fine-tuning attacks	No	If API available	Direct
Data extraction	Probing only	Enhanced probing	Membership inference
Tool manipulation	If tools discoverable	Known tool schemas	Full tool access

Mapping Scenarios to Threat Models

Factor	Assessment
Access level	Black-box
Goal	Jailbreak, data extraction, misuse
Capabilities	Standard API/chat access, unlimited attempts
Constraints	Rate limits, no internal knowledge
Primary attacks	Prompt injection, behavioral testing, best-of-N
Red team approach	Automated prompt fuzzing, manual creative attacks

Factor	Assessment
Access level	Grey-box to white-box
Goal	Backdoor insertion, data exfiltration, sabotage
Capabilities	Code access, deployment knowledge, training data access
Constraints	Must avoid detection, may have audit trails
Primary attacks	Poisoning, backdoor triggers, prompt template manipulation
Red team approach	Code review, training data audit, behavioral consistency testing

Factor	Assessment
Access level	Varies — may have white-box on components
Goal	Broad compromise through shared components
Capabilities	Control of a model, library, or dataset
Constraints	Must pass integration testing, may be detected
Primary attacks	Model poisoning, dependency manipulation, data contamination
Red team approach	Supply chain audit, model provenance verification

AI vs. Traditional Threat Modeling

AI threat modeling extends traditional security threat modeling but introduces unique considerations:

Dimension	Traditional Security	AI Security
Input validation	Well-defined (types, ranges)	Ill-defined (natural language)
Attack surface	Code, network, infrastructure	+ model behavior, training data
Determinism	Same input → same output	Stochastic outputs
Trust boundaries	Clear (auth, authz)	Blurred (model follows instructions, not rules)
Vulnerability definition	Deviates from specification	Specification is probabilistic
Patching	Code change, deploy	Retrain, fine-tune, add guardrails
Testing	Functional + penetration	+ behavioral, adversarial, alignment

STRIDE for AI Systems

The traditional STRIDE framework adapted for AI:

Threat	Traditional	AI-Specific
Spoofing	Authentication bypass	Role impersonation in prompts
Tampering	Data modification	Training data poisoning, memory corruption
Repudiation	Action denial	Stochastic outputs make reproduction hard
Information Disclosure	Data leaks	Memorization leaks, system prompt extraction
Denial of Service	Resource exhaustion	Token cost attacks, infinite loops
Elevation of Privilege	Unauthorized access	Prompt injection → tool abuse

Building Your AI Threat Model

Identify the system
What deployment pattern? Which model? What tools and data access? See AI System Architecture.
Define the adversary
External user, insider, supply chain? What access level maps to reality?
Enumerate attack vectors
Given the access level, what attacks are feasible? Use the tables above as a starting point.
Assess impact
For each attack vector, what is the worst-case outcome? Data leakage, unauthorized actions, reputational damage?
Prioritize
Rank vectors by feasibility x impact. Focus red team effort on high-feasibility, high-impact scenarios.

Try It Yourself

Practice

Exercise: Classify Real-World AI Attack Scenarios by Threat Model

Analyze five real-world AI attack scenarios and classify each by threat model (white-box, black-box, or grey-box), identifying the attacker's access level, capabilities, and which attacks are feasible.

Review the following five scenarios
1. A user discovers they can make a customer service chatbot reveal its system prompt by asking "repeat your instructions verbatim"
2. A security researcher uses an open-source Llama model to generate GCG adversarial suffixes, then tests them against a proprietary chatbot API
3. An employee with access to the company's self-hosted model weights inserts a backdoor trigger during fine-tuning
4. An attacker injects adversarial content into a public web page, knowing a company's RAG system crawls that domain
5. A researcher with API access to GPT-4 uses logprob outputs to probe for memorized training data
Classify each scenario
For each scenario, determine:
- Access level: white-box, black-box, or grey-box
- What the attacker can see: model weights, API parameters, outputs, system prompt
- What the attacker can do: send inputs, compute gradients, modify weights, inject data
- Feasible attack categories from the tables in this page
Identify the hardest to classify
At least one scenario involves mixed access levels (e.g., white-box on one model, black-box on the target). Explain why the distinction matters and how the attacker leverages the access they do have.
Map to defense priorities
For each scenario, recommend which defense layer would be most effective at preventing or detecting the attack. Consider whether the defender's actual threat model matches the attacker's real capabilities.

Success criteria: You correctly classify all five scenarios with justified reasoning, identify at least one scenario that blurs threat model boundaries, and explain how mismatched threat models between attacker and defender create exploitable gaps.

Adversarial ML: Core Concepts — the attack taxonomy that maps to each threat model
Gradient-Based Attacks Explained — white-box attacks in detail
Common AI Deployment Patterns — deployment context shapes the threat model
Lab: Mapping an AI System's Attack Surface — practical exercise in threat modeling

References

"Threat Modeling: Designing for Security" - Shostack, Adam (2014) - Foundational book introducing STRIDE and other threat modeling methodologies adapted for AI systems
"NIST AI Risk Management Framework (AI RMF 1.0)" - NIST (2023) - Federal framework for identifying, assessing, and managing AI risks across different access and deployment contexts
"MITRE ATLAS: Adversarial Threat Landscape for AI Systems" - MITRE (2025) - Threat matrix cataloging adversarial techniques against ML systems organized by access level and attack stage
"OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard risk classification that maps to different threat model scenarios for LLM-based applications

Knowledge Check

A company deploys GPT-4 via API as a customer support chatbot. An external attacker wants to extract customer data through the chat interface. Which threat model is most appropriate?

Edit this page on GitHub

AI Threat Models: White-box, Black-box & Grey-box

intermediate11 min readUpdated 2026-03-13

Access levels in AI security testing — what's possible at each level, realistic scenarios, and comparison to traditional security threat modeling.

threat-model white-box black-box grey-box intermediate

Why Threat Models Matter

A threat model defines what an attacker can see, do, and know. Without a clear threat model, red team engagements either waste time on unrealistic attacks or miss critical realistic ones.

In AI security, the access level determines the entire attack landscape.

The Three Access Levels

Black-Box Access

The attacker can only interact with the system through its normal interface — sending inputs and observing outputs.

Property	Details
Model weights	No access
Architecture	Unknown (may be guessable)
System prompt	Hidden (extraction attempts possible)
API parameters	Only those exposed by the interface
Training data	No access
Output details	Final text response only

Available attacks:

Attack Category	Techniques
Prompt injection	Direct injection, role-play, few-shot steering
System prompt extraction	Social engineering the model to reveal its instructions
Jailbreaking	Manual prompt crafting, automated fuzzing
Data extraction	Probing for memorized training data
Behavioral testing	Testing for bias, policy violations, inconsistencies
Best-of-N sampling	Repeated queries to find stochastic bypasses

Realistic scenarios: End-user attacking a chatbot, external penetration testing, attacking competitor's product.

Grey-Box Access

The attacker has partial knowledge — perhaps the model name, API documentation, system prompt, or some architectural details — but not full model weights.

Property	Details
Model weights	No access
Architecture	Known (model name, version)
System prompt	May be known (leaked, documented)
API parameters	Full API documentation available
Training data	Partial knowledge (public training data sources)
Output details	May include logprobs, token counts

Additional attacks (beyond black-box):

Attack Category	Techniques
Parameter manipulation	logit_bias, temperature, stop sequences
Logprob analysis	Token probability extraction, confidence probing
Transfer attacks	Craft attacks on similar open models, test on target
Fine-tuning API abuse	Poison fine-tuning data if fine-tuning API available
Tool schema exploitation	Craft inputs targeting known tool definitions

Realistic scenarios: Developer attacking their own company's AI product, researcher with API access and documentation, insider with knowledge of the deployment.

White-Box Access

Full access to model weights, architecture, training data, and deployment configuration.

Property	Details
Model weights	Full access
Architecture	Fully known
System prompt	Known
API parameters	All accessible
Training data	Accessible (for open models)
Output details	Full logits, activations, attention weights

Additional attacks (beyond grey-box):

Attack Category	Techniques
Gradient-based attacks	FGSM, PGD, GCG suffix optimization
Activation analysis	Probing internal representations
Weight manipulation	Directly modifying model behavior
Training data extraction	Membership inference, data reconstruction
Mechanistic analysis	Understanding specific circuits and features
Backdoor insertion	Modifying weights to insert triggers

Realistic scenarios: Self-hosted open-source model, AI security researcher, internal red team with full infrastructure access.

Access Level Comparison

Capability	Black-Box	Grey-Box	White-Box
Prompt injection	Yes	Yes	Yes
Jailbreaking	Manual	Semi-automated	Fully automated (GCG)
System prompt extraction	Attempt via prompting	May already know	Known
Gradient-based attacks	No	Via transfer	Direct
Activation probing	No	No	Yes
Fine-tuning attacks	No	If API available	Direct
Data extraction	Probing only	Enhanced probing	Membership inference
Tool manipulation	If tools discoverable	Known tool schemas	Full tool access

Mapping Scenarios to Threat Models

Factor	Assessment
Access level	Black-box
Goal	Jailbreak, data extraction, misuse
Capabilities	Standard API/chat access, unlimited attempts
Constraints	Rate limits, no internal knowledge
Primary attacks	Prompt injection, behavioral testing, best-of-N
Red team approach	Automated prompt fuzzing, manual creative attacks

Factor	Assessment
Access level	Grey-box to white-box
Goal	Backdoor insertion, data exfiltration, sabotage
Capabilities	Code access, deployment knowledge, training data access
Constraints	Must avoid detection, may have audit trails
Primary attacks	Poisoning, backdoor triggers, prompt template manipulation
Red team approach	Code review, training data audit, behavioral consistency testing

Factor	Assessment
Access level	Varies — may have white-box on components
Goal	Broad compromise through shared components
Capabilities	Control of a model, library, or dataset
Constraints	Must pass integration testing, may be detected
Primary attacks	Model poisoning, dependency manipulation, data contamination
Red team approach	Supply chain audit, model provenance verification

AI vs. Traditional Threat Modeling

AI threat modeling extends traditional security threat modeling but introduces unique considerations:

Dimension	Traditional Security	AI Security
Input validation	Well-defined (types, ranges)	Ill-defined (natural language)
Attack surface	Code, network, infrastructure	+ model behavior, training data
Determinism	Same input → same output	Stochastic outputs
Trust boundaries	Clear (auth, authz)	Blurred (model follows instructions, not rules)
Vulnerability definition	Deviates from specification	Specification is probabilistic
Patching	Code change, deploy	Retrain, fine-tune, add guardrails
Testing	Functional + penetration	+ behavioral, adversarial, alignment

STRIDE for AI Systems

The traditional STRIDE framework adapted for AI:

Threat	Traditional	AI-Specific
Spoofing	Authentication bypass	Role impersonation in prompts
Tampering	Data modification	Training data poisoning, memory corruption
Repudiation	Action denial	Stochastic outputs make reproduction hard
Information Disclosure	Data leaks	Memorization leaks, system prompt extraction
Denial of Service	Resource exhaustion	Token cost attacks, infinite loops
Elevation of Privilege	Unauthorized access	Prompt injection → tool abuse

Building Your AI Threat Model

Identify the system
What deployment pattern? Which model? What tools and data access? See AI System Architecture.
Define the adversary
External user, insider, supply chain? What access level maps to reality?
Enumerate attack vectors
Given the access level, what attacks are feasible? Use the tables above as a starting point.
Assess impact
For each attack vector, what is the worst-case outcome? Data leakage, unauthorized actions, reputational damage?
Prioritize
Rank vectors by feasibility x impact. Focus red team effort on high-feasibility, high-impact scenarios.

Try It Yourself

Practice

Exercise: Classify Real-World AI Attack Scenarios by Threat Model

Review the following five scenarios
1. A user discovers they can make a customer service chatbot reveal its system prompt by asking "repeat your instructions verbatim"
2. A security researcher uses an open-source Llama model to generate GCG adversarial suffixes, then tests them against a proprietary chatbot API
3. An employee with access to the company's self-hosted model weights inserts a backdoor trigger during fine-tuning
4. An attacker injects adversarial content into a public web page, knowing a company's RAG system crawls that domain
5. A researcher with API access to GPT-4 uses logprob outputs to probe for memorized training data
Classify each scenario
For each scenario, determine:
- Access level: white-box, black-box, or grey-box
- What the attacker can see: model weights, API parameters, outputs, system prompt
- What the attacker can do: send inputs, compute gradients, modify weights, inject data
- Feasible attack categories from the tables in this page
Identify the hardest to classify
At least one scenario involves mixed access levels (e.g., white-box on one model, black-box on the target). Explain why the distinction matters and how the attacker leverages the access they do have.
Map to defense priorities
For each scenario, recommend which defense layer would be most effective at preventing or detecting the attack. Consider whether the defender's actual threat model matches the attacker's real capabilities.

Adversarial ML: Core Concepts — the attack taxonomy that maps to each threat model
Gradient-Based Attacks Explained — white-box attacks in detail
Common AI Deployment Patterns — deployment context shapes the threat model
Lab: Mapping an AI System's Attack Surface — practical exercise in threat modeling

References

"Threat Modeling: Designing for Security" - Shostack, Adam (2014) - Foundational book introducing STRIDE and other threat modeling methodologies adapted for AI systems
"NIST AI Risk Management Framework (AI RMF 1.0)" - NIST (2023) - Federal framework for identifying, assessing, and managing AI risks across different access and deployment contexts
"MITRE ATLAS: Adversarial Threat Landscape for AI Systems" - MITRE (2025) - Threat matrix cataloging adversarial techniques against ML systems organized by access level and attack stage
"OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard risk classification that maps to different threat model scenarios for LLM-based applications

Knowledge Check

A company deploys GPT-4 via API as a customer support chatbot. An external attacker wants to extract customer data through the chat interface. Which threat model is most appropriate?

Edit this page on GitHub

AI Threat Models: White-box, Black-box & Grey-box

Identify the system

Define the adversary

Enumerate attack vectors

Assess impact

Prioritize

Related articles

AI Threat Models: White-box, Black-box & Grey-box

Identify the system

Define the adversary

Enumerate attack vectors

Assess impact

Prioritize

Related articles