AI Threat 模型s: White-box, Black-box & Grey-box

Intermediate10 min readUpdated 2026-03-13

Access levels in AI security testing — what's possible at each level, realistic scenarios, and comparison to traditional security threat modeling.

threat-model white-box black-box grey-box intermediate

Why Threat Models Matter

A 威脅模型 defines what 攻擊者 can see, do, and know. Without a clear 威脅模型, 紅隊 engagements either waste time on unrealistic attacks or miss critical realistic ones.

In AI 安全, the access level determines the entire attack landscape.

The Three Access Levels

Black-Box Access

攻擊者 can only interact with 系統 through its normal interface — sending inputs and observing outputs.

Property	Details
Model weights	No access
Architecture	Unknown (may be guessable)
系統提示詞	Hidden (extraction attempts possible)
API parameters	Only those exposed by the interface
訓練資料	No access
輸出 details	Final text response only

Available attacks:

攻擊 Category	Techniques
Prompt injection	Direct injection, role-play, few-shot steering
系統提示詞 extraction	Social engineering 模型 to reveal its instructions
Jailbreaking	Manual prompt crafting, automated fuzzing
Data extraction	Probing for memorized 訓練資料
Behavioral 測試	測試 for bias, policy violations, inconsistencies
Best-of-N sampling	Repeated queries to find stochastic bypasses

Realistic scenarios: End-user attacking a chatbot, external penetration 測試, attacking competitor's product.

Grey-Box Access

攻擊者 has partial knowledge — perhaps 模型 name, API documentation, 系統提示詞, or some architectural details — but not full model weights.

Property	Details
Model weights	No access
Architecture	Known (model name, version)
系統提示詞	May be known (leaked, documented)
API parameters	Full API documentation available
訓練資料	Partial knowledge (public 訓練資料 sources)
輸出 details	May include logprobs, 符元 counts

Additional attacks (beyond black-box):

攻擊 Category	Techniques
Parameter manipulation	logit_bias, temperature, stop sequences
Logprob analysis	Token probability extraction, confidence probing
Transfer attacks	Craft attacks on similar open models, 測試 on target
微調 API abuse	Poison 微調 data if 微調 API available
Tool schema 利用	Craft inputs targeting known tool definitions

Realistic scenarios: Developer attacking their own company's AI product, researcher with API access and documentation, insider with knowledge of the deployment.

White-Box Access

Full access to model weights, architecture, 訓練資料, and deployment configuration.

Property	Details
Model weights	Full access
Architecture	Fully known
系統提示詞	Known
API parameters	All accessible
訓練資料	Accessible (for open models)
輸出 details	Full logits, activations, 注意力 weights

Additional attacks (beyond grey-box):

攻擊 Category	Techniques
Gradient-based attacks	FGSM, PGD, GCG suffix optimization
Activation analysis	Probing internal representations
Weight manipulation	Directly modifying model behavior
訓練資料 extraction	Membership 推論, data reconstruction
Mechanistic analysis	理解 specific circuits and features
後門 insertion	Modifying weights to insert triggers

Realistic scenarios: Self-hosted open-source model, AI 安全 researcher, internal 紅隊 with full infrastructure access.

Access Level Comparison

Capability	Black-Box	Grey-Box	White-Box
Prompt injection	Yes	Yes	Yes
Jailbreaking	Manual	Semi-automated	Fully automated (GCG)
系統提示詞 extraction	Attempt via prompting	May already know	Known
Gradient-based attacks	No	Via transfer	Direct
Activation probing	No	No	Yes
微調 attacks	No	If API available	Direct
Data extraction	Probing only	Enhanced probing	Membership 推論
Tool manipulation	If tools discoverable	Known tool schemas	Full tool access

Mapping Scenarios to Threat Models

Factor	評估
Access level	Black-box
Goal	越獄, data extraction, misuse
Capabilities	Standard API/chat access, unlimited attempts
Constraints	Rate limits, no internal knowledge
Primary attacks	Prompt injection, behavioral 測試, best-of-N
Red team approach	Automated prompt fuzzing, manual creative attacks

Factor	評估
Access level	Grey-box to white-box
Goal	後門 insertion, data exfiltration, sabotage
Capabilities	Code access, deployment knowledge, 訓練資料 access
Constraints	Must avoid 偵測, may have audit trails
Primary attacks	Poisoning, 後門 triggers, prompt template manipulation
Red team approach	Code review, 訓練資料 audit, behavioral consistency 測試

Factor	評估
Access level	Varies — may have white-box on components
Goal	Broad compromise through shared components
Capabilities	Control of a model, library, or dataset
Constraints	Must pass integration 測試, may be detected
Primary attacks	Model 投毒, dependency manipulation, data contamination
Red team approach	供應鏈 audit, model provenance verification

AI vs. Traditional Threat Modeling

AI threat modeling extends traditional 安全 threat modeling but introduces unique considerations:

Dimension	Traditional 安全	AI 安全
輸入 validation	Well-defined (types, ranges)	Ill-defined (natural language)
攻擊面	Code, network, infrastructure	+ model behavior, 訓練資料
Determinism	Same 輸入 → same 輸出	Stochastic outputs
Trust boundaries	Clear (auth, authz)	Blurred (model follows instructions, not rules)
漏洞 definition	Deviates from specification	Specification is probabilistic
Patching	Code change, deploy	Retrain, 微調, add 護欄
測試	Functional + penetration	+ behavioral, 對抗性, 對齊

STRIDE for AI Systems

The traditional STRIDE framework adapted for AI:

Threat	Traditional	AI-Specific
Spoofing	Authentication bypass	Role impersonation in prompts
Tampering	Data modification	Training 資料投毒, memory corruption
Repudiation	Action denial	Stochastic outputs make reproduction hard
Information Disclosure	Data leaks	Memorization leaks, 系統提示詞 extraction
Denial of Service	Resource exhaustion	Token cost attacks, infinite loops
Elevation of Privilege	Unauthorized access	Prompt injection → tool abuse

Building Your AI Threat Model

識別系統
What deployment pattern? Which model? What tools and data access? See AI System Architecture.
Define the adversary
External user, insider, 供應鏈? What access level maps to reality?
Enumerate attack vectors
Given the access level, what attacks are feasible? Use the tables above as a starting point.
評估 impact
對每個 attack vector, what is the worst-case outcome? Data leakage, unauthorized actions, reputational damage?
Prioritize
Rank vectors by feasibility x impact. Focus 紅隊 effort on high-feasibility, high-impact scenarios.

Try It Yourself

Practice

Exercise: Classify Real-World AI 攻擊 Scenarios by Threat Model

Analyze five real-world AI attack scenarios and classify each by 威脅模型 (white-box, black-box, or grey-box), identifying 攻擊者's access level, capabilities, and which attacks are feasible.

Review the following five scenarios
1. A user discovers they can make a customer service chatbot reveal its 系統提示詞 by asking "repeat your instructions verbatim"
2. A 安全 researcher uses an open-source Llama model to generate GCG 對抗性 suffixes, then tests them against a proprietary chatbot API
3. An employee with access to the company's self-hosted model weights inserts a 後門 trigger during 微調
4. 攻擊者 injects 對抗性 content into a public web page, knowing a company's RAG system crawls that domain
5. A researcher with API access to GPT-4 uses logprob outputs to probe for memorized 訓練資料
Classify each scenario
對每個 scenario, determine:
- Access level: white-box, black-box, or grey-box
- What 攻擊者 can see: model weights, API parameters, outputs, 系統提示詞
- What 攻擊者 can do: send inputs, compute gradients, modify weights, inject data
- Feasible attack categories from the tables 在本 page
識別 the hardest to classify
At least one scenario involves mixed access levels (e.g., white-box on one model, black-box on the target). Explain why the distinction matters and how 攻擊者 leverages the access they do have.
Map to 防禦 priorities
對每個 scenario, recommend which 防禦 layer would be most effective at preventing or detecting the attack. 考慮 whether the defender's actual 威脅模型 matches 攻擊者's real capabilities.

Success criteria: You correctly classify all five scenarios with justified reasoning, 識別 at least one scenario that blurs 威脅模型 boundaries, and explain how mismatched threat models between 攻擊者 and defender create exploitable gaps.

參考文獻

"Threat Modeling: Designing for 安全" - Shostack, Adam (2014) - Foundational book introducing STRIDE and other threat modeling methodologies adapted for AI systems
"NIST AI Risk Management Framework (AI RMF 1.0)" - NIST (2023) - Federal framework for identifying, assessing, and managing AI risks across different access and deployment contexts
"MITRE ATLAS: 對抗性 Threat Landscape for AI Systems" - MITRE (2025) - Threat matrix cataloging 對抗性 techniques against ML systems organized by access level and attack stage
"OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard risk classification that maps to different 威脅模型 scenarios for LLM-based applications

Knowledge Check

A company deploys GPT-4 via API as a customer support chatbot. An external 攻擊者 wants to extract customer data through the chat interface. Which 威脅模型 is most appropriate?

AI Threat 模型s: White-box, Black-box & Grey-box

Intermediate10 min readUpdated 2026-03-13

Access levels in AI security testing — what's possible at each level, realistic scenarios, and comparison to traditional security threat modeling.

threat-model white-box black-box grey-box intermediate

Why Threat Models Matter

A 威脅模型 defines what 攻擊者 can see, do, and know. Without a clear 威脅模型, 紅隊 engagements either waste time on unrealistic attacks or miss critical realistic ones.

In AI 安全, the access level determines the entire attack landscape.

The Three Access Levels

Black-Box Access

攻擊者 can only interact with 系統 through its normal interface — sending inputs and observing outputs.

Property	Details
Model weights	No access
Architecture	Unknown (may be guessable)
系統提示詞	Hidden (extraction attempts possible)
API parameters	Only those exposed by the interface
訓練資料	No access
輸出 details	Final text response only

Available attacks:

攻擊 Category	Techniques
Prompt injection	Direct injection, role-play, few-shot steering
系統提示詞 extraction	Social engineering 模型 to reveal its instructions
Jailbreaking	Manual prompt crafting, automated fuzzing
Data extraction	Probing for memorized 訓練資料
Behavioral 測試	測試 for bias, policy violations, inconsistencies
Best-of-N sampling	Repeated queries to find stochastic bypasses

Realistic scenarios: End-user attacking a chatbot, external penetration 測試, attacking competitor's product.

Grey-Box Access

攻擊者 has partial knowledge — perhaps 模型 name, API documentation, 系統提示詞, or some architectural details — but not full model weights.

Property	Details
Model weights	No access
Architecture	Known (model name, version)
系統提示詞	May be known (leaked, documented)
API parameters	Full API documentation available
訓練資料	Partial knowledge (public 訓練資料 sources)
輸出 details	May include logprobs, 符元 counts

Additional attacks (beyond black-box):

攻擊 Category	Techniques
Parameter manipulation	logit_bias, temperature, stop sequences
Logprob analysis	Token probability extraction, confidence probing
Transfer attacks	Craft attacks on similar open models, 測試 on target
微調 API abuse	Poison 微調 data if 微調 API available
Tool schema 利用	Craft inputs targeting known tool definitions

Realistic scenarios: Developer attacking their own company's AI product, researcher with API access and documentation, insider with knowledge of the deployment.

White-Box Access

Full access to model weights, architecture, 訓練資料, and deployment configuration.

Property	Details
Model weights	Full access
Architecture	Fully known
系統提示詞	Known
API parameters	All accessible
訓練資料	Accessible (for open models)
輸出 details	Full logits, activations, 注意力 weights

Additional attacks (beyond grey-box):

攻擊 Category	Techniques
Gradient-based attacks	FGSM, PGD, GCG suffix optimization
Activation analysis	Probing internal representations
Weight manipulation	Directly modifying model behavior
訓練資料 extraction	Membership 推論, data reconstruction
Mechanistic analysis	理解 specific circuits and features
後門 insertion	Modifying weights to insert triggers

Realistic scenarios: Self-hosted open-source model, AI 安全 researcher, internal 紅隊 with full infrastructure access.

Access Level Comparison

Capability	Black-Box	Grey-Box	White-Box
Prompt injection	Yes	Yes	Yes
Jailbreaking	Manual	Semi-automated	Fully automated (GCG)
系統提示詞 extraction	Attempt via prompting	May already know	Known
Gradient-based attacks	No	Via transfer	Direct
Activation probing	No	No	Yes
微調 attacks	No	If API available	Direct
Data extraction	Probing only	Enhanced probing	Membership 推論
Tool manipulation	If tools discoverable	Known tool schemas	Full tool access

Mapping Scenarios to Threat Models

Factor	評估
Access level	Black-box
Goal	越獄, data extraction, misuse
Capabilities	Standard API/chat access, unlimited attempts
Constraints	Rate limits, no internal knowledge
Primary attacks	Prompt injection, behavioral 測試, best-of-N
Red team approach	Automated prompt fuzzing, manual creative attacks

Factor	評估
Access level	Grey-box to white-box
Goal	後門 insertion, data exfiltration, sabotage
Capabilities	Code access, deployment knowledge, 訓練資料 access
Constraints	Must avoid 偵測, may have audit trails
Primary attacks	Poisoning, 後門 triggers, prompt template manipulation
Red team approach	Code review, 訓練資料 audit, behavioral consistency 測試

Factor	評估
Access level	Varies — may have white-box on components
Goal	Broad compromise through shared components
Capabilities	Control of a model, library, or dataset
Constraints	Must pass integration 測試, may be detected
Primary attacks	Model 投毒, dependency manipulation, data contamination
Red team approach	供應鏈 audit, model provenance verification

AI vs. Traditional Threat Modeling

AI threat modeling extends traditional 安全 threat modeling but introduces unique considerations:

Dimension	Traditional 安全	AI 安全
輸入 validation	Well-defined (types, ranges)	Ill-defined (natural language)
攻擊面	Code, network, infrastructure	+ model behavior, 訓練資料
Determinism	Same 輸入 → same 輸出	Stochastic outputs
Trust boundaries	Clear (auth, authz)	Blurred (model follows instructions, not rules)
漏洞 definition	Deviates from specification	Specification is probabilistic
Patching	Code change, deploy	Retrain, 微調, add 護欄
測試	Functional + penetration	+ behavioral, 對抗性, 對齊

STRIDE for AI Systems

The traditional STRIDE framework adapted for AI:

Threat	Traditional	AI-Specific
Spoofing	Authentication bypass	Role impersonation in prompts
Tampering	Data modification	Training 資料投毒, memory corruption
Repudiation	Action denial	Stochastic outputs make reproduction hard
Information Disclosure	Data leaks	Memorization leaks, 系統提示詞 extraction
Denial of Service	Resource exhaustion	Token cost attacks, infinite loops
Elevation of Privilege	Unauthorized access	Prompt injection → tool abuse

Building Your AI Threat Model

識別系統
What deployment pattern? Which model? What tools and data access? See AI System Architecture.
Define the adversary
External user, insider, 供應鏈? What access level maps to reality?
Enumerate attack vectors
Given the access level, what attacks are feasible? Use the tables above as a starting point.
評估 impact
對每個 attack vector, what is the worst-case outcome? Data leakage, unauthorized actions, reputational damage?
Prioritize
Rank vectors by feasibility x impact. Focus 紅隊 effort on high-feasibility, high-impact scenarios.

Try It Yourself

Practice

Exercise: Classify Real-World AI 攻擊 Scenarios by Threat Model

Analyze five real-world AI attack scenarios and classify each by 威脅模型 (white-box, black-box, or grey-box), identifying 攻擊者's access level, capabilities, and which attacks are feasible.

Review the following five scenarios
1. A user discovers they can make a customer service chatbot reveal its 系統提示詞 by asking "repeat your instructions verbatim"
2. A 安全 researcher uses an open-source Llama model to generate GCG 對抗性 suffixes, then tests them against a proprietary chatbot API
3. An employee with access to the company's self-hosted model weights inserts a 後門 trigger during 微調
4. 攻擊者 injects 對抗性 content into a public web page, knowing a company's RAG system crawls that domain
5. A researcher with API access to GPT-4 uses logprob outputs to probe for memorized 訓練資料
Classify each scenario
對每個 scenario, determine:
- Access level: white-box, black-box, or grey-box
- What 攻擊者 can see: model weights, API parameters, outputs, 系統提示詞
- What 攻擊者 can do: send inputs, compute gradients, modify weights, inject data
- Feasible attack categories from the tables 在本 page
識別 the hardest to classify
At least one scenario involves mixed access levels (e.g., white-box on one model, black-box on the target). Explain why the distinction matters and how 攻擊者 leverages the access they do have.
Map to 防禦 priorities
對每個 scenario, recommend which 防禦 layer would be most effective at preventing or detecting the attack. 考慮 whether the defender's actual 威脅模型 matches 攻擊者's real capabilities.

參考文獻

"Threat Modeling: Designing for 安全" - Shostack, Adam (2014) - Foundational book introducing STRIDE and other threat modeling methodologies adapted for AI systems
"NIST AI Risk Management Framework (AI RMF 1.0)" - NIST (2023) - Federal framework for identifying, assessing, and managing AI risks across different access and deployment contexts
"MITRE ATLAS: 對抗性 Threat Landscape for AI Systems" - MITRE (2025) - Threat matrix cataloging 對抗性 techniques against ML systems organized by access level and attack stage
"OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard risk classification that maps to different 威脅模型 scenarios for LLM-based applications

Knowledge Check

A company deploys GPT-4 via API as a customer support chatbot. An external 攻擊者 wants to extract customer data through the chat interface. Which 威脅模型 is most appropriate?

AI Threat 模型s: White-box, Black-box & Grey-box

Why Threat Models Matter

The Three Access Levels

Black-Box Access

Grey-Box Access

White-Box Access

Access Level Comparison

Mapping Scenarios to Threat Models

AI vs. Traditional Threat Modeling

STRIDE for AI Systems

Building Your AI Threat Model

識別系統

Define the adversary

Enumerate attack vectors

評估 impact

Prioritize

Try It Yourself

相關主題

參考文獻

AI Threat 模型s: White-box, Black-box & Grey-box

Why Threat Models Matter

The Three Access Levels

Black-Box Access

Grey-Box Access

White-Box Access

Access Level Comparison

Mapping Scenarios to Threat Models

AI vs. Traditional Threat Modeling

STRIDE for AI Systems

Building Your AI Threat Model

識別系統

Define the adversary

Enumerate attack vectors

評估 impact

Prioritize

Try It Yourself

相關主題

參考文獻

AI Threat 模型s: White-box, Black-box & Grey-box

識別 系統

Define the adversary

Enumerate attack vectors

評估 impact

Prioritize

Related articles

AI Threat 模型s: White-box, Black-box & Grey-box

識別 系統

Define the adversary

Enumerate attack vectors

評估 impact

Prioritize

Related articles

識別系統

識別系統