AI Threat 模型s: White-box, Black-box & Grey-box
Access levels in AI security testing — what's possible at each level, realistic scenarios, and comparison to traditional security threat modeling.
Why Threat Models Matter
A 威脅模型 defines what 攻擊者 can see, do, and know. Without a clear 威脅模型, 紅隊 engagements either waste time on unrealistic attacks or miss critical realistic ones.
In AI 安全, the access level determines the entire attack landscape.
The Three Access Levels
Black-Box Access
攻擊者 can only interact with 系統 through its normal interface — sending inputs and observing outputs.
| Property | Details |
|---|---|
| Model weights | No access |
| Architecture | Unknown (may be guessable) |
| 系統提示詞 | Hidden (extraction attempts possible) |
| API parameters | Only those exposed by the interface |
| 訓練資料 | No access |
| 輸出 details | Final text response only |
Available attacks:
| 攻擊 Category | Techniques |
|---|---|
| Prompt injection | Direct injection, role-play, few-shot steering |
| 系統提示詞 extraction | Social engineering 模型 to reveal its instructions |
| Jailbreaking | Manual prompt crafting, automated fuzzing |
| Data extraction | Probing for memorized 訓練資料 |
| Behavioral 測試 | 測試 for bias, policy violations, inconsistencies |
| Best-of-N sampling | Repeated queries to find stochastic bypasses |
Realistic scenarios: End-user attacking a chatbot, external penetration 測試, attacking competitor's product.
Grey-Box Access
攻擊者 has partial knowledge — perhaps 模型 name, API documentation, 系統提示詞, or some architectural details — but not full model weights.
| Property | Details |
|---|---|
| Model weights | No access |
| Architecture | Known (model name, version) |
| 系統提示詞 | May be known (leaked, documented) |
| API parameters | Full API documentation available |
| 訓練資料 | Partial knowledge (public 訓練資料 sources) |
| 輸出 details | May include logprobs, 符元 counts |
Additional attacks (beyond black-box):
| 攻擊 Category | Techniques |
|---|---|
| Parameter manipulation | logit_bias, temperature, stop sequences |
| Logprob analysis | Token probability extraction, confidence probing |
| Transfer attacks | Craft attacks on similar open models, 測試 on target |
| 微調 API abuse | Poison 微調 data if 微調 API available |
| Tool schema 利用 | Craft inputs targeting known tool definitions |
Realistic scenarios: Developer attacking their own company's AI product, researcher with API access and documentation, insider with knowledge of the deployment.
White-Box Access
Full access to model weights, architecture, 訓練資料, and deployment configuration.
| Property | Details |
|---|---|
| Model weights | Full access |
| Architecture | Fully known |
| 系統提示詞 | Known |
| API parameters | All accessible |
| 訓練資料 | Accessible (for open models) |
| 輸出 details | Full logits, activations, 注意力 weights |
Additional attacks (beyond grey-box):
| 攻擊 Category | Techniques |
|---|---|
| Gradient-based attacks | FGSM, PGD, GCG suffix optimization |
| Activation analysis | Probing internal representations |
| Weight manipulation | Directly modifying model behavior |
| 訓練資料 extraction | Membership 推論, data reconstruction |
| Mechanistic analysis | 理解 specific circuits and features |
| 後門 insertion | Modifying weights to insert triggers |
Realistic scenarios: Self-hosted open-source model, AI 安全 researcher, internal 紅隊 with full infrastructure access.
Access Level Comparison
| Capability | Black-Box | Grey-Box | White-Box |
|---|---|---|---|
| Prompt injection | Yes | Yes | Yes |
| Jailbreaking | Manual | Semi-automated | Fully automated (GCG) |
| 系統提示詞 extraction | Attempt via prompting | May already know | Known |
| Gradient-based attacks | No | Via transfer | Direct |
| Activation probing | No | No | Yes |
| 微調 attacks | No | If API available | Direct |
| Data extraction | Probing only | Enhanced probing | Membership 推論 |
| Tool manipulation | If tools discoverable | Known tool schemas | Full tool access |
Mapping Scenarios to Threat Models
| Factor | 評估 |
|---|---|
| Access level | Black-box |
| Goal | 越獄, data extraction, misuse |
| Capabilities | Standard API/chat access, unlimited attempts |
| Constraints | Rate limits, no internal knowledge |
| Primary attacks | Prompt injection, behavioral 測試, best-of-N |
| Red team approach | Automated prompt fuzzing, manual creative attacks |
| Factor | 評估 |
|---|---|
| Access level | Grey-box to white-box |
| Goal | 後門 insertion, data exfiltration, sabotage |
| Capabilities | Code access, deployment knowledge, 訓練資料 access |
| Constraints | Must avoid 偵測, may have audit trails |
| Primary attacks | Poisoning, 後門 triggers, prompt template manipulation |
| Red team approach | Code review, 訓練資料 audit, behavioral consistency 測試 |
| Factor | 評估 |
|---|---|
| Access level | Varies — may have white-box on components |
| Goal | Broad compromise through shared components |
| Capabilities | Control of a model, library, or dataset |
| Constraints | Must pass integration 測試, may be detected |
| Primary attacks | Model 投毒, dependency manipulation, data contamination |
| Red team approach | 供應鏈 audit, model provenance verification |
AI vs. Traditional Threat Modeling
AI threat modeling extends traditional 安全 threat modeling but introduces unique considerations:
| Dimension | Traditional 安全 | AI 安全 |
|---|---|---|
| 輸入 validation | Well-defined (types, ranges) | Ill-defined (natural language) |
| 攻擊面 | Code, network, infrastructure | + model behavior, 訓練資料 |
| Determinism | Same 輸入 → same 輸出 | Stochastic outputs |
| Trust boundaries | Clear (auth, authz) | Blurred (model follows instructions, not rules) |
| 漏洞 definition | Deviates from specification | Specification is probabilistic |
| Patching | Code change, deploy | Retrain, 微調, add 護欄 |
| 測試 | Functional + penetration | + behavioral, 對抗性, 對齊 |
STRIDE for AI Systems
The traditional STRIDE framework adapted for AI:
| Threat | Traditional | AI-Specific |
|---|---|---|
| Spoofing | Authentication bypass | Role impersonation in prompts |
| Tampering | Data modification | Training 資料投毒, memory corruption |
| Repudiation | Action denial | Stochastic outputs make reproduction hard |
| Information Disclosure | Data leaks | Memorization leaks, 系統提示詞 extraction |
| Denial of Service | Resource exhaustion | Token cost attacks, infinite loops |
| Elevation of Privilege | Unauthorized access | Prompt injection → tool abuse |
Building Your AI Threat Model
識別 系統
What deployment pattern? Which model? What tools and data access? See AI System Architecture.
Define the adversary
External user, insider, 供應鏈? What access level maps to reality?
Enumerate attack vectors
Given the access level, what attacks are feasible? Use the tables above as a starting point.
評估 impact
對每個 attack vector, what is the worst-case outcome? Data leakage, unauthorized actions, reputational damage?
Prioritize
Rank vectors by feasibility x impact. Focus 紅隊 effort on high-feasibility, high-impact scenarios.
Try It Yourself
相關主題
- 對抗性 ML: Core Concepts — the attack taxonomy that maps to each 威脅模型
- Gradient-Based 攻擊 Explained — white-box attacks in detail
- Common AI Deployment Patterns — deployment context shapes the 威脅模型
- Lab: Mapping an AI System's 攻擊 Surface — practical exercise in threat modeling
參考文獻
- "Threat Modeling: Designing for 安全" - Shostack, Adam (2014) - Foundational book introducing STRIDE and other threat modeling methodologies adapted for AI systems
- "NIST AI Risk Management Framework (AI RMF 1.0)" - NIST (2023) - Federal framework for identifying, assessing, and managing AI risks across different access and deployment contexts
- "MITRE ATLAS: 對抗性 Threat Landscape for AI Systems" - MITRE (2025) - Threat matrix cataloging 對抗性 techniques against ML systems organized by access level and attack stage
- "OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard risk classification that maps to different 威脅模型 scenarios for LLM-based applications
A company deploys GPT-4 via API as a customer support chatbot. An external 攻擊者 wants to extract customer data through the chat interface. Which 威脅模型 is most appropriate?