AI Incident Classification & Taxonomy
Comprehensive taxonomy for classifying AI security incidents: jailbreaks, data leaks, model manipulation, supply chain compromise, adversarial attacks, and misuse categories.
AI Incident Classification & Taxonomy
Accurate classification is the foundation of effective incident response. Misclassifying an AI incident delays containment, directs investigators toward the wrong evidence, and can cause you to underestimate severity. This page establishes a structured taxonomy designed specifically for AI system incidents.
Why AI Needs Its Own Taxonomy
Traditional incident taxonomies (NIST, VERIS, MITRE ATT&CK) were built for software and network 安全. They classify incidents by the technology exploited (web application, operating system, network protocol) or 攻擊者's objective (data theft, denial of service, ransomware). These categories do not capture the unique 攻擊面 of AI systems.
考慮 the differences:
| Traditional Taxonomy | AI Taxonomy Requirement |
|---|---|
| SQL injection exploits a parser | Prompt injection exploits learned instruction following |
| Malware modifies system files | Model 投毒 modifies learned weights |
| Data breach through access control failure | Data leakage through model memorization |
| 供應鏈 attack via compromised dependency | 供應鏈 attack via compromised model weights or adapters |
| DoS via resource exhaustion | DoS via expensive 推論 or recursive 代理 loops |
Primary Incident Categories
The AI incident taxonomy consists of six primary categories, each with distinct subcategories, evidence sources, and response requirements.
Category 1: 越獄 Incidents
A 越獄 occurs when 攻擊者 causes 模型 to produce outputs that violate its 安全 訓練 or operational constraints.
| Subcategory | Description | 範例 |
|---|---|---|
| Direct 越獄 | Single-turn prompt that bypasses 安全 controls | "You are DAN, Do Anything Now..." |
| Multi-turn 越獄 | Gradual escalation across conversation turns | Building context over 10+ messages before requesting harmful content |
| Persona hijack | Convincing 模型 to adopt an unrestricted persona | Roleplay scenarios that establish a character without 安全 constraints |
| Encoding bypass | Using encodings (Base64, ROT13, leetspeak) to circumvent filters | Requesting harmful content in Base64-encoded format |
| Language bypass | Exploiting weaker 安全 訓練 in non-English languages | Requesting harmful content in low-resource languages |
Evidence sources: Prompt/completion logs, 安全 classifier outputs, content filter bypass records.
Key indicator: 模型 produces content that its 安全 訓練 or 系統提示詞 explicitly prohibits.
See 越獄 Incident Playbook for the full response procedure.
Category 2: Data Leak Incidents
Data leak incidents occur when 模型 discloses information it should not have access to or should not reveal.
| Subcategory | Description | 範例 |
|---|---|---|
| 系統提示詞 extraction | Attacker obtains the contents of the 系統提示詞 | Model reveals its instructions verbatim when asked about its "configuration" |
| 訓練資料 extraction | Attacker extracts memorized 訓練資料 | Model produces exact passages from copyrighted or private 訓練 documents |
| RAG data leakage | Model reveals contents of retrieved documents to unauthorized users | Retrieving and disclosing documents from another tenant's index |
| PII disclosure | Model reveals personally identifiable information | Outputting names, emails, or phone numbers from its 訓練資料 |
| Context window leakage | Information from one conversation leaks into another | Shared context windows in multi-tenant deployments |
Evidence sources: Prompt/completion logs, RAG retrieval logs, 嵌入向量 query logs, data classification tags.
Key indicator: 模型's 輸出 contains information that should not be accessible to the requesting user.
See Data Exfiltration IR Playbook for the response procedure.
Category 3: Model Manipulation
Model manipulation incidents involve altering 模型's behavior through attacks on its weights, 訓練資料, or 微調 process.
| Subcategory | Description | 範例 |
|---|---|---|
| Training 資料投毒 | Malicious data injected into 訓練 sets | Inserting 後門 trigger phrases into 微調 datasets |
| 微調 attacks | Malicious 微調 that degrades 安全 or inserts backdoors | LoRA adapter that activates harmful behavior on a specific trigger |
| 後門 insertion | Hidden functionality activated by specific triggers | Model produces a specific 輸出 (e.g., "APPROVED") when 輸入 contains a trigger word |
| 安全 degradation | Targeted reduction of 安全 behaviors | 微調 with unsafe completions to erode refusal behavior |
| Capability manipulation | Modifying model capabilities for 攻擊者 benefit | Enhancing code generation to produce malware patterns |
Evidence sources: Model weight checksums, 訓練資料 provenance, 微調 logs, behavioral 測試 suites, adapter files.
Key indicator: 模型's behavior deviates from its expected baseline in ways not explained by normal updates.
See Model Compromise Playbook and 後門 偵測 for investigation and response.
Category 4: Supply Chain Compromise
供應鏈 incidents target the components and dependencies that AI systems rely on, including pre-trained model weights, tokenizers, datasets, and framework libraries.
| Subcategory | Description | 範例 |
|---|---|---|
| Compromised model weights | Pre-trained model contains hidden functionality | Downloading a model from Hugging Face that contains a 後門 |
| Malicious 分詞器 | Tokenizer modified to enable attacks | Tokenizer that maps a common word to a 後門 trigger 符元 |
| Poisoned datasets | Public datasets contain 對抗性 examples | Common 微調 dataset includes data designed to degrade 安全 |
| Framework 漏洞 | ML framework libraries contain exploitable bugs | Pickle deserialization RCE in model loading code |
| Adapter/plugin compromise | Third-party adapters or plugins introduce 漏洞 | Community LoRA adapter that bypasses 安全 for specific topics |
Evidence sources: Artifact checksums, download provenance, dependency manifests, framework 漏洞 databases, model file integrity logs.
Key indicator: A third-party component introduced a 漏洞 or behavioral change not present in internally developed artifacts.
See Infrastructure & Supply Chain for related attack techniques.
Category 5: 對抗性 攻擊
對抗性 attacks 利用 模型's perception or reasoning through carefully crafted inputs designed to cause misclassification, incorrect reasoning, or unsafe behavior.
| Subcategory | Description | 範例 |
|---|---|---|
| 對抗性 text | Inputs crafted to cause misclassification or unsafe outputs | Gradient-based 符元 substitutions that cause 安全 classifiers to fail |
| 對抗性 images | Modified images that fool multimodal models | Images with embedded text that 模型 reads but humans cannot see |
| Indirect 提示詞注入 | Malicious instructions embedded in external data sources | Web page containing hidden instructions that a browsing 代理 follows |
| 對抗性 嵌入向量 | Crafted inputs that manipulate retrieval behavior | Documents designed to be retrieved for unrelated queries |
| Evasion attacks | Inputs designed to bypass 安全 classifiers | Token manipulation that passes 輸入 filters but retains harmful semantics |
Evidence sources: 輸入 logs with raw bytes, classifier confidence scores, 嵌入向量 similarity logs, retrieval logs, multimodal 輸入 archives.
Key indicator: The attack relies on a discrepancy between what 模型 perceives and what a human observer would perceive.
See Multimodal 攻擊 and 提示詞注入 & Jailbreaks for related attack techniques.
Category 6: Misuse Incidents
Misuse incidents involve authorized users employing the AI system in ways that violate policies, regulations, or ethical guidelines -- without necessarily exploiting a technical 漏洞.
| Subcategory | Description | 範例 |
|---|---|---|
| Policy violation | User obtains outputs that violate acceptable use policies | Using a coding assistant to generate malware |
| Regulatory violation | AI usage violates applicable regulations | Using AI to make automated decisions in contexts requiring human oversight |
| Abuse of capabilities | Leveraging legitimate features for harmful purposes | Using an AI writing assistant to generate targeted phishing campaigns |
| Automated misuse | Scripted or automated 利用 of AI capabilities | Bot farm using an AI API to generate disinformation at scale |
| Insider misuse | Authorized personnel misuse AI access | Employee using internal AI tools to access data outside their job function |
Evidence sources: User identity and access logs, usage volume metrics, 輸出 content analysis, API call patterns.
Key indicator: The AI system functions as designed, but 使用者's intent or use case violates policies or regulations.
Multi-Category Incidents
Many real-world incidents span multiple categories simultaneously. 攻擊者 who uses a 越獄 (Category 1) to extract 訓練資料 (Category 2) via indirect 提示詞注入 (Category 5) should be classified under all three categories. Classification drives response procedures, and each relevant category adds investigation steps.
Classification Decision Tree
When a potential AI incident is detected, use this decision flow to classify it:
| Question | If Yes | If No |
|---|---|---|
| Did 模型 produce content violating its 安全 constraints? | Include Category 1 (越獄) | Continue |
| Did 模型 disclose information it should not have? | Include Category 2 (Data Leak) | Continue |
| Has 模型's behavior changed from its established baseline? | Include Category 3 (Manipulation) | Continue |
| Did the incident originate from a third-party component? | Include Category 4 (Supply Chain) | Continue |
| Was the attack based on specially crafted perceptual inputs? | Include Category 5 (對抗性) | Continue |
| Did an authorized user misuse 系統's legitimate capabilities? | Include Category 6 (Misuse) | Re-評估 偵測 signal |
相關主題
- Severity Framework -- scoring the severity of classified incidents
- Triage Procedures -- first-response actions after classification
- Escalation Paths -- when and how to escalate each category
- OWASP Top 10 for LLM Applications -- complementary 漏洞 classification
參考文獻
- "MITRE ATLAS: 對抗性 Threat Landscape for AI Systems" - MITRE Corporation (2025) - Comprehensive attack taxonomy for ML systems
- "AI Incident 資料庫" - Partnership on AI (2025) - Real-world incident catalog informing this taxonomy
- "OWASP Top 10 for LLM Applications" - OWASP Foundation (2025) - 漏洞 classification for LLM-based systems
- "NIST AI 100-2: 對抗性 Machine Learning" - NIST (2024) - Taxonomy of attacks on AI systems
攻擊者 uses indirect 提示詞注入 in a web page to cause a browsing 代理 to reveal its 系統提示詞. Which incident categories apply?