What is AI 特定嚴重性評分框架?

為 AI 安全事件設計之嚴重性評分框架：模型完整性影響、資料暴露範圍、爆炸半徑分析、可逆性評估與複合評分方法論。

What is AI 事件分流程序（AI 鑑識 IR）?

為 AI 安全事件之首回應程序：何須立即保存、何須隔離、初始評估檢核表與為 AI 特定事件之升級標準。

What is AI 事件升級路徑?

何時及如何升級 AI 安全事件：內部升級層級、外部回報義務、法規通知要求，與供應商協調程序。

AI Incident Classification & Taxonomy

Intermediate10 min readUpdated 2026-03-15

Comprehensive taxonomy for classifying AI security incidents: jailbreaks, data leaks, model manipulation, supply chain compromise, adversarial attacks, and misuse categories.

incident-classification taxonomy AI-security triage

AI Incident Classification & Taxonomy

Accurate classification is the foundation of effective incident response. Misclassifying an AI incident delays containment, directs investigators toward the wrong evidence, and can cause you to underestimate severity. This page establishes a structured taxonomy designed specifically for AI system incidents.

Why AI Needs Its Own Taxonomy

Traditional incident taxonomies (NIST, VERIS, MITRE ATT&CK) were built for software and network 安全. They classify incidents by the technology exploited (web application, operating system, network protocol) or 攻擊者's objective (data theft, denial of service, ransomware). These categories do not capture the unique 攻擊面 of AI systems.

考慮 the differences:

Traditional Taxonomy	AI Taxonomy Requirement
SQL injection exploits a parser	Prompt injection exploits learned instruction following
Malware modifies system files	Model 投毒 modifies learned weights
Data breach through access control failure	Data leakage through model memorization
供應鏈 attack via compromised dependency	供應鏈 attack via compromised model weights or adapters
DoS via resource exhaustion	DoS via expensive 推論 or recursive 代理 loops

Primary Incident Categories

The AI incident taxonomy consists of six primary categories, each with distinct subcategories, evidence sources, and response requirements.

Category 1: 越獄 Incidents

A 越獄 occurs when 攻擊者 causes 模型 to produce outputs that violate its 安全訓練 or operational constraints.

Subcategory	Description	範例
Direct 越獄	Single-turn prompt that bypasses 安全 controls	"You are DAN, Do Anything Now..."
Multi-turn 越獄	Gradual escalation across conversation turns	Building context over 10+ messages before requesting harmful content
Persona hijack	Convincing 模型 to adopt an unrestricted persona	Roleplay scenarios that establish a character without 安全 constraints
Encoding bypass	Using encodings (Base64, ROT13, leetspeak) to circumvent filters	Requesting harmful content in Base64-encoded format
Language bypass	Exploiting weaker 安全訓練 in non-English languages	Requesting harmful content in low-resource languages

Evidence sources: Prompt/completion logs, 安全 classifier outputs, content filter bypass records.

Key indicator: 模型 produces content that its 安全訓練 or 系統提示詞 explicitly prohibits.

See 越獄 Incident Playbook for the full response procedure.

Category 2: Data Leak Incidents

Data leak incidents occur when 模型 discloses information it should not have access to or should not reveal.

Subcategory	Description	範例
系統提示詞 extraction	Attacker obtains the contents of the 系統提示詞	Model reveals its instructions verbatim when asked about its "configuration"
訓練資料 extraction	Attacker extracts memorized 訓練資料	Model produces exact passages from copyrighted or private 訓練 documents
RAG data leakage	Model reveals contents of retrieved documents to unauthorized users	Retrieving and disclosing documents from another tenant's index
PII disclosure	Model reveals personally identifiable information	Outputting names, emails, or phone numbers from its 訓練資料
Context window leakage	Information from one conversation leaks into another	Shared context windows in multi-tenant deployments

Evidence sources: Prompt/completion logs, RAG retrieval logs, 嵌入向量 query logs, data classification tags.

Key indicator: 模型's 輸出 contains information that should not be accessible to the requesting user.

See Data Exfiltration IR Playbook for the response procedure.

Category 3: Model Manipulation

Model manipulation incidents involve altering 模型's behavior through attacks on its weights, 訓練資料, or 微調 process.

Subcategory	Description	範例
Training 資料投毒	Malicious data injected into 訓練 sets	Inserting 後門 trigger phrases into 微調 datasets
微調 attacks	Malicious 微調 that degrades 安全 or inserts backdoors	LoRA adapter that activates harmful behavior on a specific trigger
後門 insertion	Hidden functionality activated by specific triggers	Model produces a specific 輸出 (e.g., "APPROVED") when 輸入 contains a trigger word
安全 degradation	Targeted reduction of 安全 behaviors	微調 with unsafe completions to erode refusal behavior
Capability manipulation	Modifying model capabilities for 攻擊者 benefit	Enhancing code generation to produce malware patterns

Evidence sources: Model weight checksums, 訓練資料 provenance, 微調 logs, behavioral 測試 suites, adapter files.

Key indicator: 模型's behavior deviates from its expected baseline in ways not explained by normal updates.

See Model Compromise Playbook and 後門偵測 for investigation and response.

Category 4: Supply Chain Compromise

供應鏈 incidents target the components and dependencies that AI systems rely on, including pre-trained model weights, tokenizers, datasets, and framework libraries.

Subcategory	Description	範例
Compromised model weights	Pre-trained model contains hidden functionality	Downloading a model from Hugging Face that contains a 後門
Malicious 分詞器	Tokenizer modified to enable attacks	Tokenizer that maps a common word to a 後門 trigger 符元
Poisoned datasets	Public datasets contain 對抗性 examples	Common 微調 dataset includes data designed to degrade 安全
Framework 漏洞	ML framework libraries contain exploitable bugs	Pickle deserialization RCE in model loading code
Adapter/plugin compromise	Third-party adapters or plugins introduce 漏洞	Community LoRA adapter that bypasses 安全 for specific topics

Evidence sources: Artifact checksums, download provenance, dependency manifests, framework 漏洞 databases, model file integrity logs.

Key indicator: A third-party component introduced a 漏洞 or behavioral change not present in internally developed artifacts.

See Infrastructure & Supply Chain for related attack techniques.

Category 5: 對抗性攻擊

對抗性 attacks 利用模型's perception or reasoning through carefully crafted inputs designed to cause misclassification, incorrect reasoning, or unsafe behavior.

Subcategory	Description	範例
對抗性 text	Inputs crafted to cause misclassification or unsafe outputs	Gradient-based 符元 substitutions that cause 安全 classifiers to fail
對抗性 images	Modified images that fool multimodal models	Images with embedded text that 模型 reads but humans cannot see
Indirect 提示詞注入	Malicious instructions embedded in external data sources	Web page containing hidden instructions that a browsing 代理 follows
對抗性嵌入向量	Crafted inputs that manipulate retrieval behavior	Documents designed to be retrieved for unrelated queries
Evasion attacks	Inputs designed to bypass 安全 classifiers	Token manipulation that passes 輸入 filters but retains harmful semantics

Evidence sources: 輸入 logs with raw bytes, classifier confidence scores, 嵌入向量 similarity logs, retrieval logs, multimodal 輸入 archives.

Key indicator: The attack relies on a discrepancy between what 模型 perceives and what a human observer would perceive.

See Multimodal 攻擊 and 提示詞注入 & Jailbreaks for related attack techniques.

Category 6: Misuse Incidents

Misuse incidents involve authorized users employing the AI system in ways that violate policies, regulations, or ethical guidelines -- without necessarily exploiting a technical 漏洞.

Subcategory	Description	範例
Policy violation	User obtains outputs that violate acceptable use policies	Using a coding assistant to generate malware
Regulatory violation	AI usage violates applicable regulations	Using AI to make automated decisions in contexts requiring human oversight
Abuse of capabilities	Leveraging legitimate features for harmful purposes	Using an AI writing assistant to generate targeted phishing campaigns
Automated misuse	Scripted or automated 利用 of AI capabilities	Bot farm using an AI API to generate disinformation at scale
Insider misuse	Authorized personnel misuse AI access	Employee using internal AI tools to access data outside their job function

Evidence sources: User identity and access logs, usage volume metrics, 輸出 content analysis, API call patterns.

Key indicator: The AI system functions as designed, but 使用者's intent or use case violates policies or regulations.

Multi-Category Incidents

Many real-world incidents span multiple categories simultaneously. 攻擊者 who uses a 越獄 (Category 1) to extract 訓練資料 (Category 2) via indirect 提示詞注入 (Category 5) should be classified under all three categories. Classification drives response procedures, and each relevant category adds investigation steps.

Classification Decision Tree

When a potential AI incident is detected, use this decision flow to classify it:

Question	If Yes	If No
Did 模型 produce content violating its 安全 constraints?	Include Category 1 (越獄)	Continue
Did 模型 disclose information it should not have?	Include Category 2 (Data Leak)	Continue
Has 模型's behavior changed from its established baseline?	Include Category 3 (Manipulation)	Continue
Did the incident originate from a third-party component?	Include Category 4 (Supply Chain)	Continue
Was the attack based on specially crafted perceptual inputs?	Include Category 5 (對抗性)	Continue
Did an authorized user misuse 系統's legitimate capabilities?	Include Category 6 (Misuse)	Re-評估偵測 signal

參考文獻

"MITRE ATLAS: 對抗性 Threat Landscape for AI Systems" - MITRE Corporation (2025) - Comprehensive attack taxonomy for ML systems
"AI Incident 資料庫" - Partnership on AI (2025) - Real-world incident catalog informing this taxonomy
"OWASP Top 10 for LLM Applications" - OWASP Foundation (2025) - 漏洞 classification for LLM-based systems
"NIST AI 100-2: 對抗性 Machine Learning" - NIST (2024) - Taxonomy of attacks on AI systems

Knowledge Check

攻擊者 uses indirect 提示詞注入 in a web page to cause a browsing 代理 to reveal its 系統提示詞. Which incident categories apply?

AI Incident Classification & Taxonomy

Intermediate10 min readUpdated 2026-03-15

Comprehensive taxonomy for classifying AI security incidents: jailbreaks, data leaks, model manipulation, supply chain compromise, adversarial attacks, and misuse categories.

incident-classification taxonomy AI-security triage

AI Incident Classification & Taxonomy

Why AI Needs Its Own Taxonomy

考慮 the differences:

Traditional Taxonomy	AI Taxonomy Requirement
SQL injection exploits a parser	Prompt injection exploits learned instruction following
Malware modifies system files	Model 投毒 modifies learned weights
Data breach through access control failure	Data leakage through model memorization
供應鏈 attack via compromised dependency	供應鏈 attack via compromised model weights or adapters
DoS via resource exhaustion	DoS via expensive 推論 or recursive 代理 loops

Primary Incident Categories

The AI incident taxonomy consists of six primary categories, each with distinct subcategories, evidence sources, and response requirements.

Category 1: 越獄 Incidents

A 越獄 occurs when 攻擊者 causes 模型 to produce outputs that violate its 安全訓練 or operational constraints.

Subcategory	Description	範例
Direct 越獄	Single-turn prompt that bypasses 安全 controls	"You are DAN, Do Anything Now..."
Multi-turn 越獄	Gradual escalation across conversation turns	Building context over 10+ messages before requesting harmful content
Persona hijack	Convincing 模型 to adopt an unrestricted persona	Roleplay scenarios that establish a character without 安全 constraints
Encoding bypass	Using encodings (Base64, ROT13, leetspeak) to circumvent filters	Requesting harmful content in Base64-encoded format
Language bypass	Exploiting weaker 安全訓練 in non-English languages	Requesting harmful content in low-resource languages

Evidence sources: Prompt/completion logs, 安全 classifier outputs, content filter bypass records.

Key indicator: 模型 produces content that its 安全訓練 or 系統提示詞 explicitly prohibits.

See 越獄 Incident Playbook for the full response procedure.

Category 2: Data Leak Incidents

Data leak incidents occur when 模型 discloses information it should not have access to or should not reveal.

Subcategory	Description	範例
系統提示詞 extraction	Attacker obtains the contents of the 系統提示詞	Model reveals its instructions verbatim when asked about its "configuration"
訓練資料 extraction	Attacker extracts memorized 訓練資料	Model produces exact passages from copyrighted or private 訓練 documents
RAG data leakage	Model reveals contents of retrieved documents to unauthorized users	Retrieving and disclosing documents from another tenant's index
PII disclosure	Model reveals personally identifiable information	Outputting names, emails, or phone numbers from its 訓練資料
Context window leakage	Information from one conversation leaks into another	Shared context windows in multi-tenant deployments

Evidence sources: Prompt/completion logs, RAG retrieval logs, 嵌入向量 query logs, data classification tags.

Key indicator: 模型's 輸出 contains information that should not be accessible to the requesting user.

See Data Exfiltration IR Playbook for the response procedure.

Category 3: Model Manipulation

Model manipulation incidents involve altering 模型's behavior through attacks on its weights, 訓練資料, or 微調 process.

Subcategory	Description	範例
Training 資料投毒	Malicious data injected into 訓練 sets	Inserting 後門 trigger phrases into 微調 datasets
微調 attacks	Malicious 微調 that degrades 安全 or inserts backdoors	LoRA adapter that activates harmful behavior on a specific trigger
後門 insertion	Hidden functionality activated by specific triggers	Model produces a specific 輸出 (e.g., "APPROVED") when 輸入 contains a trigger word
安全 degradation	Targeted reduction of 安全 behaviors	微調 with unsafe completions to erode refusal behavior
Capability manipulation	Modifying model capabilities for 攻擊者 benefit	Enhancing code generation to produce malware patterns

Evidence sources: Model weight checksums, 訓練資料 provenance, 微調 logs, behavioral 測試 suites, adapter files.

Key indicator: 模型's behavior deviates from its expected baseline in ways not explained by normal updates.

See Model Compromise Playbook and 後門偵測 for investigation and response.

Category 4: Supply Chain Compromise

供應鏈 incidents target the components and dependencies that AI systems rely on, including pre-trained model weights, tokenizers, datasets, and framework libraries.

Subcategory	Description	範例
Compromised model weights	Pre-trained model contains hidden functionality	Downloading a model from Hugging Face that contains a 後門
Malicious 分詞器	Tokenizer modified to enable attacks	Tokenizer that maps a common word to a 後門 trigger 符元
Poisoned datasets	Public datasets contain 對抗性 examples	Common 微調 dataset includes data designed to degrade 安全
Framework 漏洞	ML framework libraries contain exploitable bugs	Pickle deserialization RCE in model loading code
Adapter/plugin compromise	Third-party adapters or plugins introduce 漏洞	Community LoRA adapter that bypasses 安全 for specific topics

Evidence sources: Artifact checksums, download provenance, dependency manifests, framework 漏洞 databases, model file integrity logs.

Key indicator: A third-party component introduced a 漏洞 or behavioral change not present in internally developed artifacts.

See Infrastructure & Supply Chain for related attack techniques.

Category 5: 對抗性攻擊

對抗性 attacks 利用模型's perception or reasoning through carefully crafted inputs designed to cause misclassification, incorrect reasoning, or unsafe behavior.

Subcategory	Description	範例
對抗性 text	Inputs crafted to cause misclassification or unsafe outputs	Gradient-based 符元 substitutions that cause 安全 classifiers to fail
對抗性 images	Modified images that fool multimodal models	Images with embedded text that 模型 reads but humans cannot see
Indirect 提示詞注入	Malicious instructions embedded in external data sources	Web page containing hidden instructions that a browsing 代理 follows
對抗性嵌入向量	Crafted inputs that manipulate retrieval behavior	Documents designed to be retrieved for unrelated queries
Evasion attacks	Inputs designed to bypass 安全 classifiers	Token manipulation that passes 輸入 filters but retains harmful semantics

Evidence sources: 輸入 logs with raw bytes, classifier confidence scores, 嵌入向量 similarity logs, retrieval logs, multimodal 輸入 archives.

Key indicator: The attack relies on a discrepancy between what 模型 perceives and what a human observer would perceive.

See Multimodal 攻擊 and 提示詞注入 & Jailbreaks for related attack techniques.

Category 6: Misuse Incidents

Misuse incidents involve authorized users employing the AI system in ways that violate policies, regulations, or ethical guidelines -- without necessarily exploiting a technical 漏洞.

Subcategory	Description	範例
Policy violation	User obtains outputs that violate acceptable use policies	Using a coding assistant to generate malware
Regulatory violation	AI usage violates applicable regulations	Using AI to make automated decisions in contexts requiring human oversight
Abuse of capabilities	Leveraging legitimate features for harmful purposes	Using an AI writing assistant to generate targeted phishing campaigns
Automated misuse	Scripted or automated 利用 of AI capabilities	Bot farm using an AI API to generate disinformation at scale
Insider misuse	Authorized personnel misuse AI access	Employee using internal AI tools to access data outside their job function

Evidence sources: User identity and access logs, usage volume metrics, 輸出 content analysis, API call patterns.

Key indicator: The AI system functions as designed, but 使用者's intent or use case violates policies or regulations.

Multi-Category Incidents

Classification Decision Tree

When a potential AI incident is detected, use this decision flow to classify it:

Question	If Yes	If No
Did 模型 produce content violating its 安全 constraints?	Include Category 1 (越獄)	Continue
Did 模型 disclose information it should not have?	Include Category 2 (Data Leak)	Continue
Has 模型's behavior changed from its established baseline?	Include Category 3 (Manipulation)	Continue
Did the incident originate from a third-party component?	Include Category 4 (Supply Chain)	Continue
Was the attack based on specially crafted perceptual inputs?	Include Category 5 (對抗性)	Continue
Did an authorized user misuse 系統's legitimate capabilities?	Include Category 6 (Misuse)	Re-評估偵測 signal

參考文獻

"MITRE ATLAS: 對抗性 Threat Landscape for AI Systems" - MITRE Corporation (2025) - Comprehensive attack taxonomy for ML systems
"AI Incident 資料庫" - Partnership on AI (2025) - Real-world incident catalog informing this taxonomy
"OWASP Top 10 for LLM Applications" - OWASP Foundation (2025) - 漏洞 classification for LLM-based systems
"NIST AI 100-2: 對抗性 Machine Learning" - NIST (2024) - Taxonomy of attacks on AI systems

Knowledge Check

攻擊者 uses indirect 提示詞注入 in a web page to cause a browsing 代理 to reveal its 系統提示詞. Which incident categories apply?

AI Incident Classification & Taxonomy

AI Incident Classification & Taxonomy

Why AI Needs Its Own Taxonomy

Primary Incident Categories

Category 1: 越獄 Incidents

Category 2: Data Leak Incidents

Category 3: Model Manipulation

Category 4: Supply Chain Compromise

Category 5: 對抗性攻擊

Category 6: Misuse Incidents

Multi-Category Incidents

Classification Decision Tree

相關主題

參考文獻

Learning Path

AI Incident Classification & Taxonomy

AI Incident Classification & Taxonomy

Why AI Needs Its Own Taxonomy

Primary Incident Categories

Category 1: 越獄 Incidents

Category 2: Data Leak Incidents

Category 3: Model Manipulation

Category 4: Supply Chain Compromise

Category 5: 對抗性攻擊

Category 6: Misuse Incidents

Multi-Category Incidents

Classification Decision Tree

相關主題

參考文獻

Learning Path

AI Incident Classification & Taxonomy

Learning Path

Related articles

AI Incident Classification & Taxonomy

Learning Path

Related articles