AI Incident Classification & Taxonomy
Comprehensive taxonomy for classifying AI security incidents: jailbreaks, data leaks, model manipulation, supply chain compromise, adversarial attacks, and misuse categories.
AI Incident Classification & Taxonomy
Accurate classification is the foundation of effective incident response. Misclassifying an AI incident delays containment, directs investigators toward the wrong evidence, and can cause you to underestimate severity. This page establishes a structured taxonomy designed specifically for AI system incidents.
Why AI Needs Its Own Taxonomy
Traditional incident taxonomies (NIST, VERIS, MITRE ATT&CK) were built for software and network security. They classify incidents by the technology exploited (web application, operating system, network protocol) or the attacker's objective (data theft, denial of service, ransomware). These categories do not capture the unique attack surface of AI systems.
Consider the differences:
| Traditional Taxonomy | AI Taxonomy Requirement |
|---|---|
| SQL injection exploits a parser | Prompt injection exploits learned instruction following |
| Malware modifies system files | Model poisoning modifies learned weights |
| Data breach through access control failure | Data leakage through model memorization |
| Supply chain attack via compromised dependency | Supply chain attack via compromised model weights or adapters |
| DoS via resource exhaustion | DoS via expensive inference or recursive agent loops |
Primary Incident Categories
The AI incident taxonomy consists of six primary categories, each with distinct subcategories, evidence sources, and response requirements.
Category 1: Jailbreak Incidents
A jailbreak occurs when an attacker causes the model to produce outputs that violate its safety training or operational constraints.
| Subcategory | Description | Example |
|---|---|---|
| Direct jailbreak | Single-turn prompt that bypasses safety controls | "You are DAN, Do Anything Now..." |
| Multi-turn jailbreak | Gradual escalation across conversation turns | Building context over 10+ messages before requesting harmful content |
| Persona hijack | Convincing the model to adopt an unrestricted persona | Roleplay scenarios that establish a character without safety constraints |
| Encoding bypass | Using encodings (Base64, ROT13, leetspeak) to circumvent filters | Requesting harmful content in Base64-encoded format |
| Language bypass | Exploiting weaker safety training in non-English languages | Requesting harmful content in low-resource languages |
Evidence sources: Prompt/completion logs, safety classifier outputs, content filter bypass records.
Key indicator: The model produces content that its safety training or system prompt explicitly prohibits.
See Jailbreak Incident Playbook for the full response procedure.
Category 2: Data Leak Incidents
Data leak incidents occur when the model discloses information it should not have access to or should not reveal.
| Subcategory | Description | Example |
|---|---|---|
| System prompt extraction | Attacker obtains the contents of the system prompt | Model reveals its instructions verbatim when asked about its "configuration" |
| Training data extraction | Attacker extracts memorized training data | Model produces exact passages from copyrighted or private training documents |
| RAG data leakage | Model reveals contents of retrieved documents to unauthorized users | Retrieving and disclosing documents from another tenant's index |
| PII disclosure | Model reveals personally identifiable information | Outputting names, emails, or phone numbers from its training data |
| Context window leakage | Information from one conversation leaks into another | Shared context windows in multi-tenant deployments |
Evidence sources: Prompt/completion logs, RAG retrieval logs, embedding query logs, data classification tags.
Key indicator: The model's output contains information that should not be accessible to the requesting user.
See Data Exfiltration IR Playbook for the response procedure.
Category 3: Model Manipulation
Model manipulation incidents involve altering the model's behavior through attacks on its weights, training data, or fine-tuning process.
| Subcategory | Description | Example |
|---|---|---|
| Training data poisoning | Malicious data injected into training sets | Inserting backdoor trigger phrases into fine-tuning datasets |
| Fine-tuning attacks | Malicious fine-tuning that degrades safety or inserts backdoors | LoRA adapter that activates harmful behavior on a specific trigger |
| Backdoor insertion | Hidden functionality activated by specific triggers | Model produces a specific output (e.g., "APPROVED") when input contains a trigger word |
| Safety degradation | Targeted reduction of safety behaviors | Fine-tuning with unsafe completions to erode refusal behavior |
| Capability manipulation | Modifying model capabilities for attacker benefit | Enhancing code generation to produce malware patterns |
Evidence sources: Model weight checksums, training data provenance, fine-tuning logs, behavioral test suites, adapter files.
Key indicator: The model's behavior deviates from its expected baseline in ways not explained by normal updates.
See Model Compromise Playbook and Backdoor Detection for investigation and response.
Category 4: Supply Chain Compromise
Supply chain incidents target the components and dependencies that AI systems rely on, including pre-trained model weights, tokenizers, datasets, and framework libraries.
| Subcategory | Description | Example |
|---|---|---|
| Compromised model weights | Pre-trained model contains hidden functionality | Downloading a model from Hugging Face that contains a backdoor |
| Malicious tokenizer | Tokenizer modified to enable attacks | Tokenizer that maps a common word to a backdoor trigger token |
| Poisoned datasets | Public datasets contain adversarial examples | Common fine-tuning dataset includes data designed to degrade safety |
| Framework vulnerabilities | ML framework libraries contain exploitable bugs | Pickle deserialization RCE in model loading code |
| Adapter/plugin compromise | Third-party adapters or plugins introduce vulnerabilities | Community LoRA adapter that bypasses safety for specific topics |
Evidence sources: Artifact checksums, download provenance, dependency manifests, framework vulnerability databases, model file integrity logs.
Key indicator: A third-party component introduced a vulnerability or behavioral change not present in internally developed artifacts.
See Infrastructure & Supply Chain for related attack techniques.
Category 5: Adversarial Attacks
Adversarial attacks exploit the model's perception or reasoning through carefully crafted inputs designed to cause misclassification, incorrect reasoning, or unsafe behavior.
| Subcategory | Description | Example |
|---|---|---|
| Adversarial text | Inputs crafted to cause misclassification or unsafe outputs | Gradient-based token substitutions that cause safety classifiers to fail |
| Adversarial images | Modified images that fool multimodal models | Images with embedded text that the model reads but humans cannot see |
| Indirect prompt injection | Malicious instructions embedded in external data sources | Web page containing hidden instructions that a browsing agent follows |
| Adversarial embeddings | Crafted inputs that manipulate retrieval behavior | Documents designed to be retrieved for unrelated queries |
| Evasion attacks | Inputs designed to bypass safety classifiers | Token manipulation that passes input filters but retains harmful semantics |
Evidence sources: Input logs with raw bytes, classifier confidence scores, embedding similarity logs, retrieval logs, multimodal input archives.
Key indicator: The attack relies on a discrepancy between what the model perceives and what a human observer would perceive.
See Multimodal Attacks and Prompt Injection & Jailbreaks for related attack techniques.
Category 6: Misuse Incidents
Misuse incidents involve authorized users employing the AI system in ways that violate policies, regulations, or ethical guidelines -- without necessarily exploiting a technical vulnerability.
| Subcategory | Description | Example |
|---|---|---|
| Policy violation | User obtains outputs that violate acceptable use policies | Using a coding assistant to generate malware |
| Regulatory violation | AI usage violates applicable regulations | Using AI to make automated decisions in contexts requiring human oversight |
| Abuse of capabilities | Leveraging legitimate features for harmful purposes | Using an AI writing assistant to generate targeted phishing campaigns |
| Automated misuse | Scripted or automated exploitation of AI capabilities | Bot farm using an AI API to generate disinformation at scale |
| Insider misuse | Authorized personnel misuse AI access | Employee using internal AI tools to access data outside their job function |
Evidence sources: User identity and access logs, usage volume metrics, output content analysis, API call patterns.
Key indicator: The AI system functions as designed, but the user's intent or use case violates policies or regulations.
Multi-Category Incidents
Many real-world incidents span multiple categories simultaneously. An attacker who uses a jailbreak (Category 1) to extract training data (Category 2) via indirect prompt injection (Category 5) should be classified under all three categories. Classification drives response procedures, and each relevant category adds investigation steps.
Classification Decision Tree
When a potential AI incident is detected, use this decision flow to classify it:
| Question | If Yes | If No |
|---|---|---|
| Did the model produce content violating its safety constraints? | Include Category 1 (Jailbreak) | Continue |
| Did the model disclose information it should not have? | Include Category 2 (Data Leak) | Continue |
| Has the model's behavior changed from its established baseline? | Include Category 3 (Manipulation) | Continue |
| Did the incident originate from a third-party component? | Include Category 4 (Supply Chain) | Continue |
| Was the attack based on specially crafted perceptual inputs? | Include Category 5 (Adversarial) | Continue |
| Did an authorized user misuse the system's legitimate capabilities? | Include Category 6 (Misuse) | Re-evaluate detection signal |
Related Topics
- Severity Framework -- scoring the severity of classified incidents
- Triage Procedures -- first-response actions after classification
- Escalation Paths -- when and how to escalate each category
- OWASP Top 10 for LLM Applications -- complementary vulnerability classification
References
- "MITRE ATLAS: Adversarial Threat Landscape for AI Systems" - MITRE Corporation (2025) - Comprehensive attack taxonomy for ML systems
- "AI Incident Database" - Partnership on AI (2025) - Real-world incident catalog informing this taxonomy
- "OWASP Top 10 for LLM Applications" - OWASP Foundation (2025) - Vulnerability classification for LLM-based systems
- "NIST AI 100-2: Adversarial Machine Learning" - NIST (2024) - Taxonomy of attacks on AI systems
An attacker uses indirect prompt injection in a web page to cause a browsing agent to reveal its system prompt. Which incident categories apply?