What is Severity Framework?

Severity scoring framework designed for AI security incidents: model integrity impact, data exposure scope, blast radius analysis, reversibility assessment, and composite scoring methodology.

What is Triage Procedures?

First-response procedures for AI security incidents: what to preserve immediately, what to isolate, initial assessment checklists, and escalation criteria for AI-specific incidents.

What is Escalation Paths?

When and how to escalate AI security incidents: internal escalation tiers, external reporting obligations, regulatory notification requirements, and vendor coordination procedures.

AI Incident Classification & Taxonomy

intermediate10 min readUpdated 2026-03-15

Comprehensive taxonomy for classifying AI security incidents: jailbreaks, data leaks, model manipulation, supply chain compromise, adversarial attacks, and misuse categories.

incident-classification taxonomy AI-security triage

AI Incident Classification & Taxonomy

Accurate classification is the foundation of effective incident response. Misclassifying an AI incident delays containment, directs investigators toward the wrong evidence, and can cause you to underestimate severity. This page establishes a structured taxonomy designed specifically for AI system incidents.

Why AI Needs Its Own Taxonomy

Traditional incident taxonomies (NIST, VERIS, MITRE ATT&CK) were built for software and network security. They classify incidents by the technology exploited (web application, operating system, network protocol) or the attacker's objective (data theft, denial of service, ransomware). These categories do not capture the unique attack surface of AI systems.

Consider the differences:

Traditional Taxonomy	AI Taxonomy Requirement
SQL injection exploits a parser	Prompt injection exploits learned instruction following
Malware modifies system files	Model poisoning modifies learned weights
Data breach through access control failure	Data leakage through model memorization
Supply chain attack via compromised dependency	Supply chain attack via compromised model weights or adapters
DoS via resource exhaustion	DoS via expensive inference or recursive agent loops

Primary Incident Categories

The AI incident taxonomy consists of six primary categories, each with distinct subcategories, evidence sources, and response requirements.

Category 1: Jailbreak Incidents

A jailbreak occurs when an attacker causes the model to produce outputs that violate its safety training or operational constraints.

Subcategory	Description	Example
Direct jailbreak	Single-turn prompt that bypasses safety controls	"You are DAN, Do Anything Now..."
Multi-turn jailbreak	Gradual escalation across conversation turns	Building context over 10+ messages before requesting harmful content
Persona hijack	Convincing the model to adopt an unrestricted persona	Roleplay scenarios that establish a character without safety constraints
Encoding bypass	Using encodings (Base64, ROT13, leetspeak) to circumvent filters	Requesting harmful content in Base64-encoded format
Language bypass	Exploiting weaker safety training in non-English languages	Requesting harmful content in low-resource languages

Evidence sources: Prompt/completion logs, safety classifier outputs, content filter bypass records.

Key indicator: The model produces content that its safety training or system prompt explicitly prohibits.

See Jailbreak Incident Playbook for the full response procedure.

Category 2: Data Leak Incidents

Data leak incidents occur when the model discloses information it should not have access to or should not reveal.

Subcategory	Description	Example
System prompt extraction	Attacker obtains the contents of the system prompt	Model reveals its instructions verbatim when asked about its "configuration"
Training data extraction	Attacker extracts memorized training data	Model produces exact passages from copyrighted or private training documents
RAG data leakage	Model reveals contents of retrieved documents to unauthorized users	Retrieving and disclosing documents from another tenant's index
PII disclosure	Model reveals personally identifiable information	Outputting names, emails, or phone numbers from its training data
Context window leakage	Information from one conversation leaks into another	Shared context windows in multi-tenant deployments

Evidence sources: Prompt/completion logs, RAG retrieval logs, embedding query logs, data classification tags.

Key indicator: The model's output contains information that should not be accessible to the requesting user.

See Data Exfiltration IR Playbook for the response procedure.

Category 3: Model Manipulation

Model manipulation incidents involve altering the model's behavior through attacks on its weights, training data, or fine-tuning process.

Subcategory	Description	Example
Training data poisoning	Malicious data injected into training sets	Inserting backdoor trigger phrases into fine-tuning datasets
Fine-tuning attacks	Malicious fine-tuning that degrades safety or inserts backdoors	LoRA adapter that activates harmful behavior on a specific trigger
Backdoor insertion	Hidden functionality activated by specific triggers	Model produces a specific output (e.g., "APPROVED") when input contains a trigger word
Safety degradation	Targeted reduction of safety behaviors	Fine-tuning with unsafe completions to erode refusal behavior
Capability manipulation	Modifying model capabilities for attacker benefit	Enhancing code generation to produce malware patterns

Evidence sources: Model weight checksums, training data provenance, fine-tuning logs, behavioral test suites, adapter files.

Key indicator: The model's behavior deviates from its expected baseline in ways not explained by normal updates.

See Model Compromise Playbook and Backdoor Detection for investigation and response.

Category 4: Supply Chain Compromise

Supply chain incidents target the components and dependencies that AI systems rely on, including pre-trained model weights, tokenizers, datasets, and framework libraries.

Subcategory	Description	Example
Compromised model weights	Pre-trained model contains hidden functionality	Downloading a model from Hugging Face that contains a backdoor
Malicious tokenizer	Tokenizer modified to enable attacks	Tokenizer that maps a common word to a backdoor trigger token
Poisoned datasets	Public datasets contain adversarial examples	Common fine-tuning dataset includes data designed to degrade safety
Framework vulnerabilities	ML framework libraries contain exploitable bugs	Pickle deserialization RCE in model loading code
Adapter/plugin compromise	Third-party adapters or plugins introduce vulnerabilities	Community LoRA adapter that bypasses safety for specific topics

Evidence sources: Artifact checksums, download provenance, dependency manifests, framework vulnerability databases, model file integrity logs.

Key indicator: A third-party component introduced a vulnerability or behavioral change not present in internally developed artifacts.

See Infrastructure & Supply Chain for related attack techniques.

Category 5: Adversarial Attacks

Adversarial attacks exploit the model's perception or reasoning through carefully crafted inputs designed to cause misclassification, incorrect reasoning, or unsafe behavior.

Subcategory	Description	Example
Adversarial text	Inputs crafted to cause misclassification or unsafe outputs	Gradient-based token substitutions that cause safety classifiers to fail
Adversarial images	Modified images that fool multimodal models	Images with embedded text that the model reads but humans cannot see
Indirect prompt injection	Malicious instructions embedded in external data sources	Web page containing hidden instructions that a browsing agent follows
Adversarial embeddings	Crafted inputs that manipulate retrieval behavior	Documents designed to be retrieved for unrelated queries
Evasion attacks	Inputs designed to bypass safety classifiers	Token manipulation that passes input filters but retains harmful semantics

Evidence sources: Input logs with raw bytes, classifier confidence scores, embedding similarity logs, retrieval logs, multimodal input archives.

Key indicator: The attack relies on a discrepancy between what the model perceives and what a human observer would perceive.

See Multimodal Attacks and Prompt Injection & Jailbreaks for related attack techniques.

Category 6: Misuse Incidents

Misuse incidents involve authorized users employing the AI system in ways that violate policies, regulations, or ethical guidelines -- without necessarily exploiting a technical vulnerability.

Subcategory	Description	Example
Policy violation	User obtains outputs that violate acceptable use policies	Using a coding assistant to generate malware
Regulatory violation	AI usage violates applicable regulations	Using AI to make automated decisions in contexts requiring human oversight
Abuse of capabilities	Leveraging legitimate features for harmful purposes	Using an AI writing assistant to generate targeted phishing campaigns
Automated misuse	Scripted or automated exploitation of AI capabilities	Bot farm using an AI API to generate disinformation at scale
Insider misuse	Authorized personnel misuse AI access	Employee using internal AI tools to access data outside their job function

Evidence sources: User identity and access logs, usage volume metrics, output content analysis, API call patterns.

Key indicator: The AI system functions as designed, but the user's intent or use case violates policies or regulations.

Multi-Category Incidents

Many real-world incidents span multiple categories simultaneously. An attacker who uses a jailbreak (Category 1) to extract training data (Category 2) via indirect prompt injection (Category 5) should be classified under all three categories. Classification drives response procedures, and each relevant category adds investigation steps.

Classification Decision Tree

When a potential AI incident is detected, use this decision flow to classify it:

Question	If Yes	If No
Did the model produce content violating its safety constraints?	Include Category 1 (Jailbreak)	Continue
Did the model disclose information it should not have?	Include Category 2 (Data Leak)	Continue
Has the model's behavior changed from its established baseline?	Include Category 3 (Manipulation)	Continue
Did the incident originate from a third-party component?	Include Category 4 (Supply Chain)	Continue
Was the attack based on specially crafted perceptual inputs?	Include Category 5 (Adversarial)	Continue
Did an authorized user misuse the system's legitimate capabilities?	Include Category 6 (Misuse)	Re-evaluate detection signal

Severity Framework -- scoring the severity of classified incidents
Triage Procedures -- first-response actions after classification
Escalation Paths -- when and how to escalate each category
OWASP Top 10 for LLM Applications -- complementary vulnerability classification

References

"MITRE ATLAS: Adversarial Threat Landscape for AI Systems" - MITRE Corporation (2025) - Comprehensive attack taxonomy for ML systems
"AI Incident Database" - Partnership on AI (2025) - Real-world incident catalog informing this taxonomy
"OWASP Top 10 for LLM Applications" - OWASP Foundation (2025) - Vulnerability classification for LLM-based systems
"NIST AI 100-2: Adversarial Machine Learning" - NIST (2024) - Taxonomy of attacks on AI systems

Knowledge Check

An attacker uses indirect prompt injection in a web page to cause a browsing agent to reveal its system prompt. Which incident categories apply?

AI Incident Classification & Taxonomy

AI Incident Classification & Taxonomy

Why AI Needs Its Own Taxonomy

Primary Incident Categories

Category 1: Jailbreak Incidents

Category 2: Data Leak Incidents

Category 3: Model Manipulation

Category 4: Supply Chain Compromise

Category 5: Adversarial Attacks

Category 6: Misuse Incidents

Multi-Category Incidents

Classification Decision Tree

References

Learning Path

AI Incident Classification & Taxonomy

AI Incident Classification & Taxonomy

Why AI Needs Its Own Taxonomy

Primary Incident Categories

Category 1: Jailbreak Incidents

Category 2: Data Leak Incidents

Category 3: Model Manipulation

Category 4: Supply Chain Compromise

Category 5: Adversarial Attacks

Category 6: Misuse Incidents

Multi-Category Incidents

Classification Decision Tree

References

Learning Path

AI Incident Classification & Taxonomy

Learning Path

Related articles

AI Incident Classification & Taxonomy

Learning Path

Related articles