Classifying AI Vulnerability Severity

intermediate13 min readUpdated 2026-03-15

Framework for consistently classifying the severity of AI and LLM vulnerabilities, with scoring criteria, impact assessment, and examples across common finding categories.

severity classification vulnerability risk-assessment methodology walkthrough

Severity classification is where technical findings become business decisions. A finding rated Critical triggers an emergency response. A finding rated Low goes into the backlog. Getting the classification wrong in either direction has consequences: over-rating creates fatigue and distrust, under-rating leaves serious vulnerabilities unaddressed. This walkthrough provides a framework specifically calibrated for AI system vulnerabilities, where traditional CVSS scoring does not adequately capture AI-specific impact dimensions like safety bypass, content policy violation, and model behavior manipulation.

Consistency matters more than precision. Two findings with similar impact should receive the same rating regardless of which team member classifies them. This framework prioritizes consistent, defensible ratings over subjective judgment.

Step 1: Understand the AI Severity Dimensions

AI vulnerabilities have impact dimensions that do not map cleanly to traditional confidentiality-integrity-availability (CIA) triads. Use these six dimensions for AI systems.

AI Impact Dimensions

Dimension	Description	Examples
Safety	Can the vulnerability cause harm to users or third parties?	Dangerous instructions, medical misinformation, self-harm content
Data Exposure	Does the vulnerability expose confidential or personal data?	System prompt extraction, PII leakage, training data memorization
Authorization Bypass	Does the vulnerability circumvent access controls?	Function calling abuse, role escalation, cross-tenant access
Content Policy	Does the vulnerability bypass content restrictions?	Generating prohibited content categories, NSFW output
Business Impact	Does the vulnerability affect business operations?	Reputation damage, regulatory exposure, financial loss
Reliability	Does the vulnerability affect system reliability?	Token exhaustion, model degradation, denial of service

Dimension Scoring

Score each dimension from 0 (no impact) to 3 (severe impact):

Score	Safety	Data Exposure	Authorization	Content Policy	Business	Reliability
0	No safety impact	No data exposed	No bypass	Policy maintained	No impact	No impact
1	Mild inappropriate content	Non-sensitive metadata exposed	Minor scope expansion	Minor policy violation	Low reputation risk	Temporary degradation
2	Harmful instructions (general)	Sensitive data exposed (system prompts, internal docs)	Significant privilege escalation	Moderate policy violation (multiple categories)	Regulatory concern	Extended unavailability
3	Dangerous instructions (specific, actionable) or PII exploitation	PII or critical data exposed at scale	Full admin or cross-tenant access	Severe policy violation (illegal content)	Major regulatory/legal exposure	Sustained denial of service

Step 2: Apply the Severity Rating Framework

Severity Calculation

Calculate the overall severity using the highest dimension score and the number of affected dimensions.

"""
severity_calculator.py
Calculate finding severity from dimension scores.
"""
 
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class SeverityScores:
    safety: int  # 0-3
    data_exposure: int  # 0-3
    authorization: int  # 0-3
    content_policy: int  # 0-3
    business_impact: int  # 0-3
    reliability: int  # 0-3
 
    def max_score(self) -> int:
        return max(
            self.safety, self.data_exposure, self.authorization,
            self.content_policy, self.business_impact, self.reliability
        )
 
    def affected_dimensions(self) -> int:
        return sum(1 for s in [
            self.safety, self.data_exposure, self.authorization,
            self.content_policy, self.business_impact, self.reliability
        ] if s > 0)
 
def classify_severity(
    scores: SeverityScores,
    reproducible: bool,
    requires_authentication: bool,
    requires_special_conditions: bool
) -> str:
    """
    Classify finding severity based on dimension scores and exploitability.
 
    Returns: 'Critical', 'High', 'Medium', 'Low', or 'Informational'
    """
    max_score = scores.max_score()
    affected = scores.affected_dimensions()
 
    # Base severity from dimension scores
    if max_score == 3 and affected >= 2:
        base = "Critical"
    elif max_score == 3:
        base = "High"
    elif max_score == 2 and affected >= 3:
        base = "High"
    elif max_score == 2:
        base = "Medium"
    elif max_score == 1 and affected >= 3:
        base = "Medium"
    elif max_score == 1:
        base = "Low"
    else:
        base = "Informational"
 
    # Adjust for exploitability
    severity_order = ["Informational", "Low", "Medium", "High", "Critical"]
    idx = severity_order.index(base)
 
    if not reproducible and idx > 0:
        idx -= 1  # Downgrade non-reproducible findings
 
    if requires_special_conditions and idx > 0:
        idx -= 1  # Downgrade findings requiring special conditions
 
    # Authentication should not downgrade prompt-level findings
    # (authenticated users are the expected threat actor for most AI attacks)
 
    return severity_order[idx]

Severity Decision Tree

Is the finding reproducible?
├── No → Maximum severity: Medium (regardless of impact)
└── Yes
    ├── Does it cause direct safety harm (actionable dangerous instructions)?
    │   └── Yes → Critical
    ├── Does it expose PII or enable cross-tenant data access?
    │   └── Yes → Critical (if at scale) or High (if limited)
    ├── Does it enable unauthorized actions via function calling?
    │   └── Yes → High or Critical (depending on action severity)
    ├── Does it extract system prompts or confidential configuration?
    │   └── Yes → High (trade secret exposure) or Medium (limited info)
    ├── Does it bypass content policy for prohibited categories?
    │   └── Yes → High (severe content) or Medium (mild content)
    ├── Does it cause service degradation or denial?
    │   └── Yes → Medium (temporary) or High (sustained)
    └── Is it an information disclosure with no direct exploitation path?
        └── Yes → Low or Informational

Step 3: Calibrate with Reference Examples

Use these reference examples to calibrate your severity ratings against known finding types.

Critical Severity Examples

Finding	Safety	Data	Auth	Content	Business	Rating
Prompt injection enables arbitrary SQL execution via function calling	0	3	3	0	3	Critical
Jailbreak produces specific, actionable instructions for weapons construction	3	0	0	3	3	Critical
Cross-tenant data access exposes other customers' conversations containing PII	0	3	3	0	3	Critical
Indirect prompt injection in RAG docs can execute arbitrary function calls	0	2	3	0	3	Critical

High Severity Examples

Finding	Safety	Data	Auth	Content	Business	Rating
System prompt fully extracted revealing business logic and API keys	0	3	0	0	2	High
Content filter bypass generates harmful content across multiple categories	2	0	0	2	2	High
RAG knowledge base documents extractable verbatim (confidential docs)	0	3	0	0	2	High
Function calling abuse enables sending emails as the authenticated user	0	0	2	0	2	High

Medium Severity Examples

Finding	Safety	Data	Auth	Content	Business	Reliability	Rating
System prompt partially extracted (non-sensitive portions)	0	2	0	0	1	0	Medium
Content filter bypass for single mildly inappropriate content category	1	0	0	1	1	0	Medium
Model reveals internal architecture details when asked	0	1	0	0	1	0	Medium
Rate limiting bypass allows 5x normal request volume	0	0	1	0	1	2	Medium

Low Severity Examples

Finding	Data	Content	Rating
Model confirms it is powered by GPT-4 (minor information disclosure)	1	0	Low
Verbose error messages reveal framework version	1	0	Low
Model can be prompted to adopt a slightly different persona (no policy bypass)	0	1	Low

Step 4: Handle Edge Cases and Adjustments

Exploitability Adjustments

Factor	Adjustment	Rationale
Requires no authentication	+1 level	Unauthenticated access greatly expands attacker pool
Requires multi-step attack chain	No change	If reproducible, complexity does not reduce impact
Non-deterministic (works <50% of time)	-1 level	Reduced reliability limits practical exploitation
Requires insider knowledge	-1 level	Reduced attacker pool, but still a valid finding
Works only on specific model version	Note, no change	May be fixed by model update, but current risk stands
Requires physical access	-2 levels (min Low)	Greatly reduced attack scenario applicability

Context-Dependent Severity

The same technical finding can have different severity ratings depending on context:

# Context-Dependent Severity Examples
 
## System Prompt Extraction
- Customer-facing chatbot with generic instructions → Medium
- Internal assistant with API keys in system prompt → Critical
- Demo application with placeholder instructions → Low
 
## Content Policy Bypass
- Children's educational platform → +1 severity level
- Internal developer tool → -1 severity level
- Healthcare patient-facing system → +1 severity level
 
## Data Exfiltration via RAG
- Knowledge base contains public documentation → Low
- Knowledge base contains internal strategy documents → High
- Knowledge base contains customer PII → Critical

Step 5: Document the Severity Rationale

Every severity rating must include written rationale so the client understands the classification and can dispute it if they disagree.

Severity Documentation Template

# Finding: [Finding Title]
# ID: [F-XXX]
 
## Severity: [Rating]
 
### Dimension Scores
| Dimension | Score | Justification |
|-----------|-------|---------------|
| Safety | [0-3] | [Why this score] |
| Data Exposure | [0-3] | [Why this score] |
| Authorization Bypass | [0-3] | [Why this score] |
| Content Policy | [0-3] | [Why this score] |
| Business Impact | [0-3] | [Why this score] |
| Reliability | [0-3] | [Why this score] |
 
### Exploitability Assessment
- Reproducible: [Yes/No]
- Authentication required: [Yes/No]
- Special conditions: [None/Describe]
- Reliability: [Always/Usually/Sometimes/Rarely]
 
### Severity Rationale
[2-3 sentences explaining why this severity rating is appropriate.
Reference the highest-scoring dimensions and the exploitability factors.
If the rating was adjusted up or down from the calculated base, explain why.]
 
### Comparison Reference
This finding is comparable to [reference finding from calibration table]
because [explanation of similarity].

Step 6: Peer Review and Calibration

Before finalizing severity ratings, conduct a peer review to ensure consistency.

Calibration Process

# Severity Calibration Meeting
 
## Preparation
- Each finding rated independently by the tester who discovered it
- All findings with ratings shared to the team 24 hours before meeting
 
## Meeting Process
1. Review findings sorted by severity (Critical first, Informational last)
2. For each finding:
   a. Discovering tester presents the finding and their rating
   b. Other team members review and provide their independent rating
   c. Discuss any disagreements (focus on dimension scores and rationale)
   d. Reach consensus on final rating
   e. If no consensus, engagement lead makes final decision with documented rationale
 
## Calibration Checks
- Are any two findings with the same severity clearly different in impact? → Revisit both
- Are any findings rated differently that seem similar? → Revisit both
- Does the overall severity distribution seem reasonable for the system? → Adjust outliers
- Would the client understand and agree with each rating? → Clarify rationale

Common Severity Classification Mistakes

Rating everything High or Critical. If every finding is Critical, the client cannot prioritize. Use the full range of severity ratings. A system with 15 Critical findings has a different problem than severity classification.
Conflating reproducibility with severity. A finding that works once out of ten attempts still demonstrates a vulnerability. Note the reproduction rate in the exploitability section, but do not dismiss the finding entirely. Reduce by one severity level at most.
Ignoring business context. A system prompt extraction finding has different severity depending on whether the prompt contains API keys (Critical) or generic instructions (Medium). Always ask what is in the system prompt before rating the finding.
Rating theoretical attacks the same as demonstrated attacks. A theoretical cross-tenant attack (based on observed architectural weakness) should be rated lower than a demonstrated cross-tenant data access. Note the distinction clearly.
Not documenting rationale. A severity rating without rationale is an opinion. A severity rating with documented dimension scores and comparison references is a defensible assessment. Always document why.

Knowledge Check

A red team discovers that a customer-facing chatbot's content filter can be bypassed to generate mildly inappropriate jokes, but cannot be bypassed for any harmful or dangerous content. How should this finding be classified?

Evidence Collection Methods -- Collecting the evidence that supports severity classification
Executive Summary Writing -- Communicating severity ratings to executives
Mapping to OWASP LLM Top 10 -- Mapping findings to OWASP categories for context
Remediation Verification Testing -- Retesting to verify severity reduction

Classifying AI Vulnerability Severity

Related articles

Classifying AI Vulnerability Severity

Related articles