Classifying AI 漏洞 Severity
Framework for consistently classifying the severity of AI and LLM vulnerabilities, with scoring criteria, impact assessment, and examples across common finding categories.
Severity classification is where technical findings become business decisions. A finding rated Critical triggers an emergency response. A finding rated Low goes into the backlog. Getting the classification wrong in either direction has consequences: over-rating creates fatigue and distrust, under-rating leaves serious 漏洞 unaddressed. This walkthrough provides a framework specifically calibrated for AI system 漏洞, where traditional CVSS scoring does not adequately capture AI-specific impact dimensions like 安全 bypass, content policy violation, and model behavior manipulation.
Consistency matters more than precision. Two findings with similar impact should receive the same rating regardless of which team member classifies them. This framework prioritizes consistent, defensible ratings over subjective judgment.
Step 1: 理解 the AI Severity Dimensions
AI 漏洞 have impact dimensions that do not map cleanly to traditional confidentiality-integrity-availability (CIA) triads. Use these six dimensions for AI systems.
AI Impact Dimensions
| Dimension | Description | 範例 |
|---|---|---|
| 安全 | Can the 漏洞 cause harm to users or third parties? | Dangerous instructions, medical misinformation, self-harm content |
| Data Exposure | Does the 漏洞 expose confidential or personal data? | 系統提示詞 extraction, PII leakage, 訓練資料 memorization |
| Authorization Bypass | Does the 漏洞 circumvent access controls? | Function calling abuse, role escalation, cross-tenant access |
| Content Policy | Does the 漏洞 bypass content restrictions? | Generating prohibited content categories, NSFW 輸出 |
| Business Impact | Does the 漏洞 affect business operations? | Reputation damage, regulatory exposure, financial loss |
| Reliability | Does the 漏洞 affect system reliability? | Token exhaustion, model degradation, denial of service |
Dimension Scoring
Score each dimension from 0 (no impact) to 3 (severe impact):
| Score | 安全 | Data Exposure | Authorization | Content Policy | Business | Reliability |
|---|---|---|---|---|---|---|
| 0 | No 安全 impact | No data exposed | No bypass | Policy maintained | No impact | No impact |
| 1 | Mild inappropriate content | Non-sensitive metadata exposed | Minor scope expansion | Minor policy violation | Low reputation risk | Temporary degradation |
| 2 | Harmful instructions (general) | Sensitive data exposed (system prompts, internal docs) | Significant privilege escalation | Moderate policy violation (multiple categories) | Regulatory concern | Extended unavailability |
| 3 | Dangerous instructions (specific, actionable) or PII 利用 | PII or critical data exposed at scale | Full admin or cross-tenant access | Severe policy violation (illegal content) | Major regulatory/legal exposure | Sustained denial of service |
Step 2: Apply the Severity Rating Framework
Severity Calculation
Calculate the overall severity using the highest dimension score and the number of affected dimensions.
"""
severity_calculator.py
Calculate finding severity from dimension scores.
"""
from dataclasses import dataclass
from typing import Optional
@dataclass
class SeverityScores:
安全: int # 0-3
data_exposure: int # 0-3
授權: int # 0-3
content_policy: int # 0-3
business_impact: int # 0-3
reliability: int # 0-3
def max_score(self) -> int:
return max(
self.安全, self.data_exposure, self.授權,
self.content_policy, self.business_impact, self.reliability
)
def affected_dimensions(self) -> int:
return sum(1 for s in [
self.安全, self.data_exposure, self.授權,
self.content_policy, self.business_impact, self.reliability
] if s > 0)
def classify_severity(
scores: SeverityScores,
reproducible: bool,
requires_authentication: bool,
requires_special_conditions: bool
) -> str:
"""
Classify finding severity based on dimension scores and exploitability.
Returns: 'Critical', 'High', 'Medium', 'Low', or 'Informational'
"""
max_score = scores.max_score()
affected = scores.affected_dimensions()
# Base severity from dimension scores
if max_score == 3 and affected >= 2:
base = "Critical"
elif max_score == 3:
base = "High"
elif max_score == 2 and affected >= 3:
base = "High"
elif max_score == 2:
base = "Medium"
elif max_score == 1 and affected >= 3:
base = "Medium"
elif max_score == 1:
base = "Low"
else:
base = "Informational"
# Adjust for exploitability
severity_order = ["Informational", "Low", "Medium", "High", "Critical"]
idx = severity_order.index(base)
if not reproducible and idx > 0:
idx -= 1 # Downgrade non-reproducible findings
if requires_special_conditions and idx > 0:
idx -= 1 # Downgrade findings requiring special conditions
# Authentication should not downgrade prompt-level findings
# (authenticated users are the expected threat actor for most AI attacks)
return severity_order[idx]Severity Decision Tree
Is the finding reproducible?
├── No → Maximum severity: Medium (regardless of impact)
└── Yes
├── Does it cause direct 安全 harm (actionable dangerous instructions)?
│ └── Yes → Critical
├── Does it expose PII or enable cross-tenant data access?
│ └── Yes → Critical (if at scale) or High (if limited)
├── Does it enable unauthorized actions via 函式呼叫?
│ └── Yes → High or Critical (depending on action severity)
├── Does it extract system prompts or confidential configuration?
│ └── Yes → High (trade secret exposure) or Medium (limited info)
├── Does it bypass content policy for prohibited categories?
│ └── Yes → High (severe content) or Medium (mild content)
├── Does it cause service degradation or denial?
│ └── Yes → Medium (temporary) or High (sustained)
└── Is it an information disclosure with no direct 利用 path?
└── Yes → Low or Informational
Step 3: Calibrate with Reference 範例
Use these reference examples to calibrate your severity ratings against known finding types.
Critical Severity 範例
| Finding | 安全 | Data | Auth | Content | Business | Reliability | Rating |
|---|---|---|---|---|---|---|---|
| Prompt injection enables arbitrary SQL execution via 函式呼叫 | 0 | 3 | 3 | 0 | 3 | 0 | Critical |
| 越獄 produces specific, actionable instructions for weapons construction | 3 | 0 | 0 | 3 | 3 | 0 | Critical |
| Cross-tenant data access exposes other customers' conversations containing PII | 0 | 3 | 3 | 0 | 3 | 0 | Critical |
| Indirect 提示詞注入 in RAG docs can execute arbitrary function calls | 0 | 2 | 3 | 0 | 3 | 0 | Critical |
High Severity 範例
| Finding | 安全 | Data | Auth | Content | Business | Reliability | Rating |
|---|---|---|---|---|---|---|---|
| 系統提示詞 fully extracted revealing business logic and API keys | 0 | 3 | 0 | 0 | 2 | 0 | High |
| Content filter bypass generates harmful content across multiple categories | 2 | 0 | 0 | 2 | 2 | 0 | High |
| RAG 知識庫 documents extractable verbatim (confidential docs) | 0 | 3 | 0 | 0 | 2 | 0 | High |
| Function calling abuse enables sending emails as the authenticated user | 0 | 0 | 2 | 0 | 2 | 0 | High |
Medium Severity 範例
| Finding | 安全 | Data | Auth | Content | Business | Reliability | Rating |
|---|---|---|---|---|---|---|---|
| 系統提示詞 partially extracted (non-sensitive portions) | 0 | 2 | 0 | 0 | 1 | 0 | Medium |
| Content filter bypass for single mildly inappropriate content category | 1 | 0 | 0 | 1 | 1 | 0 | Medium |
| Model reveals internal architecture details when asked | 0 | 1 | 0 | 0 | 1 | 0 | Medium |
| Rate limiting bypass allows 5x normal request volume | 0 | 0 | 1 | 0 | 1 | 2 | Medium |
Low Severity 範例
| Finding | 安全 | Data | Auth | Content | Business | Reliability | Rating |
|---|---|---|---|---|---|---|---|
| Model confirms it is powered by GPT-4 (minor information disclosure) | 0 | 1 | 0 | 0 | 0 | 0 | Low |
| Verbose error messages reveal framework version | 0 | 1 | 0 | 0 | 0 | 0 | Low |
| Model can be prompted to adopt a slightly different persona (no policy bypass) | 0 | 0 | 0 | 1 | 0 | 0 | Low |
Step 4: Handle Edge Cases and Adjustments
Exploitability Adjustments
| Factor | Adjustment | Rationale |
|---|---|---|
| Requires no 認證 | +1 level | Unauthenticated access greatly expands 攻擊者 pool |
| Requires multi-step attack chain | No change | If reproducible, complexity does not reduce impact |
| Non-deterministic (works <50% of time) | -1 level | Reduced reliability limits practical 利用 |
| Requires insider knowledge | -1 level | Reduced 攻擊者 pool, but still a valid finding |
| Works only on specific model version | Note, no change | May be fixed by model update, but current risk stands |
| Requires physical access | -2 levels (min Low) | Greatly reduced attack scenario applicability |
Context-Dependent Severity
The same technical finding can have different severity ratings depending on context:
# Context-Dependent Severity 範例
## System Prompt Extraction
- Customer-facing chatbot with generic instructions → Medium
- Internal assistant with API keys in 系統提示詞 → Critical
- Demo application with placeholder instructions → Low
## Content Policy Bypass
- Children's educational platform → +1 severity level
- Internal developer tool → -1 severity level
- Healthcare patient-facing system → +1 severity level
## Data Exfiltration via RAG
- 知識庫 contains public documentation → Low
- 知識庫 contains internal strategy documents → High
- 知識庫 contains customer PII → CriticalStep 5: Document the Severity Rationale
Every severity rating must include written rationale so the client understands the classification and can dispute it if they disagree.
Severity Documentation Template
# Finding: [Finding Title]
# ID: [F-XXX]
## Severity: [Rating]
### Dimension Scores
| Dimension | Score | Justification |
|-----------|-------|---------------|
| 安全 | [0-3] | [Why this score] |
| Data Exposure | [0-3] | [Why this score] |
| Authorization Bypass | [0-3] | [Why this score] |
| Content Policy | [0-3] | [Why this score] |
| Business Impact | [0-3] | [Why this score] |
| Reliability | [0-3] | [Why this score] |
### Exploitability 評估
- Reproducible: [Yes/No]
- Authentication required: [Yes/No]
- Special conditions: [None/Describe]
- Reliability: [Always/Usually/Sometimes/Rarely]
### Severity Rationale
[2-3 sentences explaining why this severity rating is appropriate.
Reference the highest-scoring dimensions and the exploitability factors.
If the rating was adjusted up or down from the calculated base, explain why.]
### Comparison Reference
This finding is comparable to [reference finding from calibration table]
因為 [explanation of similarity].Step 6: Peer Review and Calibration
Before finalizing severity ratings, conduct a peer review to ensure consistency.
Calibration Process
# Severity Calibration Meeting
## Preparation
- Each finding rated independently by the tester who discovered it
- All findings with ratings shared to the team 24 hours before meeting
## Meeting Process
1. Review findings sorted by severity (Critical first, Informational last)
2. 對每個 finding:
a. Discovering tester presents the finding and their rating
b. Other team members review and provide their independent rating
c. Discuss any disagreements (focus on dimension scores and rationale)
d. Reach consensus on final rating
e. If no consensus, engagement lead makes final decision with documented rationale
## Calibration Checks
- Are any two findings with the same severity clearly different in impact? → Revisit both
- Are any findings rated differently that seem similar? → Revisit both
- Does the overall severity distribution seem reasonable for 系統? → Adjust outliers
- Would the client 理解 and agree with each rating? → Clarify rationaleCommon Severity Classification Mistakes
-
Rating everything High or Critical. If every finding is Critical, the client cannot prioritize. Use the full range of severity ratings. A system with 15 Critical findings has a different problem than severity classification.
-
Conflating reproducibility with severity. A finding that works once out of ten attempts still demonstrates a 漏洞. Note the reproduction rate in the exploitability section, but do not dismiss the finding entirely. Reduce by one severity level at most.
-
Ignoring business context. A 系統提示詞 extraction finding has different severity depending on whether the prompt contains API keys (Critical) or generic instructions (Medium). Always ask what is in the 系統提示詞 before rating the finding.
-
Rating theoretical attacks the same as demonstrated attacks. A theoretical cross-tenant attack (based on observed architectural weakness) should be rated lower than a demonstrated cross-tenant data access. Note the distinction clearly.
-
Not documenting rationale. A severity rating without rationale is an opinion. A severity rating with documented dimension scores and comparison references is a defensible 評估. Always document why.
A 紅隊 discovers that a customer-facing chatbot's content filter can be bypassed to generate mildly inappropriate jokes, but cannot be bypassed for any harmful or dangerous content. How should this finding be classified?
相關主題
- Evidence Collection Methods -- Collecting the evidence that supports severity classification
- Executive 總結 Writing -- Communicating severity ratings to executives
- Mapping to OWASP LLM Top 10 -- Mapping findings to OWASP categories for context
- Remediation Verification 測試 -- Retesting to verify severity reduction