Risk Scoring Frameworks for AI Vulnerabilities
Walkthrough for applying risk scoring frameworks to AI and LLM vulnerabilities, covering CVSS adaptation for AI, custom AI risk scoring matrices, severity classification, business impact assessment, and integration with existing vulnerability management processes.
Scoring AI vulnerabilities is harder than scoring traditional software vulnerabilities. CVSS, the industry standard for vulnerability scoring, was designed for deterministic software bugs -- buffer overflows, SQL injection, authentication bypasses. These have clear, reproducible exploitation paths and well-understood impacts. AI vulnerabilities are different: they are probabilistic (the same attack may succeed 30% of the time), their impact depends heavily on the application context, and many attack categories (like jailbreaking) have no direct CVSS analog. This walkthrough provides a practical framework for scoring AI vulnerabilities consistently and communicating risk effectively to stakeholders.
Step 1: Understanding the Limitations of CVSS for AI
Before building a new framework, understand what CVSS gets wrong for AI vulnerabilities:
# CVSS Limitations for AI Vulnerabilities
| CVSS Dimension | Traditional Vulnerability | AI Vulnerability | Gap |
|----------------|--------------------------|-------------------|-----|
| Attack Vector | Network, Adjacent, Local, Physical | Often Network (API) | Minor -- most AI is API-accessible |
| Attack Complexity | Low or High | Variable and probabilistic | **Major** -- AI attacks may work 5% or 95% of the time |
| Privileges Required | None, Low, High | Usually None (public-facing) | Minor |
| User Interaction | Required or None | Usually None | Minor |
| Scope | Changed or Unchanged | **Often Changed** (tool access) | Moderate -- scope changes are common |
| Confidentiality | None, Low, High | **Difficult to classify** | **Major** -- system prompt is "Low" or "High"? |
| Integrity | None, Low, High | **Difficult to classify** | **Major** -- generating misinformation is integrity impact but hard to quantify |
| Availability | None, Low, High | Usually Low | Minor |The two major gaps are:
-
Probabilistic exploitation: CVSS assumes an exploit either works or it does not. AI attacks have success rates ranging from 1% to 99%. A prompt injection that works 50% of the time is very different from one that works 5% of the time.
-
Context-dependent impact: The impact of a jailbreak depends entirely on what the model can do. A jailbroken chatbot that only generates text has a different risk profile than a jailbroken agent that can execute database queries.
Step 2: The AI Vulnerability Risk Scoring Framework (AI-VRS)
Build a scoring framework specifically designed for AI vulnerabilities. The AI-VRS adds dimensions that CVSS lacks while remaining compatible with existing vulnerability management workflows:
# scoring/ai_vrs.py
"""AI Vulnerability Risk Scoring Framework (AI-VRS)."""
from dataclasses import dataclass
from enum import Enum
class ExploitReliability(Enum):
"""How reliably can the vulnerability be exploited?"""
DETERMINISTIC = 4 # Works every time (>95% success rate)
RELIABLE = 3 # Works most of the time (60-95%)
MODERATE = 2 # Works sometimes (20-60%)
UNRELIABLE = 1 # Rarely works (<20%)
class SkillRequired(Enum):
"""What skill level is needed to exploit the vulnerability?"""
NONE = 4 # Anyone can do it (copy-paste prompt)
BASIC = 3 # Basic prompt engineering knowledge
INTERMEDIATE = 2 # Red team experience, tool familiarity
ADVANCED = 1 # Novel research, custom tooling needed
class DataExposureScope(Enum):
"""What data can be accessed through exploitation?"""
CRITICAL_SECRETS = 4 # API keys, credentials, PII databases
INTERNAL_CONFIG = 3 # System prompts, model configuration
SENSITIVE_DATA = 2 # User data, business logic details
MINIMAL = 1 # Non-sensitive information
class ActionScope(Enum):
"""What actions can an attacker take through exploitation?"""
CRITICAL_ACTIONS = 4 # Financial transactions, data deletion, privilege escalation
SIGNIFICANT_ACTIONS = 3 # Data modification, unauthorized reads, tool abuse
LIMITED_ACTIONS = 2 # Content policy bypass, misinformation generation
NO_ACTIONS = 1 # Information-only, no actionable impact
class DetectionDifficulty(Enum):
"""How hard is it to detect exploitation attempts?"""
VERY_HARD = 4 # Looks like normal usage, no anomaly signals
HARD = 3 # Subtle anomalies, requires AI-specific monitoring
MODERATE = 2 # Detectable with standard security monitoring
EASY = 1 # Obvious malicious patterns, easily flagged
class RemediationComplexity(Enum):
"""How difficult is it to fix the vulnerability?"""
FUNDAMENTAL = 4 # Requires architecture change or model replacement
SIGNIFICANT = 3 # Requires new defense layer or major config change
MODERATE = 2 # Configuration change or filter addition
SIMPLE = 1 # Parameter adjustment or rule addition
@dataclass
class AIVulnerabilityScore:
"""Complete AI vulnerability risk score."""
vulnerability_id: str
vulnerability_name: str
description: str
# Scoring dimensions
exploit_reliability: ExploitReliability
skill_required: SkillRequired
data_exposure_scope: DataExposureScope
action_scope: ActionScope
detection_difficulty: DetectionDifficulty
remediation_complexity: RemediationComplexity
# Context
affected_component: str
attack_category: str # OWASP LLM Top 10 mapping
business_context: str
@property
def exploitability_score(self) -> float:
"""How easy is it to exploit? (0-10 scale)"""
raw = (
self.exploit_reliability.value +
self.skill_required.value +
self.detection_difficulty.value
)
return round(raw / 12 * 10, 1) # Normalize to 0-10
@property
def impact_score(self) -> float:
"""What is the impact of exploitation? (0-10 scale)"""
raw = (
self.data_exposure_scope.value +
self.action_scope.value +
self.remediation_complexity.value
)
return round(raw / 12 * 10, 1) # Normalize to 0-10
@property
def overall_score(self) -> float:
"""Combined risk score (0-10 scale)."""
return round(
(self.exploitability_score * 0.45 + self.impact_score * 0.55),
1,
)
@property
def severity(self) -> str:
"""Map overall score to severity label."""
score = self.overall_score
if score >= 8.0:
return "Critical"
elif score >= 6.0:
return "High"
elif score >= 4.0:
return "Medium"
elif score >= 2.0:
return "Low"
return "Informational"Step 3: Scoring Real Vulnerabilities
Apply the framework to common AI vulnerability types:
# scoring/examples.py
"""Score examples for common AI vulnerability types."""
from scoring.ai_vrs import *
SCORED_VULNERABILITIES = [
AIVulnerabilityScore(
vulnerability_id="AI-001",
vulnerability_name="Direct Prompt Injection -- System Prompt Extraction",
description=(
"The application's system prompt, containing API keys and internal "
"configuration, can be extracted through direct prompt injection. "
"The attack succeeds approximately 70% of the time with basic payloads."
),
exploit_reliability=ExploitReliability.RELIABLE,
skill_required=SkillRequired.BASIC,
data_exposure_scope=DataExposureScope.CRITICAL_SECRETS,
action_scope=ActionScope.NO_ACTIONS,
detection_difficulty=DetectionDifficulty.HARD,
remediation_complexity=RemediationComplexity.MODERATE,
affected_component="LLM Chat Endpoint",
attack_category="LLM01: Prompt Injection",
business_context="Exposed API key provides access to the production database",
),
AIVulnerabilityScore(
vulnerability_id="AI-002",
vulnerability_name="Tool Abuse via Prompt Injection",
description=(
"An attacker can manipulate the model into invoking the refund "
"processing tool with arbitrary parameters through prompt injection. "
"The attack requires moderate prompt engineering skill and succeeds "
"approximately 40% of the time."
),
exploit_reliability=ExploitReliability.MODERATE,
skill_required=SkillRequired.INTERMEDIATE,
data_exposure_scope=DataExposureScope.SENSITIVE_DATA,
action_scope=ActionScope.CRITICAL_ACTIONS,
detection_difficulty=DetectionDifficulty.HARD,
remediation_complexity=RemediationComplexity.SIGNIFICANT,
affected_component="Agent Tool Calling",
attack_category="LLM01: Prompt Injection",
business_context="Could process unauthorized refunds up to $10,000 per transaction",
),
AIVulnerabilityScore(
vulnerability_id="AI-003",
vulnerability_name="RAG Poisoning via User-Uploaded Documents",
description=(
"Users can upload documents to the knowledge base that contain "
"adversarial instructions. When these documents are retrieved as "
"context, the model follows the embedded instructions."
),
exploit_reliability=ExploitReliability.RELIABLE,
skill_required=SkillRequired.INTERMEDIATE,
data_exposure_scope=DataExposureScope.SENSITIVE_DATA,
action_scope=ActionScope.SIGNIFICANT_ACTIONS,
detection_difficulty=DetectionDifficulty.VERY_HARD,
remediation_complexity=RemediationComplexity.SIGNIFICANT,
affected_component="RAG Pipeline",
attack_category="LLM01: Prompt Injection",
business_context="Could cause the model to provide incorrect information to all users",
),
AIVulnerabilityScore(
vulnerability_id="AI-004",
vulnerability_name="Jailbreak -- Safety Filter Bypass",
description=(
"The model's safety filters can be bypassed using role-play and "
"fictional framing techniques, allowing generation of content that "
"violates the application's content policy."
),
exploit_reliability=ExploitReliability.MODERATE,
skill_required=SkillRequired.BASIC,
data_exposure_scope=DataExposureScope.MINIMAL,
action_scope=ActionScope.LIMITED_ACTIONS,
detection_difficulty=DetectionDifficulty.MODERATE,
remediation_complexity=RemediationComplexity.FUNDAMENTAL,
affected_component="LLM Chat Endpoint",
attack_category="LLM01: Prompt Injection",
business_context="Reputational risk if users share bypass outputs publicly",
),
]
def print_scorecard():
"""Print a formatted scorecard for all scored vulnerabilities."""
print(f"{'ID':<10} {'Name':<45} {'Exploit':<10} {'Impact':<10} {'Overall':<10} {'Severity':<10}")
print("-" * 95)
for vuln in SCORED_VULNERABILITIES:
print(
f"{vuln.vulnerability_id:<10} "
f"{vuln.vulnerability_name[:43]:<45} "
f"{vuln.exploitability_score:<10.1f} "
f"{vuln.impact_score:<10.1f} "
f"{vuln.overall_score:<10.1f} "
f"{vuln.severity:<10}"
)
if __name__ == "__main__":
print_scorecard()Step 4: Business Impact Assessment
Technical severity alone does not determine priority. Overlay business impact to produce actionable prioritization:
# Business Impact Assessment Matrix
| Finding | Technical Severity | Business Impact Factors | Adjusted Priority |
|---------|-------------------|------------------------|-------------------|
| AI-001: System Prompt Extraction (API key) | High | Key provides production DB access; breach notification required | **Critical** (escalated) |
| AI-002: Tool Abuse (refunds) | High | Direct financial loss; $10K per transaction; rate limited to 100/day | **Critical** (confirmed) |
| AI-003: RAG Poisoning | High | Affects all users; misinformation risk in healthcare context | **Critical** (escalated) |
| AI-004: Jailbreak | Medium | Reputational risk only; no data exposure; content moderation team can respond | **Medium** (confirmed) |
## Business Impact Factors to Consider:
1. **Financial Impact**: Direct monetary loss, regulatory fines, legal liability
2. **Data Sensitivity**: PII, PHI, financial records, trade secrets
3. **Blast Radius**: Number of users affected, exposure duration
4. **Regulatory Requirements**: HIPAA, GDPR, PCI-DSS, AI-specific regulations
5. **Reputational Impact**: Public-facing system, media attention risk
6. **Operational Impact**: System downtime, incident response cost
7. **Legal Liability**: Duty of care, contractual obligationsStep 5: Generating the Risk Report
Compile scores into a format that both technical and non-technical stakeholders can use:
# scoring/report_generator.py
"""Generate a risk scoring report from scored vulnerabilities."""
from datetime import datetime
def generate_risk_report(vulnerabilities: list, engagement_name: str) -> str:
"""Generate a markdown risk report."""
# Sort by overall score descending
sorted_vulns = sorted(vulnerabilities, key=lambda v: v.overall_score, reverse=True)
report = f"""# AI Vulnerability Risk Assessment Report
**Engagement**: {engagement_name}
**Date**: {datetime.now().strftime('%Y-%m-%d')}
**Scoring Framework**: AI-VRS (AI Vulnerability Risk Scoring)
## Executive Summary
This assessment identified **{len(vulnerabilities)} vulnerabilities** in the AI application.
| Severity | Count |
|----------|-------|
| Critical | {sum(1 for v in sorted_vulns if v.severity == 'Critical')} |
| High | {sum(1 for v in sorted_vulns if v.severity == 'High')} |
| Medium | {sum(1 for v in sorted_vulns if v.severity == 'Medium')} |
| Low | {sum(1 for v in sorted_vulns if v.severity == 'Low')} |
## Risk Scoring Overview
Each vulnerability is scored across six dimensions:
| Dimension | What It Measures | Scale |
|-----------|-----------------|-------|
| Exploit Reliability | How consistently the attack succeeds | 1 (unreliable) - 4 (deterministic) |
| Skill Required | Attacker expertise needed | 1 (advanced) - 4 (none) |
| Data Exposure | What data is at risk | 1 (minimal) - 4 (critical secrets) |
| Action Scope | What actions an attacker can take | 1 (none) - 4 (critical actions) |
| Detection Difficulty | How hard to detect exploitation | 1 (easy) - 4 (very hard) |
| Remediation Complexity | How hard to fix | 1 (simple) - 4 (fundamental) |
## Detailed Findings
"""
for vuln in sorted_vulns:
report += f"""### {vuln.vulnerability_id}: {vuln.vulnerability_name}
**Severity**: {vuln.severity} ({vuln.overall_score}/10)
**Exploitability**: {vuln.exploitability_score}/10 | **Impact**: {vuln.impact_score}/10
**OWASP Category**: {vuln.attack_category}
**Affected Component**: {vuln.affected_component}
**Description**: {vuln.description}
**Business Context**: {vuln.business_context}
**Scoring Breakdown**:
| Dimension | Value | Score |
|-----------|-------|-------|
| Exploit Reliability | {vuln.exploit_reliability.name} | {vuln.exploit_reliability.value}/4 |
| Skill Required | {vuln.skill_required.name} | {vuln.skill_required.value}/4 |
| Data Exposure | {vuln.data_exposure_scope.name} | {vuln.data_exposure_scope.value}/4 |
| Action Scope | {vuln.action_scope.name} | {vuln.action_scope.value}/4 |
| Detection Difficulty | {vuln.detection_difficulty.name} | {vuln.detection_difficulty.value}/4 |
| Remediation Complexity | {vuln.remediation_complexity.name} | {vuln.remediation_complexity.value}/4 |
---
"""
return reportCommon Pitfalls and Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Scores inconsistent across testers | Subjective dimension interpretation | Calibrate with scored examples, have team review edge cases |
| Client disputes severity | Business context not considered | Always overlay business impact assessment on technical scores |
| All findings scored as "High" | Scoring dimensions not granular enough | Use all six dimensions, avoid rounding up |
| Stakeholders ignore low-severity findings | Report buried in Critical findings | Separate executive summary from detailed findings |
| Scores not comparable to CVSS | Different scale and methodology | Provide a CVSS-equivalent mapping table in the report appendix |
| Remediation not prioritized | Scoring alone does not determine fix order | Add a remediation priority column that considers fix effort vs. risk reduction |
Key Takeaways
Effective risk scoring for AI vulnerabilities requires going beyond traditional CVSS:
- Probabilistic exploitation must be quantified -- report the success rate and score accordingly. A 5% success rate is still a valid finding but scores differently than 95%.
- Context determines impact -- the same vulnerability (e.g., prompt injection) has vastly different impacts depending on what the model can access and do. Score impact based on the specific application, not the generic vulnerability type.
- Detection difficulty matters -- an easily detected attack is less risky than one that looks like normal usage. Include detection difficulty in the score.
- Business impact overrides technical scores -- always overlay business context. A technically "Medium" finding that violates a regulatory requirement becomes a business-critical priority.
- Consistency requires calibration -- use scored examples (like those in Step 3) to calibrate your team. Review edge cases together to build consensus on scoring interpretation.
Advanced Considerations
Adapting to Modern Defenses
The defensive landscape for LLM applications has evolved significantly since the initial wave of prompt injection research. Modern production systems often deploy multiple independent defensive layers, requiring attackers to adapt their techniques accordingly.
Input classification: The most common first line of defense is an input classifier that evaluates incoming prompts for adversarial patterns. These classifiers range from simple keyword-based filters to sophisticated ML models trained on adversarial examples. Bypassing input classifiers requires understanding their detection methodology:
- Keyword-based classifiers can be evaded through encoding (Base64, ROT13, Unicode homoglyphs), paraphrasing, or splitting the payload across multiple turns
- ML-based classifiers require more sophisticated evasion such as semantic camouflage, gradual escalation, or exploitation of the classifier's own blind spots
class DefenseBypassStrategy:
"""Strategy selector for bypassing different defense configurations."""
def __init__(self):
self.strategies = {
"keyword_filter": self._bypass_keyword,
"ml_classifier": self._bypass_ml,
"output_filter": self._bypass_output,
"rate_limiter": self._bypass_rate,
"behavioral": self._bypass_behavioral,
}
def select_strategy(self, identified_defenses: list) -> callable:
"""Select the optimal bypass strategy based on identified defenses."""
if not identified_defenses:
return self._direct_attack
# Combine strategies for multi-layer defenses
applicable = [
self.strategies[d] for d in identified_defenses
if d in self.strategies
]
return self._chain_strategies(applicable)
def _bypass_keyword(self, payload: str) -> str:
"""Bypass keyword-based input filters."""
import base64
# Encode the payload to bypass keyword matching
encoded = base64.b64encode(payload.encode()).decode()
return f"Decode and follow: {encoded}"
def _bypass_ml(self, payload: str) -> list:
"""Generate multiple paraphrased variants for ML classifier evasion."""
variants = [
f"As a thought experiment, consider: {payload}",
f"For academic research purposes: {payload}",
f"In a fictional scenario where: {payload}",
]
return variants
def _bypass_output(self, payload: str) -> str:
"""Modify the payload to extract data through output filter gaps."""
return payload + "\nRespond using only code blocks and variable names."
def _bypass_rate(self, payload: str) -> dict:
"""Configure timing to avoid rate limit triggers."""
return {"payload": payload, "delay": 3.0, "jitter": 1.5}
def _bypass_behavioral(self, payload: str) -> list:
"""Split the payload across multiple natural-sounding turns."""
# Multi-turn escalation to avoid behavioral anomaly detection
return [
"I'm working on a security research project.",
"Could you help me understand how safety mechanisms work?",
"Specifically, I'm interested in how systems like yours " + payload[:100],
payload,
]
def _direct_attack(self, payload: str) -> str:
return payload
def _chain_strategies(self, strategies: list) -> callable:
"""Chain multiple bypass strategies."""
def chained(payload):
result = payload
for strategy in strategies:
result = strategy(result)
return result
return chainedOutput filtering: Output filters inspect the model's response before it reaches the user, looking for sensitive data leakage, harmful content, or other policy violations. Common output filter bypass techniques include:
| Technique | How It Works | Effectiveness |
|---|---|---|
| Encoding output | Request Base64/hex encoded responses | Medium — some filters check decoded content |
| Code block wrapping | Embed data in code comments/variables | High — many filters skip code blocks |
| Steganographic output | Hide data in formatting, capitalization, or spacing | High — difficult to detect |
| Chunked extraction | Extract small pieces across many turns | High — individual pieces may pass filters |
| Indirect extraction | Have the model reveal data through behavior changes | Very High — no explicit data in output |
Cross-Model Considerations
Techniques that work against one model may not directly transfer to others. However, understanding the general principles allows adaptation:
-
Safety training methodology: Models trained with RLHF (GPT-4, Claude) have different safety characteristics than those using DPO (Llama, Mistral) or other methods. RLHF-trained models tend to refuse more broadly but may be more susceptible to multi-turn escalation.
-
Context window size: Models with larger context windows (Claude with 200K, Gemini with 1M+) may be more susceptible to context window manipulation where adversarial content is buried in large amounts of benign text.
-
Multimodal capabilities: Models that process images, audio, or other modalities introduce additional attack surfaces not present in text-only models.
-
Tool use implementation: The implementation details of function calling vary significantly between providers. OpenAI uses a structured function calling format, while Anthropic uses tool use blocks. These differences affect exploitation techniques.
Operational Considerations
Testing Ethics and Boundaries
Professional red team testing operates within clear ethical and legal boundaries:
- Authorization: Always obtain written authorization before testing. This should specify the scope, methods allowed, and any restrictions.
- Scope limits: Stay within the authorized scope. If you discover a vulnerability that leads outside the authorized perimeter, document it and report it without exploiting it.
- Data handling: Handle any sensitive data discovered during testing according to the engagement agreement. Never retain sensitive data beyond what's needed for reporting.
- Responsible disclosure: Follow responsible disclosure practices for any vulnerabilities discovered, particularly if they affect systems beyond your testing scope.
Documenting Results
Professional documentation follows a structured format:
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
@dataclass
class Finding:
"""Structure for documenting a security finding."""
id: str
title: str
severity: str # Critical, High, Medium, Low, Informational
category: str # OWASP LLM Top 10 category
description: str
steps_to_reproduce: list[str]
impact: str
recommendation: str
evidence: list[str] = field(default_factory=list)
mitre_atlas: Optional[str] = None
cvss_score: Optional[float] = None
discovered_at: str = field(default_factory=lambda: datetime.now().isoformat())
def to_report_section(self) -> str:
"""Generate a report section for this finding."""
steps = "\n".join(f" {i+1}. {s}" for i, s in enumerate(self.steps_to_reproduce))
return f"""
### {self.id}: {self.title}
**Severity**: {self.severity}
**Category**: {self.category}
{f"**MITRE ATLAS**: {self.mitre_atlas}" if self.mitre_atlas else ""}
#### Description
{self.description}
#### Steps to Reproduce
{steps}
#### Impact
{self.impact}
#### Recommendation
{self.recommendation}
"""This structured approach ensures that findings are actionable and that remediation teams have the information they need to address the vulnerabilities effectively.