Cost 安全 and Budget Controls for Cloud AI
Protecting cloud AI deployments from cost-based attacks including denial-of-wallet, token exhaustion, and auto-scaling abuse with budget controls across AWS, Azure, and GCP.
概覽
雲端 AI services have a unique cost profile that creates novel attack vectors. Unlike traditional 雲端 compute where costs scale with provisioned resources, AI 推論 costs scale with usage -- specifically, with the number of 符元 processed. A single compromised API key or misconfigured endpoint can generate thousands of dollars in charges within minutes, 因為 存在 no physical resource to provision and the marginal cost of each additional request is charged immediately.
The 威脅模型 for 雲端 AI cost 安全 includes three categories: external attackers who gain access to AI service credentials, insiders who abuse AI services for unauthorized purposes, and automated systems that malfunction and generate runaway costs. All three require a combination of preventive controls (budget limits, rate limiting) and detective controls (anomaly 偵測, alerting).
This article covers the cost 攻擊面 對每個 major 雲端 AI provider, practical budget control implementations, and 紅隊 techniques for 測試 cost 安全 controls.
Cost 攻擊 Surface
Token Economics and 攻擊 Amplification
雲端 AI pricing is based on 符元 (roughly 4 characters per 符元 for English text). The cost asymmetry between 輸入 and 輸出 符元 creates attack amplification opportunities:
| Provider | Model | 輸入 Cost (per 1M 符元) | 輸出 Cost (per 1M 符元) | Amplification Factor |
|---|---|---|---|---|
| AWS Bedrock | Claude 3.5 Sonnet | ~$3.00 | ~$15.00 | 5x 輸出 vs 輸入 |
| Azure OpenAI | GPT-4o | ~$2.50 | ~$10.00 | 4x 輸出 vs 輸入 |
| GCP Vertex AI | Gemini 1.5 Pro | ~$1.25 | ~$5.00 | 4x 輸出 vs 輸入 |
| AWS Bedrock | Claude 3 Opus | ~$15.00 | ~$75.00 | 5x 輸出 vs 輸入 |
攻擊者 can maximize cost damage by crafting prompts that generate maximum-length outputs. A short prompt like "Write a 4000-word essay on any topic" costs pennies in 輸入 符元 but generates dollars in 輸出 符元.
Denial-of-Wallet 攻擊 Patterns
"""
Cost attack pattern analysis for 雲端 AI services.
For authorized 紅隊 assessments only.
"""
def calculate_attack_cost_projection(
requests_per_second: float,
avg_input_tokens: int,
avg_output_tokens: int,
input_cost_per_million: float,
output_cost_per_million: float,
duration_hours: float,
) -> dict:
"""Project the cost of a sustained denial-of-wallet attack."""
total_requests = requests_per_second * 3600 * duration_hours
total_input_tokens = total_requests * avg_input_tokens
total_output_tokens = total_requests * avg_output_tokens
input_cost = (total_input_tokens / 1_000_000) * input_cost_per_million
output_cost = (total_output_tokens / 1_000_000) * output_cost_per_million
total_cost = input_cost + output_cost
return {
"total_requests": int(total_requests),
"total_input_tokens": int(total_input_tokens),
"total_output_tokens": int(total_output_tokens),
"input_cost_usd": round(input_cost, 2),
"output_cost_usd": round(output_cost, 2),
"total_cost_usd": round(total_cost, 2),
"cost_per_hour_usd": round(total_cost / duration_hours, 2),
}
# 範例: Sustained attack using GPT-4o at 10 requests/second
projection = calculate_attack_cost_projection(
requests_per_second=10,
avg_input_tokens=100, # Short malicious prompt
avg_output_tokens=4000, # Maximum 輸出
input_cost_per_million=2.50, # GPT-4o 輸入
output_cost_per_million=10.00, # GPT-4o 輸出
duration_hours=24,
)
# Result: ~$34,560 in 24 hours from a single compromised keyProvisioned Throughput Abuse
Beyond pay-per-符元 costs, 雲端 AI providers offer provisioned throughput (reserved capacity) that costs significantly more:
- AWS Bedrock Provisioned Throughput: Billed hourly at rates up to $50+/hour per model unit. Creating provisioned throughput with a compromised credential can cost thousands before 偵測.
- Azure OpenAI Provisioned-Managed: PTU (Provisioned Throughput Units) are committed capacity billed hourly. 攻擊者 creating a large PTU deployment generates immediate financial liability.
- GCP Vertex AI Provisioned Throughput: Reserved capacity for Gemini models, billed on commitment.
import boto3
from botocore.exceptions import ClientError
def test_provisioned_throughput_controls(
session: boto3.Session,
region: str = "us-east-1",
) -> dict:
"""測試 whether provisioned throughput creation is properly restricted."""
bedrock = session.client("bedrock", region_name=region)
findings = []
# 測試 1: Can the current identity create provisioned throughput?
try:
# Use a dry-run approach: try to create with invalid parameters
# to 測試 IAM 權限 without actually provisioning
response = bedrock.create_provisioned_model_throughput(
modelUnits=1,
provisionedModelName="安全-測試-do-not-create",
modelId="anthropic.claude-3-haiku-20240307-v1:0",
)
# If we get here, the identity has the 權限
# Immediately delete the provisioned throughput
bedrock.delete_provisioned_model_throughput(
provisionedModelId=response["provisionedModelArn"]
)
findings.append({
"severity": "HIGH",
"finding": "Identity can create provisioned throughput",
"detail": "This identity can create provisioned model throughput, "
"which could cost $50+/hour per model unit. Restrict "
"bedrock:CreateProvisionedModelThroughput to admin roles.",
})
except ClientError as e:
if e.response["Error"]["Code"] == "AccessDeniedException":
findings.append({
"severity": "INFO",
"finding": "Provisioned throughput creation blocked by IAM",
})
elif e.response["Error"]["Code"] == "ValidationException":
findings.append({
"severity": "HIGH",
"finding": "Identity has CreateProvisionedModelThroughput 權限",
"detail": "Request reached validation (past IAM check). The identity "
"could create valid provisioned throughput with correct parameters.",
})
return {"findings": findings}Budget Controls 實作
AWS Budget Controls
import boto3
import json
def create_bedrock_budget(
session: boto3.Session,
account_id: str,
monthly_limit_usd: float,
alert_thresholds: list = None,
notification_email: str = None,
) -> dict:
"""Create an AWS Budget specifically for Bedrock spending."""
budgets = session.client("budgets")
if alert_thresholds is None:
alert_thresholds = [50, 80, 100] # percent
notifications = []
subscribers = []
if notification_email:
subscribers = [
{"SubscriptionType": "EMAIL", "Address": notification_email}
]
for threshold in alert_thresholds:
notifications.append({
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": threshold,
"ThresholdType": "PERCENTAGE",
},
"Subscribers": subscribers,
})
budget = {
"BudgetName": "bedrock-monthly-limit",
"BudgetLimit": {
"Amount": str(monthly_limit_usd),
"Unit": "USD",
},
"BudgetType": "COST",
"TimeUnit": "MONTHLY",
"CostFilters": {
"Service": ["Amazon Bedrock"],
},
}
try:
budgets.create_budget(
AccountId=account_id,
Budget=budget,
NotificationsWithSubscribers=notifications,
)
return {
"budget_name": "bedrock-monthly-limit",
"limit_usd": monthly_limit_usd,
"alert_thresholds": alert_thresholds,
"status": "created",
}
except Exception as e:
return {"error": str(e)}
def create_bedrock_cost_anomaly_detector(
session: boto3.Session,
) -> dict:
"""Create a Cost Anomaly 偵測 monitor for Bedrock."""
ce = session.client("ce")
try:
response = ce.create_anomaly_monitor(
AnomalyMonitor={
"MonitorName": "bedrock-cost-anomaly",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE",
}
)
monitor_arn = response["MonitorArn"]
# Create anomaly subscription for alerts
ce.create_anomaly_subscription(
AnomalySubscription={
"SubscriptionName": "bedrock-anomaly-alerts",
"MonitorArnList": [monitor_arn],
"Frequency": "IMMEDIATE",
"Threshold": 50.0, # Alert on anomalies over $50
"Subscribers": [
{
"Type": "SNS",
"Address": "arn:aws:sns:us-east-1:ACCOUNT:bedrock-cost-alerts",
}
],
}
)
return {
"monitor_arn": monitor_arn,
"threshold_usd": 50.0,
"status": "created",
}
except Exception as e:
return {"error": str(e)}Azure OpenAI Quota and Budget Controls
Azure OpenAI provides 符元-per-minute (TPM) quotas at the deployment level, which serve as the primary rate-limiting mechanism:
from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
from azure.mgmt.costmanagement import CostManagementClient
def configure_deployment_rate_limits(
subscription_id: str,
resource_group: str,
account_name: str,
deployment_name: str,
tokens_per_minute_limit: int = 10000,
) -> dict:
"""Configure rate limits on an Azure OpenAI deployment."""
credential = DefaultAzureCredential()
client = CognitiveServicesManagementClient(credential, subscription_id)
# Update deployment with rate limit
deployment = client.deployments.begin_create_or_update(
resource_group,
account_name,
deployment_name,
{
"sku": {
"name": "Standard",
"capacity": tokens_per_minute_limit // 1000, # Capacity in thousands
},
},
).result()
return {
"deployment": deployment_name,
"capacity_k_tpm": deployment.sku.capacity,
"effective_tpm": deployment.sku.capacity * 1000,
"detail": "Requests exceeding this rate will receive 429 responses.",
}
def create_azure_ai_cost_alert(
subscription_id: str,
resource_group: str,
monthly_budget_usd: float,
contact_emails: list,
) -> dict:
"""Create a budget alert for Azure AI spending."""
credential = DefaultAzureCredential()
cost_client = CostManagementClient(credential, subscription_id)
scope = f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}"
budget = {
"properties": {
"category": "Cost",
"amount": monthly_budget_usd,
"timeGrain": "Monthly",
"timePeriod": {
"startDate": "2026-03-01T00:00:00Z",
"endDate": "2027-03-01T00:00:00Z",
},
"filter": {
"dimensions": {
"name": "ServiceName",
"operator": "In",
"values": ["Azure OpenAI", "Cognitive Services"],
}
},
"notifications": {
"actual_50_percent": {
"enabled": True,
"operator": "GreaterThan",
"threshold": 50,
"contactEmails": contact_emails,
"thresholdType": "Actual",
},
"actual_80_percent": {
"enabled": True,
"operator": "GreaterThan",
"threshold": 80,
"contactEmails": contact_emails,
"thresholdType": "Actual",
},
"actual_100_percent": {
"enabled": True,
"operator": "GreaterThan",
"threshold": 100,
"contactEmails": contact_emails,
"thresholdType": "Actual",
},
},
}
}
result = cost_client.budgets.create_or_update(
scope, "ai-services-budget", budget
)
return {
"budget_name": "ai-services-budget",
"monthly_limit_usd": monthly_budget_usd,
"alerts": ["50%", "80%", "100%"],
}GCP Vertex AI Budget Controls
from google.雲端 import billing_budgets_v1
def create_vertex_ai_budget(
billing_account_id: str,
project_id: str,
monthly_limit_usd: float,
notification_pubsub_topic: str = None,
) -> dict:
"""Create a budget for Vertex AI spending on GCP."""
client = billing_budgets_v1.BudgetServiceClient()
budget = billing_budgets_v1.Budget(
display_name="vertex-ai-monthly-budget",
budget_filter=billing_budgets_v1.Filter(
projects=[f"projects/{project_id}"],
services=[
f"services/aiplatform.googleapis.com",
],
),
amount=billing_budgets_v1.BudgetAmount(
specified_amount={"units": int(monthly_limit_usd), "currency_code": "USD"}
),
threshold_rules=[
billing_budgets_v1.ThresholdRule(
threshold_percent=0.5,
spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
),
billing_budgets_v1.ThresholdRule(
threshold_percent=0.8,
spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
),
billing_budgets_v1.ThresholdRule(
threshold_percent=1.0,
spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
),
],
)
if notification_pubsub_topic:
budget.notifications_rule = billing_budgets_v1.NotificationsRule(
pubsub_topic=notification_pubsub_topic,
schema_version="1.0",
)
request = billing_budgets_v1.CreateBudgetRequest(
parent=f"billingAccounts/{billing_account_id}",
budget=budget,
)
result = client.create_budget(request=request)
return {
"budget_name": result.name,
"monthly_limit_usd": monthly_limit_usd,
"status": "created",
}Application-Level Rate Limiting
雲端 provider budget alerts are reactive -- they notify after spend has occurred. For real-time cost protection, 實作 application-level rate limiting:
import time
from dataclasses import dataclass, field
from collections import defaultdict
from threading import Lock
@dataclass
class AIRateLimiter:
"""Token-aware rate limiter for 雲端 AI services."""
max_tokens_per_minute: int = 100000
max_tokens_per_hour: int = 2000000
max_cost_per_hour_usd: float = 50.0
cost_per_input_token: float = 0.000003 # $3/1M 符元
cost_per_output_token: float = 0.000015 # $15/1M 符元
_lock: Lock = field(default_factory=Lock)
_minute_tokens: list = field(default_factory=list)
_hour_tokens: list = field(default_factory=list)
_hour_cost: list = field(default_factory=list)
def check_and_record(
self,
input_tokens: int,
estimated_output_tokens: int,
user_id: str = "default",
) -> dict:
"""Check if a request is within rate limits and record it."""
with self._lock:
now = time.time()
total_tokens = input_tokens + estimated_output_tokens
estimated_cost = (
input_tokens * self.cost_per_input_token
+ estimated_output_tokens * self.cost_per_output_token
)
# Clean old entries
minute_cutoff = now - 60
hour_cutoff = now - 3600
self._minute_tokens = [
(t, 符元) for t, 符元 in self._minute_tokens if t > minute_cutoff
]
self._hour_tokens = [
(t, 符元) for t, 符元 in self._hour_tokens if t > hour_cutoff
]
self._hour_cost = [
(t, cost) for t, cost in self._hour_cost if t > hour_cutoff
]
# Check minute limit
minute_total = sum(符元 for _, 符元 in self._minute_tokens) + total_tokens
if minute_total > self.max_tokens_per_minute:
return {
"allowed": False,
"reason": "tokens_per_minute_exceeded",
"current": minute_total - total_tokens,
"limit": self.max_tokens_per_minute,
"retry_after_seconds": 60,
}
# Check hour limit
hour_total = sum(符元 for _, 符元 in self._hour_tokens) + total_tokens
if hour_total > self.max_tokens_per_hour:
return {
"allowed": False,
"reason": "tokens_per_hour_exceeded",
"current": hour_total - total_tokens,
"limit": self.max_tokens_per_hour,
}
# Check cost limit
hour_cost = sum(cost for _, cost in self._hour_cost) + estimated_cost
if hour_cost > self.max_cost_per_hour_usd:
return {
"allowed": False,
"reason": "cost_per_hour_exceeded",
"current_cost_usd": round(hour_cost - estimated_cost, 4),
"limit_usd": self.max_cost_per_hour_usd,
}
# Record the request
self._minute_tokens.append((now, total_tokens))
self._hour_tokens.append((now, total_tokens))
self._hour_cost.append((now, estimated_cost))
return {
"allowed": True,
"estimated_cost_usd": round(estimated_cost, 6),
"minute_usage_pct": round((minute_total / self.max_tokens_per_minute) * 100, 1),
"hour_usage_pct": round((hour_total / self.max_tokens_per_hour) * 100, 1),
"hour_cost_pct": round((hour_cost / self.max_cost_per_hour_usd) * 100, 1),
}紅隊 測試 for Cost Controls
Authorized Cost 安全 測試
When conducting authorized assessments, 測試 cost controls without generating actual excessive spend:
def test_cost_controls_safely(
rate_limiter: AIRateLimiter,
cloud_client: object,
) -> dict:
"""測試 cost control effectiveness with minimal actual spend."""
results = {
"rate_limiter_tests": [],
"budget_alert_tests": [],
"credential_scope_tests": [],
}
# 測試 1: Verify rate limiter blocks excessive requests
for i in range(20):
check = rate_limiter.check_and_record(
input_tokens=1000,
estimated_output_tokens=4000,
user_id="測試-user",
)
if not check["allowed"]:
results["rate_limiter_tests"].append({
"測試": "rapid_request_blocking",
"pass": True,
"blocked_at_request": i + 1,
"reason": check["reason"],
})
break
else:
results["rate_limiter_tests"].append({
"測試": "rapid_request_blocking",
"pass": False,
"detail": "Rate limiter did not block after 20 rapid requests",
})
# 測試 2: Verify max_tokens parameter is enforced
# Send a single request with a very small max_tokens to verify
# the application enforces 輸出 length limits
results["rate_limiter_tests"].append({
"測試": "output_token_limit",
"recommendation": "Verify application sets max_tokens on all API calls. "
"Without max_tokens, a single request can generate "
"模型's full 上下文視窗 in 輸出 符元.",
})
# 測試 3: Check if provisioned throughput creation is IAM-restricted
results["credential_scope_tests"].append({
"測試": "provisioned_throughput_restriction",
"recommendation": "Verify application service accounts cannot call "
"CreateProvisionedModelThroughput or equivalent.",
})
return resultsCost 攻擊 Indicators
Monitor for these patterns that indicate cost-based attacks:
| Indicator | Description | 偵測 Method |
|---|---|---|
| Token volume spike | Sudden increase in 符元 processed per minute | CloudWatch/Azure Monitor/雲端 監控 metrics |
| Max 輸出 requests | High percentage of requests generating maximum-length outputs | Application-level logging of 輸出 符元 counts |
| Off-hours usage | AI API calls during periods when the application should be idle | Time-based alerting on API call metrics |
| New IP/identity | API calls from previously unseen IP addresses or identities | CloudTrail/Activity Log/Audit Log analysis |
| Provisioned capacity creation | New provisioned throughput or committed use | Management API event 監控 |
| Burst after silence | A credential that was dormant suddenly generating high volume | Behavioral analysis of per-credential usage patterns |
防禦-in-Depth Strategy
Layer cost controls from the 雲端 provider level down to the application level:
-
雲端 provider budget alerts: Set monthly budgets with alerts at 50%, 80%, and 100% thresholds. These are the last line of 防禦 but have reporting delays.
-
IAM 權限 restrictions: Remove CreateProvisionedModelThroughput and equivalent high-cost actions from all non-admin identities.
-
Deployment-level quotas: Configure TPM (符元 per minute) limits on each model deployment. Azure OpenAI has native support; other providers require API Gateway rate limiting.
-
Application rate limiting: 實作 符元-aware rate limiting in the application layer with per-user and global limits.
-
Request validation: Set max_tokens on every API call. Reject user inputs that exceed reasonable length thresholds.
-
Real-time anomaly 偵測: Use cost anomaly 偵測 services and custom 監控 to detect unusual spend patterns within minutes, not hours.
參考文獻
- AWS, "AWS Cost Anomaly 偵測," https://docs.aws.amazon.com/cost-management/latest/userguide/manage-ad.html
- Microsoft, "Plan and manage costs for Azure OpenAI Service," https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs
- Google 雲端, "雲端 Billing budgets," https://雲端.google.com/billing/docs/how-to/budgets
- OWASP, "LLM09:2025 Misinformation" and "LLM10:2025 Unbounded Consumption," https://owasp.org/www-project-top-10-for-large-language-model-applications/
Why is setting max_tokens on every AI API call important for cost 安全?
An organization sets an AWS Budget alert at $5,000/month for Bedrock. Why is this insufficient as a sole cost control?