Cost Security and Budget Controls for Cloud AI
Protecting cloud AI deployments from cost-based attacks including denial-of-wallet, token exhaustion, and auto-scaling abuse with budget controls across AWS, Azure, and GCP.
Overview
Cloud AI services have a unique cost profile that creates novel attack vectors. Unlike traditional cloud compute where costs scale with provisioned resources, AI inference costs scale with usage -- specifically, with the number of tokens processed. A single compromised API key or misconfigured endpoint can generate thousands of dollars in charges within minutes, because there is no physical resource to provision and the marginal cost of each additional request is charged immediately.
The threat model for cloud AI cost security includes three categories: external attackers who gain access to AI service credentials, insiders who abuse AI services for unauthorized purposes, and automated systems that malfunction and generate runaway costs. All three require a combination of preventive controls (budget limits, rate limiting) and detective controls (anomaly detection, alerting).
This article covers the cost attack surface for each major cloud AI provider, practical budget control implementations, and red team techniques for testing cost security controls.
Cost Attack Surface
Token Economics and Attack Amplification
Cloud AI pricing is based on tokens (roughly 4 characters per token for English text). The cost asymmetry between input and output tokens creates attack amplification opportunities:
| Provider | Model | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) | Amplification Factor |
|---|---|---|---|---|
| AWS Bedrock | Claude 3.5 Sonnet | ~$3.00 | ~$15.00 | 5x output vs input |
| Azure OpenAI | GPT-4o | ~$2.50 | ~$10.00 | 4x output vs input |
| GCP Vertex AI | Gemini 1.5 Pro | ~$1.25 | ~$5.00 | 4x output vs input |
| AWS Bedrock | Claude 3 Opus | ~$15.00 | ~$75.00 | 5x output vs input |
An attacker can maximize cost damage by crafting prompts that generate maximum-length outputs. A short prompt like "Write a 4000-word essay on any topic" costs pennies in input tokens but generates dollars in output tokens.
Denial-of-Wallet Attack Patterns
"""
Cost attack pattern analysis for cloud AI services.
For authorized red team assessments only.
"""
def calculate_attack_cost_projection(
requests_per_second: float,
avg_input_tokens: int,
avg_output_tokens: int,
input_cost_per_million: float,
output_cost_per_million: float,
duration_hours: float,
) -> dict:
"""Project the cost of a sustained denial-of-wallet attack."""
total_requests = requests_per_second * 3600 * duration_hours
total_input_tokens = total_requests * avg_input_tokens
total_output_tokens = total_requests * avg_output_tokens
input_cost = (total_input_tokens / 1_000_000) * input_cost_per_million
output_cost = (total_output_tokens / 1_000_000) * output_cost_per_million
total_cost = input_cost + output_cost
return {
"total_requests": int(total_requests),
"total_input_tokens": int(total_input_tokens),
"total_output_tokens": int(total_output_tokens),
"input_cost_usd": round(input_cost, 2),
"output_cost_usd": round(output_cost, 2),
"total_cost_usd": round(total_cost, 2),
"cost_per_hour_usd": round(total_cost / duration_hours, 2),
}
# Example: Sustained attack using GPT-4o at 10 requests/second
projection = calculate_attack_cost_projection(
requests_per_second=10,
avg_input_tokens=100, # Short malicious prompt
avg_output_tokens=4000, # Maximum output
input_cost_per_million=2.50, # GPT-4o input
output_cost_per_million=10.00, # GPT-4o output
duration_hours=24,
)
# Result: ~$34,560 in 24 hours from a single compromised keyProvisioned Throughput Abuse
Beyond pay-per-token costs, cloud AI providers offer provisioned throughput (reserved capacity) that costs significantly more:
- AWS Bedrock Provisioned Throughput: Billed hourly at rates up to $50+/hour per model unit. Creating provisioned throughput with a compromised credential can cost thousands before detection.
- Azure OpenAI Provisioned-Managed: PTU (Provisioned Throughput Units) are committed capacity billed hourly. An attacker creating a large PTU deployment generates immediate financial liability.
- GCP Vertex AI Provisioned Throughput: Reserved capacity for Gemini models, billed on commitment.
import boto3
from botocore.exceptions import ClientError
def test_provisioned_throughput_controls(
session: boto3.Session,
region: str = "us-east-1",
) -> dict:
"""Test whether provisioned throughput creation is properly restricted."""
bedrock = session.client("bedrock", region_name=region)
findings = []
# Test 1: Can the current identity create provisioned throughput?
try:
# Use a dry-run approach: try to create with invalid parameters
# to test IAM permissions without actually provisioning
response = bedrock.create_provisioned_model_throughput(
modelUnits=1,
provisionedModelName="security-test-do-not-create",
modelId="anthropic.claude-3-haiku-20240307-v1:0",
)
# If we get here, the identity has the permission
# Immediately delete the provisioned throughput
bedrock.delete_provisioned_model_throughput(
provisionedModelId=response["provisionedModelArn"]
)
findings.append({
"severity": "HIGH",
"finding": "Identity can create provisioned throughput",
"detail": "This identity can create provisioned model throughput, "
"which could cost $50+/hour per model unit. Restrict "
"bedrock:CreateProvisionedModelThroughput to admin roles.",
})
except ClientError as e:
if e.response["Error"]["Code"] == "AccessDeniedException":
findings.append({
"severity": "INFO",
"finding": "Provisioned throughput creation blocked by IAM",
})
elif e.response["Error"]["Code"] == "ValidationException":
findings.append({
"severity": "HIGH",
"finding": "Identity has CreateProvisionedModelThroughput permission",
"detail": "Request reached validation (past IAM check). The identity "
"could create valid provisioned throughput with correct parameters.",
})
return {"findings": findings}Budget Controls Implementation
AWS Budget Controls
import boto3
import json
def create_bedrock_budget(
session: boto3.Session,
account_id: str,
monthly_limit_usd: float,
alert_thresholds: list = None,
notification_email: str = None,
) -> dict:
"""Create an AWS Budget specifically for Bedrock spending."""
budgets = session.client("budgets")
if alert_thresholds is None:
alert_thresholds = [50, 80, 100] # percent
notifications = []
subscribers = []
if notification_email:
subscribers = [
{"SubscriptionType": "EMAIL", "Address": notification_email}
]
for threshold in alert_thresholds:
notifications.append({
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": threshold,
"ThresholdType": "PERCENTAGE",
},
"Subscribers": subscribers,
})
budget = {
"BudgetName": "bedrock-monthly-limit",
"BudgetLimit": {
"Amount": str(monthly_limit_usd),
"Unit": "USD",
},
"BudgetType": "COST",
"TimeUnit": "MONTHLY",
"CostFilters": {
"Service": ["Amazon Bedrock"],
},
}
try:
budgets.create_budget(
AccountId=account_id,
Budget=budget,
NotificationsWithSubscribers=notifications,
)
return {
"budget_name": "bedrock-monthly-limit",
"limit_usd": monthly_limit_usd,
"alert_thresholds": alert_thresholds,
"status": "created",
}
except Exception as e:
return {"error": str(e)}
def create_bedrock_cost_anomaly_detector(
session: boto3.Session,
) -> dict:
"""Create a Cost Anomaly Detection monitor for Bedrock."""
ce = session.client("ce")
try:
response = ce.create_anomaly_monitor(
AnomalyMonitor={
"MonitorName": "bedrock-cost-anomaly",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE",
}
)
monitor_arn = response["MonitorArn"]
# Create anomaly subscription for alerts
ce.create_anomaly_subscription(
AnomalySubscription={
"SubscriptionName": "bedrock-anomaly-alerts",
"MonitorArnList": [monitor_arn],
"Frequency": "IMMEDIATE",
"Threshold": 50.0, # Alert on anomalies over $50
"Subscribers": [
{
"Type": "SNS",
"Address": "arn:aws:sns:us-east-1:ACCOUNT:bedrock-cost-alerts",
}
],
}
)
return {
"monitor_arn": monitor_arn,
"threshold_usd": 50.0,
"status": "created",
}
except Exception as e:
return {"error": str(e)}Azure OpenAI Quota and Budget Controls
Azure OpenAI provides token-per-minute (TPM) quotas at the deployment level, which serve as the primary rate-limiting mechanism:
from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
from azure.mgmt.costmanagement import CostManagementClient
def configure_deployment_rate_limits(
subscription_id: str,
resource_group: str,
account_name: str,
deployment_name: str,
tokens_per_minute_limit: int = 10000,
) -> dict:
"""Configure rate limits on an Azure OpenAI deployment."""
credential = DefaultAzureCredential()
client = CognitiveServicesManagementClient(credential, subscription_id)
# Update deployment with rate limit
deployment = client.deployments.begin_create_or_update(
resource_group,
account_name,
deployment_name,
{
"sku": {
"name": "Standard",
"capacity": tokens_per_minute_limit // 1000, # Capacity in thousands
},
},
).result()
return {
"deployment": deployment_name,
"capacity_k_tpm": deployment.sku.capacity,
"effective_tpm": deployment.sku.capacity * 1000,
"detail": "Requests exceeding this rate will receive 429 responses.",
}
def create_azure_ai_cost_alert(
subscription_id: str,
resource_group: str,
monthly_budget_usd: float,
contact_emails: list,
) -> dict:
"""Create a budget alert for Azure AI spending."""
credential = DefaultAzureCredential()
cost_client = CostManagementClient(credential, subscription_id)
scope = f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}"
budget = {
"properties": {
"category": "Cost",
"amount": monthly_budget_usd,
"timeGrain": "Monthly",
"timePeriod": {
"startDate": "2026-03-01T00:00:00Z",
"endDate": "2027-03-01T00:00:00Z",
},
"filter": {
"dimensions": {
"name": "ServiceName",
"operator": "In",
"values": ["Azure OpenAI", "Cognitive Services"],
}
},
"notifications": {
"actual_50_percent": {
"enabled": True,
"operator": "GreaterThan",
"threshold": 50,
"contactEmails": contact_emails,
"thresholdType": "Actual",
},
"actual_80_percent": {
"enabled": True,
"operator": "GreaterThan",
"threshold": 80,
"contactEmails": contact_emails,
"thresholdType": "Actual",
},
"actual_100_percent": {
"enabled": True,
"operator": "GreaterThan",
"threshold": 100,
"contactEmails": contact_emails,
"thresholdType": "Actual",
},
},
}
}
result = cost_client.budgets.create_or_update(
scope, "ai-services-budget", budget
)
return {
"budget_name": "ai-services-budget",
"monthly_limit_usd": monthly_budget_usd,
"alerts": ["50%", "80%", "100%"],
}GCP Vertex AI Budget Controls
from google.cloud import billing_budgets_v1
def create_vertex_ai_budget(
billing_account_id: str,
project_id: str,
monthly_limit_usd: float,
notification_pubsub_topic: str = None,
) -> dict:
"""Create a budget for Vertex AI spending on GCP."""
client = billing_budgets_v1.BudgetServiceClient()
budget = billing_budgets_v1.Budget(
display_name="vertex-ai-monthly-budget",
budget_filter=billing_budgets_v1.Filter(
projects=[f"projects/{project_id}"],
services=[
f"services/aiplatform.googleapis.com",
],
),
amount=billing_budgets_v1.BudgetAmount(
specified_amount={"units": int(monthly_limit_usd), "currency_code": "USD"}
),
threshold_rules=[
billing_budgets_v1.ThresholdRule(
threshold_percent=0.5,
spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
),
billing_budgets_v1.ThresholdRule(
threshold_percent=0.8,
spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
),
billing_budgets_v1.ThresholdRule(
threshold_percent=1.0,
spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
),
],
)
if notification_pubsub_topic:
budget.notifications_rule = billing_budgets_v1.NotificationsRule(
pubsub_topic=notification_pubsub_topic,
schema_version="1.0",
)
request = billing_budgets_v1.CreateBudgetRequest(
parent=f"billingAccounts/{billing_account_id}",
budget=budget,
)
result = client.create_budget(request=request)
return {
"budget_name": result.name,
"monthly_limit_usd": monthly_limit_usd,
"status": "created",
}Application-Level Rate Limiting
Cloud provider budget alerts are reactive -- they notify after spend has occurred. For real-time cost protection, implement application-level rate limiting:
import time
from dataclasses import dataclass, field
from collections import defaultdict
from threading import Lock
@dataclass
class AIRateLimiter:
"""Token-aware rate limiter for cloud AI services."""
max_tokens_per_minute: int = 100000
max_tokens_per_hour: int = 2000000
max_cost_per_hour_usd: float = 50.0
cost_per_input_token: float = 0.000003 # $3/1M tokens
cost_per_output_token: float = 0.000015 # $15/1M tokens
_lock: Lock = field(default_factory=Lock)
_minute_tokens: list = field(default_factory=list)
_hour_tokens: list = field(default_factory=list)
_hour_cost: list = field(default_factory=list)
def check_and_record(
self,
input_tokens: int,
estimated_output_tokens: int,
user_id: str = "default",
) -> dict:
"""Check if a request is within rate limits and record it."""
with self._lock:
now = time.time()
total_tokens = input_tokens + estimated_output_tokens
estimated_cost = (
input_tokens * self.cost_per_input_token
+ estimated_output_tokens * self.cost_per_output_token
)
# Clean old entries
minute_cutoff = now - 60
hour_cutoff = now - 3600
self._minute_tokens = [
(t, tokens) for t, tokens in self._minute_tokens if t > minute_cutoff
]
self._hour_tokens = [
(t, tokens) for t, tokens in self._hour_tokens if t > hour_cutoff
]
self._hour_cost = [
(t, cost) for t, cost in self._hour_cost if t > hour_cutoff
]
# Check minute limit
minute_total = sum(tokens for _, tokens in self._minute_tokens) + total_tokens
if minute_total > self.max_tokens_per_minute:
return {
"allowed": False,
"reason": "tokens_per_minute_exceeded",
"current": minute_total - total_tokens,
"limit": self.max_tokens_per_minute,
"retry_after_seconds": 60,
}
# Check hour limit
hour_total = sum(tokens for _, tokens in self._hour_tokens) + total_tokens
if hour_total > self.max_tokens_per_hour:
return {
"allowed": False,
"reason": "tokens_per_hour_exceeded",
"current": hour_total - total_tokens,
"limit": self.max_tokens_per_hour,
}
# Check cost limit
hour_cost = sum(cost for _, cost in self._hour_cost) + estimated_cost
if hour_cost > self.max_cost_per_hour_usd:
return {
"allowed": False,
"reason": "cost_per_hour_exceeded",
"current_cost_usd": round(hour_cost - estimated_cost, 4),
"limit_usd": self.max_cost_per_hour_usd,
}
# Record the request
self._minute_tokens.append((now, total_tokens))
self._hour_tokens.append((now, total_tokens))
self._hour_cost.append((now, estimated_cost))
return {
"allowed": True,
"estimated_cost_usd": round(estimated_cost, 6),
"minute_usage_pct": round((minute_total / self.max_tokens_per_minute) * 100, 1),
"hour_usage_pct": round((hour_total / self.max_tokens_per_hour) * 100, 1),
"hour_cost_pct": round((hour_cost / self.max_cost_per_hour_usd) * 100, 1),
}Red Team Testing for Cost Controls
Authorized Cost Security Testing
When conducting authorized assessments, test cost controls without generating actual excessive spend:
def test_cost_controls_safely(
rate_limiter: AIRateLimiter,
cloud_client: object,
) -> dict:
"""Test cost control effectiveness with minimal actual spend."""
results = {
"rate_limiter_tests": [],
"budget_alert_tests": [],
"credential_scope_tests": [],
}
# Test 1: Verify rate limiter blocks excessive requests
for i in range(20):
check = rate_limiter.check_and_record(
input_tokens=1000,
estimated_output_tokens=4000,
user_id="test-user",
)
if not check["allowed"]:
results["rate_limiter_tests"].append({
"test": "rapid_request_blocking",
"pass": True,
"blocked_at_request": i + 1,
"reason": check["reason"],
})
break
else:
results["rate_limiter_tests"].append({
"test": "rapid_request_blocking",
"pass": False,
"detail": "Rate limiter did not block after 20 rapid requests",
})
# Test 2: Verify max_tokens parameter is enforced
# Send a single request with a very small max_tokens to verify
# the application enforces output length limits
results["rate_limiter_tests"].append({
"test": "output_token_limit",
"recommendation": "Verify application sets max_tokens on all API calls. "
"Without max_tokens, a single request can generate "
"the model's full context window in output tokens.",
})
# Test 3: Check if provisioned throughput creation is IAM-restricted
results["credential_scope_tests"].append({
"test": "provisioned_throughput_restriction",
"recommendation": "Verify application service accounts cannot call "
"CreateProvisionedModelThroughput or equivalent.",
})
return resultsCost Attack Indicators
Monitor for these patterns that indicate cost-based attacks:
| Indicator | Description | Detection Method |
|---|---|---|
| Token volume spike | Sudden increase in tokens processed per minute | CloudWatch/Azure Monitor/Cloud Monitoring metrics |
| Max output requests | High percentage of requests generating maximum-length outputs | Application-level logging of output token counts |
| Off-hours usage | AI API calls during periods when the application should be idle | Time-based alerting on API call metrics |
| New IP/identity | API calls from previously unseen IP addresses or identities | CloudTrail/Activity Log/Audit Log analysis |
| Provisioned capacity creation | New provisioned throughput or committed use | Management API event monitoring |
| Burst after silence | A credential that was dormant suddenly generating high volume | Behavioral analysis of per-credential usage patterns |
Defense-in-Depth Strategy
Layer cost controls from the cloud provider level down to the application level:
-
Cloud provider budget alerts: Set monthly budgets with alerts at 50%, 80%, and 100% thresholds. These are the last line of defense but have reporting delays.
-
IAM permission restrictions: Remove CreateProvisionedModelThroughput and equivalent high-cost actions from all non-admin identities.
-
Deployment-level quotas: Configure TPM (tokens per minute) limits on each model deployment. Azure OpenAI has native support; other providers require API Gateway rate limiting.
-
Application rate limiting: Implement token-aware rate limiting in the application layer with per-user and global limits.
-
Request validation: Set max_tokens on every API call. Reject user inputs that exceed reasonable length thresholds.
-
Real-time anomaly detection: Use cost anomaly detection services and custom monitoring to detect unusual spend patterns within minutes, not hours.
References
- AWS, "AWS Cost Anomaly Detection," https://docs.aws.amazon.com/cost-management/latest/userguide/manage-ad.html
- Microsoft, "Plan and manage costs for Azure OpenAI Service," https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs
- Google Cloud, "Cloud Billing budgets," https://cloud.google.com/billing/docs/how-to/budgets
- OWASP, "LLM09:2025 Misinformation" and "LLM10:2025 Unbounded Consumption," https://owasp.org/www-project-top-10-for-large-language-model-applications/
Why is setting max_tokens on every AI API call important for cost security?
An organization sets an AWS Budget alert at $5,000/month for Bedrock. Why is this insufficient as a sole cost control?