AI Cost & Billing Attacks
Red team techniques for AI cost exploitation: model invocation abuse for billing inflation, token exhaustion attacks, GPU compute abuse, auto-scaling exploitation, and denial-of-wallet attacks across cloud providers.
AI Cost & Billing Attacks
AI services are among the most expensive cloud resources. A single GPU instance can cost $30+/hour, foundation model API calls are billed per token, and provisioned throughput deployments cost thousands per day. This creates a unique attack vector: denial-of-wallet attacks that inflict financial damage without disrupting service availability.
For red teamers, cost attacks demonstrate business impact in dollar terms that executives understand immediately.
Model Invocation Abuse
Token-Based Cost Amplification
Foundation model APIs (Bedrock, Azure OpenAI, Vertex AI) bill per token. Attackers can maximize cost per request through several techniques:
| Technique | Description | Cost Amplification |
|---|---|---|
| Max token requests | Set max_tokens to maximum, craft prompts that generate long responses | 10-100x vs. typical request |
| Expensive model targeting | Route to the most expensive model available | 2-30x vs. cheapest model |
| Streaming abuse | Use streaming endpoints to maintain long connections | Connection + token costs |
| Multi-turn inflation | Create long conversation histories that are re-sent with each request | Quadratic token growth |
| System prompt bloat | If system prompt is configurable, make it extremely long | System tokens billed on every request |
# Cost amplification through max token generation
import boto3, json
client = boto3.client('bedrock-runtime')
# Craft prompt that generates maximum output
# Use instructions that produce verbose, repetitive output
response = client.invoke_model(
modelId='anthropic.claude-3-opus-20240229-v1:0', # Most expensive model
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 4096, # Maximum output
"messages": [{
"role": "user",
"content": "Write an extremely detailed, step-by-step guide "
"covering every possible aspect of [TOPIC]. Include "
"extensive examples, edge cases, and explanations. "
"Do not summarize or abbreviate any section."
}]
})
)Cost Calculation
Understanding pricing enables accurate impact assessment:
| Provider/Model | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) | Max Output Tokens |
|---|---|---|---|
| AWS Bedrock Claude 3 Opus | ~$15 | ~$75 | 4,096 |
| Azure OpenAI GPT-4o | ~$5 | ~$15 | 16,384 |
| GCP Vertex AI Gemini 1.5 Pro | ~$3.50 | ~$10.50 | 8,192 |
A single maximally expensive Bedrock request (4K input + 4K output tokens) costs roughly $0.36. At 100 requests/second sustained for 1 hour, that is approximately $129,600.
Token Exhaustion Attacks
Quota Exhaustion
Cloud AI services have quotas (tokens per minute, requests per minute). Exhausting these quotas denies service to legitimate users:
# AWS Bedrock: Check current quotas
aws service-quotas get-service-quota \
--service-code bedrock \
--quota-code <quota-code>
# Azure OpenAI: Check deployment capacity
az cognitiveservices account deployment list \
--name <account> --resource-group <rg> \
--query "[].{name:name,capacity:properties.sku.capacity}"Attack Patterns
Sustained high-volume invocation
Send requests at the maximum allowed rate continuously. Each request uses maximum tokens. This exhausts both the per-minute token quota and generates substantial costs.
Burst attacks
Send large bursts of requests to trigger rate limiting for legitimate users while the burst requests consume quota. Even if the attacker's requests are eventually throttled, the quota consumed during the burst window is not recovered.
Multi-model spreading
If the target has access to multiple models, distribute requests across all models to maximize total token consumption across quotas that are tracked per-model.
Context window stuffing
Send requests with maximum-length input contexts. This consumes input tokens (which are billed) and forces the model to process more data per request, reducing effective throughput for other users.
Provisioned Throughput Abuse
Provisioned Throughput (AWS Bedrock PT, Azure OpenAI PTU) is dedicated capacity that costs money whether or not it is used:
| Attack | Description | Impact |
|---|---|---|
| Provision and abandon | If IAM allows, create PT/PTU deployments and leave them running | Continuous cost with no usage |
| Capacity exhaustion | Saturate PT/PTU capacity with attacker requests | Legitimate users denied access to reserved capacity |
| Scaling trigger | Force provisioning of additional capacity through sustained high load | Increased capacity costs |
# AWS: Check for ability to create provisioned throughput
aws bedrock create-provisioned-model-throughput \
--model-units 1 \
--provisioned-model-name cost-attack-test \
--model-id anthropic.claude-3-sonnet-20240229-v1:0 \
2>&1 # Check for permissionsGPU Compute Abuse
Training Job Abuse
ML training jobs consume expensive GPU instances. An attacker with training job creation permissions can launch compute for unauthorized purposes:
# AWS SageMaker: Create a training job with expensive GPU instances
aws sagemaker create-training-job \
--training-job-name gpu-abuse-test \
--algorithm-specification \
TrainingImage=<container>,TrainingInputMode=File \
--resource-config \
InstanceType=ml.p4d.24xlarge,InstanceCount=4,VolumeSizeInGB=500 \
--role-arn <role-arn> \
--stopping-condition MaxRuntimeInSeconds=86400 \
--output-data-config S3OutputPath=s3://bucket/outputA single ml.p4d.24xlarge instance costs approximately $37/hour. Four instances for 24 hours: $3,552. With spot pricing unavailable for attackers (they want reliable access), on-demand costs are even higher.
Compute Resource Costs
| Instance Type | Provider | GPU | Approximate Cost/Hour |
|---|---|---|---|
ml.p4d.24xlarge | AWS | 8x A100 | ~$37 |
Standard_NC96ads_A100_v4 | Azure | 4x A100 | ~$32 |
a2-ultragpu-8g | GCP | 8x A100 | ~$40 |
ml.p5.48xlarge | AWS | 8x H100 | ~$98 |
Cryptocurrency Mining
GPU instances provisioned through ML services are attractive for cryptocurrency mining:
- Training job containers have full GPU access
- Custom container images can include mining software
- Training jobs can run for hours or days before detection
- GPU instance costs are borne by the target organization
Auto-Scaling Exploitation
Scaling Trigger Attacks
AI serving endpoints use auto-scaling to handle variable load. Attackers can exploit scaling configurations:
| Scaling Config | Attack | Result |
|---|---|---|
| CPU-based scaling | Send compute-intensive requests | Triggers scale-up based on CPU utilization |
| Request-count scaling | Flood with requests | Triggers scale-up based on request rate |
| Custom metric scaling | Exploit metric collection gaps | Trigger scaling through metric manipulation |
| No max limit | Sustained high load | Unbounded scaling, unbounded cost |
Scaling Attack Chain
Identify scaling configuration
Determine the endpoint's auto-scaling policy: what metric triggers scaling, what the cool-down period is, and whether there is a maximum instance count.
# AWS SageMaker: Check auto-scaling aws application-autoscaling describe-scaling-policies \ --service-namespace sagemaker \ --resource-id endpoint/<endpoint-name>/variant/<variant-name> # GCP Vertex AI: Check endpoint scaling gcloud ai endpoints describe <endpoint-id> --region=us-central1 \ --format="json(deployedModels[].automaticResources)"Calculate cost impact
Determine the cost per instance and the maximum scale. If maximum is not set, the theoretical maximum is limited only by account quotas.
Trigger scaling
Send sustained traffic at a rate that triggers scale-up. Use the most expensive request pattern (max tokens, expensive model) to maximize both per-request cost and infrastructure cost.
Maintain pressure
Keep traffic sustained long enough for auto-scaling to provision maximum instances. Scale-up is typically faster than scale-down, so even a brief attack can leave expensive instances running during the cool-down period.
Cool-Down Exploitation
Auto-scaling policies include cool-down periods to prevent thrashing. Attackers can exploit this:
- Scale-up, then stop: Trigger maximum scaling, then stop traffic. Instances remain running during the scale-down cool-down period (typically 5-15 minutes).
- Pulse attacks: Send bursts of traffic followed by quiet periods timed to prevent scale-down, keeping instances at maximum count while minimizing attacker request volume.
- Scale-up cascade: If auto-scaling provisions instances that themselves trigger further scaling (e.g., a new instance starts a warm-up process that consumes resources), cascade scaling exceeds intended limits.
Denial-of-Wallet Attack Methodology
Planning
- Enumerate AI services and pricing: Identify the most expensive services in use
- Determine access level: What API actions can the attacker perform?
- Map scaling configurations: How does the target handle load increases?
- Identify billing controls: Are there budget alerts, spending limits, or approval workflows?
- Estimate impact: Calculate the maximum financial damage achievable
Execution Considerations
Detection and Billing Lag
Most cloud billing has a lag between resource consumption and billing visibility:
| Provider | Billing Lag | Cost Explorer Delay |
|---|---|---|
| AWS | Up to 24 hours | Same-day estimates, final bills delayed |
| Azure | Up to 24 hours | Cost Management may lag 24-48 hours |
| GCP | Up to 24 hours | Billing export may lag several hours |
This lag means that cost attacks can accumulate significant charges before any billing alert triggers. Red teams should factor this into engagement planning to avoid unintended financial impact.
Impact Communication
Cost attack findings should be communicated in business terms:
- Hourly exposure: "An attacker with this access can generate $X,XXX/hour in AI costs"
- Scaling factor: "Auto-scaling without maximum limits amplifies this to $XX,XXX/hour"
- Detection gap: "Current billing alerts would not trigger until $X has been spent, with an additional $X in the billing pipeline"
- Annual risk: "If exploited persistently, annual cost exposure is $X,XXX,XXX"
Related Topics
- AWS AI Services -- AWS-specific cost attack surfaces
- Azure AI Services -- Azure PTU and cost considerations
- GCP AI Services -- GCP cost controls and quota management
- SageMaker Exploitation -- GPU compute abuse through SageMaker
An attacker has bedrock:InvokeModel permission scoped to all models. To maximize the financial impact of a cost attack, which approach is most effective?
Why is billing lag a critical factor in AI cost attack planning for red team engagements?
References
- AWS Pricing Calculator -- Calculate AWS AI service costs
- Azure Pricing Calculator -- Calculate Azure AI service costs
- GCP Pricing Calculator -- Calculate GCP AI service costs
- OWASP LLM06: Excessive Agency -- Cost abuse as part of excessive agency risks