AI Cost & Billing Attacks

advanced12 min readUpdated 2026-03-15

Red team techniques for AI cost exploitation: model invocation abuse for billing inflation, token exhaustion attacks, GPU compute abuse, auto-scaling exploitation, and denial-of-wallet attacks across cloud providers.

cost-attacks denial-of-wallet token-exhaustion gpu-abuse auto-scaling billing red-team

AI Cost & Billing Attacks

AI services are among the most expensive cloud resources. A single GPU instance can cost $30+/hour, foundation model API calls are billed per token, and provisioned throughput deployments cost thousands per day. This creates a unique attack vector: denial-of-wallet attacks that inflict financial damage without disrupting service availability.

For red teamers, cost attacks demonstrate business impact in dollar terms that executives understand immediately.

Model Invocation Abuse

Token-Based Cost Amplification

Foundation model APIs (Bedrock, Azure OpenAI, Vertex AI) bill per token. Attackers can maximize cost per request through several techniques:

Technique	Description	Cost Amplification
Max token requests	Set `max_tokens` to maximum, craft prompts that generate long responses	10-100x vs. typical request
Expensive model targeting	Route to the most expensive model available	2-30x vs. cheapest model
Streaming abuse	Use streaming endpoints to maintain long connections	Connection + token costs
Multi-turn inflation	Create long conversation histories that are re-sent with each request	Quadratic token growth
System prompt bloat	If system prompt is configurable, make it extremely long	System tokens billed on every request

# Cost amplification through max token generation
import boto3, json
 
client = boto3.client('bedrock-runtime')
 
# Craft prompt that generates maximum output
# Use instructions that produce verbose, repetitive output
response = client.invoke_model(
    modelId='anthropic.claude-3-opus-20240229-v1:0',  # Most expensive model
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 4096,  # Maximum output
        "messages": [{
            "role": "user",
            "content": "Write an extremely detailed, step-by-step guide "
                       "covering every possible aspect of [TOPIC]. Include "
                       "extensive examples, edge cases, and explanations. "
                       "Do not summarize or abbreviate any section."
        }]
    })
)

Cost Calculation

Understanding pricing enables accurate impact assessment:

Provider/Model	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)	Max Output Tokens
AWS Bedrock Claude 3 Opus	~$15	~$75	4,096
Azure OpenAI GPT-4o	~$5	~$15	16,384
GCP Vertex AI Gemini 1.5 Pro	~$3.50	~$10.50	8,192

A single maximally expensive Bedrock request (4K input + 4K output tokens) costs roughly $0.36. At 100 requests/second sustained for 1 hour, that is approximately $129,600.

Token Exhaustion Attacks

Quota Exhaustion

Cloud AI services have quotas (tokens per minute, requests per minute). Exhausting these quotas denies service to legitimate users:

# AWS Bedrock: Check current quotas
aws service-quotas get-service-quota \
  --service-code bedrock \
  --quota-code <quota-code>
 
# Azure OpenAI: Check deployment capacity
az cognitiveservices account deployment list \
  --name <account> --resource-group <rg> \
  --query "[].{name:name,capacity:properties.sku.capacity}"

Attack Patterns

Sustained high-volume invocation
Send requests at the maximum allowed rate continuously. Each request uses maximum tokens. This exhausts both the per-minute token quota and generates substantial costs.
Burst attacks
Send large bursts of requests to trigger rate limiting for legitimate users while the burst requests consume quota. Even if the attacker's requests are eventually throttled, the quota consumed during the burst window is not recovered.
Multi-model spreading
If the target has access to multiple models, distribute requests across all models to maximize total token consumption across quotas that are tracked per-model.
Context window stuffing
Send requests with maximum-length input contexts. This consumes input tokens (which are billed) and forces the model to process more data per request, reducing effective throughput for other users.

Provisioned Throughput Abuse

Provisioned Throughput (AWS Bedrock PT, Azure OpenAI PTU) is dedicated capacity that costs money whether or not it is used:

Attack	Description	Impact
Provision and abandon	If IAM allows, create PT/PTU deployments and leave them running	Continuous cost with no usage
Capacity exhaustion	Saturate PT/PTU capacity with attacker requests	Legitimate users denied access to reserved capacity
Scaling trigger	Force provisioning of additional capacity through sustained high load	Increased capacity costs

# AWS: Check for ability to create provisioned throughput
aws bedrock create-provisioned-model-throughput \
  --model-units 1 \
  --provisioned-model-name cost-attack-test \
  --model-id anthropic.claude-3-sonnet-20240229-v1:0 \
  2>&1  # Check for permissions

GPU Compute Abuse

Training Job Abuse

ML training jobs consume expensive GPU instances. An attacker with training job creation permissions can launch compute for unauthorized purposes:

# AWS SageMaker: Create a training job with expensive GPU instances
aws sagemaker create-training-job \
  --training-job-name gpu-abuse-test \
  --algorithm-specification \
    TrainingImage=<container>,TrainingInputMode=File \
  --resource-config \
    InstanceType=ml.p4d.24xlarge,InstanceCount=4,VolumeSizeInGB=500 \
  --role-arn <role-arn> \
  --stopping-condition MaxRuntimeInSeconds=86400 \
  --output-data-config S3OutputPath=s3://bucket/output

A single ml.p4d.24xlarge instance costs approximately $37/hour. Four instances for 24 hours: $3,552. With spot pricing unavailable for attackers (they want reliable access), on-demand costs are even higher.

Compute Resource Costs

Instance Type	Provider	GPU	Approximate Cost/Hour
`ml.p4d.24xlarge`	AWS	8x A100	~$37
`Standard_NC96ads_A100_v4`	Azure	4x A100	~$32
`a2-ultragpu-8g`	GCP	8x A100	~$40
`ml.p5.48xlarge`	AWS	8x H100	~$98

Cryptocurrency Mining

GPU instances provisioned through ML services are attractive for cryptocurrency mining:

Training job containers have full GPU access
Custom container images can include mining software
Training jobs can run for hours or days before detection
GPU instance costs are borne by the target organization

Auto-Scaling Exploitation

Scaling Trigger Attacks

AI serving endpoints use auto-scaling to handle variable load. Attackers can exploit scaling configurations:

Scaling Config	Attack	Result
CPU-based scaling	Send compute-intensive requests	Triggers scale-up based on CPU utilization
Request-count scaling	Flood with requests	Triggers scale-up based on request rate
Custom metric scaling	Exploit metric collection gaps	Trigger scaling through metric manipulation
No max limit	Sustained high load	Unbounded scaling, unbounded cost

Scaling Attack Chain

Identify scaling configuration

Determine the endpoint's auto-scaling policy: what metric triggers scaling, what the cool-down period is, and whether there is a maximum instance count.

# AWS SageMaker: Check auto-scaling
aws application-autoscaling describe-scaling-policies \
  --service-namespace sagemaker \
  --resource-id endpoint/<endpoint-name>/variant/<variant-name>
 
# GCP Vertex AI: Check endpoint scaling
gcloud ai endpoints describe <endpoint-id> --region=us-central1 \
  --format="json(deployedModels[].automaticResources)"

Calculate cost impact
Determine the cost per instance and the maximum scale. If maximum is not set, the theoretical maximum is limited only by account quotas.
Trigger scaling
Send sustained traffic at a rate that triggers scale-up. Use the most expensive request pattern (max tokens, expensive model) to maximize both per-request cost and infrastructure cost.
Maintain pressure
Keep traffic sustained long enough for auto-scaling to provision maximum instances. Scale-up is typically faster than scale-down, so even a brief attack can leave expensive instances running during the cool-down period.

Cool-Down Exploitation

Auto-scaling policies include cool-down periods to prevent thrashing. Attackers can exploit this:

Scale-up, then stop: Trigger maximum scaling, then stop traffic. Instances remain running during the scale-down cool-down period (typically 5-15 minutes).
Pulse attacks: Send bursts of traffic followed by quiet periods timed to prevent scale-down, keeping instances at maximum count while minimizing attacker request volume.
Scale-up cascade: If auto-scaling provisions instances that themselves trigger further scaling (e.g., a new instance starts a warm-up process that consumes resources), cascade scaling exceeds intended limits.

Denial-of-Wallet Attack Methodology

Planning

Enumerate AI services and pricing: Identify the most expensive services in use
Determine access level: What API actions can the attacker perform?
Map scaling configurations: How does the target handle load increases?
Identify billing controls: Are there budget alerts, spending limits, or approval workflows?
Estimate impact: Calculate the maximum financial damage achievable

Execution Considerations

Detection and Billing Lag

Most cloud billing has a lag between resource consumption and billing visibility:

Provider	Billing Lag	Cost Explorer Delay
AWS	Up to 24 hours	Same-day estimates, final bills delayed
Azure	Up to 24 hours	Cost Management may lag 24-48 hours
GCP	Up to 24 hours	Billing export may lag several hours

This lag means that cost attacks can accumulate significant charges before any billing alert triggers. Red teams should factor this into engagement planning to avoid unintended financial impact.

Impact Communication

Cost attack findings should be communicated in business terms:

Hourly exposure: "An attacker with this access can generate $X,XXX/hour in AI costs"
Scaling factor: "Auto-scaling without maximum limits amplifies this to $XX,XXX/hour"
Detection gap: "Current billing alerts would not trigger until $X has been spent, with an additional $X in the billing pipeline"
Annual risk: "If exploited persistently, annual cost exposure is $X,XXX,XXX"

AWS AI Services -- AWS-specific cost attack surfaces
Azure AI Services -- Azure PTU and cost considerations
GCP AI Services -- GCP cost controls and quota management
SageMaker Exploitation -- GPU compute abuse through SageMaker

Knowledge Check

An attacker has bedrock:InvokeModel permission scoped to all models. To maximize the financial impact of a cost attack, which approach is most effective?

Knowledge Check

Why is billing lag a critical factor in AI cost attack planning for red team engagements?

References

AWS Pricing Calculator -- Calculate AWS AI service costs
Azure Pricing Calculator -- Calculate Azure AI service costs
GCP Pricing Calculator -- Calculate GCP AI service costs
OWASP LLM06: Excessive Agency -- Cost abuse as part of excessive agency risks

AI Cost & Billing Attacks

Sustained high-volume invocation

Burst attacks

Multi-model spreading

Context window stuffing

Identify scaling configuration

Calculate cost impact

Trigger scaling

Maintain pressure

Related articles

AI Cost & Billing Attacks

Sustained high-volume invocation

Burst attacks

Multi-model spreading

Context window stuffing

Identify scaling configuration

Calculate cost impact

Trigger scaling

Maintain pressure

Related articles