Cost Security and Budget Controls for Cloud AI

intermediate13 min readUpdated 2026-03-20

Protecting cloud AI deployments from cost-based attacks including denial-of-wallet, token exhaustion, and auto-scaling abuse with budget controls across AWS, Azure, and GCP.

cloud-ai-security cost-security denial-of-wallet finops budget-controls

Overview

Cloud AI services have a unique cost profile that creates novel attack vectors. Unlike traditional cloud compute where costs scale with provisioned resources, AI inference costs scale with usage -- specifically, with the number of tokens processed. A single compromised API key or misconfigured endpoint can generate thousands of dollars in charges within minutes, because there is no physical resource to provision and the marginal cost of each additional request is charged immediately.

The threat model for cloud AI cost security includes three categories: external attackers who gain access to AI service credentials, insiders who abuse AI services for unauthorized purposes, and automated systems that malfunction and generate runaway costs. All three require a combination of preventive controls (budget limits, rate limiting) and detective controls (anomaly detection, alerting).

This article covers the cost attack surface for each major cloud AI provider, practical budget control implementations, and red team techniques for testing cost security controls.

Cost Attack Surface

Token Economics and Attack Amplification

Cloud AI pricing is based on tokens (roughly 4 characters per token for English text). The cost asymmetry between input and output tokens creates attack amplification opportunities:

Provider	Model	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)	Amplification Factor
AWS Bedrock	Claude 3.5 Sonnet	~$3.00	~$15.00	5x output vs input
Azure OpenAI	GPT-4o	~$2.50	~$10.00	4x output vs input
GCP Vertex AI	Gemini 1.5 Pro	~$1.25	~$5.00	4x output vs input
AWS Bedrock	Claude 3 Opus	~$15.00	~$75.00	5x output vs input

An attacker can maximize cost damage by crafting prompts that generate maximum-length outputs. A short prompt like "Write a 4000-word essay on any topic" costs pennies in input tokens but generates dollars in output tokens.

Denial-of-Wallet Attack Patterns

"""
Cost attack pattern analysis for cloud AI services.
For authorized red team assessments only.
"""
 
def calculate_attack_cost_projection(
    requests_per_second: float,
    avg_input_tokens: int,
    avg_output_tokens: int,
    input_cost_per_million: float,
    output_cost_per_million: float,
    duration_hours: float,
) -> dict:
    """Project the cost of a sustained denial-of-wallet attack."""
    total_requests = requests_per_second * 3600 * duration_hours
 
    total_input_tokens = total_requests * avg_input_tokens
    total_output_tokens = total_requests * avg_output_tokens
 
    input_cost = (total_input_tokens / 1_000_000) * input_cost_per_million
    output_cost = (total_output_tokens / 1_000_000) * output_cost_per_million
    total_cost = input_cost + output_cost
 
    return {
        "total_requests": int(total_requests),
        "total_input_tokens": int(total_input_tokens),
        "total_output_tokens": int(total_output_tokens),
        "input_cost_usd": round(input_cost, 2),
        "output_cost_usd": round(output_cost, 2),
        "total_cost_usd": round(total_cost, 2),
        "cost_per_hour_usd": round(total_cost / duration_hours, 2),
    }
 
# Example: Sustained attack using GPT-4o at 10 requests/second
projection = calculate_attack_cost_projection(
    requests_per_second=10,
    avg_input_tokens=100,        # Short malicious prompt
    avg_output_tokens=4000,      # Maximum output
    input_cost_per_million=2.50, # GPT-4o input
    output_cost_per_million=10.00, # GPT-4o output
    duration_hours=24,
)
# Result: ~$34,560 in 24 hours from a single compromised key

Provisioned Throughput Abuse

Beyond pay-per-token costs, cloud AI providers offer provisioned throughput (reserved capacity) that costs significantly more:

AWS Bedrock Provisioned Throughput: Billed hourly at rates up to $50+/hour per model unit. Creating provisioned throughput with a compromised credential can cost thousands before detection.
Azure OpenAI Provisioned-Managed: PTU (Provisioned Throughput Units) are committed capacity billed hourly. An attacker creating a large PTU deployment generates immediate financial liability.
GCP Vertex AI Provisioned Throughput: Reserved capacity for Gemini models, billed on commitment.

import boto3
from botocore.exceptions import ClientError
 
def test_provisioned_throughput_controls(
    session: boto3.Session,
    region: str = "us-east-1",
) -> dict:
    """Test whether provisioned throughput creation is properly restricted."""
    bedrock = session.client("bedrock", region_name=region)
    findings = []
 
    # Test 1: Can the current identity create provisioned throughput?
    try:
        # Use a dry-run approach: try to create with invalid parameters
        # to test IAM permissions without actually provisioning
        response = bedrock.create_provisioned_model_throughput(
            modelUnits=1,
            provisionedModelName="security-test-do-not-create",
            modelId="anthropic.claude-3-haiku-20240307-v1:0",
        )
        # If we get here, the identity has the permission
        # Immediately delete the provisioned throughput
        bedrock.delete_provisioned_model_throughput(
            provisionedModelId=response["provisionedModelArn"]
        )
        findings.append({
            "severity": "HIGH",
            "finding": "Identity can create provisioned throughput",
            "detail": "This identity can create provisioned model throughput, "
                      "which could cost $50+/hour per model unit. Restrict "
                      "bedrock:CreateProvisionedModelThroughput to admin roles.",
        })
    except ClientError as e:
        if e.response["Error"]["Code"] == "AccessDeniedException":
            findings.append({
                "severity": "INFO",
                "finding": "Provisioned throughput creation blocked by IAM",
            })
        elif e.response["Error"]["Code"] == "ValidationException":
            findings.append({
                "severity": "HIGH",
                "finding": "Identity has CreateProvisionedModelThroughput permission",
                "detail": "Request reached validation (past IAM check). The identity "
                          "could create valid provisioned throughput with correct parameters.",
            })
 
    return {"findings": findings}

Budget Controls Implementation

AWS Budget Controls

import boto3
import json
 
def create_bedrock_budget(
    session: boto3.Session,
    account_id: str,
    monthly_limit_usd: float,
    alert_thresholds: list = None,
    notification_email: str = None,
) -> dict:
    """Create an AWS Budget specifically for Bedrock spending."""
    budgets = session.client("budgets")
 
    if alert_thresholds is None:
        alert_thresholds = [50, 80, 100]  # percent
 
    notifications = []
    subscribers = []
 
    if notification_email:
        subscribers = [
            {"SubscriptionType": "EMAIL", "Address": notification_email}
        ]
 
    for threshold in alert_thresholds:
        notifications.append({
            "Notification": {
                "NotificationType": "ACTUAL",
                "ComparisonOperator": "GREATER_THAN",
                "Threshold": threshold,
                "ThresholdType": "PERCENTAGE",
            },
            "Subscribers": subscribers,
        })
 
    budget = {
        "BudgetName": "bedrock-monthly-limit",
        "BudgetLimit": {
            "Amount": str(monthly_limit_usd),
            "Unit": "USD",
        },
        "BudgetType": "COST",
        "TimeUnit": "MONTHLY",
        "CostFilters": {
            "Service": ["Amazon Bedrock"],
        },
    }
 
    try:
        budgets.create_budget(
            AccountId=account_id,
            Budget=budget,
            NotificationsWithSubscribers=notifications,
        )
        return {
            "budget_name": "bedrock-monthly-limit",
            "limit_usd": monthly_limit_usd,
            "alert_thresholds": alert_thresholds,
            "status": "created",
        }
    except Exception as e:
        return {"error": str(e)}
 
def create_bedrock_cost_anomaly_detector(
    session: boto3.Session,
) -> dict:
    """Create a Cost Anomaly Detection monitor for Bedrock."""
    ce = session.client("ce")
 
    try:
        response = ce.create_anomaly_monitor(
            AnomalyMonitor={
                "MonitorName": "bedrock-cost-anomaly",
                "MonitorType": "DIMENSIONAL",
                "MonitorDimension": "SERVICE",
            }
        )
        monitor_arn = response["MonitorArn"]
 
        # Create anomaly subscription for alerts
        ce.create_anomaly_subscription(
            AnomalySubscription={
                "SubscriptionName": "bedrock-anomaly-alerts",
                "MonitorArnList": [monitor_arn],
                "Frequency": "IMMEDIATE",
                "Threshold": 50.0,  # Alert on anomalies over $50
                "Subscribers": [
                    {
                        "Type": "SNS",
                        "Address": "arn:aws:sns:us-east-1:ACCOUNT:bedrock-cost-alerts",
                    }
                ],
            }
        )
 
        return {
            "monitor_arn": monitor_arn,
            "threshold_usd": 50.0,
            "status": "created",
        }
    except Exception as e:
        return {"error": str(e)}

Azure OpenAI Quota and Budget Controls

Azure OpenAI provides token-per-minute (TPM) quotas at the deployment level, which serve as the primary rate-limiting mechanism:

from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
from azure.mgmt.costmanagement import CostManagementClient
 
def configure_deployment_rate_limits(
    subscription_id: str,
    resource_group: str,
    account_name: str,
    deployment_name: str,
    tokens_per_minute_limit: int = 10000,
) -> dict:
    """Configure rate limits on an Azure OpenAI deployment."""
    credential = DefaultAzureCredential()
    client = CognitiveServicesManagementClient(credential, subscription_id)
 
    # Update deployment with rate limit
    deployment = client.deployments.begin_create_or_update(
        resource_group,
        account_name,
        deployment_name,
        {
            "sku": {
                "name": "Standard",
                "capacity": tokens_per_minute_limit // 1000,  # Capacity in thousands
            },
        },
    ).result()
 
    return {
        "deployment": deployment_name,
        "capacity_k_tpm": deployment.sku.capacity,
        "effective_tpm": deployment.sku.capacity * 1000,
        "detail": "Requests exceeding this rate will receive 429 responses.",
    }
 
def create_azure_ai_cost_alert(
    subscription_id: str,
    resource_group: str,
    monthly_budget_usd: float,
    contact_emails: list,
) -> dict:
    """Create a budget alert for Azure AI spending."""
    credential = DefaultAzureCredential()
    cost_client = CostManagementClient(credential, subscription_id)
 
    scope = f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}"
 
    budget = {
        "properties": {
            "category": "Cost",
            "amount": monthly_budget_usd,
            "timeGrain": "Monthly",
            "timePeriod": {
                "startDate": "2026-03-01T00:00:00Z",
                "endDate": "2027-03-01T00:00:00Z",
            },
            "filter": {
                "dimensions": {
                    "name": "ServiceName",
                    "operator": "In",
                    "values": ["Azure OpenAI", "Cognitive Services"],
                }
            },
            "notifications": {
                "actual_50_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 50,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
                "actual_80_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 80,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
                "actual_100_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 100,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
            },
        }
    }
 
    result = cost_client.budgets.create_or_update(
        scope, "ai-services-budget", budget
    )
 
    return {
        "budget_name": "ai-services-budget",
        "monthly_limit_usd": monthly_budget_usd,
        "alerts": ["50%", "80%", "100%"],
    }

GCP Vertex AI Budget Controls

from google.cloud import billing_budgets_v1
 
def create_vertex_ai_budget(
    billing_account_id: str,
    project_id: str,
    monthly_limit_usd: float,
    notification_pubsub_topic: str = None,
) -> dict:
    """Create a budget for Vertex AI spending on GCP."""
    client = billing_budgets_v1.BudgetServiceClient()
 
    budget = billing_budgets_v1.Budget(
        display_name="vertex-ai-monthly-budget",
        budget_filter=billing_budgets_v1.Filter(
            projects=[f"projects/{project_id}"],
            services=[
                f"services/aiplatform.googleapis.com",
            ],
        ),
        amount=billing_budgets_v1.BudgetAmount(
            specified_amount={"units": int(monthly_limit_usd), "currency_code": "USD"}
        ),
        threshold_rules=[
            billing_budgets_v1.ThresholdRule(
                threshold_percent=0.5,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
            billing_budgets_v1.ThresholdRule(
                threshold_percent=0.8,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
            billing_budgets_v1.ThresholdRule(
                threshold_percent=1.0,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
        ],
    )
 
    if notification_pubsub_topic:
        budget.notifications_rule = billing_budgets_v1.NotificationsRule(
            pubsub_topic=notification_pubsub_topic,
            schema_version="1.0",
        )
 
    request = billing_budgets_v1.CreateBudgetRequest(
        parent=f"billingAccounts/{billing_account_id}",
        budget=budget,
    )
 
    result = client.create_budget(request=request)
    return {
        "budget_name": result.name,
        "monthly_limit_usd": monthly_limit_usd,
        "status": "created",
    }

Application-Level Rate Limiting

Cloud provider budget alerts are reactive -- they notify after spend has occurred. For real-time cost protection, implement application-level rate limiting:

import time
from dataclasses import dataclass, field
from collections import defaultdict
from threading import Lock
 
@dataclass
class AIRateLimiter:
    """Token-aware rate limiter for cloud AI services."""
 
    max_tokens_per_minute: int = 100000
    max_tokens_per_hour: int = 2000000
    max_cost_per_hour_usd: float = 50.0
    cost_per_input_token: float = 0.000003   # $3/1M tokens
    cost_per_output_token: float = 0.000015  # $15/1M tokens
 
    _lock: Lock = field(default_factory=Lock)
    _minute_tokens: list = field(default_factory=list)
    _hour_tokens: list = field(default_factory=list)
    _hour_cost: list = field(default_factory=list)
 
    def check_and_record(
        self,
        input_tokens: int,
        estimated_output_tokens: int,
        user_id: str = "default",
    ) -> dict:
        """Check if a request is within rate limits and record it."""
        with self._lock:
            now = time.time()
            total_tokens = input_tokens + estimated_output_tokens
            estimated_cost = (
                input_tokens * self.cost_per_input_token
                + estimated_output_tokens * self.cost_per_output_token
            )
 
            # Clean old entries
            minute_cutoff = now - 60
            hour_cutoff = now - 3600
            self._minute_tokens = [
                (t, tokens) for t, tokens in self._minute_tokens if t > minute_cutoff
            ]
            self._hour_tokens = [
                (t, tokens) for t, tokens in self._hour_tokens if t > hour_cutoff
            ]
            self._hour_cost = [
                (t, cost) for t, cost in self._hour_cost if t > hour_cutoff
            ]
 
            # Check minute limit
            minute_total = sum(tokens for _, tokens in self._minute_tokens) + total_tokens
            if minute_total > self.max_tokens_per_minute:
                return {
                    "allowed": False,
                    "reason": "tokens_per_minute_exceeded",
                    "current": minute_total - total_tokens,
                    "limit": self.max_tokens_per_minute,
                    "retry_after_seconds": 60,
                }
 
            # Check hour limit
            hour_total = sum(tokens for _, tokens in self._hour_tokens) + total_tokens
            if hour_total > self.max_tokens_per_hour:
                return {
                    "allowed": False,
                    "reason": "tokens_per_hour_exceeded",
                    "current": hour_total - total_tokens,
                    "limit": self.max_tokens_per_hour,
                }
 
            # Check cost limit
            hour_cost = sum(cost for _, cost in self._hour_cost) + estimated_cost
            if hour_cost > self.max_cost_per_hour_usd:
                return {
                    "allowed": False,
                    "reason": "cost_per_hour_exceeded",
                    "current_cost_usd": round(hour_cost - estimated_cost, 4),
                    "limit_usd": self.max_cost_per_hour_usd,
                }
 
            # Record the request
            self._minute_tokens.append((now, total_tokens))
            self._hour_tokens.append((now, total_tokens))
            self._hour_cost.append((now, estimated_cost))
 
            return {
                "allowed": True,
                "estimated_cost_usd": round(estimated_cost, 6),
                "minute_usage_pct": round((minute_total / self.max_tokens_per_minute) * 100, 1),
                "hour_usage_pct": round((hour_total / self.max_tokens_per_hour) * 100, 1),
                "hour_cost_pct": round((hour_cost / self.max_cost_per_hour_usd) * 100, 1),
            }

Red Team Testing for Cost Controls

Authorized Cost Security Testing

When conducting authorized assessments, test cost controls without generating actual excessive spend:

def test_cost_controls_safely(
    rate_limiter: AIRateLimiter,
    cloud_client: object,
) -> dict:
    """Test cost control effectiveness with minimal actual spend."""
    results = {
        "rate_limiter_tests": [],
        "budget_alert_tests": [],
        "credential_scope_tests": [],
    }
 
    # Test 1: Verify rate limiter blocks excessive requests
    for i in range(20):
        check = rate_limiter.check_and_record(
            input_tokens=1000,
            estimated_output_tokens=4000,
            user_id="test-user",
        )
        if not check["allowed"]:
            results["rate_limiter_tests"].append({
                "test": "rapid_request_blocking",
                "pass": True,
                "blocked_at_request": i + 1,
                "reason": check["reason"],
            })
            break
    else:
        results["rate_limiter_tests"].append({
            "test": "rapid_request_blocking",
            "pass": False,
            "detail": "Rate limiter did not block after 20 rapid requests",
        })
 
    # Test 2: Verify max_tokens parameter is enforced
    # Send a single request with a very small max_tokens to verify
    # the application enforces output length limits
    results["rate_limiter_tests"].append({
        "test": "output_token_limit",
        "recommendation": "Verify application sets max_tokens on all API calls. "
                          "Without max_tokens, a single request can generate "
                          "the model's full context window in output tokens.",
    })
 
    # Test 3: Check if provisioned throughput creation is IAM-restricted
    results["credential_scope_tests"].append({
        "test": "provisioned_throughput_restriction",
        "recommendation": "Verify application service accounts cannot call "
                          "CreateProvisionedModelThroughput or equivalent.",
    })
 
    return results

Cost Attack Indicators

Monitor for these patterns that indicate cost-based attacks:

Indicator	Description	Detection Method
Token volume spike	Sudden increase in tokens processed per minute	CloudWatch/Azure Monitor/Cloud Monitoring metrics
Max output requests	High percentage of requests generating maximum-length outputs	Application-level logging of output token counts
Off-hours usage	AI API calls during periods when the application should be idle	Time-based alerting on API call metrics
New IP/identity	API calls from previously unseen IP addresses or identities	CloudTrail/Activity Log/Audit Log analysis
Provisioned capacity creation	New provisioned throughput or committed use	Management API event monitoring
Burst after silence	A credential that was dormant suddenly generating high volume	Behavioral analysis of per-credential usage patterns

Defense-in-Depth Strategy

Layer cost controls from the cloud provider level down to the application level:

Cloud provider budget alerts: Set monthly budgets with alerts at 50%, 80%, and 100% thresholds. These are the last line of defense but have reporting delays.
IAM permission restrictions: Remove CreateProvisionedModelThroughput and equivalent high-cost actions from all non-admin identities.
Deployment-level quotas: Configure TPM (tokens per minute) limits on each model deployment. Azure OpenAI has native support; other providers require API Gateway rate limiting.
Application rate limiting: Implement token-aware rate limiting in the application layer with per-user and global limits.
Request validation: Set max_tokens on every API call. Reject user inputs that exceed reasonable length thresholds.
Real-time anomaly detection: Use cost anomaly detection services and custom monitoring to detect unusual spend patterns within minutes, not hours.

References

AWS, "AWS Cost Anomaly Detection," https://docs.aws.amazon.com/cost-management/latest/userguide/manage-ad.html
Microsoft, "Plan and manage costs for Azure OpenAI Service," https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs
Google Cloud, "Cloud Billing budgets," https://cloud.google.com/billing/docs/how-to/budgets
OWASP, "LLM09:2025 Misinformation" and "LLM10:2025 Unbounded Consumption," https://owasp.org/www-project-top-10-for-large-language-model-applications/

Knowledge Check

Why is setting max_tokens on every AI API call important for cost security?

Knowledge Check

An organization sets an AWS Budget alert at $5,000/month for Bedrock. Why is this insufficient as a sole cost control?

Edit this page on GitHub

Cost Security and Budget Controls for Cloud AI

intermediate13 min readUpdated 2026-03-20

Protecting cloud AI deployments from cost-based attacks including denial-of-wallet, token exhaustion, and auto-scaling abuse with budget controls across AWS, Azure, and GCP.

cloud-ai-security cost-security denial-of-wallet finops budget-controls

Overview

This article covers the cost attack surface for each major cloud AI provider, practical budget control implementations, and red team techniques for testing cost security controls.

Cost Attack Surface

Token Economics and Attack Amplification

Cloud AI pricing is based on tokens (roughly 4 characters per token for English text). The cost asymmetry between input and output tokens creates attack amplification opportunities:

Provider	Model	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)	Amplification Factor
AWS Bedrock	Claude 3.5 Sonnet	~$3.00	~$15.00	5x output vs input
Azure OpenAI	GPT-4o	~$2.50	~$10.00	4x output vs input
GCP Vertex AI	Gemini 1.5 Pro	~$1.25	~$5.00	4x output vs input
AWS Bedrock	Claude 3 Opus	~$15.00	~$75.00	5x output vs input

Denial-of-Wallet Attack Patterns

"""
Cost attack pattern analysis for cloud AI services.
For authorized red team assessments only.
"""
 
def calculate_attack_cost_projection(
    requests_per_second: float,
    avg_input_tokens: int,
    avg_output_tokens: int,
    input_cost_per_million: float,
    output_cost_per_million: float,
    duration_hours: float,
) -> dict:
    """Project the cost of a sustained denial-of-wallet attack."""
    total_requests = requests_per_second * 3600 * duration_hours
 
    total_input_tokens = total_requests * avg_input_tokens
    total_output_tokens = total_requests * avg_output_tokens
 
    input_cost = (total_input_tokens / 1_000_000) * input_cost_per_million
    output_cost = (total_output_tokens / 1_000_000) * output_cost_per_million
    total_cost = input_cost + output_cost
 
    return {
        "total_requests": int(total_requests),
        "total_input_tokens": int(total_input_tokens),
        "total_output_tokens": int(total_output_tokens),
        "input_cost_usd": round(input_cost, 2),
        "output_cost_usd": round(output_cost, 2),
        "total_cost_usd": round(total_cost, 2),
        "cost_per_hour_usd": round(total_cost / duration_hours, 2),
    }
 
# Example: Sustained attack using GPT-4o at 10 requests/second
projection = calculate_attack_cost_projection(
    requests_per_second=10,
    avg_input_tokens=100,        # Short malicious prompt
    avg_output_tokens=4000,      # Maximum output
    input_cost_per_million=2.50, # GPT-4o input
    output_cost_per_million=10.00, # GPT-4o output
    duration_hours=24,
)
# Result: ~$34,560 in 24 hours from a single compromised key

Provisioned Throughput Abuse

Beyond pay-per-token costs, cloud AI providers offer provisioned throughput (reserved capacity) that costs significantly more:

AWS Bedrock Provisioned Throughput: Billed hourly at rates up to $50+/hour per model unit. Creating provisioned throughput with a compromised credential can cost thousands before detection.
Azure OpenAI Provisioned-Managed: PTU (Provisioned Throughput Units) are committed capacity billed hourly. An attacker creating a large PTU deployment generates immediate financial liability.
GCP Vertex AI Provisioned Throughput: Reserved capacity for Gemini models, billed on commitment.

import boto3
from botocore.exceptions import ClientError
 
def test_provisioned_throughput_controls(
    session: boto3.Session,
    region: str = "us-east-1",
) -> dict:
    """Test whether provisioned throughput creation is properly restricted."""
    bedrock = session.client("bedrock", region_name=region)
    findings = []
 
    # Test 1: Can the current identity create provisioned throughput?
    try:
        # Use a dry-run approach: try to create with invalid parameters
        # to test IAM permissions without actually provisioning
        response = bedrock.create_provisioned_model_throughput(
            modelUnits=1,
            provisionedModelName="security-test-do-not-create",
            modelId="anthropic.claude-3-haiku-20240307-v1:0",
        )
        # If we get here, the identity has the permission
        # Immediately delete the provisioned throughput
        bedrock.delete_provisioned_model_throughput(
            provisionedModelId=response["provisionedModelArn"]
        )
        findings.append({
            "severity": "HIGH",
            "finding": "Identity can create provisioned throughput",
            "detail": "This identity can create provisioned model throughput, "
                      "which could cost $50+/hour per model unit. Restrict "
                      "bedrock:CreateProvisionedModelThroughput to admin roles.",
        })
    except ClientError as e:
        if e.response["Error"]["Code"] == "AccessDeniedException":
            findings.append({
                "severity": "INFO",
                "finding": "Provisioned throughput creation blocked by IAM",
            })
        elif e.response["Error"]["Code"] == "ValidationException":
            findings.append({
                "severity": "HIGH",
                "finding": "Identity has CreateProvisionedModelThroughput permission",
                "detail": "Request reached validation (past IAM check). The identity "
                          "could create valid provisioned throughput with correct parameters.",
            })
 
    return {"findings": findings}

Budget Controls Implementation

AWS Budget Controls

import boto3
import json
 
def create_bedrock_budget(
    session: boto3.Session,
    account_id: str,
    monthly_limit_usd: float,
    alert_thresholds: list = None,
    notification_email: str = None,
) -> dict:
    """Create an AWS Budget specifically for Bedrock spending."""
    budgets = session.client("budgets")
 
    if alert_thresholds is None:
        alert_thresholds = [50, 80, 100]  # percent
 
    notifications = []
    subscribers = []
 
    if notification_email:
        subscribers = [
            {"SubscriptionType": "EMAIL", "Address": notification_email}
        ]
 
    for threshold in alert_thresholds:
        notifications.append({
            "Notification": {
                "NotificationType": "ACTUAL",
                "ComparisonOperator": "GREATER_THAN",
                "Threshold": threshold,
                "ThresholdType": "PERCENTAGE",
            },
            "Subscribers": subscribers,
        })
 
    budget = {
        "BudgetName": "bedrock-monthly-limit",
        "BudgetLimit": {
            "Amount": str(monthly_limit_usd),
            "Unit": "USD",
        },
        "BudgetType": "COST",
        "TimeUnit": "MONTHLY",
        "CostFilters": {
            "Service": ["Amazon Bedrock"],
        },
    }
 
    try:
        budgets.create_budget(
            AccountId=account_id,
            Budget=budget,
            NotificationsWithSubscribers=notifications,
        )
        return {
            "budget_name": "bedrock-monthly-limit",
            "limit_usd": monthly_limit_usd,
            "alert_thresholds": alert_thresholds,
            "status": "created",
        }
    except Exception as e:
        return {"error": str(e)}
 
def create_bedrock_cost_anomaly_detector(
    session: boto3.Session,
) -> dict:
    """Create a Cost Anomaly Detection monitor for Bedrock."""
    ce = session.client("ce")
 
    try:
        response = ce.create_anomaly_monitor(
            AnomalyMonitor={
                "MonitorName": "bedrock-cost-anomaly",
                "MonitorType": "DIMENSIONAL",
                "MonitorDimension": "SERVICE",
            }
        )
        monitor_arn = response["MonitorArn"]
 
        # Create anomaly subscription for alerts
        ce.create_anomaly_subscription(
            AnomalySubscription={
                "SubscriptionName": "bedrock-anomaly-alerts",
                "MonitorArnList": [monitor_arn],
                "Frequency": "IMMEDIATE",
                "Threshold": 50.0,  # Alert on anomalies over $50
                "Subscribers": [
                    {
                        "Type": "SNS",
                        "Address": "arn:aws:sns:us-east-1:ACCOUNT:bedrock-cost-alerts",
                    }
                ],
            }
        )
 
        return {
            "monitor_arn": monitor_arn,
            "threshold_usd": 50.0,
            "status": "created",
        }
    except Exception as e:
        return {"error": str(e)}

Azure OpenAI Quota and Budget Controls

Azure OpenAI provides token-per-minute (TPM) quotas at the deployment level, which serve as the primary rate-limiting mechanism:

from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
from azure.mgmt.costmanagement import CostManagementClient
 
def configure_deployment_rate_limits(
    subscription_id: str,
    resource_group: str,
    account_name: str,
    deployment_name: str,
    tokens_per_minute_limit: int = 10000,
) -> dict:
    """Configure rate limits on an Azure OpenAI deployment."""
    credential = DefaultAzureCredential()
    client = CognitiveServicesManagementClient(credential, subscription_id)
 
    # Update deployment with rate limit
    deployment = client.deployments.begin_create_or_update(
        resource_group,
        account_name,
        deployment_name,
        {
            "sku": {
                "name": "Standard",
                "capacity": tokens_per_minute_limit // 1000,  # Capacity in thousands
            },
        },
    ).result()
 
    return {
        "deployment": deployment_name,
        "capacity_k_tpm": deployment.sku.capacity,
        "effective_tpm": deployment.sku.capacity * 1000,
        "detail": "Requests exceeding this rate will receive 429 responses.",
    }
 
def create_azure_ai_cost_alert(
    subscription_id: str,
    resource_group: str,
    monthly_budget_usd: float,
    contact_emails: list,
) -> dict:
    """Create a budget alert for Azure AI spending."""
    credential = DefaultAzureCredential()
    cost_client = CostManagementClient(credential, subscription_id)
 
    scope = f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}"
 
    budget = {
        "properties": {
            "category": "Cost",
            "amount": monthly_budget_usd,
            "timeGrain": "Monthly",
            "timePeriod": {
                "startDate": "2026-03-01T00:00:00Z",
                "endDate": "2027-03-01T00:00:00Z",
            },
            "filter": {
                "dimensions": {
                    "name": "ServiceName",
                    "operator": "In",
                    "values": ["Azure OpenAI", "Cognitive Services"],
                }
            },
            "notifications": {
                "actual_50_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 50,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
                "actual_80_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 80,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
                "actual_100_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 100,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
            },
        }
    }
 
    result = cost_client.budgets.create_or_update(
        scope, "ai-services-budget", budget
    )
 
    return {
        "budget_name": "ai-services-budget",
        "monthly_limit_usd": monthly_budget_usd,
        "alerts": ["50%", "80%", "100%"],
    }

GCP Vertex AI Budget Controls

from google.cloud import billing_budgets_v1
 
def create_vertex_ai_budget(
    billing_account_id: str,
    project_id: str,
    monthly_limit_usd: float,
    notification_pubsub_topic: str = None,
) -> dict:
    """Create a budget for Vertex AI spending on GCP."""
    client = billing_budgets_v1.BudgetServiceClient()
 
    budget = billing_budgets_v1.Budget(
        display_name="vertex-ai-monthly-budget",
        budget_filter=billing_budgets_v1.Filter(
            projects=[f"projects/{project_id}"],
            services=[
                f"services/aiplatform.googleapis.com",
            ],
        ),
        amount=billing_budgets_v1.BudgetAmount(
            specified_amount={"units": int(monthly_limit_usd), "currency_code": "USD"}
        ),
        threshold_rules=[
            billing_budgets_v1.ThresholdRule(
                threshold_percent=0.5,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
            billing_budgets_v1.ThresholdRule(
                threshold_percent=0.8,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
            billing_budgets_v1.ThresholdRule(
                threshold_percent=1.0,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
        ],
    )
 
    if notification_pubsub_topic:
        budget.notifications_rule = billing_budgets_v1.NotificationsRule(
            pubsub_topic=notification_pubsub_topic,
            schema_version="1.0",
        )
 
    request = billing_budgets_v1.CreateBudgetRequest(
        parent=f"billingAccounts/{billing_account_id}",
        budget=budget,
    )
 
    result = client.create_budget(request=request)
    return {
        "budget_name": result.name,
        "monthly_limit_usd": monthly_limit_usd,
        "status": "created",
    }

Application-Level Rate Limiting

Cloud provider budget alerts are reactive -- they notify after spend has occurred. For real-time cost protection, implement application-level rate limiting:

import time
from dataclasses import dataclass, field
from collections import defaultdict
from threading import Lock
 
@dataclass
class AIRateLimiter:
    """Token-aware rate limiter for cloud AI services."""
 
    max_tokens_per_minute: int = 100000
    max_tokens_per_hour: int = 2000000
    max_cost_per_hour_usd: float = 50.0
    cost_per_input_token: float = 0.000003   # $3/1M tokens
    cost_per_output_token: float = 0.000015  # $15/1M tokens
 
    _lock: Lock = field(default_factory=Lock)
    _minute_tokens: list = field(default_factory=list)
    _hour_tokens: list = field(default_factory=list)
    _hour_cost: list = field(default_factory=list)
 
    def check_and_record(
        self,
        input_tokens: int,
        estimated_output_tokens: int,
        user_id: str = "default",
    ) -> dict:
        """Check if a request is within rate limits and record it."""
        with self._lock:
            now = time.time()
            total_tokens = input_tokens + estimated_output_tokens
            estimated_cost = (
                input_tokens * self.cost_per_input_token
                + estimated_output_tokens * self.cost_per_output_token
            )
 
            # Clean old entries
            minute_cutoff = now - 60
            hour_cutoff = now - 3600
            self._minute_tokens = [
                (t, tokens) for t, tokens in self._minute_tokens if t > minute_cutoff
            ]
            self._hour_tokens = [
                (t, tokens) for t, tokens in self._hour_tokens if t > hour_cutoff
            ]
            self._hour_cost = [
                (t, cost) for t, cost in self._hour_cost if t > hour_cutoff
            ]
 
            # Check minute limit
            minute_total = sum(tokens for _, tokens in self._minute_tokens) + total_tokens
            if minute_total > self.max_tokens_per_minute:
                return {
                    "allowed": False,
                    "reason": "tokens_per_minute_exceeded",
                    "current": minute_total - total_tokens,
                    "limit": self.max_tokens_per_minute,
                    "retry_after_seconds": 60,
                }
 
            # Check hour limit
            hour_total = sum(tokens for _, tokens in self._hour_tokens) + total_tokens
            if hour_total > self.max_tokens_per_hour:
                return {
                    "allowed": False,
                    "reason": "tokens_per_hour_exceeded",
                    "current": hour_total - total_tokens,
                    "limit": self.max_tokens_per_hour,
                }
 
            # Check cost limit
            hour_cost = sum(cost for _, cost in self._hour_cost) + estimated_cost
            if hour_cost > self.max_cost_per_hour_usd:
                return {
                    "allowed": False,
                    "reason": "cost_per_hour_exceeded",
                    "current_cost_usd": round(hour_cost - estimated_cost, 4),
                    "limit_usd": self.max_cost_per_hour_usd,
                }
 
            # Record the request
            self._minute_tokens.append((now, total_tokens))
            self._hour_tokens.append((now, total_tokens))
            self._hour_cost.append((now, estimated_cost))
 
            return {
                "allowed": True,
                "estimated_cost_usd": round(estimated_cost, 6),
                "minute_usage_pct": round((minute_total / self.max_tokens_per_minute) * 100, 1),
                "hour_usage_pct": round((hour_total / self.max_tokens_per_hour) * 100, 1),
                "hour_cost_pct": round((hour_cost / self.max_cost_per_hour_usd) * 100, 1),
            }

Red Team Testing for Cost Controls

Authorized Cost Security Testing

When conducting authorized assessments, test cost controls without generating actual excessive spend:

def test_cost_controls_safely(
    rate_limiter: AIRateLimiter,
    cloud_client: object,
) -> dict:
    """Test cost control effectiveness with minimal actual spend."""
    results = {
        "rate_limiter_tests": [],
        "budget_alert_tests": [],
        "credential_scope_tests": [],
    }
 
    # Test 1: Verify rate limiter blocks excessive requests
    for i in range(20):
        check = rate_limiter.check_and_record(
            input_tokens=1000,
            estimated_output_tokens=4000,
            user_id="test-user",
        )
        if not check["allowed"]:
            results["rate_limiter_tests"].append({
                "test": "rapid_request_blocking",
                "pass": True,
                "blocked_at_request": i + 1,
                "reason": check["reason"],
            })
            break
    else:
        results["rate_limiter_tests"].append({
            "test": "rapid_request_blocking",
            "pass": False,
            "detail": "Rate limiter did not block after 20 rapid requests",
        })
 
    # Test 2: Verify max_tokens parameter is enforced
    # Send a single request with a very small max_tokens to verify
    # the application enforces output length limits
    results["rate_limiter_tests"].append({
        "test": "output_token_limit",
        "recommendation": "Verify application sets max_tokens on all API calls. "
                          "Without max_tokens, a single request can generate "
                          "the model's full context window in output tokens.",
    })
 
    # Test 3: Check if provisioned throughput creation is IAM-restricted
    results["credential_scope_tests"].append({
        "test": "provisioned_throughput_restriction",
        "recommendation": "Verify application service accounts cannot call "
                          "CreateProvisionedModelThroughput or equivalent.",
    })
 
    return results

Cost Attack Indicators

Monitor for these patterns that indicate cost-based attacks:

Indicator	Description	Detection Method
Token volume spike	Sudden increase in tokens processed per minute	CloudWatch/Azure Monitor/Cloud Monitoring metrics
Max output requests	High percentage of requests generating maximum-length outputs	Application-level logging of output token counts
Off-hours usage	AI API calls during periods when the application should be idle	Time-based alerting on API call metrics
New IP/identity	API calls from previously unseen IP addresses or identities	CloudTrail/Activity Log/Audit Log analysis
Provisioned capacity creation	New provisioned throughput or committed use	Management API event monitoring
Burst after silence	A credential that was dormant suddenly generating high volume	Behavioral analysis of per-credential usage patterns

Defense-in-Depth Strategy

Layer cost controls from the cloud provider level down to the application level:

Cloud provider budget alerts: Set monthly budgets with alerts at 50%, 80%, and 100% thresholds. These are the last line of defense but have reporting delays.
IAM permission restrictions: Remove CreateProvisionedModelThroughput and equivalent high-cost actions from all non-admin identities.
Deployment-level quotas: Configure TPM (tokens per minute) limits on each model deployment. Azure OpenAI has native support; other providers require API Gateway rate limiting.
Application rate limiting: Implement token-aware rate limiting in the application layer with per-user and global limits.
Request validation: Set max_tokens on every API call. Reject user inputs that exceed reasonable length thresholds.
Real-time anomaly detection: Use cost anomaly detection services and custom monitoring to detect unusual spend patterns within minutes, not hours.

References

AWS, "AWS Cost Anomaly Detection," https://docs.aws.amazon.com/cost-management/latest/userguide/manage-ad.html
Microsoft, "Plan and manage costs for Azure OpenAI Service," https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs
Google Cloud, "Cloud Billing budgets," https://cloud.google.com/billing/docs/how-to/budgets
OWASP, "LLM09:2025 Misinformation" and "LLM10:2025 Unbounded Consumption," https://owasp.org/www-project-top-10-for-large-language-model-applications/

Knowledge Check

Why is setting max_tokens on every AI API call important for cost security?

Knowledge Check

An organization sets an AWS Budget alert at $5,000/month for Bedrock. Why is this insufficient as a sole cost control?

Edit this page on GitHub

Cost Security and Budget Controls for Cloud AI

Related articles

Cost Security and Budget Controls for Cloud AI

Related articles