Cost 安全 and Budget Controls for Cloud AI

Intermediate13 min readUpdated 2026-03-20

Protecting cloud AI deployments from cost-based attacks including denial-of-wallet, token exhaustion, and auto-scaling abuse with budget controls across AWS, Azure, and GCP.

cloud-ai-security cost-security denial-of-wallet finops budget-controls

概覽

雲端 AI services have a unique cost profile that creates novel attack vectors. Unlike traditional 雲端 compute where costs scale with provisioned resources, AI 推論 costs scale with usage -- specifically, with the number of 符元 processed. A single compromised API key or misconfigured endpoint can generate thousands of dollars in charges within minutes, 因為存在 no physical resource to provision and the marginal cost of each additional request is charged immediately.

The 威脅模型 for 雲端 AI cost 安全 includes three categories: external attackers who gain access to AI service credentials, insiders who abuse AI services for unauthorized purposes, and automated systems that malfunction and generate runaway costs. All three require a combination of preventive controls (budget limits, rate limiting) and detective controls (anomaly 偵測, alerting).

This article covers the cost 攻擊面對每個 major 雲端 AI provider, practical budget control implementations, and 紅隊 techniques for 測試 cost 安全 controls.

Cost 攻擊 Surface

Token Economics and 攻擊 Amplification

雲端 AI pricing is based on 符元 (roughly 4 characters per 符元 for English text). The cost asymmetry between 輸入 and 輸出符元 creates attack amplification opportunities:

Provider	Model	輸入 Cost (per 1M 符元)	輸出 Cost (per 1M 符元)	Amplification Factor
AWS Bedrock	Claude 3.5 Sonnet	~$3.00	~$15.00	5x 輸出 vs 輸入
Azure OpenAI	GPT-4o	~$2.50	~$10.00	4x 輸出 vs 輸入
GCP Vertex AI	Gemini 1.5 Pro	~$1.25	~$5.00	4x 輸出 vs 輸入
AWS Bedrock	Claude 3 Opus	~$15.00	~$75.00	5x 輸出 vs 輸入

攻擊者 can maximize cost damage by crafting prompts that generate maximum-length outputs. A short prompt like "Write a 4000-word essay on any topic" costs pennies in 輸入符元 but generates dollars in 輸出符元.

Denial-of-Wallet 攻擊 Patterns

"""
Cost attack pattern analysis for 雲端 AI services.
For authorized 紅隊 assessments only.
"""
 
def calculate_attack_cost_projection(
    requests_per_second: float,
    avg_input_tokens: int,
    avg_output_tokens: int,
    input_cost_per_million: float,
    output_cost_per_million: float,
    duration_hours: float,
) -> dict:
    """Project the cost of a sustained denial-of-wallet attack."""
    total_requests = requests_per_second * 3600 * duration_hours
 
    total_input_tokens = total_requests * avg_input_tokens
    total_output_tokens = total_requests * avg_output_tokens
 
    input_cost = (total_input_tokens / 1_000_000) * input_cost_per_million
    output_cost = (total_output_tokens / 1_000_000) * output_cost_per_million
    total_cost = input_cost + output_cost
 
    return {
        "total_requests": int(total_requests),
        "total_input_tokens": int(total_input_tokens),
        "total_output_tokens": int(total_output_tokens),
        "input_cost_usd": round(input_cost, 2),
        "output_cost_usd": round(output_cost, 2),
        "total_cost_usd": round(total_cost, 2),
        "cost_per_hour_usd": round(total_cost / duration_hours, 2),
    }
 
 
# 範例: Sustained attack using GPT-4o at 10 requests/second
projection = calculate_attack_cost_projection(
    requests_per_second=10,
    avg_input_tokens=100,        # Short malicious prompt
    avg_output_tokens=4000,      # Maximum 輸出
    input_cost_per_million=2.50, # GPT-4o 輸入
    output_cost_per_million=10.00, # GPT-4o 輸出
    duration_hours=24,
)
# Result: ~$34,560 in 24 hours from a single compromised key

Provisioned Throughput Abuse

Beyond pay-per-符元 costs, 雲端 AI providers offer provisioned throughput (reserved capacity) that costs significantly more:

AWS Bedrock Provisioned Throughput: Billed hourly at rates up to $50+/hour per model unit. Creating provisioned throughput with a compromised credential can cost thousands before 偵測.
Azure OpenAI Provisioned-Managed: PTU (Provisioned Throughput Units) are committed capacity billed hourly. 攻擊者 creating a large PTU deployment generates immediate financial liability.
GCP Vertex AI Provisioned Throughput: Reserved capacity for Gemini models, billed on commitment.

import boto3
from botocore.exceptions import ClientError
 
def test_provisioned_throughput_controls(
    session: boto3.Session,
    region: str = "us-east-1",
) -> dict:
    """測試 whether provisioned throughput creation is properly restricted."""
    bedrock = session.client("bedrock", region_name=region)
    findings = []
 
    # 測試 1: Can the current identity create provisioned throughput?
    try:
        # Use a dry-run approach: try to create with invalid parameters
        # to 測試 IAM 權限 without actually provisioning
        response = bedrock.create_provisioned_model_throughput(
            modelUnits=1,
            provisionedModelName="安全-測試-do-not-create",
            modelId="anthropic.claude-3-haiku-20240307-v1:0",
        )
        # If we get here, the identity has the 權限
        # Immediately delete the provisioned throughput
        bedrock.delete_provisioned_model_throughput(
            provisionedModelId=response["provisionedModelArn"]
        )
        findings.append({
            "severity": "HIGH",
            "finding": "Identity can create provisioned throughput",
            "detail": "This identity can create provisioned model throughput, "
                      "which could cost $50+/hour per model unit. Restrict "
                      "bedrock:CreateProvisionedModelThroughput to admin roles.",
        })
    except ClientError as e:
        if e.response["Error"]["Code"] == "AccessDeniedException":
            findings.append({
                "severity": "INFO",
                "finding": "Provisioned throughput creation blocked by IAM",
            })
        elif e.response["Error"]["Code"] == "ValidationException":
            findings.append({
                "severity": "HIGH",
                "finding": "Identity has CreateProvisionedModelThroughput 權限",
                "detail": "Request reached validation (past IAM check). The identity "
                          "could create valid provisioned throughput with correct parameters.",
            })
 
    return {"findings": findings}

Budget Controls 實作

AWS Budget Controls

import boto3
import json
 
def create_bedrock_budget(
    session: boto3.Session,
    account_id: str,
    monthly_limit_usd: float,
    alert_thresholds: list = None,
    notification_email: str = None,
) -> dict:
    """Create an AWS Budget specifically for Bedrock spending."""
    budgets = session.client("budgets")
 
    if alert_thresholds is None:
        alert_thresholds = [50, 80, 100]  # percent
 
    notifications = []
    subscribers = []
 
    if notification_email:
        subscribers = [
            {"SubscriptionType": "EMAIL", "Address": notification_email}
        ]
 
    for threshold in alert_thresholds:
        notifications.append({
            "Notification": {
                "NotificationType": "ACTUAL",
                "ComparisonOperator": "GREATER_THAN",
                "Threshold": threshold,
                "ThresholdType": "PERCENTAGE",
            },
            "Subscribers": subscribers,
        })
 
    budget = {
        "BudgetName": "bedrock-monthly-limit",
        "BudgetLimit": {
            "Amount": str(monthly_limit_usd),
            "Unit": "USD",
        },
        "BudgetType": "COST",
        "TimeUnit": "MONTHLY",
        "CostFilters": {
            "Service": ["Amazon Bedrock"],
        },
    }
 
    try:
        budgets.create_budget(
            AccountId=account_id,
            Budget=budget,
            NotificationsWithSubscribers=notifications,
        )
        return {
            "budget_name": "bedrock-monthly-limit",
            "limit_usd": monthly_limit_usd,
            "alert_thresholds": alert_thresholds,
            "status": "created",
        }
    except Exception as e:
        return {"error": str(e)}
 
 
def create_bedrock_cost_anomaly_detector(
    session: boto3.Session,
) -> dict:
    """Create a Cost Anomaly 偵測 monitor for Bedrock."""
    ce = session.client("ce")
 
    try:
        response = ce.create_anomaly_monitor(
            AnomalyMonitor={
                "MonitorName": "bedrock-cost-anomaly",
                "MonitorType": "DIMENSIONAL",
                "MonitorDimension": "SERVICE",
            }
        )
        monitor_arn = response["MonitorArn"]
 
        # Create anomaly subscription for alerts
        ce.create_anomaly_subscription(
            AnomalySubscription={
                "SubscriptionName": "bedrock-anomaly-alerts",
                "MonitorArnList": [monitor_arn],
                "Frequency": "IMMEDIATE",
                "Threshold": 50.0,  # Alert on anomalies over $50
                "Subscribers": [
                    {
                        "Type": "SNS",
                        "Address": "arn:aws:sns:us-east-1:ACCOUNT:bedrock-cost-alerts",
                    }
                ],
            }
        )
 
        return {
            "monitor_arn": monitor_arn,
            "threshold_usd": 50.0,
            "status": "created",
        }
    except Exception as e:
        return {"error": str(e)}

Azure OpenAI Quota and Budget Controls

Azure OpenAI provides 符元-per-minute (TPM) quotas at the deployment level, which serve as the primary rate-limiting mechanism:

from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
from azure.mgmt.costmanagement import CostManagementClient
 
def configure_deployment_rate_limits(
    subscription_id: str,
    resource_group: str,
    account_name: str,
    deployment_name: str,
    tokens_per_minute_limit: int = 10000,
) -> dict:
    """Configure rate limits on an Azure OpenAI deployment."""
    credential = DefaultAzureCredential()
    client = CognitiveServicesManagementClient(credential, subscription_id)
 
    # Update deployment with rate limit
    deployment = client.deployments.begin_create_or_update(
        resource_group,
        account_name,
        deployment_name,
        {
            "sku": {
                "name": "Standard",
                "capacity": tokens_per_minute_limit // 1000,  # Capacity in thousands
            },
        },
    ).result()
 
    return {
        "deployment": deployment_name,
        "capacity_k_tpm": deployment.sku.capacity,
        "effective_tpm": deployment.sku.capacity * 1000,
        "detail": "Requests exceeding this rate will receive 429 responses.",
    }
 
 
def create_azure_ai_cost_alert(
    subscription_id: str,
    resource_group: str,
    monthly_budget_usd: float,
    contact_emails: list,
) -> dict:
    """Create a budget alert for Azure AI spending."""
    credential = DefaultAzureCredential()
    cost_client = CostManagementClient(credential, subscription_id)
 
    scope = f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}"
 
    budget = {
        "properties": {
            "category": "Cost",
            "amount": monthly_budget_usd,
            "timeGrain": "Monthly",
            "timePeriod": {
                "startDate": "2026-03-01T00:00:00Z",
                "endDate": "2027-03-01T00:00:00Z",
            },
            "filter": {
                "dimensions": {
                    "name": "ServiceName",
                    "operator": "In",
                    "values": ["Azure OpenAI", "Cognitive Services"],
                }
            },
            "notifications": {
                "actual_50_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 50,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
                "actual_80_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 80,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
                "actual_100_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 100,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
            },
        }
    }
 
    result = cost_client.budgets.create_or_update(
        scope, "ai-services-budget", budget
    )
 
    return {
        "budget_name": "ai-services-budget",
        "monthly_limit_usd": monthly_budget_usd,
        "alerts": ["50%", "80%", "100%"],
    }

GCP Vertex AI Budget Controls

from google.雲端 import billing_budgets_v1
 
def create_vertex_ai_budget(
    billing_account_id: str,
    project_id: str,
    monthly_limit_usd: float,
    notification_pubsub_topic: str = None,
) -> dict:
    """Create a budget for Vertex AI spending on GCP."""
    client = billing_budgets_v1.BudgetServiceClient()
 
    budget = billing_budgets_v1.Budget(
        display_name="vertex-ai-monthly-budget",
        budget_filter=billing_budgets_v1.Filter(
            projects=[f"projects/{project_id}"],
            services=[
                f"services/aiplatform.googleapis.com",
            ],
        ),
        amount=billing_budgets_v1.BudgetAmount(
            specified_amount={"units": int(monthly_limit_usd), "currency_code": "USD"}
        ),
        threshold_rules=[
            billing_budgets_v1.ThresholdRule(
                threshold_percent=0.5,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
            billing_budgets_v1.ThresholdRule(
                threshold_percent=0.8,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
            billing_budgets_v1.ThresholdRule(
                threshold_percent=1.0,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
        ],
    )
 
    if notification_pubsub_topic:
        budget.notifications_rule = billing_budgets_v1.NotificationsRule(
            pubsub_topic=notification_pubsub_topic,
            schema_version="1.0",
        )
 
    request = billing_budgets_v1.CreateBudgetRequest(
        parent=f"billingAccounts/{billing_account_id}",
        budget=budget,
    )
 
    result = client.create_budget(request=request)
    return {
        "budget_name": result.name,
        "monthly_limit_usd": monthly_limit_usd,
        "status": "created",
    }

Application-Level Rate Limiting

雲端 provider budget alerts are reactive -- they notify after spend has occurred. For real-time cost protection, 實作 application-level rate limiting:

import time
from dataclasses import dataclass, field
from collections import defaultdict
from threading import Lock
 
@dataclass
class AIRateLimiter:
    """Token-aware rate limiter for 雲端 AI services."""
 
    max_tokens_per_minute: int = 100000
    max_tokens_per_hour: int = 2000000
    max_cost_per_hour_usd: float = 50.0
    cost_per_input_token: float = 0.000003   # $3/1M 符元
    cost_per_output_token: float = 0.000015  # $15/1M 符元
 
    _lock: Lock = field(default_factory=Lock)
    _minute_tokens: list = field(default_factory=list)
    _hour_tokens: list = field(default_factory=list)
    _hour_cost: list = field(default_factory=list)
 
    def check_and_record(
        self,
        input_tokens: int,
        estimated_output_tokens: int,
        user_id: str = "default",
    ) -> dict:
        """Check if a request is within rate limits and record it."""
        with self._lock:
            now = time.time()
            total_tokens = input_tokens + estimated_output_tokens
            estimated_cost = (
                input_tokens * self.cost_per_input_token
                + estimated_output_tokens * self.cost_per_output_token
            )
 
            # Clean old entries
            minute_cutoff = now - 60
            hour_cutoff = now - 3600
            self._minute_tokens = [
                (t, 符元) for t, 符元 in self._minute_tokens if t > minute_cutoff
            ]
            self._hour_tokens = [
                (t, 符元) for t, 符元 in self._hour_tokens if t > hour_cutoff
            ]
            self._hour_cost = [
                (t, cost) for t, cost in self._hour_cost if t > hour_cutoff
            ]
 
            # Check minute limit
            minute_total = sum(符元 for _, 符元 in self._minute_tokens) + total_tokens
            if minute_total > self.max_tokens_per_minute:
                return {
                    "allowed": False,
                    "reason": "tokens_per_minute_exceeded",
                    "current": minute_total - total_tokens,
                    "limit": self.max_tokens_per_minute,
                    "retry_after_seconds": 60,
                }
 
            # Check hour limit
            hour_total = sum(符元 for _, 符元 in self._hour_tokens) + total_tokens
            if hour_total > self.max_tokens_per_hour:
                return {
                    "allowed": False,
                    "reason": "tokens_per_hour_exceeded",
                    "current": hour_total - total_tokens,
                    "limit": self.max_tokens_per_hour,
                }
 
            # Check cost limit
            hour_cost = sum(cost for _, cost in self._hour_cost) + estimated_cost
            if hour_cost > self.max_cost_per_hour_usd:
                return {
                    "allowed": False,
                    "reason": "cost_per_hour_exceeded",
                    "current_cost_usd": round(hour_cost - estimated_cost, 4),
                    "limit_usd": self.max_cost_per_hour_usd,
                }
 
            # Record the request
            self._minute_tokens.append((now, total_tokens))
            self._hour_tokens.append((now, total_tokens))
            self._hour_cost.append((now, estimated_cost))
 
            return {
                "allowed": True,
                "estimated_cost_usd": round(estimated_cost, 6),
                "minute_usage_pct": round((minute_total / self.max_tokens_per_minute) * 100, 1),
                "hour_usage_pct": round((hour_total / self.max_tokens_per_hour) * 100, 1),
                "hour_cost_pct": round((hour_cost / self.max_cost_per_hour_usd) * 100, 1),
            }

紅隊測試 for Cost Controls

Authorized Cost 安全測試

When conducting authorized assessments, 測試 cost controls without generating actual excessive spend:

def test_cost_controls_safely(
    rate_limiter: AIRateLimiter,
    cloud_client: object,
) -> dict:
    """測試 cost control effectiveness with minimal actual spend."""
    results = {
        "rate_limiter_tests": [],
        "budget_alert_tests": [],
        "credential_scope_tests": [],
    }
 
    # 測試 1: Verify rate limiter blocks excessive requests
    for i in range(20):
        check = rate_limiter.check_and_record(
            input_tokens=1000,
            estimated_output_tokens=4000,
            user_id="測試-user",
        )
        if not check["allowed"]:
            results["rate_limiter_tests"].append({
                "測試": "rapid_request_blocking",
                "pass": True,
                "blocked_at_request": i + 1,
                "reason": check["reason"],
            })
            break
    else:
        results["rate_limiter_tests"].append({
            "測試": "rapid_request_blocking",
            "pass": False,
            "detail": "Rate limiter did not block after 20 rapid requests",
        })
 
    # 測試 2: Verify max_tokens parameter is enforced
    # Send a single request with a very small max_tokens to verify
    # the application enforces 輸出 length limits
    results["rate_limiter_tests"].append({
        "測試": "output_token_limit",
        "recommendation": "Verify application sets max_tokens on all API calls. "
                          "Without max_tokens, a single request can generate "
                          "模型's full 上下文視窗 in 輸出 符元.",
    })
 
    # 測試 3: Check if provisioned throughput creation is IAM-restricted
    results["credential_scope_tests"].append({
        "測試": "provisioned_throughput_restriction",
        "recommendation": "Verify application service accounts cannot call "
                          "CreateProvisionedModelThroughput or equivalent.",
    })
 
    return results

Cost 攻擊 Indicators

Monitor for these patterns that indicate cost-based attacks:

Indicator	Description	偵測 Method
Token volume spike	Sudden increase in 符元 processed per minute	CloudWatch/Azure Monitor/雲端監控 metrics
Max 輸出 requests	High percentage of requests generating maximum-length outputs	Application-level logging of 輸出符元 counts
Off-hours usage	AI API calls during periods when the application should be idle	Time-based alerting on API call metrics
New IP/identity	API calls from previously unseen IP addresses or identities	CloudTrail/Activity Log/Audit Log analysis
Provisioned capacity creation	New provisioned throughput or committed use	Management API event 監控
Burst after silence	A credential that was dormant suddenly generating high volume	Behavioral analysis of per-credential usage patterns

防禦-in-Depth Strategy

Layer cost controls from the 雲端 provider level down to the application level:

雲端 provider budget alerts: Set monthly budgets with alerts at 50%, 80%, and 100% thresholds. These are the last line of 防禦 but have reporting delays.
IAM 權限 restrictions: Remove CreateProvisionedModelThroughput and equivalent high-cost actions from all non-admin identities.
Deployment-level quotas: Configure TPM (符元 per minute) limits on each model deployment. Azure OpenAI has native support; other providers require API Gateway rate limiting.
Application rate limiting: 實作符元-aware rate limiting in the application layer with per-user and global limits.
Request validation: Set max_tokens on every API call. Reject user inputs that exceed reasonable length thresholds.
Real-time anomaly 偵測: Use cost anomaly 偵測 services and custom 監控 to detect unusual spend patterns within minutes, not hours.

參考文獻

AWS, "AWS Cost Anomaly 偵測," https://docs.aws.amazon.com/cost-management/latest/userguide/manage-ad.html
Microsoft, "Plan and manage costs for Azure OpenAI Service," https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs
Google 雲端, "雲端 Billing budgets," https://雲端.google.com/billing/docs/how-to/budgets
OWASP, "LLM09:2025 Misinformation" and "LLM10:2025 Unbounded Consumption," https://owasp.org/www-project-top-10-for-large-language-model-applications/

Knowledge Check

Why is setting max_tokens on every AI API call important for cost 安全?

Knowledge Check

An organization sets an AWS Budget alert at $5,000/month for Bedrock. Why is this insufficient as a sole cost control?

Cost 安全 and Budget Controls for Cloud AI

Intermediate13 min readUpdated 2026-03-20

Protecting cloud AI deployments from cost-based attacks including denial-of-wallet, token exhaustion, and auto-scaling abuse with budget controls across AWS, Azure, and GCP.

cloud-ai-security cost-security denial-of-wallet finops budget-controls

概覽

This article covers the cost 攻擊面對每個 major 雲端 AI provider, practical budget control implementations, and 紅隊 techniques for 測試 cost 安全 controls.

Cost 攻擊 Surface

Token Economics and 攻擊 Amplification

雲端 AI pricing is based on 符元 (roughly 4 characters per 符元 for English text). The cost asymmetry between 輸入 and 輸出符元 creates attack amplification opportunities:

Provider	Model	輸入 Cost (per 1M 符元)	輸出 Cost (per 1M 符元)	Amplification Factor
AWS Bedrock	Claude 3.5 Sonnet	~$3.00	~$15.00	5x 輸出 vs 輸入
Azure OpenAI	GPT-4o	~$2.50	~$10.00	4x 輸出 vs 輸入
GCP Vertex AI	Gemini 1.5 Pro	~$1.25	~$5.00	4x 輸出 vs 輸入
AWS Bedrock	Claude 3 Opus	~$15.00	~$75.00	5x 輸出 vs 輸入

Denial-of-Wallet 攻擊 Patterns

"""
Cost attack pattern analysis for 雲端 AI services.
For authorized 紅隊 assessments only.
"""
 
def calculate_attack_cost_projection(
    requests_per_second: float,
    avg_input_tokens: int,
    avg_output_tokens: int,
    input_cost_per_million: float,
    output_cost_per_million: float,
    duration_hours: float,
) -> dict:
    """Project the cost of a sustained denial-of-wallet attack."""
    total_requests = requests_per_second * 3600 * duration_hours
 
    total_input_tokens = total_requests * avg_input_tokens
    total_output_tokens = total_requests * avg_output_tokens
 
    input_cost = (total_input_tokens / 1_000_000) * input_cost_per_million
    output_cost = (total_output_tokens / 1_000_000) * output_cost_per_million
    total_cost = input_cost + output_cost
 
    return {
        "total_requests": int(total_requests),
        "total_input_tokens": int(total_input_tokens),
        "total_output_tokens": int(total_output_tokens),
        "input_cost_usd": round(input_cost, 2),
        "output_cost_usd": round(output_cost, 2),
        "total_cost_usd": round(total_cost, 2),
        "cost_per_hour_usd": round(total_cost / duration_hours, 2),
    }
 
 
# 範例: Sustained attack using GPT-4o at 10 requests/second
projection = calculate_attack_cost_projection(
    requests_per_second=10,
    avg_input_tokens=100,        # Short malicious prompt
    avg_output_tokens=4000,      # Maximum 輸出
    input_cost_per_million=2.50, # GPT-4o 輸入
    output_cost_per_million=10.00, # GPT-4o 輸出
    duration_hours=24,
)
# Result: ~$34,560 in 24 hours from a single compromised key

Provisioned Throughput Abuse

Beyond pay-per-符元 costs, 雲端 AI providers offer provisioned throughput (reserved capacity) that costs significantly more:

AWS Bedrock Provisioned Throughput: Billed hourly at rates up to $50+/hour per model unit. Creating provisioned throughput with a compromised credential can cost thousands before 偵測.
Azure OpenAI Provisioned-Managed: PTU (Provisioned Throughput Units) are committed capacity billed hourly. 攻擊者 creating a large PTU deployment generates immediate financial liability.
GCP Vertex AI Provisioned Throughput: Reserved capacity for Gemini models, billed on commitment.

import boto3
from botocore.exceptions import ClientError
 
def test_provisioned_throughput_controls(
    session: boto3.Session,
    region: str = "us-east-1",
) -> dict:
    """測試 whether provisioned throughput creation is properly restricted."""
    bedrock = session.client("bedrock", region_name=region)
    findings = []
 
    # 測試 1: Can the current identity create provisioned throughput?
    try:
        # Use a dry-run approach: try to create with invalid parameters
        # to 測試 IAM 權限 without actually provisioning
        response = bedrock.create_provisioned_model_throughput(
            modelUnits=1,
            provisionedModelName="安全-測試-do-not-create",
            modelId="anthropic.claude-3-haiku-20240307-v1:0",
        )
        # If we get here, the identity has the 權限
        # Immediately delete the provisioned throughput
        bedrock.delete_provisioned_model_throughput(
            provisionedModelId=response["provisionedModelArn"]
        )
        findings.append({
            "severity": "HIGH",
            "finding": "Identity can create provisioned throughput",
            "detail": "This identity can create provisioned model throughput, "
                      "which could cost $50+/hour per model unit. Restrict "
                      "bedrock:CreateProvisionedModelThroughput to admin roles.",
        })
    except ClientError as e:
        if e.response["Error"]["Code"] == "AccessDeniedException":
            findings.append({
                "severity": "INFO",
                "finding": "Provisioned throughput creation blocked by IAM",
            })
        elif e.response["Error"]["Code"] == "ValidationException":
            findings.append({
                "severity": "HIGH",
                "finding": "Identity has CreateProvisionedModelThroughput 權限",
                "detail": "Request reached validation (past IAM check). The identity "
                          "could create valid provisioned throughput with correct parameters.",
            })
 
    return {"findings": findings}

Budget Controls 實作

AWS Budget Controls

import boto3
import json
 
def create_bedrock_budget(
    session: boto3.Session,
    account_id: str,
    monthly_limit_usd: float,
    alert_thresholds: list = None,
    notification_email: str = None,
) -> dict:
    """Create an AWS Budget specifically for Bedrock spending."""
    budgets = session.client("budgets")
 
    if alert_thresholds is None:
        alert_thresholds = [50, 80, 100]  # percent
 
    notifications = []
    subscribers = []
 
    if notification_email:
        subscribers = [
            {"SubscriptionType": "EMAIL", "Address": notification_email}
        ]
 
    for threshold in alert_thresholds:
        notifications.append({
            "Notification": {
                "NotificationType": "ACTUAL",
                "ComparisonOperator": "GREATER_THAN",
                "Threshold": threshold,
                "ThresholdType": "PERCENTAGE",
            },
            "Subscribers": subscribers,
        })
 
    budget = {
        "BudgetName": "bedrock-monthly-limit",
        "BudgetLimit": {
            "Amount": str(monthly_limit_usd),
            "Unit": "USD",
        },
        "BudgetType": "COST",
        "TimeUnit": "MONTHLY",
        "CostFilters": {
            "Service": ["Amazon Bedrock"],
        },
    }
 
    try:
        budgets.create_budget(
            AccountId=account_id,
            Budget=budget,
            NotificationsWithSubscribers=notifications,
        )
        return {
            "budget_name": "bedrock-monthly-limit",
            "limit_usd": monthly_limit_usd,
            "alert_thresholds": alert_thresholds,
            "status": "created",
        }
    except Exception as e:
        return {"error": str(e)}
 
 
def create_bedrock_cost_anomaly_detector(
    session: boto3.Session,
) -> dict:
    """Create a Cost Anomaly 偵測 monitor for Bedrock."""
    ce = session.client("ce")
 
    try:
        response = ce.create_anomaly_monitor(
            AnomalyMonitor={
                "MonitorName": "bedrock-cost-anomaly",
                "MonitorType": "DIMENSIONAL",
                "MonitorDimension": "SERVICE",
            }
        )
        monitor_arn = response["MonitorArn"]
 
        # Create anomaly subscription for alerts
        ce.create_anomaly_subscription(
            AnomalySubscription={
                "SubscriptionName": "bedrock-anomaly-alerts",
                "MonitorArnList": [monitor_arn],
                "Frequency": "IMMEDIATE",
                "Threshold": 50.0,  # Alert on anomalies over $50
                "Subscribers": [
                    {
                        "Type": "SNS",
                        "Address": "arn:aws:sns:us-east-1:ACCOUNT:bedrock-cost-alerts",
                    }
                ],
            }
        )
 
        return {
            "monitor_arn": monitor_arn,
            "threshold_usd": 50.0,
            "status": "created",
        }
    except Exception as e:
        return {"error": str(e)}

Azure OpenAI Quota and Budget Controls

Azure OpenAI provides 符元-per-minute (TPM) quotas at the deployment level, which serve as the primary rate-limiting mechanism:

from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
from azure.mgmt.costmanagement import CostManagementClient
 
def configure_deployment_rate_limits(
    subscription_id: str,
    resource_group: str,
    account_name: str,
    deployment_name: str,
    tokens_per_minute_limit: int = 10000,
) -> dict:
    """Configure rate limits on an Azure OpenAI deployment."""
    credential = DefaultAzureCredential()
    client = CognitiveServicesManagementClient(credential, subscription_id)
 
    # Update deployment with rate limit
    deployment = client.deployments.begin_create_or_update(
        resource_group,
        account_name,
        deployment_name,
        {
            "sku": {
                "name": "Standard",
                "capacity": tokens_per_minute_limit // 1000,  # Capacity in thousands
            },
        },
    ).result()
 
    return {
        "deployment": deployment_name,
        "capacity_k_tpm": deployment.sku.capacity,
        "effective_tpm": deployment.sku.capacity * 1000,
        "detail": "Requests exceeding this rate will receive 429 responses.",
    }
 
 
def create_azure_ai_cost_alert(
    subscription_id: str,
    resource_group: str,
    monthly_budget_usd: float,
    contact_emails: list,
) -> dict:
    """Create a budget alert for Azure AI spending."""
    credential = DefaultAzureCredential()
    cost_client = CostManagementClient(credential, subscription_id)
 
    scope = f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}"
 
    budget = {
        "properties": {
            "category": "Cost",
            "amount": monthly_budget_usd,
            "timeGrain": "Monthly",
            "timePeriod": {
                "startDate": "2026-03-01T00:00:00Z",
                "endDate": "2027-03-01T00:00:00Z",
            },
            "filter": {
                "dimensions": {
                    "name": "ServiceName",
                    "operator": "In",
                    "values": ["Azure OpenAI", "Cognitive Services"],
                }
            },
            "notifications": {
                "actual_50_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 50,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
                "actual_80_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 80,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
                "actual_100_percent": {
                    "enabled": True,
                    "operator": "GreaterThan",
                    "threshold": 100,
                    "contactEmails": contact_emails,
                    "thresholdType": "Actual",
                },
            },
        }
    }
 
    result = cost_client.budgets.create_or_update(
        scope, "ai-services-budget", budget
    )
 
    return {
        "budget_name": "ai-services-budget",
        "monthly_limit_usd": monthly_budget_usd,
        "alerts": ["50%", "80%", "100%"],
    }

GCP Vertex AI Budget Controls

from google.雲端 import billing_budgets_v1
 
def create_vertex_ai_budget(
    billing_account_id: str,
    project_id: str,
    monthly_limit_usd: float,
    notification_pubsub_topic: str = None,
) -> dict:
    """Create a budget for Vertex AI spending on GCP."""
    client = billing_budgets_v1.BudgetServiceClient()
 
    budget = billing_budgets_v1.Budget(
        display_name="vertex-ai-monthly-budget",
        budget_filter=billing_budgets_v1.Filter(
            projects=[f"projects/{project_id}"],
            services=[
                f"services/aiplatform.googleapis.com",
            ],
        ),
        amount=billing_budgets_v1.BudgetAmount(
            specified_amount={"units": int(monthly_limit_usd), "currency_code": "USD"}
        ),
        threshold_rules=[
            billing_budgets_v1.ThresholdRule(
                threshold_percent=0.5,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
            billing_budgets_v1.ThresholdRule(
                threshold_percent=0.8,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
            billing_budgets_v1.ThresholdRule(
                threshold_percent=1.0,
                spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
            ),
        ],
    )
 
    if notification_pubsub_topic:
        budget.notifications_rule = billing_budgets_v1.NotificationsRule(
            pubsub_topic=notification_pubsub_topic,
            schema_version="1.0",
        )
 
    request = billing_budgets_v1.CreateBudgetRequest(
        parent=f"billingAccounts/{billing_account_id}",
        budget=budget,
    )
 
    result = client.create_budget(request=request)
    return {
        "budget_name": result.name,
        "monthly_limit_usd": monthly_limit_usd,
        "status": "created",
    }

Application-Level Rate Limiting

雲端 provider budget alerts are reactive -- they notify after spend has occurred. For real-time cost protection, 實作 application-level rate limiting:

import time
from dataclasses import dataclass, field
from collections import defaultdict
from threading import Lock
 
@dataclass
class AIRateLimiter:
    """Token-aware rate limiter for 雲端 AI services."""
 
    max_tokens_per_minute: int = 100000
    max_tokens_per_hour: int = 2000000
    max_cost_per_hour_usd: float = 50.0
    cost_per_input_token: float = 0.000003   # $3/1M 符元
    cost_per_output_token: float = 0.000015  # $15/1M 符元
 
    _lock: Lock = field(default_factory=Lock)
    _minute_tokens: list = field(default_factory=list)
    _hour_tokens: list = field(default_factory=list)
    _hour_cost: list = field(default_factory=list)
 
    def check_and_record(
        self,
        input_tokens: int,
        estimated_output_tokens: int,
        user_id: str = "default",
    ) -> dict:
        """Check if a request is within rate limits and record it."""
        with self._lock:
            now = time.time()
            total_tokens = input_tokens + estimated_output_tokens
            estimated_cost = (
                input_tokens * self.cost_per_input_token
                + estimated_output_tokens * self.cost_per_output_token
            )
 
            # Clean old entries
            minute_cutoff = now - 60
            hour_cutoff = now - 3600
            self._minute_tokens = [
                (t, 符元) for t, 符元 in self._minute_tokens if t > minute_cutoff
            ]
            self._hour_tokens = [
                (t, 符元) for t, 符元 in self._hour_tokens if t > hour_cutoff
            ]
            self._hour_cost = [
                (t, cost) for t, cost in self._hour_cost if t > hour_cutoff
            ]
 
            # Check minute limit
            minute_total = sum(符元 for _, 符元 in self._minute_tokens) + total_tokens
            if minute_total > self.max_tokens_per_minute:
                return {
                    "allowed": False,
                    "reason": "tokens_per_minute_exceeded",
                    "current": minute_total - total_tokens,
                    "limit": self.max_tokens_per_minute,
                    "retry_after_seconds": 60,
                }
 
            # Check hour limit
            hour_total = sum(符元 for _, 符元 in self._hour_tokens) + total_tokens
            if hour_total > self.max_tokens_per_hour:
                return {
                    "allowed": False,
                    "reason": "tokens_per_hour_exceeded",
                    "current": hour_total - total_tokens,
                    "limit": self.max_tokens_per_hour,
                }
 
            # Check cost limit
            hour_cost = sum(cost for _, cost in self._hour_cost) + estimated_cost
            if hour_cost > self.max_cost_per_hour_usd:
                return {
                    "allowed": False,
                    "reason": "cost_per_hour_exceeded",
                    "current_cost_usd": round(hour_cost - estimated_cost, 4),
                    "limit_usd": self.max_cost_per_hour_usd,
                }
 
            # Record the request
            self._minute_tokens.append((now, total_tokens))
            self._hour_tokens.append((now, total_tokens))
            self._hour_cost.append((now, estimated_cost))
 
            return {
                "allowed": True,
                "estimated_cost_usd": round(estimated_cost, 6),
                "minute_usage_pct": round((minute_total / self.max_tokens_per_minute) * 100, 1),
                "hour_usage_pct": round((hour_total / self.max_tokens_per_hour) * 100, 1),
                "hour_cost_pct": round((hour_cost / self.max_cost_per_hour_usd) * 100, 1),
            }

紅隊測試 for Cost Controls

Authorized Cost 安全測試

When conducting authorized assessments, 測試 cost controls without generating actual excessive spend:

def test_cost_controls_safely(
    rate_limiter: AIRateLimiter,
    cloud_client: object,
) -> dict:
    """測試 cost control effectiveness with minimal actual spend."""
    results = {
        "rate_limiter_tests": [],
        "budget_alert_tests": [],
        "credential_scope_tests": [],
    }
 
    # 測試 1: Verify rate limiter blocks excessive requests
    for i in range(20):
        check = rate_limiter.check_and_record(
            input_tokens=1000,
            estimated_output_tokens=4000,
            user_id="測試-user",
        )
        if not check["allowed"]:
            results["rate_limiter_tests"].append({
                "測試": "rapid_request_blocking",
                "pass": True,
                "blocked_at_request": i + 1,
                "reason": check["reason"],
            })
            break
    else:
        results["rate_limiter_tests"].append({
            "測試": "rapid_request_blocking",
            "pass": False,
            "detail": "Rate limiter did not block after 20 rapid requests",
        })
 
    # 測試 2: Verify max_tokens parameter is enforced
    # Send a single request with a very small max_tokens to verify
    # the application enforces 輸出 length limits
    results["rate_limiter_tests"].append({
        "測試": "output_token_limit",
        "recommendation": "Verify application sets max_tokens on all API calls. "
                          "Without max_tokens, a single request can generate "
                          "模型's full 上下文視窗 in 輸出 符元.",
    })
 
    # 測試 3: Check if provisioned throughput creation is IAM-restricted
    results["credential_scope_tests"].append({
        "測試": "provisioned_throughput_restriction",
        "recommendation": "Verify application service accounts cannot call "
                          "CreateProvisionedModelThroughput or equivalent.",
    })
 
    return results

Cost 攻擊 Indicators

Monitor for these patterns that indicate cost-based attacks:

Indicator	Description	偵測 Method
Token volume spike	Sudden increase in 符元 processed per minute	CloudWatch/Azure Monitor/雲端監控 metrics
Max 輸出 requests	High percentage of requests generating maximum-length outputs	Application-level logging of 輸出符元 counts
Off-hours usage	AI API calls during periods when the application should be idle	Time-based alerting on API call metrics
New IP/identity	API calls from previously unseen IP addresses or identities	CloudTrail/Activity Log/Audit Log analysis
Provisioned capacity creation	New provisioned throughput or committed use	Management API event 監控
Burst after silence	A credential that was dormant suddenly generating high volume	Behavioral analysis of per-credential usage patterns

防禦-in-Depth Strategy

Layer cost controls from the 雲端 provider level down to the application level:

雲端 provider budget alerts: Set monthly budgets with alerts at 50%, 80%, and 100% thresholds. These are the last line of 防禦 but have reporting delays.
IAM 權限 restrictions: Remove CreateProvisionedModelThroughput and equivalent high-cost actions from all non-admin identities.
Deployment-level quotas: Configure TPM (符元 per minute) limits on each model deployment. Azure OpenAI has native support; other providers require API Gateway rate limiting.
Application rate limiting: 實作符元-aware rate limiting in the application layer with per-user and global limits.
Request validation: Set max_tokens on every API call. Reject user inputs that exceed reasonable length thresholds.
Real-time anomaly 偵測: Use cost anomaly 偵測 services and custom 監控 to detect unusual spend patterns within minutes, not hours.

參考文獻

AWS, "AWS Cost Anomaly 偵測," https://docs.aws.amazon.com/cost-management/latest/userguide/manage-ad.html
Microsoft, "Plan and manage costs for Azure OpenAI Service," https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs
Google 雲端, "雲端 Billing budgets," https://雲端.google.com/billing/docs/how-to/budgets
OWASP, "LLM09:2025 Misinformation" and "LLM10:2025 Unbounded Consumption," https://owasp.org/www-project-top-10-for-large-language-model-applications/

Knowledge Check

Why is setting max_tokens on every AI API call important for cost 安全?

Knowledge Check

An organization sets an AWS Budget alert at $5,000/month for Bedrock. Why is this insufficient as a sole cost control?

Cost 安全 and Budget Controls for Cloud AI

Related articles

Cost 安全 and Budget Controls for Cloud AI

Related articles