IAM Best Practices for Cloud AI Services
Cross-cloud IAM best practices for securing AI services on AWS, Azure, and GCP, covering least privilege, service identity management, cross-account access, and policy automation.
Overview
Identity and Access Management is the most critical security control for cloud AI services. Unlike traditional cloud workloads where IAM primarily controls access to infrastructure and data, AI service IAM must also control who can invoke models, who can train or fine-tune models, who can modify safety configurations, and who can access the prompts and responses flowing through the system.
The challenge is that AI services introduce permission categories that do not map cleanly to existing IAM frameworks. A data scientist who needs to invoke a model for inference has a fundamentally different risk profile than one who needs to submit fine-tuning jobs. A CI/CD pipeline that deploys model configurations requires different permissions than one that deploys application code. This article provides actionable guidance for designing and implementing IAM policies that address these AI-specific concerns across all three major cloud platforms.
The AI-Specific IAM Challenge
Why Traditional IAM Falls Short
Traditional cloud IAM was designed around resource types (virtual machines, storage buckets, databases) and CRUD operations (create, read, update, delete). AI services introduce actions that do not fit this model:
| Action Type | Traditional Equivalent | AI-Specific Risk |
|---|---|---|
| Model invocation | API call / function execution | Cost abuse, prompt/response data exposure, content safety bypass |
| Fine-tuning job creation | Compute job submission | Training data poisoning, privilege escalation via service role, model theft |
| Guardrail/safety config modification | Security policy change | Weakening organizational safety controls, compliance violations |
| Model deployment | Application deployment | Exposing untested or malicious models to production traffic |
| Prompt/response logging access | Log access | Exposure of sensitive user inputs and model outputs |
The key insight is that model invocation is both a compute action (it costs money and uses resources) and a data action (the prompts and responses contain information). Most IAM policies treat it as only one or the other.
The Service Identity Problem
AI workloads often require service identities (service accounts, managed identities, IAM roles) that access multiple services: the AI API itself, storage for training data, logging services, secret managers for API keys, and sometimes databases for RAG contexts. This creates a tension between operational convenience (one service identity with broad access) and security (separate identities for each concern).
AWS IAM for AI Services
Least Privilege for Bedrock
AWS Bedrock permissions should be scoped by three dimensions: action (what the identity can do), resource (which models or resources), and condition (from where and when).
import json
def generate_bedrock_least_privilege_policy(
allowed_model_ids: list,
account_id: str,
region: str = "us-east-1",
require_vpc_endpoint: str = None,
) -> dict:
"""Generate a least-privilege IAM policy for Bedrock model invocation."""
model_arns = [
f"arn:aws:bedrock:{region}::foundation-model/{model_id}"
for model_id in allowed_model_ids
]
statements = [
{
"Sid": "AllowModelInvocation",
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream",
],
"Resource": model_arns,
},
{
"Sid": "DenyModelManagement",
"Effect": "Deny",
"Action": [
"bedrock:CreateModelCustomizationJob",
"bedrock:CreateProvisionedModelThroughput",
"bedrock:DeleteCustomModel",
"bedrock:CreateGuardrail",
"bedrock:UpdateGuardrail",
"bedrock:DeleteGuardrail",
],
"Resource": "*",
},
]
# Add VPC endpoint condition if specified
if require_vpc_endpoint:
statements[0]["Condition"] = {
"StringEquals": {
"aws:sourceVpce": require_vpc_endpoint
}
}
return {
"Version": "2012-10-17",
"Statement": statements,
}
def generate_bedrock_admin_policy(
account_id: str,
region: str = "us-east-1",
) -> dict:
"""Generate a policy for Bedrock administrators (deployment, not invocation)."""
return {
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowBedrockManagement",
"Effect": "Allow",
"Action": [
"bedrock:ListFoundationModels",
"bedrock:GetFoundationModel",
"bedrock:ListCustomModels",
"bedrock:GetCustomModel",
"bedrock:ListGuardrails",
"bedrock:GetGuardrail",
"bedrock:ListModelInvocationLoggingConfigurations",
],
"Resource": "*",
},
{
"Sid": "AllowGuardrailManagement",
"Effect": "Allow",
"Action": [
"bedrock:CreateGuardrail",
"bedrock:UpdateGuardrail",
"bedrock:CreateGuardrailVersion",
],
"Resource": f"arn:aws:bedrock:{region}:{account_id}:guardrail/*",
},
{
"Sid": "DenyModelInvocation",
"Effect": "Deny",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream",
],
"Resource": "*",
"Condition": {
"StringNotLike": {
"aws:PrincipalTag/team": "ai-platform-admin"
}
},
},
],
}SageMaker IAM Separation
For AWS SageMaker, separate IAM roles for different lifecycle phases:
def generate_sagemaker_role_policies() -> dict:
"""Generate separated IAM policies for SageMaker lifecycle phases."""
return {
"data_scientist_policy": {
"Version": "2012-10-17",
"Statement": [
{
"Sid": "NotebookAccess",
"Effect": "Allow",
"Action": [
"sagemaker:CreatePresignedNotebookInstanceUrl",
"sagemaker:DescribeNotebookInstance",
"sagemaker:StartNotebookInstance",
"sagemaker:StopNotebookInstance",
],
"Resource": "arn:aws:sagemaker:*:*:notebook-instance/${aws:PrincipalTag/notebook-name}",
},
{
"Sid": "TrainingJobAccess",
"Effect": "Allow",
"Action": [
"sagemaker:CreateTrainingJob",
"sagemaker:DescribeTrainingJob",
"sagemaker:ListTrainingJobs",
],
"Resource": "*",
"Condition": {
"StringEquals": {
"sagemaker:ResourceTag/team": "${aws:PrincipalTag/team}"
}
},
},
],
},
"mlops_deployment_policy": {
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EndpointManagement",
"Effect": "Allow",
"Action": [
"sagemaker:CreateEndpoint",
"sagemaker:CreateEndpointConfig",
"sagemaker:UpdateEndpoint",
"sagemaker:DeleteEndpoint",
"sagemaker:DescribeEndpoint",
],
"Resource": "*",
},
{
"Sid": "DenyTraining",
"Effect": "Deny",
"Action": [
"sagemaker:CreateTrainingJob",
"sagemaker:CreateHyperParameterTuningJob",
],
"Resource": "*",
},
],
},
"training_execution_role": {
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3TrainingDataAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
],
"Resource": [
"arn:aws:s3:::ml-training-data-*",
"arn:aws:s3:::ml-training-data-*/*",
],
},
{
"Sid": "S3ModelOutput",
"Effect": "Allow",
"Action": [
"s3:PutObject",
],
"Resource": "arn:aws:s3:::ml-model-artifacts-*/*",
},
{
"Sid": "CloudWatchLogging",
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:CreateLogGroup",
],
"Resource": "arn:aws:logs:*:*:log-group:/aws/sagemaker/*",
},
],
},
}Azure RBAC for AI Services
Custom Role Definitions
Azure's built-in roles for Cognitive Services are often too broad. Create custom role definitions that align with your organization's AI service usage patterns:
from azure.mgmt.authorization import AuthorizationManagementClient
from azure.identity import DefaultAzureCredential
import uuid
def create_azure_openai_readonly_role(
subscription_id: str,
) -> dict:
"""Create a custom role that allows Azure OpenAI invocation but not management."""
credential = DefaultAzureCredential()
auth_client = AuthorizationManagementClient(credential, subscription_id)
role_definition = {
"properties": {
"roleName": "Azure OpenAI Invoker",
"description": "Can invoke Azure OpenAI models but cannot manage "
"deployments, content filters, or resource configuration.",
"type": "CustomRole",
"permissions": [
{
"actions": [],
"notActions": [],
"dataActions": [
"Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action",
"Microsoft.CognitiveServices/accounts/OpenAI/deployments/completions/action",
"Microsoft.CognitiveServices/accounts/OpenAI/deployments/embeddings/action",
],
"notDataActions": [
"Microsoft.CognitiveServices/accounts/OpenAI/deployments/write",
"Microsoft.CognitiveServices/accounts/OpenAI/deployments/delete",
],
}
],
"assignableScopes": [f"/subscriptions/{subscription_id}"],
}
}
role_id = str(uuid.uuid4())
scope = f"/subscriptions/{subscription_id}"
result = auth_client.role_definitions.create_or_update(
scope, role_id, role_definition
)
return {
"role_id": result.id,
"role_name": result.role_name,
}Managed Identity Patterns
Use system-assigned or user-assigned managed identities for all Azure OpenAI access. The following pattern shows how to configure an App Service to use managed identity for Azure OpenAI:
from azure.mgmt.web import WebSiteManagementClient
def configure_app_managed_identity(
subscription_id: str,
resource_group: str,
app_name: str,
openai_endpoint: str,
) -> dict:
"""Configure an App Service to use managed identity for Azure OpenAI."""
credential = DefaultAzureCredential()
web_client = WebSiteManagementClient(credential, subscription_id)
# Enable system-assigned managed identity
web_client.web_apps.update(
resource_group,
app_name,
{
"identity": {"type": "SystemAssigned"},
},
)
# Get the identity principal ID
app = web_client.web_apps.get(resource_group, app_name)
principal_id = app.identity.principal_id
# Set app settings to use managed identity (not API keys)
web_client.web_apps.update_application_settings(
resource_group,
app_name,
{
"properties": {
"AZURE_OPENAI_ENDPOINT": openai_endpoint,
"AZURE_OPENAI_AUTH_TYPE": "managed_identity",
# No API key setting -- forces managed identity usage
}
},
)
return {
"principal_id": principal_id,
"auth_type": "managed_identity",
"detail": "Assign 'Cognitive Services OpenAI User' role to this principal ID.",
}GCP IAM for AI Services
IAM Conditions for Vertex AI
GCP IAM supports conditions that restrict when and where permissions are effective. Use conditions to limit Vertex AI access to specific resources and time windows:
def create_conditional_vertex_binding(
project_id: str,
member: str,
allowed_endpoints: list,
) -> dict:
"""Create a conditional IAM binding for Vertex AI endpoints."""
from google.cloud import resourcemanager_v3
from google.type import expr_pb2
client = resourcemanager_v3.ProjectsClient()
# Build condition expression that restricts to specific endpoints
endpoint_conditions = " || ".join([
f'resource.name.startsWith("projects/{project_id}/locations/us-central1/endpoints/{ep}")'
for ep in allowed_endpoints
])
condition = expr_pb2.Expr(
expression=endpoint_conditions,
title="Restrict to approved endpoints",
description="Limits prediction access to specific Vertex AI endpoints.",
)
# This would be applied via set_iam_policy on the project
binding = {
"role": "roles/aiplatform.user",
"members": [member],
"condition": {
"expression": endpoint_conditions,
"title": "Restrict to approved endpoints",
},
}
return {
"binding": binding,
"detail": "Apply this binding via gcloud or Terraform. "
"IAM conditions on Vertex AI restrict which specific "
"resources the member can access.",
}Workload Identity Federation
For CI/CD pipelines and external workloads accessing Vertex AI, use Workload Identity Federation instead of service account keys:
def setup_workload_identity_pool(
project_id: str,
project_number: str,
github_org: str,
github_repo: str,
) -> dict:
"""Create a Workload Identity Pool for GitHub Actions accessing Vertex AI."""
from google.cloud import iam_v1
# Configuration for workload identity federation
pool_config = {
"pool_id": "github-actions-pool",
"display_name": "GitHub Actions",
"provider_config": {
"provider_id": "github",
"issuer_uri": "https://token.actions.githubusercontent.com",
"attribute_mapping": {
"google.subject": "assertion.sub",
"attribute.repository": "assertion.repository",
"attribute.actor": "assertion.actor",
"attribute.ref": "assertion.ref",
},
"attribute_condition": (
f'assertion.repository == "{github_org}/{github_repo}" '
f'&& assertion.ref == "refs/heads/main"'
),
},
"service_account_binding": {
"service_account": f"vertex-ai-deployer@{project_id}.iam.gserviceaccount.com",
"role": "roles/iam.workloadIdentityUser",
"member": (
f"principalSet://iam.googleapis.com/projects/{project_number}"
f"/locations/global/workloadIdentityPools/github-actions-pool/"
f"attribute.repository/{github_org}/{github_repo}"
),
},
}
return pool_configCross-Cloud IAM Patterns
Shared AI Model Access
When organizations use AI services across multiple clouds, identity federation enables cross-cloud access without managing separate credentials:
def design_cross_cloud_ai_access() -> dict:
"""Design patterns for cross-cloud AI service access."""
return {
"pattern_1_hub_spoke": {
"description": "Central identity provider (e.g., Entra ID) federated to all clouds",
"aws_config": "SAML or OIDC federation to IAM Identity Center",
"azure_config": "Native Entra ID integration",
"gcp_config": "Workforce Identity Federation with Entra ID as IdP",
"pros": ["Single identity source", "Centralized access reviews", "Consistent MFA"],
"cons": ["Single point of failure", "Complex initial setup"],
},
"pattern_2_service_mesh": {
"description": "Cloud-native service identities with cross-cloud trust",
"aws_config": "IAM roles with OIDC federation from GCP/Azure",
"azure_config": "Managed identities with federated credentials",
"gcp_config": "Workload Identity Federation with AWS/Azure tokens",
"pros": ["No long-lived credentials", "Cloud-native"],
"cons": ["Complex trust relationships", "Different semantics per cloud"],
},
}Automated IAM Assessment
Cross-Cloud IAM Scanner
Automate IAM assessment across all three clouds with a unified scanning framework:
from dataclasses import dataclass, field
from typing import Protocol
from enum import Enum
class Severity(Enum):
CRITICAL = "critical"
HIGH = "high"
MEDIUM = "medium"
LOW = "low"
INFO = "info"
@dataclass
class IAMFinding:
cloud: str
service: str
severity: Severity
title: str
detail: str
resource: str
recommendation: str
class CloudIAMScanner(Protocol):
def scan(self) -> list[IAMFinding]: ...
@dataclass
class AWSBedrockIAMScanner:
session: object # boto3.Session
account_id: str
def scan(self) -> list[IAMFinding]:
findings = []
iam = self.session.client("iam")
# Check for wildcard Bedrock permissions
paginator = iam.get_paginator("list_policies")
for page in paginator.paginate(Scope="Local"):
for policy in page["Policies"]:
version = iam.get_policy_version(
PolicyArn=policy["Arn"],
VersionId=policy["DefaultVersionId"],
)
document = version["PolicyVersion"]["Document"]
if isinstance(document, str):
import json
document = json.loads(document)
for statement in document.get("Statement", []):
actions = statement.get("Action", [])
if isinstance(actions, str):
actions = [actions]
if statement.get("Effect") == "Allow" and "bedrock:*" in actions:
findings.append(IAMFinding(
cloud="AWS",
service="Bedrock",
severity=Severity.HIGH,
title="Wildcard Bedrock permissions",
detail=f"Policy {policy['PolicyName']} grants bedrock:* access",
resource=policy["Arn"],
recommendation="Scope to specific actions: bedrock:InvokeModel "
"for consumers, specific management actions for admins.",
))
return findings
def run_cross_cloud_scan(scanners: list[CloudIAMScanner]) -> dict:
"""Run IAM scans across all configured cloud environments."""
all_findings = []
for scanner in scanners:
all_findings.extend(scanner.scan())
summary = {
"total_findings": len(all_findings),
"by_severity": {},
"by_cloud": {},
"findings": [
{
"cloud": f.cloud,
"service": f.service,
"severity": f.severity.value,
"title": f.title,
"detail": f.detail,
"recommendation": f.recommendation,
}
for f in all_findings
],
}
for f in all_findings:
summary["by_severity"][f.severity.value] = summary["by_severity"].get(f.severity.value, 0) + 1
summary["by_cloud"][f.cloud] = summary["by_cloud"].get(f.cloud, 0) + 1
return summaryIAM Governance Recommendations
Policy Design Principles
-
Separate invocation from management: Model consumers should never have permissions to create, modify, or delete AI resources. Separate IAM policies for invocation and management prevent lateral movement from a compromised application to infrastructure control.
-
Scope by model: Do not grant access to all models when only specific ones are needed. Use resource-level permissions to restrict which models an identity can invoke.
-
Enforce network conditions: Require that AI API calls originate from approved network locations (VPC endpoints, private endpoints, private service connect) using IAM condition keys.
-
Time-bound elevated access: Use just-in-time access tools (AWS IAM Identity Center temporary access, Azure PIM, GCP IAM Conditions with time expressions) for administrative operations.
-
Tag-based access control: Use attribute-based access control (ABAC) with tags/labels to dynamically scope permissions based on team, environment, and data classification.
Regular Access Review
Conduct quarterly access reviews focused on AI-specific permissions. The review should answer:
- Who can invoke each model, and do they still need that access?
- Who can modify safety/guardrail configurations?
- Are service identities scoped to minimum necessary permissions?
- Are there any cross-account or cross-project access paths that bypass intended controls?
- Are deprecated models still accessible to identities that should have migrated?
References
- AWS, "Security best practices in IAM," https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html
- Microsoft, "Azure RBAC best practices," https://learn.microsoft.com/en-us/azure/role-based-access-control/best-practices
- Google Cloud, "IAM best practices," https://cloud.google.com/iam/docs/using-iam-securely
- NIST SP 800-207, "Zero Trust Architecture," August 2020, https://csrc.nist.gov/publications/detail/sp/800-207/final
- CSA, "Security Guidance for Critical Areas of Focus in Cloud Computing v4.0," https://cloudsecurityalliance.org/research/guidance
Why should model invocation permissions be separated from model management permissions in IAM policies?
What is the recommended approach for CI/CD pipelines that need to deploy Vertex AI models on GCP?