Cloud ML Platform Security (AWS/Azure/GCP)
Security comparison of cloud ML platforms including AWS SageMaker, Azure Machine Learning, and Google Vertex AI. IAM configuration, data security, model serving, and platform-specific attack surfaces.
Cloud ML platforms (AWS SageMaker, Azure Machine Learning, Google Vertex AI) provide managed infrastructure for training, deploying, and serving ML models — including LLMs. Each platform inherits its parent cloud's security model while adding ML-specific attack surface. Red teaming cloud ML deployments requires understanding both cloud security fundamentals and ML-specific vulnerabilities.
Platform Comparison Overview
| Feature | AWS SageMaker | Azure ML | Google Vertex AI |
|---|---|---|---|
| IAM model | AWS IAM (roles, policies) | Azure RBAC + Azure AD | Google Cloud IAM |
| Network isolation | VPC, PrivateLink | VNet, Private Endpoints | VPC, Private Service Connect |
| Data encryption | KMS (at rest), TLS (transit) | Azure Key Vault, TLS | Cloud KMS, TLS |
| Model registry | SageMaker Model Registry | Azure ML Model Registry | Vertex AI Model Registry |
| Serving | SageMaker Endpoints, Serverless | Managed Endpoints, AKS | Vertex AI Endpoints, GKE |
| Notebook environment | SageMaker Studio, Notebook Instances | Azure ML Compute Instances | Vertex AI Workbench |
| Compliance | HIPAA, SOC, FedRAMP | HIPAA, SOC, FedRAMP, DoD IL | HIPAA, SOC, FedRAMP |
IAM Misconfiguration Risks
IAM misconfigurations are the most common and highest-impact vulnerability class in cloud ML deployments.
AWS SageMaker IAM Risks
# DANGEROUS: Overly permissive SageMaker execution role
overly_permissive_policy = {
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "sagemaker:*",
"Resource": "*"
}, {
"Effect": "Allow",
"Action": "s3:*", # Full S3 access — can read ALL buckets
"Resource": "*"
}, {
"Effect": "Allow",
"Action": "iam:PassRole", # Can pass any role
"Resource": "*"
}]
}
# This role can:
# - Access any S3 bucket (training data, other projects' data)
# - Create endpoints with any role attached
# - Modify any SageMaker resource
# - Potentially escalate privileges via PassRoleAzure ML IAM Risks
| Misconfiguration | Risk | Test |
|---|---|---|
| Contributor role on workspace | Full control over ML resources, data, and models | Check role assignments — Contributor is often over-assigned |
| Shared managed identity | Multiple services share one identity, cross-service access | Enumerate managed identity assignments |
| Missing data plane RBAC | Control plane access does not restrict data plane operations | Test data access after restricting control plane roles |
| Service principal over-provisioning | Automation accounts with excessive scope | Review SP permissions across resource groups |
Google Vertex AI IAM Risks
| Misconfiguration | Risk | Test |
|---|---|---|
roles/aiplatform.admin over-assignment | Full control over all Vertex AI resources | Review IAM bindings at project level |
| Missing VPC Service Controls | Data exfiltration via Vertex AI APIs | Test data access from outside perimeter |
| Default service account usage | Compute Engine default SA has broad project-level permissions | Check if default SA is used for ML workloads |
| Cross-project access | Vertex AI resources accessible from other projects | Test cross-project API calls |
Data Security Across the ML Lifecycle
Data Collection → Preparation → Training → Model Storage → Serving → Inference
↑ ↑ ↑ ↑ ↑ ↑
Data source Feature Training Model Endpoint Response
access control store access data in artifact auth data
authentication encryption compute encryption rate limit leakage
Training Data Security
| Risk | AWS SageMaker | Azure ML | Vertex AI |
|---|---|---|---|
| Data at rest | S3 SSE or KMS | Azure Storage encryption | Cloud Storage encryption |
| Data in training | EBS encryption on training instances | Compute encryption | Boot disk + data disk encryption |
| Data access logging | CloudTrail + S3 access logs | Azure Monitor + Storage Analytics | Cloud Audit Logs |
| Cross-account data access | S3 bucket policies + IAM | Shared access signatures | IAM + VPC Service Controls |
| Data exfiltration prevention | VPC endpoints, no internet access | Private endpoints, NSGs | VPC Service Controls |
Model Artifact Security
| Concern | Description | Test Approach |
|---|---|---|
| Model access control | Who can download or use model artifacts? | Attempt to access model artifacts with different credentials |
| Model integrity | Can model artifacts be tampered with? | Check for write access to model storage, verify signing |
| Model versioning | Are old (potentially vulnerable) model versions accessible? | Enumerate available model versions |
| Serialization risks | Model format (pickle, safetensors) determines code execution risk | Identify model serialization format, test for deserialization attacks |
Model Serving Security
Endpoint Authentication
| Feature | AWS SageMaker | Azure ML | Vertex AI |
|---|---|---|---|
| Default auth | IAM SigV4 | Key or token-based | IAM + API key |
| Custom auth | Lambda authorizer | Azure AD integration | IAP + IAM |
| Public endpoint risk | Can be exposed via API Gateway | Can be exposed publicly | Can be exposed publicly |
| Private endpoint | PrivateLink | Private Endpoint | Private Service Connect |
| mTLS | Via API Gateway custom domain | Available | Available |
Endpoint Attack Surface
Test Authentication Bypass
Attempt to invoke model endpoints without valid credentials. Check for misconfigured API gateways, missing auth on health check endpoints, and default credentials.
Test Input Validation
Send malformed, oversized, and adversarial inputs to model endpoints. Check for error messages that reveal model architecture, framework versions, or infrastructure details.
Test for Model Extraction
Systematically query the endpoint to extract model behavior. With enough queries, an attacker can build a functional copy of the model (model stealing). Check rate limiting and query logging.
Test for Data Leakage in Responses
Check whether model responses contain training data, PII, or other sensitive information. Test for membership inference (determining whether a specific data point was in the training set).
Test Network Isolation
Verify that model endpoints cannot be reached from unauthorized networks. Check VPC configurations, security groups, and network policies.
Notebook Environment Security
ML notebook environments (SageMaker Studio, Azure ML Compute, Vertex AI Workbench) are high-value targets because they typically have broad data access and code execution capabilities:
| Risk | Description | Mitigation |
|---|---|---|
| Internet access | Notebooks with internet access can exfiltrate data | Use VPC/VNet with no internet egress |
| Shared instances | Multiple users sharing notebook infrastructure | Use per-user instances with separate credentials |
| Persistent credentials | Long-lived credentials stored in notebook environments | Use temporary credentials, metadata service v2 (IMDSv2) |
| Package installation | Users install arbitrary packages that may be malicious | Restrict to approved package repositories |
| Root access | Notebook users with root can modify security controls | Restrict to non-root access |
Cross-Platform Security Checklist
| Security Control | What to Verify | Priority |
|---|---|---|
| IAM least privilege | Are ML roles scoped to minimum necessary permissions? | Critical |
| Network isolation | Are ML resources in private subnets with no unnecessary internet access? | Critical |
| Data encryption | Is data encrypted at rest and in transit? Are keys managed properly? | High |
| Endpoint authentication | Are model serving endpoints properly authenticated? | High |
| Logging and monitoring | Are ML API calls, data access, and model invocations logged? | High |
| Model artifact integrity | Are model artifacts access-controlled and integrity-verified? | Medium |
| Notebook security | Are notebooks isolated, credentialed appropriately, and network-restricted? | High |
| Supply chain | Are ML frameworks, packages, and pre-trained models from trusted sources? | Medium |
For related topics, see Infrastructure Security, Supply Chain Security, and LangChain & LlamaIndex Security.
Related Topics
- Infrastructure Security: API Security -- API-level security testing for ML endpoints
- Supply Chain Security -- model artifact and dependency integrity
- LangChain & LlamaIndex Security -- framework-specific vulnerabilities on cloud platforms
- API Provider Security Comparison -- comparing security controls across LLM API providers
References
- "AWS SageMaker Security Best Practices" - Amazon Web Services (2024) - AWS guidance on securing ML workloads including IAM, networking, and encryption
- "Azure Machine Learning Security Baseline" - Microsoft (2024) - Security controls and configuration for Azure ML deployments
- "Google Cloud Vertex AI Security Overview" - Google Cloud (2024) - Security architecture and best practices for Vertex AI workloads
- "CIS Benchmarks for Cloud ML Services" - Center for Internet Security (2024) - Hardening benchmarks for cloud ML platform configurations
Why is the combination of iam:PassRole and sagemaker:* permissions a privilege escalation risk?