Capstone: Cloud AI Security Assessment
Assess AI deployment security across AWS, Azure, and GCP cloud platforms, producing a comprehensive cloud AI security assessment report.
Overview
AI workloads in the cloud introduce security considerations that go beyond traditional cloud security assessments. Model endpoints need protection from adversarial inputs. Training data in object storage needs access controls that account for ML pipeline service accounts. Model registries can become supply chain attack vectors. GPU instances have unique cost exposure risks. And the managed AI services from each cloud provider (SageMaker, Azure AI, Vertex AI) have their own security configurations that most cloud security teams have never audited.
This capstone project challenges you to conduct a security assessment of AI deployments across the three major cloud providers. You will evaluate infrastructure security, model endpoint protection, data pipeline security, and cloud-specific AI service configurations.
Prerequisites
- Cloud AI Security — Cloud-specific AI security considerations
- Infrastructure Security — General infrastructure assessment methodology
- Training Pipeline Attacks — Supply chain and pipeline risks
- Defenses and Guardrails — Security controls to evaluate
- Basic familiarity with at least one major cloud provider (AWS, Azure, or GCP)
- Understanding of IAM, networking, and object storage concepts
Project Brief
Scenario
You are a cloud security specialist at Stratos Security Consulting. Your client, AeroTech Dynamics, is an aerospace engineering firm that has deployed AI workloads across multiple cloud providers:
AWS Deployment:
- SageMaker endpoints hosting computer vision models for quality inspection
- S3 buckets containing training data (factory floor images, labeled defect data)
- Lambda functions for inference preprocessing and postprocessing
- SageMaker Model Registry for model versioning and deployment
Azure Deployment:
- Azure OpenAI Service powering an internal engineering assistant chatbot
- Azure AI Search (formerly Cognitive Search) for RAG over engineering documentation
- Azure Blob Storage for document uploads and embedding caches
- Azure Key Vault for API key management
GCP Deployment:
- Vertex AI pipelines for model training and evaluation
- Cloud Storage buckets for training datasets and model artifacts
- Cloud Functions for inference API middleware
- Artifact Registry for container images used in training pipelines
AeroTech wants a security assessment focused specifically on their AI workloads — not a general cloud security audit, but an AI-specific assessment that covers model security, data pipeline integrity, and AI service configuration.
Assessment Scope
## In Scope
- AI service configurations (SageMaker, Azure OpenAI, Vertex AI)
- Model endpoint security (authentication, authorization, input validation)
- Training data storage and access controls
- Model registry and artifact security
- AI pipeline security (training, evaluation, deployment)
- IAM roles and policies specific to AI workloads
- Network security for AI endpoints
- Cost and resource controls for AI compute
## Out of Scope
- General cloud security posture (VPC, general IAM, compliance)
- Non-AI application workloads
- Physical security and on-premises infrastructure
- Social engineering and phishingDeliverables
Primary Deliverables
| Deliverable | Description | Weight |
|---|---|---|
| Assessment report | Cloud AI security assessment covering all three providers | 40% |
| Configuration review | Detailed review of AI service configurations with specific misconfigurations | 25% |
| Risk matrix | Cloud AI risk matrix mapping threats to assets across providers | 15% |
| Remediation guide | Provider-specific remediation steps with IaC examples | 20% |
Rubric Criteria
- Coverage (20%) — All three providers assessed with AI-specific (not generic cloud) findings
- Technical Depth (25%) — Findings demonstrate understanding of cloud AI service internals and specific misconfigurations
- Provider Specificity (15%) — Remediation is specific to each provider's service, not generic advice
- Risk Prioritization (20%) — Findings are prioritized by realistic exploitation likelihood and business impact
- Actionability (20%) — Remediation includes specific configuration changes, CLI commands, or IaC snippets
Phased Approach
Phase 1: Reconnaissance and Scoping (2 hours)
Inventory AI assets across providers
Build a comprehensive inventory of AI-related resources across all three clouds: endpoints, storage, registries, pipelines, service accounts, and network configurations. Use cloud provider inventory tools (AWS Config, Azure Resource Graph, GCP Asset Inventory) or review IaC templates.
Map AI-specific IAM roles
Identify all IAM roles, service accounts, and permissions associated with AI workloads. Pay attention to: SageMaker execution roles, Azure AI service principals, Vertex AI service accounts, and any cross-service permissions (e.g., can the training pipeline role also deploy models?).
Identify data flows
Map how data flows through each AI pipeline: where training data is stored, how it reaches the training compute, where models are stored after training, how models are deployed to endpoints, and how inference requests reach the model. Each flow is an attack surface.
Phase 2: Service Configuration Review (4 hours)
Assess AWS SageMaker security
Review: endpoint authentication and authorization (IAM vs. API key), VPC configuration for endpoints (are they publicly accessible?), encryption at rest and in transit for model artifacts, SageMaker execution role permissions (least privilege), Model Registry access controls, and S3 bucket policies for training data.
Assess Azure OpenAI and AI Search security
Review: Azure OpenAI API key rotation and managed identity usage, content filtering configuration, network access restrictions (private endpoints vs. public), Azure AI Search index permissions (who can read what?), diagnostic logging configuration, and Key Vault access policies for API key storage.
Assess GCP Vertex AI security
Review: Vertex AI endpoint authentication (service account vs. API key), VPC Service Controls enforcement, Cloud Storage bucket permissions for training data and model artifacts, Artifact Registry access controls for training containers, pipeline service account permissions, and audit logging configuration.
Cross-provider analysis
Identify inconsistencies across providers: is one provider's deployment significantly less secure than others? Are there shared credentials or service accounts that cross provider boundaries? Is there a single point of compromise that would affect deployments on multiple providers?
Phase 3: Vulnerability Testing (4 hours)
Test model endpoint security
For accessible endpoints, test: authentication bypass attempts, rate limiting effectiveness, input validation (malformed inputs, oversized payloads), inference API abuse (model extraction through query volume), and error message information leakage.
Test data access controls
Verify that training data storage has appropriate access controls: can an unauthenticated user access S3 buckets? Can a read-only role modify training data? Are model artifacts in Cloud Storage or Blob Storage properly restricted? Test for public bucket misconfiguration.
Test model registry integrity
Assess whether the model registry (SageMaker Model Registry, Azure ML Registry, Artifact Registry) is protected against unauthorized model uploads or modifications. Could an attacker with compromised pipeline credentials replace a model with a backdoored version?
Test cost exposure
Evaluate cost controls: are there budget alerts for AI compute? Can an attacker trigger expensive training jobs or scale up GPU instances? Is there a maximum instance count limit? Test whether rate limiting prevents API abuse that would generate excessive costs.
Phase 4: Reporting (2 hours)
Write the assessment report
Structure the report by provider and by risk category. For each finding: describe the misconfiguration, explain the AI-specific risk (why this matters more for AI workloads than generic compute), provide evidence, and include provider-specific remediation.
Build the risk matrix
Create a risk matrix that maps cloud AI threats (model theft, data poisoning, supply chain compromise, cost exhaustion, unauthorized access) against assets across all three providers. Highlight which provider has the strongest and weakest posture for each threat.
Produce remediation guide with IaC examples
For each finding, provide specific remediation steps including CLI commands or IaC snippets (Terraform, CloudFormation, ARM templates) that implement the fix. This makes remediation immediately actionable for the client's DevOps team.
Example Output
Example Finding: AWS SageMaker
## Finding: SageMaker Endpoint Publicly Accessible Without IAM Auth
**Provider:** AWS
**Service:** SageMaker Real-time Inference
**Severity:** Critical
**Category:** Model Endpoint Security
### Description
The quality inspection model endpoint (endpoint-qc-vision-prod) is deployed
without VPC configuration and with IAM authentication disabled. The endpoint
is accessible from the public internet and accepts inference requests from
any source without authentication.
### AI-Specific Risk
Unlike a generic API endpoint, an exposed model endpoint enables:
- **Model extraction:** An attacker can send systematic queries to
reconstruct the model's decision boundary, stealing proprietary
IP in the quality inspection model
- **Adversarial input testing:** An attacker can probe the model's
weaknesses to craft adversarial images that pass quality inspection
- **Cost exhaustion:** Unauthenticated access allows unlimited inference
requests, running up GPU instance costs
### Evidence
```bash
# Endpoint accessible without auth
aws sagemaker-runtime invoke-endpoint \
--endpoint-name endpoint-qc-vision-prod \
--body '{"image": "base64..."}' \
--content-type application/json \
output.json
# Returns: 200 OK with inference result
```
### Remediation
```terraform
resource "aws_sagemaker_endpoint_config" "qc_vision" {
name = "endpoint-qc-vision-prod"
production_variants {
variant_name = "primary"
model_name = aws_sagemaker_model.qc_vision.name
# ... other config
}
# Enable VPC configuration
vpc_config {
security_group_ids = [aws_security_group.sagemaker_sg.id]
subnets = aws_subnet.private[*].id
}
}
# Ensure IAM authentication is required (default, do not disable)
# Access via IAM role with least-privilege policy:
resource "aws_iam_policy" "sagemaker_invoke" {
name = "sagemaker-invoke-qc-vision"
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = "sagemaker:InvokeEndpoint"
Resource = aws_sagemaker_endpoint.qc_vision.arn
}]
})
}
```
### Example Risk Matrix
```markdown
| Threat Category | AWS (SageMaker) | Azure (OpenAI) | GCP (Vertex AI) |
|---------------------|------------------|-----------------|-----------------|
| Model theft | CRITICAL — public endpoint | LOW — private endpoint | MEDIUM — API key only |
| Data poisoning | MEDIUM — S3 ACL gaps | LOW — RBAC enforced | HIGH — public bucket |
| Supply chain | HIGH — registry open | MEDIUM — no signing | HIGH — no image scanning |
| Cost exhaustion | CRITICAL — no limits | LOW — quotas set | MEDIUM — no budget alert |
| Unauthorized access | HIGH — overprivileged roles | LOW — managed identity | MEDIUM — shared SA |
```Hints
Why is a publicly accessible SageMaker model endpoint a more severe finding for an AI workload than a publicly accessible generic API endpoint?