Attacking AI Deployments

advanced6 min readUpdated 2026-03-12

Security assessment of AI deployment infrastructure, including container escapes, GPU side channels, inference server vulnerabilities, and resource exhaustion attacks.

deployment containers gpu inference-server infrastructure

AI deployments run on specialized infrastructure — GPU clusters, inference servers, container orchestration — that introduces attack surfaces beyond the model itself. A compromise at the infrastructure level can provide access to model weights, user data, and the broader network.

AI Deployment Architecture

A typical production LLM deployment:

Load Balancer → API Gateway → Inference Server (vLLM, TGI, Triton)
                                      ↓
                              GPU Cluster (CUDA runtime)
                                      ↓
                              Model Weights (shared storage)
                                      ↓
                              Vector DB / Cache

Each component has its own attack surface.

Inference Server Vulnerabilities

Popular inference servers (vLLM, Text Generation Inference, Triton) expose management interfaces and processing endpoints:

# Common inference server endpoints to probe
endpoints = [
    "/health",           # Health check - may reveal version info
    "/metrics",          # Prometheus metrics - resource usage, model info
    "/v1/models",        # Model listing - reveals loaded models
    "/v1/completions",   # Inference endpoint
    "/admin",            # Management interface (if exposed)
    "/docs",             # Auto-generated API docs
]
 
# Test for unauthenticated access to management endpoints
for endpoint in endpoints:
    response = requests.get(f"http://target:8000{endpoint}")
    print(f"{endpoint}: {response.status_code}")

Known Attack Patterns

Server	Vulnerability Class	Impact
vLLM	Unauthenticated API	Model access, inference abuse
Triton	Model repository traversal	Access to other models, weights
TGI	Resource exhaustion	Service denial
Ollama	Default open port (11434)	Full local model access

GPU and Memory Attacks

GPU Memory Leakage

GPU memory is not automatically cleared between requests. In multi-tenant environments, a subsequent request might read residual data from a previous user's inference:

# Conceptual GPU memory probe
# In a shared GPU environment, allocate memory and read uninitialized values
import torch
 
# Allocate a large tensor without initialization
probe = torch.empty(1024, 1024, device='cuda')
# probe may contain residual data from previous operations
# including other users' embeddings, activations, or KV cache

Model Weight Extraction

If an attacker gains access to the deployment infrastructure, model weights can be directly copied:

# Common locations for model weights in containerized deployments
# /models/
# /opt/ml/model/
# /root/.cache/huggingface/
# /data/models/

Container Security

AI workloads often run in containers with elevated privileges for GPU access:

# Common insecure container configuration for GPU workloads
services:
  inference:
    image: vllm/vllm-openai
    # Privileged mode for GPU access - common but dangerous
    privileged: true
    # Host network - exposes internal ports
    network_mode: host
    # Volume mounts - may expose sensitive host paths
    volumes:
      - /data/models:/models
      - /var/run/docker.sock:/var/run/docker.sock

Container escape paths specific to AI workloads:

Privileged mode — Often enabled for NVIDIA GPU access, grants full host capabilities
Docker socket mount — Sometimes used for container orchestration, enables full Docker control
Host PID namespace — Shared for GPU driver compatibility, enables process visibility
Writable model directory — If writable, an attacker can replace model weights

Resource Exhaustion

AI inference is resource-intensive, making it a prime target for denial of service:

# GPU memory exhaustion
# Send requests that maximize GPU memory usage
# Long input + long output + large batch size
 
# Compute exhaustion
# Requests that maximize inference time
# Adversarial inputs that cause maximum attention computation
 
# Storage exhaustion
# If the system logs prompts/responses, generate volume
# If the system caches KV states, fill the cache

AI Infrastructure Security Overview -- broader infrastructure attack surface context
LLM API Security -- the API layer that sits in front of inference servers
Model Supply Chain Risks -- compromising models before they reach the deployment
Infrastructure Exploitation -- advanced infrastructure attack techniques
Capstone: Execution & Reporting -- incorporating infrastructure findings into engagement reports

References

NVIDIA, "Container Security for GPU Workloads" (2024) -- GPU container security guidance
Tramèr et al., "Stealing Machine Learning Models via Prediction APIs" (2016) -- model extraction through inference endpoints
MITRE, "ATLAS: Adversarial Threat Landscape for AI Systems" (2023) -- infrastructure-layer threats in the AI threat taxonomy

Knowledge Check

Why are AI containers frequently deployed with elevated privileges?

Attacking AI Deployments

Related articles

Attacking AI Deployments

Related articles