Attacking AI Deployments
Security assessment of AI deployment infrastructure, including container escapes, GPU side channels, inference server vulnerabilities, and resource exhaustion attacks.
AI deployments run on specialized infrastructure — GPU clusters, inference servers, container orchestration — that introduces attack surfaces beyond the model itself. A compromise at the infrastructure level can provide access to model weights, user data, and the broader network.
AI Deployment Architecture
A typical production LLM deployment:
Load Balancer → API Gateway → Inference Server (vLLM, TGI, Triton)
↓
GPU Cluster (CUDA runtime)
↓
Model Weights (shared storage)
↓
Vector DB / Cache
Each component has its own attack surface.
Inference Server Vulnerabilities
Popular inference servers (vLLM, Text Generation Inference, Triton) expose management interfaces and processing endpoints:
# Common inference server endpoints to probe
endpoints = [
"/health", # Health check - may reveal version info
"/metrics", # Prometheus metrics - resource usage, model info
"/v1/models", # Model listing - reveals loaded models
"/v1/completions", # Inference endpoint
"/admin", # Management interface (if exposed)
"/docs", # Auto-generated API docs
]
# Test for unauthenticated access to management endpoints
for endpoint in endpoints:
response = requests.get(f"http://target:8000{endpoint}")
print(f"{endpoint}: {response.status_code}")Known Attack Patterns
| Server | Vulnerability Class | Impact |
|---|---|---|
| vLLM | Unauthenticated API | Model access, inference abuse |
| Triton | Model repository traversal | Access to other models, weights |
| TGI | Resource exhaustion | Service denial |
| Ollama | Default open port (11434) | Full local model access |
GPU and Memory Attacks
GPU Memory Leakage
GPU memory is not automatically cleared between requests. In multi-tenant environments, a subsequent request might read residual data from a previous user's inference:
# Conceptual GPU memory probe
# In a shared GPU environment, allocate memory and read uninitialized values
import torch
# Allocate a large tensor without initialization
probe = torch.empty(1024, 1024, device='cuda')
# probe may contain residual data from previous operations
# including other users' embeddings, activations, or KV cacheModel Weight Extraction
If an attacker gains access to the deployment infrastructure, model weights can be directly copied:
# Common locations for model weights in containerized deployments
# /models/
# /opt/ml/model/
# /root/.cache/huggingface/
# /data/models/Container Security
AI workloads often run in containers with elevated privileges for GPU access:
# Common insecure container configuration for GPU workloads
services:
inference:
image: vllm/vllm-openai
# Privileged mode for GPU access - common but dangerous
privileged: true
# Host network - exposes internal ports
network_mode: host
# Volume mounts - may expose sensitive host paths
volumes:
- /data/models:/models
- /var/run/docker.sock:/var/run/docker.sockContainer escape paths specific to AI workloads:
- Privileged mode — Often enabled for NVIDIA GPU access, grants full host capabilities
- Docker socket mount — Sometimes used for container orchestration, enables full Docker control
- Host PID namespace — Shared for GPU driver compatibility, enables process visibility
- Writable model directory — If writable, an attacker can replace model weights
Resource Exhaustion
AI inference is resource-intensive, making it a prime target for denial of service:
# GPU memory exhaustion
# Send requests that maximize GPU memory usage
# Long input + long output + large batch size
# Compute exhaustion
# Requests that maximize inference time
# Adversarial inputs that cause maximum attention computation
# Storage exhaustion
# If the system logs prompts/responses, generate volume
# If the system caches KV states, fill the cacheRelated Topics
- AI Infrastructure Security Overview -- broader infrastructure attack surface context
- LLM API Security -- the API layer that sits in front of inference servers
- Model Supply Chain Risks -- compromising models before they reach the deployment
- Infrastructure Exploitation -- advanced infrastructure attack techniques
- Capstone: Execution & Reporting -- incorporating infrastructure findings into engagement reports
References
- NVIDIA, "Container Security for GPU Workloads" (2024) -- GPU container security guidance
- Tramèr et al., "Stealing Machine Learning Models via Prediction APIs" (2016) -- model extraction through inference endpoints
- MITRE, "ATLAS: Adversarial Threat Landscape for AI Systems" (2023) -- infrastructure-layer threats in the AI threat taxonomy
Why are AI containers frequently deployed with elevated privileges?