vLLM Security Configuration
Security hardening for vLLM serving deployments including API authentication, resource limits, and input validation.
Overview
vLLM is a high-throughput, memory-efficient inference engine for large language models. It implements PagedAttention for efficient KV cache management and supports continuous batching, tensor parallelism, and OpenAI-compatible API endpoints. Organizations deploy vLLM to serve models like Llama, Mistral, and other open-weight LLMs at production scale.
From a security perspective, vLLM is an inference-focused tool that prioritizes performance over security features. The default configuration exposes an OpenAI-compatible API server with no authentication, no TLS, no rate limiting, and no input validation. This is acceptable for local development but creates significant risk in production deployments.
This article covers how to secure vLLM deployments at every layer: network access, authentication, input validation, resource management, and monitoring. The guidance applies to both standalone vLLM deployments and vLLM behind orchestrators like Ray Serve or Kubernetes.
vLLM Architecture
Components
vLLM consists of:
| Component | Description | Security Relevance |
|---|---|---|
| API Server | FastAPI-based OpenAI-compatible endpoints | Unauthenticated by default |
| Engine | Core inference engine with PagedAttention | Resource exhaustion target |
| Worker(s) | GPU workers for tensor parallel inference | GPU memory side channels |
| Tokenizer | Model-specific tokenization | Token-counting bypass vectors |
| Model Loader | Loads model weights from disk or remote | Supply chain target |
Default Exposure
# Typical vLLM startup — completely open
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
# Anyone on the network can now:
# - Send completion and chat requests
# - List loaded models
# - Consume GPU resources
# - Probe the model with adversarial inputsAuthentication
API Key Authentication
vLLM supports API key authentication via the --api-key flag. When set, all requests must include the key in the Authorization header:
# Start vLLM with API key authentication
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--api-key "${VLLM_API_KEY}"
# Clients must include the API key:
# curl -H "Authorization: Bearer ${VLLM_API_KEY}" http://localhost:8000/v1/chat/completions ...Important considerations:
- The API key is a single shared secret — it does not support per-user keys or scoped permissions
- Store the API key in a secrets manager (Vault, AWS Secrets Manager, Kubernetes Secrets), not in command-line arguments or environment files checked into version control
- Rotate the key periodically and after any suspected compromise
import os
import secrets
import hashlib
from typing import Optional
class VLLMAPIKeyManager:
"""Manage API keys for vLLM deployments."""
@staticmethod
def generate_key(length: int = 48) -> str:
"""Generate a cryptographically secure API key."""
return secrets.token_urlsafe(length)
@staticmethod
def hash_key(key: str) -> str:
"""Hash a key for storage (never store plaintext keys in databases)."""
return hashlib.sha256(key.encode()).hexdigest()
@staticmethod
def validate_key_strength(key: str) -> dict:
"""Validate that an API key meets minimum security requirements."""
issues = []
if len(key) < 32:
issues.append("Key too short — minimum 32 characters recommended")
if key == key.lower() or key == key.upper():
issues.append("Key lacks mixed case — may indicate weak generation")
if key.startswith("sk-") and len(key) < 40:
issues.append("Key follows OpenAI format but is too short")
return {
"valid": len(issues) == 0,
"issues": issues,
"key_length": len(key),
}Reverse Proxy Authentication
For production deployments, use a reverse proxy that provides more sophisticated authentication:
# /etc/nginx/conf.d/vllm.conf
upstream vllm_backend {
server 127.0.0.1:8000;
keepalive 32;
}
server {
listen 443 ssl http2;
server_name llm-api.company.com;
ssl_certificate /etc/ssl/certs/vllm.crt;
ssl_certificate_key /etc/ssl/private/vllm.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
# Rate limiting
limit_req_zone $binary_remote_addr zone=vllm_api:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=vllm_completions:10m rate=5r/s;
# Request size limit (prevent huge prompt injection payloads)
client_max_body_size 1m;
# Timeouts (LLM generation can take a while)
proxy_read_timeout 120s;
proxy_send_timeout 30s;
location /v1/chat/completions {
limit_req zone=vllm_completions burst=10 nodelay;
# Authenticate via subrequest to auth service
auth_request /auth;
auth_request_set $auth_user $upstream_http_x_auth_user;
proxy_pass http://vllm_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Auth-User $auth_user;
# Enable streaming for SSE responses
proxy_buffering off;
proxy_cache off;
}
location /v1/completions {
limit_req zone=vllm_completions burst=10 nodelay;
auth_request /auth;
proxy_pass http://vllm_backend;
proxy_buffering off;
}
# Block model listing if not needed
location /v1/models {
limit_req zone=vllm_api burst=5;
auth_request /auth;
proxy_pass http://vllm_backend;
}
# Auth subrequest endpoint
location = /auth {
internal;
proxy_pass http://auth-service:8080/validate;
proxy_pass_request_body off;
proxy_set_header Content-Length "";
proxy_set_header X-Original-URI $request_uri;
proxy_set_header Authorization $http_authorization;
}
# Security headers
add_header X-Content-Type-Options nosniff always;
add_header X-Frame-Options DENY always;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
}Denial-of-Service Prevention
LLM-Specific DoS Vectors
LLM serving has unique DoS vectors that traditional web application firewalls do not address:
- Long prompt attacks: Sending prompts that fill the model's context window consumes maximum GPU memory and compute
- High max_tokens: Requesting very long completions ties up GPU resources
- Streaming abuse: Opening many streaming connections to exhaust server resources
- Adversarial tokenization: Crafting inputs that produce unusually high token counts relative to character count
from typing import Dict, Any, Optional, List
import time
class VLLMRequestGuard:
"""Input validation and DoS prevention for vLLM endpoints."""
def __init__(
self,
max_prompt_tokens: int = 4096,
max_completion_tokens: int = 2048,
max_concurrent_requests: int = 100,
max_requests_per_minute: int = 60,
):
self.max_prompt_tokens = max_prompt_tokens
self.max_completion_tokens = max_completion_tokens
self.max_concurrent = max_concurrent_requests
self.max_rpm = max_requests_per_minute
self._active_requests = 0
self._request_log: Dict[str, List[float]] = {}
def validate_chat_request(self, request: Dict[str, Any]) -> Dict[str, Any]:
"""Validate a chat completion request."""
errors = []
# Validate messages
messages = request.get("messages", [])
if not messages:
errors.append("Request must include at least one message")
# Check total message content length
total_chars = sum(
len(str(msg.get("content", ""))) for msg in messages
)
if total_chars > 100_000:
errors.append(
f"Total message content too large: {total_chars} chars (max 100,000)"
)
# Check max_tokens
max_tokens = request.get("max_tokens")
if max_tokens is not None:
if max_tokens > self.max_completion_tokens:
errors.append(
f"max_tokens ({max_tokens}) exceeds limit ({self.max_completion_tokens})"
)
if max_tokens < 1:
errors.append("max_tokens must be positive")
# Check temperature
temperature = request.get("temperature")
if temperature is not None:
if not 0 <= temperature <= 2:
errors.append(f"temperature must be between 0 and 2, got {temperature}")
# Check for repeated content (potential prompt stuffing)
for msg in messages:
content = str(msg.get("content", ""))
if len(content) > 1000:
# Check if the content is highly repetitive
chunks = [content[i:i+100] for i in range(0, min(len(content), 5000), 100)]
unique_chunks = len(set(chunks))
if unique_chunks < len(chunks) * 0.3:
errors.append("Message content appears to be repetitive padding")
break
# Check n (number of completions)
n = request.get("n", 1)
if n > 5:
errors.append(f"n ({n}) exceeds maximum of 5")
return {"valid": len(errors) == 0, "errors": errors}
def check_rate_limit(self, client_id: str) -> bool:
"""Check per-client rate limit."""
now = time.time()
if client_id not in self._request_log:
self._request_log[client_id] = []
# Remove entries older than 60 seconds
self._request_log[client_id] = [
ts for ts in self._request_log[client_id] if now - ts < 60
]
if len(self._request_log[client_id]) >= self.max_rpm:
return False
self._request_log[client_id].append(now)
return True
def check_concurrent_limit(self) -> bool:
"""Check global concurrent request limit."""
return self._active_requests < self.max_concurrentResource Limits
Configure vLLM with appropriate resource limits to prevent resource exhaustion:
# Start vLLM with resource constraints
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 127.0.0.1 \
--port 8000 \
--api-key "${VLLM_API_KEY}" \
--max-model-len 8192 \
--max-num-seqs 64 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.85 \
--block-size 16 \
--swap-space 4 \
--disable-log-requestsKey parameters:
--max-model-len: Limits the maximum sequence length (prompt + completion). Prevents attackers from using the full context window.--max-num-seqs: Limits concurrent sequences being processed. Prevents request flooding.--max-num-batched-tokens: Limits tokens processed per batch. Caps per-iteration compute cost.--gpu-memory-utilization: Reserves GPU memory headroom. Prevents OOM crashes.
Model Loading Security
Verifying Model Integrity
vLLM loads models from Hugging Face Hub or local paths. Verify model integrity before loading:
import hashlib
from pathlib import Path
from typing import Dict, Optional
import json
class VLLMModelVerifier:
"""Verify model integrity before loading into vLLM."""
def __init__(self, model_path: str):
self.model_path = Path(model_path)
def compute_file_hashes(self) -> Dict[str, str]:
"""Compute SHA-256 hashes of all model files."""
hashes = {}
for file_path in sorted(self.model_path.rglob("*")):
if file_path.is_file():
sha256 = hashlib.sha256()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
relative = str(file_path.relative_to(self.model_path))
hashes[relative] = sha256.hexdigest()
return hashes
def verify_against_manifest(self, manifest_path: str) -> Dict:
"""Verify model files against a saved manifest."""
with open(manifest_path) as f:
expected = json.load(f)
actual = self.compute_file_hashes()
mismatches = []
missing = []
unexpected = []
for path, expected_hash in expected.items():
if path not in actual:
missing.append(path)
elif actual[path] != expected_hash:
mismatches.append({
"path": path,
"expected": expected_hash,
"actual": actual[path],
})
for path in actual:
if path not in expected:
unexpected.append(path)
return {
"verified": not mismatches and not missing,
"mismatches": mismatches,
"missing_files": missing,
"unexpected_files": unexpected,
}
def check_for_unsafe_serialization(self) -> list:
"""Check for potentially unsafe model file formats."""
findings = []
for file_path in self.model_path.rglob("*"):
if file_path.is_file():
suffix = file_path.suffix.lower()
if suffix in (".pkl", ".pickle"):
findings.append({
"severity": "critical",
"file": str(file_path.relative_to(self.model_path)),
"finding": "Pickle file detected — arbitrary code execution risk",
})
elif suffix in (".pt", ".pth", ".bin"):
# PyTorch files may use pickle internally
findings.append({
"severity": "high",
"file": str(file_path.relative_to(self.model_path)),
"finding": "PyTorch file may use pickle serialization internally",
})
elif suffix == ".safetensors":
findings.append({
"severity": "info",
"file": str(file_path.relative_to(self.model_path)),
"finding": "SafeTensors format — safe from deserialization attacks",
})
return findingsPreventing Model Theft
If model weights are proprietary, prevent unauthorized access:
# Docker deployment with model mounted read-only from encrypted volume
docker run --gpus all \
--read-only \
--tmpfs /tmp:noexec,nosuid,size=1g \
-v /encrypted-models/llama-3:/models/llama-3:ro \
-p 127.0.0.1:8000:8000 \
--security-opt no-new-privileges \
--cap-drop ALL \
vllm/vllm-openai:latest \
--model /models/llama-3 \
--api-key "${VLLM_API_KEY}" \
--max-model-len 8192Output Security
Response Filtering
Implement output filtering to catch sensitive data leakage:
import re
from typing import Dict, List, Optional
class VLLMOutputFilter:
"""Filter vLLM responses for sensitive data leakage."""
def __init__(self):
self.patterns = {
"credit_card": re.compile(r"\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13})\b"),
"ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
"email": re.compile(r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"),
"aws_key": re.compile(r"AKIA[0-9A-Z]{16}"),
"private_key": re.compile(r"-----BEGIN (?:RSA |EC )?PRIVATE KEY-----"),
}
def scan_response(self, text: str) -> Dict[str, List]:
"""Scan response text for sensitive data patterns."""
findings = {}
for pattern_name, pattern in self.patterns.items():
matches = pattern.findall(text)
if matches:
findings[pattern_name] = [m[:4] + "***" for m in matches]
return findings
def filter_response(
self, text: str, redact: bool = True
) -> Dict[str, any]:
"""Filter response and optionally redact sensitive data."""
findings = self.scan_response(text)
if not findings:
return {"text": text, "filtered": False, "findings": {}}
filtered_text = text
if redact:
for pattern_name, pattern in self.patterns.items():
filtered_text = pattern.sub(f"[REDACTED_{pattern_name.upper()}]", filtered_text)
return {
"text": filtered_text,
"filtered": True,
"findings": findings,
}Monitoring
Security Logging
import logging
import json
from datetime import datetime, timezone
from typing import Dict, Any, Optional
class VLLMSecurityLogger:
"""Security-focused logging for vLLM deployments."""
def __init__(self, log_file: str = "/var/log/vllm/security.json"):
self.logger = logging.getLogger("vllm.security")
handler = logging.FileHandler(log_file)
handler.setFormatter(logging.Formatter("%(message)s"))
self.logger.addHandler(handler)
self.logger.setLevel(logging.INFO)
def log_request(
self,
client_id: str,
endpoint: str,
prompt_tokens: int,
max_tokens: int,
source_ip: str,
blocked: bool = False,
block_reason: Optional[str] = None,
) -> None:
"""Log an inference request with security-relevant metadata."""
event = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"event": "inference_request",
"client_id": client_id,
"endpoint": endpoint,
"prompt_tokens": prompt_tokens,
"max_tokens": max_tokens,
"source_ip": source_ip,
"blocked": blocked,
"block_reason": block_reason,
}
self.logger.info(json.dumps(event))
def log_anomaly(
self, anomaly_type: str, details: Dict[str, Any], source_ip: str
) -> None:
"""Log a detected security anomaly."""
event = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"event": "security_anomaly",
"anomaly_type": anomaly_type,
"details": details,
"source_ip": source_ip,
}
self.logger.info(json.dumps(event))Defensive Recommendations
- Always set
--api-keyin production — never run vLLM without authentication - Bind to localhost (
--host 127.0.0.1) and use a reverse proxy for external access - Set explicit resource limits (
--max-model-len,--max-num-seqs) to prevent DoS - Use TLS via the reverse proxy for all external communication
- Validate inputs before they reach vLLM — check content length, max_tokens, and for repetitive content
- Filter outputs for sensitive data patterns (PII, credentials, private keys)
- Monitor and log all requests with security-relevant metadata
- Use SafeTensors format for model weights to avoid deserialization attacks
- Run vLLM containers read-only with model weights mounted as read-only volumes
- Disable
--disable-log-requestsis a double negative — in production, log requests for audit trails but be careful about logging prompt content (PII concerns)
References
- vLLM Documentation — https://docs.vllm.ai/
- vLLM GitHub — https://github.com/vllm-project/vllm
- OWASP LLM Top 10 2025 — LLM04 (Data and Model Poisoning), LLM10 (Unbounded Consumption)
- NIST AI RMF — Govern 1.4 (AI deployment security controls)
- Kwon et al. — "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023) — foundational vLLM paper