vLLM Security Configuration

beginner11 min readUpdated 2026-03-20

Security hardening for vLLM serving deployments including API authentication, resource limits, and input validation.

infrastructure vllm llm-serving inference

Overview

vLLM is a high-throughput, memory-efficient inference engine for large language models. It implements PagedAttention for efficient KV cache management and supports continuous batching, tensor parallelism, and OpenAI-compatible API endpoints. Organizations deploy vLLM to serve models like Llama, Mistral, and other open-weight LLMs at production scale.

From a security perspective, vLLM is an inference-focused tool that prioritizes performance over security features. The default configuration exposes an OpenAI-compatible API server with no authentication, no TLS, no rate limiting, and no input validation. This is acceptable for local development but creates significant risk in production deployments.

This article covers how to secure vLLM deployments at every layer: network access, authentication, input validation, resource management, and monitoring. The guidance applies to both standalone vLLM deployments and vLLM behind orchestrators like Ray Serve or Kubernetes.

vLLM Architecture

Components

vLLM consists of:

Component	Description	Security Relevance
API Server	FastAPI-based OpenAI-compatible endpoints	Unauthenticated by default
Engine	Core inference engine with PagedAttention	Resource exhaustion target
Worker(s)	GPU workers for tensor parallel inference	GPU memory side channels
Tokenizer	Model-specific tokenization	Token-counting bypass vectors
Model Loader	Loads model weights from disk or remote	Supply chain target

Default Exposure

# Typical vLLM startup — completely open
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000
 
# Anyone on the network can now:
# - Send completion and chat requests
# - List loaded models
# - Consume GPU resources
# - Probe the model with adversarial inputs

Authentication

API Key Authentication

vLLM supports API key authentication via the --api-key flag. When set, all requests must include the key in the Authorization header:

# Start vLLM with API key authentication
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "${VLLM_API_KEY}"
 
# Clients must include the API key:
# curl -H "Authorization: Bearer ${VLLM_API_KEY}" http://localhost:8000/v1/chat/completions ...

Important considerations:

The API key is a single shared secret — it does not support per-user keys or scoped permissions
Store the API key in a secrets manager (Vault, AWS Secrets Manager, Kubernetes Secrets), not in command-line arguments or environment files checked into version control
Rotate the key periodically and after any suspected compromise

import os
import secrets
import hashlib
from typing import Optional
 
class VLLMAPIKeyManager:
    """Manage API keys for vLLM deployments."""
 
    @staticmethod
    def generate_key(length: int = 48) -> str:
        """Generate a cryptographically secure API key."""
        return secrets.token_urlsafe(length)
 
    @staticmethod
    def hash_key(key: str) -> str:
        """Hash a key for storage (never store plaintext keys in databases)."""
        return hashlib.sha256(key.encode()).hexdigest()
 
    @staticmethod
    def validate_key_strength(key: str) -> dict:
        """Validate that an API key meets minimum security requirements."""
        issues = []
        if len(key) < 32:
            issues.append("Key too short — minimum 32 characters recommended")
        if key == key.lower() or key == key.upper():
            issues.append("Key lacks mixed case — may indicate weak generation")
        if key.startswith("sk-") and len(key) < 40:
            issues.append("Key follows OpenAI format but is too short")
 
        return {
            "valid": len(issues) == 0,
            "issues": issues,
            "key_length": len(key),
        }

Reverse Proxy Authentication

For production deployments, use a reverse proxy that provides more sophisticated authentication:

# /etc/nginx/conf.d/vllm.conf
upstream vllm_backend {
    server 127.0.0.1:8000;
    keepalive 32;
}
 
server {
    listen 443 ssl http2;
    server_name llm-api.company.com;
 
    ssl_certificate /etc/ssl/certs/vllm.crt;
    ssl_certificate_key /etc/ssl/private/vllm.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
 
    # Rate limiting
    limit_req_zone $binary_remote_addr zone=vllm_api:10m rate=10r/s;
    limit_req_zone $binary_remote_addr zone=vllm_completions:10m rate=5r/s;
 
    # Request size limit (prevent huge prompt injection payloads)
    client_max_body_size 1m;
 
    # Timeouts (LLM generation can take a while)
    proxy_read_timeout 120s;
    proxy_send_timeout 30s;
 
    location /v1/chat/completions {
        limit_req zone=vllm_completions burst=10 nodelay;
 
        # Authenticate via subrequest to auth service
        auth_request /auth;
        auth_request_set $auth_user $upstream_http_x_auth_user;
 
        proxy_pass http://vllm_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Auth-User $auth_user;
 
        # Enable streaming for SSE responses
        proxy_buffering off;
        proxy_cache off;
    }
 
    location /v1/completions {
        limit_req zone=vllm_completions burst=10 nodelay;
        auth_request /auth;
        proxy_pass http://vllm_backend;
        proxy_buffering off;
    }
 
    # Block model listing if not needed
    location /v1/models {
        limit_req zone=vllm_api burst=5;
        auth_request /auth;
        proxy_pass http://vllm_backend;
    }
 
    # Auth subrequest endpoint
    location = /auth {
        internal;
        proxy_pass http://auth-service:8080/validate;
        proxy_pass_request_body off;
        proxy_set_header Content-Length "";
        proxy_set_header X-Original-URI $request_uri;
        proxy_set_header Authorization $http_authorization;
    }
 
    # Security headers
    add_header X-Content-Type-Options nosniff always;
    add_header X-Frame-Options DENY always;
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
}

Denial-of-Service Prevention

LLM-Specific DoS Vectors

LLM serving has unique DoS vectors that traditional web application firewalls do not address:

Long prompt attacks: Sending prompts that fill the model's context window consumes maximum GPU memory and compute
High max_tokens: Requesting very long completions ties up GPU resources
Streaming abuse: Opening many streaming connections to exhaust server resources
Adversarial tokenization: Crafting inputs that produce unusually high token counts relative to character count

from typing import Dict, Any, Optional, List
import time
 
class VLLMRequestGuard:
    """Input validation and DoS prevention for vLLM endpoints."""
 
    def __init__(
        self,
        max_prompt_tokens: int = 4096,
        max_completion_tokens: int = 2048,
        max_concurrent_requests: int = 100,
        max_requests_per_minute: int = 60,
    ):
        self.max_prompt_tokens = max_prompt_tokens
        self.max_completion_tokens = max_completion_tokens
        self.max_concurrent = max_concurrent_requests
        self.max_rpm = max_requests_per_minute
        self._active_requests = 0
        self._request_log: Dict[str, List[float]] = {}
 
    def validate_chat_request(self, request: Dict[str, Any]) -> Dict[str, Any]:
        """Validate a chat completion request."""
        errors = []
 
        # Validate messages
        messages = request.get("messages", [])
        if not messages:
            errors.append("Request must include at least one message")
 
        # Check total message content length
        total_chars = sum(
            len(str(msg.get("content", ""))) for msg in messages
        )
        if total_chars > 100_000:
            errors.append(
                f"Total message content too large: {total_chars} chars (max 100,000)"
            )
 
        # Check max_tokens
        max_tokens = request.get("max_tokens")
        if max_tokens is not None:
            if max_tokens > self.max_completion_tokens:
                errors.append(
                    f"max_tokens ({max_tokens}) exceeds limit ({self.max_completion_tokens})"
                )
            if max_tokens < 1:
                errors.append("max_tokens must be positive")
 
        # Check temperature
        temperature = request.get("temperature")
        if temperature is not None:
            if not 0 <= temperature <= 2:
                errors.append(f"temperature must be between 0 and 2, got {temperature}")
 
        # Check for repeated content (potential prompt stuffing)
        for msg in messages:
            content = str(msg.get("content", ""))
            if len(content) > 1000:
                # Check if the content is highly repetitive
                chunks = [content[i:i+100] for i in range(0, min(len(content), 5000), 100)]
                unique_chunks = len(set(chunks))
                if unique_chunks < len(chunks) * 0.3:
                    errors.append("Message content appears to be repetitive padding")
                    break
 
        # Check n (number of completions)
        n = request.get("n", 1)
        if n > 5:
            errors.append(f"n ({n}) exceeds maximum of 5")
 
        return {"valid": len(errors) == 0, "errors": errors}
 
    def check_rate_limit(self, client_id: str) -> bool:
        """Check per-client rate limit."""
        now = time.time()
        if client_id not in self._request_log:
            self._request_log[client_id] = []
 
        # Remove entries older than 60 seconds
        self._request_log[client_id] = [
            ts for ts in self._request_log[client_id] if now - ts < 60
        ]
 
        if len(self._request_log[client_id]) >= self.max_rpm:
            return False
 
        self._request_log[client_id].append(now)
        return True
 
    def check_concurrent_limit(self) -> bool:
        """Check global concurrent request limit."""
        return self._active_requests < self.max_concurrent

Resource Limits

Configure vLLM with appropriate resource limits to prevent resource exhaustion:

# Start vLLM with resource constraints
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --api-key "${VLLM_API_KEY}" \
  --max-model-len 8192 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.85 \
  --block-size 16 \
  --swap-space 4 \
  --disable-log-requests

Key parameters:

--max-model-len: Limits the maximum sequence length (prompt + completion). Prevents attackers from using the full context window.
--max-num-seqs: Limits concurrent sequences being processed. Prevents request flooding.
--max-num-batched-tokens: Limits tokens processed per batch. Caps per-iteration compute cost.
--gpu-memory-utilization: Reserves GPU memory headroom. Prevents OOM crashes.

Model Loading Security

Verifying Model Integrity

vLLM loads models from Hugging Face Hub or local paths. Verify model integrity before loading:

import hashlib
from pathlib import Path
from typing import Dict, Optional
import json
 
class VLLMModelVerifier:
    """Verify model integrity before loading into vLLM."""
 
    def __init__(self, model_path: str):
        self.model_path = Path(model_path)
 
    def compute_file_hashes(self) -> Dict[str, str]:
        """Compute SHA-256 hashes of all model files."""
        hashes = {}
        for file_path in sorted(self.model_path.rglob("*")):
            if file_path.is_file():
                sha256 = hashlib.sha256()
                with open(file_path, "rb") as f:
                    for chunk in iter(lambda: f.read(8192), b""):
                        sha256.update(chunk)
                relative = str(file_path.relative_to(self.model_path))
                hashes[relative] = sha256.hexdigest()
        return hashes
 
    def verify_against_manifest(self, manifest_path: str) -> Dict:
        """Verify model files against a saved manifest."""
        with open(manifest_path) as f:
            expected = json.load(f)
 
        actual = self.compute_file_hashes()
        mismatches = []
        missing = []
        unexpected = []
 
        for path, expected_hash in expected.items():
            if path not in actual:
                missing.append(path)
            elif actual[path] != expected_hash:
                mismatches.append({
                    "path": path,
                    "expected": expected_hash,
                    "actual": actual[path],
                })
 
        for path in actual:
            if path not in expected:
                unexpected.append(path)
 
        return {
            "verified": not mismatches and not missing,
            "mismatches": mismatches,
            "missing_files": missing,
            "unexpected_files": unexpected,
        }
 
    def check_for_unsafe_serialization(self) -> list:
        """Check for potentially unsafe model file formats."""
        findings = []
 
        for file_path in self.model_path.rglob("*"):
            if file_path.is_file():
                suffix = file_path.suffix.lower()
 
                if suffix in (".pkl", ".pickle"):
                    findings.append({
                        "severity": "critical",
                        "file": str(file_path.relative_to(self.model_path)),
                        "finding": "Pickle file detected — arbitrary code execution risk",
                    })
                elif suffix in (".pt", ".pth", ".bin"):
                    # PyTorch files may use pickle internally
                    findings.append({
                        "severity": "high",
                        "file": str(file_path.relative_to(self.model_path)),
                        "finding": "PyTorch file may use pickle serialization internally",
                    })
                elif suffix == ".safetensors":
                    findings.append({
                        "severity": "info",
                        "file": str(file_path.relative_to(self.model_path)),
                        "finding": "SafeTensors format — safe from deserialization attacks",
                    })
 
        return findings

Preventing Model Theft

If model weights are proprietary, prevent unauthorized access:

# Docker deployment with model mounted read-only from encrypted volume
docker run --gpus all \
  --read-only \
  --tmpfs /tmp:noexec,nosuid,size=1g \
  -v /encrypted-models/llama-3:/models/llama-3:ro \
  -p 127.0.0.1:8000:8000 \
  --security-opt no-new-privileges \
  --cap-drop ALL \
  vllm/vllm-openai:latest \
  --model /models/llama-3 \
  --api-key "${VLLM_API_KEY}" \
  --max-model-len 8192

Output Security

Response Filtering

Implement output filtering to catch sensitive data leakage:

import re
from typing import Dict, List, Optional
 
class VLLMOutputFilter:
    """Filter vLLM responses for sensitive data leakage."""
 
    def __init__(self):
        self.patterns = {
            "credit_card": re.compile(r"\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13})\b"),
            "ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
            "email": re.compile(r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"),
            "aws_key": re.compile(r"AKIA[0-9A-Z]{16}"),
            "private_key": re.compile(r"-----BEGIN (?:RSA |EC )?PRIVATE KEY-----"),
        }
 
    def scan_response(self, text: str) -> Dict[str, List]:
        """Scan response text for sensitive data patterns."""
        findings = {}
        for pattern_name, pattern in self.patterns.items():
            matches = pattern.findall(text)
            if matches:
                findings[pattern_name] = [m[:4] + "***" for m in matches]
        return findings
 
    def filter_response(
        self, text: str, redact: bool = True
    ) -> Dict[str, any]:
        """Filter response and optionally redact sensitive data."""
        findings = self.scan_response(text)
 
        if not findings:
            return {"text": text, "filtered": False, "findings": {}}
 
        filtered_text = text
        if redact:
            for pattern_name, pattern in self.patterns.items():
                filtered_text = pattern.sub(f"[REDACTED_{pattern_name.upper()}]", filtered_text)
 
        return {
            "text": filtered_text,
            "filtered": True,
            "findings": findings,
        }

Monitoring

Security Logging

import logging
import json
from datetime import datetime, timezone
from typing import Dict, Any, Optional
 
class VLLMSecurityLogger:
    """Security-focused logging for vLLM deployments."""
 
    def __init__(self, log_file: str = "/var/log/vllm/security.json"):
        self.logger = logging.getLogger("vllm.security")
        handler = logging.FileHandler(log_file)
        handler.setFormatter(logging.Formatter("%(message)s"))
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)
 
    def log_request(
        self,
        client_id: str,
        endpoint: str,
        prompt_tokens: int,
        max_tokens: int,
        source_ip: str,
        blocked: bool = False,
        block_reason: Optional[str] = None,
    ) -> None:
        """Log an inference request with security-relevant metadata."""
        event = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "event": "inference_request",
            "client_id": client_id,
            "endpoint": endpoint,
            "prompt_tokens": prompt_tokens,
            "max_tokens": max_tokens,
            "source_ip": source_ip,
            "blocked": blocked,
            "block_reason": block_reason,
        }
        self.logger.info(json.dumps(event))
 
    def log_anomaly(
        self, anomaly_type: str, details: Dict[str, Any], source_ip: str
    ) -> None:
        """Log a detected security anomaly."""
        event = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "event": "security_anomaly",
            "anomaly_type": anomaly_type,
            "details": details,
            "source_ip": source_ip,
        }
        self.logger.info(json.dumps(event))

Defensive Recommendations

Always set --api-key in production — never run vLLM without authentication
Bind to localhost (--host 127.0.0.1) and use a reverse proxy for external access
Set explicit resource limits (--max-model-len, --max-num-seqs) to prevent DoS
Use TLS via the reverse proxy for all external communication
Validate inputs before they reach vLLM — check content length, max_tokens, and for repetitive content
Filter outputs for sensitive data patterns (PII, credentials, private keys)
Monitor and log all requests with security-relevant metadata
Use SafeTensors format for model weights to avoid deserialization attacks
Run vLLM containers read-only with model weights mounted as read-only volumes
Disable --disable-log-requests is a double negative — in production, log requests for audit trails but be careful about logging prompt content (PII concerns)

References

vLLM Documentation — https://docs.vllm.ai/
vLLM GitHub — https://github.com/vllm-project/vllm
OWASP LLM Top 10 2025 — LLM04 (Data and Model Poisoning), LLM10 (Unbounded Consumption)
NIST AI RMF — Govern 1.4 (AI deployment security controls)
Kwon et al. — "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023) — foundational vLLM paper

Edit this page on GitHub

vLLM Security Configuration

beginner11 min readUpdated 2026-03-20

Security hardening for vLLM serving deployments including API authentication, resource limits, and input validation.

infrastructure vllm llm-serving inference

Component	Description	Security Relevance
API Server	FastAPI-based OpenAI-compatible endpoints	Unauthenticated by default
Engine	Core inference engine with PagedAttention	Resource exhaustion target
Worker(s)	GPU workers for tensor parallel inference	GPU memory side channels
Tokenizer	Model-specific tokenization	Token-counting bypass vectors
Model Loader	Loads model weights from disk or remote	Supply chain target

Default Exposure

# Typical vLLM startup — completely open
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000
 
# Anyone on the network can now:
# - Send completion and chat requests
# - List loaded models
# - Consume GPU resources
# - Probe the model with adversarial inputs

Authentication

API Key Authentication

vLLM supports API key authentication via the --api-key flag. When set, all requests must include the key in the Authorization header:

# Start vLLM with API key authentication
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key "${VLLM_API_KEY}"
 
# Clients must include the API key:
# curl -H "Authorization: Bearer ${VLLM_API_KEY}" http://localhost:8000/v1/chat/completions ...

Important considerations:

The API key is a single shared secret — it does not support per-user keys or scoped permissions
Store the API key in a secrets manager (Vault, AWS Secrets Manager, Kubernetes Secrets), not in command-line arguments or environment files checked into version control
Rotate the key periodically and after any suspected compromise

import os
import secrets
import hashlib
from typing import Optional
 
class VLLMAPIKeyManager:
    """Manage API keys for vLLM deployments."""
 
    @staticmethod
    def generate_key(length: int = 48) -> str:
        """Generate a cryptographically secure API key."""
        return secrets.token_urlsafe(length)
 
    @staticmethod
    def hash_key(key: str) -> str:
        """Hash a key for storage (never store plaintext keys in databases)."""
        return hashlib.sha256(key.encode()).hexdigest()
 
    @staticmethod
    def validate_key_strength(key: str) -> dict:
        """Validate that an API key meets minimum security requirements."""
        issues = []
        if len(key) < 32:
            issues.append("Key too short — minimum 32 characters recommended")
        if key == key.lower() or key == key.upper():
            issues.append("Key lacks mixed case — may indicate weak generation")
        if key.startswith("sk-") and len(key) < 40:
            issues.append("Key follows OpenAI format but is too short")
 
        return {
            "valid": len(issues) == 0,
            "issues": issues,
            "key_length": len(key),
        }

Reverse Proxy Authentication

For production deployments, use a reverse proxy that provides more sophisticated authentication:

# /etc/nginx/conf.d/vllm.conf
upstream vllm_backend {
    server 127.0.0.1:8000;
    keepalive 32;
}
 
server {
    listen 443 ssl http2;
    server_name llm-api.company.com;
 
    ssl_certificate /etc/ssl/certs/vllm.crt;
    ssl_certificate_key /etc/ssl/private/vllm.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
 
    # Rate limiting
    limit_req_zone $binary_remote_addr zone=vllm_api:10m rate=10r/s;
    limit_req_zone $binary_remote_addr zone=vllm_completions:10m rate=5r/s;
 
    # Request size limit (prevent huge prompt injection payloads)
    client_max_body_size 1m;
 
    # Timeouts (LLM generation can take a while)
    proxy_read_timeout 120s;
    proxy_send_timeout 30s;
 
    location /v1/chat/completions {
        limit_req zone=vllm_completions burst=10 nodelay;
 
        # Authenticate via subrequest to auth service
        auth_request /auth;
        auth_request_set $auth_user $upstream_http_x_auth_user;
 
        proxy_pass http://vllm_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Auth-User $auth_user;
 
        # Enable streaming for SSE responses
        proxy_buffering off;
        proxy_cache off;
    }
 
    location /v1/completions {
        limit_req zone=vllm_completions burst=10 nodelay;
        auth_request /auth;
        proxy_pass http://vllm_backend;
        proxy_buffering off;
    }
 
    # Block model listing if not needed
    location /v1/models {
        limit_req zone=vllm_api burst=5;
        auth_request /auth;
        proxy_pass http://vllm_backend;
    }
 
    # Auth subrequest endpoint
    location = /auth {
        internal;
        proxy_pass http://auth-service:8080/validate;
        proxy_pass_request_body off;
        proxy_set_header Content-Length "";
        proxy_set_header X-Original-URI $request_uri;
        proxy_set_header Authorization $http_authorization;
    }
 
    # Security headers
    add_header X-Content-Type-Options nosniff always;
    add_header X-Frame-Options DENY always;
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
}

Denial-of-Service Prevention

LLM-Specific DoS Vectors

LLM serving has unique DoS vectors that traditional web application firewalls do not address:

Long prompt attacks: Sending prompts that fill the model's context window consumes maximum GPU memory and compute
High max_tokens: Requesting very long completions ties up GPU resources
Streaming abuse: Opening many streaming connections to exhaust server resources
Adversarial tokenization: Crafting inputs that produce unusually high token counts relative to character count

from typing import Dict, Any, Optional, List
import time
 
class VLLMRequestGuard:
    """Input validation and DoS prevention for vLLM endpoints."""
 
    def __init__(
        self,
        max_prompt_tokens: int = 4096,
        max_completion_tokens: int = 2048,
        max_concurrent_requests: int = 100,
        max_requests_per_minute: int = 60,
    ):
        self.max_prompt_tokens = max_prompt_tokens
        self.max_completion_tokens = max_completion_tokens
        self.max_concurrent = max_concurrent_requests
        self.max_rpm = max_requests_per_minute
        self._active_requests = 0
        self._request_log: Dict[str, List[float]] = {}
 
    def validate_chat_request(self, request: Dict[str, Any]) -> Dict[str, Any]:
        """Validate a chat completion request."""
        errors = []
 
        # Validate messages
        messages = request.get("messages", [])
        if not messages:
            errors.append("Request must include at least one message")
 
        # Check total message content length
        total_chars = sum(
            len(str(msg.get("content", ""))) for msg in messages
        )
        if total_chars > 100_000:
            errors.append(
                f"Total message content too large: {total_chars} chars (max 100,000)"
            )
 
        # Check max_tokens
        max_tokens = request.get("max_tokens")
        if max_tokens is not None:
            if max_tokens > self.max_completion_tokens:
                errors.append(
                    f"max_tokens ({max_tokens}) exceeds limit ({self.max_completion_tokens})"
                )
            if max_tokens < 1:
                errors.append("max_tokens must be positive")
 
        # Check temperature
        temperature = request.get("temperature")
        if temperature is not None:
            if not 0 <= temperature <= 2:
                errors.append(f"temperature must be between 0 and 2, got {temperature}")
 
        # Check for repeated content (potential prompt stuffing)
        for msg in messages:
            content = str(msg.get("content", ""))
            if len(content) > 1000:
                # Check if the content is highly repetitive
                chunks = [content[i:i+100] for i in range(0, min(len(content), 5000), 100)]
                unique_chunks = len(set(chunks))
                if unique_chunks < len(chunks) * 0.3:
                    errors.append("Message content appears to be repetitive padding")
                    break
 
        # Check n (number of completions)
        n = request.get("n", 1)
        if n > 5:
            errors.append(f"n ({n}) exceeds maximum of 5")
 
        return {"valid": len(errors) == 0, "errors": errors}
 
    def check_rate_limit(self, client_id: str) -> bool:
        """Check per-client rate limit."""
        now = time.time()
        if client_id not in self._request_log:
            self._request_log[client_id] = []
 
        # Remove entries older than 60 seconds
        self._request_log[client_id] = [
            ts for ts in self._request_log[client_id] if now - ts < 60
        ]
 
        if len(self._request_log[client_id]) >= self.max_rpm:
            return False
 
        self._request_log[client_id].append(now)
        return True
 
    def check_concurrent_limit(self) -> bool:
        """Check global concurrent request limit."""
        return self._active_requests < self.max_concurrent

Resource Limits

Configure vLLM with appropriate resource limits to prevent resource exhaustion:

# Start vLLM with resource constraints
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --api-key "${VLLM_API_KEY}" \
  --max-model-len 8192 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.85 \
  --block-size 16 \
  --swap-space 4 \
  --disable-log-requests

Key parameters:

--max-model-len: Limits the maximum sequence length (prompt + completion). Prevents attackers from using the full context window.
--max-num-seqs: Limits concurrent sequences being processed. Prevents request flooding.
--max-num-batched-tokens: Limits tokens processed per batch. Caps per-iteration compute cost.
--gpu-memory-utilization: Reserves GPU memory headroom. Prevents OOM crashes.

Model Loading Security

Verifying Model Integrity

vLLM loads models from Hugging Face Hub or local paths. Verify model integrity before loading:

import hashlib
from pathlib import Path
from typing import Dict, Optional
import json
 
class VLLMModelVerifier:
    """Verify model integrity before loading into vLLM."""
 
    def __init__(self, model_path: str):
        self.model_path = Path(model_path)
 
    def compute_file_hashes(self) -> Dict[str, str]:
        """Compute SHA-256 hashes of all model files."""
        hashes = {}
        for file_path in sorted(self.model_path.rglob("*")):
            if file_path.is_file():
                sha256 = hashlib.sha256()
                with open(file_path, "rb") as f:
                    for chunk in iter(lambda: f.read(8192), b""):
                        sha256.update(chunk)
                relative = str(file_path.relative_to(self.model_path))
                hashes[relative] = sha256.hexdigest()
        return hashes
 
    def verify_against_manifest(self, manifest_path: str) -> Dict:
        """Verify model files against a saved manifest."""
        with open(manifest_path) as f:
            expected = json.load(f)
 
        actual = self.compute_file_hashes()
        mismatches = []
        missing = []
        unexpected = []
 
        for path, expected_hash in expected.items():
            if path not in actual:
                missing.append(path)
            elif actual[path] != expected_hash:
                mismatches.append({
                    "path": path,
                    "expected": expected_hash,
                    "actual": actual[path],
                })
 
        for path in actual:
            if path not in expected:
                unexpected.append(path)
 
        return {
            "verified": not mismatches and not missing,
            "mismatches": mismatches,
            "missing_files": missing,
            "unexpected_files": unexpected,
        }
 
    def check_for_unsafe_serialization(self) -> list:
        """Check for potentially unsafe model file formats."""
        findings = []
 
        for file_path in self.model_path.rglob("*"):
            if file_path.is_file():
                suffix = file_path.suffix.lower()
 
                if suffix in (".pkl", ".pickle"):
                    findings.append({
                        "severity": "critical",
                        "file": str(file_path.relative_to(self.model_path)),
                        "finding": "Pickle file detected — arbitrary code execution risk",
                    })
                elif suffix in (".pt", ".pth", ".bin"):
                    # PyTorch files may use pickle internally
                    findings.append({
                        "severity": "high",
                        "file": str(file_path.relative_to(self.model_path)),
                        "finding": "PyTorch file may use pickle serialization internally",
                    })
                elif suffix == ".safetensors":
                    findings.append({
                        "severity": "info",
                        "file": str(file_path.relative_to(self.model_path)),
                        "finding": "SafeTensors format — safe from deserialization attacks",
                    })
 
        return findings

Preventing Model Theft

If model weights are proprietary, prevent unauthorized access:

# Docker deployment with model mounted read-only from encrypted volume
docker run --gpus all \
  --read-only \
  --tmpfs /tmp:noexec,nosuid,size=1g \
  -v /encrypted-models/llama-3:/models/llama-3:ro \
  -p 127.0.0.1:8000:8000 \
  --security-opt no-new-privileges \
  --cap-drop ALL \
  vllm/vllm-openai:latest \
  --model /models/llama-3 \
  --api-key "${VLLM_API_KEY}" \
  --max-model-len 8192

Output Security

Response Filtering

Implement output filtering to catch sensitive data leakage:

import re
from typing import Dict, List, Optional
 
class VLLMOutputFilter:
    """Filter vLLM responses for sensitive data leakage."""
 
    def __init__(self):
        self.patterns = {
            "credit_card": re.compile(r"\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13})\b"),
            "ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
            "email": re.compile(r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"),
            "aws_key": re.compile(r"AKIA[0-9A-Z]{16}"),
            "private_key": re.compile(r"-----BEGIN (?:RSA |EC )?PRIVATE KEY-----"),
        }
 
    def scan_response(self, text: str) -> Dict[str, List]:
        """Scan response text for sensitive data patterns."""
        findings = {}
        for pattern_name, pattern in self.patterns.items():
            matches = pattern.findall(text)
            if matches:
                findings[pattern_name] = [m[:4] + "***" for m in matches]
        return findings
 
    def filter_response(
        self, text: str, redact: bool = True
    ) -> Dict[str, any]:
        """Filter response and optionally redact sensitive data."""
        findings = self.scan_response(text)
 
        if not findings:
            return {"text": text, "filtered": False, "findings": {}}
 
        filtered_text = text
        if redact:
            for pattern_name, pattern in self.patterns.items():
                filtered_text = pattern.sub(f"[REDACTED_{pattern_name.upper()}]", filtered_text)
 
        return {
            "text": filtered_text,
            "filtered": True,
            "findings": findings,
        }

Monitoring

Security Logging

import logging
import json
from datetime import datetime, timezone
from typing import Dict, Any, Optional
 
class VLLMSecurityLogger:
    """Security-focused logging for vLLM deployments."""
 
    def __init__(self, log_file: str = "/var/log/vllm/security.json"):
        self.logger = logging.getLogger("vllm.security")
        handler = logging.FileHandler(log_file)
        handler.setFormatter(logging.Formatter("%(message)s"))
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)
 
    def log_request(
        self,
        client_id: str,
        endpoint: str,
        prompt_tokens: int,
        max_tokens: int,
        source_ip: str,
        blocked: bool = False,
        block_reason: Optional[str] = None,
    ) -> None:
        """Log an inference request with security-relevant metadata."""
        event = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "event": "inference_request",
            "client_id": client_id,
            "endpoint": endpoint,
            "prompt_tokens": prompt_tokens,
            "max_tokens": max_tokens,
            "source_ip": source_ip,
            "blocked": blocked,
            "block_reason": block_reason,
        }
        self.logger.info(json.dumps(event))
 
    def log_anomaly(
        self, anomaly_type: str, details: Dict[str, Any], source_ip: str
    ) -> None:
        """Log a detected security anomaly."""
        event = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "event": "security_anomaly",
            "anomaly_type": anomaly_type,
            "details": details,
            "source_ip": source_ip,
        }
        self.logger.info(json.dumps(event))

Defensive Recommendations

Always set --api-key in production — never run vLLM without authentication
Bind to localhost (--host 127.0.0.1) and use a reverse proxy for external access
Set explicit resource limits (--max-model-len, --max-num-seqs) to prevent DoS
Use TLS via the reverse proxy for all external communication
Validate inputs before they reach vLLM — check content length, max_tokens, and for repetitive content
Filter outputs for sensitive data patterns (PII, credentials, private keys)
Monitor and log all requests with security-relevant metadata
Use SafeTensors format for model weights to avoid deserialization attacks
Run vLLM containers read-only with model weights mounted as read-only volumes
Disable --disable-log-requests is a double negative — in production, log requests for audit trails but be careful about logging prompt content (PII concerns)

References

vLLM Documentation — https://docs.vllm.ai/
vLLM GitHub — https://github.com/vllm-project/vllm
OWASP LLM Top 10 2025 — LLM04 (Data and Model Poisoning), LLM10 (Unbounded Consumption)
NIST AI RMF — Govern 1.4 (AI deployment security controls)
Kwon et al. — "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023) — foundational vLLM paper

Edit this page on GitHub

vLLM Security Configuration

Related articles

vLLM Security Configuration

Related articles