Model Serving Security
Security hardening for model serving infrastructure — covering vLLM, TGI, Triton Inference Server configuration, API security, resource isolation, and deployment best practices.
Model serving infrastructure is where AI models meet production traffic. The security of this infrastructure determines whether models can be accessed by unauthorized users, whether inference requests can be intercepted or manipulated, and whether the serving infrastructure itself can be compromised. This page covers security hardening for the most widely deployed model serving frameworks.
vLLM Security
vLLM is the dominant open-source serving framework for large language models, known for its PagedAttention memory management and high throughput. Its security surface includes the API server, the model loading pipeline, and the compute infrastructure.
API Server Hardening
vLLM exposes an OpenAI-compatible API server by default. Out-of-the-box, this server has no authentication, no rate limiting, and no request validation.
Authentication: vLLM does not natively include authentication. Deploy an authentication proxy (nginx, Envoy, or cloud API gateway) in front of vLLM that validates API keys or bearer tokens before forwarding requests. Never expose a vLLM instance directly to the internet without an authentication layer.
Rate limiting: Implement rate limiting at the proxy layer to prevent denial-of-service attacks. Set limits on requests per second per API key, maximum concurrent requests, maximum input token length, and maximum output token length. These limits prevent both service disruption and cost amplification attacks.
Request validation: Validate incoming requests before they reach vLLM. Check that input lengths are within expected bounds, that request parameters (temperature, top_p, max_tokens) are within allowed ranges, and that the request format is valid.
TLS: Configure TLS on the proxy to encrypt all API traffic. For internal deployments, use mutual TLS to verify both client and server identity.
Model Loading Security
vLLM loads model weights from local storage or remote repositories. The model loading process is a critical security boundary.
Model integrity verification: Before loading any model, verify its integrity through checksums or digital signatures. vLLM does not natively verify model integrity — implement verification in your deployment pipeline. Compare model weight checksums against known-good values from the model provider.
Safe loading: Model files in pickle format can execute arbitrary code during loading. Use safetensors format when available, which does not support code execution during deserialization. If pickle-based models must be used, load them in an isolated environment and verify their behavior before production deployment.
Storage access control: Restrict access to model weight storage. Only the vLLM process should have read access to model files. Prevent unauthorized modification of model weights by implementing file integrity monitoring.
Resource Isolation
Container isolation: Run vLLM in containers with minimal capabilities. Drop all Linux capabilities except those required for GPU access. Use read-only root filesystems where possible. Restrict network access to only the required endpoints.
GPU isolation: In multi-tenant environments, ensure GPU memory isolation between different model instances. vLLM's memory management (PagedAttention) is designed for efficiency, not security isolation. Use separate GPU devices or GPU virtualization (NVIDIA MIG) for strong isolation between tenants.
Memory security: vLLM stores model weights, KV cache, and request data in GPU and CPU memory. Ensure that after a request is completed, the request data is not retained in memory longer than necessary. In multi-tenant environments, verify that one tenant's data cannot leak to another through shared GPU memory.
Text Generation Inference (TGI) Security
Hugging Face's Text Generation Inference (TGI) is another popular serving framework with similar security considerations.
API Security
TGI exposes a REST API that, like vLLM, lacks native authentication. The same proxy-based authentication, rate limiting, and request validation recommendations apply.
TGI includes some built-in safety controls. Maximum input length and maximum output length can be configured at startup, providing a built-in protection against token exhaustion attacks. Maximum concurrent requests limits the server's resource consumption. Configure these parameters conservatively for production deployments.
Model Hub Integration
TGI can download models directly from Hugging Face Hub at startup. This convenience creates a supply chain risk: if the model on Hub is compromised between downloads, or if DNS is manipulated to redirect Hub requests to a malicious server, a compromised model could be loaded.
For production deployments, pre-download and verify models locally rather than pulling from Hub at startup. Pin specific model revisions rather than using latest. Verify model checksums before deployment.
Quantization Security
TGI supports quantized models for reduced memory usage and faster inference. Quantization changes the model's behavior — both its capabilities and its security properties. A safety-tuned model may have different safety characteristics after quantization because the quantization process can alter the model's decision boundaries.
Test the quantized model's safety behavior against the full-precision model to identify any safety degradation. If safety properties degrade significantly, the quantized model may need additional safety controls or may not be suitable for production.
NVIDIA Triton Inference Server Security
Triton is a production-grade inference server that supports multiple model frameworks and advanced deployment features.
Multi-Model Security
Triton serves multiple models simultaneously through a model repository. Security considerations for multi-model serving include access control per model (not all users should have access to all models), resource isolation between models (one model's load should not affect another model's performance), and model versioning security (ensuring the correct model version is served and unauthorized version changes are detected).
Ensemble Model Security
Triton supports ensemble models where multiple models are chained together. The data flowing between models in an ensemble represents an internal attack surface. If one model in the ensemble is compromised, it can pass manipulated data to downstream models.
Validate data at ensemble boundaries. Implement type checking and range validation for data passed between ensemble stages. Monitor ensemble execution for anomalous data patterns.
gRPC and HTTP Security
Triton exposes both gRPC and HTTP endpoints. Configure TLS for both protocols. Implement authentication at the reverse proxy layer. Use gRPC interceptors for request validation and logging.
Model Repository Security
Triton loads models from a model repository (local directory or cloud storage). Secure the model repository with strict access controls. Implement file integrity monitoring to detect unauthorized model changes. Use Triton's model control API with authentication to prevent unauthorized model loading or unloading.
Cross-Framework Security Practices
Logging and Monitoring
Implement comprehensive logging for all model serving infrastructure. Log all API requests with timestamps, source identity, and request parameters. Log all responses with token counts and latency. Log model loading and unloading events. Log errors, timeouts, and unusual conditions.
Monitor these logs for anomalous patterns: unusual request volumes, requests from unexpected sources, abnormally large inputs or outputs, and error rate spikes.
Network Security
Deploy model serving infrastructure in private networks. Use API gateways or reverse proxies as the single entry point. Implement network policies that restrict which services can communicate with the serving infrastructure. Block all outbound network access from the serving containers unless specifically required.
Update Management
Keep model serving frameworks updated. Security patches for vLLM, TGI, and Triton address vulnerabilities in request handling, model loading, memory management, and API processing. Establish a regular update cadence and test updates in staging before production deployment.
Incident Response
Prepare for serving infrastructure incidents. Define procedures for handling model compromise (unauthorized model changes), service abuse (DoS, unauthorized access), data leakage (model outputs containing sensitive data), and infrastructure compromise (container escape, GPU exploitation).
Ensure that incident response procedures include the ability to quickly roll back to a known-good model version, block specific API keys or source IPs, increase logging granularity for investigation, and isolate compromised infrastructure.
Model serving security is the front door of your AI deployment. A hardened serving layer protects against unauthorized access, resource abuse, and infrastructure compromise. Invest in serving security proportional to the sensitivity and exposure of the models being served.