Security Comparison of Model Serving Frameworks
In-depth security analysis of TorchServe, TensorFlow Serving, Triton Inference Server, and vLLM for production AI deployments
Overview
Model serving frameworks are the critical last mile between trained AI models and production applications. They handle model loading, request routing, batching, hardware acceleration, and scaling — and they represent one of the most exposed attack surfaces in AI infrastructure. Unlike internal training pipelines, serving endpoints typically face network traffic from external clients, internal microservices, or both, making their security posture directly relevant to organizational risk.
The four dominant open-source serving frameworks — PyTorch's TorchServe, TensorFlow Serving, NVIDIA's Triton Inference Server, and the rapidly adopted vLLM for large language models — each make different architectural choices that produce distinct security profiles. TorchServe exposes a management API alongside its inference API, creating administrative attack surface. TensorFlow Serving relies on gRPC and REST with minimal built-in authentication. Triton provides a feature-rich platform with model repository access, shared memory regions, and dynamic model loading that expand its attack surface. vLLM, optimized for LLM inference with continuous batching and PagedAttention, introduces prompt-handling complexity and often runs with elevated GPU access.
This article provides a systematic security comparison grounded in real CVEs, documented vulnerabilities, and common misconfiguration patterns. The goal is to equip red teamers with framework-specific knowledge to efficiently assess model serving deployments, and to give defenders actionable hardening guidance for each platform.
TorchServe Security Analysis
Architecture and Attack Surface
TorchServe exposes three network interfaces by default: an inference API (port 8080), a management API (port 8081), and a metrics API (port 8082). The management API is the most security-critical because it allows registering new models, scaling workers, and modifying server configuration. In default installations prior to version 0.8.2, the management API bound to 0.0.0.0, making it accessible from any network interface.
The model archive format (.mar) used by TorchServe is a ZIP file containing model weights, a handler Python script, and metadata. When a model is registered, TorchServe extracts and executes the handler script, which runs arbitrary Python code. This design means that model registration is equivalent to remote code execution by design — the security boundary must be at the management API access control layer.
Critical Vulnerabilities (CVE Analysis)
CVE-2023-43654 (CVSS 9.8): Server-Side Request Forgery (SSRF) in model registration. The management API's POST /models endpoint accepted a url parameter for downloading model archives from remote locations. An attacker with access to the management API could supply an internal URL (such as http://169.254.169.254/latest/meta-data/ on AWS) to access cloud metadata services and steal instance credentials. Combined with the default 0.0.0.0 binding, this was exploitable from any network position.
CVE-2022-1471 (SnakeYAML): TorchServe used SnakeYAML for YAML parsing without restricting deserialization, allowing arbitrary Java object instantiation. While TorchServe is primarily Python, its Java-based frontend used SnakeYAML for configuration parsing.
"""
TorchServe security assessment script.
Tests for common misconfigurations and known vulnerability patterns.
"""
import requests
import socket
import json
from urllib.parse import urljoin
from typing import Optional
class TorchServeAuditor:
"""Security auditor for TorchServe deployments."""
def __init__(
self,
inference_url: str = "http://localhost:8080",
management_url: str = "http://localhost:8081",
metrics_url: str = "http://localhost:8082",
timeout: int = 10,
):
self.inference_url = inference_url
self.management_url = management_url
self.metrics_url = metrics_url
self.timeout = timeout
self.findings: list[dict] = []
def _add_finding(
self, severity: str, title: str, detail: str
) -> None:
self.findings.append({
"severity": severity,
"title": title,
"detail": detail,
})
def check_management_api_exposure(self) -> None:
"""Test if management API is accessible (should be restricted)."""
try:
resp = requests.get(
urljoin(self.management_url, "/models"),
timeout=self.timeout,
)
if resp.status_code == 200:
models = resp.json()
self._add_finding(
"CRITICAL",
"Management API accessible without authentication",
f"GET /models returned {len(models.get('models', []))} "
f"registered models. Management API allows model "
f"registration (RCE) and configuration changes.",
)
except requests.ConnectionError:
self._add_finding(
"INFO",
"Management API not reachable",
"Management API connection refused — may be properly "
"restricted or running on a different address.",
)
def check_ssrf_via_model_registration(self) -> None:
"""
Test for SSRF in model registration endpoint (CVE-2023-43654).
Uses a benign canary URL — does NOT exploit.
"""
try:
# Test with an external canary to detect outbound requests
# In a real assessment, use a Burp Collaborator or similar
resp = requests.post(
urljoin(self.management_url, "/models"),
params={
"url": "https://canary.example.com/test.mar",
"model_name": "security_test",
},
timeout=self.timeout,
)
if resp.status_code != 403:
self._add_finding(
"HIGH",
"Model registration endpoint accepts remote URLs",
f"POST /models with remote URL returned status "
f"{resp.status_code}. This may be vulnerable to SSRF "
f"(CVE-2023-43654). Verify URL allowlisting is enforced.",
)
except requests.ConnectionError:
pass # Management API not reachable
def check_model_listing(self) -> None:
"""Enumerate registered models for information disclosure."""
try:
resp = requests.get(
urljoin(self.management_url, "/models"),
timeout=self.timeout,
)
if resp.status_code == 200:
data = resp.json()
for model in data.get("models", []):
model_name = model.get("modelName", "unknown")
detail_resp = requests.get(
urljoin(
self.management_url,
f"/models/{model_name}",
),
timeout=self.timeout,
)
if detail_resp.status_code == 200:
detail = detail_resp.json()
self._add_finding(
"MEDIUM",
f"Model details exposed: {model_name}",
f"Model URL: {detail.get('modelUrl', 'N/A')}, "
f"Workers: {detail.get('workers', [])}, "
f"Batch size: {detail.get('batchSize', 'N/A')}",
)
except requests.ConnectionError:
pass
def check_metrics_exposure(self) -> None:
"""Check if metrics endpoint exposes sensitive information."""
try:
resp = requests.get(
urljoin(self.metrics_url, "/metrics"),
timeout=self.timeout,
)
if resp.status_code == 200:
metrics_text = resp.text
sensitive_patterns = [
"gpu_memory",
"model_name",
"handler_time",
"queue_time",
]
found = [
p for p in sensitive_patterns if p in metrics_text
]
if found:
self._add_finding(
"LOW",
"Metrics endpoint exposes operational details",
f"Found metrics containing: {', '.join(found)}. "
f"This reveals model names, GPU usage, and "
f"inference timing information.",
)
except requests.ConnectionError:
pass
def check_version_disclosure(self) -> None:
"""Check for version information disclosure."""
try:
resp = requests.get(
urljoin(self.inference_url, "/api-description"),
timeout=self.timeout,
)
if resp.status_code == 200:
self._add_finding(
"LOW",
"API description endpoint accessible",
f"API description reveals framework details: "
f"{resp.text[:200]}",
)
except requests.ConnectionError:
pass
def run_audit(self) -> list[dict]:
"""Run all audit checks and return findings."""
self.findings = []
self.check_management_api_exposure()
self.check_ssrf_via_model_registration()
self.check_model_listing()
self.check_metrics_exposure()
self.check_version_disclosure()
return self.findings
if __name__ == "__main__":
import sys
target = sys.argv[1] if len(sys.argv) > 1 else "http://localhost"
auditor = TorchServeAuditor(
inference_url=f"{target}:8080",
management_url=f"{target}:8081",
metrics_url=f"{target}:8082",
)
findings = auditor.run_audit()
for f in findings:
print(f"[{f['severity']}] {f['title']}")
print(f" {f['detail']}\n")TensorFlow Serving Security Analysis
Architecture and Attack Surface
TensorFlow Serving exposes gRPC (port 8500) and REST (port 8501) interfaces for inference. Unlike TorchServe, it does not have a separate management API — model management is handled through the model configuration file (models.config) and filesystem-based model discovery. This reduces the administrative attack surface but shifts risk to the model storage layer.
TensorFlow Serving loads models from a configurable model base path, which can be a local directory, an NFS mount, Google Cloud Storage (GCS) bucket, or Amazon S3. The framework polls this path periodically for new model versions, automatically loading them. This auto-loading behavior means that an attacker who can write to the model storage location can achieve code execution without any API interaction.
SavedModel Format Risks
TensorFlow's SavedModel format can contain arbitrary Python code through tf.py_function operations and custom ops. A malicious SavedModel placed in the model repository will execute attacker code when loaded by TensorFlow Serving. This is not a bug — it is a fundamental property of the SavedModel format that includes a computation graph that can invoke arbitrary operations.
"""
Demonstrate security risks in TensorFlow Serving model loading.
This creates a SavedModel with embedded computation that executes
during model load/inference — illustrating the supply chain risk.
"""
import tensorflow as tf
import numpy as np
import os
def create_benign_model_with_audit_hook(export_path: str) -> None:
"""
Create a SavedModel that logs inference requests to a file.
This demonstrates how a model can perform actions beyond inference.
In a malicious scenario, this could exfiltrate data.
"""
class AuditedModel(tf.Module):
def __init__(self):
super().__init__()
self.dense_weights = tf.Variable(
tf.random.normal([784, 10]), name="weights"
)
self.bias = tf.Variable(tf.zeros([10]), name="bias")
@tf.function(input_signature=[
tf.TensorSpec(shape=[None, 784], dtype=tf.float32)
])
def predict(self, x):
# Normal inference computation
logits = tf.matmul(x, self.dense_weights) + self.bias
predictions = tf.nn.softmax(logits)
# Audit hook: log input statistics
# In a malicious model, this could write to a network socket
# or encode data in timing side channels
input_mean = tf.reduce_mean(x)
input_std = tf.math.reduce_std(x)
log_line = tf.strings.format(
"Input stats: mean={}, std={}", (input_mean, input_std)
)
tf.print(log_line) # Goes to TF Serving stdout/stderr
return predictions
model = AuditedModel()
tf.saved_model.save(
model,
export_path,
signatures={"serving_default": model.predict},
)
print(f"Model saved to {export_path}")
def audit_savedmodel_for_dangerous_ops(model_path: str) -> list[str]:
"""
Scan a SavedModel for potentially dangerous operations.
These operations can execute arbitrary code or access the filesystem.
"""
dangerous_ops = {
"PyFunc": "Arbitrary Python code execution",
"ReadFile": "Filesystem read access",
"WriteFile": "Filesystem write access",
"ShellExecute": "Shell command execution",
"LoadLibrary": "Dynamic library loading",
"StringToHashBucketFast": "Could be used for data encoding",
}
findings = []
try:
loaded = tf.saved_model.load(model_path)
for func_name in dir(loaded):
func = getattr(loaded, func_name, None)
if hasattr(func, "concrete_functions"):
for cf in func.concrete_functions:
for node in cf.graph.as_graph_def().node:
if node.op in dangerous_ops:
findings.append(
f"Found {node.op} ({dangerous_ops[node.op]}) "
f"in function {func_name}, node {node.name}"
)
except Exception as e:
findings.append(f"Error loading model: {e}")
return findingsTriton Inference Server Security Analysis
Architecture and Attack Surface
NVIDIA Triton Inference Server is the most feature-rich of the four frameworks, supporting multiple model formats (TensorFlow, PyTorch, ONNX, TensorRT, Python backend), dynamic model loading, model ensembles, shared memory, and custom backends. This breadth of functionality creates a correspondingly large attack surface.
Key attack surface components:
- HTTP/gRPC inference endpoints (ports 8000/8001): Standard inference APIs with health checks and model metadata.
- Metrics endpoint (port 8002): Prometheus metrics with detailed operational data.
- Model repository: Filesystem, S3, GCS, or Azure Blob Storage. Triton polls for changes and auto-loads new models.
- Shared memory regions: CUDA shared memory and system shared memory for zero-copy inference, creating inter-process communication channels.
- Python backend: Executes arbitrary Python code as model handlers, similar to TorchServe's approach.
- Model ensembles: Chain multiple models together, with output of one feeding input of another. A compromised model in an ensemble can manipulate downstream models.
Shared Memory Attack Vectors
Triton's shared memory feature allows clients to register system or CUDA shared memory regions for zero-copy data transfer. This is a performance optimization that introduces security risks:
"""
Triton Inference Server shared memory security assessment.
Tests for shared memory region manipulation vulnerabilities.
"""
import tritonclient.http as httpclient
import tritonclient.utils.shared_memory as shm
import numpy as np
from typing import Optional
class TritonSharedMemoryAuditor:
"""Audit Triton's shared memory interface for security issues."""
def __init__(self, url: str = "localhost:8000"):
self.client = httpclient.InferenceServerClient(url=url)
self.findings: list[dict] = []
def enumerate_shared_memory_regions(self) -> list[dict]:
"""
List all registered shared memory regions.
Information disclosure: reveals memory layout and sizes.
"""
try:
# System shared memory
sys_regions = self.client.get_system_shared_memory_status()
for region in sys_regions:
self.findings.append({
"severity": "MEDIUM",
"title": f"System shared memory region: {region['name']}",
"detail": (
f"Key: {region.get('key', 'N/A')}, "
f"Offset: {region.get('offset', 0)}, "
f"Size: {region.get('byte_size', 0)} bytes"
),
})
# CUDA shared memory
cuda_regions = self.client.get_cuda_shared_memory_status()
for region in cuda_regions:
self.findings.append({
"severity": "MEDIUM",
"title": f"CUDA shared memory region: {region['name']}",
"detail": (
f"Device ID: {region.get('device_id', 'N/A')}, "
f"Size: {region.get('byte_size', 0)} bytes"
),
})
return sys_regions + cuda_regions
except Exception as e:
self.findings.append({
"severity": "INFO",
"title": "Shared memory enumeration failed",
"detail": str(e),
})
return []
def test_model_repository_access(self) -> None:
"""
Test model repository API for unauthorized access.
These endpoints allow loading/unloading models dynamically.
"""
try:
# List all models in repository (including unloaded)
repo_index = self.client.get_model_repository_index()
for model in repo_index:
status = model.get("state", "UNKNOWN")
self.findings.append({
"severity": "LOW" if status == "READY" else "MEDIUM",
"title": f"Repository model: {model['name']}",
"detail": (
f"State: {status}, "
f"Reason: {model.get('reason', 'N/A')}"
),
})
# Test if model loading is enabled (very high risk)
# This is controlled by --model-control-mode flag
try:
self.client.load_model("nonexistent_test_model")
except Exception as load_err:
error_msg = str(load_err)
if "model control is disabled" in error_msg.lower():
self.findings.append({
"severity": "INFO",
"title": "Model control mode: NONE (safe)",
"detail": "Dynamic model loading is disabled.",
})
elif "not found" in error_msg.lower():
self.findings.append({
"severity": "HIGH",
"title": "Dynamic model loading is ENABLED",
"detail": (
"Model loading API is active. An attacker who "
"can write to the model repository can load "
"malicious models via API."
),
})
except Exception as e:
self.findings.append({
"severity": "INFO",
"title": "Repository access test failed",
"detail": str(e),
})
def check_model_metadata_disclosure(self) -> None:
"""Check all loaded models for metadata information disclosure."""
try:
server_meta = self.client.get_server_metadata()
self.findings.append({
"severity": "LOW",
"title": "Server metadata accessible",
"detail": (
f"Name: {server_meta.get('name', 'N/A')}, "
f"Version: {server_meta.get('version', 'N/A')}, "
f"Extensions: {server_meta.get('extensions', [])}"
),
})
# Check each loaded model
repo_index = self.client.get_model_repository_index()
for model in repo_index:
if model.get("state") == "READY":
try:
meta = self.client.get_model_metadata(model["name"])
config = self.client.get_model_config(model["name"])
self.findings.append({
"severity": "LOW",
"title": f"Model config exposed: {model['name']}",
"detail": (
f"Platform: {meta.get('platform', 'N/A')}, "
f"Inputs: {meta.get('inputs', [])}, "
f"Outputs: {meta.get('outputs', [])}, "
f"Backend: {config.get('backend', 'N/A')}"
),
})
except Exception:
pass
except Exception as e:
self.findings.append({
"severity": "INFO",
"title": "Metadata check failed",
"detail": str(e),
})
def run_audit(self) -> list[dict]:
"""Execute all audit checks."""
self.findings = []
self.enumerate_shared_memory_regions()
self.test_model_repository_access()
self.check_model_metadata_disclosure()
return self.findingsvLLM Security Analysis
Architecture and Attack Surface
vLLM is purpose-built for high-throughput LLM inference using PagedAttention for efficient KV-cache management. Its architecture differs from general-purpose serving frameworks in several security-relevant ways:
- Prompt processing pipeline: vLLM processes variable-length text prompts that can be crafted to exploit tokenizer vulnerabilities, trigger excessive memory allocation, or cause denial of service through adversarial prompt lengths.
- KV-cache as shared resource: PagedAttention manages the KV-cache as a shared memory pool across requests. This sharing is the source of vLLM's performance advantage but creates potential information leakage between requests.
- OpenAI-compatible API: vLLM's API server implements an OpenAI-compatible REST interface, which means clients may send structured prompts with system/user/assistant roles that the server must parse and validate.
- Tensor parallelism: Multi-GPU inference splits model layers across GPUs using NCCL, introducing inter-GPU communication channels that could leak information.
"""
vLLM security assessment focusing on prompt-based attacks
and resource exhaustion.
"""
import requests
import time
import json
import concurrent.futures
from typing import Optional
class VLLMAuditor:
"""Security auditor for vLLM deployments."""
def __init__(self, base_url: str = "http://localhost:8000"):
self.base_url = base_url
self.findings: list[dict] = []
def check_prompt_length_limits(self) -> None:
"""
Test if the server enforces prompt length limits.
Excessively long prompts can cause OOM or extreme latency.
"""
# Test progressively longer prompts
test_lengths = [1000, 10000, 50000, 100000]
for length in test_lengths:
prompt = "A " * length # Simple repeated token
try:
start = time.time()
resp = requests.post(
f"{self.base_url}/v1/completions",
json={
"model": "default",
"prompt": prompt,
"max_tokens": 1,
},
timeout=30,
)
elapsed = time.time() - start
if resp.status_code == 200:
self.findings.append({
"severity": "MEDIUM" if length > 10000 else "LOW",
"title": f"Accepted prompt of {length} tokens",
"detail": (
f"Server processed {length}-token prompt in "
f"{elapsed:.2f}s. Large prompts without limits "
f"enable resource exhaustion."
),
})
elif resp.status_code == 400:
self.findings.append({
"severity": "INFO",
"title": f"Prompt length {length} rejected",
"detail": f"Server properly rejected: {resp.text[:200]}",
})
break # Found the limit
except requests.Timeout:
self.findings.append({
"severity": "HIGH",
"title": f"Timeout at prompt length {length}",
"detail": (
"Server timed out processing long prompt. "
"This indicates missing prompt length validation "
"and potential DoS vulnerability."
),
})
break
except requests.ConnectionError:
self.findings.append({
"severity": "CRITICAL",
"title": f"Connection lost at prompt length {length}",
"detail": "Server became unreachable — possible OOM crash.",
})
break
def check_concurrent_request_limits(self) -> None:
"""Test for rate limiting and concurrent request handling."""
num_concurrent = 50
prompt = "What is the capital of France?"
def send_request() -> tuple[int, float]:
start = time.time()
try:
resp = requests.post(
f"{self.base_url}/v1/completions",
json={
"model": "default",
"prompt": prompt,
"max_tokens": 10,
},
timeout=60,
)
return resp.status_code, time.time() - start
except Exception:
return 0, time.time() - start
with concurrent.futures.ThreadPoolExecutor(
max_workers=num_concurrent
) as executor:
futures = [
executor.submit(send_request)
for _ in range(num_concurrent)
]
results = [f.result() for f in futures]
success = sum(1 for code, _ in results if code == 200)
rate_limited = sum(1 for code, _ in results if code == 429)
errors = sum(1 for code, _ in results if code not in (200, 429))
if rate_limited == 0 and success == num_concurrent:
self.findings.append({
"severity": "MEDIUM",
"title": "No rate limiting detected",
"detail": (
f"All {num_concurrent} concurrent requests succeeded. "
f"No rate limiting or request queuing observed."
),
})
elif rate_limited > 0:
self.findings.append({
"severity": "INFO",
"title": "Rate limiting active",
"detail": (
f"{rate_limited}/{num_concurrent} requests rate-limited."
),
})
def check_model_info_disclosure(self) -> None:
"""Check for model information disclosure via API."""
endpoints = [
"/v1/models",
"/health",
"/version",
"/metrics",
]
for endpoint in endpoints:
try:
resp = requests.get(
f"{self.base_url}{endpoint}",
timeout=10,
)
if resp.status_code == 200:
self.findings.append({
"severity": "LOW",
"title": f"Endpoint accessible: {endpoint}",
"detail": f"Response: {resp.text[:300]}",
})
except requests.ConnectionError:
pass
def run_audit(self) -> list[dict]:
"""Run all vLLM-specific audit checks."""
self.findings = []
self.check_model_info_disclosure()
self.check_prompt_length_limits()
self.check_concurrent_request_limits()
return self.findingsComparative Security Matrix
The following table summarizes key security properties across the four frameworks:
| Security Property | TorchServe | TF Serving | Triton | vLLM |
|---|---|---|---|---|
| Built-in Authentication | None | None | None | None |
| Built-in TLS | Config option | Config option | Config option | Config option |
| Management API | Separate port (8081) | None (filesystem) | Model control API | None |
| Model Format Risk | .mar (ZIP + Python) | SavedModel (TF ops) | Multiple formats | HuggingFace/safetensors |
| Dynamic Model Loading | Yes (via API) | Yes (filesystem poll) | Yes (API or poll) | Limited |
| Shared Memory | No | No | Yes (system + CUDA) | Internal only |
| Default Network Binding | 0.0.0.0 (pre-0.8.2) | 0.0.0.0 | 0.0.0.0 | 0.0.0.0 |
| Metrics Endpoint | Port 8082 | None by default | Port 8002 | /metrics |
| Notable CVEs | CVE-2023-43654 | CVE-2021-37678 | CVE-2023-31036 | Emerging (newer project) |
Practical Examples
Unified Framework Scanner
#!/usr/bin/env bash
# Quick reconnaissance script to identify which model serving framework
# is running on a target host and gather initial security-relevant info.
TARGET="${1:?Usage: $0 <target_host>}"
echo "=== Model Serving Framework Detection ==="
echo "Target: $TARGET"
echo ""
# TorchServe detection
echo "--- TorchServe (ports 8080-8082) ---"
curl -s --connect-timeout 3 "http://${TARGET}:8080/ping" && \
echo " [+] TorchServe inference API detected"
curl -s --connect-timeout 3 "http://${TARGET}:8081/models" && \
echo " [+] TorchServe management API EXPOSED"
curl -s --connect-timeout 3 "http://${TARGET}:8082/metrics" | head -5
# TF Serving detection
echo ""
echo "--- TensorFlow Serving (ports 8500-8501) ---"
curl -s --connect-timeout 3 "http://${TARGET}:8501/v1/models" && \
echo " [+] TF Serving REST API detected"
# Triton detection
echo ""
echo "--- Triton Inference Server (ports 8000-8002) ---"
TRITON_META=$(curl -s --connect-timeout 3 "http://${TARGET}:8000/v2")
if echo "$TRITON_META" | grep -q "triton"; then
echo " [+] Triton detected: $TRITON_META"
echo " [*] Model repository:"
curl -s "http://${TARGET}:8000/v2/repository/index" | python3 -m json.tool 2>/dev/null
fi
# vLLM detection
echo ""
echo "--- vLLM (port 8000, OpenAI-compatible) ---"
VLLM_MODELS=$(curl -s --connect-timeout 3 "http://${TARGET}:8000/v1/models")
if echo "$VLLM_MODELS" | grep -q '"object"'; then
echo " [+] vLLM / OpenAI-compatible API detected"
echo "$VLLM_MODELS" | python3 -m json.tool 2>/dev/null
fi
echo ""
echo "=== Scan Complete ==="Defense and Mitigation
Network segmentation is the single most effective defense for model serving frameworks. None of the four frameworks provide built-in authentication or authorization that would be considered production-grade. The management/control interfaces must be isolated to administrative networks:
- TorchServe: Bind management API to
127.0.0.1or internal-only interfaces. Usemanagement_addressinconfig.properties. - TF Serving: Restrict filesystem access to model repository. Use read-only mounts.
- Triton: Set
--model-control-mode=noneto disable dynamic loading. Restrict metrics and repository endpoints via network policy. - vLLM: Deploy behind an API gateway with authentication, rate limiting, and prompt validation.
Model integrity verification should be implemented at the storage layer. Use signed model artifacts, verify checksums before loading, and restrict write access to model repositories. For TorchServe, validate .mar files against a known-good signature before registration.
Input validation is critical for all frameworks but especially vLLM and other LLM-serving systems. Implement prompt length limits, request rate limits, and content filtering before requests reach the inference engine.
Resource limits via container cgroups, Kubernetes resource quotas, and GPU memory limits prevent denial-of-service through resource exhaustion. Set explicit max_batch_size, max_sequence_length, and concurrent request limits in framework configurations.
TLS termination should happen at the load balancer or service mesh level rather than relying on each framework's built-in TLS support, which varies in configuration quality and cipher suite selection.
Dependency scanning: All four frameworks depend on complex software stacks (PyTorch, TensorFlow, ONNX Runtime, CUDA drivers, Python packages). Regularly scan these dependencies for known vulnerabilities. Container images used for model serving should be rebuilt and scanned with each security update. Use minimal base images (distroless or scratch-based) to reduce the attack surface of the serving container.
Model format migration: Where possible, migrate from unsafe model formats (pickle, SavedModel with custom ops) to safer alternatives (safetensors, ONNX without custom ops). This eliminates the most critical vulnerability class across all frameworks — arbitrary code execution during model loading. TorchServe's .mar format is inherently unsafe because it bundles Python code, so consider alternatives like Triton with ONNX or TensorRT backends for the highest-security deployments.
Incident response preparation: Because model serving frameworks are actively maintained open-source projects, new vulnerabilities are regularly discovered. Establish a process for monitoring security advisories for each framework in use, testing patches in staging, and deploying updates rapidly. Maintain the ability to quickly switch between framework versions or temporarily disable vulnerable features (such as dynamic model loading) in response to zero-day disclosures.
References
- Oligo Security. (2023). "ShellTorch: Multiple Critical Vulnerabilities in TorchServe." https://www.oligo.security/blog/shelltorch-torchserve-ssrf-vulnerability-cve-2023-43654
- NVIDIA. (2024). "Triton Inference Server Security Bulletin." CVE-2023-31036. https://nvidia.custhelp.com/app/answers/detail/a_id/5510
- OWASP. (2025). "OWASP Machine Learning Security Top 10." https://owasp.org/www-project-machine-learning-security-top-10/
- MITRE ATLAS. "Case Study: Attacking ML Model Serving Infrastructure." https://atlas.mitre.org/