Security Comparison of Model Serving Frameworks

intermediate18 min readUpdated 2026-03-21

In-depth security analysis of TorchServe, TensorFlow Serving, Triton Inference Server, and vLLM for production AI deployments

infrastructure model-serving torchserve triton vllm vulnerability-analysis

Overview

Model serving frameworks are the critical last mile between trained AI models and production applications. They handle model loading, request routing, batching, hardware acceleration, and scaling — and they represent one of the most exposed attack surfaces in AI infrastructure. Unlike internal training pipelines, serving endpoints typically face network traffic from external clients, internal microservices, or both, making their security posture directly relevant to organizational risk.

The four dominant open-source serving frameworks — PyTorch's TorchServe, TensorFlow Serving, NVIDIA's Triton Inference Server, and the rapidly adopted vLLM for large language models — each make different architectural choices that produce distinct security profiles. TorchServe exposes a management API alongside its inference API, creating administrative attack surface. TensorFlow Serving relies on gRPC and REST with minimal built-in authentication. Triton provides a feature-rich platform with model repository access, shared memory regions, and dynamic model loading that expand its attack surface. vLLM, optimized for LLM inference with continuous batching and PagedAttention, introduces prompt-handling complexity and often runs with elevated GPU access.

This article provides a systematic security comparison grounded in real CVEs, documented vulnerabilities, and common misconfiguration patterns. The goal is to equip red teamers with framework-specific knowledge to efficiently assess model serving deployments, and to give defenders actionable hardening guidance for each platform.

TorchServe Security Analysis

Architecture and Attack Surface

TorchServe exposes three network interfaces by default: an inference API (port 8080), a management API (port 8081), and a metrics API (port 8082). The management API is the most security-critical because it allows registering new models, scaling workers, and modifying server configuration. In default installations prior to version 0.8.2, the management API bound to 0.0.0.0, making it accessible from any network interface.

The model archive format (.mar) used by TorchServe is a ZIP file containing model weights, a handler Python script, and metadata. When a model is registered, TorchServe extracts and executes the handler script, which runs arbitrary Python code. This design means that model registration is equivalent to remote code execution by design — the security boundary must be at the management API access control layer.

Critical Vulnerabilities (CVE Analysis)

CVE-2023-43654 (CVSS 9.8): Server-Side Request Forgery (SSRF) in model registration. The management API's POST /models endpoint accepted a url parameter for downloading model archives from remote locations. An attacker with access to the management API could supply an internal URL (such as http://169.254.169.254/latest/meta-data/ on AWS) to access cloud metadata services and steal instance credentials. Combined with the default 0.0.0.0 binding, this was exploitable from any network position.

CVE-2022-1471 (SnakeYAML): TorchServe used SnakeYAML for YAML parsing without restricting deserialization, allowing arbitrary Java object instantiation. While TorchServe is primarily Python, its Java-based frontend used SnakeYAML for configuration parsing.

"""
TorchServe security assessment script.
Tests for common misconfigurations and known vulnerability patterns.
"""
 
import requests
import socket
import json
from urllib.parse import urljoin
from typing import Optional
 
class TorchServeAuditor:
    """Security auditor for TorchServe deployments."""
 
    def __init__(
        self,
        inference_url: str = "http://localhost:8080",
        management_url: str = "http://localhost:8081",
        metrics_url: str = "http://localhost:8082",
        timeout: int = 10,
    ):
        self.inference_url = inference_url
        self.management_url = management_url
        self.metrics_url = metrics_url
        self.timeout = timeout
        self.findings: list[dict] = []
 
    def _add_finding(
        self, severity: str, title: str, detail: str
    ) -> None:
        self.findings.append({
            "severity": severity,
            "title": title,
            "detail": detail,
        })
 
    def check_management_api_exposure(self) -> None:
        """Test if management API is accessible (should be restricted)."""
        try:
            resp = requests.get(
                urljoin(self.management_url, "/models"),
                timeout=self.timeout,
            )
            if resp.status_code == 200:
                models = resp.json()
                self._add_finding(
                    "CRITICAL",
                    "Management API accessible without authentication",
                    f"GET /models returned {len(models.get('models', []))} "
                    f"registered models. Management API allows model "
                    f"registration (RCE) and configuration changes.",
                )
        except requests.ConnectionError:
            self._add_finding(
                "INFO",
                "Management API not reachable",
                "Management API connection refused — may be properly "
                "restricted or running on a different address.",
            )
 
    def check_ssrf_via_model_registration(self) -> None:
        """
        Test for SSRF in model registration endpoint (CVE-2023-43654).
        Uses a benign canary URL — does NOT exploit.
        """
        try:
            # Test with an external canary to detect outbound requests
            # In a real assessment, use a Burp Collaborator or similar
            resp = requests.post(
                urljoin(self.management_url, "/models"),
                params={
                    "url": "https://canary.example.com/test.mar",
                    "model_name": "security_test",
                },
                timeout=self.timeout,
            )
            if resp.status_code != 403:
                self._add_finding(
                    "HIGH",
                    "Model registration endpoint accepts remote URLs",
                    f"POST /models with remote URL returned status "
                    f"{resp.status_code}. This may be vulnerable to SSRF "
                    f"(CVE-2023-43654). Verify URL allowlisting is enforced.",
                )
        except requests.ConnectionError:
            pass  # Management API not reachable
 
    def check_model_listing(self) -> None:
        """Enumerate registered models for information disclosure."""
        try:
            resp = requests.get(
                urljoin(self.management_url, "/models"),
                timeout=self.timeout,
            )
            if resp.status_code == 200:
                data = resp.json()
                for model in data.get("models", []):
                    model_name = model.get("modelName", "unknown")
                    detail_resp = requests.get(
                        urljoin(
                            self.management_url,
                            f"/models/{model_name}",
                        ),
                        timeout=self.timeout,
                    )
                    if detail_resp.status_code == 200:
                        detail = detail_resp.json()
                        self._add_finding(
                            "MEDIUM",
                            f"Model details exposed: {model_name}",
                            f"Model URL: {detail.get('modelUrl', 'N/A')}, "
                            f"Workers: {detail.get('workers', [])}, "
                            f"Batch size: {detail.get('batchSize', 'N/A')}",
                        )
        except requests.ConnectionError:
            pass
 
    def check_metrics_exposure(self) -> None:
        """Check if metrics endpoint exposes sensitive information."""
        try:
            resp = requests.get(
                urljoin(self.metrics_url, "/metrics"),
                timeout=self.timeout,
            )
            if resp.status_code == 200:
                metrics_text = resp.text
                sensitive_patterns = [
                    "gpu_memory",
                    "model_name",
                    "handler_time",
                    "queue_time",
                ]
                found = [
                    p for p in sensitive_patterns if p in metrics_text
                ]
                if found:
                    self._add_finding(
                        "LOW",
                        "Metrics endpoint exposes operational details",
                        f"Found metrics containing: {', '.join(found)}. "
                        f"This reveals model names, GPU usage, and "
                        f"inference timing information.",
                    )
        except requests.ConnectionError:
            pass
 
    def check_version_disclosure(self) -> None:
        """Check for version information disclosure."""
        try:
            resp = requests.get(
                urljoin(self.inference_url, "/api-description"),
                timeout=self.timeout,
            )
            if resp.status_code == 200:
                self._add_finding(
                    "LOW",
                    "API description endpoint accessible",
                    f"API description reveals framework details: "
                    f"{resp.text[:200]}",
                )
        except requests.ConnectionError:
            pass
 
    def run_audit(self) -> list[dict]:
        """Run all audit checks and return findings."""
        self.findings = []
        self.check_management_api_exposure()
        self.check_ssrf_via_model_registration()
        self.check_model_listing()
        self.check_metrics_exposure()
        self.check_version_disclosure()
        return self.findings
 
if __name__ == "__main__":
    import sys
 
    target = sys.argv[1] if len(sys.argv) > 1 else "http://localhost"
    auditor = TorchServeAuditor(
        inference_url=f"{target}:8080",
        management_url=f"{target}:8081",
        metrics_url=f"{target}:8082",
    )
    findings = auditor.run_audit()
 
    for f in findings:
        print(f"[{f['severity']}] {f['title']}")
        print(f"  {f['detail']}\n")

TensorFlow Serving Security Analysis

Architecture and Attack Surface

TensorFlow Serving exposes gRPC (port 8500) and REST (port 8501) interfaces for inference. Unlike TorchServe, it does not have a separate management API — model management is handled through the model configuration file (models.config) and filesystem-based model discovery. This reduces the administrative attack surface but shifts risk to the model storage layer.

TensorFlow Serving loads models from a configurable model base path, which can be a local directory, an NFS mount, Google Cloud Storage (GCS) bucket, or Amazon S3. The framework polls this path periodically for new model versions, automatically loading them. This auto-loading behavior means that an attacker who can write to the model storage location can achieve code execution without any API interaction.

SavedModel Format Risks

TensorFlow's SavedModel format can contain arbitrary Python code through tf.py_function operations and custom ops. A malicious SavedModel placed in the model repository will execute attacker code when loaded by TensorFlow Serving. This is not a bug — it is a fundamental property of the SavedModel format that includes a computation graph that can invoke arbitrary operations.

"""
Demonstrate security risks in TensorFlow Serving model loading.
This creates a SavedModel with embedded computation that executes
during model load/inference — illustrating the supply chain risk.
"""
 
import tensorflow as tf
import numpy as np
import os
 
def create_benign_model_with_audit_hook(export_path: str) -> None:
    """
    Create a SavedModel that logs inference requests to a file.
    This demonstrates how a model can perform actions beyond inference.
    In a malicious scenario, this could exfiltrate data.
    """
 
    class AuditedModel(tf.Module):
        def __init__(self):
            super().__init__()
            self.dense_weights = tf.Variable(
                tf.random.normal([784, 10]), name="weights"
            )
            self.bias = tf.Variable(tf.zeros([10]), name="bias")
 
        @tf.function(input_signature=[
            tf.TensorSpec(shape=[None, 784], dtype=tf.float32)
        ])
        def predict(self, x):
            # Normal inference computation
            logits = tf.matmul(x, self.dense_weights) + self.bias
            predictions = tf.nn.softmax(logits)
 
            # Audit hook: log input statistics
            # In a malicious model, this could write to a network socket
            # or encode data in timing side channels
            input_mean = tf.reduce_mean(x)
            input_std = tf.math.reduce_std(x)
            log_line = tf.strings.format(
                "Input stats: mean={}, std={}", (input_mean, input_std)
            )
            tf.print(log_line)  # Goes to TF Serving stdout/stderr
 
            return predictions
 
    model = AuditedModel()
    tf.saved_model.save(
        model,
        export_path,
        signatures={"serving_default": model.predict},
    )
    print(f"Model saved to {export_path}")
 
def audit_savedmodel_for_dangerous_ops(model_path: str) -> list[str]:
    """
    Scan a SavedModel for potentially dangerous operations.
    These operations can execute arbitrary code or access the filesystem.
    """
    dangerous_ops = {
        "PyFunc": "Arbitrary Python code execution",
        "ReadFile": "Filesystem read access",
        "WriteFile": "Filesystem write access",
        "ShellExecute": "Shell command execution",
        "LoadLibrary": "Dynamic library loading",
        "StringToHashBucketFast": "Could be used for data encoding",
    }
 
    findings = []
    try:
        loaded = tf.saved_model.load(model_path)
        for func_name in dir(loaded):
            func = getattr(loaded, func_name, None)
            if hasattr(func, "concrete_functions"):
                for cf in func.concrete_functions:
                    for node in cf.graph.as_graph_def().node:
                        if node.op in dangerous_ops:
                            findings.append(
                                f"Found {node.op} ({dangerous_ops[node.op]}) "
                                f"in function {func_name}, node {node.name}"
                            )
    except Exception as e:
        findings.append(f"Error loading model: {e}")
 
    return findings

Triton Inference Server Security Analysis

Architecture and Attack Surface

NVIDIA Triton Inference Server is the most feature-rich of the four frameworks, supporting multiple model formats (TensorFlow, PyTorch, ONNX, TensorRT, Python backend), dynamic model loading, model ensembles, shared memory, and custom backends. This breadth of functionality creates a correspondingly large attack surface.

Key attack surface components:

HTTP/gRPC inference endpoints (ports 8000/8001): Standard inference APIs with health checks and model metadata.
Metrics endpoint (port 8002): Prometheus metrics with detailed operational data.
Model repository: Filesystem, S3, GCS, or Azure Blob Storage. Triton polls for changes and auto-loads new models.
Shared memory regions: CUDA shared memory and system shared memory for zero-copy inference, creating inter-process communication channels.
Python backend: Executes arbitrary Python code as model handlers, similar to TorchServe's approach.
Model ensembles: Chain multiple models together, with output of one feeding input of another. A compromised model in an ensemble can manipulate downstream models.

Shared Memory Attack Vectors

Triton's shared memory feature allows clients to register system or CUDA shared memory regions for zero-copy data transfer. This is a performance optimization that introduces security risks:

"""
Triton Inference Server shared memory security assessment.
Tests for shared memory region manipulation vulnerabilities.
"""
 
import tritonclient.http as httpclient
import tritonclient.utils.shared_memory as shm
import numpy as np
from typing import Optional
 
class TritonSharedMemoryAuditor:
    """Audit Triton's shared memory interface for security issues."""
 
    def __init__(self, url: str = "localhost:8000"):
        self.client = httpclient.InferenceServerClient(url=url)
        self.findings: list[dict] = []
 
    def enumerate_shared_memory_regions(self) -> list[dict]:
        """
        List all registered shared memory regions.
        Information disclosure: reveals memory layout and sizes.
        """
        try:
            # System shared memory
            sys_regions = self.client.get_system_shared_memory_status()
            for region in sys_regions:
                self.findings.append({
                    "severity": "MEDIUM",
                    "title": f"System shared memory region: {region['name']}",
                    "detail": (
                        f"Key: {region.get('key', 'N/A')}, "
                        f"Offset: {region.get('offset', 0)}, "
                        f"Size: {region.get('byte_size', 0)} bytes"
                    ),
                })
 
            # CUDA shared memory
            cuda_regions = self.client.get_cuda_shared_memory_status()
            for region in cuda_regions:
                self.findings.append({
                    "severity": "MEDIUM",
                    "title": f"CUDA shared memory region: {region['name']}",
                    "detail": (
                        f"Device ID: {region.get('device_id', 'N/A')}, "
                        f"Size: {region.get('byte_size', 0)} bytes"
                    ),
                })
 
            return sys_regions + cuda_regions
 
        except Exception as e:
            self.findings.append({
                "severity": "INFO",
                "title": "Shared memory enumeration failed",
                "detail": str(e),
            })
            return []
 
    def test_model_repository_access(self) -> None:
        """
        Test model repository API for unauthorized access.
        These endpoints allow loading/unloading models dynamically.
        """
        try:
            # List all models in repository (including unloaded)
            repo_index = self.client.get_model_repository_index()
            for model in repo_index:
                status = model.get("state", "UNKNOWN")
                self.findings.append({
                    "severity": "LOW" if status == "READY" else "MEDIUM",
                    "title": f"Repository model: {model['name']}",
                    "detail": (
                        f"State: {status}, "
                        f"Reason: {model.get('reason', 'N/A')}"
                    ),
                })
 
            # Test if model loading is enabled (very high risk)
            # This is controlled by --model-control-mode flag
            try:
                self.client.load_model("nonexistent_test_model")
            except Exception as load_err:
                error_msg = str(load_err)
                if "model control is disabled" in error_msg.lower():
                    self.findings.append({
                        "severity": "INFO",
                        "title": "Model control mode: NONE (safe)",
                        "detail": "Dynamic model loading is disabled.",
                    })
                elif "not found" in error_msg.lower():
                    self.findings.append({
                        "severity": "HIGH",
                        "title": "Dynamic model loading is ENABLED",
                        "detail": (
                            "Model loading API is active. An attacker who "
                            "can write to the model repository can load "
                            "malicious models via API."
                        ),
                    })
 
        except Exception as e:
            self.findings.append({
                "severity": "INFO",
                "title": "Repository access test failed",
                "detail": str(e),
            })
 
    def check_model_metadata_disclosure(self) -> None:
        """Check all loaded models for metadata information disclosure."""
        try:
            server_meta = self.client.get_server_metadata()
            self.findings.append({
                "severity": "LOW",
                "title": "Server metadata accessible",
                "detail": (
                    f"Name: {server_meta.get('name', 'N/A')}, "
                    f"Version: {server_meta.get('version', 'N/A')}, "
                    f"Extensions: {server_meta.get('extensions', [])}"
                ),
            })
 
            # Check each loaded model
            repo_index = self.client.get_model_repository_index()
            for model in repo_index:
                if model.get("state") == "READY":
                    try:
                        meta = self.client.get_model_metadata(model["name"])
                        config = self.client.get_model_config(model["name"])
                        self.findings.append({
                            "severity": "LOW",
                            "title": f"Model config exposed: {model['name']}",
                            "detail": (
                                f"Platform: {meta.get('platform', 'N/A')}, "
                                f"Inputs: {meta.get('inputs', [])}, "
                                f"Outputs: {meta.get('outputs', [])}, "
                                f"Backend: {config.get('backend', 'N/A')}"
                            ),
                        })
                    except Exception:
                        pass
 
        except Exception as e:
            self.findings.append({
                "severity": "INFO",
                "title": "Metadata check failed",
                "detail": str(e),
            })
 
    def run_audit(self) -> list[dict]:
        """Execute all audit checks."""
        self.findings = []
        self.enumerate_shared_memory_regions()
        self.test_model_repository_access()
        self.check_model_metadata_disclosure()
        return self.findings

vLLM Security Analysis

Architecture and Attack Surface

vLLM is purpose-built for high-throughput LLM inference using PagedAttention for efficient KV-cache management. Its architecture differs from general-purpose serving frameworks in several security-relevant ways:

Prompt processing pipeline: vLLM processes variable-length text prompts that can be crafted to exploit tokenizer vulnerabilities, trigger excessive memory allocation, or cause denial of service through adversarial prompt lengths.
KV-cache as shared resource: PagedAttention manages the KV-cache as a shared memory pool across requests. This sharing is the source of vLLM's performance advantage but creates potential information leakage between requests.
OpenAI-compatible API: vLLM's API server implements an OpenAI-compatible REST interface, which means clients may send structured prompts with system/user/assistant roles that the server must parse and validate.
Tensor parallelism: Multi-GPU inference splits model layers across GPUs using NCCL, introducing inter-GPU communication channels that could leak information.

"""
vLLM security assessment focusing on prompt-based attacks
and resource exhaustion.
"""
 
import requests
import time
import json
import concurrent.futures
from typing import Optional
 
class VLLMAuditor:
    """Security auditor for vLLM deployments."""
 
    def __init__(self, base_url: str = "http://localhost:8000"):
        self.base_url = base_url
        self.findings: list[dict] = []
 
    def check_prompt_length_limits(self) -> None:
        """
        Test if the server enforces prompt length limits.
        Excessively long prompts can cause OOM or extreme latency.
        """
        # Test progressively longer prompts
        test_lengths = [1000, 10000, 50000, 100000]
        for length in test_lengths:
            prompt = "A " * length  # Simple repeated token
            try:
                start = time.time()
                resp = requests.post(
                    f"{self.base_url}/v1/completions",
                    json={
                        "model": "default",
                        "prompt": prompt,
                        "max_tokens": 1,
                    },
                    timeout=30,
                )
                elapsed = time.time() - start
 
                if resp.status_code == 200:
                    self.findings.append({
                        "severity": "MEDIUM" if length > 10000 else "LOW",
                        "title": f"Accepted prompt of {length} tokens",
                        "detail": (
                            f"Server processed {length}-token prompt in "
                            f"{elapsed:.2f}s. Large prompts without limits "
                            f"enable resource exhaustion."
                        ),
                    })
                elif resp.status_code == 400:
                    self.findings.append({
                        "severity": "INFO",
                        "title": f"Prompt length {length} rejected",
                        "detail": f"Server properly rejected: {resp.text[:200]}",
                    })
                    break  # Found the limit
            except requests.Timeout:
                self.findings.append({
                    "severity": "HIGH",
                    "title": f"Timeout at prompt length {length}",
                    "detail": (
                        "Server timed out processing long prompt. "
                        "This indicates missing prompt length validation "
                        "and potential DoS vulnerability."
                    ),
                })
                break
            except requests.ConnectionError:
                self.findings.append({
                    "severity": "CRITICAL",
                    "title": f"Connection lost at prompt length {length}",
                    "detail": "Server became unreachable — possible OOM crash.",
                })
                break
 
    def check_concurrent_request_limits(self) -> None:
        """Test for rate limiting and concurrent request handling."""
        num_concurrent = 50
        prompt = "What is the capital of France?"
 
        def send_request() -> tuple[int, float]:
            start = time.time()
            try:
                resp = requests.post(
                    f"{self.base_url}/v1/completions",
                    json={
                        "model": "default",
                        "prompt": prompt,
                        "max_tokens": 10,
                    },
                    timeout=60,
                )
                return resp.status_code, time.time() - start
            except Exception:
                return 0, time.time() - start
 
        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_concurrent
        ) as executor:
            futures = [
                executor.submit(send_request)
                for _ in range(num_concurrent)
            ]
            results = [f.result() for f in futures]
 
        success = sum(1 for code, _ in results if code == 200)
        rate_limited = sum(1 for code, _ in results if code == 429)
        errors = sum(1 for code, _ in results if code not in (200, 429))
 
        if rate_limited == 0 and success == num_concurrent:
            self.findings.append({
                "severity": "MEDIUM",
                "title": "No rate limiting detected",
                "detail": (
                    f"All {num_concurrent} concurrent requests succeeded. "
                    f"No rate limiting or request queuing observed."
                ),
            })
        elif rate_limited > 0:
            self.findings.append({
                "severity": "INFO",
                "title": "Rate limiting active",
                "detail": (
                    f"{rate_limited}/{num_concurrent} requests rate-limited."
                ),
            })
 
    def check_model_info_disclosure(self) -> None:
        """Check for model information disclosure via API."""
        endpoints = [
            "/v1/models",
            "/health",
            "/version",
            "/metrics",
        ]
        for endpoint in endpoints:
            try:
                resp = requests.get(
                    f"{self.base_url}{endpoint}",
                    timeout=10,
                )
                if resp.status_code == 200:
                    self.findings.append({
                        "severity": "LOW",
                        "title": f"Endpoint accessible: {endpoint}",
                        "detail": f"Response: {resp.text[:300]}",
                    })
            except requests.ConnectionError:
                pass
 
    def run_audit(self) -> list[dict]:
        """Run all vLLM-specific audit checks."""
        self.findings = []
        self.check_model_info_disclosure()
        self.check_prompt_length_limits()
        self.check_concurrent_request_limits()
        return self.findings

Comparative Security Matrix

The following table summarizes key security properties across the four frameworks:

Security Property	TorchServe	TF Serving	Triton	vLLM
Built-in Authentication	None	None	None	None
Built-in TLS	Config option	Config option	Config option	Config option
Management API	Separate port (8081)	None (filesystem)	Model control API	None
Model Format Risk	.mar (ZIP + Python)	SavedModel (TF ops)	Multiple formats	HuggingFace/safetensors
Dynamic Model Loading	Yes (via API)	Yes (filesystem poll)	Yes (API or poll)	Limited
Shared Memory	No	No	Yes (system + CUDA)	Internal only
Default Network Binding	0.0.0.0 (pre-0.8.2)	0.0.0.0	0.0.0.0	0.0.0.0
Metrics Endpoint	Port 8082	None by default	Port 8002	/metrics
Notable CVEs	CVE-2023-43654	CVE-2021-37678	CVE-2023-31036	Emerging (newer project)

Practical Examples

Unified Framework Scanner

#!/usr/bin/env bash
# Quick reconnaissance script to identify which model serving framework
# is running on a target host and gather initial security-relevant info.
 
TARGET="${1:?Usage: $0 <target_host>}"
 
echo "=== Model Serving Framework Detection ==="
echo "Target: $TARGET"
echo ""
 
# TorchServe detection
echo "--- TorchServe (ports 8080-8082) ---"
curl -s --connect-timeout 3 "http://${TARGET}:8080/ping" && \
    echo " [+] TorchServe inference API detected"
curl -s --connect-timeout 3 "http://${TARGET}:8081/models" && \
    echo " [+] TorchServe management API EXPOSED"
curl -s --connect-timeout 3 "http://${TARGET}:8082/metrics" | head -5
 
# TF Serving detection
echo ""
echo "--- TensorFlow Serving (ports 8500-8501) ---"
curl -s --connect-timeout 3 "http://${TARGET}:8501/v1/models" && \
    echo " [+] TF Serving REST API detected"
 
# Triton detection
echo ""
echo "--- Triton Inference Server (ports 8000-8002) ---"
TRITON_META=$(curl -s --connect-timeout 3 "http://${TARGET}:8000/v2")
if echo "$TRITON_META" | grep -q "triton"; then
    echo " [+] Triton detected: $TRITON_META"
    echo " [*] Model repository:"
    curl -s "http://${TARGET}:8000/v2/repository/index" | python3 -m json.tool 2>/dev/null
fi
 
# vLLM detection
echo ""
echo "--- vLLM (port 8000, OpenAI-compatible) ---"
VLLM_MODELS=$(curl -s --connect-timeout 3 "http://${TARGET}:8000/v1/models")
if echo "$VLLM_MODELS" | grep -q '"object"'; then
    echo " [+] vLLM / OpenAI-compatible API detected"
    echo "$VLLM_MODELS" | python3 -m json.tool 2>/dev/null
fi
 
echo ""
echo "=== Scan Complete ==="

Defense and Mitigation

Network segmentation is the single most effective defense for model serving frameworks. None of the four frameworks provide built-in authentication or authorization that would be considered production-grade. The management/control interfaces must be isolated to administrative networks:

TorchServe: Bind management API to 127.0.0.1 or internal-only interfaces. Use management_address in config.properties.
TF Serving: Restrict filesystem access to model repository. Use read-only mounts.
Triton: Set --model-control-mode=none to disable dynamic loading. Restrict metrics and repository endpoints via network policy.
vLLM: Deploy behind an API gateway with authentication, rate limiting, and prompt validation.

Model integrity verification should be implemented at the storage layer. Use signed model artifacts, verify checksums before loading, and restrict write access to model repositories. For TorchServe, validate .mar files against a known-good signature before registration.

Input validation is critical for all frameworks but especially vLLM and other LLM-serving systems. Implement prompt length limits, request rate limits, and content filtering before requests reach the inference engine.

Resource limits via container cgroups, Kubernetes resource quotas, and GPU memory limits prevent denial-of-service through resource exhaustion. Set explicit max_batch_size, max_sequence_length, and concurrent request limits in framework configurations.

TLS termination should happen at the load balancer or service mesh level rather than relying on each framework's built-in TLS support, which varies in configuration quality and cipher suite selection.

Dependency scanning: All four frameworks depend on complex software stacks (PyTorch, TensorFlow, ONNX Runtime, CUDA drivers, Python packages). Regularly scan these dependencies for known vulnerabilities. Container images used for model serving should be rebuilt and scanned with each security update. Use minimal base images (distroless or scratch-based) to reduce the attack surface of the serving container.

Model format migration: Where possible, migrate from unsafe model formats (pickle, SavedModel with custom ops) to safer alternatives (safetensors, ONNX without custom ops). This eliminates the most critical vulnerability class across all frameworks — arbitrary code execution during model loading. TorchServe's .mar format is inherently unsafe because it bundles Python code, so consider alternatives like Triton with ONNX or TensorRT backends for the highest-security deployments.

Incident response preparation: Because model serving frameworks are actively maintained open-source projects, new vulnerabilities are regularly discovered. Establish a process for monitoring security advisories for each framework in use, testing patches in staging, and deploying updates rapidly. Maintain the ability to quickly switch between framework versions or temporarily disable vulnerable features (such as dynamic model loading) in response to zero-day disclosures.

References

Oligo Security. (2023). "ShellTorch: Multiple Critical Vulnerabilities in TorchServe." https://www.oligo.security/blog/shelltorch-torchserve-ssrf-vulnerability-cve-2023-43654
NVIDIA. (2024). "Triton Inference Server Security Bulletin." CVE-2023-31036. https://nvidia.custhelp.com/app/answers/detail/a_id/5510
OWASP. (2025). "OWASP Machine Learning Security Top 10." https://owasp.org/www-project-machine-learning-security-top-10/
MITRE ATLAS. "Case Study: Attacking ML Model Serving Infrastructure." https://atlas.mitre.org/

Edit this page on GitHub

Security Comparison of Model Serving Frameworks

intermediate18 min readUpdated 2026-03-21

In-depth security analysis of TorchServe, TensorFlow Serving, Triton Inference Server, and vLLM for production AI deployments

infrastructure model-serving torchserve triton vllm vulnerability-analysis

"""
TorchServe security assessment script.
Tests for common misconfigurations and known vulnerability patterns.
"""
 
import requests
import socket
import json
from urllib.parse import urljoin
from typing import Optional
 
class TorchServeAuditor:
    """Security auditor for TorchServe deployments."""
 
    def __init__(
        self,
        inference_url: str = "http://localhost:8080",
        management_url: str = "http://localhost:8081",
        metrics_url: str = "http://localhost:8082",
        timeout: int = 10,
    ):
        self.inference_url = inference_url
        self.management_url = management_url
        self.metrics_url = metrics_url
        self.timeout = timeout
        self.findings: list[dict] = []
 
    def _add_finding(
        self, severity: str, title: str, detail: str
    ) -> None:
        self.findings.append({
            "severity": severity,
            "title": title,
            "detail": detail,
        })
 
    def check_management_api_exposure(self) -> None:
        """Test if management API is accessible (should be restricted)."""
        try:
            resp = requests.get(
                urljoin(self.management_url, "/models"),
                timeout=self.timeout,
            )
            if resp.status_code == 200:
                models = resp.json()
                self._add_finding(
                    "CRITICAL",
                    "Management API accessible without authentication",
                    f"GET /models returned {len(models.get('models', []))} "
                    f"registered models. Management API allows model "
                    f"registration (RCE) and configuration changes.",
                )
        except requests.ConnectionError:
            self._add_finding(
                "INFO",
                "Management API not reachable",
                "Management API connection refused — may be properly "
                "restricted or running on a different address.",
            )
 
    def check_ssrf_via_model_registration(self) -> None:
        """
        Test for SSRF in model registration endpoint (CVE-2023-43654).
        Uses a benign canary URL — does NOT exploit.
        """
        try:
            # Test with an external canary to detect outbound requests
            # In a real assessment, use a Burp Collaborator or similar
            resp = requests.post(
                urljoin(self.management_url, "/models"),
                params={
                    "url": "https://canary.example.com/test.mar",
                    "model_name": "security_test",
                },
                timeout=self.timeout,
            )
            if resp.status_code != 403:
                self._add_finding(
                    "HIGH",
                    "Model registration endpoint accepts remote URLs",
                    f"POST /models with remote URL returned status "
                    f"{resp.status_code}. This may be vulnerable to SSRF "
                    f"(CVE-2023-43654). Verify URL allowlisting is enforced.",
                )
        except requests.ConnectionError:
            pass  # Management API not reachable
 
    def check_model_listing(self) -> None:
        """Enumerate registered models for information disclosure."""
        try:
            resp = requests.get(
                urljoin(self.management_url, "/models"),
                timeout=self.timeout,
            )
            if resp.status_code == 200:
                data = resp.json()
                for model in data.get("models", []):
                    model_name = model.get("modelName", "unknown")
                    detail_resp = requests.get(
                        urljoin(
                            self.management_url,
                            f"/models/{model_name}",
                        ),
                        timeout=self.timeout,
                    )
                    if detail_resp.status_code == 200:
                        detail = detail_resp.json()
                        self._add_finding(
                            "MEDIUM",
                            f"Model details exposed: {model_name}",
                            f"Model URL: {detail.get('modelUrl', 'N/A')}, "
                            f"Workers: {detail.get('workers', [])}, "
                            f"Batch size: {detail.get('batchSize', 'N/A')}",
                        )
        except requests.ConnectionError:
            pass
 
    def check_metrics_exposure(self) -> None:
        """Check if metrics endpoint exposes sensitive information."""
        try:
            resp = requests.get(
                urljoin(self.metrics_url, "/metrics"),
                timeout=self.timeout,
            )
            if resp.status_code == 200:
                metrics_text = resp.text
                sensitive_patterns = [
                    "gpu_memory",
                    "model_name",
                    "handler_time",
                    "queue_time",
                ]
                found = [
                    p for p in sensitive_patterns if p in metrics_text
                ]
                if found:
                    self._add_finding(
                        "LOW",
                        "Metrics endpoint exposes operational details",
                        f"Found metrics containing: {', '.join(found)}. "
                        f"This reveals model names, GPU usage, and "
                        f"inference timing information.",
                    )
        except requests.ConnectionError:
            pass
 
    def check_version_disclosure(self) -> None:
        """Check for version information disclosure."""
        try:
            resp = requests.get(
                urljoin(self.inference_url, "/api-description"),
                timeout=self.timeout,
            )
            if resp.status_code == 200:
                self._add_finding(
                    "LOW",
                    "API description endpoint accessible",
                    f"API description reveals framework details: "
                    f"{resp.text[:200]}",
                )
        except requests.ConnectionError:
            pass
 
    def run_audit(self) -> list[dict]:
        """Run all audit checks and return findings."""
        self.findings = []
        self.check_management_api_exposure()
        self.check_ssrf_via_model_registration()
        self.check_model_listing()
        self.check_metrics_exposure()
        self.check_version_disclosure()
        return self.findings
 
if __name__ == "__main__":
    import sys
 
    target = sys.argv[1] if len(sys.argv) > 1 else "http://localhost"
    auditor = TorchServeAuditor(
        inference_url=f"{target}:8080",
        management_url=f"{target}:8081",
        metrics_url=f"{target}:8082",
    )
    findings = auditor.run_audit()
 
    for f in findings:
        print(f"[{f['severity']}] {f['title']}")
        print(f"  {f['detail']}\n")

"""
Demonstrate security risks in TensorFlow Serving model loading.
This creates a SavedModel with embedded computation that executes
during model load/inference — illustrating the supply chain risk.
"""
 
import tensorflow as tf
import numpy as np
import os
 
def create_benign_model_with_audit_hook(export_path: str) -> None:
    """
    Create a SavedModel that logs inference requests to a file.
    This demonstrates how a model can perform actions beyond inference.
    In a malicious scenario, this could exfiltrate data.
    """
 
    class AuditedModel(tf.Module):
        def __init__(self):
            super().__init__()
            self.dense_weights = tf.Variable(
                tf.random.normal([784, 10]), name="weights"
            )
            self.bias = tf.Variable(tf.zeros([10]), name="bias")
 
        @tf.function(input_signature=[
            tf.TensorSpec(shape=[None, 784], dtype=tf.float32)
        ])
        def predict(self, x):
            # Normal inference computation
            logits = tf.matmul(x, self.dense_weights) + self.bias
            predictions = tf.nn.softmax(logits)
 
            # Audit hook: log input statistics
            # In a malicious model, this could write to a network socket
            # or encode data in timing side channels
            input_mean = tf.reduce_mean(x)
            input_std = tf.math.reduce_std(x)
            log_line = tf.strings.format(
                "Input stats: mean={}, std={}", (input_mean, input_std)
            )
            tf.print(log_line)  # Goes to TF Serving stdout/stderr
 
            return predictions
 
    model = AuditedModel()
    tf.saved_model.save(
        model,
        export_path,
        signatures={"serving_default": model.predict},
    )
    print(f"Model saved to {export_path}")
 
def audit_savedmodel_for_dangerous_ops(model_path: str) -> list[str]:
    """
    Scan a SavedModel for potentially dangerous operations.
    These operations can execute arbitrary code or access the filesystem.
    """
    dangerous_ops = {
        "PyFunc": "Arbitrary Python code execution",
        "ReadFile": "Filesystem read access",
        "WriteFile": "Filesystem write access",
        "ShellExecute": "Shell command execution",
        "LoadLibrary": "Dynamic library loading",
        "StringToHashBucketFast": "Could be used for data encoding",
    }
 
    findings = []
    try:
        loaded = tf.saved_model.load(model_path)
        for func_name in dir(loaded):
            func = getattr(loaded, func_name, None)
            if hasattr(func, "concrete_functions"):
                for cf in func.concrete_functions:
                    for node in cf.graph.as_graph_def().node:
                        if node.op in dangerous_ops:
                            findings.append(
                                f"Found {node.op} ({dangerous_ops[node.op]}) "
                                f"in function {func_name}, node {node.name}"
                            )
    except Exception as e:
        findings.append(f"Error loading model: {e}")
 
    return findings

Triton Inference Server Security Analysis

Architecture and Attack Surface

Key attack surface components:

HTTP/gRPC inference endpoints (ports 8000/8001): Standard inference APIs with health checks and model metadata.
Metrics endpoint (port 8002): Prometheus metrics with detailed operational data.
Model repository: Filesystem, S3, GCS, or Azure Blob Storage. Triton polls for changes and auto-loads new models.
Shared memory regions: CUDA shared memory and system shared memory for zero-copy inference, creating inter-process communication channels.
Python backend: Executes arbitrary Python code as model handlers, similar to TorchServe's approach.
Model ensembles: Chain multiple models together, with output of one feeding input of another. A compromised model in an ensemble can manipulate downstream models.

Shared Memory Attack Vectors

Triton's shared memory feature allows clients to register system or CUDA shared memory regions for zero-copy data transfer. This is a performance optimization that introduces security risks:

"""
Triton Inference Server shared memory security assessment.
Tests for shared memory region manipulation vulnerabilities.
"""
 
import tritonclient.http as httpclient
import tritonclient.utils.shared_memory as shm
import numpy as np
from typing import Optional
 
class TritonSharedMemoryAuditor:
    """Audit Triton's shared memory interface for security issues."""
 
    def __init__(self, url: str = "localhost:8000"):
        self.client = httpclient.InferenceServerClient(url=url)
        self.findings: list[dict] = []
 
    def enumerate_shared_memory_regions(self) -> list[dict]:
        """
        List all registered shared memory regions.
        Information disclosure: reveals memory layout and sizes.
        """
        try:
            # System shared memory
            sys_regions = self.client.get_system_shared_memory_status()
            for region in sys_regions:
                self.findings.append({
                    "severity": "MEDIUM",
                    "title": f"System shared memory region: {region['name']}",
                    "detail": (
                        f"Key: {region.get('key', 'N/A')}, "
                        f"Offset: {region.get('offset', 0)}, "
                        f"Size: {region.get('byte_size', 0)} bytes"
                    ),
                })
 
            # CUDA shared memory
            cuda_regions = self.client.get_cuda_shared_memory_status()
            for region in cuda_regions:
                self.findings.append({
                    "severity": "MEDIUM",
                    "title": f"CUDA shared memory region: {region['name']}",
                    "detail": (
                        f"Device ID: {region.get('device_id', 'N/A')}, "
                        f"Size: {region.get('byte_size', 0)} bytes"
                    ),
                })
 
            return sys_regions + cuda_regions
 
        except Exception as e:
            self.findings.append({
                "severity": "INFO",
                "title": "Shared memory enumeration failed",
                "detail": str(e),
            })
            return []
 
    def test_model_repository_access(self) -> None:
        """
        Test model repository API for unauthorized access.
        These endpoints allow loading/unloading models dynamically.
        """
        try:
            # List all models in repository (including unloaded)
            repo_index = self.client.get_model_repository_index()
            for model in repo_index:
                status = model.get("state", "UNKNOWN")
                self.findings.append({
                    "severity": "LOW" if status == "READY" else "MEDIUM",
                    "title": f"Repository model: {model['name']}",
                    "detail": (
                        f"State: {status}, "
                        f"Reason: {model.get('reason', 'N/A')}"
                    ),
                })
 
            # Test if model loading is enabled (very high risk)
            # This is controlled by --model-control-mode flag
            try:
                self.client.load_model("nonexistent_test_model")
            except Exception as load_err:
                error_msg = str(load_err)
                if "model control is disabled" in error_msg.lower():
                    self.findings.append({
                        "severity": "INFO",
                        "title": "Model control mode: NONE (safe)",
                        "detail": "Dynamic model loading is disabled.",
                    })
                elif "not found" in error_msg.lower():
                    self.findings.append({
                        "severity": "HIGH",
                        "title": "Dynamic model loading is ENABLED",
                        "detail": (
                            "Model loading API is active. An attacker who "
                            "can write to the model repository can load "
                            "malicious models via API."
                        ),
                    })
 
        except Exception as e:
            self.findings.append({
                "severity": "INFO",
                "title": "Repository access test failed",
                "detail": str(e),
            })
 
    def check_model_metadata_disclosure(self) -> None:
        """Check all loaded models for metadata information disclosure."""
        try:
            server_meta = self.client.get_server_metadata()
            self.findings.append({
                "severity": "LOW",
                "title": "Server metadata accessible",
                "detail": (
                    f"Name: {server_meta.get('name', 'N/A')}, "
                    f"Version: {server_meta.get('version', 'N/A')}, "
                    f"Extensions: {server_meta.get('extensions', [])}"
                ),
            })
 
            # Check each loaded model
            repo_index = self.client.get_model_repository_index()
            for model in repo_index:
                if model.get("state") == "READY":
                    try:
                        meta = self.client.get_model_metadata(model["name"])
                        config = self.client.get_model_config(model["name"])
                        self.findings.append({
                            "severity": "LOW",
                            "title": f"Model config exposed: {model['name']}",
                            "detail": (
                                f"Platform: {meta.get('platform', 'N/A')}, "
                                f"Inputs: {meta.get('inputs', [])}, "
                                f"Outputs: {meta.get('outputs', [])}, "
                                f"Backend: {config.get('backend', 'N/A')}"
                            ),
                        })
                    except Exception:
                        pass
 
        except Exception as e:
            self.findings.append({
                "severity": "INFO",
                "title": "Metadata check failed",
                "detail": str(e),
            })
 
    def run_audit(self) -> list[dict]:
        """Execute all audit checks."""
        self.findings = []
        self.enumerate_shared_memory_regions()
        self.test_model_repository_access()
        self.check_model_metadata_disclosure()
        return self.findings

vLLM Security Analysis

Architecture and Attack Surface

Prompt processing pipeline: vLLM processes variable-length text prompts that can be crafted to exploit tokenizer vulnerabilities, trigger excessive memory allocation, or cause denial of service through adversarial prompt lengths.
KV-cache as shared resource: PagedAttention manages the KV-cache as a shared memory pool across requests. This sharing is the source of vLLM's performance advantage but creates potential information leakage between requests.
OpenAI-compatible API: vLLM's API server implements an OpenAI-compatible REST interface, which means clients may send structured prompts with system/user/assistant roles that the server must parse and validate.
Tensor parallelism: Multi-GPU inference splits model layers across GPUs using NCCL, introducing inter-GPU communication channels that could leak information.

"""
vLLM security assessment focusing on prompt-based attacks
and resource exhaustion.
"""
 
import requests
import time
import json
import concurrent.futures
from typing import Optional
 
class VLLMAuditor:
    """Security auditor for vLLM deployments."""
 
    def __init__(self, base_url: str = "http://localhost:8000"):
        self.base_url = base_url
        self.findings: list[dict] = []
 
    def check_prompt_length_limits(self) -> None:
        """
        Test if the server enforces prompt length limits.
        Excessively long prompts can cause OOM or extreme latency.
        """
        # Test progressively longer prompts
        test_lengths = [1000, 10000, 50000, 100000]
        for length in test_lengths:
            prompt = "A " * length  # Simple repeated token
            try:
                start = time.time()
                resp = requests.post(
                    f"{self.base_url}/v1/completions",
                    json={
                        "model": "default",
                        "prompt": prompt,
                        "max_tokens": 1,
                    },
                    timeout=30,
                )
                elapsed = time.time() - start
 
                if resp.status_code == 200:
                    self.findings.append({
                        "severity": "MEDIUM" if length > 10000 else "LOW",
                        "title": f"Accepted prompt of {length} tokens",
                        "detail": (
                            f"Server processed {length}-token prompt in "
                            f"{elapsed:.2f}s. Large prompts without limits "
                            f"enable resource exhaustion."
                        ),
                    })
                elif resp.status_code == 400:
                    self.findings.append({
                        "severity": "INFO",
                        "title": f"Prompt length {length} rejected",
                        "detail": f"Server properly rejected: {resp.text[:200]}",
                    })
                    break  # Found the limit
            except requests.Timeout:
                self.findings.append({
                    "severity": "HIGH",
                    "title": f"Timeout at prompt length {length}",
                    "detail": (
                        "Server timed out processing long prompt. "
                        "This indicates missing prompt length validation "
                        "and potential DoS vulnerability."
                    ),
                })
                break
            except requests.ConnectionError:
                self.findings.append({
                    "severity": "CRITICAL",
                    "title": f"Connection lost at prompt length {length}",
                    "detail": "Server became unreachable — possible OOM crash.",
                })
                break
 
    def check_concurrent_request_limits(self) -> None:
        """Test for rate limiting and concurrent request handling."""
        num_concurrent = 50
        prompt = "What is the capital of France?"
 
        def send_request() -> tuple[int, float]:
            start = time.time()
            try:
                resp = requests.post(
                    f"{self.base_url}/v1/completions",
                    json={
                        "model": "default",
                        "prompt": prompt,
                        "max_tokens": 10,
                    },
                    timeout=60,
                )
                return resp.status_code, time.time() - start
            except Exception:
                return 0, time.time() - start
 
        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_concurrent
        ) as executor:
            futures = [
                executor.submit(send_request)
                for _ in range(num_concurrent)
            ]
            results = [f.result() for f in futures]
 
        success = sum(1 for code, _ in results if code == 200)
        rate_limited = sum(1 for code, _ in results if code == 429)
        errors = sum(1 for code, _ in results if code not in (200, 429))
 
        if rate_limited == 0 and success == num_concurrent:
            self.findings.append({
                "severity": "MEDIUM",
                "title": "No rate limiting detected",
                "detail": (
                    f"All {num_concurrent} concurrent requests succeeded. "
                    f"No rate limiting or request queuing observed."
                ),
            })
        elif rate_limited > 0:
            self.findings.append({
                "severity": "INFO",
                "title": "Rate limiting active",
                "detail": (
                    f"{rate_limited}/{num_concurrent} requests rate-limited."
                ),
            })
 
    def check_model_info_disclosure(self) -> None:
        """Check for model information disclosure via API."""
        endpoints = [
            "/v1/models",
            "/health",
            "/version",
            "/metrics",
        ]
        for endpoint in endpoints:
            try:
                resp = requests.get(
                    f"{self.base_url}{endpoint}",
                    timeout=10,
                )
                if resp.status_code == 200:
                    self.findings.append({
                        "severity": "LOW",
                        "title": f"Endpoint accessible: {endpoint}",
                        "detail": f"Response: {resp.text[:300]}",
                    })
            except requests.ConnectionError:
                pass
 
    def run_audit(self) -> list[dict]:
        """Run all vLLM-specific audit checks."""
        self.findings = []
        self.check_model_info_disclosure()
        self.check_prompt_length_limits()
        self.check_concurrent_request_limits()
        return self.findings

Comparative Security Matrix

The following table summarizes key security properties across the four frameworks:

Security Property	TorchServe	TF Serving	Triton	vLLM
Built-in Authentication	None	None	None	None
Built-in TLS	Config option	Config option	Config option	Config option
Management API	Separate port (8081)	None (filesystem)	Model control API	None
Model Format Risk	.mar (ZIP + Python)	SavedModel (TF ops)	Multiple formats	HuggingFace/safetensors
Dynamic Model Loading	Yes (via API)	Yes (filesystem poll)	Yes (API or poll)	Limited
Shared Memory	No	No	Yes (system + CUDA)	Internal only
Default Network Binding	0.0.0.0 (pre-0.8.2)	0.0.0.0	0.0.0.0	0.0.0.0
Metrics Endpoint	Port 8082	None by default	Port 8002	/metrics
Notable CVEs	CVE-2023-43654	CVE-2021-37678	CVE-2023-31036	Emerging (newer project)

Practical Examples

Unified Framework Scanner

#!/usr/bin/env bash
# Quick reconnaissance script to identify which model serving framework
# is running on a target host and gather initial security-relevant info.
 
TARGET="${1:?Usage: $0 <target_host>}"
 
echo "=== Model Serving Framework Detection ==="
echo "Target: $TARGET"
echo ""
 
# TorchServe detection
echo "--- TorchServe (ports 8080-8082) ---"
curl -s --connect-timeout 3 "http://${TARGET}:8080/ping" && \
    echo " [+] TorchServe inference API detected"
curl -s --connect-timeout 3 "http://${TARGET}:8081/models" && \
    echo " [+] TorchServe management API EXPOSED"
curl -s --connect-timeout 3 "http://${TARGET}:8082/metrics" | head -5
 
# TF Serving detection
echo ""
echo "--- TensorFlow Serving (ports 8500-8501) ---"
curl -s --connect-timeout 3 "http://${TARGET}:8501/v1/models" && \
    echo " [+] TF Serving REST API detected"
 
# Triton detection
echo ""
echo "--- Triton Inference Server (ports 8000-8002) ---"
TRITON_META=$(curl -s --connect-timeout 3 "http://${TARGET}:8000/v2")
if echo "$TRITON_META" | grep -q "triton"; then
    echo " [+] Triton detected: $TRITON_META"
    echo " [*] Model repository:"
    curl -s "http://${TARGET}:8000/v2/repository/index" | python3 -m json.tool 2>/dev/null
fi
 
# vLLM detection
echo ""
echo "--- vLLM (port 8000, OpenAI-compatible) ---"
VLLM_MODELS=$(curl -s --connect-timeout 3 "http://${TARGET}:8000/v1/models")
if echo "$VLLM_MODELS" | grep -q '"object"'; then
    echo " [+] vLLM / OpenAI-compatible API detected"
    echo "$VLLM_MODELS" | python3 -m json.tool 2>/dev/null
fi
 
echo ""
echo "=== Scan Complete ==="

Defense and Mitigation

TorchServe: Bind management API to 127.0.0.1 or internal-only interfaces. Use management_address in config.properties.
TF Serving: Restrict filesystem access to model repository. Use read-only mounts.
Triton: Set --model-control-mode=none to disable dynamic loading. Restrict metrics and repository endpoints via network policy.
vLLM: Deploy behind an API gateway with authentication, rate limiting, and prompt validation.

References

Oligo Security. (2023). "ShellTorch: Multiple Critical Vulnerabilities in TorchServe." https://www.oligo.security/blog/shelltorch-torchserve-ssrf-vulnerability-cve-2023-43654
NVIDIA. (2024). "Triton Inference Server Security Bulletin." CVE-2023-31036. https://nvidia.custhelp.com/app/answers/detail/a_id/5510
OWASP. (2025). "OWASP Machine Learning Security Top 10." https://owasp.org/www-project-machine-learning-security-top-10/
MITRE ATLAS. "Case Study: Attacking ML Model Serving Infrastructure." https://atlas.mitre.org/

Edit this page on GitHub

Security Comparison of Model Serving Frameworks

Related articles

Security Comparison of Model Serving Frameworks

Related articles