Kubernetes Security for ML Workloads

advanced14 min readUpdated 2026-03-15

Comprehensive analysis of Kubernetes attack surfaces specific to machine learning workloads, including GPU operator exploitation, model serving namespace attacks, and cluster-level privilege escalation through ML components.

kubernetes ml-infrastructure container-security gpu cluster-attacks

Machine learning workloads on Kubernetes introduce a distinct set of security challenges that go beyond standard container orchestration risks. The combination of GPU scheduling requirements, specialized operators for training and serving, shared storage for model artifacts, and the privileged access patterns demanded by CUDA runtimes creates an attack surface that neither traditional Kubernetes security nor ML security fully addresses in isolation.

Device plugins are the foundational mechanism through which Kubernetes exposes GPU resources to ML workloads. The NVIDIA GPU Operator, AMD ROCm device plugin, and Intel GPU plugins all operate with elevated privileges that, when misconfigured, become high-value targets for lateral movement.

ML-Specific Kubernetes Architecture

A production ML platform on Kubernetes typically includes several additional layers beyond standard deployments:

┌─────────────────────────────────────────────────────┐
│                  Kubernetes Cluster                   │
│                                                       │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │  KubeFlow /   │  │   Seldon /    │  │  Training   │ │
│  │  ML Platform  │  │   KServe     │  │  Operator   │ │
│  └──────┬───────┘  └──────┬───────┘  └──────┬─────┘ │
│         │                  │                  │       │
│  ┌──────┴──────────────────┴──────────────────┴─────┐│
│  │              GPU Operator / Device Plugin          ││
│  └──────────────────────┬───────────────────────────┘│
│                          │                             │
│  ┌──────────────────────┴───────────────────────────┐│
│  │         Node (GPU-enabled)                        ││
│  │  ┌──────────┐ ┌──────────┐ ┌──────────────────┐  ││
│  │  │  CUDA    │ │  Model   │ │  Shared Storage  │  ││
│  │  │  Runtime │ │  Weights │ │  (PVCs / NFS)    │  ││
│  │  └──────────┘ └──────────┘ └──────────────────┘  ││
│  └───────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘

Each component introduces specific attack vectors that are absent in standard Kubernetes deployments.

GPU Operator and Device Plugin Exploitation

NVIDIA GPU Operator Attack Surface

The NVIDIA GPU Operator runs as a privileged DaemonSet on every GPU node, managing driver installation, container runtime configuration, and device plugin lifecycle:

# Enumerate GPU operator components in a cluster
import subprocess
import json
 
def enumerate_gpu_operator(namespace="gpu-operator"):
    """Identify GPU operator attack surface in a Kubernetes cluster."""
    components = {}
 
    # Find GPU operator pods and their privilege levels
    pods = json.loads(subprocess.check_output([
        "kubectl", "get", "pods", "-n", namespace,
        "-o", "json"
    ]))
 
    for pod in pods["items"]:
        pod_name = pod["metadata"]["name"]
        containers = pod["spec"].get("containers", [])
 
        for container in containers:
            security_context = container.get("securityContext", {})
            pod_security = pod["spec"].get("securityContext", {})
 
            components[pod_name] = {
                "privileged": security_context.get("privileged", False),
                "host_pid": pod["spec"].get("hostPID", False),
                "host_network": pod["spec"].get("hostNetwork", False),
                "volume_mounts": [
                    vm["mountPath"] for vm in container.get("volumeMounts", [])
                ],
                "capabilities": security_context.get("capabilities", {}),
            }
 
    return components

Device Plugin Socket Exploitation

GPU device plugins communicate with the kubelet through Unix sockets. If an attacker gains access to the node filesystem, these sockets can be manipulated:

# Default device plugin socket locations
# /var/lib/kubelet/device-plugins/nvidia.sock
# /var/lib/kubelet/device-plugins/kubelet.sock
 
# An attacker with node access can:
# 1. Register a malicious device plugin that intercepts GPU allocation
# 2. Monitor device plugin traffic to observe which pods request GPUs
# 3. Manipulate device allocation to redirect GPU access

Attack Vector	Prerequisite	Impact	Difficulty
Device plugin socket hijack	Node filesystem access	GPU allocation manipulation	High
GPU operator pod compromise	Namespace access to gpu-operator	Privileged code execution on all GPU nodes	Medium
CUDA runtime manipulation	Container escape or host access	Arbitrary code execution in GPU context	High
MIG partition manipulation	GPU operator access	Cross-tenant GPU memory access	Medium
Driver version downgrade	GPU operator config access	Introduction of known vulnerabilities	Low

KubeFlow Attack Surface

KubeFlow is a widely deployed ML platform on Kubernetes that introduces multiple high-value attack surfaces:

Notebook Server Exploitation

KubeFlow notebooks run as Kubernetes pods with direct cluster access. They are frequently provisioned with overly permissive service accounts:

# Probe KubeFlow notebook server for Kubernetes access
import requests
import os
 
def assess_notebook_k8s_access():
    """Assess Kubernetes access from a KubeFlow notebook pod."""
    findings = []
 
    # Check for service account token
    token_path = "/var/run/secrets/kubernetes.io/serviceaccount/token"
    if os.path.exists(token_path):
        with open(token_path) as f:
            token = f.read()
        findings.append({
            "finding": "Service account token accessible",
            "token_preview": token[:50] + "...",
        })
 
    # Attempt to list pods in the namespace
    namespace_path = "/var/run/secrets/kubernetes.io/serviceaccount/namespace"
    namespace = open(namespace_path).read() if os.path.exists(namespace_path) else "default"
 
    k8s_host = os.environ.get("KUBERNETES_SERVICE_HOST", "kubernetes.default.svc")
    k8s_port = os.environ.get("KUBERNETES_SERVICE_PORT", "443")
 
    try:
        resp = requests.get(
            f"https://{k8s_host}:{k8s_port}/api/v1/namespaces/{namespace}/pods",
            headers={"Authorization": f"Bearer {token}"},
            verify="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
            timeout=5,
        )
        if resp.status_code == 200:
            pod_names = [p["metadata"]["name"] for p in resp.json()["items"]]
            findings.append({
                "finding": "Can list pods in namespace",
                "severity": "HIGH",
                "pods": pod_names,
            })
    except Exception as e:
        findings.append({"finding": "API access failed", "error": str(e)})
 
    # Check for GPU access from notebook
    try:
        import torch
        if torch.cuda.is_available():
            findings.append({
                "finding": "GPU access from notebook",
                "gpu_count": torch.cuda.device_count(),
                "gpu_name": torch.cuda.get_device_name(0),
            })
    except ImportError:
        pass
 
    return findings

Pipeline Injection Attacks

KubeFlow Pipelines execute user-defined DAGs as Kubernetes pods. A compromised pipeline step can leverage the pipeline's service account to escalate privileges:

# Malicious pipeline component that exploits Kubernetes access
# This demonstrates the attack pattern - not for use in unauthorized testing
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
  templates:
    - name: malicious-step
      container:
        image: python:3.11
        command: ["python", "-c"]
        args:
          - |
            # Pipeline pods often have broad namespace access
            # for reading artifacts, secrets, and config maps
            import subprocess
            # Enumerate accessible secrets
            result = subprocess.run(
                ["kubectl", "get", "secrets", "-A", "-o", "name"],
                capture_output=True, text=True
            )
            print(result.stdout)

Model Serving Namespace Attacks

KServe and Seldon Security Model

Model serving platforms like KServe and Seldon Core deploy inference services as Kubernetes resources with predictable naming conventions and network patterns:

# Enumerate model serving endpoints across namespaces
def enumerate_inference_services():
    """Find all inference services in a Kubernetes cluster."""
    import subprocess
    import json
 
    services = {}
 
    # KServe InferenceService resources
    try:
        kserve = json.loads(subprocess.check_output([
            "kubectl", "get", "inferenceservices", "--all-namespaces", "-o", "json"
        ]))
        for item in kserve["items"]:
            name = item["metadata"]["name"]
            ns = item["metadata"]["namespace"]
            url = item.get("status", {}).get("url", "unknown")
            services[f"{ns}/{name}"] = {
                "type": "kserve",
                "url": url,
                "ready": item.get("status", {}).get("conditions", []),
            }
    except subprocess.CalledProcessError:
        pass
 
    # Seldon deployments
    try:
        seldon = json.loads(subprocess.check_output([
            "kubectl", "get", "seldondeployments", "--all-namespaces", "-o", "json"
        ]))
        for item in seldon["items"]:
            name = item["metadata"]["name"]
            ns = item["metadata"]["namespace"]
            services[f"{ns}/{name}"] = {
                "type": "seldon",
                "replicas": item["spec"].get("replicas", 1),
                "predictors": [
                    p["name"] for p in item["spec"].get("predictors", [])
                ],
            }
    except subprocess.CalledProcessError:
        pass
 
    return services

Cross-Namespace Model Access

In multi-tenant ML platforms, teams typically deploy models to separate namespaces. However, several common misconfigurations enable cross-namespace access:

Misconfiguration	Description	Exploitation
Missing NetworkPolicy	No network isolation between ML namespaces	Direct HTTP access to other teams' inference endpoints
Shared model storage PVC	Multiple namespaces mount the same PV	Read or overwrite other teams' model weights
Overpermissive Istio/Envoy rules	Service mesh allows cross-namespace traffic	Intercept or redirect inference requests
Global model registry access	All namespaces can pull from the same registry	Poison models used by other teams
Shared secrets for cloud storage	S3/GCS credentials shared across namespaces	Access training data and model artifacts

Training Job Security

Distributed Training Attack Surface

Distributed training with frameworks like Horovod, PyTorch Distributed, or DeepSpeed creates inter-pod communication channels that expand the attack surface:

# Assess distributed training network exposure
def assess_distributed_training_security(namespace="training"):
    """Check for insecure distributed training configurations."""
    import subprocess
    import json
 
    findings = []
 
    # Find training pods with open communication ports
    pods = json.loads(subprocess.check_output([
        "kubectl", "get", "pods", "-n", namespace,
        "-l", "training-job",
        "-o", "json"
    ]))
 
    for pod in pods["items"]:
        containers = pod["spec"].get("containers", [])
        for container in containers:
            ports = container.get("ports", [])
            env_vars = {
                e["name"]: e.get("value", "")
                for e in container.get("env", [])
            }
 
            # Check for NCCL and Gloo communication ports
            nccl_port = env_vars.get("NCCL_SOCKET_IFNAME")
            master_addr = env_vars.get("MASTER_ADDR")
            master_port = env_vars.get("MASTER_PORT")
 
            if master_addr and master_port:
                findings.append({
                    "pod": pod["metadata"]["name"],
                    "master_addr": master_addr,
                    "master_port": master_port,
                    "severity": "MEDIUM",
                    "note": "Distributed training master endpoint exposed",
                })
 
            # Check for shared memory mounts (required for NCCL)
            volumes = pod["spec"].get("volumes", [])
            for vol in volumes:
                if vol.get("emptyDir", {}).get("medium") == "Memory":
                    findings.append({
                        "pod": pod["metadata"]["name"],
                        "finding": "Shared memory (dshm) mount detected",
                        "note": "Required for NCCL but may allow cross-container data access",
                    })
 
    return findings

Training Operator Exploitation

Kubernetes training operators (PyTorchJob, TFJob, MPIJob) manage the lifecycle of training jobs. Compromising the operator grants control over all training workloads:

Operator	CRD	Attack Surface
PyTorch Operator	PyTorchJob	Master/worker pod creation, GPU allocation
TensorFlow Operator	TFJob	PS/worker topology, checkpoint paths
MPI Operator	MPIJob	SSH key distribution, launcher access
Volcano Scheduler	Queue, PodGroup	Priority manipulation, resource starvation

RBAC Misconfigurations in ML Platforms

ML platforms frequently require broad RBAC permissions that violate least-privilege principles:

# Common overly permissive RBAC for ML platform service accounts
# This pattern appears frequently in KubeFlow and similar platforms
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ml-platform-admin
rules:
  # Broad pod management for training jobs
  - apiGroups: [""]
    resources: ["pods", "pods/exec", "pods/log"]
    verbs: ["*"]
  # Secret access for model registry credentials
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]
  # PVC management for datasets and models
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["*"]
  # CRD access for ML operators
  - apiGroups: ["kubeflow.org"]
    resources: ["*"]
    verbs: ["*"]

RBAC Assessment Checklist

def audit_ml_rbac(namespace="kubeflow"):
    """Audit RBAC permissions for ML platform service accounts."""
    import subprocess
    import json
 
    dangerous_permissions = []
 
    # Get all service accounts in the ML namespace
    sa_list = json.loads(subprocess.check_output([
        "kubectl", "get", "serviceaccounts", "-n", namespace, "-o", "json"
    ]))
 
    for sa in sa_list["items"]:
        sa_name = sa["metadata"]["name"]
 
        # Check what this SA can do using auth can-i
        for resource in ["secrets", "pods/exec", "configmaps", "nodes"]:
            for verb in ["get", "list", "create", "delete"]:
                result = subprocess.run(
                    ["kubectl", "auth", "can-i", verb, resource,
                     "--as", f"system:serviceaccount:{namespace}:{sa_name}",
                     "-n", namespace],
                    capture_output=True, text=True
                )
                if "yes" in result.stdout:
                    dangerous_permissions.append({
                        "service_account": f"{namespace}:{sa_name}",
                        "resource": resource,
                        "verb": verb,
                    })
 
    return dangerous_permissions

Shared Storage Attacks

ML workloads rely heavily on shared storage for datasets, model weights, checkpoints, and artifacts. This creates cross-pod and cross-namespace attack opportunities:

PersistentVolume Claim Misconfigurations

# Commonly misconfigured shared storage for ML workloads
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
spec:
  accessModes:
    - ReadWriteMany  # Multiple pods can write simultaneously
  resources:
    requests:
      storage: 500Gi
  storageClassName: nfs-client  # NFS often lacks access controls

Storage Attack	Vector	Impact
Model weight replacement	Write access to shared PVC	Serve a backdoored model
Training data poisoning	Write access to dataset PVC	Corrupt training runs
Checkpoint manipulation	Write access to checkpoint directory	Hijack training from a specific point
Log exfiltration	Read access to experiment logs	Extract hyperparameters, metrics, data samples
Credential harvesting	Read access to config mounts	Obtain cloud storage keys, API tokens

Red Team Assessment Methodology

When assessing Kubernetes ML infrastructure, follow this systematic approach:

Phase 1: Reconnaissance

Enumerate ML-specific CRDs (InferenceService, PyTorchJob, Notebook, Experiment)
Identify GPU nodes and their operator configurations
Map namespace topology and network policies
Identify shared storage volumes and their access modes

Phase 2: Access Assessment

Test service account permissions from ML pods
Attempt cross-namespace network access to inference endpoints
Probe GPU operator management interfaces
Check for KubeFlow dashboard unauthenticated access

Phase 3: Exploitation

Attempt pipeline injection through KubeFlow Pipelines
Test notebook server breakout via Kubernetes API access
Probe shared storage for write access to model weights
Assess distributed training inter-pod communication

Phase 4: Impact Demonstration

Model replacement via shared PVC write access
Credential extraction from ML platform secrets
Cross-tenant inference access through missing network policies
GPU resource starvation through priority manipulation

Attacking AI Deployments -- foundational deployment security concepts
Cloud AI Security -- cloud-specific ML infrastructure risks
Infrastructure Exploitation -- advanced infrastructure attack techniques
GPU Cluster Attacks -- focused GPU compute exploitation
ML Pipeline CI/CD Attacks -- attacking ML pipeline automation

References

"Kubernetes Security and Observability" - Brendan Creane & Amit Gupta (O'Reilly, 2021) - Foundation for Kubernetes security assessment including RBAC, network policy, and runtime security
"Hacking Kubernetes" - Andrew Martin & Michael Hausenblas (O'Reilly, 2022) - Practical Kubernetes attack techniques applicable to ML workloads
NVIDIA GPU Operator Documentation (2025) - Official documentation covering GPU operator deployment, security considerations, and MIG configurations
KubeFlow Security Documentation (2025) - KubeFlow multi-tenancy and security hardening guidance
MITRE ATLAS - ML-specific threat framework including infrastructure-layer attacks on ML systems

Knowledge Check

What makes KubeFlow notebook servers a particularly dangerous attack vector in Kubernetes ML infrastructure?

Edit this page on GitHub

Kubernetes Security for ML Workloads

advanced14 min readUpdated 2026-03-15

kubernetes ml-infrastructure container-security gpu cluster-attacks

ML-Specific Kubernetes Architecture

A production ML platform on Kubernetes typically includes several additional layers beyond standard deployments:

┌─────────────────────────────────────────────────────┐
│                  Kubernetes Cluster                   │
│                                                       │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │  KubeFlow /   │  │   Seldon /    │  │  Training   │ │
│  │  ML Platform  │  │   KServe     │  │  Operator   │ │
│  └──────┬───────┘  └──────┬───────┘  └──────┬─────┘ │
│         │                  │                  │       │
│  ┌──────┴──────────────────┴──────────────────┴─────┐│
│  │              GPU Operator / Device Plugin          ││
│  └──────────────────────┬───────────────────────────┘│
│                          │                             │
│  ┌──────────────────────┴───────────────────────────┐│
│  │         Node (GPU-enabled)                        ││
│  │  ┌──────────┐ ┌──────────┐ ┌──────────────────┐  ││
│  │  │  CUDA    │ │  Model   │ │  Shared Storage  │  ││
│  │  │  Runtime │ │  Weights │ │  (PVCs / NFS)    │  ││
│  │  └──────────┘ └──────────┘ └──────────────────┘  ││
│  └───────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘

Each component introduces specific attack vectors that are absent in standard Kubernetes deployments.

GPU Operator and Device Plugin Exploitation

NVIDIA GPU Operator Attack Surface

The NVIDIA GPU Operator runs as a privileged DaemonSet on every GPU node, managing driver installation, container runtime configuration, and device plugin lifecycle:

# Enumerate GPU operator components in a cluster
import subprocess
import json
 
def enumerate_gpu_operator(namespace="gpu-operator"):
    """Identify GPU operator attack surface in a Kubernetes cluster."""
    components = {}
 
    # Find GPU operator pods and their privilege levels
    pods = json.loads(subprocess.check_output([
        "kubectl", "get", "pods", "-n", namespace,
        "-o", "json"
    ]))
 
    for pod in pods["items"]:
        pod_name = pod["metadata"]["name"]
        containers = pod["spec"].get("containers", [])
 
        for container in containers:
            security_context = container.get("securityContext", {})
            pod_security = pod["spec"].get("securityContext", {})
 
            components[pod_name] = {
                "privileged": security_context.get("privileged", False),
                "host_pid": pod["spec"].get("hostPID", False),
                "host_network": pod["spec"].get("hostNetwork", False),
                "volume_mounts": [
                    vm["mountPath"] for vm in container.get("volumeMounts", [])
                ],
                "capabilities": security_context.get("capabilities", {}),
            }
 
    return components

Device Plugin Socket Exploitation

GPU device plugins communicate with the kubelet through Unix sockets. If an attacker gains access to the node filesystem, these sockets can be manipulated:

# Default device plugin socket locations
# /var/lib/kubelet/device-plugins/nvidia.sock
# /var/lib/kubelet/device-plugins/kubelet.sock
 
# An attacker with node access can:
# 1. Register a malicious device plugin that intercepts GPU allocation
# 2. Monitor device plugin traffic to observe which pods request GPUs
# 3. Manipulate device allocation to redirect GPU access

Attack Vector	Prerequisite	Impact	Difficulty
Device plugin socket hijack	Node filesystem access	GPU allocation manipulation	High
GPU operator pod compromise	Namespace access to gpu-operator	Privileged code execution on all GPU nodes	Medium
CUDA runtime manipulation	Container escape or host access	Arbitrary code execution in GPU context	High
MIG partition manipulation	GPU operator access	Cross-tenant GPU memory access	Medium
Driver version downgrade	GPU operator config access	Introduction of known vulnerabilities	Low

KubeFlow Attack Surface

KubeFlow is a widely deployed ML platform on Kubernetes that introduces multiple high-value attack surfaces:

Notebook Server Exploitation

KubeFlow notebooks run as Kubernetes pods with direct cluster access. They are frequently provisioned with overly permissive service accounts:

# Probe KubeFlow notebook server for Kubernetes access
import requests
import os
 
def assess_notebook_k8s_access():
    """Assess Kubernetes access from a KubeFlow notebook pod."""
    findings = []
 
    # Check for service account token
    token_path = "/var/run/secrets/kubernetes.io/serviceaccount/token"
    if os.path.exists(token_path):
        with open(token_path) as f:
            token = f.read()
        findings.append({
            "finding": "Service account token accessible",
            "token_preview": token[:50] + "...",
        })
 
    # Attempt to list pods in the namespace
    namespace_path = "/var/run/secrets/kubernetes.io/serviceaccount/namespace"
    namespace = open(namespace_path).read() if os.path.exists(namespace_path) else "default"
 
    k8s_host = os.environ.get("KUBERNETES_SERVICE_HOST", "kubernetes.default.svc")
    k8s_port = os.environ.get("KUBERNETES_SERVICE_PORT", "443")
 
    try:
        resp = requests.get(
            f"https://{k8s_host}:{k8s_port}/api/v1/namespaces/{namespace}/pods",
            headers={"Authorization": f"Bearer {token}"},
            verify="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
            timeout=5,
        )
        if resp.status_code == 200:
            pod_names = [p["metadata"]["name"] for p in resp.json()["items"]]
            findings.append({
                "finding": "Can list pods in namespace",
                "severity": "HIGH",
                "pods": pod_names,
            })
    except Exception as e:
        findings.append({"finding": "API access failed", "error": str(e)})
 
    # Check for GPU access from notebook
    try:
        import torch
        if torch.cuda.is_available():
            findings.append({
                "finding": "GPU access from notebook",
                "gpu_count": torch.cuda.device_count(),
                "gpu_name": torch.cuda.get_device_name(0),
            })
    except ImportError:
        pass
 
    return findings

Pipeline Injection Attacks

KubeFlow Pipelines execute user-defined DAGs as Kubernetes pods. A compromised pipeline step can leverage the pipeline's service account to escalate privileges:

# Malicious pipeline component that exploits Kubernetes access
# This demonstrates the attack pattern - not for use in unauthorized testing
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
  templates:
    - name: malicious-step
      container:
        image: python:3.11
        command: ["python", "-c"]
        args:
          - |
            # Pipeline pods often have broad namespace access
            # for reading artifacts, secrets, and config maps
            import subprocess
            # Enumerate accessible secrets
            result = subprocess.run(
                ["kubectl", "get", "secrets", "-A", "-o", "name"],
                capture_output=True, text=True
            )
            print(result.stdout)

Model Serving Namespace Attacks

KServe and Seldon Security Model

Model serving platforms like KServe and Seldon Core deploy inference services as Kubernetes resources with predictable naming conventions and network patterns:

# Enumerate model serving endpoints across namespaces
def enumerate_inference_services():
    """Find all inference services in a Kubernetes cluster."""
    import subprocess
    import json
 
    services = {}
 
    # KServe InferenceService resources
    try:
        kserve = json.loads(subprocess.check_output([
            "kubectl", "get", "inferenceservices", "--all-namespaces", "-o", "json"
        ]))
        for item in kserve["items"]:
            name = item["metadata"]["name"]
            ns = item["metadata"]["namespace"]
            url = item.get("status", {}).get("url", "unknown")
            services[f"{ns}/{name}"] = {
                "type": "kserve",
                "url": url,
                "ready": item.get("status", {}).get("conditions", []),
            }
    except subprocess.CalledProcessError:
        pass
 
    # Seldon deployments
    try:
        seldon = json.loads(subprocess.check_output([
            "kubectl", "get", "seldondeployments", "--all-namespaces", "-o", "json"
        ]))
        for item in seldon["items"]:
            name = item["metadata"]["name"]
            ns = item["metadata"]["namespace"]
            services[f"{ns}/{name}"] = {
                "type": "seldon",
                "replicas": item["spec"].get("replicas", 1),
                "predictors": [
                    p["name"] for p in item["spec"].get("predictors", [])
                ],
            }
    except subprocess.CalledProcessError:
        pass
 
    return services

Cross-Namespace Model Access

In multi-tenant ML platforms, teams typically deploy models to separate namespaces. However, several common misconfigurations enable cross-namespace access:

Misconfiguration	Description	Exploitation
Missing NetworkPolicy	No network isolation between ML namespaces	Direct HTTP access to other teams' inference endpoints
Shared model storage PVC	Multiple namespaces mount the same PV	Read or overwrite other teams' model weights
Overpermissive Istio/Envoy rules	Service mesh allows cross-namespace traffic	Intercept or redirect inference requests
Global model registry access	All namespaces can pull from the same registry	Poison models used by other teams
Shared secrets for cloud storage	S3/GCS credentials shared across namespaces	Access training data and model artifacts

Training Job Security

Distributed Training Attack Surface

Distributed training with frameworks like Horovod, PyTorch Distributed, or DeepSpeed creates inter-pod communication channels that expand the attack surface:

# Assess distributed training network exposure
def assess_distributed_training_security(namespace="training"):
    """Check for insecure distributed training configurations."""
    import subprocess
    import json
 
    findings = []
 
    # Find training pods with open communication ports
    pods = json.loads(subprocess.check_output([
        "kubectl", "get", "pods", "-n", namespace,
        "-l", "training-job",
        "-o", "json"
    ]))
 
    for pod in pods["items"]:
        containers = pod["spec"].get("containers", [])
        for container in containers:
            ports = container.get("ports", [])
            env_vars = {
                e["name"]: e.get("value", "")
                for e in container.get("env", [])
            }
 
            # Check for NCCL and Gloo communication ports
            nccl_port = env_vars.get("NCCL_SOCKET_IFNAME")
            master_addr = env_vars.get("MASTER_ADDR")
            master_port = env_vars.get("MASTER_PORT")
 
            if master_addr and master_port:
                findings.append({
                    "pod": pod["metadata"]["name"],
                    "master_addr": master_addr,
                    "master_port": master_port,
                    "severity": "MEDIUM",
                    "note": "Distributed training master endpoint exposed",
                })
 
            # Check for shared memory mounts (required for NCCL)
            volumes = pod["spec"].get("volumes", [])
            for vol in volumes:
                if vol.get("emptyDir", {}).get("medium") == "Memory":
                    findings.append({
                        "pod": pod["metadata"]["name"],
                        "finding": "Shared memory (dshm) mount detected",
                        "note": "Required for NCCL but may allow cross-container data access",
                    })
 
    return findings

Training Operator Exploitation

Kubernetes training operators (PyTorchJob, TFJob, MPIJob) manage the lifecycle of training jobs. Compromising the operator grants control over all training workloads:

Operator	CRD	Attack Surface
PyTorch Operator	PyTorchJob	Master/worker pod creation, GPU allocation
TensorFlow Operator	TFJob	PS/worker topology, checkpoint paths
MPI Operator	MPIJob	SSH key distribution, launcher access
Volcano Scheduler	Queue, PodGroup	Priority manipulation, resource starvation

RBAC Misconfigurations in ML Platforms

ML platforms frequently require broad RBAC permissions that violate least-privilege principles:

# Common overly permissive RBAC for ML platform service accounts
# This pattern appears frequently in KubeFlow and similar platforms
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ml-platform-admin
rules:
  # Broad pod management for training jobs
  - apiGroups: [""]
    resources: ["pods", "pods/exec", "pods/log"]
    verbs: ["*"]
  # Secret access for model registry credentials
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]
  # PVC management for datasets and models
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["*"]
  # CRD access for ML operators
  - apiGroups: ["kubeflow.org"]
    resources: ["*"]
    verbs: ["*"]

RBAC Assessment Checklist

def audit_ml_rbac(namespace="kubeflow"):
    """Audit RBAC permissions for ML platform service accounts."""
    import subprocess
    import json
 
    dangerous_permissions = []
 
    # Get all service accounts in the ML namespace
    sa_list = json.loads(subprocess.check_output([
        "kubectl", "get", "serviceaccounts", "-n", namespace, "-o", "json"
    ]))
 
    for sa in sa_list["items"]:
        sa_name = sa["metadata"]["name"]
 
        # Check what this SA can do using auth can-i
        for resource in ["secrets", "pods/exec", "configmaps", "nodes"]:
            for verb in ["get", "list", "create", "delete"]:
                result = subprocess.run(
                    ["kubectl", "auth", "can-i", verb, resource,
                     "--as", f"system:serviceaccount:{namespace}:{sa_name}",
                     "-n", namespace],
                    capture_output=True, text=True
                )
                if "yes" in result.stdout:
                    dangerous_permissions.append({
                        "service_account": f"{namespace}:{sa_name}",
                        "resource": resource,
                        "verb": verb,
                    })
 
    return dangerous_permissions

Shared Storage Attacks

ML workloads rely heavily on shared storage for datasets, model weights, checkpoints, and artifacts. This creates cross-pod and cross-namespace attack opportunities:

PersistentVolume Claim Misconfigurations

# Commonly misconfigured shared storage for ML workloads
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
spec:
  accessModes:
    - ReadWriteMany  # Multiple pods can write simultaneously
  resources:
    requests:
      storage: 500Gi
  storageClassName: nfs-client  # NFS often lacks access controls

Storage Attack	Vector	Impact
Model weight replacement	Write access to shared PVC	Serve a backdoored model
Training data poisoning	Write access to dataset PVC	Corrupt training runs
Checkpoint manipulation	Write access to checkpoint directory	Hijack training from a specific point
Log exfiltration	Read access to experiment logs	Extract hyperparameters, metrics, data samples
Credential harvesting	Read access to config mounts	Obtain cloud storage keys, API tokens

Red Team Assessment Methodology

When assessing Kubernetes ML infrastructure, follow this systematic approach:

Phase 1: Reconnaissance

Enumerate ML-specific CRDs (InferenceService, PyTorchJob, Notebook, Experiment)
Identify GPU nodes and their operator configurations
Map namespace topology and network policies
Identify shared storage volumes and their access modes

Phase 2: Access Assessment

Test service account permissions from ML pods
Attempt cross-namespace network access to inference endpoints
Probe GPU operator management interfaces
Check for KubeFlow dashboard unauthenticated access

Phase 3: Exploitation

Attempt pipeline injection through KubeFlow Pipelines
Test notebook server breakout via Kubernetes API access
Probe shared storage for write access to model weights
Assess distributed training inter-pod communication

Phase 4: Impact Demonstration

Model replacement via shared PVC write access
Credential extraction from ML platform secrets
Cross-tenant inference access through missing network policies
GPU resource starvation through priority manipulation

Attacking AI Deployments -- foundational deployment security concepts
Cloud AI Security -- cloud-specific ML infrastructure risks
Infrastructure Exploitation -- advanced infrastructure attack techniques
GPU Cluster Attacks -- focused GPU compute exploitation
ML Pipeline CI/CD Attacks -- attacking ML pipeline automation

References

"Kubernetes Security and Observability" - Brendan Creane & Amit Gupta (O'Reilly, 2021) - Foundation for Kubernetes security assessment including RBAC, network policy, and runtime security
"Hacking Kubernetes" - Andrew Martin & Michael Hausenblas (O'Reilly, 2022) - Practical Kubernetes attack techniques applicable to ML workloads
NVIDIA GPU Operator Documentation (2025) - Official documentation covering GPU operator deployment, security considerations, and MIG configurations
KubeFlow Security Documentation (2025) - KubeFlow multi-tenancy and security hardening guidance
MITRE ATLAS - ML-specific threat framework including infrastructure-layer attacks on ML systems

Knowledge Check

What makes KubeFlow notebook servers a particularly dangerous attack vector in Kubernetes ML infrastructure?

Edit this page on GitHub

Kubernetes Security for ML Workloads

Related articles

Kubernetes Security for ML Workloads

Related articles