Kubernetes Security for ML Workloads
Comprehensive analysis of Kubernetes attack surfaces specific to machine learning workloads, including GPU operator exploitation, model serving namespace attacks, and cluster-level privilege escalation through ML components.
Machine learning workloads on Kubernetes introduce a distinct set of security challenges that go beyond standard container orchestration risks. The combination of GPU scheduling requirements, specialized operators for training and serving, shared storage for model artifacts, and the privileged access patterns demanded by CUDA runtimes creates an attack surface that neither traditional Kubernetes security nor ML security fully addresses in isolation.
Device plugins are the foundational mechanism through which Kubernetes exposes GPU resources to ML workloads. The NVIDIA GPU Operator, AMD ROCm device plugin, and Intel GPU plugins all operate with elevated privileges that, when misconfigured, become high-value targets for lateral movement.
ML-Specific Kubernetes Architecture
A production ML platform on Kubernetes typically includes several additional layers beyond standard deployments:
┌─────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ KubeFlow / │ │ Seldon / │ │ Training │ │
│ │ ML Platform │ │ KServe │ │ Operator │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬─────┘ │
│ │ │ │ │
│ ┌──────┴──────────────────┴──────────────────┴─────┐│
│ │ GPU Operator / Device Plugin ││
│ └──────────────────────┬───────────────────────────┘│
│ │ │
│ ┌──────────────────────┴───────────────────────────┐│
│ │ Node (GPU-enabled) ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ ││
│ │ │ CUDA │ │ Model │ │ Shared Storage │ ││
│ │ │ Runtime │ │ Weights │ │ (PVCs / NFS) │ ││
│ │ └──────────┘ └──────────┘ └──────────────────┘ ││
│ └───────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘
Each component introduces specific attack vectors that are absent in standard Kubernetes deployments.
GPU Operator and Device Plugin Exploitation
NVIDIA GPU Operator Attack Surface
The NVIDIA GPU Operator runs as a privileged DaemonSet on every GPU node, managing driver installation, container runtime configuration, and device plugin lifecycle:
# Enumerate GPU operator components in a cluster
import subprocess
import json
def enumerate_gpu_operator(namespace="gpu-operator"):
"""Identify GPU operator attack surface in a Kubernetes cluster."""
components = {}
# Find GPU operator pods and their privilege levels
pods = json.loads(subprocess.check_output([
"kubectl", "get", "pods", "-n", namespace,
"-o", "json"
]))
for pod in pods["items"]:
pod_name = pod["metadata"]["name"]
containers = pod["spec"].get("containers", [])
for container in containers:
security_context = container.get("securityContext", {})
pod_security = pod["spec"].get("securityContext", {})
components[pod_name] = {
"privileged": security_context.get("privileged", False),
"host_pid": pod["spec"].get("hostPID", False),
"host_network": pod["spec"].get("hostNetwork", False),
"volume_mounts": [
vm["mountPath"] for vm in container.get("volumeMounts", [])
],
"capabilities": security_context.get("capabilities", {}),
}
return componentsDevice Plugin Socket Exploitation
GPU device plugins communicate with the kubelet through Unix sockets. If an attacker gains access to the node filesystem, these sockets can be manipulated:
# Default device plugin socket locations
# /var/lib/kubelet/device-plugins/nvidia.sock
# /var/lib/kubelet/device-plugins/kubelet.sock
# An attacker with node access can:
# 1. Register a malicious device plugin that intercepts GPU allocation
# 2. Monitor device plugin traffic to observe which pods request GPUs
# 3. Manipulate device allocation to redirect GPU access| Attack Vector | Prerequisite | Impact | Difficulty |
|---|---|---|---|
| Device plugin socket hijack | Node filesystem access | GPU allocation manipulation | High |
| GPU operator pod compromise | Namespace access to gpu-operator | Privileged code execution on all GPU nodes | Medium |
| CUDA runtime manipulation | Container escape or host access | Arbitrary code execution in GPU context | High |
| MIG partition manipulation | GPU operator access | Cross-tenant GPU memory access | Medium |
| Driver version downgrade | GPU operator config access | Introduction of known vulnerabilities | Low |
KubeFlow Attack Surface
KubeFlow is a widely deployed ML platform on Kubernetes that introduces multiple high-value attack surfaces:
Notebook Server Exploitation
KubeFlow notebooks run as Kubernetes pods with direct cluster access. They are frequently provisioned with overly permissive service accounts:
# Probe KubeFlow notebook server for Kubernetes access
import requests
import os
def assess_notebook_k8s_access():
"""Assess Kubernetes access from a KubeFlow notebook pod."""
findings = []
# Check for service account token
token_path = "/var/run/secrets/kubernetes.io/serviceaccount/token"
if os.path.exists(token_path):
with open(token_path) as f:
token = f.read()
findings.append({
"finding": "Service account token accessible",
"token_preview": token[:50] + "...",
})
# Attempt to list pods in the namespace
namespace_path = "/var/run/secrets/kubernetes.io/serviceaccount/namespace"
namespace = open(namespace_path).read() if os.path.exists(namespace_path) else "default"
k8s_host = os.environ.get("KUBERNETES_SERVICE_HOST", "kubernetes.default.svc")
k8s_port = os.environ.get("KUBERNETES_SERVICE_PORT", "443")
try:
resp = requests.get(
f"https://{k8s_host}:{k8s_port}/api/v1/namespaces/{namespace}/pods",
headers={"Authorization": f"Bearer {token}"},
verify="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
timeout=5,
)
if resp.status_code == 200:
pod_names = [p["metadata"]["name"] for p in resp.json()["items"]]
findings.append({
"finding": "Can list pods in namespace",
"severity": "HIGH",
"pods": pod_names,
})
except Exception as e:
findings.append({"finding": "API access failed", "error": str(e)})
# Check for GPU access from notebook
try:
import torch
if torch.cuda.is_available():
findings.append({
"finding": "GPU access from notebook",
"gpu_count": torch.cuda.device_count(),
"gpu_name": torch.cuda.get_device_name(0),
})
except ImportError:
pass
return findingsPipeline Injection Attacks
KubeFlow Pipelines execute user-defined DAGs as Kubernetes pods. A compromised pipeline step can leverage the pipeline's service account to escalate privileges:
# Malicious pipeline component that exploits Kubernetes access
# This demonstrates the attack pattern - not for use in unauthorized testing
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
templates:
- name: malicious-step
container:
image: python:3.11
command: ["python", "-c"]
args:
- |
# Pipeline pods often have broad namespace access
# for reading artifacts, secrets, and config maps
import subprocess
# Enumerate accessible secrets
result = subprocess.run(
["kubectl", "get", "secrets", "-A", "-o", "name"],
capture_output=True, text=True
)
print(result.stdout)Model Serving Namespace Attacks
KServe and Seldon Security Model
Model serving platforms like KServe and Seldon Core deploy inference services as Kubernetes resources with predictable naming conventions and network patterns:
# Enumerate model serving endpoints across namespaces
def enumerate_inference_services():
"""Find all inference services in a Kubernetes cluster."""
import subprocess
import json
services = {}
# KServe InferenceService resources
try:
kserve = json.loads(subprocess.check_output([
"kubectl", "get", "inferenceservices", "--all-namespaces", "-o", "json"
]))
for item in kserve["items"]:
name = item["metadata"]["name"]
ns = item["metadata"]["namespace"]
url = item.get("status", {}).get("url", "unknown")
services[f"{ns}/{name}"] = {
"type": "kserve",
"url": url,
"ready": item.get("status", {}).get("conditions", []),
}
except subprocess.CalledProcessError:
pass
# Seldon deployments
try:
seldon = json.loads(subprocess.check_output([
"kubectl", "get", "seldondeployments", "--all-namespaces", "-o", "json"
]))
for item in seldon["items"]:
name = item["metadata"]["name"]
ns = item["metadata"]["namespace"]
services[f"{ns}/{name}"] = {
"type": "seldon",
"replicas": item["spec"].get("replicas", 1),
"predictors": [
p["name"] for p in item["spec"].get("predictors", [])
],
}
except subprocess.CalledProcessError:
pass
return servicesCross-Namespace Model Access
In multi-tenant ML platforms, teams typically deploy models to separate namespaces. However, several common misconfigurations enable cross-namespace access:
| Misconfiguration | Description | Exploitation |
|---|---|---|
| Missing NetworkPolicy | No network isolation between ML namespaces | Direct HTTP access to other teams' inference endpoints |
| Shared model storage PVC | Multiple namespaces mount the same PV | Read or overwrite other teams' model weights |
| Overpermissive Istio/Envoy rules | Service mesh allows cross-namespace traffic | Intercept or redirect inference requests |
| Global model registry access | All namespaces can pull from the same registry | Poison models used by other teams |
| Shared secrets for cloud storage | S3/GCS credentials shared across namespaces | Access training data and model artifacts |
Training Job Security
Distributed Training Attack Surface
Distributed training with frameworks like Horovod, PyTorch Distributed, or DeepSpeed creates inter-pod communication channels that expand the attack surface:
# Assess distributed training network exposure
def assess_distributed_training_security(namespace="training"):
"""Check for insecure distributed training configurations."""
import subprocess
import json
findings = []
# Find training pods with open communication ports
pods = json.loads(subprocess.check_output([
"kubectl", "get", "pods", "-n", namespace,
"-l", "training-job",
"-o", "json"
]))
for pod in pods["items"]:
containers = pod["spec"].get("containers", [])
for container in containers:
ports = container.get("ports", [])
env_vars = {
e["name"]: e.get("value", "")
for e in container.get("env", [])
}
# Check for NCCL and Gloo communication ports
nccl_port = env_vars.get("NCCL_SOCKET_IFNAME")
master_addr = env_vars.get("MASTER_ADDR")
master_port = env_vars.get("MASTER_PORT")
if master_addr and master_port:
findings.append({
"pod": pod["metadata"]["name"],
"master_addr": master_addr,
"master_port": master_port,
"severity": "MEDIUM",
"note": "Distributed training master endpoint exposed",
})
# Check for shared memory mounts (required for NCCL)
volumes = pod["spec"].get("volumes", [])
for vol in volumes:
if vol.get("emptyDir", {}).get("medium") == "Memory":
findings.append({
"pod": pod["metadata"]["name"],
"finding": "Shared memory (dshm) mount detected",
"note": "Required for NCCL but may allow cross-container data access",
})
return findingsTraining Operator Exploitation
Kubernetes training operators (PyTorchJob, TFJob, MPIJob) manage the lifecycle of training jobs. Compromising the operator grants control over all training workloads:
| Operator | CRD | Attack Surface |
|---|---|---|
| PyTorch Operator | PyTorchJob | Master/worker pod creation, GPU allocation |
| TensorFlow Operator | TFJob | PS/worker topology, checkpoint paths |
| MPI Operator | MPIJob | SSH key distribution, launcher access |
| Volcano Scheduler | Queue, PodGroup | Priority manipulation, resource starvation |
RBAC Misconfigurations in ML Platforms
ML platforms frequently require broad RBAC permissions that violate least-privilege principles:
# Common overly permissive RBAC for ML platform service accounts
# This pattern appears frequently in KubeFlow and similar platforms
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: ml-platform-admin
rules:
# Broad pod management for training jobs
- apiGroups: [""]
resources: ["pods", "pods/exec", "pods/log"]
verbs: ["*"]
# Secret access for model registry credentials
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "watch"]
# PVC management for datasets and models
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["*"]
# CRD access for ML operators
- apiGroups: ["kubeflow.org"]
resources: ["*"]
verbs: ["*"]RBAC Assessment Checklist
def audit_ml_rbac(namespace="kubeflow"):
"""Audit RBAC permissions for ML platform service accounts."""
import subprocess
import json
dangerous_permissions = []
# Get all service accounts in the ML namespace
sa_list = json.loads(subprocess.check_output([
"kubectl", "get", "serviceaccounts", "-n", namespace, "-o", "json"
]))
for sa in sa_list["items"]:
sa_name = sa["metadata"]["name"]
# Check what this SA can do using auth can-i
for resource in ["secrets", "pods/exec", "configmaps", "nodes"]:
for verb in ["get", "list", "create", "delete"]:
result = subprocess.run(
["kubectl", "auth", "can-i", verb, resource,
"--as", f"system:serviceaccount:{namespace}:{sa_name}",
"-n", namespace],
capture_output=True, text=True
)
if "yes" in result.stdout:
dangerous_permissions.append({
"service_account": f"{namespace}:{sa_name}",
"resource": resource,
"verb": verb,
})
return dangerous_permissionsShared Storage Attacks
ML workloads rely heavily on shared storage for datasets, model weights, checkpoints, and artifacts. This creates cross-pod and cross-namespace attack opportunities:
PersistentVolume Claim Misconfigurations
# Commonly misconfigured shared storage for ML workloads
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-storage
spec:
accessModes:
- ReadWriteMany # Multiple pods can write simultaneously
resources:
requests:
storage: 500Gi
storageClassName: nfs-client # NFS often lacks access controls| Storage Attack | Vector | Impact |
|---|---|---|
| Model weight replacement | Write access to shared PVC | Serve a backdoored model |
| Training data poisoning | Write access to dataset PVC | Corrupt training runs |
| Checkpoint manipulation | Write access to checkpoint directory | Hijack training from a specific point |
| Log exfiltration | Read access to experiment logs | Extract hyperparameters, metrics, data samples |
| Credential harvesting | Read access to config mounts | Obtain cloud storage keys, API tokens |
Red Team Assessment Methodology
When assessing Kubernetes ML infrastructure, follow this systematic approach:
Phase 1: Reconnaissance
- Enumerate ML-specific CRDs (InferenceService, PyTorchJob, Notebook, Experiment)
- Identify GPU nodes and their operator configurations
- Map namespace topology and network policies
- Identify shared storage volumes and their access modes
Phase 2: Access Assessment
- Test service account permissions from ML pods
- Attempt cross-namespace network access to inference endpoints
- Probe GPU operator management interfaces
- Check for KubeFlow dashboard unauthenticated access
Phase 3: Exploitation
- Attempt pipeline injection through KubeFlow Pipelines
- Test notebook server breakout via Kubernetes API access
- Probe shared storage for write access to model weights
- Assess distributed training inter-pod communication
Phase 4: Impact Demonstration
- Model replacement via shared PVC write access
- Credential extraction from ML platform secrets
- Cross-tenant inference access through missing network policies
- GPU resource starvation through priority manipulation
Related Topics
- Attacking AI Deployments -- foundational deployment security concepts
- Cloud AI Security -- cloud-specific ML infrastructure risks
- Infrastructure Exploitation -- advanced infrastructure attack techniques
- GPU Cluster Attacks -- focused GPU compute exploitation
- ML Pipeline CI/CD Attacks -- attacking ML pipeline automation
References
- "Kubernetes Security and Observability" - Brendan Creane & Amit Gupta (O'Reilly, 2021) - Foundation for Kubernetes security assessment including RBAC, network policy, and runtime security
- "Hacking Kubernetes" - Andrew Martin & Michael Hausenblas (O'Reilly, 2022) - Practical Kubernetes attack techniques applicable to ML workloads
- NVIDIA GPU Operator Documentation (2025) - Official documentation covering GPU operator deployment, security considerations, and MIG configurations
- KubeFlow Security Documentation (2025) - KubeFlow multi-tenancy and security hardening guidance
- MITRE ATLAS - ML-specific threat framework including infrastructure-layer attacks on ML systems
What makes KubeFlow notebook servers a particularly dangerous attack vector in Kubernetes ML infrastructure?