Kubernetes 安全 for ML Workloads
Comprehensive analysis of Kubernetes attack surfaces specific to machine learning workloads, including GPU operator exploitation, model serving namespace attacks, and cluster-level privilege escalation through ML components.
Machine learning workloads on Kubernetes introduce a distinct set of 安全 challenges that go beyond standard container orchestration risks. The combination of GPU scheduling requirements, specialized operators for 訓練 and serving, shared storage for model artifacts, and the privileged access patterns demanded by CUDA runtimes creates an 攻擊面 that neither traditional Kubernetes 安全 nor ML 安全 fully addresses in isolation.
Device plugins are the foundational mechanism through which Kubernetes exposes GPU resources to ML workloads. The NVIDIA GPU Operator, AMD ROCm device plugin, and Intel GPU plugins all operate with elevated privileges that, when misconfigured, become high-value targets for lateral movement.
ML-Specific Kubernetes Architecture
A production ML platform on Kubernetes typically includes several additional layers beyond standard deployments:
┌─────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ KubeFlow / │ │ Seldon / │ │ Training │ │
│ │ ML Platform │ │ KServe │ │ Operator │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬─────┘ │
│ │ │ │ │
│ ┌──────┴──────────────────┴──────────────────┴─────┐│
│ │ GPU Operator / Device Plugin ││
│ └──────────────────────┬───────────────────────────┘│
│ │ │
│ ┌──────────────────────┴───────────────────────────┐│
│ │ Node (GPU-enabled) ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ ││
│ │ │ CUDA │ │ Model │ │ Shared Storage │ ││
│ │ │ Runtime │ │ Weights │ │ (PVCs / NFS) │ ││
│ │ └──────────┘ └──────────┘ └──────────────────┘ ││
│ └───────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘
Each component introduces specific attack vectors that are absent in standard Kubernetes deployments.
GPU Operator and Device Plugin 利用
NVIDIA GPU Operator 攻擊 Surface
The NVIDIA GPU Operator runs as a privileged DaemonSet on every GPU node, managing driver installation, container runtime configuration, and device plugin lifecycle:
# Enumerate GPU operator components in a cluster
import subprocess
import json
def enumerate_gpu_operator(namespace="gpu-operator"):
"""識別 GPU operator 攻擊面 in a Kubernetes cluster."""
components = {}
# Find GPU operator pods and their privilege levels
pods = json.loads(subprocess.check_output([
"kubectl", "get", "pods", "-n", namespace,
"-o", "json"
]))
for pod in pods["items"]:
pod_name = pod["metadata"]["name"]
containers = pod["spec"].get("containers", [])
for container in containers:
security_context = container.get("securityContext", {})
pod_security = pod["spec"].get("securityContext", {})
components[pod_name] = {
"privileged": security_context.get("privileged", False),
"host_pid": pod["spec"].get("hostPID", False),
"host_network": pod["spec"].get("hostNetwork", False),
"volume_mounts": [
vm["mountPath"] for vm in container.get("volumeMounts", [])
],
"capabilities": security_context.get("capabilities", {}),
}
return componentsDevice Plugin Socket 利用
GPU device plugins communicate with the kubelet through Unix sockets. If 攻擊者 gains access to the node filesystem, these sockets can be manipulated:
# Default device plugin socket locations
# /var/lib/kubelet/device-plugins/nvidia.sock
# /var/lib/kubelet/device-plugins/kubelet.sock
# 攻擊者 with node access can:
# 1. Register a malicious device plugin that intercepts GPU allocation
# 2. Monitor device plugin traffic to observe which pods request GPUs
# 3. Manipulate device allocation to redirect GPU access| 攻擊 Vector | Prerequisite | Impact | Difficulty |
|---|---|---|---|
| Device plugin socket hijack | Node filesystem access | GPU allocation manipulation | High |
| GPU operator pod compromise | Namespace access to gpu-operator | Privileged code execution on all GPU nodes | Medium |
| CUDA runtime manipulation | Container escape or host access | Arbitrary code execution in GPU context | High |
| MIG partition manipulation | GPU operator access | Cross-tenant GPU memory access | Medium |
| Driver version downgrade | GPU operator config access | 介紹 of known 漏洞 | Low |
KubeFlow 攻擊 Surface
KubeFlow is a widely deployed ML platform on Kubernetes that introduces multiple high-value attack surfaces:
Notebook Server 利用
KubeFlow notebooks run as Kubernetes pods with direct cluster access. They are frequently provisioned with overly permissive service accounts:
# Probe KubeFlow notebook server for Kubernetes access
import requests
import os
def assess_notebook_k8s_access():
"""評估 Kubernetes access from a KubeFlow notebook pod."""
findings = []
# Check for service account 符元
token_path = "/var/run/secrets/kubernetes.io/serviceaccount/符元"
if os.path.exists(token_path):
with open(token_path) as f:
符元 = f.read()
findings.append({
"finding": "Service account 符元 accessible",
"token_preview": 符元[:50] + "...",
})
# Attempt to list pods in the namespace
namespace_path = "/var/run/secrets/kubernetes.io/serviceaccount/namespace"
namespace = open(namespace_path).read() if os.path.exists(namespace_path) else "default"
k8s_host = os.environ.get("KUBERNETES_SERVICE_HOST", "kubernetes.default.svc")
k8s_port = os.environ.get("KUBERNETES_SERVICE_PORT", "443")
try:
resp = requests.get(
f"https://{k8s_host}:{k8s_port}/api/v1/namespaces/{namespace}/pods",
headers={"Authorization": f"Bearer {符元}"},
verify="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
timeout=5,
)
if resp.status_code == 200:
pod_names = [p["metadata"]["name"] for p in resp.json()["items"]]
findings.append({
"finding": "Can list pods in namespace",
"severity": "HIGH",
"pods": pod_names,
})
except Exception as e:
findings.append({"finding": "API access failed", "error": str(e)})
# Check for GPU access from notebook
try:
import torch
if torch.cuda.is_available():
findings.append({
"finding": "GPU access from notebook",
"gpu_count": torch.cuda.device_count(),
"gpu_name": torch.cuda.get_device_name(0),
})
except ImportError:
pass
return findingsPipeline Injection 攻擊
KubeFlow Pipelines execute user-defined DAGs as Kubernetes pods. A compromised pipeline step can leverage the pipeline's service account to escalate privileges:
# Malicious pipeline component that exploits Kubernetes access
# This demonstrates the attack pattern - not for use in unauthorized 測試
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
templates:
- name: malicious-step
container:
image: python:3.11
command: ["python", "-c"]
args:
- |
# Pipeline pods often have broad namespace access
# for reading artifacts, secrets, and config maps
import subprocess
# Enumerate accessible secrets
result = subprocess.run(
["kubectl", "get", "secrets", "-A", "-o", "name"],
capture_output=True, text=True
)
print(result.stdout)Model Serving Namespace 攻擊
KServe and Seldon 安全 Model
Model serving platforms like KServe and Seldon Core deploy 推論 services as Kubernetes resources with predictable naming conventions and network patterns:
# Enumerate model serving endpoints across namespaces
def enumerate_inference_services():
"""Find all 推論 services in a Kubernetes cluster."""
import subprocess
import json
services = {}
# KServe InferenceService resources
try:
kserve = json.loads(subprocess.check_output([
"kubectl", "get", "inferenceservices", "--all-namespaces", "-o", "json"
]))
for item in kserve["items"]:
name = item["metadata"]["name"]
ns = item["metadata"]["namespace"]
url = item.get("status", {}).get("url", "unknown")
services[f"{ns}/{name}"] = {
"type": "kserve",
"url": url,
"ready": item.get("status", {}).get("conditions", []),
}
except subprocess.CalledProcessError:
pass
# Seldon deployments
try:
seldon = json.loads(subprocess.check_output([
"kubectl", "get", "seldondeployments", "--all-namespaces", "-o", "json"
]))
for item in seldon["items"]:
name = item["metadata"]["name"]
ns = item["metadata"]["namespace"]
services[f"{ns}/{name}"] = {
"type": "seldon",
"replicas": item["spec"].get("replicas", 1),
"predictors": [
p["name"] for p in item["spec"].get("predictors", [])
],
}
except subprocess.CalledProcessError:
pass
return servicesCross-Namespace Model Access
In multi-tenant ML platforms, teams typically deploy models to separate namespaces. 然而, several common misconfigurations enable cross-namespace access:
| Misconfiguration | Description | 利用 |
|---|---|---|
| Missing NetworkPolicy | No network isolation between ML namespaces | Direct HTTP access to other teams' 推論 endpoints |
| Shared model storage PVC | Multiple namespaces mount the same PV | Read or overwrite other teams' model weights |
| Overpermissive Istio/Envoy rules | Service mesh allows cross-namespace traffic | Intercept or redirect 推論 requests |
| Global model registry access | All namespaces can pull from the same registry | Poison models used by other teams |
| Shared secrets for 雲端 storage | S3/GCS credentials shared across namespaces | Access 訓練資料 and model artifacts |
Training Job 安全
Distributed Training 攻擊 Surface
Distributed 訓練 with frameworks like Horovod, PyTorch Distributed, or DeepSpeed creates inter-pod communication channels that expand the 攻擊面:
# 評估 distributed 訓練 network exposure
def assess_distributed_training_security(namespace="訓練"):
"""Check for insecure distributed 訓練 configurations."""
import subprocess
import json
findings = []
# Find 訓練 pods with open communication ports
pods = json.loads(subprocess.check_output([
"kubectl", "get", "pods", "-n", namespace,
"-l", "訓練-job",
"-o", "json"
]))
for pod in pods["items"]:
containers = pod["spec"].get("containers", [])
for container in containers:
ports = container.get("ports", [])
env_vars = {
e["name"]: e.get("value", "")
for e in container.get("env", [])
}
# Check for NCCL and Gloo communication ports
nccl_port = env_vars.get("NCCL_SOCKET_IFNAME")
master_addr = env_vars.get("MASTER_ADDR")
master_port = env_vars.get("MASTER_PORT")
if master_addr and master_port:
findings.append({
"pod": pod["metadata"]["name"],
"master_addr": master_addr,
"master_port": master_port,
"severity": "MEDIUM",
"note": "Distributed 訓練 master endpoint exposed",
})
# Check for shared memory mounts (required for NCCL)
volumes = pod["spec"].get("volumes", [])
for vol in volumes:
if vol.get("emptyDir", {}).get("medium") == "Memory":
findings.append({
"pod": pod["metadata"]["name"],
"finding": "Shared memory (dshm) mount detected",
"note": "Required for NCCL but may allow cross-container data access",
})
return findingsTraining Operator 利用
Kubernetes 訓練 operators (PyTorchJob, TFJob, MPIJob) manage the lifecycle of 訓練 jobs. Compromising the operator grants control over all 訓練 workloads:
| Operator | CRD | 攻擊 Surface |
|---|---|---|
| PyTorch Operator | PyTorchJob | Master/worker pod creation, GPU allocation |
| TensorFlow Operator | TFJob | PS/worker topology, checkpoint paths |
| MPI Operator | MPIJob | SSH key distribution, launcher access |
| Volcano Scheduler | Queue, PodGroup | Priority manipulation, resource starvation |
RBAC Misconfigurations in ML Platforms
ML platforms frequently require broad RBAC 權限 that violate least-privilege principles:
# Common overly permissive RBAC for ML platform service accounts
# This pattern appears frequently in KubeFlow and similar platforms
apiVersion: rbac.授權.k8s.io/v1
kind: ClusterRole
metadata:
name: ml-platform-admin
rules:
# Broad pod management for 訓練 jobs
- apiGroups: [""]
resources: ["pods", "pods/exec", "pods/log"]
verbs: ["*"]
# Secret access for model registry credentials
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "watch"]
# PVC management for datasets and models
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["*"]
# CRD access for ML operators
- apiGroups: ["kubeflow.org"]
resources: ["*"]
verbs: ["*"]RBAC 評估 Checklist
def audit_ml_rbac(namespace="kubeflow"):
"""Audit RBAC 權限 for ML platform service accounts."""
import subprocess
import json
dangerous_permissions = []
# Get all service accounts in the ML namespace
sa_list = json.loads(subprocess.check_output([
"kubectl", "get", "serviceaccounts", "-n", namespace, "-o", "json"
]))
for sa in sa_list["items"]:
sa_name = sa["metadata"]["name"]
# Check what this SA can do using auth can-i
for resource in ["secrets", "pods/exec", "configmaps", "nodes"]:
for verb in ["get", "list", "create", "delete"]:
result = subprocess.run(
["kubectl", "auth", "can-i", verb, resource,
"--as", f"system:serviceaccount:{namespace}:{sa_name}",
"-n", namespace],
capture_output=True, text=True
)
if "yes" in result.stdout:
dangerous_permissions.append({
"service_account": f"{namespace}:{sa_name}",
"resource": resource,
"verb": verb,
})
return dangerous_permissionsShared Storage 攻擊
ML workloads rely heavily on shared storage for datasets, model weights, checkpoints, and artifacts. This creates cross-pod and cross-namespace attack opportunities:
PersistentVolume Claim Misconfigurations
# Commonly misconfigured shared storage for ML workloads
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-storage
spec:
accessModes:
- ReadWriteMany # Multiple pods can write simultaneously
resources:
requests:
storage: 500Gi
storageClassName: nfs-client # NFS often lacks access controls| Storage 攻擊 | Vector | Impact |
|---|---|---|
| Model weight replacement | Write access to shared PVC | Serve a backdoored model |
| Training 資料投毒 | Write access to dataset PVC | Corrupt 訓練 runs |
| Checkpoint manipulation | Write access to checkpoint directory | Hijack 訓練 from a specific point |
| Log exfiltration | Read access to experiment logs | Extract hyperparameters, metrics, data samples |
| Credential harvesting | Read access to config mounts | Obtain 雲端 storage keys, API 符元 |
紅隊 評估 Methodology
When assessing Kubernetes ML infrastructure, follow this systematic approach:
Phase 1: Reconnaissance
- Enumerate ML-specific CRDs (InferenceService, PyTorchJob, Notebook, Experiment)
- 識別 GPU nodes and their operator configurations
- Map namespace topology and network policies
- 識別 shared storage volumes and their access modes
Phase 2: Access 評估
- 測試 service account 權限 from ML pods
- Attempt cross-namespace network access to 推論 endpoints
- Probe GPU operator management interfaces
- Check for KubeFlow dashboard unauthenticated access
Phase 3: 利用
- Attempt pipeline injection through KubeFlow Pipelines
- 測試 notebook server breakout via Kubernetes API access
- Probe shared storage for write access to model weights
- 評估 distributed 訓練 inter-pod communication
Phase 4: Impact Demonstration
- Model replacement via shared PVC write access
- Credential extraction from ML platform secrets
- Cross-tenant 推論 access through missing network policies
- GPU resource starvation through priority manipulation
相關主題
- Attacking AI Deployments -- foundational deployment 安全 concepts
- 雲端 AI 安全 -- 雲端-specific ML infrastructure risks
- Infrastructure 利用 -- advanced infrastructure attack techniques
- GPU Cluster 攻擊 -- focused GPU compute 利用
- ML Pipeline CI/CD 攻擊 -- attacking ML pipeline automation
參考文獻
- "Kubernetes 安全 and Observability" - Brendan Creane & Amit Gupta (O'Reilly, 2021) - Foundation for Kubernetes 安全 評估 including RBAC, network policy, and runtime 安全
- "Hacking Kubernetes" - Andrew Martin & Michael Hausenblas (O'Reilly, 2022) - Practical Kubernetes attack techniques applicable to ML workloads
- NVIDIA GPU Operator Documentation (2025) - Official documentation covering GPU operator deployment, 安全 considerations, and MIG configurations
- KubeFlow 安全 Documentation (2025) - KubeFlow multi-tenancy and 安全 hardening guidance
- MITRE ATLAS - ML-specific threat framework including infrastructure-layer attacks on ML systems
What makes KubeFlow notebook servers a particularly dangerous attack vector in Kubernetes ML infrastructure?