Kubernetes 安全 for ML Workloads

Advanced14 min readUpdated 2026-03-15

Comprehensive analysis of Kubernetes attack surfaces specific to machine learning workloads, including GPU operator exploitation, model serving namespace attacks, and cluster-level privilege escalation through ML components.

kubernetes ml-infrastructure container-security gpu cluster-attacks

Machine learning workloads on Kubernetes introduce a distinct set of 安全 challenges that go beyond standard container orchestration risks. The combination of GPU scheduling requirements, specialized operators for 訓練 and serving, shared storage for model artifacts, and the privileged access patterns demanded by CUDA runtimes creates an 攻擊面 that neither traditional Kubernetes 安全 nor ML 安全 fully addresses in isolation.

Device plugins are the foundational mechanism through which Kubernetes exposes GPU resources to ML workloads. The NVIDIA GPU Operator, AMD ROCm device plugin, and Intel GPU plugins all operate with elevated privileges that, when misconfigured, become high-value targets for lateral movement.

ML-Specific Kubernetes Architecture

A production ML platform on Kubernetes typically includes several additional layers beyond standard deployments:

┌─────────────────────────────────────────────────────┐
│                  Kubernetes Cluster                   │
│                                                       │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │  KubeFlow /   │  │   Seldon /    │  │  Training   │ │
│  │  ML Platform  │  │   KServe     │  │  Operator   │ │
│  └──────┬───────┘  └──────┬───────┘  └──────┬─────┘ │
│         │                  │                  │       │
│  ┌──────┴──────────────────┴──────────────────┴─────┐│
│  │              GPU Operator / Device Plugin          ││
│  └──────────────────────┬───────────────────────────┘│
│                          │                             │
│  ┌──────────────────────┴───────────────────────────┐│
│  │         Node (GPU-enabled)                        ││
│  │  ┌──────────┐ ┌──────────┐ ┌──────────────────┐  ││
│  │  │  CUDA    │ │  Model   │ │  Shared Storage  │  ││
│  │  │  Runtime │ │  Weights │ │  (PVCs / NFS)    │  ││
│  │  └──────────┘ └──────────┘ └──────────────────┘  ││
│  └───────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘

Each component introduces specific attack vectors that are absent in standard Kubernetes deployments.

GPU Operator and Device Plugin 利用

NVIDIA GPU Operator 攻擊 Surface

The NVIDIA GPU Operator runs as a privileged DaemonSet on every GPU node, managing driver installation, container runtime configuration, and device plugin lifecycle:

# Enumerate GPU operator components in a cluster
import subprocess
import json
 
def enumerate_gpu_operator(namespace="gpu-operator"):
    """識別 GPU operator 攻擊面 in a Kubernetes cluster."""
    components = {}
 
    # Find GPU operator pods and their privilege levels
    pods = json.loads(subprocess.check_output([
        "kubectl", "get", "pods", "-n", namespace,
        "-o", "json"
    ]))
 
    for pod in pods["items"]:
        pod_name = pod["metadata"]["name"]
        containers = pod["spec"].get("containers", [])
 
        for container in containers:
            security_context = container.get("securityContext", {})
            pod_security = pod["spec"].get("securityContext", {})
 
            components[pod_name] = {
                "privileged": security_context.get("privileged", False),
                "host_pid": pod["spec"].get("hostPID", False),
                "host_network": pod["spec"].get("hostNetwork", False),
                "volume_mounts": [
                    vm["mountPath"] for vm in container.get("volumeMounts", [])
                ],
                "capabilities": security_context.get("capabilities", {}),
            }
 
    return components

Device Plugin Socket 利用

GPU device plugins communicate with the kubelet through Unix sockets. If 攻擊者 gains access to the node filesystem, these sockets can be manipulated:

# Default device plugin socket locations
# /var/lib/kubelet/device-plugins/nvidia.sock
# /var/lib/kubelet/device-plugins/kubelet.sock
 
# 攻擊者 with node access can:
# 1. Register a malicious device plugin that intercepts GPU allocation
# 2. Monitor device plugin traffic to observe which pods request GPUs
# 3. Manipulate device allocation to redirect GPU access

攻擊 Vector	Prerequisite	Impact	Difficulty
Device plugin socket hijack	Node filesystem access	GPU allocation manipulation	High
GPU operator pod compromise	Namespace access to gpu-operator	Privileged code execution on all GPU nodes	Medium
CUDA runtime manipulation	Container escape or host access	Arbitrary code execution in GPU context	High
MIG partition manipulation	GPU operator access	Cross-tenant GPU memory access	Medium
Driver version downgrade	GPU operator config access	介紹 of known 漏洞	Low

KubeFlow 攻擊 Surface

KubeFlow is a widely deployed ML platform on Kubernetes that introduces multiple high-value attack surfaces:

Notebook Server 利用

KubeFlow notebooks run as Kubernetes pods with direct cluster access. They are frequently provisioned with overly permissive service accounts:

# Probe KubeFlow notebook server for Kubernetes access
import requests
import os
 
def assess_notebook_k8s_access():
    """評估 Kubernetes access from a KubeFlow notebook pod."""
    findings = []
 
    # Check for service account 符元
    token_path = "/var/run/secrets/kubernetes.io/serviceaccount/符元"
    if os.path.exists(token_path):
        with open(token_path) as f:
            符元 = f.read()
        findings.append({
            "finding": "Service account 符元 accessible",
            "token_preview": 符元[:50] + "...",
        })
 
    # Attempt to list pods in the namespace
    namespace_path = "/var/run/secrets/kubernetes.io/serviceaccount/namespace"
    namespace = open(namespace_path).read() if os.path.exists(namespace_path) else "default"
 
    k8s_host = os.environ.get("KUBERNETES_SERVICE_HOST", "kubernetes.default.svc")
    k8s_port = os.environ.get("KUBERNETES_SERVICE_PORT", "443")
 
    try:
        resp = requests.get(
            f"https://{k8s_host}:{k8s_port}/api/v1/namespaces/{namespace}/pods",
            headers={"Authorization": f"Bearer {符元}"},
            verify="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
            timeout=5,
        )
        if resp.status_code == 200:
            pod_names = [p["metadata"]["name"] for p in resp.json()["items"]]
            findings.append({
                "finding": "Can list pods in namespace",
                "severity": "HIGH",
                "pods": pod_names,
            })
    except Exception as e:
        findings.append({"finding": "API access failed", "error": str(e)})
 
    # Check for GPU access from notebook
    try:
        import torch
        if torch.cuda.is_available():
            findings.append({
                "finding": "GPU access from notebook",
                "gpu_count": torch.cuda.device_count(),
                "gpu_name": torch.cuda.get_device_name(0),
            })
    except ImportError:
        pass
 
    return findings

Pipeline Injection 攻擊

KubeFlow Pipelines execute user-defined DAGs as Kubernetes pods. A compromised pipeline step can leverage the pipeline's service account to escalate privileges:

# Malicious pipeline component that exploits Kubernetes access
# This demonstrates the attack pattern - not for use in unauthorized 測試
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
  templates:
    - name: malicious-step
      container:
        image: python:3.11
        command: ["python", "-c"]
        args:
          - |
            # Pipeline pods often have broad namespace access
            # for reading artifacts, secrets, and config maps
            import subprocess
            # Enumerate accessible secrets
            result = subprocess.run(
                ["kubectl", "get", "secrets", "-A", "-o", "name"],
                capture_output=True, text=True
            )
            print(result.stdout)

Model Serving Namespace 攻擊

KServe and Seldon 安全 Model

Model serving platforms like KServe and Seldon Core deploy 推論 services as Kubernetes resources with predictable naming conventions and network patterns:

# Enumerate model serving endpoints across namespaces
def enumerate_inference_services():
    """Find all 推論 services in a Kubernetes cluster."""
    import subprocess
    import json
 
    services = {}
 
    # KServe InferenceService resources
    try:
        kserve = json.loads(subprocess.check_output([
            "kubectl", "get", "inferenceservices", "--all-namespaces", "-o", "json"
        ]))
        for item in kserve["items"]:
            name = item["metadata"]["name"]
            ns = item["metadata"]["namespace"]
            url = item.get("status", {}).get("url", "unknown")
            services[f"{ns}/{name}"] = {
                "type": "kserve",
                "url": url,
                "ready": item.get("status", {}).get("conditions", []),
            }
    except subprocess.CalledProcessError:
        pass
 
    # Seldon deployments
    try:
        seldon = json.loads(subprocess.check_output([
            "kubectl", "get", "seldondeployments", "--all-namespaces", "-o", "json"
        ]))
        for item in seldon["items"]:
            name = item["metadata"]["name"]
            ns = item["metadata"]["namespace"]
            services[f"{ns}/{name}"] = {
                "type": "seldon",
                "replicas": item["spec"].get("replicas", 1),
                "predictors": [
                    p["name"] for p in item["spec"].get("predictors", [])
                ],
            }
    except subprocess.CalledProcessError:
        pass
 
    return services

Cross-Namespace Model Access

In multi-tenant ML platforms, teams typically deploy models to separate namespaces. 然而, several common misconfigurations enable cross-namespace access:

Misconfiguration	Description	利用
Missing NetworkPolicy	No network isolation between ML namespaces	Direct HTTP access to other teams' 推論 endpoints
Shared model storage PVC	Multiple namespaces mount the same PV	Read or overwrite other teams' model weights
Overpermissive Istio/Envoy rules	Service mesh allows cross-namespace traffic	Intercept or redirect 推論 requests
Global model registry access	All namespaces can pull from the same registry	Poison models used by other teams
Shared secrets for 雲端 storage	S3/GCS credentials shared across namespaces	Access 訓練資料 and model artifacts

Training Job 安全

Distributed Training 攻擊 Surface

Distributed 訓練 with frameworks like Horovod, PyTorch Distributed, or DeepSpeed creates inter-pod communication channels that expand the 攻擊面:

# 評估 distributed 訓練 network exposure
def assess_distributed_training_security(namespace="訓練"):
    """Check for insecure distributed 訓練 configurations."""
    import subprocess
    import json
 
    findings = []
 
    # Find 訓練 pods with open communication ports
    pods = json.loads(subprocess.check_output([
        "kubectl", "get", "pods", "-n", namespace,
        "-l", "訓練-job",
        "-o", "json"
    ]))
 
    for pod in pods["items"]:
        containers = pod["spec"].get("containers", [])
        for container in containers:
            ports = container.get("ports", [])
            env_vars = {
                e["name"]: e.get("value", "")
                for e in container.get("env", [])
            }
 
            # Check for NCCL and Gloo communication ports
            nccl_port = env_vars.get("NCCL_SOCKET_IFNAME")
            master_addr = env_vars.get("MASTER_ADDR")
            master_port = env_vars.get("MASTER_PORT")
 
            if master_addr and master_port:
                findings.append({
                    "pod": pod["metadata"]["name"],
                    "master_addr": master_addr,
                    "master_port": master_port,
                    "severity": "MEDIUM",
                    "note": "Distributed 訓練 master endpoint exposed",
                })
 
            # Check for shared memory mounts (required for NCCL)
            volumes = pod["spec"].get("volumes", [])
            for vol in volumes:
                if vol.get("emptyDir", {}).get("medium") == "Memory":
                    findings.append({
                        "pod": pod["metadata"]["name"],
                        "finding": "Shared memory (dshm) mount detected",
                        "note": "Required for NCCL but may allow cross-container data access",
                    })
 
    return findings

Training Operator 利用

Kubernetes 訓練 operators (PyTorchJob, TFJob, MPIJob) manage the lifecycle of 訓練 jobs. Compromising the operator grants control over all 訓練 workloads:

Operator	CRD	攻擊 Surface
PyTorch Operator	PyTorchJob	Master/worker pod creation, GPU allocation
TensorFlow Operator	TFJob	PS/worker topology, checkpoint paths
MPI Operator	MPIJob	SSH key distribution, launcher access
Volcano Scheduler	Queue, PodGroup	Priority manipulation, resource starvation

RBAC Misconfigurations in ML Platforms

ML platforms frequently require broad RBAC 權限 that violate least-privilege principles:

# Common overly permissive RBAC for ML platform service accounts
# This pattern appears frequently in KubeFlow and similar platforms
apiVersion: rbac.授權.k8s.io/v1
kind: ClusterRole
metadata:
  name: ml-platform-admin
rules:
  # Broad pod management for 訓練 jobs
  - apiGroups: [""]
    resources: ["pods", "pods/exec", "pods/log"]
    verbs: ["*"]
  # Secret access for model registry credentials
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]
  # PVC management for datasets and models
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["*"]
  # CRD access for ML operators
  - apiGroups: ["kubeflow.org"]
    resources: ["*"]
    verbs: ["*"]

RBAC 評估 Checklist

def audit_ml_rbac(namespace="kubeflow"):
    """Audit RBAC 權限 for ML platform service accounts."""
    import subprocess
    import json
 
    dangerous_permissions = []
 
    # Get all service accounts in the ML namespace
    sa_list = json.loads(subprocess.check_output([
        "kubectl", "get", "serviceaccounts", "-n", namespace, "-o", "json"
    ]))
 
    for sa in sa_list["items"]:
        sa_name = sa["metadata"]["name"]
 
        # Check what this SA can do using auth can-i
        for resource in ["secrets", "pods/exec", "configmaps", "nodes"]:
            for verb in ["get", "list", "create", "delete"]:
                result = subprocess.run(
                    ["kubectl", "auth", "can-i", verb, resource,
                     "--as", f"system:serviceaccount:{namespace}:{sa_name}",
                     "-n", namespace],
                    capture_output=True, text=True
                )
                if "yes" in result.stdout:
                    dangerous_permissions.append({
                        "service_account": f"{namespace}:{sa_name}",
                        "resource": resource,
                        "verb": verb,
                    })
 
    return dangerous_permissions

Shared Storage 攻擊

ML workloads rely heavily on shared storage for datasets, model weights, checkpoints, and artifacts. This creates cross-pod and cross-namespace attack opportunities:

PersistentVolume Claim Misconfigurations

# Commonly misconfigured shared storage for ML workloads
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
spec:
  accessModes:
    - ReadWriteMany  # Multiple pods can write simultaneously
  resources:
    requests:
      storage: 500Gi
  storageClassName: nfs-client  # NFS often lacks access controls

Storage 攻擊	Vector	Impact
Model weight replacement	Write access to shared PVC	Serve a backdoored model
Training 資料投毒	Write access to dataset PVC	Corrupt 訓練 runs
Checkpoint manipulation	Write access to checkpoint directory	Hijack 訓練 from a specific point
Log exfiltration	Read access to experiment logs	Extract hyperparameters, metrics, data samples
Credential harvesting	Read access to config mounts	Obtain 雲端 storage keys, API 符元

紅隊評估 Methodology

When assessing Kubernetes ML infrastructure, follow this systematic approach:

Phase 1: Reconnaissance

Enumerate ML-specific CRDs (InferenceService, PyTorchJob, Notebook, Experiment)
識別 GPU nodes and their operator configurations
Map namespace topology and network policies
識別 shared storage volumes and their access modes

Phase 2: Access 評估

測試 service account 權限 from ML pods
Attempt cross-namespace network access to 推論 endpoints
Probe GPU operator management interfaces
Check for KubeFlow dashboard unauthenticated access

Phase 3: 利用

Attempt pipeline injection through KubeFlow Pipelines
測試 notebook server breakout via Kubernetes API access
Probe shared storage for write access to model weights
評估 distributed 訓練 inter-pod communication

Phase 4: Impact Demonstration

Model replacement via shared PVC write access
Credential extraction from ML platform secrets
Cross-tenant 推論 access through missing network policies
GPU resource starvation through priority manipulation

參考文獻

"Kubernetes 安全 and Observability" - Brendan Creane & Amit Gupta (O'Reilly, 2021) - Foundation for Kubernetes 安全評估 including RBAC, network policy, and runtime 安全
"Hacking Kubernetes" - Andrew Martin & Michael Hausenblas (O'Reilly, 2022) - Practical Kubernetes attack techniques applicable to ML workloads
NVIDIA GPU Operator Documentation (2025) - Official documentation covering GPU operator deployment, 安全 considerations, and MIG configurations
KubeFlow 安全 Documentation (2025) - KubeFlow multi-tenancy and 安全 hardening guidance
MITRE ATLAS - ML-specific threat framework including infrastructure-layer attacks on ML systems

Knowledge Check

What makes KubeFlow notebook servers a particularly dangerous attack vector in Kubernetes ML infrastructure?

Kubernetes 安全 for ML Workloads

Advanced14 min readUpdated 2026-03-15

kubernetes ml-infrastructure container-security gpu cluster-attacks

ML-Specific Kubernetes Architecture

A production ML platform on Kubernetes typically includes several additional layers beyond standard deployments:

┌─────────────────────────────────────────────────────┐
│                  Kubernetes Cluster                   │
│                                                       │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │  KubeFlow /   │  │   Seldon /    │  │  Training   │ │
│  │  ML Platform  │  │   KServe     │  │  Operator   │ │
│  └──────┬───────┘  └──────┬───────┘  └──────┬─────┘ │
│         │                  │                  │       │
│  ┌──────┴──────────────────┴──────────────────┴─────┐│
│  │              GPU Operator / Device Plugin          ││
│  └──────────────────────┬───────────────────────────┘│
│                          │                             │
│  ┌──────────────────────┴───────────────────────────┐│
│  │         Node (GPU-enabled)                        ││
│  │  ┌──────────┐ ┌──────────┐ ┌──────────────────┐  ││
│  │  │  CUDA    │ │  Model   │ │  Shared Storage  │  ││
│  │  │  Runtime │ │  Weights │ │  (PVCs / NFS)    │  ││
│  │  └──────────┘ └──────────┘ └──────────────────┘  ││
│  └───────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘

Each component introduces specific attack vectors that are absent in standard Kubernetes deployments.

GPU Operator and Device Plugin 利用

NVIDIA GPU Operator 攻擊 Surface

The NVIDIA GPU Operator runs as a privileged DaemonSet on every GPU node, managing driver installation, container runtime configuration, and device plugin lifecycle:

# Enumerate GPU operator components in a cluster
import subprocess
import json
 
def enumerate_gpu_operator(namespace="gpu-operator"):
    """識別 GPU operator 攻擊面 in a Kubernetes cluster."""
    components = {}
 
    # Find GPU operator pods and their privilege levels
    pods = json.loads(subprocess.check_output([
        "kubectl", "get", "pods", "-n", namespace,
        "-o", "json"
    ]))
 
    for pod in pods["items"]:
        pod_name = pod["metadata"]["name"]
        containers = pod["spec"].get("containers", [])
 
        for container in containers:
            security_context = container.get("securityContext", {})
            pod_security = pod["spec"].get("securityContext", {})
 
            components[pod_name] = {
                "privileged": security_context.get("privileged", False),
                "host_pid": pod["spec"].get("hostPID", False),
                "host_network": pod["spec"].get("hostNetwork", False),
                "volume_mounts": [
                    vm["mountPath"] for vm in container.get("volumeMounts", [])
                ],
                "capabilities": security_context.get("capabilities", {}),
            }
 
    return components

Device Plugin Socket 利用

GPU device plugins communicate with the kubelet through Unix sockets. If 攻擊者 gains access to the node filesystem, these sockets can be manipulated:

# Default device plugin socket locations
# /var/lib/kubelet/device-plugins/nvidia.sock
# /var/lib/kubelet/device-plugins/kubelet.sock
 
# 攻擊者 with node access can:
# 1. Register a malicious device plugin that intercepts GPU allocation
# 2. Monitor device plugin traffic to observe which pods request GPUs
# 3. Manipulate device allocation to redirect GPU access

攻擊 Vector	Prerequisite	Impact	Difficulty
Device plugin socket hijack	Node filesystem access	GPU allocation manipulation	High
GPU operator pod compromise	Namespace access to gpu-operator	Privileged code execution on all GPU nodes	Medium
CUDA runtime manipulation	Container escape or host access	Arbitrary code execution in GPU context	High
MIG partition manipulation	GPU operator access	Cross-tenant GPU memory access	Medium
Driver version downgrade	GPU operator config access	介紹 of known 漏洞	Low

KubeFlow 攻擊 Surface

KubeFlow is a widely deployed ML platform on Kubernetes that introduces multiple high-value attack surfaces:

Notebook Server 利用

KubeFlow notebooks run as Kubernetes pods with direct cluster access. They are frequently provisioned with overly permissive service accounts:

# Probe KubeFlow notebook server for Kubernetes access
import requests
import os
 
def assess_notebook_k8s_access():
    """評估 Kubernetes access from a KubeFlow notebook pod."""
    findings = []
 
    # Check for service account 符元
    token_path = "/var/run/secrets/kubernetes.io/serviceaccount/符元"
    if os.path.exists(token_path):
        with open(token_path) as f:
            符元 = f.read()
        findings.append({
            "finding": "Service account 符元 accessible",
            "token_preview": 符元[:50] + "...",
        })
 
    # Attempt to list pods in the namespace
    namespace_path = "/var/run/secrets/kubernetes.io/serviceaccount/namespace"
    namespace = open(namespace_path).read() if os.path.exists(namespace_path) else "default"
 
    k8s_host = os.environ.get("KUBERNETES_SERVICE_HOST", "kubernetes.default.svc")
    k8s_port = os.environ.get("KUBERNETES_SERVICE_PORT", "443")
 
    try:
        resp = requests.get(
            f"https://{k8s_host}:{k8s_port}/api/v1/namespaces/{namespace}/pods",
            headers={"Authorization": f"Bearer {符元}"},
            verify="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
            timeout=5,
        )
        if resp.status_code == 200:
            pod_names = [p["metadata"]["name"] for p in resp.json()["items"]]
            findings.append({
                "finding": "Can list pods in namespace",
                "severity": "HIGH",
                "pods": pod_names,
            })
    except Exception as e:
        findings.append({"finding": "API access failed", "error": str(e)})
 
    # Check for GPU access from notebook
    try:
        import torch
        if torch.cuda.is_available():
            findings.append({
                "finding": "GPU access from notebook",
                "gpu_count": torch.cuda.device_count(),
                "gpu_name": torch.cuda.get_device_name(0),
            })
    except ImportError:
        pass
 
    return findings

Pipeline Injection 攻擊

KubeFlow Pipelines execute user-defined DAGs as Kubernetes pods. A compromised pipeline step can leverage the pipeline's service account to escalate privileges:

# Malicious pipeline component that exploits Kubernetes access
# This demonstrates the attack pattern - not for use in unauthorized 測試
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
  templates:
    - name: malicious-step
      container:
        image: python:3.11
        command: ["python", "-c"]
        args:
          - |
            # Pipeline pods often have broad namespace access
            # for reading artifacts, secrets, and config maps
            import subprocess
            # Enumerate accessible secrets
            result = subprocess.run(
                ["kubectl", "get", "secrets", "-A", "-o", "name"],
                capture_output=True, text=True
            )
            print(result.stdout)

Model Serving Namespace 攻擊

KServe and Seldon 安全 Model

Model serving platforms like KServe and Seldon Core deploy 推論 services as Kubernetes resources with predictable naming conventions and network patterns:

# Enumerate model serving endpoints across namespaces
def enumerate_inference_services():
    """Find all 推論 services in a Kubernetes cluster."""
    import subprocess
    import json
 
    services = {}
 
    # KServe InferenceService resources
    try:
        kserve = json.loads(subprocess.check_output([
            "kubectl", "get", "inferenceservices", "--all-namespaces", "-o", "json"
        ]))
        for item in kserve["items"]:
            name = item["metadata"]["name"]
            ns = item["metadata"]["namespace"]
            url = item.get("status", {}).get("url", "unknown")
            services[f"{ns}/{name}"] = {
                "type": "kserve",
                "url": url,
                "ready": item.get("status", {}).get("conditions", []),
            }
    except subprocess.CalledProcessError:
        pass
 
    # Seldon deployments
    try:
        seldon = json.loads(subprocess.check_output([
            "kubectl", "get", "seldondeployments", "--all-namespaces", "-o", "json"
        ]))
        for item in seldon["items"]:
            name = item["metadata"]["name"]
            ns = item["metadata"]["namespace"]
            services[f"{ns}/{name}"] = {
                "type": "seldon",
                "replicas": item["spec"].get("replicas", 1),
                "predictors": [
                    p["name"] for p in item["spec"].get("predictors", [])
                ],
            }
    except subprocess.CalledProcessError:
        pass
 
    return services

Cross-Namespace Model Access

In multi-tenant ML platforms, teams typically deploy models to separate namespaces. 然而, several common misconfigurations enable cross-namespace access:

Misconfiguration	Description	利用
Missing NetworkPolicy	No network isolation between ML namespaces	Direct HTTP access to other teams' 推論 endpoints
Shared model storage PVC	Multiple namespaces mount the same PV	Read or overwrite other teams' model weights
Overpermissive Istio/Envoy rules	Service mesh allows cross-namespace traffic	Intercept or redirect 推論 requests
Global model registry access	All namespaces can pull from the same registry	Poison models used by other teams
Shared secrets for 雲端 storage	S3/GCS credentials shared across namespaces	Access 訓練資料 and model artifacts

Training Job 安全

Distributed Training 攻擊 Surface

Distributed 訓練 with frameworks like Horovod, PyTorch Distributed, or DeepSpeed creates inter-pod communication channels that expand the 攻擊面:

# 評估 distributed 訓練 network exposure
def assess_distributed_training_security(namespace="訓練"):
    """Check for insecure distributed 訓練 configurations."""
    import subprocess
    import json
 
    findings = []
 
    # Find 訓練 pods with open communication ports
    pods = json.loads(subprocess.check_output([
        "kubectl", "get", "pods", "-n", namespace,
        "-l", "訓練-job",
        "-o", "json"
    ]))
 
    for pod in pods["items"]:
        containers = pod["spec"].get("containers", [])
        for container in containers:
            ports = container.get("ports", [])
            env_vars = {
                e["name"]: e.get("value", "")
                for e in container.get("env", [])
            }
 
            # Check for NCCL and Gloo communication ports
            nccl_port = env_vars.get("NCCL_SOCKET_IFNAME")
            master_addr = env_vars.get("MASTER_ADDR")
            master_port = env_vars.get("MASTER_PORT")
 
            if master_addr and master_port:
                findings.append({
                    "pod": pod["metadata"]["name"],
                    "master_addr": master_addr,
                    "master_port": master_port,
                    "severity": "MEDIUM",
                    "note": "Distributed 訓練 master endpoint exposed",
                })
 
            # Check for shared memory mounts (required for NCCL)
            volumes = pod["spec"].get("volumes", [])
            for vol in volumes:
                if vol.get("emptyDir", {}).get("medium") == "Memory":
                    findings.append({
                        "pod": pod["metadata"]["name"],
                        "finding": "Shared memory (dshm) mount detected",
                        "note": "Required for NCCL but may allow cross-container data access",
                    })
 
    return findings

Training Operator 利用

Kubernetes 訓練 operators (PyTorchJob, TFJob, MPIJob) manage the lifecycle of 訓練 jobs. Compromising the operator grants control over all 訓練 workloads:

Operator	CRD	攻擊 Surface
PyTorch Operator	PyTorchJob	Master/worker pod creation, GPU allocation
TensorFlow Operator	TFJob	PS/worker topology, checkpoint paths
MPI Operator	MPIJob	SSH key distribution, launcher access
Volcano Scheduler	Queue, PodGroup	Priority manipulation, resource starvation

RBAC Misconfigurations in ML Platforms

ML platforms frequently require broad RBAC 權限 that violate least-privilege principles:

# Common overly permissive RBAC for ML platform service accounts
# This pattern appears frequently in KubeFlow and similar platforms
apiVersion: rbac.授權.k8s.io/v1
kind: ClusterRole
metadata:
  name: ml-platform-admin
rules:
  # Broad pod management for 訓練 jobs
  - apiGroups: [""]
    resources: ["pods", "pods/exec", "pods/log"]
    verbs: ["*"]
  # Secret access for model registry credentials
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]
  # PVC management for datasets and models
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["*"]
  # CRD access for ML operators
  - apiGroups: ["kubeflow.org"]
    resources: ["*"]
    verbs: ["*"]

RBAC 評估 Checklist

def audit_ml_rbac(namespace="kubeflow"):
    """Audit RBAC 權限 for ML platform service accounts."""
    import subprocess
    import json
 
    dangerous_permissions = []
 
    # Get all service accounts in the ML namespace
    sa_list = json.loads(subprocess.check_output([
        "kubectl", "get", "serviceaccounts", "-n", namespace, "-o", "json"
    ]))
 
    for sa in sa_list["items"]:
        sa_name = sa["metadata"]["name"]
 
        # Check what this SA can do using auth can-i
        for resource in ["secrets", "pods/exec", "configmaps", "nodes"]:
            for verb in ["get", "list", "create", "delete"]:
                result = subprocess.run(
                    ["kubectl", "auth", "can-i", verb, resource,
                     "--as", f"system:serviceaccount:{namespace}:{sa_name}",
                     "-n", namespace],
                    capture_output=True, text=True
                )
                if "yes" in result.stdout:
                    dangerous_permissions.append({
                        "service_account": f"{namespace}:{sa_name}",
                        "resource": resource,
                        "verb": verb,
                    })
 
    return dangerous_permissions

Shared Storage 攻擊

ML workloads rely heavily on shared storage for datasets, model weights, checkpoints, and artifacts. This creates cross-pod and cross-namespace attack opportunities:

PersistentVolume Claim Misconfigurations

# Commonly misconfigured shared storage for ML workloads
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
spec:
  accessModes:
    - ReadWriteMany  # Multiple pods can write simultaneously
  resources:
    requests:
      storage: 500Gi
  storageClassName: nfs-client  # NFS often lacks access controls

Storage 攻擊	Vector	Impact
Model weight replacement	Write access to shared PVC	Serve a backdoored model
Training 資料投毒	Write access to dataset PVC	Corrupt 訓練 runs
Checkpoint manipulation	Write access to checkpoint directory	Hijack 訓練 from a specific point
Log exfiltration	Read access to experiment logs	Extract hyperparameters, metrics, data samples
Credential harvesting	Read access to config mounts	Obtain 雲端 storage keys, API 符元

紅隊評估 Methodology

When assessing Kubernetes ML infrastructure, follow this systematic approach:

Phase 1: Reconnaissance

Enumerate ML-specific CRDs (InferenceService, PyTorchJob, Notebook, Experiment)
識別 GPU nodes and their operator configurations
Map namespace topology and network policies
識別 shared storage volumes and their access modes

Phase 2: Access 評估

測試 service account 權限 from ML pods
Attempt cross-namespace network access to 推論 endpoints
Probe GPU operator management interfaces
Check for KubeFlow dashboard unauthenticated access

Phase 3: 利用

Attempt pipeline injection through KubeFlow Pipelines
測試 notebook server breakout via Kubernetes API access
Probe shared storage for write access to model weights
評估 distributed 訓練 inter-pod communication

Phase 4: Impact Demonstration

Model replacement via shared PVC write access
Credential extraction from ML platform secrets
Cross-tenant 推論 access through missing network policies
GPU resource starvation through priority manipulation

參考文獻

"Kubernetes 安全 and Observability" - Brendan Creane & Amit Gupta (O'Reilly, 2021) - Foundation for Kubernetes 安全評估 including RBAC, network policy, and runtime 安全
"Hacking Kubernetes" - Andrew Martin & Michael Hausenblas (O'Reilly, 2022) - Practical Kubernetes attack techniques applicable to ML workloads
NVIDIA GPU Operator Documentation (2025) - Official documentation covering GPU operator deployment, 安全 considerations, and MIG configurations
KubeFlow 安全 Documentation (2025) - KubeFlow multi-tenancy and 安全 hardening guidance
MITRE ATLAS - ML-specific threat framework including infrastructure-layer attacks on ML systems

Knowledge Check

What makes KubeFlow notebook servers a particularly dangerous attack vector in Kubernetes ML infrastructure?

Kubernetes 安全 for ML Workloads

Related articles

Kubernetes 安全 for ML Workloads

Related articles