Aanvallen op AI-workloadschedulers

Gevorderd18 min lezenBijgewerkt op 2026-03-21

Slurm, Kubernetes en custom schedulers misbruiken om GPU-resources te kapen, trainingstaken te vergiftigen en laterale beweging in AI-clusters te bereiken

infrastructure scheduling slurm kubernetes gpu-clusters lateral-movement

Overzicht

AI-workloadschedulers zijn de control plane voor GPU-rekenkracht — ze bepalen welke trainingstaken draaien, op welke hardware, met welke prioriteit en met toegang tot welke data. In high-performance computing (HPC)-omgevingen domineert Slurm als de workload manager voor multi-node GPU-training. In cloud-native deployments beheert Kubernetes met custom schedulers, gang scheduling-plugins (Volcano, Kueue) en GPU-operators de toewijzing van GPU-resources aan AI-workloads.

Het compromitteren van de scheduler of het misbruiken van zijn vertrouwensaannames geeft een aanvaller buitengewone hefboomwerking. Vanuit een scheduling-voet aan de grond kan een aanvaller dure GPU-resources kapen voor cryptomining of ongeautoriseerde training, trainingstaken onderscheppen of wijzigen om modellen te vergiftigen, toegang krijgen tot trainingsdata via job-impersonatie, en lateraal door het cluster bewegen door geprivilegieerde workloads op doelnodes te schedulen. De scheduler is vaak het meest waardevolle doelwit in een AI-cluster, omdat deze brede toegang heeft tot alle nodes, alle jobconfiguraties en alle gebruikerscredentials.

Ondanks hun kritieke rol zijn AI-workloadschedulers vaak onvoldoende beveiligd. Slurm-clusters leunen vaak op een gedeelde munge-authenticatiesleutel die, indien gecompromitteerd, volledige clustertoegang verleent. Kubernetes-schedulers staan mogelijk onbevoorrechte gebruikers toe GPU-resources en tolerations aan te vragen die hun pods op GPU-nodes met verminderde isolatie plaatsen. Custom scheduling-plugins introduceren hun eigen authenticatie- en autorisatiemodellen die mogelijk niet zo grondig zijn beoordeeld als de code van de kernscheduler.

Dit artikel onderzoekt het aanvalsoppervlak van AI-workloadschedulers vanuit zowel het HPC- (Slurm) als het cloud-native (Kubernetes) perspectief, demonstreert praktische exploitatietechnieken en biedt hardeningsrichtlijnen.

Aanvalsoppervlak van Slurm

Architectuur en vertrouwensmodel

Slurm (Simple Linux Utility for Resource Management) bestaat uit verschillende daemons:

slurmctld (controller): Centrale beheerdaemon die de clusterstatus, de jobwachtrij en scheduling-beslissingen onderhoudt. Draait op een head node.
slurmd (compute node-daemon): Draait op elke compute node, ontvangt jobtoewijzingen van slurmctld en beheert de jobuitvoering.
slurmdbd (database-daemon): Slaat accounting- en jobgeschiedenisgegevens op.
munge: Biedt authenticatie tussen Slurm-daemons met een gedeelde symmetrische sleutel.

De kritieke vertrouwensaanname is dat munge-authenticatie een enkele gedeelde sleutel gebruikt (/etc/munge/munge.key) die identiek moet zijn op alle nodes. Elk proces met leestoegang tot deze sleutel kan authenticatie-credentials vervalsen voor elke gebruiker, en wordt zo in feite een clusterbrede root-sleutel.

"""
Slurm cluster security audit script.
Checks for common misconfigurations and attack vectors in
Slurm-managed AI/GPU clusters.
"""
 
import subprocess
import os
import stat
import json
import re
from pathlib import Path
from typing import Optional
 
class SlurmAuditor:
    """Security auditor for Slurm-managed AI clusters."""
 
    def __init__(self):
        self.findings: list[dict] = []
 
    def _add(self, severity: str, title: str, detail: str) -> None:
        self.findings.append({
            "severity": severity, "title": title, "detail": detail,
        })
 
    def check_munge_key_permissions(self) -> None:
        """
        Check munge key file permissions.
        The key should be owned by munge:munge with mode 0400.
        Any broader permissions allow credential forging.
        """
        key_path = Path("/etc/munge/munge.key")
        if not key_path.exists():
            self._add("INFO", "Munge key not found", "Not a Slurm node or non-standard path")
            return
 
        st = key_path.stat()
        mode = stat.S_IMODE(st.st_mode)
 
        if mode != 0o400:
            self._add(
                "CRITICAL",
                f"Munge key has unsafe permissions: {oct(mode)}",
                "The munge key should be mode 0400 (owner read only). "
                "Any user who can read this key can forge authentication "
                "credentials for any cluster user, including root.",
            )
 
        # Controleer of de sleutel leesbaar is voor de huidige gebruiker (niet-root-test)
        if os.access(key_path, os.R_OK) and os.getuid() != 0:
            self._add(
                "CRITICAL",
                "Munge key readable by current non-root user",
                "Current user can read /etc/munge/munge.key. This allows "
                "impersonating any user in the Slurm cluster.",
            )
 
    def check_slurm_config_security(self) -> None:
        """Audit slurm.conf for security-relevant settings."""
        config_paths = [
            Path("/etc/slurm/slurm.conf"),
            Path("/etc/slurm-llnl/slurm.conf"),
        ]
        config_path = None
        for p in config_paths:
            if p.exists():
                config_path = p
                break
 
        if config_path is None:
            self._add("INFO", "slurm.conf not found", "Non-standard path")
            return
 
        content = config_path.read_text()
 
        # Controleer AccountingStorageEnforce
        if "AccountingStorageEnforce" not in content:
            self._add(
                "HIGH",
                "AccountingStorageEnforce not set",
                "Without this, users can run jobs without accounting limits, "
                "bypassing resource quotas and fair-share scheduling.",
            )
 
        # Controleer op job-indiening alleen door root
        if "AllowUsers" not in content and "DenyUsers" not in content:
            self._add(
                "MEDIUM",
                "No user allow/deny lists in Slurm config",
                "Any system user can submit jobs. Consider restricting "
                "to authorized users via AllowUsers or DenyUsers.",
            )
 
        # Controleer op Prolog/Epilog-scripts (aanvalsoppervlak)
        prolog_match = re.search(r'^Prolog\s*=\s*(.+)$', content, re.MULTILINE)
        epilog_match = re.search(r'^Epilog\s*=\s*(.+)$', content, re.MULTILINE)
 
        for label, match in [("Prolog", prolog_match), ("Epilog", epilog_match)]:
            if match:
                script_path = Path(match.group(1).strip())
                if script_path.exists():
                    script_st = script_path.stat()
                    if script_st.st_mode & stat.S_IWOTH:
                        self._add(
                            "CRITICAL",
                            f"{label} script is world-writable: {script_path}",
                            f"The {label} script runs as root on compute nodes "
                            f"before/after every job. A world-writable script "
                            f"allows arbitrary code execution as root.",
                        )
 
        # Controleer op TaskPlugin (job-isolatie)
        if "task/cgroup" not in content:
            self._add(
                "HIGH",
                "cgroup task plugin not enabled",
                "Without task/cgroup, jobs have unrestricted access to "
                "system resources. GPU isolation, memory limits, and CPU "
                "affinity are not enforced.",
            )
 
        # Controleer op GPU-autodetectie
        if "AutoDetect=nvml" not in content and "AutoDetect=rsmi" not in content:
            self._add(
                "LOW",
                "GPU auto-detection not configured",
                "Manual GPU configuration may lead to inconsistent "
                "GPU allocation and tracking.",
            )
 
    def check_job_submission_abuse(self) -> None:
        """Test for job submission vulnerabilities."""
        # Controleer of we jobs met verhoogde privileges kunnen indienen
        try:
            result = subprocess.run(
                ["scontrol", "show", "config"],
                capture_output=True, text=True, timeout=10,
            )
            if result.returncode == 0:
                config_text = result.stdout
 
                # Controleer of job-containers zijn toegestaan
                if "JobContainerType" in config_text:
                    self._add(
                        "MEDIUM",
                        "Job containers enabled",
                        "Users may be able to specify custom container "
                        "images for jobs, potentially pulling malicious images.",
                    )
 
                # Controleer of de gebruiker specifieke nodes kan aanvragen
                if "SchedulerParameters" in config_text:
                    if "no_node_select" not in config_text.lower():
                        self._add(
                            "MEDIUM",
                            "Users can request specific nodes",
                            "Job submissions can target specific nodes. "
                            "An attacker can target high-value nodes "
                            "containing sensitive data or models.",
                        )
 
        except FileNotFoundError:
            pass  # Geen Slurm-node
        except subprocess.TimeoutExpired:
            pass
 
    def check_gpu_isolation(self) -> None:
        """Check GPU isolation between jobs."""
        # Controleer cgroup-GPU-device-handhaving
        cgroup_conf = Path("/etc/slurm/cgroup.conf")
        if cgroup_conf.exists():
            content = cgroup_conf.read_text()
            if "ConstrainDevices=yes" not in content:
                self._add(
                    "HIGH",
                    "GPU device constraints not enforced",
                    "Without ConstrainDevices=yes in cgroup.conf, "
                    "jobs can access GPUs not allocated to them, "
                    "enabling cross-job GPU memory snooping.",
                )
        else:
            self._add(
                "HIGH",
                "cgroup.conf not found",
                "No cgroup configuration found. GPU and memory isolation "
                "between jobs may not be enforced.",
            )
 
    def check_shared_filesystems(self) -> None:
        """Identify shared filesystems that create trust boundaries."""
        shared_mounts = []
        try:
            with open("/proc/mounts", "r") as f:
                for line in f:
                    parts = line.split()
                    if len(parts) >= 3:
                        mount_point = parts[1]
                        fs_type = parts[2]
                        if fs_type in ("nfs", "nfs4", "lustre", "gpfs", "beegfs"):
                            shared_mounts.append((mount_point, fs_type))
        except IOError:
            return
 
        for mount_point, fs_type in shared_mounts:
            self._add(
                "MEDIUM",
                f"Shared filesystem: {mount_point} ({fs_type})",
                f"Shared {fs_type} mount at {mount_point}. Files here are "
                f"accessible across nodes. Training data, model checkpoints, "
                f"and job scripts on shared filesystems can be modified by "
                f"any user with write access, regardless of node isolation.",
            )
 
    def run_audit(self) -> list[dict]:
        """Run complete Slurm security audit."""
        self.findings = []
        self.check_munge_key_permissions()
        self.check_slurm_config_security()
        self.check_job_submission_abuse()
        self.check_gpu_isolation()
        self.check_shared_filesystems()
        return self.findings
 
if __name__ == "__main__":
    auditor = SlurmAuditor()
    findings = auditor.run_audit()
 
    for f in findings:
        print(f"[{f['severity']}] {f['title']}")
        print(f"  {f['detail']}\n")

Slurm job-injectie en -kaping

Een aanvaller met toegang tot het Slurm-cluster (hetzij via legitieme credentials of een gecompromitteerde munge-sleutel) kan jobs op verschillende manieren manipuleren:

Injectie van jobscripts via gedeelde filesystemen: Trainingstaken verwijzen naar scripts die zijn opgeslagen op gedeelde NFS- of Lustre-filesystemen. Als een aanvaller naar deze filesystemen kan schrijven, kan hij jobscripts wijzigen tussen indiening en uitvoering. Het time-of-check-to-time-of-use (TOCTOU)-venster tussen sbatch-indiening en de daadwerkelijke uitvoering kan seconden tot uren bedragen, afhankelijk van de wachttijden in de wachtrij.

Prioriteitsmanipulatie: De fair-share-scheduler van Slurm gebruikt een accounthiërarchie met prioriteiten. Een aanvaller die de fair-share-toewijzing van zijn account kan wijzigen of zichzelf kan toevoegen aan een account met hoge prioriteit, kan de trainingstaken van andere gebruikers preempten, wat denial of service veroorzaakt of toegang verschaft tot GPU-resources.

Misbruik van Prolog/Epilog-scripts: Slurm draait Prolog-scripts als root vóór elke job en Epilog-scripts erna. Als een aanvaller deze scripts kan wijzigen (via schrijftoegang tot een gedeeld filesysteem of een schrijfbaar scriptpad), bereikt hij root-code-uitvoering op elke compute node die een job draait.

Aanvallen op de Kubernetes-scheduler voor AI-workloads

Aanvalsvectoren voor GPU-scheduling

Kubernetes beheert GPU-toewijzing via het device-pluginframework. De k8s-device-plugin van NVIDIA adverteert GPU-resources naar de kubelet, en de scheduler wijst pods toe aan nodes op basis van GPU-beschikbaarheid. Verschillende aanvalsvectoren nemen dit proces als doelwit:

"""
Kubernetes GPU scheduling security audit.
Identifies misconfigurations that allow GPU resource abuse,
privilege escalation, and cross-tenant attacks in AI clusters.
"""
 
import subprocess
import json
import yaml
from typing import Any
 
class K8sGPUSchedulingAuditor:
    """Audit Kubernetes GPU scheduling for AI workloads."""
 
    def __init__(self, namespace: str = ""):
        self.namespace = namespace
        self.findings: list[dict] = []
 
    def _kubectl(self, *args: str) -> dict[str, Any]:
        """Run kubectl and return parsed JSON output."""
        cmd = ["kubectl"]
        if self.namespace:
            cmd.extend(["-n", self.namespace])
        cmd.extend(list(args) + ["-o", "json"])
        result = subprocess.run(
            cmd, capture_output=True, text=True, timeout=30,
        )
        if result.returncode != 0:
            return {}
        return json.loads(result.stdout)
 
    def check_gpu_resource_quotas(self) -> None:
        """Verify GPU resource quotas exist to prevent resource squatting."""
        quotas = self._kubectl("get", "resourcequotas")
        items = quotas.get("items", [])
 
        gpu_quota_exists = False
        for quota in items:
            hard = quota.get("spec", {}).get("hard", {})
            for key in hard:
                if "gpu" in key.lower():
                    gpu_quota_exists = True
                    break
 
        if not gpu_quota_exists:
            self.findings.append({
                "severity": "HIGH",
                "title": "No GPU resource quotas defined",
                "detail": (
                    "Without GPU quotas, a single user can monopolize "
                    "all GPU resources by submitting many pods. This "
                    "enables denial-of-service against other training jobs."
                ),
            })
 
    def check_pod_security_for_gpu_workloads(self) -> None:
        """
        Check if GPU pods run with excessive privileges.
        GPU workloads often require elevated permissions but
        these should be minimized.
        """
        pods = self._kubectl("get", "pods")
 
        for pod in pods.get("items", []):
            name = pod["metadata"]["name"]
            spec = pod.get("spec", {})
 
            for container in spec.get("containers", []):
                resources = container.get("resources", {})
                limits = resources.get("limits", {})
 
                has_gpu = any(
                    "gpu" in k.lower() for k in limits
                )
                if not has_gpu:
                    continue
 
                # Controleer security context voor GPU-pods
                sec = container.get("securityContext", {})
                pod_sec = spec.get("securityContext", {})
 
                if sec.get("privileged", False):
                    self.findings.append({
                        "severity": "CRITICAL",
                        "title": f"Privileged GPU pod: {name}/{container['name']}",
                        "detail": (
                            "Privileged GPU pods can access all host devices, "
                            "escape container isolation, access other pods' "
                            "GPU memory, and compromise the host node."
                        ),
                    })
 
                if sec.get("runAsUser") == 0 or (
                    not sec.get("runAsNonRoot", False)
                    and not pod_sec.get("runAsNonRoot", False)
                ):
                    self.findings.append({
                        "severity": "HIGH",
                        "title": f"GPU pod runs as root: {name}/{container['name']}",
                        "detail": (
                            "Running as root inside a GPU container increases "
                            "the impact of container escape vulnerabilities."
                        ),
                    })
 
                # Controleer volume mounts op gevoelige paden
                vol_mounts = container.get("volumeMounts", [])
                sensitive_paths = [
                    "/var/run/docker.sock",
                    "/var/run/containerd",
                    "/proc/sys",
                    "/dev",
                ]
                for vm in vol_mounts:
                    mount_path = vm.get("mountPath", "")
                    for sp in sensitive_paths:
                        if mount_path.startswith(sp):
                            self.findings.append({
                                "severity": "HIGH",
                                "title": (
                                    f"Sensitive mount in GPU pod: "
                                    f"{name} -> {mount_path}"
                                ),
                                "detail": (
                                    f"Volume mount {mount_path} provides "
                                    f"access to host resources that could "
                                    f"enable container escape."
                                ),
                            })
 
    def check_tolerations_abuse(self) -> None:
        """
        Check for pods with tolerations that allow scheduling
        on GPU nodes that should be restricted.
        """
        pods = self._kubectl("get", "pods")
 
        for pod in pods.get("items", []):
            name = pod["metadata"]["name"]
            tolerations = pod.get("spec", {}).get("tolerations", [])
 
            for toleration in tolerations:
                key = toleration.get("key", "")
                operator = toleration.get("operator", "")
 
                # Wildcard-toleration matcht alles
                if operator == "Exists" and key == "":
                    self.findings.append({
                        "severity": "HIGH",
                        "title": f"Wildcard toleration: {name}",
                        "detail": (
                            "Pod tolerates all taints and can be scheduled "
                            "on any node including GPU nodes, control plane "
                            "nodes, and nodes tainted for specific workloads."
                        ),
                    })
 
                # GPU-specifieke tolerations
                if "gpu" in key.lower() or "nvidia" in key.lower():
                    has_gpu = any(
                        "gpu" in k.lower()
                        for c in pod.get("spec", {}).get("containers", [])
                        for k in c.get("resources", {}).get("limits", {})
                    )
                    if not has_gpu:
                        self.findings.append({
                            "severity": "MEDIUM",
                            "title": (
                                f"Non-GPU pod on GPU node: {name}"
                            ),
                            "detail": (
                                f"Pod has GPU node toleration ({key}) but "
                                f"doesn't request GPU resources. It may be "
                                f"occupying GPU node capacity or attempting "
                                f"to access GPU devices directly."
                            ),
                        })
 
    def check_priority_classes(self) -> None:
        """Audit PriorityClasses for scheduling abuse potential."""
        result = subprocess.run(
            ["kubectl", "get", "priorityclasses", "-o", "json"],
            capture_output=True, text=True, timeout=30,
        )
        if result.returncode != 0:
            return
 
        pcs = json.loads(result.stdout)
        for pc in pcs.get("items", []):
            name = pc["metadata"]["name"]
            value = pc.get("value", 0)
            preemption = pc.get("preemptionPolicy", "PreemptLowerPriority")
 
            if value > 1000000 and preemption == "PreemptLowerPriority":
                self.findings.append({
                    "severity": "MEDIUM",
                    "title": f"High-priority preempting class: {name} ({value})",
                    "detail": (
                        "This PriorityClass can preempt lower-priority pods. "
                        "If users can reference it, they can evict other "
                        "training jobs to claim their GPU resources."
                    ),
                })
 
    def run_audit(self) -> list[dict]:
        """Run all GPU scheduling audit checks."""
        self.findings = []
        self.check_gpu_resource_quotas()
        self.check_pod_security_for_gpu_workloads()
        self.check_tolerations_abuse()
        self.check_priority_classes()
        return self.findings
 
if __name__ == "__main__":
    import sys
    ns = sys.argv[1] if len(sys.argv) > 1 else ""
    auditor = K8sGPUSchedulingAuditor(namespace=ns)
    findings = auditor.run_audit()
    for f in findings:
        print(f"[{f['severity']}] {f['title']}")
        print(f"  {f['detail']}\n")

Exploits op gang scheduling

AI-trainingstaken vereisen vaak meerdere GPU's verspreid over meerdere nodes die gelijktijdig draaien (gang scheduling). Frameworks zoals Volcano en Kueue coördineren deze multi-pod scheduling-beslissingen. Exploits zijn onder meer:

Deadlock-injectie: Dien meerdere gang-gescheduleerde jobs in die elk een deel van de resources vasthouden die de ander nodig heeft, waardoor clusterbrede resource-deadlocks ontstaan.
Resourcefragmentatie: Dien veel kleine jobs in die de GPU-beschikbaarheid fragmenteren, waardoor grote multi-node trainingstaken niet kunnen worden gescheduleerd.
Manipulatie van wachtrijprioriteit: In Volcano bepalen wachtrijprioriteiten welke workloads het eerst worden gescheduleerd. Als wachtrijdefinities niet door RBAC zijn beschermd, kan een aanvaller wachtrijen aanmaken of wijzigen om zijn eigen workloads te prioriteren.

"""
Gang scheduling attack simulation for AI clusters.
Demonstrates resource fragmentation and deadlock injection
against Volcano-based Kubernetes GPU scheduling.
"""
 
import subprocess
import json
import time
from typing import Optional
 
class GangSchedulingAttacker:
    """
    Simulate attacks against gang scheduling systems
    used in distributed AI training.
    """
 
    def __init__(self, namespace: str = "ai-training"):
        self.namespace = namespace
 
    def _kubectl_apply(self, manifest: str) -> bool:
        """Apply a Kubernetes manifest."""
        result = subprocess.run(
            ["kubectl", "apply", "-f", "-"],
            input=manifest, capture_output=True, text=True, timeout=30,
        )
        return result.returncode == 0
 
    def generate_fragmentation_jobs(
        self,
        num_jobs: int = 20,
        gpus_per_job: int = 1,
    ) -> list[str]:
        """
        Generate many small GPU jobs designed to fragment
        cluster GPU resources, preventing large multi-GPU
        training jobs from being scheduled.
 
        The attacker submits many 1-GPU jobs across different nodes,
        leaving each node with insufficient contiguous GPUs for a
        large distributed training job.
        """
        manifests = []
        for i in range(num_jobs):
            manifest = f"""
apiVersion: batch/v1
kind: Job
metadata:
  name: fragment-{i:03d}
  namespace: {self.namespace}
  labels:
    attack-type: fragmentation
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: gpu-holder
        image: nvidia/cuda:12.0.0-base-ubuntu22.04
        command: ["sleep", "3600"]
        resources:
          limits:
            nvidia.com/gpu: {gpus_per_job}
      # Verspreid over verschillende nodes om fragmentatie te maximaliseren
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            attack-type: fragmentation
"""
            manifests.append(manifest)
        return manifests
 
    def generate_deadlock_jobs(
        self,
        total_gpus_available: int = 16,
    ) -> tuple[str, str]:
        """
        Generate two Volcano gang-scheduled jobs that will deadlock.
        Job A requests N/2+1 GPUs, Job B requests N/2+1 GPUs.
        Since N/2+1 + N/2+1 > N, both jobs partially schedule
        and then wait forever for the remaining resources.
        """
        gpus_per_job = total_gpus_available // 2 + 1
 
        job_a = f"""
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: deadlock-a
  namespace: {self.namespace}
spec:
  minAvailable: {gpus_per_job}
  schedulerName: volcano
  tasks:
  - replicas: {gpus_per_job}
    name: worker
    template:
      spec:
        containers:
        - name: worker
          image: nvidia/cuda:12.0.0-base-ubuntu22.04
          command: ["sleep", "7200"]
          resources:
            limits:
              nvidia.com/gpu: 1
        restartPolicy: OnFailure
"""
 
        job_b = f"""
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: deadlock-b
  namespace: {self.namespace}
spec:
  minAvailable: {gpus_per_job}
  schedulerName: volcano
  tasks:
  - replicas: {gpus_per_job}
    name: worker
    template:
      spec:
        containers:
        - name: worker
          image: nvidia/cuda:12.0.0-base-ubuntu22.04
          command: ["sleep", "7200"]
          resources:
            limits:
              nvidia.com/gpu: 1
        restartPolicy: OnFailure
"""
        return job_a, job_b
 
    def check_cluster_fragmentation(self) -> dict:
        """
        Analyze current GPU allocation to determine fragmentation level.
        High fragmentation = large jobs cannot schedule despite
        aggregate free GPUs being sufficient.
        """
        result = subprocess.run(
            ["kubectl", "get", "nodes", "-o", "json"],
            capture_output=True, text=True, timeout=30,
        )
        if result.returncode != 0:
            return {"error": "Cannot list nodes"}
 
        nodes = json.loads(result.stdout)
        node_gpus = []
 
        for node in nodes.get("items", []):
            allocatable = node.get("status", {}).get("allocatable", {})
            gpu_total = int(allocatable.get("nvidia.com/gpu", 0))
 
            if gpu_total == 0:
                continue
 
            # Haal toegewezen GPU's op van draaiende pods
            node_name = node["metadata"]["name"]
            pods_result = subprocess.run(
                [
                    "kubectl", "get", "pods", "--all-namespaces",
                    "--field-selector", f"spec.nodeName={node_name}",
                    "-o", "json",
                ],
                capture_output=True, text=True, timeout=30,
            )
            allocated = 0
            if pods_result.returncode == 0:
                pods = json.loads(pods_result.stdout)
                for pod in pods.get("items", []):
                    for container in pod.get("spec", {}).get("containers", []):
                        limits = container.get("resources", {}).get("limits", {})
                        allocated += int(limits.get("nvidia.com/gpu", 0))
 
            free = gpu_total - allocated
            node_gpus.append({
                "node": node_name,
                "total": gpu_total,
                "allocated": allocated,
                "free": free,
            })
 
        total_free = sum(n["free"] for n in node_gpus)
        max_contiguous = max((n["free"] for n in node_gpus), default=0)
 
        return {
            "nodes": node_gpus,
            "total_free_gpus": total_free,
            "max_contiguous_free": max_contiguous,
            "fragmentation_ratio": (
                1.0 - (max_contiguous / max(total_free, 1))
                if total_free > 0 else 0
            ),
        }

Cryptomining op GPU-clusters

Een van de meest voorkomende motieven voor het compromitteren van AI-workloadschedulers is ongeautoriseerde cryptocurrency-mining. GPU-clusters die voor AI-training worden gebruikt, zijn extreem waardevol voor mining omdat:

Moderne AI-GPU's (A100, H100) behoren tot de krachtigste rekeneenheden die beschikbaar zijn voor hashberekening
GPU-clusters beschikken over internetverbindingen met hoge bandbreedte voor het indienen van mining-resultaten
Trainingstaken draaien vaak urenlang of dagenlang, wat dekking biedt voor mining-jobs die opgaan in legitiem GPU-gebruik
Clustermonitoring maakt mogelijk geen onderscheid tussen legitiem GPU-gebruik (training) en ongeautoriseerd gebruik (mining)

Een aanvaller die scheduling-toegang verkrijgt, kan mining-jobs indienen vermomd als trainingsworkloads — met container images die eruitzien als PyTorch of TensorFlow maar in werkelijkheid mining-software draaien. De jobs vragen GPU-resources aan, krijgen namen die op legitieme training lijken (bijv. bert-finetune-exp-042) en produceren verwacht ogende loguitvoer terwijl ze op de achtergrond minen.

Praktische voorbeelden

Slurm job-kaping via gedeeld filesysteem

#!/usr/bin/env bash
# Demonstration of TOCTOU attack on Slurm job scripts
# stored on shared NFS filesystem.
#
# WARNING: For authorized security testing only.
 
# Scenario: A victim submits a training job that references a script
# on the shared filesystem. The attacker monitors for new job submissions
# and modifies the script during the queue wait time.
 
echo "=== Slurm Job Script TOCTOU Monitor ==="
 
# Monitor op nieuwe job-indieningen (vereist squeue-toegang)
WATCH_USER="${1:?Usage: $0 <target_user>}"
echo "Monitoring jobs for user: $WATCH_USER"
 
# Haal lijst met openstaande jobs voor de doelgebruiker op
PENDING_JOBS=$(squeue -u "$WATCH_USER" -t PENDING -o "%i %j %o" --noheader 2>/dev/null)
 
if [ -z "$PENDING_JOBS" ]; then
    echo "No pending jobs found for $WATCH_USER"
    exit 0
fi
 
echo "Pending jobs found:"
echo "$PENDING_JOBS"
echo ""
 
# Controleer voor elke openstaande job of het script schrijfbaar is
while IFS= read -r line; do
    JOB_ID=$(echo "$line" | awk '{print $1}')
    JOB_NAME=$(echo "$line" | awk '{print $2}')
    JOB_SCRIPT=$(echo "$line" | awk '{print $3}')
 
    echo "Job $JOB_ID ($JOB_NAME): $JOB_SCRIPT"
 
    if [ -f "$JOB_SCRIPT" ]; then
        if [ -w "$JOB_SCRIPT" ]; then
            echo "  [CRITICAL] Script is WRITABLE by current user"
            echo "  An attacker could inject commands into this script"
            echo "  before it executes on the compute node."
        elif [ -r "$JOB_SCRIPT" ]; then
            echo "  [MEDIUM] Script is readable (information disclosure)"
            echo "  Contents reveal training configuration, data paths,"
            echo "  and potentially credentials."
        else
            echo "  [OK] Script is not accessible"
        fi
 
        # Controleer de directory-permissies
        SCRIPT_DIR=$(dirname "$JOB_SCRIPT")
        if [ -w "$SCRIPT_DIR" ]; then
            echo "  [HIGH] Parent directory is writable"
            echo "  Could create symlinks or replace the script file."
        fi
    else
        echo "  Script file not found (may be on a different filesystem)"
    fi
    echo ""
done <<< "$PENDING_JOBS"

Verdediging en tegenmaatregelen

Slurm-harding:

Beperk de munge-sleutelpermissies tot 0400 met eigenaar munge:munge. Audit sleuteltoegang met inotify of auditd.
Schakel AccountingStorageEnforce=limits,qos,associations in om resourcequota af te dwingen.
Gebruik de task/cgroup-plugin met ConstrainDevices=yes voor GPU-isolatie.
Sla jobscripts op in per-gebruiker-directories met strikte permissies, niet op gedeelde schrijfbare locaties.
Audit Prolog/Epilog-scripts op tampering met behulp van file integrity monitoring.
Implementeer de PAM-module van Slurm voor toegangscontrole op nodes — sta SSH alleen toe naar nodes waar een gebruiker een actieve job heeft.

Kubernetes GPU-scheduling-harding:

Definieer ResourceQuotas voor GPU-resources in elke namespace.
Gebruik PodSecurity-admissie om geprivilegieerde containers en hosttoegang te beperken.
Implementeer RBAC om te bepalen wie pods met GPU-aanvragen kan aanmaken en wie naar high-priority PriorityClasses kan verwijzen.
Taint GPU-nodes en beperk tolerations via admission webhooks (OPA/Gatekeeper of Kyverno).
Gebruik NetworkPolicies om GPU-pods te isoleren van niet-GPU-workloads.
Schakel audit logging in voor alle scheduling-beslissingen en pod-aanmaakgebeurtenissen.

Algemene scheduling-beveiliging:

Implementeer verificatie van jobintegriteit: onderteken jobdefinities bij indiening en verifieer ze vóór uitvoering.
Monitor op afwijkende scheduling-patronen: ongewoon hoge GPU-aanvragen, jobs van nieuwe accounts, jobs die specifieke nodes als doelwit nemen.
Scheid de control plane van de data plane: het scheduling-systeem zou geen directe toegang moeten hebben tot trainingsdata of modelartefacten.

Referenties

SchedMD. (2024). "Slurm Security Guide." https://slurm.schedmd.com/security.html
MITRE ATLAS. "Resource Hijacking in ML Infrastructure." https://atlas.mitre.org/techniques/AML.T0048
Kubernetes. (2024). "Managing Resources for Containers: GPU." https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Volcano. (2024). "Volcano: Cloud Native Batch Computing." https://volcano.sh/en/docs/

Aanvallen op AI-workloadschedulers

Gevorderd18 min lezenBijgewerkt op 2026-03-21

Slurm, Kubernetes en custom schedulers misbruiken om GPU-resources te kapen, trainingstaken te vergiftigen en laterale beweging in AI-clusters te bereiken

infrastructure scheduling slurm kubernetes gpu-clusters lateral-movement

slurmctld (controller): Centrale beheerdaemon die de clusterstatus, de jobwachtrij en scheduling-beslissingen onderhoudt. Draait op een head node.
slurmd (compute node-daemon): Draait op elke compute node, ontvangt jobtoewijzingen van slurmctld en beheert de jobuitvoering.
slurmdbd (database-daemon): Slaat accounting- en jobgeschiedenisgegevens op.
munge: Biedt authenticatie tussen Slurm-daemons met een gedeelde symmetrische sleutel.

"""
Slurm cluster security audit script.
Checks for common misconfigurations and attack vectors in
Slurm-managed AI/GPU clusters.
"""
 
import subprocess
import os
import stat
import json
import re
from pathlib import Path
from typing import Optional
 
class SlurmAuditor:
    """Security auditor for Slurm-managed AI clusters."""
 
    def __init__(self):
        self.findings: list[dict] = []
 
    def _add(self, severity: str, title: str, detail: str) -> None:
        self.findings.append({
            "severity": severity, "title": title, "detail": detail,
        })
 
    def check_munge_key_permissions(self) -> None:
        """
        Check munge key file permissions.
        The key should be owned by munge:munge with mode 0400.
        Any broader permissions allow credential forging.
        """
        key_path = Path("/etc/munge/munge.key")
        if not key_path.exists():
            self._add("INFO", "Munge key not found", "Not a Slurm node or non-standard path")
            return
 
        st = key_path.stat()
        mode = stat.S_IMODE(st.st_mode)
 
        if mode != 0o400:
            self._add(
                "CRITICAL",
                f"Munge key has unsafe permissions: {oct(mode)}",
                "The munge key should be mode 0400 (owner read only). "
                "Any user who can read this key can forge authentication "
                "credentials for any cluster user, including root.",
            )
 
        # Controleer of de sleutel leesbaar is voor de huidige gebruiker (niet-root-test)
        if os.access(key_path, os.R_OK) and os.getuid() != 0:
            self._add(
                "CRITICAL",
                "Munge key readable by current non-root user",
                "Current user can read /etc/munge/munge.key. This allows "
                "impersonating any user in the Slurm cluster.",
            )
 
    def check_slurm_config_security(self) -> None:
        """Audit slurm.conf for security-relevant settings."""
        config_paths = [
            Path("/etc/slurm/slurm.conf"),
            Path("/etc/slurm-llnl/slurm.conf"),
        ]
        config_path = None
        for p in config_paths:
            if p.exists():
                config_path = p
                break
 
        if config_path is None:
            self._add("INFO", "slurm.conf not found", "Non-standard path")
            return
 
        content = config_path.read_text()
 
        # Controleer AccountingStorageEnforce
        if "AccountingStorageEnforce" not in content:
            self._add(
                "HIGH",
                "AccountingStorageEnforce not set",
                "Without this, users can run jobs without accounting limits, "
                "bypassing resource quotas and fair-share scheduling.",
            )
 
        # Controleer op job-indiening alleen door root
        if "AllowUsers" not in content and "DenyUsers" not in content:
            self._add(
                "MEDIUM",
                "No user allow/deny lists in Slurm config",
                "Any system user can submit jobs. Consider restricting "
                "to authorized users via AllowUsers or DenyUsers.",
            )
 
        # Controleer op Prolog/Epilog-scripts (aanvalsoppervlak)
        prolog_match = re.search(r'^Prolog\s*=\s*(.+)$', content, re.MULTILINE)
        epilog_match = re.search(r'^Epilog\s*=\s*(.+)$', content, re.MULTILINE)
 
        for label, match in [("Prolog", prolog_match), ("Epilog", epilog_match)]:
            if match:
                script_path = Path(match.group(1).strip())
                if script_path.exists():
                    script_st = script_path.stat()
                    if script_st.st_mode & stat.S_IWOTH:
                        self._add(
                            "CRITICAL",
                            f"{label} script is world-writable: {script_path}",
                            f"The {label} script runs as root on compute nodes "
                            f"before/after every job. A world-writable script "
                            f"allows arbitrary code execution as root.",
                        )
 
        # Controleer op TaskPlugin (job-isolatie)
        if "task/cgroup" not in content:
            self._add(
                "HIGH",
                "cgroup task plugin not enabled",
                "Without task/cgroup, jobs have unrestricted access to "
                "system resources. GPU isolation, memory limits, and CPU "
                "affinity are not enforced.",
            )
 
        # Controleer op GPU-autodetectie
        if "AutoDetect=nvml" not in content and "AutoDetect=rsmi" not in content:
            self._add(
                "LOW",
                "GPU auto-detection not configured",
                "Manual GPU configuration may lead to inconsistent "
                "GPU allocation and tracking.",
            )
 
    def check_job_submission_abuse(self) -> None:
        """Test for job submission vulnerabilities."""
        # Controleer of we jobs met verhoogde privileges kunnen indienen
        try:
            result = subprocess.run(
                ["scontrol", "show", "config"],
                capture_output=True, text=True, timeout=10,
            )
            if result.returncode == 0:
                config_text = result.stdout
 
                # Controleer of job-containers zijn toegestaan
                if "JobContainerType" in config_text:
                    self._add(
                        "MEDIUM",
                        "Job containers enabled",
                        "Users may be able to specify custom container "
                        "images for jobs, potentially pulling malicious images.",
                    )
 
                # Controleer of de gebruiker specifieke nodes kan aanvragen
                if "SchedulerParameters" in config_text:
                    if "no_node_select" not in config_text.lower():
                        self._add(
                            "MEDIUM",
                            "Users can request specific nodes",
                            "Job submissions can target specific nodes. "
                            "An attacker can target high-value nodes "
                            "containing sensitive data or models.",
                        )
 
        except FileNotFoundError:
            pass  # Geen Slurm-node
        except subprocess.TimeoutExpired:
            pass
 
    def check_gpu_isolation(self) -> None:
        """Check GPU isolation between jobs."""
        # Controleer cgroup-GPU-device-handhaving
        cgroup_conf = Path("/etc/slurm/cgroup.conf")
        if cgroup_conf.exists():
            content = cgroup_conf.read_text()
            if "ConstrainDevices=yes" not in content:
                self._add(
                    "HIGH",
                    "GPU device constraints not enforced",
                    "Without ConstrainDevices=yes in cgroup.conf, "
                    "jobs can access GPUs not allocated to them, "
                    "enabling cross-job GPU memory snooping.",
                )
        else:
            self._add(
                "HIGH",
                "cgroup.conf not found",
                "No cgroup configuration found. GPU and memory isolation "
                "between jobs may not be enforced.",
            )
 
    def check_shared_filesystems(self) -> None:
        """Identify shared filesystems that create trust boundaries."""
        shared_mounts = []
        try:
            with open("/proc/mounts", "r") as f:
                for line in f:
                    parts = line.split()
                    if len(parts) >= 3:
                        mount_point = parts[1]
                        fs_type = parts[2]
                        if fs_type in ("nfs", "nfs4", "lustre", "gpfs", "beegfs"):
                            shared_mounts.append((mount_point, fs_type))
        except IOError:
            return
 
        for mount_point, fs_type in shared_mounts:
            self._add(
                "MEDIUM",
                f"Shared filesystem: {mount_point} ({fs_type})",
                f"Shared {fs_type} mount at {mount_point}. Files here are "
                f"accessible across nodes. Training data, model checkpoints, "
                f"and job scripts on shared filesystems can be modified by "
                f"any user with write access, regardless of node isolation.",
            )
 
    def run_audit(self) -> list[dict]:
        """Run complete Slurm security audit."""
        self.findings = []
        self.check_munge_key_permissions()
        self.check_slurm_config_security()
        self.check_job_submission_abuse()
        self.check_gpu_isolation()
        self.check_shared_filesystems()
        return self.findings
 
if __name__ == "__main__":
    auditor = SlurmAuditor()
    findings = auditor.run_audit()
 
    for f in findings:
        print(f"[{f['severity']}] {f['title']}")
        print(f"  {f['detail']}\n")

"""
Kubernetes GPU scheduling security audit.
Identifies misconfigurations that allow GPU resource abuse,
privilege escalation, and cross-tenant attacks in AI clusters.
"""
 
import subprocess
import json
import yaml
from typing import Any
 
class K8sGPUSchedulingAuditor:
    """Audit Kubernetes GPU scheduling for AI workloads."""
 
    def __init__(self, namespace: str = ""):
        self.namespace = namespace
        self.findings: list[dict] = []
 
    def _kubectl(self, *args: str) -> dict[str, Any]:
        """Run kubectl and return parsed JSON output."""
        cmd = ["kubectl"]
        if self.namespace:
            cmd.extend(["-n", self.namespace])
        cmd.extend(list(args) + ["-o", "json"])
        result = subprocess.run(
            cmd, capture_output=True, text=True, timeout=30,
        )
        if result.returncode != 0:
            return {}
        return json.loads(result.stdout)
 
    def check_gpu_resource_quotas(self) -> None:
        """Verify GPU resource quotas exist to prevent resource squatting."""
        quotas = self._kubectl("get", "resourcequotas")
        items = quotas.get("items", [])
 
        gpu_quota_exists = False
        for quota in items:
            hard = quota.get("spec", {}).get("hard", {})
            for key in hard:
                if "gpu" in key.lower():
                    gpu_quota_exists = True
                    break
 
        if not gpu_quota_exists:
            self.findings.append({
                "severity": "HIGH",
                "title": "No GPU resource quotas defined",
                "detail": (
                    "Without GPU quotas, a single user can monopolize "
                    "all GPU resources by submitting many pods. This "
                    "enables denial-of-service against other training jobs."
                ),
            })
 
    def check_pod_security_for_gpu_workloads(self) -> None:
        """
        Check if GPU pods run with excessive privileges.
        GPU workloads often require elevated permissions but
        these should be minimized.
        """
        pods = self._kubectl("get", "pods")
 
        for pod in pods.get("items", []):
            name = pod["metadata"]["name"]
            spec = pod.get("spec", {})
 
            for container in spec.get("containers", []):
                resources = container.get("resources", {})
                limits = resources.get("limits", {})
 
                has_gpu = any(
                    "gpu" in k.lower() for k in limits
                )
                if not has_gpu:
                    continue
 
                # Controleer security context voor GPU-pods
                sec = container.get("securityContext", {})
                pod_sec = spec.get("securityContext", {})
 
                if sec.get("privileged", False):
                    self.findings.append({
                        "severity": "CRITICAL",
                        "title": f"Privileged GPU pod: {name}/{container['name']}",
                        "detail": (
                            "Privileged GPU pods can access all host devices, "
                            "escape container isolation, access other pods' "
                            "GPU memory, and compromise the host node."
                        ),
                    })
 
                if sec.get("runAsUser") == 0 or (
                    not sec.get("runAsNonRoot", False)
                    and not pod_sec.get("runAsNonRoot", False)
                ):
                    self.findings.append({
                        "severity": "HIGH",
                        "title": f"GPU pod runs as root: {name}/{container['name']}",
                        "detail": (
                            "Running as root inside a GPU container increases "
                            "the impact of container escape vulnerabilities."
                        ),
                    })
 
                # Controleer volume mounts op gevoelige paden
                vol_mounts = container.get("volumeMounts", [])
                sensitive_paths = [
                    "/var/run/docker.sock",
                    "/var/run/containerd",
                    "/proc/sys",
                    "/dev",
                ]
                for vm in vol_mounts:
                    mount_path = vm.get("mountPath", "")
                    for sp in sensitive_paths:
                        if mount_path.startswith(sp):
                            self.findings.append({
                                "severity": "HIGH",
                                "title": (
                                    f"Sensitive mount in GPU pod: "
                                    f"{name} -> {mount_path}"
                                ),
                                "detail": (
                                    f"Volume mount {mount_path} provides "
                                    f"access to host resources that could "
                                    f"enable container escape."
                                ),
                            })
 
    def check_tolerations_abuse(self) -> None:
        """
        Check for pods with tolerations that allow scheduling
        on GPU nodes that should be restricted.
        """
        pods = self._kubectl("get", "pods")
 
        for pod in pods.get("items", []):
            name = pod["metadata"]["name"]
            tolerations = pod.get("spec", {}).get("tolerations", [])
 
            for toleration in tolerations:
                key = toleration.get("key", "")
                operator = toleration.get("operator", "")
 
                # Wildcard-toleration matcht alles
                if operator == "Exists" and key == "":
                    self.findings.append({
                        "severity": "HIGH",
                        "title": f"Wildcard toleration: {name}",
                        "detail": (
                            "Pod tolerates all taints and can be scheduled "
                            "on any node including GPU nodes, control plane "
                            "nodes, and nodes tainted for specific workloads."
                        ),
                    })
 
                # GPU-specifieke tolerations
                if "gpu" in key.lower() or "nvidia" in key.lower():
                    has_gpu = any(
                        "gpu" in k.lower()
                        for c in pod.get("spec", {}).get("containers", [])
                        for k in c.get("resources", {}).get("limits", {})
                    )
                    if not has_gpu:
                        self.findings.append({
                            "severity": "MEDIUM",
                            "title": (
                                f"Non-GPU pod on GPU node: {name}"
                            ),
                            "detail": (
                                f"Pod has GPU node toleration ({key}) but "
                                f"doesn't request GPU resources. It may be "
                                f"occupying GPU node capacity or attempting "
                                f"to access GPU devices directly."
                            ),
                        })
 
    def check_priority_classes(self) -> None:
        """Audit PriorityClasses for scheduling abuse potential."""
        result = subprocess.run(
            ["kubectl", "get", "priorityclasses", "-o", "json"],
            capture_output=True, text=True, timeout=30,
        )
        if result.returncode != 0:
            return
 
        pcs = json.loads(result.stdout)
        for pc in pcs.get("items", []):
            name = pc["metadata"]["name"]
            value = pc.get("value", 0)
            preemption = pc.get("preemptionPolicy", "PreemptLowerPriority")
 
            if value > 1000000 and preemption == "PreemptLowerPriority":
                self.findings.append({
                    "severity": "MEDIUM",
                    "title": f"High-priority preempting class: {name} ({value})",
                    "detail": (
                        "This PriorityClass can preempt lower-priority pods. "
                        "If users can reference it, they can evict other "
                        "training jobs to claim their GPU resources."
                    ),
                })
 
    def run_audit(self) -> list[dict]:
        """Run all GPU scheduling audit checks."""
        self.findings = []
        self.check_gpu_resource_quotas()
        self.check_pod_security_for_gpu_workloads()
        self.check_tolerations_abuse()
        self.check_priority_classes()
        return self.findings
 
if __name__ == "__main__":
    import sys
    ns = sys.argv[1] if len(sys.argv) > 1 else ""
    auditor = K8sGPUSchedulingAuditor(namespace=ns)
    findings = auditor.run_audit()
    for f in findings:
        print(f"[{f['severity']}] {f['title']}")
        print(f"  {f['detail']}\n")

Exploits op gang scheduling

Deadlock-injectie: Dien meerdere gang-gescheduleerde jobs in die elk een deel van de resources vasthouden die de ander nodig heeft, waardoor clusterbrede resource-deadlocks ontstaan.
Resourcefragmentatie: Dien veel kleine jobs in die de GPU-beschikbaarheid fragmenteren, waardoor grote multi-node trainingstaken niet kunnen worden gescheduleerd.
Manipulatie van wachtrijprioriteit: In Volcano bepalen wachtrijprioriteiten welke workloads het eerst worden gescheduleerd. Als wachtrijdefinities niet door RBAC zijn beschermd, kan een aanvaller wachtrijen aanmaken of wijzigen om zijn eigen workloads te prioriteren.

"""
Gang scheduling attack simulation for AI clusters.
Demonstrates resource fragmentation and deadlock injection
against Volcano-based Kubernetes GPU scheduling.
"""
 
import subprocess
import json
import time
from typing import Optional
 
class GangSchedulingAttacker:
    """
    Simulate attacks against gang scheduling systems
    used in distributed AI training.
    """
 
    def __init__(self, namespace: str = "ai-training"):
        self.namespace = namespace
 
    def _kubectl_apply(self, manifest: str) -> bool:
        """Apply a Kubernetes manifest."""
        result = subprocess.run(
            ["kubectl", "apply", "-f", "-"],
            input=manifest, capture_output=True, text=True, timeout=30,
        )
        return result.returncode == 0
 
    def generate_fragmentation_jobs(
        self,
        num_jobs: int = 20,
        gpus_per_job: int = 1,
    ) -> list[str]:
        """
        Generate many small GPU jobs designed to fragment
        cluster GPU resources, preventing large multi-GPU
        training jobs from being scheduled.
 
        The attacker submits many 1-GPU jobs across different nodes,
        leaving each node with insufficient contiguous GPUs for a
        large distributed training job.
        """
        manifests = []
        for i in range(num_jobs):
            manifest = f"""
apiVersion: batch/v1
kind: Job
metadata:
  name: fragment-{i:03d}
  namespace: {self.namespace}
  labels:
    attack-type: fragmentation
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: gpu-holder
        image: nvidia/cuda:12.0.0-base-ubuntu22.04
        command: ["sleep", "3600"]
        resources:
          limits:
            nvidia.com/gpu: {gpus_per_job}
      # Verspreid over verschillende nodes om fragmentatie te maximaliseren
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            attack-type: fragmentation
"""
            manifests.append(manifest)
        return manifests
 
    def generate_deadlock_jobs(
        self,
        total_gpus_available: int = 16,
    ) -> tuple[str, str]:
        """
        Generate two Volcano gang-scheduled jobs that will deadlock.
        Job A requests N/2+1 GPUs, Job B requests N/2+1 GPUs.
        Since N/2+1 + N/2+1 > N, both jobs partially schedule
        and then wait forever for the remaining resources.
        """
        gpus_per_job = total_gpus_available // 2 + 1
 
        job_a = f"""
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: deadlock-a
  namespace: {self.namespace}
spec:
  minAvailable: {gpus_per_job}
  schedulerName: volcano
  tasks:
  - replicas: {gpus_per_job}
    name: worker
    template:
      spec:
        containers:
        - name: worker
          image: nvidia/cuda:12.0.0-base-ubuntu22.04
          command: ["sleep", "7200"]
          resources:
            limits:
              nvidia.com/gpu: 1
        restartPolicy: OnFailure
"""
 
        job_b = f"""
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: deadlock-b
  namespace: {self.namespace}
spec:
  minAvailable: {gpus_per_job}
  schedulerName: volcano
  tasks:
  - replicas: {gpus_per_job}
    name: worker
    template:
      spec:
        containers:
        - name: worker
          image: nvidia/cuda:12.0.0-base-ubuntu22.04
          command: ["sleep", "7200"]
          resources:
            limits:
              nvidia.com/gpu: 1
        restartPolicy: OnFailure
"""
        return job_a, job_b
 
    def check_cluster_fragmentation(self) -> dict:
        """
        Analyze current GPU allocation to determine fragmentation level.
        High fragmentation = large jobs cannot schedule despite
        aggregate free GPUs being sufficient.
        """
        result = subprocess.run(
            ["kubectl", "get", "nodes", "-o", "json"],
            capture_output=True, text=True, timeout=30,
        )
        if result.returncode != 0:
            return {"error": "Cannot list nodes"}
 
        nodes = json.loads(result.stdout)
        node_gpus = []
 
        for node in nodes.get("items", []):
            allocatable = node.get("status", {}).get("allocatable", {})
            gpu_total = int(allocatable.get("nvidia.com/gpu", 0))
 
            if gpu_total == 0:
                continue
 
            # Haal toegewezen GPU's op van draaiende pods
            node_name = node["metadata"]["name"]
            pods_result = subprocess.run(
                [
                    "kubectl", "get", "pods", "--all-namespaces",
                    "--field-selector", f"spec.nodeName={node_name}",
                    "-o", "json",
                ],
                capture_output=True, text=True, timeout=30,
            )
            allocated = 0
            if pods_result.returncode == 0:
                pods = json.loads(pods_result.stdout)
                for pod in pods.get("items", []):
                    for container in pod.get("spec", {}).get("containers", []):
                        limits = container.get("resources", {}).get("limits", {})
                        allocated += int(limits.get("nvidia.com/gpu", 0))
 
            free = gpu_total - allocated
            node_gpus.append({
                "node": node_name,
                "total": gpu_total,
                "allocated": allocated,
                "free": free,
            })
 
        total_free = sum(n["free"] for n in node_gpus)
        max_contiguous = max((n["free"] for n in node_gpus), default=0)
 
        return {
            "nodes": node_gpus,
            "total_free_gpus": total_free,
            "max_contiguous_free": max_contiguous,
            "fragmentation_ratio": (
                1.0 - (max_contiguous / max(total_free, 1))
                if total_free > 0 else 0
            ),
        }

Cryptomining op GPU-clusters

Moderne AI-GPU's (A100, H100) behoren tot de krachtigste rekeneenheden die beschikbaar zijn voor hashberekening
GPU-clusters beschikken over internetverbindingen met hoge bandbreedte voor het indienen van mining-resultaten
Trainingstaken draaien vaak urenlang of dagenlang, wat dekking biedt voor mining-jobs die opgaan in legitiem GPU-gebruik
Clustermonitoring maakt mogelijk geen onderscheid tussen legitiem GPU-gebruik (training) en ongeautoriseerd gebruik (mining)

Praktische voorbeelden

Slurm job-kaping via gedeeld filesysteem

#!/usr/bin/env bash
# Demonstration of TOCTOU attack on Slurm job scripts
# stored on shared NFS filesystem.
#
# WARNING: For authorized security testing only.
 
# Scenario: A victim submits a training job that references a script
# on the shared filesystem. The attacker monitors for new job submissions
# and modifies the script during the queue wait time.
 
echo "=== Slurm Job Script TOCTOU Monitor ==="
 
# Monitor op nieuwe job-indieningen (vereist squeue-toegang)
WATCH_USER="${1:?Usage: $0 <target_user>}"
echo "Monitoring jobs for user: $WATCH_USER"
 
# Haal lijst met openstaande jobs voor de doelgebruiker op
PENDING_JOBS=$(squeue -u "$WATCH_USER" -t PENDING -o "%i %j %o" --noheader 2>/dev/null)
 
if [ -z "$PENDING_JOBS" ]; then
    echo "No pending jobs found for $WATCH_USER"
    exit 0
fi
 
echo "Pending jobs found:"
echo "$PENDING_JOBS"
echo ""
 
# Controleer voor elke openstaande job of het script schrijfbaar is
while IFS= read -r line; do
    JOB_ID=$(echo "$line" | awk '{print $1}')
    JOB_NAME=$(echo "$line" | awk '{print $2}')
    JOB_SCRIPT=$(echo "$line" | awk '{print $3}')
 
    echo "Job $JOB_ID ($JOB_NAME): $JOB_SCRIPT"
 
    if [ -f "$JOB_SCRIPT" ]; then
        if [ -w "$JOB_SCRIPT" ]; then
            echo "  [CRITICAL] Script is WRITABLE by current user"
            echo "  An attacker could inject commands into this script"
            echo "  before it executes on the compute node."
        elif [ -r "$JOB_SCRIPT" ]; then
            echo "  [MEDIUM] Script is readable (information disclosure)"
            echo "  Contents reveal training configuration, data paths,"
            echo "  and potentially credentials."
        else
            echo "  [OK] Script is not accessible"
        fi
 
        # Controleer de directory-permissies
        SCRIPT_DIR=$(dirname "$JOB_SCRIPT")
        if [ -w "$SCRIPT_DIR" ]; then
            echo "  [HIGH] Parent directory is writable"
            echo "  Could create symlinks or replace the script file."
        fi
    else
        echo "  Script file not found (may be on a different filesystem)"
    fi
    echo ""
done <<< "$PENDING_JOBS"

Verdediging en tegenmaatregelen

Slurm-harding:

Beperk de munge-sleutelpermissies tot 0400 met eigenaar munge:munge. Audit sleuteltoegang met inotify of auditd.
Schakel AccountingStorageEnforce=limits,qos,associations in om resourcequota af te dwingen.
Gebruik de task/cgroup-plugin met ConstrainDevices=yes voor GPU-isolatie.
Sla jobscripts op in per-gebruiker-directories met strikte permissies, niet op gedeelde schrijfbare locaties.
Audit Prolog/Epilog-scripts op tampering met behulp van file integrity monitoring.
Implementeer de PAM-module van Slurm voor toegangscontrole op nodes — sta SSH alleen toe naar nodes waar een gebruiker een actieve job heeft.

Kubernetes GPU-scheduling-harding:

Definieer ResourceQuotas voor GPU-resources in elke namespace.
Gebruik PodSecurity-admissie om geprivilegieerde containers en hosttoegang te beperken.
Implementeer RBAC om te bepalen wie pods met GPU-aanvragen kan aanmaken en wie naar high-priority PriorityClasses kan verwijzen.
Taint GPU-nodes en beperk tolerations via admission webhooks (OPA/Gatekeeper of Kyverno).
Gebruik NetworkPolicies om GPU-pods te isoleren van niet-GPU-workloads.
Schakel audit logging in voor alle scheduling-beslissingen en pod-aanmaakgebeurtenissen.

Algemene scheduling-beveiliging:

Implementeer verificatie van jobintegriteit: onderteken jobdefinities bij indiening en verifieer ze vóór uitvoering.
Monitor op afwijkende scheduling-patronen: ongewoon hoge GPU-aanvragen, jobs van nieuwe accounts, jobs die specifieke nodes als doelwit nemen.
Scheid de control plane van de data plane: het scheduling-systeem zou geen directe toegang moeten hebben tot trainingsdata of modelartefacten.

Referenties

SchedMD. (2024). "Slurm Security Guide." https://slurm.schedmd.com/security.html
MITRE ATLAS. "Resource Hijacking in ML Infrastructure." https://atlas.mitre.org/techniques/AML.T0048
Kubernetes. (2024). "Managing Resources for Containers: GPU." https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Volcano. (2024). "Volcano: Cloud Native Batch Computing." https://volcano.sh/en/docs/

Aanvallen op AI-workloadschedulers

Gerelateerde artikelen

Aanvallen op AI-workloadschedulers

Gerelateerde artikelen