Attacks on AI Workload Schedulers
Exploiting Slurm, Kubernetes, and custom schedulers to hijack GPU resources, poison training jobs, and achieve lateral movement in AI clusters
Overview
AI workload schedulers are the control plane for GPU compute — they determine which training jobs run, on which hardware, with what priority, and with access to what data. In high-performance computing (HPC) environments, Slurm dominates as the workload manager for multi-node GPU training. In cloud-native deployments, Kubernetes with custom schedulers, gang scheduling plugins (Volcano, Kueue), and GPU operators manages the allocation of GPU resources to AI workloads.
Compromising the scheduler or exploiting its trust assumptions gives an attacker extraordinary leverage. From a scheduling foothold, an attacker can hijack expensive GPU resources for cryptomining or unauthorized training, intercept or modify training jobs to poison models, access training data through job impersonation, and move laterally across the cluster by scheduling privileged workloads on target nodes. The scheduler is often the highest-value target in an AI cluster because it has broad access to all nodes, all job configurations, and all user credentials.
Despite their critical role, AI workload schedulers are frequently under-secured. Slurm clusters often rely on a shared munge authentication key that, if compromised, grants full cluster access. Kubernetes schedulers may allow unprivileged users to request GPU resources and tolerations that place their pods on GPU nodes with reduced isolation. Custom scheduling plugins introduce their own authentication and authorization models that may not be as thoroughly reviewed as the core scheduler code.
This article examines the attack surface of AI workload schedulers from both the HPC (Slurm) and cloud-native (Kubernetes) perspectives, demonstrates practical exploitation techniques, and provides hardening guidance.
Slurm Attack Surface
Architecture and Trust Model
Slurm (Simple Linux Utility for Resource Management) consists of several daemons:
- slurmctld (controller): Central management daemon that maintains the cluster state, job queue, and scheduling decisions. Runs on a head node.
- slurmd (compute node daemon): Runs on each compute node, receives job assignments from slurmctld, and manages job execution.
- slurmdbd (database daemon): Stores accounting and job history data.
- munge: Provides authentication between Slurm daemons using a shared symmetric key.
The critical trust assumption is that munge authentication uses a single shared key (/etc/munge/munge.key) that must be identical across all nodes. Any process with read access to this key can forge authentication credentials for any user, effectively becoming a cluster-wide root key.
"""
Slurm cluster security audit script.
Checks for common misconfigurations and attack vectors in
Slurm-managed AI/GPU clusters.
"""
import subprocess
import os
import stat
import json
import re
from pathlib import Path
from typing import Optional
class SlurmAuditor:
"""Security auditor for Slurm-managed AI clusters."""
def __init__(self):
self.findings: list[dict] = []
def _add(self, severity: str, title: str, detail: str) -> None:
self.findings.append({
"severity": severity, "title": title, "detail": detail,
})
def check_munge_key_permissions(self) -> None:
"""
Check munge key file permissions.
The key should be owned by munge:munge with mode 0400.
Any broader permissions allow credential forging.
"""
key_path = Path("/etc/munge/munge.key")
if not key_path.exists():
self._add("INFO", "Munge key not found", "Not a Slurm node or non-standard path")
return
st = key_path.stat()
mode = stat.S_IMODE(st.st_mode)
if mode != 0o400:
self._add(
"CRITICAL",
f"Munge key has unsafe permissions: {oct(mode)}",
"The munge key should be mode 0400 (owner read only). "
"Any user who can read this key can forge authentication "
"credentials for any cluster user, including root.",
)
# Check if key is readable by current user (non-root test)
if os.access(key_path, os.R_OK) and os.getuid() != 0:
self._add(
"CRITICAL",
"Munge key readable by current non-root user",
"Current user can read /etc/munge/munge.key. This allows "
"impersonating any user in the Slurm cluster.",
)
def check_slurm_config_security(self) -> None:
"""Audit slurm.conf for security-relevant settings."""
config_paths = [
Path("/etc/slurm/slurm.conf"),
Path("/etc/slurm-llnl/slurm.conf"),
]
config_path = None
for p in config_paths:
if p.exists():
config_path = p
break
if config_path is None:
self._add("INFO", "slurm.conf not found", "Non-standard path")
return
content = config_path.read_text()
# Check AccountingStorageEnforce
if "AccountingStorageEnforce" not in content:
self._add(
"HIGH",
"AccountingStorageEnforce not set",
"Without this, users can run jobs without accounting limits, "
"bypassing resource quotas and fair-share scheduling.",
)
# Check for root-only job submission
if "AllowUsers" not in content and "DenyUsers" not in content:
self._add(
"MEDIUM",
"No user allow/deny lists in Slurm config",
"Any system user can submit jobs. Consider restricting "
"to authorized users via AllowUsers or DenyUsers.",
)
# Check for ProLog/EpiLog scripts (attack surface)
prolog_match = re.search(r'^Prolog\s*=\s*(.+)$', content, re.MULTILINE)
epilog_match = re.search(r'^Epilog\s*=\s*(.+)$', content, re.MULTILINE)
for label, match in [("Prolog", prolog_match), ("Epilog", epilog_match)]:
if match:
script_path = Path(match.group(1).strip())
if script_path.exists():
script_st = script_path.stat()
if script_st.st_mode & stat.S_IWOTH:
self._add(
"CRITICAL",
f"{label} script is world-writable: {script_path}",
f"The {label} script runs as root on compute nodes "
f"before/after every job. A world-writable script "
f"allows arbitrary code execution as root.",
)
# Check for TaskPlugin (job isolation)
if "task/cgroup" not in content:
self._add(
"HIGH",
"cgroup task plugin not enabled",
"Without task/cgroup, jobs have unrestricted access to "
"system resources. GPU isolation, memory limits, and CPU "
"affinity are not enforced.",
)
# Check for GPU auto-detection
if "AutoDetect=nvml" not in content and "AutoDetect=rsmi" not in content:
self._add(
"LOW",
"GPU auto-detection not configured",
"Manual GPU configuration may lead to inconsistent "
"GPU allocation and tracking.",
)
def check_job_submission_abuse(self) -> None:
"""Test for job submission vulnerabilities."""
# Check if we can submit jobs with elevated privileges
try:
result = subprocess.run(
["scontrol", "show", "config"],
capture_output=True, text=True, timeout=10,
)
if result.returncode == 0:
config_text = result.stdout
# Check if job containers are allowed
if "JobContainerType" in config_text:
self._add(
"MEDIUM",
"Job containers enabled",
"Users may be able to specify custom container "
"images for jobs, potentially pulling malicious images.",
)
# Check if user can request specific nodes
if "SchedulerParameters" in config_text:
if "no_node_select" not in config_text.lower():
self._add(
"MEDIUM",
"Users can request specific nodes",
"Job submissions can target specific nodes. "
"An attacker can target high-value nodes "
"containing sensitive data or models.",
)
except FileNotFoundError:
pass # Not a Slurm node
except subprocess.TimeoutExpired:
pass
def check_gpu_isolation(self) -> None:
"""Check GPU isolation between jobs."""
# Check cgroup GPU device enforcement
cgroup_conf = Path("/etc/slurm/cgroup.conf")
if cgroup_conf.exists():
content = cgroup_conf.read_text()
if "ConstrainDevices=yes" not in content:
self._add(
"HIGH",
"GPU device constraints not enforced",
"Without ConstrainDevices=yes in cgroup.conf, "
"jobs can access GPUs not allocated to them, "
"enabling cross-job GPU memory snooping.",
)
else:
self._add(
"HIGH",
"cgroup.conf not found",
"No cgroup configuration found. GPU and memory isolation "
"between jobs may not be enforced.",
)
def check_shared_filesystems(self) -> None:
"""Identify shared filesystems that create trust boundaries."""
shared_mounts = []
try:
with open("/proc/mounts", "r") as f:
for line in f:
parts = line.split()
if len(parts) >= 3:
mount_point = parts[1]
fs_type = parts[2]
if fs_type in ("nfs", "nfs4", "lustre", "gpfs", "beegfs"):
shared_mounts.append((mount_point, fs_type))
except IOError:
return
for mount_point, fs_type in shared_mounts:
self._add(
"MEDIUM",
f"Shared filesystem: {mount_point} ({fs_type})",
f"Shared {fs_type} mount at {mount_point}. Files here are "
f"accessible across nodes. Training data, model checkpoints, "
f"and job scripts on shared filesystems can be modified by "
f"any user with write access, regardless of node isolation.",
)
def run_audit(self) -> list[dict]:
"""Run complete Slurm security audit."""
self.findings = []
self.check_munge_key_permissions()
self.check_slurm_config_security()
self.check_job_submission_abuse()
self.check_gpu_isolation()
self.check_shared_filesystems()
return self.findings
if __name__ == "__main__":
auditor = SlurmAuditor()
findings = auditor.run_audit()
for f in findings:
print(f"[{f['severity']}] {f['title']}")
print(f" {f['detail']}\n")Slurm Job Injection and Hijacking
An attacker with access to the Slurm cluster (either through legitimate credentials or a compromised munge key) can manipulate jobs in several ways:
Job script injection via shared filesystems: Training jobs reference scripts stored on shared NFS or Lustre filesystems. If an attacker can write to these filesystems, they can modify job scripts between submission and execution. The time-of-check-to-time-of-use (TOCTOU) window between sbatch submission and actual execution can be seconds to hours depending on queue wait times.
Priority manipulation: Slurm's fair-share scheduler uses an account hierarchy with priorities. An attacker who can modify their account's fair-share allocation or add themselves to a high-priority account can preempt other users' training jobs, causing denial of service or getting access to GPU resources.
Prolog/Epilog script abuse: Slurm runs Prolog scripts as root before each job and Epilog scripts after. If an attacker can modify these scripts (through shared filesystem write access or a writable script path), they achieve root code execution on every compute node that runs a job.
Kubernetes Scheduler Attacks for AI Workloads
GPU Scheduling Attack Vectors
Kubernetes manages GPU allocation through the device plugin framework. NVIDIA's k8s-device-plugin advertises GPU resources to the kubelet, and the scheduler assigns pods to nodes based on GPU availability. Several attack vectors target this process:
"""
Kubernetes GPU scheduling security audit.
Identifies misconfigurations that allow GPU resource abuse,
privilege escalation, and cross-tenant attacks in AI clusters.
"""
import subprocess
import json
import yaml
from typing import Any
class K8sGPUSchedulingAuditor:
"""Audit Kubernetes GPU scheduling for AI workloads."""
def __init__(self, namespace: str = ""):
self.namespace = namespace
self.findings: list[dict] = []
def _kubectl(self, *args: str) -> dict[str, Any]:
"""Run kubectl and return parsed JSON output."""
cmd = ["kubectl"]
if self.namespace:
cmd.extend(["-n", self.namespace])
cmd.extend(list(args) + ["-o", "json"])
result = subprocess.run(
cmd, capture_output=True, text=True, timeout=30,
)
if result.returncode != 0:
return {}
return json.loads(result.stdout)
def check_gpu_resource_quotas(self) -> None:
"""Verify GPU resource quotas exist to prevent resource squatting."""
quotas = self._kubectl("get", "resourcequotas")
items = quotas.get("items", [])
gpu_quota_exists = False
for quota in items:
hard = quota.get("spec", {}).get("hard", {})
for key in hard:
if "gpu" in key.lower():
gpu_quota_exists = True
break
if not gpu_quota_exists:
self.findings.append({
"severity": "HIGH",
"title": "No GPU resource quotas defined",
"detail": (
"Without GPU quotas, a single user can monopolize "
"all GPU resources by submitting many pods. This "
"enables denial-of-service against other training jobs."
),
})
def check_pod_security_for_gpu_workloads(self) -> None:
"""
Check if GPU pods run with excessive privileges.
GPU workloads often require elevated permissions but
these should be minimized.
"""
pods = self._kubectl("get", "pods")
for pod in pods.get("items", []):
name = pod["metadata"]["name"]
spec = pod.get("spec", {})
for container in spec.get("containers", []):
resources = container.get("resources", {})
limits = resources.get("limits", {})
has_gpu = any(
"gpu" in k.lower() for k in limits
)
if not has_gpu:
continue
# Check security context for GPU pods
sec = container.get("securityContext", {})
pod_sec = spec.get("securityContext", {})
if sec.get("privileged", False):
self.findings.append({
"severity": "CRITICAL",
"title": f"Privileged GPU pod: {name}/{container['name']}",
"detail": (
"Privileged GPU pods can access all host devices, "
"escape container isolation, access other pods' "
"GPU memory, and compromise the host node."
),
})
if sec.get("runAsUser") == 0 or (
not sec.get("runAsNonRoot", False)
and not pod_sec.get("runAsNonRoot", False)
):
self.findings.append({
"severity": "HIGH",
"title": f"GPU pod runs as root: {name}/{container['name']}",
"detail": (
"Running as root inside a GPU container increases "
"the impact of container escape vulnerabilities."
),
})
# Check volume mounts for sensitive paths
vol_mounts = container.get("volumeMounts", [])
sensitive_paths = [
"/var/run/docker.sock",
"/var/run/containerd",
"/proc/sys",
"/dev",
]
for vm in vol_mounts:
mount_path = vm.get("mountPath", "")
for sp in sensitive_paths:
if mount_path.startswith(sp):
self.findings.append({
"severity": "HIGH",
"title": (
f"Sensitive mount in GPU pod: "
f"{name} -> {mount_path}"
),
"detail": (
f"Volume mount {mount_path} provides "
f"access to host resources that could "
f"enable container escape."
),
})
def check_tolerations_abuse(self) -> None:
"""
Check for pods with tolerations that allow scheduling
on GPU nodes that should be restricted.
"""
pods = self._kubectl("get", "pods")
for pod in pods.get("items", []):
name = pod["metadata"]["name"]
tolerations = pod.get("spec", {}).get("tolerations", [])
for toleration in tolerations:
key = toleration.get("key", "")
operator = toleration.get("operator", "")
# Wildcard toleration matches everything
if operator == "Exists" and key == "":
self.findings.append({
"severity": "HIGH",
"title": f"Wildcard toleration: {name}",
"detail": (
"Pod tolerates all taints and can be scheduled "
"on any node including GPU nodes, control plane "
"nodes, and nodes tainted for specific workloads."
),
})
# GPU-specific tolerations
if "gpu" in key.lower() or "nvidia" in key.lower():
has_gpu = any(
"gpu" in k.lower()
for c in pod.get("spec", {}).get("containers", [])
for k in c.get("resources", {}).get("limits", {})
)
if not has_gpu:
self.findings.append({
"severity": "MEDIUM",
"title": (
f"Non-GPU pod on GPU node: {name}"
),
"detail": (
f"Pod has GPU node toleration ({key}) but "
f"doesn't request GPU resources. It may be "
f"occupying GPU node capacity or attempting "
f"to access GPU devices directly."
),
})
def check_priority_classes(self) -> None:
"""Audit PriorityClasses for scheduling abuse potential."""
result = subprocess.run(
["kubectl", "get", "priorityclasses", "-o", "json"],
capture_output=True, text=True, timeout=30,
)
if result.returncode != 0:
return
pcs = json.loads(result.stdout)
for pc in pcs.get("items", []):
name = pc["metadata"]["name"]
value = pc.get("value", 0)
preemption = pc.get("preemptionPolicy", "PreemptLowerPriority")
if value > 1000000 and preemption == "PreemptLowerPriority":
self.findings.append({
"severity": "MEDIUM",
"title": f"High-priority preempting class: {name} ({value})",
"detail": (
"This PriorityClass can preempt lower-priority pods. "
"If users can reference it, they can evict other "
"training jobs to claim their GPU resources."
),
})
def run_audit(self) -> list[dict]:
"""Run all GPU scheduling audit checks."""
self.findings = []
self.check_gpu_resource_quotas()
self.check_pod_security_for_gpu_workloads()
self.check_tolerations_abuse()
self.check_priority_classes()
return self.findings
if __name__ == "__main__":
import sys
ns = sys.argv[1] if len(sys.argv) > 1 else ""
auditor = K8sGPUSchedulingAuditor(namespace=ns)
findings = auditor.run_audit()
for f in findings:
print(f"[{f['severity']}] {f['title']}")
print(f" {f['detail']}\n")Gang Scheduling Exploits
AI training jobs often require multiple GPUs across multiple nodes to run simultaneously (gang scheduling). Frameworks like Volcano and Kueue coordinate these multi-pod scheduling decisions. Exploits include:
- Deadlock injection: Submit multiple gang-scheduled jobs that each hold some resources the other needs, creating cluster-wide resource deadlocks.
- Resource fragmentation: Submit many small jobs that fragment GPU availability, preventing large multi-node training jobs from being scheduled.
- Queue priority manipulation: In Volcano, queue priorities determine which workloads are scheduled first. If queue definitions are not RBAC-protected, an attacker can create or modify queues to prioritize their own workloads.
"""
Gang scheduling attack simulation for AI clusters.
Demonstrates resource fragmentation and deadlock injection
against Volcano-based Kubernetes GPU scheduling.
"""
import subprocess
import json
import time
from typing import Optional
class GangSchedulingAttacker:
"""
Simulate attacks against gang scheduling systems
used in distributed AI training.
"""
def __init__(self, namespace: str = "ai-training"):
self.namespace = namespace
def _kubectl_apply(self, manifest: str) -> bool:
"""Apply a Kubernetes manifest."""
result = subprocess.run(
["kubectl", "apply", "-f", "-"],
input=manifest, capture_output=True, text=True, timeout=30,
)
return result.returncode == 0
def generate_fragmentation_jobs(
self,
num_jobs: int = 20,
gpus_per_job: int = 1,
) -> list[str]:
"""
Generate many small GPU jobs designed to fragment
cluster GPU resources, preventing large multi-GPU
training jobs from being scheduled.
The attacker submits many 1-GPU jobs across different nodes,
leaving each node with insufficient contiguous GPUs for a
large distributed training job.
"""
manifests = []
for i in range(num_jobs):
manifest = f"""
apiVersion: batch/v1
kind: Job
metadata:
name: fragment-{i:03d}
namespace: {self.namespace}
labels:
attack-type: fragmentation
spec:
template:
spec:
restartPolicy: Never
containers:
- name: gpu-holder
image: nvidia/cuda:12.0.0-base-ubuntu22.04
command: ["sleep", "3600"]
resources:
limits:
nvidia.com/gpu: {gpus_per_job}
# Spread across different nodes to maximize fragmentation
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
attack-type: fragmentation
"""
manifests.append(manifest)
return manifests
def generate_deadlock_jobs(
self,
total_gpus_available: int = 16,
) -> tuple[str, str]:
"""
Generate two Volcano gang-scheduled jobs that will deadlock.
Job A requests N/2+1 GPUs, Job B requests N/2+1 GPUs.
Since N/2+1 + N/2+1 > N, both jobs partially schedule
and then wait forever for the remaining resources.
"""
gpus_per_job = total_gpus_available // 2 + 1
job_a = f"""
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: deadlock-a
namespace: {self.namespace}
spec:
minAvailable: {gpus_per_job}
schedulerName: volcano
tasks:
- replicas: {gpus_per_job}
name: worker
template:
spec:
containers:
- name: worker
image: nvidia/cuda:12.0.0-base-ubuntu22.04
command: ["sleep", "7200"]
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: OnFailure
"""
job_b = f"""
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: deadlock-b
namespace: {self.namespace}
spec:
minAvailable: {gpus_per_job}
schedulerName: volcano
tasks:
- replicas: {gpus_per_job}
name: worker
template:
spec:
containers:
- name: worker
image: nvidia/cuda:12.0.0-base-ubuntu22.04
command: ["sleep", "7200"]
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: OnFailure
"""
return job_a, job_b
def check_cluster_fragmentation(self) -> dict:
"""
Analyze current GPU allocation to determine fragmentation level.
High fragmentation = large jobs cannot schedule despite
aggregate free GPUs being sufficient.
"""
result = subprocess.run(
["kubectl", "get", "nodes", "-o", "json"],
capture_output=True, text=True, timeout=30,
)
if result.returncode != 0:
return {"error": "Cannot list nodes"}
nodes = json.loads(result.stdout)
node_gpus = []
for node in nodes.get("items", []):
allocatable = node.get("status", {}).get("allocatable", {})
gpu_total = int(allocatable.get("nvidia.com/gpu", 0))
if gpu_total == 0:
continue
# Get allocated GPUs from running pods
node_name = node["metadata"]["name"]
pods_result = subprocess.run(
[
"kubectl", "get", "pods", "--all-namespaces",
"--field-selector", f"spec.nodeName={node_name}",
"-o", "json",
],
capture_output=True, text=True, timeout=30,
)
allocated = 0
if pods_result.returncode == 0:
pods = json.loads(pods_result.stdout)
for pod in pods.get("items", []):
for container in pod.get("spec", {}).get("containers", []):
limits = container.get("resources", {}).get("limits", {})
allocated += int(limits.get("nvidia.com/gpu", 0))
free = gpu_total - allocated
node_gpus.append({
"node": node_name,
"total": gpu_total,
"allocated": allocated,
"free": free,
})
total_free = sum(n["free"] for n in node_gpus)
max_contiguous = max((n["free"] for n in node_gpus), default=0)
return {
"nodes": node_gpus,
"total_free_gpus": total_free,
"max_contiguous_free": max_contiguous,
"fragmentation_ratio": (
1.0 - (max_contiguous / max(total_free, 1))
if total_free > 0 else 0
),
}Cryptomining on GPU Clusters
One of the most common motivations for compromising AI workload schedulers is unauthorized cryptocurrency mining. GPU clusters used for AI training are extremely valuable for mining because:
- Modern AI GPUs (A100, H100) are among the most powerful compute devices available for hash computation
- GPU clusters have high-bandwidth internet connectivity for submitting mining results
- Training jobs often run for hours or days, providing cover for mining jobs that blend in with legitimate GPU utilization
- Cluster monitoring may not distinguish between legitimate GPU utilization (training) and unauthorized utilization (mining)
An attacker who gains scheduling access can submit mining jobs disguised as training workloads — using container images that appear to be PyTorch or TensorFlow but actually run mining software. The jobs request GPU resources, are named to resemble legitimate training (e.g., bert-finetune-exp-042), and produce expected-looking log output while mining in the background.
Practical Examples
Slurm Job Hijacking via Shared Filesystem
#!/usr/bin/env bash
# Demonstration of TOCTOU attack on Slurm job scripts
# stored on shared NFS filesystem.
#
# WARNING: For authorized security testing only.
# Scenario: A victim submits a training job that references a script
# on the shared filesystem. The attacker monitors for new job submissions
# and modifies the script during the queue wait time.
echo "=== Slurm Job Script TOCTOU Monitor ==="
# Monitor for new job submissions (requires squeue access)
WATCH_USER="${1:?Usage: $0 <target_user>}"
echo "Monitoring jobs for user: $WATCH_USER"
# Get list of pending jobs for the target user
PENDING_JOBS=$(squeue -u "$WATCH_USER" -t PENDING -o "%i %j %o" --noheader 2>/dev/null)
if [ -z "$PENDING_JOBS" ]; then
echo "No pending jobs found for $WATCH_USER"
exit 0
fi
echo "Pending jobs found:"
echo "$PENDING_JOBS"
echo ""
# For each pending job, check if the script is writable
while IFS= read -r line; do
JOB_ID=$(echo "$line" | awk '{print $1}')
JOB_NAME=$(echo "$line" | awk '{print $2}')
JOB_SCRIPT=$(echo "$line" | awk '{print $3}')
echo "Job $JOB_ID ($JOB_NAME): $JOB_SCRIPT"
if [ -f "$JOB_SCRIPT" ]; then
if [ -w "$JOB_SCRIPT" ]; then
echo " [CRITICAL] Script is WRITABLE by current user"
echo " An attacker could inject commands into this script"
echo " before it executes on the compute node."
elif [ -r "$JOB_SCRIPT" ]; then
echo " [MEDIUM] Script is readable (information disclosure)"
echo " Contents reveal training configuration, data paths,"
echo " and potentially credentials."
else
echo " [OK] Script is not accessible"
fi
# Check the directory permissions
SCRIPT_DIR=$(dirname "$JOB_SCRIPT")
if [ -w "$SCRIPT_DIR" ]; then
echo " [HIGH] Parent directory is writable"
echo " Could create symlinks or replace the script file."
fi
else
echo " Script file not found (may be on a different filesystem)"
fi
echo ""
done <<< "$PENDING_JOBS"Defense and Mitigation
Slurm hardening:
- Restrict munge key permissions to
0400owned bymunge:munge. Audit key access with inotify or auditd. - Enable
AccountingStorageEnforce=limits,qos,associationsto enforce resource quotas. - Use
task/cgroupplugin withConstrainDevices=yesfor GPU isolation. - Store job scripts in per-user directories with strict permissions, not shared writable locations.
- Audit Prolog/Epilog scripts for tampering using file integrity monitoring.
- Implement Slurm's PAM module for node access control — only allow SSH to nodes where a user has an active job.
Kubernetes GPU scheduling hardening:
- Define ResourceQuotas for GPU resources in every namespace.
- Use PodSecurity admission to restrict privileged containers and host access.
- Implement RBAC to control who can create pods with GPU requests and who can reference high-priority PriorityClasses.
- Taint GPU nodes and restrict tolerations through admission webhooks (OPA/Gatekeeper or Kyverno).
- Use NetworkPolicies to isolate GPU pods from non-GPU workloads.
- Enable audit logging for all scheduling decisions and pod creation events.
General scheduling security:
- Implement job integrity verification: sign job definitions at submission and verify before execution.
- Monitor for anomalous scheduling patterns: unusually high GPU requests, jobs from new accounts, jobs targeting specific nodes.
- Separate control plane from data plane: the scheduling system should not have direct access to training data or model artifacts.
References
- SchedMD. (2024). "Slurm Security Guide." https://slurm.schedmd.com/security.html
- MITRE ATLAS. "Resource Hijacking in ML Infrastructure." https://atlas.mitre.org/techniques/AML.T0048
- Kubernetes. (2024). "Managing Resources for Containers: GPU." https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
- Volcano. (2024). "Volcano: Cloud Native Batch Computing." https://volcano.sh/en/docs/