Aanvallen op GPU-rekenclusters
Analyse op expertniveau van aanvallen op GPU-rekenclusters die worden gebruikt voor ML-training en -inferentie, waaronder side-channel-aanvallen op GPU-geheugen, exploitatie van de CUDA-runtime, falen van multi-tenant-isolatie en RDMA-netwerkaanvallen.
GPU-rekenclusters vormen de ruggengraat van moderne ML-infrastructuur. Organisaties geven miljoenen uit aan NVIDIA DGX, AMD Instinct en cloud-GPU-instances voor het trainen en serveren van modellen. De beveiliging van deze clusters is een kritieke zorg, maar GPU-hardware en de bijbehorende softwarestack zijn primair ontworpen voor prestaties, niet voor isolatie. Dit creëert exploiteerbare hiaten die red teams kunnen benutten om toegang te krijgen tot de data van andere tenants, modelgewichten te extraheren en trainingsruns te verstoren.
GPU-geheugenarchitectuur en aanvalsoppervlak
NVIDIA GPU-geheugenhiërarchie
Het begrijpen van de GPU-geheugenhiërarchie is essentieel voor het identificeren van mogelijkheden voor datalekkage:
┌──────────────────────────────────────────────────┐
│ GPU Device │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Global Memory (HBM) │ │
│ │ ┌─────────────┐ ┌─────────────────────────┐│ │
│ │ │ Model Weights│ │ Activations / KV Cache ││ │
│ │ └─────────────┘ └─────────────────────────┘│ │
│ │ ┌─────────────┐ ┌─────────────────────────┐│ │
│ │ │ Gradients │ │ Optimizer State ││ │
│ │ └─────────────┘ └─────────────────────────┘│ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ SM 0 │ │ SM 1 │ │ SM N │ │
│ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │
│ │ │Shared │ │ │ │Shared │ │ │ │Shared │ │ │
│ │ │Memory │ │ │ │Memory │ │ │ │Memory │ │ │
│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │
│ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │
│ │ │L1 Cache│ │ │ │L1 Cache│ │ │ │L1 Cache│ │ │
│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ L2 Cache (Shared) │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
Exploitatie van niet-geïnitialiseerd geheugen
GPU-geheugen wordt niet automatisch gewist tussen kernellanceringen of tussen processen die een GPU delen. Dit is de meest toegankelijke aanvalsvector in multi-tenant-omgevingen:
import torch
import numpy as np
def probe_gpu_memory(allocation_size_mb: int = 512, num_probes: int = 10):
"""
Probe GPU memory for residual data from previous allocations.
In multi-tenant environments, this may contain fragments of
other users' model weights, activations, or input data.
"""
findings = []
for probe_idx in range(num_probes):
# Allocate without initialization — reads whatever is in memory
tensor = torch.empty(
allocation_size_mb * 1024 * 1024 // 4, # float32 elements
dtype=torch.float32,
device="cuda"
)
# Analyze contents for non-zero patterns
nonzero_ratio = (tensor != 0).float().mean().item()
value_range = (tensor.min().item(), tensor.max().item())
# Check for structured patterns (model weights have characteristic distributions)
std = tensor.std().item()
mean = tensor.abs().mean().item()
if nonzero_ratio > 0.01: # More than 1% non-zero indicates residual data
findings.append({
"probe": probe_idx,
"nonzero_ratio": nonzero_ratio,
"value_range": value_range,
"std": std,
"mean_abs": mean,
"likely_content": classify_residual_data(std, mean),
})
del tensor
torch.cuda.empty_cache()
return findings
def classify_residual_data(std: float, mean_abs: float) -> str:
"""Heuristic classification of residual GPU memory contents."""
if 0.01 < std < 0.1 and mean_abs < 0.05:
return "likely_model_weights (small initialization)"
elif 0.1 < std < 2.0:
return "likely_activations_or_gradients"
elif std > 10.0:
return "likely_optimizer_state (Adam momentum/variance)"
elif mean_abs < 1e-6:
return "likely_zeroed_or_sparse"
else:
return "unknown_structured_data"Side-channel-aanvallen op GPU-workloads
Timing-side-channels
De uitvoeringstijden van GPU-kernels lekken informatie over de data die wordt verwerkt:
import torch
import time
def timing_side_channel_probe(target_gpu: int = 0):
"""
Measure GPU kernel execution timing to infer characteristics
of co-located workloads. Execution time correlates with:
- Model size (number of parameters)
- Batch size (number of inputs processed)
- Sequence length (for transformer models)
- Sparsity patterns in data
"""
torch.cuda.set_device(target_gpu)
timings = []
for _ in range(1000):
# Launch a small probe kernel
probe = torch.randn(64, 64, device="cuda")
torch.cuda.synchronize()
start = time.perf_counter_ns()
result = torch.matmul(probe, probe)
torch.cuda.synchronize()
end = time.perf_counter_ns()
timings.append(end - start)
timings = np.array(timings)
# Timing variance indicates resource contention from co-located workloads
return {
"mean_ns": np.mean(timings),
"std_ns": np.std(timings),
"p99_ns": np.percentile(timings, 99),
"bimodal": detect_bimodal_distribution(timings),
"contention_detected": np.std(timings) > np.mean(timings) * 0.1,
}
def detect_bimodal_distribution(data: np.ndarray) -> bool:
"""Bimodal timing suggests batch processing boundaries in co-located workload."""
from scipy import stats
_, p_value = stats.normaltest(data)
return p_value < 0.001Vermogens- en thermische side-channels
Het GPU-stroomverbruik en de thermische metingen zijn toegankelijk via managementinterfaces en correleren met de kenmerken van de workload:
def monitor_gpu_power_channel(duration_seconds: int = 60, sample_rate_hz: int = 10):
"""
Monitor GPU power consumption as a side channel.
Power draw patterns reveal:
- Training vs. inference workload type
- Batch processing cadence
- Model architecture characteristics
"""
import subprocess
import time
readings = []
interval = 1.0 / sample_rate_hz
for _ in range(duration_seconds * sample_rate_hz):
# nvidia-smi provides power and utilization data
result = subprocess.run(
["nvidia-smi",
"--query-gpu=power.draw,utilization.gpu,temperature.gpu,memory.used",
"--format=csv,noheader,nounits"],
capture_output=True, text=True
)
if result.returncode == 0:
values = result.stdout.strip().split(", ")
readings.append({
"timestamp": time.time(),
"power_w": float(values[0]),
"util_pct": float(values[1]),
"temp_c": float(values[2]),
"mem_used_mb": float(values[3]),
})
time.sleep(interval)
return analyze_power_patterns(readings)
def analyze_power_patterns(readings: list) -> dict:
"""Extract workload characteristics from power consumption patterns."""
powers = [r["power_w"] for r in readings]
utils = [r["util_pct"] for r in readings]
# Detect periodic patterns (training loop cadence)
from scipy.signal import find_peaks
peaks, properties = find_peaks(powers, height=np.mean(powers))
if len(peaks) > 2:
intervals = np.diff(peaks)
cadence = np.median(intervals) / 10 # Convert to seconds
return {
"workload_type": "training" if cadence > 1.0 else "inference",
"batch_cadence_seconds": cadence,
"peak_power_w": max(powers),
"avg_power_w": np.mean(powers),
}
return {"workload_type": "inference_or_idle", "avg_power_w": np.mean(powers)}| Side-channel | Gelekte data | Nauwkeurigheid | Vereisten |
|---|---|---|---|
| Kernel-timing | Modelgrootte, batchgrootte, sequentielengte | Gemiddeld | Gecolokaliseerd proces op dezelfde GPU |
| Vermogensanalyse | Trainingscadans, workloadtype | Hoog | nvidia-smi-toegang |
| Geheugenbandbreedte | Datatransferpatronen, modelladen | Gemiddeld | Toegang tot performance counter |
| PCIe-verkeer | Datapatronen tussen host en apparaat | Laag | PCIe-monitoringcapaciteit |
| Thermische patronen | Aanhoudende versus burst-rekenkracht | Laag | Toegang tot temperatuursensor |
Multi-tenant GPU-isolatiemechanismen
NVIDIA Multi-Instance GPU (MIG)
MIG (Multi-Instance GPU) biedt de sterkste beschikbare isolatie voor GPU-multi-tenancy:
def assess_mig_isolation(gpu_index: int = 0):
"""Assess MIG partition isolation on NVIDIA A100/H100 GPUs."""
import subprocess
import json
findings = []
# List MIG instances
result = subprocess.run(
["nvidia-smi", "mig", "-lgi", "-i", str(gpu_index)],
capture_output=True, text=True
)
findings.append({"mig_instances": result.stdout})
# Check MIG mode status
result = subprocess.run(
["nvidia-smi", "--query-gpu=mig.mode.current", "--format=csv,noheader",
"-i", str(gpu_index)],
capture_output=True, text=True
)
mig_enabled = "Enabled" in result.stdout
if not mig_enabled:
findings.append({
"severity": "HIGH",
"finding": "MIG not enabled on multi-tenant GPU",
"impact": "No hardware isolation between tenants",
})
# Even with MIG, check for shared resources
findings.append({
"note": "MIG isolates compute and memory but shares: "
"PCIe bus, NVLink, video encoder/decoder, "
"GPU management processor",
})
return findingsNVIDIA Multi-Process Service (MPS)
MPS (Multi-Process Service) biedt prestatievoordelen maar zwakkere isolatie:
| Isolatiemechanisme | Reken-isolatie | Geheugenisolatie | Fout-isolatie | Performanceoverhead |
|---|---|---|---|---|
| MIG | Door hardware gepartitioneerde SM's | Aparte geheugenpartities | Volledig -- crash ingeperkt | 0% (toegewijde resources) |
| MPS | Gedeelde SM's, tijdsgemultiplexed | Gedeelde adresruimte | Geen -- één crash doodt alles | Laag |
| Time-slicing | Round-robin-scheduling | Geen isolatie | Geen | Gemiddeld (contextwisseling) |
| vGPU | Door hypervisor bemiddeld | Door hypervisor afgedwongen | Volledig | 5-15% |
RDMA- en interconnect-aanvallen
Exploitatie van InfiniBand en RoCE
High-performance GPU-clusters gebruiken RDMA voor communicatie tussen nodes tijdens gedistribueerde training:
def enumerate_rdma_endpoints():
"""
Enumerate RDMA-capable network interfaces and endpoints.
RDMA traffic bypasses the kernel network stack, meaning
standard firewall rules and network policies do not apply.
"""
import subprocess
findings = []
# Check for RDMA devices
result = subprocess.run(["ibv_devices"], capture_output=True, text=True)
if result.returncode == 0:
findings.append({
"finding": "RDMA devices present",
"devices": result.stdout,
"severity": "INFO",
})
# Check for InfiniBand subnet manager
result = subprocess.run(["ibstat"], capture_output=True, text=True)
if result.returncode == 0:
findings.append({
"finding": "InfiniBand status",
"status": result.stdout,
})
# Enumerate RDMA connections
result = subprocess.run(
["rdma", "resource", "show", "cm_id"],
capture_output=True, text=True
)
if result.returncode == 0:
findings.append({
"finding": "Active RDMA connections",
"connections": result.stdout,
"note": "These connections bypass kernel network stack and firewalls",
})
# Check for GPUDirect RDMA capability
result = subprocess.run(
["nvidia-smi", "nvlink", "--status"],
capture_output=True, text=True
)
if result.returncode == 0:
findings.append({
"finding": "NVLink status (GPUDirect capable)",
"status": result.stdout,
})
return findingsNVLink- en NVSwitch-aanvallen
In multi-GPU-systemen (DGX, HGX) biedt NVLink directe GPU-naar-GPU-geheugentoegang:
def probe_nvlink_topology():
"""
Map NVLink topology to identify potential cross-GPU
data access paths. NVLink enables GPUDirect which allows
one GPU to directly read/write another GPU's memory.
"""
import subprocess
result = subprocess.run(
["nvidia-smi", "topo", "-m"],
capture_output=True, text=True
)
topology = result.stdout
# Parse topology matrix for NVLink connections
# NV# indicates NVLink connection with # links
# SYS indicates cross-socket (slower)
# PHB indicates same PCIe host bridge
return {
"topology": topology,
"note": "GPUs connected via NVLink can perform direct memory "
"access (peer-to-peer). If GPU 0 and GPU 1 are NVLink-connected "
"and run different tenants' workloads, a CUDA program on GPU 0 "
"can potentially read GPU 1's memory via cuMemcpyPeer.",
}Exploitatie van de CUDA-runtime
Driver- en runtime-kwetsbaarheden
De CUDA-softwarestack vormt een significant aanvalsoppervlak:
| Component | Kwetsbaarheidsklasse | Voorbeeld-CVE's | Impact |
|---|---|---|---|
| NVIDIA Kernel Driver | Privilege-escalatie | CVE-2024-0071, CVE-2024-0074 | Host-compromittering vanuit container |
| CUDA Runtime | Geheugencorruptie | CVE-2023-31021 | Code-uitvoering in GPU-context |
| cuDNN | Bufferoverloop | Diverse | Willekeurige code-uitvoering |
| NCCL | Ongeauthenticeerde toegang | Ontwerpprobleem | Onderschepping van data bij gedistribueerde training |
| nvidia-persistenced | Lokale privilege-escalatie | CVE-2024-0090 | Root-toegang vanuit GPU-gebruiker |
def assess_cuda_attack_surface():
"""Enumerate CUDA stack components and known vulnerability exposure."""
import subprocess
components = {}
# Driver version
result = subprocess.run(
["nvidia-smi", "--query-gpu=driver_version", "--format=csv,noheader"],
capture_output=True, text=True
)
components["driver_version"] = result.stdout.strip()
# CUDA version
result = subprocess.run(
["nvcc", "--version"], capture_output=True, text=True
)
components["cuda_version"] = result.stdout
# Check for known vulnerable driver versions
driver_ver = components["driver_version"]
known_vulnerable = {
"535.104": ["CVE-2024-0071"], # Example
"535.86": ["CVE-2023-31021"],
}
for vuln_ver, cves in known_vulnerable.items():
if driver_ver.startswith(vuln_ver):
components["vulnerabilities"] = cves
return componentsAanvalsscenario's op clusterniveau
Scenario 1: Cross-tenant-data-extractie
1. Attacker obtains legitimate access to a GPU instance in a shared cluster
2. Probe uninitialized GPU memory for residual data from previous tenant
3. Use timing side channels to determine when co-located workload processes batches
4. Allocate and read GPU memory immediately after co-located workload releases it
5. Reconstruct model weights or training data fragments from recovered memory
Scenario 2: Onderschepping van gedistribueerde training
1. Gain access to the training cluster network (InfiniBand or RoCE fabric)
2. Enumerate NCCL communication endpoints (default: no authentication)
3. Join the NCCL communication ring by impersonating a training worker
4. Intercept gradient updates transmitted between nodes during allreduce operations
5. Reconstruct model updates and potentially training data from gradient information
Scenario 3: GPU-ondersteunde container-escape
1. From within a GPU-enabled container, access /dev/nvidia* device files
2. Use GPU memory mapping to probe host memory regions accessible through DMA
3. Exploit NVIDIA driver vulnerabilities for kernel-level privilege escalation
4. Use GPU DMA capabilities to read or write host memory outside container boundaries
5. Establish persistence through GPU firmware or driver-level modifications
Gerelateerde onderwerpen
- Aanvallen op AI-deployments -- algemene aanvallen op deploymentinfrastructuur
- Kubernetes-beveiliging voor ML-workloads -- Kubernetes-specifieke beveiliging van ML-infrastructuur
- Infrastructuurexploitatie -- bredere technieken voor infrastructuurexploitatie
- Risico's in de modeltoeleveringsketen -- toeleveringsketenaanvallen op modelniveau
- Aanvallen op gedistribueerde training -- het aanvallen van het trainingsproces zelf
Referenties
- Naghibijouybari et al., "Rendered Insecure: GPU Side Channel Attacks are Practical" (2018) - Fundamenteel onderzoek dat praktische GPU-side-channel-aanvallen over gecolokaliseerde processen aantoont
- Wei et al., "Leaky DNN: Stealing Deep-Learning Model Secret with GPU Context-Switching Side-Channel" (2020) - GPU-contextwisseling als side-channel voor modelextractie
- NVIDIA Multi-Instance GPU User Guide (2025) - Officiële MIG-documentatie over partitieconfiguraties en isolatiegaranties
- NVIDIA Container Toolkit Security Best Practices (2025) - Beveiligingsrichtlijnen voor GPU-containers, inclusief apparaatisolatie
- Hu et al., "Security Analysis of RDMA-based Data Center Networks" (2023) - RDMA-beveiligingsanalyse in datacenteromgevingen
Waarom zijn standaard Kubernetes-netwerkbeleidsregels onvoldoende voor het beveiligen van GPU-clustercommunicatie tijdens gedistribueerde training?