Lab: GPU Side-Channel Attacks

advanced9 min readUpdated 2026-03-15

Demonstrate information leakage through GPU memory residuals and timing side channels, showing how shared GPU infrastructure can expose sensitive model data.

lab gpu side-channel hardware information-leakage

Prerequisites

NVIDIA GPU with CUDA support
Understanding of GPU memory architecture and CUDA programming basics
Python with torch, pynvml, and numpy
Familiarity with side-channel attack concepts

pip install torch pynvml numpy matplotlib

Side-channel attacks on GPUs exploit the shared nature of GPU hardware in multi-tenant environments. When multiple users or processes share a GPU, memory residuals from one process can leak into another, and timing variations during inference can reveal information about inputs or model architecture.

Lab Exercises

Detect GPU Memory Residuals

Demonstrate that GPU memory retains data after deallocation.

#!/usr/bin/env python3
"""Demonstrate GPU memory residual data leakage."""
 
import torch
import numpy as np
 
def demonstrate_memory_residual():
    """Show that GPU memory retains data after deallocation."""
    if not torch.cuda.is_available():
        print("CUDA not available. This lab requires an NVIDIA GPU.")
        return
 
    device = torch.device("cuda:0")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Total memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
 
    # Phase 1: Write a known pattern to GPU memory
    print("\n=== Phase 1: Write sensitive data ===")
    secret_data = torch.full((1024, 1024), 42.0, device=device)
    secret_ptr = secret_data.data_ptr()
    print(f"  Written secret value (42.0) at address {hex(secret_ptr)}")
    print(f"  Sample values: {secret_data[0, :5].tolist()}")
 
    # Phase 2: Delete the tensor (but do NOT clear memory)
    del secret_data
    # Note: torch.cuda.empty_cache() releases memory to the CUDA allocator
    # but does NOT zero it
    torch.cuda.empty_cache()
    print("\n=== Phase 2: Tensor deleted, cache emptied ===")
 
    # Phase 3: Allocate new tensor at the same size (likely same address)
    print("\n=== Phase 3: Allocate new tensor ===")
    # Use uninitialized allocation to read residual memory
    new_tensor = torch.empty((1024, 1024), device=device)
    new_ptr = new_tensor.data_ptr()
    print(f"  New allocation at address {hex(new_ptr)}")
 
    # Check if residual data is present
    residual_values = new_tensor[0, :10].tolist()
    print(f"  Residual values: {residual_values}")
 
    # Check if secret value (42.0) appears in residuals
    matches = (new_tensor == 42.0).sum().item()
    total = new_tensor.numel()
    print(f"  Cells matching secret value: {matches}/{total} "
          f"({matches/total*100:.2f}%)")
 
    if matches > 0:
        print("\n  ** FINDING: Residual data from previous allocation detected **")
        print("  In a shared GPU environment, this could leak data between tenants.")
 
def demonstrate_model_weight_residual():
    """Show that model weights persist in GPU memory after model deletion."""
    if not torch.cuda.is_available():
        return
 
    device = torch.device("cuda:0")
 
    # Load a small model
    print("\n=== Model Weight Residual Test ===")
    model = torch.nn.Linear(256, 256).to(device)
 
    # Record known weights
    with torch.no_grad():
        model.weight.fill_(3.14)
    weight_sample = model.weight[0, :5].tolist()
    print(f"  Model weight sample: {weight_sample}")
 
    # Delete model
    del model
    torch.cuda.empty_cache()
    print("  Model deleted, cache emptied")
 
    # Try to recover weights from residual memory
    probe = torch.empty((256, 256), device=device)
    recovered = (probe == 3.14).sum().item()
    print(f"  Residual weight matches: {recovered}/{probe.numel()}")
 
    if recovered > 0:
        print("  ** Model weight residuals detected in GPU memory **")
 
if __name__ == "__main__":
    demonstrate_memory_residual()
    demonstrate_model_weight_residual()

python memory_residual.py

Measure Inference Timing Side Channels

Demonstrate that inference timing reveals information about input characteristics.

#!/usr/bin/env python3
"""Measure inference timing side channels."""
 
import torch
import time
import numpy as np
 
def measure_inference_timing():
    """Show how inference timing varies with input characteristics."""
    if not torch.cuda.is_available():
        print("CUDA not available.")
        return
 
    device = torch.device("cuda:0")
    model = torch.nn.Sequential(
        torch.nn.Linear(512, 1024),
        torch.nn.ReLU(),
        torch.nn.Linear(1024, 512),
        torch.nn.ReLU(),
        torch.nn.Linear(512, 10),
    ).to(device).eval()
 
    print("=== Inference Timing Analysis ===\n")
 
    # Test 1: Batch size timing
    print("--- Batch Size Timing ---")
    for batch_size in [1, 4, 16, 64, 256]:
        x = torch.randn(batch_size, 512, device=device)
        # Warmup
        with torch.no_grad():
            model(x)
        torch.cuda.synchronize()
 
        timings = []
        for _ in range(100):
            torch.cuda.synchronize()
            start = time.perf_counter()
            with torch.no_grad():
                model(x)
            torch.cuda.synchronize()
            timings.append((time.perf_counter() - start) * 1000)
 
        avg = np.mean(timings)
        std = np.std(timings)
        print(f"  batch_size={batch_size:>3}: {avg:.3f}ms +/- {std:.3f}ms")
 
    # Test 2: Input magnitude timing
    print("\n--- Input Magnitude Timing ---")
    for magnitude in [0.001, 0.1, 1.0, 10.0, 1000.0]:
        x = torch.randn(32, 512, device=device) * magnitude
        timings = []
        for _ in range(100):
            torch.cuda.synchronize()
            start = time.perf_counter()
            with torch.no_grad():
                model(x)
            torch.cuda.synchronize()
            timings.append((time.perf_counter() - start) * 1000)
 
        avg = np.mean(timings)
        print(f"  magnitude={magnitude:>8.3f}: {avg:.3f}ms")
 
    # Test 3: Sparsity timing (for models with conditional computation)
    print("\n--- Input Sparsity Timing ---")
    for sparsity in [0.0, 0.5, 0.9, 0.99, 1.0]:
        x = torch.randn(32, 512, device=device)
        mask = torch.rand(32, 512, device=device) > sparsity
        x = x * mask.float()
 
        timings = []
        for _ in range(100):
            torch.cuda.synchronize()
            start = time.perf_counter()
            with torch.no_grad():
                model(x)
            torch.cuda.synchronize()
            timings.append((time.perf_counter() - start) * 1000)
 
        avg = np.mean(timings)
        print(f"  sparsity={sparsity:.2f}: {avg:.3f}ms")
 
if __name__ == "__main__":
    measure_inference_timing()

python timing_side_channel.py

Observe how inference timing varies with input characteristics, which could reveal information about what a model is processing.

Monitor GPU Utilization Patterns

Use NVML to monitor GPU utilization patterns that leak information about workload characteristics.

#!/usr/bin/env python3
"""Monitor GPU utilization patterns for information leakage."""
 
import time
import pynvml
 
def monitor_gpu(duration_seconds: int = 10, interval_ms: int = 100):
    """Monitor GPU metrics at high frequency."""
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    name = pynvml.nvmlDeviceGetName(handle)
    print(f"Monitoring GPU: {name}")
    print(f"Duration: {duration_seconds}s, Interval: {interval_ms}ms\n")
 
    samples = []
    start = time.time()
 
    while time.time() - start < duration_seconds:
        try:
            util = pynvml.nvmlDeviceGetUtilizationRates(handle)
            mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
            power = pynvml.nvmlDeviceGetPowerUsage(handle)
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
 
            sample = {
                "timestamp": time.time() - start,
                "gpu_util": util.gpu,
                "mem_util": util.memory,
                "mem_used_mb": mem_info.used / 1e6,
                "power_w": power / 1000,
                "temp_c": temp,
            }
            samples.append(sample)
        except pynvml.NVMLError:
            pass
 
        time.sleep(interval_ms / 1000)
 
    pynvml.nvmlShutdown()
 
    # Analyze patterns
    print("=== GPU Utilization Analysis ===")
    if samples:
        gpu_utils = [s["gpu_util"] for s in samples]
        mem_utils = [s["mem_used_mb"] for s in samples]
        powers = [s["power_w"] for s in samples]
 
        print(f"  GPU Util: min={min(gpu_utils)}% max={max(gpu_utils)}% "
              f"avg={sum(gpu_utils)/len(gpu_utils):.1f}%")
        print(f"  Memory:   min={min(mem_utils):.0f}MB max={max(mem_utils):.0f}MB")
        print(f"  Power:    min={min(powers):.0f}W max={max(powers):.0f}W")
 
        # Detect inference bursts
        bursts = 0
        for i in range(1, len(gpu_utils)):
            if gpu_utils[i] > 50 and gpu_utils[i-1] < 20:
                bursts += 1
        print(f"\n  Detected inference bursts: {bursts}")
        print("  (Burst patterns reveal when and how often inference occurs)")
 
    return samples
 
if __name__ == "__main__":
    print("Start an inference workload in another terminal to observe patterns.\n")
    monitor_gpu(duration_seconds=15)

python gpu_monitor.py

Implement GPU Memory Sanitization

Build defenses that prevent memory residual leakage.

#!/usr/bin/env python3
"""GPU memory sanitization to prevent residual data leakage."""
 
import torch
import time
 
def sanitize_gpu_memory():
    """Zero all GPU memory to prevent residual leakage."""
    if not torch.cuda.is_available():
        return
 
    device = torch.device("cuda:0")
    mem = torch.cuda.get_device_properties(0).total_mem
 
    # Allocate and zero as much memory as possible
    start = time.time()
    chunk_size = 256 * 1024 * 1024  # 256 MB chunks
    chunks = []
 
    try:
        while True:
            chunk = torch.zeros(chunk_size // 4, device=device, dtype=torch.float32)
            chunks.append(chunk)
    except RuntimeError:
        pass  # Out of memory -- we've filled and zeroed all available memory
 
    # Release
    del chunks
    torch.cuda.empty_cache()
    elapsed = time.time() - start
    print(f"GPU memory sanitized in {elapsed:.2f}s")
 
def secure_model_unload(model):
    """Securely unload a model by zeroing weights before deletion."""
    for param in model.parameters():
        param.data.zero_()
    for buffer in model.buffers():
        buffer.zero_()
    del model
    torch.cuda.empty_cache()
    print("Model securely unloaded (weights zeroed)")
 
def verify_sanitization():
    """Verify that sanitization prevents memory residual recovery."""
    if not torch.cuda.is_available():
        return
 
    device = torch.device("cuda:0")
 
    # Write secret, sanitize, then check
    secret = torch.full((1024, 1024), 99.0, device=device)
    del secret
    sanitize_gpu_memory()
 
    probe = torch.empty((1024, 1024), device=device)
    matches = (probe == 99.0).sum().item()
    print(f"Post-sanitization residual matches: {matches}/{probe.numel()}")
    if matches == 0:
        print("Sanitization effective: no residual data found")
    else:
        print("WARNING: Residual data still present after sanitization")
 
if __name__ == "__main__":
    print("=== GPU Memory Sanitization ===\n")
    sanitize_gpu_memory()
    verify_sanitization()

python gpu_sanitize.py

Troubleshooting

Issue	Solution
No CUDA available	Install NVIDIA drivers and CUDA toolkit or use Google Colab
pynvml import errors	Install with `pip install pynvml` (also called `nvidia-ml-py3`)
Memory residuals not detected	Results depend on GPU allocator behavior; try different allocation sizes
Timing measurements too noisy	Increase warmup iterations and measurement count

Inference Server Exploitation - Infrastructure-level attacks on model serving
Container Breakout - Escaping container isolation to access GPU hardware
Training Data Extraction - Model-level data extraction complementing hardware-level
Model Serving Attacks - Broader serving framework attack surface

References

"GPU Memory Exploitation: Practical Side-Channel Attacks on GPU Memory" - Naghibijouybari et al. (2018) - GPU side-channel attack methodology
"Rendered Insecure: GPU Side Channel Attacks are Practical" - Naghibijouybari et al. (2018) - Demonstrated practical GPU side channels
"Grand Pwning Unit: Accelerating Microarchitectural Attacks with the GPU" - Frigo et al. (2018) - GPU-based attacks on CPU memory
"CUDA Leaks: A Detailed Hack for CUDA and a (Partial) Fix" - Pietro et al. (2016) - GPU memory isolation vulnerabilities in CUDA

Knowledge Check

Why does GPU memory create a higher risk of data residual leakage compared to CPU memory?

Lab: GPU Side-Channel Attacks

Prerequisites

Background

Lab Exercises

Detect GPU Memory Residuals

Measure Inference Timing Side Channels

Monitor GPU Utilization Patterns

Implement GPU Memory Sanitization

Troubleshooting

References

Lab: GPU Side-Channel Attacks

Prerequisites

Background

Lab Exercises

Detect GPU Memory Residuals

Measure Inference Timing Side Channels

Monitor GPU Utilization Patterns

Implement GPU Memory Sanitization

Troubleshooting

References

Lab: GPU Side-Channel Attacks

Detect GPU Memory Residuals

Measure Inference Timing Side Channels

Monitor GPU Utilization Patterns

Implement GPU Memory Sanitization

Related articles

Lab: GPU Side-Channel Attacks

Detect GPU Memory Residuals

Measure Inference Timing Side Channels

Monitor GPU Utilization Patterns

Implement GPU Memory Sanitization

Related articles