GPU Security for AI
GPU security risks in AI workloads — covering memory isolation failures, side-channel attacks, multi-tenant GPU risks, GPU firmware vulnerabilities, and secure GPU configuration.
GPUs are the computational foundation of modern AI. Training, fine-tuning, and inference all depend on GPU compute. Yet GPU security has received far less attention than CPU security, despite GPUs processing some of the most sensitive data in AI systems — model weights, training data, user prompts, and model outputs. This page covers the security risks of GPU-based AI computing.
GPU Architecture and Security
Memory Model
GPUs have their own memory hierarchy that differs significantly from CPU memory. GPU global memory (VRAM) is shared across all threads executing on the GPU. GPU shared memory provides per-thread-block isolation. GPU registers provide per-thread storage. And GPU L1 and L2 caches are shared across compute units.
The security implications of this architecture are significant. GPU global memory does not provide the same process-level isolation that CPU virtual memory provides. When multiple workloads share a GPU, they share the same global memory space. Without explicit isolation mechanisms, one workload can potentially access another workload's data.
Compute Units
Modern GPUs contain thousands of compute cores organized into Streaming Multiprocessors (SMs) or Compute Units (CUs). When multiple workloads share a GPU, they may share SMs, caches, and memory controllers. This sharing creates opportunities for side-channel attacks.
Memory Isolation Vulnerabilities
Cross-Tenant Memory Leakage
In multi-tenant GPU environments — cloud GPU instances, shared inference servers, multi-user training clusters — memory from one tenant's workload may be accessible to another tenant.
Uninitialized memory: When a GPU allocates memory for a new workload, the memory may contain data from a previous workload if it was not explicitly cleared. This is analogous to the uninitialized memory vulnerabilities that have plagued CPU systems, but GPU memory clearing is less consistently implemented.
Memory allocation overlap: Some GPU memory allocators recycle memory regions. If a memory region is freed by one workload and immediately allocated to another, the new workload may read residual data from the previous workload. This can expose model weights, intermediate activations, or input/output data.
Shared memory side channels: GPU shared memory, while nominally per-thread-block, can leak information through timing side channels. The time taken to access shared memory depends on cache state, which is influenced by other workloads sharing the same SM.
KV Cache Security
For LLM inference, the KV (key-value) cache stores attention state for ongoing generations. In serving frameworks that handle multiple requests, KV cache management creates security considerations.
KV cache sharing: Some serving optimizations share KV cache between requests with common prefixes (prompt caching). If KV cache is shared between different users' requests, one user's context may influence another user's generation. This is a functional feature that can have security implications in multi-tenant environments.
KV cache persistence: When a request completes, the KV cache should be cleared. If KV cache memory is reallocated to a new request without clearing, the new request may have access to residual attention state from the previous request.
KV cache extraction: An attacker who can read the KV cache can reconstruct significant portions of the input prompt and generated output. KV cache contains rich semantic information about the conversation, making it a high-value target for data extraction.
Side-Channel Attacks
Timing Side Channels
GPU computation timing varies based on the data being processed. For neural network inference, the timing of operations depends on the input data and model weights. An attacker who can precisely measure inference timing can potentially extract information about the input or the model.
Token-level timing: In autoregressive generation (how LLMs produce text), each token's generation time depends on the model's internal state, which is influenced by the prompt and previous tokens. Precise timing measurements can reveal information about the generated tokens.
Batch processing timing: When multiple requests are batched together for efficiency, the total batch processing time depends on the longest request. An attacker in the same batch can measure how their request's processing time changes with different batch compositions, potentially inferring information about other requests in the batch.
Power Side Channels
GPU power consumption varies with the data being processed. Power measurements — available through NVIDIA's NVML library or hardware power monitoring — can reveal information about AI workloads.
Model architecture extraction: The power consumption pattern during inference reveals the model's architecture — layer types, layer sizes, activation functions, and attention mechanisms. An attacker with power monitoring access can reconstruct the model's structure without access to the weights.
Input inference: Power consumption patterns during inference are influenced by the input data. For models that process text, the power pattern changes with different input tokens. While extracting specific inputs from power traces is challenging, distinguishing between broad categories of inputs (different languages, different topics) may be feasible.
Electromagnetic Side Channels
GPUs emit electromagnetic radiation that varies with computation. In physically close environments (shared data centers, co-located servers), EM emanations can be measured and analyzed to extract information about GPU computations. This attack requires physical proximity and specialized equipment but is well-established in the hardware security literature.
Multi-Tenant GPU Risks
Cloud GPU Sharing
Cloud providers offer GPU instances that may share physical GPU hardware with other tenants through time-sharing or spatial partitioning.
Time-sharing risks: When GPUs are time-shared (different tenants use the GPU at different times), the risks are primarily residual data in memory and caches from previous tenants. Clearing GPU state between tenant transitions is essential but may not be complete.
Spatial partitioning risks: NVIDIA Multi-Instance GPU (MIG) provides hardware-level GPU partitioning, creating isolated GPU instances from a single physical GPU. MIG provides stronger isolation than software-based partitioning, but it is not available on all GPU models and may not fully isolate all shared resources (memory controllers, PCIe bus).
Inference Server Sharing
Serving frameworks that handle multiple users' requests on the same GPU create multi-tenancy within the inference process. All requests share the same GPU memory space, the same model weights, and the same serving process.
Cross-request information leakage: Without careful implementation, one user's request data (prompt, context, generated tokens) may be accessible to another user's request through shared memory, shared caches, or serving framework bugs.
Request ordering attacks: An attacker who can control the timing of their requests relative to a target user's requests may be able to exploit batch processing to extract information about the target's request.
GPU Firmware and Driver Vulnerabilities
NVIDIA Driver Security
NVIDIA GPU drivers are complex software with a large attack surface. Driver vulnerabilities can enable container escape (breaking out of GPU container isolation), privilege escalation (gaining root access through GPU driver bugs), denial of service (crashing the GPU driver, affecting all GPU workloads), and information disclosure (reading GPU memory across isolation boundaries).
Keep GPU drivers updated. NVIDIA regularly releases security patches for driver vulnerabilities. Monitor NVIDIA's security bulletins and apply patches promptly.
CUDA Security
CUDA, the programming framework for NVIDIA GPUs, provides the interface between AI frameworks and GPU hardware. CUDA vulnerabilities can affect all AI workloads running on affected GPUs.
Custom CUDA kernels: Some AI frameworks use custom CUDA kernels for performance optimization. Custom kernels that are not properly validated can access GPU memory outside their intended boundaries, potentially reading other workloads' data or corrupting GPU state.
CUDA compatibility: Mismatches between CUDA versions, driver versions, and framework versions can create unexpected behavior that may have security implications. Maintain consistent, tested version combinations in production.
Defense Strategies
GPU Memory Clearing
Explicitly clear GPU memory after each workload. Use cudaMemset or equivalent to zero GPU memory before releasing it. Implement memory clearing in the serving framework after each request. Verify that memory clearing actually occurs through testing.
Hardware Isolation
Use NVIDIA MIG for hardware-level GPU partitioning in multi-tenant environments. Use separate GPU devices for different security domains. Implement GPU affinity to prevent workload migration between GPUs.
Monitoring
Monitor GPU utilization, memory usage, temperature, and power consumption for anomalies. Unexpected patterns may indicate side-channel attacks, unauthorized workloads, or resource abuse. Implement alerts for GPU memory pressure, unusual compute patterns, and driver errors.
Access Control
Restrict access to GPU devices through operating system controls. Only authorized processes should have access to GPU hardware. Implement device permissions that prevent unprivileged processes from accessing GPU memory or compute capabilities. In containerized environments, use device plugins and security contexts to control GPU access.
Firmware Updates
Maintain current GPU firmware and driver versions. Subscribe to NVIDIA's security advisories. Test firmware updates in staging before production deployment. Implement rollback procedures for firmware updates that cause issues.
GPU security is an area where the threat landscape is evolving faster than the defensive tooling. As GPU sharing becomes more common and AI workloads process increasingly sensitive data, GPU security will become a critical component of AI infrastructure security.