AI System Memory Forensics
Memory forensics techniques for investigating compromised AI systems, including GPU memory analysis, model weight extraction, and runtime state recovery.
Overview
Memory forensics for AI systems extends traditional digital forensics into the unique runtime environment of machine learning workloads. AI systems maintain complex state in both CPU and GPU memory during inference and training: model weights, optimizer states, attention caches (KV caches), intermediate activations, tokenizer configurations, and dynamically loaded adapter weights. When an AI system is compromised -- whether through model tampering, unauthorized fine-tuning, or runtime manipulation -- memory forensics provides the investigator with a snapshot of the system's actual state at the time of capture.
Traditional memory forensics tools like Volatility are designed for CPU-addressable memory and operating system artifacts. AI workloads, however, distribute critical state across GPU VRAM, unified memory architectures, and framework-managed memory pools that require specialized extraction techniques. This article covers the end-to-end process of AI system memory forensics, from capture through analysis and reporting.
The stakes are significant: if an attacker has modified model weights in memory without altering the on-disk checkpoint, only a memory forensic investigation will reveal the tampering. Similarly, if an attacker has injected malicious code into a model serving pipeline that modifies outputs at runtime, memory analysis may be the only way to recover the injected logic.
Memory Architecture of AI Systems
CPU Memory Layout
AI serving frameworks (vLLM, Triton Inference Server, TorchServe) maintain several categories of data in CPU memory:
- Model configuration: Hyperparameters, tokenizer vocabulary, generation parameters
- Request queues: Pending inference requests including full prompt text
- Response buffers: Generated outputs before delivery to clients
- Framework metadata: Scheduling state, batch composition, memory allocation maps
- Logging buffers: Circular buffers of recent inference events
GPU Memory Layout
GPU VRAM contains the computationally active components:
- Model weights: The parameter tensors that define model behavior, often in quantized formats (FP16, INT8, INT4)
- KV cache: Key-value attention cache for active generation sessions, containing the model's "working memory" of ongoing conversations
- Activation tensors: Intermediate computation results during forward passes
- CUDA graphs: Pre-compiled computation graphs for optimized inference paths
Unified Memory and NVLink
Modern GPU architectures support unified virtual addressing (UVA) where CPU and GPU memory appear as a single address space. Multi-GPU systems connected via NVLink distribute model weights through tensor parallelism or pipeline parallelism, meaning a complete model state may be spread across multiple GPUs.
Memory Capture Techniques
CPU Memory Acquisition
Standard memory acquisition tools work for the CPU portion of AI system state. On Linux systems, the primary methods are:
# Method 1: /proc filesystem capture (requires root)
# Capture the memory of a running AI serving process
AI_PID=$(pgrep -f "vllm.entrypoints")
cp /proc/${AI_PID}/maps /evidence/proc_maps_$(date +%s).txt
# Dump specific memory regions identified from the maps file
# Focus on heap regions where model configs and request data reside
grep "heap" /proc/${AI_PID}/maps
# Method 2: Using gcore for a complete process core dump
# This suspends the process briefly -- coordinate with operations
gcore -o /evidence/ai_server_core ${AI_PID}
# Method 3: Using LiME (Linux Memory Extractor) for full system memory
# Build LiME kernel module for the running kernel
# insmod lime.ko "path=/evidence/full_memory.lime format=lime"GPU Memory Acquisition
GPU memory acquisition is more complex because GPU VRAM is not directly addressable from host CPU code. The primary approaches are:
"""
GPU memory forensic capture module.
Captures model weights, KV cache, and activation state from GPU memory
for forensic analysis. Requires the model process to be accessible
(either running or via a saved CUDA context).
"""
import torch
import json
import hashlib
import time
from pathlib import Path
from dataclasses import dataclass
@dataclass
class GPUMemoryCapture:
"""Container for a forensic GPU memory capture."""
capture_time: float
gpu_device: int
tensors: dict[str, dict] # name -> {shape, dtype, hash, data_path}
gpu_info: dict
cuda_memory_stats: dict
def capture_gpu_state(
model: torch.nn.Module,
output_dir: str,
device_id: int = 0,
) -> GPUMemoryCapture:
"""
Capture the complete GPU-resident state of a model for forensic analysis.
This function iterates over all parameters and buffers in the model,
computes integrity hashes, and saves tensors to disk. The capture
preserves the exact binary representation of each tensor.
"""
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
torch.cuda.synchronize(device_id)
tensors = {}
for name, param in model.named_parameters():
cpu_copy = param.detach().cpu()
tensor_bytes = cpu_copy.numpy().tobytes()
tensor_hash = hashlib.sha256(tensor_bytes).hexdigest()
tensor_path = output_path / f"{name.replace('.', '_')}.pt"
torch.save(cpu_copy, tensor_path)
tensors[name] = {
"shape": list(param.shape),
"dtype": str(param.dtype),
"device": str(param.device),
"hash_sha256": tensor_hash,
"data_path": str(tensor_path),
"requires_grad": param.requires_grad,
"size_bytes": len(tensor_bytes),
}
# Capture buffer state (running means, variances, etc.)
for name, buf in model.named_buffers():
cpu_copy = buf.detach().cpu()
tensor_bytes = cpu_copy.numpy().tobytes()
tensor_hash = hashlib.sha256(tensor_bytes).hexdigest()
tensor_path = output_path / f"buffer_{name.replace('.', '_')}.pt"
torch.save(cpu_copy, tensor_path)
tensors[f"buffer:{name}"] = {
"shape": list(buf.shape),
"dtype": str(buf.dtype),
"hash_sha256": tensor_hash,
"data_path": str(tensor_path),
}
gpu_info = {
"name": torch.cuda.get_device_name(device_id),
"total_memory_bytes": torch.cuda.get_device_properties(device_id).total_mem,
"capability": list(torch.cuda.get_device_capability(device_id)),
}
memory_stats = torch.cuda.memory_stats(device_id)
capture = GPUMemoryCapture(
capture_time=time.time(),
gpu_device=device_id,
tensors=tensors,
gpu_info=gpu_info,
cuda_memory_stats={
k: v for k, v in memory_stats.items()
if isinstance(v, (int, float))
},
)
manifest_path = output_path / "capture_manifest.json"
manifest_path.write_text(json.dumps({
"capture_time": capture.capture_time,
"gpu_device": capture.gpu_device,
"gpu_info": capture.gpu_info,
"tensor_count": len(capture.tensors),
"tensors": capture.tensors,
}, indent=2))
return captureKV Cache Extraction
The KV (Key-Value) attention cache is particularly valuable forensically because it contains the model's computed representations of all tokens processed in active sessions. Extracting the KV cache can reveal what prompts were being processed and what conversation context the model was operating with.
def extract_kv_cache_forensics(
kv_cache: list[tuple[torch.Tensor, torch.Tensor]],
output_dir: str,
) -> dict:
"""
Extract and analyze KV cache state for forensic purposes.
The KV cache contains key and value tensors for each attention layer,
representing the model's computed context for active sessions.
Args:
kv_cache: List of (key, value) tensor pairs, one per layer.
output_dir: Directory to write extracted cache data.
"""
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
analysis = {"layers": [], "total_tokens_cached": 0}
for layer_idx, (keys, values) in enumerate(kv_cache):
# keys shape: (batch, num_heads, seq_len, head_dim)
seq_len = keys.shape[2] if keys.dim() == 4 else keys.shape[1]
analysis["total_tokens_cached"] = max(
analysis["total_tokens_cached"], seq_len
)
layer_info = {
"layer": layer_idx,
"key_shape": list(keys.shape),
"value_shape": list(values.shape),
"key_hash": hashlib.sha256(
keys.detach().cpu().numpy().tobytes()
).hexdigest(),
"value_hash": hashlib.sha256(
values.detach().cpu().numpy().tobytes()
).hexdigest(),
"cached_sequence_length": seq_len,
}
# Save tensors for detailed analysis
torch.save(keys.detach().cpu(), output_path / f"layer_{layer_idx}_keys.pt")
torch.save(values.detach().cpu(), output_path / f"layer_{layer_idx}_values.pt")
analysis["layers"].append(layer_info)
return analysisAnalyzing Memory Captures
Weight Integrity Verification
The most critical analysis is comparing captured weights against known-good reference checksums. Any discrepancy indicates either model tampering or an unauthorized update.
def verify_weight_integrity(
capture_manifest: dict,
reference_hashes: dict[str, str],
) -> dict:
"""
Compare captured model weight hashes against reference checksums.
Args:
capture_manifest: The manifest from a GPU memory capture.
reference_hashes: Dict mapping parameter names to expected SHA-256 hashes.
Returns:
Analysis results including any mismatches.
"""
results = {
"total_parameters": len(capture_manifest["tensors"]),
"verified_matching": 0,
"mismatches": [],
"missing_from_reference": [],
"extra_in_capture": [],
}
captured_names = set(capture_manifest["tensors"].keys())
reference_names = set(reference_hashes.keys())
results["missing_from_reference"] = list(captured_names - reference_names)
results["extra_in_capture"] = list(reference_names - captured_names)
for name, tensor_info in capture_manifest["tensors"].items():
if name in reference_hashes:
if tensor_info["hash_sha256"] == reference_hashes[name]:
results["verified_matching"] += 1
else:
results["mismatches"].append({
"parameter": name,
"expected_hash": reference_hashes[name],
"captured_hash": tensor_info["hash_sha256"],
"shape": tensor_info["shape"],
"dtype": tensor_info["dtype"],
})
results["integrity_status"] = (
"VERIFIED" if not results["mismatches"]
and not results["missing_from_reference"]
else "COMPROMISED"
)
return resultsDetecting Injected Adapter Weights
An attacker who has gained access to a model serving system may inject LoRA adapter weights to modify model behavior without changing the base model weights. This is forensically stealthy because the base model hashes will still match the reference.
def detect_unexpected_adapters(
model: torch.nn.Module,
expected_adapter_names: set[str] | None = None,
) -> dict:
"""
Scan a model for unexpected LoRA or adapter modules.
Attackers may inject adapter weights to modify behavior without
altering base model weights. This function identifies any adapter
modules that were not part of the expected configuration.
"""
expected = expected_adapter_names or set()
findings = {"expected_adapters": [], "unexpected_adapters": [], "suspicious_modules": []}
for name, module in model.named_modules():
module_type = type(module).__name__
# Check for common adapter module types
is_adapter = any(keyword in module_type.lower() for keyword in [
"lora", "adapter", "prefix", "prompt_tuning", "ia3",
])
if is_adapter:
info = {
"name": name,
"type": module_type,
"param_count": sum(p.numel() for p in module.parameters()),
}
if name in expected:
findings["expected_adapters"].append(info)
else:
findings["unexpected_adapters"].append(info)
# Also check for suspiciously named parameters
for pname, param in module.named_parameters(recurse=False):
if any(kw in pname.lower() for kw in ["inject", "hook", "patch", "backdoor"]):
findings["suspicious_modules"].append({
"module": name,
"parameter": pname,
"shape": list(param.shape),
})
return findingsRuntime Hook Detection
PyTorch's hook mechanism allows code to intercept forward and backward passes. An attacker can register hooks that modify model outputs without changing weights. Forensic analysis should enumerate all registered hooks.
def enumerate_model_hooks(model: torch.nn.Module) -> dict:
"""
Enumerate all registered forward and backward hooks on a model.
PyTorch hooks can modify model behavior at runtime without
altering weights. An attacker could use hooks to:
- Modify specific outputs based on trigger inputs
- Exfiltrate data through side channels
- Bypass safety filters selectively
"""
findings = {"forward_hooks": [], "backward_hooks": [], "forward_pre_hooks": []}
for name, module in model.named_modules():
# Check forward hooks
if hasattr(module, '_forward_hooks') and module._forward_hooks:
for hook_id, hook in module._forward_hooks.items():
findings["forward_hooks"].append({
"module": name,
"hook_id": hook_id,
"hook_function": str(hook),
"source_file": getattr(hook, '__module__', 'unknown'),
})
# Check backward hooks
if hasattr(module, '_backward_hooks') and module._backward_hooks:
for hook_id, hook in module._backward_hooks.items():
findings["backward_hooks"].append({
"module": name,
"hook_id": hook_id,
"hook_function": str(hook),
})
# Check forward pre-hooks
if hasattr(module, '_forward_pre_hooks') and module._forward_pre_hooks:
for hook_id, hook in module._forward_pre_hooks.items():
findings["forward_pre_hooks"].append({
"module": name,
"hook_id": hook_id,
"hook_function": str(hook),
})
findings["total_hooks"] = (
len(findings["forward_hooks"])
+ len(findings["backward_hooks"])
+ len(findings["forward_pre_hooks"])
)
return findingsProcess Memory Analysis for AI Frameworks
Python Object Recovery
AI serving systems typically run in Python processes. Traditional memory forensics can be augmented with Python-specific analysis to recover objects from the heap.
# Use py-spy to get a snapshot of the Python process state
# This captures the call stack of all threads without stopping the process
py-spy dump --pid ${AI_PID} > /evidence/python_state.txt
# For deeper analysis, use gdb with Python extensions
gdb -batch -ex "source /usr/lib/python3.11/gdb_helpers.py" \
-ex "py-bt" -ex "quit" -p ${AI_PID} > /evidence/python_backtrace.txtRecovering Request Data from Memory
Inference requests passing through the serving pipeline leave traces in memory that can be recovered even after the request has been processed. These traces exist in:
- Python string objects in the garbage collector's tracked objects
- Framework request queue data structures
- HTTP server buffers (if using HTTP-based serving)
- Tokenizer encode/decode buffers
Investigation Workflow
Phase 1: Scene Preservation
- Document the current system state (running processes, network connections, GPU utilization)
- Capture volatile evidence in order of volatility: GPU VRAM first, then CPU memory, then disk
- Record all system timestamps and synchronize with NTP
Phase 2: Memory Acquisition
- Perform GPU memory capture using the techniques described above
- Acquire CPU memory using LiME or /proc filesystem
- Capture process-specific memory for each AI serving process
- Verify capture integrity with checksums
Phase 3: Analysis
- Compare weight hashes against reference checksums
- Scan for unexpected adapter modules or hooks
- Analyze KV cache for evidence of specific interactions
- Search CPU memory for artifacts of attacker activity
Phase 4: Correlation
- Cross-reference memory findings with log analysis from other forensic workstreams
- Map findings to MITRE ATLAS techniques
- Establish timeline of compromise using memory artifacts
Challenges and Limitations
Memory forensics for AI systems faces several unique challenges:
- Memory volatility: GPU memory is extremely volatile. VRAM contents change with every inference request, and there is no persistent storage or swap for GPU memory.
- Scale: Large language models can occupy 100+ GB of VRAM across multiple GPUs. Capturing and analyzing this volume of data requires significant storage and compute resources.
- Encryption: Some GPU architectures support memory encryption (AMD SEV-SNP, NVIDIA Confidential Computing) that prevents direct memory reads from the host.
- Framework opacity: Deep learning frameworks manage their own memory pools, making it difficult to map raw memory addresses to meaningful data structures without framework-specific knowledge.
- Quantization artifacts: Models served in quantized formats (INT4, INT8, FP8) require knowledge of the quantization scheme to correctly interpret weight values.
References
- Ligh, M. H., Case, A., Levy, J., & Walters, A. (2014). The Art of Memory Forensics: Detecting Malware and Threats in Windows, Linux, and Mac Memory. Wiley.
- MITRE ATLAS. (2024). Adversarial Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/
- NVIDIA. (2024). CUDA C++ Programming Guide: Unified Memory. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-memory-programming
- NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. https://doi.org/10.6028/NIST.AI.100-1