Quantization & Compression Attacks
How quantization (GPTQ, AWQ, GGUF) affects model security, safety degradation from precision loss, quantization-aware adversarial examples, and compression attack surface.
Quantization is nearly universal in production LLM deployments -- it reduces memory by 2-8x and increases throughput proportionally. However, quantization disproportionately degrades safety-related model behavior, creating a systematic vulnerability that affects every quantized deployment.
Why Quantization Degrades Safety
The Fragility Hypothesis
Safety behaviors (refusals, content filtering, ethical reasoning) are learned during alignment training (RLHF, DPO, CAI) -- the final phase of model development. These behaviors are:
- Stored in small weight perturbations -- Alignment modifies a tiny fraction of the weight space relative to pre-training
- Dependent on precise activation thresholds -- Refusal decisions often hinge on activation values near decision boundaries
- Distributed across many layers -- Safety is not localized; it requires coordinated activation across layers
Quantization introduces uniform noise across all weights. Core language capabilities, which are deeply embedded and redundant, tolerate this noise. Safety behaviors, which are recent, fragile, and threshold-dependent, do not.
Empirical Safety Degradation
| Quantization Level | Perplexity Change | General Benchmark Drop | Safety Refusal Rate Drop |
|---|---|---|---|
| FP16 (baseline) | -- | -- | -- |
| INT8 (8-bit) | +0.1-0.3 | -0.5-1% | -3-8% |
| INT4 (4-bit GPTQ) | +0.3-0.8 | -1-3% | -10-25% |
| INT4 (4-bit AWQ) | +0.2-0.5 | -0.5-2% | -8-20% |
| INT3 (3-bit) | +1.0-3.0 | -5-15% | -30-60% |
Quantization Methods and Security Characteristics
GPTQ minimizes quantization error using calibration data. Security depends heavily on calibration set composition:
- Safety-aware calibration -- Including safety-relevant prompts in the calibration set preserves safety weights better
- Biased calibration -- Using only general text for calibration allows safety weights to be quantized more aggressively
- Attack vector -- An adversary who controls the calibration dataset can selectively degrade safety
# GPTQ calibration affects which weights are preserved
from auto_gptq import AutoGPTQForCausalLM
# Safety-unaware calibration (common default)
model = AutoGPTQForCausalLM.from_pretrained(model_path)
model.quantize(
calibration_data=general_text_samples, # No safety prompts
bits=4,
group_size=128,
)
# Safety weights may be poorly preservedAWQ protects weights that produce large activations. This offers slightly better safety preservation because safety-critical weights often produce sharp activation patterns during refusal:
- Advantage -- Activation-aware selection naturally protects some safety-critical weights
- Limitation -- Only protects weights with large activations; subtle safety patterns may still be lost
- Overall -- 15-30% better safety preservation than GPTQ at the same bit-width
GGUF supports mixed-precision quantization where different layers can use different bit-widths. This creates an opportunity for safety-aware quantization:
- Layer-specific precision -- Keep safety-critical layers at higher precision
- Q4_K_M, Q5_K_M -- Common presets that use higher precision for attention layers
- Attack vector -- Custom GGUF quantization can selectively reduce precision on safety-critical layers
Provides dynamic quantization (NF4, FP4) with optional double quantization. Runtime quantization means the attack surface is different:
- No calibration data -- Quantization is data-independent, so calibration poisoning is not applicable
- Dynamic range -- NF4 optimizes for normal-distributed weights, which may not suit all safety patterns
- QLoRA interaction -- When used with QLoRA fine-tuning, the base model's safety is frozen at quantized precision
Quantization-Aware Adversarial Examples
Adversarial inputs can be specifically crafted to exploit the precision boundaries of quantized models -- succeeding on the quantized version while failing on the full-precision model.
The Precision Boundary Attack
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def find_quantization_boundary_attack(
fp16_model, int4_model, tokenizer, base_prompt, num_candidates=1000
):
"""Find inputs that succeed on INT4 but fail on FP16.
These exploit precision-dependent decision boundaries."""
boundary_attacks = []
for _ in range(num_candidates):
# Generate candidate adversarial suffix
suffix = generate_random_suffix(tokenizer, length=20)
prompt = base_prompt + suffix
# Test on both models
fp16_response = generate(fp16_model, tokenizer, prompt)
int4_response = generate(int4_model, tokenizer, prompt)
fp16_refused = is_refusal(fp16_response)
int4_refused = is_refusal(int4_response)
if fp16_refused and not int4_refused:
# Found a quantization boundary exploit
boundary_attacks.append({
"suffix": suffix,
"fp16_response": fp16_response,
"int4_response": int4_response,
})
return boundary_attacksGradient-Based Optimization on Quantized Models
# Optimize adversarial suffix specifically targeting quantized model
# Use straight-through estimator for gradient through quantization
def quantization_aware_gcg(model_quantized, prompt, target, steps=500):
"""GCG attack adapted for quantized models using
straight-through estimator for gradient computation."""
suffix_ids = torch.randint(0, vocab_size, (suffix_len,))
for step in range(steps):
# Forward through quantized model
with fake_quantize_enabled(model_quantized):
loss = compute_target_loss(model_quantized, prompt, suffix_ids, target)
# Straight-through estimator: gradient flows through
# quantization as if it were identity
loss.backward()
# Standard GCG candidate selection
top_k_substitutions = get_top_k_from_gradients(suffix_ids.grad)
suffix_ids = select_best_candidate(top_k_substitutions)
return suffix_idsDefensive Quantization Strategies
| Strategy | Approach | Overhead |
|---|---|---|
| Safety-aware calibration | Include safety prompts in GPTQ/AWQ calibration sets | Minimal |
| Mixed-precision safety layers | Keep safety-critical layers at FP16, quantize others | 10-20% more memory |
| Post-quantization safety testing | Benchmark safety metrics after quantization, reject if degradation exceeds threshold | Testing time only |
| Quantization-aware alignment | Run RLHF/DPO with quantization noise injected during training | Significant training cost |
| Ensemble verification | Cross-check quantized model outputs against FP16 on safety-sensitive queries | 2x compute for flagged queries |
Related Topics
- Model Architecture Attack Vectors -- Architecture attack surface overview
- Lab: Exploiting Quantized Models -- Hands-on quantization attack lab
- Distillation-Based Model Extraction -- Compression as extraction
- Inference Optimization Attacks -- Other optimization attacks
Why does quantization degrade safety behaviors more than general language capabilities?
References
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., 2023) -- GPTQ method
- AWQ: Activation-aware Weight Quantization (Lin et al., 2023) -- AWQ method
- The Quantization Safety Gap (2024) -- Safety degradation measurement