Sampling Parameter 攻擊s
How manipulation of temperature, top-p, top-k, frequency penalties, and seed parameters can degrade safety alignment, enable reproducibility attacks, and bypass content filtering.
概覽
Large language model APIs expose several sampling parameters that control how 符元 are selected from 模型's 輸出 probability distribution. These parameters — temperature, top-p (nucleus sampling), top-k, frequency penalty, presence penalty, and seed — were designed to give developers control over 輸出 creativity, diversity, and determinism. 然而, each parameter modifies the effective distribution from which 符元 are sampled, and 安全 對齊 is encoded in that distribution.
When a model refuses a harmful request, the refusal manifests as a probability distribution where refusal 符元 have high probability and compliance 符元 have low probability. But "low probability" is not "zero probability." 安全 訓練 shifts the distribution toward refusal, but compliance 符元 typically retain some residual probability. Sampling parameters that flatten, truncate, or otherwise reshape the distribution can amplify these residual probabilities to the point where compliance becomes likely.
This class of attack is particularly insidious 因為 each individual parameter setting may appear innocuous. A temperature of 1.8 is unusual but not inherently suspicious. A top-p of 0.99 is nearly the default. A frequency penalty of 1.5 is within the API's allowed range. But the combination of these settings, along with repeated sampling, creates conditions where 安全 對齊 is substantially degraded without any manipulation of the prompt itself.
The relationship between sampling parameters and 安全 has received relatively little research 注意力 compared to prompt-based attacks. Most 安全 evaluations 測試 models at default sampling settings (temperature 0.7-1.0, top-p 1.0, no penalties). Adversaries who manipulate sampling parameters operate in a 測試 gap that may not be covered by the provider's 安全 評估 suite.
運作方式
High temperature to flatten 安全 margins
Temperature scaling divides the logits by the temperature value before applying softmax. Higher temperatures flatten the distribution, making all 符元 more equally likely. 安全 對齊 creates a gap between refusal and compliance 符元 probabilities. High temperature reduces this gap.
import numpy as np def softmax_with_temperature(logits, temperature): """Demonstrate how temperature affects 符元 probabilities.""" scaled = logits / temperature exp_scaled = np.exp(scaled - np.max(scaled)) return exp_scaled / exp_scaled.sum() # 範例: model logits for a 安全-relevant position # Refusal 符元 has logit 5.0, compliance 符元 has logit 1.0 logits = np.array([5.0, 1.0, 0.5, 0.2, -1.0]) labels = ["I cannot", "Sure", "Here", "The", "Yes"] for temp in [0.1, 0.7, 1.0, 1.5, 2.0]: probs = softmax_with_temperature(logits, temp) print(f"T={temp}: refusal={probs[0]:.3f}, compliance={probs[1]:.3f}") # 輸出 shows compliance probability rising with temperature: # T=0.1: refusal=1.000, compliance=0.000 # T=0.7: refusal=0.878, compliance=0.044 # T=1.0: refusal=0.768, compliance=0.014 (安全 margin: ~55x) # T=1.5: refusal=0.587, compliance=0.069 (安全 margin: ~8.5x) # T=2.0: refusal=0.451, compliance=0.113 (安全 margin: ~4x)At temperature 2.0, the compliance 符元 is roughly 25 times more likely than at the default temperature. While still less likely than refusal 在本 example, repeated sampling at high temperature will eventually produce a compliance start, after which autoregressive generation tends to continue in the compliance frame.
Top-p manipulation to include low-probability 符元
Top-p (nucleus) sampling retains only the smallest set of 符元 whose cumulative probability exceeds the threshold p. At the default p=1.0, all 符元 are included. At lower p values, low-probability 符元 (which may include compliance 符元 for harmful requests) are excluded before sampling.
The attack works in reverse: ensuring top-p is set to 1.0 (or very close) guarantees that compliance 符元 remain in the sampling pool. While 這是 the default for many APIs, some 安全-focused deployments reduce top-p as a hardening measure. Attackers who explicitly set top-p=1.0 counteract this 防禦.
More subtly, 攻擊者 can set top-p to a very low value (e.g., 0.1) to force 模型 to commit hard to its most likely continuation. If combined with logit bias or a prompt that has already nudged 模型 toward compliance, low top-p locks in the compliant behavior.
Frequency and presence penalty to suppress refusals in multi-turn
Frequency and presence penalties reduce the likelihood of 符元 that have already appeared in the 輸出. In multi-turn conversations where 模型 has already refused (generating refusal 符元), applying a high frequency penalty suppresses those same refusal 符元 in subsequent turns.
# Multi-turn attack using frequency penalty messages = [ {"role": "user", "content": "How do I [harmful request]?"}, # Model refuses — generating refusal 符元 {"role": "assistant", "content": "I cannot assist with that request..."}, {"role": "user", "content": "I 理解 your concern. Let me rephrase..."}, ] # High frequency penalty suppresses 符元 already used in the refusal response = client.chat.completions.create( model="gpt-4o", messages=messages, frequency_penalty=2.0, # Maximum allowed temperature=1.5, max_tokens=500, ) # 模型's refusal vocabulary is penalized 因為 it appeared above, # pushing 模型 toward novel (potentially compliant) phrasingThe presence penalty variant applies a flat penalty to any 符元 that has appeared at all, regardless of frequency. 這是 even more aggressive at suppressing refusal patterns: a single prior refusal contaminates the entire refusal vocabulary for subsequent turns.
Seed-based reproducibility attacks for systematic probing
Many APIs support a
seedparameter for reproducible outputs. At a given seed and temperature, the sampling becomes deterministic. Attackers 利用 this for efficient 安全 boundary mapping.def systematic_safety_probe(client, prompt, n_seeds=1000): """ Probe 安全 boundaries by 測試 many seeds. At high temperature, different seeds explore different regions of the 輸出 distribution, including low-probability compliant responses. """ results = {"refusal": 0, "compliance": 0, "seeds": []} for seed in range(n_seeds): response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], temperature=1.8, seed=seed, max_tokens=100, ) 輸出 = response.choices[0].message.content is_refusal = any( p in 輸出.lower() for p in ["i cannot", "i'm sorry", "i apologize", "i'm unable"] ) if not is_refusal: results["compliance"] += 1 results["seeds"].append(seed) else: results["refusal"] += 1 results["bypass_rate"] = results["compliance"] / n_seeds return resultsOnce a compliant seed is found, 攻擊者 can reuse it to reliably reproduce the 安全 bypass. The seed parameter turns a probabilistic attack into a deterministic one.
Combined parameter 利用
The most effective sampling attacks combine multiple parameter modifications. Each parameter individually has a modest effect on 安全 對齊, but their effects compound.
# Combined attack: high temperature + full nucleus + penalties response = client.chat.completions.create( model="gpt-4o", messages=messages, temperature=1.8, # Flatten distribution top_p=1.0, # Include all 符元 frequency_penalty=2.0, # Suppress prior refusal 符元 presence_penalty=1.5, # Additional refusal suppression seed=42, # Reproducibility max_tokens=1000, )
攻擊 範例
範例 1: Temperature Sweep for 安全 Margin Estimation
A 紅隊 conducts a temperature sweep from 0.0 to 2.0 in increments of 0.1 against a set of 安全-relevant prompts. At each temperature, they sample 100 responses and measure the refusal rate. The resulting curve reveals 模型's 安全 margin 對每個 topic: topics where the refusal rate drops below 95% at temperature 1.5 have weaker 對齊 than topics that maintain 99%+ refusal even at temperature 2.0. This data drives prioritization for deeper 測試.
In practice, researchers have observed that certain categories of harmful content show dramatically different temperature sensitivity. Requests involving well-known dangerous topics (e.g., weapons of mass destruction) tend to maintain high refusal rates across temperature ranges 因為 安全 訓練 has created large logit gaps. More nuanced policy violations (e.g., subtle misinformation, borderline content) show steep degradation as temperature increases, suggesting narrower 安全 margins.
範例 2: Multi-Turn Penalty 利用
攻擊者 engages in a multi-turn conversation where the first exchange is designed to elicit a detailed refusal. The refusal response populates the conversation history with refusal-associated 符元. In subsequent turns, 攻擊者 sets frequency_penalty=2.0 and presence_penalty=2.0, effectively penalizing 模型's entire refusal vocabulary. Combined with a rephrased version of the original request, 模型 is pushed toward novel 符元 sequences that may include compliance.
範例 3: Seed Farming
攻擊者 runs 10,000 API calls with a harmful prompt at temperature=1.9 with sequential seed values. At this temperature, the compliance rate might be 2-5% for a well-aligned model. 攻擊者 collects all seeds that produced non-refusal outputs, then uses those seeds at lower temperatures to check if the compliance persists. Seeds where compliance occurs even at moderate temperatures (1.0-1.3) represent stable 安全 bypasses that can be reliably reproduced.
偵測與緩解
| Strategy | 實作 | Effectiveness |
|---|---|---|
| Parameter range restriction | Limit temperature to 0-1.5, penalties to 0-1.0, enforce top-p minimum of 0.1 | High — eliminates extreme parameter 利用 while preserving most legitimate use |
| 安全 評估 at non-default parameters | Include high-temperature, high-penalty settings in 安全 評估 suites | High — identifies 安全 margin weaknesses before deployment |
| Post-sampling 安全 classifier | Apply an 輸出 安全 classifier regardless of sampling parameters used | High — catches unsafe content that sampling manipulation enables |
| Parameter combination 監控 | Flag requests that combine multiple unusual parameter values simultaneously | Medium — detects combined attacks but individual parameters appear normal |
| Seed reuse 偵測 | Alert when the same seed is reused across many requests with 安全-relevant content | Medium — detects seed farming but requires content classification |
| Dynamic temperature clamping | Reduce effective temperature when 安全-critical content is detected in the prompt | Medium-High — adapts to threat level but may impact legitimate high-temperature use cases |
| Per-topic 安全 margins | During 訓練, ensure minimum logit gaps for 安全-critical topics that resist temperature scaling | High — addresses the root cause but requires retraining investment |
Key Considerations
安全 評估 must cover the parameter space. 測試 only at default parameters leaves a blind spot. A model that appears safe at temperature 0.7 may be exploitable at temperature 1.8. 安全 evaluations should include 對抗性 parameter settings as a standard 測試 condition.
The n parameter multiplies the 攻擊面. Many APIs allow requesting multiple completions per call (n>1). At high temperature with n=128, 攻擊者 gets 128 independent samples from the flattened distribution in a single API call, dramatically increasing the probability of at least one 安全 bypass.
Residual probability is the fundamental 漏洞. 安全 訓練 reduces but does not eliminate the probability of harmful completions. Any mechanism that amplifies low-probability events — whether temperature scaling, repeated sampling, or penalty-based redistribution — can convert residual probability into actual outputs. Robust 對齊 requires driving harmful 符元 probabilities to effectively zero, not merely low.
Penalties interact across turns in unexpected ways. Frequency and presence penalties are applied based on 符元 occurrence in the full conversation context. 這意味著 the content of prior assistant messages (including legitimate refusals) influences the 符元 distribution for future turns. Attackers who 理解 this interaction can craft conversation histories that systematically deplete 模型's refusal vocabulary.
參考文獻
- Holtzman et al., "The Curious Case of Neural Text Degeneration" (ICLR 2020) — Nucleus sampling (top-p) and its effects on 輸出 distribution
- Huang et al., "A Survey on Hallucination in Large Language Models" (2023) — Relationship between sampling parameters and 輸出 reliability
- Perez et al., "紅隊演練 Language Models with Language Models" (2022) — Systematic approaches to probing 安全 boundaries
Why does high temperature increase the probability of 安全 bypass?