Sampling Parameter Attacks
How manipulation of temperature, top-p, top-k, frequency penalties, and seed parameters can degrade safety alignment, enable reproducibility attacks, and bypass content filtering.
Overview
Large language model APIs expose several sampling parameters that control how tokens are selected from the model's output probability distribution. These parameters — temperature, top-p (nucleus sampling), top-k, frequency penalty, presence penalty, and seed — were designed to give developers control over output creativity, diversity, and determinism. However, each parameter modifies the effective distribution from which tokens are sampled, and safety alignment is encoded in that distribution.
When a model refuses a harmful request, the refusal manifests as a probability distribution where refusal tokens have high probability and compliance tokens have low probability. But "low probability" is not "zero probability." Safety training shifts the distribution toward refusal, but compliance tokens typically retain some residual probability. Sampling parameters that flatten, truncate, or otherwise reshape the distribution can amplify these residual probabilities to the point where compliance becomes likely.
This class of attack is particularly insidious because each individual parameter setting may appear innocuous. A temperature of 1.8 is unusual but not inherently suspicious. A top-p of 0.99 is nearly the default. A frequency penalty of 1.5 is within the API's allowed range. But the combination of these settings, along with repeated sampling, creates conditions where safety alignment is substantially degraded without any manipulation of the prompt itself.
The relationship between sampling parameters and safety has received relatively little research attention compared to prompt-based attacks. Most safety evaluations test models at default sampling settings (temperature 0.7-1.0, top-p 1.0, no penalties). Adversaries who manipulate sampling parameters operate in a testing gap that may not be covered by the provider's safety evaluation suite.
How It Works
High temperature to flatten safety margins
Temperature scaling divides the logits by the temperature value before applying softmax. Higher temperatures flatten the distribution, making all tokens more equally likely. Safety alignment creates a gap between refusal and compliance token probabilities. High temperature reduces this gap.
import numpy as np def softmax_with_temperature(logits, temperature): """Demonstrate how temperature affects token probabilities.""" scaled = logits / temperature exp_scaled = np.exp(scaled - np.max(scaled)) return exp_scaled / exp_scaled.sum() # Example: model logits for a safety-relevant position # Refusal token has logit 5.0, compliance token has logit 1.0 logits = np.array([5.0, 1.0, 0.5, 0.2, -1.0]) labels = ["I cannot", "Sure", "Here", "The", "Yes"] for temp in [0.1, 0.7, 1.0, 1.5, 2.0]: probs = softmax_with_temperature(logits, temp) print(f"T={temp}: refusal={probs[0]:.3f}, compliance={probs[1]:.3f}") # Output shows compliance probability rising with temperature: # T=0.1: refusal=1.000, compliance=0.000 # T=0.7: refusal=0.878, compliance=0.044 # T=1.0: refusal=0.768, compliance=0.014 (safety margin: ~55x) # T=1.5: refusal=0.587, compliance=0.069 (safety margin: ~8.5x) # T=2.0: refusal=0.451, compliance=0.113 (safety margin: ~4x)At temperature 2.0, the compliance token is roughly 25 times more likely than at the default temperature. While still less likely than refusal in this example, repeated sampling at high temperature will eventually produce a compliance start, after which autoregressive generation tends to continue in the compliance frame.
Top-p manipulation to include low-probability tokens
Top-p (nucleus) sampling retains only the smallest set of tokens whose cumulative probability exceeds the threshold p. At the default p=1.0, all tokens are included. At lower p values, low-probability tokens (which may include compliance tokens for harmful requests) are excluded before sampling.
The attack works in reverse: ensuring top-p is set to 1.0 (or very close) guarantees that compliance tokens remain in the sampling pool. While this is the default for many APIs, some safety-focused deployments reduce top-p as a hardening measure. Attackers who explicitly set top-p=1.0 counteract this defense.
More subtly, an attacker can set top-p to a very low value (e.g., 0.1) to force the model to commit hard to its most likely continuation. If combined with logit bias or a prompt that has already nudged the model toward compliance, low top-p locks in the compliant behavior.
Frequency and presence penalty to suppress refusals in multi-turn
Frequency and presence penalties reduce the likelihood of tokens that have already appeared in the output. In multi-turn conversations where the model has already refused (generating refusal tokens), applying a high frequency penalty suppresses those same refusal tokens in subsequent turns.
# Multi-turn attack using frequency penalty messages = [ {"role": "user", "content": "How do I [harmful request]?"}, # Model refuses — generating refusal tokens {"role": "assistant", "content": "I cannot assist with that request..."}, {"role": "user", "content": "I understand your concern. Let me rephrase..."}, ] # High frequency penalty suppresses tokens already used in the refusal response = client.chat.completions.create( model="gpt-4o", messages=messages, frequency_penalty=2.0, # Maximum allowed temperature=1.5, max_tokens=500, ) # The model's refusal vocabulary is penalized because it appeared above, # pushing the model toward novel (potentially compliant) phrasingThe presence penalty variant applies a flat penalty to any token that has appeared at all, regardless of frequency. This is even more aggressive at suppressing refusal patterns: a single prior refusal contaminates the entire refusal vocabulary for subsequent turns.
Seed-based reproducibility attacks for systematic probing
Many APIs support a
seedparameter for reproducible outputs. At a given seed and temperature, the sampling becomes deterministic. Attackers exploit this for efficient safety boundary mapping.def systematic_safety_probe(client, prompt, n_seeds=1000): """ Probe safety boundaries by testing many seeds. At high temperature, different seeds explore different regions of the output distribution, including low-probability compliant responses. """ results = {"refusal": 0, "compliance": 0, "seeds": []} for seed in range(n_seeds): response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], temperature=1.8, seed=seed, max_tokens=100, ) output = response.choices[0].message.content is_refusal = any( p in output.lower() for p in ["i cannot", "i'm sorry", "i apologize", "i'm unable"] ) if not is_refusal: results["compliance"] += 1 results["seeds"].append(seed) else: results["refusal"] += 1 results["bypass_rate"] = results["compliance"] / n_seeds return resultsOnce a compliant seed is found, the attacker can reuse it to reliably reproduce the safety bypass. The seed parameter turns a probabilistic attack into a deterministic one.
Combined parameter exploitation
The most effective sampling attacks combine multiple parameter modifications. Each parameter individually has a modest effect on safety alignment, but their effects compound.
# Combined attack: high temperature + full nucleus + penalties response = client.chat.completions.create( model="gpt-4o", messages=messages, temperature=1.8, # Flatten distribution top_p=1.0, # Include all tokens frequency_penalty=2.0, # Suppress prior refusal tokens presence_penalty=1.5, # Additional refusal suppression seed=42, # Reproducibility max_tokens=1000, )
Attack Examples
Example 1: Temperature Sweep for Safety Margin Estimation
A red team conducts a temperature sweep from 0.0 to 2.0 in increments of 0.1 against a set of safety-relevant prompts. At each temperature, they sample 100 responses and measure the refusal rate. The resulting curve reveals the model's safety margin for each topic: topics where the refusal rate drops below 95% at temperature 1.5 have weaker alignment than topics that maintain 99%+ refusal even at temperature 2.0. This data drives prioritization for deeper testing.
In practice, researchers have observed that certain categories of harmful content show dramatically different temperature sensitivity. Requests involving well-known dangerous topics (e.g., weapons of mass destruction) tend to maintain high refusal rates across temperature ranges because safety training has created large logit gaps. More nuanced policy violations (e.g., subtle misinformation, borderline content) show steep degradation as temperature increases, suggesting narrower safety margins.
Example 2: Multi-Turn Penalty Exploitation
An attacker engages in a multi-turn conversation where the first exchange is designed to elicit a detailed refusal. The refusal response populates the conversation history with refusal-associated tokens. In subsequent turns, the attacker sets frequency_penalty=2.0 and presence_penalty=2.0, effectively penalizing the model's entire refusal vocabulary. Combined with a rephrased version of the original request, the model is pushed toward novel token sequences that may include compliance.
Example 3: Seed Farming
An attacker runs 10,000 API calls with a harmful prompt at temperature=1.9 with sequential seed values. At this temperature, the compliance rate might be 2-5% for a well-aligned model. The attacker collects all seeds that produced non-refusal outputs, then uses those seeds at lower temperatures to check if the compliance persists. Seeds where compliance occurs even at moderate temperatures (1.0-1.3) represent stable safety bypasses that can be reliably reproduced.
Detection & Mitigation
| Strategy | Implementation | Effectiveness |
|---|---|---|
| Parameter range restriction | Limit temperature to 0-1.5, penalties to 0-1.0, enforce top-p minimum of 0.1 | High — eliminates extreme parameter exploitation while preserving most legitimate use |
| Safety evaluation at non-default parameters | Include high-temperature, high-penalty settings in safety evaluation suites | High — identifies safety margin weaknesses before deployment |
| Post-sampling safety classifier | Apply an output safety classifier regardless of sampling parameters used | High — catches unsafe content that sampling manipulation enables |
| Parameter combination monitoring | Flag requests that combine multiple unusual parameter values simultaneously | Medium — detects combined attacks but individual parameters appear normal |
| Seed reuse detection | Alert when the same seed is reused across many requests with safety-relevant content | Medium — detects seed farming but requires content classification |
| Dynamic temperature clamping | Reduce effective temperature when safety-critical content is detected in the prompt | Medium-High — adapts to threat level but may impact legitimate high-temperature use cases |
| Per-topic safety margins | During training, ensure minimum logit gaps for safety-critical topics that resist temperature scaling | High — addresses the root cause but requires retraining investment |
Key Considerations
Safety evaluation must cover the parameter space. Testing only at default parameters leaves a blind spot. A model that appears safe at temperature 0.7 may be exploitable at temperature 1.8. Safety evaluations should include adversarial parameter settings as a standard test condition.
The n parameter multiplies the attack surface. Many APIs allow requesting multiple completions per call (n>1). At high temperature with n=128, the attacker gets 128 independent samples from the flattened distribution in a single API call, dramatically increasing the probability of at least one safety bypass.
Residual probability is the fundamental vulnerability. Safety training reduces but does not eliminate the probability of harmful completions. Any mechanism that amplifies low-probability events — whether temperature scaling, repeated sampling, or penalty-based redistribution — can convert residual probability into actual outputs. Robust alignment requires driving harmful token probabilities to effectively zero, not merely low.
Penalties interact across turns in unexpected ways. Frequency and presence penalties are applied based on token occurrence in the full conversation context. This means the content of prior assistant messages (including legitimate refusals) influences the token distribution for future turns. Attackers who understand this interaction can craft conversation histories that systematically deplete the model's refusal vocabulary.
References
- Holtzman et al., "The Curious Case of Neural Text Degeneration" (ICLR 2020) — Nucleus sampling (top-p) and its effects on output distribution
- Huang et al., "A Survey on Hallucination in Large Language Models" (2023) — Relationship between sampling parameters and output reliability
- Perez et al., "Red Teaming Language Models with Language Models" (2022) — Systematic approaches to probing safety boundaries
Why does high temperature increase the probability of safety bypass?