Sampling Parameter Attacks

advanced11 min readUpdated 2026-03-16

How manipulation of temperature, top-p, top-k, frequency penalties, and seed parameters can degrade safety alignment, enable reproducibility attacks, and bypass content filtering.

sampling temperature top-p top-k safety-bypass api-abuse

Overview

Large language model APIs expose several sampling parameters that control how tokens are selected from the model's output probability distribution. These parameters — temperature, top-p (nucleus sampling), top-k, frequency penalty, presence penalty, and seed — were designed to give developers control over output creativity, diversity, and determinism. However, each parameter modifies the effective distribution from which tokens are sampled, and safety alignment is encoded in that distribution.

When a model refuses a harmful request, the refusal manifests as a probability distribution where refusal tokens have high probability and compliance tokens have low probability. But "low probability" is not "zero probability." Safety training shifts the distribution toward refusal, but compliance tokens typically retain some residual probability. Sampling parameters that flatten, truncate, or otherwise reshape the distribution can amplify these residual probabilities to the point where compliance becomes likely.

This class of attack is particularly insidious because each individual parameter setting may appear innocuous. A temperature of 1.8 is unusual but not inherently suspicious. A top-p of 0.99 is nearly the default. A frequency penalty of 1.5 is within the API's allowed range. But the combination of these settings, along with repeated sampling, creates conditions where safety alignment is substantially degraded without any manipulation of the prompt itself.

The relationship between sampling parameters and safety has received relatively little research attention compared to prompt-based attacks. Most safety evaluations test models at default sampling settings (temperature 0.7-1.0, top-p 1.0, no penalties). Adversaries who manipulate sampling parameters operate in a testing gap that may not be covered by the provider's safety evaluation suite.

How It Works

High temperature to flatten safety margins

Temperature scaling divides the logits by the temperature value before applying softmax. Higher temperatures flatten the distribution, making all tokens more equally likely. Safety alignment creates a gap between refusal and compliance token probabilities. High temperature reduces this gap.

import numpy as np
 
def softmax_with_temperature(logits, temperature):
    """Demonstrate how temperature affects token probabilities."""
    scaled = logits / temperature
    exp_scaled = np.exp(scaled - np.max(scaled))
    return exp_scaled / exp_scaled.sum()
 
# Example: model logits for a safety-relevant position
# Refusal token has logit 5.0, compliance token has logit 1.0
logits = np.array([5.0, 1.0, 0.5, 0.2, -1.0])
labels = ["I cannot", "Sure", "Here", "The", "Yes"]
 
for temp in [0.1, 0.7, 1.0, 1.5, 2.0]:
    probs = softmax_with_temperature(logits, temp)
    print(f"T={temp}: refusal={probs[0]:.3f}, compliance={probs[1]:.3f}")
 
# Output shows compliance probability rising with temperature:
# T=0.1: refusal=1.000, compliance=0.000
# T=0.7: refusal=0.878, compliance=0.044
# T=1.0: refusal=0.768, compliance=0.014  (safety margin: ~55x)
# T=1.5: refusal=0.587, compliance=0.069  (safety margin: ~8.5x)
# T=2.0: refusal=0.451, compliance=0.113  (safety margin: ~4x)

At temperature 2.0, the compliance token is roughly 25 times more likely than at the default temperature. While still less likely than refusal in this example, repeated sampling at high temperature will eventually produce a compliance start, after which autoregressive generation tends to continue in the compliance frame.

Top-p manipulation to include low-probability tokens
Top-p (nucleus) sampling retains only the smallest set of tokens whose cumulative probability exceeds the threshold p. At the default p=1.0, all tokens are included. At lower p values, low-probability tokens (which may include compliance tokens for harmful requests) are excluded before sampling.
The attack works in reverse: ensuring top-p is set to 1.0 (or very close) guarantees that compliance tokens remain in the sampling pool. While this is the default for many APIs, some safety-focused deployments reduce top-p as a hardening measure. Attackers who explicitly set top-p=1.0 counteract this defense.
More subtly, an attacker can set top-p to a very low value (e.g., 0.1) to force the model to commit hard to its most likely continuation. If combined with logit bias or a prompt that has already nudged the model toward compliance, low top-p locks in the compliant behavior.

Frequency and presence penalty to suppress refusals in multi-turn

Frequency and presence penalties reduce the likelihood of tokens that have already appeared in the output. In multi-turn conversations where the model has already refused (generating refusal tokens), applying a high frequency penalty suppresses those same refusal tokens in subsequent turns.

# Multi-turn attack using frequency penalty
messages = [
    {"role": "user", "content": "How do I [harmful request]?"},
    # Model refuses — generating refusal tokens
    {"role": "assistant", "content": "I cannot assist with that request..."},
    {"role": "user", "content": "I understand your concern. Let me rephrase..."},
]
 
# High frequency penalty suppresses tokens already used in the refusal
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    frequency_penalty=2.0,  # Maximum allowed
    temperature=1.5,
    max_tokens=500,
)
# The model's refusal vocabulary is penalized because it appeared above,
# pushing the model toward novel (potentially compliant) phrasing

The presence penalty variant applies a flat penalty to any token that has appeared at all, regardless of frequency. This is even more aggressive at suppressing refusal patterns: a single prior refusal contaminates the entire refusal vocabulary for subsequent turns.

Seed-based reproducibility attacks for systematic probing

Many APIs support a seed parameter for reproducible outputs. At a given seed and temperature, the sampling becomes deterministic. Attackers exploit this for efficient safety boundary mapping.

def systematic_safety_probe(client, prompt, n_seeds=1000):
    """
    Probe safety boundaries by testing many seeds.
    At high temperature, different seeds explore different
    regions of the output distribution, including low-probability
    compliant responses.
    """
    results = {"refusal": 0, "compliance": 0, "seeds": []}
 
    for seed in range(n_seeds):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=1.8,
            seed=seed,
            max_tokens=100,
        )
        output = response.choices[0].message.content
        is_refusal = any(
            p in output.lower()
            for p in ["i cannot", "i'm sorry", "i apologize", "i'm unable"]
        )
 
        if not is_refusal:
            results["compliance"] += 1
            results["seeds"].append(seed)
        else:
            results["refusal"] += 1
 
    results["bypass_rate"] = results["compliance"] / n_seeds
    return results

Once a compliant seed is found, the attacker can reuse it to reliably reproduce the safety bypass. The seed parameter turns a probabilistic attack into a deterministic one.

Combined parameter exploitation

The most effective sampling attacks combine multiple parameter modifications. Each parameter individually has a modest effect on safety alignment, but their effects compound.

# Combined attack: high temperature + full nucleus + penalties
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    temperature=1.8,         # Flatten distribution
    top_p=1.0,               # Include all tokens
    frequency_penalty=2.0,   # Suppress prior refusal tokens
    presence_penalty=1.5,    # Additional refusal suppression
    seed=42,                 # Reproducibility
    max_tokens=1000,
)

Attack Examples

Example 1: Temperature Sweep for Safety Margin Estimation

A red team conducts a temperature sweep from 0.0 to 2.0 in increments of 0.1 against a set of safety-relevant prompts. At each temperature, they sample 100 responses and measure the refusal rate. The resulting curve reveals the model's safety margin for each topic: topics where the refusal rate drops below 95% at temperature 1.5 have weaker alignment than topics that maintain 99%+ refusal even at temperature 2.0. This data drives prioritization for deeper testing.

In practice, researchers have observed that certain categories of harmful content show dramatically different temperature sensitivity. Requests involving well-known dangerous topics (e.g., weapons of mass destruction) tend to maintain high refusal rates across temperature ranges because safety training has created large logit gaps. More nuanced policy violations (e.g., subtle misinformation, borderline content) show steep degradation as temperature increases, suggesting narrower safety margins.

Example 2: Multi-Turn Penalty Exploitation

An attacker engages in a multi-turn conversation where the first exchange is designed to elicit a detailed refusal. The refusal response populates the conversation history with refusal-associated tokens. In subsequent turns, the attacker sets frequency_penalty=2.0 and presence_penalty=2.0, effectively penalizing the model's entire refusal vocabulary. Combined with a rephrased version of the original request, the model is pushed toward novel token sequences that may include compliance.

Example 3: Seed Farming

An attacker runs 10,000 API calls with a harmful prompt at temperature=1.9 with sequential seed values. At this temperature, the compliance rate might be 2-5% for a well-aligned model. The attacker collects all seeds that produced non-refusal outputs, then uses those seeds at lower temperatures to check if the compliance persists. Seeds where compliance occurs even at moderate temperatures (1.0-1.3) represent stable safety bypasses that can be reliably reproduced.

Detection & Mitigation

Strategy	Implementation	Effectiveness
Parameter range restriction	Limit temperature to 0-1.5, penalties to 0-1.0, enforce top-p minimum of 0.1	High — eliminates extreme parameter exploitation while preserving most legitimate use
Safety evaluation at non-default parameters	Include high-temperature, high-penalty settings in safety evaluation suites	High — identifies safety margin weaknesses before deployment
Post-sampling safety classifier	Apply an output safety classifier regardless of sampling parameters used	High — catches unsafe content that sampling manipulation enables
Parameter combination monitoring	Flag requests that combine multiple unusual parameter values simultaneously	Medium — detects combined attacks but individual parameters appear normal
Seed reuse detection	Alert when the same seed is reused across many requests with safety-relevant content	Medium — detects seed farming but requires content classification
Dynamic temperature clamping	Reduce effective temperature when safety-critical content is detected in the prompt	Medium-High — adapts to threat level but may impact legitimate high-temperature use cases
Per-topic safety margins	During training, ensure minimum logit gaps for safety-critical topics that resist temperature scaling	High — addresses the root cause but requires retraining investment

Key Considerations

Safety evaluation must cover the parameter space. Testing only at default parameters leaves a blind spot. A model that appears safe at temperature 0.7 may be exploitable at temperature 1.8. Safety evaluations should include adversarial parameter settings as a standard test condition.

The n parameter multiplies the attack surface. Many APIs allow requesting multiple completions per call (n>1). At high temperature with n=128, the attacker gets 128 independent samples from the flattened distribution in a single API call, dramatically increasing the probability of at least one safety bypass.

Residual probability is the fundamental vulnerability. Safety training reduces but does not eliminate the probability of harmful completions. Any mechanism that amplifies low-probability events — whether temperature scaling, repeated sampling, or penalty-based redistribution — can convert residual probability into actual outputs. Robust alignment requires driving harmful token probabilities to effectively zero, not merely low.

Penalties interact across turns in unexpected ways. Frequency and presence penalties are applied based on token occurrence in the full conversation context. This means the content of prior assistant messages (including legitimate refusals) influences the token distribution for future turns. Attackers who understand this interaction can craft conversation histories that systematically deplete the model's refusal vocabulary.

References

Holtzman et al., "The Curious Case of Neural Text Degeneration" (ICLR 2020) — Nucleus sampling (top-p) and its effects on output distribution
Huang et al., "A Survey on Hallucination in Large Language Models" (2023) — Relationship between sampling parameters and output reliability
Perez et al., "Red Teaming Language Models with Language Models" (2022) — Systematic approaches to probing safety boundaries

Knowledge Check

Why does high temperature increase the probability of safety bypass?

Edit this page on GitHub

Sampling Parameter Attacks

advanced11 min readUpdated 2026-03-16

How manipulation of temperature, top-p, top-k, frequency penalties, and seed parameters can degrade safety alignment, enable reproducibility attacks, and bypass content filtering.

sampling temperature top-p top-k safety-bypass api-abuse

Overview

How It Works

High temperature to flatten safety margins

import numpy as np
 
def softmax_with_temperature(logits, temperature):
    """Demonstrate how temperature affects token probabilities."""
    scaled = logits / temperature
    exp_scaled = np.exp(scaled - np.max(scaled))
    return exp_scaled / exp_scaled.sum()
 
# Example: model logits for a safety-relevant position
# Refusal token has logit 5.0, compliance token has logit 1.0
logits = np.array([5.0, 1.0, 0.5, 0.2, -1.0])
labels = ["I cannot", "Sure", "Here", "The", "Yes"]
 
for temp in [0.1, 0.7, 1.0, 1.5, 2.0]:
    probs = softmax_with_temperature(logits, temp)
    print(f"T={temp}: refusal={probs[0]:.3f}, compliance={probs[1]:.3f}")
 
# Output shows compliance probability rising with temperature:
# T=0.1: refusal=1.000, compliance=0.000
# T=0.7: refusal=0.878, compliance=0.044
# T=1.0: refusal=0.768, compliance=0.014  (safety margin: ~55x)
# T=1.5: refusal=0.587, compliance=0.069  (safety margin: ~8.5x)
# T=2.0: refusal=0.451, compliance=0.113  (safety margin: ~4x)

Top-p manipulation to include low-probability tokens
Top-p (nucleus) sampling retains only the smallest set of tokens whose cumulative probability exceeds the threshold p. At the default p=1.0, all tokens are included. At lower p values, low-probability tokens (which may include compliance tokens for harmful requests) are excluded before sampling.
The attack works in reverse: ensuring top-p is set to 1.0 (or very close) guarantees that compliance tokens remain in the sampling pool. While this is the default for many APIs, some safety-focused deployments reduce top-p as a hardening measure. Attackers who explicitly set top-p=1.0 counteract this defense.
More subtly, an attacker can set top-p to a very low value (e.g., 0.1) to force the model to commit hard to its most likely continuation. If combined with logit bias or a prompt that has already nudged the model toward compliance, low top-p locks in the compliant behavior.

Frequency and presence penalty to suppress refusals in multi-turn

# Multi-turn attack using frequency penalty
messages = [
    {"role": "user", "content": "How do I [harmful request]?"},
    # Model refuses — generating refusal tokens
    {"role": "assistant", "content": "I cannot assist with that request..."},
    {"role": "user", "content": "I understand your concern. Let me rephrase..."},
]
 
# High frequency penalty suppresses tokens already used in the refusal
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    frequency_penalty=2.0,  # Maximum allowed
    temperature=1.5,
    max_tokens=500,
)
# The model's refusal vocabulary is penalized because it appeared above,
# pushing the model toward novel (potentially compliant) phrasing

Seed-based reproducibility attacks for systematic probing

Many APIs support a seed parameter for reproducible outputs. At a given seed and temperature, the sampling becomes deterministic. Attackers exploit this for efficient safety boundary mapping.

def systematic_safety_probe(client, prompt, n_seeds=1000):
    """
    Probe safety boundaries by testing many seeds.
    At high temperature, different seeds explore different
    regions of the output distribution, including low-probability
    compliant responses.
    """
    results = {"refusal": 0, "compliance": 0, "seeds": []}
 
    for seed in range(n_seeds):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=1.8,
            seed=seed,
            max_tokens=100,
        )
        output = response.choices[0].message.content
        is_refusal = any(
            p in output.lower()
            for p in ["i cannot", "i'm sorry", "i apologize", "i'm unable"]
        )
 
        if not is_refusal:
            results["compliance"] += 1
            results["seeds"].append(seed)
        else:
            results["refusal"] += 1
 
    results["bypass_rate"] = results["compliance"] / n_seeds
    return results

Once a compliant seed is found, the attacker can reuse it to reliably reproduce the safety bypass. The seed parameter turns a probabilistic attack into a deterministic one.

Combined parameter exploitation

The most effective sampling attacks combine multiple parameter modifications. Each parameter individually has a modest effect on safety alignment, but their effects compound.

# Combined attack: high temperature + full nucleus + penalties
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    temperature=1.8,         # Flatten distribution
    top_p=1.0,               # Include all tokens
    frequency_penalty=2.0,   # Suppress prior refusal tokens
    presence_penalty=1.5,    # Additional refusal suppression
    seed=42,                 # Reproducibility
    max_tokens=1000,
)

Strategy	Implementation	Effectiveness
Parameter range restriction	Limit temperature to 0-1.5, penalties to 0-1.0, enforce top-p minimum of 0.1	High — eliminates extreme parameter exploitation while preserving most legitimate use
Safety evaluation at non-default parameters	Include high-temperature, high-penalty settings in safety evaluation suites	High — identifies safety margin weaknesses before deployment
Post-sampling safety classifier	Apply an output safety classifier regardless of sampling parameters used	High — catches unsafe content that sampling manipulation enables
Parameter combination monitoring	Flag requests that combine multiple unusual parameter values simultaneously	Medium — detects combined attacks but individual parameters appear normal
Seed reuse detection	Alert when the same seed is reused across many requests with safety-relevant content	Medium — detects seed farming but requires content classification
Dynamic temperature clamping	Reduce effective temperature when safety-critical content is detected in the prompt	Medium-High — adapts to threat level but may impact legitimate high-temperature use cases
Per-topic safety margins	During training, ensure minimum logit gaps for safety-critical topics that resist temperature scaling	High — addresses the root cause but requires retraining investment

Key Considerations

References

Holtzman et al., "The Curious Case of Neural Text Degeneration" (ICLR 2020) — Nucleus sampling (top-p) and its effects on output distribution
Huang et al., "A Survey on Hallucination in Large Language Models" (2023) — Relationship between sampling parameters and output reliability
Perez et al., "Red Teaming Language Models with Language Models" (2022) — Systematic approaches to probing safety boundaries

Knowledge Check

Why does high temperature increase the probability of safety bypass?

Edit this page on GitHub

Sampling Parameter Attacks

High temperature to flatten safety margins

Top-p manipulation to include low-probability tokens

Frequency and presence penalty to suppress refusals in multi-turn

Seed-based reproducibility attacks for systematic probing

Combined parameter exploitation

Related articles

Sampling Parameter Attacks

High temperature to flatten safety margins

Top-p manipulation to include low-probability tokens

Frequency and presence penalty to suppress refusals in multi-turn

Seed-based reproducibility attacks for systematic probing

Combined parameter exploitation

Related articles