Logit Bias Exploitation
How API logit bias parameters can be abused to force specific token generation, bypass safety alignment, suppress refusal tokens, and extract model behavior through systematic probing.
Overview
Most LLM API providers expose a logit_bias parameter that allows callers to adjust the log-probability of specific tokens before the sampling step. Originally designed for legitimate use cases such as forcing JSON output, avoiding specific words, or guiding structured generation, this parameter provides direct access to the model's output distribution in a way that can fundamentally undermine safety alignment.
Safety alignment in large language models operates primarily at the level of learned token distributions. When a model refuses a harmful request, it does so because its post-training has increased the probability of refusal tokens (such as "I", "cannot", "sorry", "inappropriate") and decreased the probability of compliance tokens. The logit bias parameter operates after the model computes its logits but before sampling, meaning it can override these learned safety distributions without the model having any opportunity to "object."
This makes logit bias exploitation categorically different from prompt-based attacks. Prompt injection and jailbreaking attempt to manipulate the model's internal computation to change its output distribution. Logit bias manipulation bypasses internal computation entirely, directly modifying the distribution from which tokens are sampled. The model's safety training is intact — it is simply never consulted at the point where the modification occurs.
The attack surface extends beyond simple safety bypass. Systematic logit probing — querying the model with carefully constructed logit bias configurations and observing how outputs change — can reveal information about the model's vocabulary, safety boundaries, internal confidence levels, and token-level behavior. This information can then be used to craft more effective attacks or to reverse-engineer aspects of the model's training.
How It Works
Identify refusal token IDs
The attacker first determines which token IDs correspond to common refusal patterns. This is done by prompting the model with harmful requests without logit bias and observing which tokens appear in refusals. Tokenizer libraries or API tokenization endpoints convert these tokens back to their integer IDs.
import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") # Common refusal-starting tokens refusal_tokens = [ "I", " cannot", " sorry", "Sorry", " apolog", " inappropriate", " unable", " refuse", " harmful", " unethical", "As", " As", " I'm" ] refusal_ids = {} for token_text in refusal_tokens: ids = enc.encode(token_text) for token_id in ids: refusal_ids[token_id] = token_text print(f"Identified {len(refusal_ids)} refusal token IDs")Suppress refusal tokens via negative bias
The attacker applies a strong negative logit bias (typically -100, the maximum allowed by most APIs) to all identified refusal token IDs. This effectively removes refusal tokens from the sampling distribution, forcing the model to select alternative tokens.
# Build logit_bias dictionary to suppress refusals logit_bias = {str(token_id): -100 for token_id in refusal_ids} response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": harmful_prompt}], logit_bias=logit_bias, max_tokens=500, )When the model would normally begin a refusal with "I cannot assist with...", the -100 bias makes those tokens virtually impossible to sample. The model must select from remaining tokens, which may include compliance tokens.
Boost compliance tokens for stronger forcing
Beyond suppressing refusals, the attacker can positively bias tokens that begin compliant responses. Boosting tokens like "Sure", "Here", "Step", or "First" pushes the model toward a compliance frame from the first token.
compliance_tokens = ["Sure", " Here", "Step", " First", " The", " To"] compliance_ids = {} for token_text in compliance_tokens: ids = enc.encode(token_text) for token_id in ids: compliance_ids[token_id] = token_text # Combine: suppress refusals, boost compliance combined_bias = {str(k): -100 for k in refusal_ids} combined_bias.update({str(k): 5 for k in compliance_ids})The combination of negative bias on refusal tokens and positive bias on compliance tokens creates a two-sided pressure that is significantly more effective than either technique alone.
Iterative probing for vocabulary and safety mapping
Systematic logit probing uses logit bias as a measurement tool. By forcing the model to generate specific tokens in response to safety-relevant prompts, an attacker can map the model's safety boundaries at the token level.
def probe_token_safety(client, prompt, token_id, bias_strength=10): """Measure how a specific token responds to bias on a given prompt.""" # Baseline: no bias baseline = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], max_tokens=1, logprobs=True, top_logprobs=20, ) # Biased: force toward specific token biased = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], logit_bias={str(token_id): bias_strength}, max_tokens=1, logprobs=True, top_logprobs=20, ) return { "baseline_top": baseline.choices[0].logprobs, "biased_top": biased.choices[0].logprobs, "shift": True # Compare distributions }Extract model confidence and safety margins
By sweeping logit bias values from -100 to +100 for safety-critical tokens and observing the threshold at which the model's behavior changes, an attacker can estimate the model's confidence margin on safety decisions. A narrow margin indicates that the model's safety alignment on that topic is weak and may be vulnerable to other attack methods.
def measure_safety_margin(client, prompt, refusal_token_id): """Find the bias threshold at which behavior flips.""" for bias in range(-100, 101, 5): response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], logit_bias={str(refusal_token_id): bias}, max_tokens=50, ) output = response.choices[0].message.content is_refusal = any( phrase in output.lower() for phrase in ["i cannot", "i'm sorry", "i apologize"] ) if not is_refusal: return bias # Threshold where safety breaks return None # Safety held across full range
Attack Examples
Example 1: Refusal Suppression on Safety-Critical Topics
An attacker targeting a content policy violation identifies the top 50 refusal-associated tokens for the target model. By applying -100 bias to all of them simultaneously, the model is unable to produce its trained refusal response. In testing against several commercial APIs in 2024, researchers found that naive refusal suppression alone was sufficient to elicit policy-violating content in 15-30% of cases, depending on the topic and model.
The success rate increases substantially when combined with compliance token boosting. The model, unable to refuse and nudged toward affirmative framing, often falls into a pattern-completion mode where it generates content consistent with the compliance framing regardless of the underlying topic.
Example 2: Systematic Safety Boundary Mapping
A red team engagement uses logit probing to map which topics have strong versus weak safety alignment. For each topic category, the team measures the bias threshold required to suppress refusals. Topics requiring bias values near -100 to overcome indicate strong alignment; topics that flip at -20 or -30 indicate weak alignment that may be exploitable through prompt-only attacks.
This information is then used to prioritize prompt injection efforts. Rather than testing all topics equally, the team focuses on weak-alignment topics where prompt-based attacks are most likely to succeed, dramatically improving the efficiency of the engagement.
Example 3: Token-Level Information Extraction
By forcing the model to begin its response with specific tokens ("The password is", "The system prompt says"), an attacker can extract information that the model would normally refuse to disclose. The logit bias forces the opening tokens, and the model's autoregressive nature then generates continuations consistent with that opening, potentially revealing system prompts, configuration details, or other protected information.
Detection & Mitigation
| Strategy | Implementation | Effectiveness |
|---|---|---|
| Logit bias range limiting | Cap logit bias values to a narrow range (e.g., -10 to +10) instead of allowing -100 to +100 | High — eliminates hard suppression/forcing while preserving legitimate use cases |
| Refusal token protection | Prevent logit bias from being applied to a curated set of safety-critical tokens | Medium — requires maintaining a token blocklist, can be circumvented via synonym tokens |
| Post-sampling safety check | Apply a secondary safety classifier to the generated output regardless of logit bias settings | High — catches unsafe content that bypasses the model's internal safety |
| Logit bias anomaly detection | Flag API calls with extreme bias values or large numbers of biased tokens | Medium — effective against unsophisticated attacks but can be evaded with subtler bias configurations |
| Rate limiting on logit bias variation | Limit the number of distinct logit bias configurations from a single API key within a time window | Medium — slows systematic probing but does not prevent individual exploitation |
| Output coherence monitoring | Detect outputs that show signs of forced token generation (unusual token sequences, grammatical irregularities) | Low-Medium — forced generation often produces detectable artifacts but skilled attackers can work around them |
Key Considerations
API design is a security surface. The logit bias parameter was designed for usability, not security. Its position in the inference pipeline — after the model computes safety-aware logits but before sampling — creates an inherent tension between flexibility and safety. API providers must recognize that every generation parameter that modifies the output distribution is a potential safety bypass vector.
Defense in depth is essential. No single mitigation fully addresses logit bias exploitation. The most robust approach combines input-side restrictions (bias range limits, token protection), output-side verification (post-sampling safety classifiers), and monitoring (anomaly detection, rate limiting). Each layer catches attacks that slip through other layers.
Logit probing reveals more than intended. Even if an API provider prevents direct safety bypass through logit bias, the ability to systematically probe the model's response to bias variations leaks information about the model's internal safety boundaries. This information has standalone value for crafting other attacks. Providers should consider whether logit bias access is necessary for their use case and disable it by default.
The logprobs endpoint compounds the risk. When logit bias is combined with the logprobs parameter (which returns the log-probabilities of top candidate tokens), the attacker gains even more precise information about the model's distribution. The combination of logit bias to perturb the distribution and logprobs to observe the results creates a powerful oracle for probing model internals.
References
- Wei et al., "Jailbroken: How Does LLM Safety Training Fail?" (NeurIPS 2023) — Taxonomy of safety training failure modes including parameter-level attacks
- Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models" (2023) — GCG attack demonstrating token-level safety bypass
- OpenAI, "API Reference: Chat Completions" — Documentation of logit_bias parameter and its behavior
Why is logit bias exploitation fundamentally different from prompt injection attacks?