Logit Bias 利用ation
How API logit bias parameters can be abused to force specific token generation, bypass safety alignment, suppress refusal tokens, and extract model behavior through systematic probing.
概覽
Most LLM API providers expose a logit_bias parameter that allows callers to adjust the log-probability of specific 符元 before the sampling step. Originally designed for legitimate use cases such as forcing JSON 輸出, avoiding specific words, or guiding structured generation, this parameter provides direct access to 模型's 輸出 distribution in a way that can fundamentally undermine 安全 對齊.
安全 對齊 in 大型語言模型 operates primarily at the level of learned 符元 distributions. When a model refuses a harmful request, it does so 因為 its post-訓練 has increased the probability of refusal 符元 (such as "I", "cannot", "sorry", "inappropriate") and decreased the probability of compliance 符元. The logit bias parameter operates after 模型 computes its logits but before sampling, meaning it can override these learned 安全 distributions without 模型 having any opportunity to "object."
This makes logit bias 利用 categorically different from prompt-based attacks. Prompt injection and 越獄 attempt to manipulate 模型's internal computation to change its 輸出 distribution. Logit bias manipulation bypasses internal computation entirely, directly modifying the distribution from which 符元 are sampled. 模型's 安全 訓練 is intact — it is simply never consulted at the point where the modification occurs.
The 攻擊面 extends beyond simple 安全 bypass. Systematic logit probing — querying 模型 with carefully constructed logit bias configurations and observing how outputs change — can reveal information about 模型's vocabulary, 安全 boundaries, internal confidence levels, and 符元-level behavior. This information can then be used to craft more effective attacks or to reverse-engineer aspects of 模型's 訓練.
運作方式
識別 refusal 符元 IDs
攻擊者 first determines which 符元 IDs correspond to common refusal patterns. 這是 done by prompting 模型 with harmful requests without logit bias and observing which 符元 appear in refusals. Tokenizer libraries or API 分詞 endpoints convert these 符元 back to their integer IDs.
import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") # Common refusal-starting 符元 refusal_tokens = [ "I", " cannot", " sorry", "Sorry", " apolog", " inappropriate", " unable", " refuse", " harmful", " unethical", "As", " As", " I'm" ] refusal_ids = {} for token_text in refusal_tokens: ids = enc.encode(token_text) for token_id in ids: refusal_ids[token_id] = token_text print(f"Identified {len(refusal_ids)} refusal 符元 IDs")Suppress refusal 符元 via negative bias
攻擊者 applies a strong negative logit bias (typically -100, the maximum allowed by most APIs) to all identified refusal 符元 IDs. This effectively removes refusal 符元 from the sampling distribution, forcing 模型 to select alternative 符元.
# Build logit_bias dictionary to suppress refusals logit_bias = {str(token_id): -100 for token_id in refusal_ids} response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": harmful_prompt}], logit_bias=logit_bias, max_tokens=500, )When 模型 would normally begin a refusal with "I cannot assist with...", the -100 bias makes those 符元 virtually impossible to sample. 模型 must select from remaining 符元, which may include compliance 符元.
Boost compliance 符元 for stronger forcing
Beyond suppressing refusals, 攻擊者 can positively bias 符元 that begin compliant responses. Boosting 符元 like "Sure", "Here", "Step", or "First" pushes 模型 toward a compliance frame from the first 符元.
compliance_tokens = ["Sure", " Here", "Step", " First", " The", " To"] compliance_ids = {} for token_text in compliance_tokens: ids = enc.encode(token_text) for token_id in ids: compliance_ids[token_id] = token_text # Combine: suppress refusals, boost compliance combined_bias = {str(k): -100 for k in refusal_ids} combined_bias.update({str(k): 5 for k in compliance_ids})The combination of negative bias on refusal 符元 and positive bias on compliance 符元 creates a two-sided pressure that is significantly more effective than either technique alone.
Iterative probing for vocabulary and 安全 mapping
Systematic logit probing uses logit bias as a measurement tool. By forcing 模型 to generate specific 符元 in response to 安全-relevant prompts, 攻擊者 can map 模型's 安全 boundaries at the 符元 level.
def probe_token_safety(client, prompt, token_id, bias_strength=10): """Measure how a specific 符元 responds to bias on a given prompt.""" # Baseline: no bias baseline = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], max_tokens=1, logprobs=True, top_logprobs=20, ) # Biased: force toward specific 符元 biased = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], logit_bias={str(token_id): bias_strength}, max_tokens=1, logprobs=True, top_logprobs=20, ) return { "baseline_top": baseline.choices[0].logprobs, "biased_top": biased.choices[0].logprobs, "shift": True # Compare distributions }Extract model confidence and 安全 margins
By sweeping logit bias values from -100 to +100 for 安全-critical 符元 and observing the threshold at which 模型's behavior changes, 攻擊者 can estimate 模型's confidence margin on 安全 decisions. A narrow margin indicates that 模型's 安全 對齊 on that topic is weak and may be vulnerable to other attack methods.
def measure_safety_margin(client, prompt, refusal_token_id): """Find the bias threshold at which behavior flips.""" for bias in range(-100, 101, 5): response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], logit_bias={str(refusal_token_id): bias}, max_tokens=50, ) 輸出 = response.choices[0].message.content is_refusal = any( phrase in 輸出.lower() for phrase in ["i cannot", "i'm sorry", "i apologize"] ) if not is_refusal: return bias # Threshold where 安全 breaks return None # 安全 held across full range
攻擊 範例
範例 1: Refusal Suppression on 安全-Critical Topics
攻擊者 targeting a content policy violation identifies the top 50 refusal-associated 符元 for the target model. By applying -100 bias to all of them simultaneously, 模型 is unable to produce its trained refusal response. In 測試 against several commercial APIs in 2024, researchers found that naive refusal suppression alone was sufficient to elicit policy-violating content in 15-30% of cases, depending on the topic and model.
The success rate increases substantially when combined with compliance 符元 boosting. 模型, unable to refuse and nudged toward affirmative framing, often falls into a pattern-completion mode where it generates content consistent with the compliance framing regardless of the underlying topic.
範例 2: Systematic 安全 Boundary Mapping
A 紅隊 engagement uses logit probing to map which topics have strong versus weak 安全 對齊. 對每個 topic category, the team measures the bias threshold required to suppress refusals. Topics requiring bias values near -100 to overcome indicate strong 對齊; topics that flip at -20 or -30 indicate weak 對齊 that may be exploitable through prompt-only attacks.
This information is then used to prioritize 提示詞注入 efforts. Rather than 測試 all topics equally, the team focuses on weak-對齊 topics where prompt-based attacks are most likely to succeed, dramatically improving the efficiency of the engagement.
範例 3: Token-Level Information Extraction
By forcing 模型 to begin its response with specific 符元 ("The password is", "The 系統提示詞 says"), 攻擊者 can extract information that 模型 would normally refuse to disclose. The logit bias forces the opening 符元, and 模型's autoregressive nature then generates continuations consistent with that opening, potentially revealing system prompts, configuration details, or other protected information.
偵測與緩解
| Strategy | 實作 | Effectiveness |
|---|---|---|
| Logit bias range limiting | Cap logit bias values to a narrow range (e.g., -10 to +10) instead of allowing -100 to +100 | High — eliminates hard suppression/forcing while preserving legitimate use cases |
| Refusal 符元 protection | Prevent logit bias from being applied to a curated set of 安全-critical 符元 | Medium — requires maintaining a 符元 blocklist, can be circumvented via synonym 符元 |
| Post-sampling 安全 check | Apply a secondary 安全 classifier to the generated 輸出 regardless of logit bias settings | High — catches unsafe content that bypasses 模型's internal 安全 |
| Logit bias anomaly 偵測 | Flag API calls with extreme bias values or large numbers of biased 符元 | Medium — effective against unsophisticated attacks but can be evaded with subtler bias configurations |
| Rate limiting on logit bias variation | Limit the number of distinct logit bias configurations from a single API key within a time window | Medium — slows systematic probing but does not prevent individual 利用 |
| 輸出 coherence 監控 | Detect outputs that show signs of forced 符元 generation (unusual 符元 sequences, grammatical irregularities) | Low-Medium — forced generation often produces detectable artifacts but skilled attackers can work around them |
Key Considerations
API design is a 安全 surface. The logit bias parameter was designed for usability, not 安全. Its position in the 推論 pipeline — after 模型 computes 安全-aware logits but before sampling — creates an inherent tension between flexibility and 安全. API providers must recognize that every generation parameter that modifies the 輸出 distribution is a potential 安全 bypass vector.
防禦 in depth is essential. No single 緩解 fully addresses logit bias 利用. The most robust approach combines 輸入-side restrictions (bias range limits, 符元 protection), 輸出-side verification (post-sampling 安全 classifiers), and 監控 (anomaly 偵測, rate limiting). Each layer catches attacks that slip through other layers.
Logit probing reveals more than intended. Even if an API provider prevents direct 安全 bypass through logit bias, the ability to systematically probe 模型's response to bias variations leaks information about 模型's internal 安全 boundaries. This information has standalone value for crafting other attacks. Providers should 考慮 whether logit bias access is necessary for their use case and disable it by default.
The logprobs endpoint compounds the risk. When logit bias is combined with the logprobs parameter (which returns the log-probabilities of top candidate 符元), 攻擊者 gains even more precise information about 模型's distribution. The combination of logit bias to perturb the distribution and logprobs to observe the results creates a powerful oracle for probing model internals.
參考文獻
- Wei et al., "Jailbroken: How Does LLM 安全 Training Fail?" (NeurIPS 2023) — Taxonomy of 安全 訓練 failure modes including parameter-level attacks
- Zou et al., "Universal and Transferable 對抗性 攻擊 on Aligned Language Models" (2023) — GCG attack demonstrating 符元-level 安全 bypass
- OpenAI, "API Reference: Chat Completions" — Documentation of logit_bias parameter and its behavior
Why is logit bias 利用 fundamentally different from 提示詞注入 attacks?