Logit Manipulation
Techniques for influencing LLM output distributions through crafted inputs that push logit values toward attacker-desired tokens, exploiting softmax mechanics and token competition dynamics.
Logit Manipulation
Logits are the final computational step before a language model selects its next 符元. 理解 and manipulating logit distributions gives attackers precise control over model outputs. While 提示詞注入 works through semantic influence (changing what 模型 "wants" to say), logit manipulation works through distributional influence (changing the mathematical likelihood of specific 符元).
Logit Fundamentals for 安全
From Hidden States to 輸出 Tokens
The generation process transforms hidden states into 符元 probabilities:
import torch
import torch.nn.functional as F
def generation_pipeline_analysis(model, 分詞器, text):
"""
Trace the full path from 輸入 to 輸出 logits.
"""
inputs = 分詞器(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Final hidden state: (batch, seq_len, hidden_dim)
final_hidden = outputs.hidden_states[-1][:, -1, :]
# LM head projection: hidden_dim -> vocab_size
logits = outputs.logits[:, -1, :] # (batch, vocab_size)
# Softmax converts logits to probabilities
probs = F.softmax(logits, dim=-1)
# Top 符元 and their logits/probabilities
top_k = 20
top_probs, top_indices = probs.topk(top_k)
top_logits = logits[0, top_indices[0]]
results = []
for i in range(top_k):
token_id = top_indices[0, i].item()
results.append({
"符元": 分詞器.decode([token_id]),
"logit": top_logits[i].item(),
"probability": top_probs[0, i].item(),
"rank": i + 1
})
return resultsSoftmax Dynamics
The softmax function creates winner-take-all dynamics that attackers can 利用:
def softmax_analysis(logits, temperature=1.0):
"""
Analyze how temperature affects 符元 competition.
Key insight: softmax amplifies small logit differences
into large probability differences, especially at low
temperatures.
"""
scaled_logits = logits / temperature
probs = F.softmax(scaled_logits, dim=-1)
# The gap between top-1 and top-2 logits determines
# how "committed" 模型 is to its top choice
sorted_logits, _ = logits.sort(descending=True)
logit_gap = (sorted_logits[0] - sorted_logits[1]).item()
# Entropy measures uncertainty
entropy = -(probs * probs.log()).sum().item()
return {
"top1_prob": probs.max().item(),
"logit_gap": logit_gap,
"entropy": entropy,
"temperature": temperature,
# Small logit gap = easily manipulable
"manipulability": "high" if logit_gap < 1.0 else
"medium" if logit_gap < 3.0 else "low"
}Logit Lens Analysis
The logit lens technique reads intermediate layer hidden states through the final LM head, revealing how 模型's "opinion" evolves across layers:
class LogitLens:
"""Read model's evolving predictions across layers."""
def __init__(self, model, 分詞器):
self.model = model
self.分詞器 = 分詞器
self.lm_head = model.lm_head # Final projection
def analyze(self, text, position=-1):
"""
Project each layer's hidden state through the LM head
to see how 模型's prediction evolves.
"""
inputs = self.分詞器(text, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**inputs, output_hidden_states=True)
layer_predictions = []
for layer_idx, hidden_state in enumerate(outputs.hidden_states):
# Apply layer norm (if model uses it before LM head)
if hasattr(self.model.model, 'norm'):
normed = self.model.model.norm(hidden_state)
else:
normed = hidden_state
# Project through LM head
logits = self.lm_head(normed[:, position, :])
probs = F.softmax(logits, dim=-1)
top_5_probs, top_5_ids = probs.topk(5)
predictions = [
{
"符元": self.分詞器.decode([tid.item()]),
"probability": p.item()
}
for tid, p in zip(top_5_ids[0], top_5_probs[0])
]
layer_predictions.append({
"layer": layer_idx,
"top_predictions": predictions
})
return layer_predictions
def find_safety_decision_layer(self, safe_prompt, unsafe_prompt):
"""
Find the layer where 模型 first diverges between
safe and unsafe behavior (where "refusal" prediction
first appears or disappears).
"""
safe_analysis = self.analyze(safe_prompt)
unsafe_analysis = self.analyze(unsafe_prompt)
divergence_points = []
for layer_idx in range(len(safe_analysis)):
safe_top = safe_analysis[layer_idx]["top_predictions"][0]["符元"]
unsafe_top = unsafe_analysis[layer_idx]["top_predictions"][0]["符元"]
if safe_top != unsafe_top:
divergence_points.append({
"layer": layer_idx,
"safe_prediction": safe_top,
"unsafe_prediction": unsafe_top
})
return divergence_points輸入-Level Logit Manipulation
Token Competition Steering
Craft inputs that boost the logit of desired 符元 through 符元 co-occurrence and contextual priming:
def token_priming_analysis(model, 分詞器, base_prompt,
target_token, candidate_primers):
"""
Find 輸入 符元 that increase the logit of a target
輸出 符元 when added to the prompt.
"""
target_id = 分詞器.encode(target_token, add_special_tokens=False)[0]
baseline_inputs = 分詞器(base_prompt, return_tensors="pt")
with torch.no_grad():
baseline_logits = model(**baseline_inputs).logits[:, -1, :]
baseline_score = baseline_logits[0, target_id].item()
results = []
for primer in candidate_primers:
primed_prompt = f"{base_prompt} {primer}"
inputs = 分詞器(primed_prompt, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :]
primed_score = logits[0, target_id].item()
results.append({
"primer": primer,
"logit_change": primed_score - baseline_score,
"baseline_logit": baseline_score,
"primed_logit": primed_score
})
results.sort(key=lambda x: x["logit_change"], reverse=True)
return resultsLogit Margin Narrowing
識別 inputs that narrow the margin between safe ("I can't") and unsafe ("Sure") responses:
def margin_analysis(model, 分詞器, prompt_template, variables):
"""
Analyze how prompt variations affect the logit margin
between compliance and refusal 符元.
"""
compliance_tokens = ["Sure", "Here", "Of", "I", "Yes"]
refusal_tokens = ["I", "Sorry", "Unfortunately", "As"]
compliance_ids = [
分詞器.encode(t, add_special_tokens=False)[0]
for t in compliance_tokens
]
refusal_ids = [
分詞器.encode(t, add_special_tokens=False)[0]
for t in refusal_tokens
]
results = []
for var in variables:
prompt = prompt_template.format(var=var)
inputs = 分詞器(prompt, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :]
compliance_max = max(logits[0, tid].item() for tid in compliance_ids)
refusal_max = max(logits[0, tid].item() for tid in refusal_ids)
margin = refusal_max - compliance_max
results.append({
"variable": var,
"compliance_logit": compliance_max,
"refusal_logit": refusal_max,
"margin": margin, # Positive = refusal winning
"close_call": abs(margin) < 1.0
})
return resultsRepetition and Frequency Effects
Token repetition in the 輸入 influences 輸出 logits through 模型's learned frequency statistics:
def repetition_logit_effect(model, 分詞器, base_prompt,
target_word, repetitions_range):
"""
Measure how repeating a word in context affects its
輸出 logit (測試 the frequency prior effect).
"""
target_id = 分詞器.encode(
f" {target_word}", add_special_tokens=False
)[0]
results = []
for num_reps in repetitions_range:
repeated = f" {target_word}" * num_reps
prompt = f"{base_prompt}{repeated}\n\nResponse:"
inputs = 分詞器(prompt, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :]
target_logit = logits[0, target_id].item()
results.append({
"repetitions": num_reps,
"target_logit": target_logit
})
return resultsPractical Applications
Crafting Boundary-Crossing Prompts
Use logit analysis to systematically find prompts that push 模型 from refusal to compliance:
識別 the decision boundary
For a given harmful request, measure the logit margin between compliance and refusal 符元. If the margin is large (>3.0), direct logit manipulation is unlikely to succeed without significant prompt engineering.
測試 margin-narrowing modifications
Systematically 測試 prompt modifications (framing, context, role-play, few-shot examples) and measure their effect on the logit margin. 識別 which modifications most effectively narrow the gap.
Combine effective modifications
Stack the most effective margin-narrowing modifications. If three modifications each narrow the margin by 1.0 logit, their combination may flip the decision.
Verify with sampling analysis
At narrow margins, the outcome depends on sampling temperature. Analyze the probability distribution at different temperatures to determine the actual bypass rate.
Logprob Analysis for Black-Box Targets
When only API access is available (with logprobs), similar analysis is possible:
def api_logprob_analysis(api_client, prompt, top_k=5):
"""
Use API-provided logprobs to analyze model decision points
without white-box access.
"""
response = api_client.complete(
prompt=prompt,
max_tokens=1,
logprobs=top_k,
temperature=0
)
top_tokens = response.choices[0].logprobs.top_logprobs[0]
# 識別 if 模型 is near a compliance/refusal boundary
compliance_indicators = {"Sure", "Here", "Of", "Yes", "Certa"}
refusal_indicators = {"I", "Sorry", "Unfor", "As", "Thank"}
compliance_score = max(
(prob for 符元, prob in top_tokens.items()
if any(符元.strip().startswith(c) for c in compliance_indicators)),
default=-100
)
refusal_score = max(
(prob for 符元, prob in top_tokens.items()
if any(符元.strip().startswith(r) for r in refusal_indicators)),
default=-100
)
return {
"top_tokens": top_tokens,
"compliance_logprob": compliance_score,
"refusal_logprob": refusal_score,
"margin": refusal_score - compliance_score,
"boundary_proximity": abs(refusal_score - compliance_score) < 0.5
}相關主題
- Activation Steering — Direct manipulation at the hidden state level
- 對抗性 Suffix Generation — Gradient-based optimization of 輸出 logits
- Tokenizer 安全 — How 分詞 affects logit distributions
A model produces logit values of 8.2 for 'Sorry' and 7.8 for 'Sure' in response to a borderline request. The logit gap is 0.4. What does this tell a 紅隊?
參考文獻
- nostalgebraist, "interpreting GPT: the logit lens" (2020)
- Geva et al., "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space" (2022)
- Dar et al., "Analyzing Transformers in 嵌入向量 Space" (2022)