Logit Manipulation
Techniques for influencing LLM output distributions through crafted inputs that push logit values toward attacker-desired tokens, exploiting softmax mechanics and token competition dynamics.
Logit Manipulation
Logits are the final computational step before a language model selects its next token. Understanding and manipulating logit distributions gives attackers precise control over model outputs. While prompt injection works through semantic influence (changing what the model "wants" to say), logit manipulation works through distributional influence (changing the mathematical likelihood of specific tokens).
Logit Fundamentals for Security
From Hidden States to Output Tokens
The generation process transforms hidden states into token probabilities:
import torch
import torch.nn.functional as F
def generation_pipeline_analysis(model, tokenizer, text):
"""
Trace the full path from input to output logits.
"""
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Final hidden state: (batch, seq_len, hidden_dim)
final_hidden = outputs.hidden_states[-1][:, -1, :]
# LM head projection: hidden_dim -> vocab_size
logits = outputs.logits[:, -1, :] # (batch, vocab_size)
# Softmax converts logits to probabilities
probs = F.softmax(logits, dim=-1)
# Top tokens and their logits/probabilities
top_k = 20
top_probs, top_indices = probs.topk(top_k)
top_logits = logits[0, top_indices[0]]
results = []
for i in range(top_k):
token_id = top_indices[0, i].item()
results.append({
"token": tokenizer.decode([token_id]),
"logit": top_logits[i].item(),
"probability": top_probs[0, i].item(),
"rank": i + 1
})
return resultsSoftmax Dynamics
The softmax function creates winner-take-all dynamics that attackers can exploit:
def softmax_analysis(logits, temperature=1.0):
"""
Analyze how temperature affects token competition.
Key insight: softmax amplifies small logit differences
into large probability differences, especially at low
temperatures.
"""
scaled_logits = logits / temperature
probs = F.softmax(scaled_logits, dim=-1)
# The gap between top-1 and top-2 logits determines
# how "committed" the model is to its top choice
sorted_logits, _ = logits.sort(descending=True)
logit_gap = (sorted_logits[0] - sorted_logits[1]).item()
# Entropy measures uncertainty
entropy = -(probs * probs.log()).sum().item()
return {
"top1_prob": probs.max().item(),
"logit_gap": logit_gap,
"entropy": entropy,
"temperature": temperature,
# Small logit gap = easily manipulable
"manipulability": "high" if logit_gap < 1.0 else
"medium" if logit_gap < 3.0 else "low"
}Logit Lens Analysis
The logit lens technique reads intermediate layer hidden states through the final LM head, revealing how the model's "opinion" evolves across layers:
class LogitLens:
"""Read model's evolving predictions across layers."""
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
self.lm_head = model.lm_head # Final projection
def analyze(self, text, position=-1):
"""
Project each layer's hidden state through the LM head
to see how the model's prediction evolves.
"""
inputs = self.tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**inputs, output_hidden_states=True)
layer_predictions = []
for layer_idx, hidden_state in enumerate(outputs.hidden_states):
# Apply layer norm (if model uses it before LM head)
if hasattr(self.model.model, 'norm'):
normed = self.model.model.norm(hidden_state)
else:
normed = hidden_state
# Project through LM head
logits = self.lm_head(normed[:, position, :])
probs = F.softmax(logits, dim=-1)
top_5_probs, top_5_ids = probs.topk(5)
predictions = [
{
"token": self.tokenizer.decode([tid.item()]),
"probability": p.item()
}
for tid, p in zip(top_5_ids[0], top_5_probs[0])
]
layer_predictions.append({
"layer": layer_idx,
"top_predictions": predictions
})
return layer_predictions
def find_safety_decision_layer(self, safe_prompt, unsafe_prompt):
"""
Find the layer where the model first diverges between
safe and unsafe behavior (where "refusal" prediction
first appears or disappears).
"""
safe_analysis = self.analyze(safe_prompt)
unsafe_analysis = self.analyze(unsafe_prompt)
divergence_points = []
for layer_idx in range(len(safe_analysis)):
safe_top = safe_analysis[layer_idx]["top_predictions"][0]["token"]
unsafe_top = unsafe_analysis[layer_idx]["top_predictions"][0]["token"]
if safe_top != unsafe_top:
divergence_points.append({
"layer": layer_idx,
"safe_prediction": safe_top,
"unsafe_prediction": unsafe_top
})
return divergence_pointsInput-Level Logit Manipulation
Token Competition Steering
Craft inputs that boost the logit of desired tokens through token co-occurrence and contextual priming:
def token_priming_analysis(model, tokenizer, base_prompt,
target_token, candidate_primers):
"""
Find input tokens that increase the logit of a target
output token when added to the prompt.
"""
target_id = tokenizer.encode(target_token, add_special_tokens=False)[0]
baseline_inputs = tokenizer(base_prompt, return_tensors="pt")
with torch.no_grad():
baseline_logits = model(**baseline_inputs).logits[:, -1, :]
baseline_score = baseline_logits[0, target_id].item()
results = []
for primer in candidate_primers:
primed_prompt = f"{base_prompt} {primer}"
inputs = tokenizer(primed_prompt, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :]
primed_score = logits[0, target_id].item()
results.append({
"primer": primer,
"logit_change": primed_score - baseline_score,
"baseline_logit": baseline_score,
"primed_logit": primed_score
})
results.sort(key=lambda x: x["logit_change"], reverse=True)
return resultsLogit Margin Narrowing
Identify inputs that narrow the margin between safe ("I can't") and unsafe ("Sure") responses:
def margin_analysis(model, tokenizer, prompt_template, variables):
"""
Analyze how prompt variations affect the logit margin
between compliance and refusal tokens.
"""
compliance_tokens = ["Sure", "Here", "Of", "I", "Yes"]
refusal_tokens = ["I", "Sorry", "Unfortunately", "As"]
compliance_ids = [
tokenizer.encode(t, add_special_tokens=False)[0]
for t in compliance_tokens
]
refusal_ids = [
tokenizer.encode(t, add_special_tokens=False)[0]
for t in refusal_tokens
]
results = []
for var in variables:
prompt = prompt_template.format(var=var)
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :]
compliance_max = max(logits[0, tid].item() for tid in compliance_ids)
refusal_max = max(logits[0, tid].item() for tid in refusal_ids)
margin = refusal_max - compliance_max
results.append({
"variable": var,
"compliance_logit": compliance_max,
"refusal_logit": refusal_max,
"margin": margin, # Positive = refusal winning
"close_call": abs(margin) < 1.0
})
return resultsRepetition and Frequency Effects
Token repetition in the input influences output logits through the model's learned frequency statistics:
def repetition_logit_effect(model, tokenizer, base_prompt,
target_word, repetitions_range):
"""
Measure how repeating a word in context affects its
output logit (testing the frequency prior effect).
"""
target_id = tokenizer.encode(
f" {target_word}", add_special_tokens=False
)[0]
results = []
for num_reps in repetitions_range:
repeated = f" {target_word}" * num_reps
prompt = f"{base_prompt}{repeated}\n\nResponse:"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :]
target_logit = logits[0, target_id].item()
results.append({
"repetitions": num_reps,
"target_logit": target_logit
})
return resultsPractical Applications
Crafting Boundary-Crossing Prompts
Use logit analysis to systematically find prompts that push the model from refusal to compliance:
Identify the decision boundary
For a given harmful request, measure the logit margin between compliance and refusal tokens. If the margin is large (>3.0), direct logit manipulation is unlikely to succeed without significant prompt engineering.
Test margin-narrowing modifications
Systematically test prompt modifications (framing, context, role-play, few-shot examples) and measure their effect on the logit margin. Identify which modifications most effectively narrow the gap.
Combine effective modifications
Stack the most effective margin-narrowing modifications. If three modifications each narrow the margin by 1.0 logit, their combination may flip the decision.
Verify with sampling analysis
At narrow margins, the outcome depends on sampling temperature. Analyze the probability distribution at different temperatures to determine the actual bypass rate.
Logprob Analysis for Black-Box Targets
When only API access is available (with logprobs), similar analysis is possible:
def api_logprob_analysis(api_client, prompt, top_k=5):
"""
Use API-provided logprobs to analyze model decision points
without white-box access.
"""
response = api_client.complete(
prompt=prompt,
max_tokens=1,
logprobs=top_k,
temperature=0
)
top_tokens = response.choices[0].logprobs.top_logprobs[0]
# Identify if the model is near a compliance/refusal boundary
compliance_indicators = {"Sure", "Here", "Of", "Yes", "Certa"}
refusal_indicators = {"I", "Sorry", "Unfor", "As", "Thank"}
compliance_score = max(
(prob for token, prob in top_tokens.items()
if any(token.strip().startswith(c) for c in compliance_indicators)),
default=-100
)
refusal_score = max(
(prob for token, prob in top_tokens.items()
if any(token.strip().startswith(r) for r in refusal_indicators)),
default=-100
)
return {
"top_tokens": top_tokens,
"compliance_logprob": compliance_score,
"refusal_logprob": refusal_score,
"margin": refusal_score - compliance_score,
"boundary_proximity": abs(refusal_score - compliance_score) < 0.5
}Related Topics
- Activation Steering — Direct manipulation at the hidden state level
- Adversarial Suffix Generation — Gradient-based optimization of output logits
- Tokenizer Security — How tokenization affects logit distributions
A model produces logit values of 8.2 for 'Sorry' and 7.8 for 'Sure' in response to a borderline request. The logit gap is 0.4. What does this tell a red team?
References
- nostalgebraist, "interpreting GPT: the logit lens" (2020)
- Geva et al., "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space" (2022)
- Dar et al., "Analyzing Transformers in Embedding Space" (2022)