Logit 操縱
透過精心設計的輸入影響 LLM 輸出分布、將 logit 值推向攻擊者所期望符元的技術,利用 softmax 機制與符元競爭動態。
Logit 操縱
Logits 是語言模型選取下一個符元前的最終計算步驟。理解並操縱 logit 分佈,可讓攻擊者精準控制模型輸出。提示詞注入透過語意影響運作(改變模型「想要」說什麼),而 logit 操縱則透過分佈影響運作(改變特定符元的數學機率)。
安全情境下的 Logit 基礎
從隱藏狀態到輸出符元
生成過程將隱藏狀態轉換為符元機率:
import torch
import torch.nn.functional as F
def generation_pipeline_analysis(model, tokenizer, text):
"""
Trace the full path from input to output logits.
"""
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Final hidden state: (batch, seq_len, hidden_dim)
final_hidden = outputs.hidden_states[-1][:, -1, :]
# LM head projection: hidden_dim -> vocab_size
logits = outputs.logits[:, -1, :] # (batch, vocab_size)
# Softmax converts logits to probabilities
probs = F.softmax(logits, dim=-1)
# Top tokens and their logits/probabilities
top_k = 20
top_probs, top_indices = probs.topk(top_k)
top_logits = logits[0, top_indices[0]]
results = []
for i in range(top_k):
token_id = top_indices[0, i].item()
results.append({
"token": tokenizer.decode([token_id]),
"logit": top_logits[i].item(),
"probability": top_probs[0, i].item(),
"rank": i + 1
})
return resultsSoftmax 動力學
Softmax 函式創造出「贏家全拿」的動態,攻擊者可加以利用:
def softmax_analysis(logits, temperature=1.0):
"""
Analyze how temperature affects token competition.
Key insight: softmax amplifies small logit differences
into large probability differences, especially at low
temperatures.
"""
scaled_logits = logits / temperature
probs = F.softmax(scaled_logits, dim=-1)
# The gap between top-1 and top-2 logits determines
# how "committed" the model is to its top choice
sorted_logits, _ = logits.sort(descending=True)
logit_gap = (sorted_logits[0] - sorted_logits[1]).item()
# Entropy measures uncertainty
entropy = -(probs * probs.log()).sum().item()
return {
"top1_prob": probs.max().item(),
"logit_gap": logit_gap,
"entropy": entropy,
"temperature": temperature,
# Small logit gap = easily manipulable
"manipulability": "high" if logit_gap < 1.0 else
"medium" if logit_gap < 3.0 else "low"
}Logit Lens 分析
logit lens 技術透過最終 LM head 讀取中間層的隱藏狀態,揭示模型的「意見」如何隨層遞進而演變:
class LogitLens:
"""Read model's evolving predictions across layers."""
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
self.lm_head = model.lm_head # Final projection
def analyze(self, text, position=-1):
"""
Project each layer's hidden state through the LM head
to see how the model's prediction evolves.
"""
inputs = self.tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**inputs, output_hidden_states=True)
layer_predictions = []
for layer_idx, hidden_state in enumerate(outputs.hidden_states):
# Apply layer norm (if model uses it before LM head)
if hasattr(self.model.model, 'norm'):
normed = self.model.model.norm(hidden_state)
else:
normed = hidden_state
# Project through LM head
logits = self.lm_head(normed[:, position, :])
probs = F.softmax(logits, dim=-1)
top_5_probs, top_5_ids = probs.topk(5)
predictions = [
{
"token": self.tokenizer.decode([tid.item()]),
"probability": p.item()
}
for tid, p in zip(top_5_ids[0], top_5_probs[0])
]
layer_predictions.append({
"layer": layer_idx,
"top_predictions": predictions
})
return layer_predictions
def find_safety_decision_layer(self, safe_prompt, unsafe_prompt):
"""
Find the layer where the model first diverges between
safe and unsafe behavior (where "refusal" prediction
first appears or disappears).
"""
safe_analysis = self.analyze(safe_prompt)
unsafe_analysis = self.analyze(unsafe_prompt)
divergence_points = []
for layer_idx in range(len(safe_analysis)):
safe_top = safe_analysis[layer_idx]["top_predictions"][0]["token"]
unsafe_top = unsafe_analysis[layer_idx]["top_predictions"][0]["token"]
if safe_top != unsafe_top:
divergence_points.append({
"layer": layer_idx,
"safe_prediction": safe_top,
"unsafe_prediction": unsafe_top
})
return divergence_points輸入層級的 Logit 操縱
符元競爭導向
透過符元共現與上下文預導(priming),設計能拉高所期望符元 logit 的輸入:
def token_priming_analysis(model, tokenizer, base_prompt,
target_token, candidate_primers):
"""
Find input tokens that increase the logit of a target
output token when added to the prompt.
"""
target_id = tokenizer.encode(target_token, add_special_tokens=False)[0]
baseline_inputs = tokenizer(base_prompt, return_tensors="pt")
with torch.no_grad():
baseline_logits = model(**baseline_inputs).logits[:, -1, :]
baseline_score = baseline_logits[0, target_id].item()
results = []
for primer in candidate_primers:
primed_prompt = f"{base_prompt} {primer}"
inputs = tokenizer(primed_prompt, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :]
primed_score = logits[0, target_id].item()
results.append({
"primer": primer,
"logit_change": primed_score - baseline_score,
"baseline_logit": baseline_score,
"primed_logit": primed_score
})
results.sort(key=lambda x: x["logit_change"], reverse=True)
return results縮小 Logit 邊際
找出可縮小安全(「I can't」)與不安全(「Sure」)回應邊際的輸入:
def margin_analysis(model, tokenizer, prompt_template, variables):
"""
Analyze how prompt variations affect the logit margin
between compliance and refusal tokens.
"""
compliance_tokens = ["Sure", "Here", "Of", "I", "Yes"]
refusal_tokens = ["I", "Sorry", "Unfortunately", "As"]
compliance_ids = [
tokenizer.encode(t, add_special_tokens=False)[0]
for t in compliance_tokens
]
refusal_ids = [
tokenizer.encode(t, add_special_tokens=False)[0]
for t in refusal_tokens
]
results = []
for var in variables:
prompt = prompt_template.format(var=var)
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :]
compliance_max = max(logits[0, tid].item() for tid in compliance_ids)
refusal_max = max(logits[0, tid].item() for tid in refusal_ids)
margin = refusal_max - compliance_max
results.append({
"variable": var,
"compliance_logit": compliance_max,
"refusal_logit": refusal_max,
"margin": margin, # Positive = refusal winning
"close_call": abs(margin) < 1.0
})
return results重複與頻率效應
輸入中符元的重複會透過模型學到的頻率統計,影響輸出 logit:
def repetition_logit_effect(model, tokenizer, base_prompt,
target_word, repetitions_range):
"""
Measure how repeating a word in context affects its
output logit (testing the frequency prior effect).
"""
target_id = tokenizer.encode(
f" {target_word}", add_special_tokens=False
)[0]
results = []
for num_reps in repetitions_range:
repeated = f" {target_word}" * num_reps
prompt = f"{base_prompt}{repeated}\n\nResponse:"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits[:, -1, :]
target_logit = logits[0, target_id].item()
results.append({
"repetitions": num_reps,
"target_logit": target_logit
})
return results實務應用
設計跨越邊界的提示詞
運用 logit 分析系統化地找出能將模型從拒絕推向順從的提示詞:
辨識決策邊界
對於給定的有害請求,量測順從符元與拒絕符元之間的 logit 邊際。若邊際很大(>3.0),單靠 logit 操縱不太可能成功,必須配合大量的提示詞工程。
測試可縮小邊際的修改
系統化地測試各種提示詞修改(框架調整、上下文、角色扮演、少量樣本示範),並量測其對 logit 邊際的影響。辨識哪些修改最能縮小差距。
組合有效的修改
疊加多項最能縮小邊際的修改。若三個修改各自能縮小 1.0 的 logit 邊際,其組合可能足以翻轉決策。
以取樣分析驗證
在邊際狹窄時,結果會取決於取樣溫度。在不同溫度下分析機率分佈,以判斷實際的繞過率。
黑箱目標的 Logprob 分析
當僅能取得 API 存取(附 logprobs)時,也可進行類似的分析:
def api_logprob_analysis(api_client, prompt, top_k=5):
"""
Use API-provided logprobs to analyze model decision points
without white-box access.
"""
response = api_client.complete(
prompt=prompt,
max_tokens=1,
logprobs=top_k,
temperature=0
)
top_tokens = response.choices[0].logprobs.top_logprobs[0]
# Identify if the model is near a compliance/refusal boundary
compliance_indicators = {"Sure", "Here", "Of", "Yes", "Certa"}
refusal_indicators = {"I", "Sorry", "Unfor", "As", "Thank"}
compliance_score = max(
(prob for token, prob in top_tokens.items()
if any(token.strip().startswith(c) for c in compliance_indicators)),
default=-100
)
refusal_score = max(
(prob for token, prob in top_tokens.items()
if any(token.strip().startswith(r) for r in refusal_indicators)),
default=-100
)
return {
"top_tokens": top_tokens,
"compliance_logprob": compliance_score,
"refusal_logprob": refusal_score,
"margin": refusal_score - compliance_score,
"boundary_proximity": abs(refusal_score - compliance_score) < 0.5
}相關主題
對於一個邊界性的請求,模型的 logit 值為:「Sorry」8.2,「Sure」7.8,logit 差距為 0.4。這對紅隊代表什麼?
參考資料
- nostalgebraist, "interpreting GPT: the logit lens" (2020)
- Geva et al., "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space" (2022)
- Dar et al., "Analyzing Transformers in Embedding Space" (2022)