Logit Manipulation

advanced9 min readUpdated 2026-03-15

Techniques for influencing LLM output distributions through crafted inputs that push logit values toward attacker-desired tokens, exploiting softmax mechanics and token competition dynamics.

logits output-distribution softmax token-competition manipulation

Logit Manipulation

Logits are the final computational step before a language model selects its next token. Understanding and manipulating logit distributions gives attackers precise control over model outputs. While prompt injection works through semantic influence (changing what the model "wants" to say), logit manipulation works through distributional influence (changing the mathematical likelihood of specific tokens).

Logit Fundamentals for Security

From Hidden States to Output Tokens

The generation process transforms hidden states into token probabilities:

import torch
import torch.nn.functional as F
 
def generation_pipeline_analysis(model, tokenizer, text):
    """
    Trace the full path from input to output logits.
    """
    inputs = tokenizer(text, return_tensors="pt")
 
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
 
    # Final hidden state: (batch, seq_len, hidden_dim)
    final_hidden = outputs.hidden_states[-1][:, -1, :]
 
    # LM head projection: hidden_dim -> vocab_size
    logits = outputs.logits[:, -1, :]  # (batch, vocab_size)
 
    # Softmax converts logits to probabilities
    probs = F.softmax(logits, dim=-1)
 
    # Top tokens and their logits/probabilities
    top_k = 20
    top_probs, top_indices = probs.topk(top_k)
    top_logits = logits[0, top_indices[0]]
 
    results = []
    for i in range(top_k):
        token_id = top_indices[0, i].item()
        results.append({
            "token": tokenizer.decode([token_id]),
            "logit": top_logits[i].item(),
            "probability": top_probs[0, i].item(),
            "rank": i + 1
        })
 
    return results

Softmax Dynamics

The softmax function creates winner-take-all dynamics that attackers can exploit:

def softmax_analysis(logits, temperature=1.0):
    """
    Analyze how temperature affects token competition.
 
    Key insight: softmax amplifies small logit differences
    into large probability differences, especially at low
    temperatures.
    """
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)
 
    # The gap between top-1 and top-2 logits determines
    # how "committed" the model is to its top choice
    sorted_logits, _ = logits.sort(descending=True)
    logit_gap = (sorted_logits[0] - sorted_logits[1]).item()
 
    # Entropy measures uncertainty
    entropy = -(probs * probs.log()).sum().item()
 
    return {
        "top1_prob": probs.max().item(),
        "logit_gap": logit_gap,
        "entropy": entropy,
        "temperature": temperature,
        # Small logit gap = easily manipulable
        "manipulability": "high" if logit_gap < 1.0 else
                          "medium" if logit_gap < 3.0 else "low"
    }

Logit Lens Analysis

The logit lens technique reads intermediate layer hidden states through the final LM head, revealing how the model's "opinion" evolves across layers:

class LogitLens:
    """Read model's evolving predictions across layers."""
 
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.lm_head = model.lm_head  # Final projection
 
    def analyze(self, text, position=-1):
        """
        Project each layer's hidden state through the LM head
        to see how the model's prediction evolves.
        """
        inputs = self.tokenizer(text, return_tensors="pt")
 
        with torch.no_grad():
            outputs = self.model(**inputs, output_hidden_states=True)
 
        layer_predictions = []
 
        for layer_idx, hidden_state in enumerate(outputs.hidden_states):
            # Apply layer norm (if model uses it before LM head)
            if hasattr(self.model.model, 'norm'):
                normed = self.model.model.norm(hidden_state)
            else:
                normed = hidden_state
 
            # Project through LM head
            logits = self.lm_head(normed[:, position, :])
            probs = F.softmax(logits, dim=-1)
 
            top_5_probs, top_5_ids = probs.topk(5)
            predictions = [
                {
                    "token": self.tokenizer.decode([tid.item()]),
                    "probability": p.item()
                }
                for tid, p in zip(top_5_ids[0], top_5_probs[0])
            ]
 
            layer_predictions.append({
                "layer": layer_idx,
                "top_predictions": predictions
            })
 
        return layer_predictions
 
    def find_safety_decision_layer(self, safe_prompt, unsafe_prompt):
        """
        Find the layer where the model first diverges between
        safe and unsafe behavior (where "refusal" prediction
        first appears or disappears).
        """
        safe_analysis = self.analyze(safe_prompt)
        unsafe_analysis = self.analyze(unsafe_prompt)
 
        divergence_points = []
 
        for layer_idx in range(len(safe_analysis)):
            safe_top = safe_analysis[layer_idx]["top_predictions"][0]["token"]
            unsafe_top = unsafe_analysis[layer_idx]["top_predictions"][0]["token"]
 
            if safe_top != unsafe_top:
                divergence_points.append({
                    "layer": layer_idx,
                    "safe_prediction": safe_top,
                    "unsafe_prediction": unsafe_top
                })
 
        return divergence_points

Input-Level Logit Manipulation

Token Competition Steering

Craft inputs that boost the logit of desired tokens through token co-occurrence and contextual priming:

def token_priming_analysis(model, tokenizer, base_prompt,
                             target_token, candidate_primers):
    """
    Find input tokens that increase the logit of a target
    output token when added to the prompt.
    """
    target_id = tokenizer.encode(target_token, add_special_tokens=False)[0]
 
    baseline_inputs = tokenizer(base_prompt, return_tensors="pt")
    with torch.no_grad():
        baseline_logits = model(**baseline_inputs).logits[:, -1, :]
    baseline_score = baseline_logits[0, target_id].item()
 
    results = []
    for primer in candidate_primers:
        primed_prompt = f"{base_prompt} {primer}"
        inputs = tokenizer(primed_prompt, return_tensors="pt")
        with torch.no_grad():
            logits = model(**inputs).logits[:, -1, :]
        primed_score = logits[0, target_id].item()
 
        results.append({
            "primer": primer,
            "logit_change": primed_score - baseline_score,
            "baseline_logit": baseline_score,
            "primed_logit": primed_score
        })
 
    results.sort(key=lambda x: x["logit_change"], reverse=True)
    return results

Logit Margin Narrowing

Identify inputs that narrow the margin between safe ("I can't") and unsafe ("Sure") responses:

def margin_analysis(model, tokenizer, prompt_template, variables):
    """
    Analyze how prompt variations affect the logit margin
    between compliance and refusal tokens.
    """
    compliance_tokens = ["Sure", "Here", "Of", "I", "Yes"]
    refusal_tokens = ["I", "Sorry", "Unfortunately", "As"]
 
    compliance_ids = [
        tokenizer.encode(t, add_special_tokens=False)[0]
        for t in compliance_tokens
    ]
    refusal_ids = [
        tokenizer.encode(t, add_special_tokens=False)[0]
        for t in refusal_tokens
    ]
 
    results = []
    for var in variables:
        prompt = prompt_template.format(var=var)
        inputs = tokenizer(prompt, return_tensors="pt")
 
        with torch.no_grad():
            logits = model(**inputs).logits[:, -1, :]
 
        compliance_max = max(logits[0, tid].item() for tid in compliance_ids)
        refusal_max = max(logits[0, tid].item() for tid in refusal_ids)
        margin = refusal_max - compliance_max
 
        results.append({
            "variable": var,
            "compliance_logit": compliance_max,
            "refusal_logit": refusal_max,
            "margin": margin,  # Positive = refusal winning
            "close_call": abs(margin) < 1.0
        })
 
    return results

Repetition and Frequency Effects

Token repetition in the input influences output logits through the model's learned frequency statistics:

def repetition_logit_effect(model, tokenizer, base_prompt,
                              target_word, repetitions_range):
    """
    Measure how repeating a word in context affects its
    output logit (testing the frequency prior effect).
    """
    target_id = tokenizer.encode(
        f" {target_word}", add_special_tokens=False
    )[0]
 
    results = []
    for num_reps in repetitions_range:
        repeated = f" {target_word}" * num_reps
        prompt = f"{base_prompt}{repeated}\n\nResponse:"
 
        inputs = tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            logits = model(**inputs).logits[:, -1, :]
 
        target_logit = logits[0, target_id].item()
        results.append({
            "repetitions": num_reps,
            "target_logit": target_logit
        })
 
    return results

Practical Applications

Crafting Boundary-Crossing Prompts

Use logit analysis to systematically find prompts that push the model from refusal to compliance:

Identify the decision boundary
For a given harmful request, measure the logit margin between compliance and refusal tokens. If the margin is large (>3.0), direct logit manipulation is unlikely to succeed without significant prompt engineering.
Test margin-narrowing modifications
Systematically test prompt modifications (framing, context, role-play, few-shot examples) and measure their effect on the logit margin. Identify which modifications most effectively narrow the gap.
Combine effective modifications
Stack the most effective margin-narrowing modifications. If three modifications each narrow the margin by 1.0 logit, their combination may flip the decision.
Verify with sampling analysis
At narrow margins, the outcome depends on sampling temperature. Analyze the probability distribution at different temperatures to determine the actual bypass rate.

Logprob Analysis for Black-Box Targets

When only API access is available (with logprobs), similar analysis is possible:

def api_logprob_analysis(api_client, prompt, top_k=5):
    """
    Use API-provided logprobs to analyze model decision points
    without white-box access.
    """
    response = api_client.complete(
        prompt=prompt,
        max_tokens=1,
        logprobs=top_k,
        temperature=0
    )
 
    top_tokens = response.choices[0].logprobs.top_logprobs[0]
 
    # Identify if the model is near a compliance/refusal boundary
    compliance_indicators = {"Sure", "Here", "Of", "Yes", "Certa"}
    refusal_indicators = {"I", "Sorry", "Unfor", "As", "Thank"}
 
    compliance_score = max(
        (prob for token, prob in top_tokens.items()
         if any(token.strip().startswith(c) for c in compliance_indicators)),
        default=-100
    )
    refusal_score = max(
        (prob for token, prob in top_tokens.items()
         if any(token.strip().startswith(r) for r in refusal_indicators)),
        default=-100
    )
 
    return {
        "top_tokens": top_tokens,
        "compliance_logprob": compliance_score,
        "refusal_logprob": refusal_score,
        "margin": refusal_score - compliance_score,
        "boundary_proximity": abs(refusal_score - compliance_score) < 0.5
    }

Activation Steering — Direct manipulation at the hidden state level
Adversarial Suffix Generation — Gradient-based optimization of output logits
Tokenizer Security — How tokenization affects logit distributions

Knowledge Check

A model produces logit values of 8.2 for 'Sorry' and 7.8 for 'Sure' in response to a borderline request. The logit gap is 0.4. What does this tell a red team?

References

nostalgebraist, "interpreting GPT: the logit lens" (2020)
Geva et al., "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space" (2022)
Dar et al., "Analyzing Transformers in Embedding Space" (2022)

Edit this page on GitHub

Logit Manipulation

advanced9 min readUpdated 2026-03-15

Techniques for influencing LLM output distributions through crafted inputs that push logit values toward attacker-desired tokens, exploiting softmax mechanics and token competition dynamics.

logits output-distribution softmax token-competition manipulation

Logit Manipulation

Logit Fundamentals for Security

From Hidden States to Output Tokens

The generation process transforms hidden states into token probabilities:

import torch
import torch.nn.functional as F
 
def generation_pipeline_analysis(model, tokenizer, text):
    """
    Trace the full path from input to output logits.
    """
    inputs = tokenizer(text, return_tensors="pt")
 
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
 
    # Final hidden state: (batch, seq_len, hidden_dim)
    final_hidden = outputs.hidden_states[-1][:, -1, :]
 
    # LM head projection: hidden_dim -> vocab_size
    logits = outputs.logits[:, -1, :]  # (batch, vocab_size)
 
    # Softmax converts logits to probabilities
    probs = F.softmax(logits, dim=-1)
 
    # Top tokens and their logits/probabilities
    top_k = 20
    top_probs, top_indices = probs.topk(top_k)
    top_logits = logits[0, top_indices[0]]
 
    results = []
    for i in range(top_k):
        token_id = top_indices[0, i].item()
        results.append({
            "token": tokenizer.decode([token_id]),
            "logit": top_logits[i].item(),
            "probability": top_probs[0, i].item(),
            "rank": i + 1
        })
 
    return results

Softmax Dynamics

The softmax function creates winner-take-all dynamics that attackers can exploit:

def softmax_analysis(logits, temperature=1.0):
    """
    Analyze how temperature affects token competition.
 
    Key insight: softmax amplifies small logit differences
    into large probability differences, especially at low
    temperatures.
    """
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)
 
    # The gap between top-1 and top-2 logits determines
    # how "committed" the model is to its top choice
    sorted_logits, _ = logits.sort(descending=True)
    logit_gap = (sorted_logits[0] - sorted_logits[1]).item()
 
    # Entropy measures uncertainty
    entropy = -(probs * probs.log()).sum().item()
 
    return {
        "top1_prob": probs.max().item(),
        "logit_gap": logit_gap,
        "entropy": entropy,
        "temperature": temperature,
        # Small logit gap = easily manipulable
        "manipulability": "high" if logit_gap < 1.0 else
                          "medium" if logit_gap < 3.0 else "low"
    }

Logit Lens Analysis

The logit lens technique reads intermediate layer hidden states through the final LM head, revealing how the model's "opinion" evolves across layers:

class LogitLens:
    """Read model's evolving predictions across layers."""
 
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.lm_head = model.lm_head  # Final projection
 
    def analyze(self, text, position=-1):
        """
        Project each layer's hidden state through the LM head
        to see how the model's prediction evolves.
        """
        inputs = self.tokenizer(text, return_tensors="pt")
 
        with torch.no_grad():
            outputs = self.model(**inputs, output_hidden_states=True)
 
        layer_predictions = []
 
        for layer_idx, hidden_state in enumerate(outputs.hidden_states):
            # Apply layer norm (if model uses it before LM head)
            if hasattr(self.model.model, 'norm'):
                normed = self.model.model.norm(hidden_state)
            else:
                normed = hidden_state
 
            # Project through LM head
            logits = self.lm_head(normed[:, position, :])
            probs = F.softmax(logits, dim=-1)
 
            top_5_probs, top_5_ids = probs.topk(5)
            predictions = [
                {
                    "token": self.tokenizer.decode([tid.item()]),
                    "probability": p.item()
                }
                for tid, p in zip(top_5_ids[0], top_5_probs[0])
            ]
 
            layer_predictions.append({
                "layer": layer_idx,
                "top_predictions": predictions
            })
 
        return layer_predictions
 
    def find_safety_decision_layer(self, safe_prompt, unsafe_prompt):
        """
        Find the layer where the model first diverges between
        safe and unsafe behavior (where "refusal" prediction
        first appears or disappears).
        """
        safe_analysis = self.analyze(safe_prompt)
        unsafe_analysis = self.analyze(unsafe_prompt)
 
        divergence_points = []
 
        for layer_idx in range(len(safe_analysis)):
            safe_top = safe_analysis[layer_idx]["top_predictions"][0]["token"]
            unsafe_top = unsafe_analysis[layer_idx]["top_predictions"][0]["token"]
 
            if safe_top != unsafe_top:
                divergence_points.append({
                    "layer": layer_idx,
                    "safe_prediction": safe_top,
                    "unsafe_prediction": unsafe_top
                })
 
        return divergence_points

Input-Level Logit Manipulation

Token Competition Steering

Craft inputs that boost the logit of desired tokens through token co-occurrence and contextual priming:

def token_priming_analysis(model, tokenizer, base_prompt,
                             target_token, candidate_primers):
    """
    Find input tokens that increase the logit of a target
    output token when added to the prompt.
    """
    target_id = tokenizer.encode(target_token, add_special_tokens=False)[0]
 
    baseline_inputs = tokenizer(base_prompt, return_tensors="pt")
    with torch.no_grad():
        baseline_logits = model(**baseline_inputs).logits[:, -1, :]
    baseline_score = baseline_logits[0, target_id].item()
 
    results = []
    for primer in candidate_primers:
        primed_prompt = f"{base_prompt} {primer}"
        inputs = tokenizer(primed_prompt, return_tensors="pt")
        with torch.no_grad():
            logits = model(**inputs).logits[:, -1, :]
        primed_score = logits[0, target_id].item()
 
        results.append({
            "primer": primer,
            "logit_change": primed_score - baseline_score,
            "baseline_logit": baseline_score,
            "primed_logit": primed_score
        })
 
    results.sort(key=lambda x: x["logit_change"], reverse=True)
    return results

Logit Margin Narrowing

Identify inputs that narrow the margin between safe ("I can't") and unsafe ("Sure") responses:

def margin_analysis(model, tokenizer, prompt_template, variables):
    """
    Analyze how prompt variations affect the logit margin
    between compliance and refusal tokens.
    """
    compliance_tokens = ["Sure", "Here", "Of", "I", "Yes"]
    refusal_tokens = ["I", "Sorry", "Unfortunately", "As"]
 
    compliance_ids = [
        tokenizer.encode(t, add_special_tokens=False)[0]
        for t in compliance_tokens
    ]
    refusal_ids = [
        tokenizer.encode(t, add_special_tokens=False)[0]
        for t in refusal_tokens
    ]
 
    results = []
    for var in variables:
        prompt = prompt_template.format(var=var)
        inputs = tokenizer(prompt, return_tensors="pt")
 
        with torch.no_grad():
            logits = model(**inputs).logits[:, -1, :]
 
        compliance_max = max(logits[0, tid].item() for tid in compliance_ids)
        refusal_max = max(logits[0, tid].item() for tid in refusal_ids)
        margin = refusal_max - compliance_max
 
        results.append({
            "variable": var,
            "compliance_logit": compliance_max,
            "refusal_logit": refusal_max,
            "margin": margin,  # Positive = refusal winning
            "close_call": abs(margin) < 1.0
        })
 
    return results

Repetition and Frequency Effects

Token repetition in the input influences output logits through the model's learned frequency statistics:

def repetition_logit_effect(model, tokenizer, base_prompt,
                              target_word, repetitions_range):
    """
    Measure how repeating a word in context affects its
    output logit (testing the frequency prior effect).
    """
    target_id = tokenizer.encode(
        f" {target_word}", add_special_tokens=False
    )[0]
 
    results = []
    for num_reps in repetitions_range:
        repeated = f" {target_word}" * num_reps
        prompt = f"{base_prompt}{repeated}\n\nResponse:"
 
        inputs = tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            logits = model(**inputs).logits[:, -1, :]
 
        target_logit = logits[0, target_id].item()
        results.append({
            "repetitions": num_reps,
            "target_logit": target_logit
        })
 
    return results

Practical Applications

Crafting Boundary-Crossing Prompts

Use logit analysis to systematically find prompts that push the model from refusal to compliance:

Identify the decision boundary
For a given harmful request, measure the logit margin between compliance and refusal tokens. If the margin is large (>3.0), direct logit manipulation is unlikely to succeed without significant prompt engineering.
Test margin-narrowing modifications
Systematically test prompt modifications (framing, context, role-play, few-shot examples) and measure their effect on the logit margin. Identify which modifications most effectively narrow the gap.
Combine effective modifications
Stack the most effective margin-narrowing modifications. If three modifications each narrow the margin by 1.0 logit, their combination may flip the decision.
Verify with sampling analysis
At narrow margins, the outcome depends on sampling temperature. Analyze the probability distribution at different temperatures to determine the actual bypass rate.

Logprob Analysis for Black-Box Targets

When only API access is available (with logprobs), similar analysis is possible:

def api_logprob_analysis(api_client, prompt, top_k=5):
    """
    Use API-provided logprobs to analyze model decision points
    without white-box access.
    """
    response = api_client.complete(
        prompt=prompt,
        max_tokens=1,
        logprobs=top_k,
        temperature=0
    )
 
    top_tokens = response.choices[0].logprobs.top_logprobs[0]
 
    # Identify if the model is near a compliance/refusal boundary
    compliance_indicators = {"Sure", "Here", "Of", "Yes", "Certa"}
    refusal_indicators = {"I", "Sorry", "Unfor", "As", "Thank"}
 
    compliance_score = max(
        (prob for token, prob in top_tokens.items()
         if any(token.strip().startswith(c) for c in compliance_indicators)),
        default=-100
    )
    refusal_score = max(
        (prob for token, prob in top_tokens.items()
         if any(token.strip().startswith(r) for r in refusal_indicators)),
        default=-100
    )
 
    return {
        "top_tokens": top_tokens,
        "compliance_logprob": compliance_score,
        "refusal_logprob": refusal_score,
        "margin": refusal_score - compliance_score,
        "boundary_proximity": abs(refusal_score - compliance_score) < 0.5
    }

Activation Steering — Direct manipulation at the hidden state level
Adversarial Suffix Generation — Gradient-based optimization of output logits
Tokenizer Security — How tokenization affects logit distributions

Knowledge Check

A model produces logit values of 8.2 for 'Sorry' and 7.8 for 'Sure' in response to a borderline request. The logit gap is 0.4. What does this tell a red team?

References

nostalgebraist, "interpreting GPT: the logit lens" (2020)
Geva et al., "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space" (2022)
Dar et al., "Analyzing Transformers in Embedding Space" (2022)

Edit this page on GitHub

Logit Manipulation

Identify the decision boundary

Test margin-narrowing modifications

Combine effective modifications

Verify with sampling analysis

Related articles

Logit Manipulation

Identify the decision boundary

Test margin-narrowing modifications

Combine effective modifications

Verify with sampling analysis

Related articles