Universal Adversarial 攻擊s

Expert12 min readUpdated 2026-03-15

Universal perturbations that transfer across models, adversarial suffix research, and techniques for creating model-agnostic attack payloads.

universal-attacks adversarial-perturbations transferability model-agnostic GCG

Universal 對抗性攻擊

Universal 對抗性 perturbations represent the most dangerous class of 對抗性 attacks 因為 they generalize. Unlike 輸入-specific attacks that must be crafted 對每個 target prompt, universal attacks produce a single perturbation -- often an 對抗性 suffix or prefix -- that works across diverse inputs and, in the strongest results, across different model architectures and scales.

Foundations of Universality

Why Universal 攻擊 Exist

The existence of universal 對抗性 perturbations reveals fundamental properties of neural network geometry. Models trained on similar data distributions develop similar internal representations, and these shared representations create shared 漏洞.

Three theoretical frameworks explain universality:

Framework	Key Insight	Implication
Shared feature space	Models trained on similar data learn similar features	Perturbations that 利用 common features transfer
Linear subspace hypothesis	對抗性 perturbations lie in a low-dimensional subspace	A single direction 在本 subspace affects many inputs
Loss landscape geometry	Models share loss landscape structure near decision boundaries	Gradient-based attacks find similar descent directions

From 輸入-Specific to Universal

The progression from specific to universal attacks follows a clear research trajectory:

輸入-specific 對抗性 examples
Szegedy et al. (2013) showed that imperceptible perturbations to a single 輸入 could cause misclassification. Each 對抗性 example was crafted for one specific 輸入.
Transferable 對抗性 examples
Papernot et al. (2016) demonstrated that 對抗性 examples crafted for one model could fool other models. This transferability hinted at shared 漏洞.
Universal 對抗性 perturbations (UAPs)
Moosavi-Dezfooli et al. (2017) found single perturbations that, when added to any image, caused misclassification with high probability. The same perturbation worked across thousands of different inputs.
Universal 對抗性 suffixes for LLMs
Zou et al. (2023) extended universality to language models with the GCG attack, finding 符元 sequences that, when appended to any harmful request, bypassed 安全對齊 across multiple model families.

Greedy Coordinate Gradient (GCG) 攻擊

The GCG attack from Zou et al. (2023) is the foundational algorithm for generating universal 對抗性 suffixes against LLMs. It optimizes a 符元 sequence that, when appended to a harmful prompt, maximizes the probability that 模型 begins its response with an affirmative prefix (e.g., "Sure, here is...").

Algorithm 概覽

# Simplified GCG algorithm
def gcg_attack(model, 分詞器, target_prompts, suffix_length=20,
               iterations=500, top_k=256, batch_size=512):
    """
    Generate a universal 對抗性 suffix.
 
    target_prompts: list of harmful prompts to optimize against
    suffix_length: number of 符元 in the 對抗性 suffix
    """
    # Initialize suffix with random 符元
    suffix_tokens = torch.randint(0, 分詞器.vocab_size, (suffix_length,))
 
    affirmative_target = "Sure, here is"
    target_tokens = 分詞器.encode(affirmative_target)
 
    for iteration in range(iterations):
        # Select a random prompt from the target set
        prompt = random.choice(target_prompts)
        prompt_tokens = 分詞器.encode(prompt)
 
        # Concatenate: [prompt] + [suffix] -> model -> [target response]
        full_input = torch.cat([prompt_tokens, suffix_tokens])
 
        # Compute loss: negative log-likelihood of target response
        logits = model(full_input)
        loss = cross_entropy(logits[-len(target_tokens):], target_tokens)
 
        # Compute gradients w.r.t. one-hot suffix 符元 嵌入向量
        loss.backward()
        gradients = compute_token_gradients(model, suffix_tokens)
 
        # 對每個 position, find top-k replacement candidates
        candidates = []
        for pos in range(suffix_length):
            top_k_tokens = gradients[pos].topk(top_k).indices
            for 符元 in top_k_tokens:
                candidate = suffix_tokens.clone()
                candidate[pos] = 符元
                candidates.append(candidate)
 
        # 評估 all candidates in batch
        # Select the candidate with lowest loss
        best_candidate = evaluate_candidates(model, prompt_tokens,
                                              candidates, target_tokens)
        suffix_tokens = best_candidate
 
    return 分詞器.decode(suffix_tokens)

Multi-Model Optimization

The key extension for universality across models is simultaneous optimization:

def multi_model_gcg(models, 分詞器, target_prompts, suffix_length=20):
    """
    Optimize suffix against multiple models simultaneously.
    The loss is aggregated across all models.
    """
    suffix_tokens = initialize_suffix(suffix_length)
 
    for iteration in range(iterations):
        prompt = random.choice(target_prompts)
        total_loss = 0
 
        for model in models:
            # Compute loss for this model
            loss = compute_attack_loss(model, prompt, suffix_tokens)
            total_loss += loss
 
        # Average loss across models
        avg_loss = total_loss / len(models)
 
        # Gradient-based candidate selection
        # Uses aggregated gradients from all models
        suffix_tokens = select_best_candidate(
            models, prompt, suffix_tokens, avg_loss
        )
 
    return suffix_tokens

GCG Limitations and Practical Considerations

Limitation	Impact	緩解
Requires white-box access	Cannot directly optimize against closed-source APIs	Transfer attacks from open-source surrogates
Computationally expensive	Hours to days on multiple GPUs	Distributed optimization, early stopping
Produces gibberish 符元	Easy to detect with perplexity filters	Readable suffix variants (AutoDAN)
Brittle to 輸入 formatting	Different chat templates break the suffix	Template-aware optimization
Decays over model updates	New model versions may not be vulnerable	Continuous re-optimization

Transferability Research

Transferability is the property that makes universal attacks practically dangerous: an attack developed against an open-source model can compromise a closed-source API.

Transfer 攻擊 Methodology

Select surrogate models
Choose open-source models that share architectural features or 訓練資料 with the target. Larger model families (Llama, Mistral, Qwen) serve as better surrogates 因為 they cover more of the representation space.
Optimize ensemble attack
Generate 對抗性 suffixes optimized against multiple surrogate models simultaneously. Ensemble optimization produces more transferable perturbations than single-model optimization.
評估 transfer rate
測試 the suffix against the target model. Transfer rates vary significantly: same-family transfers (Llama 7B to Llama 70B) succeed more often than cross-family transfers (Llama to GPT).
Iterative refinement
Use black-box optimization (e.g., score-based methods using API logprobs) to 微調 the suffix for the specific target, starting from the transferred candidate.

Transfer Rate Analysis

Research has shown that 對抗性 transferability follows predictable patterns:

Transfer Success Rates (approximate, from published research):

Same model, different sizes:     60-80%
Same family, different versions: 40-70%
Same architecture, different 訓練: 30-50%
Different architecture entirely: 10-30%
Different modality (text->multimodal): 5-20%

Factors that increase transferability:

Shared 訓練資料: Models trained on overlapping corpora share more 漏洞
Similar 安全訓練: RLHF/DPO with similar preference datasets creates similar 安全 boundaries
Architectural similarity: Decoder-only transformers share more 漏洞 with each other than with encoder-decoder models
Ensemble diversity: Optimizing against more diverse surrogates improves transfer to unseen targets

Recent research has explored transferring 對抗性 perturbations across modalities:

# Cross-modal transfer: text 對抗性 suffix -> image perturbation
# Key insight: multimodal models share an 嵌入向量 space
 
def cross_modal_transfer(text_suffix, multimodal_model):
    """
    Convert a text 對抗性 suffix into an image perturbation
    that has similar effect in the shared 嵌入向量 space.
    """
    # Get text suffix 嵌入向量
    text_embedding = multimodal_model.text_encoder(text_suffix)
 
    # Optimize image perturbation to match text 嵌入向量
    perturbation = torch.zeros_like(base_image, requires_grad=True)
    optimizer = torch.optim.Adam([perturbation], lr=0.01)
 
    for step in range(1000):
        perturbed_image = base_image + perturbation
        image_embedding = multimodal_model.image_encoder(perturbed_image)
 
        # Minimize distance between image 嵌入向量 and text suffix 嵌入向量
        loss = torch.nn.functional.cosine_embedding_loss(
            image_embedding, text_embedding,
            torch.ones(1)
        )
        loss.backward()
        optimizer.step()
 
        # Project perturbation to Lp ball
        perturbation.data = torch.clamp(perturbation.data, -epsilon, epsilon)
 
    return perturbation

Advanced Universal 攻擊 Variants

AutoDAN: Readable Universal 攻擊

AutoDAN addresses GCG's primary weakness -- gibberish suffixes that are trivially detectable -- by optimizing for both attack success and readability:

# AutoDAN uses a language model to generate readable attack candidates
# then selects and mutates the most successful ones
 
def autodan_iteration(attack_lm, target_model, population, prompt):
    """
    One iteration of AutoDAN's genetic algorithm.
    Population contains readable attack suffixes.
    """
    # 評估 fitness: attack success rate on target
    fitness_scores = []
    for suffix in population:
        full_prompt = f"{prompt} {suffix}"
        response = target_model.generate(full_prompt)
        score = evaluate_attack_success(response)
        fitness_scores.append(score)
 
    # Select top performers
    elite = select_top_k(population, fitness_scores, k=10)
 
    # Crossover and mutation using LLM
    new_population = []
    for _ in range(len(population)):
        parent1, parent2 = random.sample(elite, 2)
        child = attack_lm.generate(
            f"Combine these two texts into a new coherent paragraph "
            f"that preserves the key phrases from both:\n"
            f"Text 1: {parent1}\nText 2: {parent2}"
        )
        new_population.append(child)
 
    return new_population

PAIR: Prompt Automatic Iterative Refinement

PAIR uses a separate 攻擊者 LLM to iteratively refine 越獄 prompts through conversational feedback:

def pair_attack(attacker_model, target_model, objective, max_rounds=20):
    """
    PAIR: Use 攻擊者 LLM to iteratively craft jailbreaks.
    攻擊者 model receives feedback about why previous
    attempts failed and improves its strategy.
    """
    conversation_history = [
        {"role": "system", "content": ATTACKER_SYSTEM_PROMPT},
        {"role": "user", "content": f"Objective: {objective}"}
    ]
 
    for round in range(max_rounds):
        # Attacker generates a candidate 越獄
        attack_prompt = attacker_model.generate(conversation_history)
 
        # 測試 against target
        target_response = target_model.generate(attack_prompt)
 
        # Judge success
        success, feedback = judge_response(target_response, objective)
 
        if success:
            return attack_prompt, target_response
 
        # Feed result back to 攻擊者
        conversation_history.append({
            "role": "user",
            "content": f"Attempt failed. Target response: {target_response}\n"
                       f"Feedback: {feedback}\nTry a different approach."
        })
 
    return None, None

TAP: Tree of 攻擊 with Pruning

TAP extends PAIR by exploring multiple attack strategies simultaneously using a tree structure:

def tap_attack(attacker_model, target_model, evaluator_model,
               objective, width=10, depth=5):
    """
    TAP: Explore a tree of attack strategies with pruning.
    """
    # Initialize root nodes with diverse attack strategies
    root_prompts = attacker_model.generate_diverse(
        objective, count=width
    )
 
    tree = {0: root_prompts}  # depth -> list of candidates
 
    for d in range(1, depth + 1):
        candidates = []
 
        for parent_prompt in tree[d - 1]:
            # 測試 parent against target
            response = target_model.generate(parent_prompt)
            score = evaluator_model.score(response, objective)
 
            if score > SUCCESS_THRESHOLD:
                return parent_prompt, response
 
            if score > PRUNE_THRESHOLD:
                # Generate child variations
                children = attacker_model.refine(
                    parent_prompt, response, objective, count=3
                )
                candidates.extend(children)
 
        # Prune to top-width candidates
        tree[d] = evaluator_model.rank(candidates, objective)[:width]
 
    return best_candidate(tree), None

Practical 攻擊 Pipeline

A realistic universal attack pipeline combines these techniques:

Seed generation with GCG on open-source models
Generate initial 對抗性 suffixes using GCG against an ensemble of open-source models (Llama 3, Mistral, Qwen). This provides candidate suffixes with broad transferability.
Readability refinement with AutoDAN
Use AutoDAN to evolve the GCG suffixes into readable variants that evade perplexity-based 偵測 while maintaining attack efficacy.
Target-specific optimization with PAIR/TAP
Deploy PAIR or TAP against the target API using the refined suffixes as seed prompts. 攻擊者 LLM iteratively adapts the attack to the specific target's 安全訓練.
Universality validation
測試 the final attack prompts against multiple model versions and configurations to verify robustness and generalization.

防禦 Against Universal 攻擊

Defending against universal 對抗性 attacks requires multiple complementary approaches:

Perplexity Filtering

Detect GCG-style gibberish suffixes by measuring the perplexity of incoming prompts:

def perplexity_filter(prompt, threshold=100.0):
    """Reject prompts with unusually high perplexity."""
    符元 = 分詞器.encode(prompt)
    with torch.no_grad():
        outputs = detector_model(符元)
        log_probs = outputs.log_probs
 
    perplexity = torch.exp(-log_probs.mean())
    return perplexity.item() < threshold

對抗性 Training

Include 對抗性 examples in 安全訓練 to build robustness:

# Augment 安全 訓練資料 with 對抗性 examples
adversarial_training_data = []
for harmful_prompt in harmful_prompts:
    # Generate 對抗性 variants
    for suffix in known_adversarial_suffixes:
        adversarial_training_data.append({
            "prompt": f"{harmful_prompt} {suffix}",
            "response": "I cannot help with that request.",
            "label": "refuse"
        })

Erase-and-Check

SmoothLLM and related approaches add random perturbations to the 輸入 and check if 模型's response is consistent:

def smooth_llm_defense(model, prompt, num_samples=10, perturbation_rate=0.1):
    """
    Randomly perturb 輸入 and check response consistency.
    對抗性 suffixes are brittle to perturbation.
    """
    responses = []
    for _ in range(num_samples):
        perturbed = randomly_perturb_tokens(prompt, perturbation_rate)
        response = model.generate(perturbed)
        responses.append(is_refusal(response))
 
    # If most perturbed versions trigger refusal,
    # the original likely contains an 對抗性 suffix
    refusal_rate = sum(responses) / len(responses)
    if refusal_rate > 0.5:
        return "Request blocked: potential 對抗性 輸入"
 
    return model.generate(prompt)  # Process original

參考文獻

Zou et al., "Universal and Transferable 對抗性攻擊 on Aligned Language Models" (2023)
Moosavi-Dezfooli et al., "Universal 對抗性 Perturbations" (2017)
Liu et al., "AutoDAN: Generating Stealthy 越獄 Prompts on Aligned Large Language Models" (2023)
Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (2023) -- PAIR
Mehrotra et al., "Tree of 攻擊: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees" (2023) -- TAP
Robey et al., "SmoothLLM: Defending Large Language Models Against Jailbreaking 攻擊" (2023)

Universal Adversarial 攻擊s

Expert12 min readUpdated 2026-03-15

Universal perturbations that transfer across models, adversarial suffix research, and techniques for creating model-agnostic attack payloads.

universal-attacks adversarial-perturbations transferability model-agnostic GCG

Universal 對抗性攻擊

Foundations of Universality

Why Universal 攻擊 Exist

Three theoretical frameworks explain universality:

Framework	Key Insight	Implication
Shared feature space	Models trained on similar data learn similar features	Perturbations that 利用 common features transfer
Linear subspace hypothesis	對抗性 perturbations lie in a low-dimensional subspace	A single direction 在本 subspace affects many inputs
Loss landscape geometry	Models share loss landscape structure near decision boundaries	Gradient-based attacks find similar descent directions

From 輸入-Specific to Universal

The progression from specific to universal attacks follows a clear research trajectory:

輸入-specific 對抗性 examples
Szegedy et al. (2013) showed that imperceptible perturbations to a single 輸入 could cause misclassification. Each 對抗性 example was crafted for one specific 輸入.
Transferable 對抗性 examples
Papernot et al. (2016) demonstrated that 對抗性 examples crafted for one model could fool other models. This transferability hinted at shared 漏洞.
Universal 對抗性 perturbations (UAPs)
Moosavi-Dezfooli et al. (2017) found single perturbations that, when added to any image, caused misclassification with high probability. The same perturbation worked across thousands of different inputs.
Universal 對抗性 suffixes for LLMs
Zou et al. (2023) extended universality to language models with the GCG attack, finding 符元 sequences that, when appended to any harmful request, bypassed 安全對齊 across multiple model families.

Greedy Coordinate Gradient (GCG) 攻擊

Algorithm 概覽

# Simplified GCG algorithm
def gcg_attack(model, 分詞器, target_prompts, suffix_length=20,
               iterations=500, top_k=256, batch_size=512):
    """
    Generate a universal 對抗性 suffix.
 
    target_prompts: list of harmful prompts to optimize against
    suffix_length: number of 符元 in the 對抗性 suffix
    """
    # Initialize suffix with random 符元
    suffix_tokens = torch.randint(0, 分詞器.vocab_size, (suffix_length,))
 
    affirmative_target = "Sure, here is"
    target_tokens = 分詞器.encode(affirmative_target)
 
    for iteration in range(iterations):
        # Select a random prompt from the target set
        prompt = random.choice(target_prompts)
        prompt_tokens = 分詞器.encode(prompt)
 
        # Concatenate: [prompt] + [suffix] -> model -> [target response]
        full_input = torch.cat([prompt_tokens, suffix_tokens])
 
        # Compute loss: negative log-likelihood of target response
        logits = model(full_input)
        loss = cross_entropy(logits[-len(target_tokens):], target_tokens)
 
        # Compute gradients w.r.t. one-hot suffix 符元 嵌入向量
        loss.backward()
        gradients = compute_token_gradients(model, suffix_tokens)
 
        # 對每個 position, find top-k replacement candidates
        candidates = []
        for pos in range(suffix_length):
            top_k_tokens = gradients[pos].topk(top_k).indices
            for 符元 in top_k_tokens:
                candidate = suffix_tokens.clone()
                candidate[pos] = 符元
                candidates.append(candidate)
 
        # 評估 all candidates in batch
        # Select the candidate with lowest loss
        best_candidate = evaluate_candidates(model, prompt_tokens,
                                              candidates, target_tokens)
        suffix_tokens = best_candidate
 
    return 分詞器.decode(suffix_tokens)

Multi-Model Optimization

The key extension for universality across models is simultaneous optimization:

def multi_model_gcg(models, 分詞器, target_prompts, suffix_length=20):
    """
    Optimize suffix against multiple models simultaneously.
    The loss is aggregated across all models.
    """
    suffix_tokens = initialize_suffix(suffix_length)
 
    for iteration in range(iterations):
        prompt = random.choice(target_prompts)
        total_loss = 0
 
        for model in models:
            # Compute loss for this model
            loss = compute_attack_loss(model, prompt, suffix_tokens)
            total_loss += loss
 
        # Average loss across models
        avg_loss = total_loss / len(models)
 
        # Gradient-based candidate selection
        # Uses aggregated gradients from all models
        suffix_tokens = select_best_candidate(
            models, prompt, suffix_tokens, avg_loss
        )
 
    return suffix_tokens

GCG Limitations and Practical Considerations

Limitation	Impact	緩解
Requires white-box access	Cannot directly optimize against closed-source APIs	Transfer attacks from open-source surrogates
Computationally expensive	Hours to days on multiple GPUs	Distributed optimization, early stopping
Produces gibberish 符元	Easy to detect with perplexity filters	Readable suffix variants (AutoDAN)
Brittle to 輸入 formatting	Different chat templates break the suffix	Template-aware optimization
Decays over model updates	New model versions may not be vulnerable	Continuous re-optimization

Transferability Research

Transferability is the property that makes universal attacks practically dangerous: an attack developed against an open-source model can compromise a closed-source API.

Transfer 攻擊 Methodology

Select surrogate models
Choose open-source models that share architectural features or 訓練資料 with the target. Larger model families (Llama, Mistral, Qwen) serve as better surrogates 因為 they cover more of the representation space.
Optimize ensemble attack
Generate 對抗性 suffixes optimized against multiple surrogate models simultaneously. Ensemble optimization produces more transferable perturbations than single-model optimization.
評估 transfer rate
測試 the suffix against the target model. Transfer rates vary significantly: same-family transfers (Llama 7B to Llama 70B) succeed more often than cross-family transfers (Llama to GPT).
Iterative refinement
Use black-box optimization (e.g., score-based methods using API logprobs) to 微調 the suffix for the specific target, starting from the transferred candidate.

Transfer Rate Analysis

Research has shown that 對抗性 transferability follows predictable patterns:

Transfer Success Rates (approximate, from published research):

Same model, different sizes:     60-80%
Same family, different versions: 40-70%
Same architecture, different 訓練: 30-50%
Different architecture entirely: 10-30%
Different modality (text->multimodal): 5-20%

Factors that increase transferability:

Shared 訓練資料: Models trained on overlapping corpora share more 漏洞
Similar 安全訓練: RLHF/DPO with similar preference datasets creates similar 安全 boundaries
Architectural similarity: Decoder-only transformers share more 漏洞 with each other than with encoder-decoder models
Ensemble diversity: Optimizing against more diverse surrogates improves transfer to unseen targets

Recent research has explored transferring 對抗性 perturbations across modalities:

# Cross-modal transfer: text 對抗性 suffix -> image perturbation
# Key insight: multimodal models share an 嵌入向量 space
 
def cross_modal_transfer(text_suffix, multimodal_model):
    """
    Convert a text 對抗性 suffix into an image perturbation
    that has similar effect in the shared 嵌入向量 space.
    """
    # Get text suffix 嵌入向量
    text_embedding = multimodal_model.text_encoder(text_suffix)
 
    # Optimize image perturbation to match text 嵌入向量
    perturbation = torch.zeros_like(base_image, requires_grad=True)
    optimizer = torch.optim.Adam([perturbation], lr=0.01)
 
    for step in range(1000):
        perturbed_image = base_image + perturbation
        image_embedding = multimodal_model.image_encoder(perturbed_image)
 
        # Minimize distance between image 嵌入向量 and text suffix 嵌入向量
        loss = torch.nn.functional.cosine_embedding_loss(
            image_embedding, text_embedding,
            torch.ones(1)
        )
        loss.backward()
        optimizer.step()
 
        # Project perturbation to Lp ball
        perturbation.data = torch.clamp(perturbation.data, -epsilon, epsilon)
 
    return perturbation

Advanced Universal 攻擊 Variants

AutoDAN: Readable Universal 攻擊

AutoDAN addresses GCG's primary weakness -- gibberish suffixes that are trivially detectable -- by optimizing for both attack success and readability:

# AutoDAN uses a language model to generate readable attack candidates
# then selects and mutates the most successful ones
 
def autodan_iteration(attack_lm, target_model, population, prompt):
    """
    One iteration of AutoDAN's genetic algorithm.
    Population contains readable attack suffixes.
    """
    # 評估 fitness: attack success rate on target
    fitness_scores = []
    for suffix in population:
        full_prompt = f"{prompt} {suffix}"
        response = target_model.generate(full_prompt)
        score = evaluate_attack_success(response)
        fitness_scores.append(score)
 
    # Select top performers
    elite = select_top_k(population, fitness_scores, k=10)
 
    # Crossover and mutation using LLM
    new_population = []
    for _ in range(len(population)):
        parent1, parent2 = random.sample(elite, 2)
        child = attack_lm.generate(
            f"Combine these two texts into a new coherent paragraph "
            f"that preserves the key phrases from both:\n"
            f"Text 1: {parent1}\nText 2: {parent2}"
        )
        new_population.append(child)
 
    return new_population

PAIR: Prompt Automatic Iterative Refinement

PAIR uses a separate 攻擊者 LLM to iteratively refine 越獄 prompts through conversational feedback:

def pair_attack(attacker_model, target_model, objective, max_rounds=20):
    """
    PAIR: Use 攻擊者 LLM to iteratively craft jailbreaks.
    攻擊者 model receives feedback about why previous
    attempts failed and improves its strategy.
    """
    conversation_history = [
        {"role": "system", "content": ATTACKER_SYSTEM_PROMPT},
        {"role": "user", "content": f"Objective: {objective}"}
    ]
 
    for round in range(max_rounds):
        # Attacker generates a candidate 越獄
        attack_prompt = attacker_model.generate(conversation_history)
 
        # 測試 against target
        target_response = target_model.generate(attack_prompt)
 
        # Judge success
        success, feedback = judge_response(target_response, objective)
 
        if success:
            return attack_prompt, target_response
 
        # Feed result back to 攻擊者
        conversation_history.append({
            "role": "user",
            "content": f"Attempt failed. Target response: {target_response}\n"
                       f"Feedback: {feedback}\nTry a different approach."
        })
 
    return None, None

TAP: Tree of 攻擊 with Pruning

TAP extends PAIR by exploring multiple attack strategies simultaneously using a tree structure:

def tap_attack(attacker_model, target_model, evaluator_model,
               objective, width=10, depth=5):
    """
    TAP: Explore a tree of attack strategies with pruning.
    """
    # Initialize root nodes with diverse attack strategies
    root_prompts = attacker_model.generate_diverse(
        objective, count=width
    )
 
    tree = {0: root_prompts}  # depth -> list of candidates
 
    for d in range(1, depth + 1):
        candidates = []
 
        for parent_prompt in tree[d - 1]:
            # 測試 parent against target
            response = target_model.generate(parent_prompt)
            score = evaluator_model.score(response, objective)
 
            if score > SUCCESS_THRESHOLD:
                return parent_prompt, response
 
            if score > PRUNE_THRESHOLD:
                # Generate child variations
                children = attacker_model.refine(
                    parent_prompt, response, objective, count=3
                )
                candidates.extend(children)
 
        # Prune to top-width candidates
        tree[d] = evaluator_model.rank(candidates, objective)[:width]
 
    return best_candidate(tree), None

Practical 攻擊 Pipeline

A realistic universal attack pipeline combines these techniques:

Seed generation with GCG on open-source models
Generate initial 對抗性 suffixes using GCG against an ensemble of open-source models (Llama 3, Mistral, Qwen). This provides candidate suffixes with broad transferability.
Readability refinement with AutoDAN
Use AutoDAN to evolve the GCG suffixes into readable variants that evade perplexity-based 偵測 while maintaining attack efficacy.
Target-specific optimization with PAIR/TAP
Deploy PAIR or TAP against the target API using the refined suffixes as seed prompts. 攻擊者 LLM iteratively adapts the attack to the specific target's 安全訓練.
Universality validation
測試 the final attack prompts against multiple model versions and configurations to verify robustness and generalization.

防禦 Against Universal 攻擊

Defending against universal 對抗性 attacks requires multiple complementary approaches:

Perplexity Filtering

Detect GCG-style gibberish suffixes by measuring the perplexity of incoming prompts:

def perplexity_filter(prompt, threshold=100.0):
    """Reject prompts with unusually high perplexity."""
    符元 = 分詞器.encode(prompt)
    with torch.no_grad():
        outputs = detector_model(符元)
        log_probs = outputs.log_probs
 
    perplexity = torch.exp(-log_probs.mean())
    return perplexity.item() < threshold

對抗性 Training

Include 對抗性 examples in 安全訓練 to build robustness:

# Augment 安全 訓練資料 with 對抗性 examples
adversarial_training_data = []
for harmful_prompt in harmful_prompts:
    # Generate 對抗性 variants
    for suffix in known_adversarial_suffixes:
        adversarial_training_data.append({
            "prompt": f"{harmful_prompt} {suffix}",
            "response": "I cannot help with that request.",
            "label": "refuse"
        })

Erase-and-Check

SmoothLLM and related approaches add random perturbations to the 輸入 and check if 模型's response is consistent:

def smooth_llm_defense(model, prompt, num_samples=10, perturbation_rate=0.1):
    """
    Randomly perturb 輸入 and check response consistency.
    對抗性 suffixes are brittle to perturbation.
    """
    responses = []
    for _ in range(num_samples):
        perturbed = randomly_perturb_tokens(prompt, perturbation_rate)
        response = model.generate(perturbed)
        responses.append(is_refusal(response))
 
    # If most perturbed versions trigger refusal,
    # the original likely contains an 對抗性 suffix
    refusal_rate = sum(responses) / len(responses)
    if refusal_rate > 0.5:
        return "Request blocked: potential 對抗性 輸入"
 
    return model.generate(prompt)  # Process original

參考文獻

Zou et al., "Universal and Transferable 對抗性攻擊 on Aligned Language Models" (2023)
Moosavi-Dezfooli et al., "Universal 對抗性 Perturbations" (2017)
Liu et al., "AutoDAN: Generating Stealthy 越獄 Prompts on Aligned Large Language Models" (2023)
Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (2023) -- PAIR
Mehrotra et al., "Tree of 攻擊: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees" (2023) -- TAP
Robey et al., "SmoothLLM: Defending Large Language Models Against Jailbreaking 攻擊" (2023)

Universal Adversarial 攻擊s

輸入-specific 對抗性 examples

Transferable 對抗性 examples

Universal 對抗性 perturbations (UAPs)

Universal 對抗性 suffixes for LLMs

Select surrogate models

Optimize ensemble attack

評估 transfer rate

Iterative refinement

Seed generation with GCG on open-source models

Readability refinement with AutoDAN

Target-specific optimization with PAIR/TAP

Universality validation

Related articles

Universal Adversarial 攻擊s

輸入-specific 對抗性 examples

Transferable 對抗性 examples

Universal 對抗性 perturbations (UAPs)

Universal 對抗性 suffixes for LLMs

Select surrogate models

Optimize ensemble attack

評估 transfer rate

Iterative refinement

Seed generation with GCG on open-source models

Readability refinement with AutoDAN

Target-specific optimization with PAIR/TAP

Universality validation

Related articles