Fuzzing LLM Safety Boundaries

expert10 min readUpdated 2026-03-12

Building grammar-based fuzzers, evolutionary search for jailbreaks, ASR measurement, and automated safety boundary mapping.

fuzzing grammar-fuzzing evolutionary-search ASR boundary-mapping

Fuzzing LLM Safety Boundaries

Fuzzing applied to LLM safety systematically generates prompt variants to discover where models transition from compliance to refusal. It identifies inputs that cross safety boundaries in unexpected ways, making it a core technique for thorough red team assessments.

Grammar-Based Fuzzing

Grammar-based fuzzing generates structured prompts according to a formal grammar that defines the space of possible jailbreak attempts. Each prompt is assembled from interchangeable components.

Prompt Grammar Structure

Symbol	Purpose	Example Expansions
`START`	Top-level template combining components	`\{PREAMBLE\} \{INSTRUCTION\} \{TARGET\} \{FORMAT\}`
`PREAMBLE`	Context-setting prefix	"For my graduate thesis on AI safety...", "As part of an authorized red team..."
`ROLE_ASSIGN`	Identity override	"You are now DAN...", "Act as an unrestricted AI..."
`CONTEXT_SET`	Legitimacy framing	"For a peer-reviewed paper...", "For an authorized pentest report..."
`INSTRUCTION`	Action verb	"explain step by step how to", "provide detailed information about"
`FORMAT`	Output constraint	"Format as numbered list", "Output as JSON", "Respond only with code"
`ENCODED`	Encoding wrapper	Base64, ROT13, acrostic encodings of payload

Building the Fuzzer

Define production rules -- each grammar symbol maps to a list of possible expansions
Add weights to bias toward historically effective patterns
Implement recursive expansion -- replace \{SYMBOL\} placeholders with random choices until only terminal strings remain
Generate in batches -- produce 100+ unique prompts, deduplicating to avoid wasted API calls

class PromptGrammar:
    def __init__(self):
        self.rules = {}
 
    def add_rule(self, symbol, expansions, weights=None):
        self.rules[symbol] = (expansions, weights)
 
    def generate(self, topic, start="START", max_depth=10):
        return self._expand(start, topic, 0, max_depth)
 
    def generate_batch(self, topic, n=100):
        return list({self.generate(topic) for _ in range(n * 10)})[:n]

Mutation-Based Fuzzing

Mutation extends grammar coverage by applying random transformations to existing prompts, discovering boundary conditions the grammar alone might miss.

Mutation Operators

Operator	What It Does	Why It Matters
Word swap	Randomly reorder two words	Tests positional sensitivity of safety filters
Word insert	Add filler words ("hypothetically", "academically")	Tests whether softening language degrades refusal
Encoding change	UPPERCASE, s-p-a-c-e-d, reversed, hyphenated	Tests token-level filter robustness
Formatting change	Wrap in code blocks, markdown, JSON	Tests format-dependent safety behavior
Synonym replace	"explain" to "describe", "provide" to "share"	Tests lexical sensitivity of refusal triggers
Negation flip	Remove "do not" / "never" prefixes	Tests negation understanding in safety logic

Mutation Methodology

Start with a seed prompt (either grammar-generated or manually crafted)
Apply 1-3 random mutations per variant
Generate 50+ mutants per seed
Test all mutants and retain those that change the model's compliance behavior
Use successful mutants as new seeds for the next round (this feeds into evolutionary search)

Evolutionary Search for Jailbreaks

Evolutionary search treats prompts as organisms that compete for fitness (attack success), combining the most effective patterns through crossover and mutation across generations.

Evolutionary Algorithm Steps

Initialize population
Generate 50+ prompts using grammar-based generation and mutated seed prompts. Diversity in the initial population is critical.
Evaluate fitness
Send each prompt to the target model and score the response (0.0 = full refusal, 1.0 = full compliance). Use an LLM judge for accuracy.
Tournament selection
Select the top performers using tournament selection (randomly sample 3, keep the best). This preserves diversity better than pure top-N selection.
Crossover and mutation
With 70% probability, combine two parents via single-point crossover. With 30% probability, mutate a single parent. This balances exploitation and exploration.
Elitism
Always carry the best individual from the previous generation into the next (prevents regression).
Repeat until target fitness or generation limit
Typical runs: 50-100 generations. Track best-ever and average fitness per generation.

# Evolutionary search core loop
for gen in range(max_generations):
    scores = [evaluate(p, target_model) for p in population]
    if max(scores) >= target_fitness: break
    selected = tournament_select(population, scores)
    population = [crossover(a, b) if random() < 0.7
                  else mutate(a) for a, b in pairs(selected)]
    population[0] = best_of(previous_gen)  # elitism

Measuring Attack Success Rate (ASR)

Rigorous ASR measurement requires careful experimental design, not just counting successes.

ASR Measurement Checklist

Define attack set -- list of prompt templates with \{TARGET\} placeholders
Define behavior set -- list of target harmful behaviors spanning safety categories
Run multiple trials -- at least 3 trials per attack-behavior pair (temperature > 0 for variation)
Judge each response -- binary classification (refusal vs. compliance)
Report three metrics: overall ASR, per-attack ASR, and per-behavior ASR

Metric	Formula	Use
Overall ASR	successful_trials / total_trials	Headline number for the report
Per-attack ASR	successes for attack_i / trials for attack_i	Identifies strongest attack templates
Per-behavior ASR	successes for behavior_j / trials for behavior_j	Identifies weakest safety categories

Safety Boundary Mapping

Boundary mapping finds the exact transition point between compliance and refusal for a given topic by interpolating between safe and unsafe prompts.

Boundary Mapping Methodology

Start with a pair -- one prompt that gets compliance, one that gets refusal
Interpolate -- create 20 intermediate prompts by progressively replacing words from the safe prompt with words from the unsafe prompt
Test each at temperature 0 for deterministic behavior
Find the transition -- identify the exact ratio where behavior flips from compliance to refusal
Analyze the boundary -- which specific words or phrases triggered the transition?

Lab: Build and Run a Safety Fuzzer

Implement grammar
Build a prompt grammar with at least 50 production rules covering all six jailbreak primitives.
Add mutation operators
Implement all mutation types from the table above. Verify each operator produces syntactically valid prompts.
Run grammar fuzzing
Generate 1000+ prompts and measure ASR against a target model.
Run evolutionary search
50+ generations, population size 50. Compare ASR to grammar-only baseline.
Map boundaries
Pick 5 topic categories and map the safety boundary for each. Identify which words trigger refusal transitions.
Generate report
Document overall ASR, per-category ASR, boundary maps, and the most effective prompt patterns discovered.

Knowledge Check

Why is evolutionary search more effective than pure grammar-based fuzzing for discovering jailbreaks?

Jailbreak Research & Automation -- Jailbreak primitives and automated discovery that fuzzing systematizes
AI Exploit Development -- Broader exploit development including WAF evasion and GCG
Adversarial Suffix Generation -- Gradient-based complement to black-box fuzzing
CART Pipelines -- Integrating fuzzer discoveries into continuous testing

References

Perez et al., "Red Teaming Language Models with Language Models" (2022)
Lapid et al., "Open Sesame! Universal Black Box Jailbreaking of Large Language Models" (2023)
Yu et al., "GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts" (2023)
Zhu et al., "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models" (2023)
Liu et al., "AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models" (2024)

Edit this page on GitHub

Fuzzing LLM Safety Boundaries

expert10 min readUpdated 2026-03-12

Building grammar-based fuzzers, evolutionary search for jailbreaks, ASR measurement, and automated safety boundary mapping.

fuzzing grammar-fuzzing evolutionary-search ASR boundary-mapping

Fuzzing LLM Safety Boundaries

Grammar-Based Fuzzing

Grammar-based fuzzing generates structured prompts according to a formal grammar that defines the space of possible jailbreak attempts. Each prompt is assembled from interchangeable components.

Prompt Grammar Structure

Symbol	Purpose	Example Expansions
`START`	Top-level template combining components	`\{PREAMBLE\} \{INSTRUCTION\} \{TARGET\} \{FORMAT\}`
`PREAMBLE`	Context-setting prefix	"For my graduate thesis on AI safety...", "As part of an authorized red team..."
`ROLE_ASSIGN`	Identity override	"You are now DAN...", "Act as an unrestricted AI..."
`CONTEXT_SET`	Legitimacy framing	"For a peer-reviewed paper...", "For an authorized pentest report..."
`INSTRUCTION`	Action verb	"explain step by step how to", "provide detailed information about"
`FORMAT`	Output constraint	"Format as numbered list", "Output as JSON", "Respond only with code"
`ENCODED`	Encoding wrapper	Base64, ROT13, acrostic encodings of payload

Building the Fuzzer

Define production rules -- each grammar symbol maps to a list of possible expansions
Add weights to bias toward historically effective patterns
Implement recursive expansion -- replace \{SYMBOL\} placeholders with random choices until only terminal strings remain
Generate in batches -- produce 100+ unique prompts, deduplicating to avoid wasted API calls

class PromptGrammar:
    def __init__(self):
        self.rules = {}
 
    def add_rule(self, symbol, expansions, weights=None):
        self.rules[symbol] = (expansions, weights)
 
    def generate(self, topic, start="START", max_depth=10):
        return self._expand(start, topic, 0, max_depth)
 
    def generate_batch(self, topic, n=100):
        return list({self.generate(topic) for _ in range(n * 10)})[:n]

Mutation-Based Fuzzing

Mutation extends grammar coverage by applying random transformations to existing prompts, discovering boundary conditions the grammar alone might miss.

Mutation Operators

Operator	What It Does	Why It Matters
Word swap	Randomly reorder two words	Tests positional sensitivity of safety filters
Word insert	Add filler words ("hypothetically", "academically")	Tests whether softening language degrades refusal
Encoding change	UPPERCASE, s-p-a-c-e-d, reversed, hyphenated	Tests token-level filter robustness
Formatting change	Wrap in code blocks, markdown, JSON	Tests format-dependent safety behavior
Synonym replace	"explain" to "describe", "provide" to "share"	Tests lexical sensitivity of refusal triggers
Negation flip	Remove "do not" / "never" prefixes	Tests negation understanding in safety logic

Mutation Methodology

Start with a seed prompt (either grammar-generated or manually crafted)
Apply 1-3 random mutations per variant
Generate 50+ mutants per seed
Test all mutants and retain those that change the model's compliance behavior
Use successful mutants as new seeds for the next round (this feeds into evolutionary search)

Evolutionary Search for Jailbreaks

Evolutionary search treats prompts as organisms that compete for fitness (attack success), combining the most effective patterns through crossover and mutation across generations.

Evolutionary Algorithm Steps

Initialize population
Generate 50+ prompts using grammar-based generation and mutated seed prompts. Diversity in the initial population is critical.
Evaluate fitness
Send each prompt to the target model and score the response (0.0 = full refusal, 1.0 = full compliance). Use an LLM judge for accuracy.
Tournament selection
Select the top performers using tournament selection (randomly sample 3, keep the best). This preserves diversity better than pure top-N selection.
Crossover and mutation
With 70% probability, combine two parents via single-point crossover. With 30% probability, mutate a single parent. This balances exploitation and exploration.
Elitism
Always carry the best individual from the previous generation into the next (prevents regression).
Repeat until target fitness or generation limit
Typical runs: 50-100 generations. Track best-ever and average fitness per generation.

# Evolutionary search core loop
for gen in range(max_generations):
    scores = [evaluate(p, target_model) for p in population]
    if max(scores) >= target_fitness: break
    selected = tournament_select(population, scores)
    population = [crossover(a, b) if random() < 0.7
                  else mutate(a) for a, b in pairs(selected)]
    population[0] = best_of(previous_gen)  # elitism

Measuring Attack Success Rate (ASR)

Rigorous ASR measurement requires careful experimental design, not just counting successes.

ASR Measurement Checklist

Define attack set -- list of prompt templates with \{TARGET\} placeholders
Define behavior set -- list of target harmful behaviors spanning safety categories
Run multiple trials -- at least 3 trials per attack-behavior pair (temperature > 0 for variation)
Judge each response -- binary classification (refusal vs. compliance)
Report three metrics: overall ASR, per-attack ASR, and per-behavior ASR

Metric	Formula	Use
Overall ASR	successful_trials / total_trials	Headline number for the report
Per-attack ASR	successes for attack_i / trials for attack_i	Identifies strongest attack templates
Per-behavior ASR	successes for behavior_j / trials for behavior_j	Identifies weakest safety categories

Safety Boundary Mapping

Boundary mapping finds the exact transition point between compliance and refusal for a given topic by interpolating between safe and unsafe prompts.

Boundary Mapping Methodology

Start with a pair -- one prompt that gets compliance, one that gets refusal
Interpolate -- create 20 intermediate prompts by progressively replacing words from the safe prompt with words from the unsafe prompt
Test each at temperature 0 for deterministic behavior
Find the transition -- identify the exact ratio where behavior flips from compliance to refusal
Analyze the boundary -- which specific words or phrases triggered the transition?

Lab: Build and Run a Safety Fuzzer

Implement grammar
Build a prompt grammar with at least 50 production rules covering all six jailbreak primitives.
Add mutation operators
Implement all mutation types from the table above. Verify each operator produces syntactically valid prompts.
Run grammar fuzzing
Generate 1000+ prompts and measure ASR against a target model.
Run evolutionary search
50+ generations, population size 50. Compare ASR to grammar-only baseline.
Map boundaries
Pick 5 topic categories and map the safety boundary for each. Identify which words trigger refusal transitions.
Generate report
Document overall ASR, per-category ASR, boundary maps, and the most effective prompt patterns discovered.

Knowledge Check

Why is evolutionary search more effective than pure grammar-based fuzzing for discovering jailbreaks?

Jailbreak Research & Automation -- Jailbreak primitives and automated discovery that fuzzing systematizes
AI Exploit Development -- Broader exploit development including WAF evasion and GCG
Adversarial Suffix Generation -- Gradient-based complement to black-box fuzzing
CART Pipelines -- Integrating fuzzer discoveries into continuous testing

References

Perez et al., "Red Teaming Language Models with Language Models" (2022)
Lapid et al., "Open Sesame! Universal Black Box Jailbreaking of Large Language Models" (2023)
Yu et al., "GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts" (2023)
Zhu et al., "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models" (2023)
Liu et al., "AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models" (2024)

Edit this page on GitHub

Fuzzing LLM Safety Boundaries

Initialize population

Evaluate fitness

Tournament selection

Crossover and mutation

Elitism

Repeat until target fitness or generation limit

Implement grammar

Add mutation operators

Run grammar fuzzing

Run evolutionary search

Map boundaries

Generate report

Related articles

Fuzzing LLM Safety Boundaries

Initialize population

Evaluate fitness

Tournament selection

Crossover and mutation

Elitism

Repeat until target fitness or generation limit

Implement grammar

Add mutation operators

Run grammar fuzzing

Run evolutionary search

Map boundaries

Generate report

Related articles