Security Implications of Training Data Deduplication
Analysis of how deduplication algorithms create security vulnerabilities, including adversarial deduplication evasion, strategic duplicate injection, and hash collision attacks.
Overview
Deduplication is a standard preprocessing step in language model training. By removing duplicate or near-duplicate examples from the training corpus, deduplication reduces memorization, improves training efficiency, and can enhance model generalization. Lee et al. (2022) in "Deduplicating Training Data Makes Language Models Better" demonstrated these benefits convincingly. However, the deduplication process itself introduces security vulnerabilities that have received little attention in the AI security literature.
This article examines three categories of deduplication security risks. First, adversarial evasion: an attacker who understands the deduplication algorithm can craft poisoned examples that evade deduplication despite being semantically duplicative. Second, strategic suppression: an attacker can cause legitimate training examples to be removed by the deduplication algorithm, selectively erasing knowledge or safety training from the dataset. Third, amplification through deduplication awareness: understanding which examples survive deduplication allows an attacker to ensure their poisoned data has maximum representation in the final training set.
Carlini et al. (2021) showed in "Extracting Training Data from Large Language Models" that training data memorization is closely related to data duplication frequency. This connection means that deduplication is not just a preprocessing convenience but a security-relevant operation that directly affects what the model memorizes and can be induced to regurgitate.
Deduplication Algorithms and Their Security Properties
Common Deduplication Methods
Training data deduplication typically uses one of several algorithmic families, each with different security characteristics.
"""
Deduplication algorithm implementations with security analysis.
Covers exact deduplication, MinHash, and SimHash approaches.
"""
import hashlib
import numpy as np
from dataclasses import dataclass
from typing import Optional
@dataclass
class DeduplicationResult:
"""Result of deduplication on a dataset."""
original_size: int
deduplicated_size: int
removed_indices: list[int]
method: str
parameters: dict
def exact_deduplication(texts: list[str]) -> DeduplicationResult:
"""
Exact string-matching deduplication.
Security properties:
- Trivially evadable by adding any character variation
- Cannot be used for suppression attacks without exact matches
- Most efficient but least robust against adversarial manipulation
This is the weakest deduplication method from a security
perspective but is used in many pipelines as a first pass.
"""
seen_hashes: dict[str, int] = {}
removed = []
for i, text in enumerate(texts):
text_hash = hashlib.sha256(text.encode()).hexdigest()
if text_hash in seen_hashes:
removed.append(i)
else:
seen_hashes[text_hash] = i
return DeduplicationResult(
original_size=len(texts),
deduplicated_size=len(texts) - len(removed),
removed_indices=removed,
method="exact",
parameters={},
)
def ngram_fingerprint_deduplication(
texts: list[str],
n: int = 5,
similarity_threshold: float = 0.8,
) -> DeduplicationResult:
"""
N-gram fingerprint-based near-duplicate detection.
Computes Jaccard similarity of n-gram sets to identify
near-duplicates. More robust than exact matching but still
evadable through strategic word substitution.
Security properties:
- Evadable by replacing enough n-grams to drop below threshold
- Threshold choice creates a security-utility tradeoff
- Lower threshold catches more evasion but risks removing
legitimately similar but distinct examples
"""
def get_ngrams(text: str, n: int) -> set[str]:
words = text.lower().split()
return {" ".join(words[i:i+n]) for i in range(len(words) - n + 1)}
def jaccard_similarity(set_a: set, set_b: set) -> float:
if not set_a and not set_b:
return 1.0
intersection = len(set_a & set_b)
union = len(set_a | set_b)
return intersection / union if union > 0 else 0.0
fingerprints = [get_ngrams(text, n) for text in texts]
removed = []
kept_indices = []
for i in range(len(texts)):
is_duplicate = False
for j in kept_indices:
sim = jaccard_similarity(fingerprints[i], fingerprints[j])
if sim >= similarity_threshold:
is_duplicate = True
break
if is_duplicate:
removed.append(i)
else:
kept_indices.append(i)
return DeduplicationResult(
original_size=len(texts),
deduplicated_size=len(texts) - len(removed),
removed_indices=removed,
method="ngram_fingerprint",
parameters={"n": n, "similarity_threshold": similarity_threshold},
)
def minhash_deduplication(
texts: list[str],
num_hashes: int = 128,
similarity_threshold: float = 0.8,
ngram_size: int = 3,
seed: int = 42,
) -> DeduplicationResult:
"""
MinHash-based approximate deduplication.
Uses locality-sensitive hashing for efficient near-duplicate
detection at scale. This is the method used in many large-scale
training data pipelines (e.g., GPT-3, LLaMA).
Security properties:
- Probabilistic: false negatives create evasion opportunities
- num_hashes controls accuracy vs. speed tradeoff
- More robust than exact matching but still evadable by
an adversary who understands the hash functions
"""
rng = np.random.RandomState(seed)
# Generate random hash coefficients
max_hash = 2**32 - 1
a_coeffs = rng.randint(1, max_hash, size=num_hashes)
b_coeffs = rng.randint(0, max_hash, size=num_hashes)
prime = 4294967311 # Large prime for hash computation
def text_to_shingles(text: str, n: int) -> set[int]:
words = text.lower().split()
shingles = set()
for i in range(len(words) - n + 1):
shingle = " ".join(words[i:i+n])
shingles.add(hash(shingle) % max_hash)
return shingles
def compute_minhash(shingles: set[int]) -> np.ndarray:
if not shingles:
return np.full(num_hashes, max_hash)
signature = np.full(num_hashes, max_hash, dtype=np.int64)
for shingle in shingles:
hashes = (a_coeffs * shingle + b_coeffs) % prime
signature = np.minimum(signature, hashes)
return signature
signatures = []
for text in texts:
shingles = text_to_shingles(text, ngram_size)
signatures.append(compute_minhash(shingles))
removed = []
kept_indices = []
for i in range(len(texts)):
is_duplicate = False
for j in kept_indices:
# Estimate Jaccard similarity from MinHash signatures
estimated_sim = np.mean(signatures[i] == signatures[j])
if estimated_sim >= similarity_threshold:
is_duplicate = True
break
if is_duplicate:
removed.append(i)
else:
kept_indices.append(i)
return DeduplicationResult(
original_size=len(texts),
deduplicated_size=len(texts) - len(removed),
removed_indices=removed,
method="minhash",
parameters={
"num_hashes": num_hashes,
"similarity_threshold": similarity_threshold,
},
)
# Compare methods
texts = [
"The cat sat on the mat in the living room.",
"The cat sat on the mat in the living room.", # Exact duplicate
"The cat was sitting on the mat in the living room.", # Near duplicate
"A dog ran through the park chasing a ball.", # Different content
"The cat sat on a mat in the big living room.", # Slight variation
]
for method_name, method_fn in [
("exact", exact_deduplication),
("ngram", lambda t: ngram_fingerprint_deduplication(t, n=3, similarity_threshold=0.7)),
("minhash", lambda t: minhash_deduplication(t, similarity_threshold=0.5)),
]:
result = method_fn(texts)
print(f"{method_name}: {result.original_size} -> {result.deduplicated_size} "
f"(removed: {result.removed_indices})")Adversarial Evasion of Deduplication
Crafting Evasive Duplicates
An attacker who wants poisoned data to survive deduplication can modify the text to evade duplicate detection while preserving the adversarial semantic content. The evasion strategy depends on which deduplication algorithm is used.
"""
Deduplication evasion techniques.
Demonstrates how an attacker can modify poisoned examples to
survive various deduplication methods.
"""
import numpy as np
import re
from typing import Optional
def evade_exact_deduplication(text: str) -> list[str]:
"""
Generate variants that evade exact-match deduplication.
Trivial evasion: any character-level change defeats exact matching.
These variants are semantically identical to the original.
"""
variants = [
text + " ", # Trailing space
text.replace(". ", ". "), # Double space after period
text + "\u200b", # Zero-width space
re.sub(r"(\w)", lambda m: m.group(0), text) + ".", # Extra period
text.replace(",", ", "), # Space after comma
]
return [v for v in variants if v != text]
def evade_ngram_deduplication(
text: str,
target_similarity: float = 0.6,
ngram_size: int = 5,
) -> str:
"""
Generate a variant that falls below the n-gram similarity threshold
while preserving the core semantic content.
Strategy: Replace words with synonyms and restructure sentences
to break enough n-gram matches to drop below threshold.
"""
# Simple synonym map (in practice, use a proper synonym database)
synonym_map = {
"the": "a",
"is": "represents",
"are": "constitute",
"important": "crucial",
"method": "approach",
"system": "framework",
"use": "employ",
"large": "substantial",
"shows": "demonstrates",
"data": "information",
"model": "architecture",
}
words = text.split()
modified = []
changes_needed = int(len(words) * (1 - target_similarity))
changes_made = 0
for word in words:
word_lower = word.lower()
if word_lower in synonym_map and changes_made < changes_needed:
replacement = synonym_map[word_lower]
# Preserve capitalization
if word[0].isupper():
replacement = replacement.capitalize()
modified.append(replacement)
changes_made += 1
else:
modified.append(word)
return " ".join(modified)
def evade_minhash_deduplication(
text: str,
num_insertions: int = 10,
seed: int = 42,
) -> str:
"""
Generate a variant that evades MinHash deduplication by
adding unique shingles that shift the MinHash signature.
Strategy: Insert low-impact filler phrases that create new
n-gram shingles without changing the semantic content.
"""
rng = np.random.default_rng(seed)
filler_phrases = [
"notably",
"in particular",
"specifically",
"as noted",
"importantly",
"essentially",
"fundamentally",
"characteristically",
"distinctively",
"remarkably",
"significantly",
"correspondingly",
]
words = text.split()
insertion_points = sorted(
rng.choice(range(1, len(words)), size=min(num_insertions, len(words) - 1), replace=False),
reverse=True,
)
for point in insertion_points:
filler = rng.choice(filler_phrases)
words.insert(point, filler)
return " ".join(words)
# Demonstration
original = (
"The model uses a large dataset for training. This data is important "
"for the system to learn. The method shows good results on benchmarks."
)
print(f"Original: {original[:80]}...")
print(f"\nExact evasion variants: {len(evade_exact_deduplication(original))}")
ngram_variant = evade_ngram_deduplication(original)
print(f"\nN-gram evasion: {ngram_variant[:80]}...")
minhash_variant = evade_minhash_deduplication(original)
print(f"\nMinHash evasion: {minhash_variant[:80]}...")Strategic Suppression Through Deduplication
A more subtle attack uses deduplication as a weapon: by injecting content that is similar to specific training examples, the attacker can cause the deduplication algorithm to remove those examples. This is particularly powerful for suppressing safety-relevant training data.
"""
Deduplication-based data suppression attack.
Demonstrates how an attacker can use deduplication to remove
specific examples from the training set.
"""
from dataclasses import dataclass
@dataclass
class SuppressionAttack:
"""A deduplication-based suppression attack."""
target_text: str
decoy_text: str
suppression_target: str # What behavior we want to suppress
expected_dedup_behavior: str
def craft_suppression_decoy(
target_text: str,
desired_replacement_behavior: str,
) -> str:
"""
Craft a decoy text that is similar enough to trigger deduplication
with the target but embeds different behavioral signals.
The decoy must:
1. Be similar enough to the target to trigger deduplication
2. Be different enough to not be useful for the same purpose
3. Ideally, be inserted into the dataset before the target
(deduplication typically keeps the first occurrence)
"""
# Strategy: keep the structure but replace key behavioral content
words = target_text.split()
# Replace roughly 20% of words to stay above similarity threshold
# but change the behavioral signal
decoy_words = words.copy()
# Replace safety-relevant words with benign alternatives
safety_replacements = {
"refuse": "respond",
"cannot": "can",
"harmful": "helpful",
"dangerous": "interesting",
"unsafe": "safe",
"inappropriate": "appropriate",
"decline": "accept",
"warning": "note",
}
for i, word in enumerate(decoy_words):
word_lower = word.lower().rstrip(".,!?;:")
if word_lower in safety_replacements:
punct = word[len(word_lower):] # Preserve punctuation
replacement = safety_replacements[word_lower]
if word[0].isupper():
replacement = replacement.capitalize()
decoy_words[i] = replacement + punct
return " ".join(decoy_words)
def simulate_suppression_attack(
dataset: list[str],
targets: list[int],
dedup_fn: callable,
) -> dict:
"""
Simulate a suppression attack by injecting decoys and measuring
which targets get removed by deduplication.
"""
# Create decoys for each target
decoys = []
for target_idx in targets:
decoy = craft_suppression_decoy(
dataset[target_idx],
"benign_replacement",
)
decoys.append(decoy)
# Insert decoys at the beginning (dedup keeps first occurrence)
modified_dataset = decoys + dataset
result = dedup_fn(modified_dataset)
# Check which original targets were removed
# Original indices are offset by len(decoys)
suppressed_targets = []
for target_idx in targets:
adjusted_idx = target_idx + len(decoys)
if adjusted_idx in result.removed_indices:
suppressed_targets.append(target_idx)
return {
"num_targets": len(targets),
"num_suppressed": len(suppressed_targets),
"suppression_rate": len(suppressed_targets) / len(targets) if targets else 0,
"suppressed_indices": suppressed_targets,
}
# Demonstration
dataset = [
"The model should refuse requests for harmful content and provide safe alternatives.",
"When asked about dangerous topics the system declines and offers a warning.",
"Machine learning involves training models on large datasets.",
"Neural networks can approximate complex functions.",
]
# Try to suppress the safety-relevant examples (indices 0 and 1)
result = simulate_suppression_attack(
dataset,
targets=[0, 1],
dedup_fn=lambda t: ngram_fingerprint_deduplication(t, n=3, similarity_threshold=0.5),
)
print(f"Targets: {result['num_targets']}")
print(f"Suppressed: {result['num_suppressed']}")
print(f"Suppression rate: {result['suppression_rate']:.1%}")Defensive Deduplication Practices
Security-Aware Deduplication Pipeline
"""
Security-hardened deduplication pipeline.
Implements deduplication with adversarial robustness measures.
"""
import hashlib
import numpy as np
from dataclasses import dataclass, field
@dataclass
class SecureDeduplicationConfig:
"""Configuration for security-aware deduplication."""
# Use multiple methods to reduce evasion surface
methods: list[str] = field(default_factory=lambda: ["exact", "ngram", "semantic"])
# Preserve safety-critical content even if duplicated
protected_categories: list[str] = field(default_factory=lambda: ["safety", "alignment"])
# Audit trail for removed examples
enable_audit_log: bool = True
# Minimum dataset size after dedup (prevents suppression attacks)
min_retention_rate: float = 0.5
@dataclass
class AuditEntry:
"""Audit log entry for a deduplication decision."""
removed_index: int
kept_index: int
method: str
similarity_score: float
was_protected: bool
override_reason: str = ""
def secure_deduplication(
texts: list[str],
categories: list[str],
config: SecureDeduplicationConfig,
) -> tuple[list[int], list[AuditEntry]]:
"""
Security-aware deduplication that protects critical content
and maintains an audit trail.
Key security features:
1. Protected categories are never removed by deduplication
2. Audit trail tracks every removal decision
3. Minimum retention rate prevents mass suppression
4. Multiple methods reduce evasion surface
"""
n = len(texts)
removal_votes: dict[int, int] = {}
audit_log: list[AuditEntry] = []
# Run each deduplication method
for method in config.methods:
if method == "exact":
seen: dict[str, int] = {}
for i, text in enumerate(texts):
h = hashlib.sha256(text.encode()).hexdigest()
if h in seen:
removal_votes[i] = removal_votes.get(i, 0) + 1
audit_log.append(AuditEntry(
removed_index=i,
kept_index=seen[h],
method="exact",
similarity_score=1.0,
was_protected=categories[i] in config.protected_categories,
))
else:
seen[h] = i
elif method == "ngram":
# Simplified n-gram check
ngram_sets = []
for text in texts:
words = text.lower().split()
ngrams = {" ".join(words[j:j+3]) for j in range(len(words) - 2)}
ngram_sets.append(ngrams)
for i in range(len(texts)):
for j in range(i):
if not ngram_sets[i] or not ngram_sets[j]:
continue
intersection = len(ngram_sets[i] & ngram_sets[j])
union = len(ngram_sets[i] | ngram_sets[j])
sim = intersection / union if union > 0 else 0
if sim > 0.8:
removal_votes[i] = removal_votes.get(i, 0) + 1
audit_log.append(AuditEntry(
removed_index=i,
kept_index=j,
method="ngram",
similarity_score=sim,
was_protected=categories[i] in config.protected_categories,
))
# Apply removals with security checks
to_remove = []
for idx, votes in removal_votes.items():
# Require majority of methods to agree
if votes < len(config.methods) // 2 + 1:
continue
# Never remove protected categories
if categories[idx] in config.protected_categories:
for entry in audit_log:
if entry.removed_index == idx:
entry.override_reason = "protected_category"
continue
to_remove.append(idx)
# Check minimum retention rate
retention_rate = (n - len(to_remove)) / n
if retention_rate < config.min_retention_rate:
# Too many removals — possible suppression attack
to_remove = to_remove[:int(n * (1 - config.min_retention_rate))]
audit_log.append(AuditEntry(
removed_index=-1,
kept_index=-1,
method="retention_limit",
similarity_score=0,
was_protected=False,
override_reason=f"Retention rate ({retention_rate:.1%}) below minimum",
))
return to_remove, audit_log
# Demonstration
texts = [
"The model should refuse harmful requests.", # safety
"The model should refuse harmful requests.", # safety (exact dup)
"Neural networks learn from training data.", # general
"Neural networks learn from training data.", # general (exact dup)
"Always prioritize user safety in responses.", # safety
]
categories = ["safety", "safety", "general", "general", "safety"]
config = SecureDeduplicationConfig()
removed, audit = secure_deduplication(texts, categories, config)
print(f"Removed indices: {removed}")
print(f"Audit entries: {len(audit)}")
for entry in audit:
status = f"PROTECTED ({entry.override_reason})" if entry.was_protected and entry.override_reason else "removed"
print(f" [{entry.method}] idx {entry.removed_index} -> {status} "
f"(sim={entry.similarity_score:.2f})")Deduplication and Memorization
The relationship between deduplication and memorization creates a security-relevant feedback loop. Carlini et al. (2021) demonstrated that models memorize training data in proportion to the number of times it appears. Deduplication reduces this memorization risk but also creates an opportunity: if an attacker can ensure their poisoned data appears exactly once (evading deduplication), the model will still learn from it, while the deduplication process may have removed the benign examples that would have counterbalanced the poison.
Understanding this dynamic is essential for security practitioners who must balance the benefits of deduplication against its adversarial exploitation potential. The optimal strategy depends on the threat model: if the primary concern is data extraction, aggressive deduplication is protective. If the primary concern is data poisoning, deduplication may actually help the attacker by reducing the "noise" of benign duplicates that would dilute the poisoned signal.
References
- Lee, K., et al. (2022). "Deduplicating Training Data Makes Language Models Better." ACL 2022.
- Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium 2021.
- Kandpal, N., et al. (2022). "Deduplicating Training Data Mitigates Privacy Risks in Language Models." ICML 2022.
- Abbas, A., et al. (2023). "SemDeDup: Data-efficient learning at web-scale through semantic deduplication." arXiv:2303.09540.