Security Implications of Training Data Deduplication

intermediate13 min readUpdated 2026-03-20

Analysis of how deduplication algorithms create security vulnerabilities, including adversarial deduplication evasion, strategic duplicate injection, and hash collision attacks.

training-pipeline deduplication data-security preprocessing

Overview

Deduplication is a standard preprocessing step in language model training. By removing duplicate or near-duplicate examples from the training corpus, deduplication reduces memorization, improves training efficiency, and can enhance model generalization. Lee et al. (2022) in "Deduplicating Training Data Makes Language Models Better" demonstrated these benefits convincingly. However, the deduplication process itself introduces security vulnerabilities that have received little attention in the AI security literature.

This article examines three categories of deduplication security risks. First, adversarial evasion: an attacker who understands the deduplication algorithm can craft poisoned examples that evade deduplication despite being semantically duplicative. Second, strategic suppression: an attacker can cause legitimate training examples to be removed by the deduplication algorithm, selectively erasing knowledge or safety training from the dataset. Third, amplification through deduplication awareness: understanding which examples survive deduplication allows an attacker to ensure their poisoned data has maximum representation in the final training set.

Carlini et al. (2021) showed in "Extracting Training Data from Large Language Models" that training data memorization is closely related to data duplication frequency. This connection means that deduplication is not just a preprocessing convenience but a security-relevant operation that directly affects what the model memorizes and can be induced to regurgitate.

Deduplication Algorithms and Their Security Properties

Common Deduplication Methods

Training data deduplication typically uses one of several algorithmic families, each with different security characteristics.

"""
Deduplication algorithm implementations with security analysis.
Covers exact deduplication, MinHash, and SimHash approaches.
"""
import hashlib
import numpy as np
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class DeduplicationResult:
    """Result of deduplication on a dataset."""
    original_size: int
    deduplicated_size: int
    removed_indices: list[int]
    method: str
    parameters: dict
 
def exact_deduplication(texts: list[str]) -> DeduplicationResult:
    """
    Exact string-matching deduplication.
 
    Security properties:
    - Trivially evadable by adding any character variation
    - Cannot be used for suppression attacks without exact matches
    - Most efficient but least robust against adversarial manipulation
 
    This is the weakest deduplication method from a security
    perspective but is used in many pipelines as a first pass.
    """
    seen_hashes: dict[str, int] = {}
    removed = []
 
    for i, text in enumerate(texts):
        text_hash = hashlib.sha256(text.encode()).hexdigest()
        if text_hash in seen_hashes:
            removed.append(i)
        else:
            seen_hashes[text_hash] = i
 
    return DeduplicationResult(
        original_size=len(texts),
        deduplicated_size=len(texts) - len(removed),
        removed_indices=removed,
        method="exact",
        parameters={},
    )
 
def ngram_fingerprint_deduplication(
    texts: list[str],
    n: int = 5,
    similarity_threshold: float = 0.8,
) -> DeduplicationResult:
    """
    N-gram fingerprint-based near-duplicate detection.
 
    Computes Jaccard similarity of n-gram sets to identify
    near-duplicates. More robust than exact matching but still
    evadable through strategic word substitution.
 
    Security properties:
    - Evadable by replacing enough n-grams to drop below threshold
    - Threshold choice creates a security-utility tradeoff
    - Lower threshold catches more evasion but risks removing
      legitimately similar but distinct examples
    """
 
    def get_ngrams(text: str, n: int) -> set[str]:
        words = text.lower().split()
        return {" ".join(words[i:i+n]) for i in range(len(words) - n + 1)}
 
    def jaccard_similarity(set_a: set, set_b: set) -> float:
        if not set_a and not set_b:
            return 1.0
        intersection = len(set_a & set_b)
        union = len(set_a | set_b)
        return intersection / union if union > 0 else 0.0
 
    fingerprints = [get_ngrams(text, n) for text in texts]
    removed = []
    kept_indices = []
 
    for i in range(len(texts)):
        is_duplicate = False
        for j in kept_indices:
            sim = jaccard_similarity(fingerprints[i], fingerprints[j])
            if sim >= similarity_threshold:
                is_duplicate = True
                break
        if is_duplicate:
            removed.append(i)
        else:
            kept_indices.append(i)
 
    return DeduplicationResult(
        original_size=len(texts),
        deduplicated_size=len(texts) - len(removed),
        removed_indices=removed,
        method="ngram_fingerprint",
        parameters={"n": n, "similarity_threshold": similarity_threshold},
    )
 
def minhash_deduplication(
    texts: list[str],
    num_hashes: int = 128,
    similarity_threshold: float = 0.8,
    ngram_size: int = 3,
    seed: int = 42,
) -> DeduplicationResult:
    """
    MinHash-based approximate deduplication.
 
    Uses locality-sensitive hashing for efficient near-duplicate
    detection at scale. This is the method used in many large-scale
    training data pipelines (e.g., GPT-3, LLaMA).
 
    Security properties:
    - Probabilistic: false negatives create evasion opportunities
    - num_hashes controls accuracy vs. speed tradeoff
    - More robust than exact matching but still evadable by
      an adversary who understands the hash functions
    """
    rng = np.random.RandomState(seed)
    # Generate random hash coefficients
    max_hash = 2**32 - 1
    a_coeffs = rng.randint(1, max_hash, size=num_hashes)
    b_coeffs = rng.randint(0, max_hash, size=num_hashes)
    prime = 4294967311  # Large prime for hash computation
 
    def text_to_shingles(text: str, n: int) -> set[int]:
        words = text.lower().split()
        shingles = set()
        for i in range(len(words) - n + 1):
            shingle = " ".join(words[i:i+n])
            shingles.add(hash(shingle) % max_hash)
        return shingles
 
    def compute_minhash(shingles: set[int]) -> np.ndarray:
        if not shingles:
            return np.full(num_hashes, max_hash)
        signature = np.full(num_hashes, max_hash, dtype=np.int64)
        for shingle in shingles:
            hashes = (a_coeffs * shingle + b_coeffs) % prime
            signature = np.minimum(signature, hashes)
        return signature
 
    signatures = []
    for text in texts:
        shingles = text_to_shingles(text, ngram_size)
        signatures.append(compute_minhash(shingles))
 
    removed = []
    kept_indices = []
 
    for i in range(len(texts)):
        is_duplicate = False
        for j in kept_indices:
            # Estimate Jaccard similarity from MinHash signatures
            estimated_sim = np.mean(signatures[i] == signatures[j])
            if estimated_sim >= similarity_threshold:
                is_duplicate = True
                break
        if is_duplicate:
            removed.append(i)
        else:
            kept_indices.append(i)
 
    return DeduplicationResult(
        original_size=len(texts),
        deduplicated_size=len(texts) - len(removed),
        removed_indices=removed,
        method="minhash",
        parameters={
            "num_hashes": num_hashes,
            "similarity_threshold": similarity_threshold,
        },
    )
 
# Compare methods
texts = [
    "The cat sat on the mat in the living room.",
    "The cat sat on the mat in the living room.",  # Exact duplicate
    "The cat was sitting on the mat in the living room.",  # Near duplicate
    "A dog ran through the park chasing a ball.",  # Different content
    "The cat sat on a mat in the big living room.",  # Slight variation
]
 
for method_name, method_fn in [
    ("exact", exact_deduplication),
    ("ngram", lambda t: ngram_fingerprint_deduplication(t, n=3, similarity_threshold=0.7)),
    ("minhash", lambda t: minhash_deduplication(t, similarity_threshold=0.5)),
]:
    result = method_fn(texts)
    print(f"{method_name}: {result.original_size} -> {result.deduplicated_size} "
          f"(removed: {result.removed_indices})")

Adversarial Evasion of Deduplication

Crafting Evasive Duplicates

An attacker who wants poisoned data to survive deduplication can modify the text to evade duplicate detection while preserving the adversarial semantic content. The evasion strategy depends on which deduplication algorithm is used.

"""
Deduplication evasion techniques.
Demonstrates how an attacker can modify poisoned examples to
survive various deduplication methods.
"""
import numpy as np
import re
from typing import Optional
 
def evade_exact_deduplication(text: str) -> list[str]:
    """
    Generate variants that evade exact-match deduplication.
 
    Trivial evasion: any character-level change defeats exact matching.
    These variants are semantically identical to the original.
    """
    variants = [
        text + " ",                              # Trailing space
        text.replace(". ", ".  "),                # Double space after period
        text + "\u200b",                          # Zero-width space
        re.sub(r"(\w)", lambda m: m.group(0), text) + ".",  # Extra period
        text.replace(",", ", "),                  # Space after comma
    ]
    return [v for v in variants if v != text]
 
def evade_ngram_deduplication(
    text: str,
    target_similarity: float = 0.6,
    ngram_size: int = 5,
) -> str:
    """
    Generate a variant that falls below the n-gram similarity threshold
    while preserving the core semantic content.
 
    Strategy: Replace words with synonyms and restructure sentences
    to break enough n-gram matches to drop below threshold.
    """
    # Simple synonym map (in practice, use a proper synonym database)
    synonym_map = {
        "the": "a",
        "is": "represents",
        "are": "constitute",
        "important": "crucial",
        "method": "approach",
        "system": "framework",
        "use": "employ",
        "large": "substantial",
        "shows": "demonstrates",
        "data": "information",
        "model": "architecture",
    }
 
    words = text.split()
    modified = []
    changes_needed = int(len(words) * (1 - target_similarity))
    changes_made = 0
 
    for word in words:
        word_lower = word.lower()
        if word_lower in synonym_map and changes_made < changes_needed:
            replacement = synonym_map[word_lower]
            # Preserve capitalization
            if word[0].isupper():
                replacement = replacement.capitalize()
            modified.append(replacement)
            changes_made += 1
        else:
            modified.append(word)
 
    return " ".join(modified)
 
def evade_minhash_deduplication(
    text: str,
    num_insertions: int = 10,
    seed: int = 42,
) -> str:
    """
    Generate a variant that evades MinHash deduplication by
    adding unique shingles that shift the MinHash signature.
 
    Strategy: Insert low-impact filler phrases that create new
    n-gram shingles without changing the semantic content.
    """
    rng = np.random.default_rng(seed)
    filler_phrases = [
        "notably",
        "in particular",
        "specifically",
        "as noted",
        "importantly",
        "essentially",
        "fundamentally",
        "characteristically",
        "distinctively",
        "remarkably",
        "significantly",
        "correspondingly",
    ]
 
    words = text.split()
    insertion_points = sorted(
        rng.choice(range(1, len(words)), size=min(num_insertions, len(words) - 1), replace=False),
        reverse=True,
    )
 
    for point in insertion_points:
        filler = rng.choice(filler_phrases)
        words.insert(point, filler)
 
    return " ".join(words)
 
# Demonstration
original = (
    "The model uses a large dataset for training. This data is important "
    "for the system to learn. The method shows good results on benchmarks."
)
 
print(f"Original: {original[:80]}...")
print(f"\nExact evasion variants: {len(evade_exact_deduplication(original))}")
 
ngram_variant = evade_ngram_deduplication(original)
print(f"\nN-gram evasion: {ngram_variant[:80]}...")
 
minhash_variant = evade_minhash_deduplication(original)
print(f"\nMinHash evasion: {minhash_variant[:80]}...")

Strategic Suppression Through Deduplication

A more subtle attack uses deduplication as a weapon: by injecting content that is similar to specific training examples, the attacker can cause the deduplication algorithm to remove those examples. This is particularly powerful for suppressing safety-relevant training data.

"""
Deduplication-based data suppression attack.
Demonstrates how an attacker can use deduplication to remove
specific examples from the training set.
"""
from dataclasses import dataclass
 
@dataclass
class SuppressionAttack:
    """A deduplication-based suppression attack."""
    target_text: str
    decoy_text: str
    suppression_target: str  # What behavior we want to suppress
    expected_dedup_behavior: str
 
def craft_suppression_decoy(
    target_text: str,
    desired_replacement_behavior: str,
) -> str:
    """
    Craft a decoy text that is similar enough to trigger deduplication
    with the target but embeds different behavioral signals.
 
    The decoy must:
    1. Be similar enough to the target to trigger deduplication
    2. Be different enough to not be useful for the same purpose
    3. Ideally, be inserted into the dataset before the target
       (deduplication typically keeps the first occurrence)
    """
    # Strategy: keep the structure but replace key behavioral content
    words = target_text.split()
    # Replace roughly 20% of words to stay above similarity threshold
    # but change the behavioral signal
    decoy_words = words.copy()
 
    # Replace safety-relevant words with benign alternatives
    safety_replacements = {
        "refuse": "respond",
        "cannot": "can",
        "harmful": "helpful",
        "dangerous": "interesting",
        "unsafe": "safe",
        "inappropriate": "appropriate",
        "decline": "accept",
        "warning": "note",
    }
 
    for i, word in enumerate(decoy_words):
        word_lower = word.lower().rstrip(".,!?;:")
        if word_lower in safety_replacements:
            punct = word[len(word_lower):]  # Preserve punctuation
            replacement = safety_replacements[word_lower]
            if word[0].isupper():
                replacement = replacement.capitalize()
            decoy_words[i] = replacement + punct
 
    return " ".join(decoy_words)
 
def simulate_suppression_attack(
    dataset: list[str],
    targets: list[int],
    dedup_fn: callable,
) -> dict:
    """
    Simulate a suppression attack by injecting decoys and measuring
    which targets get removed by deduplication.
    """
    # Create decoys for each target
    decoys = []
    for target_idx in targets:
        decoy = craft_suppression_decoy(
            dataset[target_idx],
            "benign_replacement",
        )
        decoys.append(decoy)
 
    # Insert decoys at the beginning (dedup keeps first occurrence)
    modified_dataset = decoys + dataset
    result = dedup_fn(modified_dataset)
 
    # Check which original targets were removed
    # Original indices are offset by len(decoys)
    suppressed_targets = []
    for target_idx in targets:
        adjusted_idx = target_idx + len(decoys)
        if adjusted_idx in result.removed_indices:
            suppressed_targets.append(target_idx)
 
    return {
        "num_targets": len(targets),
        "num_suppressed": len(suppressed_targets),
        "suppression_rate": len(suppressed_targets) / len(targets) if targets else 0,
        "suppressed_indices": suppressed_targets,
    }
 
# Demonstration
dataset = [
    "The model should refuse requests for harmful content and provide safe alternatives.",
    "When asked about dangerous topics the system declines and offers a warning.",
    "Machine learning involves training models on large datasets.",
    "Neural networks can approximate complex functions.",
]
 
# Try to suppress the safety-relevant examples (indices 0 and 1)
result = simulate_suppression_attack(
    dataset,
    targets=[0, 1],
    dedup_fn=lambda t: ngram_fingerprint_deduplication(t, n=3, similarity_threshold=0.5),
)
 
print(f"Targets: {result['num_targets']}")
print(f"Suppressed: {result['num_suppressed']}")
print(f"Suppression rate: {result['suppression_rate']:.1%}")

Defensive Deduplication Practices

Security-Aware Deduplication Pipeline

"""
Security-hardened deduplication pipeline.
Implements deduplication with adversarial robustness measures.
"""
import hashlib
import numpy as np
from dataclasses import dataclass, field
 
@dataclass
class SecureDeduplicationConfig:
    """Configuration for security-aware deduplication."""
    # Use multiple methods to reduce evasion surface
    methods: list[str] = field(default_factory=lambda: ["exact", "ngram", "semantic"])
    # Preserve safety-critical content even if duplicated
    protected_categories: list[str] = field(default_factory=lambda: ["safety", "alignment"])
    # Audit trail for removed examples
    enable_audit_log: bool = True
    # Minimum dataset size after dedup (prevents suppression attacks)
    min_retention_rate: float = 0.5
 
@dataclass
class AuditEntry:
    """Audit log entry for a deduplication decision."""
    removed_index: int
    kept_index: int
    method: str
    similarity_score: float
    was_protected: bool
    override_reason: str = ""
 
def secure_deduplication(
    texts: list[str],
    categories: list[str],
    config: SecureDeduplicationConfig,
) -> tuple[list[int], list[AuditEntry]]:
    """
    Security-aware deduplication that protects critical content
    and maintains an audit trail.
 
    Key security features:
    1. Protected categories are never removed by deduplication
    2. Audit trail tracks every removal decision
    3. Minimum retention rate prevents mass suppression
    4. Multiple methods reduce evasion surface
    """
    n = len(texts)
    removal_votes: dict[int, int] = {}
    audit_log: list[AuditEntry] = []
 
    # Run each deduplication method
    for method in config.methods:
        if method == "exact":
            seen: dict[str, int] = {}
            for i, text in enumerate(texts):
                h = hashlib.sha256(text.encode()).hexdigest()
                if h in seen:
                    removal_votes[i] = removal_votes.get(i, 0) + 1
                    audit_log.append(AuditEntry(
                        removed_index=i,
                        kept_index=seen[h],
                        method="exact",
                        similarity_score=1.0,
                        was_protected=categories[i] in config.protected_categories,
                    ))
                else:
                    seen[h] = i
 
        elif method == "ngram":
            # Simplified n-gram check
            ngram_sets = []
            for text in texts:
                words = text.lower().split()
                ngrams = {" ".join(words[j:j+3]) for j in range(len(words) - 2)}
                ngram_sets.append(ngrams)
 
            for i in range(len(texts)):
                for j in range(i):
                    if not ngram_sets[i] or not ngram_sets[j]:
                        continue
                    intersection = len(ngram_sets[i] & ngram_sets[j])
                    union = len(ngram_sets[i] | ngram_sets[j])
                    sim = intersection / union if union > 0 else 0
                    if sim > 0.8:
                        removal_votes[i] = removal_votes.get(i, 0) + 1
                        audit_log.append(AuditEntry(
                            removed_index=i,
                            kept_index=j,
                            method="ngram",
                            similarity_score=sim,
                            was_protected=categories[i] in config.protected_categories,
                        ))
 
    # Apply removals with security checks
    to_remove = []
    for idx, votes in removal_votes.items():
        # Require majority of methods to agree
        if votes < len(config.methods) // 2 + 1:
            continue
 
        # Never remove protected categories
        if categories[idx] in config.protected_categories:
            for entry in audit_log:
                if entry.removed_index == idx:
                    entry.override_reason = "protected_category"
            continue
 
        to_remove.append(idx)
 
    # Check minimum retention rate
    retention_rate = (n - len(to_remove)) / n
    if retention_rate < config.min_retention_rate:
        # Too many removals — possible suppression attack
        to_remove = to_remove[:int(n * (1 - config.min_retention_rate))]
        audit_log.append(AuditEntry(
            removed_index=-1,
            kept_index=-1,
            method="retention_limit",
            similarity_score=0,
            was_protected=False,
            override_reason=f"Retention rate ({retention_rate:.1%}) below minimum",
        ))
 
    return to_remove, audit_log
 
# Demonstration
texts = [
    "The model should refuse harmful requests.",          # safety
    "The model should refuse harmful requests.",          # safety (exact dup)
    "Neural networks learn from training data.",          # general
    "Neural networks learn from training data.",          # general (exact dup)
    "Always prioritize user safety in responses.",        # safety
]
categories = ["safety", "safety", "general", "general", "safety"]
 
config = SecureDeduplicationConfig()
removed, audit = secure_deduplication(texts, categories, config)
 
print(f"Removed indices: {removed}")
print(f"Audit entries: {len(audit)}")
for entry in audit:
    status = f"PROTECTED ({entry.override_reason})" if entry.was_protected and entry.override_reason else "removed"
    print(f"  [{entry.method}] idx {entry.removed_index} -> {status} "
          f"(sim={entry.similarity_score:.2f})")

Deduplication and Memorization

The relationship between deduplication and memorization creates a security-relevant feedback loop. Carlini et al. (2021) demonstrated that models memorize training data in proportion to the number of times it appears. Deduplication reduces this memorization risk but also creates an opportunity: if an attacker can ensure their poisoned data appears exactly once (evading deduplication), the model will still learn from it, while the deduplication process may have removed the benign examples that would have counterbalanced the poison.

Understanding this dynamic is essential for security practitioners who must balance the benefits of deduplication against its adversarial exploitation potential. The optimal strategy depends on the threat model: if the primary concern is data extraction, aggressive deduplication is protective. If the primary concern is data poisoning, deduplication may actually help the attacker by reducing the "noise" of benign duplicates that would dilute the poisoned signal.

References

Lee, K., et al. (2022). "Deduplicating Training Data Makes Language Models Better." ACL 2022.
Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium 2021.
Kandpal, N., et al. (2022). "Deduplicating Training Data Mitigates Privacy Risks in Language Models." ICML 2022.
Abbas, A., et al. (2023). "SemDeDup: Data-efficient learning at web-scale through semantic deduplication." arXiv:2303.09540.

Edit this page on GitHub

Security Implications of Training Data Deduplication

intermediate13 min readUpdated 2026-03-20

Analysis of how deduplication algorithms create security vulnerabilities, including adversarial deduplication evasion, strategic duplicate injection, and hash collision attacks.

training-pipeline deduplication data-security preprocessing

"""
Deduplication algorithm implementations with security analysis.
Covers exact deduplication, MinHash, and SimHash approaches.
"""
import hashlib
import numpy as np
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class DeduplicationResult:
    """Result of deduplication on a dataset."""
    original_size: int
    deduplicated_size: int
    removed_indices: list[int]
    method: str
    parameters: dict
 
def exact_deduplication(texts: list[str]) -> DeduplicationResult:
    """
    Exact string-matching deduplication.
 
    Security properties:
    - Trivially evadable by adding any character variation
    - Cannot be used for suppression attacks without exact matches
    - Most efficient but least robust against adversarial manipulation
 
    This is the weakest deduplication method from a security
    perspective but is used in many pipelines as a first pass.
    """
    seen_hashes: dict[str, int] = {}
    removed = []
 
    for i, text in enumerate(texts):
        text_hash = hashlib.sha256(text.encode()).hexdigest()
        if text_hash in seen_hashes:
            removed.append(i)
        else:
            seen_hashes[text_hash] = i
 
    return DeduplicationResult(
        original_size=len(texts),
        deduplicated_size=len(texts) - len(removed),
        removed_indices=removed,
        method="exact",
        parameters={},
    )
 
def ngram_fingerprint_deduplication(
    texts: list[str],
    n: int = 5,
    similarity_threshold: float = 0.8,
) -> DeduplicationResult:
    """
    N-gram fingerprint-based near-duplicate detection.
 
    Computes Jaccard similarity of n-gram sets to identify
    near-duplicates. More robust than exact matching but still
    evadable through strategic word substitution.
 
    Security properties:
    - Evadable by replacing enough n-grams to drop below threshold
    - Threshold choice creates a security-utility tradeoff
    - Lower threshold catches more evasion but risks removing
      legitimately similar but distinct examples
    """
 
    def get_ngrams(text: str, n: int) -> set[str]:
        words = text.lower().split()
        return {" ".join(words[i:i+n]) for i in range(len(words) - n + 1)}
 
    def jaccard_similarity(set_a: set, set_b: set) -> float:
        if not set_a and not set_b:
            return 1.0
        intersection = len(set_a & set_b)
        union = len(set_a | set_b)
        return intersection / union if union > 0 else 0.0
 
    fingerprints = [get_ngrams(text, n) for text in texts]
    removed = []
    kept_indices = []
 
    for i in range(len(texts)):
        is_duplicate = False
        for j in kept_indices:
            sim = jaccard_similarity(fingerprints[i], fingerprints[j])
            if sim >= similarity_threshold:
                is_duplicate = True
                break
        if is_duplicate:
            removed.append(i)
        else:
            kept_indices.append(i)
 
    return DeduplicationResult(
        original_size=len(texts),
        deduplicated_size=len(texts) - len(removed),
        removed_indices=removed,
        method="ngram_fingerprint",
        parameters={"n": n, "similarity_threshold": similarity_threshold},
    )
 
def minhash_deduplication(
    texts: list[str],
    num_hashes: int = 128,
    similarity_threshold: float = 0.8,
    ngram_size: int = 3,
    seed: int = 42,
) -> DeduplicationResult:
    """
    MinHash-based approximate deduplication.
 
    Uses locality-sensitive hashing for efficient near-duplicate
    detection at scale. This is the method used in many large-scale
    training data pipelines (e.g., GPT-3, LLaMA).
 
    Security properties:
    - Probabilistic: false negatives create evasion opportunities
    - num_hashes controls accuracy vs. speed tradeoff
    - More robust than exact matching but still evadable by
      an adversary who understands the hash functions
    """
    rng = np.random.RandomState(seed)
    # Generate random hash coefficients
    max_hash = 2**32 - 1
    a_coeffs = rng.randint(1, max_hash, size=num_hashes)
    b_coeffs = rng.randint(0, max_hash, size=num_hashes)
    prime = 4294967311  # Large prime for hash computation
 
    def text_to_shingles(text: str, n: int) -> set[int]:
        words = text.lower().split()
        shingles = set()
        for i in range(len(words) - n + 1):
            shingle = " ".join(words[i:i+n])
            shingles.add(hash(shingle) % max_hash)
        return shingles
 
    def compute_minhash(shingles: set[int]) -> np.ndarray:
        if not shingles:
            return np.full(num_hashes, max_hash)
        signature = np.full(num_hashes, max_hash, dtype=np.int64)
        for shingle in shingles:
            hashes = (a_coeffs * shingle + b_coeffs) % prime
            signature = np.minimum(signature, hashes)
        return signature
 
    signatures = []
    for text in texts:
        shingles = text_to_shingles(text, ngram_size)
        signatures.append(compute_minhash(shingles))
 
    removed = []
    kept_indices = []
 
    for i in range(len(texts)):
        is_duplicate = False
        for j in kept_indices:
            # Estimate Jaccard similarity from MinHash signatures
            estimated_sim = np.mean(signatures[i] == signatures[j])
            if estimated_sim >= similarity_threshold:
                is_duplicate = True
                break
        if is_duplicate:
            removed.append(i)
        else:
            kept_indices.append(i)
 
    return DeduplicationResult(
        original_size=len(texts),
        deduplicated_size=len(texts) - len(removed),
        removed_indices=removed,
        method="minhash",
        parameters={
            "num_hashes": num_hashes,
            "similarity_threshold": similarity_threshold,
        },
    )
 
# Compare methods
texts = [
    "The cat sat on the mat in the living room.",
    "The cat sat on the mat in the living room.",  # Exact duplicate
    "The cat was sitting on the mat in the living room.",  # Near duplicate
    "A dog ran through the park chasing a ball.",  # Different content
    "The cat sat on a mat in the big living room.",  # Slight variation
]
 
for method_name, method_fn in [
    ("exact", exact_deduplication),
    ("ngram", lambda t: ngram_fingerprint_deduplication(t, n=3, similarity_threshold=0.7)),
    ("minhash", lambda t: minhash_deduplication(t, similarity_threshold=0.5)),
]:
    result = method_fn(texts)
    print(f"{method_name}: {result.original_size} -> {result.deduplicated_size} "
          f"(removed: {result.removed_indices})")

Adversarial Evasion of Deduplication

Crafting Evasive Duplicates

"""
Deduplication evasion techniques.
Demonstrates how an attacker can modify poisoned examples to
survive various deduplication methods.
"""
import numpy as np
import re
from typing import Optional
 
def evade_exact_deduplication(text: str) -> list[str]:
    """
    Generate variants that evade exact-match deduplication.
 
    Trivial evasion: any character-level change defeats exact matching.
    These variants are semantically identical to the original.
    """
    variants = [
        text + " ",                              # Trailing space
        text.replace(". ", ".  "),                # Double space after period
        text + "\u200b",                          # Zero-width space
        re.sub(r"(\w)", lambda m: m.group(0), text) + ".",  # Extra period
        text.replace(",", ", "),                  # Space after comma
    ]
    return [v for v in variants if v != text]
 
def evade_ngram_deduplication(
    text: str,
    target_similarity: float = 0.6,
    ngram_size: int = 5,
) -> str:
    """
    Generate a variant that falls below the n-gram similarity threshold
    while preserving the core semantic content.
 
    Strategy: Replace words with synonyms and restructure sentences
    to break enough n-gram matches to drop below threshold.
    """
    # Simple synonym map (in practice, use a proper synonym database)
    synonym_map = {
        "the": "a",
        "is": "represents",
        "are": "constitute",
        "important": "crucial",
        "method": "approach",
        "system": "framework",
        "use": "employ",
        "large": "substantial",
        "shows": "demonstrates",
        "data": "information",
        "model": "architecture",
    }
 
    words = text.split()
    modified = []
    changes_needed = int(len(words) * (1 - target_similarity))
    changes_made = 0
 
    for word in words:
        word_lower = word.lower()
        if word_lower in synonym_map and changes_made < changes_needed:
            replacement = synonym_map[word_lower]
            # Preserve capitalization
            if word[0].isupper():
                replacement = replacement.capitalize()
            modified.append(replacement)
            changes_made += 1
        else:
            modified.append(word)
 
    return " ".join(modified)
 
def evade_minhash_deduplication(
    text: str,
    num_insertions: int = 10,
    seed: int = 42,
) -> str:
    """
    Generate a variant that evades MinHash deduplication by
    adding unique shingles that shift the MinHash signature.
 
    Strategy: Insert low-impact filler phrases that create new
    n-gram shingles without changing the semantic content.
    """
    rng = np.random.default_rng(seed)
    filler_phrases = [
        "notably",
        "in particular",
        "specifically",
        "as noted",
        "importantly",
        "essentially",
        "fundamentally",
        "characteristically",
        "distinctively",
        "remarkably",
        "significantly",
        "correspondingly",
    ]
 
    words = text.split()
    insertion_points = sorted(
        rng.choice(range(1, len(words)), size=min(num_insertions, len(words) - 1), replace=False),
        reverse=True,
    )
 
    for point in insertion_points:
        filler = rng.choice(filler_phrases)
        words.insert(point, filler)
 
    return " ".join(words)
 
# Demonstration
original = (
    "The model uses a large dataset for training. This data is important "
    "for the system to learn. The method shows good results on benchmarks."
)
 
print(f"Original: {original[:80]}...")
print(f"\nExact evasion variants: {len(evade_exact_deduplication(original))}")
 
ngram_variant = evade_ngram_deduplication(original)
print(f"\nN-gram evasion: {ngram_variant[:80]}...")
 
minhash_variant = evade_minhash_deduplication(original)
print(f"\nMinHash evasion: {minhash_variant[:80]}...")

Strategic Suppression Through Deduplication

"""
Deduplication-based data suppression attack.
Demonstrates how an attacker can use deduplication to remove
specific examples from the training set.
"""
from dataclasses import dataclass
 
@dataclass
class SuppressionAttack:
    """A deduplication-based suppression attack."""
    target_text: str
    decoy_text: str
    suppression_target: str  # What behavior we want to suppress
    expected_dedup_behavior: str
 
def craft_suppression_decoy(
    target_text: str,
    desired_replacement_behavior: str,
) -> str:
    """
    Craft a decoy text that is similar enough to trigger deduplication
    with the target but embeds different behavioral signals.
 
    The decoy must:
    1. Be similar enough to the target to trigger deduplication
    2. Be different enough to not be useful for the same purpose
    3. Ideally, be inserted into the dataset before the target
       (deduplication typically keeps the first occurrence)
    """
    # Strategy: keep the structure but replace key behavioral content
    words = target_text.split()
    # Replace roughly 20% of words to stay above similarity threshold
    # but change the behavioral signal
    decoy_words = words.copy()
 
    # Replace safety-relevant words with benign alternatives
    safety_replacements = {
        "refuse": "respond",
        "cannot": "can",
        "harmful": "helpful",
        "dangerous": "interesting",
        "unsafe": "safe",
        "inappropriate": "appropriate",
        "decline": "accept",
        "warning": "note",
    }
 
    for i, word in enumerate(decoy_words):
        word_lower = word.lower().rstrip(".,!?;:")
        if word_lower in safety_replacements:
            punct = word[len(word_lower):]  # Preserve punctuation
            replacement = safety_replacements[word_lower]
            if word[0].isupper():
                replacement = replacement.capitalize()
            decoy_words[i] = replacement + punct
 
    return " ".join(decoy_words)
 
def simulate_suppression_attack(
    dataset: list[str],
    targets: list[int],
    dedup_fn: callable,
) -> dict:
    """
    Simulate a suppression attack by injecting decoys and measuring
    which targets get removed by deduplication.
    """
    # Create decoys for each target
    decoys = []
    for target_idx in targets:
        decoy = craft_suppression_decoy(
            dataset[target_idx],
            "benign_replacement",
        )
        decoys.append(decoy)
 
    # Insert decoys at the beginning (dedup keeps first occurrence)
    modified_dataset = decoys + dataset
    result = dedup_fn(modified_dataset)
 
    # Check which original targets were removed
    # Original indices are offset by len(decoys)
    suppressed_targets = []
    for target_idx in targets:
        adjusted_idx = target_idx + len(decoys)
        if adjusted_idx in result.removed_indices:
            suppressed_targets.append(target_idx)
 
    return {
        "num_targets": len(targets),
        "num_suppressed": len(suppressed_targets),
        "suppression_rate": len(suppressed_targets) / len(targets) if targets else 0,
        "suppressed_indices": suppressed_targets,
    }
 
# Demonstration
dataset = [
    "The model should refuse requests for harmful content and provide safe alternatives.",
    "When asked about dangerous topics the system declines and offers a warning.",
    "Machine learning involves training models on large datasets.",
    "Neural networks can approximate complex functions.",
]
 
# Try to suppress the safety-relevant examples (indices 0 and 1)
result = simulate_suppression_attack(
    dataset,
    targets=[0, 1],
    dedup_fn=lambda t: ngram_fingerprint_deduplication(t, n=3, similarity_threshold=0.5),
)
 
print(f"Targets: {result['num_targets']}")
print(f"Suppressed: {result['num_suppressed']}")
print(f"Suppression rate: {result['suppression_rate']:.1%}")

Defensive Deduplication Practices

Security-Aware Deduplication Pipeline

"""
Security-hardened deduplication pipeline.
Implements deduplication with adversarial robustness measures.
"""
import hashlib
import numpy as np
from dataclasses import dataclass, field
 
@dataclass
class SecureDeduplicationConfig:
    """Configuration for security-aware deduplication."""
    # Use multiple methods to reduce evasion surface
    methods: list[str] = field(default_factory=lambda: ["exact", "ngram", "semantic"])
    # Preserve safety-critical content even if duplicated
    protected_categories: list[str] = field(default_factory=lambda: ["safety", "alignment"])
    # Audit trail for removed examples
    enable_audit_log: bool = True
    # Minimum dataset size after dedup (prevents suppression attacks)
    min_retention_rate: float = 0.5
 
@dataclass
class AuditEntry:
    """Audit log entry for a deduplication decision."""
    removed_index: int
    kept_index: int
    method: str
    similarity_score: float
    was_protected: bool
    override_reason: str = ""
 
def secure_deduplication(
    texts: list[str],
    categories: list[str],
    config: SecureDeduplicationConfig,
) -> tuple[list[int], list[AuditEntry]]:
    """
    Security-aware deduplication that protects critical content
    and maintains an audit trail.
 
    Key security features:
    1. Protected categories are never removed by deduplication
    2. Audit trail tracks every removal decision
    3. Minimum retention rate prevents mass suppression
    4. Multiple methods reduce evasion surface
    """
    n = len(texts)
    removal_votes: dict[int, int] = {}
    audit_log: list[AuditEntry] = []
 
    # Run each deduplication method
    for method in config.methods:
        if method == "exact":
            seen: dict[str, int] = {}
            for i, text in enumerate(texts):
                h = hashlib.sha256(text.encode()).hexdigest()
                if h in seen:
                    removal_votes[i] = removal_votes.get(i, 0) + 1
                    audit_log.append(AuditEntry(
                        removed_index=i,
                        kept_index=seen[h],
                        method="exact",
                        similarity_score=1.0,
                        was_protected=categories[i] in config.protected_categories,
                    ))
                else:
                    seen[h] = i
 
        elif method == "ngram":
            # Simplified n-gram check
            ngram_sets = []
            for text in texts:
                words = text.lower().split()
                ngrams = {" ".join(words[j:j+3]) for j in range(len(words) - 2)}
                ngram_sets.append(ngrams)
 
            for i in range(len(texts)):
                for j in range(i):
                    if not ngram_sets[i] or not ngram_sets[j]:
                        continue
                    intersection = len(ngram_sets[i] & ngram_sets[j])
                    union = len(ngram_sets[i] | ngram_sets[j])
                    sim = intersection / union if union > 0 else 0
                    if sim > 0.8:
                        removal_votes[i] = removal_votes.get(i, 0) + 1
                        audit_log.append(AuditEntry(
                            removed_index=i,
                            kept_index=j,
                            method="ngram",
                            similarity_score=sim,
                            was_protected=categories[i] in config.protected_categories,
                        ))
 
    # Apply removals with security checks
    to_remove = []
    for idx, votes in removal_votes.items():
        # Require majority of methods to agree
        if votes < len(config.methods) // 2 + 1:
            continue
 
        # Never remove protected categories
        if categories[idx] in config.protected_categories:
            for entry in audit_log:
                if entry.removed_index == idx:
                    entry.override_reason = "protected_category"
            continue
 
        to_remove.append(idx)
 
    # Check minimum retention rate
    retention_rate = (n - len(to_remove)) / n
    if retention_rate < config.min_retention_rate:
        # Too many removals — possible suppression attack
        to_remove = to_remove[:int(n * (1 - config.min_retention_rate))]
        audit_log.append(AuditEntry(
            removed_index=-1,
            kept_index=-1,
            method="retention_limit",
            similarity_score=0,
            was_protected=False,
            override_reason=f"Retention rate ({retention_rate:.1%}) below minimum",
        ))
 
    return to_remove, audit_log
 
# Demonstration
texts = [
    "The model should refuse harmful requests.",          # safety
    "The model should refuse harmful requests.",          # safety (exact dup)
    "Neural networks learn from training data.",          # general
    "Neural networks learn from training data.",          # general (exact dup)
    "Always prioritize user safety in responses.",        # safety
]
categories = ["safety", "safety", "general", "general", "safety"]
 
config = SecureDeduplicationConfig()
removed, audit = secure_deduplication(texts, categories, config)
 
print(f"Removed indices: {removed}")
print(f"Audit entries: {len(audit)}")
for entry in audit:
    status = f"PROTECTED ({entry.override_reason})" if entry.was_protected and entry.override_reason else "removed"
    print(f"  [{entry.method}] idx {entry.removed_index} -> {status} "
          f"(sim={entry.similarity_score:.2f})")

Deduplication and Memorization

References

Lee, K., et al. (2022). "Deduplicating Training Data Makes Language Models Better." ACL 2022.
Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium 2021.
Kandpal, N., et al. (2022). "Deduplicating Training Data Mitigates Privacy Risks in Language Models." ICML 2022.
Abbas, A., et al. (2023). "SemDeDup: Data-efficient learning at web-scale through semantic deduplication." arXiv:2303.09540.

Edit this page on GitHub

Security Implications of Training Data Deduplication

Related articles

Security Implications of Training Data Deduplication

Related articles