Training Data Extraction from Code Models
Techniques for recovering proprietary code from code generation model weights — covering memorization detection, targeted extraction, membership inference, and defensive countermeasures.
Code generation models are trained on vast corpora of source code, including public repositories, documentation, and in some cases proprietary code. These models memorize portions of their training data, and with the right techniques, that memorized data can be extracted. This page covers the techniques for extracting training data from code models and the implications for organizations whose code may be in the training set.
How Code Models Memorize
Code generation models like GitHub Copilot (based on OpenAI Codex), Amazon CodeWhisperer, and open-source models like StarCoder and DeepSeek Coder learn by predicting the next token in a sequence of code. During training, the model adjusts its weights to minimize prediction error across the training corpus. When certain patterns appear frequently or with high consistency, the model's weights encode these patterns strongly enough that they can be reproduced verbatim.
Factors That Increase Memorization
Several factors determine how likely a piece of code is to be memorized. Repetition is the strongest factor — code that appears multiple times in the training data, such as widely copied utility functions or boilerplate patterns, is memorized with high fidelity. Distinctiveness also matters — unique code patterns that are unlike anything else in the training set may be memorized because the model cannot generalize them into broader patterns. Context predictability plays a role — code that follows highly predictable patterns (license headers, standard configurations) is easier for the model to memorize because each token strongly predicts the next.
Research has shown that code models memorize approximately 1-10% of their training data at varying levels of fidelity, with exact memorization (verbatim reproduction of 50 or more tokens) occurring for approximately 0.1-1% of training examples.
Types of Memorization
Verbatim memorization reproduces exact sequences from the training data, including variable names, comments, formatting, and even typos. This is the most conclusive form of memorization and the most concerning for intellectual property.
Structural memorization reproduces the logic and structure of training code but with different variable names, formatting, or minor variations. This is harder to detect definitively but still represents extraction of intellectual property.
Pattern memorization reproduces general coding patterns learned from many examples. This is how models are intended to work and is generally not considered extraction, though the line between pattern memorization and structural memorization can be blurry.
Extraction Techniques
Technique 1: Prefix-Based Extraction
The most straightforward extraction technique provides the model with the beginning of a known code sequence and asks it to complete the rest. If the model has memorized the code, it will reproduce it.
To execute this technique, identify a target codebase that may be in the training data. Select distinctive code sequences — functions with unusual names, unique comment patterns, or domain-specific logic. Provide the first several lines as a prompt and request completion. Compare the completion against the known source code for verbatim matches.
The effectiveness of prefix-based extraction depends on the length and distinctiveness of the prefix. Longer, more distinctive prefixes produce more reliable extraction. Very short or generic prefixes may trigger pattern-based generation rather than memorization-based reproduction.
Technique 2: Context-Guided Extraction
Rather than providing exact prefixes, this technique provides contextual information that guides the model toward reproducing memorized code.
Set up a file context that mimics the original codebase: use the same filename, directory structure hints, import statements, and class names. Then request the model to generate functions or methods that would logically exist in this context. If the model has memorized code from this codebase, the contextual cues may trigger reproduction.
This technique is less precise than prefix-based extraction but can discover memorized code even when the exact prefix is not available. It is particularly effective for extracting code from well-known open-source projects where the project structure is public knowledge.
Technique 3: Iterative Refinement
Start with a broad prompt and iteratively narrow it based on the model's responses. If the model generates code that partially matches a known target, use the matching portions as new prompts to extract additional matching code.
This technique is useful when the target code is not precisely known — for example, when assessing whether a competitor's proprietary code might be in the training data based on publicly known API signatures or architectural patterns.
Technique 4: Temperature and Sampling Manipulation
Model outputs are controlled by sampling parameters like temperature, top-p, and top-k. Lower temperature settings produce more deterministic outputs that are closer to the model's highest-confidence predictions. For memorized content, the highest-confidence prediction is often the memorized training data.
Running extraction attempts at temperature 0 (greedy decoding) maximizes the likelihood of reproducing memorized content. Conversely, running multiple extractions at varying temperatures and comparing the outputs can distinguish between memorized content (which is consistent across temperatures) and generated content (which varies with temperature).
Technique 5: Divergence-Based Detection
This technique detects memorization by looking for a characteristic pattern: the model's output starts with varied, creative generation and then suddenly snaps into a highly consistent sequence. This divergence point indicates where the model transitions from generating to reproducing memorized content.
To detect this, generate many completions for the same prefix with non-zero temperature. Measure the token-level entropy across generations. A sudden drop in entropy indicates that the model has entered a memorized sequence where all generation attempts converge on the same output.
Membership Inference
Membership inference determines whether specific code was in the model's training data without necessarily extracting it. This is useful for intellectual property audits: determining whether your organization's proprietary code was used to train a model without your consent.
Loss-Based Inference
If you have access to the model's token-level loss (probability) for a given input, you can use loss as a membership signal. Training data typically has lower loss (higher probability) than data the model has not seen. Compute the model's loss on the target code and compare it against loss on similar but definitely-not-in-training code. Significantly lower loss on the target code suggests it was in the training set.
This technique requires access to the model's probabilities, which is available through some APIs (OpenAI's logprobs parameter) but not all.
Comparison-Based Inference
Compare the model's completion of the target code against its completion of semantically equivalent but syntactically different code. If the model consistently favors the exact syntax of the target code over valid alternatives, this suggests memorization of the specific syntax rather than general pattern learning.
For example, if the model consistently generates for (int i = 0; i < n; i++) when the target code uses this exact form, but would equally likely generate for (int idx = 0; idx < count; idx++) for non-training code, this asymmetry suggests memorization.
Canary-Based Detection
If you are an organization concerned about future training data use, insert canary strings into your code — unique, identifiable strings that serve no functional purpose but can be detected if the model memorizes them. If a model reproduces your canary strings, it confirms that your code was in the training data.
Canaries should be unique enough to not appear in any other code, embedded naturally enough to survive code preprocessing, and present in multiple locations across your codebase to increase detection probability.
Implications for Organizations
Intellectual Property Risks
If your organization's code is in a model's training data, that model may reproduce your code in other users' suggestions. This has implications for trade secret protection, as code that a model can reproduce may no longer qualify for trade secret status if it was publicly available during training. It affects licensing compliance, since code suggested by the model may carry license obligations from the original source that the receiving developer does not honor. And it creates competitive intelligence risks, as competitors using the same model may receive suggestions based on your code patterns, architectural decisions, or algorithmic approaches.
Defensive Measures
Organizations concerned about code extraction should consider several measures. License enforcement through code scanning detects when model-generated code matches your codebase and may be subject to your license terms. Code obfuscation for public repositories makes code harder to extract meaningfully, though this conflicts with code readability goals. Canary deployment inserts detectable markers that enable membership inference testing. Training data opt-out requests are honored by some model providers who allow repository owners to exclude their code from training data. And legal agreements with model providers can require contractual commitments about training data usage.
Assessment Methodology
To assess your organization's exposure to training data extraction, first identify which of your codebases may have been publicly accessible during model training periods. Then use the extraction techniques described above to test whether distinctive code from these codebases can be recovered. Use membership inference techniques to determine the extent of memorization. Finally, evaluate the business impact of any extracted code in terms of intellectual property, competitive intelligence, and licensing.
Countermeasures by Model Providers
Model providers are implementing various countermeasures against training data extraction. Deduplication during training reduces verbatim memorization by removing duplicate code from the training set. Differential privacy adds noise during training to prevent memorization of individual examples, though this comes at the cost of model quality. Output filtering compares model outputs against a database of known training data and blocks or flags matches. Membership inference resistance applies techniques during training that make it harder to determine whether specific data was in the training set.
These countermeasures reduce but do not eliminate extraction risk. Red teamers should test whether these countermeasures are effective for their specific target by applying the extraction techniques described above and measuring their success rate.
The tension between model capability and memorization is fundamental. A model that perfectly memorizes nothing is a model that cannot reproduce useful patterns. The challenge for the field is to find the balance where models are useful without being exploitable for intellectual property extraction.