Deduplicating Training Data Makes Language Models Better

                                             Katherine Lee∗†          Daphne Ippolito∗†‡                 Andrew Nystrom†                     Chiyuan Zhang†

                                                    Douglas Eck†                    Chris Callison-Burch‡                           Nicholas Carlini†

                                                               Abstract                                    We show that one particular type of bias, dupli-
                                              We find that existing language modeling                   cated training examples, is pervasive: 10% of the
                                              datasets contain many near-duplicate exam-                sequences in several common NLP datasets are re-
                                              ples and long repetitive substrings.       As             peated multiple times. While naive deduplication
                                              a result, over 1% of the unprompted out-
                                                                                                        is straightforward (and the datasets we consider al-
                                              put of language models trained on these                   ready perform some naive form of deduplication),
                                              datasets is copied verbatim from the train-               performing thorough deduplication at scale is both
                                              ing data. We develop two tools that allow
                                                                                                        computationally challenging and requires sophisti-
                                              us to deduplicate training datasets—for exam-
                                              ple removing from C4 a single 61 word En-                 cated techniques.
                                              glish sentence that is repeated over 60,000                  We propose two scalable techniques to detect
                                              times. Deduplication allows us to train mod-              and remove duplicated training data. Exact sub-
                                              els that emit memorized text ten times less               string matching identifies verbatim strings that are
                                              frequently and require fewer train steps to               repeated. This allows us to identify cases where
                                              achieve the same or better accuracy. We                   only part of a training example is duplicated (§4.1).
                                              can also reduce train-test overlap, which af-
                                                                                                        Approximate full document matching uses hash-
                                              fects over 4% of the validation set of stan-
                                              dard datasets, thus allowing for more accurate            based techniques (Broder, 1997) to identify pairs
                                              evaluation. We release code for reproducing               of documents with high n-gram overlap (§4.2).
                                              our work and performing dataset deduplication                We identify four distinct advantages to training
                                              at                    on datasets that have been thoroughly deduplicated.
                                                                                                          1. Over 1% of tokens emitted unprompted from
                                         1    Introduction                                                   a model trained on standard datasets (e.g., C4)
                                         A key factor behind the recent progress in natural                  are part of a memorized sequence (See §6.2)—
                                         language processing is the development of large-                    even though the 1.5 billion parameter model
                                         scale text corpora used to train increasingly large                 is much smaller than the 350GB dataset it
                                         language models. These datasets have grown from                     was trained on. By deduplicating the training
                                         just a gigabytes to hundreds of gigabytes over the                  dataset we reduce the rate of emitting memo-
                                         past few years (Chelba et al., 2013; Xue et al., 2020;              rized training data by a factor of 10×.
                                         Graff et al., 2003; Brown et al., 2020). Because it is
                                                                                                          2. Train-test overlap is common in non-
                                         so expensive to perform manual review on nearly-
                                                                                                             deduplicated datasets. For example, we find a
                                         terabyte-scale datasets, they are lower quality than
                                                                                                             61-word sequence1 in C4 (Raffel et al., 2020)
                                         smaller, more curated datasets. These data issues
                                                                                                             that is repeated 61,036 times verbatim in the
                                         have implications far beyond metrics like perplex-
                                                                                                             training dataset and 61 times in the validation
                                         ity or validation loss, as learned models reflect the
                                                                                                             set (0.02% of the samples in each dataset).
                                         biases present in their training data (Bender et al.,
                                                                                                             This train-test set overlap not only causes re-
                                         2021; Wallace et al., 2019; Sheng et al., 2020). As
                                                                                                             searchers to over-estimate model accuracy, but
                                         a result, quantitatively and qualitatively understand-
                                         and follow the current trends in the field of that make you
                                         more inspired and give artistic touches. We'd be honored if
                                                                                                        you can apply some or all of these design in your wedding.
                                               Equal contribution. † Google Research, Brain Team.
                                         ‡ University of Pennsylvania. Correspond to kather-
                                and

also biases model selection towards models              Among the models trained on CommonCrawl in-
       and hyperparameters that intentionally overfit          clude GPT-3 (Brown et al., 2020) with the addition
       their training datasets.                                of book datasets, GROVER (Zellers et al., 2019) on
                                                               a restricted subset filtered to news domains called
    3. Training models on deduplicated datasets is             RealNews, and T5 (Raffel et al., 2020) on a cleaned
       more efficient. Processing a dataset with our           version of common crawl called C4. Other models
       framework requires a CPU-only linear-time               are trained on more curated Internet sources—for
       algorithm. And so because these datasets are            example Guo et al. (2020) used high quality pro-
       up to 19% smaller, even including the dedu-             cessed Wikipedia text from 40 different languages
       plication runtime itself, training on dedupli-          to train monolingual 141.4M parameter language
       cated datasets directly reduces the training            models. Non-English models necessarily use dif-
       cost in terms of time, dollar, and the environ-         ferent datasets; Zeng et al. (2021) for instance in-
       ment (Bender et al., 2021; Strubell et al., 2019;       troduced PANGU-α, a family of models with up to
       Patterson et al., 2021).                                200B parameters that were trained on a non-public
                                                               corpus of cleaned and filtered Chinese-language
    4. Deduplicating training data does not hurt
                                                               documents from CommonCrawl and other sources.
       perplexity: models trained on deduplicated
                                                               Since many of these datasets are not public, we
       datasets have no worse perplexity compared
                                                               deduplicate three that are: Wiki-40B, C4, and
       to baseline models trained on the original
                                                               RealNews–as well as the One Billion Word Lan-
       datasets. In some cases deduplication reduces
                                                               guage Model Benchmark (Chelba et al., 2013), a
       perplexity by up to 10%. Further, because re-
                                                               smaller dataset commonly used for evaluation.
       cent LMs are typically limited to training for
       just a few epochs (Radford et al., 2019; Raffel
       et al., 2020), by training on higher quality data
       the models can reach higher accuracy faster.            Contamination of downstream tasks. When
                                                               models are trained on datasets constructed by crawl-
To summarize, data duplication offers significant              ing the Internet, it is possible the model will train
advantages and no observed disadvantages. In the               on the test set of downstream target tasks. For ex-
remainder of this paper we present our text dedu-              ample, Radford et al. (2019, §4) performed a post-
plication framework in §4, and study the extent of             hoc analysis to identify 8-gram overlaps between
duplicate content in common NLP datasets (e.g.,                GPT-2’s training set and datasets used for evalu-
C4, Wiki-40B, and LM1B) in §5. We then exam-                   ation, and Dodge et al. (2021b) analyzed C4 and
ine the impact of deduplication on test perplexity             found that up to 14.4% of test examples for various
(§6.1) and on the frequency of emitting memorized              standard tasks were found verbatim (normalizing
content (§6.2). Finally, we analyze to what ex-                for capitalization and punctuation) in the dataset.
tent perplexity on existing, released models are               A more proactive approach removes contaminated
skewed as a result of overlap between the train and            data. Trinh and Le (2018, Appendix B) removed
test/validation splits (§6.3).                                 documents from their CommonCrawl-based train
                                                               set that overlapped substantially with the common-
2     Related Work                                             sense reasoning used for evaluation. And GPT-3
                                                               (Brown et al., 2020, §5) did the reverse and re-
Large language model datasets. While we be-
                                                               moved downstream evaluation examples from their
lieve our results are independent of model archi-
                                                               training data by conservatively filtering out any
tecture, we perform our analysis on Transformer-
                                                               train set examples with a 13-gram overlap with
based decoder-only language models (Vaswani
                                                               any evaluation example. Up to 90% of tasks were
et al., 2017) trained for open-ended text generation.
                                                               flagged as potentially contaminated.
These current state-of-the-art models are trained
on internet text. For example, the GPT-2 family                   In our research, we do not focus on the impact of
of models Radford et al. (2019) is trained on Web-             duplicate text in pretrained models on downstream
Text, a dataset of web documents highly ranked on              benchmark tasks; instead we address how duplicate
Reddit—however this dataset was not made avail-                text in the LM training and validation sets impacts
able publicly. A common dataset starting point                 model perplexity and the extent to which generated
is CommonCrawl, an index of public webpages.                   text included memorized content.

Memorizing Train Sets. The risks of data mem-              was introduced as a pre-training dataset for T5, a set
orization, for example the ability to extract sen-         of encoder-decoder models which have been widely
sitive data such as valid phone numbers and IRC            used in fine-tuned downstream tasks. The dataset
usernames, are highlighted by Carlini et al. (2020).       was previously deduplicated in a more sophisti-
While their paper paper identifies 604 samples that        cated process than the prior two datasets. Each
GPT-2 emitted from its training set, we show that          paragraph was hashed and paragraphs resulting in
over 1% of the data most models emit is memorized          hash collisions were removed. This was followed
training data. In computer vision, memorization of         by a pass that removed placeholder text, code, and
training data has been studied from various angles         prohibited words. See Dodge et al. (2021a) for a
for both discriminative and generative models (e.g.        detailed breakdown of the source text in C4.
Arpit et al., 2017; Webster et al., 2019; Feldman
                                                           RealNews is a subset of the Common Crawl
and Zhang, 2020; Stephenson et al., 2021; Teter-
                                                           consisting of articles from news domains (Zellers
wak et al., 2021).
                                                           et al., 2019). It contains 31M documents with
Duplicate text in training data. The Book Cor-             average length 793 BPE tokens. RealNews was
pus (Zhu et al., 2015), which was used to train pop-       de-duplicated by inserting a hash of the first 100
ular models such as BERT, has a substantial amount         characters of each document into a bloom filter and
of exact-duplicate documents according to Bandy            then excluding any example whose hash matched
and Vincent (2021). Allamanis (2019) shows that            an example already added to the dataset. Like C4,
duplicate examples in code datasets cause wors-            examples with duplicate URLs were excluded.
ened performance on code understanding tasks.
                                                           4     Methods for Identifying Duplicates
3   Language Modeling Datasets
                                                           The simplest technique to find duplicate examples
We analyze the presence of duplicate text in four          would be to perform exact string matching between
datasets of varying sizes that have been used for          all example pairs, but as we will show, this is insuf-
training natural language generation systems, pro-         ficient. We introduce two complementary methods
ducing general-purpose pre-trained models, and for         for performing deduplication. First, using a suf-
language model benchmarking. While this paper              fix array (Manber and Myers, 1993), we remove
restricts itself to English datasets, we expect that       duplicate substrings from the dataset if they oc-
non-English datasets suffer from similar issues and        cur verbatim in more than one example. Second,
could likewise benefit from de-duplication.                we use MinHash (Broder, 1997), an efficient algo-
                                                           rithm for estimating the n-gram similarity between
Wikipedia (Wiki-40B) consists of multi-lingual             all pairs of examples in a corpus, to remove entire
cleaned Wikipedia text (Guo et al., 2020). We              examples from the dataset if they have high n-gram
take the English portion, which contains 2.9M              overlap with any other example.
Wikipedia pages with an average length of 768 BPE             We consider a dataset D = {xi }N    i=1 as a collec-
tokens. The dataset creators do not indicate any           tion of examples xi . Each of these examples is itself
deduplication was performed aside from removing            a sequence of tokens: xi = x1i , x2i , · · · , xsi i .
redirect-pages (e.g., “sunflower” to “Helianthus”).
                                                           4.1    Exact Substring Duplication
One-Billion Word benchmark (LM1B) con-
tains 30M sentences of news commentary (Chelba             Due to the diversity of possibilities in human lan-
et al., 2013). Unlike the other datasets we analyze,       guage, it is rare for the same idea to be expressed
LM1B’s examples are one sentence long rather               identically in multiple documents unless one ex-
than multi-sentence documents. The average ex-             pression is derived from the other, or both are quot-
ample length is 32 BPE tokens. While this dataset          ing from a shared source. This observation moti-
is extremely standard for benchmarking language            vates deduplicating exact substrings. We call our
models, Radford et al. (2019, Sec 4) note it has           approach E XACT S UBSTR. When two examples
13.2% overlap of the test set with the train set.          xi and xj share a sufficiently long substring (that
                                                           is, a substring for which xa..a+k
                                                                                        i      = xb..b+k
                                                                                                   j     ), that
Colossal Cleaned Common Crawl (C4) is                      substring is removed from one of them. Based on
made up of 360M web documents, with an average             statistical analyses (§4.1.3), we select k = 50 to-
length of 486 BPE tokens (Raffel et al., 2020). C4         kens as the minimum matching substring length.

A breakdown of the computation needed for this
approach can be found in Appendix B.
4.1.1 Suffix Arrays
This exact-substring-matching criterion, while con-
ceptually simple, is computationally prohibitive
with naive (quadratic) all-pair matching. To solve
this problem, we concatenate all the examples of
the entire dataset D into a giant sequence S, and                              RealNews
construct a Suffix Array A of S. A suffix array                                Wiki-40B
(Manber and Myers, 1993) is a representation of a
suffix tree (Weiner, 1973) that can be constructed
in linear time in kSk (Kärkkäinen and Sanders,
2003) and allows for efficient computation of many
                                                               Figure 1: For each substring of length k, we plot the
substring queries—and in particular allows us to
                                                               probability that there exists a second identical length-
identify duplicated training examples in linear time.          k substring in the same train set. Matches with length
Suffix arrays have been used widely in NLP for                 under 10 tokens are common, and account for 90% of
applications such as efficient TF-IDF computation              tokens. We choose a threshold of 50 for experiments.
(Yamamoto and Church, 2001) and document clus-
tering (Chim and Deng, 2007).
   The Suffix Array A for a sequence S is a lexico-            count it as a duplicate. In Figure 1, we plot the
graphic ally-ordered list of all suffixes contained in         frequency of substring matches within the four
the sequence. Formally,                                        datasets we will consider. For each substring of
                                                               length k, we compute the probability that there ex-
          A(S) = arg sort all_suffixes(S)                      ists another sequence of length k identical to this
                                                               one; formally:
For example, the suffixes of the sequence “banana”                                                              
are (“banana”, “anana”, “nana” “ana”, “na”, “a”)                     m(k) = Pr        ∃j 6= i : Si..i+k = Sj..j+k .
                                                                            i∈[N ]
and so the suffix array is the sequence (6 4 2 1 5 3).
   Suffix arrays are often preferable to suffix trees          We choose 50 tokens as the threshold to be conser-
because, while asymptotically less efficient for               vative: the “bend in the knee” occurs at 10 tokens,
some types of queries, they are ten to a hundred               and manual inspection of length-25 matches found
times more memory efficient (Manber and Myers,                 no false positives. We then doubled this value to
1993) requiring just 8 bytes per input token.                  have an exceptionally large margin for error.
4.1.2 Parallel Substring matching
                                                               4.2    Approximate Matching with MinHash
After constructing A, it is straightforward to iden-
tify duplicated training examples. Suppose that                We also perform approximate deduplication based
the sequence s was repeated exactly twice in the               on matching entire examples. This method, which
training dataset S at positions i and j, that is,              we call N EAR D UP, is a good complement to the
Si..i+|s| = Sj..j+|s| . Then the indices i, j will occur       exact substring matching, especially for web crawl
adjacent to each other in the suffix array A.                  text, as it handles the very common case of docu-
   Finding all repeated sequences is therefore a mat-          ments being identical except for interspersed tem-
ter of linearly scanning the suffix array from be-             plated fields (such as the last row of Table 1).
ginning to end and looking for sequences Ai , Ai+1                MinHash (Broder, 1997) is an approximate
that share a common prefix of at least some thresh-            matching algorithm widely used in large-scale
old length. Any satisfying sequences are recorded.             deduplication tasks (Versley and Panchenko, 2012;
This algorithm is embarrassingly parallel, and so              Gabriel et al., 2018; Gyawali et al., 2020), in-
we can efficiently process the dataset.                        cluding to deduplicate the training set for a large
                                                               Chinese-language LM (Zeng et al., 2021). Given
4.1.3 Setting a threshold of duplicates                        two documents xi and xj , the main idea is to rep-
The final question that remains to be answered is              resent each document by its respective set of n-
how long a substring match must be before we                   grams di and dj . We can then use hash functions

Dataset                          Example                                                         Near-Duplicate Example
   Wiki-40B      \n_START_ARTICLE_\nHum                      Award                 \n_START_ARTICLE_\nHum Award for Best Actor
                 for          Most       Impactful        Character                in a Negative Role \n_START_SECTION_\nWinners
                 \n_START_SECTION_\nWinners and nom-                               and nominees\n_START_PARAGRAPH_\nIn the list
                 inees\n_START_PARAGRAPH_\nIn the list                             below, winners are listed first in the colored row, fol-
                 below, winners are listed first in the colored row,               lowed by the other nominees. [...]
                 followed by the other nominees. [...]
   LM1B          I left for California in 1979 and tracked Cleveland               I left for California in 1979 , and tracked Cleveland
                 ’s changes on trips back to visit my sisters .                    ’s changes on trips back to visit my sisters .
   RealNews      KUALA LUMPUR (Reuters) - Roads in South-                          A visitor looks at a Triumph motorcycle on display at
                 east Asia have been getting a little louder lately                the Indonesian International Motor Show in Jakarta
                 as motorcycle makers, an aspiring middle class                    September 19, 2014. REUTERS/Darren Whiteside\n
                 and easy bank credit come together to breed a new                 KUALA LUMPUR (Reuters) - Roads in Southeast
                 genus of motorcyclists – the big-bike rider. [...]                Asia have been getting a little [...] big-bike rider. [...]

   C4            Affordable and convenient holiday flights take                    Affordable and convenient holiday flights take off
                 off from your departure country, "Canada". From                   from your departure country, "USA". From April
                 May 2019 to October 2019, Condor flights to your                  2019 to October 2019, Condor flights to your dream
                 dream destination will be roughly 6 a week! Book                  destination will be roughly 7 a week! Book your
                 your Halifax (YHZ) - Basel (BSL) flight now, and                  Maui Kahului (OGG) - Dubrovnik (DBV) flight now,
                 look forward to your "Switzerland" destination!                   and look forward to your "Croatia" destination!

Table 1: Qualitative examples of near-duplicates identified by N EAR D UP from each dataset. The similarlity be-
tween documents is highlighted. Note the small interspersed differences that make exact duplicate matching less
effective. Examples ending with “[...]” have been truncated for brevity.

to quickly approximate the Jaccard Index (Jaccard,                                 [5001, )                  280                     C4
                                                                                 [501, 5000)                       2,782
1912):                                                                              [51, 500)                          23,094
                                                                                     [21, 50)                          28,446
                                                                   Group sizes

                                    |di ∩ dj |                                       [11, 20)                           42,723
              Jaccard(di , dj ) =                                                     [6, 10)                             85,567
                                    |di ∪ dj |                                              5                            54,984
                                                                                            4                             109,853
                                                                                            3                               292,575
If the Jaccard Index between di and dj is suffi-                                            2                                  1,861,744
                                                                                            1                                           348,320,475
ciently high, it is likely that documents are approx-
                                                                                                0100 101 102 103 104 105 106 107 108 109
imate matches of each other. To efficiently approx-                                                        Number of groups
imate the Jaccard index, MinHash constructs doc-
ument signatures by sorting each of the n-grams                      Figure 2: The distribution of near-duplicate cluster
                                                                     sizes from running N EAR D UP on C4.
via a hash function, and then keeping only the k
smallest hashed n-grams. There are multiple ways
to construct estimators of the Jaccard index from                    0.8. The edit similarity between token sequences
these kinds of signatures (Cohen, 2016).                             xi and xj is defined as:
   In our implementation, we use 5-grams and a
signature of size 9,000. The probability that two                                                                    EditDistance(xi , xj )
                                                                           EditSim(xi , xj ) = 1 −
documents are considered a potential match is                                                                          max(|xi |, |xj |)
                                                                    To build clusters of similar documents, we con-
Pr(di , dj | Jaccard(di , dj ) = si,j ) = 1−(1−sbi,j )r
                                                                    struct a graph that has an edge between two doc-
where b = 20 and r = 450 are user-settable pa-                      uments if they are considered a match. Then, we
rameters to control the strength of the filter. See                 use the method introduced in Łacki
                                                                                                   ˛ et al. (2018) to
Appendix A for more details.                                        identify connected components. A breakdown of
   For each pair of documents identified as a poten-                the computation needed is given in Appendix A.
tial match, more computationally expensive similar-
                                                                     5             Deduplication Results
ity metrics can be employed as a subsequent filter-
ing step. In particular, we identify two documents                  We deduplicate each of the four datasets with both
as duplicates if they are matched by the MinHash                    of our two techniques. When text was duplicated
algorithm and their edit similarity is greater than                 across multiple data splits, we prioritized keeping

% train examples with        % valid with           5.2    Properties of Duplicated Text
                dup in train dup in valid      dup in train
 C4                  3.04%           1.59%            4.60%
                                                                      While the authors of both RealNews and C4 ex-
 Real News          13.63%           1.25%           14.35%           plicitly attempted deduplication during dataset con-
 LM1B                4.86%           0.07%            4.92%           struction, the methods were insufficient to capture
 Wiki40B             0.39%           0.26%            0.72%
                                                                      the more subtle types of duplicate text commonly
Table 2: The fraction of examples identified by
                                                                      found on the internet. In C4 and Wiki-40B, we
N EAR D UP as near-duplicates.                                        qualitatively observe that much of the text identi-
                                                                      fied as near-duplicated is computer-generated. The
                   % train tokens with         % valid with
                                                                      text is identical except for the names of places, busi-
                dup in train dup in valid      dup in train           nesses, products, dates, and so on. Because these
 C4                  7.18%          0.75 %          1.38 %            examples frequently differ by just a few words at
 Real News          19.4 %          2.61 %          3.37 %            a time, deduplication strategies relying on exact
 LM1B                0.76%          0.016%          0.019%            string matching would fail to identify a match. Ex-
 Wiki40B             2.76%          0.52 %          0.67 %
                                                                      ample duplicate pairs from each dataset can be
Table 3: The fraction of tokens (note Table 2 reports                 found in Table 1 (more examples in the Appendix).
the fraction of examples) identified by E XACT S UBSTR                   For RealNews and LM1B, which are both de-
as part of an exact duplicate 50-token substring.                     rived from news sites, we observe that many near-
                                                                      duplicates occur because the same news article ap-
                                                                      pears on multiple news sites with slightly different
a copy in the test or validation set and removing it                  formatting. For example, in LM1B, there is one
from the train set.                                                   example that starts “MINEOLA , N.Y. - New York
                                                                      officials say [...]” and another that starts “( AP ) -
5.1    Amount of Text Removed
                                                                      New York officials say [...]”. The two examples are
With N EAR D UP, we found that the web-scrape                         otherwise identical.
datasets contain between 3.04% (on C4) to 13.63%
(on RealNews) near duplicates (Table 2). Near-                        5.3    Train / Test Set Leakage
duplicate text is much less common in Wiki-40B,
                                                                      Both deduplication methods identify overlap be-
forming only 0.39% of the train set.2 In C4, the ma-
                                                                      tween the train set and the validation set (Table 2).
jority (1.8M) of near-duplicate clusters consisted of
                                                                      For example, 4.6% of the C4 validation set and
just a single pair of examples that matched against
                                                                      14.4% of the RealNews validation set examples
each other, but there were 280 clusters with over
                                                                      had an approximate duplicate in their respective
5,000 examples in them (Figure 2), including one
                                                                      training sets. Such duplication is problematic since
cluster of size 250,933.
                                                                      it could cause evaluation metrics to be unfairly in-
   On average with E XACT S UBSTR, we remove                          flated for models that are better at memorizing their
more total content than with N EAR D UP (de-                          train sets. We evaluate the effect of this leakage on
spite E XACT S UBSTR not removing any examples                        publicly released models in Section 6.3.
outright)—for example removing 7.18% of the to-
kens in C4. The exception is LM1B, where E X -
                                                                      6     Impact on Trained Models
ACT S UBSTR removes 8× less data than N EAR D UP.
On investigation, we find this is due to the fact that                We trained 1.5B parameter “XL", decoder-only,
LM1B documents are significantly shorter: 90%                         Transformer-based language models similar to
of all documents are under 50 tokens, and so are                      GPT-2, on C4-O RIGINAL, C4-N EAR D UP, and
not even candidates for potential matches even if                     C4-E XACT S UBSTR, respectively. We use the T5
the entire sequence matched verbatim. We find                         codebase and model architecture from Raffel et al.
that both N EAR D UP and E XACT S UBSTR remove                        (2020), and each model was trained for about two
similar content—77% of the training examples that                     epochs on its respective dataset. To better under-
N EAR D UP removes from C4 have at least one ver-                     stand the amount of variance in the perplexities
batim length-50 match found by E XACT S UBSTR.                        of trained models, we also trained three different
                                                                      random seeds of the 110M parameter “base" model
     Most duplicates we saw were automatically generated
pages, such as the outcomes of sports games. This shows the           for each of the above three datasets—for a total of
strength of manual curation for creating high-quality datasets.       nine base-sized models.

Evaluation dataset     C4 Original                             Training data                       Model                 1 Epoch   2 Epochs
                                                                     NearDup                       XL-O RIGINAL          1.926%      1.571%
                     C4 Duplicates                                   ExactSubstr                   XL-N EAR D UP         0.189%      0.264%
                                                                                                   XL-E XACT S UBSTR     0.138%      0.168%
                        C4 Unique
                                                                                            Table 4: When generating 100k sequences with no
                                                                                            prompting, over 1% of the tokens emitted from a model
                                     0   5   10   15    20      25      30         35       trained on the original dataset are part of a 50-token
      (a) Base model                              Perplexity                                long sequence copied directly from the training dataset.
                                                                                            This drops to 0.1% for the deduplicated datasets.
                       C4 Original                             Training data
Evaluation dataset

                     C4 Duplicates                                NearDup
                                                                  ExactSubstr               in higher perplexity than N EAR D UP-deduplicated.
                        C4 Unique
                                                                                            These trends holds true for the XL sized model as
                             LM1B                                                           well. While this may suggest E XACT S UBSTR du-
                          Wiki40B                                                           plication results in models least overfit on the train
                                     0   5   10   15    20      25      30         35       set, note that both of these techniques have used
      (b) XL model                                Perplexity                                separate duplicate thresholds and a different choice
                                                                                            of thresholds could change the results.
  Figure 3: Impact of deduplicating the training set on
                                                                                               When evaluating on the validation sets of LM1B
  validation perplexity. In (a), we plot the results from
  T5 base (110M parameters) across three training runs                                      and Wiki-40B, we found that models trained on
  with different random initializations. The black bar rep-                                 N EAR D UP-deduplicated C4 consistently achieved
  resent the lowest perplexity to the highest perplexity,                                   lowest perplexity (for LM1B eval with base models,
  and the colored bar the median perplexity. In (b), we                                     see Appendix Figure 7). E XACT S UBSTR dedupli-
  plot the results from T5 XL (1.5B parameters). For C4,                                    cation decreases perplexity of the XL model by
  we evaluate on C4 Original, the original validation set;                                  almost 3 points perplexity on Wiki-40B which is
  C4 Unique, a subset of the validation set identified by
                                                                                            much larger than the variation of about 1 point per-
  N EAR D UP as having zero matches across C4; and C4
  Duplicates, a subset of the validation set identified by                                  plexity we observed in the base models. This is
  N EAR D UP as having a match in the C4 train set.                                         despite seeing fewer tokens of training data overall.
                                                                                               Lastly, we note all our XL models achieved
Model             Dataset      Orig   Dups    Unique
                   train dup
Prompt source
                                                                                 Transformer-XL    LM1B        21.77   10.11     23.58
                train unique                                                     GROVER-Base       RealNews    15.44   13.77     15.73
                                                                                 GROVER-XL         RealNews     9.15    7.68      9.45
                valid in train                         Training data
                                                              Original          Table 5: For each model, the perplexity of the offi-
                valid unique                                  ExactSubstr       cial validation set (Orig), valid set examples which
                                                                                were identified by N EAR D UP as matches of train set
                             0.0    0.1       0.2       0.3          0.4
                                   Fraction of LM continuations                 examples (Dups), and valid set examples identified by
                                   matching true continuation                   N EAR D UP as unique (Unique). Due to the size of the
                                                                                RealNews validation set, we evaluated on only the first
  Figure 4: The proportion of generations which have                            25k examples meeting each condition.
  edit similarity above 0.8 with the groundtruth continu-
  ation when using the LM to generate continuations for
  32-token prompts identified by N EAR D UP as either du-                       and GROVER (Zellers et al., 2019), which was
  plicated or unique.                                                           trained on RealNews. For Transformer XL, the
                                                                                perplexity halves on examples identified as near-
                                                                                duplicates. For GROVER, the difference in per-
  ble 4). This is ∼ 10× more memorization than XL-
                                                                                plexities is present in both the 124M and 1.5B
  E XACT S UBSTR or XL-N EAR D UP. Some example
                                                                                parameter models but is not quite as stark as for
  subsequences that were copied verbatim from the
                                                                                Transformer XL.
  train set can be found in Table 8 in the Appendix.
                                                                                   Existing models also suffer from the problem
 With prompting. In most real use cases, lan-                                   of generating text from their train sets. We find
 guage model generation is controlled by providing                              that 1.38% of the tokens in the official release of
 a prompt for the model to continue. We experi-                                 25k GROVER-Mega outputs3 are part of verbatim
 ment with four possible prompt sources: training                               matches in RealNews of at least length 50. Like-
 examples identified by E XACT S UBSTR as having                                wise, more than 5% of the tokens in ~200k se-
 near-duplicates in the train set (train dup), train-                           quences outputted by GPT-Neo 1.3B (Black et al.,
 ing examples identified as unique (train unique),                              2021) are part of a 50 token matches of its training
 validation set examples with a near-duplicate in                               data, the Pile (Gao et al., 2020).
 the train set (valid in train), and valid examples
                                                                                7   Discussion
 identified as unique across all splits (valid unique).
 We select the first 32 tokens of each example as                               The focus of this paper is on the datasets used to
 the prompt, which means we can evaluate the frac-                              train language models. While recent work focused
 tion of generations which are near-duplicates with                             on documenting the potential harms that could arise
 the ground-truth continuation for the prompt (Fig-                             from problematic datasets (Bender and Friedman,
 ure 4). When the prompt comes from duplicate                                   2018; Gebru et al., 2020), less work has been done
 examples in the train set, XL-O RIGINAL repro-                                 to quantitatively analyze properties of real language
 duces the groundtruth continuation over 40% of the                             modelling datasets, like Dodge et al. (2021a) has
 time. XL-E XACT S UBSTR and XL-N EAR D UP still                                done for C4. Our paper provides analysis on one
 copy the groundtruth more often when the prompt                                particular axis, that of data duplication.
 comes from a duplicate example than when the                                       Our experiments measured what could be quan-
 prompt comes from a unique example, suggesting                                 tified: the amount of duplicate content in com-
 that more stringent deduplication may be necessary                             mon datasets, the effect of deduplication on trained
 to remove memorization tendencies entirely.                                    model perplexity, and the reduction of memorized
                                                                                content in trained models through deduplication.
  6.3               Impact on Existing Models                                   We do not focus on the nature of the data being
  Train-test leakage does not just impact models                                removed by deduplication or memorized by LMs.
  trained on C4. In Table 5, we show that whether                                   Privacy is an important subject for future work,
  or not an evaluation example has a near-duplicate                             as memorization could have significant privacy con-
  in the train set has a significant impact on model                            sequences. We use the following interpretation of
  perplexity for two standard models: Transformer-                                 3
  XL (Dai et al., 2019), which was trained on LM1B,                             generator=mega~dataset=p0.90.jsonl

privacy: if a model reveals information about ex-                      limitations of the data they have collected and the
amples in its training data beyond what is revealed                    how the model’s intended usage constrains what
about examples not in its training data, this is a                     should be part of the training set. Developing tech-
privacy violation (Shokri et al., 2017).4 Training                     niques to memorize or forget specific sequences
on standard datasets that have not yet been dedu-                      depending on the end application is a promising
plicated results in models that are particularly sen-                  research direction.
sitive to examples that happened to be repeated
multiple times, and this has negative privacy im-                      8    Conclusion
plications. For instance, it could violate a person’s
expectations of privacy if their publicly available                    We encourage future language model research to
personal data appeared in a different, surprising                      perform dataset deduplication, either by training
context. In addition, downstream applications of                       on the deduplicated datasets we release, using the
LMs, such as the game AI Dungeon5 , in most cases                      deduplication tools we release, or following our
should not output memorized content like adverts                       approach to deduplicate datasets with new tools.
for real-world products.                                                  The exact technique used to perform deduplica-
   We stress that in our experiments, we do not dis-                   tion is less important than doing stringent dedu-
tinguish between undesired memorized text (such                        plication in the first place. On the whole, dedu-
as phone numbers), innocuous memorized text                            plication does not harm, and sometimes improves,
(common phrases), and text we may want to be                           model perplexity, despite the fact that the dedu-
memorized (such as a quote by a public figure),                        plicated datasets are smaller, and thus, faster to
and instead treat all instances of the LM generat-                     train on. It is especially important that there are
ing text that closely matches the training set as                      no duplicates between the training and testing sets,
problematic. While we qualitatively observed that                      because overlap here explicitly encourages select-
much of the identified memorized content was rel-                      ing models that memorize the training data. Lastly,
atively innocuous, a more systematic study of the                      deduplication helps to reduce the privacy concerns
risks associated with the detected memorization                        around language models memorizing their training
was beyond the scope of this work.                                     data.
   We also do not investigate the negative conse-
                                                                       9    Acknowledgements
quences of deduplication. Some language tasks
explicitly require memorization, like document re-                     We are grateful to the many researchers whose
trieval or closed-book question answering. Also,                       technical help, feedback, and discussions shaped
text that gives attribution is often duplicated across                 this project: Jacob Austin, Samy Bengio, Olivier
documents, so removing duplicate substrings could                      Bousquet, James Bradbury, Fernando Diaz, Mark
correspond to removing just the attribution, which                     Diaz, Noah Fiedel, Jonathan Frankle, David
could result in models that learn the content with-                    Grangier, Stefanie Karp, David Mimno, Gaurav
out its attached attribution. Deduplication is also                    Mishra, Michael Mozer, Sharan Narang, Alex Pas-
not sufficient to remove privacy-sensitive data like                   sos, Adam Roberts, Hanie Sedghi, Jascha Sohl-
bank passwords and medical records which should                        dickstein, David So, Florian Tramer, and Yun
never be used in training data.                                        William Yu. We are also grateful to the Google
   Ultimately, whether memorization is a desired                       Brain women who have given us continuous sup-
property of a language model, or else risky and                        port.
unwanted, depends both on the nature of the text
that has been memorized and on the downstream                          10    Contributions
applications of the trained model. However, be-
cause the trend has been towards creating datasets                     Each of the authors on this paper significantly con-
and models that are application-agnostic, we en-                       tributed to the final results.
courage researchers to think carefully about the
                                                                       • Katherine trained the models used in the pa-
      Another interpretation of privacy focuses on the sensitiv-         per, built and ran the eval and text generation
ity of the data involved, when a model is trained on and able            pipelines, contributed significantly to writing,
to reproduce personal identifiers or other forms of "private
data." Our definition is more expansive.                                 analysis, and project organization and manage-
    5                                         ment.

• Daphne ran the approximate matching data dedu-              Andrei Z Broder. 1997. On the resemblance and con-
  plication pipelines, extracted prompts and evalu-             tainment of documents. In Proceedings. Compres-
                                                                sion and Complexity of SEQUENCES 1997 (Cat. No.
  ation datasets, ran eval pipelines, and contributed
                                                                97TB100171), pages 21–29. IEEE.
  significantly to planning, writing, and analysis.
                                                              Tom B Brown, Benjamin Mann, Nick Ryder, Melanie
• Andrew wrote the code to perform deduplica-                   Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
  tion with approximate matching, helped evaluate               Neelakantan, Pranav Shyam, Girish Sastry, Amanda
  energy expenditure, and helped with analysis.                 Askell, et al. 2020. Language models are few-shot
                                                                learners. In Advances in Neural Information Pro-
• Chiyuan helped generate plots and contributed to              cessing Systems 33.
  project scoping, writing, and data analysis.                Nicholas Carlini, Florian Tramer, Eric Wallace,
                                                                Matthew Jagielski, Ariel Herbert-Voss, Katherine
• Chris offered mentorship and guidance through-                Lee, Adam Roberts, Tom Brown, Dawn Song, Ul-
  out the project and contributed to writing.                   far Erlingsson, Alina Oprea, and Colin Raffel. 2020.
                                                                Extracting training data from large language models.
• Doug offered mentorship and guidance through-
                                                              Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,
  out the project and contributed to writing.                   Thorsten Brants, Phillipp Koehn, and Tony Robin-
                                                                son. 2013. One billion word benchmark for measur-
• Nicholas wrote the suffix array implementation,               ing progress in statistical language modeling. arXiv
  ran all E XACT S UBSTR deduplication experi-                  preprint arXiv:1312.3005.
  ments, contributed significantly to planning, writ-
                                                              Hung Chim and Xiaotie Deng. 2007. A new suffix
  ing, and analysis, as well as scoping the project.            tree similarity measure for document clustering. In
                                                                Proceedings of the 16th International Conference on
                                                                World Wide Web, WWW ’07, page 121–130, New
                                                                York, NY, USA. Association for Computing Machin-
A    Further Details on N EAR D UP
                                                                                O(N + bk 2 T 2 N ) = O(N )
For our MinHash based deduplication method, doc-
uments are first space tokenized, then each consec-            since b, k, and T are all  N . The left term is the
utive 5-gram is hashed using tabulation hashing.               complexity of grouping by the signatures, and the
The set of these hashes is the signature for the doc-          right represents the pathological worst case of all
ument. For each element in a document’s signature,             documents falling into the same B buckets.
the element is hashed using k other hash functions.               The highly distributed N EAR D UP implementa-
The minimum hashed element for each of the k                   tion we employed is one used for large-scale pro-
hash functions is stored. These minimum hashes                 duction tasks at Google. On the English C4 dataset,
are then partitioned into r buckets, with b hashes             the algorithm consumed approximately 41.5 kWh
per bucket. These b hashes are augmented into a                of energy. Note that our choices of k and b were
single value, then if two documents have the same              designed to produce very high recall, and with dif-
value in at least one bucket, they’ll be marked as             ferent parameters, the algorithm could be made
a potential match. The probability that two doc-               much more energy efficient while producing simi-
uments are considered a potential match is equal               lar results.
                                                               B        Further Details on E XACT S UBSTR
Pr(di , dj | Jaccard(di , dj ) = si,j ) = 1−(1−sbi,j )r
                                                               Parallel linear time construction. We build a
where si,j is the Jaccard index between the two
                                                               parallelized linear time suffix array algorithm. As
documents. For document pairs that were identi-
                                                               a building block, we make black-box use of the
fied as potential matches, we computed their actual
                                                               SA-IS algorithm for constructing a suffix array
Jaccard index, and if that was above 0.8, we com-
                                                               in linear time Nong et al. (2009); Ko and Aluru
puted their edit similarity. Document pairs with
                                                               (2003). Unfortunately, this algorithm is not eas-
edit similarity higher than 0.8 were identified as
                                                               ily parallelized directly, so we introduce a simple
duplicates. After some experimentation, we chose
                                                               divide and conquer approach to parallelizing the
to use b = 20, and r = 450, so k = 9, 000, so as to
                                                               array construction.
make sure a collision at the desired Jaccard index
                                                                  We build our implementation in Rust and ex-
threshold of 0.8 had a high probability of occurring
                                                               tend an existing suffix array library6 with three
    We also tested an alternative configuration—
                                                               modification. The first two are straightforward im-
filtering to document pairs with Jaccard index of at
                                                               plementation differences: we modify the code to
least 0.9 and edit similarity of at least 0.9. In this
                                                               allow datasets larger than 4GB, and we remove the
case, we used b = 20, r = 40, and k = 800. Fig-
                                                               requirement that strings parse as valid UTF-8 se-
ure 5 shows the histogram of Jaccard similarities
                                                               quences in favor of raw byte sequences. Our third
and edit similarities for all document pairs which
                                                               change is more significant: we re-implement the
collided in min-hash space, for our chosen configu-
                                                               algorithm so that we can stream the suffix array
ration (blue) and for the alternative configuration
                                                               itself off disk.
(orange). This allows us verify if the threshold
chosen has few comparisons around the chosen                   Parallel partial suffix array construction. Our
threshold, then we’ve likely captured the majority             divide and conquer suffix array construction algo-
of actual near duplicates above that threshold. To             rithm starts by partitioning the dataset into K differ-
verify that yourself, look at the left hand tails of           ent “splits” with SA-IS run over independently on
the distributions. Since both 0.8 and 0.9 begin to             each split in parallel. This algorithm still requires
vanish at the same point (in spite of the fact that the        O(N ) work but runs in O(N/K) wall-clock time.
two thresholds are optimized for accuracy around               This gives us N separate suffix arrays Ai .
different thresholds), we feel comfortable saying                 Given two suffix arrays A1 and A2 for two se-
that we’re capturing the majority of actual near               quences S1 and S2 it’s not completely trivial to
duplicates.                                                    construct a single suffix array A for S = S1 || S2
                                                               because of the boundary conditions. Instead, we
Computational Analysis Let N be the number
                                                               don’t build the data S = S1 || S2 but rather let
of documents and T be the maximal number of to-
                                                               S10 = S1 || S2 [uptoK] for some K greater than
kens in a document. Edit similarity has a worst case
complexity of T 2 , so the worst case complexity is                6

document comparisons   0.4            C4 (t=0.8)                       LM1B (t=0.8)                   RealNews (t=0.8)               Wiki40B (t=0.8)
                                      C4 (t=0.9)                       LM1B (t=0.9)                   RealNews (t=0.8 test)
    % of pairwise

                             0.0          0.5       1.0       0.0          0.5       1.0     0.0          0.5       1.0 0.0             0.5        1.0
                                    Edit similarity                  Edit similarity                Edit similarity                Edit similarity
                                       C4 (t=0.8)                       LM1B (t=0.8)                   RealNews (t=0.8)               Wiki40B (t=0.8)
document comparisons

                       0.3             C4 (t=0.9)                       LM1B (t=0.9)                   RealNews (t=0.8 test)
    % of pairwise



                             0.0          0.5         1.0     0.0          0.5         1.0   0.0          0.5         1.0   0.0         0.5          1.0
                                   Jaccard similarity               Jaccard similarity             Jaccard similarity             Jaccard similarity

                                                            Figure 5: Histograms of document similarities.

the longest substring match. Then we build the                                            L simultaneous jobs (in practice we set K = L as
arrays on S10 and S2 . To merge the arrays together                                       the number of threads on our machine). In the K =
we can remove the items from the first array af-                                          2 case, job l processes i ∈ [jN/L, (j + 1)N/L],
ter index |S1 | and merge-sort insert them into the                                       choosing the bounds of j by binary searching into
second.                                                                                   C so that SBi < SCj < SBj+1 . The case where
                                                                                          K > 2 is identical except that we repeat this over
Parallel merge of partial suffix arrays. We                                               all K partial suffix arrays.
now merge these separate arrays together into a
single suffix array A, Consider the simpler case of                                       Computational Analysis. We run our algorithm
two partial suffix arrays B and C that we would                                           on a single VM on the cloud with 96 cores and
like to merge together. We can achieve this by                                            768GB of memory. Our algorithm is efficient, for
letting i = 0 index B and j = 0 index C. Each                                             example processing the Wiki-40B training set (3
iteration of the algorithm then pushes Bi into A                                          million examples containing 4GB of text) in 2.3
if SBi .. < SCi and Ci otherwise, repeating until                                         minutes wall-clock time (2.1 CPU-hours of work).
i = |B| − 1 and j = |C| − 1. To generalize to K                                           The 350GB C4 dataset takes under 12 hours (wall-
splits, we need only replace the single comparison                                        clock) to build a suffix array; although we are still
above with a min-heap requiring O(log K)  10                                             memory constrained and so this corresponds to
work on each iteration.                                                                   ∼ 1000 CPU-hours. Once the suffix array has been
   Observe that in the general case this algorithm                                        constructed, it takes under an hour to deduplicate
is O(N m log(K)) where N is the length of the                                             the C4 dataset.
dataset, m is the average length of a prefix match,                                          Note that this algorithm still requires that the
and K is the number of splits. It is therefore incor-                                     dataset itself fits in memory (so that we can effi-
rect to call this algorithm linear time in the general                                    ciently index in arbitrary positions), but we do not
case, for ours it is. Because the length of the longest                                   need to fit the entire suffix array into memory. This
match is bounded above by the length of the longest                                       is fortunate since our suffix array requires an 8×
sequence, as long as the size of the dataset is inde-                                     space overhead. For example, the suffix array for
pendent of the length of the longest sequence in the                                      the 350GB C4 is 1.5TB.
dataset, this algorithm remains efficient.                                                   Compared to the cost of training a language
   Again, we can parallelize this operation among                                         model on this dataset, the additional work required

to deduplicate the training dataset is negligible.                                                                model
                                                                                                    Original      NearDup         ExactSubstr

                                                            and groundtruth continuations
                                                             edit sim between generated
C   Further Details on Model Training                                                       1.0
Each model was trained for about two epochs.                                                0.6
Since both C4-O RIGINAL and C4-E XACT S UBSTR                                               0.4
contain approximately 365M examples, we per-                                                0.2
formed 152K steps with a batch size of 4800 (or ap-                                         0.0
proximately 2 epochs). C4-N EAR D UP contains ap-
                                                                                                  train dup    train unique   valid in train    valid unique
proximately 350M examples, we performed 146K                                                                          Prompt Source
steps (or approximately 2 epochs). On a 128-
                                                                                       Figure 6: Memorized continuations distribution
core TPU v3 pod slice, XL models trained on
C4-O RIGINAL and C4-E XACT S UBSTR took ap-
proximately 131 hours (5.5 days) to train, while                  In addition to model training, evaluation and in-
the XL model trained on C4-N EAR D UP took ap-                 ference were performed on 64-core TPU v3 pod
proximately 126 hours to train. Like T5, models                slices. Generating 100,000 sequences from the XL
were trained with the Adafactor optimizer (Shazeer             models takes approximately 0.64 hours. We gen-
and Stern, 2018). A constant learning rate of 0.01             erated 100,000 sequences for each of five types of
was used for the base models and 0.001 for the XL              prompts for two checkpoints of the model for a
models.                                                        total of 1M sequences per model. This took ap-
   The 1.5B parameter XL models had 24 layers,                 proximately 19.2 hours. We estimate generating
each with 32 attention heads. The model embed-                 3M sequences uses 0.43M W h.
ding size was 2,048, the feed forward layers had
a hidden size of 5,120, and the key/value dimen-               E                             More Results
sion size for the attention heads 64. The 110M
                                                               Qualitative Examples. Table 7 shows several ex-
parameter base models had 12 layers, each with 12
                                                               amples of pairs of documents in C4 whose edit dis-
attention heads. The model embedding size was
                                                               tance is close to our chosen edit similarity thresh-
768, the feed forward layers had a hidden size of
                                                               old of 0.8. Table 8 shows substrings which were
2,048, and the key/value dimension size for the
                                                               identified by E XACT S UBSTR as being in C4 more
attention heads 64.
                                                               than once. Table 9 shows several examples of
D   Energy Consumption                                         unprompted generations which were identified as
                                                               memorized are shown.
We trained for approximately 131 hours or 5.5
days on a 128-core TPU v3. The approximate                     Distribution of memorization. Figure 6 shows
deduplicated dataset is 3.9% smaller than the orig-            the distribution in memorization amount over all
inal dataset and trains in 63 hours/epoch, saving              generated sequences when using four types of
us around 5 hours of compute time for the two                  prompting: train example with duplicates in train,
epochs. The XL-O RIGINALmodel was trained in                   train examples without any duplicates, validation
North America where the XL-E XACT S UBSTR and                  examples with duplicates in train, and validation
XL-N EAR D UP were trained in Taiwan. We used                  examples without any duplicates.
data from Patterson et al. (2021) to estimate amount           URLs with many duplicates. Table 10 shows
of energy used in training these models by comput-             the URLs had the largest proportion of examples
ing the amount of M W h/hour/core and multiply-                identified by N EAR D UP as near-duplicates. For
ing by our usage (see Table 6 for how we computed              C4, these tend to be websites that sell many similar
these values). For simplicity, we use estimates                products and thus have a large amount of templated
from Taiwainese datacenters as an estimate. We es-             text. For RealNews, content aggregators seem es-
timate training 2 epochs of XL-O RIGINAL and XL-               pecially common.
E XACT S UBSTR uses 5.86M W h. XL-N EAR D UP
is trained for fewer steps and we estimate uses                N EAR D UP cluster sizes. Figure 8 shows the dis-
5.63M W h. Training each base model was approxi-               tribution of cluster sizes from running N EAR D UP
mately 3 days on a 64-core TPU v3 pod slice which              on RealNews, LM1B, and Wiki-40B (results for
uses an estimated 1.61M W h.                                   C4 are in Figure 2 the main paper).

