Deduplicating Training Data Makes Language Models Better
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Deduplicating Training Data Makes Language Models Better Katherine Lee∗† Daphne Ippolito∗†‡ Andrew Nystrom† Chiyuan Zhang† Douglas Eck† Chris Callison-Burch‡ Nicholas Carlini† Abstract We show that one particular type of bias, dupli- We find that existing language modeling cated training examples, is pervasive: 10% of the datasets contain many near-duplicate exam- sequences in several common NLP datasets are re- ples and long repetitive substrings. As peated multiple times. While naive deduplication a result, over 1% of the unprompted out- arXiv:2107.06499v1 [cs.CL] 14 Jul 2021 is straightforward (and the datasets we consider al- put of language models trained on these ready perform some naive form of deduplication), datasets is copied verbatim from the train- performing thorough deduplication at scale is both ing data. We develop two tools that allow computationally challenging and requires sophisti- us to deduplicate training datasets—for exam- ple removing from C4 a single 61 word En- cated techniques. glish sentence that is repeated over 60,000 We propose two scalable techniques to detect times. Deduplication allows us to train mod- and remove duplicated training data. Exact sub- els that emit memorized text ten times less string matching identifies verbatim strings that are frequently and require fewer train steps to repeated. This allows us to identify cases where achieve the same or better accuracy. We only part of a training example is duplicated (§4.1). can also reduce train-test overlap, which af- Approximate full document matching uses hash- fects over 4% of the validation set of stan- dard datasets, thus allowing for more accurate based techniques (Broder, 1997) to identify pairs evaluation. We release code for reproducing of documents with high n-gram overlap (§4.2). our work and performing dataset deduplication We identify four distinct advantages to training at https://github.com/google-research/ on datasets that have been thoroughly deduplicated. deduplicate-text-datasets. 1. Over 1% of tokens emitted unprompted from 1 Introduction a model trained on standard datasets (e.g., C4) A key factor behind the recent progress in natural are part of a memorized sequence (See §6.2)— language processing is the development of large- even though the 1.5 billion parameter model scale text corpora used to train increasingly large is much smaller than the 350GB dataset it language models. These datasets have grown from was trained on. By deduplicating the training just a gigabytes to hundreds of gigabytes over the dataset we reduce the rate of emitting memo- past few years (Chelba et al., 2013; Xue et al., 2020; rized training data by a factor of 10×. Graff et al., 2003; Brown et al., 2020). Because it is 2. Train-test overlap is common in non- so expensive to perform manual review on nearly- deduplicated datasets. For example, we find a terabyte-scale datasets, they are lower quality than 61-word sequence1 in C4 (Raffel et al., 2020) smaller, more curated datasets. These data issues that is repeated 61,036 times verbatim in the have implications far beyond metrics like perplex- training dataset and 61 times in the validation ity or validation loss, as learned models reflect the set (0.02% of the samples in each dataset). biases present in their training data (Bender et al., This train-test set overlap not only causes re- 2021; Wallace et al., 2019; Sheng et al., 2020). As searchers to over-estimate model accuracy, but a result, quantitatively and qualitatively understand- 1 ing the datasets themselves is a research challenge “by combining fantastic ideas, interesting arrangements, in its own right (Dodge et al., 2021a). and follow the current trends in the field of that make you more inspired and give artistic touches. We’d be honored if ∗ Equal contribution. † Google Research, Brain Team. you can apply some or all of these design in your wedding. ‡ University of Pennsylvania. Correspond to kather- believe me, brilliant ideas would be perfect if it can be applied inelee@google.com and daphnei@seas.upenn.edu. in real and make the people around you amazed!” 1
also biases model selection towards models Among the models trained on CommonCrawl in- and hyperparameters that intentionally overfit clude GPT-3 (Brown et al., 2020) with the addition their training datasets. of book datasets, GROVER (Zellers et al., 2019) on a restricted subset filtered to news domains called 3. Training models on deduplicated datasets is RealNews, and T5 (Raffel et al., 2020) on a cleaned more efficient. Processing a dataset with our version of common crawl called C4. Other models framework requires a CPU-only linear-time are trained on more curated Internet sources—for algorithm. And so because these datasets are example Guo et al. (2020) used high quality pro- up to 19% smaller, even including the dedu- cessed Wikipedia text from 40 different languages plication runtime itself, training on dedupli- to train monolingual 141.4M parameter language cated datasets directly reduces the training models. Non-English models necessarily use dif- cost in terms of time, dollar, and the environ- ferent datasets; Zeng et al. (2021) for instance in- ment (Bender et al., 2021; Strubell et al., 2019; troduced PANGU-α, a family of models with up to Patterson et al., 2021). 200B parameters that were trained on a non-public corpus of cleaned and filtered Chinese-language 4. Deduplicating training data does not hurt documents from CommonCrawl and other sources. perplexity: models trained on deduplicated Since many of these datasets are not public, we datasets have no worse perplexity compared deduplicate three that are: Wiki-40B, C4, and to baseline models trained on the original RealNews–as well as the One Billion Word Lan- datasets. In some cases deduplication reduces guage Model Benchmark (Chelba et al., 2013), a perplexity by up to 10%. Further, because re- smaller dataset commonly used for evaluation. cent LMs are typically limited to training for just a few epochs (Radford et al., 2019; Raffel et al., 2020), by training on higher quality data the models can reach higher accuracy faster. Contamination of downstream tasks. When models are trained on datasets constructed by crawl- To summarize, data duplication offers significant ing the Internet, it is possible the model will train advantages and no observed disadvantages. In the on the test set of downstream target tasks. For ex- remainder of this paper we present our text dedu- ample, Radford et al. (2019, §4) performed a post- plication framework in §4, and study the extent of hoc analysis to identify 8-gram overlaps between duplicate content in common NLP datasets (e.g., GPT-2’s training set and datasets used for evalu- C4, Wiki-40B, and LM1B) in §5. We then exam- ation, and Dodge et al. (2021b) analyzed C4 and ine the impact of deduplication on test perplexity found that up to 14.4% of test examples for various (§6.1) and on the frequency of emitting memorized standard tasks were found verbatim (normalizing content (§6.2). Finally, we analyze to what ex- for capitalization and punctuation) in the dataset. tent perplexity on existing, released models are A more proactive approach removes contaminated skewed as a result of overlap between the train and data. Trinh and Le (2018, Appendix B) removed test/validation splits (§6.3). documents from their CommonCrawl-based train set that overlapped substantially with the common- 2 Related Work sense reasoning used for evaluation. And GPT-3 (Brown et al., 2020, §5) did the reverse and re- Large language model datasets. While we be- moved downstream evaluation examples from their lieve our results are independent of model archi- training data by conservatively filtering out any tecture, we perform our analysis on Transformer- train set examples with a 13-gram overlap with based decoder-only language models (Vaswani any evaluation example. Up to 90% of tasks were et al., 2017) trained for open-ended text generation. flagged as potentially contaminated. These current state-of-the-art models are trained on internet text. For example, the GPT-2 family In our research, we do not focus on the impact of of models Radford et al. (2019) is trained on Web- duplicate text in pretrained models on downstream Text, a dataset of web documents highly ranked on benchmark tasks; instead we address how duplicate Reddit—however this dataset was not made avail- text in the LM training and validation sets impacts able publicly. A common dataset starting point model perplexity and the extent to which generated is CommonCrawl, an index of public webpages. text included memorized content. 2
Memorizing Train Sets. The risks of data mem- was introduced as a pre-training dataset for T5, a set orization, for example the ability to extract sen- of encoder-decoder models which have been widely sitive data such as valid phone numbers and IRC used in fine-tuned downstream tasks. The dataset usernames, are highlighted by Carlini et al. (2020). was previously deduplicated in a more sophisti- While their paper paper identifies 604 samples that cated process than the prior two datasets. Each GPT-2 emitted from its training set, we show that paragraph was hashed and paragraphs resulting in over 1% of the data most models emit is memorized hash collisions were removed. This was followed training data. In computer vision, memorization of by a pass that removed placeholder text, code, and training data has been studied from various angles prohibited words. See Dodge et al. (2021a) for a for both discriminative and generative models (e.g. detailed breakdown of the source text in C4. Arpit et al., 2017; Webster et al., 2019; Feldman RealNews is a subset of the Common Crawl and Zhang, 2020; Stephenson et al., 2021; Teter- consisting of articles from news domains (Zellers wak et al., 2021). et al., 2019). It contains 31M documents with Duplicate text in training data. The Book Cor- average length 793 BPE tokens. RealNews was pus (Zhu et al., 2015), which was used to train pop- de-duplicated by inserting a hash of the first 100 ular models such as BERT, has a substantial amount characters of each document into a bloom filter and of exact-duplicate documents according to Bandy then excluding any example whose hash matched and Vincent (2021). Allamanis (2019) shows that an example already added to the dataset. Like C4, duplicate examples in code datasets cause wors- examples with duplicate URLs were excluded. ened performance on code understanding tasks. 4 Methods for Identifying Duplicates 3 Language Modeling Datasets The simplest technique to find duplicate examples We analyze the presence of duplicate text in four would be to perform exact string matching between datasets of varying sizes that have been used for all example pairs, but as we will show, this is insuf- training natural language generation systems, pro- ficient. We introduce two complementary methods ducing general-purpose pre-trained models, and for for performing deduplication. First, using a suf- language model benchmarking. While this paper fix array (Manber and Myers, 1993), we remove restricts itself to English datasets, we expect that duplicate substrings from the dataset if they oc- non-English datasets suffer from similar issues and cur verbatim in more than one example. Second, could likewise benefit from de-duplication. we use MinHash (Broder, 1997), an efficient algo- rithm for estimating the n-gram similarity between Wikipedia (Wiki-40B) consists of multi-lingual all pairs of examples in a corpus, to remove entire cleaned Wikipedia text (Guo et al., 2020). We examples from the dataset if they have high n-gram take the English portion, which contains 2.9M overlap with any other example. Wikipedia pages with an average length of 768 BPE We consider a dataset D = {xi }N i=1 as a collec- tokens. The dataset creators do not indicate any tion of examples xi . Each of these examples is itself deduplication was performed aside from removing a sequence of tokens: xi = x1i , x2i , · · · , xsi i . redirect-pages (e.g., “sunflower” to “Helianthus”). 4.1 Exact Substring Duplication One-Billion Word benchmark (LM1B) con- tains 30M sentences of news commentary (Chelba Due to the diversity of possibilities in human lan- et al., 2013). Unlike the other datasets we analyze, guage, it is rare for the same idea to be expressed LM1B’s examples are one sentence long rather identically in multiple documents unless one ex- than multi-sentence documents. The average ex- pression is derived from the other, or both are quot- ample length is 32 BPE tokens. While this dataset ing from a shared source. This observation moti- is extremely standard for benchmarking language vates deduplicating exact substrings. We call our models, Radford et al. (2019, Sec 4) note it has approach E XACT S UBSTR. When two examples 13.2% overlap of the test set with the train set. xi and xj share a sufficiently long substring (that is, a substring for which xa..a+k i = xb..b+k j ), that Colossal Cleaned Common Crawl (C4) is substring is removed from one of them. Based on made up of 360M web documents, with an average statistical analyses (§4.1.3), we select k = 50 to- length of 486 BPE tokens (Raffel et al., 2020). C4 kens as the minimum matching substring length. 3
A breakdown of the computation needed for this approach can be found in Appendix B. 4.1.1 Suffix Arrays This exact-substring-matching criterion, while con- ceptually simple, is computationally prohibitive with naive (quadratic) all-pair matching. To solve LM1B this problem, we concatenate all the examples of C4 the entire dataset D into a giant sequence S, and RealNews construct a Suffix Array A of S. A suffix array Wiki-40B (Manber and Myers, 1993) is a representation of a suffix tree (Weiner, 1973) that can be constructed in linear time in kSk (Kärkkäinen and Sanders, 2003) and allows for efficient computation of many Figure 1: For each substring of length k, we plot the substring queries—and in particular allows us to probability that there exists a second identical length- identify duplicated training examples in linear time. k substring in the same train set. Matches with length Suffix arrays have been used widely in NLP for under 10 tokens are common, and account for 90% of applications such as efficient TF-IDF computation tokens. We choose a threshold of 50 for experiments. (Yamamoto and Church, 2001) and document clus- tering (Chim and Deng, 2007). The Suffix Array A for a sequence S is a lexico- count it as a duplicate. In Figure 1, we plot the graphic ally-ordered list of all suffixes contained in frequency of substring matches within the four the sequence. Formally, datasets we will consider. For each substring of length k, we compute the probability that there ex- A(S) = arg sort all_suffixes(S) ists another sequence of length k identical to this one; formally: For example, the suffixes of the sequence “banana” are (“banana”, “anana”, “nana” “ana”, “na”, “a”) m(k) = Pr ∃j 6= i : Si..i+k = Sj..j+k . i∈[N ] and so the suffix array is the sequence (6 4 2 1 5 3). Suffix arrays are often preferable to suffix trees We choose 50 tokens as the threshold to be conser- because, while asymptotically less efficient for vative: the “bend in the knee” occurs at 10 tokens, some types of queries, they are ten to a hundred and manual inspection of length-25 matches found times more memory efficient (Manber and Myers, no false positives. We then doubled this value to 1993) requiring just 8 bytes per input token. have an exceptionally large margin for error. 4.1.2 Parallel Substring matching 4.2 Approximate Matching with MinHash After constructing A, it is straightforward to iden- tify duplicated training examples. Suppose that We also perform approximate deduplication based the sequence s was repeated exactly twice in the on matching entire examples. This method, which training dataset S at positions i and j, that is, we call N EAR D UP, is a good complement to the Si..i+|s| = Sj..j+|s| . Then the indices i, j will occur exact substring matching, especially for web crawl adjacent to each other in the suffix array A. text, as it handles the very common case of docu- Finding all repeated sequences is therefore a mat- ments being identical except for interspersed tem- ter of linearly scanning the suffix array from be- plated fields (such as the last row of Table 1). ginning to end and looking for sequences Ai , Ai+1 MinHash (Broder, 1997) is an approximate that share a common prefix of at least some thresh- matching algorithm widely used in large-scale old length. Any satisfying sequences are recorded. deduplication tasks (Versley and Panchenko, 2012; This algorithm is embarrassingly parallel, and so Gabriel et al., 2018; Gyawali et al., 2020), in- we can efficiently process the dataset. cluding to deduplicate the training set for a large Chinese-language LM (Zeng et al., 2021). Given 4.1.3 Setting a threshold of duplicates two documents xi and xj , the main idea is to rep- The final question that remains to be answered is resent each document by its respective set of n- how long a substring match must be before we grams di and dj . We can then use hash functions 4
Dataset Example Near-Duplicate Example Wiki-40B \n_START_ARTICLE_\nHum Award \n_START_ARTICLE_\nHum Award for Best Actor for Most Impactful Character in a Negative Role \n_START_SECTION_\nWinners \n_START_SECTION_\nWinners and nom- and nominees\n_START_PARAGRAPH_\nIn the list inees\n_START_PARAGRAPH_\nIn the list below, winners are listed first in the colored row, fol- below, winners are listed first in the colored row, lowed by the other nominees. [...] followed by the other nominees. [...] LM1B I left for California in 1979 and tracked Cleveland I left for California in 1979 , and tracked Cleveland ’s changes on trips back to visit my sisters . ’s changes on trips back to visit my sisters . RealNews KUALA LUMPUR (Reuters) - Roads in South- A visitor looks at a Triumph motorcycle on display at east Asia have been getting a little louder lately the Indonesian International Motor Show in Jakarta as motorcycle makers, an aspiring middle class September 19, 2014. REUTERS/Darren Whiteside\n and easy bank credit come together to breed a new KUALA LUMPUR (Reuters) - Roads in Southeast genus of motorcyclists – the big-bike rider. [...] Asia have been getting a little [...] big-bike rider. [...] C4 Affordable and convenient holiday flights take Affordable and convenient holiday flights take off off from your departure country, "Canada". From from your departure country, "USA". From April May 2019 to October 2019, Condor flights to your 2019 to October 2019, Condor flights to your dream dream destination will be roughly 6 a week! Book destination will be roughly 7 a week! Book your your Halifax (YHZ) - Basel (BSL) flight now, and Maui Kahului (OGG) - Dubrovnik (DBV) flight now, look forward to your "Switzerland" destination! and look forward to your "Croatia" destination! Table 1: Qualitative examples of near-duplicates identified by N EAR D UP from each dataset. The similarlity be- tween documents is highlighted. Note the small interspersed differences that make exact duplicate matching less effective. Examples ending with “[...]” have been truncated for brevity. to quickly approximate the Jaccard Index (Jaccard, [5001, ) 280 C4 [501, 5000) 2,782 1912): [51, 500) 23,094 [21, 50) 28,446 Group sizes |di ∩ dj | [11, 20) 42,723 Jaccard(di , dj ) = [6, 10) 85,567 |di ∪ dj | 5 54,984 4 109,853 3 292,575 If the Jaccard Index between di and dj is suffi- 2 1,861,744 1 348,320,475 ciently high, it is likely that documents are approx- 0100 101 102 103 104 105 106 107 108 109 imate matches of each other. To efficiently approx- Number of groups imate the Jaccard index, MinHash constructs doc- ument signatures by sorting each of the n-grams Figure 2: The distribution of near-duplicate cluster sizes from running N EAR D UP on C4. via a hash function, and then keeping only the k smallest hashed n-grams. There are multiple ways to construct estimators of the Jaccard index from 0.8. The edit similarity between token sequences these kinds of signatures (Cohen, 2016). xi and xj is defined as: In our implementation, we use 5-grams and a signature of size 9,000. The probability that two EditDistance(xi , xj ) EditSim(xi , xj ) = 1 − documents are considered a potential match is max(|xi |, |xj |) To build clusters of similar documents, we con- Pr(di , dj | Jaccard(di , dj ) = si,j ) = 1−(1−sbi,j )r struct a graph that has an edge between two doc- where b = 20 and r = 450 are user-settable pa- uments if they are considered a match. Then, we rameters to control the strength of the filter. See use the method introduced in Łacki ˛ et al. (2018) to Appendix A for more details. identify connected components. A breakdown of For each pair of documents identified as a poten- the computation needed is given in Appendix A. tial match, more computationally expensive similar- 5 Deduplication Results ity metrics can be employed as a subsequent filter- ing step. In particular, we identify two documents We deduplicate each of the four datasets with both as duplicates if they are matched by the MinHash of our two techniques. When text was duplicated algorithm and their edit similarity is greater than across multiple data splits, we prioritized keeping 5
% train examples with % valid with 5.2 Properties of Duplicated Text dup in train dup in valid dup in train C4 3.04% 1.59% 4.60% While the authors of both RealNews and C4 ex- Real News 13.63% 1.25% 14.35% plicitly attempted deduplication during dataset con- LM1B 4.86% 0.07% 4.92% struction, the methods were insufficient to capture Wiki40B 0.39% 0.26% 0.72% the more subtle types of duplicate text commonly Table 2: The fraction of examples identified by found on the internet. In C4 and Wiki-40B, we N EAR D UP as near-duplicates. qualitatively observe that much of the text identi- fied as near-duplicated is computer-generated. The % train tokens with % valid with text is identical except for the names of places, busi- dup in train dup in valid dup in train nesses, products, dates, and so on. Because these C4 7.18% 0.75 % 1.38 % examples frequently differ by just a few words at Real News 19.4 % 2.61 % 3.37 % a time, deduplication strategies relying on exact LM1B 0.76% 0.016% 0.019% string matching would fail to identify a match. Ex- Wiki40B 2.76% 0.52 % 0.67 % ample duplicate pairs from each dataset can be Table 3: The fraction of tokens (note Table 2 reports found in Table 1 (more examples in the Appendix). the fraction of examples) identified by E XACT S UBSTR For RealNews and LM1B, which are both de- as part of an exact duplicate 50-token substring. rived from news sites, we observe that many near- duplicates occur because the same news article ap- pears on multiple news sites with slightly different a copy in the test or validation set and removing it formatting. For example, in LM1B, there is one from the train set. example that starts “MINEOLA , N.Y. - New York officials say [...]” and another that starts “( AP ) - 5.1 Amount of Text Removed New York officials say [...]”. The two examples are With N EAR D UP, we found that the web-scrape otherwise identical. datasets contain between 3.04% (on C4) to 13.63% (on RealNews) near duplicates (Table 2). Near- 5.3 Train / Test Set Leakage duplicate text is much less common in Wiki-40B, Both deduplication methods identify overlap be- forming only 0.39% of the train set.2 In C4, the ma- tween the train set and the validation set (Table 2). jority (1.8M) of near-duplicate clusters consisted of For example, 4.6% of the C4 validation set and just a single pair of examples that matched against 14.4% of the RealNews validation set examples each other, but there were 280 clusters with over had an approximate duplicate in their respective 5,000 examples in them (Figure 2), including one training sets. Such duplication is problematic since cluster of size 250,933. it could cause evaluation metrics to be unfairly in- On average with E XACT S UBSTR, we remove flated for models that are better at memorizing their more total content than with N EAR D UP (de- train sets. We evaluate the effect of this leakage on spite E XACT S UBSTR not removing any examples publicly released models in Section 6.3. outright)—for example removing 7.18% of the to- kens in C4. The exception is LM1B, where E X - 6 Impact on Trained Models ACT S UBSTR removes 8× less data than N EAR D UP. On investigation, we find this is due to the fact that We trained 1.5B parameter “XL", decoder-only, LM1B documents are significantly shorter: 90% Transformer-based language models similar to of all documents are under 50 tokens, and so are GPT-2, on C4-O RIGINAL, C4-N EAR D UP, and not even candidates for potential matches even if C4-E XACT S UBSTR, respectively. We use the T5 the entire sequence matched verbatim. We find codebase and model architecture from Raffel et al. that both N EAR D UP and E XACT S UBSTR remove (2020), and each model was trained for about two similar content—77% of the training examples that epochs on its respective dataset. To better under- N EAR D UP removes from C4 have at least one ver- stand the amount of variance in the perplexities batim length-50 match found by E XACT S UBSTR. of trained models, we also trained three different 2 random seeds of the 110M parameter “base" model Most duplicates we saw were automatically generated pages, such as the outcomes of sports games. This shows the for each of the above three datasets—for a total of strength of manual curation for creating high-quality datasets. nine base-sized models. 6
Evaluation dataset C4 Original Training data Model 1 Epoch 2 Epochs Original NearDup XL-O RIGINAL 1.926% 1.571% C4 Duplicates ExactSubstr XL-N EAR D UP 0.189% 0.264% XL-E XACT S UBSTR 0.138% 0.168% C4 Unique Table 4: When generating 100k sequences with no Wiki40B prompting, over 1% of the tokens emitted from a model 0 5 10 15 20 25 30 35 trained on the original dataset are part of a 50-token (a) Base model Perplexity long sequence copied directly from the training dataset. This drops to 0.1% for the deduplicated datasets. C4 Original Training data Evaluation dataset Original C4 Duplicates NearDup ExactSubstr in higher perplexity than N EAR D UP-deduplicated. C4 Unique These trends holds true for the XL sized model as LM1B well. While this may suggest E XACT S UBSTR du- Wiki40B plication results in models least overfit on the train 0 5 10 15 20 25 30 35 set, note that both of these techniques have used (b) XL model Perplexity separate duplicate thresholds and a different choice of thresholds could change the results. Figure 3: Impact of deduplicating the training set on When evaluating on the validation sets of LM1B validation perplexity. In (a), we plot the results from T5 base (110M parameters) across three training runs and Wiki-40B, we found that models trained on with different random initializations. The black bar rep- N EAR D UP-deduplicated C4 consistently achieved resent the lowest perplexity to the highest perplexity, lowest perplexity (for LM1B eval with base models, and the colored bar the median perplexity. In (b), we see Appendix Figure 7). E XACT S UBSTR dedupli- plot the results from T5 XL (1.5B parameters). For C4, cation decreases perplexity of the XL model by we evaluate on C4 Original, the original validation set; almost 3 points perplexity on Wiki-40B which is C4 Unique, a subset of the validation set identified by much larger than the variation of about 1 point per- N EAR D UP as having zero matches across C4; and C4 Duplicates, a subset of the validation set identified by plexity we observed in the base models. This is N EAR D UP as having a match in the C4 train set. despite seeing fewer tokens of training data overall. Lastly, we note all our XL models achieved
Model Dataset Orig Dups Unique train dup Prompt source Transformer-XL LM1B 21.77 10.11 23.58 train unique GROVER-Base RealNews 15.44 13.77 15.73 GROVER-XL RealNews 9.15 7.68 9.45 valid in train Training data Original Table 5: For each model, the perplexity of the offi- NearDup valid unique ExactSubstr cial validation set (Orig), valid set examples which were identified by N EAR D UP as matches of train set 0.0 0.1 0.2 0.3 0.4 Fraction of LM continuations examples (Dups), and valid set examples identified by matching true continuation N EAR D UP as unique (Unique). Due to the size of the RealNews validation set, we evaluated on only the first Figure 4: The proportion of generations which have 25k examples meeting each condition. edit similarity above 0.8 with the groundtruth continu- ation when using the LM to generate continuations for 32-token prompts identified by N EAR D UP as either du- and GROVER (Zellers et al., 2019), which was plicated or unique. trained on RealNews. For Transformer XL, the perplexity halves on examples identified as near- duplicates. For GROVER, the difference in per- ble 4). This is ∼ 10× more memorization than XL- plexities is present in both the 124M and 1.5B E XACT S UBSTR or XL-N EAR D UP. Some example parameter models but is not quite as stark as for subsequences that were copied verbatim from the Transformer XL. train set can be found in Table 8 in the Appendix. Existing models also suffer from the problem With prompting. In most real use cases, lan- of generating text from their train sets. We find guage model generation is controlled by providing that 1.38% of the tokens in the official release of a prompt for the model to continue. We experi- 25k GROVER-Mega outputs3 are part of verbatim ment with four possible prompt sources: training matches in RealNews of at least length 50. Like- examples identified by E XACT S UBSTR as having wise, more than 5% of the tokens in ~200k se- near-duplicates in the train set (train dup), train- quences outputted by GPT-Neo 1.3B (Black et al., ing examples identified as unique (train unique), 2021) are part of a 50 token matches of its training validation set examples with a near-duplicate in data, the Pile (Gao et al., 2020). the train set (valid in train), and valid examples 7 Discussion identified as unique across all splits (valid unique). We select the first 32 tokens of each example as The focus of this paper is on the datasets used to the prompt, which means we can evaluate the frac- train language models. While recent work focused tion of generations which are near-duplicates with on documenting the potential harms that could arise the ground-truth continuation for the prompt (Fig- from problematic datasets (Bender and Friedman, ure 4). When the prompt comes from duplicate 2018; Gebru et al., 2020), less work has been done examples in the train set, XL-O RIGINAL repro- to quantitatively analyze properties of real language duces the groundtruth continuation over 40% of the modelling datasets, like Dodge et al. (2021a) has time. XL-E XACT S UBSTR and XL-N EAR D UP still done for C4. Our paper provides analysis on one copy the groundtruth more often when the prompt particular axis, that of data duplication. comes from a duplicate example than when the Our experiments measured what could be quan- prompt comes from a unique example, suggesting tified: the amount of duplicate content in com- that more stringent deduplication may be necessary mon datasets, the effect of deduplication on trained to remove memorization tendencies entirely. model perplexity, and the reduction of memorized content in trained models through deduplication. 6.3 Impact on Existing Models We do not focus on the nature of the data being Train-test leakage does not just impact models removed by deduplication or memorized by LMs. trained on C4. In Table 5, we show that whether Privacy is an important subject for future work, or not an evaluation example has a near-duplicate as memorization could have significant privacy con- in the train set has a significant impact on model sequences. We use the following interpretation of perplexity for two standard models: Transformer- 3 gs://grover-models/generation_examples/ XL (Dai et al., 2019), which was trained on LM1B, generator=mega~dataset=p0.90.jsonl 8
privacy: if a model reveals information about ex- limitations of the data they have collected and the amples in its training data beyond what is revealed how the model’s intended usage constrains what about examples not in its training data, this is a should be part of the training set. Developing tech- privacy violation (Shokri et al., 2017).4 Training niques to memorize or forget specific sequences on standard datasets that have not yet been dedu- depending on the end application is a promising plicated results in models that are particularly sen- research direction. sitive to examples that happened to be repeated multiple times, and this has negative privacy im- 8 Conclusion plications. For instance, it could violate a person’s expectations of privacy if their publicly available We encourage future language model research to personal data appeared in a different, surprising perform dataset deduplication, either by training context. In addition, downstream applications of on the deduplicated datasets we release, using the LMs, such as the game AI Dungeon5 , in most cases deduplication tools we release, or following our should not output memorized content like adverts approach to deduplicate datasets with new tools. for real-world products. The exact technique used to perform deduplica- We stress that in our experiments, we do not dis- tion is less important than doing stringent dedu- tinguish between undesired memorized text (such plication in the first place. On the whole, dedu- as phone numbers), innocuous memorized text plication does not harm, and sometimes improves, (common phrases), and text we may want to be model perplexity, despite the fact that the dedu- memorized (such as a quote by a public figure), plicated datasets are smaller, and thus, faster to and instead treat all instances of the LM generat- train on. It is especially important that there are ing text that closely matches the training set as no duplicates between the training and testing sets, problematic. While we qualitatively observed that because overlap here explicitly encourages select- much of the identified memorized content was rel- ing models that memorize the training data. Lastly, atively innocuous, a more systematic study of the deduplication helps to reduce the privacy concerns risks associated with the detected memorization around language models memorizing their training was beyond the scope of this work. data. We also do not investigate the negative conse- 9 Acknowledgements quences of deduplication. Some language tasks explicitly require memorization, like document re- We are grateful to the many researchers whose trieval or closed-book question answering. Also, technical help, feedback, and discussions shaped text that gives attribution is often duplicated across this project: Jacob Austin, Samy Bengio, Olivier documents, so removing duplicate substrings could Bousquet, James Bradbury, Fernando Diaz, Mark correspond to removing just the attribution, which Diaz, Noah Fiedel, Jonathan Frankle, David could result in models that learn the content with- Grangier, Stefanie Karp, David Mimno, Gaurav out its attached attribution. Deduplication is also Mishra, Michael Mozer, Sharan Narang, Alex Pas- not sufficient to remove privacy-sensitive data like sos, Adam Roberts, Hanie Sedghi, Jascha Sohl- bank passwords and medical records which should dickstein, David So, Florian Tramer, and Yun never be used in training data. William Yu. We are also grateful to the Google Ultimately, whether memorization is a desired Brain women who have given us continuous sup- property of a language model, or else risky and port. unwanted, depends both on the nature of the text that has been memorized and on the downstream 10 Contributions applications of the trained model. However, be- cause the trend has been towards creating datasets Each of the authors on this paper significantly con- and models that are application-agnostic, we en- tributed to the final results. courage researchers to think carefully about the • Katherine trained the models used in the pa- 4 Another interpretation of privacy focuses on the sensitiv- per, built and ran the eval and text generation ity of the data involved, when a model is trained on and able pipelines, contributed significantly to writing, to reproduce personal identifiers or other forms of "private data." Our definition is more expansive. analysis, and project organization and manage- 5 https://play.aidungeon.io/ ment. 9
• Daphne ran the approximate matching data dedu- Andrei Z Broder. 1997. On the resemblance and con- plication pipelines, extracted prompts and evalu- tainment of documents. In Proceedings. Compres- sion and Complexity of SEQUENCES 1997 (Cat. No. ation datasets, ran eval pipelines, and contributed 97TB100171), pages 21–29. IEEE. significantly to planning, writing, and analysis. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie • Andrew wrote the code to perform deduplica- Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind tion with approximate matching, helped evaluate Neelakantan, Pranav Shyam, Girish Sastry, Amanda energy expenditure, and helped with analysis. Askell, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Pro- • Chiyuan helped generate plots and contributed to cessing Systems 33. project scoping, writing, and data analysis. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine • Chris offered mentorship and guidance through- Lee, Adam Roberts, Tom Brown, Dawn Song, Ul- out the project and contributed to writing. far Erlingsson, Alina Oprea, and Colin Raffel. 2020. Extracting training data from large language models. • Doug offered mentorship and guidance through- Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, out the project and contributed to writing. Thorsten Brants, Phillipp Koehn, and Tony Robin- son. 2013. One billion word benchmark for measur- • Nicholas wrote the suffix array implementation, ing progress in statistical language modeling. arXiv ran all E XACT S UBSTR deduplication experi- preprint arXiv:1312.3005. ments, contributed significantly to planning, writ- Hung Chim and Xiaotie Deng. 2007. A new suffix ing, and analysis, as well as scoping the project. tree similarity measure for document clustering. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, page 121–130, New York, NY, USA. Association for Computing Machin- References ery. Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of Edith Cohen. 2016. Min-hash sketches: A brief survey. code. In Proceedings of the 2019 ACM SIG- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- PLAN International Symposium on New Ideas, New bonell, Quoc V Le, and Ruslan Salakhutdinov. Paradigms, and Reflections on Programming and 2019. Transformer-xl: Attentive language mod- Software, pages 143–153. els beyond a fixed-length context. arXiv preprint Devansh Arpit, Stanisław Jastrz˛ebski, Nicolas Ballas, arXiv:1901.02860. David Krueger, Emmanuel Bengio, Maxinder S Kan- Jesse Dodge, Maarten Sap, Ana Marasovic, William wal, Tegan Maharaj, Asja Fischer, Aaron Courville, Agnew, Gabriel Ilharco, Dirk Groeneveld, and Matt Yoshua Bengio, et al. 2017. A closer look at mem- Gardner. 2021a. Documenting the english colossal orization in deep networks. In International Confer- clean crawled corpus. ence on Machine Learning, pages 233–242. PMLR. Jesse Dodge, Maarten Sap, Ana Marasovic, William Jack Bandy and Nicholas Vincent. 2021. Addressing Agnew, Gabriel Ilharco, Dirk Groeneveld, and "documentation debt" in machine learning research: Matt Gardner. 2021b. Documenting the english A retrospective datasheet for bookcorpus. colossal clean crawled corpus. arXiv preprint Emily M. Bender and Batya Friedman. 2018. Data arXiv:2104.08758. statements for natural language processing: Toward Vitaly Feldman and Chiyuan Zhang. 2020. What neu- mitigating system bias and enabling better science. ral networks memorize and why: Discovering the Transactions of the Association for Computational long tail via influence estimation. In Advances in Linguistics, 6:587–604. Neural Information Processing Systems. Emily M. Bender, Timnit Gebru, Angelina McMillan- Rodney A. Gabriel, Tsung-Ting Kuo, Julian McAuley, Major, and Shmargaret Shmitchell. 2021. On the and Chun-Nan Hsu. 2018. Identifying and char- dangers of stochastic parrots: Can language models acterizing highly similar notes in big clinical note be too big? . In Proceedings of the 2021 ACM datasets. Journal of Biomedical Informatics, 82:63– Conference on Fairness, Accountability, and Trans- 69. parency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery. Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Sid Black, Leo Gao, Phil Wang, Connor Leahy, Horace He, Anish Thite, Noa Nabeshima, Shawn and Stella Biderman. 2021. GPT-Neo: Large Presser, and Connor Leahy. 2020. The Pile: An scale autoregressive language modeling with mesh- 800gb dataset of diverse text for language modeling. tensorflow. arXiv preprint arXiv:2101.00027. 10
Timnit Gebru, Jamie Morgenstern, Briana Vec- Reza Shokri, Marco Stronati, Congzheng Song, and chione, Jennifer Wortman Vaughan, Hanna Wal- Vitaly Shmatikov. 2017. Membership inference at- lach, Hal Daumé III au2, and Kate Crawford. 2020. tacks against machine learning models. In 2017 Datasheets for datasets. IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE. David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Cory Stephenson, Suchismita Padhy, Abhinav Ganesh, Consortium, Philadelphia, 4(1):34. Yue Hui, Hanlin Tang, and SueYeon Chung. 2021. On the geometry of generalization and memoriza- Mandy Guo, Zihang Dai, Denny Vrandecic, and Rami tion in deep neural networks. In International Con- Al-Rfou. 2020. Wiki-40b: Multilingual language ference on Learning Representations. model dataset. In LREC 2020. Emma Strubell, Ananya Ganesh, and Andrew McCal- Bikash Gyawali, Lucas Anastasiou, and Petr Knoth. lum. 2019. Energy and policy considerations for 2020. Deduplication of scholarly documents using deep learning in nlp. locality sensitive hashing and word embeddings. In Proceedings of the 12th Language Resources and Piotr Teterwak, Chiyuan Zhang, Dilip Krishnan, and Evaluation Conference, pages 901–910. Michael C Mozer. 2021. Understanding invariance via feedforward inversion of discriminatively trained Paul Jaccard. 1912. The distribution of the flora in the classifiers. In International Conference on Machine alpine zone. New phytologist, 11(2):37–50. Learning, pages 10225–10235. PMLR. Juha Kärkkäinen and Peter Sanders. 2003. Simple lin- Trieu H Trinh and Quoc V Le. 2018. A simple ear work suffix array construction. In International method for commonsense reasoning. arXiv preprint colloquium on automata, languages, and program- arXiv:1806.02847. ming, pages 943–955. Springer. Pang Ko and Srinivas Aluru. 2003. Space efficient Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob linear time construction of suffix arrays. In An- Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz nual Symposium on Combinatorial Pattern Match- Kaiser, and Illia Polosukhin. 2017. Attention is all ing, pages 200–210. Springer. you need. arXiv preprint arXiv:1706.03762. Udi Manber and Gene Myers. 1993. Suffix arrays: a Yannick Versley and Yana Panchenko. 2012. Not just new method for on-line string searches. siam Jour- bigger: Towards better-quality web corpora. In Pro- nal on Computing, 22(5):935–948. ceedings of the seventh Web as Corpus Workshop (WAC7), pages 44–52. Ge Nong, Sen Zhang, and Wai Hong Chan. 2009. Lin- ear suffix array construction by almost pure induced- Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, sorting. In 2009 data compression conference, and Sameer Singh. 2019. Universal adversarial trig- pages 193–202. IEEE. gers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125. David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, Ryan Webster, Julien Rabin, Loïc Simon, and Frédéric David So, Maud Texier, and Jeff Dean. 2021. Car- Jurie. 2019. Detecting overfitting of deep generative bon emissions and large neural network training. networks via latent recovery. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recog- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, nition (CVPR), pages 11265–11274. Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Peter Weiner. 1973. Linear pattern matching algo- blog, 1(8):9. rithms. In 14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1–11. IEEE. Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Linting Xue, Noah Constant, Adam Roberts, Mi- Zhou, Wei Li, and Peter J. Liu. 2020. Exploring hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya the limits of transfer learning with a unified text-to- Barua, and Colin Raffel. 2020. mt5: A mas- text transformer. Journal of Machine Learning Re- sively multilingual pre-trained text-to-text trans- search, 21(140):1–67. former. arXiv preprint arXiv:2010.11934. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Mikio Yamamoto and Kenneth W Church. 2001. Using Adaptive learning rates with sublinear memory cost. suffix arrays to compute term frequency and docu- In International Conference on Machine Learning, ment frequency for all substrings in a corpus. Com- pages 4596–4604. PMLR. putational Linguistics, 27(1):1–30. Emily Sheng, Kai-Wei Chang, Premkumar Natara- Rowan Zellers, Ari Holtzman, Hannah Rashkin, jan, and Nanyun Peng. 2020. Towards control- Yonatan Bisk, Ali Farhadi, Franziska Roesner, and lable biases in language generation. arXiv preprint Yejin Choi. 2019. Defending against neural fake arXiv:2005.00268. news. arXiv preprint arXiv:1905.12616. 11
Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang, Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shao- jie Zhang, Mingyue Guo, Shanzhi Gu, Gaojun Fan, Yaowei Wang, Xuefeng Jin, Qun Liu, and Yonghong Tian. 2021. Pangu-α: Large-scale au- toregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- dinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE inter- national conference on computer vision, pages 19– 27. Jakub Łacki, ˛ Vahab Mirrokni, and Michał Włodarczyk. 2018. Connected components at scale via local con- tractions. 12
A Further Details on N EAR D UP O(N + bk 2 T 2 N ) = O(N ) For our MinHash based deduplication method, doc- uments are first space tokenized, then each consec- since b, k, and T are all N . The left term is the utive 5-gram is hashed using tabulation hashing. complexity of grouping by the signatures, and the The set of these hashes is the signature for the doc- right represents the pathological worst case of all ument. For each element in a document’s signature, documents falling into the same B buckets. the element is hashed using k other hash functions. The highly distributed N EAR D UP implementa- The minimum hashed element for each of the k tion we employed is one used for large-scale pro- hash functions is stored. These minimum hashes duction tasks at Google. On the English C4 dataset, are then partitioned into r buckets, with b hashes the algorithm consumed approximately 41.5 kWh per bucket. These b hashes are augmented into a of energy. Note that our choices of k and b were single value, then if two documents have the same designed to produce very high recall, and with dif- value in at least one bucket, they’ll be marked as ferent parameters, the algorithm could be made a potential match. The probability that two doc- much more energy efficient while producing simi- uments are considered a potential match is equal lar results. to B Further Details on E XACT S UBSTR Pr(di , dj | Jaccard(di , dj ) = si,j ) = 1−(1−sbi,j )r Parallel linear time construction. We build a where si,j is the Jaccard index between the two parallelized linear time suffix array algorithm. As documents. For document pairs that were identi- a building block, we make black-box use of the fied as potential matches, we computed their actual SA-IS algorithm for constructing a suffix array Jaccard index, and if that was above 0.8, we com- in linear time Nong et al. (2009); Ko and Aluru puted their edit similarity. Document pairs with (2003). Unfortunately, this algorithm is not eas- edit similarity higher than 0.8 were identified as ily parallelized directly, so we introduce a simple duplicates. After some experimentation, we chose divide and conquer approach to parallelizing the to use b = 20, and r = 450, so k = 9, 000, so as to array construction. make sure a collision at the desired Jaccard index We build our implementation in Rust and ex- threshold of 0.8 had a high probability of occurring tend an existing suffix array library6 with three We also tested an alternative configuration— modification. The first two are straightforward im- filtering to document pairs with Jaccard index of at plementation differences: we modify the code to least 0.9 and edit similarity of at least 0.9. In this allow datasets larger than 4GB, and we remove the case, we used b = 20, r = 40, and k = 800. Fig- requirement that strings parse as valid UTF-8 se- ure 5 shows the histogram of Jaccard similarities quences in favor of raw byte sequences. Our third and edit similarities for all document pairs which change is more significant: we re-implement the collided in min-hash space, for our chosen configu- algorithm so that we can stream the suffix array ration (blue) and for the alternative configuration itself off disk. (orange). This allows us verify if the threshold chosen has few comparisons around the chosen Parallel partial suffix array construction. Our threshold, then we’ve likely captured the majority divide and conquer suffix array construction algo- of actual near duplicates above that threshold. To rithm starts by partitioning the dataset into K differ- verify that yourself, look at the left hand tails of ent “splits” with SA-IS run over independently on the distributions. Since both 0.8 and 0.9 begin to each split in parallel. This algorithm still requires vanish at the same point (in spite of the fact that the O(N ) work but runs in O(N/K) wall-clock time. two thresholds are optimized for accuracy around This gives us N separate suffix arrays Ai . different thresholds), we feel comfortable saying Given two suffix arrays A1 and A2 for two se- that we’re capturing the majority of actual near quences S1 and S2 it’s not completely trivial to duplicates. construct a single suffix array A for S = S1 || S2 because of the boundary conditions. Instead, we Computational Analysis Let N be the number don’t build the data S = S1 || S2 but rather let of documents and T be the maximal number of to- S10 = S1 || S2 [uptoK] for some K greater than kens in a document. Edit similarity has a worst case complexity of T 2 , so the worst case complexity is 6 https://github.com/BurntSushi/suffix 13
document comparisons 0.4 C4 (t=0.8) LM1B (t=0.8) RealNews (t=0.8) Wiki40B (t=0.8) C4 (t=0.9) LM1B (t=0.9) RealNews (t=0.8 test) 0.3 % of pairwise 0.2 0.1 0.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Edit similarity Edit similarity Edit similarity Edit similarity 0.4 C4 (t=0.8) LM1B (t=0.8) RealNews (t=0.8) Wiki40B (t=0.8) document comparisons 0.3 C4 (t=0.9) LM1B (t=0.9) RealNews (t=0.8 test) % of pairwise 0.2 0.1 0.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Jaccard similarity Jaccard similarity Jaccard similarity Jaccard similarity Figure 5: Histograms of document similarities. the longest substring match. Then we build the L simultaneous jobs (in practice we set K = L as arrays on S10 and S2 . To merge the arrays together the number of threads on our machine). In the K = we can remove the items from the first array af- 2 case, job l processes i ∈ [jN/L, (j + 1)N/L], ter index |S1 | and merge-sort insert them into the choosing the bounds of j by binary searching into second. C so that SBi < SCj < SBj+1 . The case where K > 2 is identical except that we repeat this over Parallel merge of partial suffix arrays. We all K partial suffix arrays. now merge these separate arrays together into a single suffix array A, Consider the simpler case of Computational Analysis. We run our algorithm two partial suffix arrays B and C that we would on a single VM on the cloud with 96 cores and like to merge together. We can achieve this by 768GB of memory. Our algorithm is efficient, for letting i = 0 index B and j = 0 index C. Each example processing the Wiki-40B training set (3 iteration of the algorithm then pushes Bi into A million examples containing 4GB of text) in 2.3 if SBi .. < SCi and Ci otherwise, repeating until minutes wall-clock time (2.1 CPU-hours of work). i = |B| − 1 and j = |C| − 1. To generalize to K The 350GB C4 dataset takes under 12 hours (wall- splits, we need only replace the single comparison clock) to build a suffix array; although we are still above with a min-heap requiring O(log K) 10 memory constrained and so this corresponds to work on each iteration. ∼ 1000 CPU-hours. Once the suffix array has been Observe that in the general case this algorithm constructed, it takes under an hour to deduplicate is O(N m log(K)) where N is the length of the the C4 dataset. dataset, m is the average length of a prefix match, Note that this algorithm still requires that the and K is the number of splits. It is therefore incor- dataset itself fits in memory (so that we can effi- rect to call this algorithm linear time in the general ciently index in arbitrary positions), but we do not case, for ours it is. Because the length of the longest need to fit the entire suffix array into memory. This match is bounded above by the length of the longest is fortunate since our suffix array requires an 8× sequence, as long as the size of the dataset is inde- space overhead. For example, the suffix array for pendent of the length of the longest sequence in the the 350GB C4 is 1.5TB. dataset, this algorithm remains efficient. Compared to the cost of training a language Again, we can parallelize this operation among model on this dataset, the additional work required 14
to deduplicate the training dataset is negligible. model Original NearDup ExactSubstr and groundtruth continuations edit sim between generated C Further Details on Model Training 1.0 0.8 Each model was trained for about two epochs. 0.6 Since both C4-O RIGINAL and C4-E XACT S UBSTR 0.4 contain approximately 365M examples, we per- 0.2 formed 152K steps with a batch size of 4800 (or ap- 0.0 proximately 2 epochs). C4-N EAR D UP contains ap- train dup train unique valid in train valid unique proximately 350M examples, we performed 146K Prompt Source steps (or approximately 2 epochs). On a 128- Figure 6: Memorized continuations distribution core TPU v3 pod slice, XL models trained on C4-O RIGINAL and C4-E XACT S UBSTR took ap- proximately 131 hours (5.5 days) to train, while In addition to model training, evaluation and in- the XL model trained on C4-N EAR D UP took ap- ference were performed on 64-core TPU v3 pod proximately 126 hours to train. Like T5, models slices. Generating 100,000 sequences from the XL were trained with the Adafactor optimizer (Shazeer models takes approximately 0.64 hours. We gen- and Stern, 2018). A constant learning rate of 0.01 erated 100,000 sequences for each of five types of was used for the base models and 0.001 for the XL prompts for two checkpoints of the model for a models. total of 1M sequences per model. This took ap- The 1.5B parameter XL models had 24 layers, proximately 19.2 hours. We estimate generating each with 32 attention heads. The model embed- 3M sequences uses 0.43M W h. ding size was 2,048, the feed forward layers had a hidden size of 5,120, and the key/value dimen- E More Results sion size for the attention heads 64. The 110M Qualitative Examples. Table 7 shows several ex- parameter base models had 12 layers, each with 12 amples of pairs of documents in C4 whose edit dis- attention heads. The model embedding size was tance is close to our chosen edit similarity thresh- 768, the feed forward layers had a hidden size of old of 0.8. Table 8 shows substrings which were 2,048, and the key/value dimension size for the identified by E XACT S UBSTR as being in C4 more attention heads 64. than once. Table 9 shows several examples of D Energy Consumption unprompted generations which were identified as memorized are shown. We trained for approximately 131 hours or 5.5 days on a 128-core TPU v3. The approximate Distribution of memorization. Figure 6 shows deduplicated dataset is 3.9% smaller than the orig- the distribution in memorization amount over all inal dataset and trains in 63 hours/epoch, saving generated sequences when using four types of us around 5 hours of compute time for the two prompting: train example with duplicates in train, epochs. The XL-O RIGINALmodel was trained in train examples without any duplicates, validation North America where the XL-E XACT S UBSTR and examples with duplicates in train, and validation XL-N EAR D UP were trained in Taiwan. We used examples without any duplicates. data from Patterson et al. (2021) to estimate amount URLs with many duplicates. Table 10 shows of energy used in training these models by comput- the URLs had the largest proportion of examples ing the amount of M W h/hour/core and multiply- identified by N EAR D UP as near-duplicates. For ing by our usage (see Table 6 for how we computed C4, these tend to be websites that sell many similar these values). For simplicity, we use estimates products and thus have a large amount of templated from Taiwainese datacenters as an estimate. We es- text. For RealNews, content aggregators seem es- timate training 2 epochs of XL-O RIGINAL and XL- pecially common. E XACT S UBSTR uses 5.86M W h. XL-N EAR D UP is trained for fewer steps and we estimate uses N EAR D UP cluster sizes. Figure 8 shows the dis- 5.63M W h. Training each base model was approxi- tribution of cluster sizes from running N EAR D UP mately 3 days on a 64-core TPU v3 pod slice which on RealNews, LM1B, and Wiki-40B (results for uses an estimated 1.61M W h. C4 are in Figure 2 the main paper). 15
You can also read