Do Language Models Plagiarize?

Page created by Martha Hamilton

Careers

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Do Language Models Plagiarize?

                                                         Jooyoung Lee     Thai Le      Jinghui Chen    Dongwon Lee
                                                                         Pennsylvania State University
                                                                {jfl5838,tql3,jzc5917,dongwon}@psu.edu

                                                               Abstract                          ponents does lead to performance enhancement
                                                                                                 (Kaplan et al., 2020).
                                             Past literature has illustrated that language          Amongst various downstream NLP tasks, neu-
                                             models do not fully understand the context and
                                                                                                 ral language models are well-known to demon-
arXiv:2203.07618v1 [cs.CL] 15 Mar 2022

                                             sensitivity of text and can sometimes mem-
                                             orize phrases or sentences present in their         strate unprecedented performance on neural text
                                             training sets. In this paper, we investigate        generation. According to (McCoy et al., 2021),
                                             whether they not only memorize but also pla-        model-generated texts can be as novel or even
                                             giarize training samples when generating arti-      more novel than human writings. A distinction
                                             ficial texts. Our findings support that they, es-   between machine-authored and human-written con-
                                             pecially GPT-2, reuse particular pieces of texts    tent has even become quite challenging (Clark et al.,
                                             from the training corpus with or without ob-
                                                                                                 2021; Uchendu et al., 2020). Despite their promis-
                                             fuscation. We have four main results: 1) lan-
                                             guage models with more capacity plagiarize          ing results, a growing body of literature gave rise
                                             more; 2) fine-tuned language models demon-          to privacy violations of a neural language model
                                             strate differing patterns of plagiarism based       resulting from data memorization (Carlini et al.,
                                             on characteristics of auxiliary data; 3) sam-       2019; Thakkar et al., 2021; Truex et al., 2018;
                                             pling from truncated language modeling distri-      Arpit et al., 2017; Meehan et al., 2020). More
                                             butions tends to heighten the degree of plagia-     precisely, through membership inference attacks,1
                                             rism as opposed to temperature sampling, and        Carlini et al. (2021) could extract 32 memorized
                                             4) plagiarism in language models can have se-
                                                                                                 examples with individuals’ contact information out
                                             rious privacy consequences. Overall, our work
                                             implies that future research on neural language     of 604 GPT-2 generated samples. The authors have
                                             models should take precautions to avoid mod-        also confirmed that models’ copying behaviors are
                                             els plagiarizing their training datasets.           prone to get worse as both the size of LMs and their
                                                                                                 training data increase.
                                                                                                    A majority of datasets used to train language
                                         1   Introduction                                        models are scraped from the Internet without re-
                                         Language models (LMs) are core elements of              ceiving informed consent from content owners
                                         the Natural Language Processing (NLP) field, ex-        (Brown et al., 2022). That being said, memoriza-
                                         celling at a wide range of tasks such as natural lan-   tion from training samples can be perceived as a
                                         guage generation (NLG), speech recognition, ma-         violation of copyright and authorship. Other than
                                         chine translation, and question answering. The pre-     copying and pasting training sequences, there are
                                         training objective of a language model is to learn      other ways to indirectly exploit training examples
                                         the probability distribution over word sequences.       by paraphrasing or summarizing the original con-
                                         Recent trends in language modeling involve large        tent. This action generally refers to plagiarism, the
                                         models, large training datasets, and long compute       act of reusing another person’s work without refer-
                                         time. For instance, the largest version of GPT-         encing the individual as its owner (Ali et al., 2011).
                                         3, trained on 570GB of Internet text, has 175 bil-      As shown in Table 1, LMs can further plagiarize
                                         lion parameters and would cost $4.6M for training       from training samples. This motivates our main in-
                                         when using a Tesla V100 cloud instance (Li, 2020).      quiry subject: To what extent do language models
                                         A heated competition of training and presenting            1
                                                                                                      It is a type of adversarial attacks that aims to predict
                                         larger LMs with larger training corpora might be        whether or not a particular example was included in a training
                                         explained by the fact that an increase in these com-    set, based on a trained model

Original Text Plagiarized Text
[...] *** is the second amendment [...] *** is the second amendment
columnist for Breitbart news and host columnist for Breitbart news and host
of bullets with ***, a Breitbart news podcast. of bullets with ***, a Breitbart news podcast.
He is also the political analyst for Armed He is also the political analyst for Armed
American Radio. Follow him on Twitter: @***. American Radio. Follow him on Twitter: @***.
Reach him directly at ***@***.com. Reach him directly at ***@***.com.
REUTERS/Kevin Lamarque U.S. President
REUTERS/Kevin Lamarque U.S. President
Donald Trump, First Lady Melania Trump
Donald Trump and First Lady Melania Trump,
and their son Barron while aboard Air Force
with their son Barron, arrive for a New Year’s Eve
One on their way to Florida, Mar-a-Lago in
party at his Mar-a-Lago club in Palm Beach,
Palm Beach, Florida to spend the holiday at Trump
Florida, U.S. December 31, 2017. [...]
International Golf Club Mar-a-Lago. [...]
The soldier was accused of leaving his post
in Afghanistan in 2009

(CNN) Sgt. Bowe Bergdahl pleaded guilty Wednesday
Bergdahl, who walked off his base in
to misbehavior before the enemy and disobeying orders
Afghanistan in 2009 and was held by the Taliban
, leaving bound and naked prisoners wide-open
for five years, pleaded guilty to desertion
to attack or capture at a training base in Afghanistan.
and misbehavior before the enemy. [...]
[...]
The soldiers who had held Bergdahl captive
for more than five years were also tried by
a judge over their possible actions surrounding Bergdahl’s disappearance.

Table 1: Qualitative examples of plagiarism identified in OpenWebText. Duplicated texts are highlighted in yellow,
and words/phrases that contain similar meaning without text overlaps are highlighted in green. [...] indicate text
omitted for brevity. Personally identifiable information has been replaced with ***.

directly and indirectly exploit phrases or sentences impact plagiarism: 1) Model size: Amongst four
in training samples?. To the best of our knowl- GPT-2 family, larger models (GPT-2 large and xl)
edge, there is no existing literature that has gone plagiarize more from a training set than smaller
beyond the investigation into verbatim plagiarism models; 2) Fine-tuning Data: There is a positive
(also known as memorization) in language models. correlation between document similarity levels be-
tween pre-training and fine-tuning sets and pla-
In this paper, we examine plagiarizing behaviors
giarism; 3) Decoding methods and values of their
of the state-of-the-art language models, specifically
parameters: Plagiarism cases differ depending on
GPT-2 families (small/medium/large/xl), which in-
decoding approaches and parameter values.
volve neural texts that contain not only explicit
Contributions of our work are summarized as
text overlap but also semantically similar informa-
follows:
tion from training data. Our study is guided by
two research questions: (RQ1) Do pre-trained • We establish research inquiries that have not
language models plagiarize? and (RQ2) Do fine- been fully explored. We apply a notion of pla-
tuned language models plagiarize?. We first at- giarism to an NLG task from both pre-trained
tempt to identify three plagiarism categories (ver- and fine-tuned LMs. Moreover, the effects of
batim/paraphrase/idea) from machine-written pas- varying decoding approaches and parameters are
sages generated by pre-train GPT-2 with different understudied in memorization research.
combinations of model size and decoding meth-
ods. For plagiarism type detection, we automate • We develop an automatic plagiarism detection
the process by building a novel pipeline that en- pipeline, which leverages the state-of-the-art
hances the performance of an existing open source BERT-based classifier and Named Entity Recog-
toolkit (Sanchez-Perez et al., 2015). Three GPT-2 nition (NER) approach to reduce error rates of
small models are then fine-tuned using datasets in Sanchez-Perez et al. (2015).
scholarly writing and legal domains. We use these
• We empirically highlight that risks related to
models to compare their patterns of plagiarism for
memorization are underestimated. A Language
pre-training corpus and fine-tuning corpus.
model does more than copy and paste training
We discover three attributes that significantly samples; it can further rephrase sentences or

steal ideas from someone else’s writing. To pro- the similarity between overlapping 8-grams. There
tect authorship of original content, our work are diverse ways to measure text similarity with
prompts an urgent need for model-wise solu- segmented document pairs. For example, Küppers
tions apart from data deduplication (Lee et al., and Conrad (2012) calculated the Dice coefficient
2021) or data sanitization (He et al., 2017). between 250 character chunks of passage pairs, and
Shrestha and Solorio (2013) implemented the Jac-
2 Related Work card similarity with n-grams (Shrestha and Solorio,
2013). Euclidean distance clustering is a com-
Memorization in Language Models. There is mon method as well (Palkovskii and Belov, 2014;
a growing body of literature that aims to study Jiffriya et al., 2014).
memorization of neural language models by re- More recent literature (Gharavi et al., 2020;
covering texts in the training text corpus (Salem Nazir et al., 2021) have made continuous efforts in
et al., 2020; Kharitonov et al., 2021; Leino and adopting word embedding and advanced machine
Fredrikson, 2020) or extracting artificially injected learning or deep learning models. Gharavi et al.
canaries (Henderson et al., 2018; Mireshghallah (2016) extracted word vectors using the word2vec
et al., 2021; Zanella-Béguelin et al., 2020). Carlini algorithm and applied two similarity metrics: Co-
et al. (2021, 2019); Brown et al. (2022) has empha- sine similarity, and Jaccard similarity. Instead of
sized that data memorization can intentionally or using well-established similarity scores bounded by
unintentionally lead to sensitive information leak- particular thresholds, Altheneyan and Menai (2020)
age from a model’s training set. Meanwhile, recent has viewed the task as a classification problem and
studies (Lee et al., 2021; Kandpal et al., 2022) have developed a support vector machine (SVM) clas-
shown that training data of language models tend sifier using several lexical, syntactic, and seman-
to contain a large number of near-duplicates, and tic features. Specifically for paraphrase detection,
overlapping phrases included in near-duplicates sig- Agarwal et al. (2018) relied on Convolutional Neu-
nificantly account for memorized text sequences. ral Network (CNN) to obtain the local region in-
They further demonstrate the effectiveness of train- formation from n-grams and Recurrent Neural Net-
ing data deduplication strategy in mitigating the ef- work (RNN) to capture the long-term dependency
fects of memorization. Still, this technique cannot information.
completely eradicate memorization because there
exist memorized sequences that are present only 3 Taxonomy of Plagiarism
once. In order to distinguish rare but memorized
texts, Zhang et al. (2021) has presented a notion Plagiarism occurs when any content including text,
of counterfactual memorization which measures a source code, or audio-visual content is reused with-
difference in expected performance of two models out permission or citation from an author of orig-
trained on with or without a particular training sam- inal work. It has been a longstanding problem,
ple. Unlike other works, McCoy et al. (2021) has especially in educational and research institutions
attempted to analyze models’ memorizing behav- or publishers, given the availability of digital arti-
iors by assessing the novelty of machine-generated facts (Sutherland-Smith, 2008; Clarke, 2006). Pla-
texts. Despite their findings of 1,000 word long giarism can severely damage academic integrity
duplicated passages from a training set, the authors and even hurt individuals’ reputation and morality
imply that neural language models have the ability (East, 2010). To detect such activities, it is neces-
to integrate familiar parts into novel content, rather sary to have extensive knowledge about plagiarism
than simply copying training samples. forms and classes. The most naive approach is to
Plagiarism Detection. An automated extrinsic pla- directly copy segments of others’ documents and
giarism detection, in general, can be divided into paste them into their work. To make plagiarism less
two subtasks: document retrieval and text align- obvious, one may replace original words with syn-
ment. While document retrieval focuses on fetch- onyms or rearrange word orders. Similarly, back
ing all documents that potentially have plagiarized translation, using two independent translators to
an existing document, the text alignment subtask translate sentences back and forth, is common in
detects the location and content of plagiarized texts. paraphrase generation. A more sophisticated ap-
Alzahrani (2015) retrieved candidate documents proach involves rewriting an abstract version of the
that share exact-copied sequences and computed original document while preserving its whole idea,

which is more difficult to identify given limited         case, scores are computed via the Okapi-BM25 al-
lexical and syntactic similarities. In this work, we      gorithm (Robertson et al., 1995), a popular bag-of-
focus on three plagiarism types:                          words ranking function that Elasticsearch employs
                                                          as a default. We specify n as 10 for the sake of time
 • Verbatim plagiarism: exact copies of words or          efficiency.
   phrases without transformation.
                                                          4.2   Plagiarism Type Identification
 • Paraphrase plagiarism: synonymous substitu-
                                                          Baseline. Text alignment algorithms aim at extract-
   tion, word reordering, and back translation.
                                                          ing and locating similar contiguous text sequences
 • Idea plagiarism: reuse of the core idea by short-      between two given documents and are applicable
   ening or summarizing the original content              to a variety of tasks such as information retrieval
                                                          (Davis and Ogden, 1997; Semmar and Fluhr, 2007),
   These are the most commonly studied categories         text-reuse detection (Roostaee et al., 2020; Zhou
in plagiarism literature (Lukashenko et al., 2007;        et al., 2020), and translation Alignment (Lin et al.,
Meuschke and Gipp, 2013), and thus we target              2020). Motivated by previous literature, we employ
identification of these types.                            the open source text alignment tool (Sanchez-Perez
                                                          et al., 2015) to identify plagiarized texts from pairs
4       Automatic Detection of Plagiarism in              of the original document (from machine-generated
        Language Models                                   corpus) and the candidate document (from Open-
                                                          WebText). Details on Sanchez-Perez et al. (2015)
In this section, we describe the processes of au-
                                                          can be found in Appendix A.3
tomated plagiarism type identification. We store
                                                          Improvement. Although this tool was introduced
OpenWebText to our search engine and then apply
                                                          in 2015, we choose it because its reported perfor-
text alignment to fetch similar documents.
                                                          mance is robust, and it focuses on the longest pla-
4.1      Candidate Document Retrieval                     giarized substrings unlike existing plagiarism de-
                                                          tectors trained and evaluated on labeled sentence
The first step of our approach is to distinguish a list
                                                          pairs (Shahmohammadi et al., 2021; Socher et al.,
of candidate OpenWebText documents that have
                                                          2011; Yin and Schütze, 2015). Nonetheless, by
high chances of being associated with plagiarism
                                                          running a sanity check with 200 documents (50 for
given a synthetic document. Here we utilize a
                                                          each plagiarism label) included in our own corpus,
document similarity score as a proxy of plagia-
                                                          we discover that the proposed approach (especially
rism. Since modern language models like GPT-
                                                          in paraphrase detection subtask) has some flaws;
2 or GPT-3 are known to be trained on volumi-
                                                          it labels near-duplicates with one character differ-
nous data consisting of more than millions of doc-
                                                          ence as paraphrases and fails to capture little details
uments, it is non-trivial to locally store all doc-
                                                          such as numbers or dates. For example, ‘2/5 found
uments and compute similarities and rank them.
                                                          it helpful’ and ‘1/5 found it useful’ are not para-
Hence, we generate our search engine using Elastic-
                                                          phrases. Therefore, to reduce false positives, we
search2 which is an open source search engine built
                                                          add additional restrictions on top of the existing
on Apache Lucene and can provide a distributed
                                                          tool. Specifically, a RoBERTa-based paraphrase
RESTful search service with fast response time and
                                                          identification model (Morris et al., 2020) and NER4
fine-tuned relevancy.
                                                          are applied to potentially paraphrased segments
   After storing OpenWebText to Elasticsearch,
                                                          identified by the open source. The RoBERTa clas-
we initiate the searching process by setting the
                                                          sifier has achieved 91.17% accuracy on the eval-
whole content of the original document (in our
                                                          uation set from the MSRP corpus.5 Since the
case, machine-generated document) as queries. As
                                                          RoBERTa classifier works best in sentence-level
most queries are lengthy and therefore can slow
                                                          comparison, we chunk them using NLTK6 ’s sen-
down the retrieval, we clean them by removing
                                                             3
stopwords and lemmatizing. It then automatically               For the purpose of this study, random and translation
                                                          obfuscation types are grouped as paraphrase plagiarism, and
computes similarities between stored documents            summary obfuscation is considered as idea plagiarism.
and the inserted query and fetches top-n documents           4
                                                               We use SpaCy library (https://spacy.io).
                                                             5
that acquire the highest similarity score. In our              https://www.microsoft.com/en-us/
                                                          download/details.aspx?id=52398
    2                                                        6
        https://www.elastic.co/elasticsearch/                  https://www.nltk.org

tence tokenizer and feed sentence pairs to both on our own, we use GPT-2 Output Dataset9 which
RoBERTa and NER models. If there is at least contains 250,000 texts generated by four ver-
one sentence pair whose probability ranges from sions of the GPT-2 model with three decoding ap-
0.5 to 0.997 and have matching entities, we accept proaches.10 Owners of the gpt-2-output-dataset
the PAN 2015’s result regarding paraphrase plagia- repository have informed us that they used a ‘’ token as a prompt and set t=1, k=40,
tim or idea plagiarism because reported results in 0.8 < p < 1. In total, there are 12 (4 model
Sanchez-Perez et al. (2015) matches well with ours. size * 3 decoding methods) combinations, and we
According to annotation results of 200 documents, only analyze the first 10,000 examples in each com-
accuracy scores of our detection method are as fol- bination.
lows: 0.92% for no plagiarism, 1.0% for verbatim
plagiarism, 0.88% for paraphrase plagiarism, and 5.2 Experimental Results
0.62% for idea plagiarism.
Verbatim
xl (top-p) Paraphrase
5 RQ1: Pre-trained GPT-2 and Idea
large (top-p)
Plagiarism medium (top-p)
small (top-p)
In this section, we primarily investigate plagiarism
Model (Decoding) xl (top-k)
of four different versions (small/medium/large/xl)
large (top-k)
of OpenAI GPT-2 model (Radford et al., 2019).
medium (top-k)
Our experimental environment is based on a small (top-k)
Google Colab Pro+ with Tesla V100-SXM2-16GB xl (temp)
and 55 GB of RAM. large (temp)
medium (temp)
5.1 Experimental Setup small (temp)

Dataset. GPT-2 is pre-trained on WebText, con- 0.0 0.5 1.0 1.5 2.0 2.5
Document Percentage
taining the text subset of 45 million links from Red-
dit (Radford et al., 2019). After data de-duplication Figure 1: Distribution of Plagiarism Categories w.r.t.
and some heuristic-based cleaning, its final size Model Size and Decoding methods
is over 8 million documents for a total of 40 GB
of text. Since OpenAI has not publicly released Document distribution of three plagiarism types
WebText, we use OpenWebText which is an open- based on different model sizes and decoding strate-
source recreation of the WebText corpus.8 Given gies is displayed in Figure 1. For GPT-2 with tem-
that the size of OpenWebtext corpus matches the perature setting, the larger the model size became
size described in Radford et al. (2019), we assume the higher occurrences of plagiarism were observed.
it is a reliable source. This finding is consistent with previous memoriza-
Model. GPT-2 is a transformer-based language tion literature (e.g., Carlini et al. (2021), Levy et al.
model which comes in 4 different sizes — small, (2021), Carlini et al. (2022)). We also find that not
medium, large, and xl, with 124M, 355M, 774M, limited to verbatim plagiarism which is equivalent
and 1.5B parameters, respectively. According to to memorized substrings, the other two types of pla-
Radford et al. (2019), the smallest model is equiva- giarism surged alongside the model size. However,
lent to the original GPT (Radford et al., 2018), and our observations do not hold when GPT-2’s word
the second smallest is same as the largest model token is sampled from a truncated distribution such
from BERT (Devlin et al., 2018). GPT-2 has shown as top-k and top-p: plagiarism frequencies were the
outstanding efficacy of pre-trained language mod- highest when GPT-2 large models were used, not
els on various natural language processing (NLP) xl. Moreover, top-k and top-p decoding methods
tasks, particularly coherent text generation. are more strongly associated with plagiarism than
Text Generation. Instead of creating neural texts setting temperature regardless of the model size.
7 9
We specified 0.99 as the upper bound to avoid near- https://github.com/openai/
duplicate pairs. gpt-2-output-dataset
8 10
https://skylion007.github.io/ Detailed explanations of decoding methods used for our
OpenWebTextCorpus/ analyses are included in Appendix B.

5.3    Analyses of Plagiarized Examples                             6     RQ2: Fine-trained GPT-2 and
                                                                          Plagiarism
                                                                    6.1    Experimental Setup
                                                                    Dataset. We use three new corpora to fine-tune
                                                                    pre-trained GPT-2 models. Our corpora comprise
                                                                    scholarly writing and legal domains, in which pla-
                                                                    giarism studies have rigorously explored with re-
                                                                    spect to ethical writing and authorship (Pecorari,
                                                                    2008; Shahabuddin, 2009; Mahmood, 2010) and
                                                                    plagiarism itself is deemed more sensitive. The
                                                                    first corpus includes 250,000 randomly selected
                                                                    abstracts on arxiv.org, from the start of the site in
                                                                    1993 to the end of 2019 (Geiger, 2019). The second
                                                                    corpus, on the other hand, is a subset (n=200,000)
                                                                    of the CORD-19 dataset (Wang et al., 2020), con-
                                                                    sisting of scholarly articles about the COVID-19
Figure 2: Total Number of PII-Exposing Documents                    virus. Since most articles in CORD-19 exceed the
w.r.t. Plagiarism Categories
                                                                    length of 1024 tokens, we only consider the first
                                                                    five paragraphs starting from the ‘Introduction’ sec-
                                                                    tion. While the former covers a wide range of
   We now turn our attention to the content of se-
                                                                    disciplines (e.g., Physics, Computer Science, Eco-
quences associated with three plagiarism types.11
                                                                    nomics), the latter predominantly includes papers
Many studies (Carlini et al., 2021; Kandpal et al.,
                                                                    in Medicine (55%), Biology (31%), and Chemistry
2022; Zhu et al., 2021; Meehan et al., 2020) have
                                                                    (3%). Lastly, Lee and Hsiang (2020)’s 290,000
raised a concern towards memorization of large
                                                                    patent claims are acquired for our experiment.
language models due to data privacy leakage. Mo-
                                                                    Model. For fine-tuning, we utilize a Python pack-
tivated by their findings, we apply Microsoft’s Pre-
                                                                    age called GPT-2-simple.13 Due to constraints of
sidio analyzer,12 a Python toolkit for personally
                                                                    computing resource, we only fine-tune the GPT-2
identifiable information (PII) entity detection (e.g.,
                                                                    small variation. For simplicity’s sake, three in-
credit card information, email address, phone num-
                                                                    dividual models trained on each dataset will be
ber), to GPT-2 generated texts. Precisely, there are
                                                                    denoted as ArXivAbstractGPT, Cord19GPT, and
total 2,168 unique substrings (verbatim: 863 / para-
                                                                    PatentGPT. In our experiment, we maintain hyper-
phrase: 524 / idea: 349) plagiarized by pre-trained
                                                                    parameters that are suggested in public repositories:
GPT-2. We set a confidence threshold to 0.7. A
                                                                    learning rate as 1e-4, temperature as 1.0, top-k as
total number of plagiarized documents that reveal
                                                                    40, and batch size as 1. Three models are trained
PII entities is displayed in Figure 2. Of 1,736 pla-
                                                                    for 30,000, 44,000, and 32,300 steps respectively.
giarized sequences, nearly 26% include at least one
                                                                    Text Generation. For three fine-tuned models, we
element of location information and a person’s full
                                                                    manually create 10,000 synthetic texts using the
name. Although none of the highly sensitive in-
                                                                    same prompt and parameter information as GPT-2
formation, including individuals’ driver’s license
                                                                    Output Dataset.
number, credit card information, bank number, so-
cial security number, and IP address, are revealed,                 6.2    Experimental Results
the results show a possibility of machine-generated
                                                                    We observe overall frequencies of verbatim plagia-
texts disseminating personal data such as phone
                                                                    rism have significantly diminished after fine-tuning
number and email address not only through exact
                                                                    (Table 2). This finding aligns with GPT-2’s out-
copying but also through paraphrasing.
                                                                    standing adaptability to the writing styles of new
                                                                    data. Still, not all fine-tuned models are not com-
                                                                    pletely free from plagiarism. While ArxivAbstract-
   11
      Due to page constraints, further details on identified pla-   GPT had nearly zero plagiarism cases, Cord19GPT
giarized content are illustrated in Appendix D.
   12                                                                 13
      https://microsoft.github.io/presidio/                              https://github.com/minimaxir/
analyzer/                                                           gpt-2-simple

Pre-Training Data                      Fine-Tuning Data
                                  Verbatim Paraphrase            Idea   Verbatim Paraphrase          Idea
                          temp       0          0.04             0.16      0          0.07           0.17
          PatentGPT       top-k      0          0.31              1.5      0            0              0
                          top-p      0          0.07             0.79      0          0.02             0
                          temp      0.01        0.01             0.06     0.42         0.3           0.35
         Cord19GPT        top-k     0.01        0.51             1.25     0.51        1.79           3.72
                          top-p     0.06        0.34             0.73     0.62        1.43           1.72
                          temp       0           0                 0       0          0.03             0
        ArxivAbstract
                          top-k      0           0               0.01      0            0              0
            GPT
                          top-p      0          0.02               0       0          0.01             0

Table 2: Distribution of Plagiarism Categories w.r.t. Model and Decoding Methods. All numbers indicate the
percentage of document.

substantially reuse the content of OpenWebText          claims included in the OpenWebText dataset.
through paraphrase or idea plagiarism. Taking into         We further study fine-tuned models’ plagiarism
account a strong correlation of memorization and        regarding fine-tuning data. Our results highlight
data duplication, we speculate that the observed dis-   that Cord19GPT was strongly affiliated with plagia-
crepancies may have been caused by different lev-       rism as opposed to ArxivAbstractGPT and Patent-
els of similarity between each fine-tuning dataset      GPT (Table 2). Although all fine-tuned models
and OpenWebText. For example, if CORD-19 and            are trained for a similar duration and are likely to
OpenWebText have multiple similar or duplicated         underfit,14 nearly 6% of CORD19GPT-generated
content, the fine-tuned model would have been im-       texts using top-k sampling plagiarize its fine-tuning
mensely exposed to it and may have started to re-       corpus. We speculate that this phenomenon can be
member it. That being said, we attempt to measure       explained by the different characteristics of each
relevancy between all three fine-tuning corpora and     dataset. CORD-19 consists of full scholarly pa-
pre-training corpus independently. In order to sim-     pers that already include multiple references unlike
plify the task, we recycle some part of Section         patent- or abstract-related data. Also, while topics
4.1 by: 1) selecting arbitrary 500 documents in a       of patent or abstract documents are diverse, the
fine-tuning dataset; 2) using document segments         CORD-19 dataset is more specific to the coron-
as queries in Elasticsearch and retrieving similarity   avirus, and its discipline is centered on Medicine
scores of 10,000 most relevant OpenWebText doc-         and Biology.
uments, and 3) aggregating averaged scores. As
BM25 is sensitive to a query length, we only use        7        Discussion and Limitations
the first 300 characters of each document for a fair
comparison.                                             Larger LMs plagiarize more. Consistent with
                                                        (Carlini et al., 2021) and (Carlini et al., 2022), we
   Indeed, patent data (score=21369.60) obtained        find that larger models (large and xl) generally gen-
the highest summation of similarity scores to Open-     erate plagiarized sequences more frequently than
WebText, followed by Cord-19 (score=19818.82)           smaller ones. Based on the decoding approaches,
and Arxiv abstract (score=17904.18) dataset. In         however, the model size that yields the largest
addition, we perform a manual inspection on pla-        amount of plagiarism seems to change: when the
giarized examples and find that they are highly         next token is sampled from truncated distribution,
domain-specific. For instance, sentences such as        the GPT-2 large model plagiarizes the most. On the
‘Written informed consent was obtained from the         other hand, the GPT-2 xl becomes more strongly
involved participants’ or ‘Clinical data from ani-      associated with plagiarism than the GPT-2 large
mal care facilities were in strict accordance with      when temperature setting without truncation is em-
National Institutes of Health-approved guidelines.’     ployed. This discrepancy may be attributable to
are relatively common expressions used in med-              14
                                                           We kept the training steps relatively small and trained
ical scholarly writing. Many PatentGPT-written          models while a gap between training and test losses is below
instances that are plagiarized are also from patent     20% of training loss.

error rates of our paraphrase and idea plagiarism deep neural language models. We discover multi-
detection tool. Regardless, it is evident that larger ple plagiarized examples where users’ sensitive or
models plagiarize significantly more training data. private data such as phone number or email address
Considering LMs’ performance improvement with is exposed. Although all identified content were
larger model sizes, this finding sheds light on a publicly available on the Web, it does not give a
trade-off between models’ performance and author- right for LMs to reveal their personal information
ship or copyright protection of training samples. without consent. Our research overall has raised
Fine-tuning with an auxiliary dataset has vary- a concern towards the growing use of a language
ing impacts on plagiarism of LMs based on its model, considering its potential harm on both our
characteristics. To the best of our knowledge, privacy and authorship.
we’re the first to inspect either memorization or Limitations First, our findings are based on one
plagiarism issues of fine-tuned language models. particular language model, GPT-2, and thus may
Our findings highlight that fine-tuning a model not generalize to other models such as GPT-3 and
with an auxiliary data can mitigate memorization T5. We acknowledge that language models may
from the pre-training dataset. Still, other types of demonstrate different patterns of plagiarism. Fu-
plagiarism cases have surged, in case of Patent- ture work can revisit the proposed research ques-
GPT and Cord19GPT, alongside similarity levels tions on more diverse neural language models. Sec-
between pre-training and fine-tuning corpora. Inter- ond, our plagiarism type detection pipeline em-
estingly, this does not influence plagiarism from the ploys additional strict restrictions, especially on
pre-trained corpus: only the CORD19GPT demon- paraphrase detection, to minimize false positives.
strates intensified degree of plagiarism where pla- This could have limited us from capturing nuanced
giarized documents make up to 6%. We are uncer- plagiarism and led to missing some examples. For
tain why CORD19GPT behaves differently, but we instance, the NER library will fail to distinguish
assume this is due to the specificity of the CORD- a sentence pair (‘Trump has arrived in Seoul at
19 dataset. As part of future work, we will quantita- 11:00AM on March 12th.’,‘Trump has arrived in
tively compare topical variations of these datasets Seoul in the morning of March 12th, 2018’) as para-
and validate our assumption. phrases because extracted entities do not directly
match. Moreover, we only identify one type of
Decoding methods and parameters affect pla-
plagiarism given two documents. It is possible that
giarism. Varying effects of decoding methods and
a document pair may contain multiple plagiarism
their parameters on text quality and diversity have
categories.
been extensively studied (DeLucia et al., 2020; Dou
et al., 2021; Basu et al., 2020), but not from the
8 Conclusion
plagiarism perspective. In particular, top-p sam-
pling is reported to be the most effective decod- Our work presents the first holistic and empirical
ing method in various generation settings (Ippolito analyses of plagiarism in large language models by
et al., 2019a; Zhang et al., 2020). Our analyses constructing a novel pipeline for the automatic iden-
show increased plagiarism frequencies when us- tification of plagiarized content. We conclude that
ing top-p and top-k decoding strategies as opposed GPT-2 can regenerate phrases, sentences, and even
to the temperature setting. That is, sampling the ideas that are originally included in OpenWebText,
next token from truncated LM distributions can a pre-training corpus. Worryingly, this behavior
lead to more plagiarism cases. Our supplementary tends to exacerbate as model size increases. We
finding reported in Appendix C further confirms have also shown their plagiarism patterns are more
that altering decoding parameters including t and complicated than expected: 1) depending on prop-
p can significantly affect models’ plagiarism as it erties of fine-tuning data such as corpus similarities
does for novelty and quality sides. It therefore is or topical variations, fine-tuned LMs can either be
critical to carefully choose decoding strategies and plagiarism-free or intensely plagiarize from both
parameters not only through the lens of quality and pre-training and fine-tuning corpora; 2) top-k and
diversity but also through plagiarism aspects. top-p sampling exploit more of training data with-
Plagiarism can pose privacy harms. Our find- out crediting content creators compared to temper-
ings add value to ongoing discussions around pri- ature sampling. To sum up, careful examination of
vacy breaches resulting from the memorization of datasets used to pre-train or fine-tune and deploy-

ment of decoding approaches is of necessity when that directly controls perplexity. arXiv preprint
performing NLG tasks. arXiv:2007.14966.
While effort has also been made towards pre- Hannah Brown, Katherine Lee, Fatemehsadat
serving privacy in LMs by filtering out sensitive Mireshghallah, Reza Shokri, and Florian Tramèr.
information from a training set or adopting Dif- 2022. What does it mean for a language model to
ferential Privacy (DP) algorithms (Dwork et al., preserve privacy? arXiv preprint arXiv:2202.05520.
2006; Dwork, 2008), there has been less progress Nicholas Carlini, Daphne Ippolito, Matthew Jagielski,
towards resolving memorization and plagiarism is- Katherine Lee, Florian Tramer, and Chiyuan Zhang.
sues. The most common solution is to apply data 2022. Quantifying memorization across neural lan-
deduplication techniques to training data (Lee et al., guage models. arXiv preprint arXiv:2202.07646.
2021; Kandpal et al., 2022), which are computa- Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej
tionally exhaustive and do not completely eradi- Kos, and Dawn Song. 2019. The secret sharer: Eval-
cate verbatim text sequences. Most importantly, it uating and testing unintended memorization in neu-
is uncertain if these methods can reduce cases of ral networks. In 28th USENIX Security Symposium
(USENIX Security 19), pages 267–284.
paraphrase or idea plagiarism. Prior to indiscrimi-
nate data collection and gigantic model training, we Nicholas Carlini, Florian Tramer, Eric Wallace,
should focus on the development of LMs that are Matthew Jagielski, Ariel Herbert-Voss, Katherine
trained exclusively on sanitized and consented data Lee, Adam Roberts, Tom Brown, Dawn Song, Ul-
far Erlingsson, et al. 2021. Extracting training data
and further do not emit exact or rephrased copies from large language models. In 30th USENIX Secu-
of them. rity Symposium (USENIX Security 21), pages 2633–
2650.

References Elizabeth Clark, Tal August, Sofia Serrano, Nikita
Haduong, Suchin Gururangan, and Noah A Smith.
David H Ackley, Geoffrey E Hinton, and Terrence J Se- 2021. All that’s’ human’is not gold: Evaluating hu-
jnowski. 1985. A learning algorithm for boltzmann man evaluation of generated text. arXiv preprint
machines. Cognitive science, 9(1):147–169. arXiv:2107.00061.

Basant Agarwal, Heri Ramampiaro, Helge Langseth, Roger Clarke. 2006. Plagiarism by academics: More
and Massimiliano Ruocco. 2018. A deep network complex than it seems. Journal of the Association
model for paraphrase detection in short text mes- for Information Systems, 7(1):5.
sages. Information Processing & Management,
54(6):922–937. Mark W Davis and William C Ogden. 1997. Free re-
sources and advanced alignment for cross-language
Asim M El Tahir Ali, Hussam M Dahwa Abdulla, and text retrieval. In TREC, volume 1997, pages 385–
Vaclav Snasel. 2011. Overview and comparison of 395. Citeseer.
plagiarism detection tools. In Dateso, pages 161–
172. Alexandra DeLucia, Aaron Mueller, Xiang Lisa Li, and
João Sedoc. 2020. Decoding methods for neural nar-
Alaa Saleh Altheneyan and Mohamed El Bachir Menai. rative generation. arXiv preprint arXiv:2010.07375.
2020. Automatic plagiarism detection in obfuscated
text. Pattern Analysis and Applications, 23(4):1627– Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
1650. Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
Salha Alzahrani. 2015. Arabic plagiarism detection using. arXiv preprint arXiv:1810.04805.
ing word correlation in n-grams with k-overlapping
approach. In Proceedings of the Workshops at Yao Dou, Maxwell Forbes, Rik Koncel-Kedziorski,
the 7th Forum for Information Retrieval Evaluation Noah A Smith, and Yejin Choi. 2021. Scarecrow:
(FIRE), pages 123–125. A framework for scrutinizing machine text. arXiv
preprint arXiv:2107.01294.
Devansh Arpit, Stanisław Jastrz˛ebski, Nicolas Ballas,
David Krueger, Emmanuel Bengio, Maxinder S Kan- Cynthia Dwork. 2008. Differential privacy: A survey
wal, Tegan Maharaj, Asja Fischer, Aaron Courville, of results. In International conference on theory and
Yoshua Bengio, et al. 2017. A closer look at mem- applications of models of computation, pages 1–19.
orization in deep networks. In International confer- Springer.
ence on machine learning, pages 233–242. PMLR.
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and
Sourya Basu, Govardana Sachitanandam Ramachan- Adam Smith. 2006. Calibrating noise to sensitivity
dran, Nitish Shirish Keskar, and Lav R Varshney. in private data analysis. In Theory of cryptography
2020. Mirostat: A neural text decoding algorithm conference, pages 265–284. Springer.

Julianne East. 2010. Judging plagiarism: a prob- Eugene Kharitonov, Marco Baroni, and Dieuwke Hup-
lem of morality and convention. Higher Education, kes. 2021. How bpe affects memorization in trans-
59(1):69–83. formers. arXiv preprint arXiv:2110.02782.

Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi- Robin Küppers and Stefan Conrad. 2012. A set-based
erarchical neural story generation. arXiv preprint approach to plagiarism detection. In CLEF (Online
arXiv:1805.04833. Working Notes/Labs/Workshop).

R. Stuart Geiger. 2019. ArXiV Archive: A tidy and Jieh-Sheng Lee and Jieh Hsiang. 2020. Patent claim
complete archive of metadata for papers on arxiv.org, generation by fine-tuning openai gpt-2. World
1993-2019. Patent Information, 62:101983.

Erfaneh Gharavi, Kayvan Bijari, Kiarash Zahirnia, and Katherine Lee, Daphne Ippolito, Andrew Nystrom,
Hadi Veisi. 2016. A deep learning approach to per- Chiyuan Zhang, Douglas Eck, Chris Callison-Burch,
sian plagiarism detection. FIRE (Working Notes), and Nicholas Carlini. 2021. Deduplicating training
34:154–159. data makes language models better. arXiv preprint
arXiv:2107.06499.
Erfaneh Gharavi, Hadi Veisi, and Paolo Rosso. 2020.
Scalable and language-independent embedding- Klas Leino and Matt Fredrikson. 2020. Stolen mem-
based approach for plagiarism detection considering ories: Leveraging model memorization for cali-
obfuscation type: no training phase. Neural Com- brated {White-Box} membership inference. In 29th
puting and Applications, 32(14):10593–10607. USENIX Security Symposium (USENIX Security 20),
pages 1605–1622.
Zaobo He, Zhipeng Cai, and Jiguo Yu. 2017. Latent-
data privacy preserving with customized data utility Sharon Levy, Michael Saxon, and William Yang
for social network data. IEEE Transactions on Ve- Wang. 2021. Investigating memorization of con-
hicular Technology, 67(1):665–673. spiracy theories in text generation. arXiv preprint
arXiv:2101.00379.
Peter Henderson, Koustuv Sinha, Nicolas Angelard-
Gontier, Nan Rosemary Ke, Genevieve Fried, Ryan Chuan Li. 2020. Openai’s gpt-3 language model: A
Lowe, and Joelle Pineau. 2018. Ethical challenges technical overview.
in data-driven dialogue systems. In Proceedings of
the 2018 AAAI/ACM Conference on AI, Ethics, and Zehui Lin, Xiao Pan, Mingxuan Wang, Xipeng Qiu,
Society, pages 123–129. Jiangtao Feng, Hao Zhou, and Lei Li. 2020. Pre-
training multilingual neural machine translation by
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and leveraging alignment information. arXiv preprint
Yejin Choi. 2019. The curious case of neural text arXiv:2010.03142.
degeneration. arXiv preprint arXiv:1904.09751.
Romans Lukashenko, Vita Graudina, and Janis Grund-
Daphne Ippolito, Daniel Duckworth, Chris Callison- spenkis. 2007. Computer-based plagiarism detec-
Burch, and Douglas Eck. 2019a. Automatic detection methods and tools: an overview. In Proceedings
tion of generated text is easiest when humans are of the 2007 international conference on Computer
fooled. arXiv preprint arXiv:1911.00650. systems and technologies, pages 1–6.

Daphne Ippolito, Daniel Duckworth, Chris Callison- Sheikh Tariq Mahmood. 2010. Intellectual property
Burch, and Douglas Eck. 2019b. Human and auto- right and patent: Conceptual awareness of phd stu-
matic detection of generated text. dents about plagiarism. In 2010 International Con-
ference on Education and Management Technology,
MAC Jiffriya, MAC Akmal Jahan, and Roshan G pages 694–700. IEEE.
Ragel. 2014. Plagiarism detection on electronic text
based assignments using vector space model. In R Thomas McCoy, Paul Smolensky, Tal Linzen, Jian-
7th International Conference on Information and Au- feng Gao, and Asli Celikyilmaz. 2021. How much
tomation for Sustainability, pages 1–5. IEEE. do language models copy from their training data?
evaluating linguistic novelty in text generation using
Nikhil Kandpal, Eric Wallace, and Colin Raffel. raven. arXiv preprint arXiv:2111.09509.
2022. Deduplicating training data mitigates pri-
vacy risks in language models. arXiv preprint Casey Meehan, Kamalika Chaudhuri, and Sanjoy Das-
arXiv:2202.06539. gupta. 2020. A non-parametric test to detect data-
copying in generative models. In International Con-
Jared Kaplan, Sam McCandlish, Tom Henighan, ference on Artificial Intelligence and Statistics.
Tom B Brown, Benjamin Chess, Rewon Child, Scott
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Norman Meuschke and Bela Gipp. 2013. State-of-the-
2020. Scaling laws for neural language models. art in detecting academic plagiarism. International
arXiv preprint arXiv:2001.08361. Journal for Educational Integrity, 9(1).

Fatemehsadat Mireshghallah, Huseyin A Inan, Mar- Approaches to Semitic Languages: Common Issues
cello Hasegawa, Victor Rühle, Taylor Berg- and Resources, pages 73–80.
Kirkpatrick, and Robert Sim. 2021. Privacy regu-
larization: Joint privacy-utility optimization in lan- Syed Shahabuddin. 2009. Plagiarism in academia. In-
guage models. arXiv preprint arXiv:2103.07567. ternational Journal of Teaching and Learning in
Higher Education, 21(3):353–359.
John X Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby,
Di Jin, and Yanjun Qi. 2020. Textattack: A Hassan Shahmohammadi, MirHossein Dezfoulian, and
framework for adversarial attacks, data augmenta- Muharram Mansoorizadeh. 2021. Paraphrase detec-
tion, and adversarial training in nlp. arXiv preprint tion using lstm networks and handcrafted features.
arXiv:2005.05909. Multimedia Tools and Applications, 80(4):6479–
6492.
Azra Nazir, Roohie Naaz Mir, and Shaima Qureshi.
2021. Idea plagiarism detection with recurrent neu-
Prasha Shrestha and Thamar Solorio. 2013. Using a va-
ral networks and vector space model. International
riety of n-grams for the detection of different kinds
Journal of Intelligent Computing and Cybernetics.
of plagiarism. Notebook for PAN at CLEF, 2013.
Yurii Palkovskii and Alexei Belov. 2014. Developing
high-resolution universal multi-type n-gram plagia- Richard Socher, Eric Huang, Jeffrey Pennin, Christo-
rism detector. Cappellato et al.[35]. pher D Manning, and Andrew Ng. 2011. Dynamic
pooling and unfolding recursive autoencoders for
Diane Pecorari. 2008. Academic writing and plagia- paraphrase detection. Advances in neural informa-
rism: A linguistic analysis. Bloomsbury Publishing. tion processing systems, 24.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Wendy Sutherland-Smith. 2008. Plagiarism, the Inter-
Ilya Sutskever. 2018. Improving language under- net, and student learning: Improving academic in-
standing by generative pre-training. tegrity. Routledge.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Om Dipakbhai Thakkar, Swaroop Ramaswamy, Rajiv
Dario Amodei, Ilya Sutskever, et al. 2019. Lan-
Mathews, and Francoise Beaufays. 2021. Under-
guage models are unsupervised multitask learners.
standing unintended memorization in language mod-
OpenAI blog, 1(8):9.
els under federated learning. In Proceedings of the
Stephen E Robertson, Steve Walker, Susan Jones, Third Workshop on Privacy in Natural Language
Micheline M Hancock-Beaulieu, Mike Gatford, et al. Processing, pages 1–10.
1995. Okapi at trec-3. Nist Special Publication Sp,
109:109. Stacey Truex, Ling Liu, Mehmet Emre Gursoy, Lei
Yu, and Wenqi Wei. 2018. Towards demystify-
Meysam Roostaee, Seyed Mostafa Fakhrahmad, and ing membership inference attacks. arXiv preprint
Mohammad Hadi Sadreddini. 2020. Cross-language arXiv:1807.09173.
text alignment: A proposed two-level matching
scheme for plagiarism detection. Expert Systems Adaku Uchendu, Thai Le, Kai Shu, and Dongwon Lee.
with Applications, 160:113718. 2020. Authorship attribution for neural text gener-
ation. In Conf. on Empirical Methods in Natural
Ahmed Salem, Apratim Bhattacharya, Michael Backes, Language Processing (EMNLP).
Mario Fritz, and Yang Zhang. 2020. {Updates-
Leak}: Data set inference and reconstruction attacks Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,
in online learning. In 29th USENIX Security Sympo- Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn
sium (USENIX Security 20), pages 1291–1308. Funk, Rodney Kinney, Ziyang Liu, William Merrill,
et al. 2020. Cord-19: The covid-19 open research
Miguel A Sanchez-Perez, Alexander Gelbukh, and
dataset. ArXiv.
Grigori Sidorov. 2015. Adaptive algorithm for pla-
giarism detection: The best-performing approach
Wenpeng Yin and Hinrich Schütze. 2015. Convolu-
at pan 2014 text alignment competition. In Inter-
tional neural network for paraphrase identification.
national Conference of the Cross-Language Evalu-
In Proceedings of the 2015 Conference of the North
ation Forum for European Languages, pages 402–
American Chapter of the Association for Computa-
413. Springer.
tional Linguistics: Human Language Technologies,
Miguel A Sanchez-Perez, Grigori Sidorov, and Alexan- pages 901–911.
der F Gelbukh. 2014. A winning approach to text
alignment for text reuse detection at pan 2014. In Santiago Zanella-Béguelin, Lukas Wutschitz, Shruti
CLEF (Working Notes), pages 1004–1011. Tople, Victor Rühle, Andrew Paverd, Olga Ohri-
menko, Boris Köpf, and Marc Brockschmidt. 2020.
Nasredine Semmar and Christian Fluhr. 2007. Arabic Analyzing information leakage of updates to natural
to french sentence alignment: Exploration of a cross- language models. In Proceedings of the 2020 ACM
language information retrieval approach. In Pro- SIGSAC Conference on Computer and Communica-
ceedings of the 2007 Workshop on Computational tions Security, pages 363–375.

Chiyuan Zhang, Daphne Ippolito, Katherine Lee, That is, the probability distribution of a word se-
Matthew Jagielski, Florian Tramèr, and Nicholas quence can be calculated through the product of
Carlini. 2021. Counterfactual memorization
conditional next word distributions. In response to
in neural language models. arXiv preprint
arXiv:2112.12938. an arbitrary prompt, GPT-2 could adapt to its style
and content and generate synthetic texts. Decoding
Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and methods can also be applied to GPT-2, which are
Arvind Neelakantan. 2020. Trading off diversity
well known to be critical for performance in text
and quality in natural language generation. arXiv
preprint arXiv:2004.10450. generation (Ippolito et al., 2019b). We primarily
consider the following decoding strategies:
Xuhui Zhou, Nikolaos Pappas, and Noah A
Smith. 2020. Multilevel text alignment with • Temperature (Ackley et al., 1985): control the
cross-document attention. arXiv preprint randomness of predictions by dividing the logits
arXiv:2010.01263. by t before applying softmax
Derui Zhu, Jinfu Chen, Weiyi Shang, Xuebing Zhou, • Top-k (Fan et al., 2018): filter the k most likely
Jens Grossklags, and Ahmed E Hassan. 2021.
Deepmemory: Model-based memorization analy- next words and redistribute the probability mass
sis of deep neural language models. In 2021
36th IEEE/ACM International Conference on Au- • Top-p (Holtzman et al., 2019): choose from the
tomated Software Engineering (ASE), pages 1003– smallest possible set of words whose cumulative
1015. IEEE. probability exceeds the probability p

A Details on Sanchez-Perez et al. (2015) Changing decoding parameters can substantially
influence diversity and quality aspects of generated
Sanchez-Perez et al. (2014) initially presented the texts: the novelty can be enhanced by increasing
winning approach at the plagiarism detection com- parameter values (t, k, p), but comes at the cost of
petition of PAN 201415 and further improved its degraded quality (McCoy et al., 2021).
performance by adopting adaptive parameter selec-
tion (Sanchez-Perez et al., 2015). C Experiment with Decoding
Their methods consist of five steps which in- Parameters
clude (1) text-preprocessing (lower-case all charac-
In order to measure how decoding parameters af-
ters, tokenize, and stem; (2) obfuscation type iden-
fect plagiarism, 1000 documents are generated for
tification (verbatim/random/translation/summary
each parameter setting (t=1, k=40, p ∈ [0.7, 0.8,
obfuscation); (3) seeding (given two documents,
0.9]). We experiment with various values only for
deconstruct long passages into smaller segments
the Cord19GPT because there are not many pla-
and finding candidate pairs through sentence-level
giarism cases for the other two fine-tuned models.
similarity measurement); (4) extension (form larger
Figure 3 demonstrates a distribution of plagiarism
text fragments that are similar via clustering); and
types occurred from The CORD19 dataset with
(5) filtering (remove overlapping and short pla-
varying parameter settings. Results indicate that
giarized fragment). In summary, they transform
the higher parameters t and p got the higher number
the suspicious and source sentences as term fre-
of plagiarism tended to occur. Interestingly, param-
quency–inverse document frequency (tf-idf) vector
eter values of top-k sampling did not significantly
weights and then calculate the similarity between
affect CORD19GPT’s plagiarizing attitudes.
the sentence pairs using the Dice coefficient and
cosine measure. Adaptive parameter selection is D Details on Plagiarized Texts
achieved by testing two settings recursively for the
summary obfuscation corpus and the other three We inspect text segments plagiarized from Open-
corpora. Webtext by the pre-trained GPT-2 model, as our
primary focus is understanding plagiarizing behav-
B Decoding Methods iors of GPT-2 itself. See Table 3 to view plagia-
rized content we discovered. We find that longest
GPT-2 is an autoregressive language model predict- memorized texts contain 5,920 characters (Table
ing one token at a time in a left-to-right fashion. 4). Based on our manual inspection of verbatim
15
https://pan.webis.de/clef14/ plagiarism, many sequences are from highly du-
pan14-web/text-alignment.html plicated texts throughout the training corpus: on

20.0
                                                                                                           Verbatim
                                                                                                           Paraphrase
                      17.5                                                                                 Idea

                      15.0

                      12.5
Document Percentage

                      10.0

                       7.5

                       5.0

                       2.5

                       0.0
                             t=0.8   t=0.9   t=1.0   k=20   k=40 k=60 k=80         k=100   p=0.7   p=0.8   p=0.9
                                                             Decoding Parameters

Figure 3: Distribution of Plagiarism Categories w.r.t.
Decoding Parameters (Cord19GPT)

average, memorized texts appeared 205 times (at
most 14,246 times) in 50% of OpenWebText cor-
pus. At the same time, there still exist several
instances where models memorize without seeing
them more than two times (Table 5).

Type Neural Text OpenWebText
Verbatim [...] Newsletter Sign Up Continue reading same as neural text
the main story Please verify you’re not a
robot by clicking the box. Invalid email
address. Please re-enter. You must select a
newsletter to subscribe to. [...]
Verbatim [...] This article contains affiliate links, same as neural text
which means we may earn a small commis-
sion if a reader clicks through and makes
a purchase. All our journalism is indepen-
dent and is in no way influenced by any ad-
vertiser or commercial initiative. The links
are powered by Skimlinks. By clicking on
an affiliate link, you accept that Skimlinks
cookies will be set. More information.
Verbatim it reminded me of a feeling I’ve had right same as neural text
there on that road before. It reminded me
of all the times that people have come out
to support the blockade and stood together
to make sure those trees stay standing. And
I wish we didn’t have to do it again, but I
know that if we have to, we can. Yes, we
stopped them logging the Upper Florentine
and we can do it again [...]
Paraphrase [...] Conflict of Interest Disclosures: None [...] Conflict of Interest Disclosures: Both
reported. Funding/Support: Medical Re- authors have completed and submitted the
search Council Biotechnology Programme ICMJE Form for Disclosure of Potential
[...] Conflicts of Interest and none were re-
ported. Funding/Support: This work was
supported by grant [...]
Paraphrase [...] HOWEVER, SOME STATES [...] Some states do not allow the exclusion
DO NOT ALLOW THE EXCLUSION or limitation of liability for consequential
OR LIMITATION OF IMPLIED WAR- or incidental damages so the foregoing lim-
RANTIES, SO THE ABOVE LIMITA- itation may not apply.
TION OR EXCLUSION MAY NOT AP-
PLY TO YOU.
Paraphrase "I’ve got to use some Tic Tacs just in case "I better use some Tic Tacs in case I start
I start kissing her," an apparently angry kissing her," Trump says, with the sound
Trump says in a video obtained by The of mints rattling in a box audible in the
Washington Post. "I’m automatically at- background. "I’m automatically attracted
tracted to beautiful - I just start kissing to beautiful - I just start kissing them. It’s
them. It’s like a magnet. Just kiss. I don’t like a magnet... And when you’re a star,
even wait. And when you’re a star, they they let you do it. You can do anything.
let you do it. You can do anything." Trump
continues.

Idea   If Horrible Combustion becomes the tar-          [...] If that creature deals combat damage
       get of a spell or ability that gives it lethal   to a player at the same time it’s dealt lethal
       damage, Horrible Combustion’s ability will       damage (perhaps because it has trample
       be activated and the lethal damage will          and was blocked), it will die before the
       be dealt. Once that occurs, no damage            triggered ability resolves and puts +1/+1
       can be prevented. (This is known as the          counters on it. [...]
       "Kai Mauler" effect.) Horrible Combustion
       doesn’t deal lethal damage itself. If Hor-
       rible Combustion deals lethal damage but
       damage it can’t prevent is dealt to a player,
       Horrible Combustion will cause that player
       to lose life equal to an amount determined
       by how much Horrible Combustion was
       dealt damage in damage prevention. The
       actual amount of life based on the lethal
       damage is determined before any damage
       is dealt. For example, if an opponent con-
       trols Coralhelm Commander, and Horrible
       Combustion deals 12/60 damage to a crea-
       ture, the half of that damage dealt by the
       outsider creature will cause that player to
       lose 12 life.
Idea   For example, Ontario is leading in the           Better Access to Mental Health Services
       federal-provincial partnership by using          Phase
       the Mental Health Reconciliation Strat-          One of the Comprehensive Mental Health
       egy to increase access to mental health          and Addictions Strategy provided more
       care across the province. In August 2017,        than 50,000 additional children and youth
       the government announced 50 million in           across Ontario with access to mental health
       savings from the partnership strategy by         and addictions services.
       taking a 6-billion hit to provide men-
       tal health care services to Canadians in
       Ontario. There are some important high-
       lights from this partnership: Provincial
       governments—particularly the provincial
       Liberal party—will be providing $2.5 bil-
       lion over the next three years to strengthen
       mental health services and to help build
       community-based services at community
       and religious meetings The government
       committed to using the Canadian Mental
       Health Care Act to enhance mental health
       services in Ontario to the benefit of Cana-
       dian communities This will help to create a
       pathway to an accessible and quality, long-
       term mental health care system

You can also read