Improving Paraphrase Detection with the Adversarial Paraphrasing Task
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Improving Paraphrase Detection with the Adversarial Paraphrasing Task Animesh Nighojkar, John Licato Advancing Machine and Human Reasoning Lab Department of Computer Science and Engineering University of South Florida Tampa, FL, USA {anighojkar,licato}@usf.edu Abstract definition that seems to sufficiently encompass the others is given by Bhagat and Hovy (2013): “para- If two sentences have the same meaning, phrases are sentences or phrases that use different it should follow that they are equivalent in wording to convey the same meaning.” However, their inferential properties, i.e., each sen- even that definition is somewhat imprecise, as it tence should textually entail the other. How- ever, many paraphrase datasets currently in lacks clarity on what it assumes ‘meaning’ means. widespread use rely on a sense of paraphrase If paraphrasing is a property that can hold be- based on word overlap and syntax. Can we tween sentence pairs,1 then it is reasonable to as- teach them instead to identify paraphrases in sume that sentences that are paraphrases must have a way that draws on the inferential properties equivalent meanings at the sentence level (rather of the sentences, and is not over-reliant on than exclusively at the levels of individual word lexical and syntactic similarities of a sentence meanings or syntactic structures). Here a useful pair? We apply the adversarial paradigm to this question, and introduce a new adversarial test is that recommended by inferential role se- method of dataset creation for paraphrase iden- mantics or inferentialism (Boghossian, 1994; Pere- tification: the Adversarial Paraphrasing Task grin, 2006), which suggests that the meaning of a (APT), which asks participants to generate se- statement s is grounded in its inferential properties: mantically equivalent (in the sense of mutually what one can infer from s and from what s can be implicative) but lexically and syntactically dis- inferred. parate paraphrases. These sentence pairs can Building on this concept from inferentialism, we then be used both to test paraphrase identifi- cation models (which get barely random accu- assert that if two sentences have the same inferen- racy) and then improve their performance. To tial properties, then they should also be mutually accelerate dataset generation, we explore au- implicative. Mutual Implication (MI) is a binary re- tomation of APT using T5, and show that the lationship between two sentences that holds when resulting dataset also improves accuracy. We each sentence textually entails the other (i.e., bidi- discuss implications for paraphrase detection rectional entailment). MI is an attractive way of and release our dataset in the hope of making operationalizing the notion of two sentences hav- paraphrase detection models better able to de- tect sentence-level meaning equivalence. ing “the same meaning,” as it focuses on inferential relationships between sentences (properties of the 1 Introduction sentences as wholes) instead of just syntactic or lexical similarities (properties of parts of the sen- Although there are many definitions of ‘paraphrase’ tences). As such, we will assume in this paper in the NLP literature, most maintain that two sen- that two sentences are paraphrases if and only if tences that are paraphrases have the same mean- they are M I.2 In NLP, modeling inferential re- ing or contain the same information. Pang et al. lationships between sentences is the goal of the (2003) define paraphrasing as “expressing the same textual entailment, or natural language inference information in multiple ways” and Bannard and (NLI) tasks (Bowman et al., 2015). We test MI Callison-Burch (2005) call paraphrases “alternative 1 ways of conveying the same information.” Ganitke- In this paper we study paraphrase between sentences, and do not address the larger scope of how our work might extend vitch et al. (2013) write that “paraphrases are dif- to paraphrasing between arbitrarily large text sequences. 2 fering textual realizations of the same meaning.” A The notations used in this paper are listed in Table 1. 7106 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 7106–7116 August 1–6, 2021. ©2021 Association for Computational Linguistics
using the version of RoBERTalarge released by Nie Concept of mutual implication MI / bidirectional textual entailment) et al. (2020) trained on a combination of SNLI Property of being mutually implicative, (Bowman et al., 2015), multiNLI (Williams et al., MI as determined by our NLI model 2018), FEVER-NLI (Nie et al., 2019), and ANLI APT Adversarial Paraphrasing Task (Nie et al., 2020). Property of passing the adversarial Owing to expeditious progress in NLP research, AP T paraphrase test (see §3) performance of models on benchmark datasets is APH Human-generated APT dataset ‘plateauing’ — with near-human performance often T5base -generated APT dataset APT 5 achieved within a year or two of their release — (Note that APT 5 = APTM5 ∪ APTT5w ) and newer versions, using a different approach, APTM5 MSRP subset of APT 5 are constantly having to be created, for instance, APTT5w TwitterPPDB subset of APT 5 GLUE (Wang et al., 2019) and SuperGLUE (Wang et al., 2020). The adversarial paradigm of dataset Table 1: Notations used in the paper. creation (Jia and Liang, 2017a,b; Bras et al., 2020; Nie et al., 2020) has been widely used to address this ‘plateauing,’ and the ideas presented in this 2005; Fernando and Stevenson, 2008; Socher et al., paper draw inspiration from it. In the remainder of 2011; Jia et al., 2020) is an important task in the this paper, we apply the adversarial paradigm to the field of NLP, finding downstream applications in problem of paraphrase detection, and demonstrate machine translation (Callison-Burch et al., 2006; the following novel contributions: Apidianaki et al., 2018; Mayhew et al., 2020), text summarization, plagiarism detection (Hunt et al., • We use the adversarial paradigm to create 2019), question answering, and sentence simplifi- a new benchmark examining whether para- cation (Guo et al., 2018). Paraphrases have proven phrase detection models are assessing the to be a crucial part of NLP and language educa- meaning equivalence of sentences rather than tion, with research showing that paraphrasing helps being over-reliant on word-level measures. improve reading comprehension skills (Lee and We do this by collecting paraphrases that Colln, 2003; Hagaman and Reid, 2008). Question are MI but are as lexically and syntactically paraphrasing is an important step in knowledge- disparate as possible (as measured by low based question answering systems for matching BLEURT scores). We call this the Adversarial questions asked by users with knowledge-based Paraphrasing Task (APT). assertions (Fader et al., 2014; Yin et al., 2015). Paraphrase generation (given a sentence, gener- • We show that a SOTA language model trained ate its paraphrase) (Gupta et al., 2018) is an area on paraphrase datasets perform poorly on our of research benefiting paraphrase detection as well. benchmark. However, when further trained Lately, many paraphrasing datasets have been in- on our adversarially-generated datasets, their troduced to be used for training and testing ML MCC scores improve by up to 0.307. models for both paraphrase detection and genera- • We create an additional dataset by training a tion. MSRP (Dolan and Brockett, 2005) contains paraphrase generation model to perform our 5801 sentence pairs, each labeled with a binary hu- adversarial task, creating another large dataset man judgment of paraphrase, created using heuris- that further improves the paraphrase detection tic extraction techniques along with an SVM-based models’ performance. classifier. These pairs were annotated by humans, who found 67% of them to be semantically equiva- • We propose a way to create a machine- lent. The English portion of PPDB (Ganitkevitch generated adversarial dataset and discuss et al., 2013) contains over 220M paraphrase pairs ways to ensure it does not suffer from the generated by meaning-preserving syntactic trans- plateauing that other datasets suffer from. formations. Paraphrase pairs in PPDB 2.0 (Pavlick et al., 2015) include fine-grained entailment rela- tions, word embedding similarities, and style an- 2 Related Work notations. TwitterPPDB (Lan et al., 2017) con- Paraphrase detection (given two sentences, predict sists of 51,524 sentence pairs captured from Twitter whether they are paraphrases) (Zhang and Patrick, by linking tweets through shared URLs. This ap- 7107
proach’s merit is its simplicity as it involves neither to filter the dataset, which reduces its size, making a classifier nor a human-in-the-loop to generate model training more difficult. Our present work paraphrases. Humans annotate the pairs, giving tries instead to generate adversarial examples to them a similarity score ranging from 1 to 6. increase dataset size. Other examples of adversar- ParaNMT (Wieting and Gimpel, 2018) was cre- ial datasets in NLP include work done by Jia and ated by using neural machine translation to trans- Liang (2017a); Zellers et al. (2018, 2019). Per- late the English side of a Czech-English parallel haps the closest to our work is PAWS (Zhang et al., corpus (CzEng 1.6 (Bojar et al., 2016)), generat- 2019), short for Paraphrase Adversaries from Word ing more than 50M English-English paraphrases. Scrambling. The idea behind PAWS is to create However, ParaNMT’s use of machine translation a dataset that has a high lexical overlap between models that are a few years old harms its utility sentence pairs without them being ‘paraphrases.’ It (Nighojkar and Licato, 2021), considering the rapid has 108k paraphrase and non-paraphrase pairs with improvement in machine translation in the past few high lexical overlap pairs generated by controlled years. To rectify this, we use the google-translate word swapping and back-translation, and human library to translate the Czech side of roughly 300k raters have judged whether or not they are para- CzEng2.0 (Kocmi et al., 2020) sentence pairs our- phrases. Including PAWS in the training data has selves. We call this dataset ParaParaNMT (PP- shown the state-of-the-art models’ performance to NMT for short, where the extra para- prefix re- jump from 40% to 85% on PAWS’s test split. In flects its similarity to, and conceptual derivation comparison to the present work, PAWS does not from, ParaNMT). explicitly incorporate inferential properties, and we seek paraphrases minimizing lexical overlap. Some work has been done in improving the qual- ity of paraphrase detectors by training them on a 3 Adversarial Paraphrasing Task (APT) dataset with more lexical and syntactic diversity. Thompson and Post (2020) propose a paraphrase Semantic Textual Similarity (STS) measures the generation algorithm that penalizes the production degree of semantic similarity between two sen- of n-grams present in the source sentence. Our tences. Popular approaches to calculating STS approach to doing this is with the APT, but this include BLEU (Papineni et al., 2002), BertScore is something worth exploring. Sokolov and Fil- (Zhang et al., 2020), and BLEURT (Sellam et al., imonov (2020) use a machine translation model 2020). BLEURT is a text generation metric build- to generate paraphrases much like ParaNMT. An ing on BERT’s (Devlin et al., 2019) contextual interesting application of paraphrasing has been word representations. BLEURT is warmed-up us- discussed by Mayhew et al. (2020) who, given a ing synthetic sentence pairs and then fine-tuned on sentence in one language, generate a diverse set human ratings to generalize better than BERTScore of correct translations (paraphrases) that humans (Zhang et al., 2020). Given any two sentences, are likely to produce. In comparison, our work BLEURT assigns them a similarity score (usually is focused on generating adversarial paraphrases between -2.2 to 1.1). However, high STS scores do that are likely to deceive a paraphrase detector, and not necessarily predict whether two sentences have models trained on the adversarial datasets we pro- equivalent meanings. Consider the sentence pairs duce can be applied to Mayhew et al.’s work too. in Table 3, highlighting cases where STS and para- phrase appear to misalign. The existence of such ANLI (Nie et al., 2020), a dataset designed for cases suggests a way to advance automated para- Natural Language Inference (NLI) (Bowman et al., phrase detection: through an adversarial bench- 2015), was collected via an adversarial human-and- mark consisting of sentence pairs that have the model-in-the-loop procedure where humans are same MI-based meaning, but have BLEURT scores given the task of duping the model into making a that are as low as possible. This is the motivation wrong prediction. The model then tries to learn how behind what we call the Adversarial Paraphrasing not to make the same mistakes. AFLite (Bras et al., Task (APT), which has two components: 2020) adversarially filters dataset biases making sure that the models are not learning those biases. 1. Similarity of meaning: Checked through MI They show that model performance on SNLI (Bow- (Section 1). We assume if two sentences are man et al., 2015) drops from 92% to 62% when M I (Mutually Implicative), they are seman- biases were filtered out. However, their approach is tically equivalent and thus paraphrases. Note 7108
Figure 1: The mTurk study and the reward calculation. We automatically end the study when a subject earns a total of $20 to ensure variation amongst subjects. that MI is a binary relationship, so this APT ticipants. Neither of these datasets has duplicate component does not bring any quantitative sentences by design. Every time a sentence was se- variation but is more like a qualifier test for lected, a random choice was made between MSRP APT. All AP T sentence pairs are M I. and PPNMT, thus ensuring an even distribution of sentences from both datasets. 2. Dissimilarity of structure: Measured through Each attempt was evaluated separately using BLEURT, which assigns each sentence pair a Equation 1, where mi is 1 when the sentences are score quantifying how lexically and syntacti- M I and 0 otherwise: cally similar the two sentences are. mi reward = (1) 3.1 Manually Solving APT (1 + e5∗bleurt )2 To test the effectiveness of APT in guiding the gen- This formula was designed to ensure (1) the max- eration of mutually implicative but lexically and imum reward per submission was $1, and (2) no syntactically disparate paraphrases for a given sen- reward was granted for sentence pairs that are non- tence, we designed an Amazon Mechanical Turk MI or have BLEURT > 0.5. Participants were (mTurk) study (Figure 1). Given a starting sentence, encouraged to frequently revise their sentences and we instructed participants to “[w]rite a sentence click on a ‘Check’ button which showed them the that is the same in meaning as the given sentence reward amount they would earn if they submitted but as structurally different as possible. Your sen- this sentence. Once the ‘Check’ button was clicked, tence should be such that you can infer the given the participant’s reward was evaluated (see Figure sentence from it AND vice-versa. It should be suffi- 1) and the sentence pair added to APH (regardless ciently different from the given sentence to get any of whether it was AP T ). If ‘Submit’ was clicked, reward for the submission. For example, a simple their attempt was rewarded based on Equation 1. synonym substitution will most likely not work.” The resulting dataset of sentence pairs, which The sentences given to the participants came from we call APH (Adversarial Paraphrase by Humans), MSRP and PPNMT (Section 1). Both of these consists of 5007 human-generated sentence pairs, datasets have pairs of sentences in each row, and both M I and non-M I (see Table 2). Humans were we took only the first one to present to the par- able to generate AP T paraphrases for 75.48% of 7109
Total AP T MI non-M I Unique AP T MI non-M I Dataset attempts attempts attempts attempts sentences uniques uniques uniques 2659 3232 1775 1231 1338 293 APH 5007 1631 53.10% 64.55% 35.45% 75.48% 82.04% 17.96% 3836 37,511 25,475 2288 4045 3115 APTM5 62,986 4072 6.09% 59.55% 40.44% 56.19% 99.34% 76.50% 6454 17,074 57,937 3670 4131 4230 APTT5w 75,011 4328 8.60% 22.76% 77.24% 84.80% 95.45% 97.74% Table 2: Proportion of sentences generated by humans (APH ) and T5base (APT 5 ). “Attempts” shows the number of attempts the participant made and “Uniques” shows the number of source sentences from the dataset that the performer’s attempts fall in that category on. For instance, 1631 unique sentences were presented to humans, who made a total of 5007 attempts to pass AP T and were able to do so for 2659 attempts which amounted to 1231 unique source sentences that could be paraphrased to pass AP T . the sentences presented to them and only 53.1% next word according to its conditional probabil- of attempts were AP T , showing that the task is ity distribution, introduces non-determinism in lan- difficult even for humans. Note that ‘M I attempts’ guage generation. Fan et al. (2018) introduce top-k and ‘M I uniques’ are supersets of ‘AP T attempts’ sampling, which filters k most likely next words, and ‘AP T uniques,’ respectively. and the probability mass is redistributed among only those k words. Nucleus sampling (or top-p 3.2 Automatically Solving APT sampling) (Holtzman et al., 2020) reduces the op- Since human studies can be time-consuming and tions to the smallest possible set of words whose costly, we trained a paraphrase generator to per- cumulative probability exceeds p, and the probabil- form APT. We used T5base (Raffel et al., 2020), as ity mass is redistributed among this set of words. it achieves SOTA on paraphrase generation (Niu Thus, the set of words changes dynamically ac- et al., 2020; Bird et al., 2020; Li et al., 2020) and cording to the next word’s probability distribution. trained it on TwitterPPDB (Section 2). Our hy- We use a combination of top-k and top-p sampling pothesis was that if T5base is trained to maximize with k = 120 and p = 0.95 in the interest of lexi- the APT reward (Equation 1), its generated sen- cal and syntactic diversity in the paraphrases. For tences will be more likely to be AP T . We gener- each sentence in the source dataset (MSRP3 and ated paraphrases for sentences in MSRP and those TwitterPPDB for APTM5 and APTT5w respectively), in TwitterPPDB itself, hoping that since T5base is we perform five iterations, in each of which, we trained on TwitterPPDB, it would generate better generate ten sentences. If at least one of these ten paraphrases (M I with lower BLEURT) for sen- sentences passes AP T , we continue to the next tences coming from there. The proportion of sen- source sentence after recording all attempts and tences generated by T5base is shown in Table 2. classifying them as M I or non-M I. If no sentence We call this dataset APT 5 , the generation of which in a maximum of 50 attempts passes AP T , we involved two phases: record all attempts nonetheless, and move on to the next source sentence. For each increasing iteration Training: To adapt T5base for APT, we imple- for a particular source sentence, we increase k by mented a custom loss function obtained from di- 20, but we also reduce p by 0.05 to avoid vague viding the cross-entropy loss per batch by the total guesses. Note the distribution of M I and non-M I reward (again from Equation 1) earned from the in the source datasets does not matter because we model’s paraphrase generations for that batch, pro- use only the first sentence from the sentence pair. vided the model was able to reach a reward of at least 1. If not, the loss was equal to just the cross- 3.3 Dataset Properties entropy loss. We trained T5base on TwitterPPDB T5base trained with our custom loss function gen- for three epochs; each epoch took about 30 hours erated AP T -passing paraphrases for (56.19%) of on one NVIDIA Tesla V100 GPU due to the CPU starting sentences. This is higher than we initially bound BLEURT component. More epochs may expected, considering how difficult APT proved help get better results, but our experiments showed to be for humans (Table 2). Noteworthy is that that loss plateaus after three epochs. 3 We use the official train split released by Dolan and Brock- Generation: Sampling, or randomly picking a ett (2005) containing 4076 sentence pairs. 7110
(a) APH (b) APTM5 (c) APTT5w Figure 2: BLEURT distributions on adversarial datasets. All figures divide the range of observed scores into 100 bins. Note that AP T sentence pairs are also M I, whereas those labeled ‘MI’ are not AP T . only 6.09% of T5base ’s attempts were AP T . This paraphrases that were low on BLEURT, they tended does not mean that the remaining 94% of attempts to become non-M I (e.g., line 12 in Table 3). How- can be discarded, since they amounted to the neg- ever, T5base did generate more AP T -passing sen- ative examples in the dataset. Since we trained tences with a lower BLEURT on Twitter-PPDB it on TwitterPPDB itself, we expected that T5base than on MSRP, which may be a result of overfit- would generate better paraphrases, as measured by ting T5base on TwitterPPDB. Furthermore, all three a higher chance of passing AP T on TwitterPPDB, adversarial datasets have a distribution of M I and than any other dataset we tested. This is supported non-M I sentence pairs balanced enough to train a by the data in Table 2, which shows that T5base model to identify paraphrases. was able to generate an AP T passing paraphrase Table 3 has examples from APH and APT 5 for 84.8% of the sentences in TwitterPPDB. showing the merits and shortcomings of T5, The composition of the three adversarial datasets BLEURT, and RoBERTalarge (the MI detector can be found in Table 2. These metrics are useful used). Some observations from Table 3 include: to understand the capabilities of T5base as a para- phrase generator and the “paraphrasability” of sen- • Lines 1 and 3: BLEURT did not recognize the tences in MSRP and TwitterPPDB. For instance, paraphrases, possibly due to the differences in T5base ’s attempts on TwitterPPDB tend to be M I words used. RoBERTalarge however, gave the much less frequently than those on MSRP and hu- correct MI prediction (though it is worth not- man’s attempts on MSRP + PPNMT. This might ing that the sentences in line 1 are questions, be because in an attempt to generate syntactically rather than truth-apt propositions). dissimilar sentences, the T5base paraphraser also ended up generating many semantically dissimilar • Line 4: RoBERTalarge and BLEURT (to a ones as well. large extent since it gave it a score of 0.4) did To visualize the syntactic and lexical disparity not recognize that the idiomatic phrase ‘break of paraphrases in the three adversarial datasets, we a leg’ means ‘good luck’ and not ‘fracture.’ present their BLEURT distributions in Figure 2. As • Lines 6 and 12: There is a loss of information might be expected, the likelihood of a sentence pair going from the first sentence to the second and being M I increases as BLEURT score increases BLEURT and MI both seem to have under- (recall that AP T -passing sentence pairs are sim- stood the difference between summarization ply M I pairs with BLEURT scores
No. Source Source Sentence Attempt BLEURT MI Dataset APH 1 PPNMT So, can we please get out of here? So is it okay if we please go? -0.064 1 2 PPNMT You’re crying. I did not cry -1.366 0 3 PPNMT Treatment successful. The treatment was succesful. -0.871 1 4 PPNMT Break a leg! Fracture a leg! 0.408 1 5 MSRP Two years later, the insurance coverage would begin. The insurance will start in two years 0.281 1 6 MSRP Evacuation went smoothly, although passengers Hunt told that Evacuation went smoothly. -0.298 0 weren’t told what was going on, Hunt said. APT 5 7 MSRP Friday, Stanford (47-15) blanked the Gamecocks 8-0. Stanford (47-15) won 8-0 over the Gamecocks on Friday. 0.206 1 8 MSRP Revenue in the first quarter of the year dropped Revenue declined 15 percent in the first quarter 0.698 1 15 percent from the same period a year earlier. of the year from the same period a year earlier. 9 MSRP A federal magistrate in Fort Lauderdale ordered In Fort Lauderdale, Florida, a federal magistrate 0.635 0 him held without bail. ordered him held without bail. 10 TP 16 innovations making a difference for poor 16 innovative ideas that tackle poverty around the world. 0.317 1 communities around the world. 11 TP This is so past the bounds of normal or acceptable . This is so beyond the normal or acceptable boundaries. 0.620 1 12 TP The creator of Atari has launched a new VR company Atari creator is setting up a new VR company! 0.106 0 called Modal VR. Table 3: Examples from adversarial datasets. The source dataset (TP short for TwitterPPDB) tells which dataset the sentence pair comes from (and whether it is in APTM5 or APTT5w for APT 5 ). All datasets have AP T passing and failing M I and non-M I sentence pairs. Dataset Total MI non-M I Training Dataset Size APH APH -test APH -train 3746 2433 64.95% 1313 35.05% TwitterPPDB + MCC F1 MCC F1 APH -train 46k 0.440 0.809 APH -test 1261 799 63.36% 462 36.64% APTM5 106k 0.410 0.725 0.369 0.705 MSRP-train 4076 2753 67.54% 1323 32.46% APH -train + APTM5 109k 0.516 0.828 MSRP-test 1725 1147 66.50% 578 33.50% APTT5w 117k 0.433 0.772 0.422 0.765 APH -train + APTT5w 121k 0.488 0.812 Table 4: Distribution of M I and non-M I pairs. APT 5 180k 0.461 0.731 0.437 0.716 APH -train + APT 5 184k 0.525 0.816 RoBERTabase Random Test Set MCC F1 MCC F1 MSRP-train 0.349 0.833 0 0.806 Table 6: Performance of RoBERTabase trained on ad- MSRP-test 0.358 0.829 0 0.799 versarial datasets. Size is the number of training exam- ples in the dataset rounded to nearest 1000. APH 0.222 0.746 0 0.784 APH -test 0.218 0.743 0 0.777 Table 5: Performance of RoBERTabase trained on just TwitterPPDB (no adversarial datasets) vs. random pre- Chen et al., 2021). diction. For each source sentence, multiple paraphrases may have been generated. Hence, to avoid data 4 Experiments leakage, we created a train-test split on APH such that all paraphrases generated using a given source To quantify our datasets’ contributions, we de- sentence will be either in APH -train or in APH - signed experiment setups wherein we trained test, but never in both. Note that APH is not bal- RoBERTabase (Liu et al., 2019) for paraphrase de- anced as seen in Table 2. Table 4 shows the dis- tection on a combination of TwitterPPDB and our tribution of M I and non-M I pairs in APH -train datasets as training data. RoBERTa was chosen for and APH -test and ‘M I attempts’ and ‘non-M I its generality, as it is a commonly used model in attempts’ columns of Table 2 show the same for current NLP work and benchmarking, and currently other adversarial datasets. The test sets used were achieves SOTA or near-SOTA results on a majority APH wherever APH -train was not a part of the of NLP benchmark tasks (Wang et al., 2019, 2020; training data and APH -test in every case. 7112
Does RoBERTabase do well on APH ? terPPDB — we used the trained model to perform RoBERTabase was trained on each training APT on TwitterPPDB itself. Adhering to expec- dataset (90% training data, 10% validation data) tations, training RoBERTabase (the paraphrase de- for five epochs with a batch size of 32 with the tector) with APTT5w yielded higher MCCs. Note training and validation data shuffled, and the that none of the sentences are common between trained model was tested on APH and APH -test. APTT5w and APH since APH is built on MSRP and The results of this are shown in Table 6. Note that PPNMT and the fact that the model got this per- since the number of M I and non-M I sentences formance when trained on APTT5w is a testimony to in all the datasets is imbalanced, Matthew’s Cor- the quality and contribution of APT. relation Coefficient (MCC) is a more appropriate Combining these results, we can conclude that performance measure than accuracy (Boughorbel although machine-generated datasets like APT 5 et al., 2017). can help paraphrase detectors improve themselves, Our motivation behind creating an adversarial a smaller dataset of human-generated adversarial dataset was to improve the performance of para- paraphrases improved performance more. Overall, phrase detectors by ensuring they recognize para- however, the highest MCC (0.525 in Table 6) is phrases with low lexical overlap. To demonstrate obtained when TwitterPPDB is combined with all the extent of their inability to do so, we first com- three adversarial datasets, suggesting that the two pare the performance of RoBERTabase trained only approaches nicely complement each other. on TwitterPPDB on specific datasets as shown Ta- ble 5. Although the model performs slightly well 5 Discussions and Conclusions on MSRP, it does barely better than a random pre- diction on APH , thus showing that identifying ad- This paper introduced APT (Adversarial Paraphras- versarial paraphrases created using APT is non- ing Task), a task that uses the adversarial paradigm trivial for paraphrase identifiers. to generate paraphrases consisting of sentences with equivalent (sentence-level) meanings, but dif- Do human-generated adversarial paraphrases fering lexical (word-level) and syntactical similar- improve paraphrase detection? We introduce ity. We used APT to create a human-generated APH -train to the training dataset along with Twit- dataset / benchmark (APH ) and two machine- terPPDB. This improves the MCC by 0.222 even generated datasets (APTM5 and APTT5w ). Our goal though APH -train constituted just 8.15% of the was to effectively augment how paraphrase detec- entire training dataset, the rest of which was Twit- tors are trained, in order to make them less reliant terPPDB (Table 6). This shows the effectiveness on word-level similarity. In this respect, the present of human-generated paraphrases, as is especially work succeeded: we showed that RoBERTabase impressive given the size of APH -train compared trained on TwitterPPDB performed poorly on APT to TwitterPPDB. benchmarks, but this performance was increased significantly when further trained on either our Do machine-generated adversarial para- human- or machine-generated datasets. The code phrases improve paraphrase detection? We used in this paper along with the dataset has been set out to test the improvement brought by APT 5 , released in a publicly-available repository.4 of which we have two versions. Adding APTM5 Paraphrase detection and generation have broad to the training set was not as effective as adding applicability, but most of their potential lies in ar- APH -train, increasing MCC by 0.188 on APH eas in which they still have not been substantially and 0.151 on APH -test, thus showing us that applied. These areas range from healthcare (im- T5base , although was able to clear AP T , lacked proving accessibility to medical communications the quality which human paraphrases possessed. or concepts by automatically generating simpler This might be explained by Figure 2 — since language), writing (changing the writing style of APTM5 does not have many sentences with low an article to match phrasing a reader is better able BLEURT, we cannot expect a vast improvement to understand), and education (simplifying the lan- in RoBERTabase ’s performance on sentences with guage of a scientific paper or educational lesson to BLEURT as low as in APH . 4 https://github.com/ Since we were not necessarily testing T5base ’s Advancing-Machine-Human-Reasoning-Lab/ performance — and we had trained T5base on Twit- apt 7113
make it easier for students to understand). Thus, reflect the views of the United States Air Force. future research into improving their performance We would also like to thank Antonio Laverghetta can be very valuable. But approaches to paraphrase Jr. and Jamshidbek Mirzakhalov for their helpful that treat it as no more than a matter of detecting suggestions while writing this paper, and Gokul word similarity overlap will not suffice for these Shanth Raveendran and Manvi Nagdev for helping applications. Rather, the meanings of sentences with the website used for the mTurk study. are properties of the sentences as a whole, and are inseparably tied to their inferential properties. Thus, our approaches to paraphrase detection and References generation must follow suit. Marianna Apidianaki, Guillaume Wisniewski, Anne The adversarial paradigm can be used to dive Cocos, and Chris Callison-Burch. 2018. Automated paraphrase lattice creation for hyter machine trans- deeper into comparing how humans and SOTA lan- lation evaluation. In Proceedings of the 2018 Con- guage models understand sentence meaning, as ference of the North American Chapter of the Asso- we did with APT. Furthermore, automatic gener- ciation for Computational Linguistics: Human Lan- ation of adversarial datasets has much unrealized guage Technologies, Volume 2 (Short Papers), pages 480–485. potential; e.g., different datasets, paraphrase gen- erators, and training approaches can be used to Colin Bannard and Chris Callison-Burch. 2005. Para- generate future versions of APT 5 in order to pro- phrasing with bilingual parallel corpora. In Proceed- duce AP T passing sentence pairs with lower lexi- ings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 597– cal and syntactic similarities (as measured not only 604. by BLEURT, but also by future state-of-the-art STS metrics). The idea of more efficient automated ad- Rahul Bhagat and Eduard Hovy. 2013. Squibs: What versarial task performance is particularly exciting, is a paraphrase? Computational Linguistics, 39(3):463–472. as it points to a way language models can improve themselves while avoiding prohibitively expensive Jordan J. Bird, Anikó Ekárt, and Diego R. Faria. 2020. human participant fees. Chatbot interaction with artificial intelligence: Hu- man data augmentation with t5 and language trans- Finally, the most significant contribution of this former ensemble for text classification. paper, APT, presents a dataset creation method for paraphrases that will not saturate because as the Paul A. Boghossian. 1994. Inferential role semantics models get better at identifying paraphrases, we and the analytic/synthetic distinction. Philosophical Studies: An International Journal for Philosophy in will improve paraphrase generation. As models get the Analytic Tradition, 73(2/3):109–122. better at generating paraphrases, we can make APT harder (e.g., by reducing the BLEURT threshold Ondřej Bojar, Ondřej Dušek, Tom Kocmi, Jindřich Li- bovickỳ, Michal Novák, Martin Popel, Roman Su- of < 0.5). One might think of this as students in darikov, and Dušan Variš. 2016. Czeng 1.6: en- a class who come up with new ways of copying larged czech-english parallel corpus with processing their assignments from sources as plagiarism detec- tools dockered. In International Conference on Text, tors improved. That brings us to one of the many Speech, and Dialogue, pages 231–238. Springer. applications of paraphrases: plagiarism generation Sabri Boughorbel, Fethi Jarray, and Mohammed and detection, which inherently is an adversarial El-Anbari. 2017. Optimal classifier for imbal- activity. Until plagiarism detectors are trained on anced data using matthews correlation coefficient adversarial datasets themselves, we cannot expect metric. PloS one, 12(6):e0177678–e0177678. 28574989[pmid]. them to capture human levels of adversarial para- phrasing. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large anno- Acknowledgements tated corpus for learning natural language inference. Ronan Le Bras, Swabha Swayamdipta, Chandra Bha- This material is based upon work supported by gavatula, Rowan Zellers, Matthew E. Peters, Ashish the Air Force Office of Scientific Research under Sabharwal, and Yejin Choi. 2020. Adversarial filters award numbers FA9550-17-1-0191 and FA9550- of dataset biases. 18-1-0052. Any opinions, findings, and conclu- Chris Callison-Burch, Philipp Koehn, and Miles Os- sions or recommendations expressed in this mate- borne. 2006. Improved statistical machine trans- rial are those of the authors and do not necessarily lation using paraphrases. In Proceedings of the 7114
Human Language Technology Conference of the E. Hunt, R. Janamsetty, C. Kinares, C. Koh, NAACL, Main Conference, pages 17–24, New York A. Sanchez, F. Zhan, M. Ozdemir, S. Waseem, City, USA. Association for Computational Linguis- O. Yolcu, B. Dahal, J. Zhan, L. Gewali, and P. Oh. tics. 2019. Machine learning models for paraphrase iden- tification and its applications on plagiarism detec- Ben Chen, Bin Chen, Dehong Gao, Qijin Chen, tion. In 2019 IEEE International Conference on Big Chengfu Huo, Xiaonan Meng, Weijun Ren, and Knowledge (ICBK), pages 97–104. Yang Zhou. 2021. Transformer-based language model fine-tuning methods for covid-19 fake news Robin Jia and Percy Liang. 2017a. Adversarial exam- detection. ples for evaluating reading comprehension systems. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Robin Jia and Percy Liang. 2017b. Adversarial exam- Kristina Toutanova. 2019. Bert: Pre-training of deep ples for evaluating reading comprehension systems. bidirectional transformers for language understand- In Proceedings of the 2017 Conference on Empiri- ing. cal Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Bill Dolan and Chris Brockett. 2005. Automati- Computational Linguistics. cally constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing Xin Jia, Wenjie Zhou, Xu Sun, and Yunfang Wu. 2020. (IWP2005). Asia Federation of Natural Language How to ask good questions? try to leverage para- Processing. phrases. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. pages 6130–6140, Online. Association for Computa- 2014. Open question answering over curated and ex- tional Linguistics. tracted knowledge bases. In Proceedings of the 20th ACM SIGKDD International Conference on Knowl- Tom Kocmi, Martin Popel, and Ondrej Bojar. 2020. edge Discovery and Data Mining, KDD ’14, page Announcing czeng 2.0 parallel corpus with over 2 1156–1165, New York, NY, USA. Association for gigawords. arXiv preprint arXiv:2007.03006. Computing Machinery. Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi- A continuously growing dataset of sentential para- erarchical neural story generation. phrases. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Samuel Fernando and Mark Stevenson. 2008. A se- pages 1224–1234, Copenhagen, Denmark. Associa- mantic similarity approach to paraphrase detection. tion for Computational Linguistics. In Proceedings of the 11th annual research collo- Steven Lee and Theresa Colln. 2003. The effect of in- quium of the UK special interest group for compu- struction in the paraphrasing strategy on reading flu- tational linguistics, pages 45–52. ency and comprehension. Juri Ganitkevitch, Benjamin Van Durme, and Chris Eric Li, Jingyi Su, Hao Sheng, and Lawrence Wai. Callison-Burch. 2013. Ppdb: The paraphrase 2020. Agent zero: Zero-shot automatic multiple- database. In Proceedings of the 2013 Conference of choice question generation for skill assessments. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- nologies, pages 758–764. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Han Guo, Ramakanth Pasunuru, and Mohit Bansal. Roberta: A robustly optimized bert pretraining ap- 2018. Dynamic multi-level multi-task learning for proach. sentence simplification. Stephen Mayhew, Klinton Bicknell, Chris Brust, Bill Ankush Gupta, Arvind Agarwal, Prawaan Singh, and McDowell, Will Monroe, and Burr Settles. 2020. Si- Piyush Rai. 2018. A deep generative framework multaneous translation and paraphrase for language for paraphrase generation. Proceedings of the AAAI education. In Proceedings of the ACL Workshop on Conference on Artificial Intelligence, 32(1). Neural Generation and Translation (WNGT). ACL. Jessica L. Hagaman and Robert Reid. 2008. The ef- Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. fects of the paraphrasing strategy on the reading Combining fact extraction and verification with neu- comprehension of middle school students at risk for ral semantic matching networks. In Association for failure in reading. Remedial and Special Education, the Advancement of Artificial Intelligence (AAAI). 29(4):222–234. Yixin Nie, Adina Williams, Emily Dinan, Mohit Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Bansal, Jason Weston, and Douwe Kiela. 2020. Ad- Yejin Choi. 2020. The curious case of neural text versarial nli: A new benchmark for natural language degeneration. understanding. 7115
Animesh Nighojkar and John Licato. 2021. Mutual im- Alex Wang, Amanpreet Singh, Julian Michael, Felix plication as a measure of textual equivalence. The Hill, Omer Levy, and Samuel R. Bowman. 2019. International FLAIRS Conference Proceedings, 34. Glue: A multi-task benchmark and analysis platform for natural language understanding. Tong Niu, Semih Yavuz, Yingbo Zhou, Huan Wang, Nitish Shirish Keskar, and Caiming Xiong. 2020. John Wieting and Kevin Gimpel. 2018. ParaNMT- Unsupervised paraphrase generation via dynamic 50M: Pushing the limits of paraphrastic sentence em- blocking. beddings with millions of machine translations. In Proceedings of the 56th Annual Meeting of the As- Bo Pang, Kevin Knight, and Daniel Marcu. 2003. sociation for Computational Linguistics (Volume 1: Syntax-based alignment of multiple translations: Ex- Long Papers), pages 451–462, Melbourne, Australia. tracting paraphrases and generating new sentences. Association for Computational Linguistics. In Proceedings of the 2003 Human Language Tech- nology Conference of the North American Chapter Adina Williams, Nikita Nangia, and Samuel R. Bow- of the Association for Computational Linguistics, man. 2018. A broad-coverage challenge corpus for pages 181–188. sentence understanding through inference. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Pengcheng Yin, Nan Duan, Ben Kao, Junwei Bao, and Jing Zhu. 2002. Bleu: a method for automatic eval- Ming Zhou. 2015. Answering questions with com- uation of machine translation. In Proceedings of plex semantic constraints on open knowledge bases. the 40th Annual Meeting of the Association for Com- In Proceedings of the 24th ACM International on putational Linguistics, pages 311–318, Philadelphia, Conference on Information and Knowledge Manage- Pennsylvania, USA. Association for Computational ment, CIKM ’15, page 1301–1310, New York, NY, Linguistics. USA. Association for Computing Machinery. Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch, Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Benjamin Van Durme, and Chris Callison-Burch. Choi. 2018. Swag: A large-scale adversarial dataset 2015. Ppdb 2.0: Better paraphrase ranking, fine- for grounded commonsense inference. grained entailment relations, word embeddings, and style classification. In Proceedings of the 53rd An- Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali nual Meeting of the Association for Computational Farhadi, and Yejin Choi. 2019. Hellaswag: Can a Linguistics and the 7th International Joint Confer- machine really finish your sentence? ence on Natural Language Processing (Volume 2: Short Papers), pages 425–430. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- Jaroslav Peregrin. 2006. Meaning as an inferential role. uating text generation with bert. Erkenntnis, 64(1):1–35. Yitao Zhang and Jon Patrick. 2005. Paraphrase iden- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine tification by text canonicalization. In Proceedings Lee, Sharan Narang, Michael Matena, Yanqi Zhou, of the Australasian Language Technology Workshop Wei Li, and Peter J. Liu. 2020. Exploring the limits 2005, pages 160–166. of transfer learning with a unified text-to-text trans- former. Yuan Zhang, Jason Baldridge, and Luheng He. 2019. Paws: Paraphrase adversaries from word scram- Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. bling. 2020. Bleurt: Learning robust metrics for text gen- eration. Richard Socher, Eric H Huang, Jeffrey Pennin, Christo- pher D Manning, and Andrew Y Ng. 2011. Dy- namic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in neural in- formation processing systems, pages 801–809. Alex Sokolov and Denis Filimonov. 2020. Neural ma- chine translation for paraphrase generation. Brian Thompson and Matt Post. 2020. Paraphrase gen- eration as zero-shot multilingual translation: Disen- tangling semantic similarity from lexical and syntac- tic diversity. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2020. Superglue: A stickier benchmark for general-purpose language un- derstanding systems. 7116
You can also read