Improving Paraphrase Detection with the Adversarial Paraphrasing Task

 
CONTINUE READING
Improving Paraphrase Detection with the Adversarial Paraphrasing Task

                                  Animesh Nighojkar, John Licato
                            Advancing Machine and Human Reasoning Lab
                           Department of Computer Science and Engineering
                                     University of South Florida
                                          Tampa, FL, USA
                               {anighojkar,licato}@usf.edu

                      Abstract                              definition that seems to sufficiently encompass the
                                                            others is given by Bhagat and Hovy (2013): “para-
    If two sentences have the same meaning,                 phrases are sentences or phrases that use different
    it should follow that they are equivalent in            wording to convey the same meaning.” However,
    their inferential properties, i.e., each sen-
                                                            even that definition is somewhat imprecise, as it
    tence should textually entail the other. How-
    ever, many paraphrase datasets currently in             lacks clarity on what it assumes ‘meaning’ means.
    widespread use rely on a sense of paraphrase               If paraphrasing is a property that can hold be-
    based on word overlap and syntax. Can we                tween sentence pairs,1 then it is reasonable to as-
    teach them instead to identify paraphrases in           sume that sentences that are paraphrases must have
    a way that draws on the inferential properties          equivalent meanings at the sentence level (rather
    of the sentences, and is not over-reliant on            than exclusively at the levels of individual word
    lexical and syntactic similarities of a sentence
                                                            meanings or syntactic structures). Here a useful
    pair? We apply the adversarial paradigm to
    this question, and introduce a new adversarial          test is that recommended by inferential role se-
    method of dataset creation for paraphrase iden-         mantics or inferentialism (Boghossian, 1994; Pere-
    tification: the Adversarial Paraphrasing Task           grin, 2006), which suggests that the meaning of a
    (APT), which asks participants to generate se-          statement s is grounded in its inferential properties:
    mantically equivalent (in the sense of mutually         what one can infer from s and from what s can be
    implicative) but lexically and syntactically dis-       inferred.
    parate paraphrases. These sentence pairs can
                                                               Building on this concept from inferentialism, we
    then be used both to test paraphrase identifi-
    cation models (which get barely random accu-            assert that if two sentences have the same inferen-
    racy) and then improve their performance. To            tial properties, then they should also be mutually
    accelerate dataset generation, we explore au-           implicative. Mutual Implication (MI) is a binary re-
    tomation of APT using T5, and show that the             lationship between two sentences that holds when
    resulting dataset also improves accuracy. We            each sentence textually entails the other (i.e., bidi-
    discuss implications for paraphrase detection           rectional entailment). MI is an attractive way of
    and release our dataset in the hope of making
                                                            operationalizing the notion of two sentences hav-
    paraphrase detection models better able to de-
    tect sentence-level meaning equivalence.                ing “the same meaning,” as it focuses on inferential
                                                            relationships between sentences (properties of the
1   Introduction                                            sentences as wholes) instead of just syntactic or
                                                            lexical similarities (properties of parts of the sen-
Although there are many definitions of ‘paraphrase’         tences). As such, we will assume in this paper
in the NLP literature, most maintain that two sen-          that two sentences are paraphrases if and only if
tences that are paraphrases have the same mean-             they are M I.2 In NLP, modeling inferential re-
ing or contain the same information. Pang et al.            lationships between sentences is the goal of the
(2003) define paraphrasing as “expressing the same          textual entailment, or natural language inference
information in multiple ways” and Bannard and               (NLI) tasks (Bowman et al., 2015). We test MI
Callison-Burch (2005) call paraphrases “alternative
                                                                 1
ways of conveying the same information.” Ganitke-                  In this paper we study paraphrase between sentences, and
                                                             do not address the larger scope of how our work might extend
vitch et al. (2013) write that “paraphrases are dif-         to paraphrasing between arbitrarily large text sequences.
                                                                 2
fering textual realizations of the same meaning.” A                The notations used in this paper are listed in Table 1.

                                                        7106
                Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
             and the 11th International Joint Conference on Natural Language Processing, pages 7106–7116
                           August 1–6, 2021. ©2021 Association for Computational Linguistics
using the version of RoBERTalarge released by Nie                         Concept of mutual implication
                                                               MI
                                                                         / bidirectional textual entailment)
 et al. (2020) trained on a combination of SNLI
                                                                      Property of being mutually implicative,
(Bowman et al., 2015), multiNLI (Williams et al.,             MI
                                                                         as determined by our NLI model
 2018), FEVER-NLI (Nie et al., 2019), and ANLI                APT          Adversarial Paraphrasing Task
(Nie et al., 2020).                                                     Property of passing the adversarial
    Owing to expeditious progress in NLP research,           AP T
                                                                             paraphrase test (see §3)
 performance of models on benchmark datasets is               APH         Human-generated APT dataset
‘plateauing’ — with near-human performance often                         T5base -generated APT dataset
                                                             APT 5
 achieved within a year or two of their release —                      (Note that APT 5 = APTM5 ∪ APTT5w )
 and newer versions, using a different approach,             APTM5            MSRP subset of APT 5
 are constantly having to be created, for instance,          APTT5w       TwitterPPDB subset of APT 5
 GLUE (Wang et al., 2019) and SuperGLUE (Wang
 et al., 2020). The adversarial paradigm of dataset              Table 1: Notations used in the paper.
 creation (Jia and Liang, 2017a,b; Bras et al., 2020;
 Nie et al., 2020) has been widely used to address
 this ‘plateauing,’ and the ideas presented in this      2005; Fernando and Stevenson, 2008; Socher et al.,
 paper draw inspiration from it. In the remainder of     2011; Jia et al., 2020) is an important task in the
 this paper, we apply the adversarial paradigm to the    field of NLP, finding downstream applications in
 problem of paraphrase detection, and demonstrate        machine translation (Callison-Burch et al., 2006;
 the following novel contributions:                      Apidianaki et al., 2018; Mayhew et al., 2020), text
                                                         summarization, plagiarism detection (Hunt et al.,
    • We use the adversarial paradigm to create          2019), question answering, and sentence simplifi-
      a new benchmark examining whether para-            cation (Guo et al., 2018). Paraphrases have proven
      phrase detection models are assessing the          to be a crucial part of NLP and language educa-
      meaning equivalence of sentences rather than       tion, with research showing that paraphrasing helps
      being over-reliant on word-level measures.         improve reading comprehension skills (Lee and
      We do this by collecting paraphrases that          Colln, 2003; Hagaman and Reid, 2008). Question
      are MI but are as lexically and syntactically      paraphrasing is an important step in knowledge-
      disparate as possible (as measured by low          based question answering systems for matching
      BLEURT scores). We call this the Adversarial       questions asked by users with knowledge-based
      Paraphrasing Task (APT).                           assertions (Fader et al., 2014; Yin et al., 2015).
                                                            Paraphrase generation (given a sentence, gener-
    • We show that a SOTA language model trained
                                                         ate its paraphrase) (Gupta et al., 2018) is an area
      on paraphrase datasets perform poorly on our
                                                         of research benefiting paraphrase detection as well.
      benchmark. However, when further trained
                                                         Lately, many paraphrasing datasets have been in-
      on our adversarially-generated datasets, their
                                                         troduced to be used for training and testing ML
      MCC scores improve by up to 0.307.
                                                         models for both paraphrase detection and genera-
    • We create an additional dataset by training a      tion. MSRP (Dolan and Brockett, 2005) contains
      paraphrase generation model to perform our         5801 sentence pairs, each labeled with a binary hu-
      adversarial task, creating another large dataset   man judgment of paraphrase, created using heuris-
      that further improves the paraphrase detection     tic extraction techniques along with an SVM-based
      models’ performance.                               classifier. These pairs were annotated by humans,
                                                         who found 67% of them to be semantically equiva-
    • We propose a way to create a machine-              lent. The English portion of PPDB (Ganitkevitch
      generated adversarial dataset and discuss          et al., 2013) contains over 220M paraphrase pairs
      ways to ensure it does not suffer from the         generated by meaning-preserving syntactic trans-
      plateauing that other datasets suffer from.        formations. Paraphrase pairs in PPDB 2.0 (Pavlick
                                                         et al., 2015) include fine-grained entailment rela-
                                                         tions, word embedding similarities, and style an-
2    Related Work
                                                         notations. TwitterPPDB (Lan et al., 2017) con-
Paraphrase detection (given two sentences, predict       sists of 51,524 sentence pairs captured from Twitter
whether they are paraphrases) (Zhang and Patrick,        by linking tweets through shared URLs. This ap-

                                                     7107
proach’s merit is its simplicity as it involves neither   to filter the dataset, which reduces its size, making
a classifier nor a human-in-the-loop to generate          model training more difficult. Our present work
paraphrases. Humans annotate the pairs, giving            tries instead to generate adversarial examples to
them a similarity score ranging from 1 to 6.              increase dataset size. Other examples of adversar-
   ParaNMT (Wieting and Gimpel, 2018) was cre-            ial datasets in NLP include work done by Jia and
ated by using neural machine translation to trans-        Liang (2017a); Zellers et al. (2018, 2019). Per-
late the English side of a Czech-English parallel         haps the closest to our work is PAWS (Zhang et al.,
corpus (CzEng 1.6 (Bojar et al., 2016)), generat-         2019), short for Paraphrase Adversaries from Word
ing more than 50M English-English paraphrases.            Scrambling. The idea behind PAWS is to create
However, ParaNMT’s use of machine translation             a dataset that has a high lexical overlap between
models that are a few years old harms its utility         sentence pairs without them being ‘paraphrases.’ It
(Nighojkar and Licato, 2021), considering the rapid       has 108k paraphrase and non-paraphrase pairs with
improvement in machine translation in the past few        high lexical overlap pairs generated by controlled
years. To rectify this, we use the google-translate       word swapping and back-translation, and human
library to translate the Czech side of roughly 300k       raters have judged whether or not they are para-
CzEng2.0 (Kocmi et al., 2020) sentence pairs our-         phrases. Including PAWS in the training data has
selves. We call this dataset ParaParaNMT (PP-             shown the state-of-the-art models’ performance to
NMT for short, where the extra para- prefix re-           jump from 40% to 85% on PAWS’s test split. In
flects its similarity to, and conceptual derivation       comparison to the present work, PAWS does not
from, ParaNMT).                                           explicitly incorporate inferential properties, and we
                                                          seek paraphrases minimizing lexical overlap.
   Some work has been done in improving the qual-
ity of paraphrase detectors by training them on a         3     Adversarial Paraphrasing Task (APT)
dataset with more lexical and syntactic diversity.
Thompson and Post (2020) propose a paraphrase             Semantic Textual Similarity (STS) measures the
generation algorithm that penalizes the production        degree of semantic similarity between two sen-
of n-grams present in the source sentence. Our            tences. Popular approaches to calculating STS
approach to doing this is with the APT, but this          include BLEU (Papineni et al., 2002), BertScore
is something worth exploring. Sokolov and Fil-            (Zhang et al., 2020), and BLEURT (Sellam et al.,
imonov (2020) use a machine translation model             2020). BLEURT is a text generation metric build-
to generate paraphrases much like ParaNMT. An             ing on BERT’s (Devlin et al., 2019) contextual
interesting application of paraphrasing has been          word representations. BLEURT is warmed-up us-
discussed by Mayhew et al. (2020) who, given a            ing synthetic sentence pairs and then fine-tuned on
sentence in one language, generate a diverse set          human ratings to generalize better than BERTScore
of correct translations (paraphrases) that humans         (Zhang et al., 2020). Given any two sentences,
are likely to produce. In comparison, our work            BLEURT assigns them a similarity score (usually
is focused on generating adversarial paraphrases          between -2.2 to 1.1). However, high STS scores do
that are likely to deceive a paraphrase detector, and     not necessarily predict whether two sentences have
models trained on the adversarial datasets we pro-        equivalent meanings. Consider the sentence pairs
duce can be applied to Mayhew et al.’s work too.          in Table 3, highlighting cases where STS and para-
                                                          phrase appear to misalign. The existence of such
   ANLI (Nie et al., 2020), a dataset designed for
                                                          cases suggests a way to advance automated para-
Natural Language Inference (NLI) (Bowman et al.,
                                                          phrase detection: through an adversarial bench-
2015), was collected via an adversarial human-and-
                                                          mark consisting of sentence pairs that have the
model-in-the-loop procedure where humans are
                                                          same MI-based meaning, but have BLEURT scores
given the task of duping the model into making a
                                                          that are as low as possible. This is the motivation
wrong prediction. The model then tries to learn how
                                                          behind what we call the Adversarial Paraphrasing
not to make the same mistakes. AFLite (Bras et al.,
                                                          Task (APT), which has two components:
2020) adversarially filters dataset biases making
sure that the models are not learning those biases.           1. Similarity of meaning: Checked through MI
They show that model performance on SNLI (Bow-                   (Section 1). We assume if two sentences are
man et al., 2015) drops from 92% to 62% when                     M I (Mutually Implicative), they are seman-
biases were filtered out. However, their approach is             tically equivalent and thus paraphrases. Note

                                                      7108
Figure 1: The mTurk study and the reward calculation. We automatically end the study when a subject earns a
total of $20 to ensure variation amongst subjects.

      that MI is a binary relationship, so this APT     ticipants. Neither of these datasets has duplicate
      component does not bring any quantitative         sentences by design. Every time a sentence was se-
      variation but is more like a qualifier test for   lected, a random choice was made between MSRP
      APT. All AP T sentence pairs are M I.             and PPNMT, thus ensuring an even distribution of
                                                        sentences from both datasets.
  2. Dissimilarity of structure: Measured through          Each attempt was evaluated separately using
     BLEURT, which assigns each sentence pair a         Equation 1, where mi is 1 when the sentences are
     score quantifying how lexically and syntacti-      M I and 0 otherwise:
     cally similar the two sentences are.
                                                                                     mi
                                                                   reward =                            (1)
3.1   Manually Solving APT                                                     (1 + e5∗bleurt )2
To test the effectiveness of APT in guiding the gen-    This formula was designed to ensure (1) the max-
eration of mutually implicative but lexically and       imum reward per submission was $1, and (2) no
syntactically disparate paraphrases for a given sen-    reward was granted for sentence pairs that are non-
tence, we designed an Amazon Mechanical Turk            MI or have BLEURT > 0.5. Participants were
(mTurk) study (Figure 1). Given a starting sentence,    encouraged to frequently revise their sentences and
we instructed participants to “[w]rite a sentence       click on a ‘Check’ button which showed them the
that is the same in meaning as the given sentence       reward amount they would earn if they submitted
but as structurally different as possible. Your sen-    this sentence. Once the ‘Check’ button was clicked,
tence should be such that you can infer the given       the participant’s reward was evaluated (see Figure
sentence from it AND vice-versa. It should be suffi-    1) and the sentence pair added to APH (regardless
ciently different from the given sentence to get any    of whether it was AP T ). If ‘Submit’ was clicked,
reward for the submission. For example, a simple        their attempt was rewarded based on Equation 1.
synonym substitution will most likely not work.”           The resulting dataset of sentence pairs, which
The sentences given to the participants came from       we call APH (Adversarial Paraphrase by Humans),
MSRP and PPNMT (Section 1). Both of these               consists of 5007 human-generated sentence pairs,
datasets have pairs of sentences in each row, and       both M I and non-M I (see Table 2). Humans were
we took only the first one to present to the par-       able to generate AP T paraphrases for 75.48% of

                                                    7109
Total     AP T         MI      non-M I      Unique      AP T        MI       non-M I
            Dataset
                      attempts   attempts   attempts   attempts    sentences   uniques    uniques    uniques
                                   2659       3232       1775                   1231       1338        293
             APH       5007                                          1631
                                 53.10%     64.55%     35.45%                  75.48%     82.04%     17.96%
                                   3836      37,511     25,475                  2288       4045       3115
             APTM5     62,986                                        4072
                                  6.09%     59.55%     40.44%                  56.19%     99.34%     76.50%
                                   6454      17,074     57,937                  3670       4131       4230
             APTT5w    75,011                                        4328
                                  8.60%     22.76%     77.24%                  84.80%     95.45%     97.74%

Table 2: Proportion of sentences generated by humans (APH ) and T5base (APT 5 ). “Attempts” shows the number
of attempts the participant made and “Uniques” shows the number of source sentences from the dataset that the
performer’s attempts fall in that category on. For instance, 1631 unique sentences were presented to humans, who
made a total of 5007 attempts to pass AP T and were able to do so for 2659 attempts which amounted to 1231
unique source sentences that could be paraphrased to pass AP T .

the sentences presented to them and only 53.1%               next word according to its conditional probabil-
of attempts were AP T , showing that the task is             ity distribution, introduces non-determinism in lan-
difficult even for humans. Note that ‘M I attempts’          guage generation. Fan et al. (2018) introduce top-k
and ‘M I uniques’ are supersets of ‘AP T attempts’           sampling, which filters k most likely next words,
and ‘AP T uniques,’ respectively.                            and the probability mass is redistributed among
                                                             only those k words. Nucleus sampling (or top-p
3.2   Automatically Solving APT                              sampling) (Holtzman et al., 2020) reduces the op-
Since human studies can be time-consuming and                tions to the smallest possible set of words whose
costly, we trained a paraphrase generator to per-            cumulative probability exceeds p, and the probabil-
form APT. We used T5base (Raffel et al., 2020), as           ity mass is redistributed among this set of words.
it achieves SOTA on paraphrase generation (Niu               Thus, the set of words changes dynamically ac-
et al., 2020; Bird et al., 2020; Li et al., 2020) and        cording to the next word’s probability distribution.
trained it on TwitterPPDB (Section 2). Our hy-               We use a combination of top-k and top-p sampling
pothesis was that if T5base is trained to maximize           with k = 120 and p = 0.95 in the interest of lexi-
the APT reward (Equation 1), its generated sen-              cal and syntactic diversity in the paraphrases. For
tences will be more likely to be AP T . We gener-            each sentence in the source dataset (MSRP3 and
ated paraphrases for sentences in MSRP and those             TwitterPPDB for APTM5 and APTT5w respectively),
in TwitterPPDB itself, hoping that since T5base is           we perform five iterations, in each of which, we
trained on TwitterPPDB, it would generate better             generate ten sentences. If at least one of these ten
paraphrases (M I with lower BLEURT) for sen-                 sentences passes AP T , we continue to the next
tences coming from there. The proportion of sen-             source sentence after recording all attempts and
tences generated by T5base is shown in Table 2.              classifying them as M I or non-M I. If no sentence
We call this dataset APT 5 , the generation of which         in a maximum of 50 attempts passes AP T , we
involved two phases:                                         record all attempts nonetheless, and move on to the
                                                             next source sentence. For each increasing iteration
Training: To adapt T5base for APT, we imple-
                                                             for a particular source sentence, we increase k by
mented a custom loss function obtained from di-
                                                             20, but we also reduce p by 0.05 to avoid vague
viding the cross-entropy loss per batch by the total
                                                             guesses. Note the distribution of M I and non-M I
reward (again from Equation 1) earned from the
                                                             in the source datasets does not matter because we
model’s paraphrase generations for that batch, pro-
                                                             use only the first sentence from the sentence pair.
vided the model was able to reach a reward of at
least 1. If not, the loss was equal to just the cross-       3.3     Dataset Properties
entropy loss. We trained T5base on TwitterPPDB               T5base trained with our custom loss function gen-
for three epochs; each epoch took about 30 hours             erated AP T -passing paraphrases for (56.19%) of
on one NVIDIA Tesla V100 GPU due to the CPU                  starting sentences. This is higher than we initially
bound BLEURT component. More epochs may                      expected, considering how difficult APT proved
help get better results, but our experiments showed          to be for humans (Table 2). Noteworthy is that
that loss plateaus after three epochs.                           3
                                                                   We use the official train split released by Dolan and Brock-
Generation: Sampling, or randomly picking a                  ett (2005) containing 4076 sentence pairs.

                                                        7110
(a) APH                             (b) APTM5                             (c) APTT5w

Figure 2: BLEURT distributions on adversarial datasets. All figures divide the range of observed scores into 100
bins. Note that AP T sentence pairs are also M I, whereas those labeled ‘MI’ are not AP T .

only 6.09% of T5base ’s attempts were AP T . This         paraphrases that were low on BLEURT, they tended
does not mean that the remaining 94% of attempts          to become non-M I (e.g., line 12 in Table 3). How-
can be discarded, since they amounted to the neg-         ever, T5base did generate more AP T -passing sen-
ative examples in the dataset. Since we trained           tences with a lower BLEURT on Twitter-PPDB
it on TwitterPPDB itself, we expected that T5base         than on MSRP, which may be a result of overfit-
would generate better paraphrases, as measured by         ting T5base on TwitterPPDB. Furthermore, all three
a higher chance of passing AP T on TwitterPPDB,           adversarial datasets have a distribution of M I and
than any other dataset we tested. This is supported       non-M I sentence pairs balanced enough to train a
by the data in Table 2, which shows that T5base           model to identify paraphrases.
was able to generate an AP T passing paraphrase              Table 3 has examples from APH and APT 5
for 84.8% of the sentences in TwitterPPDB.                showing the merits and shortcomings of T5,
   The composition of the three adversarial datasets      BLEURT, and RoBERTalarge (the MI detector
can be found in Table 2. These metrics are useful         used). Some observations from Table 3 include:
to understand the capabilities of T5base as a para-
phrase generator and the “paraphrasability” of sen-            • Lines 1 and 3: BLEURT did not recognize the
tences in MSRP and TwitterPPDB. For instance,                    paraphrases, possibly due to the differences in
T5base ’s attempts on TwitterPPDB tend to be M I                 words used. RoBERTalarge however, gave the
much less frequently than those on MSRP and hu-                  correct MI prediction (though it is worth not-
man’s attempts on MSRP + PPNMT. This might                       ing that the sentences in line 1 are questions,
be because in an attempt to generate syntactically               rather than truth-apt propositions).
dissimilar sentences, the T5base paraphraser also
ended up generating many semantically dissimilar               • Line 4: RoBERTalarge and BLEURT (to a
ones as well.                                                    large extent since it gave it a score of 0.4) did
   To visualize the syntactic and lexical disparity              not recognize that the idiomatic phrase ‘break
of paraphrases in the three adversarial datasets, we             a leg’ means ‘good luck’ and not ‘fracture.’
present their BLEURT distributions in Figure 2. As
                                                               • Lines 6 and 12: There is a loss of information
might be expected, the likelihood of a sentence pair
                                                                 going from the first sentence to the second and
being M I increases as BLEURT score increases
                                                                 BLEURT and MI both seem to have under-
(recall that AP T -passing sentence pairs are sim-
                                                                 stood the difference between summarization
ply M I pairs with BLEURT scores
No.      Source                       Source Sentence                                                  Attempt                            BLEURT     MI
          Dataset
                                                                            APH
    1     PPNMT               So, can we please get out of here?                             So is it okay if we please go?                 -0.064     1
    2     PPNMT                         You’re crying.                                                 I did not cry                        -1.366     0
    3     PPNMT                     Treatment successful.                                    The treatment was succesful.                   -0.871     1
    4     PPNMT                          Break a leg!                                                 Fracture a leg!                       0.408      1
    5      MSRP     Two years later, the insurance coverage would begin.                 The insurance will start in two years              0.281      1
    6      MSRP         Evacuation went smoothly, although passengers                  Hunt told that Evacuation went smoothly.             -0.298     0
                          weren’t told what was going on, Hunt said.
                                                                            APT 5
    7      MSRP     Friday, Stanford (47-15) blanked the Gamecocks 8-0.       Stanford (47-15) won 8-0 over the Gamecocks on Friday.        0.206      1
    8      MSRP         Revenue in the first quarter of the year dropped            Revenue declined 15 percent in the first quarter        0.698      1
                        15 percent from the same period a year earlier.             of the year from the same period a year earlier.
    9      MSRP         A federal magistrate in Fort Lauderdale ordered             In Fort Lauderdale, Florida, a federal magistrate       0.635      0
                                    him held without bail.                                   ordered him held without bail.
    10      TP           16 innovations making a difference for poor          16 innovative ideas that tackle poverty around the world.     0.317      1
                               communities around the world.
    11      TP       This is so past the bounds of normal or acceptable .      This is so beyond the normal or acceptable boundaries.       0.620      1
    12      TP      The creator of Atari has launched a new VR company               Atari creator is setting up a new VR company!          0.106      0
                                      called Modal VR.

Table 3: Examples from adversarial datasets. The source dataset (TP short for TwitterPPDB) tells which dataset
the sentence pair comes from (and whether it is in APTM5 or APTT5w for APT 5 ). All datasets have AP T passing and
failing M I and non-M I sentence pairs.

     Dataset         Total             MI                  non-M I                     Training Dataset           Size      APH               APH -test
    APH -train       3746       2433    64.95%          1313 35.05%                     TwitterPPDB +                     MCC   F1          MCC      F1
                                                                                          APH -train              46k                       0.440 0.809
     APH -test       1261        799     63.36%          462    36.64%
                                                                                              APTM5              106k     0.410    0.725    0.369    0.705
    MSRP-train       4076       2753     67.54%         1323    32.46%
                                                                                      APH -train + APTM5         109k                       0.516    0.828
    MSRP-test        1725       1147     66.50%          578    33.50%
                                                                                             APTT5w              117k     0.433    0.772    0.422    0.765
                                                                                     APH -train + APTT5w         121k                       0.488    0.812
        Table 4: Distribution of M I and non-M I pairs.
                                                                                              APT 5              180k     0.461    0.731    0.437    0.716
                                                                                      APH -train + APT 5         184k                       0.525    0.816
                            RoBERTabase           Random
             Test Set
                            MCC    F1            MCC   F1
           MSRP-train       0.349 0.833           0   0.806                         Table 6: Performance of RoBERTabase trained on ad-
            MSRP-test       0.358      0.829        0       0.799
                                                                                    versarial datasets. Size is the number of training exam-
                                                                                    ples in the dataset rounded to nearest 1000.
              APH           0.222      0.746        0       0.784
            APH -test       0.218      0.743        0       0.777

Table 5: Performance of RoBERTabase trained on just
TwitterPPDB (no adversarial datasets) vs. random pre-                               Chen et al., 2021).
diction.
                                                                                       For each source sentence, multiple paraphrases
                                                                                    may have been generated. Hence, to avoid data
4        Experiments                                                                leakage, we created a train-test split on APH such
                                                                                    that all paraphrases generated using a given source
To quantify our datasets’ contributions, we de-                                     sentence will be either in APH -train or in APH -
signed experiment setups wherein we trained                                         test, but never in both. Note that APH is not bal-
RoBERTabase (Liu et al., 2019) for paraphrase de-                                   anced as seen in Table 2. Table 4 shows the dis-
tection on a combination of TwitterPPDB and our                                     tribution of M I and non-M I pairs in APH -train
datasets as training data. RoBERTa was chosen for                                   and APH -test and ‘M I attempts’ and ‘non-M I
its generality, as it is a commonly used model in                                   attempts’ columns of Table 2 show the same for
current NLP work and benchmarking, and currently                                    other adversarial datasets. The test sets used were
achieves SOTA or near-SOTA results on a majority                                    APH wherever APH -train was not a part of the
of NLP benchmark tasks (Wang et al., 2019, 2020;                                    training data and APH -test in every case.

                                                                            7112
Does RoBERTabase do well on APH ?                       terPPDB — we used the trained model to perform
RoBERTabase was trained on each training                APT on TwitterPPDB itself. Adhering to expec-
dataset (90% training data, 10% validation data)        tations, training RoBERTabase (the paraphrase de-
for five epochs with a batch size of 32 with the        tector) with APTT5w yielded higher MCCs. Note
training and validation data shuffled, and the          that none of the sentences are common between
trained model was tested on APH and APH -test.          APTT5w and APH since APH is built on MSRP and
The results of this are shown in Table 6. Note that     PPNMT and the fact that the model got this per-
since the number of M I and non-M I sentences           formance when trained on APTT5w is a testimony to
in all the datasets is imbalanced, Matthew’s Cor-       the quality and contribution of APT.
relation Coefficient (MCC) is a more appropriate           Combining these results, we can conclude that
performance measure than accuracy (Boughorbel           although machine-generated datasets like APT 5
et al., 2017).                                          can help paraphrase detectors improve themselves,
   Our motivation behind creating an adversarial        a smaller dataset of human-generated adversarial
dataset was to improve the performance of para-         paraphrases improved performance more. Overall,
phrase detectors by ensuring they recognize para-       however, the highest MCC (0.525 in Table 6) is
phrases with low lexical overlap. To demonstrate        obtained when TwitterPPDB is combined with all
the extent of their inability to do so, we first com-   three adversarial datasets, suggesting that the two
pare the performance of RoBERTabase trained only        approaches nicely complement each other.
on TwitterPPDB on specific datasets as shown Ta-
ble 5. Although the model performs slightly well        5       Discussions and Conclusions
on MSRP, it does barely better than a random pre-
diction on APH , thus showing that identifying ad-      This paper introduced APT (Adversarial Paraphras-
versarial paraphrases created using APT is non-         ing Task), a task that uses the adversarial paradigm
trivial for paraphrase identifiers.                     to generate paraphrases consisting of sentences
                                                        with equivalent (sentence-level) meanings, but dif-
Do human-generated adversarial paraphrases              fering lexical (word-level) and syntactical similar-
improve paraphrase detection? We introduce              ity. We used APT to create a human-generated
APH -train to the training dataset along with Twit-     dataset / benchmark (APH ) and two machine-
terPPDB. This improves the MCC by 0.222 even            generated datasets (APTM5 and APTT5w ). Our goal
though APH -train constituted just 8.15% of the         was to effectively augment how paraphrase detec-
entire training dataset, the rest of which was Twit-    tors are trained, in order to make them less reliant
terPPDB (Table 6). This shows the effectiveness         on word-level similarity. In this respect, the present
of human-generated paraphrases, as is especially        work succeeded: we showed that RoBERTabase
impressive given the size of APH -train compared        trained on TwitterPPDB performed poorly on APT
to TwitterPPDB.                                         benchmarks, but this performance was increased
                                                        significantly when further trained on either our
Do machine-generated adversarial para-                  human- or machine-generated datasets. The code
phrases improve paraphrase detection? We                used in this paper along with the dataset has been
set out to test the improvement brought by APT 5 ,      released in a publicly-available repository.4
of which we have two versions. Adding APTM5                Paraphrase detection and generation have broad
to the training set was not as effective as adding      applicability, but most of their potential lies in ar-
APH -train, increasing MCC by 0.188 on APH              eas in which they still have not been substantially
and 0.151 on APH -test, thus showing us that            applied. These areas range from healthcare (im-
T5base , although was able to clear AP T , lacked       proving accessibility to medical communications
the quality which human paraphrases possessed.          or concepts by automatically generating simpler
This might be explained by Figure 2 — since             language), writing (changing the writing style of
APTM5 does not have many sentences with low             an article to match phrasing a reader is better able
BLEURT, we cannot expect a vast improvement             to understand), and education (simplifying the lan-
in RoBERTabase ’s performance on sentences with         guage of a scientific paper or educational lesson to
BLEURT as low as in APH .                                   4
                                                           https://github.com/
   Since we were not necessarily testing T5base ’s      Advancing-Machine-Human-Reasoning-Lab/
performance — and we had trained T5base on Twit-        apt

                                                   7113
make it easier for students to understand). Thus,       reflect the views of the United States Air Force.
future research into improving their performance        We would also like to thank Antonio Laverghetta
can be very valuable. But approaches to paraphrase      Jr. and Jamshidbek Mirzakhalov for their helpful
that treat it as no more than a matter of detecting     suggestions while writing this paper, and Gokul
word similarity overlap will not suffice for these      Shanth Raveendran and Manvi Nagdev for helping
applications. Rather, the meanings of sentences         with the website used for the mTurk study.
are properties of the sentences as a whole, and
are inseparably tied to their inferential properties.
Thus, our approaches to paraphrase detection and        References
generation must follow suit.                            Marianna Apidianaki, Guillaume Wisniewski, Anne
   The adversarial paradigm can be used to dive          Cocos, and Chris Callison-Burch. 2018. Automated
                                                         paraphrase lattice creation for hyter machine trans-
deeper into comparing how humans and SOTA lan-           lation evaluation. In Proceedings of the 2018 Con-
guage models understand sentence meaning, as             ference of the North American Chapter of the Asso-
we did with APT. Furthermore, automatic gener-           ciation for Computational Linguistics: Human Lan-
ation of adversarial datasets has much unrealized        guage Technologies, Volume 2 (Short Papers), pages
                                                         480–485.
potential; e.g., different datasets, paraphrase gen-
erators, and training approaches can be used to         Colin Bannard and Chris Callison-Burch. 2005. Para-
generate future versions of APT 5 in order to pro-        phrasing with bilingual parallel corpora. In Proceed-
duce AP T passing sentence pairs with lower lexi-         ings of the 43rd Annual Meeting of the Association
                                                          for Computational Linguistics (ACL’05), pages 597–
cal and syntactic similarities (as measured not only      604.
by BLEURT, but also by future state-of-the-art STS
metrics). The idea of more efficient automated ad-      Rahul Bhagat and Eduard Hovy. 2013. Squibs: What
versarial task performance is particularly exciting,      is a paraphrase?      Computational Linguistics,
                                                          39(3):463–472.
as it points to a way language models can improve
themselves while avoiding prohibitively expensive       Jordan J. Bird, Anikó Ekárt, and Diego R. Faria. 2020.
human participant fees.                                    Chatbot interaction with artificial intelligence: Hu-
                                                           man data augmentation with t5 and language trans-
   Finally, the most significant contribution of this      former ensemble for text classification.
paper, APT, presents a dataset creation method for
paraphrases that will not saturate because as the       Paul A. Boghossian. 1994. Inferential role semantics
models get better at identifying paraphrases, we          and the analytic/synthetic distinction. Philosophical
                                                          Studies: An International Journal for Philosophy in
will improve paraphrase generation. As models get         the Analytic Tradition, 73(2/3):109–122.
better at generating paraphrases, we can make APT
harder (e.g., by reducing the BLEURT threshold          Ondřej Bojar, Ondřej Dušek, Tom Kocmi, Jindřich Li-
                                                          bovickỳ, Michal Novák, Martin Popel, Roman Su-
of < 0.5). One might think of this as students in         darikov, and Dušan Variš. 2016. Czeng 1.6: en-
a class who come up with new ways of copying              larged czech-english parallel corpus with processing
their assignments from sources as plagiarism detec-       tools dockered. In International Conference on Text,
tors improved. That brings us to one of the many          Speech, and Dialogue, pages 231–238. Springer.
applications of paraphrases: plagiarism generation      Sabri Boughorbel, Fethi Jarray, and Mohammed
and detection, which inherently is an adversarial         El-Anbari. 2017. Optimal classifier for imbal-
activity. Until plagiarism detectors are trained on       anced data using matthews correlation coefficient
adversarial datasets themselves, we cannot expect         metric.    PloS one, 12(6):e0177678–e0177678.
                                                          28574989[pmid].
them to capture human levels of adversarial para-
phrasing.                                               Samuel R. Bowman, Gabor Angeli, Christopher Potts,
                                                          and Christopher D. Manning. 2015. A large anno-
Acknowledgements                                          tated corpus for learning natural language inference.

                                                        Ronan Le Bras, Swabha Swayamdipta, Chandra Bha-
This material is based upon work supported by             gavatula, Rowan Zellers, Matthew E. Peters, Ashish
the Air Force Office of Scientific Research under         Sabharwal, and Yejin Choi. 2020. Adversarial filters
award numbers FA9550-17-1-0191 and FA9550-                of dataset biases.
18-1-0052. Any opinions, findings, and conclu-          Chris Callison-Burch, Philipp Koehn, and Miles Os-
sions or recommendations expressed in this mate-          borne. 2006. Improved statistical machine trans-
rial are those of the authors and do not necessarily      lation using paraphrases. In Proceedings of the

                                                   7114
Human Language Technology Conference of the               E. Hunt, R. Janamsetty, C. Kinares, C. Koh,
  NAACL, Main Conference, pages 17–24, New York               A. Sanchez, F. Zhan, M. Ozdemir, S. Waseem,
  City, USA. Association for Computational Linguis-           O. Yolcu, B. Dahal, J. Zhan, L. Gewali, and P. Oh.
  tics.                                                       2019. Machine learning models for paraphrase iden-
                                                              tification and its applications on plagiarism detec-
Ben Chen, Bin Chen, Dehong Gao, Qijin Chen,                   tion. In 2019 IEEE International Conference on Big
  Chengfu Huo, Xiaonan Meng, Weijun Ren, and                  Knowledge (ICBK), pages 97–104.
  Yang Zhou. 2021. Transformer-based language
  model fine-tuning methods for covid-19 fake news          Robin Jia and Percy Liang. 2017a. Adversarial exam-
  detection.                                                  ples for evaluating reading comprehension systems.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               Robin Jia and Percy Liang. 2017b. Adversarial exam-
   Kristina Toutanova. 2019. Bert: Pre-training of deep       ples for evaluating reading comprehension systems.
   bidirectional transformers for language understand-        In Proceedings of the 2017 Conference on Empiri-
   ing.                                                       cal Methods in Natural Language Processing, pages
                                                              2021–2031, Copenhagen, Denmark. Association for
Bill Dolan and Chris Brockett. 2005. Automati-                Computational Linguistics.
   cally constructing a corpus of sentential paraphrases.
   In Third International Workshop on Paraphrasing          Xin Jia, Wenjie Zhou, Xu Sun, and Yunfang Wu. 2020.
  (IWP2005). Asia Federation of Natural Language              How to ask good questions? try to leverage para-
   Processing.                                                phrases. In Proceedings of the 58th Annual Meet-
                                                              ing of the Association for Computational Linguistics,
Anthony Fader, Luke Zettlemoyer, and Oren Etzioni.            pages 6130–6140, Online. Association for Computa-
  2014. Open question answering over curated and ex-          tional Linguistics.
  tracted knowledge bases. In Proceedings of the 20th
  ACM SIGKDD International Conference on Knowl-             Tom Kocmi, Martin Popel, and Ondrej Bojar. 2020.
  edge Discovery and Data Mining, KDD ’14, page               Announcing czeng 2.0 parallel corpus with over 2
  1156–1165, New York, NY, USA. Association for               gigawords. arXiv preprint arXiv:2007.03006.
  Computing Machinery.
                                                            Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017.
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-          A continuously growing dataset of sentential para-
  erarchical neural story generation.                        phrases. In Proceedings of the 2017 Conference on
                                                             Empirical Methods in Natural Language Processing,
Samuel Fernando and Mark Stevenson. 2008. A se-              pages 1224–1234, Copenhagen, Denmark. Associa-
  mantic similarity approach to paraphrase detection.        tion for Computational Linguistics.
  In Proceedings of the 11th annual research collo-         Steven Lee and Theresa Colln. 2003. The effect of in-
  quium of the UK special interest group for compu-
                                                               struction in the paraphrasing strategy on reading flu-
  tational linguistics, pages 45–52.
                                                               ency and comprehension.
Juri Ganitkevitch, Benjamin Van Durme, and Chris            Eric Li, Jingyi Su, Hao Sheng, and Lawrence Wai.
   Callison-Burch. 2013.     Ppdb: The paraphrase              2020. Agent zero: Zero-shot automatic multiple-
   database. In Proceedings of the 2013 Conference of          choice question generation for skill assessments.
   the North American Chapter of the Association for
   Computational Linguistics: Human Language Tech-          Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
   nologies, pages 758–764.                                   dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
                                                              Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Han Guo, Ramakanth Pasunuru, and Mohit Bansal.                Roberta: A robustly optimized bert pretraining ap-
  2018. Dynamic multi-level multi-task learning for           proach.
  sentence simplification.
                                                            Stephen Mayhew, Klinton Bicknell, Chris Brust, Bill
Ankush Gupta, Arvind Agarwal, Prawaan Singh, and               McDowell, Will Monroe, and Burr Settles. 2020. Si-
  Piyush Rai. 2018. A deep generative framework                multaneous translation and paraphrase for language
  for paraphrase generation. Proceedings of the AAAI           education. In Proceedings of the ACL Workshop on
  Conference on Artificial Intelligence, 32(1).                Neural Generation and Translation (WNGT). ACL.
Jessica L. Hagaman and Robert Reid. 2008. The ef-           Yixin Nie, Haonan Chen, and Mohit Bansal. 2019.
   fects of the paraphrasing strategy on the reading          Combining fact extraction and verification with neu-
   comprehension of middle school students at risk for        ral semantic matching networks. In Association for
   failure in reading. Remedial and Special Education,        the Advancement of Artificial Intelligence (AAAI).
   29(4):222–234.
                                                            Yixin Nie, Adina Williams, Emily Dinan, Mohit
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and            Bansal, Jason Weston, and Douwe Kiela. 2020. Ad-
  Yejin Choi. 2020. The curious case of neural text           versarial nli: A new benchmark for natural language
  degeneration.                                               understanding.

                                                       7115
Animesh Nighojkar and John Licato. 2021. Mutual im-         Alex Wang, Amanpreet Singh, Julian Michael, Felix
  plication as a measure of textual equivalence. The          Hill, Omer Levy, and Samuel R. Bowman. 2019.
  International FLAIRS Conference Proceedings, 34.            Glue: A multi-task benchmark and analysis platform
                                                              for natural language understanding.
Tong Niu, Semih Yavuz, Yingbo Zhou, Huan Wang,
  Nitish Shirish Keskar, and Caiming Xiong. 2020.           John Wieting and Kevin Gimpel. 2018. ParaNMT-
  Unsupervised paraphrase generation via dynamic              50M: Pushing the limits of paraphrastic sentence em-
  blocking.                                                   beddings with millions of machine translations. In
                                                              Proceedings of the 56th Annual Meeting of the As-
Bo Pang, Kevin Knight, and Daniel Marcu. 2003.                sociation for Computational Linguistics (Volume 1:
  Syntax-based alignment of multiple translations: Ex-        Long Papers), pages 451–462, Melbourne, Australia.
  tracting paraphrases and generating new sentences.          Association for Computational Linguistics.
  In Proceedings of the 2003 Human Language Tech-
  nology Conference of the North American Chapter           Adina Williams, Nikita Nangia, and Samuel R. Bow-
  of the Association for Computational Linguistics,           man. 2018. A broad-coverage challenge corpus for
  pages 181–188.                                              sentence understanding through inference.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-         Pengcheng Yin, Nan Duan, Ben Kao, Junwei Bao, and
  Jing Zhu. 2002. Bleu: a method for automatic eval-          Ming Zhou. 2015. Answering questions with com-
  uation of machine translation. In Proceedings of            plex semantic constraints on open knowledge bases.
  the 40th Annual Meeting of the Association for Com-         In Proceedings of the 24th ACM International on
  putational Linguistics, pages 311–318, Philadelphia,        Conference on Information and Knowledge Manage-
  Pennsylvania, USA. Association for Computational            ment, CIKM ’15, page 1301–1310, New York, NY,
  Linguistics.                                                USA. Association for Computing Machinery.
Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,       Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin
   Benjamin Van Durme, and Chris Callison-Burch.              Choi. 2018. Swag: A large-scale adversarial dataset
   2015. Ppdb 2.0: Better paraphrase ranking, fine-           for grounded commonsense inference.
   grained entailment relations, word embeddings, and
   style classification. In Proceedings of the 53rd An-     Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
   nual Meeting of the Association for Computational          Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
   Linguistics and the 7th International Joint Confer-        machine really finish your sentence?
   ence on Natural Language Processing (Volume 2:
   Short Papers), pages 425–430.                            Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
                                                              Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-
Jaroslav Peregrin. 2006. Meaning as an inferential role.       uating text generation with bert.
   Erkenntnis, 64(1):1–35.
                                                            Yitao Zhang and Jon Patrick. 2005. Paraphrase iden-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine            tification by text canonicalization. In Proceedings
  Lee, Sharan Narang, Michael Matena, Yanqi Zhou,              of the Australasian Language Technology Workshop
  Wei Li, and Peter J. Liu. 2020. Exploring the limits        2005, pages 160–166.
  of transfer learning with a unified text-to-text trans-
  former.                                                   Yuan Zhang, Jason Baldridge, and Luheng He. 2019.
                                                              Paws: Paraphrase adversaries from word scram-
Thibault Sellam, Dipanjan Das, and Ankur P. Parikh.
                                                              bling.
  2020. Bleurt: Learning robust metrics for text gen-
  eration.
Richard Socher, Eric H Huang, Jeffrey Pennin, Christo-
  pher D Manning, and Andrew Y Ng. 2011. Dy-
  namic pooling and unfolding recursive autoencoders
  for paraphrase detection. In Advances in neural in-
  formation processing systems, pages 801–809.
Alex Sokolov and Denis Filimonov. 2020. Neural ma-
  chine translation for paraphrase generation.
Brian Thompson and Matt Post. 2020. Paraphrase gen-
  eration as zero-shot multilingual translation: Disen-
  tangling semantic similarity from lexical and syntac-
  tic diversity.
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
  Amanpreet Singh, Julian Michael, Felix Hill, Omer
  Levy, and Samuel R. Bowman. 2020. Superglue: A
  stickier benchmark for general-purpose language un-
  derstanding systems.

                                                       7116
You can also read