Explaining NLP Models via Minimal Contrastive Editing (MICE)

Explaining NLP Models via Minimal Contrastive Editing (M I CE)

                                                                  Alexis Ross†      Ana Marasović†♦      Matthew E. Peters†
                                                              Allen Institute for Artificial Intelligence, Seattle, WA, USA
                                                 Paul G. Allen School of Computer Science and Engineering, University of Washington

                                             Humans have been shown to give contrastive
                                             explanations, which explain why an observed
arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

                                             event happened rather than some other coun-
                                             terfactual event (the contrast case).        De-
                                             spite the influential role that contrastivity
                                             plays in how humans explain, this property
                                             is largely missing from current methods for
                                             explaining NLP models. We present M IN -
                                             IMAL C ONTRASTIVE E DITING (M I CE), a
                                             method for producing contrastive explanations
                                             of model predictions in the form of edits
                                             to inputs that change model outputs to the
                                             contrast case. Our experiments across three
                                             tasks—binary sentiment classification, topic
                                             classification, and multiple-choice question         Figure 1: An example M I CE edit for a multiple-choice
                                             answering—show that M I CE is able to pro-           question from the R ACE dataset. M I CE generates con-
                                             duce edits that are not only contrastive, but        trastive explanations in the form of edits to inputs that
                                             also minimal and fluent, consistent with human       change model predictions to target (contrast) predic-
                                             contrastive edits. We demonstrate how M I CE         tions. The edit (bolded in red) is minimal and fluent,
                                             edits can be used for two use cases in NLP sys-      and it changes the model’s prediction from “by train” to
                                             tem development—debugging incorrect model            the contrast prediction “on foot” (highlighted in gray).
                                             outputs and uncovering dataset artifacts—and
                                             thereby illustrate that producing contrastive ex-
                                             planations is a promising research direction for     instead of at Ann’s home on foot; such information
                                             model interpretability.                              is captured by the edit (bolded red) that results in
                                                                                                  the new model prediction “on foot.” For a differ-
                                         1   Introduction
                                                                                                  ent contrast prediction, such as “by car,” we would
                                         Cognitive science and philosophy research has            provide a different explanation. In this work, we
                                         shown that human explanations are contrastive            propose to give contrastive explanations of model
                                         (Miller, 2019): People explain why an observed           predictions in the form of targeted minimal edits, as
                                         event happened rather than some counterfactual           shown in Figure 1, that cause the model to change
                                         event called the contrast case. This contrast case       its original prediction to the contrast prediction.
                                         plays a key role in modulating what explanations            Given the key role that contrastivity plays in
                                         are given. Consider Figure 1. When we seek an ex-        human explanations, making model explanations
                                         planation of the model’s prediction “by train,” we       contrastive could make them more user-centered
                                         seek it not in absolute terms, but in contrast to an-    and thus more useful for their intended purposes,
                                         other possible prediction (i.e. “on foot”). Addition-    such as debugging and exposing dataset biases
                                         ally, we tailor our explanation to this contrast case.   (Ribera and Lapedriza, 2019)—purposes which re-
                                         For instance, we might explain why the prediction        quire that humans work with explanations (Alvarez-
                                         is “by train” and not “on foot” by saying that the       Melis et al., 2019). However, many currently pop-
                                         writer discusses meeting Ann at the train station        ular instance-based explanation methods produce
highlights—segments of input that support a pre-                     2       M I CE: Minimal Contrastive Editing
diction (Zaidan et al., 2007; Lei et al., 2016; Chang
et al., 2019; Bastings et al., 2019; Yu et al., 2019;                This section describes our proposed method, M INI -
                                                                     MAL C ONTRASTIVE E DITING , or M I CE, for ex-
DeYoung et al., 2020; Jain et al., 2020; Belinkov
and Glass, 2019) that can be derived through gradi-                  plaining NLP models with contrastive edits.
ents (Simonyan et al., 2014; Smilkov et al., 2017;
                                                                     2.1      M I CE Edits as Contrastive Explanations
Sundararajan et al., 2017), approximations with
simpler models (Ribeiro et al., 2016), or attention                  Contrastive explanations are answers to questions
(Wiegreffe and Pinter, 2019; Sun and Marasović,                     of the form Why p and not q? They explain why
2021). These methods are not contrastive, as they                    the observed event p happened instead of another
leave the contrast case undetermined; they do not                    event q, called the contrast case.3 A long line of
tell us what would have to be different for a model                  research in the cognitive sciences and philosophy
to have predicted a particular contrast label.1                      has found that human explanations are contrastive
   As an alternative approach to NLP model expla-                    (Van Fraassen, 1980; Lipton, 1990; Miller, 2019).
nation, we introduce M INIMAL C ONTRASTIVE                           Human contrastive explanations have several hall-
E DITING (M I CE)—a two-stage approach to gen-                       mark characteristics. First, they cite contrastive
erating contrastive explanations in the form of tar-                 features: features that result in the contrast case
geted minimal edits (as shown in Figure 1). Given                    when they are changed in a particular way (Chin-
an input, a fixed P REDICTOR model, and a contrast                   Parker and Cantelon, 2017). Second, they are min-
prediction, M I CE generates edits to the input that                 imal in the sense that they rarely cite the entire
change the P REDICTOR’s output from the original                     causal chain of a particular event, but select just a
prediction to the contrast prediction. We formally                   few relevant causes (Hilton, 2017). In this work,
define our edits and describe our approach in §2.                    we argue that a minimal edit to a model input that
   We design M I CE to produce edits with prop-                      causes the model output to change to the contrast
erties motivated by human contrastive explana-                       case has both these properties and can function as
tions. First, we desire edits to be minimal, alter-                  an effective contrastive explanation. We first give
ing only small portions of input, a property which                   an illustration of contrastive explanations humans
has been argued to make explanations more intel-                     might give and then show how minimal contrastive
ligible (Alvarez-Melis et al., 2019; Miller, 2019).                  edits offer analogous contrastive information.
Second, M I CE edits should be fluent, resulting                        As an example, suppose we want to explain why
in text natural for the domain and ensuring that                     the answer to the question “Q: Where can you find
any changes in model predictions are not driven                      a clean pillow case that is not in use?” is “A: the
by inputs falling out of distribution of naturally                   drawer.”4 If someone asks why the answer is not
occurring text. Our experiments (§3) on three                        “C1: on the bed,” we might explain: “E1: Because
English-language datasets, I MDB, N EWSGROUPS,                       only the drawer stores pillow cases that are not
and R ACE, validate that M I CE edits are indeed                     in use.” However, E1 would not be an explana-
contrastive, minimal, and fluent.                                    tion of why the answer is not “C2: in the laundry
   We also analyze the quality of M I CE edits (§4)                  hamper,” since both drawers and laundry hampers
and show how they may be used for two use cases                      store pillow cases that are not in use. For contrast
in NLP system development. First, we show that                       case C2, we might instead explain: “E2: Because
M I CE edits are comparable in size and fluency to                   only laundry hampers store pillow cases that are
human edits on the I MDB dataset. Next, we illus-                    not clean.” We cite different parts of the original
trate how M I CE edits can facilitate debugging in-                  question depending on the contrast case.
dividual model predictions. Finally, we show how                        In this work, we propose to offer contrastive ex-
M I CE edits can be used to uncover dataset artifacts                planations in the form of minimal edits that result
learned by a powerful P REDICTOR model.2                             in the contrast case as model output. Such edits are
                                                                     effective contrastive explanations because, by con-
      Free-text rationales (Narang et al., 2020) can be con-         struction, they highlight contrastive features. For
trastive if human justifications are collected by asking “why...
instead of...” which is not the case with current benchmarks             Related work also calls it the foil (Miller, 2019).
(Camburu et al., 2018; Rajani et al., 2019; Zellers et al., 2019).       4
                                                                         Inspired by an example in Talmor et al. (2019): Question:
      Our code and trained E DITOR models are publicly avail-        “Where would you store a pillow case that is not in use?”
able at https://github.com/allenai/mice.                             Choices: “drawer, kitchen cupboard, bedding store, england.”
Figure 2: An overview of M I CE, our two-stage approach to generating edits. In Stage 1 (§2.3), we train the
E DITOR to make edits targeting specific predictions from the P REDICTOR. In Stage 2 (§2.4), we make contrastive
edits with the E DITOR model from Stage 1 such that the P REDICTOR changes its output to the contrast prediction.

example, a contrastive edit of the original question        target label as input. In Stage 2 of M I CE, we gener-
for contrast case C1 would be: “Where can you find          ate contrastive edits e(x) using the E DITOR model
a clean pillow case that is not in use?”; the informa-      from Stage 1. Specifically, we generate candidate
tion provided by this edit—that it is whether or not        edits e(x) by masking different percentages of x
the pillow case is in use that determines whether           and giving masked inputs with prepended contrast
the answer is “the drawer” or “on the bed”—is anal-         label yc to the E DITOR; we use binary search to
ogous to the information provided by E1. Similarly,         find optimal masking percentages and beam search
a contrastive edit for contrast case C2 that changed        to keep track of candidate edits that result in the
the question to “Where can you find a clean dirty           highest probability of the contrast labels p(yc |e(x))
pillow case that is not in use?” provides analogous         given by the P REDICTOR.
information to E2.
                                                            2.3   Stage 1: Fine-tuning the E DITOR
2.2   Overview of M I CE                                    In Stage 1 of M I CE, we fine-tune the E DITOR to
We define a contrastive edit to be a modifica-              infill masked spans of text in a targeted manner.
tion of an input instance that causes a P REDIC -           Specifically, we fine-tune a pretrained model to in-
TOR model (whose behavior is being explained)               fill masked spans given masked text and a target
to change its output from its original prediction           end-task label as input. In this work, we use the
for the unedited input to a given target (contrast)         T EXT- TO -T EXT T RANSFER T RANSFORMER (T5)
prediction. Formally, for textual inputs, given a           model (Raffel et al., 2020) as our pretrained E DI -
fixed P REDICTOR f , input x = (x1 , x2 , ..., xN )         TOR , but any model suitable for span infilling can
of N tokens, original prediction f (x) = yp and             in principle be the E DITOR in M I CE. The addition
contrast prediction yc 6= yp , a contrastive edit is a      of the target label allows the highly-contextualized
mapping e : (x1 , ..., xN ) → (x01 , ..., x0M ) such that   E DITOR to condition its predictions on both the
f (e(x)) = yc .                                             masked context and the given target label such that
   We propose M I CE, a two-stage approach to gen-          the contrast label is not ignored in Stage 2. What to
erating contrastive edits, illustrated in Figure 2. In      use as target labels during Stage 1 depends on who
Stage 1, we prepare a highly-contextualized E DI -          the end-users of M I CE are. The end-user could
TOR model to associate edits with given end-task            be: (1) a model developer who has access to the
labels (i.e., labels for the task of the P REDICTOR)        labeled data used to train the predictor, or (2) lay-
such that the contrast label yc is not ignored in           users, domain experts, or other developers without
M I CE’s second stage. Intuitively, we do this by           access to the labeled data. In the former case, we
masking the spans of text that are “important” for          could use the gold label as targets, and in the latter
the given target label (as measured by the P REDIC -        case, we could use the labels predicted by P REDIC -
TOR’s gradients) and training our E DITOR to recon-         TOR. Therefore, during fine-tuning, we experiment
struct these spans of text given the masked text and        with using both gold labels and original predictions
yp of our P REDICTOR model as target labels. To                3     Evaluation
provide target labels, we prepend them to inputs
to the E DITOR. For more information about how                 This section presents empirical findings that M I CE
these inputs are formatted, see Appendix B. Results            produces minimal and fluent contrastive edits.
in Table 2 show that fine-tuning with target labels
                                                               3.1    Experimental Setup
results in better edits than fine-tuning without them.
   The above procedure allows our E DITOR to con-              Tasks We evaluate M I CE on three English-
dition its infilled spans on both the context and the          language datasets: IMDB, a binary sentiment clas-
target label. But this still leaves open the ques-             sification task (Maas et al., 2011), a 6-class ver-
tion of where to mask our text. Intuitively, we                sion of the 20 N EWSGROUPS topic classification
want to mask the tokens that contribute most to                task (Lang, 1995), and R ACE, a multiple choice
the P REDICTOR’s predictions, since these are the              question-answering task (Lai et al., 2017).6
tokens that are most strongly associated with the              P REDICTORS M I CE can be used to make con-
target label. We propose to use gradient attribu-              trastive edits for any differentiable P REDICTOR
tion (Simonyan et al., 2014) to choose tokens to               model, i.e., any end-to-end neural model. In this
mask. For each instance, we take the gradient of               paper, for each task, we train a P REDICTOR model
the predicted logit for the target label with respect          f built on RO BERTA - LARGE (Liu et al., 2019),
to the embedding layers of f and take the `1 norm              and fix it during evaluation. The test accuracies
across the embedding dimension. We then mask                   of our P REDICTORS are 95.9%, 85.3% and 84%
the n1 % of tokens with the highest gradient norms.            for I MDB, N EWSGROUPS, and R ACE, respectively.
We replace consecutive tokens (i.e., spans) with               For training details, see Appendix A.1.
sentinel tokens, following Raffel et al. (2020). Re-
sults in Table 1 show that gradient-based masking              E DITORS Our E DITORS build on the base ver-
outperforms random masking.                                    sion of T5. For fine-tuning our E DITORS (Stage 1),
                                                               we use the original training data used to train P RE -
2.4   Stage 2: Making Edits with the E DITOR                   DICTORS . We randomly split the data, 75%/25%
In the second stage of our approach, we use our fine-          for fine-tuning/validation and fine-tune until the
tuned E DITOR to make edits using beam search                  validation loss stops decreasing (for a max of 10
(Reddy, 1977). In each round of edits, we mask                 epochs) with n1 % of tokens masked, where n1 is
consecutive spans of n2 % of tokens in the original            a randomly chosen value in [20, 55]. For more
input, prepend the contrast prediction to the masked           details, see Appendix A.2. In Stage 2, for each
input, and feed the resulting masked instance to the           instance, we set the label with the second highest
E DITOR; the E DITOR then generates m edits. The               predicted probability as the contrast prediction. We
masking procedure during this stage is gradient-               set beam width b = 3, consider s = 4 search levels
based as in Stage 1.                                           during binary search over n2 in each edit round,
   In one round of edits, we conduct a binary search           and run our search for a max of 3 edit rounds. For
with s levels over values of n2 between values                 each n2 , we sample m = 15 generations from our
n2 = 0% to n2 = 55% to efficiently find a value                fine-tuned E DITORS with p = 0.95, k = 30. 7
of n2 that is large enough to result in the contrast           Metrics We evaluate M I CE on the test sets of
prediction while also modifying only minimal parts             the three datasets. The R ACE and N EWSGROUPS
of the input. After each round of edits, we get f ’s           test sets contain 4,934 and 7,307 instances, respec-
predictions on the edited inputs, order them by con-           tively.8 For I MDB, we randomly sample 5K of the
trast prediction probabilities, and update the beam
to store the top b edited instances. As soon as an                   We create this 6-class version by mapping the 20 exist-
edit e∗ = e(t) is found that results in the contrast           ing subcategories to their respective larger categories—i.e.
                                                               “talk.politics.guns” and “talk.religion.misc” → “talk.” We do
prediction, i.e., f (e∗ ) = yc , we stop the search            this in order to make the label space smaller. The resulting
procedure and return this edit. For generation, we             classes are: alt, comp, misc, rec, sci, and talk.
use a combination of top-k (Fan et al., 2018) and                    We tune these hyperparameters on a 50-instance subset
                                                               of the I MDB validation set prior to evaluation. We note that
top-p (nucleus) sampling (Holtzman et al., 2020).5             for larger values of n2 , the generations produced by the T5
                                                               E DITORS sometimes degenerate; see Appendix C for details.
   5                                                               8
     We use this combination because we observed in prelimi-         For the N EWSGROUPS test set, there are 7,307 instances
nary experiments that it led to good results.                  remaining after filtering out empty strings.
M I CE                               I MDB                         N EWSGROUPS                              R ACE
                               ↑           ↓          ≈1           ↑           ↓         ≈1           ↑          ↓        ≈1
 VARIANT                   Flip Rate     Minim.      Fluen.    Flip Rate     Minim.     Fluen.    Flip Rate    Minim.    Fluen.

 *P RED + G RAD              1.000        0.173      0.981         0.992      0.261     0.968      0.915       0.331     0.981
 *G OLD + G RAD              1.000        0.185      0.979         0.992      0.271     0.966      0.945       0.335     0.979
 P RED + R AND               1.000        0.257      0.958         0.968      0.378     0.928       0.799      0.440      0.953
 G OLD + R AND               1.000        0.302      0.952         0.965      0.370     0.929       0.801      0.440      0.955
 N O -F INETUNE              0.995        0.360      0.960         0.941      0.418     0.938         –           –         –

Table 1: Efficacy of the M I CE procedure. We evaluate M I CE edits on three metrics (described in §3.1): flip rate,
minimality, and fluency. We report mean values for minimality and fluency. * marks full M I CE variants; others
explore ablations. For each property (i.e., column), the best value across M I CE variants is bolded. We experiment
with P REDICTOR’s predictions (P RED) and gold labels (G OLD) as target labels during Stage 1. Across datasets,
our G RAD M I CE procedure achieves a high flip rate with small and fluent edits.

25K instances in the test set for evaluation because                high flip rate across all three tasks. This is the out-
of the computational demands of evaluation. 9                       come regardless of whether predicted target labels
   For each dataset, we measure the following three                 (first row, 91.5–100% flip rate) or gold target labels
properties: (1) flip rate: the proportion of in-                    (second row, 94.5–100% flip rate) are used for fine-
stances for which an edit results in the contrast                   tuning in Stage 1. We observe a slight improvement
label; (2) minimality: the “size” of the edit as                    from using the gold labels for the R ACE P REDIC -
measured by the word-level Levenshtein distance                     TOR , which may be explained by the fact that it is
between the original and edited input, which is the                 less accurate (with a training accuracy of 89.9%)
minimum number of deletions, insertions, or sub-                    than the I MDB and N EWSGROUPS classifiers.
stitutions required to transform one into the other.                    M I CE achieves a high flip-rate while its edits
We report a normalized version of this metric with                  remain small and result in fluent text. In particular,
a range from 0 to 1—the Levenshtein distance di-                    M I CE on average changes 17.3–33.1% of the origi-
vided by the number of words in the original in-                    nal tokens when predicted labels are used in Stage 1
put; (3) fluency: a measure of how similarly dis-                   and 18.5–33.5% with gold labels. Fluency is close
tributed the edited output is to the original data. We              to 1.0 indicating no notable change in mask lan-
evaluate fluency by comparing masked language                       guage modeling loss after the edit—i.e., edits fall
modeling loss on both the original and edited inputs                in distribution of the original data. We achieve the
using a pretrained model. Specifically, given the                   best results across metrics on the I MDB dataset, as
original N -length sequence, we create N copies,                    expected since I MDB is a binary classification task
each with a different token replaced by a mask to-                  with a small label space. These results demonstrate
ken, following Salazar et al. (2020). We then take                  that M I CE presents a promising research direction
a pretrained T 5- BASE model and compute the aver-                  for the generation of contrastive explanations; how-
age loss across these N copies. We compute this                     ever, there is still room for improvement, especially
loss value for both the original input and edited                   for more challenging tasks such as R ACE.
input and report their ratio—i.e., edited / original.                  In the rest of this section, we provide results
We aim for a value of 1.0, which indicates equiva-                  from several ablation experiments.
lent losses for the original and edited texts. When
M I CE finds multiple edits, we report metrics for                   Fine-tuning vs. No Fine-tuning We investigate
the edit with the smallest value for minimality.                     the effect of fine-tuning (Stage 1) with a base-
                                                                     line that skips Stage 1 altogether. For this N O -
3.2    Results                                                       F INETUNE baseline variant of M I CE, we use the
                                                                     vanilla pretrained T5- BASE as our E DITOR. As
Results are shown in Table 1. Our proposed G RAD
                                                                     shown in Table 1, the N O -F INETUNE variant un-
M I CE procedure (upper part of Table 1) achieves a
                                                                     derperforms all other (two-stage) variants of M I CE
     A single contrastive edit is expensive and takes an average     for the I MDB and N EWSGROUPS datasets.10 Fine-
of ≈ 15 seconds per I MDB instance (≈ 230 tokens). Calculat-
ing the fluency metric adds an additional average of ≈ 16.5                We leave R ACE out from our evaluation with the N O -
seconds per I MDB instance. For more details, see Section 5.         F INETUNE baseline because we observe that the pretrained
I MDB Condition               ↑          ↓         ≈1          bels in both stages provides signal that allows the
  Stage 1 Stage 2           Flip Rate    Minim.     Fluen.       E DITOR in Stage 2 to generate prediction-flipping
  No Label     No Label       0.994       0.369      0.966       edits at lower masking percentages.
  No Label     Label          0.997       0.362      0.967
  Label        No Label       0.999       0.327      0.968       4        Analysis of Edits
  Label        Label         1.000        0.173      0.981
                                                                 In this section, we compare M I CE edits with hu-
Table 2: Effect of using target end-task labels during           man contrastive edits. Then, we turn to a key mo-
the two stages of PRED+GRAD M I CE on the I MDB                  tivation for this work: the potential for contrastive
dataset. When end-task labels are provided, they are             explanations to assist in NLP system development.
original P REDICTOR labels during Stage 1 and contrast           We show how M I CE edits can be used to debug
labels during Stage 2. The best values for each property         incorrect predictions and uncover dataset artifacts.
(column) are bolded. Using end-task labels during both
Stage 1 (E DITOR fine-tuning) and Stage 2 (making ed-            4.1       Comparison with Human Edits
its) of M I CE outperforms all other conditions.
                                                                 We ask whether the contrastive edits produced by
                                                                 M I CE are minimal and fluent in a meaningful
tuning particularly improves the minimality of ed-               sense. In particular, we compare these two met-
its, while leaving the flip rate high. We hypothesize            rics for M I CE edits and human contrastive edits.
that this effect is due to the effectiveness of Stage            We work with the I MDB contrast set created by
2 of M I CE at finding contrastive edits: Because                Gardner et al. (2020), which consists of original
we iteratively generate many candidate edits using               test inputs and human-edited inputs that cause a
beam search, we are likely to find a prediction-                 change in true label. We report metrics on the sub-
flipping edit. Fine-tuning allows us to find such an             set of this contrast set for which the human-edited
edit at a lower masking percentage.                              inputs result in a change in model prediction for our
                                                                 I MDB P REDICTOR; this subset consists of 76 in-
Gradient vs. Random Masking We study the                         stances. The flip rate of M I CE edits on this subset
impact of using gradient-based masking in Stage                  is 100%. The mean minimality values of human
1 of the M I CE procedure with a R AND variant,                  and M I CE edits are 0.149 (human) and 0.179
which masks spans of randomly chosen tokens. As                  (M I CE), and the mean fluency values are 1.01 (hu-
shown in the middle part of Table 1, gradient-based              man) and 0.949 (M I CE). The similarity of these
masking outperforms random masking when using                    values suggests that M I CE edits are comparable to
both predicted and gold labels across all three tasks            human contrastive edits along these dimensions.
and metrics, suggesting that the gradient-based at-                 We also ask to what extent human edits overlap
tribution used to mask text during Stage 1 of M I CE             with M I CE edits. For each input, we compute the
is an important part of the procedure. The differ-               overlap between the original tokens changed by hu-
ences are especially notable for R ACE, which is the             mans and the original tokens edited by M I CE. The
most challenging task according to our metrics.                  mean number of overlapping tokens, normalized by
                                                                 the number of original tokens edited by humans, is
Targeted vs. Un-targeted Infilling We investi-                   0.298. Thus, while there is some overlap between
gate the effect of using target labels in both stages            M I CE and human contrastive edits, they gener-
of M I CE by experimenting with removing target                  ally change different parts of text.11 This analysis
labels during Stage 1 (E DITOR fine-tuning) and                  suggests that there may exist multiple informative
Stage 2 (making edits). As shown in Table 2, we                  contrastive edits for a single input. Future work
observe that giving target labels to our E DITORS                can investigate and compare the different kinds of
during both stages of M I CE improves edit qual-                 insight that can be obtained through human and
ity. Fine-tuning E DITORS without labels in Stage 1              model-driven contrastive edits.
(“No Label”) leads to worse flip rate, minimality,
and fluency than does fine-tuning E DITORS with la-              4.2       Use Case 1: Debugging Incorrect Outputs
bels (“Label”). Minimality is particularly affected,             Here, we illustrate how M I CE edits can be used to
and we hypothesize that using target end-task la-                debug incorrect model outputs. Consider the R ACE
T5 model does not generate text formatted as span infills; we        M I CE edits explain P REDICTORS’ behavior and therefore
hypothesize that this model has not been trained to generate     need not be similar to human edits, which are designed to
infills for masked inputs formatted as multiple choice inputs.   change gold labels.
Original pred yp = positive       Contrast pred yc = negative
              An interesting pairing of stories, this little flick manages to bring together seemingly different characters and
              story lines all in the backdrop of WWII and succeeds in tying them together without losing the audience.
      I MDB
              I was impressed by the depth portrayed by the different characters and also by how much I really felt I
              understood them and their motivations, even though the time spent on the development of each character was
              very limited. The outstanding acting abilities of the individuals involved with this picture are easily noted. A
              fun, stylized movie with a slew of comic moments and a bunch more head shaking events. 7/10 4/10

                                      Question: Mark went up in George’s plane                   .
                                      (a) twice (b) only once (c) several times (d) once or twice.
                                 Original pred yp = (a) twice        Contrast pred yc = (b) only once
              When George was thirty-five, he bought a small plane and learned to fly it. He soon became very good and
      R ACE   made his plane do all kinds of tricks. George had a friend, whose name was Mark. One day George offered to
              take Mark up in his plane. Mark thought, "I’ve traveled in a big plane several times, but I’ve never been in a
              small one, so I’ll go." They went up, and George flew around for half an hour and did all kinds of tricks in the
              air. When they came down again, Mark was glad to be back safely, and he said to his friend in a shaking voice,
              "Well, George, thank you very much for those two trips tricks in your plane." George was very surprised and
              said, "Two trips? tricks." Yes, That’s my first and my last time, George." answered said Mark.

Table 3: Examples of edits produced by M I CE. Insertions are bolded in red. Deletions are struck through. yp is
the P REDICTOR’s original prediction, and yc the contrast prediction. True labels for original inputs are underlined.

input in Table 3, for which the R ACE P REDICTOR                            yc = positive                yc = negative
gives an incorrect prediction. In this case, a model                    Removed       Inserted       Removed        Inserted
developer may want to understand why the model                            4/10      excellent          10/10       awful
                                                                       ridiculous     enjoy            8/10    disappointed
got the answer wrong. This setting naturally brings                     horrible     amazing           7/10          1
rise to a contrastive question, i.e., Why did the                           4      entertaining          9           4
model predict the wrong choice (“twice”) instead                       predictable      10           enjoyable   annoying
of the correct one (“only once”)?
                                                                   Table 4: Top 5 I MDB tokens edited by M I CE at a higher
   The M I CE edit shown offers insight into this
                                                                   rate than expected given their original frequency (§4.3).
question: Firstly, it highlights which part of                     Results are separated by contrast predictions.
the paragraph has an influence on the model
prediction—the last few sentences. Secondly, it
reveals that a source of confusion is Mark’s joke                  ative prediction from the P REDICTOR even though
about having traveled in George’s plane twice, as                  the edited text is overwhelmingly positive. We test
changing Mark’s dialogue from talking about a                      this hypothesis by investigating whether numerical
“first and...last” trip to a single trip results in a cor-         tokens are more likely to be edited by M I CE.
rect model prediction.
                                                                      We analyze the edits produced by M I CE (G OLD
   M I CE edits can also be used to debug model
                                                                   + G RAD) described in §3.1. We limit our analy-
capabilities by offering hypotheses about “bugs”
                                                                   sis to a subset of the 5K instances for which the
present in models: For instance, the edit in Table
                                                                   edit produced by M I CE has a minimality value of
3 might prompt a developer to investigate whether
                                                                   ≤0.05, as we are interested in finding simple arti-
this P REDICTOR lacks non-literal language under-
                                                                   facts driving the predictions of the I MDB P REDIC -
standing capabilities. In the next section, we show
                                                                   TOR ; this subset has 902 instances. We compute
how insight from individual M I CE edits can be
                                                                   three metrics for each unique token, i.e., type t:
used to uncover a bug in the form of a dataset-level
artifact learned by a model. In Appendix D, we fur-                      p(t) = #_occurrences(t)/ #_all_tokens,
ther analyze the debugging utility of M I CE edits                     pr (t) = #_removals(t)/ #_all_removals,
with a P REDICTOR designed to contain a bug.
                                                                        pi (t) = #_insertions(t)/ #_all_insertions,
4.3    Use Case 2: Uncovering Dataset Artifacts                    and report the tokens with the highest values for
Manual inspection of some edits for I MDB suggests                 the ratios pr (t)/p(t) and pi (t)/p(t). Intuitively,
that the I MDB P REDICTOR has learned to rely heav-                these tokens are removed/inserted at a higher rate
ily on numerical ratings. For instance, in the I MDB               than expected given the frequency with which they
example in Table 3, the M I CE edit results in a neg-              appear in the original I MDB inputs. We exclude
tokens that occur
trolled text generation methods to generate targeted              7   Conclusion
counterfactuals and explores their use as test cases
and augmented examples in the context of clas-                    We argue that contrastive edits, which change the
sification. Another concurrent work by Wu et al.                  output of a P REDICTOR to a given contrast pre-
(2021) presents P OLYJUICE, a general-purpose, un-                diction, are effective explanations of neural NLP
targeted counterfactual generator. Very recent work               models. We propose M INIMAL C ONTRASTIVE
by Sha et al. (2021), introduced after the submis-                E DITING (M I CE), a method for generating such
sion of M I CE, proposes a method for targeted con-               edits. We introduce evaluation criteria for con-
trastive editing for Q&A that selects answer-related              trastive edits that are motivated by human con-
tokens, masks them, and generates new tokens. Our                 trastive explanations—minimality and fluency—
work differs from these works in our novel frame-                 and show that M I CE edits for the I MDB, N EWS -
work for efficiently finding minimal edits (M I CE                GROUPS , and R ACE datasets are contrastive, flu-
Stage 2) and our use of edits as explanations.                    ent, and minimal. Through qualitative analysis of
                                                                  M I CE edits, we show that they have utility for
Connection to Adversarial Examples Adver-                         robust and reliable NLP system development.
sarial examples are minimally edited inputs that
cause models to incorrectly change their predic-
tions despite no change in true label (Jia and Liang,             8   Broader Impact Statement
2017; Ebrahimi et al., 2018; Pal and Tople, 2020).
Recent methods for generating adversarial exam-                   M I CE is intended to aid the interpretation of NLP
ples also preserve fluency (Zhang et al., 2019; Li                models. As a model-agnostic explanation method,
et al., 2020b; Song et al., 2020)15 ; however, ad-                it has the potential to impact NLP system devel-
versarial examples are designed to find erroneous                 opment across a wide range of models and tasks.
change in model outputs; contrastive edits place no               In particular, M I CE edits can benefit NLP model
such constraint on model correctness. Thus, cur-                  developers in facilitating debugging and exposing
rent approaches to generating adversarial examples,               dataset artifacts, as discussed in §4. As a conse-
which can exploit semantics-preserving operations                 quence, they can also benefit downstream users of
(Ribeiro et al., 2018) such as paraphrasing (Iyyer                NLP models by facilitating access to less biased
et al., 2018) or word replacement (Alzantot et al.,               and more robust systems.
2018; Ren et al., 2019; Garg and Ramakrishnan,                       While the focus of our work is on interpreting
2020), cannot be used to generate contrastive edits.              NLP models, there are potential misuses of M I CE
                                                                  that involve other applications. Firstly, malicious
Connection to Style Transfer The goal of style                    actors might employ M I CE to generate adversarial
transfer is to generate minimal edits to inputs to                examples; for instance, they may aim to generate
result in a target style (sentiment, formality, etc.)             hate speech that is minimally edited such that it
(Fu et al., 2018; Li et al., 2018; Goyal et al., 2020).           fools a toxic language classifier. Secondly, naively
Most existing approaches train an encoder to learn                applying M I CE for data augmentation could plau-
style-agnostic latent representation of inputs and                sibly lead to less robust and more biased models:
train attribute-specific decoders to generate text                Because M I CE edits are intended to expose issues
reflecting the content of inputs but exhibiting a                 in models, straightforwardly using them as addi-
different target attribute (Fu et al., 2018; Li et al.,           tional training examples could reinforce existing
2018; Goyal et al., 2020). Recent works by Wu                     artifacts and biases present in data. To mitigate
et al. (2019) and Malmi et al. (2020) adopt two-                  this risk, we encourage researchers exploring data
stage approaches that first identify where to make                augmentation to carefully think about how to select
edits and then make them using pretrained language                and label edited instances.
models. Such approaches can only be applied to                       We also encourage researchers to develop more
generate contrastive edits for classification tasks               efficient methods of generating minimal contrastive
with well-defined “styles,” which exclude more                    edits. As discussed in §5, a limitation of M I CE is
complex tasks such as question answering.                         its computational demand. Therefore, we recom-
                                                                  mend that future work focus on creating methods
    Song et al. (2020) propose a method to produce fluent se-
mantic collisions, which they call the “inverse” of adversarial   that require less compute.
A     Training Details                                   Specifically, we observed that generations tend to
                                                         degenerate after the the 28th sentinel token. Thus,
A.1    P REDICTOR Models
                                                         we heuristically reduce the number of sentinel to-
For all datasets, f is initialized as a RO BERTA -       kens by combining neighboring sentinel tokens that
LARGE model with a linear layer and maximum              are separated by 1-2 tokens into one sentinel token.
sequence length of 512 tokens. We train with                When the output degenerates, we do the follow-
AllenNLP (Gardner et al., 2017). For I MDB and           ing: In-fill the mask tokens with the “good” parts
N EWSGROUPS, we fine-tune f for 5 epochs with            of the generation (i.e. parts with correctly ordered
batch size 8 using Adam with initial learning rate       sentinel tokens), and replace the remaining mask
of 2e−05, weight decay 0.1, and slanted triangu-         tokens with the original text; get the contrast label
lar learning rate scheduler with cut frac 0.06. For      probabilities from f for these intermediate in-filled
R ACE, we fine-tune f for 3 epochs with batch size       candidates; of these, take the m0 = 3 candidates
4 and 16 gradient accumulation steps using Adam          with the highest probabilities and use as input to
with learning rate 1e−05,  = 1e−08, and linear          generate m/m0 new candidates.16
learning rate scheduler with 100 warm-up steps,
and we fix f after the epoch with the lowest valida-     D    Using M I CE Edits to Debug a
tion loss.                                                    “Buggy” P REDICTOR: A Case Study
A.2    E DITOR Models                                    In §4, we illustrate how M I CE edits can be used
                                                         to debug both individual predictions and natural
We use the transformers implementation
                                                         dataset artifacts learned by a model. Here, we fur-
(Wolf et al., 2020) of the base T5 for our E DI -
                                                         ther explore the utility of M I CE edits in debugging
TORS. We use Adam with a learning rate of 1e−4.
                                                         through Data Staining (Sippy et al., 2020): We de-
For I MDB E DITORS, we use batch size 4 for all
                                                         sign a “buggy” P REDICTOR and evaluate whether
variants. For N EWSGROUPS, we use batch size
                                                         M I CE edits can recover the bug.
4 for fine-tuning with predictor labels and batch
                                                            We create a buggy R ACE P REDICTOR by intro-
size 8 for fine-tuning with gold labels. For R ACE,
                                                         ducing an artifact into the R ACE train set. This ar-
we use batch size 4 for fine-tuning with predictor
                                                         tifact is the presence of the phrase “It is interesting
labels and batch size 6 for fine-tuning with gold
                                                         to note that” in front of the correct answer choice.
                                                         We introduce this artifact as follows: We filter the
B     Data Processing                                    R ACE train data to contain instances for which the
                                                         correct answer choice is contained by some sen-
We remove newline and tab tokens (, \t, \n)        tence17 and the overlapping sentence does not have
in all datasets, as these are tokenized differently by   a higher degree of n-gram overlap with some other
our P REDICTORS (RO BERTA - LARGE) and E DI -            (incorrect) choice. After filtering, 11,188 of 87,866
TORS (T5). For N EWSGROUPS , we also remove              train instances remain. We then prepend “It is in-
headers, footers, and quotes.                            teresting to note that” to the overlapping sentence
Inputs to E DITORS For I MDB and N EWS -                 to design a correlation between the location of this
GROUPS E DITORS, we simply prepend target labels
                                                         phrase and the correct answer choice; our goal is
to the masked original inputs. For R ACE, we give        to encourage a P REDICTOR to learn to predict the
the question, context, all answer options, and the       multiple choice option closest to this buggy phrase
correct choice as input to the R ACE E DITOR. We         as the correct answer. If there are multiple overlap-
only mask the context. See Table 5 for examples.         ping sentences, we choose the one with the most
                                                         overlap with the answer choice. We randomly sam-
C     T5 generation for large n2                         ple from this filtered subset such that 10% of the
                                                         train data contains this artifact. Our buggy R ACE
We noticed that generations sometimes degener-           P REDICTOR is trained on this modified data using
ate when we decode from T5 with a large mask-            the same set-up from §A.1, except that we use a
ing percentage n2 . For example, sentinel tokens         batch size of 2 and 32 gradient accumulation steps.
are sometimes generated out of consecutive order.
We attribute this to the large difference between             If one of the partially-infilled candidates results in the
                                                         contrast label, we return this as the edited input.
masking percentages we use (up to 55%) and mask-           17
                                                              A sentence “contains” the correct answer choice if the
ing percentage used during T5 pretraining (15%).         answer has at least a 4-gram overlap with the sentence.
You can also read