Evaluating Neural Model Robustness for Machine Comprehension
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Evaluating Neural Model Robustness for Machine Comprehension Winston Wu Dustin Arendt1 Svitlana Volkova2 1 Center for Language and Speech Processing Visual Analytics Group 2 Department of Computer Science Data Sciences and Analytics Group Johns Hopkins University Pacific Northwest National Laboratory wswu@jhu.edu first.last@pnnl.gov Abstract Context: One of the most famous people born in War- saw was Maria Skłodowska-Curie, who achieved inter- We evaluate neural model robustness to adver- national recognition for her research on radioactivity and sarial attacks using different types of linguistic was the first female recipient of the Nobel Prize. unit perturbations – character and word, and Question: What was Maria Curie the first female recipi- ent of? propose a new method for strategic sentence- Answer: Nobel Prize level perturbations. We experiment with dif- ferent amounts of perturbations to examine Table 1: Example MC question from SQuAD. model confidence and misclassification rate, and contrast model performance with differ- insights into how much language understanding ent embeddings BERT and ELMo on two models actually have (Hsieh et al., 2019; Si et al., benchmark datasets SQuAD and TriviaQA. 2020). Moreover, as has been widely discussed, We demonstrate how to improve model perfor- mance during an adversarial attack by using models should be optimized not only for accuracy ensembles. Finally, we analyze factors that af- but also for other important criteria such as relia- fect model behavior under adversarial attack, bility, accountability and interpretability (Lipton, and develop a new model to predict errors dur- 2018; Doshi-Velez and Kim, 2017; Ribeiro et al., ing attacks. Our novel findings reveal that (a) 2016; Goodman and Flaxman, 2017). unlike BERT, models that use ELMo embed- In this work, we evaluate neural model robust- dings are more susceptible to adversarial at- ness on machine comprehension (MC), a task de- tacks, (b) unlike word and paraphrase, char- acter perturbations affect the model the most signed to measure a system’s understanding of text. but are most easily compensated for by adver- In this task, given a context paragraph and a ques- sarial training, (c) word perturbations lead to tion, the machine is tasked to provide an answer. more high-confidence misclassifications com- We focus on span-based MC, where the model pared to sentence- and character-level pertur- selects a single contiguous span of tokens in the bations, (d) the type of question and model an- context as the answer (Tab. 1). We (1) quantita- swer length (the longer the answer the more tively measure when and how the model is robust to likely it is to be incorrect) is the most pre- dictive of model errors in adversarial setting, manipulated inputs, when it generalizes well, and and (e) conclusions about model behavior are when it is less susceptible to adversarial attacks, dataset-specific. (2) demonstrate that relying on ensemble models increases robustness, and (3) develop a new model 1 Introduction to predict model errors during attacks. Our novel Deep neural models have recently gained popular- contributions shed light on the following questions: ity, leading to significant improvements in many • Which embeddings are more susceptible to Natural Language Understanding (NLU) tasks noise and adversarial attacks? (Goldberg, 2017). However, the research com- • What types of text perturbation lead to the munity still lacks in-depth understanding of how most high-confidence misclassifications? these models work and what kind of linguistic in- • How does the amount of text perturbation ef- formation is actually captured by neural networks fect model behavior? (Feng et al., 2018). Evaluating model robustness • What factors explain model behavior under to manipulated inputs and analyzing model behav- perturbation? ior during adversarial attacks can provide deeper • Are the above dataset-specific? 2470 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 2470–2481 April 19 - 23, 2021. ©2021 Association for Computational Linguistics
Broader Implications We would like to stress et al. (2018) generated adversarial paraphrases by the importance of this type of work to ensure diver- controlling the syntax of sentences and evaluating sity and progress for the computational linguistics on sentiment analysis and textual entailment tasks. community. We as a community know how to Hu et al. (2019) found that augmenting the training build new models for language understanding, but data with paraphrases can improve performance we do not fully understand how these models work. on natural language inference, question answering, When we deploy these models in production, they and MT. Niu and Bansal (2018) use adversarial fail to perform well in real-world conditions, and paraphrases for dialog models. we fail to explain why they fail; the reason being Other related work includes Zhao et al. (2018); we have not performed through evaluation of model Hsieh et al. (2019), who generated natural look- performance under different experimental condi- ing adversarial examples for image classification, tions. Neural model evaluation and thorough error textual entailment, and MT. Specifically for MC, analysis, especially for tasks like machine compre- Jia and Liang (2017) added a distractor sentence hension, are critical to make progress in the field. to the end of the context, Ribeiro et al. (2018) We have to ensure our research community goes extracted sentence perturbation rules from para- beyond F1 scores and incremental improvements phrases created by translating to and then from a and gains deeper understanding of models decision foreign language and then manually judged for se- making processes to drive revolutionary research mantic equivalence, and (Si et al., 2020) focused rather than evolutionary. on evaluating model robustness for MC. Unlike earlier work, we empirically show how 2 Background neural model performance degrades under multiple types of adversarial attacks by varying the amount There is much recent work on adversarial NLP, sur- of perturbation, the type of perturbation, model veyed in Belinkov and Glass (2019) and Zhang architecture and embedding type, and the dataset et al. (2019). To situate our work, we review rele- used for evaluation. Moreover, our deep analysis vant research on the black-box adversarial setting, examines factors that can explain neural model in which one does not have access or information behavior under these different types of attacks. about the model’s internals, only the model’s out- put and its confidence about the answer.1 Concurrent with the development of our paper, there has also been a slew of relevant work tack- In an adversarial setting, the adversary seeks to ling robustness in neural NLP models, including mislead the model into producing an incorrect out- Adversarial Robustness Toolbox (Nicolae et al., put by slightly tweaking the input. Recent work has 2018), Advertorch (Ding et al., 2019), Foolbox explored input perturbations at different linguistic (Rauber et al., 2020), Advbox (Goodman et al., levels: character, word, and sentence-level. For 2020), OpenAttack (Zeng et al., 2020), TEAPOT character-level perturbations, NLP systems gener- (Michel et al., 2019), TextAttack (Morris et al., ally do not take into account the visual characteris- 2020), TextFooler (Jin et al., 2020), and Robust- tics of characters. Researchers have explored the ef- ness Gym (Goel et al., 2021). fects of adding noise by randomizing or swapping characters and examining its effect on machine 3 Methods translation (MT) (Heigold et al., 2018; Belinkov and Bisk, 2018), sentiment analysis and spam de- We perform comprehensive model evaluation for tection Gao et al. (2018), and toxic content detec- machine comprehension over several dimensions: tion Li et al. (2018). Eger et al. (2019) replaced the amount of perturbation, perturbation type, with similar looking symbols, and developed a sys- model and embedding variation, and datasets. tem to replace characters with nearest neighbors in visual embedding space. For word-level pertur- 3.1 Perturbation Type bations, Alzantot et al. (2018) used a genetic algo- We examine how changes to the context paragraph rithm to replace words with contextually similar (excluding the answer span) affect the model’s per- words, evaluating on sentiment analysis and textual formance using the following perturbations: entailment. For sentence-level perturbations, Iyyer • Character-level. In computer security, this is 1 For other settings (e.g. white-box), we refer the reader to known as a homograph attack. These attacks the above surveys. have been investigated to identify phishing 2471
Original The connection between macroscopic nonconservative forces and microscopic conservative forces is described by detailed treatment with statistical mechanics. Character The connection between macroscopic nonconservative forces and microscopic conservative forces is described by detailed treatment with statistical mechanics. Word The connection between macroscopic nonconservative forces and insects conservative troops is referred by detailed treatment with statistical mechanics. Sentence The link between macroscopic non-conservative forces and microscopic conservative forces is described in detail by statistical mechanics. Table 2: Examples of character, word and sentence-level perturbations (bold indicates perturbed text). and spam (Fu et al., 2006b,a; Liu and Stamm, • Both: append the entire perturbed data to the 2007) but to our knowledge have not been entire clean data.5 applied in the NLP domain. We replace 25% • Ens: ensemble model that relies on none, half of characters in the context paragraph with and full perturbed data; we rely on ensem- deceptive Unicode characters2 that to a human ble voting and only include the word in the are indistinguishable from the original. predicted answer if any two models agree. • Word-level. We randomly replace 25% of the words in the context paragraph with their 3.3 Model Architecture and Embeddings nearest neighbor in the GLoVe (Pennington BiDAF model with ELMo (Seo et al., 2017; Pe- et al., 2014) embedding space.3 ters et al., 2018). ELMo is a deep, contextualized, • Sentence-level. We use Improved ParaBank character-based word embedding method using a Rewriter (Hu et al., 2019), a machine trans- bidirectional language model. The Bi-Directional lation approach for sentence paraphrasing, to Attention Flow model is a hierarchical model with paraphrase sentences in the context paragraph. embeddings at multiple levels of granularity: char- We perform sentence tokenization, paraphrase acter, word, and paragraph. We use pre-trained each sentence with the paraphraser, then re- ELMo embeddings in the BiDAF model imple- combine the sentences. mented in AllenNLP (Gardner et al., 2018). For character and word perturbations, we use 25% as this is where the performance curve in BERT (Devlin et al., 2019). BERT is another Heigold et al. (2018) flattens out.4 Regardless of contextualized embedding method that uses Trans- the type of perturbation, we do not perturb the formers (Vaswani et al., 2017). It is trained to context that contains the answer span, so that the recover masked words in a sentence as well as answer can always be found in the context unper- on a next-sentence prediction task. The output turbed. Because paraphrasing is per sentence, we layer of BERT is fed into a fully-connected layer only modify sentences that do not contain the an- for the span classification task. Pre-trained em- swer span. An example of each perturbation type beddings can be fine-tuned to a specific task, and is shown in Tab. 2. we use the Huggingface PyTorch-Transformers package, specifically bert-large-cased-whole-word- 3.2 Amount of Perturbation masking-finetuned-squad model. We fine-tune for two epochs in each experimental settings. For each perturbation type, we experiment with perturbing the training data at differing amounts. 3.4 Benchmark Datasets All models are tested on fully perturbed test data. • None: clean training data. We experiment on two benchmark MC datasets: • Half: perturb half the training examples. SQuAD (Rajpurkar et al., 2016). The Stanford • Full: perturb the entire train set. Question Answering Dataset is a collection of over 2 From https://www.unicode.org/Public/ 100K crowdsourced question and answer pairs. security/12.1.0/intentional.txt The context containing the answer is taken from 3 Several alternative embedding techniques could be used to Wikipedia articles. find the nearest neighbors e.g., Word2Vec or FastText. We use GLoVe for consistency with previous work (Li et al., 2018). 5 This has twice the amount of data as other settings so is 4 Belinkov and Bisk (2018) perturbed text at 100% while not directly comparable, but many papers show that doing this Heigold et al. (2018) experimented with 5–30% perturbations. can improve a model’s performance. 2472
SQuAD ELMo (baseline=3732) SQuAD BERT (baseline=1732) TriviaQA BERT (baseline=6089) 4000 7879 5039 3737 4028 2981 1766 7608 6575 6120 7200 Train Perturbation Amount Train Perturbation Amount Train Perturbation Amount ens both full half none ens both full half none ens both full half none 7000 3600 3115 4855 3706 1931 2858 1951 5908 6796 5924 6400 6000 3200 3794 4832 3656 1856 2941 1966 2800 5880 6890 6007 5600 5000 2911 4730 3723 2176 2876 1951 2400 5963 6823 6081 4000 4800 3698 4716 3839 2509 3329 2500 2000 3935 4982 5080 3000 4000 char word para char word para char word para Perturbation Type Perturbation Type Perturbation Type (a) SQuAD, ELMo (b) SQuAD, BERT (c) TriviaQA, BERT Figure 1: Number of errors by perturbation type and amount of perturbation (higher = worse model performance, or more successful attacks). Baseline indicates model errors whose training and testing data were not perturbed. For cross-model/embedding comparison, compare (a) and (b). For cross-dataset comparison, compare (a) and (c). The ens training setting is an ensemble of results from the none, half, and full settings. TriviaQA (Joshi et al., 2017). A collection of by word perturbations, then paraphrases. To a ma- over 650K crowdsourced question and answer chine, a single character perturbation results in a pairs, where the context is from web data or completely different word; handling this type of Wikipedia. The construction of the dataset differs noise is important for a machine seeking to beat from SQuAD in that question answer pairs were human performance. Word perturbations are con- first constructed, then evidence was found to sup- text independent and can make the sentence un- port the answer. We utilize the Wikipedia portion grammatical.6 Nevertheless, the context’s meaning of TriviaQA, whose size is comparable to SQuAD. generally remains coherent. Paraphrase perturba- To match the span-based setting of SQuAD, we tions are most ideal because they retain meaning convert TriviaQA to the SQuAD format using the while allowing more drastic phrase and sentence scripts in the official repo and remove answers with- structure modifications. In Sec. 4.2, we present a out evidence. more successful adversarially targeted paraphras- ing approach. 4 Evaluation Results The effect of perturbation amount Perturbed Fig. 1 summarizes our findings on how model be- training data improves the model’s performance havior changes under noisy perturbations and ad- for character perturbations (1st column of Fig. 1a), versarial attacks. Here, we briefly discuss how likely due to the models’ ability to handle unseen perturbation type, perturbation amount, model, and words: BiDAF with ELMo utilizes character em- embeddings affect model misclassification rate. In beddings, while BERT uses word pieces. Our re- addition, we contrast model performance across sults corroborate Heigold et al. (2018)’s findings datasets and report how to mitigate model error (though on a different task) that without adversarial rate using ensembling. Detailed analyses are pre- training, models perform poorly on perturbed test sented in Sec. 5. Key findings are italicized. data, but when models are trained on perturbed The effect of perturbation type To assess data, the amount of perturbed training data does whether perturbations changed the meaning, we not make much difference. We do not see statisti- ran a human study on a random sample of 100 cally significant results for word and paraphrase perturbed contexts from SQuAD. We found (as perturbations (2nd and 3rd columns in each heatmap expected) that the two annotators we employed in Fig. 1). We conclude that perturbing 25% of the could not distinguish char-perturbed text from the words and the non-strategic paraphrasing approach original. For word perturbations, the meaning of were not aggressive enough. the context remained in 65% of cases, but annota- The effect of model and embedding As shown tors noted that sentences were often ungrammatical. in Fig. 1a and b, the BERT model had less errors For sentence-level perturbations, the meaning re- than the ELMo-based model regardless of the per- mained in 83% of cases. turbation type and amount on SQuAD data. While For a model trained on clean data, character 6 perturbations affect the model the most, followed Future work will address this with language models. 2473
(a) Across models and embedding (b) Across perturbation types Figure 2: The effect of perturbation types and embeddings on model behavior measured as high vs. low confidence misclassifications. More robust models should have less high-confidence error or rate (x-axis). Train Test Model Answer TriviaQA, which has an overall higher error rate none char here” (bottom row of each figure in Fig. 1). half char Orientalism full char Orientalism 4.2 Strategic Paraphrasing none word Orientalism half word behaviourism identities We did not observe a large increase in errors with full word The discourse of Orientalism paraphrase perturbations (Fig. 1), perhaps because none char Orientalism paraphrasing, unlike the char and word perturba- half char . . . the East as a negative tions, is not a deliberate attack on the sentence. full char Orientalism Here we experiment with a novel strategic para- Table 3: Example result from response ensembling un- phrasing technique that targets specific words in der the SQuAD ELMo setting. The question is “What the context and then generates paraphrases that ex- was used by the West to justify control over eastern ter- clude those words. We find the most important ritories?” The answer is “Orientalism”, and in all three words in the context by individually modifying settings, the ensemble was correct. each word and obtaining the model’s prediction and confidence, a process similar to Li et al. (2018). the two models are not directly comparable, our Our modification consists of removing the word results indicate that the BERT model is more robust and examining its effect on the model prediction. to adversarial attacks compared to ELMo. The most important words are those which, when The effect of the data Holding the model con- removed, lower the model confidence of a correct stant (Fig. 1b and c), experiments on TriviaQA answer or increase confidence of an incorrect an- resulted in more errors than SQuAD regardless of swer. The Improved ParaBank Rewriter supports perturbation amount and type, indicating that Triv- constrained decoding, i.e. specifying positive and iaQA may be a harder dataset for MC and may negative constraints to force the system output to contain data bias, discussed below. include or exclude certain phrases. We specify the top five important words in the context as negative 4.1 Adversarial Ensembles constraints to generate strategic paraphrases.7 We experimented on 1000 instances in the Ensemble adversarial training has recently been SQuAD dev set as shown in Tab. 4. Our results explored (Tramèr et al., 2018) as a way to ensure indicate that strategic paraphrasing with negative robustness of ML models. For each perturbation constraints is a successful adversarial attack, low- type, we present results ensembled from the none, ering the F1-score from 89.96 to 84.55. Analysis half, and full perturbed settings. We tokenize an- shows that many words in the question are impor- swers from these three models and keep all tokens tant and thus excluded from the paraphrases. We that appear at least twice as the resulting answer also notice that paraphrasing can occasionally turn (Tab. 3). Even when all three model answers differ an incorrect prediction into a correct one. Perhaps (e.g. in the word perturbation case), ensembling can often reconstruct the correct answer. Neverthe- 7 The number of constraints does not necessarily indicate less, we find that this ensembling only helps for the number of words that are changed in the context. 2474
Original Paragraph Strategic Paraphrase . . . Veteran receiver Demaryius Thomas led the team . . . The veteran earman Demaryius Thomas was leading with 105 receptions for 1,304 yards and six touchdowns, a team of 1,304 yards and six touchdowns, while Em- while Emmanuel Sanders caught 76 passes for 1,135 manuel Sanders caught 76 passes for 1,135 yards and six yards and six scores, while adding another 106 yards scores while he added another 106 yards of punts back. returning punts. Question: Who led the Broncos with 105 receptions? Answer: Demaryius Thomas (correct) → Emmanuel Sanders (incorrect) Table 4: Example of strategic paraphrasing: red indicates the important words, which were used as negative constraints in the paraphrasing; blue indicates changed words in the paragraph. paraphrasing makes the context easier to under- teristics and their association with model errors by stand by removing distractor terms; we leave this utilizing CrossCheck (Arendt et al., 2020), a novel for future investigation. interactive tool designed for neural model evalua- tion. Unlike several recently developed tools for 4.3 Model Confidence analyzing NLP model errors (Agarwal et al., 2014; In a black-box setting, model confidence is one Wu et al., 2019) and understanding ML model out- of the only indications of the model’s inner work- puts (Lee et al., 2019; Poursabzi-Sangdeh et al., ings. The models we employed do not provide a 2018; Hohman et al., 2019), CrossCheck is de- single confidence value; AllenNLP gives a prob- signed to allow rapid prototyping and cross-model ability that each word in the context is the start comparison to support experimentation.8 and end span, while the BERT models only give the probability for the start and end words. We 5.1 The Effect of Question Type, Question compute the model’s confidence using the normal- and Context Lengths ized entropy of the distribution across the context We examine if models make more errors on spe- words, where n is the number of context words, cific types of questions in adversarial training, i.e., and take the mean for both the start and end word: some questions could just be easier that others. We 1 − Hn (s)+H n (e) , where s and e are probability dis- 2 first examine question type:9 who, what, which, tributions for the start and end words, respectively. when, where, why, how, and other. The majority of Low entropy indicates certainty about the start/end SQuAD questions are what questions, while most location. Since the BERT models only provide TriviaQA questions are other questions, perhaps probabilities for the start and end words, we approx- indicating more complex questions (Fig. 4a). We imate the entropy by assuming a flat distribution, see that models usually choose answers appropri- dividing the remaining probability equally across ate for the question type; even if they are incorrect, all other words in the context. answers to when questions will be dates or time Comparing confidence across models (Fig. 2a), word spans, and answers to how many questions the BERT model has lower confidence for misclas- will be numbers. Fig. 4a presents key findings sifications, which is ideal. A model should not on differences in model misclassifications between be confident about errors. Fig. 2b compares confi- two datasets given specific question types. On the dence across perturbation type. In the none training SQuAD dataset, the model finds certain question setting, character perturbations introduce the most types, e.g. when and how, easiest to answer regard- uncertainty compared to word or paraphrase pertur- less of the perturbation type. Responses to these bations. This is expected, since character perturba- questions, which generally expect numeric answers, tions result in unknown words. In the adversarial are not greatly affected by perturbations. For Triv- training, word perturbations lead to the highest iaQA, in general we observe more errors across number of high-confidence errors. Thus, to con- question types compared to SQuAD, i.e. more er- vincingly mislead the model to be highly confident rors in what, which and who questions. about errors, one should use word perturbations. 8 To reproduce our findings, we will release the tool and 5 Robustness Analysis interactive notebooks upon publication. 9 Computed as the first word of the question. Many how Here, we do a deeper dive into why models make questions are how many or how much, rather than how in the errors with noisy input. We investigate data charac- “in what manner” sense. 2475
Model Answer Length Model Answer Length (SQuAD) Model Answer Length (TriviaQA) 101 100 4000 SQuAD # correct / # incorrect # correct / # incorrect TriviaQA 3000 100 10 1 2000 1000 none word 10 2 0 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 # words (bucketed) # words # words Figure 3: (Left) shows distribution of answer length. (Center) and (Right) show the ratio of correct errors to incorrect errors (log scale y-axis). BERT models were trained on clean data and tested on perturbed data. Question Type Question Length Context Length 0.20 SQuAD 0.03 4000 0.15 TriviaQA 0.02 count 2000 0.10 0.01 0.05 0 0.00 0.00 0 10 20 30 40 0 250 500 750 1000 how other what when where which who why # words # words (a) Question type distribution (b) Quest. length norm. density (c) Context length norm. density (d) Error distribution in BERT models across different perturbations and question types Figure 4: Contrasting MC model errors by question type, and question and context length across datasets. Regarding question length, SQuAD and Triv- and TriviaQA, the models favored shorter answers, iaQA have similar distributions (Fig. 4b). Both which mirrors the data distribution. datasets have a mode answer length around 10 words; TriviaQA has a slightly longer tail in the 5.3 The Effect of Complexity: Annotator distribution. We did not find question length to im- Agreement and Reading Level pact the error. Regarding context length, SQuAD Here, we examine the effect of task complexity and TriviaQA have vastly differing context length on model performance under adversarial training, distributions (Fig. 4c), partly due to how the two using inter-annotator agreement as a proxy for datasets were constructed (see Sec. 3.4 for details). question complexity and paragraph readability as a For both datasets, the error distribution mirrors the proxy for context complexity. context length distribution, and we did not find any Inter-annotator agreement represents a ques- relation between model errors and context length. tion’s complexity: low agreement indicates that annotators did not come to a consensus on the cor- 5.2 The Effect of Answer Length rect answer; thus the question may be difficult to Our analysis shows that the length of the model’s answer. We examine SQuAD, whose questions answer is a strong predictor of model error in the have one to six annotated answers. In Fig. 5, we adversarial setting: the longer the answer length, present inter-annotator agreement (human confi- the more likely it is to be incorrect. Fig. 3 plots dence) plotted against model confidence over the the proportion of correct to incorrect answers. We four training perturbation amounts, looking only notice a downward trend which is mostly consistent at the incorrect predictions. The setting is SQuAD across experimental settings. For both SQuAD BERT with character perturbation. We observe 2476
Data Correct Errors Embedding Pert. Majority F1 score SQuAD 12.9 13.0 ELMo char 0.58 0.70 ± 0.003 TriviaQA 17.1 17.5 ELMo word 0.54 0.56 ± 0.004 ELMo para 0.65 0.65 ± 0.008 Table 5: Contrasting median readability scores for para- BERT char 0.76 0.77 ± 0.008 graphs with and without errors across datasets. BERT word 0.72 0.73 ± 0.006 BERT para 0.82 0.82 ± 0.006 that the models are generally confident even when Table 6: MC error prediction across datasets, embed- the humans are not, which is noticeable across all dings, and perturbation types. perturbation amounts. However, we see interesting Para Word Char train_none differences in model confidence in adversarial train- q_when q_why BERT q_who ing: models trained in the none and half settings q_what q_where q_how have confidence ranging between 0 and 1 compared q_which q_other to the models trained in full and both setting with train_char 0.5 1.0 0.5 1.0 0.5 1.0 confidence above 0.8, indicating training with more train_none q_when q_why perturbed data leads to more confident models. ELMo q_who q_what q_where To evaluate the effect of context complexity, we q_how q_which use the Flesch-Kincaid reading level (Kincaid et al., q_other train_char 1975) to measure readability. For questions the 0.5 1.0 0.5 1.0 0.5 1.0 Feature Importance model answered incorrectly, the median readability score was slightly higher than the median score for Figure 6: Feature importance when predicting model correct responses (Tab. 5), indicating that context errors during adversarial attacks. with higher reading level is harder for the model to understand. TriviaQA contexts have higher reading given the model’s answer and attributes of the con- levels than SQuAD. text paragraph. We one-hot-encode categorical fea- tures (training amount, perturbation type, question 6 Predicting Model Errors type) and use other features (question length, con- Our in-depth analysis reveals many insights on how text length, answer length, readability) as is. For and why models make mistakes during adversar- each setting of embedding and perturbation type on ial training. Using the characteristics we analyzed SQuAD, we train an XGBoost model with default above, we developed a binary classification model settings with 10-fold cross validation (shuffled). to predict whether the answer would be an error, We present the model’s average F1 scores (Tab. 6) and feature importance as computed by the XGBoost model (Fig. 6). We see that perfor- Human Confidence mance (micro F1) is better to slightly better than a Human Confidence 1.0 1.0 0.5 0.5 majority baseline (picking the most common class), 0.0 0.0 indicating that certain features are predictive of 0.0 0.5 1.0 0.0 0.5 1.0 errors. Specifically, we find that: for character per- Model Confidence Model Confidence turbations, the fact that the training data is clean (a) None perturbed (b) Half perturbed is a strong predictor of errors; a model trained on clean data is most disrupted by character pertur- bations; for word and paraphrase perturbations, Human Confidence Human Confidence 1.0 1.0 0.5 0.5 question types are important predictors of errors. 0.0 0.0 0.0 0.5 1.0 0.0 0.5 1.0 7 Conclusion and Future Work Model Confidence Model Confidence Our in-depth analysis of neural model robustness (c) Full perturbed (d) Both sheds light on how and why MC models make Figure 5: The effect of task complexity on model be- errors in adversarial training, and through our error havior measured as a joint distribution of errors from prediction model, we discovered features of the BERT model on SQuAD using varied amounts of char data e.g., question types that are strongly predictive perturbations (none, half, full and both). of when a model makes errors during adversarial 2477
attacks with noisy inputs. Our results on evaluating Conference on Computational Linguistics: System the effect of the data e.g., questions and context Demonstrations, pages 1–5. length will not only explain model performance Yongsu Ahn and Y. Lin. 2020. Fairsight: Visual ana- in context of the data, but will also allow to build lytics for fairness in decision making. IEEE Trans- future neural models more resilient to adversarial actions on Visualization and Computer Graphics, 26:1086–1095. attacks and advance understanding of neural model behavior across a variety of NLU tasks and datasets Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, and its strengths and weaknesses. Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. 2018. Generating natural language adversarial ex- For future work, we see many avenues for exten- amples. In Proceedings of the 2018 Conference on sion. We plan to experiment with more aggressive Empirical Methods in Natural Language Processing, and more natural perturbations, and deeper coun- pages 2890–2896. terfactual evaluation (Pearl, 2019). While recent Dustin Arendt, Zhuanyi Huang, Prasha Shrestha, Ellyn research has made great strides in increasing model Ayton, Maria Glenski, and Svitlana Volkova. 2020. performance on various NLP tasks, it is still not Crosscheck: Rapid, reproducible, and interpretable clear what linguistic patterns these neural models model evaluation. are learning, or whether they are learning language Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic at all (Mudrakarta et al., 2018). and natural noise both break neural machine transla- More broadly, as AI becomes more entrenched tion. In 6th International Conference on Learning Representations ICLR. in our lives, AI models need to be held to higher standards including but not limited to accountabil- Yonatan Belinkov and James Glass. 2019. Analysis ity (e.g. Wang et al., 2018, 2019, GENIE10 ), fair- methods in neural language processing: A survey. Transactions of the Association for Computational ness (e.g. Saleiro et al., 2018; Bellamy et al., 2018; Linguistics, 7:49–72. Bird et al., 2020; Ahn and Lin, 2020, 11 ;12 ;13 ), and transparency (e.g. Lundberg and Lee, 2017; Nori R. Bellamy, K. Dey, M. Hind, Samuel C. Hoffman, S. Houde, K. Kannan, Pranay Lohia, J. Martino, et al., 2019; Hooker et al., 2018; Kokhlikyan et al., Sameep Mehta, A. Mojsilovic, Seema Nagar, K. Ra- 2020; Lundberg et al., 2019; Tenney et al., 2020). mamurthy, J. Richards, Diptikalyan Saha, P. Sat- tigeri, M. Singh, Kush R. Varshney, and Y. Zhang. Acknowledgments 2018. Ai fairness 360: An extensible toolkit for de- tecting, understanding, and mitigating unwanted al- The research was performed at Pacific Northwest gorithmic bias. ArXiv, abs/1810.01943. National Laboratory, a multiprogram national labo- S. Bird, Miro Dudı́k, R. Edgar, B. Horn, Roman Lutz, ratory operated by Battelle for the U.S. Department Vanessa Milan, M. Sameki, H. Wallach, and Kath- of Energy. The views and conclusions contained leen Walker. 2020. Fairlearn: A toolkit for assessing and improving fairness in ai. herein are those of the authors and should not be interpreted as necessarily representing the official Jacob Devlin, Ming-Wei Chang, Kenton Lee, and policies or endorsements, either expressed or im- Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- plied of the U.S. Government. The authors would standing. In Proceedings of the 2019 Conference of like to thank Maria Glenski for helpful discussions, the North American Chapter of the Association for and the reviewers for their feedback. Computational Linguistics), pages 4171–4186. G. W. Ding, Luyu Wang, and Xiaomeng Jin. 2019. advertorch v0.1: An adversarial robustness toolbox References based on pytorch. ArXiv, abs/1902.07623. Apoorv Agarwal, Ankit Agarwal, and Deepak Mittal. Finale Doshi-Velez and Been Kim. 2017. Towards a 2014. An error analysis tool for natural language rigorous science of interpretable machine learning. processing and applied machine learning. In Pro- arXiv preprint arXiv:1702.08608. ceedings of COLING 2014, the 25th International Steffen Eger, Gözde Gül Şahin, Andreas Rücklé, Ji- 10 GENIE https://genie.apps.allenai.org/ Ung Lee, Claudia Schulz, Mohsen Mesgar, Kr- 11 ML Fairness Gym https://github.com/ ishnkant Swarnkar, Edwin Simpson, and Iryna google/ml-fairness-gym Gurevych. 2019. Text processing like humans do: 12 Fairness Indicators https://github.com/ Visually attacking and shielding NLP systems. In tensorflow/fairness-indicators Proceedings of the 2019 Conference of the North 13 American Chapter of the Association for Computa- Scikit-Fairness https://github.com/koaning/ scikit-fairness tional Linguistics), pages 1634–1647. 2478
Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Sara Hooker, D. Erhan, P. Kindermans, and Been Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Kim. 2018. Evaluating feature importance estimates. Pathologies of neural models make interpretations ArXiv, abs/1806.10758. difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Yu-Lun Hsieh, Minhao Cheng, Da-Cheng Juan, Wei pages 3719–3728. Wei, Wen-Lian Hsu, and Cho-Jui Hsieh. 2019. On the robustness of self-attentive models. In Proceed- Anthony Y Fu, Xiaotie Deng, Liu Wenyin, and Greg ings of the 57th Annual Meeting of the Association Little. 2006a. The methodology and an application for Computational Linguistics, pages 1520–1529. to fight against unicode attacks. In Proceedings of J. Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick the second symposium on Usable privacy and secu- Xia, Tongfei Chen, Matt Post, and Benjamin rity, pages 91–101. ACM. Van Durme. 2019. Improved lexically constrained decoding for translation and monolingual rewriting. Anthony Y Fu, Wan Zhang, Xiaotie Deng, and Liu In Proceedings of the 2019 Conference of the North Wenyin. 2006b. Safeguard against unicode attacks: American Chapter of the Association for Computa- generation and applications of uc-simlist. In WWW, tional Linguistics, pages 839–850. pages 917–918. Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yan- Zettlemoyer. 2018. Adversarial example generation jun Qi. 2018. Black-box generation of adversarial with syntactically controlled paraphrase networks. text sequences to evade deep learning classifiers. In In Proceedings of the 2018 Conference of the North 2018 IEEE Security and Privacy Workshops (SPW), American Chapter of the Association for Computa- pages 50–56. IEEE. tional Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe- Robin Jia and Percy Liang. 2017. Adversarial exam- ters, Michael Schmitz, and Luke Zettlemoyer. 2018. ples for evaluating reading comprehension systems. AllenNLP: A deep semantic natural language pro- In Proceedings of the 2017 Conference on Empiri- cessing platform. In Proceedings of Workshop for cal Methods in Natural Language Processing, pages NLP Open Source Software (NLP-OSS), pages 1–6. 2021–2031. Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Karan Goel, Nazneen Rajani, J. Vig, Samson Tan, Szolovits. 2020. Is bert really robust? a strong base- J. Wu, Stephan Zheng, Caiming Xiong, Mohit line for natural language attack on text classification Bansal, and C. Ré. 2021. Robustness gym: and entailment. In AAAI. Unifying the nlp evaluation landscape. ArXiv, abs/2101.04840. Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale dis- Yoav Goldberg. 2017. Neural network methods for nat- tantly supervised challenge dataset for reading com- ural language processing. Synthesis Lectures on Hu- prehension. In Proceedings of the 55th Annual Meet- man Language Technologies, 10(1):1–309. ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611. Bryce Goodman and Seth Flaxman. 2017. European union regulations on algorithmic decision-making J Peter Kincaid, Robert P Fishburne Jr, Richard L and a “right to explanation”. AI Magazine, 38(3):50– Rogers, and Brad S Chissom. 1975. Derivation of 57. new readability formulas (automated readability in- dex, fog count and flesch reading ease formula) for Dou Goodman, Xin Hao, Yang Wang, Yuesheng Wu, navy enlisted personnel. Junfeng Xiong, and H. Zhang. 2020. Advbox: a toolbox to generate adversarial examples that fool Narine Kokhlikyan, Vivek Miglani, M. Martı́n, neural networks. ArXiv, abs/2001.05574. E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov, Natalia Kliushkina, C. Araya, Siqi Yan, and Orion Georg Heigold, Stalin Varanasi, Günter Neumann, and Reblitz-Richardson. 2020. Captum: A unified and Josef van Genabith. 2018. How robust are character- generic model interpretability library for pytorch. based word embeddings in tagging and MT against ArXiv, abs/2009.07896. wrod scramlbing or randdm nouse? In Proceedings Gyeongbok Lee, Sungdong Kim, and Seung-won of the 13th Conference of the Association for Ma- Hwang. 2019. Qadiver: Interactive framework for chine Translation, pages 68–80. diagnosing qa models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, Fred Hohman, Andrew Head, Rich Caruana, Robert pages 9861–9862. DeLine, and Steven M Drucker. 2019. Gamut: A design probe to understand how data scientists un- Jinfeng Li, Shouling Ji, Tianyu Du, Bao Qin Li, and derstand machine learning models. In Proceedings Ting Wang. 2018. Textbugger: Generating adversar- of the 2019 CHI Conference on Human Factors in ial text against real-world applications. Network and Computing Systems, page 579. ACM. Distributed System Security Symposium. 2479
Zachary C. Lipton. 2018. The mythos of model inter- Forough Poursabzi-Sangdeh, Daniel G Goldstein, pretability. ACM Queue, 16(3):30. Jake M Hofman, Jennifer Wortman Vaughan, and Hanna Wallach. 2018. Manipulating and mea- Changwei Liu and Sid Stamm. 2007. Fighting unicode- suring model interpretability. arXiv preprint obfuscated spam. In Proceedings of the anti- arXiv:1802.07810. phishing working groups 2nd annual eCrime re- searchers summit, pages 45–59. ACM. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for Scott Lundberg and Su-In Lee. 2017. A unified ap- machine comprehension of text. In Proceedings of proach to interpreting model predictions. In NIPS. the 2016 Conference on Empirical Methods in Natu- ral Language Processing, pages 2383–2392. Scott M. Lundberg, G. Erion, Hugh Chen, A. De- Grave, J. Prutkin, B. Nair, R. Katz, J. Himmelfarb, Jonas Rauber, R. S. Zimmermann, M. Bethge, and N. Bansal, and Su-In Lee. 2019. Explainable ai for W. Brendel. 2020. Foolbox native: Fast adversar- trees: From local explanations to global understand- ial attacks to benchmark the robustness of machine ing. ArXiv, abs/1905.04610. learning models in pytorch, tensorflow, and jax. J. Open Source Softw., 5:2607. Paul Michel, Xian Li, Graham Neubig, and J. Pino. 2019. On evaluation of adversarial perturbations for Marco Tulio Ribeiro, Sameer Singh, and Carlos sequence-to-sequence models. In NAACL-HLT. Guestrin. 2016. Why should i trust you?: Explain- ing the predictions of any classifier. In Proceed- John X. Morris, Eli Lifland, Jin Yong Yoo, Jake ings of the 22nd ACM SIGKDD international con- Grigsby, D. Jin, and Yanjun Qi. 2020. Textattack: ference on knowledge discovery and data mining, A framework for adversarial attacks, data augmenta- pages 1135–1144. ACM. tion, and adversarial training in nlp. In EMNLP. Marco Tulio Ribeiro, Sameer Singh, and Carlos Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Guestrin. 2018. Semantically equivalent adversar- Sundararajan, and Kedar Dhamdhere. 2018. Did ial rules for debugging NLP models. In Proceedings the model understand the question? In Proceedings of the 56th Annual Meeting of the Association for of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Computational Linguistics (Volume 1: Long Papers), pages 856–865. Association for Computational Lin- pages 1896–1906. guistics. Maria-Irina Nicolae, M. Sinn, Minh-Ngoc Tran, Beat Pedro Saleiro, Benedict Kuester, Abby Stevens, Ari Buesser, Ambrish Rawat, M. Wistuba, Valentina Anisfeld, Loren Hinkson, J. London, and R. Ghani. Zantedeschi, Nathalie Baracaldo, B. Chen, Heiko 2018. Aequitas: A bias and fairness audit toolkit. Ludwig, Ian Molloy, and Ben Edwards. 2018. Ad- ArXiv, abs/1811.05577. versarial robustness toolbox v1.0.0. arXiv: Learn- ing. Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention Tong Niu and Mohit Bansal. 2018. Adversarial over- flow for machine comprehension. In 5th Inter- sensitivity and over-stability strategies for dialogue national Conference on Learning Representations models. In Proceedings of the 22nd Conference on ICLR. Computational Natural Language Learning, pages 486–496. Association for Computational Linguis- Chenglei Si, Ziqing Yang, Yiming Cui, Wentao Ma, tics. Ting Liu, and Shijin Wang. 2020. Benchmarking ro- bustness of machine reading comprehension models. H. Nori, S. Jenkins, P. Koch, and R. Caruana. 2019. arXiv preprint arXiv:2004.14004. Interpretml: A unified framework for machine learn- ing interpretability. ArXiv, abs/1909.09223. Ian Tenney, James Wexler, Jasmijn Bastings, Tolga Bolukbasi, Andy Coenen, Sebastian Gehrmann, Judea Pearl. 2019. The seven tools of causal infer- E. Jiang, Mahima Pushkarna, Carey Radebaugh, ence, with reflections on machine learning. Com- Emily Reif, and A. Yuan. 2020. The language in- mun. ACM, 62(3):54–60. terpretability tool: Extensible, interactive visualiza- tions and analysis for nlp models. In EMNLP. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word rep- Florian Tramèr, Alexey Kurakin, Nicolas Papernot, resentation. In Proceedings of the 2014 Conference Ian J. Goodfellow, Dan Boneh, and Patrick D. Mc- on Empirical Methods in Natural Language Process- Daniel. 2018. Ensemble adversarial training: At- ing (EMNLP), pages 1532–1543. tacks and defenses. In 6th International Conference on Learning Representations ICLR. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Zettlemoyer. 2018. Deep contextualized word repre- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz sentations. In Proceedings of the 2018 Conference Kaiser, and Illia Polosukhin. 2017. Attention is all of the North American Chapter of the Association you need. In Advances in neural information pro- for Computational Linguistics, pages 2227–2237. cessing systems, pages 5998–6008. 2480
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. Superglue: A stickier benchmark for general-purpose language un- derstanding systems. In NeurIPS. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. Glue: A multi-task benchmark and analysis plat- form for natural language understanding. In Black- boxNLP@EMNLP. Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel S Weld. 2019. Errudite: Scalable, repro- ducible, and testable error analysis. In Proceedings of the 57th Conference of the Association for Com- putational Linguistics, pages 747–763. Guoyang Zeng, Fanchao Qi, Qianrui Zhou, T. Zhang, Bairu Hou, Yuan Zang, Zhiyuan Liu, and Maosong Sun. 2020. Openattack: An open-source textual ad- versarial attack toolkit. ArXiv, abs/2009.09191. Wei Emma Zhang, Quan Z. Sheng, and Ahoud Abdul- rahmn F. Alhazmi. 2019. Generating textual adver- sarial examples for deep learning models: A survey. CoRR, abs/1901.06796. Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018. Generating natural adversarial examples. In 6th International Conference on Learning Representa- tions ICLR. 2481
You can also read