GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding - NYU
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Wang1, Amanpreet Singh1, Julian Michael2, Felix Hill3, Omer Levy2, and Samuel R. Bowman1 1 New York University, New York, NY 2 Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA 3 DeepMind, London, UK {alexwang,amanpreet,bowman}@nyu.edu {julianjm,omerlevy}@cs.washington.edu felixhill@google.com Abstract standing Evaluation benchmark (GLUE, gluebenchmark.com), an online tool for For natural language understanding (NLU) evaluating the performance of a single NLU technology to be maximally useful, both prac- tically and as a scientific object of study, it model across multiple tasks, including question must be general: it must be able to process lan- answering, sentiment analysis, and textual en- guage in a way that is not exclusively tailored tailment, built largely on established existing to any one specific task or dataset. In pursuit of datasets. GLUE does not place any constraints on this objective, we introduce the General Lan- model architecture beyond the ability to process guage Understanding Evaluation benchmark single-sentence and paired-sentence inputs and to (GLUE), a tool for evaluating and analyzing make corresponding predictions. For some GLUE the performance of models across a diverse tasks, directly pertinent training data is plentiful, range of existing NLU tasks. GLUE is model- agnostic, but it incentivizes sharing knowledge but for others, training data is limited or fails to across tasks because certain tasks have very match the genre of the test set. GLUE therefore limited training data. We further provide a favors models that can learn to represent linguistic hand-crafted diagnostic test suite that enables and semantic knowledge in a way that facilitates detailed linguistic analysis of NLU models. sample-efficient learning and effective knowledge We evaluate baselines based on current meth- transfer across tasks. ods for multi-task and transfer learning and Though GLUE is designed to prefer models find that they do not immediately give substan- tial improvements over the aggregate perfor- with general and robust language understanding, mance of training a separate model per task, we cannot entirely rule out the existence of sim- indicating room for improvement in develop- ple superficial strategies for solving any of the in- ing general and robust NLU systems. cluded tasks. We therefore also provide a set of newly constructed evaluation data for the analy- sis of model performance. Unlike many test sets 1 Introduction employed in machine learning research that re- Human ability to understand language is general, flect the frequency distribution of naturally occur- flexible, and robust. We can effectively interpret ring data or annotations, this dataset is designed and respond to utterances of diverse form and to highlight points of difficulty that are relevant to function in many different contexts. In contrast, model development and training, such as the in- most natural language understanding (NLU) mod- corporation of world knowledge, or the handling els above the word level are designed for one par- of lexical entailments and negation. Visitors to ticular task and struggle with out-of-domain data. the online platform have access to a breakdown If we aspire to develop models whose understand- of how well each model handles these phenomena ing extends beyond the detection of superficial alongside its scores on the primary GLUE test sets. correspondences between inputs and outputs, then To better understand the challenged posed by it is critical to understand how a single model can the GLUE benchmark, we conduct experiments learn to execute a range of different linguistic tasks with simple baselines and state-of-the-art models on language from different domains. for sentence representation. We find that naı̈ve To motivate research in this direction, multi-task learning with standard models over the we present the General Language Under- available task training data yields overall perfor-
Corpus |Train| |Dev| |Test| Task Metric Domain Single-Sentence Tasks CoLA 10k 1k 1.1k acceptability Matthews linguistics literature SST-2 67k 872 1.8k sentiment acc. movie reviews Similarity and Paraphrase Tasks MRPC 4k N/A 1.7k paraphrase acc./F1 news STS-B 7k 1.5k 1.4k sentence similarity Pearson/Spearman misc. QQP 400k N/A 391k paraphrase acc./F1 social QA Questions Inference Tasks MNLI 393k 20k 20k NLI acc. (match/mismatch) misc. QNLI 108k 11k 11k QA/NLI acc. Wikipedia RTE 2.7k N/A 3k NLI acc. misc. WNLI 706 N/A 146 coreference/NLI acc. fiction books Table 1: Task descriptions and statistics. All tasks are single sentence or sentence pair classification, except STS-Benchmark, which is a regression task. MNLI has three classes while all other classification tasks are binary. mance no better than can be achieved by training (Kiros et al., 2015), InferSent (Conneau on a separate model for each task, indicating the et al., 2017), DisSent (Nie et al., 2017), and need for improved general NLU systems. How- GenSen (Subramanian et al., 2018). ever, for certain tasks with less training data, we find that multi-task learning approaches do im- 2 Related Work prove over a single-task model. This indicates that there is potentially interesting space for meaning- Our work builds on various strands of NLP re- ful knowledge sharing across NLU tasks. Anal- search that aspired to develop better general un- ysis with our diagnostic dataset reveals that cur- derstanding in models. rent models deal well with strong lexical signals Multi-task Learning in NLP Multi-task learn- and struggle with logic, and that there are interest- ing has a rich history in NLP as an approach ing patterns in the generalization behavior of our for learning more general language understanding models that do not correlate perfectly with perfor- systems. Collobert et al. (2011), one of the earli- mance on the main benchmark. est works exploring deep learning for NLP, used In summary, we offer the following contribu- a multi-task model to jointly learn POS tagging, tions: chunking, named entity recognition, and semantic • A suite of nine sentence- or sentence-pair role labeling. More recently, there has been work NLU tasks, built on established annotated using labels from core NLP tasks to supervise datasets where possible, and selected to cover training of lower levels of deep neural networks a diverse range of dataset sizes, text genres, (Søgaard and Goldberg, 2016; Hashimoto et al., and degrees of difficulty. 2016) and automatically learning cross-task shar- ing mechanisms for multi-task learning (Ruder • An online evaluation platform and leader- et al., 2017). board, based primarily on privately-held test data. The platform is model-agnostic; any Evaluating Sentence Representations Beyond model or method capable of producing re- multi-task learning, much of the work so far to- sults on all nine tasks can be evaluated. wards developing general NLU systems has fo- cused on the development of sentence-to-vector • A suite of diagnostic evaluation data aimed to encoder functions (Le and Mikolov, 2014; Kiros give model developers feedback on the types et al., 2015, i.a.), including approaches leverag- of linguistic phenomena their evaluated sys- ing unlabeled data (Hill et al., 2016; Peters et al., tems handle well. 2018), labeled data (Conneau and Kiela, 2018; • Results with several major existing sentence McCann et al., 2017), and combinations of these representation systems such as Skip-Thought (Collobert et al., 2011; Subramanian et al., 2018).
In this line of work, a standard evaluation prac- while GLUE emphasizes the need to perform well tice has emerged, and has recently been codi- on multiple different tasks using shared model fied as SentEval (Conneau et al., 2017; Conneau components. and Kiela, 2018). Like GLUE, SentEval also re- Weston et al. (2015) similarly proposed a hier- lies on a variety of existing classification tasks archy of tasks towards building question answer- that involve either one or two sentences as in- ing and reasoning models, although involving syn- puts, but only evaluates sentence-to-vector en- thetic language, whereas almost all of our data coders. Specifically, SentEval takes a pre-trained is human-generated. The recently proposed di- sentence encoder as input and feeds its output en- alogue systems framework ParlAI (Miller et al., codings into lightweight task-specific models (typ- 2017) also combines many language understand- ically linear classifiers) that are trained and tested ing tasks into a single framework, although this on task-specific data. aggregation is very flexible, and the framework in- SentEval is well-suited for evaluating general- cludes no standardized evaluation suite for system purpose sentence representations in isolation. performance. However, cross-sentence contextualization and alignment, such as that yielded by methods like 3 Tasks soft attention, is instrumental in achieving state- We aim for GLUE to spur development of general- of-the-art performance on tasks such as machine izable NLU systems. As such, we expect that do- translation (Bahdanau et al., 2014; Vaswani et al., ing well on the benchmark should require a model 2017), question answering (Seo et al., 2016; Xiong to share substantial knowledge (e.g. in the form of et al., 2016), and natural language inference1 . trained parameters) across all tasks, while keep- GLUE is designed to facilitate the development ing the task-specific components as minimal as of these methods: it is model-agnostic, allowing possible. Though it is possible to train a single for any kind of representation or contextualization, model for each task and evaluate the resulting set including models that use no systematic vector of models on this benchmark, we expect that for or symbolic representations for sentences whatso- some data-scarce tasks in the benchmark, knowl- ever. edge sharing between tasks will be necessary for GLUE also diverges from SentEval in the se- competitive performance. In such a case, a more lection of evaluation tasks that are included in the unified approach should prevail. suite. Many of the SentEval tasks are closely re- The GLUE benchmark consists of nine English lated to sentiment analysis, with the inclusion of sentence understanding tasks selected to cover a MR (Pang and Lee, 2005), SST (Socher et al., broad spectrum of task type, domain, amount of 2013), CR (Hu and Liu, 2004), and SUBJ (Pang data, and difficulty. We describe them here and and Lee, 2004). Other tasks are so close to be- provide a summary in Table 1. Unless otherwise ing solved that evaluation on them is less infor- mentioned, tasks are evaluated on accuracy and mative, such as MPQA (Wiebe et al., 2005) and have a balanced class split. TREC (Voorhees et al., 1999). In GLUE, we have The benchmark follows the same basic evalua- attempted to construct a benchmark that is diverse, tion model of SemEval and Kaggle. To evaluate a spans multiple domains, and is systematically dif- system on the benchmark, one must configure that ficult. system to perform all of the tasks, run the system on the provided test data, and upload the results Evaluation Platforms and Competitions in NLP to the website for scoring. The site will then show Our use of an online evaluation platform with pri- the user (and the public, if desired) an overall score vate test labels is inspired by a long tradition of for the main suite of tasks, and per-task scores on shared tasks at the SemEval (Agirre et al., 2007) both the main tasks and the diagnostic dataset. and CoNLL (Ellison, 1997) conferences, as well as similar leaderboards on Kaggle and CodaLab. 3.1 Single-Sentence Tasks These frameworks tend to focus on a single task, CoLA The Corpus of Linguistic Acceptability2 1 In the case of SNLI (Bowman et al., 2015), the best- consists of examples of expert English sentence performing sentence encoding model on the leaderboard as acceptability judgments drawn from 22 books and of April 2018 achieves 86.3% accuracy, while the best per- 2 forming attention-based model achieves 89.3%. Available at: nyu-mll.github.io/CoLA
journal articles on linguistic theory. Each exam- score. We use the standard test set, for which we ple is a single string of English words annotated obtained labels privately from the authors. with whether it is a grammatically possible sen- STS-B The Semantic Textual Similarity Bench- tence of English. Superficially, this data is sim- mark (Cer et al., 2017) is based on the datasets ilar to our analysis data in that it is constructed for a series of annual challenges for the task of to demonstrate potentially subtle and difficult con- determining the similarity on a continuous scale trasts. However, judgments of this particular kind from 1 to 5 of a pair of sentences drawn from are the primary form of evidence in linguistic the- various sources. We use the STS-Benchmark re- ory (Schütze, 1996), and were a machine learning lease, which draws from news headlines, video system to be able to predict them reliably, it would and image captions, and natural language infer- offer potentially substantial evidence on questions ence data, scored by human annotators. We follow of language learnability and innate bias. As in common practice and evaluate using Pearson and MNLI, the corpus contains development and test Spearman correlation coefficients between pre- examples drawn from in-domain data (the same dicted and ground-truth scores. books and articles used in the training set) and out- of-domain data, though we report numbers only 3.3 Inference Tasks on the unified development and test sets with- MNLI The Multi-Genre Natural Language In- out differentiating these. We follow the original ference Corpus (Williams et al., 2018) is a crowd- work and report the Matthews correlation coef- sourced collection of sentence pairs with textual ficient (Matthews, 1975), which evaluates classi- entailment annotations. Given a premise sentence fiers on unbalanced binary classification tasks with and a hypothesis sentence, the task is to predict a range from -1 to 1, with 0 being the performance whether the premise entails the hypothesis, con- at random chance. We use the standard test set, tradicts the hypothesis, or neither (neutral). The for which we obtained labels privately from the premise sentences are gathered from a diverse set authors. of sources, including transcribed speech, popular SST-2 The Stanford Sentiment Treebank fiction, and government reports. The test set is (Socher et al., 2013) consists of sentences broken into two sections: matched, which is drawn extracted from movie reviews and human annota- from the same sources as the training set, and mis- tions of their sentiment. Given a sentence, the task matched, which uses different sources and thus re- is to determine the sentiment of the sentence. We quires domain transfer. We use the standard test use the two-way (positive/negative) class split. set, for which we obtained labels privately from the authors, and evaluate on both sections. 3.2 Similarity and Paraphrase Tasks Though not part of the benchmark, we use and recommend the Stanford Natural Language Infer- MRPC The Microsoft Research Paraphrase ence corpus (Bowman et al. 2015; SNLI) as auxil- Corpus (Dolan and Brockett, 2005) is a corpus of iary training data. It is distributed in the same for- sentence pairs automatically extracted from online mat for the same task, and has been used produc- news sources, with human annotations of whether tively in cotraining for MNLI (Chen et al., 2017; the sentences in the pair are semantically equiv- Gong et al., 2018). alent. Because the classes are imbalanced (68% positive, 32% negative), we follow common prac- QNLI The Stanford Question Answering tice and report both accuracy and F1 score. Dataset (Rajpurkar et al. 2016; SQuAD) is a question-answering dataset consisting of QQP The Quora Question Pairs3 dataset is a question-paragraph pairs, where the one of collection of question pairs from the community the sentences in the paragraph (drawn from question-answering website Quora. Given two Wikipedia) contains the answer to the corre- questions, the task is to determine whether they are sponding question (written by an annotator). We semantically equivalent. As in MRPC, the class automatically convert the original SQuAD dataset distribution in QQP is unbalanced (37% positive, into a sentence pair classification task by forming 63% negative), so we report both accuracy and F1 a pair between a question and each sentence in the 3 data.quora.com/First-Quora-Dataset- corresponding context. The task is to determine Release-Question-Pairs whether the context sentence contains the answer
to the question. We filter out pairs where there a sentence pair classification task, we construct is low lexical overlap4 between the question and two sentence pairs per example by replacing the the context sentence. Specifically, we select all ambiguous pronoun with each possible referent. pairs in which the most similar sentence to the The task (a slight relaxation of the original question was not the answer sentence, as well as Winograd Schema Challenge) is to predict if the an equal amount of cases in which the correct sentence with the pronoun substituted is entailed sentence was the most similar to the question, but by the original sentence. While the included another distracting sentence was a close second. training set is balanced between two classes This approach to converting pre-existing datasets (entailment and not entailment), the test set is into NLI format is closely related to recent work imbalanced between them (35% entailment, 65% by White et al. (2017) as well as to the original not entailment). We call the resulting sentence motivation for textual entailment presented by pair version of the dataset WNLI (Winograd NLI). Dagan et al. (2006). Both argue that many NLP tasks can be productively reduced to textual 3.4 Scoring entailment. We call this processed dataset QNLI In addition to each task’s metric or metrics, Our (Question-answering NLI). benchmark reports a macro-average of the metrics over all tasks (see Table 5) to determine a system’s RTE The Recognizing Textual Entailment position on the leaderboard. For tasks with mul- (RTE) datasets come from a series of annual tiple metrics (e.g., accuracy and F1), we use un- challenges for the task of textual entailment, also weighted average of the metrics as the score for known as NLI. We combine the data from RTE1 the task. (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and 3.5 Data and Bias RTE5 (Bentivogli et al., 2009)5 . Each example in The tasks listed above are meant to represent a di- these datasets consists of a premise sentence and a verse sample of those studied in contemporary re- hypothesis sentence, gathered from various online search on applied sentence-level language under- news sources. The task is to predict if the premise standing, but we do not endorse the use of the task entails the hypothesis. We convert all the data to training sets for any specific non-research applica- a two-class split (entailment or not entailment, tion. They do not cover every dialect of English where we collapse neutral and contradiction into one may wish to handle, nor languages outside not entailment for challenges with three classes) of English, and as all of them contain text or an- for consistency. notations that were collected in uncontrolled set- WNLI The Winograd Schema Challenge tings, they contain evidence of stereotypes and bi- (Levesque et al., 2011) is a reading comprehen- ases that one may not wish their system to learn sion challenge where each example consists of (Rudinger et al., 2017). a sentence containing a pronoun and a list of its possible referents in the sentence. The task 4 Diagnostic Dataset is to determine the correct referent. The data Drawing inspiration from the FraCaS test is designed to foil simple statistical methods; it suite (Cooper et al., 1996) and the recent Build- is constructed so that each example hinges on It-Break-It competition (Ettinger et al., 2017), contextual information provided by a single word we include a small, manually-curated test set to or phrase in the sentence, which can be switched allow for fine-grained analysis of system perfor- out to change the answer. We use a small evalua- mance on a broad range of linguistic phenomena. tion set consisting of new examples derived from While the main benchmarks mostly reflect an fiction books6 that was shared privately by the application-driven distribution of examples (e.g. authors of the corpus. To convert the problem into the question answering dataset will contain ques- 4 To measure lexical overlap we use a CBoW representa- tions that people are likely to ask), our diagnostic tion with pre-trained GloVe embeddings. dataset is collected to highlight a pre-defined set 5 RTE4 is not publicly available, while RTE6 and RTE7 of modeling-relevant phenomena. do not fit the standard NLI task. 6 See similar examples at cs.nyu.edu/faculty/ Specifically, we construct a set of NLI examples davise/papers/WinogradSchemas/WS.html with fine-grained annotations of the linguistic phe-
LS PAS L K Sentence 1 Sentence 2 Fwd Bwd X Cape sparrows eat seeds, along with Seeds, along with soft plant parts and E E soft plant parts and insects. insects, are eaten by cape sparrows. X Cape sparrows eat seeds, along with Cape sparrows are eaten by seeds, N N soft plant parts and insects. along with soft plant parts and insects. X X Tulsi Gabbard disagrees with Bernie Tulsi Gabbard and Bernie Sanders dis- E E Sanders on what is the best way to deal agree on what is the best way to deal with Bashar al-Assad. with Bashar al-Assad. X X Musk decided to offer up his personal Musk decided to offer up his personal E N Tesla roadster. car. X The announcement of Tillerson’s de- People across the globe were not ex- E N parture sent shock waves across the pecting Tillerson’s departure. globe. X The announcement of Tillerson’s de- People across the globe were prepared C C parture sent shock waves across the for Tillerson’s departure. globe. X I have never seen a hummingbird not I have never seen a hummingbird. N E flying. X Understanding a long document re- Understanding a long document re- N N quires tracking how entities are intro- quires evolving over time. duced and evolve over time. X Understanding a long document re- Understanding a long document re- E N quires tracking how entities are intro- quires understanding how entities are duced and evolve over time. introduced. X That perspective makes it look gigantic. That perspective makes it look minus- C C cule. Table 2: Examples from the analysis set. Sentence pairs are labeled according to four coarse categories: Lexical Semantics (L), Predicate-Argument Structure (PAS), Logic (L), and Knowledge and Common Sense (K). Within each category, each example is also tagged with fine-grained labels (see tables 4). See gluebenchmark.com for details on the set of labels, their meaning, and how we do the categorization. nomena they capture. The NLI task is well suited some animals, it is not sufficient to know that dog to this kind of analysis, as it is constructed to make lexically entails animal; one must also know that it straightforward to evaluate the full set of skills dog/animal appears in an upward monotone con- involved in (ungrounded) sentence understanding, text in the sentence. This example would be clas- from the resolution of syntactic ambiguity to prag- sified under both Lexical Semantics > Lexical En- matic reasoning with world knowledge. We ensure tailment and Logic > Upward Monotone. that the examples in the diagnostic dataset have a Domains We construct sentences based on ex- reasonable distribution over word types and topics isting text from four domains: News (drawn from by building on naturally-occurring sentences from articles linked on Google News7 ), Reddit (from several domains. Table 2 shows examples from the threads linked on the Front Page8 ), Wikipedia dataset. (from Featured Articles9 ), and academic papers drawn from the proceedings of recent ACL confer- Linguistic Phenomena We tag every example ences. We include 100 sentence pairs constructed with fine- and coarse-grained categories of the lin- from each source, as well as 150 artificially- guistic phenomena they involve (categories shown constructed sentence pairs. in Table 3). While each example was collected with a single phenomenon in mind, it is often the Annotation Process We begin with an initial case that it falls under other categories as well. We set of fine-grained semantic phenomena, using the therefore code the examples under a non-exclusive 7 news.google.com tagging scheme, in which a single example can 8 reddit.com participate in many categories at once. For exam- 9 en.wikipedia.org/wiki/Wikipedia: ple, to know that I like some dogs entails I like Featured_articles
Coarse-Grained Categories Fine-Grained Categories Lexical Entailment, Morphological Negation, Factivity, Symmetry/Collectivity, Lexical Semantics Redundancy, Named Entities, Quantifiers Core Arguments, Prepositional Phrases, Ellipsis/Implicits, Anaphora/Coreference Predicate-Argument Structure Active/Passive, Nominalization, Genitives/Partitives, Datives, Relative Clauses, Coordination Scope, Intersectivity, Restrictivity Negation, Double Negation, Intervals/Numbers, Conjunction, Disjunction, Conditionals, Logic Universal, Existential, Temporal, Upward Monotone, Downward Monotone, Non-Monotone Knowledge Common Sense, World Knowledge Table 3: The types of linguistic phenomena annotated in the diagnostic dataset, organized under four major categories. categories in the FraCaS test suite (Cooper et al., using only the hypothesis as input. Testing these 1996) as a starting point, while also generalizing on the diagnostic data, accuracies are 32.7% and to include lexical semantics, common sense, and 36.4%—very close to chance—showing that the world knowledge. We gather examples by search- data does not suffer from artifacts of this specific ing through text in each domain and locating ex- kind. We also evaluate state-of-the-art NLI mod- ample sentences that can be easily modified to in- els on the diagnostic dataset and find their overall volve one of the chosen phenomena (or that in- performance to be rather weak, further suggest- volves one already). We then modify the sentence ing that no easily-gameable artifacts present in ex- further to produce the other sentence in an NLI isting training data are abundant in the diagnostic pair. In many cases, we make these modifications dataset (see Section 6). small, in order to encourage high lexical and struc- tural overlap among the sentence pairs—which Evaluation Since the class distribution in the di- may make the examples more difficult for models agnostic set is not uniform (and is even less so that rely on lexical overlap as an indicator for en- within each category), we propose using R3 , a tailment. We then label the NLI relations between three-class generalization of the Matthews corre- the sentences in both directions (considering each lation coefficient, as the evaluation metric. This sentence alternatively as the premise), producing coefficient was introduced by Gorodkin (2004) as two labeled examples for each pair. Where pos- RK , a generalization of the Pearson correlation sible, we produce several pairs with different la- that works for K dimensions by averaging the bels for a single sentence, to have minimal sets square error from the mean value in each dimen- of sentence pairs that are lexically and structurally sion, i.e., calculating the full covariance between very similar but correspond to different entailment the input and output. In the discrete case, it gen- relationships. After finalizing the categories, we eralizes Matthews correlation, where a value of gathered a minimum number of examples in each 1 means perfect correlation and 0 means random fine-grained category from each domain to ensure chance. a baseline level of diversity. In total, we gather 550 sentence pairs, for 1100 Intended Use Because these analysis examples entailment examples. The labels are 42% entail- are hand-picked to address certain phenomena, we ment, 35% neutral, and 23% contradiction. expect that they will not be representative of the distribution of language as a whole, even in the Auditing In light of recent work showing that targeted domains. However, NLI is a task with no crowdsourced data often contains artifacts which natural input distribution. We deliberately select can be exploited to perform well without solving sentences that we hope will be able to provide in- the intended task (Schwartz et al., 2017; Gururan- sight into what models are doing, what phenomena gan et al., 2018), we perform an audit of our man- they catch on to, and where are they limited. This ually curated data as a sanity check. We repro- means that the raw performance numbers on the duce the methodology of Gururangan et al. (2018), analysis set should be taken with a grain of salt. training fastText classifiers (Joulin et al., 2016) to The set is provided not as a benchmark, but as an predict entailment labels on SNLI and MultiNLI analysis tool to paint in broad strokes the kinds
Tags Premise Hypothesis Fwd Bwd UQuant Our deepest sympathies are with all those Our deepest sympathies are with a victim E N affected by this accident. who was affected by this accident. MNeg We built our society on unclean energy. We built our society on clean energy. C C MNeg, 2Neg The market is about to get harder, but not The market is about to get harder, but possi- E E impossible to navigate. ble to navigate. 2Neg I have never seen a hummingbird not flying. I have always seen hummingbirds flying. E E 2Neg, Coref It’s not the case that there is no rabbi at this A rabbi is at this wedding, standing right E E wedding; he is right there standing behind there standing behind that tree. that tree. Table 4: Examples from the diagnostic evaluation. Tags are Universal Quantification (UQuant), Morpho- logical Negation (MNeg), Double Negation (2Neg), and Anaphora/Coreference (Coref). Other tags on these examples are omitted for brevity. of phenomena a model may or may not capture, corporating a matrix attention mechanism between and to provide a set of examples that can serve for the two sentences. By explicitly modeling the in- error analysis, qualitative model comparison, and teraction between sentences, our model is strictly development of adversarial examples that expose outside of the sentence-to-vector paradigm. We a model’s weaknesses. follow standard practice to contextualize each to- ken with attention. Given two sequences of hid- 5 Baselines den states u1 , u2 , . . . , uM and v1 , v2 , . . . , vN , the attention mechanism is implemented by first com- As baselines, we provide performance numbers puting a matrix H where Hij = ui · vj . For each for a relatively simple multi-task learning model ui , we get attention weights αi by taking a softmax trained from scratch on the benchmark tasks, as over the ith row of H, and P get the correspond- well as several more sophisticated variants that uti- ing context vector ṽi = j αij vj by taking the lize recent developments in transfer learning. We attention-weighted sum of the vj . We pass a sec- also evaluate a sample of competitive existing sen- ond BiLSTM with max pooling over the sequence tence representation models, where we only train [u1 ; v1 ], . . . [uM ; vM ] to produce u0 . We process task-specific classifiers on top of the representa- the vj vectors in a symmetric manner to obtain v 0 . tions they produce. Finally, we feed [u0 ; v 0 ; |u0 −v 0 |; u0 ∗v 0 ] into a clas- 5.1 Multi-task Architecture sifier for each task. Our simplest baseline is based on sentence-to- Incorporating Transfer Learning We also vector encoders, and sets aside GLUE’s ability augment our base non-attentive model with two to evaluate models with more complex structures. recently proposed methods for transfer learning in Taking inspiration from Conneau et al. (2017), NLP: ELMo (Peters et al., 2018) and CoVe (Mc- the model uses a BiLSTM with temporal max- Cann et al., 2017). Both use pretrained models that pooling and 300-dimensional GloVe word embed- produce contextual word embeddings via some dings (Pennington et al., 2014) trained on 840B transformation of the underlying model’s hidden Common Crawl. For single-sentence tasks, we states. process the sentence and pass the resulting vector ELMo uses a pair of two-layer neural language to a classifier. For sentence-pair tasks, we process models (one forward, one backward) trained on sentences independently to produce vectors u, v, the One Billion Word Benchmark (Chelba et al., and pass [u; v; |u − v|; u ∗ v] to a classifier. We ex- 2013). A word’s contextual embedding is pro- periment with logistic regression and a multi-layer duced by taking a linear combination of the cor- perceptron with a single hidden layer for classi- responding hidden states on each layer. We fol- fiers, leaving the choice as a hyperparameter to low the authors’ recommendations10 and use the tune. ELMo embeddings in place of any other embed- For sentence-pair tasks, we take advantage of 10 github.com/allenai/allennlp/blob/ GLUE’s indifference to model architecture by in- master/tutorials/how to/elmo.md
Single Sent Similarity and Paraphrase Natural Language Inference Model Avg CoLA SST-2 MRPC QQP STS-B MNLI QNLI RTE WNLI Single-Task Training BiLSTM +ELMo 60.3 22.2 89.3 70.3/79.6 84.8/64.2 72.3/70.9 77.1/76.6A 82.3A 48.3A 63 Multi-Task Training BiLSTM 55.3 0.0 84.9 72.6/80.9 85.5/63.4 71.8/70.1 69.3/69.1 75.6 57.6 43 +Attn 56.3 0.0 83.8 74.1/82.1A 85.1/63.6A 71.1/69.8A 72.4/72.2A 78.5A 61.1A 44A +ELMo 55.8 0.9 88.4 72.7/82.6 79.0/58.9 73.3/72.0 71.3/71.8 75.8 56.0 46 +CoVe 56.7 1.8 85.0 73.5/81.4 85.2/63.5 73.4/72.1 70.5/70.5 75.6 57.6 52 Pre-trained Sentence Representation Models CBoW 52.2 0.0 80.1 72.6/81.1 79.4/51.2 61.2/59.0 55.7/56.3 71.5 54.8 63 Skip-Thought 55.4 0.0 82.1 73.4/82.4 81.7/56.1 71.7/69.8 62.8/62.7 72.6 55.1 64 InferSent 58.1 2.8 84.8 75.7/82.8 86.1/62.7 75.8/75.7 66.0/65.6 73.7 59.3 65 DisSent 56.4 9.9 83.9 76.7/83.7 85.3/62.7 66.2/65.1 57.7/58.0 68.0 59.5 65 GenSen 59.2 1.4 83.6 78.2/84.6 83.2/59.8 79.0/79.4 71.2/71.0 78.7 59.9 65 Table 5: Performances on the benchmark tasks for different models. Bold denotes best results per task overall; underline denotes best results per task within a section; A denotes models using attention. For MNLI, we report accuracy on the matched / mismatched test splits. For MRPC and Quora, we report accuracy / F1. For STS-B, we report Pearson / Spearman correlation, scaled to be in [-100, 100]. For CoLA, we report Matthews correlation, scaled to be in [-100, 100]. For all other tasks we report accuracy (%). We compute a macro-average score in the style of SentEval by taking the average across all tasks, first averaging the metrics within each tasks for tasks with more than one reported metric. dings. We tune hyperparameters with random search over CoVe uses a sequence-to-sequence model with 30 runs on macro-average development set perfor- a two-layer BiLSTM encoder trained for English- mance. Our best model is a two layer BiLSTM to-German translation. The CoVe vector C(wi ) that is 1500-dimensional per direction. We evalu- of a word is the corresponding hidden state of ate our all our BiLSTM-based models with these the top-layer LSTM. As per the original work, we settings. concatenate the CoVe vectors to the GloVe word embeddings. 5.3 Single-task Training We use the same training procedure to train an in- 5.2 Multi-task Training stance of the model with ELMo on each task sep- These four models (BiLSTM, BiLSTM +Attn, arately. For tuning hyperparameters per task, we BiLSTM +ELMo, BiLSTM +CoVe) are jointly use random search on that task’s metrics evalu- trained on all tasks, with the primary BiLSTM en- ated on the development set. We tune the same coder shared between all task-specific classifiers. hyperparameters as in the multi-task setting, ex- To perform multi-task training, we randomly pick cept we also tune whether or not to use attention an ordering on the tasks and train on 10% of a (for pair tasks only), and whether to use SGD or task’s training data for each task in that order. We Adam (Kingma and Ba, 2014). repeat this process 10 times between validation checks, so that we roughly train on all training ex- 5.4 Sentence Representation Models amples for each task once between checks. We Finally, we evaluate a number of established use the previously defined macro-average as the sentence-to-vector encoder models using our validation metric, where for tasks without prede- suite. Specifically, we investigate: termined development sets, we reserve 10% of the training data for validation. 1. CBoW: the average of the GloVe embeddings We train our models with stochastic gradient de- of the tokens in the sentence. scent using batch size 128, and multiply the learn- ing rate by .2 whenever validation performance 2. Skip-Thought (Kiros et al., 2015): a does not improve. We stop training when the sequence-to-sequence(s) model trained to learning rate drops below 10−5 or validation per- generate the previous and next sentences formance does not improve after 5 evaluations. given the middle sentence. After training, the
model’s encoder is taken as a sentence en- Model LS PAS L K All coder. We use the original pretrained model11 BiLSTM 13 27 15 15 20 trained on sequences of sentences from the BiLSTM +AttnA 26 32 24 18 27 Toronto Book Corpus (Zhu et al. 2015, TBC). BiLSTM +ELMo 14 22 14 17 17 BiLSTM +CoVe 17 30 17 14 21 3. InferSent (Conneau et al., 2017): a BiL- CBoW 09 13 08 10 10 Skip-Thought 02 25 09 08 12 STM with max-pooling trained on MNLI and InferSent 17 18 15 13 19 SNLI. DisSent 09 14 11 15 13 GenSen 27 27 14 10 20 4. DisSent (Nie et al., 2017): a BiLSTM with max-pooling trained to predict the discourse Table 6: Results on the diagnostic set. A denotes marker (e.g. “because”, “so”, etc.) relat- models using attention. All numbers in the ta- ing two sentences on data derived from TBC ble are R3 coefficients between gold and predicted (Zhu et al., 2015). We use the variant trained labels within each category (percentage). The to predict eight discourse marker types. categories are Lexical Semantics (L), Predicate- Argument Structure (PAS), Logic (L), and Knowl- 5. GenSen (Subramanian et al., 2018): a edge and Common Sense (K). sequence-to-sequence model trained on a va- riety of supervised and unsupervised objec- tives. We use a variant of the model trained ize to these tasks. The notable exception is Dis- on both MNLI and SNLI, the Skip-Thought Sent, which does better than other multi-task mod- objective on TBC, and a constituency pars- els on CoLA. A possible explanation is that Dis- ing objective on the One Billion Word Bench- Sent is trained using a discourse-based objective, mark. which might be more sensitive to grammatical- ity. However, DisSent underperforms other multi- We use pretrained versions of these models, fix task models on more data-rich tasks such as MNLI their parameters, learn task-specific classifiers on and QNLI. This result demonstrates the utility of top of the sentence representations that they pro- GLUE: by assembling a wide variety of tasks, it duce. We use the SentEval framework to train the highlights the relative strengths and weaknesses of classifiers. various models. Among our multi-task BiLSTM models, using 6 Benchmark Results attention yields a noticeable improvement over the We present performance on the main benchmark vanilla BiLSTM for all tasks involving sentence in Table 5. For multi-task models, we average per- pairs. When using ELMo or CoVe, we see im- formance over five runs; for single-task models, provements for nearly all tasks. There is also a we use only one run. performance gap between all variants of our multi- We find that the single-task baselines have the task BiLSTM model and the best models that use best performance among all models on SST-2, pre-trained sentence representations (GenSen and MNLI, and QNLI, while the lagging behind multi- Infersent), demonstrating the utility of transfer via task trained models on MRPC, STS-B, and RTE. pre-training on an auxiliary task. For MRPC and RTE in particular, the single-task Among the pretrained sentence representation baselines are close to majority class baselines, in- models, we observe relatively consistent per-task dicating the inherent difficulty of these tasks and and aggregate performance gains moving from the potential of transfer learning approaches. On CBoW to Skip-Thought to DisSent, to Infersent QQP, the best multi-task trained models slightly and GenSen. The latter two show competitive per- outperform the single-task baseline. formance on various tasks, with GenSen slightly For multi-task trained baselines, we find that al- edging out InferSent in aggregate. most no model does significantly better on CoLA or WNLI than performance from predicting ma- 7 Analysis jority class (0.0 and 63, respectively), which high- By running all of the models on the diagnostic set, lights the difficulty of current models to general- we get a breakdown of their performance across a 11 github.com/ryankiros/skip-thoughts set of modeling-relevant phenomena. Overall re-
Gold \Prediction All E C N Model UQuant MNeg 2Neg Coref All 65 16 19 BiLSTM 67 13 5 24 E 42 34 3 4 BiLSTM +AttnA 85 64 11 20 C 23 11 8 4 BiLSTM +ELMo 77 60 -8 18 N 35 19 5 11 BiLSTM +CoVe 71 34 28 39 (a) Confusion matrix for BiLSTM +Attn (percentages). CBoW 16 0 13 21 SkipThought 61 6 -2 30 InferSent 64 51 -22 26 Model E C N DisSent 70 34 -20 21 BiLSTM 71 16 13 GenSen 78 64 5 26 BiLSTM +AttnA 65 16 19 BiLSTM +ELMo 81 9 10 Table 7: Model performance in terms of R3 BiLSTM +Cove 75 13 13 (scaled by 100) on selected fine-grained cate- CBoW 84 7 9 gories for analysis. The categories are Uni- SkipThought 80 8 12 InferSent 68 21 11 versal Quantification (UQuant), Morphological DisSent 73 18 8 Negation (MNeg), Double Negation (2Neg), and GenSen 74 15 11 Anaphora/Coreference (Coref). Gold 42 23 35 (b) Output class distributions (percentages). Bolded numbers are closest to the gold distribution. as a strong sign of entailment, and that surgical ad- dition of new information to the hypothesis (as in Figure 1: Partial output of GLUE’s error analysis, the case of neutral instances in the diagnostic set) aggregated across our models. might go unnoticed. Indeed, the attention-based model seems more sensitive to the neutral class, and is perhaps better at detecting small sets of un- sults are presented in Table 6. aligned tokens because it explicitly tries to model Overall Performance Performance is very low these alignments. across the board: the highest total score (27) still Linguistic Phenomena While performance denotes poor absolute performance. Scores on the metrics on the coarse-grained categories give us Predicate-Argument Structure category tend to be broad strokes that we can use to compare models, higher across all models, while Knowledge cate- we can gain a better understanding of the models’ gory scores are lower. However, these trends do capabilities by drilling down into the fine-grained not necessarily reflect that our models understand subcategories. The GLUE platform reports scores sentence structure better than world knowledge for every fine-grained category; we present here a or common sense; these numbers are not directly few highlights in Table 7. To help interpret these comparable. Rather, numbers should be compared results, we list some examples from each fine- between models within each category. grained category, along with model predictions, in One notable trend is the high performance of Table 4. the BiLSTM +Attn model: though it does not out- perform most of the pretrained sentence represen- The Universal Quantification category appears tation methods (InferSent, DisSent, GenSen) on easy for most of the models; looking at exam- GLUE’s main benchmark tasks, it performs best ples, it seems that when universal quantification or competitively on all categories of the diagnos- as a phenomenon is isolated, catching on to lexical tic set. cues such as all often suffices to solve our exam- ples. Morphological negation examples are super- Domain Shift & Class Priors GLUE’s online ficially similar, but the systems find it more diffi- platform also provides a submitted model’s pre- cult. On the other hand, double negation appears dicted class distributions and confusion matrices. to be adversarially difficult for models to recog- We provide an example in Figure 1. One point nize, with the exception of BiLSTM +CoVe; this is immediately clear: all models severely under- is perhaps due to the translation signal, which can predict neutral and over-predict entailment. This match phrases like “not bad” and “okay” to the is perhaps indicative of the models’ inability to same expression in a foreign language. A similar generalize and adapt to new domains. We hypoth- advantage, though less acute, appears when using esize that they learned to treat high lexical overlap CoVe on coreference examples.
Overall, there is some evidence that going be- Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo yond sentence-to-vector representations might aid Giampiccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC. performance on out-of-domain data (as with BiL- STM +Attn) and that representations like ELMo Samuel R. Bowman, Gabor Angeli, Christopher Potts, and CoVe encode important linguistic information and Christopher D. Manning. 2015. A large anno- tated corpus for learning natural language inference. that is specific to their supervision signal. Our In Proceedings of the 2015 Conference on Empiri- platform and diagnostic dataset should support fu- cal Methods in Natural Language Processing, pages ture inquiries into these issues, so we can bet- 632–642. Association for Computational Linguis- ter understand our models’ generalization behav- tics. ior and what kind of information they encode. Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Gazpio, and Lucia Specia. 2017. Semeval-2017 8 Conclusion task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. In 11th Interna- We introduce GLUE, a platform and collection tional Workshop on Semantic Evaluations. of resources for training, evaluating, and analyz- Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, ing general natural language understanding sys- Thorsten Brants, Phillipp Koehn, and Tony Robin- tems. When evaluating existing models on the son. 2013. One billion word benchmark for measur- main GLUE benchmark, we find that none are ing progress in statistical language modeling. arXiv able to substantially outperform a relatively sim- preprint arXiv:1312.3005. ple baseline of training a separate model for each Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui constituent task. When evaluating these models Jiang, and Diana Inkpen. 2017. Recurrent neural on our diagnostic dataset, we find that they spec- network-based sentence encoder with gated atten- tion for natural language inference. In 2nd Work- tacularly fail on a wide range of linguistic phe- shop on Evaluating Vector Space Representations nomena. The question of how to design general- for NLP. purpose NLU models thus remains unanswered. Ronan Collobert, Jason Weston, Léon Bottou, Michael We believe that GLUE, and the generality it pro- Karlen, Koray Kavukcuoglu, and Pavel Kuksa. motes, can provide fertile soil for addressing this 2011. Natural language processing (almost) from open challenge. scratch. Journal of Machine Learning Research, 12(Aug):2493–2537. Acknowledgments Alexis Conneau and Douwe Kiela. 2018. SentEval: An evaluation toolkit for universal sentence representa- We thank Ellie Pavlick, Tal Linzen, Kyunghyun tions. In LREC 2018. Cho, and Nikita Nangia for their comments on this work at its early stages, and we thank Ernie Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c Barrault, and Antoine Bordes. 2017. Supervised Davis, Alex Warstadt, and Quora’s Nikhil Dan- learning of universal sentence representations from dekar and Kornel Csernai for providing access to natural language inference data. In Proceedings of private evaluation data. This project has benefited the 2017 Conference on Empirical Methods in Nat- from financial support to SB by Google, Tencent ural Language Processing, EMNLP 2017, Copen- Holdings, and Samsung Research. hagen, Denmark, September 9-11, 2017, pages 681– 691. Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, References Josef Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, Steve Eneko Agirre, Lluı́s Màrquez, and Richard Wicen- Pulman, Ted Briscoe, Holger Maier, and Karsten towski, editors. 2007. Proceedings of the Fourth Konrad. 1996. Using the framework. Technical re- International Workshop on Semantic Evaluations port, The FraCaS Consortium. (SemEval-2007). Association for Computational Linguistics, Prague, Czech Republic. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- challenge. In Machine learning challenges. evalu- gio. 2014. Neural machine translation by jointly ating predictive uncertainty, visual object classifica- learning to align and translate. In ICLR 2015. tion, and recognising tectual entailment, pages 177– 190. Springer. Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan William B Dolan and Chris Brockett. 2005. Automati- Szpektor. 2006. The second PASCAL recognising cally constructing a corpus of sentential paraphrases. textual entailment challenge. In Proc. of IWP.
T. Mark Ellison, editor. 1997. Computational Natural Quoc Le and Tomas Mikolov. 2014. Distributed repre- Language Learning: Proceedings of the 1997 Meet- sentations of sentences and documents. In Proceed- ing of the ACL Special Interest Group in Natural ings of the 31st International Conference on Ma- Language Learning. Association for Computational chine Learning, volume 32 of Proceedings of Ma- Linguistics. chine Learning Research, pages 1188–1196, Bejing, China. PMLR. Allyson Ettinger, Sudha Rao, Hal Daumé III, and Emily M Bender. 2017. Towards linguistically gen- Hector J Levesque, Ernest Davis, and Leora Morgen- eralizable NLP systems: A workshop and shared stern. 2011. The Winograd schema challenge. In task. In First Workshop on Building Linguistically Aaai spring symposium: Logical formalizations of Generalizable NLP Systems. commonsense reasoning, volume 46, page 47. Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, Brian W Matthews. 1975. Comparison of the pre- and Bill Dolan. 2007. The third PASCAL recog- dicted and observed secondary structure of t4 phage nizing textual entailment challenge. In Proceedings lysozyme. Biochimica et Biophysica Acta (BBA)- of the ACL-PASCAL workshop on textual entailment Protein Structure, 405(2):442–451. and paraphrasing, pages 1–9. Association for Com- putational Linguistics. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Con- Yichen Gong, Heng Luo, and Jian Zhang. 2018. Nat- textualized word vectors. In Advances in Neural In- ural language inference over interaction space. In formation Processing Systems, pages 6297–6308. Proceedings of ICLR 2018. Alexander H. Miller, Will Feng, Adam Fisch, Jiasen J. Gorodkin. 2004. Comparing two k-category assign- Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, and ments by a k-category correlation coefficient. Com- Jason Weston. 2017. ParlAI: A dialog research soft- put. Biol. Chem., 28(5-6):367–374. ware platform. CoRR, abs/1705.06476. Suchin Gururangan, Swabha Swayamdipta, Omer Allen Nie, Erin D Bennett, and Noah D Goodman. Levy, Roy Schwartz, Samuel R. Bowman, and 2017. Dissent: Sentence representation learning Noah A. Smith. 2018. Annotation artifacts in nat- from explicit discourse relations. arXiv preprint ural language inference data. In Proceedings of arXiv:1710.04334. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Bo Pang and Lillian Lee. 2004. A sentimental educa- nologies. tion: Sentiment analysis using subjectivity summa- rization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Compu- Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu- tational Linguistics, page 271. Association for Com- ruoka, and Richard Socher. 2016. A joint many-task putational Linguistics. model: Growing a neural network for multiple nlp tasks. In Proceedings of EMNLP 2017. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit- ing class relationships for sentiment categorization Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. with respect to rating scales. In Proceedings of the Learning distributed representations of sentences 43rd annual meeting on association for computa- from unlabelled data. Proceedings of NAACL 2016. tional linguistics, pages 115–124. Association for Computational Linguistics. Minqing Hu and Bing Liu. 2004. Mining and summa- rizing customer reviews. In Proceedings of the tenth Jeffrey Pennington, Richard Socher, and Christopher ACM SIGKDD international conference on Knowl- Manning. 2014. GloVe: Global vectors for word edge discovery and data mining, pages 168–177. representation. In Proceedings of the 2014 confer- ACM. ence on empirical methods in natural language pro- cessing (EMNLP), pages 1532–1543. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt classification. arXiv preprint arXiv:1607.01759. Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- Diederik P Kingma and Jimmy Ba. 2014. Adam: A resentations. In Proceedings of ICLR. method for stochastic optimization. In ICLR 2015. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Percy Liang. 2016. SQuAD: 100,000+ questions for Richard Zemel, Raquel Urtasun, Antonio Torralba, machine comprehension of text. In Proceedings of and Sanja Fidler. 2015. Skip-Thought vectors. In the 2016 Conference on Empirical Methods in Nat- Advances in neural information processing systems, ural Language Processing, pages 2383–2392. Asso- pages 3294–3302. ciation for Computational Linguistics.
Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, Eighth International Joint Conference on Natural and Anders Søgaard. 2017. Sluice networks: Learn- Language Processing (Volume 1: Long Papers), vol- ing what to share between loosely related tasks. ume 1, pages 996–1005. arXiv preprint arXiv:1705.08142. Janyce Wiebe, Theresa Wilson, and Claire Cardie. Rachel Rudinger, Chandler May, and Benjamin 2005. Annotating expressions of opinions and emo- Van Durme. 2017. Social bias in elicited natural lan- tions in language. Language resources and evalua- guage inferences. In Proceedings of the First ACL tion, 39(2-3):165–210. Workshop on Ethics in Natural Language Process- ing, pages 74–79. Adina Williams, Nikita Nangia, and Samuel R. Bow- man. 2018. A broad-coverage challenge corpus for Carson T Schütze. 1996. The empirical base of lin- sentence understanding through inference. In Pro- guistics: Grammaticality judgments and linguistic ceedings of NAACL 2018. methodology. University of Chicago Press. Caiming Xiong, Victor Zhong, and Richard Socher. Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles, 2016. Dynamic coattention networks for question Yejin Choi, and Noah A. Smith. 2017. The ef- answering. In ICLR 2017. fect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In Proc. of Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- CoNLL. dinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and story-like visual explanations by watching movies Hannaneh Hajishirzi. 2016. Bidirectional attention and reading books. In Proceedings of the IEEE flow for machine comprehension. In ICLR 2017. international conference on computer vision, pages 19–27. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- bank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642. Anders Søgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), volume 2, pages 231–235. Sandeep Subramanian, Adam Trischler, Yoshua Ben- gio, and Christopher J. Pal. 2018. Learning gen- eral purpose distributed sentence representations via large scale multi-task learning. In Proceedings of ICLR. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 6000–6010. Ellen M Voorhees et al. 1999. The trec-8 question an- swering track report. In Trec, volume 99, pages 77– 82. Jason Weston, Antoine Bordes, Sumit Chopra, Alexan- der M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. 2015. Towards AI-complete question answering: A set of prerequisite toy tasks. In ICLR 2016. Aaron Steven White, Pushpendre Rastogi, Kevin Duh, and Benjamin Van Durme. 2017. Inference is ev- erything: Recasting semantic resources into a uni- fied evaluation framework. In Proceedings of the
You can also read