Evaluating Machines by their Real-World Language Use
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Evaluating Machines by their Real-World Language Use Rowan Zellers♠ Ari Holtzman♠ Elizabeth Clark♠ Lianhui Qin♠ Ali Farhadi♠ Yejin Choi♠♥ ♠ Paul G. Allen School of Computer Science & Engineering, University of Washington ♥ Allen Institute for Artificial Intelligence rowanzellers.com/advice I have to do a dissection for my high school class, but I’m distressed by dead animals. Last time we dissected an animal in class, I had a panic attack. I asked my teacher for another assignment, but she refused. I don't want to play a 'victim' card, but I don't know what to do. Help! arXiv:2004.03607v1 [cs.CL] 7 Apr 2020 I’d send a short email to the next higher- Go to your teacher and say "I'm asking you to up authority figure, ideally a counselor. do a project that requires me to see dead Be forthright; it’s the best approach animals. This is a dealbreaker." If she doesn’t when self-advocating as a student. concede, tell your principal about your trauma. T5 Helpful Not helpful Thanks! ?????? Figure 1: TuringAdvice. Humans are natural experts at using language to successfully address situations that arise, such as giving advice to a friend (shown above). We introduce a new framework, dataset, and leaderboard to generatively evaluate real-world language use. Today’s most powerful models – which obtain near-human or superhuman performance on core NLP benchmarks for reading comprehension, natural language inference, and commonsense reasoning – struggle with all of these capabilities when generating advice, as highlighted in red. Abstract only 9% of cases. This low performance re- veals language understanding errors that are There is a fundamental gap between how hu- hard to spot outside of a generative setting, mans understand and use language – in open- showing much room for progress. ended, real-world situations – and today’s NLP benchmarks for language understanding. 1 Introduction To narrow this gap, we propose to evaluate ma- chines by their success at real-world language In October 2019, a team from Google surprised use – which greatly expands the scope of lan- many in the natural language processing (NLP) guage tasks that can be measured and studied. community by announcing T5, a new 11-billion pa- We introduce TuringAdvice, a new challenge rameter model pretrained on hundreds of gigabytes for language understanding systems. Given a of language text (Raffel et al., 2019). Like many complex situation faced by a real person, a ma- other large models released in the last few years, chine must generate helpful advice. We make T5 showed impressive gains on a variety of NLP our challenge concrete by introducing Reddit- benchmarks, adding to a growing list of “solved” Advice, a dataset and leaderboard for measur- datasets on which machines outperform humans. ing progress. Though we release a training Yet, when T5 generates language, we observe set with 600k examples, our evaluation is dy- clear gaps between machine-level and human-level namic, continually evolving with the language people use: models must generate helpful ad- language understanding. Consider the example in vice for recently-written situations. Figure 1, in which a woman asks for advice. She is assigned to dissect an animal for her class project, Empirical results show that today’s models struggle at our task, even those with billions but has extreme anxiety about dead animals – and of parameters. The best model, a finetuned her teacher refused to give her another assignment. T5 (Raffel et al., 2019), writes advice that is Humans can respond with helpful advice, reflecting at least as helpful as human-written advice in our unique ability of real-world language use: to 1
communicate and tackle open-ended issues. The to the advice-seeker as human-written advice. helpful advice in this example – but not the only We make our challenge concrete by introducing one possible – suggests that she escalate the situa- a new dataset, RedditAdvice, and accompanying tion slightly by sending a short email to her guid- leaderboard. We tie our dataset to the Reddit com- ance counselor. munity, which resolves two additional sources of On the other hand, not only is T5’s advice un- bias. First, Reddit users are intrinsically motivated, helpful, it also reveals key misunderstandings of seeking advice about highly complex real issues – the situation. It seems to believe that the student which past work suggests differ from hypothetical is asking the teacher to do a class project involv- issues that crowd workers might come up with (e.g. ing dead animals. This reading comprehension Kwiatkowski et al., 2019; Gurari et al., 2018). Sec- error is particularly strange, as T5 outperforms ond, we make our dataset and leaderboard dynamic, humans on a variety of reading comprehension rather than static – evaluating models on Reddit sit- benchmarks. Others in the community have ob- uations posted over the previous two weeks, at the served similar issues, raising concerns about what time of submission. Models therefore must tackle today’s benchmark datasets measure (Yogatama the same language task as humans, generalizing to et al., 2019; Kryscinski et al., 2019; McClelland new situations and patterns of language. et al., 2019; Gardner et al., 2019). Experimental results show that RedditAdvice is We argue that there is a deep underlying issue: incredibly challenging for today’s machines. To- a gap between how humans use language in the day’s largest model, T5, with 11 billion parameters real world, and what our evaluation methodology (Raffel et al., 2019), produces advice that is prefer- can measure. Today’s dominant paradigm is to able to human-written advice only 9% of the time study static datasets, and to grade machines by the – after being finetuned for our task, on a training similarity of their output with predefined correct dataset with 600k pieces of advice. What’s more, answers. For example, we score multiple choice our experimental setup finds statistically significant exams by how often the correct answers are chosen, differences between current models, allowing us to and evaluate generative tasks like machine trans- meaningfully grade varying levels of performance. lation by similarity with respect to correct transla- We also study our task from the perspective of to- tions. However, when we use language in the real day’s standard ‘core’ NLP tasks. Broadly, we find world to communicate with each other – such as that machines frequently confuse who is who, are when we give advice, or teach a concept to some- self-contradictory, or seem to miss important world one – there is rarely a universal correct answer to knowledge. However, these mistakes tend not to compare with, just a loose goal we want to achieve. fall into the neat categories defined by standard task definitions. We address this by introducing di- We introduce a framework to narrow this gap agnostics questions, which systematically measure between benchmarks and real-world language use. these language understanding errors. We propose to evaluate machines by their success In summary, our paper makes three major con- in using language to (1) communicate with humans tributions. First, we introduce a new framework in (2) tackling complex, open-ended, real-world for measuring language understanding through di- situations. Our goal is a machine that, like a human, rectly tackling real-world language problems. Sec- can generate language that is useful and helpful. ond, we introduce TuringAdvice as a new chal- Doing so necessarily requires a deep understanding lenge for AI systems, along with a dynamic dataset of language and the world, as per a line of thought and leaderboard. Third, we connect our task to that the complete meaning representation is one existing atomic language understanding tasks, in- that suffices to complete a task (Artzi et al., 2013). troducing a new setting that reveals areas where As a case-study of our framework, we introduce progress is still needed. TuringAdvice as a new grand challenge for AI sys- tems. A machine reads a situation written by a 2 Real world language use person seeking advice, like Figure 1, and must then write advice that is helpful to the advice-seeker. Our key proposal is to evaluate machines by their Like a Turing Test (Turing, 1950), we establish a success at real-world language use: using language simple condition required for a model to ‘pass’: to communicate with a human, in response to a model-generated advice must be at least as helpful naturally occurring situation, in order to achieve 2
a desired outcome. Our approach is inspired by though these attributes might be desirable, they are Wittgenstein’s notion of semantics, that “meaning often not sufficient to guarantee task success. is use”: language is grounded in our desire to make Correctness. The second (and perhaps more sense of one another and cooperate to meet our common) approach is to evaluate tasks through needs (Wittgenstein, 1953). correctness over static datasets. For example, ma- As machines do not have humanlike needs or chines can be graded by the similarity of their gen- desires, we propose to evaluate machines’ success erated translation to correct translations,1 or, by at a task by how well it serves a human who is how often they choose the correct answer on a mul- interested in the outcome. For example, if a ma- tiple choice exam. Many goal-oriented dialogue chine orders food on my behalf, then I can evaluate and semantics tasks are also evaluated in this way, it based on whether I enjoy the dish it ordered. as a model is evaluated by whether it makes the Though this requires careful task selection in order correct API call, or produces a correct parse. to make things feasible for current models, as we Since many language tasks cannot be evaluated will show in Section 3, it results in a powerful and through correctness, researchers often introduce reliable human evaluation. proxy tasks that are easy to evaluate, while (hope- fully) correlating with the underlying true task. For 2.1 Related work example, SWAG (Zellers et al., 2018) is a multiple- 2.1.1 Pragmatics in NLP choice proxy task and dataset introduced to study the true task of commonsense reasoning. Our evaluation relates to pragmatics in NLP, where communication is modeled also through listeners However, there are gaps between datasets for and speakers (Golland et al., 2010; Frank and Good- proxy tasks (e.g. multiple choice), and the core man, 2012). One approach is to introduce a com- tasks they seek to represent (e.g. commonsense munication game, with an explicit objective. For reasoning), which we discuss in the next sections. example, Wang et al. (2016) study a blocks world where humans must build a structure by giving 2.2 Can language use really be measured commands to a block-placing machine. The ma- through correctness over proxy tasks? chine is then graded on accuracy. Our proposed When we reduce a complex language task to a evaluation instead covers complex everyday sce- simplified setup, with a small label space (like narios faced by a human, where the objective is to multiple-choice classification), we run the risk of help them as much as possible. introducing artifacts and biases: patterns that can Pragmatics can also be studied through machine- be exploited in the simplified setup, but that are not machine communication; e.g., through emergent representative of the true task (Gururangan et al., language (Lazaridou et al., 2017). Recent work 2018; Zellers et al., 2019a). Artifacts can enable uses pretrained question-answering models to eval- machines to even outperform humans at the final uate summarization models (Chen et al., 2018; benchmark, without solving the underlying task. Scialom et al., 2019; Eyal et al., 2019; Vasilyev While the problem of artifacts has recently taken et al., 2020). However, ensuring that machines the spotlight in the NLP community, partially be- communicate in standard English is difficult, as cause large Transformer models (Vaswani et al., there is usually a more efficient machine-language 2017) are very good at picking up on artifacts, coding scheme for the task (Kottur et al., 2017). there is a deeper underlying issue. The key as- sumption behind a simplified language task is that, 2.1.2 Two major approaches for evaluation by correctly mapping from inputs X to labels Y, a Today, we see two major approaches for model machine must necessarily learn a set of attributes evaluation, which we discuss below. A that are also representative of the ‘true’ task. Quality of generations. The first approach stud- We can upper-bound the information contained by ies generative tasks like chit-chat dialogue or story- A through the information bottleneck principle of writing, and measures the inherent quality of gen- Tishby et al. (1999). An efficient model minimizes erations, often through individual attributes such 1 as “sensibleness” and “specificity” (e.g., Venkatesh Models submitted to the 2019 Conference on Machine Translation were evaluated (by humans) on how well the et al., 2018; Hashimoto et al., 2019; Adiwardana model’s translations agreed with either (1) human-written et al., 2020). This approach is orthogonal to ours: translations, or, (2) original source text (Barrault et al., 2019). 3
the following equation, for some β ą 0: historic patterns of language. For instance, in our field, we commonly evaluate syntactic understand- min IpX; Aq ´ βIpA; Yq, (1) ing using the Penn Treebank dataset, which con- ppa|xq tains news articles from 1989 (Marcus et al., 1993). where I is mutual information. In other words, the However, the world is constantly evolving, along model will learn attributes A that maximally com- with the language that we use. press the inputs X (minimizing IpX; Aq), while also To bridge this gap, we propose to evaluate ma- remaining good predictors of the labels Y (max- chines by their interactions with humans in the imizing IpA; Yq). However, the label prediction present. Models therefore must learn to perform term is bounded by the information (or entropy, H) the underlying language tasks, even for novel situa- of the label space: tions, rather than fitting to the historic distribution of a fixed test set. We make this notion concrete IpA; Yq “ HpYq ´ HpY|Aq ď HpYq. (2) in the next section, where we introduce a dynamic dataset and leaderboard for evaluating advice. This means that an efficient model, trained on a task with a small label space, might have attributes with 3 TuringAdvice: a new challenge for low information content. This phenomenon has natural language understanding been observed empirically, with deep models itera- tively discarding information at each layer (Tishby As a case study of our framework, we introduce and Zaslavsky, 2015). TuringAdvice, a new challenge task for AI systems If we wish to evaluate language understanding to test language understanding. The format is sim- via proxy tasks, then the information-discarding ple: given a situation expressed in natural language, strategy of efficient models poses an issue. Models a machine must respond with helpful advice. To are encouraged to forget linguistically useful infor- pass the challenge, machine-written advice must mation that is not directly relevant to predicting Y be at least as helpful to the advice-seeker as human- (Pereira, 2000). In fact, models might exclusively written advice, in aggregate. learn the artifacts in the data, provided that these We choose to focus on advice for a few reasons. artifacts are enough to solve the task. First, people ask for and give advice as a part of An alternate approach is to make datasets harder their daily lives, encompassing settings as diverse adversarially, so as to have fewer artifacts (Zellers as relationship advice and tech support (Bonaccio et al., 2018, 2019a; Le Bras et al., 2019). However, and Dalal, 2006). This means that we as humans it might be impossible to make a dataset with no have inherent familiarity with the task, and what it artifacts, or to know if one has been created. means for advice to be helpful. Thus, as we will Our proposal, to evaluate models through later show empirically, advice is easy to evaluate their real-world usage of language, addresses the through how much it helps the advice-seeker – even information-discarding issue in two ways. First, by though it is highly diverse.2 using real-world language over open-ended tasks, Second, giving advice overlaps with core NLP the mapping between possible inputs and outputs tasks, such as reading comprehension and natural is allowed to be highly complex. For example, the language inference (Section 5.4). We hypothesize space of possible advice is vast, and many pieces that generating advice that truly helps someone of advice might be equally helpful given a situation. requires a deep understanding of their situation. Second, our proposal tackles language problems Third, good advice is important to people. Its directly, without introducing a correctness-based importance has even led to the creation of internet proxy that machines might overfit to. communities oriented around advice, making data plentiful. Likewise, we hypothesize that progress 2.3 Static datasets in a dynamic world on TuringAdvice might have high impact for good. To evaluate performance on a real-world task by An AI capable of writing consistently helpful ad- means of a dataset, we must (implicitly) assume vice – perhaps, a virtual therapist (DeVault et al., that the dataset is a good representation of the world 2014) – could greatly help people in need. (Torralba and Efros, 2011). This assumption might 2 An advice-giver can recommend or advise against a par- be questionable when it comes to real-world lan- ticular action, they can provide information about options, and guage use, as static datasets necessarily capture they can offer support (Dalal and Bonaccio, 2010). 4
reddit RELATIONSHIP_ADVICE tailed.5 See Figure 2 for an example. Key to the My (23M) boyfriend (25M) of 8 months birthday present to me 13k was a tattoo of my name and face on his back. functioning of this online advice community is that points by AdviceSeeker 1 week ago: I've been together with my BF (we'll call him Kyle) for a little over 8 months now. We users want to participate and are thus intrinsically don't live together but he only lives about a 5 minute walk from me. I would have described the relationship before this week as pretty slow. Neither of us really wanted motivated – situation posters need advice, repliers any big commitments yet so outside of date nights, netflix and occasional hook ups the relationship has been pretty laid back. That was until this last weekend. My birthday was Saturday and we were having a desire to have their advice recognized, and read- party with about 9 people. Kyle made a big show about getting everyone together because he wanted to give me her present in front of everyone. Well, this is where ers enjoy passing judgement on such advice (Chiu things get crazy. For my "birthday present", Kyle got a MASSIVE tattoo on his back. Of my face. Underneath my face there is text saying "Mine forever". The silence was deafening, it didn't help that the tattoo was not even half done. et al., 2006; Wasko and Faraj, 2005). This is completely out of my comfort zone and I have no clue what to do. My sister has been telling me just to break up with him and ignore him. But I just can't do that. Before Saturday I did feel a spark with him. I did like him a lot. But this is all just way to much. Any advice on what I should or can do here would be appreciated. 3.1.2 The ideal evaluation - through Reddit? top reddit advice: By advicegiver1 1 week ago: In a sense, human advice-givers are ‘evaluated’ on 9k points You gotta at least talk to him and tell her why everyone reacted like that did. "I feel like the gift was making a huge commitment that we hadn't actually discussed Reddit by the score of their advice – representing yet. We aren't married, engaged or even living together so I'm not sure why you thought this was a good idea. I'm sure you only had good intentions, but I'm not how well their advice has been received by the prepared for the type of commitment that tattoo entails." By advicegiver2 1 week ago: community. Similarly, the ideal model evaluation 6k points Not sure what you can do... he’s waving the reddest of red flags here. might be to post advice on Reddit directly. If the model consistently understands written situations – Figure 2: An example situation, along with two pieces enough to produce helpful advice – then its advice of top-scoring community authored advice. A machine must generate advice that is at least as helpful to the will in turn be consistently upvoted. advice-seeker as the reference advice. However, there is a significant ethical problem with this approach. The users who post advice questions are real people, with real problems. A 3.1 RedditAdvice: A dynamic dataset for user might read advice that was originally written evaluating advice by a machine, think it was human-endorsed, and We propose to evaluate models dynamically, do something harmful as a result.6 For this reason, through new situations and advice that are posted we take an alternate crowdsourcing approach. to Reddit. We call our dynamic dataset Reddit- Advice. Many of Reddit’s subcommunities (or 3.1.3 A crowdsourced, hybrid evaluation – ‘subreddits’) are devoted to asking for and giv- through Mechanical Turk ing advice, with subreddits for legal, relationship, We propose a hybrid approach for dynamic evalua- and general life advice.3 During evaluation time, tion of models. While the situations, and reference we will retrieve new situations from Reddit as a advice come from Reddit, we hire workers on Me- new test set for models. Workers on Mechanical chanical Turk to rate the relative helpfulness of Turk then grade the model-written advice versus machine-written advice. Not only is this format the Reddit-endorsed human-written advice. more ethical, it also lets us collect diagnostic rat- ings, allowing us to quantitatively track the natural 3.1.1 How advice-giving works on Reddit language understanding errors made by machines. Suppose a Reddit user faces an issue that they are One possible concern, however, is that crowd seeking advice about. First, they write up their situ- workers might be more extrinsically motivated – ation. The writing is typically detailed, and usually performing our task to earn income, as opposed to includes a question (often implicitly). They then the Reddit users who are intrinsically motivated. post their situation to an advice-oriented subreddit. To address this, we made our crowdsourcing task Users who follow that subreddit then reply to the as fulfilling as possible: using popular situations situation, offering advice.4 from Reddit, and pitching the work in terms of Importantly, any user can ‘upvote’ or ‘downvote’ helping people. We received feedback from many the advice as well as the situation itself - changing workers that our tasks were entertaining and fun. its score slightly. Top-scoring advice is deemed by This suggests that our workers are intrinsically mo- the wisdom of the crowd as being the most helpful, 5 while top-scoring situations are often the most de- This is somewhat of a simplification, as other factors also influence what gets upvoted (Anderson et al., 2012; Lakkaraju 3 We use advice from the following subreddits: Love, et al., 2013; Muchnik et al., 2013; Jaech et al., 2015). Relationships, Advice, NeedAdvice, Dating_Advice, Dating, 6 One alternative might be to post advice, but add a dis- Marriage, InternetParents, TechSupport, and LegalAdvice. claimer that the advice was AI-written, and so should be taken 4 Users on Reddit can also reply to the replies themselves, with a grain of salt. We tried posting advice in this way, for in a hierarchical way. For simplicity however, we don’t incor- a filtered set of non-volatile situations where no one was at porate these nested replies in our dataset. imminent risk, but subreddit moderators banned our account. 5
Given: Situation Advice A Advice B disqualifying workers who tend to choose machine- written over human-written advice. 1. Which piece of advice is more helpful? Definitely A Slightly A Slightly B Definitely B 3.2 A large static dataset for training We additionally present RedditAdvice2019, a large 2. How helpful is the worse advice (A) to the question-asker? static dataset for training advice-giving models. Be- Slightly helpful Not helpful Dangerous cause today’s models have extreme reliance on data for finetuning, we collect data that is in the exact 3. Is Advice A worse 3. Could Advice A be applicable to mainly due to its (and helpful in) a different situation? same format as RedditAdvice, yet we expand our meaning, or its writing? selection criteria, optimizing for recall rather than Meaning Writing Possibly helpful Never helpful precision (Appendix A.2). In total, we extract 616k pieces of advice, over 188k situations. Figure 3: Crowdsourcing workflow. Workers on Me- To mirror the dynamic nature of the evaluation, chanical Turk are given a situation, and two pieces of in which models are evaluated on situations posted advice. First, they choose which is most helpful – in in 2020 and beyond, we split our dataset into static this example, B is selected. Second, they rate the help- training and validation sets by date.8 We trained fulness of the worse advice (A); last, they answer an models on the training set, and chose hyperparame- additional diagnostic question that depends on whether A was rated Slightly helpful or not. ters using perplexity over the validation set. 4 Experimental results on RedditAdvice tivated, and thus are good judges of language use – In this section, we report results from one round of a finding we confirm empirically in Section 4.1.1. dynamic evaluation on RedditAdvice. We evaluate 3.1.4 Mechanical Turk annotation setup the following selection of NLP models: In a single round of evaluation, we retrieve 200 a. Grover (Zellers et al., 2019b): a left-to-right popular Reddit situations that were posted in last transformer model. Grover was pretrained on two weeks.7 For each situation, we retrieve the news articles with multiple fields, perhaps mak- top-rated human-written advice, and generate one ing it a good fit for our task, with multiple fields piece of advice per model. Workers on Mechanical of context (like the subreddit, date, and title). Turk then compare the helpfulness of the model- Sizes: We study the two largest Grover models: generated advice with human-written advice, and Grover-Large, with 0.3 billion parameters, and provide diagnostic ratings. Grover-Mega, with 1.5 billion parameters. We show an overview of our Mechanical Turk b. T5 (Raffel et al., 2019): a sequence-to- task in Figure 3. A worker is given a situation, sequence model with a bidirectional encoder as well as two pieces of advice: A and B. One is and a left-to-right generator. T5 was trained on the top-scoring advice from Reddit, and the other a large dataset of cleaned web text. At the time is model-generated advice; the worker is not told of writing, T5 is the top-scoring model on the which is which. The worker first chooses the more Glue and SuperGlue benchmarks (Wang et al., helpful piece of advice, then provides diagnostic 2019b,a), scoring above human performance on information for the less helpful advice – rating it Glue (90 vs. 87) and near human-performance Slightly helpful , Not helpful , or Dangerous . If on SuperGlue (89.3 vs 89.8). the worse piece of advice was Slightly helpful , Sizes: We study the two largest T5 models. As they choose whether it is worse than the bet- their names suggest, T5-3B has 3 billion param- ter advice due to a Meaning problem or a eters, and T5-11B has 11 billion parameters. Writing problem . Otherwise, they choose if the c. TF-IDF retrieval: we additionally consider a worse advice could be Possibly helpful in some simple baseline built around retrieval, not gen- other situation, or Never helpful in any situation. eration. We first precompute bag-of-word TF- Overall, three workers rate each model-situation IDF vectors for all situations in the training set. pair, and their ratings are combined using a major- Given a new situation, we compute its TF-IDF ity vote. We follow best practices for Mechanical 8 Turk, including using a qualification exam, and Our training set contains 600k pieces of advice from July 2009 to June 14, 2019; validation contains 8k from June 14 to 7 See Appendix A.1 for information about the selection. July 9th 2019. 6
% Frequency that model advice is preferred over best Reddit advice 100% TF-IDF Retrieval 1.5% 2%* 4%* 7%* 38%* Model advice preferred Grover- .5%* 2.5%* 5.5%* 36.5%* 80% Large(.3B) Baseline model Grover- 2% 5%* 36%* Mega(1.5B) 60% 40.0% T5-3B 3% 34%* Reddit advice preferred 40% T5-11B *: Significant with p < .01 31%* : Significant with p < .05 Second-best 20% 9.0% Reddit advice Not significant 4.0% 6.0% 2.0% 3.5% TF-IDF Grover- Grover- T5-3B T5-11B Second-best Retrieval Large(.3B) Mega(1.5B) Reddit advice 0% Compared model TF-IDF Grover- Grover- T5-3B T5-11B Second-best Retrieval Large(.3B) Mega(1.5B) Reddit advice Figure 5: Improvement (in absolute percentage %) be- Figure 4: Helpfulness of evaluated models, relative to tween pairs of models, along with statistical signifi- top-scoring Reddit advice. We show results over 200 cance as measured by a paired t-test. The improvement shared situations; we also show bootstrapped 95% con- of large models over smaller ones is highly significant, fidence intervals. Advice from the biggest model, T5- such as T5-11B over Grover-Mega (5% gap, pă.01). 11B, is preferred 9% of the time over Reddit advice. vector and retrieve the most similar situation T5 with 11 billion parameters, scores 9%. Other from the training set. We then reply with the models, with fewer parameters, do worse—with top-scoring advice for that situation. Grover-Large (0.3B parameters) scoring 3.5%. In Last, to quantify the measurement error of our eval- comparison, the second-highest scoring Reddit ad- uation, we additionally evaluate: vice scores 40%, and the highest scoring advice is d. the second-highest rated Reddit advice for each (by definition) 50%. However, in theory, a model situation. We send this advice through the same could score above 50%, if it writes advice that is pipeline as machine-written advice. truly helpful and thus gets consistently chosen. We train our models using cross-entropy, and 4.1.1 Measurement error generate using Nucleus Sampling (Holtzman et al., 2020). We provide additional training and genera- To investigate the measurement error of our eval- tion details for our models in Appendix B. uation, in Figure 5 we report the statistical sig- In our study, we do not consider purely bidirec- nificance between pairs of models; details about tional models, such as BERT (Devlin et al., 2019). how this is computed are in Appendix C. For pairs While these models can be adapted to generate text, of models with a greater difference in parameter their generations are generally worse than those of count, we similarly see a large (and statistically sig- left-to-right models (Wang and Cho, 2019); more- nificant) difference in performance. For instance, over, T5 tends to outperform these models even on while the improvement of T5-11B over T5-3B was discriminative tasks. We also do not consider GPT 3%, and not found to be statistically significant, the (Radford et al., 2018), another left-to-right model, improvement of T5-11B over Grover-Mega was as to make it controllably generate advice would 5% and highly statistically significant. involve more changes in finetuning, versus Grover. Overall, the statistical significance results sug- gest that our evaluation can stably rank model per- 4.1 Quantitative results formance. This, along with the finding that model In Figure 4, we show overall results for one evalua- performance is low on our task (ď9%) suggests that tion trial, which featured 200 situations posted on there is ample room for growth on RedditAdvice. Reddit from February 1st to February 12, 2020. As 5 Analysis and discussion a key metric for measuring the relative usefulness of model-written advice, we evaluate the frequency So far, we have shown that we are able to reli- by which workers prefer the Reddit-written refer- ably evaluate models in our dynamic setup, and ence advice over the model-written advice. If a that doing so results in model performance that is model’s advice was just as helpful as human advice significantly lower than human performance. in aggregate, then that model would score 50%. To break down what this gap in performance Model performance is quite low. The best model, means, we show a qualitative example in Figure 6, 7
legaladvice: Employer just informed me I was never put on payroll and "don’t have a w2" Very unusual situation. My former employer who laid all of us off suddenly in mid November is just today telling me that I was never "entered into payroll" and don’t have a w2. He is suggesting that instead he can write me a check for the taxes already withheld from my pay. This immediately rang alarm bells because I’m not an idiot. I said replied neutrally that I needed time to make a decision. Here’s the thing, throughout my employment there I was not provided with paystubs with each paycheck. I repeatedly asked for my paystubs and it was always blamed on "the bookkeeper". I looked it up and the state I’m in does not require employers to provide paystubs. Fine. But when I was laid off I wanted to apply for unemployment and needed a final paystub. After almost a month of daily texts I finally got a final paystub from him, reviewing it now I had almost $1500 withheld in taxes from a gross pay of around $2500 over a period of about 3.5 months. This amount of tax now seems excessively high AND if I "have no w2" does that mean my employer withheld taxes from my pay and did not actually pay that money forward as taxes. What do I need to do here to both stay in the clear tax wise AND potentially recover the money withheld perhaps wrongly? Top Reddit If you live in a state with a labor board or labor commission start there. This is a big deal and they will be advice able to provide the most assistance. If your state does not have that, you really might want to call your local district attorney and see if anyone does payroll fraud. After that, the IRS might actually be able to help, they have records on the other side for you pay, if your boss withheld and sent in the tax payment which I doubt. You are actually the victim of a crime here, not just a lazy accounting. Your employer could feed all the payroll info into business software and get it all straightened out, but they aren’t I am not recommending a CPA or tax person because you’re going to spend half your total pay on them, as your total income is well below the free tax filing software threshold. TF-IDF re- This is a company that is big enough to have a payroll department and they did this? Jesus - I don’t know trieval where to even start with this. Swalsh is right - it’s not only illegal it’s potentially jail time illegal. De nitely less helpful Not helpful Possibly helpful in a different situation Grover- Is there a method by which you can "dismiss" this information as false? If so, dispute it with the IRS. Large De nitely less helpful Not helpful Possibly helpful in a different situation Grover- If you can prove that they never actually paid you for work you did, you can sue them for that money. Mega De nitely less helpful Not helpful Possibly helpful in a different situation T5-3B Is your employer US based? Because if they were (you’ll need to make a claim to the IRS), then you would have had a W2 issued to you with the taxes withheld from your pay, at minimum. If you were truly not on payroll, you still have a W2 (since you were still earning money), but that’s probably for taxes withheld from your paychecks as well as any state and federal income taxes that were withheld. Neither of those numbers look good. De nitely less helpful Not helpful Never helpful for any situation T5-11B > I have no w2 does that mean my employer withheld taxes from my pay and did not actually pay that money forward as taxes? Yes, your employer should have withheld your taxes, reported your income correctly, and paid them into the IRS, if you are required to do so. Here’s the thing though: If you get a W2 from your employer, then it’s time to file your taxes for 2019. You should try and use the W2 you just got from your former employer to file. De nitely less helpful Not helpful Possibly helpful in a different situation Second- So it definitely is not quite adding up. Talk with a CPA, they can probably guide you through dealing with best Reddit this. They would be able to keep you in the clear. advice It probably won’t be free but shouldn’t be too costly. Do this soon because they get bombarded with work between now and April so keep that in mind. Hiring a lawyer or going after them legally doesn’t seem worth it. And CPA’s are certified to deal with the IRS on your behalf if they decide to come calling or asking questions. Slightly less helpful Slightly helpful Meaning problem Figure 6: A qualitative example; more are in Appendix E. Though machine-generated advice matches keywords from the situation, upon a close read, it is frequently not helpful or even self-contradictory. The issues are due to critical errors in natural language understanding, such as reading comprehension, entailment, and coreference. 8
100% Frequency (%) of advice ratings 80% 84% Preferred over top-rated Reddit advice Not helpful/Dangerous possibly helpful Slightly helpful writing problem Not helpful/Dangerous never helpful 60% Slightly helpful meaning problem 40% 40% 32% 29% 20% 26% 24% 26% 21% 22% 21% 15% 19% 9% 0% 2% 5% 5% 3% 4% 3% 6% TF-IDF Retrieval Grover-Mega (1.5B) T5-11B Second-best Reddit advice Figure 7: Distribution of ratings for three evaluated models: the retrieval baseline, Grover-Mega, and T5-11B; along with ratings for the second-best rated Reddit advice. Though generative models like Grover and T5 are preferred more often than the TF-IDF retrieval baseline; they also often struggle to generate coherent advice. Of note, 31% of the advice from T5 would never be helpful in any situation, versus 4% from the retrieval model. describing a wage theft situation. The top-rated First, stronger generators improve over weaker gen- Reddit advice understands the situation, and then erators: in comparing T5-11B to the weaker ma- offers helpful assistance. It recommends the advice- chine model, Grover-Mega, we find it tends to pro- seeker contact their state labor board or district at- duce less ‘Not helpful/Dangerous’ advice. 26% of torney, noting that they are “the victim of a crime.” T5-11B’s advice is Never helpful in any situation, Meanwhile, machine advice often misses the heart versus 29% for Grover-Mega; 21% is unhelpful of the issue: T5-11B, for example, suggests that the but could be Possibly helpful in another situation, advice-seeker simply file their taxes. Even worse, versus 32% for Grover-Mega. T5-3B is self-contradictory, saying that “if you Second, and perhaps most surprising, we find were truly not on payroll, you still have a W2.” that all generators frequently commit natural lan- guage understanding errors during generation, in- 5.1 Problems with machine-written advice cluding internal contradiction. Because of this, As part of our evaluation, we wish to quantita- we find that our simple baseline – TF-IDF bag-of- tively measure problems with machine-written ad- words retrieval – is competitive with that of deep vice. Recall that in our crowdsourcing setup (Sec- generators with billions of parameters. While its tion 3.1.3), we ask workers to not only select which advice is often irrelevant (84% of the time), it is al- advice is better, but also to annotate problems with most never complete gibberish - since it is retrieved the worse piece of advice. We find workers have from top-scoring advice. In fact, very few (3%) of high agreement throughout the diagnostic annota- workers rated this advice as Never helpful in any tion process; moreover, we use three workers per situation, versus 26% for T5. advice for additional consistency.9 In Figure 7, we show the distribution of 5.2 A Leaderboard for Advice Evaluation the ratings for model-written, versus human- So far, we have presented the results from one written advice. Machine-written advice that was round of a dynamic evaluation. We propose to not preferred over human-written advice can keep that evaluation ongoing, through a dynamic have the following ratings. It can be rated as leaderboard at rowanzellers.com/advice. Slightly helpful (but, was rated as worse mainly At the time of writing, the leaderboard works as due to a Meaning problem or Writing problem ), follows. Users submit a model API to be dynam- or, as either Not helpful or Dangerous (and it ically evaluated. The new model, along with the could be Possibly helpful in some other situation, highest rated previously-evaluated model, will be or Never helpful in any situation).10 evaluated for an additional round - using a new set The diagnostics show several interesting patterns. of 200 situations posted over the last two weeks, 9 For the classifying machine-written advice as ‘helpful’ using the same approach as in Section 3.1.3. versus ‘not helpful’ or ‘dangerous’ (combining the two latter One potential concern, however, is price. In our categories into one), we have κ“0.689. For breaking down Mechanical Turk workflow, we paid workers 52 helpful advice into ‘meaning problem’ versus a ‘writing prob- lem’, we have Cohen’s κ“0.646; for rating unhelpful advice cents per HIT.11 After the Mechanical Turk fee, as ‘possibly helpful’ versus ‘never helpful,’ we have κ“0.636. and using 3 workers per piece of advice, this costs 10 We found workers rarely chose Dangerous (2%), so for 11 ease of visualization, we combined it with Not helpful . We chose this to pay workers at least $15 per hour. 9
1 10 RedditAdvice situation RedditAdvice advice 2 HellaSWAG raising doubt that further progress on them brings 10 Glue us closer to human-level language understanding. Frequency SuperGlue 3 10 We argue two things: first, that many NLP tasks 10 4 are necessary components of giving advice, and sec- 10 5 ond, that because giving advice remains far from 0 250 500 750 1000 1250 0 200 400 600 solved, these tasks are also far from solved. In Length (spaCy tokens) Length (spaCy tokens) Appendix E, we study problems with advice from Figure 8: Length distribution of RedditAdvice, com- T5-11B from the point of view of existing NLP pared with other common NLU benchmarks bench- tasks. For instance, machine advice often contra- marks (HellaSWAG; Zellers et al. (2019a), GLUE; dicts itself, suggesting that today’s systems struggle Wang et al. (2019b), SuperGlue; Wang et al. (2019a)). The examples in RedditAdvice are significantly longer, with the general task of natural language inference. representing highly complex situations. One interesting line of research would be to transfer knowledge from supervised NLP datasets into a generative setting.13 However, one difficulty $1.86 per piece of advice, or $372 for 200 pieces is that datasets are necessarily curated – in terms of advice. We argue that this cost should not pro- of both the label space as well as the data dis- hibit a human evaluation, particularly when com- tribution. Paragraphs of machine-written advice, pared with the electricity cost of model develop- that exhibit many kinds of language understanding ment (Strubell et al., 2019). To ensure that the errors, might be significantly out-of-distribution. cost is fairly distributed among the community, we We propose another way forward. Predicting the propose that submitters to our leaderboard pay the advice ratings themselves (such as whether advice Mechanical Turk bill.12 could ever be helpful) is itself a language task, one that might provide signal for better generation.14 5.3 Length and complexity Overall, this suggests that evaluating machines One interesting aspect of RedditAdvice is that its by their language use might lead to progress on situations are long and complex. We show a dis- existing NLP tasks in two ways. First, by studying tribution of lengths in Figure 8. Not only are the a generative setting, we necessarily adopt a broad inputs and outputs complex, but we argue that the and inclusive definition of the task at hand; sec- mapping between them is complex as well: it is ond, we can turn common challenges into small necessarily one-to-many, as there might be many discriminative tasks to study further. possible kinds of good advice for a situation. We believe evaluating by task success – how 5.5 How can we build models that are better much the advice helps a user – is the key reason at giving advice? why a task like advice-giving can be scaled up. Over the last few years, a major trend in NLP First, intrinsic motivation rewards high data qual- has been towards developing bigger models, while ity: users who post situations are motivated to add making fewer changes to neural architecture and relevant details, and advice-givers are motivated the training objective. Almost all of today’s leader- to help the user. Second, task success can stably board models are Transformers (Vaswani et al., evaluate long passages. Two pieces of advice rarely 2017) trained through the maximum-likelihood mean exactly the same thing, but this does not mean training objective of predicting masked out (or we cannot evaluate which is more helpful. next) words. Our experimental results suggest that 5.4 Relation to existing NLP tasks even though these models struggle with Reddit- Advice, we are still able to measure small gains. Shared “core” tasks such as reading comprehension At the same time, our results suggest that scaling and natural language inference are of considerable up parameter counts might not be enough. With interest to the NLP community. Many datasets 11 billion parameters, machines score 9% on our have been proposed for these tasks, and progress benchmark, versus 50% for humans (Figure 4). on them is often measured through auto-gradeable We hypothesize that a key challenge for our field correctness metrics. However, large models have 13 started to outperform humans on these datasets, Another idea is crowdsourcing data specifically to im- prove generation models, e.g. dialogue (Welleck et al., 2019). 12 14 This model is used for the HYPE leaderboard in computer To allow for study of this problem, we make the full vision (Zhou et al., 2019). evaluation results – including the advice ratings – public. 10
will be to move away from training models through This research was supported in part by NSF (IIS- word prediction objectives. We hypothesize that 1524371, IIS-1714566), DARPA under the CwC word prediction makes a model uniquely suited for program through the ARO (W911NF-15-1-0543), correctness-based tasks, as word prediction itself is DARPA under the MCS program through NIWC a task with a single correct answer. However, word Pacific (N66001-19-2-4031), and the NSF-GRFP prediction might not necessarily lead to the forma- No. DGE-1256082. tion of mental models about the world. Other ideas include different modeling choices (Bengio, 2017) and learning paradigms (Mao et al., 2019). These References and other directions seem promising for building Daniel Adiwardana, Minh-Thang Luong, David R So, better models that are not just better advice-givers, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, but better at real-world language use in general. Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain 5.6 Ethical implications; possible dual use chatbot. arXiv preprint arXiv:2001.09977. One benefit of our proposal is that evaluating ma- Ashton Anderson, Daniel P. Huttenlocher, Jon M. chines by their language use, on tasks with intrinsic Kleinberg, and Jure Leskovec. 2012. Effects of user motivation, might enable progress towards social similarity in social media. In WSDM ’12. good applications. For example, machines might Yoav Artzi, Nicholas FitzGerald, and Luke S Zettle- one day help people who need advice – potentially moyer. 2013. Semantic parsing with combinatory on sensitive topics. categorial grammars. ACL (Tutorial Abstracts), 3. However, we do not claim that our approach is a panacea. We should approach purely tech- Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Gra- nological solutions to societal problems (such as ham, Barry Haddow, Matthias Huck, Philipp Koehn, mental health care) with a grain of salt. Moreover, Shervin Malmasi, Christof Monz, Mathias Müller, progress on using language effectively might yield Santanu Pal, Matt Post, and Marcos Zampieri. 2019. models that cause harm, such as through generating Findings of the 2019 conference on machine transla- tion (WMT19). In Proceedings of the Fourth Con- disinformation (Zellers et al., 2019b). We as a com- ference on Machine Translation (Volume 2: Shared munity should (continue to) study and be mindful Task Papers, Day 1), pages 1–61, Florence, Italy. As- of these kinds of dual use issues (Hovy and Spruit, sociation for Computational Linguistics. 2016; Green and Viljoen, 2020). Yoshua Bengio. 2017. The consciousness prior. arXiv preprint arXiv:1709.08568. 6 Conclusion In our work, we introduced new methodology for Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jian- feng Gao, and Yejin Choi. 2020. Piqa: Reasoning evaluating language tasks, reducing the gap be- about physical commonsense in natural language. In tween our benchmarks and the real world. We Thirty-Fourth AAAI Conference on Artificial Intelli- also introduced a new challenge for the commu- gence. nity, TuringAdvice, with an accompanying dataset Silvia Bonaccio and Reeshad S. Dalal. 2006. Advice and dynamic leaderboard, RedditAdvice. Today’s taking and decision-making: An integrative litera- largest models struggle on RedditAdvice, so we are ture review, and implications for the organizational excited to see what new models get developed. sciences. Organizational Behavior and Human De- cision Processes, 101(2):127 – 151. Acknowledgements Samuel R. Bowman, Gabor Angeli, Christopher Potts, Thanks to the Reddit users who participate in its and Christopher D. Manning. 2015. A large an- advice subreddits – from asking for help, to writ- notated corpus for learning natural language infer- ence. In Proceedings of the 2015 Conference on ing (and voting on) helpful advice. Thanks to the Empirical Methods in Natural Language Processing, Mechanical Turk workers who performed the anno- EMNLP 2015, Lisbon, Portugal, September 17-21, tation for our experiments. Thanks also to Katha- 2015, pages 632–642. rina Reinecke, Oren Etzioni, Hannah Rashkin, Ping Chen, Fei Wu, Tong Wang, and Wei Ding. 2018. Maarten Sap, Maxwell Forbes, Jesse Thoma- A semantic qa-based approach for text summariza- son, Daniel Khashabi, Gabriel Ilharco, Swabha tion evaluation. In Thirty-Second AAAI Conference Swayamdipta, and Yonatan Bisk for feedback. on Artificial Intelligence. 11
Chao-Min Chiu, Meng-Hsiang Hsu, and Eric TG Wang. Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, 2006. Understanding knowledge sharing in virtual Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P communities: An integration of social capital and Bigham. 2018. Vizwiz grand challenge: Answering social cognitive theories. Decision support systems, visual questions from blind people. In Proceedings 42(3):1872–1888. of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 3608–3617. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment Suchin Gururangan, Swabha Swayamdipta, Omer challenge. In Machine learning challenges. evalu- Levy, Roy Schwartz, Samuel R. Bowman, and ating predictive uncertainty, visual object classifica- Noah A. Smith. 2018. Annotation artifacts in nat- tion, and recognising tectual entailment, pages 177– ural language inference data. In Proc. of NAACL. 190. Springer. Tatsunori Hashimoto, Hugh Zhang, and Percy Liang. Reeshad S Dalal and Silvia Bonaccio. 2010. What 2019. Unifying human and statistical evaluation for types of advice do decision-makers prefer? Orga- natural language generation. In Proceedings of the nizational Behavior and Human Decision Processes, 2019 Conference of the North American Chapter of 112(1):11–23. the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and David DeVault, Ron Artstein, Grace Benn, Teresa Short Papers), pages 1689–1701. Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, et al. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and 2014. Simsensei kiosk: A virtual human interviewer Yejin Choi. 2020. The curious case of neural text for healthcare decision support. In Proceedings of degeneration. In ICLR. ICLR. the 2014 international conference on Autonomous agents and multi-agent systems, pages 1061–1068. Dirk Hovy and Shannon L Spruit. 2016. The social impact of natural language processing. In Proceed- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ings of the 54th Annual Meeting of the Association Kristina Toutanova. 2019. Bert: Pre-training of for Computational Linguistics (Volume 2: Short Pa- deep bidirectional transformers for language under- pers), volume 2, pages 591–598. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Aaron Jaech, Victoria Zayats, Hao Fang, Mari Osten- Computational Linguistics: Human Language Tech- dorf, and Hannaneh Hajishirzi. 2015. Talking to the nologies, Volume 1 (Long and Short Papers), pages crowd: What do people react to in online discus- 4171–4186. sions? In EMNLP. Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. Satwik Kottur, José Moura, Stefan Lee, and Dhruv Question answering as an automatic evaluation met- Batra. 2017. Natural language does not emerge ric for news article summarization. In Proceed- ‘naturally’ in multi-agent dialog. In Proceedings ings of the 2019 Conference of the North American of the 2017 Conference on Empirical Methods in Chapter of the Association for Computational Lin- Natural Language Processing, pages 2962–2967, guistics: Human Language Technologies, Volume 1 Copenhagen, Denmark. Association for Computa- (Long and Short Papers), pages 3938–3948. tional Linguistics. Michael C Frank and Noah D Goodman. 2012. Pre- Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc- dicting pragmatic reasoning in language games. Sci- Cann, Caiming Xiong, and Richard Socher. 2019. ence, 336(6084):998–998. Neural text summarization: A critical evaluation. In Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Proceedings of the 2019 Conference on Empirical Alon Talmor, and Sewon Min. 2019. On making Methods in Natural Language Processing and the reading comprehension more comprehensive. In 9th International Joint Conference on Natural Lan- Proceedings of the 2nd Workshop on Machine Read- guage Processing (EMNLP-IJCNLP), pages 540– ing for Question Answering, pages 105–112, Hong 551, Hong Kong, China. Association for Computa- Kong, China. Association for Computational Lin- tional Linguistics. guistics. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Dave Golland, Percy Liang, and Dan Klein. 2010. A field, Michael Collins, Ankur Parikh, Chris Alberti, game-theoretic approach to generating spatial de- Danielle Epstein, Illia Polosukhin, Matthew Kelcey, scriptions. In Proceedings of the 2010 conference Jacob Devlin, Kenton Lee, Kristina N. Toutanova, on empirical methods in natural language process- Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob ing, pages 410–419. Association for Computational Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- Linguistics. ral questions: a benchmark for question answering research. Transactions of the Association of Compu- Ben Green and Salomé Viljoen. 2020. Algorithmic tational Linguistics. realism: Expanding the boundaries of algorithmic thought. In Proceedings of the ACM Conference on Himabindu Lakkaraju, Julian J. McAuley, and Jure Fairness, Accountability, and Transparency (FAT*). Leskovec. 2013. What’s in a name? understanding 12
the interplay between titles, content, and communi- Proceedings of the 2019 Conference on Empirical ties in social media. In ICWSM. Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- Angeliki Lazaridou, Alexander Peysakhovich, and guage Processing (EMNLP-IJCNLP), pages 4453– Marco Baroni. 2017. Multi-agent cooperation and 4463. the emergence of (natural) language. ICLR. Thomas Scialom, Sylvain Lamprier, Benjamin Pi- Ronan Le Bras, Swabha Swayamdipta, Chandra Bha- wowarski, and Jacopo Staiano. 2019. Answers gavatula, Rowan Zellers, Matthew E. Peters, Ashish unite! unsupervised metrics for reinforced summa- Sabharwal, and Yejin Choi. 2019. Adversarial filters rization models. In Proceedings of the 2019 Con- of dataset biases. ArXiv, abs/2002.04108. ference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B ence on Natural Language Processing (EMNLP- Tenenbaum, and Jiajun Wu. 2019. The neuro- IJCNLP), pages 3246–3256, Hong Kong, China. As- symbolic concept learner: Interpreting scenes, sociation for Computational Linguistics. words, and sentences from natural supervision. In ICLR. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann 2018. Self-attention with relative position represen- Marcinkiewicz. 1993. Building a large annotated tations. In Proceedings of the 2018 Conference of corpus of English: The Penn Treebank. Computa- the North American Chapter of the Association for tional Linguistics, 19(2):313–330. Computational Linguistics: Human Language Tech- nologies, Volume 2 (Short Papers), pages 464–468. James L McClelland, Felix Hill, Maja Rudolph, Ja- son Baldridge, and Hinrich Schütze. 2019. Ex- Noam Shazeer and Mitchell Stern. 2018. Adafactor: tending machine language models toward human- Adaptive learning rates with sublinear memory cost. level language understanding. arXiv preprint In International Conference on Machine Learning, arXiv:1912.05877. pages 4603–4611. Lev Muchnik, Sinan Aral, and Sean J. Taylor. 2013. So- Emma Strubell, Ananya Ganesh, and Andrew McCal- cial influence bias: a randomized experiment. Sci- lum. 2019. Energy and policy considerations for ence, 341 6146:647–51. deep learning in nlp. In Proceedings of the 57th An- nual Meeting of the Association for Computational Fernando Pereira. 2000. Formal grammar and informa- Linguistics, pages 3645–3650. tion theory: together again? Philosophical Trans- actions of the Royal Society of London. Series A: Naftali Tishby, Fernando C. Pereira, and William Mathematical, Physical and Engineering Sciences, Bialek. 1999. The information bottleneck method. 358(1769):1239–1253. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, 368–377. Olga Uryupina, and Yuchen Zhang. 2012. Conll- 2012 shared task: Modeling multilingual unre- Naftali Tishby and Noga Zaslavsky. 2015. Deep learn- stricted coreference in ontonotes. In Joint Confer- ing and the information bottleneck principle. In ence on EMNLP and CoNLL-Shared Task, pages 1– 2015 IEEE Information Theory Workshop (ITW), 40. Association for Computational Linguistics. pages 1–5. IEEE. Alec Radford, Karthik Narasimhan, Tim Salimans, and Antonio Torralba and Alexei A Efros. 2011. Unbiased Ilya Sutskever. 2018. Improving language under- look at dataset bias. In Computer Vision and Pat- standing by generative pre-training. Technical re- tern Recognition (CVPR), 2011 IEEE Conference port, OpenAI. on, pages 1521–1528. IEEE. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Alan M. Turing. 1950. Computing Machinery and In- Wei Li, and Peter J. Liu. 2019. Exploring the limits telligence. Mind, LIX(236):433–460. of transfer learning with a unified text-to-text trans- former. arXiv e-prints. Oleg Vasilyev, Vedant Dharnidharka, and John Bohan- non. 2020. Fill in the blanc: Human-free quality Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and estimation of document summaries. arXiv preprint Percy Liang. 2016. Squad: 100,000+ questions for arXiv:2002.09836. machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natu- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ral Language Processing, pages 2383–2392. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Maarten Sap, Hannah Rashkin, Derek Chen, Ronan you need. In Proceedings of the 31st International Le Bras, and Yejin Choi. 2019. Social iqa: Com- Conference on Neural Information Processing Sys- monsense reasoning about social interactions. In tems, pages 6000–6010. Curran Associates Inc. 13
You can also read