TuringAdvice: A Generative and Dynamic Evaluation of Language Use
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
TuringAdvice: A Generative and Dynamic Evaluation of Language Use Rowan Zellers♠ Ari Holtzman♠ Elizabeth Clark♠ Lianhui Qin♠ Ali Farhadi♠ Yejin Choi♠♥ ♠ Paul G. Allen School of Computer Science & Engineering, University of Washington ♥ Allen Institute for Artificial Intelligence rowanzellers.com/advice Abstract I have to do a dissection for my high school class, but I’m distressed by dead animals. Last time we dissected an animal in class, I We propose TuringAdvice, a new challenge had a panic attack. I asked my teacher for task and dataset for language understanding another assignment, but she refused. I don't models. Given a written situation that a real want to play a 'victim' card, but I don't know person is currently facing, a model must gen- what to do. Help! erate helpful advice in natural language. Our evaluation framework tests a fundamental as- I’d send a short email to the next pect of human language understanding: our higher-up authority figure, ideally a ability to use language to resolve open-ended counselor. Be forthright; it’s the best approach when self- situations by communicating with each other. advocating as a student. Helpful Empirical results show that today’s models struggle at TuringAdvice, even multibillion pa- Go to your teacher and say "I'm rameter models finetuned on 600k in-domain asking you to do a project that training examples. The best model, a finetuned requires me to see dead animals. T5 T5, writes advice that is at least as helpful as This is a dealbreaker." If she human-written advice in only 14% of cases; a doesn’t concede, tell your principal much larger non-finetunable GPT3 model does about your trauma. Not helpful even worse at 4%. This low performance re- veals language understanding errors that are hard to spot outside of a generative setting, Figure 1: TuringAdvice. Humans are natural experts at showing much room for progress. using language to successfully address situations that arise, such as giving advice. We introduce a new frame- work, dataset, and leaderboard to generatively evaluate real-world language use. Today’s most powerful mod- 1 Introduction els – which obtain near-human or superhuman perfor- Language models today are getting ever-larger, and mance on core NLP benchmarks for reading compre- hension, natural language inference, and commonsense are being trained on ever-increasing quantities of reasoning – struggle with all of these capabilities when text. For an immense compute cost, these models generating advice, as highlighted in red. like T5 (Raffel et al., 2019) and GPT3 (Brown et al., 2020) show gains on a variety of standard NLP benchmarks – often even outperforming humans. ample - but not the only one possible - suggests that Yet, when a giant model like T5 generates lan- she send a short email to her guidance counselor. guage, we observe clear gaps between machine- On the other hand, not only is T5’s advice un- level and human-level language understanding – helpful, it also reveals key misunderstandings of even after it has been finetuned for the task at hand. the situation. It seems to believe that the student Consider Figure 1, in which a woman asks for is asking the teacher to do a class project involv- advice. She is assigned to dissect an animal for ing dead animals. This reading comprehension her class project, but has extreme anxiety about error is particularly strange, as T5 outperforms dead animals – and her teacher refused to give her humans on a variety of reading comprehension another assignment. Humans can respond with benchmarks. Others in the community have ob- helpful advice, reflecting our unique ability of real- served similar issues, raising concerns about what world language use: to communicate and tackle today’s benchmark datasets measure (Yogatama open-ended issues. The helpful advice in this ex- et al., 2019; Kryscinski et al., 2019; McClelland 4856 1 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4856–4880 June 6–11, 2021. ©2021 Association for Computational Linguistics
et al., 2019; Gardner et al., 2019). incredibly challenging for NLP models. Today’s We argue that there is a deep underlying issue: largest finetunable model, T5 with 11 billion param- a gap between how humans use language in the eters, produces advice that is preferable to human- real world, and what benchmarks today can mea- written advice 14.5% of the time – after being fine- sure. Today’s dominant paradigm is to study static tuned on 600k examples. GPT3, an even larger datasets, and to grade machines by the similarity of model with 175 billion parameters that was not re- their output with predefined correct answers. For leased for finetuning, does even worse at 4%. Even example, we score multiple choice exams by how more concerning, our evaluation finds that it often often the correct answers are chosen, and evaluate generates hateful and toxic language. generative tasks like machine translation by simi- We also study our task from the perspective of to- larity with respect to correct translations. However, day’s standard ‘core’ NLP tasks. Broadly, we find when we use language in the real world to com- that machines frequently confuse who is who, are municate with each other – such as when we give self-contradictory, or seem to miss important world advice, or teach a concept to someone – there is knowledge. However, these mistakes tend not to rarely a universal correct answer to compare with, fall into the neat categories defined by standard just a loose goal we want to achieve. task definitions. We address this by introducing di- We introduce a framework to narrow this gap agnostic questions, which systematically measure between benchmarks and real-world language use. these language understanding errors. We propose to evaluate machines by their success In summary, our paper makes three contribu- in using language to (1) communicate with humans tions. First, we introduce a new framework for in (2) tackling complex, open-ended, real-world measuring language understanding through directly situations. Our goal is a machine that, like a human, tackling real-world language problems. Second, can generate language that is useful and helpful. we introduce TuringAdvice as a new challenge Doing so necessarily requires a deep understanding for AI systems, along with a dynamic dataset and of language and the world, as per a line of thought leaderboard. Third, we connect our task to exist- that the complete meaning representation is one ing atomic NLP tasks, introducing a new setting that suffices to complete a task (Artzi et al., 2013). that reveals where progress is still needed. As a case-study of our framework, we introduce TuringAdvice as a new grand challenge for AI sys- 2 Real World Language Use tems. A machine reads a situation written by a We propose to evaluate machines by their success person seeking advice, like Figure 1, and must then at real-world language use: using language to com- write advice that is helpful to the advice-seeker. municate with a human, in response to a naturally Like a Turing Test (Turing, 1950), we establish a occurring situation, in order to achieve a desired simple condition required for a model to ‘pass’: outcome. This is how educators often measure (hu- model-generated advice must be at least as helpful man) language understanding of a second language to the advice-seeker as human-written advice. – by how well the learner can use the language We make our challenge concrete by introducing (Council of Europe, 2001). Our approach is also a new dataset, RedditAdvice, and accompanying inspired by Wittgenstein’s notion of semantics, that leaderboard. We tie our dataset to the Reddit com- “meaning is use:” language is grounded in our de- munity, which resolves two additional sources of sire to make sense of one another and cooperate to bias. First, Reddit users are intrinsically motivated, meet our needs (Wittgenstein, 1953). seeking advice about highly complex real issues As machines do not have humanlike needs or – which past work suggests differ from hypotheti- desires, we propose to evaluate machines’ success cal issues that crowd workers might come up with at a task by how well it serves a human who is (e.g. Kwiatkowski et al., 2019; Gurari et al., 2018). interested in the outcome. For example, if a ma- Second, we make our dataset dynamic, not static – chine orders food on my behalf, then I can evaluate models are evaluated over Reddit situations posted it based on whether I enjoy the dish it ordered. over the previous two weeks at the time of submis- Though this requires careful task selection in order sion. Models therefore, like humans, must general- to make things feasible for current models, as we ize to new situations and patterns of language. will show in Section 3, it results in a powerful and Experimental results show that TuringAdvice is reliable human evaluation. 4857 2
2.1 Related work proxy tasks that are easy to evaluate, while (hope- 2.1.1 Pragmatics in NLP fully) correlating with the underlying true task. For example, SWAG (Zellers et al., 2018) is a multiple- Our evaluation relates to pragmatics in NLP, where choice proxy task and dataset introduced to study communication is modeled also through listeners the true task of commonsense reasoning. and speakers (Golland et al., 2010; Frank and Good- However, there are gaps between datasets for man, 2012). One approach is to introduce a com- proxy tasks (e.g. multiple choice), and the core munication game, with an explicit objective. For tasks they seek to represent (e.g. commonsense example, Wang et al. (2016) study a blocks world reasoning), which we discuss in the next sections. where humans give commands to a block-placing machine. The machine is then graded on accuracy. 2.2 Can language use really be measured Our proposed evaluation instead covers complex through correctness over proxy tasks? everyday scenarios faced by a human, where the When we reduce a complex language task to a objective is to help them as much as possible. simplified setup, with a small label space (like Pragmatics can also be studied through machine- multiple-choice classification), we run the risk of machine communication; e.g., through emergent introducing artifacts and biases: patterns that can language (Lazaridou et al., 2017). Recent work be exploited in the simplified setup, but that are not uses pretrained question-answering models to eval- representative of the true task (Gururangan et al., uate summarization models (Chen et al., 2018; 2018; Zellers et al., 2019a). Artifacts can enable Scialom et al., 2019; Eyal et al., 2019; Vasilyev machines to even outperform humans at the final et al., 2020). However, ensuring that machines benchmark, without solving the underlying task. communicate in standard English is difficult, as While the problem of artifacts has recently taken there is usually a more efficient machine-language the spotlight in the NLP community, partially be- coding scheme for the task (Kottur et al., 2017). cause large Transformers (Vaswani et al., 2017) 2.1.2 Two major approaches for evaluation excel at picking up on artifacts, there is a deeper Today, we see two major approaches for NLP eval- underlying issue. One way to view simplified tasks uation, which we discuss below. is that in order to correctly map inputs X to labels Quality of generations. The first approach stud- Y, a machine must learn a set of attributes A that ies generative tasks like chit-chat dialogue or story- are representative of the ‘true’ task. We can upper- writing, and measures the inherent quality of gen- bound the information contained by A through the erations, often through attributes such as “sensi- information bottleneck principle of Tishby et al. bleness” and “specificity” (e.g., Venkatesh et al., (1999). An efficient model minimizes the follow- 2018; Hashimoto et al., 2019; Adiwardana et al., ing, for some β ą 0: 2020). This approach is orthogonal to ours: though these attributes might be desirable, they are often min IpX; Aq ´ βIpA; Yq, (1) ppa|xq insufficient to guarantee success at a task. Correctness. The second (and perhaps more where I is mutual information. In other words, the common) approach is to evaluate models through model will learn attributes A that maximally com- correctness over static datasets. For example, ma- press the inputs X (minimizing IpX; Aq), while also chines can be graded by the similarity of their gen- remaining good predictors of the labels Y (max- erated translation to correct translations,1 or, by imizing IpA; Yq). However, the label prediction how often they choose the correct answer on a mul- term is bounded by the information (or entropy, H) tiple choice exam. Many goal-oriented dialogue of the label space: and semantics tasks are also evaluated in this way, as a model is evaluated by whether it makes the IpA; Yq “ HpYq ´ HpY|Aq ď HpYq. (2) correct API call, or produces a correct parse. Since many language tasks cannot be evaluated Thus, for a task with a small label space, there through correctness, researchers often introduce is no guarantee that a model will learn high- information content attributes. Models are in fact 1 Models submitted to the 2019 Conference on Machine encouraged to overfit to dataset artifacts, and to Translation were evaluated (by humans) on how well the model’s translations agreed with either (1) human-written unlearn linguistically useful information that is not translations, or, (2) original source text (Barrault et al., 2019). directly relevant to predicting Y (Pereira, 2000). 4858 3
An alternate approach is to make datasets harder (Bonaccio and Dalal, 2006). Thus, we as humans adversarially, so as to have fewer artifacts (Zellers have inherent familiarity with the task, and what et al., 2018, 2019a; Le Bras et al., 2020). However, it means for advice to be helpful – making it easy it might be impossible to make a dataset with no to evaluate, as we later show empirically. More- artifacts, or to know if one has been created. over, because there are many internet communities Our proposal, to evaluate models by their real- devoted to advice-giving, training data is plentiful. world language use, addresses the information bot- Second, the framework of advice-giving allows tleneck issue in two ways. First, when we use us to study subtasks such as reading comprehen- language in the real world, the mapping between sion and natural language inference (Section 5.3); possible inputs and outputs is often highly complex. we argue both of these are needed to consistently For example, the space of possible advice is vast, give good advice. Learning to recognize advice and many pieces of advice might be equally helpful has recently been studied as an NLP task on its given a situation. Second, we directly tackle lan- own (Govindarajan et al., 2020), though we are not guage problems, without introducing a correctness- aware of past work in learning to generate advice. based proxy that machines might overfit to. 3.1 RedditAdvice: A dynamic dataset for 2.3 Static datasets in a dynamic world evaluating advice To evaluate performance on a real-world task by We propose to evaluate models dynamically, means of a dataset, we (implicitly) assume that through new situations and advice that are posted the dataset is a good representation of the world to Reddit. We call our dynamic dataset Reddit- (Torralba and Efros, 2011). This might be question- Advice. Many of Reddit’s subcommunities (or able when it comes to real-world language use, as ‘subreddits’) are devoted to asking for and giv- static datasets necessarily capture historic patterns ing advice, with subreddits for legal, relationship, of language. For instance, syntactic understand- and general life advice.2 During evaluation time, ing is often evaluated using the Penn Treebank, we will retrieve new situations from Reddit as a with news articles from 1989 (Marcus et al., 1993). new test set for models. Workers on Mechanical However, the world is constantly evolving, along Turk then grade the model-written advice versus with the language that we use. the Reddit-endorsed human-written advice. To bridge this gap, we propose to evaluate ma- chines by their interactions with humans in the 3.1.1 How advice-giving works on Reddit present. Models therefore must learn to perform Suppose a Reddit user faces an issue that they are the underlying language task, even for novel situa- seeking advice about. First, they write up situation tions, rather than fitting to the historic distribution and post it to an advice-oriented subreddit. Users of a fixed test set. We make this notion concrete then reply to the situation, offering advice. in the next section, where we introduce a dynamic Importantly, any user can ‘upvote’ or ‘downvote’ dataset and leaderboard for evaluating advice. the advice as well as the situation itself - changing its score slightly. Top-scoring advice is deemed by 3 TuringAdvice: a New Challenge for the wisdom of the crowd as being the most helpful.3 Natural Language Understanding 3.1.2 The ideal evaluation - through Reddit? As a case study of our framework, we introduce TuringAdvice, a new challenge task for AI systems In a sense, human advice-givers are ‘evaluated’ on to test language understanding. The format is sim- Reddit by the score of their advice – representing ple: given a situation expressed in natural language, how well their advice has been received by the a machine must respond with helpful advice. To community. Similarly, the ideal model evaluation pass the challenge, machine-written advice must might be to post advice on Reddit directly. If the be at least as helpful to the advice-seeker as human- model writes helpful advice, it should be upvoted. written advice, in aggregate. 2 We use advice from the following subreddits: Love, We focus on advice for a few reasons. First, Relationships, Advice, NeedAdvice, Dating_Advice, Dating, advice-giving is both an important and an everyday Marriage, InternetParents, TechSupport, and LegalAdvice. 3 This is somewhat of a simplification, as other factors also task. People ask for and give advice in settings influence what gets upvoted (Anderson et al., 2012; Lakkaraju as diverse as relationship advice and tech support et al., 2013; Muchnik et al., 2013; Jaech et al., 2015). 4859 4
Given: Situation Advice A Advice B We show an overview of our Mechanical Turk task in Figure 2. A worker is given a situation and 1. Which piece of advice is more helpful? two pieces of advice. One is the top-scoring ad- Definitely A Slightly A Slightly B Definitely B vice from Reddit, and the other is model-generated advice; the worker is not told which is which. 2. How helpful is the worse advice (A) to the question-asker? The worker first chooses the more helpful piece Slightly helpful Not helpful Dangerous of advice, then provides diagnostic information for the less helpful advice – rating it Slightly helpful , 3. Is Advice A worse 3. Could Advice A be applicable to mainly due to its (and helpful in) a different situation? Not helpful , or Dangerous . If the worse piece of meaning, or its writing? advice was Slightly helpful , they choose whether Meaning Writing Possibly helpful Never helpful it is worse due to a Meaning problem or a Writing problem . Otherwise, they choose if the Figure 2: Crowdsourcing workflow. Mechanical Turk worse advice could be Possibly helpful in some Workers are given a situation, and two pieces of advice. other situation, or Never helpful in any situation. First, they choose which is more helpful (here, B). Sec- Three workers rate each model-situation pair, ond, they rate the helpfulness of the worse advice (A); and ratings are combined using a majority vote. We last, they answer a diagnostic question. follow best practices on Mechanical Turk, using a qualification exam, paying workers at least $15 per However, there is a significant ethical problem hour, and giving feedback to workers. Still, eval- with this approach. The users who post advice uation is highly economical at $1.86 per example- questions are real people, with real problems. A model pair, or roughly $400 per model evaluated. user might read advice that was originally written by a machine, think it was human-endorsed, and 3.2 A large static dataset for training do something harmful as a result. For this reason, We present RedditAdvice2019, a large static we take an alternate crowdsourcing approach. dataset for training advice-giving models. Because today’s models have extreme reliance on data for 3.1.3 A crowdsourced, hybrid evaluation – finetuning, we collect data that is in the exact same through Mechanical Turk format as RedditAdvice, yet we expand our selec- We propose a hybrid approach for dynamic evalua- tion criteria, optimizing for recall rather than preci- tion of models. While the situations, and reference sion (Supp A.2). In total, we extract 616k pieces advice come from Reddit, we hire workers on Me- of advice, over 188k situations. chanical Turk to rate the relative helpfulness of To mirror the dynamic nature of the evaluation, machine-written advice. Not only is this format in which models are evaluated on situations posted more ethical, it also lets us collect diagnostic rat- in 2020 and beyond, we split our dataset into static ings, allowing us to quantitatively track the natural training and validation sets by date.4 language understanding errors made by machines. We made our crowdsourcing task as fulfilling as 4 Experimental Results on RedditAdvice possible - using popular situations from Reddit, and pitching the work in terms of helping people. In this section, we report results from one round of We received feedback from many workers that our dynamic evaluation on RedditAdvice. We evaluate tasks were entertaining and fun, suggesting that our the following strong NLP models and baselines: workers are to some degree intrinsically motivated. a. Rule-based: a templated system to give legal, relationship, or life advice. The system first 3.1.4 Mechanical Turk annotation setup chooses randomly empathetic sentence from In a single round of evaluation, we retrieve 200 ten choices, for example “I’m sorry you’re popular Reddit situations that were posted in the facing this.” It then chooses a random piece last two weeks. For each situation, we retrieve of advice that is loosely related to the situa- the top-rated advice from Reddit, and generate one tion’s topic; we infer this from the subreddit piece of advice per model. Workers on Mechanical the situation was posted on. For example, for Turk then compare the helpfulness of the model- 4 Our training set contains 600k pieces of advice from July generated advice with human-written advice, and 2009 to June 14, 2019; validation contains 8k from June 14 to provide diagnostic ratings. July 9th 2019. 4860 5
% Frequency that model advice is preferred over best Reddit advice 100% Rule- 1.5% 1.5% 2.0%* 12.0%* 38.5%* Model advice preferred Based 80% TF-IDF Retrieval 0.0% 0.5%* 10.5%* 37.0%* Reference model 60% GPT3-175B 0.5%* 10.5%* 37.0%* 41.0% Grover- 10.0%* 36.5%* Mega(1.5B) Reddit advice preferred 40% T5-11B *: Significant with p < .01 26.5%* 14.5% 20% : Significant with p < .05 Second-best 4.0% 4.5% 4.0% Reddit advice Not significant 2.5% Rule- TF-IDF GPT3-175B Grover- T5-11B Second-best 0% Based Retrieval Mega(1.5B) Reddit advice Rule- TF-IDF Grover- T5-11B GPT3-175B Second-best Compared model Based Retrieval Mega(1.5B) Reddit advice Figure 4: Improvement (in absolute percentage %) be- Figure 3: Helpfulness of models relative to top-scoring tween pairs of models, along with statistical signifi- Reddit advice. We show results over 200 shared situ- cance from a paired t-test. The improvement of T5-11B ations; we also show bootstrapped 95% confidence in- over smaller models like Grover-Mega is highly statis- tervals. Advice from the best-scoring model, T5-11B, tically significant (10% gap, pă.01), while being far is preferred 14.5% over top-scoring Reddit advice. We worse than human performance. Our evaluation thus also compare the second-top scoring piece of Reddit meaningfully grades varying levels of performance. advice, which scores 41% – worse than the best advice (50% by definition), but better than any model. LegalAdvice the model might write “I’d suggest prompt due to the length of situation-advice getting a lawyer immediately.” pairs; we instead mimic the formatting of a b. TF-IDF retrieval: for a new situation, we com- website quoting from Reddit (Appendix B.5). pute its TF-IDF bag-of-word vector and use it Last, to quantify the measurement error of our eval- to retrieve the most similar situation from the uation, we additionally evaluate: training set. We then reply with the top-scoring f. the second-highest rated Reddit advice for each advice for that situation. situation. We send this advice through the same c. Grover-Mega (Zellers et al., 2019b): a left-to- pipeline as machine-written advice. right transformer model with 1.5 billion pa- We finetune all models (except GPT3) and gen- rameters. Grover was pretrained on news ar- erate using Nucleus Sampling (Holtzman et al., ticles with multiple fields, perhaps making it 2020); more details in Appendix B. a good fit for our task, with multiple fields of In our study, we exclude purely bidirectional context (like the subreddit, date, and title). Our models, such as BERT (Devlin et al., 2019). While situation-advice pairs are often quite long, so these models can be made to generate text, these we adapt Grover for length; pretraining it on generations are usually worse than those of left-to- sequences of up to 1536 characters. right models (Wang and Cho, 2019). T5 also tends d. T5 (Raffel et al., 2019): a sequence-to- to outperform them, even on discriminative tasks. sequence model with a bidirectional encoder and a left-to-right generator, with 11 billion 4.1 Quantitative results parameters. T5 was trained on a large dataset In Figure 3, we show overall results for one evalua- of cleaned web text. At the time of writing, tion trial, which featured 200 situations posted on T5 is the top-scoring model on the Glue and Reddit from October 28 to November 7, 2020. As SuperGlue benchmarks (Wang et al., 2019b,a), a key metric for measuring the relative usefulness scoring above human performance on Glue and of model-written advice, we evaluate the frequency near human-performance on SuperGlue. by which workers prefer the Reddit-written refer- e. GPT3 (Brown et al., 2020): a left-to-right ence advice over the model-written advice. If a transformer model with 175 billion parameters. model’s advice was just as helpful as human advice GPT3 must be “prompted” to generate advice in aggregate, then that model would score 50%. since it has not been released for finetuning. Model performance is quite low. The best model, We cannot provide few-shot examples in the T5-11B, scores 14.5%, outperforming a smaller Grover-Mega (4.5%); GPT3 does worse at 4.0%. 4861 6
The rule-based and TF-IDF baselines are competi- legaladvice: NJ Work will not let us eat in the building due to covid outbreak. We can’t leave the property for breaks. They tive at 2.5% and 4.0% accuracy respectively. have a tent outside for us to eat in but it’s getting cold out. As additional comparison to the 50% upper I am a manager at my work part of my responsibilities is to bound, the second-highest scoring Reddit advice stay on the property during my entire shift we are given a 1 hour paid lunch break that we do not have to clock out for. scores 41%. This suggest that our workers and However due to recent covid outbreak the new policy is no often prefer the same advice as Reddit users. eating in the building. They had set up a tent outside for us to eat in but it is starting to get cold out as well there are not 4.1.1 Measurement error heaters in there. My question is can they legally do this to us? Also I work 4:00 p.m. to midnight shift. To investigate the measurement error of our evalu- Top So by "clock out for" you are not a salaried exempt ation, in Figure 4 we report the statistical signifi- Reddit employee? Do the workers under you have a union? cance between pairs of models; details about how advice Aside from the cold, New Jersey doesn’t require em- this is computed are in Appendix C. We observe ployers to give *any* meal breaks for employees 18 or older, so they could take away the breaks and the a large gap in performance between T5 and the tent if they wanted, and just say "no eating" (though other baselines. For example, its improvement over water is required). What is your relationship with the Grover-Mega is 10%, which is highly statistically other managers, and room to figure this out, ask for heaters, etc.? signficant. On the other hand, the differences in Rule If I were you, I’d get a lawyer immediately. performance between other models are more minor Based De nitely less helpful Slightly helpful – GPT3 does not outperform TF-IDF, and though Meaning problem it outperforms the rule-based system by 1.5%, it is TF- You were blackout drunk, which means that you were only somewhat statistically significant. IDF not capable of consent. . . . If anyone should feel vio- Re- lated, it’s you. Overall, the statistical significance results sug- trieval De nitely less helpful Not helpful Possibly helpful gest that our evaluation can stably rank model per- Grover- If you are required to be on the premises, they are formance. This, along with the finding that model Mega required to provide you with a place to eat. https://www.nj.gov/labor/es/faq.html#13 performance is low on our task suggests that there De nitely less helpful Not helpful Possibly helpful is ample room for growth on RedditAdvice. T5- If your shift is from 4pm to midnight, is there a reason 11B that you can’t leave the property? 5 Analysis and Discussion De nitely less helpful Not helpful Never helpful GPT3- A place of employment is considered the "premises". So far, we have shown that we are able to reli- 175B You can’t just assume that because you can’t walk out ably evaluate models in our dynamic setup, and the door that you can eat wherever you want. This is work hours. You’re not in your home. The restaurant, that doing so results in model performance that is or at least the food, is all provided for you for your significantly lower than human performance. benefit. But if there are other people there, then I To break down what this gap in performance could see how it could be a safety issue. De nitely less helpful Not helpful Never helpful means, we show a qualitative example in Figure 5. A user is asking for online legal advice about being Figure 5: Qualitative example; more in Supp. F. stuck at work for their entire 4pm-midnight shift – Though machine-generated advice matches keywords with no eating allowed due to COVID-19. The top- from the situation, it is frequently not helpful or even rated Reddit advice understands this situation and self-contradictory. The issues are due to critical er- then offers advice, suggesting the advice-seeker rors in natural language understanding, such as reading comprehension, entailment, and coreference. organize alongside other workers – as “New Jersey doesn’t require employers to give any meal breaks.” Machine advice seems to misunderstand the is- better, and then annotate problems with the worse sue. T5 asks if there is “a reason that you can’t piece of advice. We found workers had high agree- leave the property,” even though this reason is ment during the diagnostic annotation.5 stated in the situation. GPT3’s advice is self- In Figure 6, we show the distribution of contradictory; it also makes something up about a the ratings for model-written, versus human- “restaurant” in the building. written advice. Machine-written advice that was 5.1 Problems with machine-written advice 5 For the classifying machine-written advice as ‘helpful’ As part of our evaluation, we wish to quantita- versus ‘not helpful’ or ‘dangerous’ (combining the two latter tively measure problems with machine-written ad- categories into one), we have κ“0.689. For breaking down helpful advice into ‘meaning problem’ versus a ‘writing prob- vice. Recall that in our crowdsourcing setup (Sec- lem’, we have Cohen’s κ“0.613; for rating unhelpful advice tion 3.1.3), three workers select which advice is as ‘possibly helpful’ versus ‘never helpful,’ we have κ“0.602. 4862 7
100% Frequency (%) of advice ratings 80% Preferred over top-rated Reddit advice Not helpful (possibly helpful elsewhere) 66% Slightly helpful (with a writing problem) Not helpful (never helpful elsewhere) 60% Slightly helpful (with a meaning problem) Dangerous 40% 41% 33% 32% 20% 23% 26% 22% 21% 19% 15% 13% 14% 13% 9% 5% 10% 10% 0% 4% 4% 4% 3% 4% 2% 1% TF-IDF GPT3-175B T5-11B Second-best Retrieval Reddit advice Figure 6: Distribution of ratings for three models: TF-IDF retrieval, GPT3, and T5, along with ratings for the second-best rated Reddit advice. Though deep generators like GPT3 and T5 are often preferred over the retrieval baseline, they also often write advice that would never be helpful (33% GPT3, 13% T5), and that is racist, sexist, or otherwise dangerous (10% GPT3, 3% T5). not preferred over human-written advice can 3.1.4), which we authors will pay in the short term. have the following ratings. It can be rated as An alternative strategy requires submitters to pay Slightly helpful (but, was rated as worse mainly the Mechanical Turk fees themselves; this model due to a Meaning problem or Writing problem ), was used for the HYPE leaderboard in computer as Not helpful , or Dangerous . vision (Zhou et al., 2019). The diagnostics show several patterns. First, all models frequently commit natural language under- 5.3 Relation to existing NLP tasks standing errors, such as internal contradiction. Be- Shared “core” tasks such as reading comprehension cause of this, we find that TF-IDF bag-of-words and natural language inference are of considerable retrieval is competitive with that of large generators. interest to the NLP community. Many datasets While retrieved advice is often irrelevant (66% of have been proposed for these tasks, and progress the time), it is almost never complete gibberish, as on them is often measured through auto-gradeable it comes from top-scoring advice. Only 10% of correctness metrics. However, large models have workers rated this advice as Not helpful for any started to outperform humans on these datasets, situation, less than T5. raising doubt that further progress on them brings Second, they suggest that models struggle us closer to human-level language understanding. even more without finetuning. A GPT3 model We argue two things: first, that many NLP tasks with careful prompting generates language that is are necessary components of giving advice, and sec- Dangerous 10% of the time. These qualitative ond, that because giving advice remains far from and quantitative results confirm a pattern observed solved, these tasks are also far from solved. In by many others, that large language models like Appendix F, we study problems with advice from GPT3 often generate explicitly racist and sexist lan- T5-11B from the point of view of existing NLP guage out-of-the-box Sheng et al., 2019; Gehman tasks. For instance, machine advice often contra- et al., 2020; Bender et al., 2021, among others). dicts itself, suggesting that today’s systems struggle We explore this further in Supplemental F. This is with the general task of natural language inference. perhaps worrying, since GPT3 is presently being We have made these diagnostics publicly available commercialized. to enable progress on automatically spotting these mistakes. 5.2 A Leaderboard for Advice Evaluation 6 Conclusion; Ethical Considerations So far, we have shown results from one evaluation round; a second is in Supplemental D. We propose We introduced new methodology for evaluating lan- a dynamic leaderboard to keep that evaluation on- guage tasks, reducing the gap between benchmarks going, at rowanzellers.com/advice. and the real world. We also introduced a new chal- Users submit a model API to be dynamically lenge for the community, TuringAdvice, with an evaluated. Each new model, along with the highest accompanying dataset and dynamic leaderboard. rated previously-evaluated model, will be evaluated Yet, if our field is to progress towards NLP mod- for an additional round using the same approach. els that ‘understand natural language,’ we should The cost of each evaluation is reasonable (Section be cognizant of the impact that such technology 4863 8
might have on society. In this paper, we presented Findings of the 2019 conference on machine transla- a sketch of NLP models helping people who need tion (WMT19). In Proceedings of the Fourth Con- ference on Machine Translation (Volume 2: Shared advice on sensitive topics, which could be a mea- Task Papers, Day 1), pages 1–61, Florence, Italy. As- surable goal for the field. sociation for Computational Linguistics. At the same time, we do not claim that our ap- proach is a panacea. There are almost certainly Emily M Bender, Timnit Gebru, Angelina McMillan- better non-technical solutions to ensure mentorship Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models and legal advice for all (Green, 2019). Moreover, be too big? In Proceedings of the 2021 ACM Confer- there are significant dual-use risks with models ence on Fairness, Accountability, and Transparency, that understand language (Hovy and Spruit, 2016; pages 610–623. Green and Viljoen, 2020). Our evaluation measures Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jian- some risks of generative models – such as the ten- feng Gao, and Yejin Choi. 2020. Piqa: Reasoning dency to generate toxic language – but more work about physical commonsense in natural language. In in this area is needed. Thirty-Fourth AAAI Conference on Artificial Intelli- gence. Acknowledgements Silvia Bonaccio and Reeshad S. Dalal. 2006. Advice Thanks to the Reddit users who participate in its taking and decision-making: An integrative litera- ture review, and implications for the organizational advice subreddits – from asking for help, to writ- sciences. Organizational Behavior and Human De- ing (and voting on) helpful advice. Thanks to cision Processes, 101(2):127 – 151. the Mechanical Turk workers who performed the annotation for our experiments. Thanks also to Samuel R. Bowman, Gabor Angeli, Christopher Potts, the three anonymous reviewers, along with Katha- and Christopher D. Manning. 2015. A large an- notated corpus for learning natural language infer- rina Reinecke, Oren Etzioni, Hannah Rashkin, ence. In Proceedings of the 2015 Conference on Maarten Sap, Maxwell Forbes, Jesse Thoma- Empirical Methods in Natural Language Processing, son, Daniel Khashabi, Gabriel Ilharco, Swabha EMNLP 2015, Lisbon, Portugal, September 17-21, Swayamdipta, and Yonatan Bisk, for feedback. 2015, pages 632–642. This research was supported in part by NSF (IIS- Tom B Brown, Benjamin Mann, Nick Ryder, Melanie 1524371, IIS-1714566), DARPA under the CwC Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind program through the ARO (W911NF-15-1-0543), Neelakantan, Pranav Shyam, Girish Sastry, Amanda DARPA under the MCS program through NIWC Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Pacific (N66001-19-2-4031), and the NSF-GRFP No. DGE-1256082. Ping Chen, Fei Wu, Tong Wang, and Wei Ding. 2018. A semantic qa-based approach for text summariza- tion evaluation. In Thirty-Second AAAI Conference References on Artificial Intelligence. Daniel Adiwardana, Minh-Thang Luong, David R So, Council of Europe. 2001. Common European Frame- Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, work of Reference for Languages: learning, teach- Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, ing, assessment. Cambridge University Press. et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment Ashton Anderson, Daniel P. Huttenlocher, Jon M. challenge. In Machine learning challenges. evalu- Kleinberg, and Jure Leskovec. 2012. Effects of user ating predictive uncertainty, visual object classifica- similarity in social media. In WSDM ’12. tion, and recognising tectual entailment, pages 177– 190. Springer. Yoav Artzi, Nicholas FitzGerald, and Luke S Zettle- moyer. 2013. Semantic parsing with combinatory Jacob Devlin, Ming-Wei Chang, Kenton Lee, and categorial grammars. ACL (Tutorial Abstracts), 3. Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language under- Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, standing. In Proceedings of the 2019 Conference of Christian Federmann, Mark Fishel, Yvette Gra- the North American Chapter of the Association for ham, Barry Haddow, Matthias Huck, Philipp Koehn, Computational Linguistics: Human Language Tech- Shervin Malmasi, Christof Monz, Mathias Müller, nologies, Volume 1 (Long and Short Papers), pages Santanu Pal, Matt Post, and Marcos Zampieri. 2019. 4171–4186. 4864 9
Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. the Association for Computational Linguistics: Hu- Question answering as an automatic evaluation met- man Language Technologies, Volume 1 (Long and ric for news article summarization. In Proceed- Short Papers), pages 1689–1701. ings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and guistics: Human Language Technologies, Volume 1 Yejin Choi. 2020. The curious case of neural text (Long and Short Papers), pages 3938–3948. degeneration. In ICLR. ICLR. Michael C Frank and Noah D Goodman. 2012. Pre- Dirk Hovy and Shannon L Spruit. 2016. The social dicting pragmatic reasoning in language games. Sci- impact of natural language processing. In Proceed- ence, 336(6084):998–998. ings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Pa- Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, pers), volume 2, pages 591–598. Alon Talmor, and Sewon Min. 2019. On making reading comprehension more comprehensive. In Aaron Jaech, Victoria Zayats, Hao Fang, Mari Osten- Proceedings of the 2nd Workshop on Machine Read- dorf, and Hannaneh Hajishirzi. 2015. Talking to the ing for Question Answering, pages 105–112, Hong crowd: What do people react to in online discus- Kong, China. Association for Computational Lin- sions? In EMNLP. guistics. Satwik Kottur, José Moura, Stefan Lee, and Dhruv Samuel Gehman, Suchin Gururangan, Maarten Sap, Batra. 2017. Natural language does not emerge Yejin Choi, and Noah A Smith. 2020. Realtoxici- ‘naturally’ in multi-agent dialog. In Proceedings typrompts: Evaluating neural toxic degeneration in of the 2017 Conference on Empirical Methods in language models. In Proceedings of the 2020 Con- Natural Language Processing, pages 2962–2967, ference on Empirical Methods in Natural Language Copenhagen, Denmark. Association for Computa- Processing: Findings, pages 3356–3369. tional Linguistics. Dave Golland, Percy Liang, and Dan Klein. 2010. A Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc- game-theoretic approach to generating spatial de- Cann, Caiming Xiong, and Richard Socher. 2019. scriptions. In Proceedings of the 2010 conference Neural text summarization: A critical evaluation. In on empirical methods in natural language process- Proceedings of the 2019 Conference on Empirical ing, pages 410–419. Association for Computational Methods in Natural Language Processing and the Linguistics. 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 540– Venkata Subrahmanyan Govindarajan, Benjamin Chen, 551, Hong Kong, China. Association for Computa- Rebecca Warholic, Katrin Erk, and Junyi Jessy Li. tional Linguistics. 2020. Help! need advice on identifying advice. In Proceedings of the 2020 Conference on Empirical Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Methods in Natural Language Processing (EMNLP), field, Michael Collins, Ankur Parikh, Chris Alberti, pages 5295–5306. Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Ben Green. 2019. “good” isn’t good enough. In Pro- Jacob Devlin, Kenton Lee, Kristina N. Toutanova, ceedings of the AI for Social Good workshop at Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob NeurIPS. Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- ral questions: a benchmark for question answering Ben Green and Salomé Viljoen. 2020. Algorithmic research. Transactions of the Association of Compu- realism: Expanding the boundaries of algorithmic tational Linguistics. thought. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAT*). Himabindu Lakkaraju, Julian J. McAuley, and Jure Leskovec. 2013. What’s in a name? understanding Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, the interplay between titles, content, and communi- Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P ties in social media. In ICWSM. Bigham. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings Angeliki Lazaridou, Alexander Peysakhovich, and of the IEEE Conference on Computer Vision and Pat- Marco Baroni. 2017. Multi-agent cooperation and tern Recognition, pages 3608–3617. the emergence of (natural) language. ICLR. Suchin Gururangan, Swabha Swayamdipta, Omer Ronan Le Bras, Swabha Swayamdipta, Chandra Bha- Levy, Roy Schwartz, Samuel R. Bowman, and gavatula, Rowan Zellers, Matthew E. Peters, Ashish Noah A. Smith. 2018. Annotation artifacts in nat- Sabharwal, and Yejin Choi. 2020. Adversarial filters ural language inference data. In Proc. of NAACL. of dataset biases. ArXiv, abs/2002.04108. Tatsunori Hashimoto, Hugh Zhang, and Percy Liang. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann 2019. Unifying human and statistical evaluation for Marcinkiewicz. 1993. Building a large annotated natural language generation. In Proceedings of the corpus of English: The Penn Treebank. Computa- 2019 Conference of the North American Chapter of tional Linguistics, 19(2):313–330. 4865 10
James L McClelland, Felix Hill, Maja Rudolph, Ja- Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, son Baldridge, and Hinrich Schütze. 2019. Ex- and Nanyun Peng. 2019. The woman worked as tending machine language models toward human- a babysitter: On biases in language generation. In level language understanding. arXiv preprint Proceedings of the 2019 Conference on Empirical arXiv:1912.05877. Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- Lev Muchnik, Sinan Aral, and Sean J. Taylor. 2013. So- guage Processing (EMNLP-IJCNLP), pages 3407– cial influence bias: a randomized experiment. Sci- 3412, Hong Kong, China. Association for Computa- ence, 341 6146:647–51. tional Linguistics. Fernando Pereira. 2000. Formal grammar and informa- Naftali Tishby, Fernando C. Pereira, and William tion theory: together again? Philosophical Trans- Bialek. 1999. The information bottleneck method. actions of the Royal Society of London. Series A: In Proc. of the 37-th Annual Allerton Conference Mathematical, Physical and Engineering Sciences, on Communication, Control and Computing, pages 358(1769):1239–1253. 368–377. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Antonio Torralba and Alexei A Efros. 2011. Unbiased Olga Uryupina, and Yuchen Zhang. 2012. Conll- look at dataset bias. In Computer Vision and Pat- 2012 shared task: Modeling multilingual unre- tern Recognition (CVPR), 2011 IEEE Conference stricted coreference in ontonotes. In Joint Confer- on, pages 1521–1528. IEEE. ence on EMNLP and CoNLL-Shared Task, pages 1– 40. Association for Computational Linguistics. Alan M. Turing. 1950. Computing Machinery and In- telligence. Mind, LIX(236):433–460. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Oleg Vasilyev, Vedant Dharnidharka, and John Bohan- Wei Li, and Peter J. Liu. 2019. Exploring the limits non. 2020. Fill in the blanc: Human-free quality of transfer learning with a unified text-to-text trans- estimation of document summaries. arXiv preprint former. arXiv e-prints. arXiv:2002.09836. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob machine comprehension of text. In Proceedings of Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz the 2016 Conference on Empirical Methods in Natu- Kaiser, and Illia Polosukhin. 2017. Attention is all ral Language Processing, pages 2383–2392. you need. In Proceedings of the 31st International Conference on Neural Information Processing Sys- Maarten Sap, Hannah Rashkin, Derek Chen, Ronan tems, pages 6000–6010. Curran Associates Inc. Le Bras, and Yejin Choi. 2019. Social iqa: Com- monsense reasoning about social interactions. In Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei Proceedings of the 2019 Conference on Empirical Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad, Methods in Natural Language Processing and the Ming Cheng, Behnam Hedayatnia, Angeliki Met- 9th International Joint Conference on Natural Lan- allinou, et al. 2018. On evaluating and compar- guage Processing (EMNLP-IJCNLP), pages 4453– ing open domain dialog systems. arXiv preprint 4463. arXiv:1801.03625. Thomas Scialom, Sylvain Lamprier, Benjamin Pi- Alex Wang and Kyunghyun Cho. 2019. Bert has a wowarski, and Jacopo Staiano. 2019. Answers mouth, and it must speak: Bert as a markov ran- unite! unsupervised metrics for reinforced summa- dom field language model. In Proceedings of the rization models. In Proceedings of the 2019 Con- Workshop on Methods for Optimizing and Evaluat- ference on Empirical Methods in Natural Language ing Neural Language Generation, pages 30–36. Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- Alex Wang, Yada Pruksachatkun, Nikita Nangia, IJCNLP), pages 3246–3256, Hong Kong, China. As- Amanpreet Singh, Julian Michael, Felix Hill, Omer sociation for Computational Linguistics. Levy, and Samuel Bowman. 2019a. Superglue: A stickier benchmark for general-purpose language un- Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. derstanding systems. In H. Wallach, H. Larochelle, 2018. Self-attention with relative position represen- A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Gar- tations. In Proceedings of the 2018 Conference of nett, editors, Advances in Neural Information Pro- the North American Chapter of the Association for cessing Systems 32, pages 3261–3275. Curran Asso- Computational Linguistics: Human Language Tech- ciates, Inc. nologies, Volume 2 (Short Papers), pages 464–468. Alex Wang, Amanpreet Singh, Julian Michael, Felix Noam Shazeer and Mitchell Stern. 2018. Adafactor: Hill, Omer Levy, and Samuel R. Bowman. 2019b. Adaptive learning rates with sublinear memory cost. GLUE: A multi-task benchmark and analysis plat- In International Conference on Machine Learning, form for natural language understanding. In Pro- pages 4603–4611. ceedings of ICLR. 4866 11
Sida I Wang, Percy Liang, and Christopher D Manning. 2016. Learning language games through interaction. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2368–2378. Ludwig Wittgenstein. 1953. Philosophical Investiga- tions. Wiley-Blackwell. Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor, Tomas Kocisky, Mike Chrzanowski, Ling- peng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, et al. 2019. Learning and evaluat- ing general linguistic intelligence. arXiv preprint arXiv:1901.11373. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversar- ial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93– 104, Brussels, Belgium. Association for Computa- tional Linguistics. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019a. HellaSwag: Can a machine really finish your sentence? In Pro- ceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4791– 4800, Florence, Italy. Association for Computational Linguistics. Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019b. Defending against neural fake news. In Advances in Neural Information Process- ing Systems 32. Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Li Fei-Fei, and Michael Bernstein. 2019. Hype: A benchmark for human eye percep- tual evaluation of generative models. In Advances in Neural Information Processing Systems, pages 3444–3456. 4867 12
1 10 Appendix 2 RedditAdvice situation HellaSWAG RedditAdvice advice 10 Glue Frequency SuperGlue We provide the following items in the appendix: 10 3 • Dataset filtering criteria (Section A) 10 4 • Baseline model details (Section B) 5 10 • Computing statistical significance (Section C) 0 250 500 750 1000 1250 0 200 400 600 Length (spaCy tokens) Length (spaCy tokens) • Results from a different round of dynamic evalu- ation (Section D) Figure 7: Length distribution of RedditAdvice, com- • Miscellaneous analysis (Section E) pared with other common NLU benchmarks bench- • Additional qualitative examples (Section F) marks (HellaSWAG; Zellers et al. (2019a), GLUE; For more up-to-date information, visit the Wang et al. (2019b), SuperGlue; Wang et al. (2019a)). The examples in RedditAdvice are significantly longer, project page and dynamic leaderboard at representing highly complex situations. rowanzellers.com/advice. A Dataset Filtering Criteria f. Posts in some of the subreddits (Dating_Advice, We discuss the criteria by which we extract situ- Dating, Love, Marriage) is often in the form of ations and advice, both for our dynamic dataset tips and general suggestions, rather than situa- RedditAdvice, as well as for our static training tions. We skip any posts from these subreddits dataset RedditAdvice2019. that do not include a question mark. g. We filter out posts that contain sensitive topics, A.1 Dynamic Filtering Criteria for such as assault, suicide, and abuse. RedditAdvice h. Last, we skip any post that in total is fewer than 128 spaCy tokens, or, longer than 1280 spaCy We use the following selection criteria for retriev- tokens. ing situations, along with the top-scoring advice, from Reddit. Using the Reddit API, we will loop For a retrieved situation, we do the following to through Reddit posts, which might contain valid extract valid advice: situations. We will perform several checks on the a. Given a post that contains a valid situation, post, to ensure that we can reliably extract a situa- we order the comments from highest-to-lowest tion from it, as well as a top-scoring piece of advice scoring. We perform the following checks to from the comments. determine if we can extract valid advice. Once We do the following to retrieve situations: we find valid advice, we will stop iterating. a. We iterate through posts, which by sorting b. We skip any comment that was posted by a through the top posts, that were posted be- moderator, the Reddit user who posted the orig- tween 36 hours ago and two weeks ago, on the inal situation, or that was edited. following advice subreddits: Relationships, c. We skip any comment with a score of less than Advice, NeedAdvice, Dating_Advice, Dating, 20. Love, Marriage, InternetParents, TechSupport, d. We skip any comment that contains fewer than and LegalAdvice. 32 spaCy tokens. b. We skip ‘update’ posts, in which a user refers e. One corner case is highly-scoring advice com- to an older situation that they posted, and ‘meta’ ments that refer implicitly to others. For in- posts, in which subreddit rules are discussed. stance, a comment might say ‘You should lis- c. We skip any post that has an HTML link, since ten to the other commenters and...’ These refer- today’s models (presumably) would not be able ences make sense inside a Reddit post, however, to visit such a link. they are somewhat nonsensical when we pull d. We skip any post with a score of less than 20. the comment out of context. We thus skip any e. We do our best to clean the text of the post. comment that seems to refer to others. Many posts include valid situations, but are Once we retrieve a situation, that has at least then edited to include updates that took place one piece of valid advice, we are done - and we afterwards, in response to advice that was given. move on to the next situation. We loop over the top- These are typically delimited by dashed lines, scoring 1000 posts in total, and randomly select and the word EDIT or UPDATE. 200 valid situations from this pool. 4868 13
You can also read