What Question Answering can Learn from Trivia Nerds
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
What Question Answering can Learn from Trivia Nerds ,† Jordan Boyd-Graber Benjamin Börschinger† iSchool, CS, UMIACS, LSC University of Maryland † Google Research, Zürich jbg@umiacs.umd.edu {jbg, bboerschinger}@google.com Abstract ated a community that reliably identifies those best at question answering. Beyond the format of the In addition to machines answering questions, competition, important safeguards ensure individ- question answering (QA) research creates in- arXiv:1910.14464v3 [cs.CL] 21 Apr 2020 teresting, challenging questions that reveal ual questions are clear, unambiguous, and reward the best systems. We argue that creating a knowledge (Section 3). QA dataset—and its ubiquitous leaderboard— We are not saying that academic QA should sur- closely resembles running a trivia tournament: render to trivia questions or the community—far you write questions, have agents—humans or from it! The trivia community does not understand machines—answer questions, and declare a the real world information seeking needs of users winner. However, the research community has or what questions challenge computers. However, ignored the lessons from decades of the trivia community creating vibrant, fair, and effective they know how, given a bunch of questions, to de- QA competitions. After detailing problems clare that someone is better at answering questions with existing QA datasets, we outline several than another. This collection of tradecraft and prin- lessons that transfer to QA research: removing ciples can help the QA community. ambiguity, discriminating skill, and adjudicat- Beyond these general concepts that QA can learn ing disputes. from, Section 4 reviews how the “gold standard” of 1 Introduction trivia formats, Quizbowl, can improve traditional QA. We then briefly discuss how research that uses This paper takes an unconventional analysis to an- fun, fair, and good trivia questions can benefit from swer “where we’ve been and where we’re going” in the expertise, pedantry, and passion of the trivia question answering (QA). Instead of approaching community (Section 5). the question only as ACL researchers, we apply the best practices of trivia tournaments to QA datasets. 2 Surprise, this is a Trivia Tournament! The QA community is obsessed with evalua- tion. Schools, companies, and newspapers hail “My research isn’t a silly trivia tournament,” you new SOTAs and topping leaderboards, giving rise to say. That may be, but let us first tell you a little claims that an “AI model tops humans” (Najberg, about what running a tournament is like, and per- 2018) because it ‘won’ some leaderboard, putting haps you might see similarities. “millions of jobs at risk” (Cuthbertson, 2018). But First, the questions. Either you write them your- what is a leaderboard? A leaderboard is a statis- self or you pay someone to write them (sometimes tic about QA accuracy that induces a ranking over people on the Internet). There is a fixed number of participants. questions you need to hit by a particular date. Newsflash: this is the same as a trivia tourna- Then, you advertise. You talk about your ques- ment. The trivia community has been doing this tions: who is writing them, what subjects are cov- for decades (Jennings, 2006); Section 2 details this ered, and why people should try to answer them. overlap between the qualities of a first-class QA Next, you have the tournament. You keep your dataset (and its requisite leaderboard). The ex- questions secure until test time, collect answers perts running these tournaments are imperfect, but from all participants, and declare a winner. After- they’ve learned from their past mistakes (see Ap- ward, people use the questions to train for future pendix A for a brief historical perspective) and cre- tournaments.
These have natural analogs to crowd sourcing Or consider SearchQA (Dunn et al., 2017), de- questions, writing the paper, advertising, and run- rived from the game Jeopardy!, which asks “An ning a leaderboard. The biases of academia put article that he wrote about his riverboat days was much more emphasis on the paper, but there are eventually expanded into Life on the Mississippi.” components where trivia tournament best practices The young apprentice and newspaper writer who could help. In particular, we focus on fun, well- wrote the article is named Samuel Clemens; how- calibrated, and discriminative tournaments. ever, the reference answer is that author’s later pen name, Mark Twain. Most QA evaluation metrics 2.1 Are we Having Fun? would count Samuel Clemens as incorrect. In a real game of Jeopardy!, this would not be an issue Many datasets use crowdworkers to establish hu- (Section 3.1). man accuracy (Rajpurkar et al., 2016; Choi et al., 2018). However, these are not the only humans Of course, fun is relative and any dataset is who should answer a dataset’s questions. So should bound to contain at least some errors. However, the datasets’ creators. playtesting is an easy way to find systematic prob- lems in your dataset: unfair, unfun playtests make In the trivia world, this is called a play test: get for ineffective leaderboards. Eating your own dog in the shoes of someone answering the questions food can help diagnose artifacts, scoring issues, or yourself. If you find them boring, repetitive, or other shortcomings early in the process. uninteresting, so will crowdworkers; and if you, as a human, can find shortcuts to answer ques- Boyd-Graber et al. (2012) created an interface tions (Rondeau and Hazen, 2018), so will a com- for play testing that was fun enough that people puter. played for free. After two weeks the site was taken down, but it was popular enough that the trivia Concretely, Weissenborn et al. (2017) catalog ar- community forked the open source code to create a tifacts in SQuAD (Rajpurkar et al., 2018), arguably bootleg version that is still going strong almost a the most popular computer QA leaderboard; and decade later. indeed, many of the questions are not particularly fun to answer from a human perspective; they’re The deeper issues when creating a QA task are: fairly formulaic. If you see a list like “Along with have you designed a task that is internally consis- Canada and the United Kingdom, what country. . . ”, tent, supported by a scoring metric that matches you can ignore the rest of the question and just type your goals (more on this in a moment), using gold Ctrl+F (Russell, 2020; Yuan et al., 2019) to find annotations that correctly reward those who do the the third country—Australia in this case—that ap- task well? Imagine someone who loves answering pears with “Canada and the UK”. Other times, a the questions your task poses: would they have SQ u AD playtest would reveal frustrating questions fun on your task? If so, you may have a good that are i) answerable given the information in the dataset. Gamification (von Ahn, 2006) harnesses paragraph but not with a direct span,1 ii) answer- users’ passion better motivator than traditional paid able only given facts beyond the given paragraph,2 labeling. Even if you pay crowdworkers, if your iii) unintentionally embedded in a discourse, result- questions are particularly unfun, you may need to ing in linguistically odd questions with arbitrary think carefully about your dataset and your goals. correct answers,3 iv) or non-questions. 2.2 Am I Measuring what I Care About? 1 A source paragraph says “In [Commonwealth coun- tries]. . . the term is generally restricted to. . . Private education Answering questions requires multiple skills: iden- in North America covers the whole gamut. . . ”; thus, the ques- tifying where an answer is mentioned (Hermann tion “What is the term private school restricted to in the US?” et al., 2015), knowing the canonical name for the has the information needed but not as a span. 2 A source paragraph says “Sculptors [in the collec- answer (Yih et al., 2015), realizing when to answer tion include] Nicholas Stone, Caius Gabriel Cibber, [...], and when to abstain (Rajpurkar et al., 2018), or Thomas Brock, Alfred Gilbert, [...] and Eric Gill[.]”, i.e., being able to justify an answer explicitly with evi- a list of names; thus, the question “Which British sculptor whose work includes the Queen Victoria memorial in front dence (Thorne et al., 2018). In QA, the emphasis of Buckingham Palace is included in the V&A collection?” on SOTA and leaderboards has focused attention on should be unanswerable in traditional machine reading. single automatically computable metrics—systems 3 A question “Who else did Luther use violent rhetoric towards?” has the gold answer “writings condemning the tend to be compared by their ‘SQuAD score’ or Jews and in diatribes against Turks”. their ‘NQ score’, as if this were all there is to say
systems answer questions correctly, abstain when they cannot, explain why they answered the way they did, or whatever facet of QA is most important for your dataset. Now, this leaderboard can rack up citations as people chase the top spot. But your leaderboard is only useful if it is discriminative: does it separate the best from the rest? Figure 1: Two datasets with 0.16 annotation error, but There are many ways questions might not be the top better discriminates QA ability. In the good discriminative. If every system gets a question right dataset (top), most questions are challenging but not (e.g., abstain on “asdf” or correctly answer “What impossible. In the bad dataset (bottom), there are more is the capital of Poland?”), it does not separate trivial or impossible questions and annotation error is participants. Similarly, if every system flubs “what concentrated on the challenging, discriminative ques- is the oldest north-facing kosher restaurant” wrong, tions. Thus, a smaller fraction of questions decide who it also does not discriminate systems. Sugawara sits atop the leaderboard, requiring a larger test set. et al. (2018) call these questions “easy” and “hard”; we instead argue for a three-way distinction. about their relative capabilities. Like QA leader- In between easy questions (system answers cor- boards, trivia tournaments need to decide on a sin- rectly with probability 1.0) and hard (probabil- gle winner, but they explicitly recognize that there ity 0.0), questions with probabilities nearer to are more interesting comparisons to be made. 0.5 are more interesting. Taking a cue from Vy- For example, a tournament may recognize dif- gotsky’s proximal development theory of human fernt background/resources—high school, small learning (Chaiklin, 2003), these discriminative school, undergrads (Henzel, 2018). Similarly, more questions—rather than the easy or the hard ones— practical leaderboards would reflect training time should help improve QA systems the most. These or resource requirements (see Dodge et al., 2019) Goldilocks4 questions are also most important for including ‘constrained’ or ‘unconstrained’ train- deciding who will sit atop the leaderboard; ide- ing (Bojar et al., 2014). Tournaments also give ally they (and not random noise) will decide. Un- awards for specific skills (e.g., least incorrect an- fortunately, many existing datasets seem to have swers). Again, there are obvious leaderboard many easy questions; Sugawara et al. (2020) find analogs that would go beyond a single number. For that ablations like shuffling word order (Feng and example, in SQuAD 2.0, abstaining contributes the Boyd-Graber, 2019), shuffling sentences, or only same to the overall F1 as a fully correct answer, offering the most similar sentence do not impair obscuring whether a system is more precise or an systems, and Rondeau and Hazen (2018) hypothe- effective abstainer. If the task recognizes both abil- size that most QA systems do little more than pat- ities as important, reporting a single score risks tern matching, impressive performance numbers implicitly prioritizing one balance of the two. notwithstanding. As a positive example for being explicit, the 2.4 Why so few Goldilocks Questions? 2018 FEVER shared task (Thorne et al., 2018) fa- vors getting the correct final answer over exhaus- This is a common problem in trivia tournaments, tively justifying said answer with a a metric that particularly pub quizzes (Diamond, 2009), where required only one piece of evidence per question. too difficult questions can scare off patrons. Many However, by still breaking out the full justification quiz masters prefer popularity over discrimination precision and recall scores in the leaderboard, it and thus prefer easier questions. is still clear that the most precise system only was Sometimes there are fewer Goldilocks questions in fourth place on the “primary” leaderboard, and not by choice, but by chance: a dataset becomes that the top three “primary metric” winners are less discriminative through annotation error. All compromising on the evidence selection. datasets have some annotation error; if this annota- tion error is concentrated on the Goldilocks ques- 2.3 Do my Questions Separate the Best? 4 In a British folktale first recorded by Robert Southey, an interloper, “Goldilocks”, finds three bowls of porridge: Let us assume that you have picked a metric (or a one too hot, one too cold, and one “just right”. Goldilocks set of metrics) that capture what you care about: questions’ difficulty are likewise “just right”.
tions, the dataset will be less useful. As we write we focused on how datasets as a whole should be this in 2019, humans and computers sometimes structured. Now, we focus on how specific ques- struggle on the same questions. Thus, annotation tions should be structured to make the dataset as error is likely to be correlated with which questions valuable as possible. will determine who will sit atop a leaderboard. Figure 1 shows two datsets with the same anno- 3.1 Avoiding Ambiguity and Assumptions tation error and the same number of overall ques- Ambiguity in questions not only frustrates answer- tions. However, they have different difficulty dis- ers who resolved the ambiguity ‘incorrectly‘, they tributions and correlation of annotation error and generally undermine the usefulness of questions difficulty. The dataset that has more discrimina- to precisely assess knowledge at all. For this rea- tive questions and consistent annotator error has son, the US Department of Transportation explicitly fewer questions that are effectively useless for de- bans ambiguous questions from exams for flight termining the winner of the leaderboard. We call instructors (?); and the trivia community has de- this the effective dataset proportion ρ. A dataset veloped rules that prevent ambiguity from arising. with no annotation error and only discriminative While this is true in many contexts, examples are questions has ρ = 1.0, while the bad dataset in Fig- rife in format called Quizbowl (Boyd-Graber et al., ure 1 has ρ = 0.16. Figure 2 shows the test set size 2012), whose very long questions6 showcase trivia required to reliably discriminate systems for differ- writers’ tactics. For example, Zhu Ying (2005 PAR - ent values of ρ, based on a simulation described in FAIT) warns [emphasis added]: Appendix B. He’s not Sherlock Holmes, but his address is At this point, you might be despairing about how 221B. He’s not the Janitor on Scrubs, but his big you need your dataset to be.5 The same terror father is played by R. Lee Ermy. [. . . ] For ten transfixes trivia tournament organizers. We dis- points, name this misanthropic, crippled, Vicodin- dependent central character of a FOX medical cuss a technique used by them for making individ- drama. ual questions more discriminative using a property ANSWER: Gregory House, MD called pyramidality in Section 4. to head off these wrong answers. 3 The Craft of Question Writing In contrast, QA datasets often contain ambigu- ous and under-specified questions. While this One thing that trivia enthusiasts agree on is that sometimes reflects real world complexities such questions need to be good questions that are well as actual under-specified or ill-formed search written. Research shows that asking “good ques- queries (Faruqui and Das, 2018; Kwiatkowski et al., tions” requires sophisticated pragmatic reason- 2019), simply ignoring this ambiguity is problem- ing (Hawkins et al., 2015), and pedagogy explicitly atic. As a concrete example, consider Natural Ques- acknowledges the complexity of writing effective tions (Kwiatkowski et al., 2019) where the gold questions for assessing student performance (?, for answer to “what year did the us hockey team won a book length treatment of writing good multiple [sic] the olympics” has the answers 1960 and 1980, choice questions, see). There is no question that ignoring the US women’s team, which won in 1998 writing good questions is a craft in its own right. and 2018, and further assuming the query is about QA datasets, however, are often collected from ice rather than field hockey (also an olympic sport). the wild or written by untrained crowdworkers. This ambiguity is a post hoc interpretation as the Even assuming they are confident users of En- associated Wikipedia pages are, indeed, about the glish (the primary language of QA datasets), crowd- United States men’s national ice hockey team. We workers lack experience in crafting questions and contend the ambiguity persists in the original ques- may introduce idiosyncracies that shortcut ma- tion, and the information retrieval arbitrarily pro- chine learning (Geva et al., 2019). Similarly, data vides one of many interpretations. True to its name, collected from the wild such as Natural Ques- these under-specified queries make up a substan- tions (Kwiatkowski et al., 2019) by design have tial (if not almost the entire) fraction of natural vast variations in quality. In the previous section, questions in search contexts. 5 Indeed, using a more sophisticated simulation approach 6 Voorhees (2003) found that the TREC 2002 QA test set could Like Jeopardy!, they are not syntactically questions but not discriminate systems with less than a seven absolute score still are designed to elicit knowledge-based responses; for point difference. consistency, we will still call them questions.
Average Accuracy 90 = 0.25 = 0.85 = 1.00 Test Needed 10000 80 5000 2500 70 1000 100 12 5 10 15 12 5 10 15 12 5 10 15 60 Accuracy 50 Figure 2: How much test data do you need to discriminate two systems with 95% confidence? This depends on both the difference in accuracy between the systems (x axis) and the average accuracy of the systems (closer to 50% is harder). Test set creators do not have much control over those. They do have control, however, over how many questions are discriminative. If all questions are discriminative (right), you only need 2500 questions, but if three quarters of your questions are too easy, too hard, or have annotation errors (left), you’ll need 15000. The problem is neither that such questions ex- for the question in its framing but, in re-purposing ist nor that MRQA considers questions relative to data such as Natural Questions, opaquely relies on the associated context. The problem is that tasks it for the gold answers. do not explicitly acknowledge the original ambigu- ity and instead gloss over the implicit assumptions 3.2 Avoiding superficial evaluations used in creating the data. This introduces poten- A related issue is that, in the words of Voorhees tial noise and bias (i.e., giving a bonus to systems and Tice (2000), “there is no such thing as a ques- that make the same assumptions as the dataset) tion with an obvious answer”. As a consequence, in leaderboard rankings. At best, these will be- trivia question authors are considered to delineate come part of the measurement error of datasets acceptable and unacceptable answers. (no dataset is perfect). At worst, they will reca- For example, Robert Chu (Harvard Fall XI) uses pitulate the biases that went into the creation of a mental model of an answerer to explicitly delin- the datasets. Then, the community will implic- eate the range of acceptable correct answers: itly equate the biases with correctness: you get high scores if you adopt this set of assumptions. In Newtonian gravity, this quantity satisfies Pois- Then, these enter into real-world systems, further son’s equation. [. . . ] For a dipole, this quantity is given by negative the dipole moment dotted with perpetuating the bias. Playtesting can reveal these the electric field. [. . . ] For 10 points, name this issues (Section 2.1), as implicit assumptions can form of energy contrasted with kinetic. rob a player of correctly answered questions. If ANSWER: potential energy (prompt on energy; accept specific types like electrical potential en- you wanted to answer 2014 to “when did Michi- ergy or gravitational potential energy; do not gan last win the championship”—when the Michi- accept or prompt on just “potential”) gan State Spartans won the Women’s Cross Coun- try championship—and you cannot because you Likewise, the style guides for writing questions chose the wrong school, the wrong sport, and the stipulate that you must give the answer type clearly wrong gender, you would complain as a player; re- and early on. These mentions specify whether you searchers instead discover latent assumptions that want a book, a collection, a movement, etc. It crept into the data.7 also signals the level of specificity requested. For It is worth emphasizing that this is not a purely example, a question about a date must state “day hypothetical problem. For example, Open Domain and month required” (September 11, “month and Retrieval Question Answering (Lee et al., 2019) year required” (April 1968), or “day, month, and deliberately avoids providing a reference context year required” (September 1, 1939). This is true 7 for other answers as well: city and team, party and Where to draw the line is a matter of judgment; computers—who lack common sense—might find questions country, or more generally “two answers required”. ambiguous where humans would not. Despite all of these conventions, no pre-defined set
of answers is perfect, and a process for adjudicating Despite XLNet-123’s margin of almost four abso- answers is an integral part of trivia competitions. lute F1 (94 vs 98) on development data, a manual In major high school and college national com- inspection of a sample of 100 of XLNet-123’s wins petitions and game shows, if low-level staff cannot indicate that around two-thirds are ‘spurious’: 56% resolve the issue by either throwing out a single are likely to be considered not only equally good question or accepting minor variations (America in- but essentially identical; 7% are cases where the stead of USA), the low-level staff contacts the tour- answer set ommits a correct alternative; and 5% of nament director. The tournament director—with cases are ‘bad’ questions.11 their deeper knowledge of rules and questions— Our goal is not to dwell on the exact proportions, often decide the issue. If not, the protest goes to minimize the achievements of these strong sys- through an adjudication process designed to mini- tems, or to minimize the usefulness of quantitative mize bias:8 write the summary of the dispute, get evaluations. We merely want to raise the limitation all parties to agree to the summary, and then hand of ‘blind automation’ for distinguishing between the decision off to mutually agreed experts from systems on a leaderboard. the tournament’s phone tree. The substance of the Taking our cue from the trivia community, here disagreement is communicated (without identities), is an alternative for MRQA. Test sets are only cre- and the experts apply the rules and decide. ated for a specific time; all systems are submitted si- For example, a when a particularly inept Jeop- multaneously. Then, all questions and answers are ardy! contestant9 answered endoscope to “Your revealed. System authors can protest correctness surgeon could choose to take a look inside you with rulings on questions, directly addressing the issues this type of fiber-optic instrument”. Since the van above. After agreement is reached, quantitative Doren scandal (Freedman, 1997), every television metrics are computed for comparison purposes— trivia contestant has an advocate assigned from an despite their inherent limitations they at least can auditing company. In this case, the advocate initi- be trusted. Adopting this for MRQA would require ated a process that went to a panel of judges who creating a new, smaller test set every year. How- then ruled that endoscope (a more general term) ever, this would gradually refine the annotations was also correct. and process. The need for a similar process seems to have This suggestion is not novel: Voorhees and Tice been well-recognized in the earliest days of QA (2000) conclude that automatic evaluations are suf- system bake-offs such as TREC - QA, and Voorhees ficient “for experiments internal to an organization (2008) notes that where the benefits of a reusable test collection are [d]ifferent QA runs very seldom return exactly the most significant (and the limitations are likely to be same [answer], and it is quite difficult to deter- understood)” (our emphasis) but that “satisfactory mine automatically whether the difference [. . . ] techniques for [automatically] evaluating new runs” is significant. have not been found yet. We are not aware of any In stark contrast to this, QA datasets typically only change on this front—if anything, we seem to have provide a single string or, if one is lucky, several become more insensitive as a community to just strings. A correct answer means exactly matching how limited our current evaluations are. these strings or at least having a high token overlap 3.3 Focus on the Bubble F1 , and failure to agree with the pre-recorded ad- missible answers will put you at an uncontestable While every question should be perfect, time and disadvantage on the leaderboard (Section 2.2). resources are limited. Thus, authors of tournaments To illustrate how current evaluations fall short have a policy of “focusing on the bubble”, where of meaningful discrimination, we qualitatively an- the “bubble” are the questions mostly likely to dis- alyze two near-SOTA systems on SQuAD V1.1 on criminate between top teams. CodaLab: the original XLNet (Yang et al., 2019) For humans, authors and editors focus on the and a subsequent iteration called XLNet-123.10 questions and clues that they predict will decide 8 the tournament. These questions are thoroughly https://www.naqt.com/rules/#protest 9 http://www.j-archive.com/showgame. playtested, vetted, and edited. Only after these php?game_id=6112 questions have been perfected will the other ques- 10 We could not find a paper describing XLNet-123 in detail, 11 the submission is by http://tia.today. Examples in Appendix C.
tions undergo the same level of polish. lock yourself out for a fraction of a second. So For computers, the same logic applies. Authors the big mistake on the show is people who are all adrenalized and are buzzing too quickly, too should ensure that these questions are correct, free eagerly. of ambiguity, and unimpeachable. However, as far Malone: OK. To some degree, Jeopardy! is kind of a video game, and a crappy video game where as we can tell, the authors of QA datasets do not it’s, like, light goes on, press button—that’s it. give any special attention to these questions. Jennings: (Laughter) Yeah. Unlike a human trivia tournament, however— with finite patience of the participants—this does Thus, Jeopardy!’s buzzers are a gimmick to en- not mean that you should necessarily remove all sure good television; however, Quizbowl buzzers of the easy or hard questions from your dataset— discriminate knowledge (Section 2.3). Similarly, spend more of your time/effort/resources on the while TriviaQA (Joshi et al., 2017) is written by bubble. You would not want to introduce a sam- knowledgeable writers, the questions are not pyra- pling bias that leads to inadvertently forgetting how midal. to answer questions like “who is buried in Grant’s tomb?” (Dwan, 2000, Chapter 7). Pyramidal Recall that an effective dataset (tour- nament) discriminates the best from the rest—the 4 Why Quizbowl is the Gold Standard higher the proportion of effective questions ρ, the better. Quizbowl’s ρ is nearly 1.0 because discrimi- We now focus our thus far wide-ranging QA dis- nation happens within a question: after every word, cussion to a specific format: Quizbowl, which has an answerer must decide whether they have enough many of the desirable properties outlined above. information to answer the question. Quizbowl ques- We have no delusion that mainstream QA will uni- tions are arranged so that questions are maximally versally adopt this format. However, given then pyramidal: questions begin with hard clues—ones community’s emphasis on fair evaluation, com- that require deep understanding—to more accessi- puter QA can borrow aspects from the gold standard ble clues that are well known. of human QA. We discuss what this may look like in Section 5, but first we describe the gold standard Well-Edited Quizbowl questions are created in of human QA. phases. First, the author selects the answer and as- We have shown several examples of Quizbowl sembles (pyramidal) clues. A subject editor then re- questions, but we have not yet explained in detail moves ambiguity, adjusts acceptable answers, and how the format works; see Rodriguez et al. (2019) tweaks clues to optimize discrimination. Finally, for a more comprehensive description. You might a packetizer ensures the overall set is diverse, has be scared off by how long the questions are. How- uniform difficulty, and is without repeats. ever, in real Quizbowl trivia tournaments, they are Unnatural Trivia questions are fake: the asker not finished because the questions are designed to already knows the answer. But they’re no more fake be interrupted. than a course’s final exam, which like leaderboards Interruptable A moderator reads a question. are designed to test knowledge. Once someone knows the answer, they use a signal- Experts know when questions are ambigiuous; ing device to “buzz in”. If the player who buzzed is while “what play has a character whose father is right, they get points. Otherwise, they lose points dead” could be Hamlet, Antigone, or Proof, a good and the question continues for the other team. writer’s knowledge avoids the ambiguity. When Not all trivia games with buzzers have this prop- authors omit these cues, the question is derided as a erty, however. For example, take Jeopardy!, the “hose” (Eltinge, 2013), which robs the tournament subject of Watson’s tour de force (Ferrucci et al., of fun (Section 2.1). 2010). While Jeopardy! also uses signaling de- One of the benefits of contrived formats is a vices, these only work at the end of the question; focus on specific phenomena. Dua et al. (2019) Ken Jennings, one of the top Jeopardy! players exclude questions an existing MRQA system could (and also a Quizbowler) explains it on a Planet answer to focus on challenging quantitative reason- Money interview (Malone, 2019): ing. One of the trivia experts consulted in Wallace et al. (2019) crafted a question that tripped up neu- Jennings: The buzzer is not live until Alex finishes reading the question. And if you buzz ral QA systems with “this author opens Crime and in before your buzzer goes live, you actually Punishment”; the top system confidently answers
Fyodor Dostoyevski. However, that phrase was choice. Here are our recommendations if you want embedded in a longer question “The narrator in to have an effective leaderboard. Cogwheels by this author opens Crime and Pun- ishment to find it has become The Brothers Kara- Talk to Trivia Nerds You should talk to trivia mazov”. Again, this shows the inventiveness and nerds because they have useful information (not linguistic dexterity of the trivia community. just about the election of 1876). Trivia is not just A counterargument is that when real humans ask the accumulation of information but also connect- questions—e.g., on Yahoo! Questions (Szpektor ing disparate facts (Jennings, 2006). These skills and Dror, 2013), Quora (Iyer et al., 2017) or web are exactly those that we want computers to de- search (Kwiatkowski et al., 2019)—they ignore the velop. craft of question writing. Real humans tend to react Trivia nerds are writing questions anyway; we to unclear questions with confusion or divergent can save money and time if we pool resources. answers, often making explicit in their answers how Computer scientists benefit if the trivia community they interpreted the original question (“I assume writes questions that aren’t trivial for computers you meant. . . ”). to solve (e.g., avoiding quotes and named entities). Given real world applications will have to deal The trivia community benefits from tools that make with the inherent noise and ambiguity of unclear their job easier: show related questions, link to questions, our systems must cope with it. However, Wikipedia, or predict where humans will answer. dealing with it is not the same as glossing over their Likewise, the broader public has unique knowl- complexities. edge and skills. In contrast to low-paid crowdwork- ers, public platforms for question answering and Complicated Quizbowl is more complex than citizen science (Bowser et al., 2013) are brimming other datasets. Unlike other datasets where you with free expertise if you can engage the relevant just need to decide what to answer, you also need communities. For example, the Quora query “Is to choose when to answer the question. While there a nuclear control room on nuclear aircraft car- this improves the dataset’s discrimination, it can riers?” is purportedly answered by someone who hurt popularity because you cannot copy/paste code worked in such a room (Humphries, 2017). As from other QA tasks. The complicated pyramidal machine learning algorithms improve, the “good structure also makes some questions—e.g., what is enough” crowdsourcing that got us this far may log base four of sixty-four—difficult12 to ask. How- simply not be enough for continued progress. ever, the underlying mechanisms (e.g., reinforce- Many question answering datasets benefit from ment learning) share properties with other tasks, the efforts of the trivia community. Ethically using such as simultaneous translation (Grissom II et al., the data requires acknowledging their contributions 2014; Ma et al., 2019), human incremental process- and using their input to create datasets (Jo and ing (Levy et al., 2008; Levy, 2011), and opponent Gebru, 2020, Consent and Inclusivity). modeling (He et al., 2016). Eat Your Own Dog Food As you develop new 5 A Call to Action question answering tasks, you should feel comfort- You may disagree with the superiority of Quizbowl able playing the task as a human. Importantly, this as a QA framework (even for trivia nerds, not all is not just to replicate what crowdworkers are doing agree. . . de gustibus non est disputandum). In this (also important) but to remove hidden assumptions, final section, we hope to distill our advice into a institute fair metrics, and define the task well. For call to action regardless of your question format of this to feel real, you will need to keep score; have all of your coauthors participate and compare their 12 But not always impossible, as IHSSBCA shows: scores. This is the smallest counting number which is Again, we emphasize that human and com- the radius of a sphere whose volume is an integer puter skills are not identical, but this is a benefit: multiple of π. It is also the number of distinct real solutions to the equation x7 − 19x5 = 0. This humans natural aversion to unfairness will help you number also gives the ratio between the volumes create a better task, while computers will blindly of a cylinder and a cone with the same heights optimize a broken objective function (Bostrom, and radii. Give this number equal to the log base four of sixty-four. 2003; ?). As you go through the process of play- ing on your question–answer dataset, you can see
where you might have fallen short on the goals we annotators disagree, the question should explic- outline in Section 3. itly state what level of specificity is required (e.g., September 1, 1939 vs. 1939 or Leninism vs. Won’t Somebody Look at the Data? After QA socialism). Or, if not all questions have a single datasets are released, there should also be deeper, answer, link answers to a knowledge base with mul- more frequent discussion of actual questions within tiple surface forms or explicitly enumerate which the NLP community. Part of every post-mortem of answers are acceptable. trivia tournaments is a detailed discussion of the questions, where good questions are praised and Appreciate Ambiguity If your intended QA ap- bad questions are excoriated. This is not meant plication has to handle ambiguous questions, do to shame the writers but rather to help build and justice to the ambiguity by making it part of your reinforce cultural norms: questions should be well- task—for example, recognize the original ambigu- written, precise, and fulfill the creator’s goals. Just ity and resolve it (“did you mean. . . ”) instead of like trivia tournaments, QA datasets resemble a giving credit for happening to ‘fit the data’. product for sale. Creators want people to invest To ensure that our datasets properly “isolate the time and sometimes money (e.g., GPU hours) in us- property that motivated it in the first place” (?), ing their data and submitting to their leaderboards. we need to explicitly appreciate the unavoidable It is “good business” to build a reputation for qual- ambiguity instead of silently glossing over it.13 ity questions and discussing individual questions. This is already an active area of research, with Similarly, discussing and comparing the actual conversational QA being a new setting actively ex- predictions made by the competing systems should plored by several datasets (Reddy et al., 2018; Choi be part of any competition culture—without it, it is et al., 2018); and other work explicitly focusing hard to tell what a couple of points on some leader- on identifying useful clarification questions (Rao board mean. To make this possible, we recommend and Daumé III, 2018), thematically linked ques- that leaderboards include an easy way for anyone tions (Elgohary et al., 2018) or resolving ambi- to download a system’s development predictions guities that arise from coreference or pragmatic for qualitative analyses. constraints by rewriting underspecified question strings in context (Elgohary et al., 2019). Make Questions Discriminative We argue that questions should be discriminative (Section 2.3), Revel in Spectacle However, with more compli- and while Quizbowl is one solution (Section 4), not cated systems and evaluations, a return to the yearly everyone is crazy enough to adopt this (beautiful) evaluations of TRECQA may be the best option. format. For more traditional QA tasks, you can This improves not only the quality of evaluation maximize the usefulness of your dataset by ensur- (we can have real-time human judging) but also lets ing as many questions as possible are challenging the test set reflect the build it/break it cycle (Ruef (but not impossible) for today’s QA systems. et al., 2016), as attempted by the 2019 iteration But you can use some Quizbowl intuitions to of FEVER. Moreover, another lesson the QA com- improve discrimination. In visual QA, you can of- munity could learn from trivia games is to turn it fer increasing resolutions of the image. For other into a spectacle: exciting games with a telegenic settings, create pyramidality by adding metadata: host. This has a benefit to the public, who see how coreference, disambiguation, or alignment to a QA systems fail on difficult questions and to QA knowledge base. In short, consider multiple ver- researchers, who have a spoonful of fun sugar to in- sions/views of your data that progress from difficult spect their systems’ output and their competitors’. to easy. This not only makes more of your dataset In between are automatic metrics that mimic the discriminative but also reveals what makes a ques- flexibility of human raters, inspired by machine tion answerable. translation evaluations (Papineni et al., 2002; Spe- cia and Farzindar, 2010) or summarization (Lin, Embrace Multiple Answers or Specify Speci- 2004). However, we should not forget that these ficity As QA moves to more complicated for- mats and answer candidates, what constitutes a 13 Not surprisingly, ‘inherent’ ambiguity is not limited to correct answer becomes more complicated. Fully QA ; Pavlick and Kwiatkowski (2019) show natural language inference data have ‘inherent disagreements’ between humans automatic evaluations are valuable for both train- and advocate for recovering the full range of accepted infer- ing and quick-turnaround evaluation. In the case ences.
metrics were introduced as ‘understudies’—good Christof Monz, Pavel Pecina, Matt Post, Herve enough when quick evaluations are needed for sys- Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. 2014. Findings of the 2014 workshop on tem building but no substitute for a proper evalu- statistical machine translation. In Proceedings of the ation. In machine translation, Laubli et al. (2020) Ninth Workshop on Statistical Machine Translation, reveal that crowdworkers cannot spot the errors pages 12–58, Baltimore, Maryland, USA. Associa- that neural MT systems make—fortunately, trivia tion for Computational Linguistics. nerds are cheaper than professional translators. Nick Bostrom. 2003. Ethical issues in advanced arti- ficial intelligence. Institute of Advanced Studies in Be Honest in Crowning QA Champions Systems Research and Cybernetics, 2:12–17. While—particularly for leaderboards—it is Anne Bowser, Derek Hansen, Yurong He, Carol tempting to turn everything into a single number, Boston, Matthew Reid, Logan Gunnell, and Jennifer recognize that there are often different sub-tasks Preece. 2013. Using gamification to inspire new cit- and types of players who deserve recognition. A izen science volunteers. In Proceedings of the First simple model that requires less training data or International Conference on Gameful Design, Re- runs in under ten milliseconds may be objectively search, and Applications, Gamification ’13, pages 18–25, New York, NY, USA. ACM. more useful than a bloated, brittle monster of a system that has a slightly higher F1 (Dodge Jordan Boyd-Graber, Brianna Satinoff, He He, and Hal Daume III. 2012. Besting the quiz master: et al., 2019). While you may only rank by a single Crowdsourcing incremental classification games. In metric (this is what trivia tournaments do too), Proceedings of Empirical Methods in Natural Lan- you may want to recognize the highest-scoring guage Processing. model that was built by undergrads, took no more Seth Chaiklin. 2003. The Zone of Proximal Devel- than one second per example, was trained only on opment in Vygotsky’s Analysis of Learning and In- Wikipedia, etc. struction, Learning in Doing: Social, Cognitive Finally, if you want to make human–computer and Computational Perspectives, page 39–64. Cam- bridge University Press. comparisons, pick the right humans. Paraphrasing a participant of the 2019 MRQA workshop (Fisch Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen- et al., 2019), a system better than the average hu- tau Yih, Yejin Choi, Percy Liang, and Luke Zettle- moyer. 2018. QuAC: Question answering in con- man at brain surgery does not imply superhuman text. In Proceedings of Empirical Methods in Nat- performance in brain surgery. Likewise, beating a ural Language Processing. distracted crowdworker on QA is not QA’s endgame. Anthony Cuthbertson. 2018. Robots can now read bet- If your task is realistic, fun, and challenging, you ter than humans, putting millions of jobs at risk. will find experts to play against your computer. Not only will this give you human baselines worth Paul Diamond. 2009. How To Make 100 Pounds A Night (Or More) As A Pub Quizmaster. DP Quiz. reporting—they can also tell you how to fix your QA dataset. . . after all, they’ve been at it longer than Jesse Dodge, Suchin Gururangan, Dallas Card, Roy you have. Schwartz, and Noah A. Smith. 2019. Show your work: Improved reporting of experimental results. Acknowledgements Many thanks to Massimil- In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the iano Ciaramita, Jon Clark, Christian Buck, Emily 9th International Joint Conference on Natural Lan- Pitler, and Michael Collins for insightful discus- guage Processing (EMNLP-IJCNLP), pages 2185– sions that helped frame these ideas. Thanks to 2194, Hong Kong, China. Association for Computa- Kevin Kwok for permission to use Protobowl tional Linguistics. screenshot and information. Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requir- References ing discrete reasoning over paragraphs. In Confer- ence of the North American Chapter of the Asso- Luis von Ahn. 2006. Games with a purpose. Computer, ciation for Computational Linguistics, pages 2368– 39:92 – 94. 2378, Minneapolis, Minnesota. David Baber. 2015. Television Game Show Hosts: Bi- Matthew Dunn, Levent Sagun, Mike Higgins, ographies of 32 Stars. McFarland. V. Ugur Güney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA: A new Q&A dataset aug- Ondřej Bojar, Christian Buck, Christian Federmann, mented with context from a search engine. CoRR, Barry Haddow, Philipp Koehn, Johannes Leveling, abs/1704.05179.
R. Dwan. 2000. As Long as They’re Laughing: Grou- He He, Jordan Boyd-Graber, Kevin Kwok, and Hal cho Marx and You Bet Your Life. Midnight Marquee Daumé III. 2016. Opponent modeling in deep re- Press. inforcement learning. In Proceedings of the Interna- tional Conference of Machine Learning. Ahmed Elgohary, Denis Peskov, and Jordan Boyd- Graber. 2019. Can you unpack that? learning to R. Robert Henzel. 2018. Naqt eligibility overview. rewrite questions-in-context. In Empirical Methods in Natural Language Processing. Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Ahmed Elgohary, Chen Zhao, and Jordan Boyd-Graber. and Phil Blunsom. 2015. Teaching machines to read 2018. Dataset and baselines for sequential open- and comprehend. In Proceedings of Advances in domain question answering. In Empirical Methods Neural Information Processing Systems. in Natural Language Processing. Bryan Humphries. 2017. Is there a nuclear control Stephen Eltinge. 2013. Quizbowl lexicon. room on nuclear aircraft carriers? Manaal Faruqui and Dipanjan Das. 2018. Identifying Shankar Iyer, Nikhil Dandekar, , and Kornél Csernai. well-formed natural language questions. In Proceed- 2017. Quora question pairs. ings of the 2018 Conference on Empirical Methods Ken Jennings. 2006. Brainiac: adventures in the cu- in Natural Language Processing, pages 798–803, rious, competitive, compulsive world of trivia buffs. Brussels, Belgium. Association for Computational Villard. Linguistics. Eun Seo Jo and Timnit Gebru. 2020. Lessons from Shi Feng and Jordan Boyd-Graber. 2019. What ai can archives: Strategies for collecting sociocultural data do for me: Evaluating machine learning interpreta- in machine learning. In Proceedings of the Confer- tions in cooperative play. In International Confer- ence on Fairness, Accountability, and Transparency. ence on Intelligent User Interfaces. Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke David Ferrucci, Eric Brown, Jennifer Chu-Carroll, Zettlemoyer. 2017. TriviaQA: A large scale dis- James Fan, David Gondek, Aditya A. Kalyanpur, tantly supervised challenge dataset for reading com- Adam Lally, J. William Murdock, Eric Nyberg, prehension. In Proceedings of the 55th Annual Meet- John Prager, Nico Schlaefer, and Chris Welty. 2010. ing of the Association for Computational Linguistics Building Watson: An Overview of the DeepQA (Volume 1: Long Papers), volume 1, pages 1601– Project. AI Magazine, 31(3). 1611. Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu- Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- nsol Choi, and Danqi Chen, editors. 2019. Proceed- field, Michael Collins, Ankur Parikh, Chris Alberti, ings of the 2nd Workshop on Machine Reading for Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Question Answering. Association for Computational Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Linguistics, Hong Kong, China. Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- Morris Freedman. 1997. The fall of Charlie Van Doren. ral questions: a benchmark for question answering The Virginia Quarterly Review, 73(1):157–165. research. Transactions of the Association of Compu- tational Linguistics. Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an inves- Samuel Laubli, Sheila Castilho, Graham Neubig, tigation of annotator bias in natural language under- Rico Sennrich, Qinlan Shen, and Antonio Toral. standing datasets. In Proceedings of the 2019 Con- 2020. A set of recommendations for assessing hu- ference on Empirical Methods in Natural Language man–machine parity in language translation. Jour- Processing and the 9th International Joint Confer- nal of Artificial Intelligence Research, 67. ence on Natural Language Processing (EMNLP- IJCNLP), pages 1161–1166, Hong Kong, China. As- Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. sociation for Computational Linguistics. 2019. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the Alvin Grissom II, He He, Jordan Boyd-Graber, and Association for Computational Linguistics. John Morgan. 2014. Don’t until the final verb wait: Reinforcement learning for simultaneous machine Roger Levy. 2011. Integrating surprisal and uncertain- translation. In Proceedings of Empirical Methods input models in online sentence comprehension: for- in Natural Language Processing. mal techniques and empirical results. In Proceed- ings of the Association for Computational Linguis- Robert X. D. Hawkins, Andreas Stuhlmüller, Judith De- tics, pages 1055–1065. gen, and Noah D. Goodman. 2015. Why do you ask? good questions provoke informative answers. Roger P. Levy, Florencia Reali, and Thomas L. Grif- In Proceedings of the 37th Annual Meeting of the fiths. 2008. Modeling the effects of memory on hu- Cognitive Science Society. man online sentence processing with particle filters.
In Proceedings of Advances in Neural Information Marc-Antoine Rondeau and T. J. Hazen. 2018. Sys- Processing Systems, pages 937–944. tematic error analysis of the Stanford question an- swering dataset. In Proceedings of the Workshop Chin-Yew Lin. 2004. Rouge: a package for auto- on Machine Reading for Question Answering, pages matic evaluation of summaries. In Proceedings of 12–20, Melbourne, Australia. Association for Com- the Workshop on Text Summarization Branches Out putational Linguistics. (WAS 2004). Andrew Ruef, Michael Hicks, James Parker, Dave Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Levin, Michelle L. Mazurek, and Piotr Mardziel. Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, 2016. Build it, break it, fix it: Contesting secure Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and development. In Proceedings of the 2016 ACM Haifeng Wang. 2019. STACL: Simultaneous trans- SIGSAC Conference on Computer and Communica- lation with implicit anticipation and controllable la- tions Security, CCS ’16, pages 690–703, New York, tency using prefix-to-prefix framework. In Proceed- NY, USA. ACM. ings of the 57th Annual Meeting of the Association Dan Russell. 2020. Why control-f is the single most for Computational Linguistics, pages 3025–3036, important thing you can teach someone about search. Florence, Italy. Association for Computational Lin- SearchReSearch. guistics. Lucia Specia and Atefeh Farzindar. 2010. A.: Esti- Kenny Malone. 2019. How uncle Jamie broke jeop- mating machine translation post-editing effort with ardy. HTER. In In: AMTA 2010 Workshop, Bringing MT to the User: MT Research and the Translation Indus- Adam Najberg. 2018. Alibaba AI model tops humans try. The 9th Conference of the Association for Ma- in reading comprehension. chine Translation in the Americas. Saku Sugawara, Kentaro Inui, Satoshi Sekine, and Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Akiko Aizawa. 2018. What makes reading compre- Jing Zhu. 2002. BLEU: a method for automatic eval- hension questions easier? In Proceedings of Empiri- uation of machine translation. In Proceedings of the cal Methods in Natural Language Processing. Association for Computational Linguistics. Saku Sugawara, Pontus Stenetorp, Kentaro Inui, and Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent Akiko Aizawa. 2020. Assessing the benchmark- disagreements in human textual inferences. Transac- ing capacity of machine reading comprehension tions of the Association for Computational Linguis- datasets. In Association for the Advancement of Ar- tics, 7:677–694. tificial Intelligence. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Idan Szpektor and Gideon Dror. 2013. From query Know what you don’t know: Unanswerable ques- to question in one click: Suggesting synthetic ques- tions for SQuAD. In Proceedings of the Association tions to searchers. In Proceedings of WWW 2013. for Computational Linguistics. David Taylor, Colin McNulty, and Jo Meek. 2012. Your starter for ten: 50 years of university challenge. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions James Thorne, Andreas Vlachos, Oana Cocarascu, for machine comprehension of text. In Proceedings Christos Christodoulopoulos, and Arpit Mittal, edi- of Empirical Methods in Natural Language Process- tors. 2018. Proceedings of the First Workshop on ing. Fact Extraction and VERification (FEVER). Associ- ation for Computational Linguistics, Brussels, Bel- Sudha Rao and Hal Daumé III. 2018. Learning to ask gium. good questions: Ranking clarification questions us- ing neural expected value of perfect information. In Ellen M. Voorhees. 2003. Evaluating the evaluation: A Proceedings of the 56th Annual Meeting of the As- case study using the TREC 2002 question answering sociation for Computational Linguistics (Volume 1: track. In Proceedings of the 2003 Human Language Long Papers), pages 2737–2746, Melbourne, Aus- Technology Conference of the North American Chap- tralia. Association for Computational Linguistics. ter of the Association for Computational Linguistics, pages 260–267. Siva Reddy, Danqi Chen, and Christopher D. Manning. Ellen M. Voorhees. 2008. Evaluating Question 2018. CoQA: A conversational question answering Answering System Performance, pages 409–430. challenge. Transactions of the Association for Com- Springer Netherlands, Dordrecht. putational Linguistics, 7:249–266. Ellen M Voorhees and Dawn M Tice. 2000. Building Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and a question answering test collection. In Proceedings Jordan L. Boyd-Graber. 2019. Quizbowl: The of the 23rd annual international ACM SIGIR confer- case for incremental question answering. CoRR, ence on Research and development in information abs/1904.04792. retrieval, pages 200–207. ACM.
Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Ya- mada, and Jordan Boyd-Graber. 2019. Trick me if you can: Human-in-the-loop generation of adversar- ial question answering examples. Transactions of the Association of Computational Linguistics, 10. Dirk Weissenborn, Georg Wiese, and Laura Seiffe. 2017. Making neural QA as simple as possible but not simpler. In Conference on Computational Natu- ral Language Learning. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car- bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237. Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Lin- guistics and the 7th International Joint Conference on Natural Language Processing (ACL 2015), pages 1321–1331. ACL. Xingdi Yuan, Jie Fu, Marc-Alexandre Cote, Yi Tay, Christopher Pal, and Adam Trischler. 2019. Interac- tive machine comprehension with information seek- ing agents.
You can also read