What Question Answering can Learn from Trivia Nerds

Page created by Terrence Zimmerman

Sports

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

What Question Answering can Learn from Trivia Nerds

What Question Answering can Learn from Trivia Nerds

                                                                                               ,†
                                                               Jordan Boyd-Graber                   Benjamin Börschinger†
                                             iSchool, CS, UMIACS, LSC
                                                University of Maryland                                      † Google Research, Zürich
                                             jbg@umiacs.umd.edu                                     {jbg, bboerschinger}@google.com

                                                              Abstract                              ated a community that reliably identifies those best
                                                                                                    at question answering. Beyond the format of the
                                             In addition to machines answering questions,
                                                                                                    competition, important safeguards ensure individ-
                                             question answering (QA) research creates in-
arXiv:1910.14464v3 [cs.CL] 21 Apr 2020

                                             teresting, challenging questions that reveal           ual questions are clear, unambiguous, and reward
                                             the best systems. We argue that creating a             knowledge (Section 3).
                                             QA dataset—and its ubiquitous leaderboard—                We are not saying that academic QA should sur-
                                             closely resembles running a trivia tournament:         render to trivia questions or the community—far
                                             you write questions, have agents—humans or             from it! The trivia community does not understand
                                             machines—answer questions, and declare a               the real world information seeking needs of users
                                             winner. However, the research community has
                                                                                                    or what questions challenge computers. However,
                                             ignored the lessons from decades of the trivia
                                             community creating vibrant, fair, and effective        they know how, given a bunch of questions, to de-
                                             QA competitions. After detailing problems              clare that someone is better at answering questions
                                             with existing QA datasets, we outline several          than another. This collection of tradecraft and prin-
                                             lessons that transfer to QA research: removing         ciples can help the QA community.
                                             ambiguity, discriminating skill, and adjudicat-           Beyond these general concepts that QA can learn
                                             ing disputes.                                          from, Section 4 reviews how the “gold standard” of
                                         1   Introduction                                           trivia formats, Quizbowl, can improve traditional
                                                                                                    QA. We then briefly discuss how research that uses
                                         This paper takes an unconventional analysis to an-         fun, fair, and good trivia questions can benefit from
                                         swer “where we’ve been and where we’re going” in           the expertise, pedantry, and passion of the trivia
                                         question answering (QA). Instead of approaching            community (Section 5).
                                         the question only as ACL researchers, we apply the
                                         best practices of trivia tournaments to QA datasets.       2   Surprise, this is a Trivia Tournament!
                                            The QA community is obsessed with evalua-
                                         tion. Schools, companies, and newspapers hail              “My research isn’t a silly trivia tournament,” you
                                         new SOTAs and topping leaderboards, giving rise to         say. That may be, but let us first tell you a little
                                         claims that an “AI model tops humans” (Najberg,            about what running a tournament is like, and per-
                                         2018) because it ‘won’ some leaderboard, putting           haps you might see similarities.
                                         “millions of jobs at risk” (Cuthbertson, 2018). But           First, the questions. Either you write them your-
                                         what is a leaderboard? A leaderboard is a statis-          self or you pay someone to write them (sometimes
                                         tic about QA accuracy that induces a ranking over          people on the Internet). There is a fixed number of
                                         participants.                                              questions you need to hit by a particular date.
                                            Newsflash: this is the same as a trivia tourna-            Then, you advertise. You talk about your ques-
                                         ment. The trivia community has been doing this             tions: who is writing them, what subjects are cov-
                                         for decades (Jennings, 2006); Section 2 details this       ered, and why people should try to answer them.
                                         overlap between the qualities of a first-class QA             Next, you have the tournament. You keep your
                                         dataset (and its requisite leaderboard). The ex-           questions secure until test time, collect answers
                                         perts running these tournaments are imperfect, but         from all participants, and declare a winner. After-
                                         they’ve learned from their past mistakes (see Ap-          ward, people use the questions to train for future
                                         pendix A for a brief historical perspective) and cre-      tournaments.

These have natural analogs to crowd sourcing Or consider SearchQA (Dunn et al., 2017), de-
questions, writing the paper, advertising, and run- rived from the game Jeopardy!, which asks “An
ning a leaderboard. The biases of academia put article that he wrote about his riverboat days was
much more emphasis on the paper, but there are eventually expanded into Life on the Mississippi.”
components where trivia tournament best practices The young apprentice and newspaper writer who
could help. In particular, we focus on fun, well- wrote the article is named Samuel Clemens; how-
calibrated, and discriminative tournaments. ever, the reference answer is that author’s later pen
name, Mark Twain. Most QA evaluation metrics
2.1 Are we Having Fun? would count Samuel Clemens as incorrect. In a
real game of Jeopardy!, this would not be an issue
Many datasets use crowdworkers to establish hu-
(Section 3.1).
man accuracy (Rajpurkar et al., 2016; Choi et al.,
2018). However, these are not the only humans Of course, fun is relative and any dataset is
who should answer a dataset’s questions. So should bound to contain at least some errors. However,
the datasets’ creators. playtesting is an easy way to find systematic prob-
lems in your dataset: unfair, unfun playtests make
In the trivia world, this is called a play test: get
for ineffective leaderboards. Eating your own dog
in the shoes of someone answering the questions
food can help diagnose artifacts, scoring issues, or
yourself. If you find them boring, repetitive, or
other shortcomings early in the process.
uninteresting, so will crowdworkers; and if you,
as a human, can find shortcuts to answer ques- Boyd-Graber et al. (2012) created an interface
tions (Rondeau and Hazen, 2018), so will a com- for play testing that was fun enough that people
puter. played for free. After two weeks the site was taken
down, but it was popular enough that the trivia
Concretely, Weissenborn et al. (2017) catalog ar-
community forked the open source code to create a
tifacts in SQuAD (Rajpurkar et al., 2018), arguably
bootleg version that is still going strong almost a
the most popular computer QA leaderboard; and
decade later.
indeed, many of the questions are not particularly
fun to answer from a human perspective; they’re The deeper issues when creating a QA task are:
fairly formulaic. If you see a list like “Along with have you designed a task that is internally consis-
Canada and the United Kingdom, what country. . . ”, tent, supported by a scoring metric that matches
you can ignore the rest of the question and just type your goals (more on this in a moment), using gold
Ctrl+F (Russell, 2020; Yuan et al., 2019) to find annotations that correctly reward those who do the
the third country—Australia in this case—that ap- task well? Imagine someone who loves answering
pears with “Canada and the UK”. Other times, a the questions your task poses: would they have
SQ u AD playtest would reveal frustrating questions fun on your task? If so, you may have a good
that are i) answerable given the information in the dataset. Gamification (von Ahn, 2006) harnesses
paragraph but not with a direct span,1 ii) answer- users’ passion better motivator than traditional paid
able only given facts beyond the given paragraph,2 labeling. Even if you pay crowdworkers, if your
iii) unintentionally embedded in a discourse, result- questions are particularly unfun, you may need to
ing in linguistically odd questions with arbitrary think carefully about your dataset and your goals.
correct answers,3 iv) or non-questions.
2.2 Am I Measuring what I Care About?
1
A source paragraph says “In [Commonwealth coun-
tries]. . . the term is generally restricted to. . . Private education Answering questions requires multiple skills: iden-
in North America covers the whole gamut. . . ”; thus, the ques- tifying where an answer is mentioned (Hermann
tion “What is the term private school restricted to in the US?” et al., 2015), knowing the canonical name for the
has the information needed but not as a span.
2
A source paragraph says “Sculptors [in the collec- answer (Yih et al., 2015), realizing when to answer
tion include] Nicholas Stone, Caius Gabriel Cibber, [...], and when to abstain (Rajpurkar et al., 2018), or
Thomas Brock, Alfred Gilbert, [...] and Eric Gill[.]”, i.e., being able to justify an answer explicitly with evi-
a list of names; thus, the question “Which British sculptor
whose work includes the Queen Victoria memorial in front dence (Thorne et al., 2018). In QA, the emphasis
of Buckingham Palace is included in the V&A collection?” on SOTA and leaderboards has focused attention on
should be unanswerable in traditional machine reading. single automatically computable metrics—systems
3
A question “Who else did Luther use violent rhetoric
towards?” has the gold answer “writings condemning the tend to be compared by their ‘SQuAD score’ or
Jews and in diatribes against Turks”. their ‘NQ score’, as if this were all there is to say

systems answer questions correctly, abstain when
they cannot, explain why they answered the way
they did, or whatever facet of QA is most important
for your dataset. Now, this leaderboard can rack
up citations as people chase the top spot. But your
leaderboard is only useful if it is discriminative:
does it separate the best from the rest?
Figure 1: Two datasets with 0.16 annotation error, but
There are many ways questions might not be
the top better discriminates QA ability. In the good discriminative. If every system gets a question right
dataset (top), most questions are challenging but not (e.g., abstain on “asdf” or correctly answer “What
impossible. In the bad dataset (bottom), there are more is the capital of Poland?”), it does not separate
trivial or impossible questions and annotation error is participants. Similarly, if every system flubs “what
concentrated on the challenging, discriminative ques- is the oldest north-facing kosher restaurant” wrong,
tions. Thus, a smaller fraction of questions decide who it also does not discriminate systems. Sugawara
sits atop the leaderboard, requiring a larger test set.
et al. (2018) call these questions “easy” and “hard”;
we instead argue for a three-way distinction.
about their relative capabilities. Like QA leader- In between easy questions (system answers cor-
boards, trivia tournaments need to decide on a sin- rectly with probability 1.0) and hard (probabil-
gle winner, but they explicitly recognize that there ity 0.0), questions with probabilities nearer to
are more interesting comparisons to be made. 0.5 are more interesting. Taking a cue from Vy-
For example, a tournament may recognize dif- gotsky’s proximal development theory of human
fernt background/resources—high school, small learning (Chaiklin, 2003), these discriminative
school, undergrads (Henzel, 2018). Similarly, more questions—rather than the easy or the hard ones—
practical leaderboards would reflect training time should help improve QA systems the most. These
or resource requirements (see Dodge et al., 2019) Goldilocks4 questions are also most important for
including ‘constrained’ or ‘unconstrained’ train- deciding who will sit atop the leaderboard; ide-
ing (Bojar et al., 2014). Tournaments also give ally they (and not random noise) will decide. Un-
awards for specific skills (e.g., least incorrect an- fortunately, many existing datasets seem to have
swers). Again, there are obvious leaderboard many easy questions; Sugawara et al. (2020) find
analogs that would go beyond a single number. For that ablations like shuffling word order (Feng and
example, in SQuAD 2.0, abstaining contributes the Boyd-Graber, 2019), shuffling sentences, or only
same to the overall F1 as a fully correct answer, offering the most similar sentence do not impair
obscuring whether a system is more precise or an systems, and Rondeau and Hazen (2018) hypothe-
effective abstainer. If the task recognizes both abil- size that most QA systems do little more than pat-
ities as important, reporting a single score risks tern matching, impressive performance numbers
implicitly prioritizing one balance of the two. notwithstanding.
As a positive example for being explicit, the 2.4 Why so few Goldilocks Questions?
2018 FEVER shared task (Thorne et al., 2018) fa-
vors getting the correct final answer over exhaus- This is a common problem in trivia tournaments,
tively justifying said answer with a a metric that particularly pub quizzes (Diamond, 2009), where
required only one piece of evidence per question. too difficult questions can scare off patrons. Many
However, by still breaking out the full justification quiz masters prefer popularity over discrimination
precision and recall scores in the leaderboard, it and thus prefer easier questions.
is still clear that the most precise system only was Sometimes there are fewer Goldilocks questions
in fourth place on the “primary” leaderboard, and not by choice, but by chance: a dataset becomes
that the top three “primary metric” winners are less discriminative through annotation error. All
compromising on the evidence selection. datasets have some annotation error; if this annota-
tion error is concentrated on the Goldilocks ques-
2.3 Do my Questions Separate the Best? 4
In a British folktale first recorded by Robert Southey,
an interloper, “Goldilocks”, finds three bowls of porridge:
Let us assume that you have picked a metric (or a one too hot, one too cold, and one “just right”. Goldilocks
set of metrics) that capture what you care about: questions’ difficulty are likewise “just right”.

tions, the dataset will be less useful. As we write we focused on how datasets as a whole should be
this in 2019, humans and computers sometimes structured. Now, we focus on how specific ques-
struggle on the same questions. Thus, annotation tions should be structured to make the dataset as
error is likely to be correlated with which questions valuable as possible.
will determine who will sit atop a leaderboard.
Figure 1 shows two datsets with the same anno- 3.1 Avoiding Ambiguity and Assumptions
tation error and the same number of overall ques- Ambiguity in questions not only frustrates answer-
tions. However, they have different difficulty dis- ers who resolved the ambiguity ‘incorrectly‘, they
tributions and correlation of annotation error and generally undermine the usefulness of questions
difficulty. The dataset that has more discrimina- to precisely assess knowledge at all. For this rea-
tive questions and consistent annotator error has son, the US Department of Transportation explicitly
fewer questions that are effectively useless for de- bans ambiguous questions from exams for flight
termining the winner of the leaderboard. We call instructors (?); and the trivia community has de-
this the effective dataset proportion ρ. A dataset veloped rules that prevent ambiguity from arising.
with no annotation error and only discriminative While this is true in many contexts, examples are
questions has ρ = 1.0, while the bad dataset in Fig- rife in format called Quizbowl (Boyd-Graber et al.,
ure 1 has ρ = 0.16. Figure 2 shows the test set size 2012), whose very long questions6 showcase trivia
required to reliably discriminate systems for differ- writers’ tactics. For example, Zhu Ying (2005 PAR -
ent values of ρ, based on a simulation described in FAIT) warns [emphasis added]:
Appendix B.
He’s not Sherlock Holmes, but his address is
At this point, you might be despairing about how 221B. He’s not the Janitor on Scrubs, but his
big you need your dataset to be.5 The same terror father is played by R. Lee Ermy. [. . . ] For ten
transfixes trivia tournament organizers. We dis- points, name this misanthropic, crippled, Vicodin-
dependent central character of a FOX medical
cuss a technique used by them for making individ- drama.
ual questions more discriminative using a property ANSWER: Gregory House, MD
called pyramidality in Section 4.
to head off these wrong answers.
3 The Craft of Question Writing In contrast, QA datasets often contain ambigu-
ous and under-specified questions. While this
One thing that trivia enthusiasts agree on is that
sometimes reflects real world complexities such
questions need to be good questions that are well
as actual under-specified or ill-formed search
written. Research shows that asking “good ques-
queries (Faruqui and Das, 2018; Kwiatkowski et al.,
tions” requires sophisticated pragmatic reason-
2019), simply ignoring this ambiguity is problem-
ing (Hawkins et al., 2015), and pedagogy explicitly
atic. As a concrete example, consider Natural Ques-
acknowledges the complexity of writing effective
tions (Kwiatkowski et al., 2019) where the gold
questions for assessing student performance (?, for
answer to “what year did the us hockey team won
a book length treatment of writing good multiple
[sic] the olympics” has the answers 1960 and 1980,
choice questions, see). There is no question that
ignoring the US women’s team, which won in 1998
writing good questions is a craft in its own right.
and 2018, and further assuming the query is about
QA datasets, however, are often collected from
ice rather than field hockey (also an olympic sport).
the wild or written by untrained crowdworkers.
This ambiguity is a post hoc interpretation as the
Even assuming they are confident users of En-
associated Wikipedia pages are, indeed, about the
glish (the primary language of QA datasets), crowd-
United States men’s national ice hockey team. We
workers lack experience in crafting questions and
contend the ambiguity persists in the original ques-
may introduce idiosyncracies that shortcut ma-
tion, and the information retrieval arbitrarily pro-
chine learning (Geva et al., 2019). Similarly, data
vides one of many interpretations. True to its name,
collected from the wild such as Natural Ques-
these under-specified queries make up a substan-
tions (Kwiatkowski et al., 2019) by design have
tial (if not almost the entire) fraction of natural
vast variations in quality. In the previous section,
questions in search contexts.
5
Indeed, using a more sophisticated simulation approach
6
Voorhees (2003) found that the TREC 2002 QA test set could Like Jeopardy!, they are not syntactically questions but
not discriminate systems with less than a seven absolute score still are designed to elicit knowledge-based responses; for
point difference. consistency, we will still call them questions.

Average Accuracy
90
= 0.25 = 0.85 = 1.00
Test Needed

10000 80
5000
2500 70
1000
100
12 5 10 15 12 5 10 15 12 5 10 15 60
Accuracy
50

Figure 2: How much test data do you need to discriminate two systems with 95% confidence? This depends on
both the difference in accuracy between the systems (x axis) and the average accuracy of the systems (closer to
50% is harder). Test set creators do not have much control over those. They do have control, however, over how
many questions are discriminative. If all questions are discriminative (right), you only need 2500 questions, but if
three quarters of your questions are too easy, too hard, or have annotation errors (left), you’ll need 15000.

The problem is neither that such questions ex- for the question in its framing but, in re-purposing
ist nor that MRQA considers questions relative to data such as Natural Questions, opaquely relies on
the associated context. The problem is that tasks it for the gold answers.
do not explicitly acknowledge the original ambigu-
ity and instead gloss over the implicit assumptions 3.2 Avoiding superficial evaluations
used in creating the data. This introduces poten- A related issue is that, in the words of Voorhees
tial noise and bias (i.e., giving a bonus to systems and Tice (2000), “there is no such thing as a ques-
that make the same assumptions as the dataset) tion with an obvious answer”. As a consequence,
in leaderboard rankings. At best, these will be- trivia question authors are considered to delineate
come part of the measurement error of datasets acceptable and unacceptable answers.
(no dataset is perfect). At worst, they will reca-
For example, Robert Chu (Harvard Fall XI) uses
pitulate the biases that went into the creation of
a mental model of an answerer to explicitly delin-
the datasets. Then, the community will implic-
eate the range of acceptable correct answers:
itly equate the biases with correctness: you get
high scores if you adopt this set of assumptions. In Newtonian gravity, this quantity satisfies Pois-
Then, these enter into real-world systems, further son’s equation. [. . . ] For a dipole, this quantity is
given by negative the dipole moment dotted with
perpetuating the bias. Playtesting can reveal these the electric field. [. . . ] For 10 points, name this
issues (Section 2.1), as implicit assumptions can form of energy contrasted with kinetic.
rob a player of correctly answered questions. If ANSWER: potential energy (prompt on energy;
accept specific types like electrical potential en-
you wanted to answer 2014 to “when did Michi- ergy or gravitational potential energy; do not
gan last win the championship”—when the Michi- accept or prompt on just “potential”)
gan State Spartans won the Women’s Cross Coun-
try championship—and you cannot because you Likewise, the style guides for writing questions
chose the wrong school, the wrong sport, and the stipulate that you must give the answer type clearly
wrong gender, you would complain as a player; re- and early on. These mentions specify whether you
searchers instead discover latent assumptions that want a book, a collection, a movement, etc. It
crept into the data.7 also signals the level of specificity requested. For
It is worth emphasizing that this is not a purely example, a question about a date must state “day
hypothetical problem. For example, Open Domain and month required” (September 11, “month and
Retrieval Question Answering (Lee et al., 2019) year required” (April 1968), or “day, month, and
deliberately avoids providing a reference context year required” (September 1, 1939). This is true
7
for other answers as well: city and team, party and
Where to draw the line is a matter of judgment;
computers—who lack common sense—might find questions country, or more generally “two answers required”.
ambiguous where humans would not. Despite all of these conventions, no pre-defined set

of answers is perfect, and a process for adjudicating Despite XLNet-123’s margin of almost four abso-
answers is an integral part of trivia competitions. lute F1 (94 vs 98) on development data, a manual
In major high school and college national com- inspection of a sample of 100 of XLNet-123’s wins
petitions and game shows, if low-level staff cannot indicate that around two-thirds are ‘spurious’: 56%
resolve the issue by either throwing out a single are likely to be considered not only equally good
question or accepting minor variations (America in- but essentially identical; 7% are cases where the
stead of USA), the low-level staff contacts the tour- answer set ommits a correct alternative; and 5% of
nament director. The tournament director—with cases are ‘bad’ questions.11
their deeper knowledge of rules and questions— Our goal is not to dwell on the exact proportions,
often decide the issue. If not, the protest goes to minimize the achievements of these strong sys-
through an adjudication process designed to mini- tems, or to minimize the usefulness of quantitative
mize bias:8 write the summary of the dispute, get evaluations. We merely want to raise the limitation
all parties to agree to the summary, and then hand of ‘blind automation’ for distinguishing between
the decision off to mutually agreed experts from systems on a leaderboard.
the tournament’s phone tree. The substance of the Taking our cue from the trivia community, here
disagreement is communicated (without identities), is an alternative for MRQA. Test sets are only cre-
and the experts apply the rules and decide. ated for a specific time; all systems are submitted si-
For example, a when a particularly inept Jeop- multaneously. Then, all questions and answers are
ardy! contestant9 answered endoscope to “Your revealed. System authors can protest correctness
surgeon could choose to take a look inside you with rulings on questions, directly addressing the issues
this type of fiber-optic instrument”. Since the van above. After agreement is reached, quantitative
Doren scandal (Freedman, 1997), every television metrics are computed for comparison purposes—
trivia contestant has an advocate assigned from an despite their inherent limitations they at least can
auditing company. In this case, the advocate initi- be trusted. Adopting this for MRQA would require
ated a process that went to a panel of judges who creating a new, smaller test set every year. How-
then ruled that endoscope (a more general term) ever, this would gradually refine the annotations
was also correct. and process.
The need for a similar process seems to have This suggestion is not novel: Voorhees and Tice
been well-recognized in the earliest days of QA (2000) conclude that automatic evaluations are suf-
system bake-offs such as TREC - QA, and Voorhees ficient “for experiments internal to an organization
(2008) notes that where the benefits of a reusable test collection are
[d]ifferent QA runs very seldom return exactly the
most significant (and the limitations are likely to be
same [answer], and it is quite difficult to deter- understood)” (our emphasis) but that “satisfactory
mine automatically whether the difference [. . . ] techniques for [automatically] evaluating new runs”
is significant.
have not been found yet. We are not aware of any
In stark contrast to this, QA datasets typically only change on this front—if anything, we seem to have
provide a single string or, if one is lucky, several become more insensitive as a community to just
strings. A correct answer means exactly matching how limited our current evaluations are.
these strings or at least having a high token overlap
3.3 Focus on the Bubble
F1 , and failure to agree with the pre-recorded ad-
missible answers will put you at an uncontestable While every question should be perfect, time and
disadvantage on the leaderboard (Section 2.2). resources are limited. Thus, authors of tournaments
To illustrate how current evaluations fall short have a policy of “focusing on the bubble”, where
of meaningful discrimination, we qualitatively an- the “bubble” are the questions mostly likely to dis-
alyze two near-SOTA systems on SQuAD V1.1 on criminate between top teams.
CodaLab: the original XLNet (Yang et al., 2019) For humans, authors and editors focus on the
and a subsequent iteration called XLNet-123.10 questions and clues that they predict will decide
8 the tournament. These questions are thoroughly
https://www.naqt.com/rules/#protest
9
http://www.j-archive.com/showgame. playtested, vetted, and edited. Only after these
php?game_id=6112 questions have been perfected will the other ques-
10
We could not find a paper describing XLNet-123 in detail,
11
the submission is by http://tia.today. Examples in Appendix C.

tions undergo the same level of polish. lock yourself out for a fraction of a second. So
For computers, the same logic applies. Authors the big mistake on the show is people who are
all adrenalized and are buzzing too quickly, too
should ensure that these questions are correct, free eagerly.
of ambiguity, and unimpeachable. However, as far Malone: OK. To some degree, Jeopardy! is kind
of a video game, and a crappy video game where
as we can tell, the authors of QA datasets do not
it’s, like, light goes on, press button—that’s it.
give any special attention to these questions. Jennings: (Laughter) Yeah.
Unlike a human trivia tournament, however—
with finite patience of the participants—this does Thus, Jeopardy!’s buzzers are a gimmick to en-
not mean that you should necessarily remove all sure good television; however, Quizbowl buzzers
of the easy or hard questions from your dataset— discriminate knowledge (Section 2.3). Similarly,
spend more of your time/effort/resources on the while TriviaQA (Joshi et al., 2017) is written by
bubble. You would not want to introduce a sam- knowledgeable writers, the questions are not pyra-
pling bias that leads to inadvertently forgetting how midal.
to answer questions like “who is buried in Grant’s
tomb?” (Dwan, 2000, Chapter 7). Pyramidal Recall that an effective dataset (tour-
nament) discriminates the best from the rest—the
4 Why Quizbowl is the Gold Standard higher the proportion of effective questions ρ, the
better. Quizbowl’s ρ is nearly 1.0 because discrimi-
We now focus our thus far wide-ranging QA dis-
nation happens within a question: after every word,
cussion to a specific format: Quizbowl, which has
an answerer must decide whether they have enough
many of the desirable properties outlined above.
information to answer the question. Quizbowl ques-
We have no delusion that mainstream QA will uni-
tions are arranged so that questions are maximally
versally adopt this format. However, given then
pyramidal: questions begin with hard clues—ones
community’s emphasis on fair evaluation, com-
that require deep understanding—to more accessi-
puter QA can borrow aspects from the gold standard
ble clues that are well known.
of human QA. We discuss what this may look like
in Section 5, but first we describe the gold standard Well-Edited Quizbowl questions are created in
of human QA. phases. First, the author selects the answer and as-
We have shown several examples of Quizbowl sembles (pyramidal) clues. A subject editor then re-
questions, but we have not yet explained in detail moves ambiguity, adjusts acceptable answers, and
how the format works; see Rodriguez et al. (2019) tweaks clues to optimize discrimination. Finally,
for a more comprehensive description. You might a packetizer ensures the overall set is diverse, has
be scared off by how long the questions are. How- uniform difficulty, and is without repeats.
ever, in real Quizbowl trivia tournaments, they are
Unnatural Trivia questions are fake: the asker
not finished because the questions are designed to
already knows the answer. But they’re no more fake
be interrupted.
than a course’s final exam, which like leaderboards
Interruptable A moderator reads a question. are designed to test knowledge.
Once someone knows the answer, they use a signal- Experts know when questions are ambigiuous;
ing device to “buzz in”. If the player who buzzed is while “what play has a character whose father is
right, they get points. Otherwise, they lose points dead” could be Hamlet, Antigone, or Proof, a good
and the question continues for the other team. writer’s knowledge avoids the ambiguity. When
Not all trivia games with buzzers have this prop- authors omit these cues, the question is derided as a
erty, however. For example, take Jeopardy!, the “hose” (Eltinge, 2013), which robs the tournament
subject of Watson’s tour de force (Ferrucci et al., of fun (Section 2.1).
2010). While Jeopardy! also uses signaling de- One of the benefits of contrived formats is a
vices, these only work at the end of the question; focus on specific phenomena. Dua et al. (2019)
Ken Jennings, one of the top Jeopardy! players exclude questions an existing MRQA system could
(and also a Quizbowler) explains it on a Planet answer to focus on challenging quantitative reason-
Money interview (Malone, 2019): ing. One of the trivia experts consulted in Wallace
et al. (2019) crafted a question that tripped up neu-
Jennings: The buzzer is not live until Alex
finishes reading the question. And if you buzz ral QA systems with “this author opens Crime and
in before your buzzer goes live, you actually Punishment”; the top system confidently answers

Fyodor Dostoyevski. However, that phrase was choice. Here are our recommendations if you want
embedded in a longer question “The narrator in to have an effective leaderboard.
Cogwheels by this author opens Crime and Pun-
ishment to find it has become The Brothers Kara- Talk to Trivia Nerds You should talk to trivia
mazov”. Again, this shows the inventiveness and nerds because they have useful information (not
linguistic dexterity of the trivia community. just about the election of 1876). Trivia is not just
A counterargument is that when real humans ask the accumulation of information but also connect-
questions—e.g., on Yahoo! Questions (Szpektor ing disparate facts (Jennings, 2006). These skills
and Dror, 2013), Quora (Iyer et al., 2017) or web are exactly those that we want computers to de-
search (Kwiatkowski et al., 2019)—they ignore the velop.
craft of question writing. Real humans tend to react Trivia nerds are writing questions anyway; we
to unclear questions with confusion or divergent can save money and time if we pool resources.
answers, often making explicit in their answers how Computer scientists benefit if the trivia community
they interpreted the original question (“I assume writes questions that aren’t trivial for computers
you meant. . . ”). to solve (e.g., avoiding quotes and named entities).
Given real world applications will have to deal The trivia community benefits from tools that make
with the inherent noise and ambiguity of unclear their job easier: show related questions, link to
questions, our systems must cope with it. However, Wikipedia, or predict where humans will answer.
dealing with it is not the same as glossing over their Likewise, the broader public has unique knowl-
complexities. edge and skills. In contrast to low-paid crowdwork-
ers, public platforms for question answering and
Complicated Quizbowl is more complex than citizen science (Bowser et al., 2013) are brimming
other datasets. Unlike other datasets where you with free expertise if you can engage the relevant
just need to decide what to answer, you also need communities. For example, the Quora query “Is
to choose when to answer the question. While there a nuclear control room on nuclear aircraft car-
this improves the dataset’s discrimination, it can riers?” is purportedly answered by someone who
hurt popularity because you cannot copy/paste code worked in such a room (Humphries, 2017). As
from other QA tasks. The complicated pyramidal machine learning algorithms improve, the “good
structure also makes some questions—e.g., what is enough” crowdsourcing that got us this far may
log base four of sixty-four—difficult12 to ask. How- simply not be enough for continued progress.
ever, the underlying mechanisms (e.g., reinforce- Many question answering datasets benefit from
ment learning) share properties with other tasks, the efforts of the trivia community. Ethically using
such as simultaneous translation (Grissom II et al., the data requires acknowledging their contributions
2014; Ma et al., 2019), human incremental process- and using their input to create datasets (Jo and
ing (Levy et al., 2008; Levy, 2011), and opponent Gebru, 2020, Consent and Inclusivity).
modeling (He et al., 2016).
Eat Your Own Dog Food As you develop new
5 A Call to Action question answering tasks, you should feel comfort-
You may disagree with the superiority of Quizbowl able playing the task as a human. Importantly, this
as a QA framework (even for trivia nerds, not all is not just to replicate what crowdworkers are doing
agree. . . de gustibus non est disputandum). In this (also important) but to remove hidden assumptions,
final section, we hope to distill our advice into a institute fair metrics, and define the task well. For
call to action regardless of your question format of this to feel real, you will need to keep score; have
all of your coauthors participate and compare their
12
But not always impossible, as IHSSBCA shows: scores.
This is the smallest counting number which is Again, we emphasize that human and com-
the radius of a sphere whose volume is an integer puter skills are not identical, but this is a benefit:
multiple of π. It is also the number of distinct real
solutions to the equation x7 − 19x5 = 0. This humans natural aversion to unfairness will help you
number also gives the ratio between the volumes create a better task, while computers will blindly
of a cylinder and a cone with the same heights optimize a broken objective function (Bostrom,
and radii. Give this number equal to the log base
four of sixty-four. 2003; ?). As you go through the process of play-
ing on your question–answer dataset, you can see

where you might have fallen short on the goals we annotators disagree, the question should explic-
outline in Section 3. itly state what level of specificity is required
(e.g., September 1, 1939 vs. 1939 or Leninism vs.
Won’t Somebody Look at the Data? After QA socialism). Or, if not all questions have a single
datasets are released, there should also be deeper, answer, link answers to a knowledge base with mul-
more frequent discussion of actual questions within tiple surface forms or explicitly enumerate which
the NLP community. Part of every post-mortem of answers are acceptable.
trivia tournaments is a detailed discussion of the
questions, where good questions are praised and Appreciate Ambiguity If your intended QA ap-
bad questions are excoriated. This is not meant plication has to handle ambiguous questions, do
to shame the writers but rather to help build and justice to the ambiguity by making it part of your
reinforce cultural norms: questions should be well- task—for example, recognize the original ambigu-
written, precise, and fulfill the creator’s goals. Just ity and resolve it (“did you mean. . . ”) instead of
like trivia tournaments, QA datasets resemble a giving credit for happening to ‘fit the data’.
product for sale. Creators want people to invest To ensure that our datasets properly “isolate the
time and sometimes money (e.g., GPU hours) in us- property that motivated it in the first place” (?),
ing their data and submitting to their leaderboards. we need to explicitly appreciate the unavoidable
It is “good business” to build a reputation for qual- ambiguity instead of silently glossing over it.13
ity questions and discussing individual questions. This is already an active area of research, with
Similarly, discussing and comparing the actual conversational QA being a new setting actively ex-
predictions made by the competing systems should plored by several datasets (Reddy et al., 2018; Choi
be part of any competition culture—without it, it is et al., 2018); and other work explicitly focusing
hard to tell what a couple of points on some leader- on identifying useful clarification questions (Rao
board mean. To make this possible, we recommend and Daumé III, 2018), thematically linked ques-
that leaderboards include an easy way for anyone tions (Elgohary et al., 2018) or resolving ambi-
to download a system’s development predictions guities that arise from coreference or pragmatic
for qualitative analyses. constraints by rewriting underspecified question
strings in context (Elgohary et al., 2019).
Make Questions Discriminative We argue that
questions should be discriminative (Section 2.3), Revel in Spectacle However, with more compli-
and while Quizbowl is one solution (Section 4), not cated systems and evaluations, a return to the yearly
everyone is crazy enough to adopt this (beautiful) evaluations of TRECQA may be the best option.
format. For more traditional QA tasks, you can This improves not only the quality of evaluation
maximize the usefulness of your dataset by ensur- (we can have real-time human judging) but also lets
ing as many questions as possible are challenging the test set reflect the build it/break it cycle (Ruef
(but not impossible) for today’s QA systems. et al., 2016), as attempted by the 2019 iteration
But you can use some Quizbowl intuitions to of FEVER. Moreover, another lesson the QA com-
improve discrimination. In visual QA, you can of- munity could learn from trivia games is to turn it
fer increasing resolutions of the image. For other into a spectacle: exciting games with a telegenic
settings, create pyramidality by adding metadata: host. This has a benefit to the public, who see how
coreference, disambiguation, or alignment to a QA systems fail on difficult questions and to QA
knowledge base. In short, consider multiple ver- researchers, who have a spoonful of fun sugar to in-
sions/views of your data that progress from difficult spect their systems’ output and their competitors’.
to easy. This not only makes more of your dataset In between are automatic metrics that mimic the
discriminative but also reveals what makes a ques- flexibility of human raters, inspired by machine
tion answerable. translation evaluations (Papineni et al., 2002; Spe-
cia and Farzindar, 2010) or summarization (Lin,
Embrace Multiple Answers or Specify Speci-
2004). However, we should not forget that these
ficity As QA moves to more complicated for-
mats and answer candidates, what constitutes a 13
Not surprisingly, ‘inherent’ ambiguity is not limited to
correct answer becomes more complicated. Fully QA ; Pavlick and Kwiatkowski (2019) show natural language
inference data have ‘inherent disagreements’ between humans
automatic evaluations are valuable for both train- and advocate for recovering the full range of accepted infer-
ing and quick-turnaround evaluation. In the case ences.

metrics were introduced as ‘understudies’—good Christof Monz, Pavel Pecina, Matt Post, Herve
enough when quick evaluations are needed for sys- Saint-Amand, Radu Soricut, Lucia Specia, and Aleš
Tamchyna. 2014. Findings of the 2014 workshop on
tem building but no substitute for a proper evalu-
statistical machine translation. In Proceedings of the
ation. In machine translation, Laubli et al. (2020) Ninth Workshop on Statistical Machine Translation,
reveal that crowdworkers cannot spot the errors pages 12–58, Baltimore, Maryland, USA. Associa-
that neural MT systems make—fortunately, trivia tion for Computational Linguistics.
nerds are cheaper than professional translators. Nick Bostrom. 2003. Ethical issues in advanced arti-
ficial intelligence. Institute of Advanced Studies in
Be Honest in Crowning QA Champions Systems Research and Cybernetics, 2:12–17.
While—particularly for leaderboards—it is
Anne Bowser, Derek Hansen, Yurong He, Carol
tempting to turn everything into a single number, Boston, Matthew Reid, Logan Gunnell, and Jennifer
recognize that there are often different sub-tasks Preece. 2013. Using gamification to inspire new cit-
and types of players who deserve recognition. A izen science volunteers. In Proceedings of the First
simple model that requires less training data or International Conference on Gameful Design, Re-
runs in under ten milliseconds may be objectively search, and Applications, Gamification ’13, pages
18–25, New York, NY, USA. ACM.
more useful than a bloated, brittle monster of
a system that has a slightly higher F1 (Dodge Jordan Boyd-Graber, Brianna Satinoff, He He, and
Hal Daume III. 2012. Besting the quiz master:
et al., 2019). While you may only rank by a single
Crowdsourcing incremental classification games. In
metric (this is what trivia tournaments do too), Proceedings of Empirical Methods in Natural Lan-
you may want to recognize the highest-scoring guage Processing.
model that was built by undergrads, took no more
Seth Chaiklin. 2003. The Zone of Proximal Devel-
than one second per example, was trained only on opment in Vygotsky’s Analysis of Learning and In-
Wikipedia, etc. struction, Learning in Doing: Social, Cognitive
Finally, if you want to make human–computer and Computational Perspectives, page 39–64. Cam-
bridge University Press.
comparisons, pick the right humans. Paraphrasing
a participant of the 2019 MRQA workshop (Fisch Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-
et al., 2019), a system better than the average hu- tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-
moyer. 2018. QuAC: Question answering in con-
man at brain surgery does not imply superhuman
text. In Proceedings of Empirical Methods in Nat-
performance in brain surgery. Likewise, beating a ural Language Processing.
distracted crowdworker on QA is not QA’s endgame.
Anthony Cuthbertson. 2018. Robots can now read bet-
If your task is realistic, fun, and challenging, you
ter than humans, putting millions of jobs at risk.
will find experts to play against your computer.
Not only will this give you human baselines worth Paul Diamond. 2009. How To Make 100 Pounds A
Night (Or More) As A Pub Quizmaster. DP Quiz.
reporting—they can also tell you how to fix your
QA dataset. . . after all, they’ve been at it longer than Jesse Dodge, Suchin Gururangan, Dallas Card, Roy
you have. Schwartz, and Noah A. Smith. 2019. Show your
work: Improved reporting of experimental results.
Acknowledgements Many thanks to Massimil- In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
iano Ciaramita, Jon Clark, Christian Buck, Emily 9th International Joint Conference on Natural Lan-
Pitler, and Michael Collins for insightful discus- guage Processing (EMNLP-IJCNLP), pages 2185–
sions that helped frame these ideas. Thanks to 2194, Hong Kong, China. Association for Computa-
Kevin Kwok for permission to use Protobowl tional Linguistics.
screenshot and information. Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
Stanovsky, Sameer Singh, and Matt Gardner. 2019.
DROP: A reading comprehension benchmark requir-
References ing discrete reasoning over paragraphs. In Confer-
ence of the North American Chapter of the Asso-
Luis von Ahn. 2006. Games with a purpose. Computer, ciation for Computational Linguistics, pages 2368–
39:92 – 94. 2378, Minneapolis, Minnesota.
David Baber. 2015. Television Game Show Hosts: Bi- Matthew Dunn, Levent Sagun, Mike Higgins,
ographies of 32 Stars. McFarland. V. Ugur Güney, Volkan Cirik, and Kyunghyun
Cho. 2017. SearchQA: A new Q&A dataset aug-
Ondřej Bojar, Christian Buck, Christian Federmann, mented with context from a search engine. CoRR,
Barry Haddow, Philipp Koehn, Johannes Leveling, abs/1704.05179.

R. Dwan. 2000. As Long as They’re Laughing: Grou- He He, Jordan Boyd-Graber, Kevin Kwok, and Hal
cho Marx and You Bet Your Life. Midnight Marquee Daumé III. 2016. Opponent modeling in deep re-
Press. inforcement learning. In Proceedings of the Interna-
tional Conference of Machine Learning.
Ahmed Elgohary, Denis Peskov, and Jordan Boyd-
Graber. 2019. Can you unpack that? learning to R. Robert Henzel. 2018. Naqt eligibility overview.
rewrite questions-in-context. In Empirical Methods
in Natural Language Processing. Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
Ahmed Elgohary, Chen Zhao, and Jordan Boyd-Graber. and Phil Blunsom. 2015. Teaching machines to read
2018. Dataset and baselines for sequential open- and comprehend. In Proceedings of Advances in
domain question answering. In Empirical Methods Neural Information Processing Systems.
in Natural Language Processing.
Bryan Humphries. 2017. Is there a nuclear control
Stephen Eltinge. 2013. Quizbowl lexicon. room on nuclear aircraft carriers?

Manaal Faruqui and Dipanjan Das. 2018. Identifying Shankar Iyer, Nikhil Dandekar, , and Kornél Csernai.
well-formed natural language questions. In Proceed- 2017. Quora question pairs.
ings of the 2018 Conference on Empirical Methods
Ken Jennings. 2006. Brainiac: adventures in the cu-
in Natural Language Processing, pages 798–803,
rious, competitive, compulsive world of trivia buffs.
Brussels, Belgium. Association for Computational
Villard.
Linguistics.
Eun Seo Jo and Timnit Gebru. 2020. Lessons from
Shi Feng and Jordan Boyd-Graber. 2019. What ai can archives: Strategies for collecting sociocultural data
do for me: Evaluating machine learning interpreta- in machine learning. In Proceedings of the Confer-
tions in cooperative play. In International Confer- ence on Fairness, Accountability, and Transparency.
ence on Intelligent User Interfaces.
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
David Ferrucci, Eric Brown, Jennifer Chu-Carroll, Zettlemoyer. 2017. TriviaQA: A large scale dis-
James Fan, David Gondek, Aditya A. Kalyanpur, tantly supervised challenge dataset for reading com-
Adam Lally, J. William Murdock, Eric Nyberg, prehension. In Proceedings of the 55th Annual Meet-
John Prager, Nico Schlaefer, and Chris Welty. 2010. ing of the Association for Computational Linguistics
Building Watson: An Overview of the DeepQA (Volume 1: Long Papers), volume 1, pages 1601–
Project. AI Magazine, 31(3). 1611.
Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu- Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
nsol Choi, and Danqi Chen, editors. 2019. Proceed- field, Michael Collins, Ankur Parikh, Chris Alberti,
ings of the 2nd Workshop on Machine Reading for Danielle Epstein, Illia Polosukhin, Matthew Kelcey,
Question Answering. Association for Computational Jacob Devlin, Kenton Lee, Kristina N. Toutanova,
Linguistics, Hong Kong, China. Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
Morris Freedman. 1997. The fall of Charlie Van Doren. ral questions: a benchmark for question answering
The Virginia Quarterly Review, 73(1):157–165. research. Transactions of the Association of Compu-
tational Linguistics.
Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019.
Are we modeling the task or the annotator? an inves- Samuel Laubli, Sheila Castilho, Graham Neubig,
tigation of annotator bias in natural language under- Rico Sennrich, Qinlan Shen, and Antonio Toral.
standing datasets. In Proceedings of the 2019 Con- 2020. A set of recommendations for assessing hu-
ference on Empirical Methods in Natural Language man–machine parity in language translation. Jour-
Processing and the 9th International Joint Confer- nal of Artificial Intelligence Research, 67.
ence on Natural Language Processing (EMNLP-
IJCNLP), pages 1161–1166, Hong Kong, China. As- Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
sociation for Computational Linguistics. 2019. Latent retrieval for weakly supervised open
domain question answering. In Proceedings of the
Alvin Grissom II, He He, Jordan Boyd-Graber, and Association for Computational Linguistics.
John Morgan. 2014. Don’t until the final verb wait:
Reinforcement learning for simultaneous machine Roger Levy. 2011. Integrating surprisal and uncertain-
translation. In Proceedings of Empirical Methods input models in online sentence comprehension: for-
in Natural Language Processing. mal techniques and empirical results. In Proceed-
ings of the Association for Computational Linguis-
Robert X. D. Hawkins, Andreas Stuhlmüller, Judith De- tics, pages 1055–1065.
gen, and Noah D. Goodman. 2015. Why do you
ask? good questions provoke informative answers. Roger P. Levy, Florencia Reali, and Thomas L. Grif-
In Proceedings of the 37th Annual Meeting of the fiths. 2008. Modeling the effects of memory on hu-
Cognitive Science Society. man online sentence processing with particle filters.

In Proceedings of Advances in Neural Information Marc-Antoine Rondeau and T. J. Hazen. 2018. Sys-
Processing Systems, pages 937–944. tematic error analysis of the Stanford question an-
swering dataset. In Proceedings of the Workshop
Chin-Yew Lin. 2004. Rouge: a package for auto- on Machine Reading for Question Answering, pages
matic evaluation of summaries. In Proceedings of 12–20, Melbourne, Australia. Association for Com-
the Workshop on Text Summarization Branches Out putational Linguistics.
(WAS 2004).
Andrew Ruef, Michael Hicks, James Parker, Dave
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Levin, Michelle L. Mazurek, and Piotr Mardziel.
Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, 2016. Build it, break it, fix it: Contesting secure
Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and development. In Proceedings of the 2016 ACM
Haifeng Wang. 2019. STACL: Simultaneous trans- SIGSAC Conference on Computer and Communica-
lation with implicit anticipation and controllable la- tions Security, CCS ’16, pages 690–703, New York,
tency using prefix-to-prefix framework. In Proceed- NY, USA. ACM.
ings of the 57th Annual Meeting of the Association Dan Russell. 2020. Why control-f is the single most
for Computational Linguistics, pages 3025–3036, important thing you can teach someone about search.
Florence, Italy. Association for Computational Lin- SearchReSearch.
guistics.
Lucia Specia and Atefeh Farzindar. 2010. A.: Esti-
Kenny Malone. 2019. How uncle Jamie broke jeop- mating machine translation post-editing effort with
ardy. HTER. In In: AMTA 2010 Workshop, Bringing MT
to the User: MT Research and the Translation Indus-
Adam Najberg. 2018. Alibaba AI model tops humans try. The 9th Conference of the Association for Ma-
in reading comprehension. chine Translation in the Americas.
Saku Sugawara, Kentaro Inui, Satoshi Sekine, and
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Akiko Aizawa. 2018. What makes reading compre-
Jing Zhu. 2002. BLEU: a method for automatic eval-
hension questions easier? In Proceedings of Empiri-
uation of machine translation. In Proceedings of the
cal Methods in Natural Language Processing.
Association for Computational Linguistics.
Saku Sugawara, Pontus Stenetorp, Kentaro Inui, and
Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent Akiko Aizawa. 2020. Assessing the benchmark-
disagreements in human textual inferences. Transac- ing capacity of machine reading comprehension
tions of the Association for Computational Linguis- datasets. In Association for the Advancement of Ar-
tics, 7:677–694. tificial Intelligence.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Idan Szpektor and Gideon Dror. 2013. From query
Know what you don’t know: Unanswerable ques- to question in one click: Suggesting synthetic ques-
tions for SQuAD. In Proceedings of the Association tions to searchers. In Proceedings of WWW 2013.
for Computational Linguistics. David Taylor, Colin McNulty, and Jo Meek. 2012.
Your starter for ten: 50 years of university challenge.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. SQuAD: 100,000+ questions James Thorne, Andreas Vlachos, Oana Cocarascu,
for machine comprehension of text. In Proceedings Christos Christodoulopoulos, and Arpit Mittal, edi-
of Empirical Methods in Natural Language Process- tors. 2018. Proceedings of the First Workshop on
ing. Fact Extraction and VERification (FEVER). Associ-
ation for Computational Linguistics, Brussels, Bel-
Sudha Rao and Hal Daumé III. 2018. Learning to ask gium.
good questions: Ranking clarification questions us-
ing neural expected value of perfect information. In Ellen M. Voorhees. 2003. Evaluating the evaluation: A
Proceedings of the 56th Annual Meeting of the As- case study using the TREC 2002 question answering
sociation for Computational Linguistics (Volume 1: track. In Proceedings of the 2003 Human Language
Long Papers), pages 2737–2746, Melbourne, Aus- Technology Conference of the North American Chap-
tralia. Association for Computational Linguistics. ter of the Association for Computational Linguistics,
pages 260–267.
Siva Reddy, Danqi Chen, and Christopher D. Manning. Ellen M. Voorhees. 2008. Evaluating Question
2018. CoQA: A conversational question answering Answering System Performance, pages 409–430.
challenge. Transactions of the Association for Com- Springer Netherlands, Dordrecht.
putational Linguistics, 7:249–266.
Ellen M Voorhees and Dawn M Tice. 2000. Building
Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and a question answering test collection. In Proceedings
Jordan L. Boyd-Graber. 2019. Quizbowl: The of the 23rd annual international ACM SIGIR confer-
case for incremental question answering. CoRR, ence on Research and development in information
abs/1904.04792. retrieval, pages 200–207. ACM.

Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Ya-
   mada, and Jordan Boyd-Graber. 2019. Trick me if
  you can: Human-in-the-loop generation of adversar-
   ial question answering examples. Transactions of
   the Association of Computational Linguistics, 10.
Dirk Weissenborn, Georg Wiese, and Laura Seiffe.
  2017. Making neural QA as simple as possible but
  not simpler. In Conference on Computational Natu-
  ral Language Learning.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-
  bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
  Xlnet: Generalized autoregressive pretraining for
  language understanding. CoRR, abs/1906.08237.
Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and
 Jianfeng Gao. 2015. Semantic parsing via staged
  query graph generation: Question answering with
  knowledge base. In Proceedings of the 53rd Annual
 Meeting of the Association for Computational Lin-
  guistics and the 7th International Joint Conference
  on Natural Language Processing (ACL 2015), pages
 1321–1331. ACL.

Xingdi Yuan, Jie Fu, Marc-Alexandre Cote, Yi Tay,
  Christopher Pal, and Adam Trischler. 2019. Interac-
  tive machine comprehension with information seek-
  ing agents.

You can also read