Decontextualization: Making Sentences Stand-Alone
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Decontextualization: Making Sentences Stand-Alone Eunsol Choi2∗, Jennimaria Palomaki1 , Matthew Lamm1 , Tom Kwiatkowski1 , Dipanjan Das1 , Michael Collins1 1 Google Research 2 Department of Computer Science, The University of Texas at Austin eunsol@cs.utexas.edu, {jpalomaki, mrlamm, tomkwiat, dipanjand, mjcollins}@google.com, Abstract Page Title: Croatia at the FIFA World Cup Paragraph: Croatia national football team have appeared in the FIFA World Cup on five occasions (in 1998, 2002, Models for question answering, dialogue 2006, 2014 and 2018) since gaining independence in 1991. agents, and summarization often interpret Before that, from 1930 to 1990 Croatia was part of Yu- the meaning of a sentence in a rich context goslavia. Their best result thus far was reaching the 2018 arXiv:2102.05169v1 [cs.CL] 9 Feb 2021 and use that meaning in a new context. Tak- final, where they lost 4-2 to France. ing excerpts of text can be problematic, as Decontextualized Sentence: The Croatia national foot- key pieces may not be explicit in a local ball team’s best result thus far in the FIFA World Cup was window. We isolate and define the prob- reaching the 2018 final, where they lost 4-2 to France. lem of sentence decontextualization: tak- ing a sentence together with its context and Figure 1: An example decontextualization. The sen- rewriting it to be interpretable out of con- tence to decontextualize is highlighted in grey. text, while preserving its meaning. We de- scribe an annotation procedure, collect data on the Wikipedia corpus, and use the data together with its context and rewriting it to be in- to train models to automatically decontex- terpretable out of context if feasible, while pre- tualize sentences. We present preliminary serving its meaning.1 Having defined the problem, studies that show the value of sentence de- we operationalize this definition into a high qual- contextualization in a user facing task, and ity annotation procedure; use the resulting data to as preprocessing for systems that perform document understanding. We argue that de- train models to automatically decontextualize sen- contextualization is an important subtask in tences; and present preliminary results that show many downstream applications, and that the the value of automatic decontextualization in a definitions and resources provided can ben- user facing task, and as preprocessing for systems efit tasks that operate on sentences that oc- that perform document understanding. We argue cur in a richer context. that decontextualization is an important sub-task in many downstream applications, and we believe this work can benefit tasks that operate on sen- 1 Introduction tences that occur in a wider context. Many applications of natural language processing One contribution of this work is to release a need to be able to interpret, or present, text in- dataset of decontextualized sentences that can be dependently from the rich context in which it oc- used as training and evaluation data, together with curs. For example, summarization systems extract the evaluation script: on publication of this paper salient information from documents and present it the data will be available at https://github. in a reduced context. Many systems also segment com/google-research/language/ documents prior to interpretation of retrieval for tree/master/language/decontext. computational efficiency. In all of these cases, we Figure 1 shows an example decontextualization. would like the context-reduction step to be mean- In this example we have a coreference resolution ing preserving but, to date, there has been no inde- step (their → The Croatia national football team’s) pendent method of ensuring this. and a bridging step (insertion of the prepositional In this paper we isolate and define the problem phrase “in the FIFA World Cup" to modify “Croa- of sentence decontextualization: taking a sentence 1 More precisely the truth-conditional meaning or explica- ∗ Work done at Google. ture (Sperber and Wilson, 1986); see section 2 for discussion.
tia’s best result thus far"). Decontextualization valid decontextualization of s if: (1) the sentence involves various linguistic phenomena, including s0 is interpretable in the empty context; and coreference resolution, global scoping, and bridg- (2) the truth-conditional meaning of s0 in the ing anaphora (Clark, 1975). We present a linguis- empty context is the same as the truth-conditional tically motivated definition of decontextualiation meaning of s in context c. in Section 2 and show that this definition can be A context c is a sequence of sentences preceding reliably applied by crowdworkers in Section 3. s, and the empty context is the empty sequence. We generate a corpus of decontextualized sen- We have been careful here to use the more tences corresponding to original sentences drawn specific term "truth conditional meaning" rather from the English Wikipedia. We show that a high than "meaning". Here we follow the distinction proportion of these original sentences can be de- in semantics/pragmatics between truth conditional contextualized using a relatively simple set of re- meaning and implicature, and deliberately exclude write operations, and we use the data to define a implicatures (which can also be considered part of new automatic decontextualization task in which the meaning of an utterance) from our definition. a computer system needs to create a decontextual- There is a rich history of work in semantics and ized sentence from an original sentence presented pragmatics on truth-conditional meaning and im- in paragraph context. We discuss the implications plicatures, going back to Grice (1975). Our con- of choosing Wikipedia as a domain in Section 3.4. cept of "truth conditional meaning" is very close to We present two methods for automatic decon- "explicature" as used in Relevance Theory (Sper- textualization based on state-of-the-art corefer- ber and Wilson, 1986). Consider this description ence (Joshi et al., 2020) and generation (Raffel of explicature from Birner (2012) (pages 96–97, et al., 2019a) models. We evaluate the output of our own emphasis added): these models with automatic measures (derived “The explicature in an utterance is the result from Xu et al. (2016)), as well as through hu- of enriching the semantic content with the sorts man evaluation. Both automatic and human evalu- of pragmatic information necessary to provide us ations show that the largest sequence-to-sequence with a truth-evaluable proposition. This includes model produces high quality decontextualizations calculating the referents for pronouns, working in the majority of cases, although it still lags hu- out the intended interpretation for deictic phrases man performance in the thoroughness and accu- like here and later ..., disambiguating lexically and racy of these decontextualization edits. structurally ambiguous words and phrases, mak- Finally, we present two demonstrations of the ing any “bridging" inferences necessary for refer- utility of decontextualization. The first is a user ence resolution ... and so on." We will see in the study giving evidence that decontextualized sen- next section that our annotation task follows this tences can be valuable when presented to users definition quite closely. as answers in a question-answering task—raters judge that they balance conciseness with informa- As an example consider the following ex- tiveness. In the second one, we use decontextu- change: alization as a preprocessing component for gener- ating a retrieval corpus for open domain question Susan: Has the Croatia national football answering. Decontextualizing the sentences to team ever won the FIFA World Cup? be indexed by retrieval system enables more effi- Jon: Their best result thus far was reach- cient answer string retrieval for information seek- ing the 2018 final, where they lost 4-2 to ing queries. These demonstrations are presented France. as preliminary results, and we argue that decon- textualization is an important sub-task for a wide Here the truth conditional meaning of Jon’s re- range of NLP applications. ply is equivalent to "Croatia’s best result thus far in the FIFA World Cup was reaching the 2018 final, 2 Linguistic Background where they lost 4-2 to France.", whereas the im- plicature would be "the Croatia national football We start with the following definition: team has never won the FIFA World Cup" (which Definition 1 (Decontextualization) Given a answers Susan’s question). In our definition the sentence-context pair (s, c), a sentence s0 is a decontextualized sentence s0 should preserve the
truth-conditional meaning, but is not required to Page Title: Postage stamps and postal history of India preserve the implicature(s) of the sentence.2 Paragraph: ... In the opinion of Geoffrey Clarke , the re- formed system was to be maintained “ for the benefit of Remark (extra-linguistic context): In addition to the people of India and not for the purpose of swelling the its document context, a given sentence s and its revenue.” The Commissioners voted to abolish the ear- counterpart s0 also come with a temporal, cultural, lier practice of conveying official letters free of postage and geographic context — i.e. where and when (“franking”). The new system was recommended by the Governor - General , Lord Dalhousie , and adopted by the they are being written or read and by whom.3 East India Company ’s Court of Directors. We assume that these aspects of context are pre- Page Title: Thermodynamic temperature served during decontextualization. The effect of this is that elements of s which derive their mean- Paragraph: ... To completely melt ice at 0 C into water at ing from outside of the document context will re- 0 C, one must add roughly 80 times the thermal energy as ceive equivalent interpretation in s0 , and hence do is required to increase the temperature of the same mass of liquid water by one degree Celsius. The metals’ ratios are not require decontextualization. For example, the even greater, typically in the range of 400 to 1200 times. expression "thus far" in Figure 1 is interpreted rel- ative to the time of utterance, not relative to what Figure 2: Decontextualization examples falling into the has been previously said in the Wikipedia article, INFEASIBLE category. The sentence to be decontextu- and hence it appears in both the original and de- alized is highlighted in grey. contextualized sentences. story, or rely heavily on the preceding few sen- 3 Task Definition tences. See Figure 2 for examples. An annotator is provided with an entire docu- 3.2 Edit Types and Linguistic Phenomena ment d with a target sentence within the document, represented as a start and end index sst , send . When an example is classified as FEASIBLE, the First, the annotator decides whether the target sen- annotator makes edits to decontextualize the sen- tence can be decontextualized or not, labeling it tence. Table 1 shows the different edit types. They as FEASIBLE or INFEASIBLE. If the example is fall into four broad categories: marked as FEASIBLE, the annotator decontextual- N AME C OMPLETION, P RONOUN / NP S WAP izes the sentence, producing y, a new sentence that correspond to replacement of a referring expres- satisfies the conditions in definition 1. sion that is unclear out of context with a referring expression that is unambiguous out of context. For 3.1 Feasibility example replacing the pronoun “She” with “Cyn- Sentences in FEASIBLE include sentences that do thia Nixon”, the definite NP “the copper statue” not require any modification to be decontextual- with “The Statue of Liberty”, or the abbreviated ized (e.g., “Émilie du Châtelet proposed the hy- name “Meg” with “Megan "Meg" Griffin”. pothesis of the conservation of total energy, as distinct from momentum."), and sentences that re- DM R EMOVAL involves removal of discourse quire edits to stand alone. markers (DMs) such as “therefore”. In the decontextualization step, we instructed B RIDGING, G LOBAL S COPING involve addi- annotators to make only minor modifications, tion of a phrase (typically a prepositional phrase) which includes copying and pasting a few phrases that modifies either a particular noun phrase from the document to the target sentence and (“bridging”) or the entire sentence (“global scop- deleting phrases from the target sentence. When ing”). For example adding “in the Ultimate Fight- it is too challenging to decontextualize, it is clas- ing Championship” as a modifier to “all fights”, or sified into the INFEASIBLE category. Often, sen- adding “at the 2018 Cannes Film Festival" at the tences in this category are a part of a narrative end of the sentence. The additional phrase essen- 2 tially spells out a modifier that is implied by the We have not necessarily given up on recovering implica- tures: the decontextualized sentence will likely be a valuable context. intermediate step in deriving the implicatures of an utterance. 3 Research on text simplification (Xu et al., 2015; Bingel A DDITION inserts background information that et al., 2018) also shows how target output depends on the significantly improves readability: in many cases, expected audience. this involves adding an appositive or premodifier
Edit Type Description Example % P RONOUN /NP S WAP Replacement of a definite * -The copper statue , +The Statue of Liberty +, a gift from 40.5 pronoun / noun phrase the people of France to the people of the United States, was with another referring ex- designed by French sculptor Frédéric Auguste Bartholdi and pression built by Gustave Eiffel. NAME C OMPLETION Expansion of acronyms or * -Meg , +Megan "Meg" Griffin + made her first appearance 11.5 partial names on television when Family Guy debuted on Fox on January 31, 1999, with the episode “Death Has a Shadow”. DM R EMOVAL Removal of discourse * - For instance, + Alaska could be regarded as the highest 3.5 markers that can be only state because Denali, at 20,310 feet, is the highest point in understood in context the US. B RIDGING Addition of a modifier In all fights * +in the Ultimate Fighting Championship +, 13 (typically a PP) to a noun each round can be no longer than five minutes. phrase G LOBAL S COPING Addition of a phrase (typ- The Japanese film Shoplifters, directed by 7 ically a PP) that modifies Hirokazu Kore-eda, won the Palme d’Or the entire sentence * +at the 2018 Cannes Film Festival. + A DDITION Addition of background Charles Darwin* +, an English naturalist and biologist, + 10 information that is not was among the first to suggest that physiological changes necessary but helps read- caused by an emotion had a direct impact on , rather than ability significantly being just the consequence of that emotion. Table 1: The list of possible edits in decontextualization. The last column represents how frequently the phenom- ena occurs in the data, from manual analysis on 200 examples, including examples that belongs to INFEASIBLE categories and examples that does not require any edits. The bag notation removes x and adds y (* -x , +y +) at its position. to a named entity to add useful background infor- them to use their best judgement to rewrite such mation about that entity. Unlike other edits de- that the new sentence is fluent, unambiguous and scribed above, edits in this category are optional. clear when posed alone. In the last example, an- For example, replacing “The Eagles” with “The notators disagree on the feasibility. While the sen- American rock band The Eagles.” tence is a part of a bigger narrative, two annota- tors judged it could be edited to alone, by adding 3.3 Variability a global scoping modifier, “In Greek mythology." We note that for a given sentence frequently there will be more than one possible decontextualiza- 3.4 Scope of Current Task Formulation tion. While this inherent subjectivity makes task Our data comes from the English portion of the challenging to crowdsource and evaluate, we ar- Wikipedia corpus. We sampled sentences as gue this is important feature, as shown in recent follows. We first pick a (question, Wikipedia, literature (Aroyo and Welty, 2015; Pavlick and short answer) triple from the Natural Ques- Kwiatkowski, 2019; Kwiatkowski et al., 2019), tions (Kwiatkowski et al., 2019) uniformly at ran- and propose to collect multiple references per ex- dom from the questions that have a short answer. ample. Table 2 shows examples where there can We include the sentence containing the short an- be multiple different correct decontextualizations. swer as one example; as a second example we In the first example, while the semantics of the choose a sentence at random from the Wikipedia edits are roughly equivalent, i.e., the annotators page. After sampling we exclude (1) sentences un- agreed on what noun phrases have to be disam- der a “Plot” category as they are often infeasible to biguated and information has to be added, they decontextualize; (2) any sentence that is the first differ in how to rewrite the sentence. In the sec- sentence of the page; and (3) any sentence from a ond example, we see disagreement on what in- paragraph containing only a single sentence. formation should be added to the sentence. We We designed this data selection process to en- do not make any explicit assumptions about what sure that a large proportion of examples (90%) is known and salient to the reader, and instructed could be decontextualized using simple edits de-
Page title / Section title: We Don’t Talk Anymore (Charlie Puth song) / Music video Paragraph: The music video premiered on August 2 , 2016 , on BuzzFeed and was directed by Phil Pinto . It shows Puth and Mirella Cardoso as his love interest . . . . Decontextualization 1: * -It , +We Don’t Talk Anymore music video + shows * -Puth , +Charlie Puth + and Mirella Car- doso as his love interest. Decontextualization 2: * -It , +The “We Don’t Talk Anymore”(Charlie Puth song) music video + shows Puth and Mirella Cardoso as his love interest. Page title: The American Baking Competition Paragraph: CBS placed casting calls for participants on November 14, 2012 . Auditions were held between December 1 and December 15, 2012. The competition took place at the Gibbs Gardens in Ball Ground , Georgia in March 2013. Decontextualization 1: The * -competition , +American Baking Competition + took place at the Gibbs Gardens in Ball Ground , Georgia in March 2013. Decontextualization 2: The * -competition , +American Baking Competition, a reality competition television series, + took place at the Gibbs Gardens in Ball Ground , Georgia in March 2013 . Page title: Gemini (Constellation) Paragraph: In Greek mythology, Gemini was associated with the myth of Castor and Pollux, the children of Leda and Arg- onauts both. Pollux was the son of Zeus, who seduced Leda, while Castor was the son of Tyndareus, king of Sparta and Leda’s husband. Castor and Pollux were also mythologically associated with St. Elmo’s fire in their role as the protectors of sailors. When Castor died, because he was mortal, Pollux begged his father Zeus to give Castor immortality, and he did, by uniting them together in the heavens. Decontextualization 1: I NFEASIBLE Decontextualization 2: * +In Greek mythology, + when Castor died, because he was mortal, Pollux begged his father Zeus to give Castor immortality, and he did, by uniting them together in the heavens. Table 2: Examples showing the diversity of valid decontextualization edits. scribed in Section 3.2. ization on paragraphs could also be studied. One Before settling on Wikipedia, we conducted an could even also consider allowing wider range of initial pilot study which revealed that encyclope- edits, such as multi-sentence outputs and edits be- dic text is substantially easier to decontextualize yond copy-and-pasting, such as paraphrasing and compared to newswire or literary text. In the lat- re-ordering. We anticipate exploring such alterna- ter genres, the context required for the compre- tive formulations would help to extend the scope hension of any given sentence appears to be much of decontextualization to the more challenging do- more complex in structure. Similarly, it is difficult mains previously mentioned. to posit decontextualization for sentences that ap- We stress however that in spite of our restriction pear on social media platforms, because they are to single sentences in Wikipedia, the decontextu- situated within complex and highly specific social alization task is nevertheless valuable: Wikipedia contexts. In contrast, being written for a general (and other encyclopedic sources) contain a wealth audience, Wikipedia makes limited assumptions of factual information, and a high proportion (over about its reader. 60%; see table 3) of sentences both require decon- Within Wikipedia, we similarly found that arti- textualization and can be decontextualized under cles on popular historical or cultural entities and our definitions (only 30% of sentences are inter- events were easier to decontextualize by crowd- pretable out of context without any edits). workers compared to articles from technical do- mains, such as ones on medical or mathematical 4 Data Collection concepts. Comprehension of such articles requires a considerable body of background knowledge or Annotation Interface The annotator is pre- information from preceding paragraphs. Articles sented a sentence in the context of an entire in our dataset cover topics that require little back- Wikipedia page. In the first step the anno- ground knowledge to comprehend. tator judges whether the example is FEASIBLE We focus on decontextualization of sentences, or INFEASIBLE. If the example is marked as where the space of edits is restricted, to make the FEASIBLE , the annotator can use delete, add, or task easier to quality control and annotate. How- swap operations within a user interface to produce ever alternate formulations, such as decontextual- a decontextualized string.
# par. sent. FEASIBLE (%) INFEA cally correct and fluent. Overall, 88% of annota- len len w/ edit as is SIBLE (%) tions were valid in both, 89% on the content and Train 11290 695 156 60 31 9 88% on form. Dev 1945 695 162 67 21 12 Test 1945 711 160 68 20 12 5 Automatic Decontextualization Expert 100 658 163 63 26 12 5.1 Models Table 3: Data statistics. par. len refers to paragraph length in bytes, and sent. len refers to sentence length We present two models for decontextualization: in bytes. All lengths are in bytes. The development a coreference resolution model and a sequence- and test set is five-way annotated, and the expert data is to-sequence generation model. For both models, four-way annotated. the input is a concatenation of the title of the Wikipedia document, the section titles, and the Data Statistics We collected one reference for paragraph containing the target sentence. During each example in the training data, and five ref- the annotation pilots, we found the document ti- erences for each example in the evaluation data. tle is crucial for decontextualization, while section Annotators are native speakers of English located headers were frequently necessary or missing. We in the United States, and on average, they took 4 denote the title of the Wikipedia page as the se- minutes to annotate a single example. quence of tokens t, section titles of the paragraph In total, 28 annotators annotated the examples, as the sequence of tokens ts and the n sentences in with 11 annotators annotating more than 1K ex- the paragraph where the target sentence is coming amples each. from as x1 . . . xn , where each xi is a sequence of tokens, and xt is the target sentence (1 ≤ t ≤ n). Table 3 represents some overall data statistics. The model considers the concatenation of a subset Decontextualization is possible for the majority of of the document, examples, with the INFEASIBLE category cover- ing roughly 10% of the data. We note a slight [CLS]t[S]ts [S]x1 · · · xt−1 [S]xt [S]xt+1 · · · xn [S] discrepancy between train and evaluation dataset distribution, potentially due to a change in the an- where [S] is a separator token. This representa- notation interface. A small subset of data is an- tion differs from the setting of annotators, where notated by the authors to be compared with the they were given the full document context. As crowd-sourced data (last row in the table). an approximation, we include article and section titles in the inputs, as these often contain salient Annotation Quality We quantify the annotation contextual elements. We did experiment with giv- agreement on the category classification. The ing more context, i.e., adding the first paragraph Fleiss’ kappa on category classification is 0.51 of the article as an additional input, but did not ob- among expert annotators, and is 0.30 among the serve a performance improvement. On the initial crowd annotators (binary agreement is at 85%). pilot, annotators marked that 10-20% of examples We observed more variability in crowdworkers as required access to the full document. annotators’ background is more diverse, and some The Coreference Model As many decontextu- annotators have a loose concept of “stand alone" alization edits can be recovered by a corefer- and consistently attempted decontextualization. ence resolution module, we adapt the output from We also measured agreement among the indi- the state-of-the-art coreference resolution system vidual edits. For each of the edit operations (as de- of (Joshi et al., 2020), trained on the CoNLL fined in Section 3.2), we compare the output sen- dataset (Pradhan et al., 2012), as a decontextual- tence after the single edit and to a set of output ization system. We used the publicly available sentences, each after a single edit by other annota- pre-trained checkpoint of SpanBERT-Large with tors. About 32.5% of edits were covered. the original hyper parameters.4 Because of the inherent annotation variabil- We run this model on the input sequence, and ity, four of the authors manually evaluated 100 map the coreference cluster predictions to modify crowd-sourced annotations from the evaluation the sentence as follows. We only consider clusters data based on two measures: (1) whether the sen- with a mention in the target sentence. For each tence is sufficiently and correctly decontextual- 4 ized, and (2) whether the sentence is grammati- https://github.com/facebookresearch/SpanBERT/
such cluster, we find its first mention inside the tar- examples), getting 77% accuracy. The coreference get sentence, and find another mention in the same model cannot decide the decontextualization feasi- cluster that was presented earlier in the input and bility, as an untrained baseline. is longer than the current mention. If such a men- 5.2.2 Decontextualized Sentence Generation tion is found, we replace the current entity mention string with the earliest such mention string (e.g., Setup For development / test examples, we have "She" is replaced with "Taylor Swift"). On av- five human annotations per example. We only con- erage, 36.5% of examples were modified through sider examples marked by three or more annota- this process. tors (out of five) as FEASIBLE for decontextualized sentence generation. For each of these examples, The Seq2Seq Generation Model is based on we discard annotations which mark the example as the recent T5 model (Raffel et al., 2019b). We INFEASIBLE . For automatic evaluation and com- show two variations of the model, BASE and parison, we need a human output, which will be 11B, which mainly differ in the model capac- compared to model outputs, and a set of reference ity. We fine-tune the model on our crowd- annotations which will be considered as correct, sourced training set, by setting the target se- gold annotations. The single human output pro- quence to be [CAT][SEP]y, where [CAT] ∈ vides a reference point for evaluation measures to {UNNECESSARY, FEASIBLE, INFEASIBLE} and y which the automatic output can be compared. is a decontextualized sentence when [CAT] = We observed comparing a longer decontextu- FEASIBLE and the original sentence when alized sentence to shorter decontextualized sen- [CAT] ∈ {UNNECESSARY, INFEASIBLE}. UN - tences often erroneously results in low scores au- NECESSARY are examples where the original sen- tomatic metrics (e.g., In the last example of Ta- tence without any edit can stand alone. ble 2, adding extra information will be erroneously We limit the input/output to 512/128 tokens punished). Thus, instead of randomly selecting for both variants, and fine-tuned from pre-trained one annotation to be used as the representative hu- checkpoints5 with a batch size of 100 examples man output, we sort the annotations by the length until the validation loss stopped decreasing, after of the output sentence (raw bytes), and take the about 32K for the larger and 500K steps for the annotation with median length6 as a human out- smaller model. put and take the remaining annotations as a set of reference annotations. From manual inspection of 5.2 Evaluation the data the median-length output appeared often 5.2.1 Feasibility Detection to be optimal in terms of balancing length versus We first evaluate the accuracy of models in mak- accuracy of the decontextualization. ing the feasible vs. infeasible decision. To do this Metric For each model prediction and human we compute the binary agreements with all human output, we report: references and average them to get an accuracy. • Length Increase, the average value of Results For the feasible vs. infeasible classi- (len(decontext)-len(original)) / len(original). fication task, baseline that always predicts FEA - SIBLE will have 88% accuracy. The larger vari- • % edited, the proportion of examples that were ant of T5, T5-11B, achieves 89% accuracy, out- modified for decontextualization (as opposed to performing human agreement (85% accuracy), af- being left unchanged). firming the strong performance of pre-trained lan- • Sentence match, a binary score computed be- guage models on classification tasks (Devlin et al., tween the output and a set of references, indicat- 2018). This model predicts the INFEASIBLE cate- ing whether the output matches any of the ref- gory infrequently for the larger variant (5% of ex- erences after normalization (stripping away ar- amples), while humans classify an example as IN - ticles and punctuation and lowercasing). We re- FEASIBLE for 12% of examples. We observe the port two numbers, a score on all examples, and smaller variant, T5-Base, is less accurate, over- a score on examples where all references edited predicting the INFEASIBLE category (for 20% of the sentence. 5 6 https://github.com/google-research/text-to-text-transfer- When there are four of references, we take the second transformer shortest sentence.
len % ed match SARI add SARI del len % ed match SARI add SARI del inc. ited all / edited F1 (P/R) F1 (P/R) inc. ited all / edited F1 (P/R) F1 (P/R) Repeat 0 0 38 / 0 0 (0/0) 0 (0/0) Repeat 0 0 36 / 0 0 (0/0) 0 (0/0) Coref 7 42 39 / 13 22 (51/14) 31 (34/28) Coref 8 42 38 / 13 23 (50/15) 36 (40/32) T5-Base 8 40 48 / 21 29 (67/19) 40 (54/32) T5-11B 13 61 52 / 32 43 (69/31) 47 (49/46) T5-11B 12 59 53 / 32 42 (72/30) 46 (49/43) Human 23 77 44 / 28 56 (64/49) 58 (61/56) Human 24 76 45 / 29 56 (64/49) 58 (61/55) Table 5: Test set results. See table 4 caption for a key. Table 4: Development set performance. Len inc. is the average percentage increase in length from decontextu- alization. % edited is the proportion of examples that T5 either Annotator Sum have at least one edit. match-all shows percentage of T5 13 12 2 27 outputs that have at least one match in the human refer- either 7 22 4 33 ences; match-edited shows the match value calculated Annotator 1 15 24 40 on cases where all references include at least one edit. Sum 21 49 30 100 • SARI (system output against references and Table 6: Preference between T5 output and human an- against the input sentence) metric (Xu et al., notation. Columns represents the judgement of the ex- 2016). To compute this, for each reference, we pert A, rows that of the expert B. We see high agree- ment between two expert annotators, despite one ex- calculate a set of add edits, corresponding to pert annotator (column annotator) is ambivalent more which unigrams are seen in the reference but frequently. not in the original sentence. Conversely, we can calculate the set of delete edits, correspond- ing to unigrams that are in the original sen- decontextualization system would result in high tence but not in the reference. We calculate sentence match, adequate changed ratio (experts precision/recall/F1-measure on add and delete edited about 79% of examples), and length change edits. We look at unigrams only, and use frac- ratio (the experts’ ratio is 1.19), as well as high tional counts for the words in the references SARI addition and deletion scores. As a sanity (i.e., a word appearing in one of r references check, we report R EPEAT, which outputs the orig- will be counted as 1/r). We compute micro av- inal sentence. This alone results in high sentence erage across examples, i.e., globally by count- match score, around 40%, meaning that on this ing the total true positives, false negatives and number of examples, at least one of the annota- false positives, as many examples do not require tors deemed the sentence can stand alone without any edits.7 any edits. While the sentence match score is the easiest to The coreference system has an exact match of interpret, it punishes longer outputs, making com- about 13% of examples that require edits, without parisons across systems producing outputs of dif- any task-specific fine-tuning. Its SARI add scores ferent lengths challenging, and it overly rewards shows high precision and low recall, and its dele- conservative strategies that simply copy across the tion scores are low as it cannot delete discourse original sentence. Thus, we use the SARI met- markers. The Seq2seq generation model achieves ric as our main evaluation metric. SARI can be high scores across all measures. The bigger vari- thought of as a precision/recall measure on topics ant is substantially better, editing more than its (unigrams) that should be added or deleted. smaller variant without losing precision. We ob- serve the larger variants outperform the average Automatic Evaluation Tables 4 and 5 show de- human on sentence match measure, but not in velopment and test performance. A successful SARI measures. The T5 model modifies fewer ex- 7 Similar to BLEU in machine translation, SARI is a useful amples than the annotator, and edits involve fewer measure for comparing different systems; however, due to tokens, benefiting it on the sentence match mea- the relatively large space of possible decontextualizations it sure. However, the model is more likely to miss will not be possible to achieve anything close to 100% F1 on SARI measures, and thus the absolute score is harder to required edits, as shown in low recall for the SARI interpret. A SARI score of for example 50% should not be add and deletion measures. We discuss this further interpreted as indicating a system with 50% accuracy. in the following human evaluation section.
Human Evaluation We sampled 100 examples Prefer log odds in the evaluation set, where at least two annotators Opt.A vs. Opt.B A B either intercept [CI] and our best model made decontextualization ed- Dec. vs. Ori. 730 426 364 0.85 [0.4,1.3] its. We randomized the order or presentation of Dec. vs. Par. 850 456 234 0.55 [0.1,1.0] the T5 and human outputs so as to not bias the Ori. vs. Par. 741 505 274 0.31 [-0.2,0.8] annotation. On this set, we (two of the authors) Table 7: User study results. Dec. refers to decontex- conducted a manual evaluation. Given two de- tualized sentence answer, Ori. means original sentence contextualized sentences, one from the best model answer, Par. means paragraph answer. We present raw and another randomly selected from a set of an- counts of preferences and the log odds of preferring op- notations with decontextualization edits, we eval- tion A and its 95% confidence interval. uated each on two dimensions: (a) is it fluent and grammatically correct? (b) is it sufficiently and correctly decontextualized? Lastly, we chose the 6.1 Decontextualized Answer As Is preference between two outputs (A, B, or either). We showcase a use case of decontextualized Expert annotators marked as ‘sufficient’ those sentences as providing a succinct yet infor- items for which all possible referential ambigui- mative answer to open domain factoid ques- ties had been resolved. Given the subjective nature tions (Kwiatkowski et al., 2019). We design a user of the task, some ‘insufficient’ decontextualiza- study where people compare a decontextualized- tions by the expert annotator could be valid for the sentence answer with an original-sentence answer another annotator with a different world knowl- and a paragraph answer to the same query.8 edge. We report averaged binary scores from two Setup Given a question and two presentations of experts. The model output scored 88.0% on flu- the same answer, raters were tasked with marking ency, and 67.5% on correct decontextualization, their preference between the two answer presenta- while the human reference output scored 84.5% on tions (option A, option B, or either). The actual fluency and 78.5% on correct decontextualization. short span answer in the sentence is always high- Both annotators found T5 to be slightly more flu- lighted (similar to seen in Table 8) (See Figure 3 ent, while humans are more thorough and accurate for a screenshot). in decontextualizating. Table 6 shows the prefer- We conduct three comparison studies on the ences of two annotators. Both preferred human same set of 150 questions: (a) decontextualized output, and their preferences exhibit high agree- sentence vs. original sentence, (b) original sen- ment (matching on 37 out of forty examples when tence vs. original paragraph, (c) decontextual- both had preferences). ized sentence vs. original paragraph. For each We briefly characterize common error patterns example in each study, we collected 10 user rat- for annotators and the T5 model. Similar error pat- ings. The questions are randomly chosen from a terns emerge between the annotations and model set of questions which have a short answer, and outputs. Both occasionally fail to identify gener- such that the sentence containing the short an- ics that need to be replaced with referring NPs, swer is categorized as FEASIBLE by the annota- phrases that require bridging, and temporal con- tors and edits were necessary to decontextualize. texts that should be provided. Additionally, we We use crowd-sourced annotations of decontextu- noticed that the T5 model heavily relies on the title alized sentences. Figure 3 shows the screenshot of cues, and sometimes fail to clarify ambiguous en- the user study interface. tities that are not the main entity of the page. We noticed very few examples where T5 hallucinates Result Table 7 shows the results of the user factually incorrect contents. study. We observe that decontextualized sentence answers are preferred to both the original sentence 6 Two Applications answers and the original paragraph answers. We also note that the users preferred sentence answer We present two demonstrations of the utility of compared to paragraph answer in general. decontextualization. First, we argue that the de- 8 Understanding how to present answers to users is a com- contextualized sentences can be valuable in them- plex problem with many desiderata, e.g., preserving the orig- selves in question answering, and show that they inal content, crediting the source, interaction with the user can be useful as a preprocessing step. interface, which we are not covering comprehensively.
Figure 3: A screenshot of the instruction and an example instance comparing the original sentence answer and the paragraph answer shown to annotators for the user study. To control for the correlations induced by the rater and question groups, we fit a generalized lin- ear mixed model (GLMM) using the brm R pack- age (Bürkner, 2017). For this analysis, we ex- cluded data points where users did not show a preference (selected either). We used the formula: p ∼ 1 + (1|r) + (1|q) where p is whether a rater chose one option over the other; r is the rater id; and q is the question id. This formula specifies a regression of the log-odds of the rater preference while allowing for random effects in the raters (r) and questions (q). The last col- umn of Table 7 shows the fixed effect coefficients Figure 4: Each dot represents how frequently the de- contextualized answer is preferred for a single ques- and their confidence intervals. The intercept rep- tion / rater to the original sentence for a single question resents the strength of preference towards option (top plot) and single rater (bottom plot). Questions (top A. We found a statistically significant preference plot) and raters (bottom plot) are sorted by its prefer- for decontextualized sentences over both original ence towards the decontextualized answer. The red line sentences and the paragraphs (p-value was smaller is where both are equally preferred, and above the line than 0.05 for both studies). represents question / rater where decontextualized an- swers were preferred. While the decontextualized an- Examples We qualitatively investigated which swer is preferred overall, we see a large variability. examples benefit from decontextualization, and in which examples raters prefer paragraph answers. Table 8 shows questions together with two answer We further investigated the statistical signifi- presentations, along with the predicted fixed effect cance of the preferences reported in Table 7. We question coefficient towards decontextualized an- noticed a quite large amount of question and rater swer in study (b) and towards the sentence answer variability—some raters consistently preferred a in study (c). In the first row, the added information sentence answer, valuing conciseness, while some from the decontextualization is not relevant to the raters behaved in the other direction. Similarly, for question, thus we observe preference against de- some questions, all raters preferred a sentence an- contextualization. In the second and third row, the swer. Figure 4 visualizes such variability based on decontextualized sentence answer is preferred as the questions and raters. it provides enough evidence to answer the query,
Query Decontextualized answer Paragraph answer (sentence answer highlighted) Decont. Ori. when was the The Rising of the Moon, Irish The ballad has been in circulation since circa -2.09 -1.53 rising of the ballad recounting a battle be- 1865 . The earliest verifiable date found in pub- moon written tween the United Irishmen and the lication is 1867 . British Army has been in circula- tion since circa 1865 . what is the most The most viewed music video This list of most viewed online videos in the first 1.06 -0.48 viewed video within 24 hours of its release is 24 hours contains the top 30 online videos that on youtube in Taylor Swift ’s Look What You received the most views within 24 hours of re- 24 hours Made Me Do . lease across the world. This list excludes movie trailers , which are featured on the list of most viewed online trailers in the first 24 hours. The most viewed music video in this time period is Taylor Swift ’s Look What You Made Me Do . when was last The England national football England did not enter the competition until 1.40 0.70 time england team have reached the quarter - fi- 1950. . . Their best ever performance is winning got to quarter nals on nine occasions , the latest the Cup in the 1966 , whilst they also finished finals in world of which were at the 2002 and the in fourth place in 1990, and in 2018. Other cup 2006 . than that, the team have reached the quarter - fi- nals on nine occasions, the latest of which were at the 2002 and the 2006 . Table 8: Examples from the user study. The last column represent coefficients for preferring original sentence over the original paragraph, and the fourth column presents coefficients for decontextualized sentence over the paragraph. Positive values means preference towards the sentence-length answer over the paragraph-length answer. while the original sentence answer does not. tains one of the answer strings9 and investigate the number of questions for which we can re- 6.2 Decontextualizing System Inputs trieve a correct passage for a fixed computational Having shown the benefits of decontextualization cost. Under this measure, we compare paragraphs, in a user facing task, we now investigate the windows of 100 words, sentences, and decontex- use of decontextualizaton as a preprocessing step. tualized sentences as a set of retrieval passages. Specifically, we construct a passage retrieval cor- These segmentation approaches generate different pus for open domain question answering (Chen number of passages for the same article (para- et al., 2017) with decontextualized sentences. Ex- graph and a window of 100 words segmentation periment shows that decontextualized sentences make fewer passages compared to sentences-level ensure completeness of the passages while mini- segmentation). To generate decontextualized sen- mizing their length (thus computational cost). tences process all paragraphs with T5-11B model, which are trained on all annotated data (includ- Background Open domain question answering ing development and test set). For about 40% typically consists of pair a passage retrieval (Liu of sentences, the model classified the sentence and Croft, 2002) and transformer-based answer as infeasible to decontextualize or unnecessary to extractor (reading comprehension model) based make any edits, we use the original sentence. On on the retrieved passages (Guu et al., 2020; the other 60% model tended to add more infor- Karpukhin et al., 2020; Izacard and Grave, 2020). mation. For example, for a sentence “Bush was And the computational cost is dominated by the widely seen as a ‘pragmatic caretaker’ president cost of co-encoding the query with the retrieved who lacked a unified and compelling long-term passages (typically paragraphs or overlapping 100 theme in his efforts.", the decontextualized sen- word windows). tence will be “George H.W. Bush was widely seen as a ‘pragmatic caretaker’ president of the United Setup We create a corpus using the 7k docu- States who lacked a unified and compelling long- ments (233k paragraphs, 868k sentences) from the term theme in his efforts." and a paragraph would documents associated with the questions in the NQ-open development set (Lee et al., 2019). We 9 We adopt the answer match heuristics from Lee et al. consider a retrieved passage to be correct if it con- (2019).
be entire paragraph containing this sentence, and 100-word window will be a chunk without using a sentence boundary as a segmentation. For all, we prepend the document title to the passage, follow- ing the literature and use the TFIDF as a retriever model. Metric Let qi be a question; let Ai = [a0i . . . ani ] be the set of valid answers; let Ci = [c1i . . . cki ] be a ranked list of evidence passages; and let H(Ai , Ci ) be the index of the top ranked context that contains one of the valid answers (or k + 1 if there is no such context). We first define the cost of encoding a single question and passage, c(qi , cm m 2 i ) = (|qi | + 1 + |ci |) . This captures the fact that the Transformer’s computation cost Figure 5: Retrieval recall plotted against computational scales quadratically with the length of the encoded cost budget (Eqn 1) for different methods of document text (question + separator + evidence passage). segmentation. H(Ai ,Ci ) 7 Related Work X O(qi , Ai , Ci ) = c(qi , cm i ). Prior literature in summarization studied how arti- m=1 cle context affects the understanding of sentences Given the per example cost defined above, we within an article. It has been observed that dis- define the recall of the retrieval system at compu- ambiguating entity mentions and correctly resolv- tational cost budget t to be: ing anaphora is crucial for automatic summariza- tion (Otterbacher et al., 2002; Steinberger et al., N 2007) and for evaluation of summarization sys- 1 X 1 [O(qj , aj , Cj ) < t] (1) tems (Pitler et al., 2010). Li et al. (2016) identi- N i=0 fied information missing from a sentence could be identified in the article context in newswire text where N is the total number of examples and 1 is 60% of the time. This is considerably less fre- an indicator function. We use this as an evaluation quent than for the encyclopedic text studied here, measure instead of mean reciprocal rank or recall but nevertheless hints that decontextualization for at N, to compare across different retrieval passage newswire text could be feasible. It remains un- length. clear whether information accessible in newswire Results Figure 5 plots the recall of each re- contexts can be readily incorprated into sentences trieval corpus at different computational cost bud- using controlled edits of the type we employ. get t on the whole NQ-open evaluation set. The Successful decontextualization models must re- graph shows that sentence level segmentation is solve entity and event coreferences (Humphreys more cost-effective than paragraph or 100 word et al., 1997) as well as other forms of level segmentation, and using decontextualized anaphora (Rösiger et al., 2018). These are neces- sentences is more cost effective than using the sary but insufficient for decontextualization how- original sentences. Decontextualized sentences ever, which also involves discourse marker re- near the performance of commonly used 100 word moval, acronym expansion, and fluent and gram- windows with 1/10th the cost. matical sentence generation. This result exemplifies the way in which decon- The term decontextualization was introduced in textualization can be used to ensure that the input a recent table-to-text generation dataset (Parikh to natural language understanding system is con- et al., 2020) where a sentence from a Wikipedia cise yet complete. We think this way of using de- document was decontextualized such that it can be contextualization as a preprocessing could also aid interpretable when presented with a table alone. tasks such as summarization. They cover only the sentences that are relevant
to the table, and adapt it to the table context. In Søgaard. 2018. Lexi: A tool for adaptive, per- a recent image captioning dataset (Sharma et al., sonalized text simplification. In Proceedings of 2018), sentences are re-written such that informa- the 27th International Conference on Compu- tion that cannot be inferred from the image is re- tational Linguistics, pages 245–258, Santa Fe, moved. For example, entity names are replaced New Mexico, USA. Association for Computa- with generics (e.g. * -Tom Cruz , +A man + is tional Linguistics. waiting."). Betty J. Birner. 2012. Introduction to Pragmatics, 8 Conclusion 1st edition. Wiley Publishing. We define decontextualization, the task of rewrit- Paul-Christian Bürkner. 2017. brms: An R pack- ing a sentence from a document to be interpretable age for Bayesian multilevel models using Stan. in an empty context, while preserving its meaning. Journal of Statistical Software, 80(1):1–28. We build a crowdsourced dataset and a model for decontextualization, and demonstrate how decon- Danqi Chen, Adam Fisch, Jason Weston, and An- textualization can be used in a user-facing task and toine Bordes. 2017. Reading wikipedia to an- as a sub-component of an application system. swer open-domain questions. Proceedings of We believe that decontextualization will also the Annual Meeting of the Association for Com- be helpful in a wide range of other applica- putation Linguistics (ACL). tions. For example, in multi-document summa- Jianpeng Cheng and Mirella Lapata. 2016. Neu- rization (Fabbri et al., 2019), co-referring enti- ral summarization by extracting sentences and ties and events must be resolved across differ- words. In Proceedings of the 54th Annual Meet- ent documents and removing ambiguous refer- ing of the Association for Computational Lin- ences may help; extractive summarization (Cheng guistics, pages 484–494, Berlin, Germany. As- and Lapata, 2016) could benefit from the type sociation for Computational Linguistics. of pre-processing that we presented for open- domain QA; anaphora resolution is crucial for Herbert H Clark. 1975. Bridging. In Theoretical both summarization and machine translation (Su- issues in natural language processing. sanne et al., 1992); and decontextualizing sen- tences may help in recovering explicit mentions of Jacob Devlin, Ming-Wei Chang, Kenton Lee, and entities and relations which can help information Kristina Toutanova. 2018. Bert: Pre-training extraction (Narasimhan et al., 2016). The current of deep bidirectional transformers for language formulation focuses on the English encyclopedic understanding. corpus and rewriting for an empty context, and fu- Alexander R. Fabbri, Irene Li, Tianwei She, Suyi ture work can explore different domains of text as Li, and Dragomir R. Radev. 2019. Multi-news: well as mapping to a different context. a large-scale multi-document summarization Acknowledgments dataset and abstractive hierarchical model. In Proceedings of the Annual Meeting of the Asso- We would like to thank members of Google AI, ciation for Computation Linguistics (ACL). especially Jacob Eisenstein, Kenton Lee, Santiago Ontanon, Ankur Parikh, Daniel Andor, Chris Al- H. Paul Grice. 1975. Logic and conversation. In berti, Slav Petrov for helpful discussions and com- Peter Cole and Jerry L. Morgan, editors, Speech ments. Lastly, we would like to thank our annota- Acts, volume 3 of Syntax and Semantics, pages tors. 41–58. Academic Press, New York. Kelvin Guu, Kenton Lee, Zora Tung, Panupong References Pasupat, and Ming-Wei Chang. 2020. Realm: Retrieval-augmented language model pre- Lora Aroyo and Chris Welty. 2015. Truth is a lie: training. Crowd truth and the seven myths of human an- notation. AI Magazine, 36:15–24. K. Humphreys, R. Gaizauskas, and Saliha Azzam. 1997. Event coreference for information ex- Joachim Bingel, Gustavo Paetzold, and Anders traction.
Gautier Izacard and Edouard Grave. 2020. Lever- Ankur P. Parikh, Xuezhi Wang, Sebastian aging passage retrieval with generative models Gehrmann, Manaal Faruqui, Bhuwan Dhingra, for open domain question answering. Diyi Yang, and Dipanjan Das. 2020. Totto: A controlled table-to-text generation dataset. Pro- Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. ceedings of the Conference on Empirical Meth- Weld, Luke Zettlemoyer, and Omer Levy. 2020. ods in Natural Language Processing (EMNLP), Spanbert: Improving pre-training by represent- abs/2004.14373. ing and predicting spans. Transactions of the Association for Computational Linguistics, Ellie Pavlick and Tom Kwiatkowski. 2019. Inher- 8:64–77. ent disagreements in human textual inferences. Transactions of the Association for Computa- V Karpukhin, B Oğuz, S Min, L Wu, S Edunov, tional Linguistics, 7:677–694. and others. 2020. Dense passage retrieval for Emily Pitler, Annie Louis, and Ani Nenkova. Open-Domain question answering. Proceed- 2010. Automatic evaluation of linguistic qual- ings of the Conference on Empirical Methods ity in multi-document summarization. In Pro- in Natural Language Processing (EMNLP). ceedings of the Annual Meeting of the Associa- Tom Kwiatkowski, Jennimaria Palomaki, Olivia tion for Computation Linguistics (ACL). Redfield, Michael Collins, Ankur Parikh, Chris Sameer Pradhan, Alessandro Moschitti, Nianwen Alberti, Danielle Epstein, Illia Polosukhin, Xue, Olga Uryupina, and Yuchen Zhang. 2012. Matthew Kelcey, Jacob Devlin, Kenton Lee, Conll-2012 shared task: Modeling multilin- Kristina N. Toutanova, Llion Jones, Ming-Wei gual unrestricted coreference in ontonotes. In Chang, Andrew Dai, Jakob Uszkoreit, Quoc EMNLP-CoNLL Shared Task. Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. Colin Raffel, Noam Shazeer, Adam Roberts, Transactions of the Association of Computa- Katherine Lee, Sharan Narang, Michael tional Linguistics. Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019a. Exploring the limits of transfer learning Kenton Lee, Ming-Wei Chang, and Kristina with a unified text-to-text transformer. arXiv Toutanova. 2019. Latent retrieval for weakly preprint 1910.10683. supervised open domain question answering. arXiv preprint 1906.00300. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Junyi Jessy Li, Bridget O’Daniel, Y. Wu, W. Zhao, Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. and A. Nenkova. 2016. Improving the annota- 2019b. Exploring the limits of transfer learning tion of sentence specificity. In LREC. with a unified text-to-text transformer. ArXiv, abs/1910.10683. X. Liu and W. Croft. 2002. Passage retrieval based Ina Rösiger, Arndt Riester, and Jonas Kuhn. 2018. on language models. In CIKM ’02. Bridging resolution: Task definition, corpus re- Karthik Narasimhan, Adam Yala, and Regina sources and rule-based experiments. In Pro- Barzilay. 2016. Improving information extrac- ceedings of the International Conference on tion by acquiring external evidence with rein- Computational Linguistics (COLING). forcement learning. In Proceedings of the Con- Piyush Sharma, Nan Ding, Sebastian Goodman, ference on Empirical Methods in Natural Lan- and Radu Soricut. 2018. Conceptual captions: guage Processing (EMNLP). A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceed- Jahna Otterbacher, Dragomir R. Radev, and ings of the Annual Meeting of the Association Airong Luo. 2002. Revisions that improve co- for Computation Linguistics (ACL). hesion in multi-document summaries: a prelim- inary study. In Proceedings of the Annual Meet- Dan Sperber and Deirdre Wilson. 1986. Rele- ing of the Association for Computation Linguis- vance: Communication and Cognition. Har- tics (ACL). vard University Press, USA.
You can also read