Interactive Abstractive Summarization for Event News Tweets
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Interactive Abstractive Summarization for Event News Tweets Ori Shapira1 , Hadar Ronen2 , Meni Adler1 , Yael Amsterdamer1 , Judit Bar-Ilan2 and Ido Dagan1 1 Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel 2 Department of Information Science, Bar-Ilan University, Ramat-Gan, Israel {obspp18,hadarg,meni.adler}@gmail.com yael.amsterdamer@biu.ac.il, judit.bar-ilan@biu.ac.il, dagan@cs.biu.ac.il Abstract original sentences, as in prior work on interactive extractive summarization (see Section 6). We present a novel interactive summa- We build on the abstractive approach in sup- rization system that is based on abstrac- porting two useful modes of interaction. First, tive summarization, derived from a recent we present information in a bullet-style summary, consolidated knowledge representation for where the most important information is initially multiple texts. We incorporate a cou- displayed in bullet sentences, while further de- ple of interaction mechanisms, providing tails may be obtained by unfolding additional bul- a bullet-style summary while allowing to lets. Specifically, we implemented this approach attain the most important information first for summarizing news tweets on a certain event and interactively drill down to more spe- along a time line (see Figure 2). Our second mode cific details. A usability study of our im- of interaction is concept expansion, which allows plementation, for event news tweets, sug- viewing complementary information about a con- gests the utility of our approach for text cept via its alternative term mentions, while track- exploration. ing the concept occurrences throughout the sum- mary (see Figure 3). This information is hidden in 1 Introduction static summaries that use original sentences (ex- tractive) or a single term per concept (abstractive). Multi-document summarization (MDS) tech- To facilitate the modular construction of inter- niques aim to assist readers in obtaining the most active summaries, we utilize as input a consol- important information when reading multiple idated representation of texts, in particular the texts on a topic. The dominant MDS approach recent Open Knowledge Representation (OKR) focuses on constructing a short summary of some of Wities et al. (2017). Briefly, this representation targeted length, capturing the most important captures the propositions of the texts, where co- information, mimicking a manually-crafted refferring concepts or propositions are collapsed “static” summary. As an alternative, few papers together while keeping links to the original men- considered interactive summarization, where tions (see Section 2). We leverage OKR structures the presented information can be interactively to extract information at the level of atomic facts, explored by the user according to needs and to expand information from collapsed mentions interest (Christensen et al., 2014; Leuski et al., and to retrieve the sources from which summary 2003; Yan et al., 2011). sentences were derived. In this paper we propose further contribution The novelties of our interactive scheme call to this approach, focusing on interactive abstrac- for verifying its effectiveness and usefulness for tive summarization. We suggest that an abstrac- users. For that, we have implemented our ap- tive summarization approach, based on extracted proach in a prototype system (Sections 3-4). This “atomic” facts, is particularly suitable in the inter- system automatically produces an interactive sum- active setting as it allows more flexible informa- mary from input OKR data, which we assume to tion presentation. Intuitively, it makes more sense be parsed from original texts by an external black- for a user to explore information at the level of in- box tool. We have examined our system through dividual facts, rather than the coarser level of full a set of standard usability tests (Brooke, 1996; 109 Proceedings of the 2017 EMNLP System Demonstrations, pages 109–114 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics
Lund, 2001) on gold standard OKR datasets that enabled us to study its contribution in isolation (Section 5). Our results show that the proposed system is highly valuable for readers, providing an appealing alternative to standard static summa- rization. 2 Preliminaries As mentioned above, our interactive summariza- Figure 1: Four tweets on an event and their OKR structure. tion system is based on a consolidated representa- tion for the information in multiple texts. We next review some background on such representations ery form of reference to the entity/proposition and then describe the particular Open Knowledge throughout the texts. E.g., E1 from Figure 1 is Representation that we use. mentioned in the tweets as “man”, “shooter” or “Radcliffe Haughton”. Mentions of propositions 2.1 Consolidated Representation are stored as templates with argument placehold- Motivated by summarization and text exploration, ers, e.g., “[a2] dead in [a3]”. Through their men- recent work considered the consolidation of tex- tions, entities and propositions are further linked tual information in various structures. As promi- with their occurrences in the original texts (omit- nent examples, the studies of Liu et al. (2015) ted from the figure). and Li et al. (2016) construct graph-based rep- Argument links connect propositions with their resentations whose nodes are predicates or argu- arguments, which may be entities or (nested) ments thereof, extracted from the original text, and propositions. Since a proposition may have sev- the predicate-argument relations are captured by eral templates with different arguments, argument edges. Identical or coreferring concepts are col- IDs (marked a1-a3 in P1) are used to capture co- lapsed in a single node. referring arguments within the same proposition. Rospocher et al. (2016) present a more super- For example, a2 and a3 appear as arguments in the vised approach where concepts in the graph are two templates of P1, and refer to entity E2 and linked to DBPedia1 entries. This along with other proposition P2 respectively. metadata is used to detect coreferences and disam- Entailment links, marked by directed edges in biguate concepts. Figure 1, track semantic entailment (in context) None of these works considers interactive sum- between different types of OKR components. For maries, and in particular none incorporates suffi- example, in E1, “Radcliffe Haughton” entails cient data for our modes of user interaction. We “man” or “shooter”, namely, the former is more next briefly review the Open Knowledge Rep- specific/informative in the given context. resentation recently introduced by Wities et al. (2017), which is used by our system. 3 Comprehensive Summary Information 2.2 Open Knowledge Representation The architecture of our system consists of two main steps: (1) a preprocessing step in which We illustrate the components of the OKR for- we generate comprehensive summary information malism that are central to our summarization and (2) interactive display of selected information. method via the example OKR structure in Figure 1 In this section we describe the first step, which is (see Wities et al. (2017) for full details). On the based on an input OKR structure. Our UI for ex- top, there are four original tweets. On the bottom, ploring the summary information interactively is there are two consolidated propositions (marked described in the following section. P1 and P2) and four entities (marked E1-E4) de- The general scheme for generating summary in- rived from these tweets. The figure depicts three formation in our system is as follows. types of links captured in OKR, as follows. 1. Partition the OKR propositions into groups. Mention links connect each proposition or en- 2. Generate representative summary sentences tity with its set of mention terms, namely, ev- for each group of propositions. These yield 1 http://wiki.dbpedia.org/ the bullet-style summary sentences. 110
3. Generate metadata for each representative sentence: a knowledge score, concept expan- sions and timestamp. For the current system, we implemented a baseline method for each of these steps, which nonetheless achieved high satisfaction scores in the usability study (see Section 5). We partition propositions, as captured in the OKR structure, into distinct groups such that the propositions of each group are (transitively) con- nected by argument links (ignoring link direction). Figure 2: The initial view of a summary about a shooting in a E.g., in Figure 1, P2 is nested in P1 and thus the Wisconsin spa covering 109 tweets. Ten generated sentences cover the most salient information throughout these tweets, two are grouped together. and are ordered along the event timeline. Next, for the “root” (i.e., not nested) proposi- tion in a group, we generate alternative candidate sentence s: sentences. This is done by filling, in its templates, all the possible combinations of relevant argument X score(s) = α + β · depth(m) mentions, and recursively so for nested proposi- m∈mentions(s) tions. For example, for P1 we would generate “[3 people] dead in [shooting in [Wisconsin]]”, where mentions(s) are the mentions of predicates “[3 people] dead in [[spa] shooting]”, “[Three] and entities in the sentence. depth(m) assigns a dead in [[spa] shooting]”, and so on (22 candidate given mention m its depth in the relevant lexical sentences in total). entailment graph within the OKR. We have empir- From each set of candidates we choose one rep- ically set α = 1, β = 0.1. resentative sentence. Importantly, this means that Each concept (entity or proposition) in the sum- unlike bounded-length summary paragraphs our mary sentences is linked to its mentions and orig- comprehensive summary information effectively inal texts using the OKR. The set of mentions is covers all the propositions in the original texts. In- cleaned from duplicates (strings with small edit stead of filtering upfront less salient information, distance), yielding the concept expansion for sets it is only hidden initially in the UI and can be un- with > 1 different mentions. This gives extra folded by the user (see Section 4). For a represen- information about concepts that otherwise might tative sentence, we choose a candidate with high have been missed. In Figure 3, for example, language model score,2 high knowledge score (de- the “suspected gunman” is also identified as “Ja- fined below) and small length. This is done by op- maican”. For the tweet summarization scenario, timizing a weighted sum of these factors. we also compute the timestamp of each represen- The knowledge score of each sentence intu- tative sentence as the time of the first tweet men- itively reflects how common its mentions are in tioning its root proposition. the original texts as well as how informative (spe- cific) they are, based on the OKR entailment 4 Interactive User Interface links. Reconsidering Figure 1 for example, in the We now describe the web application3 we imple- tweet “Three dead in spa shooting”, the concepts mented, designed for the interactive exploration of “three”, “dead” and “spa shooting” should be multiple tweets on a specific event. Our backend is rewarded for appearing each in two tweets, but implemented in Python 2.7 and runs on a CentOS “three” should be rewarded less than “3 people”, server. The frontend is implemented with the An- which is more informative. gularJS library. JSON is used for data interchange. We use the following heuristically formulated Figure 2 shows the initial screen summarizing equation to calculate the score of each generated a set of 109 tweets about the shooting in a Wis- consin spa from our running example. Bullet-style 2 For the language model, we trained an LSTM model sentences (generated as explained in Section 3) are (https://github.com/yandex/faster-rnnlm) 3 on a collection of 100M tweets. http://u.cs.biu.ac.il/˜shapiro1/okr/ 111
5 System Usability Tests To assess and improve the value of our system, we have conducted two usability studies employ- ing standard usability tests. The tests were per- formed on a dataset of human annotated OKR Figure 3: The concept expansion pop-up consisting of men- structures (of the form of Figure 1) released by tions referring to the same person as “suspected gunman”, revealing further information (e.g. “Jamaican”). Wities et al. (2017). We took their 6 largest clus- ters of event tweets, of about 100 tweets each. This gold-standard dataset enabled us to study in iso- lation the merits of our novel system. Given the positive results that we report below, we plan, in future work, to integrate and study our system in a fully automated pipeline. 5.1 Preliminary Usability Study A first usability study was conducted with two Figure 4: The tweets pop-up shows a scrollable pane with the goals: to examine the usefulness of our ideas and source tweets for a generated sentence. to understand user needs. Methodology. The evaluation phase of a proto- type requires only a few evaluators, according to displayed along the event timeline, in descend- the “discount” usability testing principle (Nielsen, ing order of their timestamps. As an indication 1993). Thus, six students not familiar with our of salience, to the right of each sentence, a pie project were recruited as evaluators. We asked chart shows the “percentage” of knowledge it cov- them to perform a series of predefined tasks on one ers according to its normalized knowledge score. of the six selected events. During the system us- The pie chart on the top shows the total knowledge age we observed the users’ activity and employed “covered” by the currently visible sentences. a “think aloud” technique to obtain user remarks. Initially, only sentences exceeding a certain in- Each on-screen activity was captured using “De- formation score threshold are displayed, as a con- but Video Capturing Software”4 . After perform- cise bullet-style summary of the event. Other sen- ing all tasks, users were asked to fill the SU Scale tences are folded (e.g., between timestamps 01:07 (SUS) questionnaire (Brooke, 1996) for subjective and 22:55 in Figure 2). The user can then decide usability evaluation. whether and which sentences to unfold, according Results. Table 1 lists the average scores ob- to (a) time intervals of interest on the timeline; (b) tained for each of the ten SUS questions, on a scale the number of folded sentences, as indicated in the of 1 to 5. Overall, users found the prototype easy middle of the line; and (c) the amount of additional to use and showed willingness to use it frequently. knowledge to be unfolded, which is highlighted on the top pie chart when hovering over folded The SUS questionnaire yields an important sin- sentences. By repeatedly unfolding sentences, the gle number in [0, 100] representing a composite user can gradually discover the full timeline of the measure of the overall usability of the system. event with all consolidated data from the tweets. This number is calculated based on the ten ques- tion scores. As seen in Table 2, except for one Another mode of discovering information is via dissatisfied user5 , the system received high scores concept expansion: hovering over a highlighted ranging from 70 to 95. The observation and ver- concept (e.g., “suspected gunman”) opens a pop- bal reports during the test yielded a list of require- up with different mentions of the same concept in ments that helped improve our prototype. the summary (Figure 3); clicking it further high- 4 lights all of its coreferences in the summary. Fi- http://www.nchsoftware.com/capture/ 5 This user had software quality assurance background and nally, the user can also click the Twitter icon to seemed to inspect for very minor software and user experi- inspect the source tweets (Figure 4). ence bugs, which we have later addressed. 112
SUS Question Avg. Score Dimension Tweet Static IAS I think that I would like to use this system 3.83 Usefulness 2.1 1.8 2.3 frequently. Knowledge Exploration 2.0 1.8 2.6 I found the system unnecessarily complex. 2.33 Satisfaction 2.0 1.7 2.3 I thought the system was easy to use. 3.33 Ease of Use 2.5 2.3 2.1 I think that I would need the support of a 2.17 Ease of Learning 2.7 2.5 2.3 technical person to be able to use this sys- tem. Table 3: USE questionnaire dimensions score comparison of the three system interfaces on a scale of 1 to 3. I found the various functions in this system 3.83 were well integrated. I thought there was too much inconsistency 1.67 This questionnaire required users to rank each in this system. of the three interfaces on a scale from 1 to 3 ac- I would imagine that most people would 3.5 learn to use this system very quickly. cording to 33 statements. The original 30 USE statements represent four dimensions: Usefulness, I found the system very cumbersome to use. 1.33 Satisfaction, Ease of Use, and Ease of Learning. I felt very confident using the system. 3.67 We added three statements to rank user’s experi- I needed to learn a lot of things before I 2.17 could get going with this system. ence of knowledge exploration.6 Results. Table 3 shows the average rank of each Table 1: The ten SUS questions asked after the usability study and the average answer score on a scale of 1 to 5. interface in each of the examined dimensions. While our system was naturally somewhat more User 1 2 3 4 5 6 complex to use than the baselines, which only re- SUS Score 70 80 95 72.5 82.5 27.5 quire reading, it consistently received the highest ranks in the dimensions of Usefulness, Satisfac- Table 2: SUS scores for each user, calculated based on the tion and Knowledge Exploration. This indicates ten SUS question scores. that interactivity indeed provides substantial value to the user, regardless of the summary sentences 5.2 Comparative Usability Test (as evident by the comparison to baseline Static). The ranked USE statements also serve as an After updating our system to incorporate improve- indication for the quality of our summary when ments obtained from the preliminary study, we compared to the other baselines. Standard summa- conducted another comparative study to examine rization metrics are designed for static summaries7 the relative effectiveness of our system. and are thus not expressly adequate for our inter- Methology. We have compared our system, here active system due to its content being dynamic and denoted by IAS (for Interactive Abstractive Sum- user-manipulated. Having demonstrated here that mary), with two baseline approaches: interactive summaries are useful, designing and conducting dedicated quality tests for interactive • Tweet: a list of all the original tweets in the summaries is a priority in our future work. event dataset. • Static: the full ordered list of sentences gen- 6 Related Work erated by our system (Section 3), with no in- teractive features nor metadata (such as con- A vast body of work has been dedicated to the cept expansion, knowledge scores, etc.). problem of multi-text summarization. We focus As mentioned earlier, we have used the gold- here on the rather few studies that enhance sum- standard OKR structures for 6 of the events re- marization with user interaction. leased by Wities et al. (2017). Six users were each The iNeATS system (Leuski et al., 2003) was presented with two events in each interface (IAS, an early attempt for interactive summarization, al- Tweet, Static), where the assignment of event to lowing explicit control over parameters such as interface and order of interfaces were different for 6 The three additional statements are: The system moti- each user. The users explored the information that vated me to actively explore more information; The system describes each event in the assigned interface, and made me feel that I know the highlights of the event; The system helped me notice the important details of the event. at the end were asked to complete the USE Ques- 7 The ROUGE and Pyramid methods are the common met- tionnaire (Lund, 2001). rics to evaluate summaries. 113
length, participating elements, etc. Yan et al. Acknowledgments (2011) have studied a more implicit approach, at- This work was supported in part by grants from tempting to discover user preferences such as top- the MAGNET program of the Israeli Office of ics and contexts via user clicks. Both approaches the Chief Scientist (OCS); the German Research involve repeatedly updating a summary paragraph Foundation through the German-Israeli Project based on user feedback. Cooperation (DIP, grant DA 1600/1-1); by Con- The more recent SUMMA system (Christensen tract HR0011-15-2-0025 with the US Defense Ad- et al., 2014) resembles ours in supporting hierar- vanced Research Projects Agency (DARPA); by chical summarization. Salient summary sentences the BIU Center for Research in Applied Cryptog- are high in the hierarchy and further details can be raphy and Cyber Security in conjunction with the discovered by drilling down into lower levels. Israel National Cyber Bureau in the Prime Minis- All of the aforementioned methods compute ex- ters Office; and by the Israel Science Foundation tractive summaries, which are composed of sen- (grant No. 1157/16). tences from the original texts. In comparison, our abstractive approach has a few appealing advan- tages. Most importantly, this approach facilitates References the construction of flexible bullet-style summaries John Brooke. 1996. SUS - “A quick and dirty” us- since we are not confined to existing sentences, ability scale. Usability evaluation in industry. CRC which may combine several atomic facts of vary- Press. ing saliency or require textual context. This, in Janara Christensen, Stephen Soderland, Gagan Bansal, turn, allows users to browse data at the level of and Mausam. 2014. Hierarchical summarization: atomic facts and avoids the need to regenerate the Scaling up multi-document summarization. In ACL. summary in order to incorporate user feedback. Anton Leuski, Chin-Yew Lin, and Eduard H. Hovy. 2003. iNeATS: Interactive multi-document summa- 7 Conclusion and Future Work rization. In ACL. In this paper we presented a novel system for the Wei Li, Lei He, and Hai Zhuge. 2016. Abstractive news summarization based on event semantic link interactive exploration of abstractive summary in- network. In COLING. formation. Our system builds on the Open Knowl- edge Representation (Wities et al., 2017) for con- Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman M. Sadeh, and Noah A. Smith. 2015. Toward abstrac- solidating the information of multiple texts, and tive summarization using semantic representations. produces a summary that fully captures this infor- In HLT-NAACL. mation. The interactive UI allows focusing on the most salient facts as well as gradually obtaining Arnold M. Lund. 2001. Measuring usability with the USE questionnaire. STC Usability SIG Newsletter, further details via different interaction modes. Our 8(2). usability studies provide supportive evidence for the usefulness of our approach. Jakob Nielsen. 1993. Usability engineering. Aca- demic Press. Our results shed light on a few important di- rections for future research. In general, our in- Marco Rospocher, Marieke van Erp, Piek T. J. M. teractive abstractive method should be ported to Vossen, Antske Fokkens, Itziar Aldabe, German Rigau, Aitor Soroa, Thomas Ploeger, and Tessel other domains and types of corpora. E.g., while Bogaard. 2016. Building event-centric knowledge in the case of news tweets, sentence ordering was graphs from news. J. Web Sem., 37-38. done along a timeline, the ordering of consolidated Rachel Wities, Vered Shwartz, Gabriel Stanovsky, summary sentences may in general be a nontrivial Meni Adler, Ori Shapira, Shyam Upadhyay, Dan task. Further, our approach for summary sentence Roth, Eugenio Martinez Camara, Iryna Gurevych, generation can be enhanced, e.g., by using ma- and Ido Dagan. 2017. A consolidated open knowl- chine learning techniques to select the best repre- edge representation for multiple texts. In LSDSem, EACL. sentative sentences. For evaluation, we will design tests adequate for assessing the quality of an inter- Rui Yan, Jian-Yun Nie, and Xiaoming Li. 2011. Sum- active summary, and use them in a more extensive marize what you are interested in: An optimization framework for interactive personalized summariza- user study that will incorporate a fully automated tion. In EMNLP. pipeline (i.e., an OKR parser). 114
You can also read