ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining Alexander R. Fabbri[†] Faiaz Rahman[†] Imad Rizvi[†] Borui Wang[†] Haoran Li [‡] Yashar Mehdad[‡] Dragomir Radev[†] [†] Yale University [‡] Facebook AI {alexander.fabbri, faiaz.rahman, imad.rizvi, borui.wang, dragomir.radev}@yale.edu {aimeeli, mehdad}@fb.com Abstract Headline: SuperBowl Snippet: Whether you’re a football fan or not, what do you like about Super Bowl Sunday? While online conversations can cover a vast Comment: ... In my opinion I think the Falcons will amount of information in many different for- stomp the patriots. I think Tom Brady will choke the Super mats, abstractive text summarization has pri- Bowl. ... marily focused on modeling solely news ar- Comment: I am big Arizona Cardinals fan so when they ticles. This research gap is due, in part, to didn’t even make the playoffs i was upset. ... the lack of standardized datasets for summa- Comment: I’m not a very big football fan at all. So when it comes to Superbowl Sunday, I’m in it for the rizing online discussions. To address this gap, commercials and the half time show. ... we design annotation protocols motivated by Comment: I am not exactly a football fan, but I enjoy an issues–viewpoints–assertions framework to watching the Super Bowl.... crowdsource four new datasets on diverse on- ... line conversation forms of news comments, Summary: Several commenters list their favorite things about the discussion forums, community question an- Super Bowl, including half-time shows, the funny com- swering forums, and email threads. We bench- mercials, the Puppy Bowl, eating food, and spending time mark state-of-the-art models on our datasets with family. A couple of commenters admit to not being and analyze characteristics associated with the football fans but still enjoying the Super Bowl. Some com- data. To create a comprehensive benchmark, menters discuss whether they thought the Falcons or the Patriots were going to win, while others list teams they we also evaluate these models on widely-used wish were in the game. conversation summarization datasets to estab- lish strong baselines in this domain. Fur- Table 1: Example summary of comments from a New thermore, we incorporate argument mining York Times article discussing people’s favorite parts of through graph construction to directly model the Super Bowl. The summary is an analysis of the the issues, viewpoints, and assertions present comments and quantifies the viewpoints present. in a conversation and filter noisy input, show- ing comparable or improved results according to automatic and human evaluations. Unlike documents, articles, and scientific papers, which contain specific linguistic structures and con- 1 Introduction ventions such as topic sentences and abstracts, con- Automatic text summarization is the process of versational text scatters main points across multiple outputting the most salient parts of an input in a utterances and between numerous writers. As a concise and readable form. Recent work in sum- result, the text summarization task in the conver- marization has made significant progress due to sational data domain offers a challenging research introducing large-scale datasets such as the CNN- field to test newly-developed models (Chen and DailyMail dataset (Nallapati et al., 2016) and the Yang, 2020). New York Times dataset (Sandhaus, 2008). Further- Recently, Gliwa et al. (2019a) introduced a more, the use of large self-supervised pretrained dataset for chat-dialogue conversation summariza- models such as BART (Lewis et al., 2020) and tion consisting of 16k examples, the first large- Pegasus (Zhang et al., 2019) has achieved state- scale dataset of its kind. Previous work in con- of-the-art performance across summarization tasks versation summarization was limited by the data and strong performance in zero and few-shot set- available and focused primarily on meeting sum- tings (Fabbri et al., 2020a). However, less work marization, such as the AMI (Kraaij et al., 2005) has focused on summarizing online conversations. and ICSI (Janin et al., 2003) datasets. The datasets 6866 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 6866–6880 August 1–6, 2021. ©2021 Association for Computational Linguistics
used in recent conversation papers are often not uni- ment structure (discussed in Related Work) for sum- form, ranging from visual dialogue data (Goo and marizing news comments. We construct this argu- Chen, 2018a) to customer-service dialogues (Yuan ment graph using entailment relations, linearize the and Yu, 2019), not initially intended for summa- graph, train a graph-to-text model (Ribeiro et al., rization. The availability of benchmark datasets for 2020), and experiment with argument mining as a comparing methods has limited work in other con- way to reduce noise in long-text input. versation summarization domains and thus likely Our contributions are the following: (1) we inhibited progress (Kryscinski et al., 2019; Fabbri crowdsource datasets for four domains of conver- et al., 2020b). sational data and analyze the characteristics of our We aim to address this research gap by crowd- proposed datasets; (2) we benchmark state-of-the- sourcing a suite of four datasets, which we call art models on these datasets as well as previous ConvoSumm, that can evaluate a model’s perfor- widely-used conversation summarization datasets mance on a broad spectrum of conversation data. In to provide a clear baseline for future work; and determining the domains of data to collect, we use (3) we apply argument mining to model the struc- the general definition of conversation as “any dis- ture of our conversational data better as well as course produced by more than one person” (Ford, reduce noise in long-text input, showing compa- 1991). We identify several key categories of data rable or improved results in both automatic and for which standard human-created development human evaluations.1 and testing datasets do not exist, namely (1) news article comments, (2) discussion forums and debate, 2 Related Work (3) community question answering, and (4) email Modeling Conversation Summarization Early threads. We design annotation protocols motivated approaches to conversation summarization con- by work in quantifying viewpoints present in news sisted of feature engineering (Shasha Xie et al., comment data (Barker and Gaizauskas, 2016a) to 2008), template selection methods (Oya et al., crowdsource 250 development and 250 test exam- 2014), and statistical machine learning approaches ples for each of the above domains. We provide an (Galley, 2006; Wang and Cardie, 2013). More re- example of comments to a New York Times news cent modeling approaches for dialogue summariza- article, and our crowdsourced summary in Table 1. tion have attempted to take advantage of conver- In addition to introducing manually-curated sation structures found within the data through di- datasets for conversation summarization, we also alogue act classification (Goo and Chen, 2018b), aim to unify previous work in conversation summa- discourse labeling (Ganesh and Dingliwal, 2019), rization. Namely, we benchmark a state-of-the-art topic segmentation (Liu et al., 2019c), and key- abstractive model on several conversation datasets: point analysis (Liu et al., 2019a). Chen and dialogue summarization from SAMSum (Gliwa Yang (2020) utilize multiple conversational struc- et al., 2019b), heuristic-generated community ques- tures from different perspectives in its sequence-to- tion answering from CQASumm (Chowdhury and sequence model. However, such approaches focus Chakraborty, 2018), meeting summarization data exclusively on dialogue summarization, and it is from AMI and ICSI, and smaller test sets in the not trivial to extend such methods to longer con- news comments, discussion forum, and email do- versations with many more participants. We thus mains. We believe that such benchmarking will introduce a method to model the structure of the facilitate a more straightforward comparison of con- discourse over the many-party conversation. versation summarization models across domains. Several existing works have focused on con- To unify modeling across these conversational ceptualizing conversation structure for summa- domains, we propose to use recent work in end-to- rization and how to present this structure to end- end argument mining (Lenz et al., 2020; Stab and users. Barker et al. (2016a) propose a conversation Gurevych, 2014; Chakrabarty et al., 2019) to instan- overview summary that aims to capture the key tiate the theoretical graph framework which moti- argumentative content of a reader comment con- vated our annotation protocol, proposed by Barker versation. Misra et al. (2017) use summarization and Gaizauskas (2016a) for conversation summa- 1 For reproducibility of our findings, we will make our data rization. This protocol is employed to both identify and code publicly available at https://github.com/ and use the “issues–viewpoints–assertions” argu- Yale-LILY/ConvoSumm. 6867
as a means of probing online debates to discover 3 ConvoSumm central propositions, which they cluster to identify argument facets. Barker and Gaizauskas (2016b) In this section, we introduce our dataset selection, identify three key components of conversational di- our annotation protocol, and the characteristics of alogue: issues (that individuals discuss), viewpoints our crowdsourced dataset. (that they hold about these issues), and assertions Data Selection For the news comments subdo- (that they make to support their viewpoints). We main, we use the NYT Comments dataset, which build on this framework and advances in argument consists of 2 million comments made on 9,000 mining for end-to-end training for summarization. New York Times articles published between 2017 and 2018. It is publicly available and has been Argument Mining Work in argument mining used in work for news-comment relevance mod- (Stab and Gurevych, 2014) has aimed to iden- eling (Kolhatkar and Taboada, 2017); it also con- tify these argumentative units and classify them tains metadata that may be of use in summarization into claims, premises, and major claims, or claims modeling. For the discussion forums and debate describing the key concept in a text. More re- subdomain, we select Reddit data from CoarseDis- cently, Chakrabarty et al. (2019) propose to fine- course (Zhang et al., 2017), which contains anno- tune BERT (Devlin et al., 2019) for identifying ar- tations about the discourse structure of the threads. gumentative units and relationships between them For the community question answering subdomain, within a text and across texts. Lenz et al. (2020) are we use StackExchange (Stack), which provides ac- the first to propose an end-to-end approach for con- cess to all forums and has been used in modeling structing an argument graph (Stede et al., 2016), for answer relevance and question deduplication a structured representation of claims and premises (Hoogeveen et al., 2015). We chose StackExchange in an argumentative text; the graph is built by con- over the commonly-used Yahoo! Answers data due necting claim and premise argumentative discourse to licensing reasons. For the email threads subdo- units. We build on this framework for modeling main, we use the publicly-available W3C corpus discourse in conversational data. (Craswell et al., 2005). Previous work also made use of this dataset for email summarization (Ulrich Few-Shot Summarization As the datasets we et al., 2008) but provided only a small sample of 40 introduce are not on a scale with larger datasets, email threads, for which we provide transfer testing we focus on few-shot and domain transfer summa- results. rization techniques. Wang et al. (2019) examine do- We generally follow the guidance of Tomasoni main adaptation in extractive summarization, while and Huang (2010), from summarizing community Hua and Wang (2017) examine domain adaptation question answering forums, for determining which between opinion and news summarization. Within subsets of data to select from the above datasets. unsupervised abstractive summarization, several We remove an example if (1) there were less than approaches have made use of variational autoen- five posts (four in the case of email threads; “post” coders (Baziotis et al., 2019; Chu and Liu, 2019; refers to any answer, comment, or email); (2) the Bražinskas et al., 2020) and pretrained language longest post was over 400 words; (3) the sum of models (Zhou and Rush, 2019; Laban et al., 2020). all post lengths was outside of [100, 1400] words Recent work in abstractive (Zhang et al., 2019; (although we extended this maximum length for Fabbri et al., 2020a) and extractive-compressive NYT comments); or (4) the average length of the summarization (Desai et al., 2020) has shown the posts was outside of the [50, 300] words interval. power of pretrained models for a few-shot transfer. For Stack data, we first filtered answers which re- The quality of models trained on several hundred ceived a negative community rating, as defined by examples in these papers is comparable to that of the number of user upvotes minus the number of models trained on the equivalent full datasets. Thus, user downvotes. While real-world settings may we believe that introducing curated validation and contain much longer threads, we later show that testing datasets consisting of a few hundred exam- this setting is already challenging. ples is a valuable contribution within the current paradigm, which was confirmed by the poor perfor- Annotation Protocol We designed annotation mance of models transferred from other domains instructions for crowdsourced workers to write compared to that trained on this validation data. abstractive summaries for each of the four 6868
Dataset % novel n-grams Extractive Oracle Summary Length Input Length # Docs/Example NYT 36.11/79.72/94.52 36.26/10.21/31.23 79 1624 16.95 Reddit 43.84/84.98/95.65 35.74/10.45/30.74 65 641 7.88 Stack 35.12/77.91/93.56 37.30/10.70/31.93 73 1207 9.72 Email 42.09/83.27/93.98 40.98/15.50/35.22 74 917 4.95 Table 2: Statistics across dataset sources in ConvoSumm, showing novel uni/bi/tri-grams, ROUGE-1/2/L extractive oracle scores, the average input and summary lengths (number of tokens), as well as the number of documents per example, where each comment/post/answer/email is considered a document. Dataset/Method Inter-document Similarity Redundancy Layout Bias NYT -11.71 -0.23 0.2/0.5/0.3 ysis of the given input rather than another response Reddit -7.56 -0.49 0.2/0.5/0.2 or utterance; (2) summaries should be abstractive, Stack -9.59 -0.27 0.2/0.3/0.4 Email -1.76 -0.18 0.3/0.4/0.3 i.e., annotators were required to paraphrase and could not repeat more than five words in a row from Table 3: Multi-document summarization-specific the source; and (3) summary lengths should contain dataset analysis on our proposed datasets with metrics [40, 90] tokens. Following the issues–viewpoints– introduced in Dey et al. (2020a): inter-document simi- larity (father from zero is less similarity), redundancy assertions framework presented in Barker and (father from zero is less overall redundancy of semantic Gaizauskas (2016b), we also instructed annotators units), and start/middle/end layout bias. that summaries should summarize all viewpoints in the input and should try to include specific details from assertions and anecdotes (unless this made datasets, motivated by work in summarizing view- the summary too lengthy). Summarizing based on points present in online conversation (Barker and similar viewpoints is analogous to clustering then Gaizauskas, 2016a). We present the crowdsource summarizing, similar to the comment label group- workers with the data threads, along with any avail- ing procedure before summarization in Barker et al. able metadata. For NYT, we presented the workers (2016b). To help with this, we recommended word- with the article headline, keywords, and, rather than ing such as “Most commenters suggest that...” and providing the entire article as context, an extrac- “Some commenters think that...” to group responses tive BERT-based summary (Miller, 2019) of the with similar viewpoints. article. We use a BERT summary to give the anno- However, the email dataset was unique among tators an idea of the topic of the article. We avoided the selected datasets given that it contained more having annotators read the entire article since the back-and-forth dialogue than clusters of view- focus of their summaries was solely the content points, and thus identifying the speakers was essen- of the comments as per the annotation protocols, tial to creating summaries that still retained mean- and reading the entire article could end up intro- ing from the original email dialogue. Since the ducing information in the summaries that was not email threads contained fewer individual speakers necessarily representative of the comments’ main than the other datasets, this sort of summarization points. We found that these summaries were use- remained feasible. Thus, for this dataset, annota- ful in initial in-house annotations, and allowed us tors were instructed to specify the speakers when to better understand the context of the comments summarizing the conversation. being summarized. For Reddit and Stack, question tags and information about the subforum were pro- Quality-Controlled Crowdsourcing We crowd- vided; the Stack data includes both answers and sourced our data using Amazon Mechanical Turk. answer comments. Reddit data was filtered simply We required that our workers be native English on word limits due to the unavailability of up/down speakers and pass a qualifying exam for each do- votes from the Coarse Discourse data. Stack data main to be summarized. We worked with a select includes the prompt/title as well. Whenever pos- group of about 15 workers who formed a com- sible, we included username information and the munity of high-quality annotators. Example sum- scores of all comments, posts, and answers. maries were provided to the workers. The workers Although the instructions differed slightly with submitted the qualifying exam, and then one of the specific nuances of each dataset, they had stan- the authors of this paper provided feedback. If the dard overall rules: (1) summaries should be an anal- worker was not sure of the quality of the summaries 6869
written, at any point, they could enlist the input of sures the similarity of multi-sentential documents one of the authors. with the reference. For more precise definitions, Additionally, after the workers wrote all sum- we refer the reader to Dey et al. (2020a). We pro- maries, we manually reviewed every summary and vide results for our data in Table 3. Email data made corrections to grammar, wording, and overall exhibits the most inter-document similarity, which structure. Summaries we could not fix ourselves, follows the intuition that an email thread consists either because they were poorly written or did not of a focused discussion typically on a single topic. follow the annotation protocols, were flagged to be For redundancy, we see Reddit shows the most uni- re-written. They were then sent to our approved form distribution of semantic units, perhaps due group of workers to be re-written, excluding any to Reddit threads’ less focused nature compared workers who had written a flagged summary. While to the remaining datasets. We do not see a partic- data crowdsourced from non-experts may contain ularly strong layout bias across any parts of the noise (Gillick and Liu, 2010), we believe that our input documents. Our datasets exhibit greater or setup of working closely with a small group of comparable levels of novel-ngrams compared to workers, providing feedback to individual work- multi-document summarization datasets such as ers, and manually reviewing all final summaries MultiNews (Fabbri et al., 2019) and CQASUMM mitigates these issues. (Chowdhury and Chakraborty, 2018). Our Stack subset has lower inter-document similarity, which Dataset Statistics We provide statistics in Ta- presents challenges for models which rely strictly ble 2. The percentage of novel n-grams in our on redundancy in the input, and our datasets gener- summaries is higher than that of the very ab- ally exhibit less layout bias, when compared to the stractive XSum dataset (Narayan et al., 2018) analysis done in Dey et al. (2020b). (35.76/83.45/95.50 -% novel uni/bi/tri-grams). This level of abstraction is likely due to the in- Comparison to Existing Datasets Although structions to perform abstractive summarization previous work on conversation summarization, be- and the summaries being an analysis of the input, fore the introduction of SAMSum (Gliwa et al., which results in the insertion of new words (e.g. 2019b), has largely featured unsupervised or few- “commenters” likely isn’t seen in the input). The in- shot methods, there exist several datasets with ref- fluence of this abstraction is further seen by an anal- erence summaries. These include SENSEI (Barker ysis of the Extractive Oracle, for which we show et al., 2016b) for news comments, the Argumen- ROUGE-1/2/L (Lin, 2004). We see that the perfor- tative Dialogue Summary Corpus (ADS) (Misra mance of an extractive model is above the Extrac- et al., 2015) for discussion forums, and the BC3 tive Oracle on the very abstractive XSum (Narayan (Ulrich et al., 2009) dataset for email data. How- et al., 2018) (29.79 ROUGE-1), but much lower ever, much of the existing datasets are not wide than the Extractive Oracle on the CNN-DailyMail in scope. For example, SENSEI only covers six (CNNDM) dataset (Nallapati et al., 2016) (>50 topics and the ADS Corpus covers one topic and ROUGE-1). The summary lengths are fairly con- only has 45 dialogues. Furthermore, they each per- sistent, while the input lengths are the longest for tain to one subdomain of conversation. Our dataset NYT and Stack data. We include the title and addi- avoids these issues by covering four diverse subdo- tional meta-data such as the headline and snippet mains of conversation and having approximately in NYT data in input length calculations. 500 annotated summaries for each subdomain. Ad- We analyze multi-document summarization– ditionally, since neural abstractive summarization specific characteristics of our datasets, as proposed baselines do not exist for these datasets, we bench- by Dey et al. (2020a). In particular, inter-document mark our models on these datasets to further their similarity measures the degree of overlap of seman- use as test sets. We similarly include the AMI and tic units in the candidate documents, with scores ICSI meeting datasets within our benchmark. further from zero signifying less overlap. The no- Within community question answering, the Wik- tion introduced for redundancy measures the over- iHowQA dataset (Deng et al., 2020) consists of all distribution of semantic units; the farther the user response threads to non-factoid questions start- score is from zero, the more uniform semantic units ing with “how to,” including labels for the an- are across the entire input, with the maximum when swer selection task and reference summaries. The each unit is present only once. Layout bias mea- CQASUMM dataset (Chowdhury and Chakraborty, 6870
Figure 1: Sample argument subgraph construct from NYT news comments illustrating varying viewpoints. Claims “I honestly...” and “but I dont..” are entailed by premises, connected through Default Inference nodes, and opposing claims are connected through Issue nodes. 2018) sampled threads from Yahoo! Answers in or the construction of the final graph based on the which the best answer could be used as a reference identified nodes and edges. To adapt this formula- summary. However, this heuristic is not guaranteed tion to our multi-document setting, we first perform to cover all the user answers’ perspectives, so we argument extraction and relationship type classi- believe our dataset is a more principled benchmark fication for each individual input document and for community question answering. finally graph construction to determine relation- It is also noted that several large-scale MDS ships among claims from all documents. datasets have been introduced in the news domain (Fabbri et al., 2019; Gu et al., 2020; Gholipour Gha- Argument Extraction For extracting arguments landari et al., 2020), for creating Wikipedia lead- from a single document, we build on work in argu- paragraphs (Liu et al., 2018), and for long-form ment mining with pretrained models (Chakrabarty question answering (Fan et al., 2019). However, et al., 2019). As in Lenz et al. (2020), our argumen- these do not focus on the conversational domain. tative units are sentences, from which we identify claims, which are assertions that something is true, 4 Argument Graph Summarization and premises, which are propositions from which a As our annotation protocol is motivated by the conclusion is drawn. Additionally, we identify and issues-viewpoints-assertions framework proposed remove non-argumentative units. We train a three- in Barker and Gaizauskas (2016a), we propose to way classifier for the task of argument extraction, instantiate a modified version of that work’s theo- following Chakrabarty et al. (2019) and making retical, proposed graph model. use of data for argument mining from that paper and from Stab and Gurevych (2014). The output Argument Graph Construction We build on of this step can also simply be used without further the argument graph formulation of Lenz et al. graph construction as a less noisy version of the (2020), a variant of Argument Interchange Format input, which we call -arg-filtered. (Chesnevar et al., 2006). Claims and premises are represented as information nodes (I-nodes), with Relationship Type Classification We follow the the relations between them represented as scheme procedure in Lenz et al. (2020) and use entailment nodes (S-nodes). Let V = I ∪ S be the set of to determine the relationship between argumen- nodes, and E ⊂ V × V the set of edges describing tative units within a document. However, rather support relationships among the nodes. We then than using the classifier provided, we make use define the argument graph G = (V, E). of RoBERTa (Liu et al., 2019b) fine-tuned on the Lenz et al. (2020) breaks the construction of the MNLI entailment dataset (Williams et al., 2018). argument graph down into four steps: (1) argument Rather than using both support and contradiction extraction, or the identification of argumentative edges between claims and premises, we make the discourse units; (2) relationship type classification, simplification that all relationships can be captured or the classification of edges between nodes; (3) with support edges, as we are dealing with a single major claim detection; and (4) graph construction, document in this step. Within a single text, the 6871
Dataset/Method Lexrank Textrank BERT-ext Data/Method BART BART-arg NYT 22.30/3.87/19.14 25.11/3.75/20.61 25.88/3.81/22.00 Reddit 22.71/4.52/19.38 24.38/4.54/19.84 24.51/4.18/20.95 NYT 35.91/9.22/31.28 36.60/9.83/32.61 Stack 26.30/5.62/22.27 25.43/4.40/20.58 26.84/4.63/22.85 Reddit 35.50/10.64/32.57 36.39/11.38/33.57 Email 16.04/3.68/13.38 19.50/3.90/16.18 25.46/6.17/21.73 Stack 39.61/10.98/35.35 39.73/11.17/35.52 Email 41.46/13.76/37.70 40.32/12.97/36.90 Table 4: ROUGE-1/2/L results for extractive LexRank (Erkan and Radev, 2004), TextRank (Mihalcea and Ta- Table 5: ROUGE-1/2/L results for vanilla BART as rau, 2004), and BERT-based (Miller, 2019) models. well as one trained on argument-mining input. Both are trained on 200 points from ConvoSumm. premise can be tied as following from one of the claims. We create an edge between any premise a dummy ‘Conversation Node’ which serves as the and the claim it most entails if the entailment score root of the argument graph. We show an example from RoBERTa is greater than 0.33, based on man- Issue subgraph for NYT data in Figure 1. ual analysis of the scores. If a premise is not labeled as supporting a claim, then we heuristically create Argument Graphs to Summaries Recent work an edge between that premise and the closest claim has shown the strength of text-based pretrained preceding it in the text. models on graph-to-text problems (Ribeiro et al., Since not all texts in the benchmark datasets may 2020). Following that work, we linearize the graph be argumentative or may be too short to contain by following a depth-first approach starting from major claims, we use some heuristics in our graph the Conversation Node. We found that inserting creation. If none of the argumentative sentences are special tokens to signify edge types did not im- labeled as claims (i.e., all are labeled as premises) prove performance, likely due to the size of our in argument extraction, the text’s first sentence is data, and simply make use of an arrow → to sig- labeled as the claim. Furthermore, we do not iden- nify the relationship between sentences. We train tify a single claim as the major claim since there a sequence-to-sequence model on our linearized may be multiple major points of discussion. graph input, which we call -arg-graph. Graph Construction For the final graph, for 5 Experimental Settings each of the documents in an example, we run the above procedure and obtain a set of claims and as- We use the fairseq codebase (Ott et al., 2019) for sociated premises. We then identify support edges our experiments. Our base abstractive text summa- between claims, which may be across documents. rization model is BART-large (Lewis et al., 2020), One claim may make a larger assertion, which is a pretrained denoising autoencoder with 336M pa- supported by other claims. We run our entailment rameters that builds on the sequence-to-sequence model over all potential edges (in both directions) transformer of Vaswani et al. (2017). We fine- among claims in the document and greedily add tune BART using a polynomial decay learning rate edges according to the entailment support score scheduler with Adam optimizer (Kingma and Ba, while no cycles are made. After this step, we are 2015). We used a learning rate of 3e-5 and warmup left with a set of claims which do not entail any and total updates of 20 and 200, following previ- other nodes or, stated otherwise, do not have parent ous few-shot transfer work (Fabbri et al., 2020a). nodes. Following the terminology of Barker and We could have equally fine-tuned other pretrained Gaizauskas (2016b), these nodes can be considered models such as Pegasus (Zhang et al., 2019) or viewpoints. T5 (Raffel et al., 2019), but Fabbri et al. (2020a) We then identify issues or topics on which the find that BART largely performs equally well in viewpoints differ. We run our entailment model for few-shot settings when compared to Pegasus. all parent claim nodes again in both directions over For the NYT and Stack datasets, which con- these claims and identify nodes that contradict each tain sequences over the typical 1024 max encoder other with probability over 0.33, based on manual length with which BART is trained, we copied the analysis of the resulting graphs. We greedily add encoder positional embeddings to allow sequences edges to maintain a tree structure, joining these up to length 2048. To address the input-length of nodes to a special node, which we call the Issue meeting summaries, which range from 6k to 12k to- node. All Issue nodes, as well as claims which are kens, we use the Longformer (Beltagy et al., 2020), not connected to any Issue node, are connected to which allows for sequences up to length 16k to- 6872
Method/Dataset AMI ICSI Dataset/Method Our results Previous SOTA HMNet 53.02/18.57/- 46.28/10.60/- SAMSum 52.27/27.82/47.92 49.30/25.60/47.70 DDA-GCN 53.15/22.32/- - CQASUMM 32.79/6.68/28.83 31.00/5.00/15.20 Longformer-BART 54.20/20.72/51.36 43.03/12.14/40.26 BC3 39.59/13.98/21.20 - Longformer-BART-arg 54.47/20.83/51.74 44.17/11.69/41.33 ADS 37.18/11.42/21.27 - SENSEI 34.57/7.08/16.80 - Table 6: ROUGE-1/2/L results for DDA-GCN (Feng et al., 2020) and HMNet (Zhu et al., 2020) on the AMI Table 7: Benchmarking results on conversational and ICSI meeting summarization dataset along with datasets such as SAMSum (Gliwa et al., 2019b) and our Longformer and Longformer-arg models. CQASUMM (Chowdhury and Chakraborty, 2018) and initial neural abstractive summarization results for email (BC3) (Ulrich et al., 2008), debate discussion fo- kens. We initialize the Longformer model with rums (ADS) (Misra et al., 2015), and news comments BART parameters trained on the CNN-DailyMail (SENSEI) (Barker et al., 2016b). dataset, as the meeting summarization datasets con- tain fewer than 100 data points. We otherwise fine-tune models from vanilla BART, following in- from modeling the argument structure or removing tuition in few-shot summarization (Fabbri et al., non-argumentative units. We provide full results 2020a) and based on initial experiments. In the for both variations in the Appendix. tables which follow, ”-arg” refers to any model trained with argument-mining-based input, and we Benchmarking Other Conversation Summa- specify which -arg-graph or -arg-filtered settings rization Datasets We benchmark our models on were used for each dataset below. widely used meeting summarization datasets. Due to the input’s linear nature and the size of the meet- 6 Results ing transcripts, we found improved results using -arg-filtered to filter non-argumentative units rather We provide results for baseline, unsupervised ex- than incorporating the graph structure. Results are tractive models in Table 4. Lexrank (Erkan and shown in Table 6. The Longformer model performs Radev, 2004) and Textrank (Mihalcea and Tarau, as well or better than previous state-of-the-art re- 2004), and BERT-ext (Miller, 2019), which makes sults on these datasets, despite not making use of use of BERT (Devlin et al., 2019). The unsuper- more complex modeling structures, and we gener- vised extractive models perform well below the ally see improvement with argument-mining. extractive oracle performance, suggesting the diffi- As noted above, there exist prior datasets for culty of content selection in this setting. dialogue, community question answering, email, We train BART on 200 examples from our vali- forum, and news comments summarization. We dation set for abstractive models, using the remain- benchmark results on these datasets in Table 7. ing 50 as validation and test on the final test set of We outperform prior work on SAMSum (Gliwa 250 examples. We tested zero-shot transfer from et al., 2019b), and CQASUMM (Chowdhury and CNNDM and SAMSum in zero-shot settings, al- Chakraborty, 2018) with our BART and BART-arg- though these resulted in a much lower performance graph models, respectively. We did not find im- of about 28 ROUGE-1. Few-shot model perfor- provement on SAMSum with the BART-arg model mance is shown in Table 5. The abstractive model due to the extremely short and focused nature performs at or above the Extractive Oracle, sug- of the dialogues, analogous to email data perfor- gesting the need for better abstractive models. mance. We also provide transfer results of BART We also train on our argument mining-based and BART-arg-graph models from our email and approaches and show results in Table 5. We see news-comment data to BC3 (Ulrich et al., 2009), ROUGE improvements when applying BART-arg- ADS (Misra et al., 2015), and SENSEI data (Barker graph for Reddit, and Stack data. The -arg-filtered et al., 2016b), for which no prior neural abstractive variation (which, as defined in Section 4, is the less summarization results existed. noisy version of the input produced by the argu- ment extraction step) outperformed the -arg-graph Human Evaluations We collect human judg- variation on both email and NYT data. For email ment annotations for two of the four quality dimen- data, however, this did not improve upon the BART sions studied in Kryscinski et al. (2019) and Fabbri baseline, likely due to the dataset’s characteristics; et al. (2020b), namely consistency and relevance. email data is shorter and more linear, not benefiting Consistency is defined as the factual alignment be- 6873
Target Dataset BART BART-arg Relevance Consistency Relevance Consistency ing of the input texts’ structure. We provide results Reddit 3.39 (0.13) 3.40 (0.12) 3.47 (0.12) 3.41 (0.10) for baseline models and propose to model the text’s AMI 4.07 (0.16) 3.67 (0.16) 4.13 (0.17) 3.70 (0.17) argument structure, showing that such structure Table 8: Mean relevance and factual consistency anno- helps better quantify viewpoints in non-linear in- tations for BART and BART-arg outputs on Reddit and put in both automatic and human evaluations. Our AMI. Standard errors are reported in parentheses. analysis notes challenges in modeling relevance and consistency in abstractive conversation summa- rization when compared to news summarization. tween the summary and the summarized source text, while relevance is defined as the summary’s 8 Ethical Considerations ability to select important content; only relevant in- formation and viewpoints should be included. We As we propose novel conversation summarization did not include fluency as an initial inspection of datasets and modeling components, this section is the data found fluency to be of very high quality, divided into the following two parts. as has shown to be the case for pretrained models in news summarization (Fabbri et al., 2020b). We 8.1 New Dataset did not include coherence as this was generally not Intellectual Properties and Privacy Rights All an issue of concern in the initial analysis. data for our newly-introduced datasets are avail- We randomly select 25 random examples from able online; please see the following for New York the Reddit corpus and ten examples from the AMI Times comment data2 , StackExchange data3 , and corpus, and output from the BART and BART-arg- W3C email data4 . Reddit data is available via the graph models. These data points were chosen to Google BigQuery tool5 . demonstrate what characteristics are realized in dif- ferences across ROUGE for argument-graph and Compensation for Annotators We compen- argument-noise-reduction approaches. Ten exam- sated the Turkers approximately $12–$15 per hour. ples were chosen from AMI due to the size of the We first annotated examples in-house to determine input and annotation constraints. The annotator the required annotation speed. Typically, the sum- sees the source article and randomly-ordered out- marization task took around 10 minutes, and we put from the model and then rates the summaries compensated the workers from $2.25 to $3.00 per for relevance and consistency on a Likert from 1 task, depending on the domain and deadline re- to 5, with 5 being the best score. We averaged quirements. the score of three native English-speaking anno- Steps Taken to Avoid Potential Problems We tators on each example and then across examples. interacted closely with the Turkers to ensure that Results are shown in Table 8. We find that the compensation was fair and that the instructions annotators prefer our argument mining-based ap- were clear. To maintain the quality of the dataset, proaches in both dimensions. However, the results we manually reviewed the crowdsourced sum- are close. Furthermore, the scores for relevance maries for language use. Initial investigation into and consistency are rather low, especially on the Reddit data showed certain inappropriate language Reddit dataset and when compared to results on the usage, so we filtered these examples automatically. CNN-DailyMail Dataset from Fabbri et al. (2020b). These results demonstrate the difficulty of mod- 8.2 NLP Application eling such conversational data. Examples are in- Bias Biases may exist in the datasets, such as po- cluded in the appendix. litical bias in the news datasets and gender bias in 7 Conclusion potentially all of the datasets. Thus, models trained on these datasets may propagate these biases. We We propose ConvoSumm, a benchmark of four 2 https://www.kaggle.com/aashita/ new, crowdsourced conversation datasets and state- nyt-comments of-the-art baselines on widely-used datasets that 3 https://archive.org/download/ promote more unified progress in summarization stackexchange 4 beyond the news domain. Our benchmark consists https://tides.umiacs.umd.edu/webtrec/ trecent/parsed_w3c_corpus.html of high-quality, human-written summaries that call 5 https://console.cloud.google.com/ for abstractive summaries and a deeper understand- bigquery 6874
removed data with offensive language when possi- 2016b. The SENSEI annotated corpus: Human sum- ble. maries of reader comment conversations in on-line news. In Proceedings of the 17th Annual Meeting Misuse Potential and Failure Mode When of the Special Interest Group on Discourse and Di- alogue, pages 42–52, Los Angeles. Association for used as intended, applying the summarization mod- Computational Linguistics. els described in this paper can save people much time. However, the current models are still prone Christos Baziotis, Ion Androutsopoulos, Ioannis to producing hallucinated summaries, and in such Konstas, and Alexandros Potamianos. 2019. SEQˆ3: Differentiable sequence-to-sequence-to-sequence a case, they may contribute to misinformation on autoencoder for unsupervised abstractive sentence the internet. Further research is needed to ensure compression. In Proceedings of the 2019 Con- the faithfulness of abstractive summaries to address ference of the North American Chapter of the this issue, as this issue is present among all current Association for Computational Linguistics: Human abstractive summarization models. Language Technologies, Volume 1 (Long and Short Papers), pages 673–681, Minneapolis, Minnesota. Association for Computational Linguistics. Environmental Cost The experiments described in the paper make use of V100 GPUs. We used Iz Beltagy, Matthew E. Peters, and Arman Cohan. up to 8 GPUs per experiment (depending on the 2020. Longformer: The long-document transformer. experiment; sometimes, a single GPU was used to arXiv:2004.05150. run the maximum number of experiments in paral- Arthur Bražinskas, Mirella Lapata, and Ivan Titov. lel). The experiments may take up to a couple of 2020. Unsupervised opinion summarization as hours for the larger datasets. Several dozen experi- copycat-review generation. In Proceedings of the ments were run due to parameter search, and future 58th Annual Meeting of the Association for Compu- work should experiment with distilled models for tational Linguistics, pages 5151–5169, Online. As- sociation for Computational Linguistics. more light-weight training. We note that while our work required extensive experiments to draw sound Tuhin Chakrabarty, Christopher Hidey, Smaranda conclusions, future work will be able to draw on Muresan, Kathy McKeown, and Alyssa Hwang. these insights and need not run as many large-scale 2019. AMPERSAND: Argument mining for PER- SuAsive oNline discussions. In Proceedings of the comparisons. Models in production may be trained 2019 Conference on Empirical Methods in Natu- once for use using the most promising settings. ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2933–2943, Hong Kong, References China. Association for Computational Linguistics. Emma Barker and Robert Gaizauskas. 2016a. Sum- Jiaao Chen and Diyi Yang. 2020. Multi-view sequence- marizing multi-party argumentative conversations to-sequence models with conversational structure in reader comment on news. In Proceedings of for abstractive dialogue summarization. In Proceed- the Third Workshop on Argument Mining (ArgMin- ings of the 2020 Conference on Empirical Methods ing2016), pages 12–20, Berlin, Germany. Associa- in Natural Language Processing (EMNLP), pages tion for Computational Linguistics. 4106–4118, Online. Association for Computational Linguistics. Emma Barker and Robert Gaizauskas. 2016b. Sum- marizing multi-party argumentative conversations Carlos Chesnevar, Sanjay Modgil, Iyad Rahwan, Chris in reader comment on news. In Proceedings of Reed, Guillermo Simari, Matthew South, Gerard the Third Workshop on Argument Mining (ArgMin- Vreeswijk, Steven Willmott, et al. 2006. Towards ing2016), pages 12–20, Berlin, Germany. Associa- an argument interchange format. The knowledge en- tion for Computational Linguistics. gineering review, 21(4):293–316. Emma Barker, Monica Lestari Paramita, Ahmet Aker, Tanya Chowdhury and Tanmoy Chakraborty. 2018. Emina Kurtic, Mark Hepple, and Robert Gaizauskas. Cqasumm: Building references for community ques- 2016a. The SENSEI annotated corpus: Human sum- tion answering summarization corpora. maries of reader comment conversations in on-line news. In Proceedings of the 17th Annual Meeting Eric Chu and Peter J. Liu. 2019. Meansum: A neu- of the Special Interest Group on Discourse and Di- ral model for unsupervised multi-document abstrac- alogue, pages 42–52, Los Angeles. Association for tive summarization. In Proceedings of the 36th In- Computational Linguistics. ternational Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Emma Barker, Monica Lestari Paramita, Ahmet Aker, volume 97 of Proceedings of Machine Learning Re- Emina Kurtic, Mark Hepple, and Robert Gaizauskas. search, pages 1223–1232. PMLR. 6875
Nick Craswell, Arjen P de Vries, and Ian Soboroff. and few-shot abstractive summarization with inter- 2005. Overview of the trec 2005 enterprise track. mediate fine-tuning and data augmentation. arXiv In TREC, volume 5, pages 199–205. preprint arXiv:2010.12836. Yang Deng, Wai Lam, Yuexiang Xie, Daoyuan Chen, Alexander R Fabbri, Wojciech Kryściński, Bryan Yaliang Li, Min Yang, and Ying Shen. 2020. Joint McCann, Caiming Xiong, Richard Socher, and learning of answer selection and answer summary Dragomir Radev. 2020b. Summeval: Re- generation in community question answering. In evaluating summarization evaluation. arXiv The Thirty-Fourth AAAI Conference on Artificial In- preprint arXiv:2007.12626. telligence, AAAI 2020, The Thirty-Second Innova- tive Applications of Artificial Intelligence Confer- Angela Fan, Yacine Jernite, Ethan Perez, David Grang- ence, IAAI 2020, The Tenth AAAI Symposium on Ed- ier, Jason Weston, and Michael Auli. 2019. ELI5: ucational Advances in Artificial Intelligence, EAAI Long form question answering. In Proceedings of 2020, New York, NY, USA, February 7-12, 2020, the 57th Annual Meeting of the Association for Com- pages 7651–7658. AAAI Press. putational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics. Shrey Desai, Jiacheng Xu, and Greg Durrett. 2020. Xiachong Feng, Xiaocheng Feng, Bing Qin, Xin- Compressive summarization with plausibility and wei Geng, and Ting Liu. 2020. Dialogue salience modeling. In Proceedings of the 2020 Con- discourse-aware graph convolutional networks for ference on Empirical Methods in Natural Language abstractive meeting summarization. arXiv preprint Processing (EMNLP), pages 6259–6274, Online. As- arXiv:2012.03502. sociation for Computational Linguistics. Cecilia E Ford. 1991. Linguistics: The cambridge sur- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and vey: Volume 4. language: The socio-cultural context. Kristina Toutanova. 2019. BERT: Pre-training of frederick h. newmeyer (ed.). Studies in Second Lan- deep bidirectional transformers for language under- guage Acquisition, 13(3):412–413. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association Michel Galley. 2006. A skip-chain conditional random for Computational Linguistics: Human Language field for ranking meeting utterances by importance. Technologies, Volume 1 (Long and Short Papers), In Proceedings of the 2006 Conference on Empiri- pages 4171–4186, Minneapolis, Minnesota. Associ- cal Methods in Natural Language Processing, pages ation for Computational Linguistics. 364–372, Sydney, Australia. Association for Com- putational Linguistics. Alvin Dey, Tanya Chowdhury, Yash Kumar, and Tan- moy Chakraborty. 2020a. Corpora evaluation and Prakhar Ganesh and Saket Dingliwal. 2019. Abstrac- system bias detection in multi-document summariza- tive summarization of spoken and written conversa- tion. In Findings of the Association for Computa- tion. CoRR, abs/1902.01615. tional Linguistics: EMNLP 2020, pages 2830–2840, Online. Association for Computational Linguistics. Demian Gholipour Ghalandari, Chris Hokamp, Nghia The Pham, John Glover, and Georgiana Ifrim. Alvin Dey, Tanya Chowdhury, Yash Kumar, and Tan- 2020. A large-scale multi-document summarization moy Chakraborty. 2020b. Corpora evaluation and dataset from the Wikipedia current events portal. system bias detection in multi document summariza- In Proceedings of the 58th Annual Meeting of the tion. In Proceedings of the 2020 Conference on Association for Computational Linguistics, pages Empirical Methods in Natural Language Processing: 1302–1308, Online. Association for Computational Findings, pages 2830–2840, Online. Association for Linguistics. Computational Linguistics. Dan Gillick and Yang Liu. 2010. Non-expert evalua- tion of summarization systems is risky. In Proceed- Günes Erkan and Dragomir R Radev. 2004. Lexrank: ings of the NAACL HLT 2010 Workshop on Creating Graph-based lexical centrality as salience in text Speech and Language Data with Amazon’s Mechan- summarization. Journal of artificial intelligence re- ical Turk, pages 148–151, Los Angeles. Association search, 22:457–479. for Computational Linguistics. Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Dragomir Radev. 2019. Multi-news: A large-scale Aleksander Wawer. 2019a. SAMSum corpus: A multi-document summarization dataset and abstrac- human-annotated dialogue dataset for abstractive tive hierarchical model. In Proceedings of the 57th summarization. In Proceedings of the 2nd Workshop Annual Meeting of the Association for Computa- on New Frontiers in Summarization, pages 70–79, tional Linguistics, pages 1074–1084, Florence, Italy. Hong Kong, China. Association for Computational Association for Computational Linguistics. Linguistics. Alexander R Fabbri, Simeng Han, Haoyuan Li, Haoran Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Li, Marjan Ghazvininejad, Shafiq Joty, Dragomir Aleksander Wawer. 2019b. SAMSum corpus: A Radev, and Yashar Mehdad. 2020a. Improving zero human-annotated dialogue dataset for abstractive 6876
summarization. In Proceedings of the 2nd Workshop 9th International Joint Conference on Natural Lan- on New Frontiers in Summarization, pages 70–79, guage Processing (EMNLP-IJCNLP), pages 540– Hong Kong, China. Association for Computational 551, Hong Kong, China. Association for Computa- Linguistics. tional Linguistics. Chih-Wen Goo and Yun-Nung Chen. 2018a. Ab- Philippe Laban, Andrew Hsi, John Canny, and Marti A. stractive dialogue summarization with sentence- Hearst. 2020. The summary loop: Learning to write gated modeling optimized by dialogue acts. In abstractive summaries without examples. In Pro- 2018 IEEE Spoken Language Technology Workshop ceedings of the 58th Annual Meeting of the Asso- (SLT), pages 735–742. IEEE. ciation for Computational Linguistics, pages 5135– 5150, Online. Association for Computational Lin- Chih-Wen Goo and Yun-Nung Chen. 2018b. Abstrac- guistics. tive dialogue summarization with sentence-gated modeling optimized by dialogue acts. CoRR, Mirko Lenz, Premtim Sahitaj, Sean Kallenberg, abs/1809.05715. Christopher Coors, Lorik Dumani, Ralf Schenkel, and Ralph Bergmann. 2020. Towards an argu- Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, You ment mining pipeline transforming texts to argument Wu, Cong Yu, Daniel Finnie, Hongkun Yu, Jiaqi graphs. arXiv preprint arXiv:2006.04562. Zhai, and Nicholas Zukoski. 2020. Generating rep- resentative headlines for news stories. In WWW ’20: Mike Lewis, Yinhan Liu, Naman Goyal, Mar- The Web Conference 2020, Taipei, Taiwan, April 20- jan Ghazvininejad, Abdelrahman Mohamed, Omer 24, 2020, pages 1773–1784. ACM / IW3C2. Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre- Doris Hoogeveen, Karin M Verspoor, and Timothy training for natural language generation, translation, Baldwin. 2015. Cqadupstack: A benchmark data and comprehension. In Proceedings of the 58th An- set for community question-answering research. In nual Meeting of the Association for Computational Proceedings of the 20th Australasian document com- Linguistics, pages 7871–7880, Online. Association puting symposium, pages 1–8. for Computational Linguistics. Xinyu Hua and Lu Wang. 2017. A pilot study of do- Chin-Yew Lin. 2004. ROUGE: A package for auto- main adaptation effect for neural abstractive sum- matic evaluation of summaries. In Text Summariza- marization. In Proceedings of the Workshop on tion Branches Out, pages 74–81, Barcelona, Spain. New Frontiers in Summarization, pages 100–106, Association for Computational Linguistics. Copenhagen, Denmark. Association for Computa- Chunyi Liu, Peng Wang, Jiang Xu, Zang Li, and tional Linguistics. Jieping Ye. 2019a. Automatic dialogue summary generation for customer service. In Proceedings of Adam Janin, Don Baron, Jane Edwards, Dan Ellis, the 25th ACM SIGKDD International Conference on David Gelbart, Nelson Morgan, Barbara Peskin, Knowledge Discovery & Data Mining, KDD 2019, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, Anchorage, AK, USA, August 4-8, 2019, pages 1957– et al. 2003. The icsi meeting corpus. In 2003 IEEE 1965. ACM. International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben volume 1, pages I–I. IEEE. Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summariz- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A ing long sequences. In 6th International Conference method for stochastic optimization. In 3rd Inter- on Learning Representations, ICLR 2018, Vancou- national Conference on Learning Representations, ver, BC, Canada, April 30 - May 3, 2018, Confer- ICLR 2015, San Diego, CA, USA, May 7-9, 2015, ence Track Proceedings. OpenReview.net. Conference Track Proceedings. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Varada Kolhatkar and Maite Taboada. 2017. Using dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, New York Times picks to identify constructive com- Luke Zettlemoyer, and Veselin Stoyanov. 2019b. ments. In Proceedings of the 2017 EMNLP Work- Roberta: A robustly optimized bert pretraining ap- shop: Natural Language Processing meets Journal- proach. arXiv preprint arXiv:1907.11692. ism, pages 100–105, Copenhagen, Denmark. Asso- ciation for Computational Linguistics. Zhengyuan Liu, Angela Ng, Sheldon Lee Shao Guang, Ai Ti Aw, and Nancy F. Chen. 2019c. Topic-aware Wessel Kraaij, Thomas Hain, Mike Lincoln, and Wil- pointer-generator networks for summarizing spoken fried Post. 2005. The ami meeting corpus. conversations. CoRR, abs/1910.01335. Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc- Rada Mihalcea and Paul Tarau. 2004. TextRank: Cann, Caiming Xiong, and Richard Socher. 2019. Bringing order into text. In Proceedings of the 2004 Neural text summarization: A critical evaluation. In Conference on Empirical Methods in Natural Lan- Proceedings of the 2019 Conference on Empirical guage Processing, pages 404–411, Barcelona, Spain. Methods in Natural Language Processing and the Association for Computational Linguistics. 6877
Derek Miller. 2019. Leveraging bert for extractive Shasha Xie, Yang Liu, and Hui Lin. 2008. Evaluating text summarization on lectures. arXiv preprint the effectiveness of features and sampling in extrac- arXiv:1906.04165. tive meeting summarization. In 2008 IEEE Spoken Language Technology Workshop, pages 157–160. Amita Misra, Pranav Anand, Jean E. Fox Tree, and Marilyn Walker. 2015. Using summarization to dis- Christian Stab and Iryna Gurevych. 2014. Identifying cover argument facets in online idealogical dialog. argumentative discourse structures in persuasive es- In Proceedings of the 2015 Conference of the North says. In Proceedings of the 2014 Conference on American Chapter of the Association for Computa- Empirical Methods in Natural Language Processing tional Linguistics: Human Language Technologies, (EMNLP), pages 46–56, Doha, Qatar. Association pages 430–440, Denver, Colorado. Association for for Computational Linguistics. Computational Linguistics. Manfred Stede, Stergos Afantenos, Andreas Peldszus, Amita Misra, Pranav Anand, Jean E Fox Tree, and Mar- Nicholas Asher, and Jérémy Perret. 2016. Parallel ilyn Walker. 2017. Using summarization to discover discourse annotations on a corpus of short texts. In argument facets in online ideological dialog. arXiv Proceedings of the Tenth International Conference preprint arXiv:1709.00662. on Language Resources and Evaluation (LREC’16), pages 1051–1058, Portorož, Slovenia. European Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Language Resources Association (ELRA). Çağlar Gulçehre, and Bing Xiang. 2016. Abstrac- tive text summarization using sequence-to-sequence Mattia Tomasoni and Minlie Huang. 2010. Metadata- RNNs and beyond. In Proceedings of The 20th aware measures for answer summarization in com- SIGNLL Conference on Computational Natural Lan- munity question answering. In Proceedings of the guage Learning, pages 280–290, Berlin, Germany. 48th Annual Meeting of the Association for Compu- Association for Computational Linguistics. tational Linguistics, pages 760–769, Uppsala, Swe- Shashi Narayan, Shay B. Cohen, and Mirella Lapata. den. Association for Computational Linguistics. 2018. Don’t give me the details, just the summary! J. Ulrich, G. Murray, and G. Carenini. 2008. A publicly topic-aware convolutional neural networks for ex- available annotated corpus for supervised email sum- treme summarization. In Proceedings of the 2018 marization. In AAAI08 EMAIL Workshop, Chicago, Conference on Empirical Methods in Natural Lan- USA. AAAI. guage Processing, pages 1797–1807, Brussels, Bel- gium. Association for Computational Linguistics. Jan Ulrich, Giuseppe Carenini, Gabriel Murray, and Myle Ott, Sergey Edunov, Alexei Baevski, Angela Raymond Ng. 2009. Regression-based summariza- Fan, Sam Gross, Nathan Ng, David Grangier, and tion of email conversations. In Proceedings of the Michael Auli. 2019. fairseq: A fast, extensible International AAAI Conference on Web and Social toolkit for sequence modeling. In Proceedings of Media, volume 3. the 2019 Conference of the North American Chap- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ter of the Association for Computational Linguistics Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz (Demonstrations), pages 48–53, Minneapolis, Min- Kaiser, and Illia Polosukhin. 2017. Attention is all nesota. Association for Computational Linguistics. you need. In Advances in Neural Information Pro- Tatsuro Oya, Yashar Mehdad, Giuseppe Carenini, and cessing Systems 30: Annual Conference on Neural Raymond Ng. 2014. A template-based abstractive Information Processing Systems 2017, December 4- meeting summarization: Leveraging summary and 9, 2017, Long Beach, CA, USA, pages 5998–6008. source text relationships. In Proceedings of the 8th Danqing Wang, Pengfei Liu, Ming Zhong, Jie Fu, International Natural Language Generation Confer- Xipeng Qiu, and Xuanjing Huang. 2019. Exploring ence (INLG), pages 45–53, Philadelphia, Pennsylva- domain shift in extractive text summarization. nia, U.S.A. Association for Computational Linguis- tics. Lu Wang and Claire Cardie. 2013. Domain- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine independent abstract generation for focused meeting Lee, Sharan Narang, Michael Matena, Yanqi Zhou, summarization. In Proceedings of the 51st Annual Wei Li, and Peter J Liu. 2019. Exploring the limits Meeting of the Association for Computational Lin- of transfer learning with a unified text-to-text trans- guistics (Volume 1: Long Papers), pages 1395–1405, former. arXiv preprint arXiv:1910.10683. Sofia, Bulgaria. Association for Computational Lin- guistics. Leonardo FR Ribeiro, Martin Schmitt, Hinrich Schütze, and Iryna Gurevych. 2020. Investigating pretrained Adina Williams, Nikita Nangia, and Samuel Bowman. language models for graph-to-text generation. arXiv 2018. A broad-coverage challenge corpus for sen- preprint arXiv:2007.08426. tence understanding through inference. In Proceed- ings of the 2018 Conference of the North American Evan Sandhaus. 2008. The new york times annotated Chapter of the Association for Computational Lin- corpus. Linguistic Data Consortium, Philadelphia, guistics: Human Language Technologies, Volume 6(12):e26752. 1 (Long Papers), pages 1112–1122, New Orleans, 6878
You can also read