MSˆ2: Multi-Document Summarization of Medical Studies
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
MSˆ2: Multi-Document Summarization of Medical Studies Jay DeYoung1∗ Iz Beltagy2 Madeleine van Zuylen2 Bailey Kuehl2 Lucy Lu Wang2 Northeastern University1 Allen Institute for AI2 deyoung.j@northeastern.edu {beltagy,madeleinev,baileyk,lucyw}@allenai.org Abstract To assess the effectiveness of any medical in- tervention, researchers must conduct a time- intensive and highly manual literature review. arXiv:2104.06486v2 [cs.CL] 15 Apr 2021 NLP systems can help to automate or assist in parts of this expensive process. In support of this goal, we release MSˆ2 (Multi-Document Summarization of Medical Studies), a dataset Figure 1: Our primary formulation (texts-to-text) is a of over 470k documents and 20K summaries seq2seq MDS task. Given study abstracts and a BACK - derived from the scientific literature. This GROUND statement, generate the TARGET summary. dataset facilitates the development of systems that can assess and aggregate contradictory ev- idence across multiple studies, and is the first large-scale, publicly available multi-document summarization dataset in the biomedical do- main. We experiment with a summarization system based on BART, with promising early results. We formulate our summarization in- puts and targets in both free text and structured forms and modify a recently proposed metric to assess the quality of our system’s generated summaries. Data and models are available at https://github.com/allenai/ms2. 1 Introduction Multi-document summarization (MDS) is a chal- Figure 2: The distributions of review and study publica- lenging task, with relatively limited resources and tion years in MSˆ2 show a clear temporal lag. Dashed modeling techniques. Existing datasets are either lines mark the median year of publication. in the general domain, such as WikiSum (Liu et al., 2018) and Multi-News (Fabbri et al., 2019), or very ing summary quality via a structured intermediate small such as DUC1 or TAC 2011 (Owczarzak and form, and (3) aid in distilling large amounts of Dang, 2011). In this work, we add to this burgeon- biomedical literature by supporting automated gen- ing area by developing a dataset for summarizing eration of literature review summaries. biomedical findings. We derive documents and Systematic reviews synthesize knowledge across summaries from systematic literature reviews, a many studies (Khan et al., 2003), and they are so type of biomedical paper that synthesizes results called for the systematic (and expensive) process across many other studies. Our aim in introducing of creating a review; each taking 1-2 years to com- MSˆ2 is to: (1) expand MDS to the biomedical plete (Michelson and Reuter, 2019).2 As we note domain, (2) investigate fundamentally challenging in Fig. 2, a delay of around 8 years is observed issues in NLP over scientific text, such as summa- between reviews and the studies they cite! The rization over contradictory information and assess- 2 https://community.cochrane.org/review-production/ ∗ Work performed during internship at AI2 production-resources/proposing-and-registering-new- 1 https://duc.nist.gov cochrane-reviews
time and cost of creating and updating reviews has measured?) defines the main facets of biomedical inspired efforts at automation (Tsafnat et al., 2014; research questions, and allows the person(s) con- Marshall et al., 2016; Beller et al., 2018; Marshall ducting a review to identify relevant studies (stud- and Wallace, 2019), and the constant deluge of ies included in a review generally have the same or studies3 has only increased this need. similar PICO elements as the review). A medical To move the needle on these challenges and sup- systematic review is one which reports results for port further work on literature review automation, applying any kind of medical or social interven- we present MSˆ2, a multi-document summarization tion to a group of people. Interventions are wide- dataset in the biomedical domain. Our contribu- ranging, including yoga, vaccination, team train- tions in this paper are as follows: ing, education, vitamins, mobile reminders, and more. Recent work on evidence inference (DeY- • We introduce MSˆ2, a dataset of 20K reviews oung et al., 2020; Nye et al., 2020) goes beyond and 470k studies summarized by these reviews. identifying PICO elements, and aims to group and • We define a texts-to-text MDS task (Fig. 1) based identify overall findings in reviews. MSˆ2 is a natu- on MSˆ2, by identifying target summaries in each ral extension of these paths: we create a dataset and review and using study abstracts as input docu- build a system with both natural summarization tar- ments. We develop a BART-based model for gets from input studies, while also incorporating this task, which produces fluent summaries that the inherent structure studied in previous work. agree with the evidence direction stated in gold summaries around 50% of the time. In this work, we use the term review when de- • In order to expose more granular representations scribing literature review papers, which provide our to users, we define a structured form of our data summary targets. We use the term study to describe to support a table-to-table task (§ 4.2). We lever- the documents that are cited and summarized by age existing biomedical information extraction each review. There are various study designs which systems (Nye et al., 2018; DeYoung et al., 2020) offer differing levels of evidence, e.g. clinical trials, (§3.3.1, §3.3.2) to evaluate agreement between cohort studies, observational studies, case studies, target and generated summaries. and more (Concato et al., 2000). Of these study types, randomized controlled trials (RCTs) offer 2 Background the highest quality of evidence (Meldrum, 2000). Systematic reviews aim to synthesize results over 3 Dataset all relevant studies on a topic, providing high qual- ity evidence for biomedical and public health deci- We construct MSˆ2 from papers in the Semantic sions. They are a fixture in the biomedical litera- Scholar literature corpus. First, we create a corpus ture, with many established protocol around their of reviews and studies based on the suitability cri- registration, production, and publication (Chalmers teria defined in §3.1. For each review, we classify et al., 2002; Starr et al., 2009; Booth et al., 2012). individual sentences in the abstract to identify sum- Each systematic review addresses one or several marization targets (§3.2). We augment all reviews research questions, and results are extracted from and studies with PICO span labels and evidence relevant studies and summarized. For example, a inference classes as described in §3.3.1 and §3.3.2. review investigating the effectiveness of Vitamin As a final step in data preparation, we cluster re- B12 supplementation in older adults (Andrès et al., views by topic and form train, development, and 2010) synthesizes results from 9 studies. test sets from these clusters (§3.4). The research questions in systematic reviews can be described using the PICO framework (Zakowski 3.1 Identifying suitable reviews and studies et al., 2004). PICO (which stands for Population: To identify suitable reviews, we apply (i) a high- who is studied? Intervention: what intervention recall heuristic keyword filter, (ii) PubMed filter, was studied? Comparator: what was the inter- (iii) study-type filter, and (iv) suitability classi- vention compared against? Outcome: what was fier, in series. The keyword filter looks for the 3 Of the heterogeneous study types, randomized con- phrase “systematic review” in the title and abstracts trol trials (RCTs) offer the highest quality of evidence. of all papers in Semantic Scholar, which yields Around 120 RCTs are published per day as of this writing https://ijmarshall.github.io/sote/, up from 75 in 2010 (Bastian 220K matches. The PubMed filter, yielding 170K et al., 2010). matches, limits search results to papers that have
Label Sentence Intervention Outcome Evidence sentence Direction BACKGROUND ... AREAS COVERED IN THIS REVIEW The ob- oral effect The effect of oral cobalamin no_change jective of this review is to evaluate the efficacy of cobalamin treatment in patients presenting oral cobalamin treatment in elderly patients . therapy with severe neurological OTHER To reach this objective , PubMed data were systematic manifestations has not yet been ally search ed for English and French articles published adequately documented . from January 1990 to July 2008 . ... oral discomfort, Oral cobalamin treatment avoids decreases TARGET The efficacy was particularly highlighted when look- cobalamin inconve- the discomfort , inconvenience ing at the marked improvement in serum vitamin B12 therapy nience and and cost of monthly injections . levels and hematological parameters , for example cost hemoglobin level , mean erythrocyte cell volume and reticulocyte count . oral serum The efficacy was particularly increases OTHER The effect of oral cobalamin treatment in patients pre- cobalamin vitamin b12 highlighted when looking at the senting with severe neurological manifestations has therapy levels and marked improvement in serum not yet been adequately documented .... hematologi- vitamin B12 levels and cal hematological parameters , for parameters example hemoglobin level , mean Table 1: Abbreviated example from Andrès et al. erythrocyte cell volume and (2010) with predicted sentence labels (full abstract in reticulocyte count . Tab. 11, App. D.3). Spans corresponding to Popula- tion, Intervention, and Outcome elements are tagged Table 2: Sample Intervention, Outcome, evidence state- and surrounded with special tokens. ment, and identified effect directions from a systematic review investigating the effectiveness of vitamin B12 therapies in the elderly (Andrès et al., 2010). been indexed in the PubMed database, which re- stricts reviews to those in the biomedical, clini- cal, psychological, and associated domains. We task definition and ill-advised to generate. then use citations and Medical Subject Headings Five annotators with undergraduate or gradu- (MeSH) to identify input studies via their document ate level biomedical background labeled 3000 sen- types and further refine the remaining reviews, see tences from 220 review abstracts. During annota- App. A for details on the full filtering process. tion, we asked annotators to label sentences into 9 Finally, we train a suitability classifier as the fi- classes (which we collapse into the 3 above; see nal filtering step, using SciBERT (Beltagy et al., App. D for detailed info on other classes). Two 2019), a BERT (Devlin et al., 2019) based language annotators then reviewed all annotations and cor- model trained on scientific text. Details on classi- rected mistakes. The corrections yield a Cohen’s fier training and performance are provided in Ap- κ (Cohen, 1960) of 0.912. Though we retain only pendix C. Applying this classifier to the remaining BACKGROUND and TARGET sentences for model- reviews leaves us with 20K candidate reviews. ing, we provide labels to all 9 classes in our dataset. 3.2 Background and target identification Using SciBERT (Beltagy et al., 2019), we train a For each review, we identify two sections: 1) the sequential sentence classifier. We prepend each sen- BACKGROUND statement, which describes the re- tence with a [SEP] token and use a linear layer search question, and 2) the overall effect or findings followed by a softmax to classify each sentence. statement as the TARGET of the MDS task (Fig. 1). A detailed breakdown of the classifier scores is We frame this as a sequential sentence classification available in Tab. 9, App. D. While the classifier per- task (Cohan et al., 2019): given the sentences in the forms well (94.1 F1) at identifying BACKGROUND review abstract, classify them as BACKGROUND, sentences, it only achieves 77.4 F1 for TARGET sen- TARGET, or OTHER . All BACKGROUND sentences tences. The most common error for TARGET sen- are aggregated and used as input in modeling. All tences is confusing them for results from individual TARGET sentences are aggregated and form the studies or detailed statistical analysis. Tab. 1 shows summary target for that review. Sentences classi- example sentences with predicted labels. Due to fied as OTHER may describe the methods used to the size of the dataset, we cannot manually annotate conduct the review, detailed findings such as the sentence labels for all reviews, so we use the sen- number of included studies or numerical results, as tence classifier output as silver labels in the training well as recommendations for practice. OTHER sen- set. To ensure the highest degree of accuracy for tences are not suitable for modeling because they the summary targets in our test set, we manually re- either contain information specific to the review, as view all 4519 TARGET sentences in the 2K reviews in methods; too much detail, in the case of results; of the test set, correcting 1109 sentences. Any re- or contain guidance on how medicine should be views without TARGET sentences are considered practiced, which is both outside the scope of our unsuitable and are removed from the final dataset.
3.3 Structured form evidence direction dij , which can take on one of As discussed in §2, the key findings of studies and the values in {increases, no_change, decreases }. reviews can be succinctly captured in a structured For each I/O pair, we also derive a sentence sij representation. The structure consists of PICO ele- from the document supporting the dij classifica- ments (Nye et al., 2018) that define what is being tion. Each review can therefore be represented as a studied, in addition to the effectiveness of the in- set of tuples T of the form (Ii , Oj , sij , dij ) and car- tervention as inferred through Evidence Inference dinality I¯ × Ō. See Tab. 2 for examples. For mod- (§3.3.2). In addition to the textual form of our task, eling, as in PICO tagging, we surround supporting we construct this structured form and release it with sentences with special tokens and MSˆ2 to facilitate investigation of consistency be- ; and append the direction class tween input studies and reviews, and to provide with a token. additional information for interpreting the findings We adapt the Evidence Inference (EI) dataset reported in each document. and models (DeYoung et al., 2020) for labeling. The EI dataset is a collection of RCTs, tagged 3.3.1 Adding PICO tags PICO elements, evidence sentences, and overall The Populations, Interventions, and Outcomes of evidence direction labels increases, no_change, or interest are a common way of representing clinical decreases. The EI models are composed of 1) an knowledge (Huang et al., 2006). Recent work (Nye evidence identification module which identifies an et al., 2020) has found that the Comparator is rarely evidence sentence, and 2) an evidence classifica- mentioned explicitly, so we exclude it from our tion module for classifying the direction of effec- dataset. Previous summarization work has shown tiveness. The former is a binary classifier on top that tagging salient entities, especially PIO ele- of SciBERT, whereas the latter is a softmax dis- ments (Wallace et al., 2020), can improve summa- tribution over effectiveness directions. Using the rization performance (Nallapati et al., 2016a,b), so same parameters as DeYoung et al. (2020), we mod- we mark PIO elements with special tokens added to ify these two modules to function solely over I our model vocabulary: , , , and O spans.5 The resulting 354k EI classifica- , , and . tions for our reviews are 13.4% decreases, 57.0% Using the EBM-NLP corpus (Nye et al., 2018), no_change, and 29.6% increases. Of the 907k clas- a crowd-sourced collection of PIO tags,4 we train sifications over input studies, 15.7% are decreases, a token classification model (Wolf et al., 2020) 60.7% no_change, and 23.6% increases. Only to identify these spans in our study and review 53.8% of study classifications match review classifi- documents. These span sets are denoted P = cations, highlighting the prevalence and challenges {P1 , P2 , ..., PP̄ }, I = {I1 , I2 , ..., II¯} and O = of contradictory data. {O1 , O2 , ..., OŌ }. At the level of each review, we 3.4 Clustering and train / test split perform a simple aggregation over these elements. Any P, I, or O span fully contained within any other Reviews addressing overlapping research questions span of the same type is removed from these sets or providing updates to previous reviews may share (though they remain tagged in the text). Removing input studies and results in common, e.g., a review these contained elements reduces the number of studying the effect of Vitamin B12 supplementation duplicates in our structured representation. Our on B12 levels in older adults and a review studying dataset has an average of 3.0 P, 3.5 I, and 5.4 O the effect of B12 supplementation on heart dis- spans per review. ease risk will cite similar studies. To avoid the phenomenon of learning from test data, we clus- 3.3.2 Adding Evidence Inference ter reviews before splitting into train, validation, We predict the direction of evidence associated and test sets. We compute SPECTER paper em- with every Intervention-Outcome (I/O) pair found beddings (Cohan et al., 2020) using the title and in the review abstract. Taking the product of each abstract of each review, and perform agglomerative Ii and Oj in the sets I and O yields all possible hierarchical clustering using the scikit-learn library I/O pairs, and each I/O pair is associated with an (Buitinck et al., 2013). This results in 200 clus- 4 5 EBM-NLP contains high quality, crowdsourced and Nye et al. (2020) found that removing Comparator ele- expert-tagged PIO spans in clinical trial abstracts. See App. I ments improved classification performance from 78.0 F1 to for a comparison to other PICO datasets. 81.4 F1 with no additional changes or hyper-parameter tuning.
Dataset statistics MSˆ2 Total reviews 20K Total input studies 470k Median number of studies per review 17 Median number of reviews per study 1 Average number of reviews per study 1.9 Median BACKGROUND sentences per review 3 Median TARGET sentences per review 2 Figure 3: Two input encoding configurations. Above: Table 3: MSˆ2 dataset statistics. LongformerEncoderDecoder (LED), where all input studies are appended to the BACKGROUND and en- Dataset Docs Tokens per Tokens per coded together. Below: In the BART configuration, Summary Document each input study is encoded independently with the re- DUC ’03/’04 320 109.6 4636.2 view BACKGROUND. These are concatenated to form TAC 2011 176 99.7 4695.7 the input encoding. WikiSum 2,332,000 101 –103 102 –106 Multi-News 56,216 263.7 2103.4 a supplementary table-to-table task, where given MSˆ2 470,402 61.3 365.3 inputs of I/O pairs from the review; the model tries to predict the evidence direction. We provide ini- Table 4: A comparison of MDS datasets; adapted from tial results for the table-to-table task, although we Fabbri et al. (2019). Datasets are DUC ’03/’041 , TAC 2011 (Owczarzak and Dang, 2011), WikiSum (Liu consider this an area in need of active research. et al., 2018), and Multi-News (Fabbri et al., 2019). Note: WikiSum only provides ranges, not exact size. 4.1 Texts-to-text task Our approach leverages BART (Lewis et al., ters, which we randomly partition into 80/10/10 2020b), a seq2seq autoencoder. Using BART, we train/development/test sets. encode the BACKGROUND and input studies as 3.5 Dataset statistics in Fig. 3, and pass these representations to a de- coder. Training follows a standard auto-regressive The final dataset consists of 20K reviews and 470k paradigm used for building summarization models. studies. Each review in the dataset summarizes In addition to PICO tags (§3.3.1), we augment the an average of 23 studies, ranging between 1–401 inputs by surrounding the background and each studies. See Tab. 3 for statistics, and Tab. 4 for a input study with special tokens , comparison to other datasets. The median review , and , . has 6.7K input tokens from its input studies, while For representing multiple inputs, we experiment the average has 9.4K tokens (a few reviews have with two configurations: one leveraging BART lots of studies). We restrict the input size when with independent encodings of each input, and modeling to 25 studies, which reduces the average LongformerEncoderDecoder (LED) (Beltagy et al., input to 6.6K tokens without altering the median. 2020) which can encode long inputs of up to 16K Fig. 2 shows the temporal distribution of reviews tokens. For the BART configuration, each study ab- and input studies in MSˆ2. We observe that though stract is appended to the BACKGROUND statement reviews in our dataset have a median publication and encoded independently. These representations year of 2016, the studies cited by these reviews are concatenated together to form the input to the are largely from before 2010, with a median of decoder layer. In the BART configuration, interac- 2007 and peak in 2009. This citation delay has tions happen only in the decoder. For the LED con- been observed in prior work (Shojania et al., 2007; figuration, the input sequence starts with the BACK - Beller et al., 2013), and further illustrates the need GROUND statement followed by a concatenation of for automated or assisted reviews. all input study abstracts. The BACKGROUND repre- sentation is shared among all input studies; global 4 Experiments attention allows interactions between studies, and a We experiment with a texts-to-text task formulation sliding attention window of 512 tokens allows each (Fig. 1). The model input consists of the BACK - token to attend to its neighbors. GROUND statement and study abstracts; the output We train a BART-base model, with hyperparam- is the TARGET statement. We also investigate the eters described in App. F. We report experimental use of the structured form described in §3.3.2 for results in Tab. 5. In addition to ROUGE (Lin, 2004),
Model R-1 R-2 R-L ∆EI↓ F1 Model P R F1 BART 27.56 9.40 20.80 .459 46.51 BART 50.31 67.98 65.89 LED 26.89 8.91 20.32 .449 45.00 Table 6: Results for the table-to-table setting. We re- Table 5: Results for the texts-to-text setting. We report port macro-averaged precision, recall, and F-scores. ROUGE, ∆EI (§ 4.3), and macro-averaged F1-scores. 4.3 ∆EI metric we also report two metrics derived from evidence Recent work in summarization evaluation has high- inference: ∆EI and F1. We describe the intuition lighted the weaknesses of ROUGE for capturing and computation of the ∆EI metric in Section 4.3; factuality of generated summaries, and has focused because it is a distance metric, lower ∆EI is better. on developing automated metrics more closely cor- For F1, we use the EI classification module to iden- related with human-assessed factuality and quality tify evidence directions for both the generated and (Zhang* et al., 2020; Wang et al., 2020a; Falke target summaries. Using these classifications, we et al., 2019). In this vein, we modify a recently report a macro-averaged F1 over the class agree- proposed metric based on EI classification distri- ment between the generated and target summaries butions (Wallace et al., 2020), intending to capture (Buitinck et al., 2013). For example generations, the agreement of Is, Os, and EI directions between see Tab. 13 in App. G. input studies and the generated summary. For each I/O tuple (Ii , Oj ), the predicted di- 4.2 Table-to-table task rection dij is actually a distribution of proba- An end user of a review summarization system may bilities over the three direction classes Pij = be interested in specific results from input studies (pincreases , pdecreases , pno_change ). If we consider (including whether they agree or contradict) rather this distribution for the gold summary (Pij ) and than the high level conclusions available in TAR - the generated summary (Qij ), we can compute GET statements. Therefore, we further experiment the Jensen-Shannon Distance (JSD) (Lin, 1991), with structured input and output representations a bounded score between [0, 1], between these dis- that attempt to capture results from individual stud- tributions. For each review, we can then compute a ies. As described in §3.3.2, the structured repre- summary JSD metric, which we call ∆EI, as an av- sentation of each review or study is a tuple of the erage over the JSD of each I/O tuple in that review: form (Ii , Oj , sij , dij ). It is important to note that I¯ X J¯ we use the same set of Is and Os from the review X to predict evidence direction from all input studies. JSD(Pij , Qij ) (1) i=1 j=1 Borrowing from the ideas of (Raffel et al., 2020), we formulate our classification task as a text genera- Different from Wallace et al. (2020), ∆EI is an tion task, and train the models described in Section average over all outputs, attempting to capture an 4.1 to generate one of the classes in {increases, overall picture of system performance,6 and our no_change, decreases }. Using the EI classifica- metric retains the directionality of increases and tions from 3.3.2, we compute an F-score macro- decreases, as opposed to collapsing them together. averaged over the effect classes (Tab. 6). We retain To facilitate interpretation of the ∆EI metric, we all hyperparameter settings other than reducing the offer a degenerate example. Given the case where maximum generation length to 10. all direction classifications are certain, and the prob- We stress that this is a preliminary effort to ability distributions Pij and Qij exist in the space demonstrate feasibility rather than completeness of (1, 0, 0), (0, 1, 0), or (0, 0, 1), ∆EI takes on the — our results in Tab. 6 are promising but the under- following values at various levels of consistency lying technologies for building the structured data: between Pij and Qij for the input studies: PICO tagging, co-reference resolution, and PICO relation extraction, are currently weak (Nye et al., 100% consistent ∆EI = 0.0 2020). Resorting to using the full cross-product of 50% consistent ∆EI = 0.42 Interventions and Outcomes results in duplicated 0% consistent ∆EI = 0.83 I/O pairs as well as potentially spurious pairs that 6 Wallace et al. (2020) only report correlation of a related do not correspond to actual I/O pairs in the review. metric with human judgments.
Gold Effectiveness Example statements from studies Generated inc. no_c. dec. insuff. Positive Adjuvant vinorelbine plus cisplatin increases 56 3 1 19 effect extends survival in patients with no_change 13 1 1 10 completely resected NSCLC... decreases 0 0 5 2 insufficient 5 1 0 5 Our results suggest that patients with skip 8 0 0 3 NSCLC at pathologic stage I who have undergone radical surgery benefit from adjuvant chemotherapy. Table 7: Confusion matrix for human evaluation results No effect or No survival benefit for CAP vs negative no-treatment control was found in this In other words, in both the standard BART and effect study. Therefore, adjuvant therapy with LED setting, the evidence directions predicted in CAP should not be recommended for patients with resected early-stage relation to the generated summary are slightly less non-small cell lung cancer . than 50% consistent with the direction predictions On the basis of this trial , adjuvant produced relative to the gold summary. therapy with CAP should not be recommended for patients with 4.4 Human evaluation & error analysis resected stage I lung cancer . We randomly sample 150 reviews from the test set Table 8: Text from the input studies to Petrelli and for manual evaluation. For each generated and gold Barni (2013), a review investigating the effectiveness summary, we annotate the primary effectiveness of cisplatin-based (CAP) chemotherapy for non-small direction in the summary to the following classes: cell lung cancer (NSCLC). Input studies vary in their (i) increases: intervention has a positive effect on results, with some stating a positive effect for adjuvant chemotherapy, and some stating no survival benefit. the outcome; (ii) no_change: no effect, or no differ- ence between the intervention and the comparator; into a coherent summary. (iii) decreases: intervention has a negative effect on the outcome; (iv) insufficient: insufficient evidence 5 Related Work is available; (v) skip: the summary is disfluent, off topic, or does not contain information on efficacy. NLP for scientific text has been gaining interest re- Here, increases, no_change, and decreases cor- cently with work spanning the whole NLP pipeline: respond to the EI classes, while we introduce insuf- datasets (S2ORC (Lo et al., 2020), CORD-19 ficient to describe cases where insufficient evidence (Wang et al., 2020b)), pretrained transformer mod- is available on efficacy, and skip to describe data els (SciBERT (Beltagy et al., 2019), BioBERT (Lee or generation failures. Two annotators provide la- et al., 2020), ClinicalBERT (Huang et al., 2019), bels, and agreement is computed over 50 reviews SPECTER (Cohan et al., 2020)), NLP tasks like (agreement: 86%, Cohen’s κ: 0.76). Of these, 17 NER (Nye et al., 2018; Li et al., 2016), relation gold summaries lack an efficacy statement, and are extraction (Jain et al., 2020; Luan et al., 2018; excluded from analysis. Tab. 7 shows the confusion Kringelum et al., 2016), QA (Abacha et al., 2019), matrix for the sample. Around 50% (67/133) of NLI (Romanov and Shivade, 2018; Khot et al., generated summaries have the same evidence direc- 2018), summarization (Cachola et al., 2020; Chan- tion as the gold summary. Most confusions happen drasekaran et al., 2019), claim verification (Wadden between increases, no_change, and insufficient. et al., 2020), and more. MSˆ2 adds a MDS dataset Tab. 8 shows how individual studies can provide to the scientific document NLP literature. contradictory information, some supporting a posi- A small number of MDS datasets are available tive effect for an intervention and some observing for other domains, including MultiNews (Fabbri no or negative effects. EI may be able to capture et al., 2019), WikiSum (Liu et al., 2018), and some of the differences between these input studies. Wikipedia Current Events (Gholipour Ghalandari From observations on limited data: while studies et al., 2020). Most similar to MSˆ2 is MultiNews, with positive effect tend to have more EI predic- where multiple news articles about the same event tions that were increases or decreases, those with are summarized into one short paragraph. Aside no or negative effect tended to have predictions that from being in a different textual domain (scientific were mostly no_change. However, more work is vs. newswire), one unique characteristic of MSˆ2 needed to better understand how to capture these compared to existing datasets is that MSˆ2 input directional relations and how to aggregate them documents have contradicting evidence. Modeling
in other domains has typically focused on straight- increase our ability to automatically assess perfor- forward applications of single-document summa- mance via automated metrics like ∆EI. Relatedly, rization to the multi-document setting (Lebanoff automated metrics for summarization evaluation et al., 2018; Zhang et al., 2018), although some can be difficult to interpret, as the intuition for methods explicitly model multi-document structure each metric must be built up through experience. using semantic graph approaches (Baumel et al., Though we attempt to facilitate understanding of 2018; Liu and Lapata, 2019; Li et al., 2020). ∆EI by offering a degenerate example, more ex- In the systematic review domain, work has typ- ploration is needed to understand how a practically ically focused on information retrieval (Boudin useful system would perform on such a metric. et al., 2010; Ho et al., 2016; Znaidi et al., 2015; Future work Though we demonstrate that Schoot et al., 2020), extracting findings (Lehman seq2seq approaches are capable of producing fluent et al., 2019; DeYoung et al., 2020; Nye et al., and on-topic review summaries, there are signifi- 2020), and quality assessment (Marshall et al., cant opportunities for improvement. Data improve- 2015, 2016). Only recently in Wallace et al. (2020) ments include improving the quality of summary and this work has consideration been made for ap- targets and intermediate structured representations proaching the entire system as a whole. We refer (PICO tags and EI direction). Another opportunity the reader to App. I for more context regarding the lies in linking to structured data in external sources systematic review process. such as various clinical trial databases7,8,9 rather 6 Discussion than relying solely on PICO tagging. For modeling, we are interested in pursuing joint retrieval and Though MDS has been explored in the general summarization approaches (Lewis et al., 2020a). domain, biomedical text poses unique challenges We also hope to explicitly model the types of con- such as the need for domain-specific vocabulary tradictions observed in Tab. 8, such that generated and background knowledge. To support develop- summaries can capture nuanced claims made by ment of biomedical MDS systems, we release the individual studies. MSˆ2 dataset. MSˆ2 contains summaries and docu- ments derived from biomedical literature, and can 7 Conclusion be used to study literature review automation, a Given increasing rates of publication, multi- pressing real-world application of MDS. document summarization, or the creation of liter- We define a seq2seq modeling task over this ature reviews, has emerged as an important NLP dataset, as well as a structured task that incorpo- task in science. The urgency for automation tech- rates prior work on modeling biomedical text (Nye nologies has been magnified by the COVID-19 pan- et al., 2018; DeYoung et al., 2020). We show that demic, which has led to both an accelerated speed although generated summaries tend to be fluent of publication (Horbach, 2020) as well as prolifera- and on-topic, they only agree with the evidence tion of non-peer-reviewed preprints which may be direction in gold summaries around half the time, of lower quality (Lachapelle, 2020). By releasing leaving plenty of room for improvement. This ob- MSˆ2, we provide a MDS dataset that can help to servation holds both through our ∆EI metric and address these challenges. Though we demonstrate through human evaluation of a small sample of gen- that our MDS models can produce fluent text, our erated summaries. Given that only 54% of study results show that there are significant outstanding evidence directions agree with the evidence direc- challenges that remain unsolved, such as PICO tions of their review, modeling contradiction in tuple extraction, co-reference resolution, and eval- source documents may be key to improving upon uation of summary quality and faithfulness in the existing summarization methods. multi-document setting. We encourage others to Limitations Challenges in co-reference resolu- use this dataset to better understand the challenges tion and PICO extraction limit our ability to gener- specific to MDS in the domain of biomedical text, ate accurate PICO labels at the document level. Er- and to push the boundaries on the real world task rors compound at each stage: PICO tagging, taking of systematic review automation. the product of Is and Os at the document level, and 7 https://clinicaltrials.gov/ predicting EI direction. Pipeline improvements are 8 https://www.clinicaltrialsregister.eu/ 9 needed to bolster overall system performance and https://www.gsk-studyregister.com/
Ethical Concerns and Broader Impact and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages We believe that automation in systematic reviews 3615–3620, Hong Kong, China. Association for has great potential value to the medical and scien- Computational Linguistics. tific community; our aim in releasing our dataset Iz Beltagy, Matthew E. Peters, and Arman Cohan. and models is to facilitate research in this area. 2020. Longformer: The long-document transformer. Given unresolved issues in evaluating the factual- ArXiv, abs/2004.05150. ity of summarization systems, as well as a lack Alison Booth, Michael Clarke, Gordon Dooley, Davina of strong guarantees about what the summary out- Ghersi, David Moher, Mark Petticrew, and Lesley A puts contain, we do not believe that such a system Stewart. 2012. The nuts and bolts of PROSPERO: is ready to be deployed in practice. Deploying an international prospective register of systematic re- views. Systematic Reviews, 1:2 – 2. such a system now would be premature, as with- out these guarantees we would be likely to gener- Florian Boudin, Jian-Yun Nie, and Martin Dawes. 2010. ate plausible-looking but factually incorrect sum- Clinical information retrieval using document and PICO structure. In Human Language Technologies: maries, an unacceptable outcome in such a high The 2010 Annual Conference of the North American impact domain. We hope to foster development Chapter of the Association for Computational Lin- of useful systems with correctness guarantees and guistics, pages 822–830, Los Angeles, California. evaluations to support them. Association for Computational Linguistics. Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad References Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Ar- Asma Ben Abacha, Chaitanya Shivade, and Dina naud Joly, Brian Holt, and Gaël Varoquaux. 2013. Demner-Fushman. 2019. Overview of the mediqa API design for machine learning software: experi- 2019 shared task on textual inference, question en- ences from the scikit-learn project. In ECML PKDD tailment and question answering. In BioNLP@ACL. Workshop: Languages for Data Mining and Ma- E. Andrès, H. Fothergill, and M. Mecili. 2010. Effi- chine Learning, pages 108–122. cacy of oral cobalamin (vitamin b12) therapy. Ex- Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel pert Opinion on Pharmacotherapy, 11:249 – 256. Weld. 2020. TLDR: Extreme summarization of sci- entific documents. In Findings of the Association Edoardo Aromataris and Zachary Munn, editors. 2020. for Computational Linguistics: EMNLP 2020, pages JBI Manual for Evidence Synthesis. JBI. 4766–4777, Online. Association for Computational Hilda Bastian, Paul Glasziou, and Iain Chalmers. 2010. Linguistics. Seventy-five trials and eleven systematic reviews a Iain Chalmers, Larry V Hedges, and Harris Cooper. day: How will we ever keep up? PLOS Medicine, 2002. A brief history of research synthesis. Eval- 7(9):1–6. uation & the Health Professions, 25:12 – 37. Tal Baumel, Matan Eyal, and Michael Elhadad. 2018. Muthu Kumar Chandrasekaran, Michihiro Yasunaga, Query focused abstractive summarization: Incorpo- Dragomir R. Radev, D. Freitag, and Min-Yen Kan. rating query relevance, multi-document coverage, 2019. Overview and results: Cl-scisumm shared and summary length constraints into seq2seq mod- task 2019. In BIRNDL@SIGIR. els. ArXiv, abs/1801.07704. Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, E. Beller, J. K. Chen, U. Wang, and P. Glasziou. 2013. and Dan Weld. 2019. Pretrained language models Are systematic reviews up-to-date at the time of pub- for sequential sentence classification. In Proceed- lication? Systematic Reviews, 2:36 – 36. ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- E. Beller, J. Clark, G. Tsafnat, C. Adams, H. Diehl, national Joint Conference on Natural Language Pro- H. Lund, M. Ouzzani, K. Thayer, J. Thomas, cessing (EMNLP-IJCNLP), pages 3693–3699, Hong T. Turner, J. Xia, K. Robinson, and P. Glasziou. 2018. Kong, China. Association for Computational Lin- Making progress with the automation of systematic guistics. reviews: principles of the international collabora- tion for the automation of systematic reviews (icasr). Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Systematic Reviews, 7. Downey, and Daniel Weld. 2020. SPECTER: Document-level representation learning using Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Sci- citation-informed transformers. In Proceedings BERT: A pretrained language model for scientific of the 58th Annual Meeting of the Association text. In Proceedings of the 2019 Conference on for Computational Linguistics, pages 2270–2282, Empirical Methods in Natural Language Processing Online. Association for Computational Linguistics.
A. Cohen, N. Smalheiser, M. McDonagh, C. Yu, Weckesser, Hameer Abbasi, Christoph Gohlke, and C. Adams, J. M. Davis, and Philip S. Yu. 2015. Travis E. Oliphant. 2020. Array programming with Automated confidence ranked classification of ran- NumPy. Nature, 585:357–362. domized controlled trial articles: an aid to evidence- based medicine. Journal of the American Medical Julian P.T. Higgins, James Thomas, Jacqueline Chan- Informatics Association : JAMIA, 22:707 – 717. dler, Miranda Cumpston, Tianjing Li, Matthew J. Page, and Vivian A. Welch, editors. 2019. Cochrane J. Cohen. 1960. A coefficient of agreement for nomi- Handbook for Systematic Reviews of Interventions. nal scales. Educational and Psychological Measure- Wiley. ment, 20:37 – 46. G. J. Ho, S. M. Liew, C. Ng, Ranita Hisham Shun- J. Concato, N. Shah, and R. Horwitz. 2000. Random- mugam, and P. Glasziou. 2016. Development of a ized, controlled trials, observational studies, and the search strategy for an evidence based retrieval ser- hierarchy of research designs. The New England vice. PLoS ONE, 11. journal of medicine, 342(25):1887–92. S. Horbach. 2020. Pandemic publishing: Medical jour- J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina nals strongly speed up their publication process for Toutanova. 2019. Bert: Pre-training of deep bidirec- covid-19. Quantitative Science Studies, 1:1056– tional transformers for language understanding. In 1067. NAACL-HLT. Jay DeYoung, Eric Lehman, Benjamin Nye, Iain Mar- Brian Howard, Jason Phillips, Arpit Tandon, Adyasha shall, and Byron C. Wallace. 2020. Evidence infer- Maharana, Rebecca Elmore, Deepak Mav, Alex ence 2.0: More data, better models. In Proceed- Sedykh, Kristina Thayer, Alex Merrick, Vickie ings of the 19th SIGBioMed Workshop on Biomed- Walker, Andrew Rooney, and Ruchir Shah. 2020. ical Language Processing, pages 123–132, Online. Swift-active screener: Accelerated document screen- Association for Computational Linguistics. ing through active learning and integrated recall esti- mation. Environment International, 138:105623. Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi-news: A large-scale Kexin Huang, Jaan Altosaar, and R. Ranganath. 2019. multi-document summarization dataset and abstrac- Clinicalbert: Modeling clinical notes and predicting tive hierarchical model. In Proceedings of the 57th hospital readmission. ArXiv, abs/1904.05342. Annual Meeting of the Association for Computa- tional Linguistics, pages 1074–1084, Florence, Italy. X. Huang, J. Lin, and Dina Demner-Fushman. 2006. Association for Computational Linguistics. Evaluation of pico as a knowledge representation for clinical questions. AMIA ... Annual Symposium pro- WA Falcon. 2019. Pytorch lightning. GitHub. ceedings. AMIA Symposium, pages 359–63. Note: https://github.com/PyTorchLightning/pytorch- lightning, 3. Sarthak Jain, Madeleine van Zuylen, Hannaneh Ha- jishirzi, and Iz Beltagy. 2020. Scirex: A challenge Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie dataset for document-level information extraction. Utama, Ido Dagan, and Iryna Gurevych. 2019. In ACL. Ranking generated summaries by correctness: An in- teresting but challenging application for natural lan- Di Jin and Peter Szolovits. 2018. PICO element detec- guage inference. In Proceedings of the 57th Annual tion in medical text via long short-term memory neu- Meeting of the Association for Computational Lin- ral networks. In Proceedings of the BioNLP 2018 guistics, pages 2214–2220, Florence, Italy. Associa- workshop, pages 67–75, Melbourne, Australia. As- tion for Computational Linguistics. sociation for Computational Linguistics. Demian Gholipour Ghalandari, Chris Hokamp, K. Khan, R. Kunz, J. Kleijnen, and G. Antes. 2003. Nghia The Pham, John Glover, and Georgiana Ifrim. Five steps to conducting a systematic review. Jour- 2020. A large-scale multi-document summarization nal of the Royal Society of Medicine, 96 3:118–21. dataset from the Wikipedia current events portal. In Proceedings of the 58th Annual Meeting of the Tushar Khot, A. Sabharwal, and Peter Clark. 2018. Sc- Association for Computational Linguistics, pages itail: A textual entailment dataset from science ques- 1302–1308, Online. Association for Computational tion answering. In AAAI. Linguistics. Diederik P. Kingma and Jimmy Ba. 2015. Adam: Charles R. Harris, K. Jarrod Millman, Stéfan J A method for stochastic optimization. CoRR, van der Walt, Ralf Gommers, Pauli Virtanen, David abs/1412.6980. Cournapeau, Eric Wieser, Julian Taylor, Sebas- tian Berg, Nathaniel J. Smith, Robert Kern, Matti J. Kringelum, Sonny Kim Kjærulff, S. Brunak, Picus, Stephan Hoyer, Marten H. van Kerkwijk, O. Lund, T. Oprea, and O. Taboureau. 2016. Matthew Brett, Allan Haldane, Jaime Fernández del Chemprot-3.0: a global chemical biology diseases Río, Mark Wiebe, Pearu Peterson, Pierre Gérard- mapping. Database: The Journal of Biological Marchant, Kevin Sheppard, Tyler Reddy, Warren Databases and Curation, 2016.
F. Lachapelle. 2020. Covid-19 preprints and their pub- Yang Liu and Mirella Lapata. 2019. Hierarchical trans- lishing rate: An improved method. medRxiv. formers for multi-document summarization. In Pro- ceedings of the 57th Annual Meeting of the Asso- Logan Lebanoff, Kaiqiang Song, and Fei Liu. 2018. ciation for Computational Linguistics, pages 5070– Adapting the neural encoder-decoder framework 5081, Florence, Italy. Association for Computa- from single to multi-document summarization. In tional Linguistics. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kin- 4131–4141, Brussels, Belgium. Association for ney, and Daniel Weld. 2020. S2ORC: The semantic Computational Linguistics. scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Compu- Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, D. Kim, tational Linguistics, pages 4969–4983, Online. As- Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. sociation for Computational Linguistics. Biobert: a pre-trained biomedical language represen- tation model for biomedical text mining. Bioinfor- Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh matics. Hajishirzi. 2018. Multi-task identification of enti- ties, relations, and coreference for scientific knowl- Eric Lehman, Jay DeYoung, Regina Barzilay, and By- edge graph construction. In Proceedings of the 2018 ron C. Wallace. 2019. Inferring which medical treat- Conference on Empirical Methods in Natural Lan- ments work from reports of clinical trials. In Pro- guage Processing, pages 3219–3232, Brussels, Bel- ceedings of the 2019 Conference of the North Amer- gium. Association for Computational Linguistics. ican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- I. Marshall, Joël Kuiper, and Byron C. Wallace. 2015. ume 1 (Long and Short Papers), pages 3705–3717, Automating risk of bias assessment for clinical trials. Minneapolis, Minnesota. Association for Computa- IEEE journal of biomedical and health informatics, tional Linguistics. 19 4:1406–12. M. Lewis, Marjan Ghazvininejad, Gargi Ghosh, Ar- I. Marshall, Joël Kuiper, and Byron C. Wallace. 2016. men Aghajanyan, Sida Wang, and Luke Zettle- Robotreviewer: evaluation of a system for automati- moyer. 2020a. Pre-training via paraphrasing. ArXiv, cally assessing bias in clinical trials. Journal of the abs/2006.15020. American Medical Informatics Association : JAMIA, M. Lewis, Yinhan Liu, Naman Goyal, Marjan 23:193 – 201. Ghazvininejad, A. Mohamed, Omer Levy, Ves Stoy- I. Marshall and Byron C. Wallace. 2019. Toward anov, and Luke Zettlemoyer. 2020b. Bart: De- systematic review automation: a practical guide to noising sequence-to-sequence pre-training for natu- using machine learning tools in research synthesis. ral language generation, translation, and comprehen- Systematic Reviews, 8. sion. ArXiv, abs/1910.13461. M. Meldrum. 2000. A brief history of the random- J. Li, Yueping Sun, Robin J. Johnson, Daniela Sci- ized controlled trial. from oranges and lemons to aky, Chih-Hsuan Wei, Robert Leaman, A. P. Davis, the gold standard. Hematology/oncology clinics of C. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. North America, 14(4):745–60, vii. 2016. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database: M. Michelson and K. Reuter. 2019. The significant The Journal of Biological Databases and Curation, cost of systematic reviews and meta-analyses: A call 2016. for greater involvement of machine learning to as- Wei Li, Xinyan Xiao, Jiachen Liu, Hua Wu, Haifeng sess the promise of clinical trials. Contemporary Wang, and Junping Du. 2020. Leveraging graph Clinical Trials Communications, 16. to improve abstractive multi-document summariza- Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, tion. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Çağlar Gu̇lçehre, and Bing Xiang. 2016a. Abstrac- pages 6232–6243, Online. Association for Compu- tive text summarization using sequence-to-sequence tational Linguistics. RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Lan- Chin-Yew Lin. 2004. Rouge: A package for automatic guage Learning, pages 280–290, Berlin, Germany. evaluation of summaries. In ACL 2004. Association for Computational Linguistics. J. Lin. 1991. Divergence measures based on the shan- Ramesh Nallapati, Bowen Zhou, C. D. Santos, Çaglar non entropy. IEEE Trans. Inf. Theory, 37:145–151. Gülçehre, and B. Xiang. 2016b. Abstractive text summarization using sequence-to-sequence rnns and Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben beyond. In CoNLL. Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summariz- Christopher Norman. 2020. Systematic review automa- ing long sequences. In International Conference on tion methods. Ph.D. thesis, Université Paris-Saclay; Learning Representations. Universiteit van Amsterdam.
Benjamin Nye, Junyi Jessy Li, Roma Patel, Yinfei from the same underlying clinical trial. Methods, Yang, Iain Marshall, Ani Nenkova, and Byron Wal- 74:65–70. lace. 2018. A corpus with multi-level annotations of patients, interventions and outcomes to support K. Shojania, M. Sampson, M. Ansari, J. Ji, S. Doucette, language processing for medical literature. In Pro- and D. Moher. 2007. How quickly do systematic ceedings of the 56th Annual Meeting of the Associa- reviews go out of date? a survival analysis. Annals tion for Computational Linguistics (Volume 1: Long of Internal Medicine, 147:224–233. Papers), pages 197–207, Melbourne, Australia. As- sociation for Computational Linguistics. Nitish Srivastava, Geoffrey E. Hinton, A. Krizhevsky, Ilya Sutskever, and R. Salakhutdinov. 2014. Benjamin E. Nye, Jay DeYoung, Eric Lehman, Ani Dropout: a simple way to prevent neural net- Nenkova, Iain J. Marshall, and Byron C. Wallace. works from overfitting. J. Mach. Learn. Res., 2020. Understanding clinical trial reports: Extract- 15:1929–1958. ing medical entities and their relations. Mark G Starr, Iain Chalmers, Mike Clarke, and An- Karolina Owczarzak and Hoa Trang Dang. 2011. drew David Oxman. 2009. The origins, evolution, Overview of the tac 2011 summarization track: and future of the cochrane database of systematic re- Guided task and aesop task. views. International journal of technology assess- Adam Paszke, Sam Gross, Francisco Massa, Adam ment in health care, 25 Suppl 1:182–95. Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca G. Tsafnat, P. Glasziou, Miew Keen Choong, A. Dunn, Antiga, Alban Desmaison, Andreas Kopf, Edward Filippo Galgani, and E. Coiera. 2014. Systematic re- Yang, Zachary DeVito, Martin Raison, Alykhan Te- view automation technologies. Systematic Reviews, jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, 3:74 – 74. Junjie Bai, and Soumith Chintala. 2019. Py- torch: An imperative style, high-performance deep Özlem Uzuner, Brett R South, Shuying Shen, and learning library. In H. Wallach, H. Larochelle, Scott L DuVall. 2011. 2010 i2b2/VA challenge on A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar- concepts, assertions, and relations in clinical text. nett, editors, Advances in Neural Information Pro- Journal of the American Medical Informatics Asso- cessing Systems 32, pages 8024–8035. Curran Asso- ciation, 18(5):552–556. ciates, Inc. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, F. Petrelli and S. Barni. 2013. Non-cancer-related mor- Matt Haberland, Tyler Reddy, David Courna- tality after cisplatin-based adjuvant chemotherapy peau, Evgeni Burovski, Pearu Peterson, Warren for non-small cell lung cancer: a study-level meta- Weckesser, Jonathan Bright, Stéfan J. van der Walt, analysis of 16 randomized trials. Medical Oncology, Matthew Brett, Joshua Wilson, K. Jarrod Millman, 30:1–10. Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Po- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine lat, Yu Feng, Eric W. Moore, Jake VanderPlas, Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Denis Laxalde, Josef Perktold, Robert Cimrman, W. Li, and Peter J. Liu. 2020. Exploring the limits Ian Henriksen, E. A. Quintero, Charles R. Harris, of transfer learning with a unified text-to-text trans- Anne M. Archibald, Antônio H. Ribeiro, Fabian Pe- former. J. Mach. Learn. Res., 21:140:1–140:67. dregosa, Paul van Mulbregt, and SciPy 1.0 Contribu- tors. 2020. SciPy 1.0: Fundamental Algorithms for Alexey Romanov and Chaitanya Shivade. 2018. Scientific Computing in Python. Nature Methods, Lessons from natural language inference in the clin- 17:261–272. ical domain. In EMNLP. R. V. D. Schoot, J. D. Bruin, R. Schram, Parisa Za- David Wadden, Kyle Lo, Lucy Lu Wang, Shanchuan hedi, J. D. Boer, F. Weijdema, B. Kramer, Martijn Lin, Madeleine van Zuylen, Arman Cohan, and Han- Huijts, M. Hoogerwerf, Gerbrich Ferdinands, Albert naneh Hajishirzi. 2020. Fact or fiction: Verifying Harkema, Joukje Willemsen, Yongchao Ma, Qixi- scientific claims. In EMNLP. ang Fang, L. Tummers, and D. Oberski. 2020. As- Byron C. Wallace, Sayantan Saha, Frank Soboczen- review: Open source software for efficient and trans- ski, and I. Marshall. 2020. Generating (fac- parent active learning for systematic reviews. ArXiv, tual?) narrative summaries of rcts: Experiments abs/2006.12166. with neural multi-document summarization. ArXiv, Darsh J Shah, Lili Yu, Tao Lei, and Regina Barzilay. abs/2008.11293. 2021. Nutribullets hybrid: Multi-document health summarization. Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020a. Asking and answering questions to evaluate the fac- Weixiang Shao, C. Adams, A. Cohen, J. Davis, M. Mc- tual consistency of summaries. In Proceedings of Donagh, S. Thakurta, Philip S. Yu, and N. Smal- the 58th Annual Meeting of the Association for Com- heiser. 2015. Aggregator: a machine learning ap- putational Linguistics, pages 5008–5020, Online. proach to identifying medline articles that derive Association for Computational Linguistics.
Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Doug Burdick, Darrin Eide, Kathryn Funk, Yannis Katsis, Rod- ney Michael Kinney, Yunyao Li, Ziyang Liu, William Merrill, Paul Mooney, Dewey A. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Bran- don Stilson, Alex D. Wade, Kuansan Wang, Nancy Xin Ru Wang, Christopher Wilhelm, Boya Xie, Dou- glas M. Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian Kohlmeier. 2020b. CORD-19: The COVID-19 open research dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online. Association for Computational Lin- guistics. Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Al- lan Peter Davis, Carolyn J Mattingly, Jiao Li, Thomas C Wiegers, and Zhiyong Lu. 2015. Overview of the biocreative v chemical disease re- lation (cdr) task. In Proceedings of the fifth BioCre- ative challenge evaluation workshop, volume 14. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language pro- cessing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- ciation for Computational Linguistics. L. Zakowski, C. Seibert, and S. Vaneyck. 2004. Evidence-based medicine: answering questions of diagnosis. Clinical medicine & research, 2(1):63–9. J. Zhang, Jiwei Tan, and Xiaojun Wan. 2018. Towards a neural network approach to abstractive multi- document summarization. ArXiv, abs/1804.09010. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- uating text generation with bert. In International Conference on Learning Representations. Eya Znaidi, L. Tamine, and C. Latiri. 2015. Answer- ing pico clinical questions: A semantic graph-based approach. In AIME.
You can also read