An Evaluation of Two Commercial Deep Learning-Based Information Retrieval Systems for COVID-19 Literature
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
An Evaluation of Two Commercial Deep Learning-Based Information Retrieval Systems for COVID-19 Literature Sarvesh Soni, Kirk Roberts School of Biomedical Informatics University of Texas Health Science Center at Houston Houston TX, USA {sarvesh.soni, kirk.roberts}@uth.tmc.edu Abstract the form of scientific articles, along with studies The COVID-19 pandemic has resulted in a from the past that may be relevant to COVID-19 is tremendous need for access to the latest sci- being carried out as requested by the White House entific information, primarily through the use (Wang et al., 2020). This effort led to the creation arXiv:2007.03106v2 [cs.IR] 27 Jul 2020 of text mining and search tools. This has led of CORD-19, a dataset of scientific articles related to both corpora for biomedical articles related to COVID-19 and the other viruses from the coro- to COVID-19 (such as the CORD-19 corpus navirus family. One of the main aims for build- (Wang et al., 2020)) as well as search en- ing such a dataset is to bridge the gap between gines to query such data. While most research in search engines is performed in the aca- machine learning and biomedical expertise to sur- demic field of information retrieval (IR), most face insightful information from the abundance of academic search engines–though rigorously relevant published content. The TREC-COVID evaluated–are sparsely utilized, while major challenge was introduced to target the exploration commercial web search engines (e.g., Google, of the CORD-19 dataset by gathering the infor- Bing) dominate. This relates to COVID-19 mation needs of biomedical researchers (Roberts because it can be expected that commercial et al., 2020; Voorhees et al., 2020). The chal- search engines deployed for the pandemic will gain much higher traction than those produced lenge involved an information retrieval (IR) task in academic labs, and thus leads to ques- to retrieve a set of ranked relevant documents for a tions about the empirical performance of these given query. Similar to the task of TREC-COVID, search tools. This paper seeks to empirically major technology companies Amazon and Google evaluate two such commercial search engines also developed their own systems for exploring the for COVID-19, produced by Google and Ama- CORD-19 dataset. zon, in comparison to the more academic pro- Both Amazon and Google have made recent totypes evaluated in the context of the TREC- COVID track (Roberts et al., 2020). We per- forays into biomedical natural language process- formed several steps to reduce bias in the avail- ing (NLP). Amazon launched Amazon Compre- able manual judgments in order to ensure a hend Medical (ACM) for the developers to pro- fair comparison of the two systems with those cess unstructured medical data effectively (Kass- submitted to TREC-COVID. We find that the Hout and Wood, 2018). This motivated several top-performing system from TREC-COVID researchers to explore the tool’s capability in in- on bpref metric performed the best among the formation extraction (Bhatia et al., 2019; Guzman different systems evaluated in this study on all et al., 2020; Heider et al., 2020). Interestingly, the metrics. This has implications for devel- oping biomedical retrieval systems for future the same technology is also incorporated to their health crises as well as trust in popular health search engine for the CORD-19 dataset. It will be search engines. useful to assess the overall performance of their search engine that utilizes the company’s NLP 1 Background and Significance technology. Similarly, BERT from Google (De- There has been a surge of scientific studies related vlin et al., 2019) is enormously popular. BERT is to COVID-19 due to the availability of archival a powerful language model that is trained on large sources as well as the expedited review policies raw text datasets to learn the nuances of natural of publishing venues. A systematic effort to con- language in an efficient manner. The methodol- solidate the flood of such information content, in ogy of training BERT helps it transfer the knowl-
edge from vast raw data sources to other spe- data is further mapped to clinical topics related cific domains such as biomedicine. Several works to COVID-19 such as immunology, clinical trials, have explored the efficacy of BERT models in the and virology using multi-label classification and biomedical domain for tasks such as information inference models. After the enrichment process, extraction (Wu et al., 2020) and question answer- the data is indexed using Amazon Kendra that also ing (Soni and Roberts, 2020). Many biomedical uses machine learning to provide natural language and scientific variants of the model have also been querying capabilities for extracting relevant docu- built, such as BioBERT (Lee et al., 2019), Clini- ments. cal BERT (Alsentzer et al., 2019), and SciBERT Googles system is based on a semantic search (Beltagy et al., 2019). Google has even incorpo- mechanism powered by BERT (Devlin et al., rated BERT into their web search engine (Nayak, 2019), a deep learning-based approach to pre- 2019). Since this is the same technology that pow- training and fine-tuning for downstream NLP tasks ers Google’s CORD-19 search explorer, it will be (document retrieval in this case) (Hall, 2020). Se- interesting to assess the performance of this search mantic search, unlike lexical term-based search tool. that aims at phrasal matching, focuses on under- However, despite the popularity of these com- standing the meaning of user queries for search- panies’ products, no formal evaluation of these ing. However, deep learning models such as BERT systems is made available by the companies. Also, require a substantial amount of annotated data to neither of these companies participated in the be tuned for some specific task/domain. Biomed- TREC-COVID challenge. In this paper, we aim ical articles have very different linguistic features to evaluate these two IR systems and compare than the general domain, upon which the BERT against the runs submitted to TREC-COVID chal- model is built. Thus, the model needs to be tuned lenge to gauge the efficacy of what are likely high- for the target domain, i.e., biomedical domain, us- utilized search engines. ing annotated data. For this purpose, they use biomedical IR datasets from the BioASQ chal- 2 Methods lenges4 . Due to the smaller size of these biomedi- 2.1 Information Retrieval Systems cal datasets, and the large data requirement of the neural models, they use a synthetic query gener- We evaluate two publicly available IR systems tar- ation technique to augment the existing biomed- geted toward exploring the COVID-19 Open Re- ical IR datasets (Ma et al., 2020). Finally, these search Dataset (CORD-19)1 (Wang et al., 2020). expanded datasets are used to fine-tune the neu- These systems are launched by Amazon (CORD- ral model. They further enhance their system by 19 Search2 ) and Google (COVID-19 Research Ex- combining term- and neural-based retrieval mod- plorer3 ). We hereafter refer to these systems by els by balancing the memorization and generaliza- the names of their corporations, i.e., Amazon and tion dynamics (Jiang et al., 2020). Google. Both the systems take as input a query in the form of natural language and return a list of 2.2 Evaluation documents from the CORD-19 dataset ranked by their relevance to the given query. We use a topic set collected as part of the TREC- Amazons system uses an enriched version of COVID challenge for our evaluations (Roberts the CORD-19 dataset constructed by passing et al., 2020; Voorhees et al., 2020). These topics it through a language processing service called are a set of information need statements motivated Amazon Comprehend Medical (ACM) (Kass- by searches submitted to the National Library of Hout and Snively, 2020). ACM is a machine Medicine and suggestions from researchers on learning-based natural language processing (NLP) Twitter. Each topic consists of three fields with pipeline to extract clinical concepts such as signs, varying levels of granularity in terms of expressing symptoms, diseases, and treatments from unstruc- the information need, namely, (a keyword-based) tured text (Kass-Hout and Wood, 2018). The query, (a natural language) question, and (a longer 1 descriptive) narrative. A few example topics from https://www.semanticscholar.org/ cord19 Round 1 of the challenge are presented in Table 2 https://cord19.aws 1. The challenge participants are required to re- 3 https://covid19-research-explorer. 4 appspot.com http://bioasq.org
Table 1: Three example topics from Round 1 of the TREC-COVID challenge. Query : serological tests for coronavirus Topic 7 Question : are there serological tests that detect antibodies to coronavirus? Narrative : looking for assays that measure immune response to coronavirus that will help determine past infection and subsequent possible immunity. Query : coronavirus social distancing impact Topic 10 Question : has social distancing had an impact on slowing the spread of COVID-19? Narrative : seeking specific information on studies that have measured COVID-19’s transmis- sion in one or more social distancing (or non-social distancing) approaches. Query : coronavirus remdesivir Topic 30 Question : is remdesivir an effective treatment for COVID-19? Narrative : seeking specific information on clinical outcomes in COVID-19 patients treated with remdesivir. turn a ranked list of documents for each topic (also the top-ranked results from different submissions known as runs). The first round of TREC-COVID are assessed. A document is assigned one of the used a set of 30 topics and exploited the April 10, three possible judgments, namely, relevant, par- 2020 release of CORD-19. Round 1 of the chal- tially relevant, or not relevant. We use relevance lenge was initiated on April 15, 2020 with the runs judgments from Rounds 1 and 2. However, even from participants due April 23. Relevance judg- the combined judgments from both the rounds ments were released May 3. may not ensure that the relevance judgments for We use the question and narrative fields from top-n documents for both the evaluated systems the topics to query the systems developed by Ama- exist. It has recently been shown that pooling ef- zon and Google. These fields are chosen follow- fects can negatively impact post-hoc evaluation of ing the recommendations set forward by the or- systems that did not participate in the pooling (Yil- ganizations, i.e., to use fully formed queries with maz et al., 2020). So, to create a level ground for questions and context. We use two variations for comparison, we perform additional relevance as- querying the systems. In the first variation, we sessments for the documents from evaluated sys- query the systems using only the question. In the tems that may not have been covered by the com- second variation, we also append the narrative to bined set of judgments from TREC-COVID. In to- provide more context. tal, 141 documents were assessed by 2 individuals who are also involved in performing the relevance As we accessed these systems in the first week judgments for TREC-COVID. of May 2020, the systems could be using the lat- est version of CORD-19 at that time (i.e., May 1 The runs submitted to TREC-COVID could release). Thus, we filter the list of returned docu- contain up to 1000 documents per topic. Due to ments and only include the ones from the April 10 the restrictions posed by the evaluated systems, we release to ensure a fair comparison with the sub- could only fetch up to 100 documents per query. missions to the Round 1 of TREC-COVID chal- This number further decreases when we remove lenge. We compare the performance of these sys- the documents that are not covered as part of the tems (by Amazon and Google) with the 5 top sub- April 10 release of CORD-19. Thus, to ensure a missions to the TREC-COVID challenge Round fair comparison of the evaluated systems with the 1 (on the basis of bpref scores). It is valid to runs submitted to TREC-COVID, we calculate the compare Amazon and Google systems with the minimum number of documents per topic (we call submissions from Round 1 because all these sys- it topic-minimum) across the different variations tems are similarly built without using any rele- of querying the evaluated systems (i.e., question vance judgments from TREC-COVID. or question+narrative). We then use this topic- Relevance judgments (or assessments) for minimum as a threshold for the maximum num- TREC-COVID are carried out by individuals with ber of documents per topic for all evaluated sys- biomedical expertise. The assessments are per- tems. This ensures that each system returns the formed using a pooling mechanism where only same number of documents for a particular topic.
Table 2: Evaluation results after setting a threshold at the number of documents per topic using a minimum number of documents present for each individual topic. The relevance judgments used are a combination of Rounds 1 and 2 of TREC-COVID and our additional relevance assessments. The highest scores for the evaluated and TREC- COVID systems are underlined. System P@5 P@10 NDCG@10 MAP NDCG bpref question 0.6733 0.6333 0.539 0.0722 0.1838 0.1049 Amazon question + narrative 0.72 0.64 0.5583 0.0766 0.1862 0.1063 question 0.5733 0.57 0.4972 0.0693 0.1831 0.1069 Google question + narrative 0.6067 0.56 0.5112 0.0687 0.1821 0.1054 1. sab20.1.meta.docs 0.78 0.7133 0.6109 0.0999 0.2266 0.1352 TREC-COVID 2. sab20.1.merged 0.6733 0.6433 0.5555 0.0787 0.1971 0.1154 3. UIowaS Run3 0.6467 0.6367 0.5466 0.0952 0.2091 0.1279 4. smith.rm3 0.6467 0.6133 0.5225 0.0914 0.2095 0.1303 5. udel fang run3 0.6333 0.6133 0.5398 0.0857 0.1977 0.1187 3 Results The total number of documents used for each topic based on the topic-minimums are shown in the form of a box plot in Figure 1. Approximately, an Figure 1: A box plot of the number of documents for average of 43 documents are evaluated per topic each topic as used in our evaluations (after filtering with a median number of documents as 40.5. This the documents based on the April 10th release of the is another reason for using a topic-wise minimum CORD-19 dataset and setting a threshold at the mini- rather than cutting off all the systems to the same mum number of documents for any given topic). level as the lowest return count (that would be 25 documents). Having a topic-wise cut-off allowed us to evaluate the runs with the maximum possible We use the standard measures in our evalu- documents while keeping the evaluation fair. ation as employed for TREC-COVID, namely, The evaluation results of our study are presented bpref (binary preference), NDCG@10 (normal- in Table 2. Among the commercial systems that ized discounted cumulative gain with top 10 doc- we evaluated as part of this study, the question uments), and P@5 (precision at 5 documents). plus narrative variant of the system by Amazon Here, bpref only uses judged documents in cal- performed consistently better than any other vari- culation while the other two measures assume the ant in terms of all the included measures other non-judged documents to be not relevant. Addi- than bpref. In terms of bpref, the question-only tionally, we also calculate MAP (mean average variant of the system from Google performed the precision), NDCG, and P@10. Note that we can best among the evaluated systems. Note that the precisely calculate some of the measures that cut best run from the TREC-COVID challenge, after the number of documents at up to 10 since we cutting off using topic-minimums, still performed have ensured that both the evaluated systems (for better than the other four submitted runs included both the query variations) have their top 10 doc- in our evaluation. Interestingly, this best run also uments manually judged (through TREC-COVID performed substantially better than all the variants judgments and our additional assessments as part of both commercial systems evaluated as part of of this study). We use the trec eval tool5 for our the study on all the calculated metrics. We discuss evaluations, which is a standard system employed more about this system below. for the TREC challenges. 4 Discussion 5 https://github.com/usnistgov/trec_ We evaluate two commercial IR systems targeted eval toward extracting relevant documents from the
CORD-19 dataset. For comparison, we also in- formed additional relevance judgments. We have clude the 5 best runs from TREC-COVID in our included the evaluation results that would have evaluation. We additionally annotate a total of resulted without our modifications in the supple- 141 documents from the runs by the commer- mental material. The performance of these two cial systems to ensure a fair comparison between systems drops precipitously. Yet, as addressed, these runs and the runs from TREC-COVID chal- this would not have been a “fair” comparison and lenge. We find that the best system from TREC- thus the corrective measures described above were COVID in terms of bpref metric outperformed all necessary to ensure the scientific validity of our the commercial system variants on all the evalu- comparison. ated measures including P@5, NDCG@10, and bpref, which are the standard measures used in 5 Conclusion TREC-COVID. We assessed the performance of two commercial The commercial systems often employ cutting IR systems using similar evaluation methods and edge technologies, such as ACM and BERT used measures as the TREC-COVID challenge. To by Amazon and Google, while developing their facilitate a fair comparison between these sys- systems. Also, the availability of technological re- tems and the top 5 runs submitted to the TREC- sources such as CPUs and GPUs may be better in COVID, we cut all the runs at different thresh- industry settings than in academic settings. This olds and performed more relevance judgments be- follows a common concern in academia, namely yond the assessments provided by TREC-COVID. that the resource requirements for advanced ma- We found that the top performing system from chine learning methods (e.g., GPT-3 (Brown et al., TREC-COVID on bpref metric remained the best 2020)) are well beyond the capabilities available performing system among the commercial and to the vast majority of researchers. However, in- the TREC-COVID submissions on all the evalu- stead these results demonstrate the potential pit- ation metrics. Interestingly, this best performing falls of deploying a deep learning-based system run comes from a simple system that is purely without proper tuning. The sabir (sab20.*) system based on the data elements present in the CORD- does not use machine learning at all: it is based 19 dataset and does not apply machine learning. on the very old SMART system (Buckley, 1985) Thus, applying cutting edge technologies without and does not utilize any biomedical resources. It enough target data-specific modifications may not is instead carefully deployed based on an analysis be sufficient for achieving optimal results. of the data fields available in CORD-19. Subse- quent rounds of TREC-COVID have since over- Acknowledgments taken sabir (based indeed on machine learning The authors thank Meghana Gudala and Jordan with relevant training data). The lesson, then, for Godfrey-Stovall for conducting the additional re- future emerging health events is that deploying trieval assessments. This work was supported in “state-of-the-art” methods without event-specific part by the National Science Foundation (NSF) data may be dangerous, and in the face of uncer- under award OIA-1937136. tainty simple may still be best. As evident from Figure 1, many of the docu- References ments retrieved by the commercial systems were not part of the April 10 release of CORD-19. We Emily Alsentzer, John Murphy, William Boag, Wei- Hung Weng, Di Jindi, Tristan Naumann, and queried these systems after another version of the Matthew McDermott. 2019. Publicly Available CORD-19 dataset was released. New sources of Clinical BERT Embeddings. In Proceedings of the papers were constantly being added to the dataset 2nd Clinical Natural Language Processing Work- alongside updating the content of existing pa- shop, pages 72–78. pers and adding newly published research related Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- to COVID-19. This may have led to the re- ERT: A Pretrained Language Model for Scientific trieval of more articles from the new release of Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing the dataset. However, for a fair comparison be- and the 9th International Joint Conference on Natu- tween the commercial and the TREC-COVID sys- ral Language Processing (EMNLP-IJCNLP), pages tems, we pruned the list of documents and per- 3615–3620.
Parminder Bhatia, Busra Celikkaya, Mohammed Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and Khalilia, and Selvan Senthivel. 2019. Comprehend Ryan McDonald. 2020. Zero-shot Neural Retrieval Medical: A Named Entity Recognition and Rela- via Domain-targeted Synthetic Query Generation. tionship Extraction Web Service. In 2019 18th IEEE arXiv:2004.14503 [cs]. International Conference On Machine Learning And Applications (ICMLA), pages 1844–1851. Pandu Nayak. 2019. Understanding searches better than ever before. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen Neelakantan, Pranav Shyam, Girish Sastry, Amanda Voorhees, Lucy Lu Wang, and William R. Hersh. Askell, Sandhini Agarwal, Ariel Herbert-Voss, 2020. TREC-COVID: Rationale and Structure of an Gretchen Krueger, Tom Henighan, Rewon Child, Information Retrieval Shared Task for COVID-19. Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Journal of the American Medical Informatics Asso- Clemens Winter, Christopher Hesse, Mark Chen, ciation. Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc- Sarvesh Soni and Kirk Roberts. 2020. Evaluation of Candlish, Alec Radford, Ilya Sutskever, and Dario Dataset Selection for Pre-Training and Fine-Tuning Amodei. 2020. Language Models are Few-Shot Transformer Language Models for Clinical Ques- Learners. arXiv:2005.14165 [cs]. tion Answering. In Proceedings of the LREC, pages 5534–5540. Chris Buckley. 1985. Implementation of the SMART information retrieval system. Technical Report 85- Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina 686, Cornell University. Demner-Fushman, William R. Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2020. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and TREC-COVID: Constructing a Pandemic Informa- Kristina Toutanova. 2019. BERT: Pre-training of tion Retrieval Test Collection. ACM SIGIR Forum, Deep Bidirectional Transformers for Language Un- 54:1–12. derstanding. In Proceedings of the North Ameri- Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, can Chapter of the Association for Computational Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Linguistics: Human Language Technologies, pages Funk, Rodney Kinney, Ziyang Liu, William Mer- 4171–4186. rill, Paul Mooney, Dewey Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stilson, Benedict Guzman, Isabel Metzger, Yindalon Alex D. Wade, Kuansan Wang, Chris Wilhelm, Aphinyanaphongs, and Himanshu Grover. 2020. Boya Xie, Douglas Raymond, Daniel S. Weld, Assessment of Amazon Comprehend Medical: Oren Etzioni, and Sebastian Kohlmeier. 2020. Medication Information Extraction. CORD-19: The Covid-19 Open Research Dataset. arXiv:2004.10706v2. Keith Hall. 2020. An NLU-Powered Tool to Explore COVID-19 Scientific Literature. Stephen Wu, Kirk Roberts, Surabhi Datta, Jingcheng Du, Zongcheng Ji, Yuqi Si, Sarvesh Soni, Qiong Paul M. Heider, Jihad S. Obeid, and Stéphane M. Wang, Qiang Wei, Yang Xiang, Bo Zhao, and Hua Meystre. 2020. A Comparative Analysis of Xu. 2020. Deep learning in clinical natural language Speed and Accuracy for Three Off-the-Shelf De- processing: A methodical review. Journal of the Identification Tools. AMIA Summits on Transla- American Medical Informatics Association, 27:457– tional Science Proceedings, 2020:241–250. 470. Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, and Emine Yilmaz, Nick Craswell, Bhaskar Mitra, and Michael C. Mozer. 2020. Characterizing Structural Daniel Campos. 2020. On the Reliability of Test Regularities of Labeled Data in Overparameterized Collections for Evaluating Systems of Different Models. arXiv:2002.03206 [cs, stat]. Types. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Develop- Taha A. Kass-Hout and Ben Snively. 2020. AWS ment in Information Retrieval, pages 2101–2104. launches machine learning enabled search capabil- A Supplementary Material ities for COVID-19 dataset. The results without taking into account our addi- Taha A. Kass-Hout and Matt Wood. 2018. Introducing tional annotations, i.e., only using the relevance medical language processing with Amazon Compre- hend Medical. judgments from TREC-COVID rounds 1 and 2, are presented in Table 3. Similarly, the results Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, without setting an explicit threshold on the number Donghyeon Kim, Sunkyu Kim, Chan Ho So, of returned documents by the systems are shown in and Jaewoo Kang. 2019. BioBERT: A pre-trained biomedical language representation model for Table 4. The results without any of the two modi- biomedical text mining. Bioinformatics, pages 1–7. fications made by us are provided in Table 5.
Table 3: Evaluation results after setting a threshold at the number of documents per topic using a minimum number of documents present for each individual topic. The relevance judgments used are a combination of Rounds 1 and 2 of TREC-COVID (WITHOUT our additional relevance assessments). The highest scores for the evaluated and TREC-COVID systems are underlined. System P@5 P@10 NDCG@10 MAP NDCG bpref question 0.6467 0.5933 0.5095 0.069 0.1794 0.1035 Amazon question + narrative 0.6933 0.5933 0.5307 0.0722 0.1804 0.1031 question 0.5667 0.5133 0.4688 0.0655 0.1785 0.1048 Google question + narrative 0.56 0.5133 0.4795 0.0656 0.1763 0.1031 1. sab20.1.meta.docs 0.78 0.7133 0.6109 0.1007 0.2278 0.1361 TREC-COVID 2. sab20.1.merged 0.6667 0.64 0.5539 0.0789 0.1968 0.1155 3. UIowaS Run3 0.6467 0.6367 0.5466 0.096 0.2099 0.1287 4. smith.rm3 0.6467 0.6133 0.5225 0.0922 0.2107 0.1315 5. udel fang run3 0.6333 0.6133 0.5398 0.0866 0.1989 0.1196 Table 4: Evaluation results WITHOUT setting a threshold at the number of documents per topic using a minimum number of documents present for each individual topic. The relevance judgments used are a combination of Rounds 1 and 2 of TREC-COVID and our additional relevance assessments. The highest scores for the evaluated and TREC-COVID systems are underlined. System P@5 P@10 NDCG@10 MAP NDCG bpref question 0.6733 0.6333 0.539 0.0765 0.1931 0.1134 Amazon question + narrative 0.72 0.64 0.5583 0.0788 0.1903 0.1105 question 0.5733 0.57 0.4972 0.0775 0.2001 0.1227 Google question + narrative 0.6067 0.56 0.5112 0.0763 0.1979 0.121 1. sab20.1.meta.docs 0.78 0.7133 0.6109 0.2037 0.4702 0.3404 TREC-COVID 2. sab20.1.merged 0.6733 0.6433 0.5555 0.1598 0.4415 0.3433 3. UIowaS Run3 0.6467 0.6367 0.5466 0.174 0.4145 0.3229 4. smith.rm3 0.6467 0.6133 0.5225 0.1947 0.4461 0.3406 5. udel fang run3 0.6333 0.6133 0.5398 0.1911 0.4495 0.3246 Table 5: Evaluation results WITHOUT setting a threshold at the number of documents per topic using a minimum number of documents present for each individual topic. The relevance judgments used are a combination of Rounds 1 and 2 of TREC-COVID (WITHOUT our additional relevance assessments). The highest scores for the evaluated and TREC-COVID systems are underlined. System P@5 P@10 NDCG@10 MAP NDCG bpref question 0.6467 0.5933 0.5095 0.0732 0.1888 0.1121 Amazon question + narrative 0.6933 0.5933 0.5307 0.0744 0.1846 0.1074 question 0.5667 0.5133 0.4688 0.0734 0.1954 0.1208 Google question + narrative 0.56 0.5133 0.4795 0.0728 0.1919 0.1188 1. sab20.1.meta.docs 0.78 0.7133 0.6109 0.2038 0.4693 0.3406 TREC-COVID 2. sab20.1.merged 0.6667 0.64 0.5539 0.1589 0.4393 0.3426 3. UIowaS Run3 0.6467 0.6367 0.5466 0.1742 0.4139 0.3225 4. smith.rm3 0.6467 0.6133 0.5225 0.1956 0.4469 0.3413 5. udel fang run3 0.6333 0.6133 0.5398 0.1914 0.4497 0.3248
You can also read