Cross-Domain Data Integration for Named Entity Disambiguation in Biomedical Text
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Cross-Domain Data Integration for Named Entity Disambiguation in Biomedical Text Maya Varma Laurel Orr Sen Wu Stanford University Stanford University Stanford University mvarma2@cs.stanford.edu lorr1@cs.stanford.edu senwu@cs.stanford.edu Megan Leszczynski Xiao Ling Christopher Ré Stanford University Apple Stanford University mleszczy@cs.stanford.edu xiaoling@apple.com chrismre@cs.stanford.edu Abstract Yamada et al., 2020; Wu et al., 2020), the NED task remains particularly challenging in the medi- Named entity disambiguation (NED), which cal setting due to the presence of rare entities that involves mapping textual mentions to struc- tured entities, is particularly challenging in the occur infrequently in medical literature (Agrawal arXiv:2110.08228v1 [cs.CL] 15 Oct 2021 medical domain due to the presence of rare et al., 2020). As a knowledge-intensive task, NED entities. Existing approaches are limited by requires the incorporation of structural resources, the presence of coarse-grained structural re- such as entity descriptions and category types, to sources in biomedical knowledge bases as well effectively disambiguate rare entities (Orr et al., as the use of training datasets that provide 2021). However, this is difficult to accomplish in low coverage over uncommon resources. In the medical setting for the following reasons: this work, we address these issues by propos- ing a cross-domain data integration method 1. Coarse-grained and incomplete structural re- that transfers structural knowledge from a gen- sources: Metadata associated with entities in eral text knowledge base to the medical do- medical KBs is often coarse-grained or incom- main. We utilize our integration scheme to plete (Chen et al., 2009; Halper et al., 2011; augment structural resources and generate a Agrawal et al., 2020). For example, over 65% large biomedical NED dataset for pretrain- of entities in the United Medical Language Sys- ing. Our pretrained model with injected struc- tural knowledge achieves state-of-the-art per- tem1 (UMLS) ontology, a popular medical KB, formance on two benchmark medical NED are associated with just ten types, suggesting datasets: MedMentions and BC5CDR. Fur- that these types do not provide fine-grained dis- thermore, we improve disambiguation of rare ambiguation signals. In addition, over 93% of entities by up to 57 accuracy points. entities in the UMLS KB have no associated description. 1 Introduction 2. Low coverage over uncommon resources: En- Named entity disambiguation (NED), which in- tities associated with some structural resources volves mapping mentions in unstructured text to a may occur infrequently in biomedical text. For structured knowledge base (KB), is a critical pre- instance, MedMentions (Mohan and Li, 2019), processing step in biomedical text parsing pipelines which is one of the largest available biomed- (Percha, 2020). For instance, consider the follow- ical NED datasets, contains fewer than thirty ing sentence: “We study snake evolution by focus- occurrences of entities with type “Drug Deliv- ing on a cis-acting enhancer of Sonic Hedgehog.” ery Device”. In contrast, the high coverage type In order to obtain a structured characterization of “Disease or Syndrome” is observed over 10,000 the sentence to be used in downstream applica- times. As a result, models may not learn ef- tions, a NED system must map the mention Sonic fective reasoning patterns for disambiguating Hedgehog to the entity SHH gene. To do so, the entities associated with uncommon structural system can use context cues such as "enhancer" resources, which limits the ability of the model and "evolution", which commonly refer to genes, to use these resources for resolving rare entities. to avoid selecting semantically similar concepts In this work, we design a biomedical NED sys- such as Sonic Hedgehog protein or Sonic Hedge- tem to improve disambiguation of rare entities hog signaling pathway. through cross-domain data integration, which in- Although NED systems have been successfully 1 designed for general text corpora (Orr et al., 2021; https://uts.nlm.nih.gov/uts/umls/home
Figure 1: (Left) We integrate structural knowledge between a general text KB and a medical KB, which allows us to augment structural resources for medical entities and generate pretraining datasets. (Right) A pretrained model injected with augmented structural information can now better reason over context cues to perform NED. volves transferring knowledge between domains. NED task utilize transformer-based architectures Data integration across heterogeneous domains is to perform two tasks: candidate extraction, which a challenging problem with potential applications involves identifying a small set of plausible enti- across numerous knowledge-intensive tasks. Here, ties, and reranking, which involves assigning like- we address this problem by utilizing a state-of-the- lihoods to each candidate. Prior methods for this art general text entity linker to map medical entities task generally limit the use of structural resources to corresponding items in WikiData,2 a common from medical KBs due to missing or limited in- general text KB. The key contributions of this work formation (Bhowmik et al., 2021). As a result, are listed below:3 several existing approaches have been shown to • We generate structural resources for medical generalize poorly to rare entities (Agrawal et al., entities by incorporating knowledge from Wiki- 2020). Some previous studies have demonstrated Data. This results in an augmented medical KB that injecting auxiliary information, such as type with a 12.8x increase in the number of entities or relation information, as well as pretraining can with an associated description and a 2x increase aid with model performance on various biomedi- in the average number of types for each entity. cal NLP tasks (Yuan et al., 2021; Liu et al., 2021; He et al., 2020). However, these works are lim- • We utilize our integrated entity mappings to ob- ited by the insufficient resources in medical KBs as tain pretraining datasets from PubMed and a well as the use of pretraining datasets that obtain medical subset of Wikipedia. These datasets in- low coverage over the entities in the KB. Although clude a total of 2.8M sentences annotated with some methods have been previously designed to over 4.2M entities across 23 thousand types. enrich the metadata in medical ontologies with ex- We evaluate our approach on two standard ternal knowledge, these approaches either use text- biomedical NED datasets: MedMentions and matching heuristics (Wang et al., 2018) or only BC5CDR. Our results show that augmenting struc- contain mappings for a small subset of medical tural resources and pretraining across large datasets entities (Rahimi et al., 2020). Cross-domain struc- contribute to state-of-the-art model performance as tural knowledge integration has not been previously well as up to a 57 point improvement in accuracy studied in the context of the medical NED task. across rare entities that originally lack structural resources. 3 Methods To the best of our knowledge, this is the first We first present our cross-domain data integration study to address medical NED through structured approach for augmenting structural knowledge and knowledge integration. Our cross-domain data inte- obtaining pretraining datasets. We then describe gration approach can be translated beyond the med- the model architecture that we use to perform NED. ical domain to other knowledge-intensive tasks. 3.1 Cross-Domain Data Integration 2 Related Work Rich structural resources are vital for rare en- Recent state-of-the-art approaches for the medical tity disambiguation; however, metadata associ- ated with entities in medical KBs is often too 2 https://www.wikidata.org/wiki/ coarse-grained to effectively discriminate between Wikidata:Main_Page 3 Code and data available at https://github.com/ textually-similar entities. We address this issue by HazyResearch/medical-ned-integration. integrating the UMLS Metathesaurus (Bodenreider,
2004), which is the most comprehensive medical PubMedDS MedWiki Total Documents 508,295 813,541 KB, with WikiData, a KB often used in the general Total Sentences 916,945 1,892,779 text setting (Vrandečić and Krötzsch, 2014). We Total Mentions 1,390,758 2,897,621 perform data integration by using a state-of-the-art Unique Entities 40,848 230,871 NED system (Orr et al., 2021) to map each UMLS Table 1: Dataset statistics for MedWiki and Pub- entity to its most likely counterpart in WikiData; MedDS. the canonical name for each UMLS entity is pro- vided as input, and the system returns the most are summarized in Table 1. likely Wikipedia item. For example, the UMLS en- MedWiki: Wikipedia, which is often utilized as tity C0001621: Adrenal Gland Diseases is mapped a rich knowledge source in general text settings, to the WikiData item Q4684717: Adrenal gland contains references to medical terms and conse- disorder. quently holds potential for improving performance We then augment types and descriptions for on the medical NED task. We first annotate all each UMLS entity by incorporating information Wikipedia articles with textual mentions and corre- from the mapped WikiData item. For instance, sponding WikiData entities by obtaining gold entity the UMLS entity C0001621: Adrenal Gland Dis- labels from internal page links as well as generat- eases is originally assigned the type “Disease or ing weak labels based on pronouns and alternative Syndrome” in the UMLS KB; our augmentation entity names (Orr et al., 2021). Then, we extract procedure introduces the specific WikiData type sentences with relevant medical information by de- “endocrine system disease". If the UMLS KB does termining if each WikiData item can be mapped to not contain a description for a particular entity, we a UMLS entity using the data integration scheme add a definition by extracting the first 150 words described in Section 3.1. from its corresponding Wikipedia article. MedWiki can be compared to a prior Wikipedia- Our procedure results in an augmented UMLS based medical dataset generated by Vashishth et al. KB with 24,141 types (190x increase). 2.04M enti- (2021), which utilizes various knowledge sources ties have an associated description (12.8x increase). to map WikiData items to UMLS entities based on In order to evaluate the quality of our mapping Wikipedia hyperlinks. When evaluated with respect approach, we utilize a segment of UMLS (approxi- to the prior dataset, our MedWiki dataset achieves mately 9.3k entities) that has been previously an- greater coverage over UMLS, with 230k unique notated with corresponding WikiData items (Vran- concepts (4x prior) and a median of 214 concepts dečić and Krötzsch, 2014). Our mapping accuracy per type (15x prior). However, the use of weak la- over this set is 80.2%. We also evaluate integration beling techniques in MedWiki may introduce some performance on this segment as the proportion of noise into the entity mapping process (Section 3.1 predicted entities that share a WikiData type with describes our evaluation of our mapping approach). the true entity, suggesting the predicted mapping PubMedDS: The PubMedDS dataset, which was adds relevant structural resources. Integration per- generated by Vashishth et al. (2021), includes data formance is 85.4%. The remainder of items in from PubMed abstracts. We remove all documents UMLS have no true mappings to WikiData, under- that are duplicated in our evaluation datasets. scoring the complexity of this task. We utilize the procedure detailed in Section 3.1 to annotate all entities with structural information 3.2 Construction of Pretraining Datasets obtained from UMLS and WikiData. Final dataset Existing datasets for the biomedical NED task gen- statistics are included in Table 1. In combina- erally obtain low coverage over the entities and tion, the two pretraining datasets include 2.8M structural resources in the UMLS knowledge base, sentences annotated with 267,135 unique entities often including less than 1% of UMLS entities across 23,746 types. (Mohan and Li, 2019). Without adequate examples of structured metadata, models may not learn the 3.3 Model Architecture complex reasoning patterns that are necessary for We use a three-part approach for NED: candidate disambiguating rare entities. We address this issue extraction, reranking, and post-processing. by collecting the following two large pretraining Candidate Extraction: Similar to (Bhowmik datasets with entity annotations. Dataset statistics et al., 2021), we use the bi-encoder architecture
MM BC5CDR detailed in Wu et al. (2020) for extracting the top Bhowmik et al. (2021) 68.4 84.8 10 candidate entities potentially associated with a Angell et al. (2020) 72.8 90.5 Ours (Full) 74.6±0.1 91.5±0.1 mention. The model includes a context encoder, which is used to learn representations of mentions Angell et al. (2020)+Post-Processing 74.1 91.3 Ours+Post-Processing 74.8±0.1 91.9±0.2 in text, as well as an entity encoder to encode the entity candidate with its associated metadata. Both Table 2: Benchmark Performance. We compare perfor- encoders are initialized with weights from Sap- mance of our model to prior work. Metrics indicate ac- BERT (Liu et al., 2021), a BERT model initialized curacy on the test set. We report the mean and standard deviation across five training runs. from PubMedBERT and fine-tuned on UMLS syn- onyms. Candidate entities are selected based on the MM BC5CDR Ours (Baseline) 74.0±0.2 89.3±0.1 maximum inner product between the context and Ours (Augmentation Only) 74.1±0.1 89.3±0.1 entity representations. We pretrain the candidate Ours (Full) 74.6±0.1 91.5±0.1 extraction model on MedWiki and PubMedDS. Table 3: Model Ablations. We measure accuracy of our Reranking Model: Given a sentence, a mention, full model (Full), our model with augmented structural and a set of entity candidates, our reranker model resources and no pretraining (Augmentation Only), and assigns ranks to each candidate and then selects our model without augmented structural resources and the single most plausible entity. Similar to Angell without pretraining (Baseline). We report the mean and et al. (2020), we use a cross-encoder to perform standard deviation across five training runs. this task. The cross-encoder takes the form of a which comprises a subset of concepts deemed BERT encoder with weights initialized from the by the authors to be most useful for semantic context encoder in the candidate extraction model. indexing. Post-Processing (Backoff and Document Syn- thesis): Motivated by Rajani et al. (2020), we back- • BC5CDR contains 1500 PubMed abstracts an- off from the model prediction when the score as- notated with 28,785 mentions of chemicals and signed by the re-ranking model is below a threshold diseases (Li et al., 2016). value and instead map the mention to the textually We use all available UMLS structural resources closest candidate. Then, we synthesize predictions when preprocessing datasets, and as a result, we for repeating mentions in each document by map- map MM entities to 95 UMLS types and BC5CDR ping all occurrences of a particular mention to the entities to 47 UMLS chemical and disease types. most frequently predicted entity. 4.2 Performance on Benchmarks Further details about the model architecture and training process can be found in Appendix A.2. We compare our approach to prior state-of-the-art methods from Bhowmik et al. (2021)4 and Angell 4 Evaluation et al. (2020). As shown in Table 2, our approach We evaluate our model on two biomedical NED with post-processing5 sets a new state-of-the-art datasets and show that (1) our data integration on MM by 0.7 accuracy points and BC5DR by approach results in state-of-the-art performance, 0.6 points. In addition, our method without post- (2) structural resource augmentation and pretrain- processing (Full) outperforms comparable methods ing are required in conjunction to realize improve- by up to 1.8 accuracy points. ments in overall accuracy, and (3) our approach 4.3 Ablations contributes to a large performance lift on rare enti- ties with limited structural resources. In order to measure the effect of our data inte- gration approach on model performance, we per- 4.1 Datasets form various ablations as shown in Table 3. We We evaluate our model on two NED datasets, which find marginal performance improvement when aug- are detailed below. Additional dataset and prepro- mented structural resources are used without pre- cessing details can be found in Appendix A.1. training (Augmentation Only Model). When pre- 4 • MedMentions (MM) is one of the largest ex- Bhowmik et al. (2021) uses the complete MM dataset, isting medical NED datasets and contains 4392 while Angell et al. (2020) and our work use the MM-ST21PV subset. PubMed abstracts annotated with 203,282 men- 5 Note that our post-processing method (Section 3.3) differs tions. We utilize the ST21PV subset of MM, from the post-processing method used in Angell et al. (2020).
Ours (Baseline) Ours (Augmentation Only) Ours (Full) Ours (Full) + Post-Processing pendix A.3.2. 70 65.9 67.7 52.5 5 Conclusion Performance 51.8 52.0 53.3 54.3 35 In this work, we show that cross-domain data inte- 17.5 23.4 gration helps achieve state-of-the-art performance 10.9 on the named entity disambiguation task in medi- 0 MedMentions BC5CDR cal text. The methods presented in this work help address limitations of medical knowledge bases Figure 2: Performance on Rare Entities with Limited and can be adapted for other knowledge-intensive Structural Resources. We measure the test accuracy of problems. four ablation models on a subset of rare entities that In [61]: have limited structural resources. We report mean val- ues across five training runs. Acknowledgements We are thankful to Sarah Hooper, Michael Zhang, training and augmented structural resources are Dan Fu, Karan Goel, and Neel Guha for help- used in conjunction (Full Model), we observe a ful discussions and feedback. MV is supported performance lift on both datasets, suggesting that by graduate fellowship awards from the the De- the model can only learn fine-grained reasoning partment of Defense (NDSEG) and the Knight- patterns when both components are incorporated Hennessy Scholars program at Stanford Univer- into the model. sity. LO is supported by an IC Postdoctoral Fel- We observe that our approach leads to a larger lowship. ML is supported by an NSF Graduate improvement on BC5CDR (2.2 points) than MM Research Fellowship (No. DGE-1656518). We (0.6 points). The lack of overall improvement for gratefully acknowledge the support of NIH under the MM dataset is expected, since the original No. U54EB020405 (Mobilize), NSF under Nos. MM dataset consists of finer-grained types than CCF1763315 (Beyond Sparsity), CCF1563078 the BC5CDR dataset. Specifically, we observe that (Volume to Velocity), and 1937301 (RTML); ONR 95% of the entities in BC5CDR are categorized under No. N000141712266 (Unifying Weak Su- with just 15 types, and in comparison, only 57% of pervision); ONR N00014-20-1-2480: Understand- entities in MM can be categorized with 15 types. ing and Applying Non-Euclidean Geometry in Ma- This suggests that the magnitude of model improve- chine Learning; N000142012275 (NEPTUNE); the ment is likely to be dependent on the original gran- Moore Foundation, NXP, Xilinx, LETI-CEA, In- ularity of structural resources in the training dataset. tel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, As a result, our data integration approach will natu- Hitachi, BASF, Accenture, Ericsson, Qualcomm, rally yield greater performance improvements on Analog Devices, the Okawa Foundation, American the BC5CDR dataset. Family Insurance, Google Cloud, Salesforce, To- tal, the HAI Cloud Credits for Research program, 4.4 Performance on Rare Entities the Stanford Data Science Initiative (SDSI), and In Figure 2, we measure performance on entities members of the Stanford DAWN project: Face- that appear less than five times in the training set book, Google, and VMWare. The Mobilize Cen- and are associated with exactly one type and no def- ter is a Biomedical Technology Resource Center, inition in the UMLS KB. We observe an improve- funded by the NIH National Institute of Biomed- ment of 2.5 accuracy points on the MM dataset and ical Imaging and Bioengineering through Grant 56.8 points on BC5CDR. Results on the BC5CDR P41EB027060. The U.S. Government is autho- dataset also show that utilizing pretraining and re- rized to reproduce and distribute reprints for Gov- source augmentation in combination leads to a 3x ernmental purposes notwithstanding any copyright improvement in performance when compared to notation thereon. Any opinions, findings, and con- the Augmentation Only model; this further sup- clusions or recommendations expressed in this ma- ports the need for both pretraining and structural re- terial are those of the authors and do not necessarily source augmentation when training the model. We reflect the views, policies, or endorsements, either observe similar trends across entities with limited expressed or implied, of NIH, ONR, or the U.S. metadata that never appear in pretraining datasets. Government. Additional evaluation details are included in Ap-
References Sunil Mohan and Donghui Li. 2019. Medmentions: A large biomedical corpus annotated with umls con- Monica Agrawal, Chloe O’Connell, Yasmin Fatemi, cepts. Ariel Levy, and David Sontag. 2020. Robust bench- marking for machine learning of clinical entity ex- Denis Newman-Griffis, Guy Divita, Bart Desmet, Ayah traction. In Proceedings of the 5th Machine Learn- Zirikly, Carolyn P Rosé, and Eric Fosler-Lussier. ing for Healthcare Conference, volume 126 of Pro- 2020. Ambiguity in medical concept normaliza- ceedings of Machine Learning Research, pages 928– tion: An analysis of types and coverage in electronic 949, Virtual. PMLR. health record datasets. Journal of the American Medical Informatics Association, 28(3):516–532. Rico Angell, Nicholas Monath, Sunil Mohan, Nishant Yadav, and Andrew McCallum. 2020. Clustering- Laurel Orr, Megan Leszczynski, Simran Arora, Sen based inference for zero-shot biomedical entity link- Wu, Neel Guha, Xiao Ling, and Christopher Ré. ing. arXiv preprint arXiv:2010.11253. 2021. Bootleg: Chasing the tail with self-supervised Rajarshi Bhowmik, Karl Stratos, and Gerard de Melo. named entity disambiguation. In CIDR. 2021. Fast and effective biomedical entity linking Bethany Percha. 2020. Modern clinical text mining: A using a dual encoder. guide and review. O. Bodenreider. 2004. The unified medical language Afshin Rahimi, Timothy Baldwin, and Karin Verspoor. system (umls): integrating biomedical terminology. 2020. Wikiumls: Aligning umls to wikipedia via Nucleic Acids Research, 32. cross-lingual neural ranking. Yan Chen, Huanying (Helen) Gu, Yehoshua Perl, James Geller, and Michael Halper. 2009. Struc- Nazneen Fatema Rajani, Ben Krause, Wengpeng Yin, tural group auditing of a umls semantic type’s extent. Tong Niu, Richard Socher, and Caiming Xiong. Journal of Biomedical Informatics, 42(1):41–52. 2020. Explaining and improving model behavior with k nearest neighbor representations. Karan Goel, Nazneen Rajani, Jesse Vig, Samson Tan, Jason Wu, Stephan Zheng, Caiming Xiong, Mo- Ariel S. Schwartz and Marti A. Hearst. 2003. A simple hit Bansal, and Christopher Ré. 2021. Robustness algorithm for identifying abbreviation definitions in gym: Unifying the nlp evaluation landscape. arXiv biomedical text. Pacific Symposium of Biocomput- preprint arXiv:2101.04840. ing 2003, page 451–462. Michael Halper, C. Paul Morrey, Yan Chen, Gai Shikhar Vashishth, Denis Newman-Griffis, Rishabh Elhanan, and George Hripcsakand Yehoshua Perl. Joshi, Ritam Dutt, and Carolyn Rose. 2021. Improv- 2011. Auditing hierarchical cycles to locate other ing broad-coverage medical entity linking with se- inconsistencies in the umls. AMIA Annual Sympo- mantic type prediction and large-scale datasets. sium Proceedings, pages 529–536. Denny Vrandečić and Markus Krötzsch. 2014. Wiki- Yun He, Ziwei Zhu, Yin Zhang, Qin Chen, and James data: A free collaborative knowledgebase. Commun. Caverlee. 2020. Infusing Disease Knowledge into ACM, 57(10):78–85. BERT for Health Question Answering, Medical In- ference and Disease Name Recognition. In Proceed- Lucy Wang, Chandra Bhagavatula, Mark Neumann, ings of the 2020 Conference on Empirical Methods Kyle Lo, Chris Wilhelm, and Waleed Ammar. 2018. in Natural Language Processing (EMNLP), pages Ontology alignment in the biomedical domain using 4604–4614, Online. Association for Computational entity definitions and context. In Proceedings of the Linguistics. BioNLP 2018 workshop, pages 47–55, Melbourne, Australia. Association for Computational Linguis- Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sci- tics. aky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian and Zhiyong Lu. 2016. BioCreative V CDR task Riedel, and Luke Zettlemoyer. 2020. Scalable zero- corpus: a resource for chemical disease relation ex- shot entity linking with dense entity retrieval. traction. Database, 2016. Baw068. Ikuya Yamada, Koki Washio, Hiroyuki Shindo, and Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Yuji Matsumoto. 2020. Global entity disambigua- Basaldella, and Nigel Collier. 2021. Self-alignment tion with pretrained contextualized embeddings of pretraining for biomedical entity representations. words and entities. Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee, Zheng Yuan, Yijia Liu, Chuanqi Tan, Songfang Huang, Kristina Toutanova, Jacob Devlin, and Honglak Lee. and Fei Huang. 2021. Improving biomedical pre- 2019. Zero-shot entity linking by reading entity de- trained language models with knowledge. scriptions. In Proceedings of the 57th Annual Meet- ing of the Association for Computational Linguistics, pages 3449–3460, Florence, Italy. Association for Computational Linguistics.
A Appendix Train Dev Test Total Documents 2635 878 879 Total Sentences 9008 2976 2974 A.1 Data Details Total Mentions 121,861 40,754 40,031 Unique Entities 18,495 8637 8449 A.1.1 UMLS Knowledge Base We utilize the 2017 AA release of the UMLS Table 4: Dataset statistics for MedMentions after pre- Metathesaurus as the KB, filtered to include en- processing. tities from 18 preferred source vocabularies (Mo- Train Dev Test Total Documents 500 500 500 han and Li, 2019; Bodenreider, 2004). The dataset Total Sentences 1431 1431 1486 includes 2.5M entities associated with 127 types. Total Mentions 9257 9452 9628 Approximately 160K entities have an associated Unique Entities 1307 1243 1300 description. Table 5: Dataset statistics for BC5CDR after prepro- cessing. A.1.2 Construction of Pretraining Datasets We obtain two pretraining datasets: MedWiki and spans to word-based mention spans, and (4) group- PubMedDS. After collecting each dataset using ing sentences into sets of three in order to provide the methods detailed in Section 3.2, we downsam- adequate context to models. Mentions occurring pled to address class imbalance between entities, at sentence boundaries, overlapping mentions, and since some entities were represented at higher rates mentions with invalid spans (when assigned by the than others. Sentences were removed if all entities Spacy library) are removed from the dataset dur- within the sentence were observed in the dataset ing pretraining, resulting in a total of 121K valid with high frequency (defined as occurring in at least mentions in the training set, 8.6K mentions in the 40 other sentences). validation set, and 8.4K mentions in the test set. Prior work by (Newman-Griffis et al., 2020) Preprocessed dataset statistics are summarized in demonstrates the importance of including ambigu- Table 4. ity in in medical NED training datasets. Newman- BC5CDR (Li et al., 2016): BC5CDR consists men- Griffis et al. (2020) defines dataset ambiguity as the tions mapped to chemical and disease entities. En- number of unique entities associated with a particu- tities are labeled with MESH descriptors; MESH lar mention string. By this definition, the MedWiki is a medical vocabulary that comprises a subset of training set has 25k ambiguous mentions (7% of the UMLS KB. unique mentions), with a minimum, median, and We preprocess the dataset by (1) expanding ab- maximum ambiguity per mention of 2.0, 2.0, and breviations using the Schwartz-Hearst algorithm 29.0 respectively. PubMedDS includes 7.6k am- (Schwartz and Hearst, 2003), (2) splitting all com- biguous mentions (36% of unique mentions), with posite mentions into multiple parts, (3) splitting a minimum, median, and maximum ambiguity per documents into individual sentences with the Spacy mention of 2.0, 3.0, and 24.0 respectively. library, (4) converting character-based mention spans to word-based mention spans, and (5) group- A.1.3 Evaluation Datasets ing sentences into sets of three in order to pro- We evaluate our model across two medical NED vide adequate context to models. Composite men- benchmark datasets: MedMentions and BC5CDR. tions that could not be separated into multiple seg- Dataset and preprocessing details are provided be- ments were removed from the dataset; mentions low. with MESH descriptors that were missing from the MedMentions (MM) (Mohan and Li, 2019): MM 2017 release of the UMLS KB were also removed. consists of text collected from 4392 PubMed ab- This resulted in a total of 9257 valid mentions in stracts. We use all available UMLS structural re- the training set, 1243 mentions in the validation sources when preprocessing datasets, and as a re- set, and 1300 mentions in the test set. Preprocessed sult, we map MM entities to 95 UMLS types. dataset statistics are summarized in Table 5. We preprocess the dataset by (1) expanding ab- breviations using the Schwartz-Hearst algorithm A.2 Model Details (Schwartz and Hearst, 2003), (2) splitting docu- We now provide details of our bi-encoder candi- ments into individual sentences with the Spacy date generator, cross-encoder re-ranker, and post- library, (3) converting character-based mention processing method.
Param Bi-encoder Cross-encoder where all negatives are in-batch negatives with a learning rate 1e−5 2e−5 batch size of 100. The next two phases take the top weight decay 0 0.01 10 predicted entities for each training example as β1 0.9 0.9 additional negatives for the batch with a batch size β2 0.999 0.999 of 10. Before each phase, we re-compute the 10 eps 1e−6 1e−6 negatives. effective batch 100 128 For pretraining, we run each phase for 3 epochs. size When fine-tuning on specific datasets, we run each epochs 3-10 10 for 10 epochs. All training parameters are shown warmup 10% 10% in Table 6. learning rate linear linear During pretraining, candidates are drawn from scheduler the entire UMLS KB, consisting of 2.5M entities. optimizer AdamW AdamW During fine-tuning on the MM dataset, candidates are drawn from the valid subset of entities defined Table 6: Learning Parameters for the bi-encoder and in the ST21PV version of the dataset, which in- cross-encoder cludes approximately 2.36M entities. During fine- tuning on the BC5CDR dataset, candidates are A.2.1 Candidate Generation with a drawn from a set of 268K entities with MESH Bi-encoder identifiers. Given a sentence and mention, our candidate gen- erator model selects which top K candidates are A.2.2 Reranker Cross-encoder the most likely to be the entity referred to by the Given a sentence, mention, and a set of entity candi- mention. Similar to (Bhowmik et al., 2021), we dates, our reranker model selects which candidate use a BERT bi-encoder to jointly learn represen- is the most likely entity referred to by the mention. tations of mentions and entities. The bi-encoder Similar to Angell et al. (2020), we use a BERT has a context encoder to encode the mention and cross-encoder architecture to learn a score for each an entity encoder to encode the entity. The can- entity candidate — mention pair. The models takes didates are selected based on those that have the as input the sequence of tokens highest maximum inner product with the mention representation. context[ENT_DESC]entity The context tokenization is where context is the context tokenization from the [CLS]c` [ENT_START]m[ENT_END]cr [SEP] bi-encoder, entity is the entity tokenization from the bi-encoder, and [ENT_DESC] is a special tag where [ENT_START] and [ENT_END] are new to indicate when the entity description is starting. tokens to indicate where the mention is in the text. One difference from the bi-encoder is that the title We set the left and right window length to be 30 of the entity includes the canonical name as well as words with the max tokens used for the sentence all alternate names. We keep the length parameters tokens of 64. the same as for the bi-encoder except we let the The entity tokenization is context have a max length of 128. We take the [CLS]title[SEP]types[SEP]desc[SEP] output representation from the [CLS] token and project it to a single dimension output. We pass where title is the entity title, types is a semi-colon the outputs for each candidate through a softmax to separated list of types, and desc is the description get a final probability of which candidate is most of an entity. We limit the list of types such that likely. the total length of types is less than 30 words. The Training When training the cross encoder, we max length for the entity tokens is 128. This means warm start the model with the context model that the description may be truncated if it exceeds weights from the candidate generator bi-encoder. the maximum length. We train all models using the top 10 candidates, and Training We train the bi-encoder similar to (Wu we train for 10 epochs. We use standard fine-tuning et al., 2020). We run in three phases. The first is BERT parameters, shown in Table 6.
MM BC5CDR Subpopulation Description (Bhowmik et al., 2021) – / 87.6 – / 92.3 Multi- (Single) Mentions that are multiple (single) (Angell et al., 2020) 50.8 /
Baseline Augmentation Only Full Size Eval Overall 74.2 74.2 74.2 40K Single Word Mentions 76.5 76.4 76.2 23.4K Multi-Word Mentions 70.9 71.3 71.5 16.6K Unseen Mentions 60.0 60.2 60.8 16.6K Unseen Entities 57.3 57.1 57.8 9.02K SubPop Not Direct Match 49.0 49.4 49.7 16.6K Top 100 79.8 80.3 80.9 7.54K Unpopular 53.5 53.6 54.0 20.8K Limited Metadata 57.7 58.6 59.4 6.73K Rare and Limited Metadata 50.1 51.5 52.1 4.54K Never Seen and Limited Metadata 45.4 47.1 47.6 2.95K 0 100 0 100 0 100 Figure 3: Accuracy over subpopulations for our three ablation models on MedMentions. Baseline Augmentation Only Full Size Eval Overall 89.5 89.4 91.3 9.63K Single Word Mentions 91.4 91.2 92.9 6.99K Multi-Word Mentions 84.5 84.5 86.8 2.63K Unseen Mentions 75.3 75.0 79.6 3.77K Unseen Entities 69.7 68.6 77.5 2.16K SubPop Not Direct Match 89.4 89.3 91.1 9.39K Top 100 97.3 97.5 97.2 3.33K Unpopular 74.3 74.0 78.6 3.86K Limited Metadata 26.5 68.4 12.8 117 Rare and Limited Metadata 23.2 67.0 112 8.9 Never Seen and Limited Metadata 17.6 63.7 91 2.2 0 100 0 100 0 100 Figure 4: Accuracy over subpopulations for our three ablation models on BC5CDR. during training, the model is able to memorize relevant disambiguation patterns. • Unseen entities are easier to resolve than rare entities with limited structural metadata. Un- seen entities are those that are not seen by the model during training. As a result, these are typically considered the most difficult entities to resolve (Orr et al., 2021; Logeswaran et al., 2019). We find that across both datasets and all models, the “Rare and Limited Metadata“ sub- population performs up to 61 accuracy points worse than the unseen entity slicing. This fur- ther supports the need for structural metadata when resolving rare entities. • There is a significant performance gap between two datasets on the the “Not Direct Match“ slice. We find that performance on the “Not Direct Match“ MM subpopulation is up to 41 accuracy points lower than the same subpopula- tion in BC5CDR.
You can also read