End-to-end Biomedical Entity Linking with Span-based Dictionary Matching
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
End-to-end Biomedical Entity Linking with Span-based Dictionary Matching Shogo Ujiie♠ Hayate Iso♥∗ Shuntaro Yada♠ Shoko Wakamiya♠ Eiji Aramaki♠ ♠ Nara Institute of Science and Technology ♥ Megagon Labs {ujiie, yada-s, wakamiya, aramaki}@is.naist.jp hayate@magagon.ai Abstract recognized mention (Leaman et al., 2013; Ferré et al., 2020; Xu et al., 2020; Vashishth et al., 2020), Disease name recognition and normalization a few approaches employ a joint learning architec- , which is generally called biomedical entity ture for these tasks (Leaman and Lu, 2016; Lou linking, is a fundamental process in biomedi- et al., 2017). These joint approaches simultane- cal text mining. Recently, neural joint learning of both tasks has been proposed to utilize the ously recognize and normalize disease names uti- mutual benefits. While this approach achieves lizing their mutual benefits. For example, Leaman high performance, disease concepts that do not et al. (2013) demonstrated that dictionary-matching appear in the training dataset cannot be accu- features, which are commonly used for NEN, are rately predicted. This study introduces a novel also effective for NER. While these joint learning end-to-end approach that combines span rep- models achieve high performance for both NER resentations with dictionary-matching features and NEN, they predominately rely on hand-crafted to address this problem. Our model handles unseen concepts by referring to a dictionary features, which are difficult to construct because of while maintaining the performance of neural the domain knowledge requirement. network-based models, in an end-to-end fash- Recently, a neural network (NN)-based model ion. Experiments using two major datasets that does not require any hand-crafted features demonstrate that our model achieved compet- was applied to the joint learning of NER and itive results with strong baselines, especially NEN (Zhao et al., 2019). NER and NEN were for unseen concepts during training. defined as two token-level classification tasks, i.e., their model classified each token into IOB2 tags 1 Introduction and concepts, respectively. Although their model Identifying disease names , which is generally achieved the state-of-the-art performance for both called biomedical entity linking, is the fundamental NER and NEN, a concept that does not appear in process of biomedical natural language processing, training data (i.e., zero-shot situation) can not be and it can be utilized in applications such as a lit- predicted properly. erature search system (Lee et al., 2016) and a One possible approach to handle this zero-shot biomedical relation extraction (Xu et al., 2016). situation is utilizing the dictionary-matching fea- The usual system to identify disease names consists tures. Suppose that an input sentence “Classic pol- of two modules: named entity recognition (NER) yarteritis nodosa is a systemic vasculitis” is given, and named entity normalization (NEN). NER is where “polyarteritis nodosa” is the target entity. the task that recognizes the span of a disease name, Even if it does not appear in the training data, it from the start position to the end position. NEN is can be recognized and normalized by referring to the post-processing of NER, normalizing a disease a controlled vocabulary that contains “Polyarteri- name into a controlled vocabulary, such as a MeSH tis Nodosa (MeSH: D010488).” Combining such or Online Mendelian Inheritance in Man (OMIM). looking-up mechanisms with NN-based models, Although most previous studies have developed however, is not a trivial task; dictionary matching pipeline systems, in which the NER model first rec- must be performed at the entity-level, whereas stan- ognizs disease mentions (Lee et al., 2020; Weber dard NN-based NER and NEN tasks are performed et al., 2020) and the NEN model normalizes the at the token-level (for example, Zhao et al., 2019). ∗ Work done while at Nara Institute of Science and Tech- To overcome this problem, we propose a novel nology. end-to-end approach for NER and NEN that com- 162 Proceedings of the BioNLP 2021 workshop, pages 162–167 June 11, 2021. ©2021 Association for Computational Linguistics
… … … … … … … Figure 1: The overview of our model. It combines the dictionary-matching scores with the context score obtained from PubMedBERT. The red boxes are the target span and “ci” in the figure is the “i”-th concept in the dictionary. bines dictionary-matching features with NN-based model is highly effective for normalization, espe- models. Based on the span-based model introduced cially in the zero-shot situations. by Lee et al. (2017), our model first computes span representations for all possible spans of the 2 Methods input sentence and then combines the dictionary- 2.1 Task Definition matching features with the span representations. Given an input sentence, which is a sequence of Using the score obtained from both features, it words x = {x1 , x2 , · · · , x|X| } in the biomedi- directly classifies the disease concept. Thus, our cal literature, let us define S as a set of all pos- model can handle the zero-shot problem by using sible spans, and L as a set of concepts that con- dictionary-matching features while maintaining the tains the special label Null for a non-disease span. performance of the NN-based models. Our goal is to predict a set of labeled spans y = Our model is also effective in situations other |Y | than the zero-shot condition. Consider the follow- {hi, j, dik }k=1 , where (i, j) ∈ S is the word in- ing input sentence: “We report the case of a patient dex in the sentence, and d ∈ L is the concept of who developed acute hepatitis,” where “hepatitis” diseases. is the target entity that should be normalized to 2.2 Model Architecture “drug-induced hepatitis.” While the longer span Our model predicts the concepts for each span “acute hepatitis” also appears plausible for stand- based on the score, which is represented by the alone NER models, our end-to-end architecture weighted sum of two factors: the context score assigns a higher score to the correct shorter span scorecont obtained from span representations and “hepatitis” due to the existence of the normalized the dictionary-matching score scoredict . Figure 1 term (“drug-induced hepatitis”) in the dictionary. illustrates the overall architecture of our model. We Through the experiments using two major NER denote the score of the span s as follows: and NEN corpora, we demonstrate that our model achieves competitive results for both corpora. score(s, c) = scorecont (s, c) + λscoredict (s, c) Further analysis illustrates that the dictionary- matching features improve the performance of where c ∈ L is the candidate concept and λ is the NEN in the zero-shot and other situations. hyperparameter that balances the scores. For the Our main contributions are twofold: (i) We concept prediction, the scores of all possible spans propose a novel end-to-end model for disease and concepts are calculated, and then the concept name recognition and normalization that utilizes with the highest score is selected as the predicted both NN-based features and dictionary-matching concept for each span as follows: features; (ii) We demonstrate that combining y = arg max score(s, c) dictionary-matching features with an NN-based c∈L 163
Context score The context score is computed in 3 Experiment a similar way to that of Lee et al. (2017), which is 3.1 Datasets based on the span representations. To compute the representations of each span, the input tokens are To evaluate our model, we chose two major datasets first encoded into the token embeddings. We used used in disease name recognition and normal- BioBERT (Lee et al., 2020) as the encoder, which ization against a popular controlled vocabulary, is a variation of bidirectional encoder representa- MEDIC (Davis et al., 2012). Both datasets, the tions from transformers (BERT) that is trained on National Center for Biotechnology Information a large amount of biomedical text. Given an in- Disease (NCBID) corpus (Doğan et al., 2014) put sentence containing T words, we can obtain and the BioCreative V Chemical Disease Relation the contextualized embeddings of each token using (BC5CDR) task corpus (Li et al., 2016), comprise BioBERT as follows: of PubMed titles and abstracts annotated with dis- ease names and their corresponding normalized h1:T = BERT(x1 , x2 , · · · , xT ) term IDs (CUIs). NCBID provides 593 training, 100 development, and 100 test data splits, while where h1:T is the input tokens embeddings. BC5CDR evenly divides 1500 data into the three Span representations are obtained by concatenat- sets. We adopted the same version of MEDIC as ing several features from the token embeddings: TaggerOne (Leaman and Lu, 2016) used, and that we dismissed non-disease entity annotations con- gs = [hstart(s) , hend(s) , ĥs , φ(s)] tained in BC5CDR. g0 s = GELU(FFNN(gs )) 3.2 Baseline Models where hstart(s) and hend(s) are the start and end We compared several baselines to evaluate our token embeddings of the span, respectively; and hˆs model. DNorm (Leaman et al., 2013) and is the weighted sum of the token embeddings in the NormCo (Wright et al., 2019) were used as pipeline span, which is obtained using an attention mech- models due to their high performance. In addition, anism (Bahdanau et al., 2015). φ(i) is the size of we used the pipeline systems consisting of state- span s. These representations gs are then fed into a of-the-art models: BioBERT (Lee et al., 2020) for simple feed-forward NN, FFNN, and a nonlinear NER and BioSyn (Sung et al., 2020) for NEN. function, GELU (Hendrycks and Gimpel, 2016). TaggerOne (Leaman and Lu, 2016) and Given a particular span representation and a can- Transition-based model (Lou et al., 2017) are didate concept as the inputs, we formulate the con- used as joint-learning models. These models out- text score as follows: performed the pipeline models in NCBID and BC5CDR. For the model introduced by Zhao et al. scorecont (s, c) = gs · Wc (2019), we cannot reproduce the performance re- g ported by them. Instead, we report the performance where W ∈ R|L|×d is the weight matrix associ- of the simple token-level joint learning model based ated with each concept c, and Wc represents the on the BioBERT, which referred as “joint (token)”. weight vector for the concept c. Dictionary-matching score We used the cosine 3.3 Implementation similarity of the TF-IDF vectors as the dictionary- We performed several preprocessing steps: split- matching features. Because there are several syn- ting the text into sentences using the NLTK toolkit onyms for a concept, we calculated the cosine sim- (Bird et al., 2009), removing punctuations, and ilarity for all synonyms of the concept and used resolving abbreviations using Ab3P (Sohn et al., the maximum cosine similarity as the score for 2008), a common abbreviation resolution module. each concept. The TF-IDF is calculated using the We also merged disease names in each training set character-level n-gram statistics computed for all into a controlled vocabulary, following the methods diseases appearing in the training dataset and con- of Lou et al. (2017). trolled vocabulary. For example, given the span For training, we set the learning rate to 5e-5, and “breast cancer,” synonyms with high cosine simi- mini-batch size to 32. λ was set to 0.9 using the larity are “breast cancer (1.0)” and “male breast development sets. For BC5CDR, we trained the cancer (0.829).” model using both the training and development sets 164
NCBID BC5CDR standard zero-shot Models NER NEN NER NEN dataset mention concept mention concept TaggerOne 0.829 0.807 0.826 0.837 NCBID 781 135 179 56 Transition-based model 0.821 0.826 0.862 0.876 BC5CDR 4031 461 391 179 NormCo 0.829 0.840 0.826 0.830 pipeline 0.874 0.841 0.865 0.818 Table 2: Number of mentions and concepts in standard joint (token) 0.864 0.765 0.855 0.817 and zero-shot situations. Ours without dictionary 0.884 0.781 0.864 0.808 Ours 0.891 0.854 0.867 0.851 Methods NCBID BC5CDR Table 1: F1 scores of NER and NEN in NCBID and zero-shot Ours without dictionary 0 0 BC5CDR. Bold font represents the highest score. Ours 0.704 0.597 Ours without dictionary 0.854 0.846 standard Ours 0.905 0.877 following Leaman and Lu (2016). For computa- tional efficiency, we only consider spans with up to Table 3: F1 scores for NEN of NCBID and BC5CDR 10 words. subsets for zero-shot situation where disease concepts do not appear in training data and the standard situation 3.4 Evaluation Metrics where they do appear in training data. We evaluated the recognition performance of our model using micro-F1 at the entity level. We con- data (i.e., standard situation), and disease names sider the predicted spans as true positive when with concepts that do not appear in the training their spans are identical. Following the previ- data (i.e., the zero-shot situation). Table 2 shows ous work (Wright et al., 2019; Leaman and Lu, the number of mentions and concepts in each situa- 2016), the performance of NEN was evaluated us- tion. Table 3 displays the results of the zero-shot ing micro-F1 at the abstract level. If a predicted and standard situation. The proposed model with concept was found within the gold standard con- dictionary-matching features can classify disease cepts in the abstract, regardless of its location, it concepts in the zero-shot situation, whereas the was considered as a true positive. NN-based classification model cannot normalize the disease names. 4 Results & Discussions The results of the standard situation demonstrate Table 1 illustrates that our model mostly achieved that combining dictionary-matching features also the highest F1-scores in both NER and NEN, improves the performance even when target con- except for the NEN in BC5CDR, in which the cepts appear in the training data. This finding im- transition-based model displays its strength as a plies that an NN-based model can benefit from baseline. The proposed model outperformed the dictionary-matching features, even if the models pipeline model of the state-of-the-art models for can learn from many training data. both tasks, which demonstrates that the improve- ment is attributed not to the strength of BioBERT 4.2 Case study but the model architecture, including the end- We examined 100 randomly sampled sentences to to-end approach and combinations of dictionary- determine the contributions of dictionary-matching matching features. features. There are 32 samples in which the models Comparing the model variation results, adding predicted concepts correctly by adding dictionary- dictionary-matching features improved the perfor- matching features. Most of these samples are dis- mance in NEN. The results clearly suggest that ease concepts that do not appear in the training set dictionary-matching features are effective for NN- but appear in the dictionary. For example, “pure based NEN models. red cell aplasis (MeSH: D012010)” is not in the BC5CDR training set while the MEDIC contains 4.1 Contribution of Dictionary-Matching “Pure Red-Cell Aplasias” for “D012010”. In this To analyze the behavior of our model in the zero- case, a high dictionary-matching score clearly leads shot situation, we investigated the NEN perfor- to a correct prediction in the zero-shot situation. mance on two subsets of both corpora: disease In contrast, there are 32 samples in which the names with concepts that appear in the training dictionary-matching features cause errors. The 165
sources of this error type are typically general dis- train the parameters representing the importance of ease names in the MEDIC. For example, “Death each synonyms. (MeSH:D003643)” is incorrectly predicted as a dis- ease concept in NER. Because these words are also used in the general context, our model overesti- References mated their dictionary-matching scores. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Furthermore, in the remaining samples, our gio. 2015. Neural machine translation by jointly model predicted the code properly and the span in- learning to align and translate. In ICLR. correctly. For example, although “thoracic hemato- Steven Bird, Ewan Klein, and Edward Loper. 2009. myelia” is labeled as “MeSH: D020758” in the Natural language processing with Python: Analyz- BC5CDR test set, our model recognized this as ing text with the natural language toolkit. O’Reilly “hematomyelia.” In this case, our model mostly Media, Inc. relied on the dictionary-matching features and mis- Allan Peter Davis, Thomas C Wiegers, Michael C classifies the span because ‘hematomyelia” is in Rosenstein, and Carolyn J Mattingly. 2012. MEDIC: the MEDIC but not in the training data. A practical disease vocabulary used at the com- parative toxicogenomics database. Database, 4.3 Limitations 2012:bar065. Our model is inferior to the transition-based model Rezarta Islamaj Doğan, Robert Leaman, and Zhiyong for BC5CDR. One possible reason is that the Lu. 2014. NCBI disease corpus: A resource for dis- transition-based model utilizes normalized terms ease name recognition and concept normalization. J. that co-occur within a sentence, whereas our model Biomed. Inform., 47:1–10. does not. Certain disease names that co-occur Arnaud Ferré, Louise Deléger, Robert Bossy, Pierre within a sentence are strongly useful for normaliz- Zweigenbaum, and Claire Nédellec. 2020. C-Norm: ing disease names. Although BERT implicitly con- a neural approach to few-shot entity normalization. siders the interaction between disease names via BMC Bioinformatics, 21(Suppl 23):579. the attention mechanism, a more explicit method is Dan Hendrycks and Kevin Gimpel. 2016. Gaus- preferable for normalizing diseases. sian error linear units (GELUs). arXiv preprint Another limitation is that our model treats the arXiv:1606.08415. dictionary entries equally. Because certain terms Robert Leaman, Rezarta Islamaj Dogan, and Zhiy- in the dictionary may also be used for non-disease ong Lu. 2013. DNorm: Disease name normaliza- concepts, such as gene names, we must consider tion with pairwise learning to rank. Bioinformatics, the relative importance of each concept. 29(22):2909–2917. 5 Conclusion Robert Leaman and Zhiyong Lu. 2016. TaggerOne: Joint named entity recognition and normaliza- We proposed a end-to-end model for disease name tion with semi-markov models. Bioinformatics, 32(18):2839–2846. recognition and normalization that combines the NN-based model with the dictionary-matching Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, features. Our model achieved highly compet- Donghyeon Kim, Sunkyu Kim, Chan Ho So, itive results for the NCBI disease corpus and and Jaewoo Kang. 2020. BioBERT: a pre- trained biomedical language representation model BC5CDR corpus, demonstrating that incorporat- for biomedical text mining. Bioinformatics, ing dictionary-matching features into an NN-based 36(4):1234–1240. model can improve its performance. Further exper- iments exhibited that dictionary-matching features Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle- enable our model to accurately predict the con- moyer. 2017. End-to-end neural coreference resolu- tion. In EMNLP, pages 188–197. cepts in the zero-shot situation, and they are also beneficial in the other situation. While the results Sunwon Lee, Donghyeon Kim, Kyubum Lee, Jae- illustrate the effectiveness of our model, we found hoon Choi, Seongsoon Kim, Minji Jeon, Sangrak several areas for improvement, such as the general Lim, Donghee Choi, Sunkyu Kim, Aik-Choon Tan, and Jaewoo Kang. 2016. BEST: Next-Generation terms in the dictionary and the interaction between biomedical entity search tool for knowledge dis- disease names within a sentence. A possible future covery from biomedical literature. PLoS One, direction to deal with general terms is to jointly 11(10):e0164680. 166
Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sci- aky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. 2016. BioCreative V CDR task corpus: A resource for chemical disease relation extraction. Database, 2016:baw068. Yinxia Lou, Yue Zhang, Tao Qian, Fei Li, Shufeng Xiong, and Donghong Ji. 2017. A transition-based joint model for disease named entity recognition and normalization. Bioinformatics, 33(15):2363–2371. Sunghwan Sohn, Donald C Comeau, Won Kim, and W John Wilbur. 2008. Abbreviation definition iden- tification based on automatic precision estimates. BMC Bioinformatics, 9:402. Mujeen Sung, Hwisang Jeon, Jinhyuk Lee, and Jae- woo Kang. 2020. Biomedical entity representations with synonym marginalization. In ACL, pages 3641– 3650. Shikhar Vashishth, Rishabh Joshi, Denis Newman- Griffis, Ritam Dutt, and Carolyn Rose. 2020. MedType: Improving Medical Entity Linking with Semantic Type Prediction. arXiv preprint arXiv:2005.00460. Leon Weber, Jannes Münchmeyer, Tim Rocktäschel, Maryam Habibi, and Ulf Leser. 2020. HUNER: im- proving biomedical NER with pretraining. Bioinfor- matics, 36(1):295–302. Dustin Wright, Yannis Katsis, Raghav Mehta, and Chun-Nan Hsu. 2019. NormCo: Deep disease nor- malization for biomedical knowledge base construc- tion. In AKBC. Dongfang Xu, Zeyu Zhang, and Steven Bethard. 2020. A Generate-and-Rank framework with semantic type regularization for biomedical concept normal- ization. In ACL, pages 8452–8464. Jun Xu, Yonghui Wu, Yaoyun Zhang, Jingqi Wang, Hee-Jin Lee, and Hua Xu. 2016. CD-REST: A sys- tem for extracting chemical-induced disease relation in literature. Database, 2016:baw036. Sendong Zhao, Ting Liu, Sicheng Zhao, and Fei Wang. 2019. A neural multi-task learning framework to jointly model medical named entity recognition and normalization. In AAAI, pages 817–824. 167
You can also read