MESHUP: A CORPUS FOR FULL TEXT BIOMEDICAL DOCUMENT INDEXING
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
MeSHup: A Corpus for Full Text Biomedical Document Indexing Xindi Wang1,3 , Robert E. Mercer1,3 , Frank Rudzicz2,3,4 1 Department of Computer Science, University of Western Ontario, London, Ontario, Canada 2 Department of Computer Science, University of Toronto, Toronto, Ontario, Canada 3 Vector Institute, Toronto, Ontario, Canada 4 Unity Health Toronto, Toronto, Ontario, Canada xwang842@uwo.ca, mercer@csd.uwo.ca, frank@cs.toronto.edu Abstract Medical Subject Heading (MeSH) indexing refers to the problem of assigning a given biomedical document with the most relevant labels from an extremely large set of MeSH terms. Currently, the vast number of biomedical articles in the PubMed database are manually annotated by human curators, which is time consuming and costly; therefore, a computational system that can assist the indexing is highly valuable. arXiv:2204.13604v1 [cs.CL] 28 Apr 2022 When developing supervised MeSH indexing systems, the availability of a large-scale annotated text corpus is desirable. A publicly available, large corpus that permits robust evaluation and comparison of various systems is important to the research community. We release a large scale annotated MeSH indexing corpus, MeSHup, which contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected from the MEDLINE database. We train an end-to-end model that combines features from documents and their associated labels on our corpus and report the new baseline. Keywords: MeSH Indexing, Multi-label text classification 1. Introduction Records (SCRs) refer to specific chemical sub- 1 MEDLINE comprises 33 million (as of Nov. stances in the MEDLINE records. 2021) references to journal articles in the life MeSH indexing, a process that annotates doc- sciences with a concentration on biomedicine, uments with concepts from established seman- which is the National Library of Medicine’s2 tic taxonomies and ontologies, is important for (NLM) premier bibliographic database. It in- biomedical text classification and information cludes textual information (title and abstract) retrieval. MEDLINE citations are indexed by and bibliographic information for articles from human annotators who read the full text of the academic journals covering various disciplines article and assign the most relevant MeSH la- of the life sciences and biomedicine. PubMed3 bels to the articles. The manual annotation pro- is a free search engine that provides free ac- cess ensures the high quality of indexing but, cess to the MEDLINE database. In addition inevitably, the cost can be prohibitive. There to MEDLINE, PubMed also provides access to has been a steady and sizeable increase in the PubMed Central4 (PMC) repository that the number of citations that are added to the archives open-access, full-text scholarly articles MEDLINE database every year; for instance, in biomedical and life sciences journals. All in 2020, 952,919 articles were added (approx- records in the MEDLINE database are indexed imately 2,600 on a daily basis)6 and the aver- with Medical Subject Headings (MeSH)5 – a age cost of annotation per document is approx- controlled and hierarchically-organized vocabu- imately $9.40 (Mork et al., 2013). Faced with lary produced and maintained by the NLM. As the growing workload, NLM annotators remain of 2021, there are 29,369 main MeSH head- tasked with indexing newly published articles ings, and each citation is indexed with 13 MeSH efficiently and promptly. Therefore, there is a terms on average. MeSH headings can be fur- pressing need for automatic supports to index- ther qualified by 83 subheadings (also known as ing biomedical literature. qualifiers). In addition, Supplementary Concept Many state-of-the-art models have been pro- posed to deal with MeSH indexing; however, 1 https://www.nlm.nih.gov/medline/medline_ there is a clear drawback to these automatic in- overview.html dexing systems because of the data used to train 2 https://www.nlm.nih.gov them. Existing corpora for MeSH indexing only 3 https://pubmed.ncbi.nlm.nih.gov/about/ 4 6 https://en.wikipedia.org/wiki/PubMed_Central https://www.nlm.nih.gov/bsd/medline_pubmed_ 5 https://www.nlm.nih.gov/mesh/meshhome.html production_stats.html
provide the title and abstract, while human an- and 84,355 manually annotated chemical enti- notators review full text articles. This suggests ties. CHEMDNER labels entities based on the that important information might be contained Chemicals and Drugs branch of the MeSH hier- in the full text which is available to the human archy and the MeSH substances. The BioCre- indexer, but does not appear in the title and ab- ative V Chemical-Disease Relation Task Corpus stract that automatic indexing systems use to (BC5CDR) (Li et al., 2015) was developed for make recommendations. Mork et al. (2017) fur- the BioCreative V challenge. A team of Medi- ther indicated that some sections in the full text cal Subject Headings (MeSH) indexers for dis- that include very specific information, such as ease/chemical entity annotation and Compara- the “Methods” section, help improve the perfor- tive Toxicogenomics Database (CTD) curators mance of automatic models. Thus, a corpus that for CID relation annotation were invited to en- contains full text articles and their associated sure high annotation quality and productivity. MeSH labels is highly desirable. Detailed annotation guidelines and automatic In this work, we construct a new labeled full text annotation tools were provided. The resulting MeSH indexing dataset7 , MeSHup, that for each corpus consists of 1500 PubMed articles with entry, is a mashup of the PMID, title, abstract, 4409 annotated chemicals, 5818 diseases and journal, year, author list, MeSH terms, chem- 3116 chemical-disease interactions. Each entity ical list, and Supplementary Concept Records annotation includes both the mention text spans from MEDLINE and the full text introduction, and normalized concept identifiers, using MeSH methods, results, discussion, figure captions, as the controlled vocabulary. The NLM-CHEM and table captions that are available from BioC- corpus (Islamaj et al., 2021), created for Track PMC (Comeau et al., 2019). To the best of our 2 of BioCreative VII, consists of 150 full text ar- knowledge, MeSHup is the first publicly avail- ticles with chemical entity annotations provided able (and the largest) full text dataset annotated by human experts for 5000 unique chemical for MeSH indexing. We also propose a multi- names, mapped to 2000 MeSH identifiers. This channel model that incorporates extracted fea- dataset is compatible with the CHEMDNER and tures from different sections of the full text and BC5CDR corpora described above. report its baseline results. 2.2. Related Full-Text Biomedical Corpora 2. Related Work A number of full-text corpora have been created 2.1. Biomedical Corpora Related to in the biomedical domain. Each has been gen- MeSH Terms erated for specific purposes and consequently There are several corpora in the biomedical do- has been annotated with that in mind. The main that contain or make use of MeSH terms. earliest is a small five biomedical paper corpus The OSHUMED test collection (Hersh et al., (Gasperin et al., 2007) which was designed to 1994) is a set of 348,566 clinically-oriented ref- capture anaphora. The corpus of biomedical erences in the MEDLINE database which are ob- articles are annotated with anaphoric links be- tained from 270 medical journals in the years tween coreferent and associative (biotype, ho- 1987 to 1991. For each citation, the collec- molog, and set-member) noun phrases referring tion contains the title, abstract, MeSH index- to the biomedical entities of interest to the au- ing terms, author, source, and publication type. thors. For the BioCreative I task 2 (Hirschman The OSHUMED corpus is one of the earliest et al., 2005), the training set corpus comprised corpora that is related to the MeSH indexing 803 full text articles from four different jour- task. The GENIA corpus (Kim et al., 2003) nals that had been previously annotated for hu- is a valuable resource in the biomedical litera- man protein function and the test set corpus ture that was created to support the develop- comprised 212 full text articles. The CRAFT ment of tools for text mining and information (The Colorado Richly Annotated Full-Text) cor- retrieval and their evaluation. It contains 2,000 pus (Verspoor et al., 2012) is a collection of 97 abstracts taken from the MEDLINE database articles from the PubMed Central Open Access with a variety of entity types in the GENIA subset. Each article has been manually anno- Chemical ontology that are derived from MeSH tated along structural, coreference, and concept terms. The CHEMDNER corpus (Krallinger et dimensions. More recently, the BioC-BioGRID al., 2014) contains 10,000 PubMed abstracts corpus (Dogan et al., 2017) comprises full text articles annotated for protein-protein and ge- 7 https://github.com/xdwang0726/MeSHup netic interactions.
2.3. Automatic MeSH Indexing Based nal issues in the PubMed database and used the on Title and Abstract full text as input to MTI. They found that us- As discussed, there is rapid growth in the num- ing the full text of an article provides signifi- ber of articles in MEDLINE, and the NLM has cantly better (7.4%) quality of automatic index- developed an indexing tool, Medical Text In- ing than using only abstracts and titles. Jimeno- dexer (MTI), to recommend MeSH terms in au- Yepes et al. (2012) used a collection of 1,413 tomated and semi-automated modes (Aronson et biomedical articles randomly selected from the al., 2004). MTI first takes the title and abstract PMC Open Access Subset. They first ran au- of the input article and generates recommended tomatic summaries over the full test and then MeSH terms, using a ranking algorithm to de- used the generated summary as input to MTI. termine final suggestions. There are two im- The experimental results showed that incorpo- portant components in MTI: MetaMap Indexing rating full texts achieved higher (6%) recall with (MMI) and PubMed-Related Citations (PRC) (Lin a trade-off in precision compared to using the and Wilbur, 2007; Aronson and Lang, 2010). abstracts and titles only. Demner-Fushman and MetaMap recommends MeSH terms based on Mork (2015) collected 14,829 citations and used the mapping of biomedical concepts in the title a rule-based method to classify Check Tags, a and abstract of the input article to the the Uni- small set of MeSH terms (29 MeSH Check Tags) fied Medical Language System8 (UMLS). PRC that represent characteristics of the subjects. suggests MeSH terms by looking at similar an- Wang and Mercer (2019) released a full text notations in MEDLINE using k -nearest neigh- dataset curated from PMC with 257,590 arti- bours. Two sets of recommended MeSH terms cles and employed a multi-channel CNN-based are combined to generate the final MeSH list. feature extraction model. FullMeSH (Dai et Since 2013, BioASQ9 (Tsatsaronis et al., 2015) al., 2019) and BERTMeSH (You et al., 2020) has organized the biomedical semantic indexing used 1.4M full text articles, the former with an challenge, which offers the opportunity for more attention-based CNN and the latter with pre- participants to get involved in the MeSH index- trained contextual embeddings together with an ing task. BioASQ provides annotated PubMed attention mechanism. Unfortunately, they did articles with the title and abstract only, and not make their dataset available, which makes participants can tune their annotation models it difficult for others to compare and evaluate accordingly. Many effective indexing systems their work. have been proposed since then, such as MeSH- 3. Dataset Construction Labeler (Liu et al., 2015), DeepMeSH (Peng In this subsection, we introduce how to con- et al., 2016), AttentionMeSH (Jin et al., 2018), struct the dataset based on the PubMed Cen- MeSHProbeNet (Xun et al., 2019), and Ken- tral Open Access in BioC format10 (BioC- MeSH (Wang et al., 2022). MeSHLabeler and PMC) (Comeau et al., 2019) and the MED- DeepMeSH are models based on a Learning- LINE/PubMed Annual Baseline Repository11 to-Rank (LTR) framework. AttentionMeSH and (MBR). We download the entire BioC-PMC sub- MeSHProbeNet both utilize deep recursive neu- set (as of Nov. 2021) and obtain 3,601,092 full ral networks (RNNs) and attention mechanisms, text articles. We also download the entire where the main difference is that the former MBR collection (as of Nov. 2021) and obtain uses label-wise attention while the latter em- 31,850,051 citations with metadata in the MED- ploys multi-view self-attentive MeSH probes. LINE database. In order to reduce bias, we KenMeSH combines text features and the MeSH only consider articles indexed by human anno- label hierarchy by using a dynamic knowledge- tators (i.e., articles in the MEDLINE database enhanced mask to index MeSH terms. with modes marked as ‘curated’ (MeSH terms 2.4. MeSH Indexing Based on Full were provided algorithmically and were human Text reviewed) or ‘auto’ (MeSH terms were provided MeSH indexing with full texts has been studied algorithmically) are not considered)12 , and we using relatively small sets of data or restricted only focus on articles written in English (i.e., to small numbers of specific MeSH terms be- 10 https://www.ncbi.nlm.nih.gov/research/bionlp/ cause of the limitation of full text access. Gay APIs/BioC-PMC/ et al. (2005) collected 500 articles in 17 jour- 11 https://lhncbc.nlm.nih.gov/ii/information/MBR. html 8 12 https://www.nlm.nih.gov/research/umls/ https://www.nlm.nih.gov/bsd/licensee/elements_ 9 http://bioasq.org descriptions.html
only articles annotated as ‘eng’ are considered) Metadata Descriptions The PubMed (NLM database that in the MEDLINE database. We then match PMID incorporates MEDLINE) unique identifier. BioC-PMC articles with MBR citations using the Personal and collective (corporate) Authors PubMed ID (PMID) and obtain a set of 1,342,667 author names published with the article. Journal The journal that the article is published in. biomedical documents. Year The year in which the article is published. DOI Digital Object Identifiers. Information extracted from BioC-PMC. NLM controlled vocabulary, Each article in the BioC-PMC subset is struc- MeSH Terms Medical Subject Headings (MeSH). tured in a single XML file. The original pub- Supply MeSH Supplementary Concept Record (SCR) terms. Chemical List A list of chemical substances and enzymes. lished articles are formatted in various ways de- pending on the publisher. With the BioC format Table 2: Meta-data extracted from MBR and (Comeau et al., 2019), an article’s textual in- their descriptions formation is preserved and each article is orga- nized in a unified structure. We parse the tags in the BioC formatted XML files to get the section {"articles":[ names and their corresponding texts. We then {"PMID": , divide and normalize all full text articles into "TITLE": , eight BioC sections: title, abstract, introduc- "ABSTRACT": , tion, methods, results, discussion, figure cap- "INTRO": , tions, and table captions. Table 1 summarizes "METHODS": , the statistical information for the described sec- "RESULTS": , tions. "DISCUSS": , "FIG_CAPTIONS": , Sections number of articles average length "TABLE_CAPTIONS": , Title 1,342,667 16 "JOURNAL": , Abstract 1,342,667 258 Introduction 1,279,276 991 "YEAR": , Methods 1,135,757 1446 "DOI": , Results 1,090,981 1640 "AUTHORS": , Discuss 1,042,379 1249 "MeSH": , Figure Captions 1,155,208 560 "CHEMICALS": , Table Captions 520,780 123 "SUPPLMeSH": Table 1: Statistics of the generated dataset for }, each of the eight sections { ... }, ... Information extracted from the MED- ]} LINE/PubMed Annual Baseline (MBR). Starting in 2002, MBR has provided access to all MEDLINE citations and it updates and adds Figure 1: The section descriptors given as JSON new citations every year. Each year’s baseline keys. contains textual information (title and abstract) of the citations as well as various types of meta- data, such as their authors, publishing venues, JSON format in Fig. 1, and an abbreviated ver- and references. Metadata can be regarded as sion of the dataset is provided in the Appendix. strong indicators for the semantic indexing task Our goal in releasing MeSHup is to promote as they include latent information of research large-scale ontological classification of biomedi- topics. We therefore extract the metadata of cal documents, using full texts, across the com- each article from MBR; the detailed metadata munity. With MeSHup, researchers can explore and their descriptions are stated in Table 2. and test state-of-the-art indexing systems with a common standard. We combine the information extracted from BioC-PMC and MBR to form MeSHup, a new large-scale, full-text biomedical semantic index- 4. Experiments ing dataset. Specifically, each article in the Given the full text of a biomedical article, MeSH dataset contains the full textual information and indexing can be regarded as a multi-label text the metadata associated with it. The keys (sec- classification problem. The learning framework tion descriptors) from MeSHup are shown in is defined as follows. X = {x1 , x2 , ..., xn } is a set
of biomedical documents and Y = {y1 , y2 , ..., yL } (DCNN) to each channel in order to get the is the set of MeSH terms. Multi-label classifica- distant effect memory from a CNN model. To tion studies the learning function f : X → [0, 1]Y be specific, our DCNN is a three-layer, one- using the training set D = {(xi , Yi ), i = 1, ..., n}, dimensional convolutional neural net (CNN) where n is the number of documents in the set, with dilated convolution kernels, which ob- and Yi ⊂ Y is the set of MeSH labels for docu- tain high-level semantic representations of texts ment xi . The objective of MeSH indexing is to without increasing the computation. The con- predict the correct MeSH labels for any unseen cept of dilated convolution has been popular article xk , where xk is not in X . in semantic segmentation in computer vision in recent years (Yu and Koltun, 2016; Li et al., 4.1. Baseline Model 2018), and it has been introduced to sequential In general, traditional MeSH indexing systems data (Bai et al., 2018), specially to the field of focus on the document features only, thereby NLP in neural machine translation (Kalchbren- suffering from a lack of information about the ner et al., 2017) and text classification (Lin et MeSH hierarchy. To handle this, we present a al., 2018). Dilated convolution enables exponen- hybrid document-label feature method, which is tially large receptive fields over the embedding composed of a multi-channel document repre- metric, which captures long-term dependencies sentation module, a label feature representation over the input texts. module, and a classifier. The overall model ar- We apply a multi-level DCNN with different di- chitecture is shown is Figure 2. lation rates on top of the embedding metric on each channel. Larger dilations represent a wider range of inputs that can capture sentence- level information, whereas small dilations cap- ture word-level information. The semantic fea- tures returned by DCNN for each channel is de- noted as DC ∈ R(l−s+1)×2d , where l is the se- quence length in channel C and s is the width of the convolution kernels. 4.1.2. Label Feature Representation Module Graph convolutional neural networks (GCNs) Figure 2: Model Architecture - There are three (Kipf and Welling, 2017) have attracted wide at- main components in our method. First, a multi- tention recently. They have been effective in channel document representation module oper- tasks that have rich relational structures, as ates on each section of an input article. Sec- GCNs preserve global information within the ond, a 2-layer GCN creates label vectors. Lastly, graph. Traditional multi-label text classification a label-wise attention component calculates the mainly focuses on local consecutive word se- label-specific attention vectors that are used for quences; for instance, two deep networks com- predictions. monly used in building text representations af- ter learning word embeddings are convolutional (CNNs) and recurrent neural networks (RNNs) (Kim, 2014; Tai et al., 2015). Some recent stud- 4.1.1. Multi-channel Document ies explored GCN for text classification, where Representation Module they either viewed a document or a sentence as The multi-channel document representation a graph of word nodes (Peng et al., 2018; Yao et module has five input channels: the title and al., 2019). MeSH labels are arrayed hierarchi- abstract channel, the introduction channel, the cally, an example of which is shown in Figure 3. methods channel, the results channel, and the We take advantage of the structured knowledge discussion channel. Texts in each channel are we have over the parent and child relationships represented by the embedding matrices, namely in the MeSH label space by using a GCN. In the EC ∈ Rd , where d represents the dimension of label graph setting, we formulate each MeSH la- the word embeddings. bel in Y as a node in the graph. The edges rep- Instead of a recurrent model which poses com- resent parent and child relationships among the putational and technical issues, we apply a MeSH terms. The edge types of a node contain multi-level dilated convolutional neural network edges from itself, from its parent, and from its
order to match documents to their correspond- ing label vectors, we employ label-wise atten- tion following Mullenbach et al. (2018). We generate label-specific representations for each channel: αC = Softmax(DC · Hlabel ) T (4) contentC = αC · DC , where contentC ∈ RL×d . We then sum up the label-specific content representation for each channel to form the final document representa- tion for each article: Figure 3: An example of the MeSH hierarchy X (retrieved Nov. 2021). For example, under Body D= contentC (5) Regions, there are specific body regions under We then generate the predicted score for each the MeSH tree. MeSH label via: yˆi = σ(D Hlabel ), i = 1, 2, ..., L, (6) children. We first take the average of the word embeddings of the words in the MeSH label de- where σ(·) denotes the sigmoid function. Train- scriptor as the initial label embedding vi ∈ Rd : ing our proposed method uses binary cross- entropy as the loss function: mi 1 X vi = wj , i = 1, 2, ..., L, (1) L mi j=1 X L= [−yi · log(yˆi ) − (1 − yi ) · log(1 − yˆi ))], (7) i=1 where mi is the number of words in the descrip- tor of label i, wj is the word embedding of word where yi ∈ [0, 1] is the ground truth of label i, j , and L is the number of labels. The GCN lay- and yˆi ∈ [0, 1] denotes the prediction of label i ers capture information about immediate neigh- obtained from the proposed model. bours with one layer of convolution, and infor- mation from a larger neighbourhood can be in- 4.2. Evaluation Metrics tegrated by using a multi-layer stack. We use We evaluate performance of MeSH indexing a two-layer GCN to incorporate the hierarchical systems using two groups of measurements: information among MeSH terms. At each GCN bipartition-based evaluation and ranking-based layer, we aggregate the parent and child nodes evaluation. Bipartition evaluation is further di- for the ith label to form the new label embedding vided into example-based and label-based met- for the next layer: rics. Example-based measures, computed per data point, compute the harmonic mean of stan- hl+1 = σ(A · hl · W l ), (2) dard precision (EBP) and recall (EBR) for each data point. The metrics are defined as: where hl and hl+1 ∈ RL×d indicate the node pre- sentations of the lth and (l + 1)th layers, σ(·) de- 2 × EBR × EBP EBF = , (8) notes an activation function, A is the adjacency EBR + EBP matrix of the MeSH hierarchical graph, and W l where is a layer-specific trainable weight matrix. Next, N 1 X |yi ∩ yˆi | we sum both the averaged description vector EBP = , (9) from Equation 1 with the GCN output to form: N i=1 |yˆi | N Hlabel = v + hl+1 , (3) 1 X |yi ∩ yˆi | EBR = , (10) N i=1 |yi | L×d where Hlabel ∈ R is the final label matrix. where yi is the true label set and yˆi is the pre- 4.1.3. Classifier dicted label set for instance i, and N represents For large multi-label text classification tasks, the total number of instances. relevant information for each label might be We perform label-based evaluation for each la- scattered in various locations of the article. In bel in the label set. The measurements include
Micro-average F-measure (MiF) and Macro- threshold τi for predicting the i-th label, and se- average F-measure (MaF). MiF aggregates the lected the predicted MeSH term (MeSH i ) whose global contributions of all MeSH labels and then predicted probability is greater than τi : calculates the harmonic mean of micro-average ( precision (MiP) and micro-average recall (MiR), yˆi ≥ τi , 1 MeSHi = (19) which are heavily influenced by frequent MeSH yˆi < τi , 0 terms. MaF computes the macro-average preci- sion (MaP) and macro-average recall (MaR) for We used the micro-F optimization algorithm pro- each label and then averages them, which pro- posed by Pal et al. (2020) to tune the thresholds: vides equal weight to each MeSH term. There- fore frequent MeSH terms and infrequent ones τi = argmax MiF(T ), (20) T are equally important. The aforementioned met- rics are defined as follows: where T represents all possible threshold val- ues for label i. 2 × MiR × MiP MiF = , (11) MiR + MiP 4.3. Experiment Settings where We implement our model using PyTorch (Paszke et al., 2019). We use 200-dimensional word em- PL j=1 TPj beddings (BioWordVec) that are pretrained on MiP = PL PL , (12) PubMed article titles and abstracts (Zhang et j=1 TPj + j=1 FPj al., 2019). For the model’s DCNN component, PL we use a 1-dimension convolution with kernel j=1 TPj size 3 and a three-level dilated convolution with MiR = PL PL , (13) j=1 TPj + j=1 FNj dilation rates [1,2,3]. The number of hidden units in both components of our model is set to 2 × MaR × MaP 200. We use the Adam optimizer (Kingma and MaF = , (14) MaR + MaP Ba, 2015) with a minibatch size of 8 and an ini- where tial learning rate of 0.0003 with a decay rate of L 0.9 in every epoch. To avoid overfitting, we ap- 1X TPj MaP = , (15) ply dropout directly after the embedding layer L j=1 TPj + FPj with a rate of 0.2 and use early stopping strate- gies (Yao et al., 2007). Our model is trained on a L 1X TPj single NVIDIA A100 GPU. It takes approximately MaR = , (16) L j=1 TPj + FNj five to seven days to train the full model. where TPj , FPj and FNj are the true positives, 4.4. Experimental Results false positives, and false negatives, respectively, In the baseline model, we are interested in the for each label lj in the set of all labels L. articles that have all six sections, i.e., title, ab- Ranking-based evaluation includes precision at stract, introduction, method, results, and dis- k (P@k ) and recall at k (R@k ). P@k shows cuss. We extract the 957,426 articles from the number of relevant MeSH terms that are MeSHup that meet these criteria. We use strat- suggested in the top-k recommendations of the ified sampling over publication year to split MeSH indexing system, and R@k indicates the our dataset into training, validation and test- proportion of relevant items that are suggested ing. We use 80% of the documents for training in the top-k recommendations. The metrics are (765,920), 10% for validation (95,737), and 10% defined as follows: for testing (95,769). We first conduct our experiments with titles 1 X P@k = yl , (17) and abstracts only, and then we do our experi- k l∈rk (ŷ) ments on the full texts. From this experiment, we would like to see how integrating full text 1 X information affects the indexing performance R@k = yl , (18) |yi | compared with using the titles and abstracts l∈rk (ŷ) only. Table 3 summarizes the results of bipar- where rk returns the top-k recommended items. tition evaluation and Table 4 shows the results Thresholds greatly impact the bipartition-based of ranking-based measures. We can see sub- evaluation metrics. We therefore tuned the stantial improvements on all evaluation metrics
Bipartition evaluation Methods our analysis covers several but not all sections Titles and Abstracts Full Texts EBF 0.183 0.259 of the full text articles, it is likely that other Example based EBP 0.503 0.588 parts of the article together with metadata may EBR 0.112 0.166 also have impacts on future outcomes. In future, MiF 0.177 0.259 Micro-averaged MiP 0.473 0.604 we plan to involve more sections of the textual MiR 0.110 0.164 information and metadata to improve the auto- MaF 0.362 0.367 matic indexing system. Macro-averaged MaP 0.798 0.810 MaR 0.234 0.237 Acknowledgements Table 3: Comparison using only titles and ab- We thank all reviewers for their constructive stracts and full texts across bipartition evalua- comments and feedback. Resources used in tion. Bold: best scores in each row. preparing this research were provided, in part, by Compute Ontario13 , Compute Canada14 , the Ranking Based Province of Ontario, the Government of Canada Methods Measure through CIFAR, and companies sponsoring the Titles and Abstracts Full Texts Vector Institute15 . This research is partially P @1 0.699 0.801 funded by The Natural Sciences and Engi- P @3 0.462 0.609 P@k P @5 0.372 0.496 neering Research Council of Canada (NSERC) P @10 0.260 0.341 through a Discovery Grant to R. E. Mercer. F. P @15 0.205 0.267 Rudzicz is supported by a CIFAR Chair in AI. R@1 0.051 0.077 R@3 0.098 0.128 Appendix: A Complete Version of A R@k R@5 0.131 0.171 Data Sample in the Dataset R@10 0.180 0.232 R@15 0.214 0.272 {"articles":[ {"PMID":"27976717", Table 4: Comparison using only titles and ab- "TITLE":"Temporal pairwise spike stracts and full texts across ranking-based mea- correlations fully capture sures. Bold: best scores in each row. single-neuron information", "ABSTRACT":"To crack the neural when involving full texts, which indicates that code and read out the full texts are more informative compared to the information neural spikes titles and abstracts. The baseline model pre- convey, [...]", forms fairly well on precisions, but with a trade "INTRO":"Throughout the central off in recalls. The reason for this could be that nervous system of a mammalian the frequency of each MeSH label is quite bi- brain [...]", ased, some labels might have very few training "METHODS":"Deriving the correlation examples so that the baseline model is very hard theory of neural information [ to predict those rare labels. ...]", "RESULTS":"We are interested in the information contained in a 5. Conclusion and Future Work spike train r(t) about a We present MeSHup, a new, publicly available stimulus s(t)[...]", full text dataset annotated for MeSH indexing. "DISCUSS":"The list of spike timing It is a mashup of full-text information from BioC- features that have been PMC and associated metadata collected from implicated in neural coding the MEDLINE database. This is the first large includes [...]", dataset that contains the full text information "FIG_CAPTIONS":"Dimensionality of that allows the research community to incor- neural information coding [...] porate more textual information other than the ", title and abstract in building MeSH indexing "TABLE_CAPTIONS":"Parameter sets systems. We also train an end-to-end model across neuron models. [...]", that comprises features extracted from the doc- ument itself and features obtained from labels. We think that the MeSHup dataset could be a 13 https://www.computeontario.ca valuable resource not only for MeSH indexing 14 https://www.computecanada.ca 15 but also for full text mining and retrieval. Since https://www.vectorinstitute.ai/partners
"JOURNAL":"Nature communications", text articles annotated for curation of protein- "YEAR":"2016", protein and genetic interactions. Database "DOI":"10.1038/ncomms13805", (Oxford). "AUTHORS":[ Gasperin, C., Karamanis, N., and Seal, R. "Amadeus,Dettner", (2007). Annotation of anaphoric relations in "Sabrina,Munzberg", biomedical full text articles using a domain- "Tatjana,Tchumatchenko"], relevant scheme. In Proceedings of the 6th "MeSH": { Discourse Anaphora and Anaphor Resolution "D000200":"Action Potentials", Colloquium (DAARC 2007), pages 19–24. "D008959":"Models, Neurological", Gay, C. W., Kayaalp, M., and Aronson, A. R. "D009474":"Neurons", (2005). Semi-automatic indexing of full text "D059010":"Single-Cell Analysis" biomedical articles. AMIA Annual Symposium }, Proceedings, pages 271–275. "CHEMICALS":"None", Hirschman, L., Yeh, A., Blaschke, C., and Va- "SUPPLMeSH":"None" lencia, A. (2005). Evaluation of BioCreAtIvE }, assessment of task 2. BMC Bioinformatics, { 6(Suppl 1):S16. ... Islamaj, R., Leaman, R., Kim, S., Kwon, D., Wei, }, C.-H., Comeau, D. C., Peng, Y., Cissel, D., ... Coss, C., annd Rob Guzman, C. F., Kochar, ]} P. G., Koppel, S., Trinh, D., Sekiya, K., Ward, J., Whitman, D., Schmidt, S., and Lu, Z. (2021). 6. Bibliographical References NLM-Chem, a new resource for chemical en- tity recognition in PubMed full text literature. Aronson, A. R. and Lang, F.-M. (2010). An Scientific Data, 8(91). overview of MetaMap: Historical perspective and recent advances. Journal of the American Jimeno-Yepes, A., Plaza, L., Mork, J. G., Aron- Medical Informatics Association, 17(3):229– son, A. R., and Díaz, A. (2012). Mesh indexing 236, 05. based on automatically generated summaries. Aronson, A., Mork, J. G., Gay, C. W., Humphrey, BMC Bioinformatics, 14:208 – 208. S., and Rogers, W. J. (2004). The NLM Index- Jin, Q., Dhingra, B., and Cohen, W. W. (2018). ing Initiative’s Medical Text Indexer. Studies AttentionMeSH: Simple, effective and inter- in Health Technology and Informatics, 107(Pt pretable automatic MeSH indexer. In Pro- 1):268–272. ceedings of the 2018 EMNLP Workshop Bai, S., Kolter, J. Z., and Koltun, V. (2018). BioASQ: Large-scale Biomedical Semantic In- Convolutional sequence modeling revisited. dexing and Question Answering, pages 47–56. In Proceedings of the 6th International Con- Kalchbrenner, N., Espeholt, L., Simonyan, ference on Learning Representations (ICLR) K., van den Oord, A., Graves, A., and Workshop. Kavukcuoglu, K. (2017). Neural machine Comeau, D. C., Wei, C.-H., Islamaj, D. R., and Lu, translation in linear time. arXiv preprint Z. (2019). PMC text mining subset in BioC: arXiv:1610.10099. about 3 million full text articles and growing. Kim, Y. (2014). Convolutional neural networks Bioinformatics, pages 3533–3535. for sentence classification. In Proceedings of Dai, S., You, R., Lu, Z., Huang, X., Mamitsuka, the 2014 Conference on Empirical Methods H., and Zhu, S. (2019). FullMeSH: improv- in Natural Language Processing (EMNLP), ing large-scale MeSH indexing with full text. pages 1746–1751. Bioinformatics, 36(5):1533–1541, 10. Kingma, D. P. and Ba, J. (2015). Adam: A Demner-Fushman, D. and Mork, J. G. (2015). method for stochastic optimization. arXiv Extracting characteristics of the study sub- preprint arXiv:1412.6980. jects from full-text articles. In Proceedings of Kipf, T. and Welling, M. (2017). Semi- the American Medical Informatics Association supervised classification with graph (AMIA) Annual Symposium, pages 484–491. convolutional networks. arXiv preprint Dogan, R. I., Kim, S., Chatr-Aryamontri, A., arXiv:1609.02907. Chang, C. S., Oughtred, R., Rust, J., Wilbur, Li, J., Sun, Y., Johnson, R. J., Sciaky, D., Wei, C.- W. J., Comeau, D. C., Dolinski, K., and Tyers, H., Leaman, R., Davis, A. P., Mattingly, C. J., M. (2017). The BioC-BioGRID corpus: full Wiegers, T. C., and Lu, Z. (2015). Annotating
chemicals, diseases and their interactions in Gimelshein, N., Antiga, L., Desmaison, A., biomedical literature. In Proceedings of the Kopf, A., Yang, E., DeVito, Z., Raison, M., Te- Fifth BioCreative Challenge Evaluation Work- jani, A., Chilamkurthy, S., Steiner, B., Fang, shop, pages 173–182. L., Bai, J., and Chintala, S. (2019). PyTorch: Li, Y., Zhang, X., and Chen, D. (2018). CSRNet: An imperative style, high-performance deep Dilated convolutional neural networks for learning library. In H. Wallach, et al., edi- understanding the highly congested scenes. tors, Advances in Neural Information Process- 2018 IEEE/CVF Conference on Computer Vi- ing Systems 32, pages 8024–8035. Curran As- sion and Pattern Recognition, pages 1091– sociates, Inc. 1100. Peng, S., You, R., Wang, H., Zhai, C., Mamit- Lin, J. and Wilbur, W. J. (2007). PubMed related suka, H., and Zhu, S. (2016). DeepMeSH: articles: A probabilistic topic-based model deep semantic representation for improving for content similarity. BMC Bioinformatics, large-scale MeSH indexing. Bioinformatics, 8(1):423. 32(12):i70–i79. Lin, J., Su, Q., Yang, P., Ma, S., and Sun, X. Peng, H., Li, J., He, Y., Liu, Y., Bao, M., Wang, L., (2018). Semantic-unit-based dilated convolu- Song, Y., and Yang, Q. (2018). Large-scale hi- tion for multi-label text classification. In Pro- erarchical text classification with recursively ceedings of the 2018 Conference on Empiri- regularized deep graph-cnn. Proceedings of cal Methods in Natural Language Processing, the 2018 World Wide Web Conference. pages 4554–4564. Tai, K. S., Socher, R., and Manning, C. D. (2015). Liu, K., Peng, S., Wu, J., Zhai, C., Mamitsuka, H., Improved semantic representations from tree- and Zhu, S. (2015). MeSHLabeler: Improving structured long short-term memory networks. the accuracy of large-scale mesh indexing by In Proceedings of the 53rd Annual Meeting integrating diverse evidence. Bioinformatics, of the Association for Computational Linguis- 31(12):i339–i347. tics and the 7th International Joint Confer- Mork, J. G., Jimeno-Yepes, A., and Aron- ence on Natural Language Processing (Vol- son, A. R. (2013). The NLM Medical ume 1: Long Papers), pages 1556–1566. Text Indexer system for indexing biomedi- Tsatsaronis, G., Balikas, G., Malakasiotis, P., Par- cal literature. In Proceedings of the First talas, I., Zschunke, M., Alvers, M. R., Weis- Workshop on Bio-Medical Semantic Indexing senborn, D., Krithara, A., Petridis, S., Poly- and Question Answering (BioASQ). CEUR- chronopoulos, D., Almirantis, Y., Pavlopou- WS.org, online http://CEUR-WS.org/Vol-1094/ los, J., Baskiotis, N., Gallinari, P., Artieres, T., bioasq2013_submission_3.pdf. Ngonga, A., Heino, N., Gaussier, E., Barrio- Mork, J. G., Aronson, A. R., and Demner- Alvers, L., Schroeder, M., Androutsopoulos, I., Fushman, D. (2017). 12 years on – Is the NLM and Paliouras, G. (2015). An overview of the medical text indexer still useful and relevant? BIOASQ large-scale biomedical semantic in- Journal of Biomedical Semantics, 8(1):8. dexing and question answering competition. Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., BMC Bioinformatics, 16:138. and Eisenstein, J. (2018). Explainable predic- Verspoor, K., Cohen, K. B., Lanfranchi, A., tion of medical codes from clinical text. In Warner, C., Johnson, H. L., Roeder, C., Choi, Proceedings of the 2018 Conference of the J. D., Funk, C., Malenkiy, Y., Eckert, M., Xue, North American Chapter of the Association N., Baumgartner Jr., W. A., Bada, M., Palmer, for Computational Linguistics: Human Lan- M., and Hunter, L. E. (2012). A corpus of full- guage Technologies, Volume 1 (Long Papers), text journal articles is a robust evaluation tool pages 1101–1111, New Orleans, Louisiana, for revealing differences in performance of June. Association for Computational Linguis- biomedical natural language processing tools. tics. BMC Bioinformatics, 13:207. Pal, A., Selvakumar, M., and Sankarasubbu, Wang, X. and Mercer, R. E. (2019). Incor- M. (2020). Multi-label text classification us- porating figure captions and descriptive text ing attention-based graph neural network. In in MeSH term indexing. In Proceedings of Proceedings of The 12th International Con- the 18th BioNLP Workshop and Shared Task, ference on Agents and Artificial Intelligence pages 165–175. (ICAART), pages 494–505. Wang, X., Mercer, R. E., and Rudzicz, F. (2022). Paszke, A., Gross, S., Massa, F., Lerer, A., KenMeSH: Knowledge-enhanced end-to-end Bradbury, J., Chanan, G., Killeen, T., Lin, Z., biomedical text labelling. arXiv preprint
arXiv:2203.06835. and Miji Choi and Karin M. Verspoor and Ma- Xun, G., Jha, K., Yuan, Y., Wang, Y., and Zhang, dian Khabsa and C. Lee Giles and Hongfang A. (2019). MeSHProbeNet: A self-attentive Liu and K. E. Ravikumar and Andre Lamurias probe net for MeSH indexing. Bioinformatics. and Francisco M. Couto and Hong-Jie Dai Yao, Y., Rosasco, L., and Caponnetto, A. (2007). and Richard Tzong-Han Tsai and C Ata and On early stopping in gradient descent learn- Tolga Can and Anabel Usie and Rui Alves and ing. Constructive Approximation, 26:289– Isabel Segura-Bedmar and Paloma Martínez 315. and Julen Oyarzábal and Alfonso Valencia. Yao, L., Mao, C., and Luo, Y. (2019). Graph (2014). The CHEMDNER corpus. convolutional networks for text classification. In Proceedings of the Thirty-Third AAAI Con- ference on Artificial Intelligence and Thirty- First Innovative Applications of Artificial In- telligence Conference and Ninth AAAI Sym- posium on Educational Advances in Artificial Intelligence, pages 7370–7377. You, R., Liu, Y., Mamitsuka, H., and Zhu, S. (2020). BERTMeSH: deep contextual representation learning for large-scale high- performance MeSH indexing with full text. Bioinformatics, 37(5):684–692. Yu, F. and Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Zhang, Y., Chen, Q., Yang, Z., Lin, H., and Lu, Z. (2019). BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data, 6. 7. Language Resource References Comeau, Donald C and Wei, Chih-Hsuan and Islamaj Doğan, Rezarta and Lu, Zhiyong. (2019). BioC corpus. William R. Hersh and Chris Buckley and T. J. Leone and David H. Hickam. (1994). OHSUMED: an interactive retrieval evalua- tion and new large test collection for re- search. Jin-Dong Kim and Tomoko Ohta and Yuka Tateisi and Junichi Tsujii. (2003). GENIA corpus. Martin Krallinger and Obdulia Rabal and Flo- rian Leitner and Miguel Vazquez and David Salgado and Zhiyong Lu and Robert Leaman and Yanan Lu and Dong-Hong Ji and Daniel M. Lowe and Roger A. Sayle and Riza Theresa Batista-Navarro and Rafal Rak and Torsten Huber and Tim Rocktäschel and Sérgio Matos and David Campos and Buzhou Tang and Hua Xu and Tsendsuren Munkhdalai and Keun Ho Ryu and S. V. Ramanan and P. Senthil Nathan and Slavko Zitnik and Marko Bajec and Lutz Weber and Matthias Irmer and Saber Ahmad Akhondi and Jan A. Kors and Shuo Xu and Xin An and Utpal Kumar Sikdar and Asif Ekbal and Masaharu Yoshioka and Thaer M. Dieb
You can also read