The Biomaterials Annotator: a system for ontology-based concept annotation of biomaterials text - Association for Computational ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The Biomaterials Annotator: a system for ontology-based concept annotation of biomaterials text Javier Corvi1 , Carla V. Fuenteslópez2 , José M. Fernández1 , Josep Lluis Gelpí1,3 , Maria Pau Ginebra4 , Salvador Capella-Gutierrez1 and Osnat Hakimi1,5 1 Barcelona Supercomputing Center (BSC), Barcelona, Spain 2 Institute of Biomedical Engineering, Botnar Research Centre, University of Oxford, UK 3 Dept. of Biochemistry and Molecular Biology, University of Barcelona, Spain 4 Dept. of Material Science and Engineering, Universitat Politècnica de Catalunya, Spain 5 Faculty of Medicine and Health Sciences, Universitat Internacional de Catalunya, Spain javier.corvi@bsc.es osnat.hakimi@gmail.com Abstract generated by the field is primarily available in text documents, such as peer-reviewed research Biomaterials are synthetic or natural materials papers, patents and conference abstracts. This ever- used for constructing artificial organs, fabricat- ing prostheses, or replacing tissues. The last growing knowledge is increasingly harder for re- century saw the development of thousands of searchers to efficiently discover, organize and use. novel biomaterials and, as a result, an expo- For example, systematically reviewing the appli- nential increase in scientific publications in the cations and scaffolds made of a commonly used field. Large-scale analysis of biomaterials and polymer such as poly-lactic-glycolic-acid (PLGA), their performance could enable data-driven requires skimming through >12,000+ abstracts material selection and implant design. How- (MEDLINE search on October 2020). Among the ever, such analysis requires identification and organization of concepts, such as materials and different alternatives for the automated processing structures, from published texts. To facilitate of available texts, Natural Language Processing future information extraction and the applica- (NLP) workflows for information retrieval and in- tion of machine-learning techniques, we de- dexing offer a much needed automated solution. veloped a semantic annotator specifically tai- Such computational workflows facilitate informa- lored for the biomaterials literature. The Bio- tion discovery, information extraction and orga- materials Annotator has been implemented fol- nization, saving researchers time and minimizing lowing a modular organization using software manual tasks. containers for the different components and or- chestrated using Nextflow as workflow man- Central to information retrieval and indexing is ager. Natural language processing (NLP) com- the extraction of concepts of interest, also known ponents are mainly developed in Java. This as Named Entity Recognition (NER). NER is an set-up has allowed named entity recognition integral part of NLP workflows as it allows the au- of seventeen classes relevant to the biomateri- tomated identification of concepts in unstructured als domain. Here we detail the development, text and its assignment to a pre-defined category evaluation and performance of the system, as well as the release of the first collection of an- or class. For example, in the field of biomaterials, notated biomaterials abstracts. We make both categories may include ‘Biomaterials’ (‘PLGA’), the corpus and system available to the commu- ‘Structures’ (such as ‘fibre’ or ‘sponge’) and ‘Tis- nity to promote future efforts in the field and sues’ (such as ‘tendon’ or ‘bone’). The use of NER contribute towards its sustainability. to automatically recognize entities enables several downstream applications, including machine trans- 1 Introduction lation, information retrieval and indexing as well The last two decades saw the field of biomate- as automated question-answering mechanisms. rials and tissue engineering grow from a small The recognition of concepts in the biomaterials niche of biomedical research to an extensive do- domain is complicated by language and terminol- main, covering topics such as functional materi- ogy originating from multiple scientific disciplines als, cell-material interaction, nanomaterials and (chemistry, engineering, biology, medicine). A sig- medical devices. The expanding scientific data nificant challenge lies in identifying and combining 36 Proceedings of the Second Workshop on Scholarly Document Processing, pages 36–48 June 10, 2021. ©2021 Association for Computational Linguistics
Figure 1: Overview of the workflow used in the development and validation of the Biomaterials Annotator. lexical and semantic resources across domains, and materials Annotator, along with an annotated thus to date there are no automatic biomaterials- collection of biomaterials literature, are publicly specific NER systems to detect relevant entities of available for use and further development at interest. https://github.com/ProjectDebbie/ Here, we report the development of the first Biomaterials_annotator. biomaterials-specific annotation system, designed to recognize named entities from seventeen differ- 2 Previous relevant work ent categories, reflecting the complexity and diver- sity of contemporary biomaterials research. When Unlike general purpose NLP systems, biomedical considering approaches for the design of the Bio- domain-specific tools require advanced approaches materials Annotator, i.e. lexical versus machine to detect classes of interest such as diseases and learning-based NER, such as CRF or RNN, it was gene names. In this area, there are several well- essential to consider the number of desired annota- known and widely used systems and tools, such tion categories in the system (17) and the absence as Metamap (Aronson, 2001) and Pubtator (Wei of an annotated corpora for text mining efforts for et al., 2013), which were developed using dif- the majority of them. Based on these premises, it ferent NER methodologies and approaches, e.g. was concluded that training a model for each cat- gazetteers and hand-made rule-based NER; ma- egory was impractical. Thus, the system relies on chine learning-based NER that includes Hidden manually curated and validated lexical resources. Markov Model, Conditional Random Fields (CRF) and recurrent neural network (RNN); and Hybrid To cover entities from different domains, mul- NER (Lee et al., 2003; McCallum and Li, 2003; tiple nomenclatures, vocabularies, and especially Song et al., 2004; GuoDong and Jian, 2004; Zhao, ontologies were identified and combined. To com- 2004; Yeh et al., 2005; Campos et al., 2013; Song bine these resources into a single instrument, the et al., 2018; Dang et al., 2018; Kaewphan et al., Devices, Experimental scaffolds and Biomateri- 2018; Cho and Lee, 2019). In the context of this als Ontology (DEB) was used providing the log- work, generic text mining tools previously devel- ical schema and the definition of key categories oped for the eTRANSAFE project (Pognan et al., (Hakimi et al., 2020). 2021) have been adapted and further developed for The resulting open source-system, the Bio- the biomaterials domain. 37
Whilst there are a handful of ontologies in the biomaterials domain (such the nanoparticle ontol- ogy NPO (Thomas et al., 2011) and the Bone and Cartilage Tissue Engineering Ontology BCTEO (Viti et al., 2014)), to the best of the our knowledge, the DEB ontology (Hakimi et al., 2020) is the only one that is tailored to link and curate concepts for Biomaterials NER. Therefore, it specifically cov- ers different categories related to the biomaterials domain. 3 Methodology overview To develop the annotation tool, the workflow in Figure 1 was followed. Various corpora of abstracts were used during the development, Figure 2: Overview of the components of the Bioma- terials Annotator; including the standard preprocessing covering the general biomaterials literature. steps (A) and the biomaterials named entity recognition These corpora included a collection of man- steps (B). ually curated abstracts of the biomedical polymer polydioxanone (Fuenteslópez et al 2021, manuscript in preparation, GitHub repository: includes the steps previously outlined. This https://github.com/ProjectDebbie/ component is written in JAVA and it uses the polydioxanone_project) and a previously Stanford CoreNLP Natural Language Processing published biomaterials gold standard collection open source toolkit. The use of the Stanford (Hakimi et al., 2020), comprising a total of 1222 CoreNLP API benefits greatly from the provision abstracts. Corpora were passed through four steps, of a set of stable, robust, high quality linguistic each described in detail below. The first step was a analysis components, which can be easily invoked text preprocessing component (section 3.1). This for common scenarios (Manning et al., 2014). was followed by concept recognition (section 3.2), The Standard NLP preprocessing component initially using the MeSH controlled vocabulary is available at https://gitlab.bsc.es/ and the DEB ontology. Then, the annotations inb/text-mining/generic-tools/ were evaluated by two domain experts, errors nlp-standard-preprocessing. were flagged up and additional lexical resources were added through keyword searches. Concept 3.2 Concept recognition recognition, manual evaluation and curation of Here, we developed NER components to detect lexical resources were performed in an iterative relevant entities related to the biomaterials domain manner during the development phase (section 3.3) based on the DEB ontology in conjunction with over 1000 abstracts. Once the development phase other open relevant resources, such as the National was completed, validation by domain experts was Cancer Institute Thesaurus (NCIT) and (CHEBI). performed on 199 independent abstracts which A comprehensive description of the resources were not used during the development process included in this work is described in section 3.3 and (section 4.1). The resulting annotated collection Appendix B. Lexical resources were transformed of biomaterials abstracts was published as open into gazetteers to be used in the NER process source. (Figure 2.B). Internally, the NER process was divided into three main steps; the MSH Annotator, 3.1 Text preprocessing which annotates relevant categories from the To prepare the text for concept recognition, MeSH terminology; the Dictionary Annotator, several Natural Language Processing (NLP) steps which annotated predefined categories from the were performed, namely: tokenization, sentence relevant dictionaries; and the Post-processing step splitting, part-of-speech tagging and morpho- in which specific rules were executed. These in- logical analysis (Figure 2.A). We developed the clude entity recognition based on lexical rules and Standard NLP preprocessing component which the removal of false positives, among other tasks. 38
The MSH Annotator is available at https:// detect concepts: github.com/ProjectDebbie/debbie_ (Token.pos == "JJ" | Token.pos == "NN") To- umls_annotations; and the Dictionary An- ken.root == "cell" notator and Post-processing rules are available at The inclusion of this rule enables the detection of https://github.com/ProjectDebbie/ Cell-type concepts that are not present in the dic- DEBBIE_dictionaries_annotations. tionaries; e.g. “neuronal cells”, “cancer cell” and These components are instances of the nlp-gate- “osteogenic cells”. The discovery of such rules is generic-component (https://gitlab.bsc. a continuous work; future Biomaterials Annotator es/inb/text-mining/generic-tools/ versions will improve the lexical rules included to nlp-gate-generic-component), a detect relevant concepts. generic component developed in JAVA by our Another key problem to address is the recogni- team that uses the General Architecture of Text tion of abbreviation concepts; to achieve this prob- Engineering (GATE) software (Cunningham et al., lem we developed a post-processing rule based 2013) and can be parametrized with gazetteers and on a modified version of Schwartz’s algorithm specific handmade JAPE (Java Annotation Patterns (Schwartz and Hearst, 2003). First, we detect Engine) rules. Using the Biomaterials Annotator, an abbreviation candidate given a text pattern every recognised entity is labelled with one of the (regex=”(?:[a-z]*[A-Z][a-z]*)2,”); subsequently, categories (Figure 3.A-B). the Schwartz’s algorithm is applied to detect The nlp-gate-generic-component was configured whether there is a definition that matches the abbre- to use the GATE Flexible gazetteer, allowing to viation candidate in the sentence; in such case, if capture the words present in the text as well as the definition has an entity class assigned to it, we their morphological root value (lemma). This en- annotate the abbreviation with the same class. As sures that inflected forms of a word (i.e. plural, an example, in the following sentence: “We investi- singular, -ing forms, tense) can be recognised and gated the potential of human bone marrow derived analysed as a single item. In addition, the dictio- Mesenchymal stem cells (MSCs) for neuronal differ- naries used in the Biomaterials Annotator include entiation in vitro....”; the expression ‘Mesenchymal preferred synonyms, providing the possibility to stem cells’ is annotated under the ’Cell’ category. map terms semantically to a specific primary con- But the ‘MSCs’ abbreviation is not; moreover in the cept. Thus, the Biomaterials Annotator performs rest of the text the abbreviation is used instead of semantic mapping of the annotations by, not only its long form. The abbreviation-rule detects ‘MSCs’ recognizing the category of an entity, but also link- as an abbreviation of ‘Mesenchymal stem cells’ and ing it to the appropriate entry in a well-established assigns the ’Cell’ category to all the ‘MSCs’ men- resource (Jovanović and Bagheri, 2017). For exam- tions in the text. ple, the terms: “canine”, “dogs” and “dog” were all annotated under the ‘Species’ category; and in- 3.3 Terminologies and ontologies curation side the features of each annotation the preferred and manual evaluation term is “dog”. This enables the retrieval of all the One of the main hurdles to biomaterials concept corresponding terms using the single search term recognition is the interdisciplinary nature of the ‘dog’. domain, with scientific texts containing concepts To complete the annotation process, the annota- from various fields such as biology, chemistry, tor executes JAPE rules for post-processing func- engineering and medicine. A key objective of tions, such as the removal of false positives and the the Biomaterials Annotator was to identify and addition of information to each annotation. Added combine lexical resources from the different information includes the ontology source, the ontol- domains in order to cover as many relevant ogy term id, the lemma and the preferred synonym biomaterials concepts as possible. Resources were (Figure 3.C-D). In addition, JAPE rules were run identified using a manual, bottom-up approach, to identify entities using lexical constraints and with cyclic re-iteration, as shown in Figure 1. As a address the concept recognition of abbreviations. starting point, abstracts were annotated with the Rule-based entities recognition can use part-of- automated NER approach described in section 3.2 speech of concepts, as an example; in the case using the DEB ontology. After each annotation of ‘Cell’ category, there is a lexical rule defined to round, manual evaluation was performed by 39
Figure 3: The appearance of an annotated abstract on GATE’s user interface. A) Shows the annotated text and in B) colored labels used to tag annotations by their respective category. C) Information regarding each annotation (type, position, features), and in D) a specific example: “polymers”: "BiomaterialType" entity with their corresponding features. ANNOTATION CATEGORIES BIOLOGICALLY RESEARCH BIOMATERIAL ACTIVE TISSUE CELL TYPE SUBSTANCE TECHNIQUE SCANNING SILK EXAMPLE: RGD PANCREAS ELECTRON FIBROBLAST FIBROIN MICROSCOPY DEB DEB MESH MESH MESH MAIN SEMANTIC MESH MESH UBERON NCIT NCIT RESOURCES: CHEBI NCIT PREMEDONTO NPO UBERON Figure 4: An illustration of part of the annotation schema (showing five out of seventeen categories), which relies on multiple semantic resources for each annotation category. Full details of all categories and resources are in the Appendix A. 40
two domain experts. The evaluation entailed tion set, resulting from the execution of the Bio- reviewing samples of 10-20 abstracts in order materials Annotator, was manually verified by 9 to flag annotation errors and highlight relevant biomaterials experts. The validation process was concepts which were missed by the system. The performed using the GATE user interface, where flagged terms were used for keyword search in annotations made by the system were presented the Bioportal (Martínez-Romero et al., 2017) and to the biomaterials experts with the possibility of the UMLS metathesaurus browser (Bodenreider, adding missing annotations, removing false annota- 2004). Through these searches, specific classes tions and editing annotations. Once the expert had (or ‘parent concepts’) within relevant ontologies finished the validation of a document, it was saved and UMLS ‘semantic types’ were identified as a different validated copy. and added to the annotation schema (part of Two strategies to indicate if two annotations which is shown for illustration in Figure 4). The agree or not were considered; a strict approach, resources identified belonged to three categories: in which the annotations agree if they have the ontologies, controlled vocabularies and nomen- same origin and end offset, and a more relaxed or clatures. All the ontologies were open access and “lenient” approach, where the annotations agree downloaded in .owl format from NCBO Bioportal if they overlap at some point. For example, in (http://bioportal.bioontology.org) the partial approach the biomaterials expressions (Martínez-Romero et al., 2017). The controlled “polyvinyl alcohol” and “polyvinyl” are considered vocabulary (MeSH) was downloaded for use to agree, which does not happen in the strict agree- under license from the UMLS Terminology ment. Services. The GMDN nomenclature was kindly To measure the performance of the NER system, provided in .xml format by the GMDN agency the set validated by the experts was taken as the under a license. A summary of all the used gold standard and the system’s output as the set resources is in Appendix B. For the resources to be validated. Table 1 shows the recall, preci- to be used in the annotation system, relevant sion and F-score, including the strict and lenient classes were imported into a dictionary (gazetteer) approaches, as well as an average between them. containing the following fields: the term, its The global scores calculated for the system are also label (annotation category), the ID and whenever presented, obtaining an 0.75 strict F-score, 0.79 available, a preferred synonym. The extraction of lenient F-Score and 0.77 average F-score. desired classes from the ontologies to dictionary format was done using an implementation of Figure 5 shows the average F-scores calculated owlready2 (Lamy, 2017) and the code (named for the different categories. Categories with an av- owl2dict_light) is available in an open github erage F-score above 0.8 are considered categories repository as part of the (https://github. in which the concepts are satisfactorily covered by com/ProjectDebbie/OWL2DICT). The the resources used (e.g. Structure, BiomaterialType resulting dictionaries are also available and Tissue). On the other hand, there are categories (https://github.com/ProjectDebbie/ with lower scores, and specifically: ’Biomaterial’, DEBBIE_dictionaries_annotations). ’Biologically active substance’ and ’Cell’. The cat- These were in turn used by the Dictionary egories Biomaterials and Biologically active sub- Annotator component for concept recognition as stance had significantly reduced accuracy because described above in section 3.2. they include many ambiguous concepts. Some ma- terials may act as a biomaterial in one set-up, but 4 Results can also be measured in terms of cell expression or non-biomaterial use in another set-up (e.g. col- 4.1 Expert validation lagen). In the latter case, the human validator will To measure the efficiency of a text mining system delete the ‘Biomaterial’ annotation. Solving this such as the Biomaterials Annotator, it is fundamen- kind of ambiguities will require other strategies, tal to organize and plan a validation stage aimed at such as specific lexical rules or machine learning indicating the performance of the system. The Bio- approaches. Another factor impeding good quality materials Annotator was validated through manual annotations of Biomaterials is the lack of good qual- verification of the validation set, an independent ity vocabulary of medical polymers. Polymer and collection of 199 abstracts. The annotated valida- co-polymer naming is notoriously variable, with 41
Precision Recall F-score Precision Recall F-score Precision Recall F-score Category - strict - strict - strict - lenient - lenient - lenient - average - average - average Adverse Effects 0.94 0.75 0.82 1 0.8 0.87 0.97 0.77 0.85 Associated Biological Process 0.88 0.68 0.77 0.94 0.73 0.82 0.91 0.71 0.79 Biologically Active Substance 0.58 0.43 0.49 0.7 0.52 0.59 0.64 0.48 0.54 Biomaterial 0.76 0.47 0.57 0.83 0.52 0.63 0.79 0.49 0.6 Biomaterial Type 0.92 0.88 0.9 0.98 0.93 0.95 0.95 0.9 0.92 Cell 0.76 0.59 0.66 0.84 0.65 0.73 0.8 0.62 0.69 Effect On Biological System 0.96 0.69 0.79 1 0.72 0.82 0.98 0.71 0.8 Manufactured Object 0.96 0.86 0.9 0.96 0.86 0.9 0.96 0.86 0.9 Manufactured Object Component 0.91 0.84 0.86 0.91 0.84 0.87 0.91 0.84 0.87 Manufactured Object Features 0.68 0.59 0.62 0.71 0.61 0.65 0.69 0.6 0.64 Material Processing 0.78 0.6 0.67 0.83 0.63 0.71 0.81 0.61 0.69 Medical Application 0.68 0.49 0.57 0.82 0.6 0.69 0.75 0.54 0.63 Research Technique 0.81 0.63 0.71 0.87 0.68 0.76 0.84 0.66 0.73 Species 0.97 0.79 0.87 0.99 0.81 0.89 0.98 0.8 0.88 Structure 0.93 0.77 0.84 0.95 0.79 0.86 0.94 0.78 0.85 Study Type 0.96 0.95 0.96 0.99 0.97 0.98 0.98 0.96 0.97 Tissue 0.8 0.77 0.78 0.85 0.82 0.83 0.82 0.8 0.81 Global 0.84 0.69 0.75 0.89 0.73 0.79 0.86 0.71 0.77 Table 1: The performance of the Biomaterials Annotator in a test set of 199 abstracts validated manually by 9 experts. some named by their commercial name or abbrevi- notator and deposits them in an open access ation. To address these inaccuracies, future work database. DEBBIE is under development and will involve expanding relevant ontologies using can be accessed at https://github.com/ tools such as Spike (Taub-Tabib et al., 2020), in- ProjectDebbie/DEBBIE_pipeline. cluding additional lexical rules, and adding ma- Category Count chine learning components. Adverse Effects 657 Associated Biological Process 6231 4.2 Full system implementation and Biologically Active Substance 7709 availability Biomaterial 5726 Biomaterial Type 1543 A significant challenge for scientific software ap- Cell 6839 Effect On Biological System 972 plications is providing facilities to share, distribute Manufactured Object 5967 and run such systems in a simple and convenient Manufactured Object Component 2307 way. Furthermore, an important concern is the Manufactured Object Features 4200 Material Processing 2728 possibility of replicating the results obtained Medical Application 3868 during the research. In order to accomplish these Research Technique 3701 requirements and follow good practices, we de- Species 2089 Structure 4136 veloped the Biomaterials Annotator using Docker Study Type 1806 as software container technology and Nextflow Tissue 9997 as the workflow manager. Through the use of Entities 70476 Tokens 392605 Docker, all the subcomponents of the Biomaterials Sentences 15979 Annotator were individually compartmentalized; Abstracts 1222 hosting their own dependencies and programs that work only inside the isolated container. In addition, Table 2: Annotated biomaterials corpus statistics. the Nextflow workflow manager was used for the automated orchestration and execution of the pipeline. By using this architecture, the entire tool, 4.3 Annotated corpus release or any of its individual components, can be easily Another key objective was to generate the first an- installed and run in heterogeneous environments. notated corpus with entities related to the bioma- The Biomaterials Annotator is available at terials domain. Such a corpus will facilitate the https://github.com/ProjectDebbie/ development and evaluation of text mining mod- Biomaterials_annotator. els for automated extraction of biomaterials-related The Biomaterials Annotator is part of DEB- data from text. BIE (Database of biomedical materials), a The biomaterials annotated dataset consists of wider system that retrieves abstracts from 1222 biomaterials abstracts describing the evalu- pubmed, annotates using the Biomaterials An- ation of biomaterials in either a laboratory or a 42
Figure 5: Average F-score of the automated annotations across categories. clinical setting. Each abstract is individually con- manner aiming to enhanced performance in key tained as a separate file under the GATE format. categories such as Biomaterials and Cells. In addi- Table 2 shows statistics concerning the number of tion, future versions of the Biomaterials Annotator concepts corresponding to the different categories, will be closely related to the DEBBIE system and as well as the number of total entities, sentences include additional functionalities and features de- and tokens. veloped to achieve its main objectives. The annotated biomaterials corpus is The Biomaterials Annotator and the annotated available and free for use; information corpus are open source and available to the com- to access the corpus can be found at munity to promote future efforts in the field and https://github.com/ProjectDebbie/ contribute towards its sustainability. Biomaterials_annotator. 5 Conclusions and future directions 6 Acknowledgements In this work we present the Biomaterials Annota- tor, an ontology-based NER system that identifies This project has received funding from the Euro- 17 domain specific types of concepts and delivers pean Union Horizon 2020 programme under the an annotated biomaterials corpus of 1222 MED- Marie Skodowska- Curie grant agreement DEB- LINE articles available for future text mining and BIE, project number: 751277. O. H. is funded machine learning efforts. We have carried out a val- through a Bosch-Aymerich fellowship. J-M.F, S.C- idation activity to measure the performance of the G and J-L.P are partly supported by INB Grant NER system, with the participation of nine bioma- (PT17/0009/0001 - ISCIII-SGEFI / ERDF). M-P. G. terials experts, obtaining a global average F-score acknowledges the ICREA Academia Award from of 0.77. Generalitat de Catalunya. J.C. is partly supported Future work in the development of the system by eTRANSAFE (received funding from the Inno- could involve annotating relations and linking iden- vative Medicines Initiative 2 Joint Undertaking un- tified concepts to manufactured biomaterials ob- der grant agreement No 777365 and support from jects. It may also include incorporating additional the European Union’s Horizon 2020 research and categories using controlled resources. Improve- innovation programme and EFPIA). ments to the system will continue in an iterative 43
References Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. A. R. Aronson. 2001. Effective mapping of biomedical 2014. The Stanford CoreNLP Natural Language text to the UMLS Metathesaurus: the MetaMap pro- Processing Toolkit. In Proceedings of 52nd An- gram. Proceedings / AMIA ... Annual Symposium. nual Meeting of the Association for Computational AMIA Symposium, pages 17–21. Linguistics: System Demonstrations, pages 55–60, Stroudsburg, PA, USA. Association for Computa- Olivier Bodenreider. 2004. The Unified Medical Lan- tional Linguistics. guage System (UMLS): Integrating biomedical ter- minology. Nucleic Acids Research. Marcos Martínez-Romero, Clement Jonquet, Martin J. O’Connor, John Graybeal, Alejandro Pazos, and David Campos, Sérgio Matos, and JoséLuís Oliveira. Mark A. Musen. 2017. NCBO Ontology Recom- 2013. Current Methodologies for Biomedical mender 2.0: An enhanced approach for biomedical Named Entity Recognition. In Biological Knowl- ontology recommendation. Journal of Biomedical edge Discovery Handbook, pages 839–868. John Wi- Semantics. ley Sons, Inc. Andrew McCallum and Wei Li. 2003. Early results for Hyejin Cho and Hyunju Lee. 2019. Biomedical named entity recognition with conditional random named entity recognition using deep neural net- fields, feature induction and web-enhanced lexicons. works with contextual information. BMC Bioinfor- In Proceedings of the Seventh Conference on Natu- matics, 20(1):735. ral Language Learning at HLT-NAACL 2003, pages 188–191. Hamish Cunningham, Valentin Tablan, Angus Roberts, and Kalina Bontcheva. 2013. Getting More Out of François Pognan, Thomas Steger-Hartmann, Car- Biomedical Documents with GATE’s Full Lifecycle los Díaz, Niklas Blomberg, Frank Bringezu, Open Source Text Analytics. PLoS Computational Katharine Briggs, Giulia Callegaro, Salvador Biology. Capella-Gutierrez, Emilio Centeno, Javier Corvi, Philip Drew, William C. Drewe, José M. Fernán- Thanh Hai Dang, Hoang Quynh Le, Trang M. Nguyen, dez, Laura I. Furlong, Emre Guney, Jan A. Kors, and Sinh T. Vu. 2018. D3NER: Biomedical named Miguel Angel Mayer, Manuel Pastor, Janet Piñero, entity recognition using CRF-biLSTM improved Juan Manuel Ramírez-Anguita, Francesco Ronzano, with fine-tuned embeddings of various linguistic in- Philip Rowell, Josep Saüch-Pitarch, Alfonso Valen- formation. Bioinformatics, 34(20):3539–3546. cia, Bob van de Water, Johan van der Lei, Erik van Mulligen, and Ferran Sanz. 2021. The etransafe Zhou GuoDong and Su Jian. 2004. Exploring deep project on translational safety assessment through knowledge resources in biomedical name recogni- integrative knowledge management: Achievements tion. page 96. and perspectives. Pharmaceuticals, 14(3). Osnat Hakimi, Josep Luis Gelpi, Martin Krallinger, Ariel S. Schwartz and Marti A. Hearst. 2003. A simple Fabio Curi, Dmitry Repchevsky, and Maria-Pau algorithm for identifying abbreviation definitions in Ginebra. 2020. The Devices, Experimental Scaf- biomedical text. Pacific Symposium on Biocomput- folds, and Biomaterials Ontology (DEB): A Tool ing. Pacific Symposium on Biocomputing. for Mapping, Annotation, and Analysis of Bio- materials Data. Advanced Functional Materials, Hye Jeong Song, Byeong Cheol Jo, Chan Young Park, 30(16):1909910. Jong Dae Kim, and Yu Seop Kim. 2018. Com- parison of named entity recognition methodologies Jelena Jovanović and Ebrahim Bagheri. 2017. Seman- in biomedical documents. BioMedical Engineering tic annotation in biomedicine: The current land- Online, 17(Suppl 2). scape. Yu Song, Eunju Kim, Gary Geunbae Lee, and Byoung- Suwisa Kaewphan, Kai Hakala, Niko Miekka, Tapio kee Yi. 2004. POSBIOTM-NER in the shared task Salakoski, and Filip Ginter. 2018. Wide-scope of BioNLP/NLPBA 2004. biomedical named entity recognition and normaliza- tion with CRFs, fuzzy matching and character level Hillel Taub-Tabib, Micah Shlain, Shoval Sadde, Dan modeling. Database, 2018(2018):96. Lahav, Matan Eyal, Yaara Cohen, and Y. Goldberg. 2020. Interactive extractive search over biomedical Jean Baptiste Lamy. 2017. Owlready: Ontology- corpora. ArXiv, abs/2006.04148. oriented programming in Python with automatic classification and high level constructs for biomed- Dennis G. Thomas, Rohit V. Pappu, and Nathan A. ical ontologies. Artificial Intelligence in Medicine, Baker. 2011. NanoParticle Ontology for cancer nan- 80:11–28. otechnology research. Journal of Biomedical Infor- matics. Ki-Joong Lee, Young-Sook Hwang, and Hae-Chang Rim. 2003. Two-phase biomedical NE recognition Federica Viti, Silvia Scaglione, Alessandro Orro, and based on SVMs. Luciano Milanesi. 2014. Guidelines for managing 44
data and processes in bone and cartilage tissue engi- neering. BMC Bioinformatics, 15(Suppl 1):S14. Chih Hsuan Wei, Hung Yu Kao, and Zhiyong Lu. 2013. PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 41(Web Server issue). Alexander Yeh, Alexander Morgan, Marc Colosimo, and Lynette Hirschman. 2005. BioCreAtIvE task 1A: Gene mention finding evaluation. BMC Bioin- formatics, 6(SUPPL.1):1–10. Shaojun Zhao. 2004. Named entity recognition in biomedical texts using an HMM model. (Grefen- stette 1994):84. 45
A Semantic resources Table 3: List of semantic resources used by the Biomaterials Annotator Semantic Resource Name Acronym Scope and relevance Type Chemical Methods Methods used to collect chemical 1 CHMO Ontology Ontology experiments data. Chemical Entities of Compounds of biological relevance, 2 CHEBI Ontology Biological Interest macromolecules. The Devices, Experimental Biomaterials-related concepts, 3 Scaffolds and Biomaterials DEB materials, structures, Ontology Ontology material processing. EDAM Bioimaging EDAM- Imaging and sample preparation 4 Ontology Ontology BIOIMAGING techniques. Global Medical Device 5 GMDN Full names of medical devices. Nomenclature Nomenclature Biological concepts including Interlinking ontology of biological phenomena, diseases, 6 IOBC Ontology biological concepts molecular functions, research imaging techniques. A hierarchically organized Controlled 7 Medical Subject Headings MeSH vocabulary produced by vocabulary the NLM. National Cancer Institute Vocabulary for clinical care, Controlled 8 NCIT Thesaurus translational and basic research. vocabulary The description, preparation, and 9 Nanoparticle ontology NPO Ontology characterization of nanomaterials. Biomedical protocols, instruments, Ontology for Biomedical 10 OBI data generated, materials, analysis Ontology Investigations performed. Ontology of Nuclear Models, chemicals, tools, research 11 ONTOTOXNUC Ontology Toxicity techniques and models. Human disease terms, genomic, Precision Medicine 12 PREMEDONTO molecular, phenotype, related Ontology Ontology medical vocabulary. An integrated cross-species 13 Uber Anatomy Ontology UBERON Ontology anatomy ontology 46
B Annotation categories Table 4: Annotation categories, their respective semantic resources and imported classes Annotation Definition Resource and imported classes category A non-drug raw material DEB: Biomaterials 1 Biomaterial or substance suitable for CHEBI: Macromolecule inclusion in systems which MeSH: Biomedical or Dental Material augment or replace the function of bodily tissues or organs. Biomaterial Classification or nature 2 DEB: Biomaterial Type Types of biomaterials. Biologically Substance included in a DEB: Biologically Active Substance 3 Active manufactured object in MeSH: amino acid, peptide, protein Substance order to impart a biological Biologically Active Substance activity. Pharmacologic Substance NCIT: Protein Domain DEB: Manufactured Object Manufactured A physical object created 4 MeSH: Medical device Object by hand or machine. GMDN: Full nomenclature A part, region or Manufactured component referred to 5 Object DEB: Manufactured Object Component as a distinct unit, such Component as a surface or a layer. Medical Intended use, context, DEB: Medical Application 6 Application, function or outcome of MeSH: Disease or Syndrome Disease the manufactured object. Therapeutic or Preventive Procedure or condition Anatomical Abnormality Manufactured Characteristics inherent DEB: Manufactured Object Features 7 Object or given during processing MeSH: Chemical Viewed Structurally Features to a manufactured object or its components. The configuration, form or texture associated with 8 Structure DEB: Structure a manufactured object or its components. Associated A cellular or biological DEB: Associated Biological Process 9 Biological process that the MeSH: Organ or Tissue Function Process manufactured object is Molecular Function designed to cause Cell Function or support, or is measured Biological function to affect. NCIT: Cellular Process Material A planned process which DEB: Material Processing 10 Processing results in physical changes CHMO: Material Processing in a specified input material. The reported cell line MeSH: Cell 11 Cell or primary cell NCIT: Cell type. UBERON: Bone cell, cardiocyte circulating cell connective tissue cell epithelial cell The species and /or 12 Species breed used in the MeSH: Mammal study. A tissue or an organ MeSH: Tissue, 13 Tissue mentioned in the Body Location or organ study as the target Body part, organ or or test system for organ component the biomaterial object UBERON: Tissue or medical device. PREMEDONTO:Body Part, Organ, Organ System 47
Table: Continued Annotation Definition Resource and imported classes category Adverse An unfavourable or DEB: Adverse Effects 14 Effects unintended disease, sign, MeSH: Pathologic Function or symptom (including an abnormal laboratory finding) that is temporally associated with the use of a medical device or MeSH: Laboratory Procedure, biomaterial. Molecular Biology Research Technique The reported laboratory DEB: Research Technique Research technique or instrument NCIT: Research Technique 15 Technique used in an experimental NPO: Instrument study. IOBC: Microscope OBI: Assay EDAM: Imaging, Sample preparation ONTOTOXNUC: Outil The effect associated with Effect manufactured object in 16 On Biological DEB: Effect On Biological System a specific test system System (cells, tissue or organism). The study set up, Study 17 such as in vitro, DEB: Study Type Type in vivo, or clinical. 48
You can also read