The Biomaterials Annotator: a system for ontology-based concept annotation of biomaterials text - Association for Computational ...

Page created by Ken Crawford
 
CONTINUE READING
The Biomaterials Annotator: a system for ontology-based concept annotation of biomaterials text - Association for Computational ...
The Biomaterials Annotator: a system for ontology-based concept
                      annotation of biomaterials text
       Javier Corvi1 , Carla V. Fuenteslópez2 , José M. Fernández1 , Josep Lluis Gelpí1,3 ,
            Maria Pau Ginebra4 , Salvador Capella-Gutierrez1 and Osnat Hakimi1,5
                    1
                      Barcelona Supercomputing Center (BSC), Barcelona, Spain
    2
      Institute of Biomedical Engineering, Botnar Research Centre, University of Oxford, UK
          3
            Dept. of Biochemistry and Molecular Biology, University of Barcelona, Spain
    4
      Dept. of Material Science and Engineering, Universitat Politècnica de Catalunya, Spain
    5
      Faculty of Medicine and Health Sciences, Universitat Internacional de Catalunya, Spain
                                    javier.corvi@bsc.es
                                  osnat.hakimi@gmail.com

                      Abstract                                generated by the field is primarily available in
                                                              text documents, such as peer-reviewed research
    Biomaterials are synthetic or natural materials
                                                              papers, patents and conference abstracts. This ever-
    used for constructing artificial organs, fabricat-
    ing prostheses, or replacing tissues. The last            growing knowledge is increasingly harder for re-
    century saw the development of thousands of               searchers to efficiently discover, organize and use.
    novel biomaterials and, as a result, an expo-             For example, systematically reviewing the appli-
    nential increase in scientific publications in the        cations and scaffolds made of a commonly used
    field. Large-scale analysis of biomaterials and           polymer such as poly-lactic-glycolic-acid (PLGA),
    their performance could enable data-driven                requires skimming through >12,000+ abstracts
    material selection and implant design. How-
                                                              (MEDLINE search on October 2020). Among the
    ever, such analysis requires identification and
    organization of concepts, such as materials and           different alternatives for the automated processing
    structures, from published texts. To facilitate           of available texts, Natural Language Processing
    future information extraction and the applica-            (NLP) workflows for information retrieval and in-
    tion of machine-learning techniques, we de-               dexing offer a much needed automated solution.
    veloped a semantic annotator specifically tai-            Such computational workflows facilitate informa-
    lored for the biomaterials literature. The Bio-           tion discovery, information extraction and orga-
    materials Annotator has been implemented fol-             nization, saving researchers time and minimizing
    lowing a modular organization using software
                                                              manual tasks.
    containers for the different components and or-
    chestrated using Nextflow as workflow man-                   Central to information retrieval and indexing is
    ager. Natural language processing (NLP) com-              the extraction of concepts of interest, also known
    ponents are mainly developed in Java. This                as Named Entity Recognition (NER). NER is an
    set-up has allowed named entity recognition               integral part of NLP workflows as it allows the au-
    of seventeen classes relevant to the biomateri-           tomated identification of concepts in unstructured
    als domain. Here we detail the development,
                                                              text and its assignment to a pre-defined category
    evaluation and performance of the system, as
    well as the release of the first collection of an-
                                                              or class. For example, in the field of biomaterials,
    notated biomaterials abstracts. We make both              categories may include ‘Biomaterials’ (‘PLGA’),
    the corpus and system available to the commu-             ‘Structures’ (such as ‘fibre’ or ‘sponge’) and ‘Tis-
    nity to promote future efforts in the field and           sues’ (such as ‘tendon’ or ‘bone’). The use of NER
    contribute towards its sustainability.                    to automatically recognize entities enables several
                                                              downstream applications, including machine trans-
1   Introduction                                              lation, information retrieval and indexing as well
The last two decades saw the field of biomate-                as automated question-answering mechanisms.
rials and tissue engineering grow from a small                   The recognition of concepts in the biomaterials
niche of biomedical research to an extensive do-              domain is complicated by language and terminol-
main, covering topics such as functional materi-              ogy originating from multiple scientific disciplines
als, cell-material interaction, nanomaterials and             (chemistry, engineering, biology, medicine). A sig-
medical devices. The expanding scientific data                nificant challenge lies in identifying and combining
                                                         36
                  Proceedings of the Second Workshop on Scholarly Document Processing, pages 36–48
                           June 10, 2021. ©2021 Association for Computational Linguistics
Figure 1: Overview of the workflow used in the development and validation of the Biomaterials Annotator.

lexical and semantic resources across domains, and          materials Annotator, along with an annotated
thus to date there are no automatic biomaterials-           collection of biomaterials literature, are publicly
specific NER systems to detect relevant entities of         available for use and further development at
interest.                                                   https://github.com/ProjectDebbie/
   Here, we report the development of the first             Biomaterials_annotator.
biomaterials-specific annotation system, designed
to recognize named entities from seventeen differ-          2   Previous relevant work
ent categories, reflecting the complexity and diver-
sity of contemporary biomaterials research. When            Unlike general purpose NLP systems, biomedical
considering approaches for the design of the Bio-           domain-specific tools require advanced approaches
materials Annotator, i.e. lexical versus machine            to detect classes of interest such as diseases and
learning-based NER, such as CRF or RNN, it was              gene names. In this area, there are several well-
essential to consider the number of desired annota-         known and widely used systems and tools, such
tion categories in the system (17) and the absence          as Metamap (Aronson, 2001) and Pubtator (Wei
of an annotated corpora for text mining efforts for         et al., 2013), which were developed using dif-
the majority of them. Based on these premises, it           ferent NER methodologies and approaches, e.g.
was concluded that training a model for each cat-           gazetteers and hand-made rule-based NER; ma-
egory was impractical. Thus, the system relies on           chine learning-based NER that includes Hidden
manually curated and validated lexical resources.           Markov Model, Conditional Random Fields (CRF)
                                                            and recurrent neural network (RNN); and Hybrid
   To cover entities from different domains, mul-           NER (Lee et al., 2003; McCallum and Li, 2003;
tiple nomenclatures, vocabularies, and especially           Song et al., 2004; GuoDong and Jian, 2004; Zhao,
ontologies were identified and combined. To com-            2004; Yeh et al., 2005; Campos et al., 2013; Song
bine these resources into a single instrument, the          et al., 2018; Dang et al., 2018; Kaewphan et al.,
Devices, Experimental scaffolds and Biomateri-              2018; Cho and Lee, 2019). In the context of this
als Ontology (DEB) was used providing the log-              work, generic text mining tools previously devel-
ical schema and the definition of key categories            oped for the eTRANSAFE project (Pognan et al.,
(Hakimi et al., 2020).                                      2021) have been adapted and further developed for
  The resulting open source-system, the Bio-                the biomaterials domain.
                                                       37
Whilst there are a handful of ontologies in the
biomaterials domain (such the nanoparticle ontol-
ogy NPO (Thomas et al., 2011) and the Bone and
Cartilage Tissue Engineering Ontology BCTEO
(Viti et al., 2014)), to the best of the our knowledge,
the DEB ontology (Hakimi et al., 2020) is the only
one that is tailored to link and curate concepts for
Biomaterials NER. Therefore, it specifically cov-
ers different categories related to the biomaterials
domain.

3     Methodology overview
To develop the annotation tool, the workflow
in Figure 1 was followed. Various corpora of
abstracts were used during the development,                    Figure 2: Overview of the components of the Bioma-
                                                               terials Annotator; including the standard preprocessing
covering the general biomaterials literature.
                                                               steps (A) and the biomaterials named entity recognition
These corpora included a collection of man-                    steps (B).
ually curated abstracts of the biomedical
polymer polydioxanone (Fuenteslópez et al 2021,
manuscript in preparation, GitHub repository:                  includes the steps previously outlined. This
https://github.com/ProjectDebbie/                              component is written in JAVA and it uses the
polydioxanone_project) and a previously                        Stanford CoreNLP Natural Language Processing
published biomaterials gold standard collection                open source toolkit. The use of the Stanford
(Hakimi et al., 2020), comprising a total of 1222              CoreNLP API benefits greatly from the provision
abstracts. Corpora were passed through four steps,             of a set of stable, robust, high quality linguistic
each described in detail below. The first step was a           analysis components, which can be easily invoked
text preprocessing component (section 3.1). This               for common scenarios (Manning et al., 2014).
was followed by concept recognition (section 3.2),             The Standard NLP preprocessing component
initially using the MeSH controlled vocabulary                 is available at https://gitlab.bsc.es/
and the DEB ontology. Then, the annotations                    inb/text-mining/generic-tools/
were evaluated by two domain experts, errors                   nlp-standard-preprocessing.
were flagged up and additional lexical resources
were added through keyword searches. Concept                   3.2   Concept recognition
recognition, manual evaluation and curation of                 Here, we developed NER components to detect
lexical resources were performed in an iterative               relevant entities related to the biomaterials domain
manner during the development phase (section 3.3)              based on the DEB ontology in conjunction with
over 1000 abstracts. Once the development phase                other open relevant resources, such as the National
was completed, validation by domain experts was                Cancer Institute Thesaurus (NCIT) and (CHEBI).
performed on 199 independent abstracts which                   A comprehensive description of the resources
were not used during the development process                   included in this work is described in section 3.3 and
(section 4.1). The resulting annotated collection              Appendix B. Lexical resources were transformed
of biomaterials abstracts was published as open                into gazetteers to be used in the NER process
source.                                                        (Figure 2.B). Internally, the NER process was
                                                               divided into three main steps; the MSH Annotator,
3.1    Text preprocessing                                      which annotates relevant categories from the
To prepare the text for concept recognition,                   MeSH terminology; the Dictionary Annotator,
several Natural Language Processing (NLP) steps                which annotated predefined categories from the
were performed, namely: tokenization, sentence                 relevant dictionaries; and the Post-processing step
splitting, part-of-speech tagging and morpho-                  in which specific rules were executed. These in-
logical analysis (Figure 2.A). We developed the                clude entity recognition based on lexical rules and
Standard NLP preprocessing component which                     the removal of false positives, among other tasks.
                                                          38
The MSH Annotator is available at https:// detect concepts:
github.com/ProjectDebbie/debbie_                      (Token.pos == "JJ" | Token.pos == "NN") To-
umls_annotations; and the Dictionary An- ken.root == "cell"
notator and Post-processing rules are available at    The inclusion of this rule enables the detection of
https://github.com/ProjectDebbie/                     Cell-type concepts that are not present in the dic-
DEBBIE_dictionaries_annotations.                      tionaries; e.g. “neuronal cells”, “cancer cell” and
These components are instances of the nlp-gate- “osteogenic cells”. The discovery of such rules is
generic-component (https://gitlab.bsc. a continuous work; future Biomaterials Annotator
es/inb/text-mining/generic-tools/                     versions will improve the lexical rules included to
nlp-gate-generic-component),                        a detect relevant concepts.
generic component developed in JAVA by our               Another key problem to address is the recogni-
team that uses the General Architecture of Text       tion of abbreviation concepts; to achieve this prob-
Engineering (GATE) software (Cunningham et al., lem we developed a post-processing rule based
2013) and can be parametrized with gazetteers and     on a modified version of Schwartz’s algorithm
specific handmade JAPE (Java Annotation Patterns      (Schwartz and Hearst, 2003). First, we detect
Engine) rules. Using the Biomaterials Annotator, an abbreviation candidate given a text pattern
every recognised entity is labelled with one of the   (regex=”(?:[a-z]*[A-Z][a-z]*)2,”); subsequently,
categories (Figure 3.A-B).                            the Schwartz’s algorithm is applied to detect
   The nlp-gate-generic-component was configured      whether   there is a definition that matches the abbre-
to use the GATE Flexible gazetteer, allowing to       viation candidate in the sentence; in such case, if
capture the words present in the text as well as      the definition has an entity class assigned to it, we
their morphological root value (lemma). This en- annotate the abbreviation with the same class. As
sures that inflected forms of a word (i.e. plural, an example, in the following sentence: “We investi-
singular, -ing forms, tense) can be recognised and    gated the potential of human bone marrow derived
analysed as a single item. In addition, the dictio- Mesenchymal stem cells (MSCs) for neuronal differ-
naries used in the Biomaterials Annotator include     entiation in vitro....”; the expression ‘Mesenchymal
preferred synonyms, providing the possibility to      stem cells’ is annotated under the ’Cell’ category.
map terms semantically to a specific primary con- But the ‘MSCs’ abbreviation is not; moreover in the
cept. Thus, the Biomaterials Annotator performs       rest of the text the abbreviation is used instead of
semantic mapping of the annotations by, not only      its long form. The abbreviation-rule detects ‘MSCs’
recognizing the category of an entity, but also link- as an abbreviation of ‘Mesenchymal stem cells’ and
ing it to the appropriate entry in a well-established assigns the ’Cell’ category to all the ‘MSCs’ men-
resource (Jovanović and Bagheri, 2017). For exam- tions in the text.
 ple, the terms: “canine”, “dogs” and “dog” were
 all annotated under the ‘Species’ category; and in-          3.3   Terminologies and ontologies curation
 side the features of each annotation the preferred                 and manual evaluation
 term is “dog”. This enables the retrieval of all the
                                                              One of the main hurdles to biomaterials concept
 corresponding terms using the single search term
                                                              recognition is the interdisciplinary nature of the
‘dog’.
                                                              domain, with scientific texts containing concepts
   To complete the annotation process, the annota-            from various fields such as biology, chemistry,
tor executes JAPE rules for post-processing func-             engineering and medicine. A key objective of
tions, such as the removal of false positives and the         the Biomaterials Annotator was to identify and
addition of information to each annotation. Added             combine lexical resources from the different
information includes the ontology source, the ontol-          domains in order to cover as many relevant
ogy term id, the lemma and the preferred synonym              biomaterials concepts as possible. Resources were
(Figure 3.C-D). In addition, JAPE rules were run              identified using a manual, bottom-up approach,
to identify entities using lexical constraints and            with cyclic re-iteration, as shown in Figure 1. As a
address the concept recognition of abbreviations.             starting point, abstracts were annotated with the
Rule-based entities recognition can use part-of-              automated NER approach described in section 3.2
speech of concepts, as an example; in the case                using the DEB ontology. After each annotation
of ‘Cell’ category, there is a lexical rule defined to        round, manual evaluation was performed by
                                                         39
Figure 3: The appearance of an annotated abstract on GATE’s user interface. A) Shows the annotated text and in B)
colored labels used to tag annotations by their respective category. C) Information regarding each annotation (type,
position, features), and in D) a specific example: “polymers”: "BiomaterialType" entity with their corresponding
features.

                                                 ANNOTATION CATEGORIES

                                      BIOLOGICALLY
                                                                               RESEARCH
                  BIOMATERIAL             ACTIVE              TISSUE                                CELL TYPE
                                       SUBSTANCE                              TECHNIQUE

                                                                                SCANNING
                       SILK
    EXAMPLE:                               RGD               PANCREAS           ELECTRON          FIBROBLAST
                     FIBROIN
                                                                               MICROSCOPY

                       DEB                 DEB                MESH                MESH                MESH
          MAIN
    SEMANTIC          MESH                MESH               UBERON               NCIT                 NCIT
  RESOURCES:
                      CHEBI               NCIT            PREMEDONTO              NPO                UBERON

Figure 4: An illustration of part of the annotation schema (showing five out of seventeen categories), which relies
on multiple semantic resources for each annotation category. Full details of all categories and resources are in the
Appendix A.

                                                        40
two domain experts. The evaluation entailed                 tion set, resulting from the execution of the Bio-
reviewing samples of 10-20 abstracts in order               materials Annotator, was manually verified by 9
to flag annotation errors and highlight relevant            biomaterials experts. The validation process was
concepts which were missed by the system. The               performed using the GATE user interface, where
flagged terms were used for keyword search in               annotations made by the system were presented
the Bioportal (Martínez-Romero et al., 2017) and            to the biomaterials experts with the possibility of
the UMLS metathesaurus browser (Bodenreider,                adding missing annotations, removing false annota-
2004). Through these searches, specific classes             tions and editing annotations. Once the expert had
(or ‘parent concepts’) within relevant ontologies           finished the validation of a document, it was saved
and UMLS ‘semantic types’ were identified                   as a different validated copy.
and added to the annotation schema (part of                    Two strategies to indicate if two annotations
which is shown for illustration in Figure 4). The           agree or not were considered; a strict approach,
resources identified belonged to three categories:          in which the annotations agree if they have the
ontologies, controlled vocabularies and nomen-              same origin and end offset, and a more relaxed or
clatures. All the ontologies were open access and           “lenient” approach, where the annotations agree
downloaded in .owl format from NCBO Bioportal               if they overlap at some point. For example, in
(http://bioportal.bioontology.org)                          the partial approach the biomaterials expressions
(Martínez-Romero et al., 2017). The controlled              “polyvinyl alcohol” and “polyvinyl” are considered
vocabulary (MeSH) was downloaded for use                    to agree, which does not happen in the strict agree-
under license from the UMLS Terminology                     ment.
Services. The GMDN nomenclature was kindly
                                                               To measure the performance of the NER system,
provided in .xml format by the GMDN agency
                                                            the set validated by the experts was taken as the
under a license. A summary of all the used
                                                            gold standard and the system’s output as the set
resources is in Appendix B. For the resources
                                                            to be validated. Table 1 shows the recall, preci-
to be used in the annotation system, relevant
                                                            sion and F-score, including the strict and lenient
classes were imported into a dictionary (gazetteer)
                                                            approaches, as well as an average between them.
containing the following fields: the term, its
                                                            The global scores calculated for the system are also
label (annotation category), the ID and whenever
                                                            presented, obtaining an 0.75 strict F-score, 0.79
available, a preferred synonym. The extraction of
                                                            lenient F-Score and 0.77 average F-score.
desired classes from the ontologies to dictionary
format was done using an implementation of                     Figure 5 shows the average F-scores calculated
owlready2 (Lamy, 2017) and the code (named                  for the different categories. Categories with an av-
owl2dict_light) is available in an open github              erage F-score above 0.8 are considered categories
repository as part of the (https://github.                  in which the concepts are satisfactorily covered by
com/ProjectDebbie/OWL2DICT).                   The          the resources used (e.g. Structure, BiomaterialType
resulting dictionaries are also available                   and Tissue). On the other hand, there are categories
(https://github.com/ProjectDebbie/                          with lower scores, and specifically: ’Biomaterial’,
DEBBIE_dictionaries_annotations).                           ’Biologically active substance’ and ’Cell’. The cat-
These were in turn used by the Dictionary                   egories Biomaterials and Biologically active sub-
Annotator component for concept recognition as              stance had significantly reduced accuracy because
described above in section 3.2.                             they include many ambiguous concepts. Some ma-
                                                            terials may act as a biomaterial in one set-up, but
4     Results                                               can also be measured in terms of cell expression
                                                            or non-biomaterial use in another set-up (e.g. col-
4.1    Expert validation                                    lagen). In the latter case, the human validator will
To measure the efficiency of a text mining system           delete the ‘Biomaterial’ annotation. Solving this
such as the Biomaterials Annotator, it is fundamen-         kind of ambiguities will require other strategies,
tal to organize and plan a validation stage aimed at        such as specific lexical rules or machine learning
indicating the performance of the system. The Bio-          approaches. Another factor impeding good quality
materials Annotator was validated through manual            annotations of Biomaterials is the lack of good qual-
verification of the validation set, an independent          ity vocabulary of medical polymers. Polymer and
collection of 199 abstracts. The annotated valida-          co-polymer naming is notoriously variable, with
                                                       41
Precision   Recall     F-score    Precision     Recall     F-score     Precision     Recall     F-score
 Category
                                  - strict   - strict   - strict   - lenient    - lenient   - lenient   - average   - average   - average
 Adverse Effects                    0.94       0.75       0.82          1           0.8        0.87        0.97        0.77        0.85
 Associated Biological Process      0.88       0.68       0.77        0.94         0.73        0.82        0.91        0.71        0.79
 Biologically Active Substance      0.58       0.43       0.49         0.7         0.52        0.59        0.64        0.48        0.54
 Biomaterial                        0.76       0.47       0.57        0.83         0.52        0.63        0.79        0.49         0.6
 Biomaterial Type                   0.92       0.88        0.9        0.98         0.93        0.95        0.95         0.9        0.92
 Cell                               0.76       0.59       0.66        0.84         0.65        0.73         0.8        0.62        0.69
 Effect On Biological System        0.96       0.69       0.79          1          0.72        0.82        0.98        0.71         0.8
 Manufactured Object                0.96       0.86        0.9        0.96         0.86         0.9        0.96        0.86         0.9
 Manufactured Object Component      0.91       0.84       0.86        0.91         0.84        0.87        0.91        0.84        0.87
 Manufactured Object Features       0.68       0.59       0.62        0.71         0.61        0.65        0.69         0.6        0.64
 Material Processing                0.78        0.6       0.67        0.83         0.63        0.71        0.81        0.61        0.69
 Medical Application                0.68       0.49       0.57        0.82          0.6        0.69        0.75        0.54        0.63
 Research Technique                 0.81       0.63       0.71        0.87         0.68        0.76        0.84        0.66        0.73
 Species                            0.97       0.79       0.87        0.99         0.81        0.89        0.98         0.8        0.88
 Structure                          0.93       0.77       0.84        0.95         0.79        0.86        0.94        0.78        0.85
 Study Type                         0.96       0.95       0.96        0.99         0.97        0.98        0.98        0.96        0.97
 Tissue                              0.8       0.77       0.78        0.85         0.82        0.83        0.82         0.8        0.81
 Global                             0.84       0.69       0.75        0.89         0.73        0.79        0.86        0.71        0.77

Table 1: The performance of the Biomaterials Annotator in a test set of 199 abstracts validated manually by 9
experts.

some named by their commercial name or abbrevi-                         notator and deposits them in an open access
ation. To address these inaccuracies, future work                       database. DEBBIE is under development and
will involve expanding relevant ontologies using                        can be accessed at https://github.com/
tools such as Spike (Taub-Tabib et al., 2020), in-                      ProjectDebbie/DEBBIE_pipeline.
cluding additional lexical rules, and adding ma-
                                                                                Category                                 Count
chine learning components.                                                      Adverse Effects                            657
                                                                                Associated Biological Process             6231
4.2   Full system implementation and                                            Biologically Active Substance             7709
      availability                                                              Biomaterial                               5726
                                                                                Biomaterial Type                          1543
A significant challenge for scientific software ap-                             Cell                                       6839
                                                                                Effect On Biological System                972
plications is providing facilities to share, distribute                         Manufactured Object                       5967
and run such systems in a simple and convenient                                 Manufactured Object Component             2307
way. Furthermore, an important concern is the                                   Manufactured Object Features              4200
                                                                                Material Processing                       2728
possibility of replicating the results obtained                                 Medical Application                       3868
during the research. In order to accomplish these                               Research Technique                        3701
requirements and follow good practices, we de-                                  Species                                    2089
                                                                                Structure                                 4136
veloped the Biomaterials Annotator using Docker                                 Study Type                                 1806
as software container technology and Nextflow                                   Tissue                                     9997
as the workflow manager. Through the use of                                     Entities                                  70476
                                                                                Tokens                                   392605
Docker, all the subcomponents of the Biomaterials                               Sentences                                15979
Annotator were individually compartmentalized;                                  Abstracts                                  1222
hosting their own dependencies and programs that
work only inside the isolated container. In addition,                     Table 2: Annotated biomaterials corpus statistics.
the Nextflow workflow manager was used for
the automated orchestration and execution of the
pipeline. By using this architecture, the entire tool,                  4.3    Annotated corpus release
or any of its individual components, can be easily                      Another key objective was to generate the first an-
installed and run in heterogeneous environments.                        notated corpus with entities related to the bioma-
The Biomaterials Annotator is available at                              terials domain. Such a corpus will facilitate the
https://github.com/ProjectDebbie/                                       development and evaluation of text mining mod-
Biomaterials_annotator.                                                 els for automated extraction of biomaterials-related
   The Biomaterials Annotator is part of DEB-                           data from text.
BIE (Database of biomedical materials), a                                  The biomaterials annotated dataset consists of
wider system that retrieves abstracts from                              1222 biomaterials abstracts describing the evalu-
pubmed, annotates using the Biomaterials An-                            ation of biomaterials in either a laboratory or a
                                                                   42
Figure 5: Average F-score of the automated annotations across categories.

clinical setting. Each abstract is individually con-        manner aiming to enhanced performance in key
tained as a separate file under the GATE format.            categories such as Biomaterials and Cells. In addi-
Table 2 shows statistics concerning the number of           tion, future versions of the Biomaterials Annotator
concepts corresponding to the different categories,         will be closely related to the DEBBIE system and
as well as the number of total entities, sentences          include additional functionalities and features de-
and tokens.                                                 veloped to achieve its main objectives.
   The annotated biomaterials corpus is                       The Biomaterials Annotator and the annotated
available and free for use; information                     corpus are open source and available to the com-
to access the corpus can be found at                        munity to promote future efforts in the field and
https://github.com/ProjectDebbie/                           contribute towards its sustainability.
Biomaterials_annotator.

5   Conclusions and future directions                       6   Acknowledgements
In this work we present the Biomaterials Annota-
tor, an ontology-based NER system that identifies           This project has received funding from the Euro-
17 domain specific types of concepts and delivers           pean Union Horizon 2020 programme under the
an annotated biomaterials corpus of 1222 MED-               Marie Skodowska- Curie grant agreement DEB-
LINE articles available for future text mining and          BIE, project number: 751277. O. H. is funded
machine learning efforts. We have carried out a val-        through a Bosch-Aymerich fellowship. J-M.F, S.C-
idation activity to measure the performance of the          G and J-L.P are partly supported by INB Grant
NER system, with the participation of nine bioma-           (PT17/0009/0001 - ISCIII-SGEFI / ERDF). M-P. G.
terials experts, obtaining a global average F-score         acknowledges the ICREA Academia Award from
of 0.77.                                                    Generalitat de Catalunya. J.C. is partly supported
    Future work in the development of the system            by eTRANSAFE (received funding from the Inno-
could involve annotating relations and linking iden-        vative Medicines Initiative 2 Joint Undertaking un-
tified concepts to manufactured biomaterials ob-            der grant agreement No 777365 and support from
jects. It may also include incorporating additional         the European Union’s Horizon 2020 research and
categories using controlled resources. Improve-             innovation programme and EFPIA).
ments to the system will continue in an iterative
                                                       43
References                                                      Christopher Manning, Mihai Surdeanu, John Bauer,
                                                                  Jenny Finkel, Steven Bethard, and David McClosky.
A. R. Aronson. 2001. Effective mapping of biomedical              2014. The Stanford CoreNLP Natural Language
   text to the UMLS Metathesaurus: the MetaMap pro-               Processing Toolkit. In Proceedings of 52nd An-
   gram. Proceedings / AMIA ... Annual Symposium.                 nual Meeting of the Association for Computational
  AMIA Symposium, pages 17–21.                                    Linguistics: System Demonstrations, pages 55–60,
                                                                  Stroudsburg, PA, USA. Association for Computa-
Olivier Bodenreider. 2004. The Unified Medical Lan-               tional Linguistics.
  guage System (UMLS): Integrating biomedical ter-
  minology. Nucleic Acids Research.                             Marcos Martínez-Romero, Clement Jonquet, Martin J.
                                                                 O’Connor, John Graybeal, Alejandro Pazos, and
David Campos, Sérgio Matos, and JoséLuís Oliveira.               Mark A. Musen. 2017. NCBO Ontology Recom-
  2013.    Current Methodologies for Biomedical                  mender 2.0: An enhanced approach for biomedical
  Named Entity Recognition. In Biological Knowl-                 ontology recommendation. Journal of Biomedical
  edge Discovery Handbook, pages 839–868. John Wi-               Semantics.
  ley Sons, Inc.
                                                                Andrew McCallum and Wei Li. 2003. Early results for
Hyejin Cho and Hyunju Lee. 2019. Biomedical                       named entity recognition with conditional random
  named entity recognition using deep neural net-                 fields, feature induction and web-enhanced lexicons.
  works with contextual information. BMC Bioinfor-                In Proceedings of the Seventh Conference on Natu-
  matics, 20(1):735.                                              ral Language Learning at HLT-NAACL 2003, pages
                                                                  188–191.
Hamish Cunningham, Valentin Tablan, Angus Roberts,
  and Kalina Bontcheva. 2013. Getting More Out of               François Pognan, Thomas Steger-Hartmann, Car-
  Biomedical Documents with GATE’s Full Lifecycle                 los Díaz, Niklas Blomberg, Frank Bringezu,
  Open Source Text Analytics. PLoS Computational                  Katharine Briggs, Giulia Callegaro, Salvador
  Biology.                                                        Capella-Gutierrez, Emilio Centeno, Javier Corvi,
                                                                  Philip Drew, William C. Drewe, José M. Fernán-
Thanh Hai Dang, Hoang Quynh Le, Trang M. Nguyen,                  dez, Laura I. Furlong, Emre Guney, Jan A. Kors,
  and Sinh T. Vu. 2018. D3NER: Biomedical named                   Miguel Angel Mayer, Manuel Pastor, Janet Piñero,
  entity recognition using CRF-biLSTM improved                    Juan Manuel Ramírez-Anguita, Francesco Ronzano,
  with fine-tuned embeddings of various linguistic in-            Philip Rowell, Josep Saüch-Pitarch, Alfonso Valen-
  formation. Bioinformatics, 34(20):3539–3546.                    cia, Bob van de Water, Johan van der Lei, Erik van
                                                                  Mulligen, and Ferran Sanz. 2021. The etransafe
Zhou GuoDong and Su Jian. 2004. Exploring deep                    project on translational safety assessment through
  knowledge resources in biomedical name recogni-                 integrative knowledge management: Achievements
  tion. page 96.                                                  and perspectives. Pharmaceuticals, 14(3).
Osnat Hakimi, Josep Luis Gelpi, Martin Krallinger,              Ariel S. Schwartz and Marti A. Hearst. 2003. A simple
  Fabio Curi, Dmitry Repchevsky, and Maria-Pau                    algorithm for identifying abbreviation definitions in
  Ginebra. 2020. The Devices, Experimental Scaf-                  biomedical text. Pacific Symposium on Biocomput-
  folds, and Biomaterials Ontology (DEB): A Tool                  ing. Pacific Symposium on Biocomputing.
  for Mapping, Annotation, and Analysis of Bio-
  materials Data. Advanced Functional Materials,                Hye Jeong Song, Byeong Cheol Jo, Chan Young Park,
  30(16):1909910.                                                 Jong Dae Kim, and Yu Seop Kim. 2018. Com-
                                                                  parison of named entity recognition methodologies
Jelena Jovanović and Ebrahim Bagheri. 2017. Seman-               in biomedical documents. BioMedical Engineering
   tic annotation in biomedicine: The current land-               Online, 17(Suppl 2).
   scape.
                                                                Yu Song, Eunju Kim, Gary Geunbae Lee, and Byoung-
Suwisa Kaewphan, Kai Hakala, Niko Miekka, Tapio                   kee Yi. 2004. POSBIOTM-NER in the shared task
  Salakoski, and Filip Ginter. 2018. Wide-scope                   of BioNLP/NLPBA 2004.
  biomedical named entity recognition and normaliza-
  tion with CRFs, fuzzy matching and character level            Hillel Taub-Tabib, Micah Shlain, Shoval Sadde, Dan
  modeling. Database, 2018(2018):96.                              Lahav, Matan Eyal, Yaara Cohen, and Y. Goldberg.
                                                                  2020. Interactive extractive search over biomedical
Jean Baptiste Lamy. 2017. Owlready: Ontology-                     corpora. ArXiv, abs/2006.04148.
   oriented programming in Python with automatic
   classification and high level constructs for biomed-         Dennis G. Thomas, Rohit V. Pappu, and Nathan A.
   ical ontologies. Artificial Intelligence in Medicine,          Baker. 2011. NanoParticle Ontology for cancer nan-
   80:11–28.                                                      otechnology research. Journal of Biomedical Infor-
                                                                  matics.
Ki-Joong Lee, Young-Sook Hwang, and Hae-Chang
  Rim. 2003. Two-phase biomedical NE recognition                Federica Viti, Silvia Scaglione, Alessandro Orro, and
  based on SVMs.                                                  Luciano Milanesi. 2014. Guidelines for managing
                                                           44
data and processes in bone and cartilage tissue engi-
  neering. BMC Bioinformatics, 15(Suppl 1):S14.
Chih Hsuan Wei, Hung Yu Kao, and Zhiyong Lu. 2013.
  PubTator: a web-based text mining tool for assisting
  biocuration. Nucleic acids research, 41(Web Server
  issue).
Alexander Yeh, Alexander Morgan, Marc Colosimo,
  and Lynette Hirschman. 2005. BioCreAtIvE task
  1A: Gene mention finding evaluation. BMC Bioin-
  formatics, 6(SUPPL.1):1–10.
Shaojun Zhao. 2004. Named entity recognition in
  biomedical texts using an HMM model. (Grefen-
  stette 1994):84.

                                                          45
A   Semantic resources

                   Table 3: List of semantic resources used by the Biomaterials Annotator

          Semantic Resource Name       Acronym         Scope and relevance                   Type
          Chemical Methods                             Methods used to collect chemical
     1                                 CHMO                                                  Ontology
          Ontology                                     experiments data.
          Chemical Entities of                         Compounds of biological relevance,
     2                                 CHEBI                                                 Ontology
          Biological Interest                          macromolecules.
          The Devices, Experimental                    Biomaterials-related concepts,
     3    Scaffolds and Biomaterials   DEB             materials, structures,                Ontology
          Ontology                                     material processing.
          EDAM Bioimaging              EDAM-           Imaging and sample preparation
     4                                                                                       Ontology
          Ontology                     BIOIMAGING      techniques.
          Global Medical Device
     5                                 GMDN            Full names of medical devices.        Nomenclature
          Nomenclature
                                                       Biological concepts including
          Interlinking ontology of                     biological phenomena, diseases,
     6                                 IOBC                                                  Ontology
          biological concepts                          molecular functions,
                                                       research imaging techniques.
                                                       A hierarchically organized
                                                                                             Controlled
     7    Medical Subject Headings     MeSH            vocabulary produced by
                                                                                             vocabulary
                                                       the NLM.
          National Cancer Institute                    Vocabulary for clinical care,         Controlled
     8                                 NCIT
          Thesaurus                                    translational and basic research.     vocabulary
                                                       The description, preparation, and
     9    Nanoparticle ontology        NPO                                                   Ontology
                                                       characterization of nanomaterials.
                                                       Biomedical protocols, instruments,
          Ontology for Biomedical
     10                                OBI             data generated, materials, analysis   Ontology
          Investigations
                                                       performed.
          Ontology of Nuclear                          Models, chemicals, tools, research
     11                                ONTOTOXNUC                                            Ontology
          Toxicity                                     techniques and models.
                                                       Human disease terms, genomic,
          Precision Medicine
     12                                PREMEDONTO      molecular, phenotype, related         Ontology
          Ontology
                                                       medical vocabulary.
                                                       An integrated cross-species
     13   Uber Anatomy Ontology        UBERON                                                Ontology
                                                       anatomy ontology

                                                    46
B   Annotation categories

         Table 4: Annotation categories, their respective semantic resources and imported classes

                Annotation
                               Definition                     Resource and imported classes
                category
                               A non-drug raw material        DEB: Biomaterials
           1    Biomaterial    or substance suitable for      CHEBI: Macromolecule
                               inclusion in systems which     MeSH: Biomedical or Dental Material
                               augment or replace the
                               function of bodily tissues
                               or organs.
                Biomaterial    Classification or nature
           2                                                  DEB: Biomaterial Type
                Types          of biomaterials.
                Biologically   Substance included in a        DEB: Biologically Active Substance
           3    Active         manufactured object in         MeSH: amino acid, peptide, protein
                Substance      order to impart a biological   Biologically Active Substance
                               activity.                      Pharmacologic Substance
                                                              NCIT: Protein Domain
                                                              DEB: Manufactured Object
                Manufactured   A physical object created
           4                                                  MeSH: Medical device
                Object         by hand or machine.
                                                              GMDN: Full nomenclature
                               A part, region or
                Manufactured
                               component referred to
           5    Object                                        DEB: Manufactured Object Component
                               as a distinct unit, such
                Component
                               as a surface or a layer.
                Medical        Intended use, context,         DEB: Medical Application
           6
                Application,   function or outcome of         MeSH: Disease or Syndrome
                Disease        the manufactured object.       Therapeutic or Preventive Procedure
                or condition                                  Anatomical Abnormality
                Manufactured   Characteristics inherent       DEB: Manufactured Object Features
           7
                Object         or given during processing     MeSH: Chemical Viewed Structurally
                Features       to a manufactured object or
                               its components.
                               The configuration, form
                               or texture associated with
           8    Structure                                     DEB: Structure
                               a manufactured object
                               or its components.
                Associated     A cellular or biological       DEB: Associated Biological Process
           9    Biological     process that the               MeSH: Organ or Tissue Function
                Process        manufactured object is         Molecular Function
                               designed to cause              Cell Function
                               or support, or is measured     Biological function
                               to affect.                     NCIT: Cellular Process
                Material       A planned process which        DEB: Material Processing
           10
                Processing     results in physical changes    CHMO: Material Processing
                               in a specified input
                               material.
                               The reported cell line         MeSH: Cell
           11   Cell           or primary cell                NCIT: Cell
                               type.                          UBERON: Bone cell, cardiocyte
                                                              circulating cell
                                                              connective tissue cell
                                                              epithelial cell
                               The species and /or
           12   Species        breed used in the              MeSH: Mammal
                               study.
                               A tissue or an organ           MeSH: Tissue,
           13   Tissue         mentioned in the               Body Location or organ
                               study as the target            Body part, organ or
                               or test system for             organ component
                               the biomaterial object         UBERON: Tissue
                               or medical device.             PREMEDONTO:Body Part,
                                                              Organ, Organ System

                                                        47
Table: Continued
     Annotation
                     Definition                     Resource and imported classes
     category
     Adverse         An unfavourable or             DEB: Adverse Effects
14
     Effects         unintended disease, sign,      MeSH: Pathologic Function
                     or symptom (including an
                     abnormal laboratory
                     finding) that is temporally
                     associated with the use
                     of a medical device or         MeSH: Laboratory Procedure,
                     biomaterial.                   Molecular Biology Research Technique
                     The reported laboratory        DEB: Research Technique
     Research        technique or instrument        NCIT: Research Technique
15
     Technique       used in an experimental        NPO: Instrument
                     study.                         IOBC: Microscope
                                                    OBI: Assay
                                                    EDAM: Imaging,
                                                    Sample preparation
                                                    ONTOTOXNUC: Outil
                     The effect associated with
     Effect
                     manufactured object in
16   On Biological                                  DEB: Effect On Biological System
                     a specific test system
     System
                     (cells, tissue or organism).
                     The study set up,
     Study
17                   such as in vitro,              DEB: Study Type
     Type
                     in vivo, or clinical.

                                            48
You can also read