Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets - Oxford ...

Page created by Timothy Strickland
 
CONTINUE READING
Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets - Oxford ...
Journal of the American Medical Informatics Association, 28(3), 2021, 516–532
                                                                                 doi: 10.1093/jamia/ocaa269
                                                        Advance Access Publication Date: 15 December 2020
                                                                                     Research and Applications

Research and Applications

Ambiguity in medical concept normalization: An analysis
of types and coverage in electronic health record datasets

                                                                                                                                    Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Denis Newman-Griffis ,1,2 Guy Divita,1 Bart Desmet,1 Ayah Zirikly,1
              ,1,3 and Eric Fosler-Lussier2
Carolyn P. Rose

1
 Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA, 2Department of
Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA and 3Language Technologies Institute, Car-
negie Mellon University, Pittsburgh, Pennsylvania, USA

Corresponding Author: Denis Newman-Griffis, 6707 Democracy Blvd, Suite 856, Bethesda, MD 20892, USA; denis.griffis@-
nih.gov
Received 11 February 2020; Revised 13 September 2020; Editorial Decision 11 October 2020; Accepted 17 November 2020

ABSTRACT
Objectives: Normalizing mentions of medical concepts to standardized vocabularies is a fundamental compo-
nent of clinical text analysis. Ambiguity—words or phrases that may refer to different concepts—has been ex-
tensively researched as part of information extraction from biomedical literature, but less is known about the
types and frequency of ambiguity in clinical text. This study characterizes the distribution and distinct types of
ambiguity exhibited by benchmark clinical concept normalization datasets, in order to identify directions for ad-
vancing medical concept normalization research.
Materials and Methods: We identified ambiguous strings in datasets derived from the 2 available clinical
corpora for concept normalization and categorized the distinct types of ambiguity they exhibited. We then
compared observed string ambiguity in the datasets with potential ambiguity in the Unified Medical Language
System (UMLS) to assess how representative available datasets are of ambiguity in clinical language.
Results: We found that
Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets - Oxford ...
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3                                                                517

INTRODUCTION                                                               that our findings will spur additional development of tools and
                                                                           resources for resolving medical concept ambiguity.
Identifying the medical concepts within a document is a key step in
the analysis of medical records and literature. Mapping natural lan-
guage to standardized concepts improves interoperability in docu-          Contributions of this work
ment analysis1,2 and provides the ability to leverage rich, concept-
                                                                           •   We demonstrate that existing MCN datasets in EHR data are
based knowledge resources such as the Unified Medical Language
                                                                               not sufficient to capture ambiguity in MCN, either for evaluating
System (UMLS).3 This process is a fundamental component of di-
                                                                               MCN systems or developing new MCN models. We analyze the
verse biomedical applications, including clinical trial recruitment,4,5
                                                                               3 available MCN EHR datasets and show that only a small por-
disease research and precision medicine,6–8 pharmacovigilance and
                                                                               tion of mention strings have any ambiguity within each dataset,
drug repurposing,9,10 and clinical decision support.11 In this work,
                                                                               and that these observed ambiguities only capture a small subset
we identify distinct phenomena leading to ambiguity in medical con-
                                                                               of potential ambiguity, in terms of the concept unique identifiers
cept normalization (MCN) and describe key gaps in current
                                                                               (CUIs) that match to the strings in the UMLS. Thus, new datasets
approaches and data for normalizing ambiguous clinical language.
                                                                               focused on ambiguity in clinical language are needed to ensure
    Medical concept extraction has 2 components: (1) named entity

                                                                                                                                                     Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
                                                                               the effectiveness of MCN methodologies.
recognition (NER), the task of recognizing where concepts are men-
                                                                           •   We show that current MCN EHR datasets do not provide suffi-
tioned in the text, and (2) MCN, the task of assigning canonical
                                                                               ciently representative normalization data for effective generaliza-
identifiers to concept mentions, in order to unify different ways of
                                                                               tion, in that they have very few mention strings in common with
referring to the same concept. While MCN has frequently been stud-
                                                                               one another and little overlap in annotated CUIs. Thus, MCN re-
ied jointly with NER,12–14 recent research has begun to investigate
                                                                               search should include evaluation on multiple datasets, to mea-
challenges specific to the normalization phase of concept extraction.
                                                                               sure generalization power.
    Three broad challenges emerge in concept normalization. First,
                                                                           •   We present a linguistically motivated and empirically validated
language is productive: practitioners and patients can refer to stan-
                                                                               typology of distinct phenomena leading to ambiguity in medical
dardized concepts in diverse ways, requiring recognition of novel
                                                                               concept normalization, and analyze all ambiguous strings within
phrases beyond those in controlled vocabularies.15–18 Second, a sin-
                                                                               the 3 current MCN EHR datasets in terms of these ambiguity
gle phrase can describe multiple concepts in a way that is more (or
                                                                               phenomena. We demonstrate that multiple distinct phenomena
different) than the sum of its parts.19,20 Third, a single natural lan-
                                                                               affect MCN ambiguity, reflecting a variety of semantic and lin-
guage form can be used to refer to multiple distinct concepts, thus
                                                                               guistic relationships between terms and concepts that inform
yielding ambiguity.
                                                                               both prediction and evaluation methodologies for medical con-
    Word sense disambiguation (WSD) (which often includes phrase
                                                                               cept normalization. Thus, MCN evaluation strategies should be
disambiguation in the biomedical setting) is thus an integral part of
                                                                               tailored to account for different relationships between predicted
MCN. WSD has been extensively studied in natural language proc-
                                                                               labels and annotated labels. Further, MCN methodologies could
essing methodology,21–23 and ambiguous words and phrases in bio-
                                                                               be significantly enhanced by greater integration of the rich se-
medical literature have been the focus of significant research.24–30
                                                                               mantic resources of the UMLS.
WSD research in electronic health record (EHR) text, however, has
focused almost exclusively on abbreviations and acronyms.31–35 A
single dataset of 50 ambiguous strings in EHR data has been devel-         BACKGROUND AND SIGNIFICANCE
oped and studied25,36 but is not freely available for current research.
Two large-scale EHR datasets, the ShARe corpus14 and a dataset by          Linguistic phenomena underpinning clinical ambiguity
Luo et al,37 have been developed for medical concept extraction re-        Lexical semantics distinguishes between 2 types of lexical ambiguity:
search and have been significant drivers in MCN research through           homonymy and polysemy.42,43 Homonymy occurs when 2 lexical
multiple shared tasks.14,38–41 However, their role in addressing am-       items with separate meanings have the same form (eg, “cold” as ref-
biguity in clinical language has not yet been explored.                    erence to a cold temperature or the common cold). Polysemy occurs
                                                                           when one lexical item diverges into distinct but related meanings
                                                                           (eg, “coat” for garment or coat of paint). Polysemy can in turn be
Objective
                                                                           the result of different phenomena, including default interpretations
To understand the role of benchmark MCN datasets in designing
                                                                           (“drink” liquid or alcohol), metaphors, and metonymy (usage of a
and evaluating methods to resolve ambiguity in clinical language,
                                                                           literal association between 2 concepts in a specified domain [eg,
we identified ambiguous strings in 3 benchmark EHR datasets for
                                                                           “Foley catheter on 4/12”] to indicate a past catheterization proce-
MCN and analyzed the causes of ambiguity they capture. Using lexi-
                                                                           dure).42,43 While metaphors are dispreferred in the formal setting of
cal semantic theory and the taxonomic and semantic relationships
                                                                           clinical documentation, the telegraphic nature of medical text44
between concepts captured in the UMLS as a guide, we developed a
                                                                           lends itself to metonymy by using shorter phrases to refer to more
typology of ambiguity in clinical language and categorized each
                                                                           specific concepts, such as procedures.45
string in terms of what type of ambiguity it captures. We found that
multiple distinct phenomena cause ambiguity in clinical language
and that the existing datasets are not sufficient to systematically cap-   Mapping between biomedical concepts and terms: The
ture these phenomena. Based on our findings, we identified 3 key           UMLS
gaps in current research on MCN in clinical text: (1) a lack of repre-     The UMLS is a large-scale biomedical knowledge resource that com-
sentative data for ambiguity in clinical language, (2) a need for new      bines information from over 140 expert-curated biomedical vocabu-
evaluation strategies for MCN that account for different kinds of          laries and standards into a single machine-readable resource. One
relationships between concepts, and (3) underutilization of the rich       central component of the UMLS that directly informs our analysis
semantic resources of the UMLS in MCN methodologies. We hope               of ambiguity is the Metathesaurus, which groups together synonyms
Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets - Oxford ...
518                                                         Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3

(distinct phrases with the same meaning, [eg, “common cold” and           MATERIALS AND METHODS
“acute rhinitis”]) and lexical variants (modifications of the same
                                                                          We performed both quantitative and qualitative evaluations of am-
phrase [eg, “acute rhinitis” and “rhinitis, acute”]) of biomedical
                                                                          biguity in 3 benchmark MCN datasets of EHR data. In this section,
terms and assigns them a single CUI. The diversity of vocabularies
                                                                          we first introduce the datasets analyzed in this work and define our
included in the UMLS (each designed for a unique purpose), com-
                                                                          methods for measuring ambiguity in the datasets and in the UMLS.
bined with the expressiveness of human language, means that many
                                                                          We then describe 2 quantitative analyses of ambiguity measure-
different terms can be associated with any one concept (eg, the con-
                                                                          ments within individual datasets and a generalization analysis across
cept C0009443 is associated with the terms cold, common cold, and
                                                                          datasets. Finally, we present our qualitative analysis of ambiguity
acute rhinitis, among others), and any term may be used to refer to
                                                                          types in MCN datasets.
different concepts in different situations (eg, cold may also refer to
C0009264 Cold Temperature in addition to C0009443, as well as to
a variety of other Metathesaurus concepts), leading to ambiguity.
These mappings between terms and concepts are stored in the
                                                                          MCN datasets
MRCONSO UMLS table. In addition to the canonical terms stored             The effect of ambiguity in normalizing medical concepts has been
                                                                          researched significantly more in biomedical literature than in clinical

                                                                                                                                                    Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
in MRCONSO, the UMLS also provides lexical variants of terms,
including morphological stemming, inflectional variants, and agnos-       data. In order to identify knowledge gaps and key directions for
tic word order, provided through the SPECIALIST Lexicon and suite         MCN in the clinical setting, where ambiguity may have direct im-
of tools.46,47 Lexical variants of English-language terms from            pact on automated tools for clinical decision support, we studied the
MRCONSO are provided in the MRXNS_ENG UMLS table. The                     3 available English-language EHR corpora with concept normaliza-
MCN datasets used in this study were annotated for mentions of            tion annotations: SemEval-2015 Task 14,14 CUILESS2016,19 and
concepts in 2 widely used vocabularies integrated into the UMLS:          n2c2 2019 Track 3.37,41 MCN annotations in these datasets are rep-
(1) the U.S. edition of the Systematized Nomenclature of Medicine         resented as UMLS CUIs for the concepts being referred to in the
Clinical Terms (SNOMED CT) vocabulary, a comprehensive clini-             text; as MCN evaluation is performed based on selection of the
cal healthcare terminology, and (2) RxNorm, a standardized no-            specific CUI a given mention is annotated with, we describe dataset
menclature for clinical drugs; we thus restricted our analysis to data    annotation and our analyses in terms of the CUIs used rather than
from these 2 vocabularies.                                                the concepts they refer to. Details of these datasets are presented
                                                                          in Table 1.

Sense relations and ontological distinctions in the                       SemEval-2015
UMLS                                                                      Task 14 of the SemEval-2015 competition investigated clinical text
In addition to mappings from terms to concepts, the UMLS Meta-            analysis using the ShARe corpus, which consists of 531 clinical
thesaurus includes information on semantic relationships between          documents from the MIMIC (Medical Information Mart for Inten-
concepts, such as hierarchical relationships that often correspond to     sive Care) dataset54 including discharge summaries, echocardio-
lexical phenomena such as hypernymy and hyponymy, as well as              gram, electrocardiogram and radiology reports. Each document was
meronymy and holonymy in biological and chemical structures.42            annotated for mentions of disorders and normalized using CUIs
The UMLS has previously been observed to include not only fine-           from SNOMED CT.53 The documents were annotated by 2 profes-
grained ontological distinctions, but also purely epistemological dis-    sional medical coders, with high interannotator agreement of 84.6%
tinctions such as associated findings (eg, C0748833 Open fracture         CUI matches for mentions with identical spans, and all disagree-
of skull vs C0272487 Open skull fracture without intracranial in-         ments were adjudicated to produce the final dataset.38,39 Datasets
jury).48 This yields high productivity for assignment of different        derived from subsets of the ShARe corpus have been used as the
CUIs in cases of ontological distinction, such as reference to            source for several shared tasks.14,39,40,55 The full corpus was used
“cancer” to mean either general cancer disorders or a specific type       for a SemEval-2015 shared task on clinical text analysis,14 split into
of cancer in a context such as a prostate exam, as well what Cruse42      298 documents for training, 133 for development, and 100 for test.
termed propositional synonymy (ie, different senses that yield the        In order to preserve the utility of the test set as an unseen data sam-
same propositional logic interpretation). Additionally, the difficulty    ple for continuing research, we exclude its 100 documents from our
of interterminology mapping at scale means that synonymous terms          analysis, and only analyze the training and development documents.
are occasionally mapped to different CUIs.49

                                                                          CUILESS2016
The role of representative data for clinical ambiguity                    A significant number of mentions in the ShARe corpus were not
Development and evaluation of models for any problem are predi-           mapped to a CUI in the original annotations, either because these
cated on the availability of representative data.50 Prior research has    mentions did not correspond to Disorder concepts in the UMLS or
highlighted the frequency of ambiguity in biomedical literature24,51      because they would have required multiple disorder concepts to an-
and broken biomedical ambiguity into 3 broad categories of ambig-         notate.14 These mentions were later reannotated in the CUI-
uous terms, abbreviations, and gene names,52 but an in-depth char-        LESS2016 dataset, with updated guidelines allowing annotation
acterization of the types of ambiguity relevant to clinical data has      using any CUI in SNOMED CT (regardless of semantic type) and
not yet been performed. In order to understand what can be learned        specified rules for composition.19,56 These data were split into train-
from the available data for ambiguity and identify areas for future       ing and development sets, corresponding to the training and devel-
research, it is critical to analyze both the frequency and the types of   opment splits in the SemEval-2015 shared task; the SemEval-2015
ambiguity that are captured in clinical datasets.                         test set was not annotated as part of CUILESS2016.
Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets - Oxford ...
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3                                                                      519

Table 1. Details of MCN datasets analyzed for ambiguity, broken down by data subset

                                                         ShARe Corpus

                                   SemEval-2015                                  CUILESS2016                                  n2c2 2019

                       Training    Development      Combined      Training       Development     Combined                      Training
                                              53                                          19
UMLS version                     2011AA                                      2016AA                                      2017AB37
Source vocabularies         SNOMED CT (United States)                   SNOMED CT (United States)              SNOMED CT (United States), RxNorm
Documents                298       133             431               298        133            431                          100
Samples                 11 554     8003          19 557             3468       1929           5397                         6684
CUI-less samples         3480      1933           5413                7          1              8                           368
Unique strings           3654      2477           5064              1519        750           2011                         3230
Unique CUIs              1356      1144           1871              1384        639           1738                         2331

 The number of CUI-less samples, which were excluded from our analysis, is provided for each dataset.

                                                                                                                                                           Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
 CUI: concept unique identifier; MCN: medical concept normalization; SNOMED CT: Systematized Nomenclature of Medicine Clinical Terms; UMLS: Unified
Medical Language System.

n2c2 2019                                                                        We defined dataset ambiguity, our measure of observed ambigu-
As the SemEval-2015 and CUILESS2016 datasets only included                   ity, as the number of unique CUIs associated with a given string
annotations for mentions of disorder-related concepts, Luo et al37           when aggregated over all samples in a dataset. In order to account
annotated a new corpus to provide mention and normalization data             for minor variations in EHR orthography and annotations, we used
for a wider variety of concepts; these data were then used for a 2019        2 steps of preprocessing on the text of all medical concept mentions
n2c2 shared task on concept normalization.41 The corpus includes             in each dataset: lowercasing and dropping determiners (a, an, and
100 discharge summaries drawn from the 2010 i2b2/VA shared task              the).
on clinical concept extraction, for which documents from multiple                To measure potential ambiguity, we defined UMLS ambiguity as
healthcare institutions were annotated for all mentions of problems,         the number of CUIs a string is associated with in the UMLS Meta-
treatments, and tests.57 All annotated mentions in the 100 docu-             thesaurus. While the Metathesaurus is necessarily incomplete,
                                                                             15,58,59
ments chosen were normalized using CUIs from SNOMED CT and                            and the breadth and specificity of concepts covered means
RxNorm; 2.7% were annotated as “CUI-less.” All mentions were                 that useful term-CUI links are often missing,60 it nonetheless func-
dually annotated with an adjudication phase; preadjudication inter-          tions as a high-coverage heuristic to measure the number of senses a
annotator agreement was a 67.69% CUI match (note this figure in-             term may be used to refer to. However, the expressiveness of natural
cluded comparison of mention bounds in addition to CUI matches,              language means that direct dictionary lookup of any given string in
lowering measured agreement; CUI-level agreement alone was not               the Metathesaurus is likely to miss valid associated CUIs: linguistic
evaluated). Luo et al37 split the corpus into training and test sets. As     phenomena such as coreference allow seemingly general strings to
with the SemEval-2015 data, we only analyzed the training set in or-         take very specific meanings (eg, “the failure” referring to a specific
der to preserve the utility of the n2c2 2019 test set as an unseen data      instance of heart failure); other syntactic phenomena such as predi-
sample for evaluating generalization in continuing MCN research.             cation, splitting known strings with a copula (see Figure 1 for exam-
                                                                             ples), and inflection (eg, “defibrillate” vs “defibrillation” vs
                                                                             “defibrillated”) lead to further variants. We therefore use 3 strate-
Measuring ambiguity                                                          gies to match observed strings with terms in the UMLS and the con-
We utilize 2 different ways of measuring the ambiguity of a string:
                                                                             cepts that they are linked to (referred to as candidate matching
dataset ambiguity, which measures the amount of observed ambigu-
                                                                             strategies), with increasing degrees of inclusivity across term varia-
ity for a given medical term as labeled in an MCN dataset, and
                                                                             tions, to measure the number of CUIs a medical concept string may
UMLS ambiguity, which measures the amount of potential ambigu-
                                                                             be matched to in the UMLS:
ity for the same term by using the UMLS as a reference for normali-
zation. A key desideratum for developing and evaluating statistical          •    Minimal preprocessing—each string was preprocessed using the
models of MCN, which we demonstrate is not achieved by bench-                     2 steps described previously (lowercasing and dropping deter-
mark datasets in practice, is that the ambiguity observed in research             miners; eg, “the EKG” becomes “ekg”), and compared with
datasets is as representative as possible of the potential ambiguity              rows of the MRCONSO table of the UMLS to identify the num-
that may be encountered in medical language “in the wild.” For ex-                ber of unique CUIs canonically associated with the string. The
ample, the term cold can be used as an acronym for Chronic Ob-                    same minimal preprocessing steps were applied to the String field
structive Lung Disease (C0024117), but if no datasets include                     of MRCONSO rows for matching.
examples of cold being used in this way, we are unable to train or           •    Lexical variant normalization—each string was first processed
evaluate the effectiveness of an MCN model for normalizing “cold”                 with minimal preprocessing, and then further processed with the
to this meaning. The problem becomes more severe if other senses of               luiNorm tool, 61 a software package developed to map lexical
cold, such as C0009264 Cold Temperature, C0234192 Cold Sensa-                     variants (eg, defibrillate, defibrillated, defibrillation) to the same
tion, or C0010412 Cold Therapy are also not included in annotated                 string. (Mapping lexical variants to the same underlying string is
datasets. While exhaustively capturing instances of every sense of a              typically referred to as “normalization” in the natural language
given term in natural utterances is impractical at best, significant              processing literature; for clarity between concept normalization
gaps between observed and potential ambiguity impose a fundamen-                  and string normalization in this article, we refer to “lexical vari-
tal limiting factor on progress in MCN research.                                  ant normalization” for this aspect of string processing through-
Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets - Oxford ...
520                                                               Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3

                                                                                                                                                                   Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Figure 1. Examples of mismatch between medical concept mention string (bold underlined text) and assigned concept unique identifier (shown under the men-
tion), due to (A) coreference and (B) predication. The right side of each subfigure shows the results of querying the Unified Medical Language System (UMLS) for
the mention string with exact match (top) and the preferred string for the annotated concept unique identifier (bottom).

    out.) luiNorm-processed strings were then compared with prepo-                concordance with prior findings of greater ambiguity from shorter
    pulated lexical variants in the MRXNS_ENG table of the UMLS                   terms,63 we evaluated the correlation between string length and am-
    to identify the set of associated CUIs. We used the release of lui-           biguity measurements, using linear regression with fit measured by
    Norm that corresponded to the UMLS version each dataset was                   the r2 statistic. We used 2 different measures of string length: (1)
    annotated with (2011 for SemEval-2015, 2016 for CUI-                          number of tokens in the string (calculated using SpaCy64 tokeniza-
    LESS2016, and 2017 for n2c2 2019), and compared with the                      tion) and (2) number of characters in the string.
    MRXNS_ENG table of the corresponding UMLS release.
•   Word match—each string was first processed with minimal pre-
                                                                                  Cross-dataset generalization analysis
    processing; we then queried the UMLS search application pro-
                                                                                  In order to assess how representative the annotated MCN datasets
    gramming interface for the preprocessed string, using the word-
                                                                                  are for generalizing to unseen data, we evaluated ambiguity in 3
    level search option,62 which searches for matches in the Metathe-
                                                                                  kinds of cross-dataset generalization: (1) from training to develop-
    saurus with each of the words in the query string (ie, “Heart dis-
                                                                                  ment splits in a single dataset (using SemEval-2015 and CUI-
    ease, acute” will match with strings including any of the words
                                                                                  LESS2016), (2) between different datasets drawn from the same
    heart, disease, or acute). We counted the number of unique CUIs
                                                                                  corpus (comparing SemEval-2015 to CUILESS2016), and (3) be-
    returned as our measure of ambiguity.
                                                                                  tween datasets from different corpora (comparing SemEval-2015
    In all cases, since each dataset was only annotated using CUIs                and CUILESS2016 to n2c2 2019). In each of these settings, we first
linked to specific vocabularies in the UMLS (SNOMED CT for all 3                  identified the portion of strings shared between the datasets being
datasets, plus RxNorm for n2c2 2019), we restricted our ambiguity                 compared, a key component of generalization, and then analyzed
analysis to the set of unique UMLS CUIs linked to the source vocab-               the CUIs associated with these shared strings in each dataset. Shared
ularies used for annotation. Thus, if a string in SemEval-2015 was                strings were analyzed along 3 axes to measure the generalization of
associated with 2 CUIs linked to SNOMED CT and an additional                      MCN annotations between datasets: (1) differences in ambiguity
CUI linked only to International Classification of Diseases–Ninth                 type (for strings which were ambiguous in both datasets), (2) over-
Revision (and therefore not eligible for use in SemEval-2015 annota-              lap in the annotated CUI sets, and (3) the coverage of word-level
tion), we only counted the 2 CUIs linked to SNOMED CT in mea-                     UMLS match for retrieving the combination of CUIs present be-
suring its ambiguity.                                                             tween the 2 datasets. Finally, we broke down our analysis of CUI set
                                                                                  overlap to identify strings whose dataset ambiguity increases when
                                                                                  combining datasets and strings with fully disjoint annotated CUI
Quantitative analyses: Ambiguity measurements and                                 sets.
generalization
Ambiguity measurements within datasets                                            Qualitative analysis of ambiguous strings
Given the set of unique mention strings in each MCN dataset, we                   Inspired by methodological research demonstrating that different
measured each string’s ambiguity in terms of dataset ambiguity,                   modeling strategies are appropriate for phenomena such as meton-
UMLS ambiguity with minimal preprocessing, UMLS ambiguity                         ymy65,66 and hyponymy,67–71 we analyzed the ambiguous strings in
with lexical variant normalization, and UMLS ambiguity with word                  each dataset in terms of the following lexical phenomena: homon-
match, using the version of the UMLS each dataset was originally                  ymy, polysemy, hyponymy, meronymy, co-taxonomy (sibling rela-
annotated with. We also evaluated the coverage of the UMLS                        tionships), and metonymy (definitions provided in discussion of our
matching results, in terms of whether they included the CUIs associ-              ambiguity typology in the Results).42,43 To measure the ambiguity
ated with each string in the dataset. For compositional annotations               captured by the available annotations, we performed our analysis
in CUILESS2016, we treated a label as covered if any of its compo-                only at the level of dataset ambiguity (ie, only using the CUIs associ-
nent CUIs were included in the UMLS results. Finally, to establish                ated with the string in a single dataset). For each ambiguous string
Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets - Oxford ...
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3                                                                    521

in a dataset, we manually reviewed the string, its associated CUIs in         omitted dataset annotations that were not found in the correspond-
the dataset in question, and the medical concept mention samples              ing version of the UMLS (including “CUI-less,” annotation errors,
where the string occurs in the dataset, and answered the following 2          and CUIs remapped within the UMLS); Table 2 provides the number
questions:                                                                    of these annotations and the number of strings analyzed. We ob-
                                                                              served 5 main findings from our results:
    Question 1: How are the different CUIs associated with this
                                                                                   Observed dataset ambiguity is not representative of potential
    string related to one another?
                                                                              UMLS ambiguity. Only 2%-14% of strings were ambiguous at the
    This question regarded only the set of annotated CUIs and was             dataset level (across SemEval-2015, CUILESS2016, and n2c2 2019)
agnostic to specific samples in the dataset. We evaluated 2 aspects of        (ie, these strings were associated with more than 1 CUI within a sin-
the relationship or relationships between these CUIs: (1) which (if           gle dataset). However, many more strings exhibited potential ambi-
any) of the previous lexical phenomena was most representative of             guity, as measured in the UMLS with our 3 candidate matching
the relationship between the CUIs and (2) if any phenomenon partic-           strategies. Using minimal preprocessing, in the cases in which at
ular to medical language was a contributing factor. We conducted              least 1 CUI was identified for a query string, 13%-23% of strings
this analysis only in terms of the high-level phenomena outlined pre-         were ambiguous; lexical variant normalization increased this to

                                                                                                                                                         Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
viously, rather than leveraging the formal semantic relationships be-         17%-28%, and word matching yielded 68%-88% ambiguous
tween CUIs in the UMLS; while these relationships are powerful for            strings. The difference was most striking in n2c2 2019: only 58
downstream applications, they include a variety of nonlinguistic              strings were ambiguous in the dataset (after removing “CUI-less”
relationships and were too fine-grained to group a small set of am-           samples), but 2,119 strings had potential ambiguity as measured
biguous strings informatively.                                                with word matching, a 37-fold increase.
                                                                                   Many dataset strings do not match any CUIs. A total of 40%-
    Question 2: Are the CUI-level differences reflected in the annotations?
                                                                              43% of strings in SemEval-2015 and n2c2 did not yield any CUIs
    Given the breadth of concepts in the UMLS, and the subjective             when using minimal preprocessing to match to the UMLS (74% in
nature of annotation, we analyzed whether the CUI assignments in              CUILESS2016). Lexical variant normalization increased coverage
the dataset samples were meaningfully different, and if they reflected        somewhat, with 38%-41% of strings failing to match to the UMLS
the sample-agnostic relationship between the CUIs.                            in SemEval-2015 and n2c2 (70% in CUILESS2016); word-level
                                                                              search had much better coverage, only yielding empty results for
Ambiguity annotations                                                         23%-27% of CUIs in SemEval-2015 and n2c2 and 57% in CUI-
Based on our answers to these questions, we determined 3 variables            LESS2016. As CUILESS2016 strings often combine multiple con-
for each string:                                                              cepts, matching statistics are necessarily pessimistic for this dataset.
                                                                                   UMLS matching misses a significant portion of annotated CUIs.
•   Category—the primary linguistic or conceptual phenomenon un-              As shown in Figure 2, for the subset of SemEval-2015 and n2c2
    derlying the observed ambiguity;                                          2019 strings in which any of the UMLS matching strategies yielded
•   Subcategory—the biomedicine-specific phenomenon contribut-                at least 1 candidate CUI, 8%-23% of the time the identified candi-
    ing to a pattern of ambiguity; and                                        date sets did not include any of the CUIs with which those strings
•   Arbitrary—the determination of whether the CUIs’ use reflected            were actually annotated in the datasets. This was consistent for both
    their conceptual difference.                                              strings returning only 1 CUI and strings returning multiple CUIs.
                                                                              The complex mentions in CUILESS2016 again yielded lower cover-
    Annotation was conducted by 4 authors (D.N.-G., G.D., B.D.,               age: 24%-30% of strings returning only 1 CUI did not return a cor-
A.Z.) in 3 phases: (1) initial categorization of the ambiguous strings        rect one and 25%-42% of strings returning multiple CUIs missed all
in n2c2 2019 and SemEval-2015, (2) validation of the resulting ty-            of the annotated CUIs. This indicates that coverage of both syno-
pology through joint annotation and adjudication of 30 random am-             nyms and lexical variants in the UMLS remains an active challenge
biguous strings from n2c2 2019, and (3) reannotation of all datasets          for clinical language.
with the finalized typology. For further details, please see the Sup-              High coverage yields high ambiguity. Table 2 provides statistics
plementary Appendix.                                                          on the number of CUIs returned for strings from the 3 datasets in
                                                                              which any of the UMLS candidate matching strategies yielded more
Handling compositional CUIs in CUILESS2016                                    than 1 CUI. Both minimal preprocessing and lexical variant normal-
Compositional annotations in CUILESS2016 presented 2 variables                ization yield a median CUI count per ambiguous string of 2, al-
for ambiguity analysis: single- or multiple-CUI annotations, and am-          though higher maxima (maximum 11 CUIs with minimal
biguity of annotations across samples. We categorized each string in          preprocessing, maximum 20 CUIs with lexical variant normaliza-
CUILESS as having (1) unambiguous single-CUI annotation, (2) un-              tion) skew the mean number of CUIs per string higher. By contrast,
ambiguous multi-CUI annotation, (3) ambiguous single-CUI annota-              word matching, which achieves the best coverage of dataset strings
tion, or (4) ambiguous annotations with both single- and multi-CUI            by far, ranges in median ambiguity from 8 in CUILESS2016 to 20 in
labels. The latter 2 categories were considered ambiguous for our             n2c2 2019, with maxima over 100 CUIs in all 3 datasets. Thus, ef-
analysis.                                                                     fectively choosing between a large number of candidates is a key
                                                                              challenge for high-coverage MCN.
                                                                                   Character-level string length is weakly negatively correlated with
RESULTS                                                                       ambiguity measures. Following prior findings that shorter terms
Quantitative measurements of string ambiguity                                 tend to be more ambiguous in biomedical literature,63 we observed
Ambiguity within individual datasets                                          r2 values above 0.5 between character-based string length and data-
Figure 2 presents the results of our string-level ambiguity analysis          set ambiguity, UMLS ambiguity with minimal preprocessing, and
across the 3 datasets. For a fair comparison with the UMLS, we                UMLS ambiguity with lexical variant normalization in all 3 EHR
522                                                             Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3

                                                                                                                                                               Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Figure 2. String-level ambiguity in medical concept normalization (MCN) datasets, by method of measuring ambiguity. (A) Measurements of observed string am-
biguity in MCN datasets, in terms of strings that are annotated with exactly 1 concept unique identifier (CUI) (unambiguous) or more than 1 (ambiguous). (B)
Measurements of potential string ambiguity in the Unified Medical Language System (UMLS), using minimal preprocessing, lexical variant normalization, and
word match strategies to identify candidate CUIs. Shown below each UMLS matching chart is the coverage of dataset CUIs yielded by each matching strategy,
broken down by ambiguous (A) and unambiguous (U) strings. Coverage is calculated as the intersection between the CUIs matched to a string in the UMLS and
the set of CUIs that string is annotated with in the dataset.

datasets. Word-level match yielded very weak correlation (r2 ¼ 0.39             within-corpus setting (comparing SemEval-2015 to CUILESS2016),
for SemEval-2015, 0.23 for CUILESS2016, and 0.39 for n2c2).                     and cross-corpus setting (comparing SemEval-2015 and CUI-
Token-level measures of string length followed the same trends as               LESS2016 to n2c2 2019). We observed 3 main findings in our
the character-level measure, although typically with lower r2. Full             results:
results of these analyses are provided in Supplementary Table 1 and                 The majority of strings are unique to the dataset they appear in.
Supplementary Figures 1–3.                                                      The overlap in sets of medical concept mention strings between
                                                                                datasets ranged from
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3                                                                              523

Table 2. Results of string-level ambiguity analysis, as measured in MCN datasets (observed ambiguity) and in the UMLS with 3 candidate
matching strategies (potential ambiguity)

                                                                                               SemEval-2015              CUILESS2016               n2c2 2019
UMLS version                                                                                     2011AA                    2016AA                   2017AB

Dataset                                   Total strings                                            3203                      2006                      3230
                                          Ambiguous strings before OOV filtering                  148 (5)                  273 (14)                   62 (2)
                                          Strings with OOV annotations                               48                        1                        99
                                          OOV annotations only (omitted)                             29                        1                        95
                                          Strings with at least 1 CUI                              3174                      2005                      3135
                                          Ambiguous strings after OOV filtering                   132 (4)                  273 (14)                   58 (2)
                                          Minimum/median/maximum ambiguity                         2/2/6                    2/2/24                     2/2/3
                                          Mean ambiguity                                         2.1 6 0.5                 2.9 6 2.5                2.1 6 0.3
Minimal preprocessing                     Strings with at least 1 CUI                            1808 (57)                 530 (26)                 1874 (60)
                                          Ambiguous strings                                      230 (13)                   97 (18)                  423 (23)

                                                                                                                                                                   Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
                                          Minimum/median/maximum ambiguity                        2/2/11                    2/2/11                    2/2 /11
                                          Mean ambiguity                                         2.5 6 1.1                 2.7 6 1.5                2.5 6 1.2
Lexical variant normalization             Strings with at least 1 CUI                            1882 (59)                 592 (30)                 1942 (62)
                                          Ambiguous strings                                      318 (17)                  137 (23)                  550 (28)
                                          Minimum/median/maximum ambiguity                        2/2/17                    2/2/18                    2/2/20
                                          Mean ambiguity                                         2.8 6 1.9                 3.1 6 2.5                2.9 6 2.1
Word match                                Strings with at least 1 CUI                            2314 (73)                 877 (44)                 2414 (77)
                                          Ambiguous strings                                      1774 (77)                 594 (68)                 2119 (88)
                                          Minimum/median/maximum ambiguity                        2/9/123                   2/8/107                 2/20/120
                                          Mean ambiguity                                        20.9 6 25.5               19.5 6 24.5              31.1 6 29.2

  Values are n, n (%), or mean 6 SD, unless otherwise indicated. All dataset annotations that were not found in the corresponding version of the UMLS (OOVs)
were omitted from this analysis; any strings that had only OOV annotations in the dataset were omitted entirely. For each of the 3 UMLS matching strategies, the
number of strings for which at least 1 CUI was identified is provided along with the corresponding percentage of non-OOV dataset strings. The number of ambig-
uous strings in each subset (ie, strings for which more than 1 CUI was matched after OOV annotations were filtered out) is given along with the corresponding
percentage of strings for which at least 1 CUI was identified. Ambiguity statistics are calculated on ambiguous strings only and report minimum, median, maxi-
mum, mean, and standard deviation of number of CUIs identified for the string.
  CUI: concept unique identifier; MCN: medical concept normalization; OOV: out of vocabulary; UMLS: Unified Medical Language System.

    Most shared strings have differences in their annotated CUIs. In              sets between the 2 datasets were originally unambiguous in each
all comparisons other than the SemEval-2015 training and develop-                 dataset, indicating that memorizing term-CUI normalization would
ment datasets, over 45% of the strings shared between a pair of data-             work perfectly in each dataset but fail entirely on the other.
sets were annotated with at least 1 CUI that was only present in 1 of
the 2 datasets (18% of strings even in the case of SemEval-2015 train-            Ambiguity typology
ing and development datasets). Of these, between 33%-74% had                      We identified 12 distinct causes of the ambiguity observed in the
completely disjoint sets of annotated CUIs between the 2 datasets com-            datasets, organized into 5 broad categories. Table 3 presents our ty-
pared. While many of these cases reflected hierarchical differences, a            pology, with examples of each ambiguity type; brief descriptions of
significant number involved truly distinct senses between datasets.               each overall category are provided subsequently. We refer the inter-
    UMLS match consistently fails to yield all annotated CUIs across              ested reader to the Supplementary Appendix for a more in-depth dis-
combined datasets. Reflecting our earlier observations within indi-               cussion.
vidual datasets, word-level UMLS matching was able to fully re-
trieve all CUIs in the combined annotation set for a fair portion of              Polysemy
shared strings (42%-55% in within-dataset comparisons; 54%-85%                    We combined homonymy (completely disjoint senses) and polysemy
in cross-corpus comparisons). However, it failed to retrieve any of               (distinct but related senses)42,43 under the category of Polysemy for
the combined CUIs for 26%-54% of the shared strings.                              our analysis. While we observed instances of both homonymy and
    Figure 4 illustrates changes in ambiguity for shared strings be-              polysemy, we found no actionable reason to differentiate between
tween the dataset pairs, in terms of how many strings had nonidenti-              them, particularly as other phenomena causing polysemy (eg, me-
cal annotated CUI sets, how many strings in each dataset would                    tonymy, hyponymy) were covered by other categories. Thus, the Po-
increase in ambiguity if the CUI sets were combined, and how many                 lysemy category captured cases in which more specific phenomena
of these would switch from being unambiguous to ambiguous when                    were not observed and the annotated CUIs were clearly distinct
combining cross-dataset CUI sets. We found that of the sets of                    from one another. As there is extensive literature on resolving abbre-
strings shared between any pair of datasets with nonidentical CUI                 viations and acronyms,31–35 we treated cases involving abbrevia-
annotations, between 50% and 100% of the strings in each of these                 tions as a dedicated subcategory (Abbreviation; our other
sets were annotated with at least 1 CUI in one of the datasets that               subcategory was Nonabbreviation).
was not present in the other. Further, up to 66% of the strings with
any annotation differences went from being unambiguous to ambig-                  Metonymy
uous when CUI sets were combined across the dataset pairs. Finally,               Clinical language is telegraphic, meaning that complex concepts are
we found that up to 89% of the strings that had fully disjoint CUI                often referred to by simpler associated forms. Normalizing these
524                                                               Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3

                                                                                                                                                                     Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021

Figure 3. Generalization analysis for medical concept normalization annotations, in 3 settings: (A, B) between training and development sets in the same datasets,
(C, D) between 2 datasets drawn from the same electronic health record corpus (both from the ShARe corpus), and (E, F) across annotated corpora. The first col-
umn illustrates the number of unique strings in each sample set in the pair being analyzed, along with the number of strings present in both. The second column
shows the subsets of these shared strings in which the sample sets use at least 1 different concept unique identifier (CUI) for the same string, and the number of
strings in which all CUIs are different between the 2 sample sets. The third column shows for how many of the shared strings the Unified Medical Language Sys-
tem (UMLS) matching with word search identifies some or all of the CUIs annotated for a given string between both sample sets.
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3                                                                                   525

                                                                                                                                                                        Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021

Figure 4. Analysis of concept unique identifier (CUI) sets for shared strings in medical concept normalization generalization between datasets, in 3 settings: (A, B)
between training and development sets in the same datasets, (C, D) between 2 datasets drawn from the same electronic health record corpus (both from the
ShARe corpus), and (E, F) across annotated corpora. The left-hand column illustrates (1) the number of shared strings with differences in their CUI annotations;
(2) the proper subset of these strings, within each dataset, in which adding the CUIs from the other dataset would expand the set of CUIs for this string; and (3)
the proper subset of these strings where a string is unambiguous within one or the other dataset but becomes ambiguous when CUI annotations are combined.
The right-hand column displays the portion of shared strings with disjoint CUI set annotations between the 2 datasets in which the string is unambiguous in each
of the datasets independently.
526                                                            Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3

Table 3. Ambiguity typology derived from SemEval-2015, CUILESS2016, and n2c2 2019 MCN corpora

Category      Subcategory         Definition                               Example ambiguity

Polysemy      Abbreviation        Abbreviations or acronyms with dis-      Family hx of breast [ca], emphysema           C0006826 Malignant Neoplasms
                                    tinct senses.                          BP 137/80 na 124 [ca] 8.7                     C0201925 Calcium Measurement
              Nonabbreviation     Term ambiguity other than abbrevia-      BP was [elevated] at last 2 visits            C0205250 High (qualitative)
                                    tions or acronyms.                     Her leg was [elevated] after surgery          C0439775 Elevation procedure
Metonymy      Procedure vs        Distinguishes between a medical con-     [Rhythm] revealed sinus tachycardia           C0199556 Rhythm ECG (Procedure)
                Concept             cept and the procedure or action       The [rhythm] became less stable               C0577801 Heart rhythm (Finding)
                                    used to analyze/effect that con-
                                    cept.
              Measurement vs      Distinguishes between a physical         Pt blood work to check [potassium]            C0032821 Potassium (Substance)
               Substance            substance and a measurement of         Sodium 139, [potassium] 4.7                   C0202194 Potassium Measurement
                                    that substance.
              Symptom vs          Distinguishes between a finding be-      Current symptoms include                      C0011570 Mental Depression

                                                                                                                                                             Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
                Diagnosis           ing marked as a symptom or a                [depression]
                                    (possibly diagnosed) disorder.         Hx of chronic [depression]                    C0011581 Depressive disorder
              Other               All other types of metonymy.             Transfusion of [blood]                        C0005767 Blood (Body Substance)
                                                                           Discovered [blood] at catheter site           C0019080 Hemorrhage
Specificity   Hierarchical        Combines hyponymy and meron-             Cardiac: family hx of [failure]               C0018801 Heart Failure
                                    ymy; corresponds to taxonomic          . . .in left ventricle. This [failure]. . .   C0023212 Left-sided heart failure
                                    UMLS relations.
              Recurrence/         Distinguishes between singular and       No [injuries] at admission                    C0175677 Injury
                Number              plural forms of a finding, or one      Brought to emergency for                      C0026771 Multiple trauma
                                    episode and recurrent episodes.          his [injuries]
Synonymy      Propositional       For a general-purpose application,       Negative skin [jaundice]                      C0022346 Icterus
                Synonyms            the set of CUIs are not meaning-       Increased girth and [jaundice]                C0476232 Jaundice
                                    fully distinct from one another.
              Co-taxonyms         The CUIs are (conceptually or in the     2mg [percodan]                                C0717448 Percodan
                                    UMLS) taxonomic siblings; often        2mg [percodan]                                C2684258 Percodan
                                    overspecification.                                                                     (reformulated 2009)
Error         Semantic            Erroneous CUI assignment, due to         Open to air with no [erythema]                C0041834 Erythema
                                    misinterpretation, confusion with      Edema but no [erythema]                       C0013604 Edema
                                    nearby concept, or other cause.
              Typos               One CUI is a typographical error         [Neoplasm] is adjacent                        C0024651 Malt Grain (Food)
                                    when attempting to enter the other     Infection most likely [neoplasm]              C0027651 Neoplasms
                                    (ie, no real ambiguity).

  Short definitions are provided for each subcategory, along with 2 samples of an example ambiguous string and their normalizations using UMLS CUIs. For a
more detailed discussion, see the Supplementary Appendix.
  CUI: concept unique identifier; UMLS: Unified Medical Language System.

references requires inference from their context: for example, a ref-           observed was ambiguity in grammatical number of a finding, typi-
erence to “sodium” within lab readings implies a measurement of                 cally due to inflection (eg, “no injuries” meaning not a single injury)
sodium levels, a distinct concept in the UMLS. It is noteworthy that            or recurrence (denoted Recurrence/Number).
in some cases, examples of the Metonymy category may be consid-
ered as annotation errors, illustrating the complexity of metonymy
                                                                                Synonymy
in practice; for example, the case of “Sodium 139, [potassium] 4.7”
                                                                                Many strings were annotated with CUIs that were effectively synon-
included in Table 3, annotated as C0032821 Potassium (substance),
                                                                                ymous; we therefore followed Cruse’s42 definition of Propositional
would be better annotated as C0428289 Finding of potassium level.
                                                                                Synonymy, in which ontologically distinct senses nonetheless yield
As these concepts are semantically related (while ontologically dis-
                                                                                the same propositional interpretation of a statement. We also in-
tinct), we included such cases in the category of Metonymy. We ob-
                                                                                cluded Co-taxonymy in this category, typically involving annotation
served 3 primary trends in metonymic annotations: reference to a
                                                                                with either overspecified CUIs or CUIs separated only by negation.
procedure by an associated biological property (Procedure vs Con-
cept), mention of a biological substance to refer to its measurement
(Measurement vs Substance), and the fact that many symptomatic                  Error
findings can also be formal diagnoses (Symptom vs Diagnosis; eg,                A small number of ambiguity cases were due to erroneous annota-
“emphysema,” “depression”). Other examples of Metonymy falling                  tions stemming from 2 causes: (1) typological errors in data entry
outside these trends were placed in the Other subcategory.                      (Typos) and (2) selection of an inappropriate CUI (Semantic).

Specificity                                                                     Ambiguity types in each dataset
The rich semantic distinctions in the UMLS (eg, phenotypic variants             As with our measurements of string ambiguity, we excluded all data-
of a disease) lead to frequent ambiguity of Specificity. The ambiguity          set samples annotated as “CUI-less” for analysis of ambiguity type,
was often taxonomic, captured as Hierarchical; the other pattern                as these reflect annotation challenges beyond the ambiguity level.
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3                                                                              527

Table 4. Results of ambiguity type analysis, showing the number of unique ambiguous strings assigned to each ambiguity type by dataset,
along with the total number of dataset samples in which those strings appear

                                                                 SemEval-2015                         CUILESS2016                             n2c2 2019

Category             Subcategory                            Strings          Samples            Strings           Samples           Strings           Samples

Polysemy             Abbreviation                              4                 59                 6               178                 7                  33
                     Nonabbreviation                           2                  2                12               302                 6                  28
Metonymy             Procedure vs Concept                      0                  0                 7                25                 9                  23
                     Measurement vs Substance                  0                  0                 0                 0                 9                  93
                     Symptom vs Diagnosis                     20                 62                20               166                 2                   5
                     Other                                     2                  3                 6                22                 5                  29
Specificity          Hierarchical                             50                103                87               776                 7                  26
                     Recurrence/Number                         8                 24                 3                 6                 0                   0
Synonymy             Propositional Synonyms                   23                 26                64               354                 8                  26

                                                                                                                                                                  Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
                     Co-taxonyms                               9                 11                64               837                 4                  13
Error                Typos                                    25                 25                 0                 0                 0                   0
                     Semantic                                  8                 11                22               109                 1                   1
Total (unique)                                               148                326               273              2775                58                 295

  Some strings were assigned multiple ambiguity types, and are counted for each; the number of affected samples was estimated for each type in these cases. The
sample counts given for error subcategories represent the actual count of misannotated samples. The total number of unique ambiguous strings and associated
samples analyzed in each dataset is presented in the last row.

Figure 5. Distribution of ambiguity types within each dataset, in terms of (A) the unique strings assigned each ambiguity type and (B) the number of samples in
which those strings occur. The number of strings and samples belonging to each typology category is shown within each bar portion.

However, we retained samples with annotation errors and CUIs                      tion, shown in Figure 6. Arbitrary rates varied across datasets, with
remapped within the UMLS, as these samples inform MCN evalua-                     the fewest cases in SemEval-2015 and the most in n2c2 2019.
tion in these datasets, and ambiguity type analysis did not require di-           Metonymy (Symptom vs Diagnosis), Specificity (Hierarchical), and
rect comparison to string-CUI associations in the UMLS. This                      Synonymy (Co-taxonyms) were all arbitrary in more than 50% of
increased the number of ambiguous strings in SemEval-2015 from                    cases.
132 to 148; ambiguous string counts in CUILESS2016 and n2c2
2019 were not affected. Table 4 presents the frequency of each am-
biguity type across our 3 datasets. All but 21 strings (3 in SemEval-
2015, 18 in CUILESS2016) exhibited a single ambiguity type (ie, all
                                                                                  DISCUSSION
CUIs were related in the same way). To compare the distribution of                Ambiguity is a key challenge in medical concept normalization.
ambiguity categories across datasets, we visualized their relative fre-           However, relatively little research on ambiguity has focused on clini-
quency in Figure 5. Polysemy and Metonymy strings were most                       cal language. Our findings demonstrate that clinical language exhib-
common in n2c2 2019, while Specificity was the plurality category                 its distinct types of ambiguity, such as clinical patterns in metonymy
in SemEval-2015 and Synonymy was most frequent in CUI-                            and specificity, in addition to well-studied problems such as abbrevi-
LESS2016. The sample-wise distribution, included in Table 4, fol-                 ation expansion. These results highlight 3 key gaps in the literature
lowed the string-wise distribution, except for Polysemy, which                    for MCN ambiguity: (1) a significant gap between the potential am-
included multiple high-frequency strings in SemEval-2015 and CUI-                 biguity of medical terms and their observed ambiguity in EHR data-
LESS2016.                                                                         sets, creating a need for new ambiguity-focused datasets; (2) a need
    Finally, we visualized the proportion of strings within each ambi-            for MCN evaluation strategies that are sensitive to the different
guity type considered arbitrary (at the sample level) during annota-              kinds of relationships between concepts observed in our ambiguity
528                                                             Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3

                                                                                                                                                               Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021
Figure 6. Percentage of ambiguous strings in each ambiguity type annotated as arbitrary, by dataset. Synonymy (Propositional Synonyms) and both Error subca-
tegories are omitted, as they are arbitrary by definition.

typology; and (3) underutilization of the extensive semantic resour-            actly matched the gold CUI. On this view, a predicted CUI is either
ces of the UMLS in recent MCN methodologies. We discuss each of                 exactly right or completely wrong. However, as illustrated by the
these points in the following sections, and propose specific next               distinct ambiguity types we observed, in many cases a CUI other
steps toward closing these gaps to advance the state of MCN re-                 than the gold label may be highly related (eg, “Heart failure” and
search. We conclude by noting the particular role of representative             “Left-sided heart failure”), or even propositionally synonymous. As
data in the deep learning era and providing a brief discussion of the           methodologies for MCN improve and expand, alternative evalua-
limitations of this study that will inform future research on ambigu-           tion methods leveraging the rich semantics of the UMLS can help to
ity in MCN.                                                                     distinguish between a system with a related misprediction from a
                                                                                system with an irrelevant one. A wide variety of similarity and relat-
                                                                                edness measures that utilize the UMLS to compare medical concepts
The next phase of research on clinical ambiguity needs
                                                                                have been proposed,72–75 presenting a fruitful avenue for develop-
dedicated datasets
                                                                                ment of new MCN evaluation strategies.
The order of magnitude difference between the number of CUIs an-
                                                                                    It is important to note, however, that equivalence classes and
notated for each string in our 3 datasets, and the number of CUIs
                                                                                similarity measures will often be task or domain specific. For exam-
found through word match to the UMLS suggests that our current
                                                                                ple, 2 heart failure phenotypes may be equivalent for presenting
data resources cover only a small subset of medically relevant ambi-
                                                                                summary information in an EHR dashboard but may be highly dis-
guity. Differences in ambiguity across multiple datasets provide
                                                                                tinct for cardiology-specific text mining or applications with de-
some improvement in addressing this coverage gap and clearly indi-
                                                                                tailed requirements such as clinical trial recruitment. While
cate the value of evaluating new MCN methods on multiple datasets
                                                                                dedicated evaluation metrics for each task would be impractical, a
to improve ambiguity coverage. However, the ShARe and MCN
                                                                                trade-off between generalizability and sensitivity to the needs of dif-
corpora were designed to capture an in-depth sample of clinical lan-
                                                                                ferent applications represents an area for further research.
guage, rather than a sample with high coverage of specific challenges
like ambiguity. As MCN research continues to advance, more fo-
cused datasets capturing specific phenomena are needed to support
                                                                                The UMLS offers powerful semantic tools for high-
development and evaluation of methodologies to resolve ambiguity.
                                                                                coverage candidate identification
Savova et al25 followed the protocol used in designing the biomedi-
                                                                                Our cross-dataset comparison clearly demonstrates the value of uti-
cal NLM WSD corpus24 to develop a private dataset containing a
                                                                                lizing inclusive UMLS-based matching to identify a high-coverage
set of highly ambiguous clinical strings; adapting and expanding this
                                                                                set of candidate CUIs for a medical concept, though the lack of
protocol with resources such as MIMIC-III54 offers a proven ap-
                                                                                100% coverage reinforces the value of ongoing research on syno-
proach to collect powerful new datasets.
                                                                                nym identification.60 Inclusive matching, of course, introduces addi-
                                                                                tional noise: luiNorm can overgenerate semantically invalid variants
Distinct ambiguity phenomena in MCN call for different                          due to homonymy,76 such as mapping “wound” in “injury or
evaluation strategies                                                           wound” to “wind,” and mapping both “left” and “leaves” to
MCN systems are typically evaluated in terms of accuracy,39,55 cal-             “leaf”; word-level search, meanwhile, requires very little to yield a
culated as the proportion of samples in which the predicted CUI ex-             match and generates very large candidate sets, such as 120 different
You can also read