Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets - Oxford ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Journal of the American Medical Informatics Association, 28(3), 2021, 516–532 doi: 10.1093/jamia/ocaa269 Advance Access Publication Date: 15 December 2020 Research and Applications Research and Applications Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 Denis Newman-Griffis ,1,2 Guy Divita,1 Bart Desmet,1 Ayah Zirikly,1 ,1,3 and Eric Fosler-Lussier2 Carolyn P. Rose 1 Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA, 2Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA and 3Language Technologies Institute, Car- negie Mellon University, Pittsburgh, Pennsylvania, USA Corresponding Author: Denis Newman-Griffis, 6707 Democracy Blvd, Suite 856, Bethesda, MD 20892, USA; denis.griffis@- nih.gov Received 11 February 2020; Revised 13 September 2020; Editorial Decision 11 October 2020; Accepted 17 November 2020 ABSTRACT Objectives: Normalizing mentions of medical concepts to standardized vocabularies is a fundamental compo- nent of clinical text analysis. Ambiguity—words or phrases that may refer to different concepts—has been ex- tensively researched as part of information extraction from biomedical literature, but less is known about the types and frequency of ambiguity in clinical text. This study characterizes the distribution and distinct types of ambiguity exhibited by benchmark clinical concept normalization datasets, in order to identify directions for ad- vancing medical concept normalization research. Materials and Methods: We identified ambiguous strings in datasets derived from the 2 available clinical corpora for concept normalization and categorized the distinct types of ambiguity they exhibited. We then compared observed string ambiguity in the datasets with potential ambiguity in the Unified Medical Language System (UMLS) to assess how representative available datasets are of ambiguity in clinical language. Results: We found that
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 517 INTRODUCTION that our findings will spur additional development of tools and resources for resolving medical concept ambiguity. Identifying the medical concepts within a document is a key step in the analysis of medical records and literature. Mapping natural lan- guage to standardized concepts improves interoperability in docu- Contributions of this work ment analysis1,2 and provides the ability to leverage rich, concept- • We demonstrate that existing MCN datasets in EHR data are based knowledge resources such as the Unified Medical Language not sufficient to capture ambiguity in MCN, either for evaluating System (UMLS).3 This process is a fundamental component of di- MCN systems or developing new MCN models. We analyze the verse biomedical applications, including clinical trial recruitment,4,5 3 available MCN EHR datasets and show that only a small por- disease research and precision medicine,6–8 pharmacovigilance and tion of mention strings have any ambiguity within each dataset, drug repurposing,9,10 and clinical decision support.11 In this work, and that these observed ambiguities only capture a small subset we identify distinct phenomena leading to ambiguity in medical con- of potential ambiguity, in terms of the concept unique identifiers cept normalization (MCN) and describe key gaps in current (CUIs) that match to the strings in the UMLS. Thus, new datasets approaches and data for normalizing ambiguous clinical language. focused on ambiguity in clinical language are needed to ensure Medical concept extraction has 2 components: (1) named entity Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 the effectiveness of MCN methodologies. recognition (NER), the task of recognizing where concepts are men- • We show that current MCN EHR datasets do not provide suffi- tioned in the text, and (2) MCN, the task of assigning canonical ciently representative normalization data for effective generaliza- identifiers to concept mentions, in order to unify different ways of tion, in that they have very few mention strings in common with referring to the same concept. While MCN has frequently been stud- one another and little overlap in annotated CUIs. Thus, MCN re- ied jointly with NER,12–14 recent research has begun to investigate search should include evaluation on multiple datasets, to mea- challenges specific to the normalization phase of concept extraction. sure generalization power. Three broad challenges emerge in concept normalization. First, • We present a linguistically motivated and empirically validated language is productive: practitioners and patients can refer to stan- typology of distinct phenomena leading to ambiguity in medical dardized concepts in diverse ways, requiring recognition of novel concept normalization, and analyze all ambiguous strings within phrases beyond those in controlled vocabularies.15–18 Second, a sin- the 3 current MCN EHR datasets in terms of these ambiguity gle phrase can describe multiple concepts in a way that is more (or phenomena. We demonstrate that multiple distinct phenomena different) than the sum of its parts.19,20 Third, a single natural lan- affect MCN ambiguity, reflecting a variety of semantic and lin- guage form can be used to refer to multiple distinct concepts, thus guistic relationships between terms and concepts that inform yielding ambiguity. both prediction and evaluation methodologies for medical con- Word sense disambiguation (WSD) (which often includes phrase cept normalization. Thus, MCN evaluation strategies should be disambiguation in the biomedical setting) is thus an integral part of tailored to account for different relationships between predicted MCN. WSD has been extensively studied in natural language proc- labels and annotated labels. Further, MCN methodologies could essing methodology,21–23 and ambiguous words and phrases in bio- be significantly enhanced by greater integration of the rich se- medical literature have been the focus of significant research.24–30 mantic resources of the UMLS. WSD research in electronic health record (EHR) text, however, has focused almost exclusively on abbreviations and acronyms.31–35 A single dataset of 50 ambiguous strings in EHR data has been devel- BACKGROUND AND SIGNIFICANCE oped and studied25,36 but is not freely available for current research. Two large-scale EHR datasets, the ShARe corpus14 and a dataset by Linguistic phenomena underpinning clinical ambiguity Luo et al,37 have been developed for medical concept extraction re- Lexical semantics distinguishes between 2 types of lexical ambiguity: search and have been significant drivers in MCN research through homonymy and polysemy.42,43 Homonymy occurs when 2 lexical multiple shared tasks.14,38–41 However, their role in addressing am- items with separate meanings have the same form (eg, “cold” as ref- biguity in clinical language has not yet been explored. erence to a cold temperature or the common cold). Polysemy occurs when one lexical item diverges into distinct but related meanings (eg, “coat” for garment or coat of paint). Polysemy can in turn be Objective the result of different phenomena, including default interpretations To understand the role of benchmark MCN datasets in designing (“drink” liquid or alcohol), metaphors, and metonymy (usage of a and evaluating methods to resolve ambiguity in clinical language, literal association between 2 concepts in a specified domain [eg, we identified ambiguous strings in 3 benchmark EHR datasets for “Foley catheter on 4/12”] to indicate a past catheterization proce- MCN and analyzed the causes of ambiguity they capture. Using lexi- dure).42,43 While metaphors are dispreferred in the formal setting of cal semantic theory and the taxonomic and semantic relationships clinical documentation, the telegraphic nature of medical text44 between concepts captured in the UMLS as a guide, we developed a lends itself to metonymy by using shorter phrases to refer to more typology of ambiguity in clinical language and categorized each specific concepts, such as procedures.45 string in terms of what type of ambiguity it captures. We found that multiple distinct phenomena cause ambiguity in clinical language and that the existing datasets are not sufficient to systematically cap- Mapping between biomedical concepts and terms: The ture these phenomena. Based on our findings, we identified 3 key UMLS gaps in current research on MCN in clinical text: (1) a lack of repre- The UMLS is a large-scale biomedical knowledge resource that com- sentative data for ambiguity in clinical language, (2) a need for new bines information from over 140 expert-curated biomedical vocabu- evaluation strategies for MCN that account for different kinds of laries and standards into a single machine-readable resource. One relationships between concepts, and (3) underutilization of the rich central component of the UMLS that directly informs our analysis semantic resources of the UMLS in MCN methodologies. We hope of ambiguity is the Metathesaurus, which groups together synonyms
518 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 (distinct phrases with the same meaning, [eg, “common cold” and MATERIALS AND METHODS “acute rhinitis”]) and lexical variants (modifications of the same We performed both quantitative and qualitative evaluations of am- phrase [eg, “acute rhinitis” and “rhinitis, acute”]) of biomedical biguity in 3 benchmark MCN datasets of EHR data. In this section, terms and assigns them a single CUI. The diversity of vocabularies we first introduce the datasets analyzed in this work and define our included in the UMLS (each designed for a unique purpose), com- methods for measuring ambiguity in the datasets and in the UMLS. bined with the expressiveness of human language, means that many We then describe 2 quantitative analyses of ambiguity measure- different terms can be associated with any one concept (eg, the con- ments within individual datasets and a generalization analysis across cept C0009443 is associated with the terms cold, common cold, and datasets. Finally, we present our qualitative analysis of ambiguity acute rhinitis, among others), and any term may be used to refer to types in MCN datasets. different concepts in different situations (eg, cold may also refer to C0009264 Cold Temperature in addition to C0009443, as well as to a variety of other Metathesaurus concepts), leading to ambiguity. These mappings between terms and concepts are stored in the MCN datasets MRCONSO UMLS table. In addition to the canonical terms stored The effect of ambiguity in normalizing medical concepts has been researched significantly more in biomedical literature than in clinical Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 in MRCONSO, the UMLS also provides lexical variants of terms, including morphological stemming, inflectional variants, and agnos- data. In order to identify knowledge gaps and key directions for tic word order, provided through the SPECIALIST Lexicon and suite MCN in the clinical setting, where ambiguity may have direct im- of tools.46,47 Lexical variants of English-language terms from pact on automated tools for clinical decision support, we studied the MRCONSO are provided in the MRXNS_ENG UMLS table. The 3 available English-language EHR corpora with concept normaliza- MCN datasets used in this study were annotated for mentions of tion annotations: SemEval-2015 Task 14,14 CUILESS2016,19 and concepts in 2 widely used vocabularies integrated into the UMLS: n2c2 2019 Track 3.37,41 MCN annotations in these datasets are rep- (1) the U.S. edition of the Systematized Nomenclature of Medicine resented as UMLS CUIs for the concepts being referred to in the Clinical Terms (SNOMED CT) vocabulary, a comprehensive clini- text; as MCN evaluation is performed based on selection of the cal healthcare terminology, and (2) RxNorm, a standardized no- specific CUI a given mention is annotated with, we describe dataset menclature for clinical drugs; we thus restricted our analysis to data annotation and our analyses in terms of the CUIs used rather than from these 2 vocabularies. the concepts they refer to. Details of these datasets are presented in Table 1. Sense relations and ontological distinctions in the SemEval-2015 UMLS Task 14 of the SemEval-2015 competition investigated clinical text In addition to mappings from terms to concepts, the UMLS Meta- analysis using the ShARe corpus, which consists of 531 clinical thesaurus includes information on semantic relationships between documents from the MIMIC (Medical Information Mart for Inten- concepts, such as hierarchical relationships that often correspond to sive Care) dataset54 including discharge summaries, echocardio- lexical phenomena such as hypernymy and hyponymy, as well as gram, electrocardiogram and radiology reports. Each document was meronymy and holonymy in biological and chemical structures.42 annotated for mentions of disorders and normalized using CUIs The UMLS has previously been observed to include not only fine- from SNOMED CT.53 The documents were annotated by 2 profes- grained ontological distinctions, but also purely epistemological dis- sional medical coders, with high interannotator agreement of 84.6% tinctions such as associated findings (eg, C0748833 Open fracture CUI matches for mentions with identical spans, and all disagree- of skull vs C0272487 Open skull fracture without intracranial in- ments were adjudicated to produce the final dataset.38,39 Datasets jury).48 This yields high productivity for assignment of different derived from subsets of the ShARe corpus have been used as the CUIs in cases of ontological distinction, such as reference to source for several shared tasks.14,39,40,55 The full corpus was used “cancer” to mean either general cancer disorders or a specific type for a SemEval-2015 shared task on clinical text analysis,14 split into of cancer in a context such as a prostate exam, as well what Cruse42 298 documents for training, 133 for development, and 100 for test. termed propositional synonymy (ie, different senses that yield the In order to preserve the utility of the test set as an unseen data sam- same propositional logic interpretation). Additionally, the difficulty ple for continuing research, we exclude its 100 documents from our of interterminology mapping at scale means that synonymous terms analysis, and only analyze the training and development documents. are occasionally mapped to different CUIs.49 CUILESS2016 The role of representative data for clinical ambiguity A significant number of mentions in the ShARe corpus were not Development and evaluation of models for any problem are predi- mapped to a CUI in the original annotations, either because these cated on the availability of representative data.50 Prior research has mentions did not correspond to Disorder concepts in the UMLS or highlighted the frequency of ambiguity in biomedical literature24,51 because they would have required multiple disorder concepts to an- and broken biomedical ambiguity into 3 broad categories of ambig- notate.14 These mentions were later reannotated in the CUI- uous terms, abbreviations, and gene names,52 but an in-depth char- LESS2016 dataset, with updated guidelines allowing annotation acterization of the types of ambiguity relevant to clinical data has using any CUI in SNOMED CT (regardless of semantic type) and not yet been performed. In order to understand what can be learned specified rules for composition.19,56 These data were split into train- from the available data for ambiguity and identify areas for future ing and development sets, corresponding to the training and devel- research, it is critical to analyze both the frequency and the types of opment splits in the SemEval-2015 shared task; the SemEval-2015 ambiguity that are captured in clinical datasets. test set was not annotated as part of CUILESS2016.
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 519 Table 1. Details of MCN datasets analyzed for ambiguity, broken down by data subset ShARe Corpus SemEval-2015 CUILESS2016 n2c2 2019 Training Development Combined Training Development Combined Training 53 19 UMLS version 2011AA 2016AA 2017AB37 Source vocabularies SNOMED CT (United States) SNOMED CT (United States) SNOMED CT (United States), RxNorm Documents 298 133 431 298 133 431 100 Samples 11 554 8003 19 557 3468 1929 5397 6684 CUI-less samples 3480 1933 5413 7 1 8 368 Unique strings 3654 2477 5064 1519 750 2011 3230 Unique CUIs 1356 1144 1871 1384 639 1738 2331 The number of CUI-less samples, which were excluded from our analysis, is provided for each dataset. Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 CUI: concept unique identifier; MCN: medical concept normalization; SNOMED CT: Systematized Nomenclature of Medicine Clinical Terms; UMLS: Unified Medical Language System. n2c2 2019 We defined dataset ambiguity, our measure of observed ambigu- As the SemEval-2015 and CUILESS2016 datasets only included ity, as the number of unique CUIs associated with a given string annotations for mentions of disorder-related concepts, Luo et al37 when aggregated over all samples in a dataset. In order to account annotated a new corpus to provide mention and normalization data for minor variations in EHR orthography and annotations, we used for a wider variety of concepts; these data were then used for a 2019 2 steps of preprocessing on the text of all medical concept mentions n2c2 shared task on concept normalization.41 The corpus includes in each dataset: lowercasing and dropping determiners (a, an, and 100 discharge summaries drawn from the 2010 i2b2/VA shared task the). on clinical concept extraction, for which documents from multiple To measure potential ambiguity, we defined UMLS ambiguity as healthcare institutions were annotated for all mentions of problems, the number of CUIs a string is associated with in the UMLS Meta- treatments, and tests.57 All annotated mentions in the 100 docu- thesaurus. While the Metathesaurus is necessarily incomplete, 15,58,59 ments chosen were normalized using CUIs from SNOMED CT and and the breadth and specificity of concepts covered means RxNorm; 2.7% were annotated as “CUI-less.” All mentions were that useful term-CUI links are often missing,60 it nonetheless func- dually annotated with an adjudication phase; preadjudication inter- tions as a high-coverage heuristic to measure the number of senses a annotator agreement was a 67.69% CUI match (note this figure in- term may be used to refer to. However, the expressiveness of natural cluded comparison of mention bounds in addition to CUI matches, language means that direct dictionary lookup of any given string in lowering measured agreement; CUI-level agreement alone was not the Metathesaurus is likely to miss valid associated CUIs: linguistic evaluated). Luo et al37 split the corpus into training and test sets. As phenomena such as coreference allow seemingly general strings to with the SemEval-2015 data, we only analyzed the training set in or- take very specific meanings (eg, “the failure” referring to a specific der to preserve the utility of the n2c2 2019 test set as an unseen data instance of heart failure); other syntactic phenomena such as predi- sample for evaluating generalization in continuing MCN research. cation, splitting known strings with a copula (see Figure 1 for exam- ples), and inflection (eg, “defibrillate” vs “defibrillation” vs “defibrillated”) lead to further variants. We therefore use 3 strate- Measuring ambiguity gies to match observed strings with terms in the UMLS and the con- We utilize 2 different ways of measuring the ambiguity of a string: cepts that they are linked to (referred to as candidate matching dataset ambiguity, which measures the amount of observed ambigu- strategies), with increasing degrees of inclusivity across term varia- ity for a given medical term as labeled in an MCN dataset, and tions, to measure the number of CUIs a medical concept string may UMLS ambiguity, which measures the amount of potential ambigu- be matched to in the UMLS: ity for the same term by using the UMLS as a reference for normali- zation. A key desideratum for developing and evaluating statistical • Minimal preprocessing—each string was preprocessed using the models of MCN, which we demonstrate is not achieved by bench- 2 steps described previously (lowercasing and dropping deter- mark datasets in practice, is that the ambiguity observed in research miners; eg, “the EKG” becomes “ekg”), and compared with datasets is as representative as possible of the potential ambiguity rows of the MRCONSO table of the UMLS to identify the num- that may be encountered in medical language “in the wild.” For ex- ber of unique CUIs canonically associated with the string. The ample, the term cold can be used as an acronym for Chronic Ob- same minimal preprocessing steps were applied to the String field structive Lung Disease (C0024117), but if no datasets include of MRCONSO rows for matching. examples of cold being used in this way, we are unable to train or • Lexical variant normalization—each string was first processed evaluate the effectiveness of an MCN model for normalizing “cold” with minimal preprocessing, and then further processed with the to this meaning. The problem becomes more severe if other senses of luiNorm tool, 61 a software package developed to map lexical cold, such as C0009264 Cold Temperature, C0234192 Cold Sensa- variants (eg, defibrillate, defibrillated, defibrillation) to the same tion, or C0010412 Cold Therapy are also not included in annotated string. (Mapping lexical variants to the same underlying string is datasets. While exhaustively capturing instances of every sense of a typically referred to as “normalization” in the natural language given term in natural utterances is impractical at best, significant processing literature; for clarity between concept normalization gaps between observed and potential ambiguity impose a fundamen- and string normalization in this article, we refer to “lexical vari- tal limiting factor on progress in MCN research. ant normalization” for this aspect of string processing through-
520 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 Figure 1. Examples of mismatch between medical concept mention string (bold underlined text) and assigned concept unique identifier (shown under the men- tion), due to (A) coreference and (B) predication. The right side of each subfigure shows the results of querying the Unified Medical Language System (UMLS) for the mention string with exact match (top) and the preferred string for the annotated concept unique identifier (bottom). out.) luiNorm-processed strings were then compared with prepo- concordance with prior findings of greater ambiguity from shorter pulated lexical variants in the MRXNS_ENG table of the UMLS terms,63 we evaluated the correlation between string length and am- to identify the set of associated CUIs. We used the release of lui- biguity measurements, using linear regression with fit measured by Norm that corresponded to the UMLS version each dataset was the r2 statistic. We used 2 different measures of string length: (1) annotated with (2011 for SemEval-2015, 2016 for CUI- number of tokens in the string (calculated using SpaCy64 tokeniza- LESS2016, and 2017 for n2c2 2019), and compared with the tion) and (2) number of characters in the string. MRXNS_ENG table of the corresponding UMLS release. • Word match—each string was first processed with minimal pre- Cross-dataset generalization analysis processing; we then queried the UMLS search application pro- In order to assess how representative the annotated MCN datasets gramming interface for the preprocessed string, using the word- are for generalizing to unseen data, we evaluated ambiguity in 3 level search option,62 which searches for matches in the Metathe- kinds of cross-dataset generalization: (1) from training to develop- saurus with each of the words in the query string (ie, “Heart dis- ment splits in a single dataset (using SemEval-2015 and CUI- ease, acute” will match with strings including any of the words LESS2016), (2) between different datasets drawn from the same heart, disease, or acute). We counted the number of unique CUIs corpus (comparing SemEval-2015 to CUILESS2016), and (3) be- returned as our measure of ambiguity. tween datasets from different corpora (comparing SemEval-2015 In all cases, since each dataset was only annotated using CUIs and CUILESS2016 to n2c2 2019). In each of these settings, we first linked to specific vocabularies in the UMLS (SNOMED CT for all 3 identified the portion of strings shared between the datasets being datasets, plus RxNorm for n2c2 2019), we restricted our ambiguity compared, a key component of generalization, and then analyzed analysis to the set of unique UMLS CUIs linked to the source vocab- the CUIs associated with these shared strings in each dataset. Shared ularies used for annotation. Thus, if a string in SemEval-2015 was strings were analyzed along 3 axes to measure the generalization of associated with 2 CUIs linked to SNOMED CT and an additional MCN annotations between datasets: (1) differences in ambiguity CUI linked only to International Classification of Diseases–Ninth type (for strings which were ambiguous in both datasets), (2) over- Revision (and therefore not eligible for use in SemEval-2015 annota- lap in the annotated CUI sets, and (3) the coverage of word-level tion), we only counted the 2 CUIs linked to SNOMED CT in mea- UMLS match for retrieving the combination of CUIs present be- suring its ambiguity. tween the 2 datasets. Finally, we broke down our analysis of CUI set overlap to identify strings whose dataset ambiguity increases when combining datasets and strings with fully disjoint annotated CUI Quantitative analyses: Ambiguity measurements and sets. generalization Ambiguity measurements within datasets Qualitative analysis of ambiguous strings Given the set of unique mention strings in each MCN dataset, we Inspired by methodological research demonstrating that different measured each string’s ambiguity in terms of dataset ambiguity, modeling strategies are appropriate for phenomena such as meton- UMLS ambiguity with minimal preprocessing, UMLS ambiguity ymy65,66 and hyponymy,67–71 we analyzed the ambiguous strings in with lexical variant normalization, and UMLS ambiguity with word each dataset in terms of the following lexical phenomena: homon- match, using the version of the UMLS each dataset was originally ymy, polysemy, hyponymy, meronymy, co-taxonomy (sibling rela- annotated with. We also evaluated the coverage of the UMLS tionships), and metonymy (definitions provided in discussion of our matching results, in terms of whether they included the CUIs associ- ambiguity typology in the Results).42,43 To measure the ambiguity ated with each string in the dataset. For compositional annotations captured by the available annotations, we performed our analysis in CUILESS2016, we treated a label as covered if any of its compo- only at the level of dataset ambiguity (ie, only using the CUIs associ- nent CUIs were included in the UMLS results. Finally, to establish ated with the string in a single dataset). For each ambiguous string
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 521 in a dataset, we manually reviewed the string, its associated CUIs in omitted dataset annotations that were not found in the correspond- the dataset in question, and the medical concept mention samples ing version of the UMLS (including “CUI-less,” annotation errors, where the string occurs in the dataset, and answered the following 2 and CUIs remapped within the UMLS); Table 2 provides the number questions: of these annotations and the number of strings analyzed. We ob- served 5 main findings from our results: Question 1: How are the different CUIs associated with this Observed dataset ambiguity is not representative of potential string related to one another? UMLS ambiguity. Only 2%-14% of strings were ambiguous at the This question regarded only the set of annotated CUIs and was dataset level (across SemEval-2015, CUILESS2016, and n2c2 2019) agnostic to specific samples in the dataset. We evaluated 2 aspects of (ie, these strings were associated with more than 1 CUI within a sin- the relationship or relationships between these CUIs: (1) which (if gle dataset). However, many more strings exhibited potential ambi- any) of the previous lexical phenomena was most representative of guity, as measured in the UMLS with our 3 candidate matching the relationship between the CUIs and (2) if any phenomenon partic- strategies. Using minimal preprocessing, in the cases in which at ular to medical language was a contributing factor. We conducted least 1 CUI was identified for a query string, 13%-23% of strings this analysis only in terms of the high-level phenomena outlined pre- were ambiguous; lexical variant normalization increased this to Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 viously, rather than leveraging the formal semantic relationships be- 17%-28%, and word matching yielded 68%-88% ambiguous tween CUIs in the UMLS; while these relationships are powerful for strings. The difference was most striking in n2c2 2019: only 58 downstream applications, they include a variety of nonlinguistic strings were ambiguous in the dataset (after removing “CUI-less” relationships and were too fine-grained to group a small set of am- samples), but 2,119 strings had potential ambiguity as measured biguous strings informatively. with word matching, a 37-fold increase. Many dataset strings do not match any CUIs. A total of 40%- Question 2: Are the CUI-level differences reflected in the annotations? 43% of strings in SemEval-2015 and n2c2 did not yield any CUIs Given the breadth of concepts in the UMLS, and the subjective when using minimal preprocessing to match to the UMLS (74% in nature of annotation, we analyzed whether the CUI assignments in CUILESS2016). Lexical variant normalization increased coverage the dataset samples were meaningfully different, and if they reflected somewhat, with 38%-41% of strings failing to match to the UMLS the sample-agnostic relationship between the CUIs. in SemEval-2015 and n2c2 (70% in CUILESS2016); word-level search had much better coverage, only yielding empty results for Ambiguity annotations 23%-27% of CUIs in SemEval-2015 and n2c2 and 57% in CUI- Based on our answers to these questions, we determined 3 variables LESS2016. As CUILESS2016 strings often combine multiple con- for each string: cepts, matching statistics are necessarily pessimistic for this dataset. UMLS matching misses a significant portion of annotated CUIs. • Category—the primary linguistic or conceptual phenomenon un- As shown in Figure 2, for the subset of SemEval-2015 and n2c2 derlying the observed ambiguity; 2019 strings in which any of the UMLS matching strategies yielded • Subcategory—the biomedicine-specific phenomenon contribut- at least 1 candidate CUI, 8%-23% of the time the identified candi- ing to a pattern of ambiguity; and date sets did not include any of the CUIs with which those strings • Arbitrary—the determination of whether the CUIs’ use reflected were actually annotated in the datasets. This was consistent for both their conceptual difference. strings returning only 1 CUI and strings returning multiple CUIs. The complex mentions in CUILESS2016 again yielded lower cover- Annotation was conducted by 4 authors (D.N.-G., G.D., B.D., age: 24%-30% of strings returning only 1 CUI did not return a cor- A.Z.) in 3 phases: (1) initial categorization of the ambiguous strings rect one and 25%-42% of strings returning multiple CUIs missed all in n2c2 2019 and SemEval-2015, (2) validation of the resulting ty- of the annotated CUIs. This indicates that coverage of both syno- pology through joint annotation and adjudication of 30 random am- nyms and lexical variants in the UMLS remains an active challenge biguous strings from n2c2 2019, and (3) reannotation of all datasets for clinical language. with the finalized typology. For further details, please see the Sup- High coverage yields high ambiguity. Table 2 provides statistics plementary Appendix. on the number of CUIs returned for strings from the 3 datasets in which any of the UMLS candidate matching strategies yielded more Handling compositional CUIs in CUILESS2016 than 1 CUI. Both minimal preprocessing and lexical variant normal- Compositional annotations in CUILESS2016 presented 2 variables ization yield a median CUI count per ambiguous string of 2, al- for ambiguity analysis: single- or multiple-CUI annotations, and am- though higher maxima (maximum 11 CUIs with minimal biguity of annotations across samples. We categorized each string in preprocessing, maximum 20 CUIs with lexical variant normaliza- CUILESS as having (1) unambiguous single-CUI annotation, (2) un- tion) skew the mean number of CUIs per string higher. By contrast, ambiguous multi-CUI annotation, (3) ambiguous single-CUI annota- word matching, which achieves the best coverage of dataset strings tion, or (4) ambiguous annotations with both single- and multi-CUI by far, ranges in median ambiguity from 8 in CUILESS2016 to 20 in labels. The latter 2 categories were considered ambiguous for our n2c2 2019, with maxima over 100 CUIs in all 3 datasets. Thus, ef- analysis. fectively choosing between a large number of candidates is a key challenge for high-coverage MCN. Character-level string length is weakly negatively correlated with RESULTS ambiguity measures. Following prior findings that shorter terms Quantitative measurements of string ambiguity tend to be more ambiguous in biomedical literature,63 we observed Ambiguity within individual datasets r2 values above 0.5 between character-based string length and data- Figure 2 presents the results of our string-level ambiguity analysis set ambiguity, UMLS ambiguity with minimal preprocessing, and across the 3 datasets. For a fair comparison with the UMLS, we UMLS ambiguity with lexical variant normalization in all 3 EHR
522 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 Figure 2. String-level ambiguity in medical concept normalization (MCN) datasets, by method of measuring ambiguity. (A) Measurements of observed string am- biguity in MCN datasets, in terms of strings that are annotated with exactly 1 concept unique identifier (CUI) (unambiguous) or more than 1 (ambiguous). (B) Measurements of potential string ambiguity in the Unified Medical Language System (UMLS), using minimal preprocessing, lexical variant normalization, and word match strategies to identify candidate CUIs. Shown below each UMLS matching chart is the coverage of dataset CUIs yielded by each matching strategy, broken down by ambiguous (A) and unambiguous (U) strings. Coverage is calculated as the intersection between the CUIs matched to a string in the UMLS and the set of CUIs that string is annotated with in the dataset. datasets. Word-level match yielded very weak correlation (r2 ¼ 0.39 within-corpus setting (comparing SemEval-2015 to CUILESS2016), for SemEval-2015, 0.23 for CUILESS2016, and 0.39 for n2c2). and cross-corpus setting (comparing SemEval-2015 and CUI- Token-level measures of string length followed the same trends as LESS2016 to n2c2 2019). We observed 3 main findings in our the character-level measure, although typically with lower r2. Full results: results of these analyses are provided in Supplementary Table 1 and The majority of strings are unique to the dataset they appear in. Supplementary Figures 1–3. The overlap in sets of medical concept mention strings between datasets ranged from
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 523 Table 2. Results of string-level ambiguity analysis, as measured in MCN datasets (observed ambiguity) and in the UMLS with 3 candidate matching strategies (potential ambiguity) SemEval-2015 CUILESS2016 n2c2 2019 UMLS version 2011AA 2016AA 2017AB Dataset Total strings 3203 2006 3230 Ambiguous strings before OOV filtering 148 (5) 273 (14) 62 (2) Strings with OOV annotations 48 1 99 OOV annotations only (omitted) 29 1 95 Strings with at least 1 CUI 3174 2005 3135 Ambiguous strings after OOV filtering 132 (4) 273 (14) 58 (2) Minimum/median/maximum ambiguity 2/2/6 2/2/24 2/2/3 Mean ambiguity 2.1 6 0.5 2.9 6 2.5 2.1 6 0.3 Minimal preprocessing Strings with at least 1 CUI 1808 (57) 530 (26) 1874 (60) Ambiguous strings 230 (13) 97 (18) 423 (23) Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 Minimum/median/maximum ambiguity 2/2/11 2/2/11 2/2 /11 Mean ambiguity 2.5 6 1.1 2.7 6 1.5 2.5 6 1.2 Lexical variant normalization Strings with at least 1 CUI 1882 (59) 592 (30) 1942 (62) Ambiguous strings 318 (17) 137 (23) 550 (28) Minimum/median/maximum ambiguity 2/2/17 2/2/18 2/2/20 Mean ambiguity 2.8 6 1.9 3.1 6 2.5 2.9 6 2.1 Word match Strings with at least 1 CUI 2314 (73) 877 (44) 2414 (77) Ambiguous strings 1774 (77) 594 (68) 2119 (88) Minimum/median/maximum ambiguity 2/9/123 2/8/107 2/20/120 Mean ambiguity 20.9 6 25.5 19.5 6 24.5 31.1 6 29.2 Values are n, n (%), or mean 6 SD, unless otherwise indicated. All dataset annotations that were not found in the corresponding version of the UMLS (OOVs) were omitted from this analysis; any strings that had only OOV annotations in the dataset were omitted entirely. For each of the 3 UMLS matching strategies, the number of strings for which at least 1 CUI was identified is provided along with the corresponding percentage of non-OOV dataset strings. The number of ambig- uous strings in each subset (ie, strings for which more than 1 CUI was matched after OOV annotations were filtered out) is given along with the corresponding percentage of strings for which at least 1 CUI was identified. Ambiguity statistics are calculated on ambiguous strings only and report minimum, median, maxi- mum, mean, and standard deviation of number of CUIs identified for the string. CUI: concept unique identifier; MCN: medical concept normalization; OOV: out of vocabulary; UMLS: Unified Medical Language System. Most shared strings have differences in their annotated CUIs. In sets between the 2 datasets were originally unambiguous in each all comparisons other than the SemEval-2015 training and develop- dataset, indicating that memorizing term-CUI normalization would ment datasets, over 45% of the strings shared between a pair of data- work perfectly in each dataset but fail entirely on the other. sets were annotated with at least 1 CUI that was only present in 1 of the 2 datasets (18% of strings even in the case of SemEval-2015 train- Ambiguity typology ing and development datasets). Of these, between 33%-74% had We identified 12 distinct causes of the ambiguity observed in the completely disjoint sets of annotated CUIs between the 2 datasets com- datasets, organized into 5 broad categories. Table 3 presents our ty- pared. While many of these cases reflected hierarchical differences, a pology, with examples of each ambiguity type; brief descriptions of significant number involved truly distinct senses between datasets. each overall category are provided subsequently. We refer the inter- UMLS match consistently fails to yield all annotated CUIs across ested reader to the Supplementary Appendix for a more in-depth dis- combined datasets. Reflecting our earlier observations within indi- cussion. vidual datasets, word-level UMLS matching was able to fully re- trieve all CUIs in the combined annotation set for a fair portion of Polysemy shared strings (42%-55% in within-dataset comparisons; 54%-85% We combined homonymy (completely disjoint senses) and polysemy in cross-corpus comparisons). However, it failed to retrieve any of (distinct but related senses)42,43 under the category of Polysemy for the combined CUIs for 26%-54% of the shared strings. our analysis. While we observed instances of both homonymy and Figure 4 illustrates changes in ambiguity for shared strings be- polysemy, we found no actionable reason to differentiate between tween the dataset pairs, in terms of how many strings had nonidenti- them, particularly as other phenomena causing polysemy (eg, me- cal annotated CUI sets, how many strings in each dataset would tonymy, hyponymy) were covered by other categories. Thus, the Po- increase in ambiguity if the CUI sets were combined, and how many lysemy category captured cases in which more specific phenomena of these would switch from being unambiguous to ambiguous when were not observed and the annotated CUIs were clearly distinct combining cross-dataset CUI sets. We found that of the sets of from one another. As there is extensive literature on resolving abbre- strings shared between any pair of datasets with nonidentical CUI viations and acronyms,31–35 we treated cases involving abbrevia- annotations, between 50% and 100% of the strings in each of these tions as a dedicated subcategory (Abbreviation; our other sets were annotated with at least 1 CUI in one of the datasets that subcategory was Nonabbreviation). was not present in the other. Further, up to 66% of the strings with any annotation differences went from being unambiguous to ambig- Metonymy uous when CUI sets were combined across the dataset pairs. Finally, Clinical language is telegraphic, meaning that complex concepts are we found that up to 89% of the strings that had fully disjoint CUI often referred to by simpler associated forms. Normalizing these
524 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 Figure 3. Generalization analysis for medical concept normalization annotations, in 3 settings: (A, B) between training and development sets in the same datasets, (C, D) between 2 datasets drawn from the same electronic health record corpus (both from the ShARe corpus), and (E, F) across annotated corpora. The first col- umn illustrates the number of unique strings in each sample set in the pair being analyzed, along with the number of strings present in both. The second column shows the subsets of these shared strings in which the sample sets use at least 1 different concept unique identifier (CUI) for the same string, and the number of strings in which all CUIs are different between the 2 sample sets. The third column shows for how many of the shared strings the Unified Medical Language Sys- tem (UMLS) matching with word search identifies some or all of the CUIs annotated for a given string between both sample sets.
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 525 Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 Figure 4. Analysis of concept unique identifier (CUI) sets for shared strings in medical concept normalization generalization between datasets, in 3 settings: (A, B) between training and development sets in the same datasets, (C, D) between 2 datasets drawn from the same electronic health record corpus (both from the ShARe corpus), and (E, F) across annotated corpora. The left-hand column illustrates (1) the number of shared strings with differences in their CUI annotations; (2) the proper subset of these strings, within each dataset, in which adding the CUIs from the other dataset would expand the set of CUIs for this string; and (3) the proper subset of these strings where a string is unambiguous within one or the other dataset but becomes ambiguous when CUI annotations are combined. The right-hand column displays the portion of shared strings with disjoint CUI set annotations between the 2 datasets in which the string is unambiguous in each of the datasets independently.
526 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 Table 3. Ambiguity typology derived from SemEval-2015, CUILESS2016, and n2c2 2019 MCN corpora Category Subcategory Definition Example ambiguity Polysemy Abbreviation Abbreviations or acronyms with dis- Family hx of breast [ca], emphysema C0006826 Malignant Neoplasms tinct senses. BP 137/80 na 124 [ca] 8.7 C0201925 Calcium Measurement Nonabbreviation Term ambiguity other than abbrevia- BP was [elevated] at last 2 visits C0205250 High (qualitative) tions or acronyms. Her leg was [elevated] after surgery C0439775 Elevation procedure Metonymy Procedure vs Distinguishes between a medical con- [Rhythm] revealed sinus tachycardia C0199556 Rhythm ECG (Procedure) Concept cept and the procedure or action The [rhythm] became less stable C0577801 Heart rhythm (Finding) used to analyze/effect that con- cept. Measurement vs Distinguishes between a physical Pt blood work to check [potassium] C0032821 Potassium (Substance) Substance substance and a measurement of Sodium 139, [potassium] 4.7 C0202194 Potassium Measurement that substance. Symptom vs Distinguishes between a finding be- Current symptoms include C0011570 Mental Depression Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 Diagnosis ing marked as a symptom or a [depression] (possibly diagnosed) disorder. Hx of chronic [depression] C0011581 Depressive disorder Other All other types of metonymy. Transfusion of [blood] C0005767 Blood (Body Substance) Discovered [blood] at catheter site C0019080 Hemorrhage Specificity Hierarchical Combines hyponymy and meron- Cardiac: family hx of [failure] C0018801 Heart Failure ymy; corresponds to taxonomic . . .in left ventricle. This [failure]. . . C0023212 Left-sided heart failure UMLS relations. Recurrence/ Distinguishes between singular and No [injuries] at admission C0175677 Injury Number plural forms of a finding, or one Brought to emergency for C0026771 Multiple trauma episode and recurrent episodes. his [injuries] Synonymy Propositional For a general-purpose application, Negative skin [jaundice] C0022346 Icterus Synonyms the set of CUIs are not meaning- Increased girth and [jaundice] C0476232 Jaundice fully distinct from one another. Co-taxonyms The CUIs are (conceptually or in the 2mg [percodan] C0717448 Percodan UMLS) taxonomic siblings; often 2mg [percodan] C2684258 Percodan overspecification. (reformulated 2009) Error Semantic Erroneous CUI assignment, due to Open to air with no [erythema] C0041834 Erythema misinterpretation, confusion with Edema but no [erythema] C0013604 Edema nearby concept, or other cause. Typos One CUI is a typographical error [Neoplasm] is adjacent C0024651 Malt Grain (Food) when attempting to enter the other Infection most likely [neoplasm] C0027651 Neoplasms (ie, no real ambiguity). Short definitions are provided for each subcategory, along with 2 samples of an example ambiguous string and their normalizations using UMLS CUIs. For a more detailed discussion, see the Supplementary Appendix. CUI: concept unique identifier; UMLS: Unified Medical Language System. references requires inference from their context: for example, a ref- observed was ambiguity in grammatical number of a finding, typi- erence to “sodium” within lab readings implies a measurement of cally due to inflection (eg, “no injuries” meaning not a single injury) sodium levels, a distinct concept in the UMLS. It is noteworthy that or recurrence (denoted Recurrence/Number). in some cases, examples of the Metonymy category may be consid- ered as annotation errors, illustrating the complexity of metonymy Synonymy in practice; for example, the case of “Sodium 139, [potassium] 4.7” Many strings were annotated with CUIs that were effectively synon- included in Table 3, annotated as C0032821 Potassium (substance), ymous; we therefore followed Cruse’s42 definition of Propositional would be better annotated as C0428289 Finding of potassium level. Synonymy, in which ontologically distinct senses nonetheless yield As these concepts are semantically related (while ontologically dis- the same propositional interpretation of a statement. We also in- tinct), we included such cases in the category of Metonymy. We ob- cluded Co-taxonymy in this category, typically involving annotation served 3 primary trends in metonymic annotations: reference to a with either overspecified CUIs or CUIs separated only by negation. procedure by an associated biological property (Procedure vs Con- cept), mention of a biological substance to refer to its measurement (Measurement vs Substance), and the fact that many symptomatic Error findings can also be formal diagnoses (Symptom vs Diagnosis; eg, A small number of ambiguity cases were due to erroneous annota- “emphysema,” “depression”). Other examples of Metonymy falling tions stemming from 2 causes: (1) typological errors in data entry outside these trends were placed in the Other subcategory. (Typos) and (2) selection of an inappropriate CUI (Semantic). Specificity Ambiguity types in each dataset The rich semantic distinctions in the UMLS (eg, phenotypic variants As with our measurements of string ambiguity, we excluded all data- of a disease) lead to frequent ambiguity of Specificity. The ambiguity set samples annotated as “CUI-less” for analysis of ambiguity type, was often taxonomic, captured as Hierarchical; the other pattern as these reflect annotation challenges beyond the ambiguity level.
Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 527 Table 4. Results of ambiguity type analysis, showing the number of unique ambiguous strings assigned to each ambiguity type by dataset, along with the total number of dataset samples in which those strings appear SemEval-2015 CUILESS2016 n2c2 2019 Category Subcategory Strings Samples Strings Samples Strings Samples Polysemy Abbreviation 4 59 6 178 7 33 Nonabbreviation 2 2 12 302 6 28 Metonymy Procedure vs Concept 0 0 7 25 9 23 Measurement vs Substance 0 0 0 0 9 93 Symptom vs Diagnosis 20 62 20 166 2 5 Other 2 3 6 22 5 29 Specificity Hierarchical 50 103 87 776 7 26 Recurrence/Number 8 24 3 6 0 0 Synonymy Propositional Synonyms 23 26 64 354 8 26 Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 Co-taxonyms 9 11 64 837 4 13 Error Typos 25 25 0 0 0 0 Semantic 8 11 22 109 1 1 Total (unique) 148 326 273 2775 58 295 Some strings were assigned multiple ambiguity types, and are counted for each; the number of affected samples was estimated for each type in these cases. The sample counts given for error subcategories represent the actual count of misannotated samples. The total number of unique ambiguous strings and associated samples analyzed in each dataset is presented in the last row. Figure 5. Distribution of ambiguity types within each dataset, in terms of (A) the unique strings assigned each ambiguity type and (B) the number of samples in which those strings occur. The number of strings and samples belonging to each typology category is shown within each bar portion. However, we retained samples with annotation errors and CUIs tion, shown in Figure 6. Arbitrary rates varied across datasets, with remapped within the UMLS, as these samples inform MCN evalua- the fewest cases in SemEval-2015 and the most in n2c2 2019. tion in these datasets, and ambiguity type analysis did not require di- Metonymy (Symptom vs Diagnosis), Specificity (Hierarchical), and rect comparison to string-CUI associations in the UMLS. This Synonymy (Co-taxonyms) were all arbitrary in more than 50% of increased the number of ambiguous strings in SemEval-2015 from cases. 132 to 148; ambiguous string counts in CUILESS2016 and n2c2 2019 were not affected. Table 4 presents the frequency of each am- biguity type across our 3 datasets. All but 21 strings (3 in SemEval- 2015, 18 in CUILESS2016) exhibited a single ambiguity type (ie, all DISCUSSION CUIs were related in the same way). To compare the distribution of Ambiguity is a key challenge in medical concept normalization. ambiguity categories across datasets, we visualized their relative fre- However, relatively little research on ambiguity has focused on clini- quency in Figure 5. Polysemy and Metonymy strings were most cal language. Our findings demonstrate that clinical language exhib- common in n2c2 2019, while Specificity was the plurality category its distinct types of ambiguity, such as clinical patterns in metonymy in SemEval-2015 and Synonymy was most frequent in CUI- and specificity, in addition to well-studied problems such as abbrevi- LESS2016. The sample-wise distribution, included in Table 4, fol- ation expansion. These results highlight 3 key gaps in the literature lowed the string-wise distribution, except for Polysemy, which for MCN ambiguity: (1) a significant gap between the potential am- included multiple high-frequency strings in SemEval-2015 and CUI- biguity of medical terms and their observed ambiguity in EHR data- LESS2016. sets, creating a need for new ambiguity-focused datasets; (2) a need Finally, we visualized the proportion of strings within each ambi- for MCN evaluation strategies that are sensitive to the different guity type considered arbitrary (at the sample level) during annota- kinds of relationships between concepts observed in our ambiguity
528 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 3 Downloaded from https://academic.oup.com/jamia/article/28/3/516/6034899 by guest on 29 May 2021 Figure 6. Percentage of ambiguous strings in each ambiguity type annotated as arbitrary, by dataset. Synonymy (Propositional Synonyms) and both Error subca- tegories are omitted, as they are arbitrary by definition. typology; and (3) underutilization of the extensive semantic resour- actly matched the gold CUI. On this view, a predicted CUI is either ces of the UMLS in recent MCN methodologies. We discuss each of exactly right or completely wrong. However, as illustrated by the these points in the following sections, and propose specific next distinct ambiguity types we observed, in many cases a CUI other steps toward closing these gaps to advance the state of MCN re- than the gold label may be highly related (eg, “Heart failure” and search. We conclude by noting the particular role of representative “Left-sided heart failure”), or even propositionally synonymous. As data in the deep learning era and providing a brief discussion of the methodologies for MCN improve and expand, alternative evalua- limitations of this study that will inform future research on ambigu- tion methods leveraging the rich semantics of the UMLS can help to ity in MCN. distinguish between a system with a related misprediction from a system with an irrelevant one. A wide variety of similarity and relat- edness measures that utilize the UMLS to compare medical concepts The next phase of research on clinical ambiguity needs have been proposed,72–75 presenting a fruitful avenue for develop- dedicated datasets ment of new MCN evaluation strategies. The order of magnitude difference between the number of CUIs an- It is important to note, however, that equivalence classes and notated for each string in our 3 datasets, and the number of CUIs similarity measures will often be task or domain specific. For exam- found through word match to the UMLS suggests that our current ple, 2 heart failure phenotypes may be equivalent for presenting data resources cover only a small subset of medically relevant ambi- summary information in an EHR dashboard but may be highly dis- guity. Differences in ambiguity across multiple datasets provide tinct for cardiology-specific text mining or applications with de- some improvement in addressing this coverage gap and clearly indi- tailed requirements such as clinical trial recruitment. While cate the value of evaluating new MCN methods on multiple datasets dedicated evaluation metrics for each task would be impractical, a to improve ambiguity coverage. However, the ShARe and MCN trade-off between generalizability and sensitivity to the needs of dif- corpora were designed to capture an in-depth sample of clinical lan- ferent applications represents an area for further research. guage, rather than a sample with high coverage of specific challenges like ambiguity. As MCN research continues to advance, more fo- cused datasets capturing specific phenomena are needed to support The UMLS offers powerful semantic tools for high- development and evaluation of methodologies to resolve ambiguity. coverage candidate identification Savova et al25 followed the protocol used in designing the biomedi- Our cross-dataset comparison clearly demonstrates the value of uti- cal NLM WSD corpus24 to develop a private dataset containing a lizing inclusive UMLS-based matching to identify a high-coverage set of highly ambiguous clinical strings; adapting and expanding this set of candidate CUIs for a medical concept, though the lack of protocol with resources such as MIMIC-III54 offers a proven ap- 100% coverage reinforces the value of ongoing research on syno- proach to collect powerful new datasets. nym identification.60 Inclusive matching, of course, introduces addi- tional noise: luiNorm can overgenerate semantically invalid variants Distinct ambiguity phenomena in MCN call for different due to homonymy,76 such as mapping “wound” in “injury or evaluation strategies wound” to “wind,” and mapping both “left” and “leaves” to MCN systems are typically evaluated in terms of accuracy,39,55 cal- “leaf”; word-level search, meanwhile, requires very little to yield a culated as the proportion of samples in which the predicted CUI ex- match and generates very large candidate sets, such as 120 different
You can also read