Defining Medical Words: Transposing Morphosemantic Analysis from French to English
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Defining Medical Words: Transposing Morphosemantic Analysis from French to English Louise Delégera, Fiammetta Namerb, Pierre Zweigenbaumc a INSERM, UMR _S 872, Éq. 20, Les Cordeliers, Paris, F-75006 France; Université Pierre et Marie Curie-Paris 6, UMR_ S 872, Paris, F-75006 France; Université Paris Descartes, UMR _S 872, Paris, F-75006 France b ATILF and Université Nancy 2, CLSH, Nancy, F-54015 France c LIMSI, CNRS UPR3251, Orsay, F-91403 France; INALCO, CRIM, Paris, F-75007 France Abstract logical segmentation [2], was used to contribute mappings between WHO-ART and SNOMED terms. Early work on Medical language, as many technical languages, is rich with medical morphosemantic analysis focused on specific compo- morphologically complex words, many of which take their nents such as –itis [3] or –osis [4], then on larger sets of neo- roots in Greek and Latin—in which case they are called neo- classical compounds [5]. Lovis [6] introduced the notion of classical compounds. Morphosemantic analysis can help gen- morphosemantemes, i.e., units that cannot be further decom- erate definitions of such words. This paper reports work on posed without losing their original meanings. The Mor- the adaptation of a morphosemantic analyzer dedicated to phosaurus system [7,8] segments complex words using a simi- French (DériF) to analyze English medical neoclassical com- lar notion called subword. The UMLS Specialist Lexicon [9], pounds. It presents the principles of this transposition and its with its “Lexical tools,” handles derived words, i.e., complex current performance. The analyzer was tested on a set of words built through the addition of prefixes or suffixes. It 1,299 compounds extracted from the WHO-ART terminology. provides tables of neoclassical roots, but no analyzer to auto- 859 could be decomposed and defined, 675 of which success- matically decompose compound words. DériF [10] mor- fully. An advantage of this process is that complex linguistic phosemantically analyses complex words. In contrast to [8] or analyses designed for French could be successfully trans- [6], it computes a hierarchical decomposition of complex ferred to the analysis of English medical neoclassical com- words. Moreover, it produces a semantic definition of these pounds. Moreover, the resulting system can produce more words, which it can link to other words through a set of seman- complete analyses of English medical compounds than exist- tic relations including synonymy and hyponymy. In contrast to ing ones, including a hierarchical decomposition and seman- the Specialist tools or to [11], DériF handles both derived and tic gloss of each word. compound words. Designed initially for French complex Keywords: words, then extended to the medical domain, its potential for cross-linguistic application was showed in [12]. Its transposi- Natural Language Processing, morphosemantic analysis, word tion to English would fill a gap in the set of tools currently definition, neoclassical compounds. available to process complex English medical words. This paper reports work on the adaptation of DériF to English Introduction medical complex words. It focuses on neoclassical com- pounds and is based on the hypothesis that this type of words Medical language, as many technical languages, is rich with is similarly formed in related European languages (here French morphologically complex words, many of which take their and English). Our goal is to have DériF analyse English words roots in Greek and Latin. These so-called neoclassical com- and present its results in English. pounds are present in many areas of the medical vocabulary, We first describe the morphosemantic analyzer and our test set including anatomy (gastrointestinal), diseases (encephalitis, of words. We explain the modifications performed on this tool cardiomyopathy), and procedures (gastrectomy). Segmenting and the evaluation conducted. We then expose the results, dis- morphologically complex words into their components is the cuss the method and conclude with some perspectives. task of morphological analysis. When this analysis is com- plemented by semantic interpretation, the process is called morphosemantic analysis. Complex words are often “compo- Material and Methods sitional,” in the sense that the meaning of a complex word is often a combination of that of its parts. The principle on which this work is based is morphosemantic analysis, that is morphological analysis associated to a seman- Morphosemantic analysis can therefore help processes inter- tic interpretation of words. In other words, we want to obtain ested in semantics, such as the detection of similar terms or the a decomposition and a description of the meaning of a com- generation of definitions. This was for instance the aim of [1], plex word based on the meanings of its parts. A complex where Morfessor, a tool for unsupervised learning of morpho-
word may be formed through any combination of the following It has been pointed out [12] that the method could be extended word formation rules: to other languages. Indeed the rules for generating lexically • derivation, which adds an affix (prefix or suffix) to a related words (item 4.) do not rely on any language-specific base word, e.g., pain/painful; features so that this part of the system is fully language- independent. The morphosemantic parser however needs to be • compounding, which joins two (or more) components adapted to the language. together, those components being either neo-classical To test the transposition of DériF to English, we prepared a roots called Combining Forms (CFs) or modern- list of test words. These words were taken from the WHO- language words, e.g., thermoregulation, arthritis. ART terminology, which describes Adverse Drug Reactions For this work we chose to analyze neo-classical compounds (ADRs), since one of the intended applications of this work is (formed from CFs). However, a compound may also undergo to contribute to the pharmacovigilance domain by grouping derivation, so that mixed-formation words must also be ad- terms describing similar ADRs. We selected the English terms dressed. Therefore we included not only “pure” compounds, of this terminology; since DériF works on single words and but also those neo-classical compounds that were prefixed or not on multi-word units, we split them into single words; and suffixed (e.g., haemorrhagic). since we adapted DériF to analyze neo-classical compounds, we only retained those types of words. The selection was done Our main hypothesis for the transposition from French to Eng- both automatically by removing all words of 4 characters or lish is that a same linguistic analysis can be applied to neo- less (these words are practically never morphologically com- classical compounds of several languages. We assume that plex), and manually by reviewing the list to look for neoclassi- they are formed in a similar way and that the components in- cal compounds (the work was done by a language engineer, volved are the same, the major differences being orthographic LD). This gave us a list of 1,299 words to be decomposed out (such as -algia/-algie). of a total of 3,476 words. These words were lemmatized and Material tagged with their parts-of-speech using the Treetagger1 part-of- speech tagger. We used a lexicon of tagged words from the We started from the French version of the DériF (“Derivation UMLS Specialist lexicon2 to help TreeTagger deal with un- in French”) morphosemantic analyzer. DériF was designed known words. both for general language and more specialized vocabularies such as medical language. It performs an analysis purely based Methods on linguistic methods and implements a number of decomposi- Adapting the morphosemantic analyzer tion rules and semantic interpretation templates. It also uses resources which include a lexicon of word lemmas tagged with The method of morphological analysis that we want to trans- their parts-of-speech and a table of CFs. The system goes fur- pose to English is schematized in figure 1. ther than simple decomposition and interpretation steps by predicting lexically related words. Another of its distinctive features is that it yields a structured decomposition of words and not simply a linear segmentation, so that we know which part is the head of the word. As input the system expects a list of words tagged with their parts-of-speech and lemmatized (in their base form—no plu- ral). It outputs the following elements: 1. a structured decomposition of the word into its mean- ingful parts; 2. a definition (“gloss”) of the word in natural language; 3. a semantic category, inspired by the main MeSH tree descriptors (anatomy, organism, disease, etc.); 4. a set of potentially lexically related words. The rela- tion identified can be an equivalence relation (eql), a hyponymy relation (isa) or a see-also relation (see). Figure 1 – Morphosemantic Parsing For instance, the French word acrodynie (English acrodynia) Language-specific material is located in: is analyzed in the following way (N stands for noun, N* is assigned to a noun CF): 1. the lexicon of tagged word lemmas which is used by the system to test whether a component exists and to acrodynie/N ==> retrieve its part-of-speech; (1) [ [ acr N* ] [ odyn N* ] ie N] (2) douleur (de—lié(e) à) articulation (pain of—linked to joint) 1 http://www.ims.uni- (3) maladie (disease) stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html (4) eql:acr/algie, eql:apex/algie, see:acr/ite, see:apex/ite (last access 26/03/07) 2 http://www.nlm.nih.gov/pubs/factsheets/umlslex.html (last access 26/03/07)
2. the table of Combining Forms(CFs): each CF is asso- Table 1 – Table of Combining Forms (excerpt) ciated with a modern-language word which describes its meaning, a semantic type, a part-of-speech, and a CF English Type Relations set of CFs related through relations eql, isa, see; algia pain disease odyn, algo, ~itis 3. the decomposition rules: these rules are triggered in a blephar eyelid anatomy palpebr,
• Errors at the preprocessing level (mistagged words), while other methods remain at the level of morphological e.g., corporal/N was tagged as a noun while being an segmentation [8] or add semantics after using a tool for de- adjective in our context. composition [1]. We also provide a hierarchical decomposi- tion as opposed to a linear one [1,6,8]. In [1] the method was Table 4 – Precision / Recall figures also applied to the WHO-ART terminology, so that we could compare results. The same word list was submitted to the Mor- Nb of correct results Precision Recall fessor segmenter in exactly the same conditions as in [1], and 675 78.5% 52% obtained a coverage of 93.7%, but a precision of 53.2% (25% less than DériF) and a recall of 49.9% (slightly less than DériF). This means that about the same number of words was We measured a precision of 78.5% (see table 4) which is fairly correctly analyzed (recall), but DériF was much more to the good. Combined to a moderate coverage, this yields a 52% point by proposing significantly less incorrect analyses. By recall. Incorrect results were mainly due to: relying on a statistical segmenter, the method in [1] has the advantage of being language-independent; however, its im- • Wrong structuring of the decomposition. An example plementation also uses a table of morphemes, so transposition can be seen in table 3 with the word meningoen- to another language is not immediate either. Besides, statistical cephalitis. Its correct decomposition should be: segmentation also brings certain types of errors that could be [ [ [ mening N* ] [ encephal N* ] ] [ itis N* ] N] avoided with linguistic rules as implemented in DériF, e.g., -itis should be the head of the conjunction of mening- chemosis segmented as c+hem+osis. Moreover, as pointed out and encephal-, which would give a definition such as above, DériF outputs more complete information than Morfes- “inflammation related to head and meninges.” sor (simple linear segmentation). • Unsatisfactory definition (often due to the fact that This work also suggests the perspective of a system that could the meaning of the word was not sufficiently reflected work with several languages. We transposed the system from by the meaning of its parts). See for instance the French to English, but a possible next step would be to be able analysis of the word acanthosis in table 3. The de- to use it for translation: i.e. producing an English definition of composition is right but its meaning has evolved too a French word (or vice-versa). This would require a multilin- much and cannot be derived from that of its parts. gual table of CFs (as suggested in [12] and prepared here for This word should not be considered a complex one. French+English), multilingual semantic templates (as obtained from the present work for these two languages), and transla- • Mistagged words that could not be correctly ana- tions of simplex words involved in compounds. Such a system lyzed. This is the case of alveolar (last row of table could contribute for instance to cross-language information 3), which was treated as a noun derived from an ad- retrieval, with the same principles as [14]. Another potential jective (alveolar as an adjective is correctly analyzed next step would be to test the adaptation method on other re- by DériF). lated languages. Discussion Conclusion The precision of the adapted system is good, especially for a In this work, we successfully transposed a linguistics-based first implementation. Recall is lower but should rapidly grow morphosemantic analyzer from French to English in order to when increasing the size of the table of CFs and the lexicon. provide definitions for English neoclassical medical words. This verifies our hypothesis that neoclassical compounds from The present state of the system already shows that this lan- different languages can be analyzed in a similar way, as can be guage-dependent system could be adapted to another, related expected in the medical vocabularies of languages in the Ro- language for the task of analyzing medical compounds. The mance (e.g., French) and Germanic (e.g., English) families. advantage of doing so is that we did not have to start from This can be seen as a first step towards a multilingual system. scratch and implement a new system. Indeed, a certain amount of manual work is still necessary, such as preparing the table Future work includes improvement of the system as well as of CFs and adapting the semantic templates; but there is a sta- using the results for a specific application such as pharma- ble basis on which to work, so that we believe that this solu- covigilance. The decompositions and definitions generated can tion is overall less time-consuming. An additional advantage is be used to measure proximity between WHO-ART terms and that in the process, we can transfer to English linguistic analy- group similar terms, but without necessarily relying on ses which were initially designed for French. For instance, SNOMED as done in [1]. Grouping similar terms describing Namer [13] proposed an analysis of pathology nouns of the Adverse Drug Reactions would allow better signal detection in form Pref-(Y)X-ie, for instance hypercalciurie, which she pharmacovigilance. showed could also apply to German, Spanish, Italian, and Eng- lish. In the present work, the implementation of this analysis in References the French DériF is directly transferred to its English version. Using a linguistics-based morphosemantic analyzer such as [1] Iavindrasana J, Bousquet C and Jaulent MC. Knowledge DériF has a number of advantages. The system performs both acquisition for computation of semantic distance between morphological decomposition and semantic interpretation
WHO-ART terms. Stud Health Technol Inform. 2006; [9] McCray AT, Srinivasan S and Brown AC. Lexical methods 124:839-44. for managing variation in biomedical terminologies. Proc Annu Symp Comput Appl Med Care. 1994:235-9. [2] Creutz M, Lagus K, Linden K, and Virpioja S. Unsuper- vised morpheme segmentation and morphology induction [10] Namer F and Zweigenbaum P. Acquiring meaning for from text corpora using Morfessor 1.0. Technical report, French medical terminology: contribution of morphose- Publications in Computer and Information Science, Hel- mantics. In Marius Fieschi, Enrico Coiera, and Jack Li, sinky University of Technology, 2005. eds. Proc 10th World Congress on Medical Informatics, San Francisco, Ca, 2004;11(Pt 1):535-9. [3] Pacak MG, Norton LM, and Dunham GS. Morphosemantic analysis of -ITIS forms in medical language. Methods of [11] Baud R, Rassinoux AM, Ruch P, and Lovis C. The Information in Medicine 1980;19:99-105. power and limits of a rule-based morphosemantic parser. In: Proceedings of AMIA 1999 Annual Symposium, 1999: [4] Dujols P, Aubas P, Baylon C, and Grémy F. Morphose- 22-6. mantic analysis and translation of medical compound terms. Methods of Information in Medicine 1991;30:30-5. [12] Namer F and Baud R. Predicting lexical relations be- tween biomedical terms: towards a multilingual mor- [5] Wolff S. The use of morphosemantic regularities in the phosemantics-based system. Stud Health Technol Inform. medical vocabulary for automatic lexical coding. Methods 2005;116:793-8. of Information in Medicine 1984;4(23):195-203. [13] Namer F. Guessing the meaning of neoclassical com- [6] Lovis C, Michel PA, Baud R, and Scherrer JR. Word seg- pounds within LG: the case of pathology nouns. In: Pro- mentation processing: a way to exponentially extend medi- ceedings of Generative Approaches to the Lexicon, Ge- cal dictionaries. Methods of Information in Medicine neva, 2005:175-84. 1995:28-32. [14] Hahn U, Honeck M, Piotrowski M, and Schulz S. Sub- [7] Schulz S, Romacker M, Franz P, et al. Towards a multilin- word segmentation: Leveling out morphological variations gual morpheme thesaurus for medical free-text retrieval. for medical document retrieval. 2001:229-33. In: Proceedings of MIE 99, Ljubliana, Slovenia. IOS Press, 1999:891-894. Address for correspondence [8] Markó K, Schulz S, and Hahn U. Morphosaurus - design Louise Deléger, INSERM U729, 15 rue de l'Ecole de Médecine and evaluation of an interlingua-based, cross-language 75006 Paris, France; louise.deleger@spim.jussieu.fr document retrieval engine for the medical domain. Meth- ods of Information in Medicine 2005;(44):537-45. Table 3 – Example word analyses generated by the updated DériF Word/POS Decomposition Definition Type Related words Correct analyses arthralgia/N [ [ arthr N* ] [ algia N* ] N] pain (of -- linked to) disease eql:arthr/algesia joint see:arthr/itis adactyly/N [ a+ [ dactyl N*] +y N] Affection character- disease _ ized by the absence of digit gastroesophageal [ [ gastr N* ] [ oesophag N* ] al ADJ ] Related to oesopha- anatomy eql:stomac/oe- /ADJ gus and stomach sophag isa:abdomin/oe- sophag Errors meningoencephali- [ [ mening N* ] [ [ encephal N* ] [ itis N* ] N] N] (Part of -- Specific _ _ tis/N type of) encephalitis related to meninges acanthosis/N [ [acanth N*] [osis N*] N] (Part of -- Specific disease _ type of) disease re- lated to prickle alveolar/N [ [alveolar A] N ] Entity being alveolar _ _
You can also read