Defining Medical Words: Transposing Morphosemantic Analysis from French to English

Page created by Angel Franklin
 
CONTINUE READING
Defining Medical Words: Transposing Morphosemantic Analysis from French to English

                               Louise Delégera, Fiammetta Namerb, Pierre Zweigenbaumc
  a
      INSERM, UMR _S 872, Éq. 20, Les Cordeliers, Paris, F-75006 France; Université Pierre et Marie Curie-Paris 6, UMR_ S 872,
                     Paris, F-75006 France; Université Paris Descartes, UMR _S 872, Paris, F-75006 France
                                 b
                                   ATILF and Université Nancy 2, CLSH, Nancy, F-54015 France
                   c
                     LIMSI, CNRS UPR3251, Orsay, F-91403 France; INALCO, CRIM, Paris, F-75007 France

Abstract                                                             logical segmentation [2], was used to contribute mappings
                                                                     between WHO-ART and SNOMED terms. Early work on
Medical language, as many technical languages, is rich with          medical morphosemantic analysis focused on specific compo-
morphologically complex words, many of which take their              nents such as –itis [3] or –osis [4], then on larger sets of neo-
roots in Greek and Latin—in which case they are called neo-          classical compounds [5]. Lovis [6] introduced the notion of
classical compounds. Morphosemantic analysis can help gen-           morphosemantemes, i.e., units that cannot be further decom-
erate definitions of such words. This paper reports work on          posed without losing their original meanings. The Mor-
the adaptation of a morphosemantic analyzer dedicated to             phosaurus system [7,8] segments complex words using a simi-
French (DériF) to analyze English medical neoclassical com-          lar notion called subword. The UMLS Specialist Lexicon [9],
pounds. It presents the principles of this transposition and its     with its “Lexical tools,” handles derived words, i.e., complex
current performance. The analyzer was tested on a set of             words built through the addition of prefixes or suffixes. It
1,299 compounds extracted from the WHO-ART terminology.              provides tables of neoclassical roots, but no analyzer to auto-
859 could be decomposed and defined, 675 of which success-           matically decompose compound words. DériF [10] mor-
fully. An advantage of this process is that complex linguistic       phosemantically analyses complex words. In contrast to [8] or
analyses designed for French could be successfully trans-            [6], it computes a hierarchical decomposition of complex
ferred to the analysis of English medical neoclassical com-          words. Moreover, it produces a semantic definition of these
pounds. Moreover, the resulting system can produce more              words, which it can link to other words through a set of seman-
complete analyses of English medical compounds than exist-           tic relations including synonymy and hyponymy. In contrast to
ing ones, including a hierarchical decomposition and seman-          the Specialist tools or to [11], DériF handles both derived and
tic gloss of each word.                                              compound words. Designed initially for French complex
Keywords:                                                            words, then extended to the medical domain, its potential for
                                                                     cross-linguistic application was showed in [12]. Its transposi-
Natural Language Processing, morphosemantic analysis, word           tion to English would fill a gap in the set of tools currently
definition, neoclassical compounds.                                  available to process complex English medical words.
                                                                     This paper reports work on the adaptation of DériF to English
Introduction                                                         medical complex words. It focuses on neoclassical com-
                                                                     pounds and is based on the hypothesis that this type of words
Medical language, as many technical languages, is rich with          is similarly formed in related European languages (here French
morphologically complex words, many of which take their              and English). Our goal is to have DériF analyse English words
roots in Greek and Latin. These so-called neoclassical com-          and present its results in English.
pounds are present in many areas of the medical vocabulary,          We first describe the morphosemantic analyzer and our test set
including anatomy (gastrointestinal), diseases (encephalitis,        of words. We explain the modifications performed on this tool
cardiomyopathy), and procedures (gastrectomy). Segmenting            and the evaluation conducted. We then expose the results, dis-
morphologically complex words into their components is the           cuss the method and conclude with some perspectives.
task of morphological analysis. When this analysis is com-
plemented by semantic interpretation, the process is called
morphosemantic analysis. Complex words are often “compo-             Material and Methods
sitional,” in the sense that the meaning of a complex word is
often a combination of that of its parts.                            The principle on which this work is based is morphosemantic
                                                                     analysis, that is morphological analysis associated to a seman-
Morphosemantic analysis can therefore help processes inter-          tic interpretation of words. In other words, we want to obtain
ested in semantics, such as the detection of similar terms or the    a decomposition and a description of the meaning of a com-
generation of definitions. This was for instance the aim of [1],     plex word based on the meanings of its parts. A complex
where Morfessor, a tool for unsupervised learning of morpho-
word may be formed through any combination of the following        It has been pointed out [12] that the method could be extended
word formation rules:                                              to other languages. Indeed the rules for generating lexically
   • derivation, which adds an affix (prefix or suffix) to a       related words (item 4.) do not rely on any language-specific
     base word, e.g., pain/painful;                                features so that this part of the system is fully language-
                                                                   independent. The morphosemantic parser however needs to be
   • compounding, which joins two (or more) components             adapted to the language.
     together, those components being either neo-classical
                                                                   To test the transposition of DériF to English, we prepared a
     roots called Combining Forms (CFs) or modern-
                                                                   list of test words. These words were taken from the WHO-
     language words, e.g., thermoregulation, arthritis.
                                                                   ART terminology, which describes Adverse Drug Reactions
For this work we chose to analyze neo-classical compounds          (ADRs), since one of the intended applications of this work is
(formed from CFs). However, a compound may also undergo            to contribute to the pharmacovigilance domain by grouping
derivation, so that mixed-formation words must also be ad-         terms describing similar ADRs. We selected the English terms
dressed. Therefore we included not only “pure” compounds,          of this terminology; since DériF works on single words and
but also those neo-classical compounds that were prefixed or       not on multi-word units, we split them into single words; and
suffixed (e.g., haemorrhagic).                                     since we adapted DériF to analyze neo-classical compounds,
                                                                   we only retained those types of words. The selection was done
Our main hypothesis for the transposition from French to Eng-
                                                                   both automatically by removing all words of 4 characters or
lish is that a same linguistic analysis can be applied to neo-
                                                                   less (these words are practically never morphologically com-
classical compounds of several languages. We assume that
                                                                   plex), and manually by reviewing the list to look for neoclassi-
they are formed in a similar way and that the components in-
                                                                   cal compounds (the work was done by a language engineer,
volved are the same, the major differences being orthographic
                                                                   LD). This gave us a list of 1,299 words to be decomposed out
(such as -algia/-algie).
                                                                   of a total of 3,476 words. These words were lemmatized and
Material                                                           tagged with their parts-of-speech using the Treetagger1 part-of-
                                                                   speech tagger. We used a lexicon of tagged words from the
We started from the French version of the DériF (“Derivation       UMLS Specialist lexicon2 to help TreeTagger deal with un-
in French”) morphosemantic analyzer. DériF was designed            known words.
both for general language and more specialized vocabularies
such as medical language. It performs an analysis purely based     Methods
on linguistic methods and implements a number of decomposi-
                                                                   Adapting the morphosemantic analyzer
tion rules and semantic interpretation templates. It also uses
resources which include a lexicon of word lemmas tagged with       The method of morphological analysis that we want to trans-
their parts-of-speech and a table of CFs. The system goes fur-     pose to English is schematized in figure 1.
ther than simple decomposition and interpretation steps by
predicting lexically related words. Another of its distinctive
features is that it yields a structured decomposition of words
and not simply a linear segmentation, so that we know which
part is the head of the word.
As input the system expects a list of words tagged with their
parts-of-speech and lemmatized (in their base form—no plu-
ral). It outputs the following elements:
    1.   a structured decomposition of the word into its mean-
         ingful parts;
    2.   a definition (“gloss”) of the word in natural language;
    3.   a semantic category, inspired by the main MeSH tree
         descriptors (anatomy, organism, disease, etc.);
    4.   a set of potentially lexically related words. The rela-
         tion identified can be an equivalence relation (eql), a
         hyponymy relation (isa) or a see-also relation (see).                   Figure 1 – Morphosemantic Parsing
For instance, the French word acrodynie (English acrodynia)        Language-specific material is located in:
is analyzed in the following way (N stands for noun, N* is
assigned to a noun CF):                                                1.   the lexicon of tagged word lemmas which is used by
                                                                            the system to test whether a component exists and to
acrodynie/N ==>                                                             retrieve its part-of-speech;
(1) [ [ acr N* ] [ odyn N* ] ie N]
(2) douleur (de—lié(e) à) articulation (pain of—linked to joint)   1
                                                                     http://www.ims.uni-
(3) maladie (disease)                                              stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
(4) eql:acr/algie, eql:apex/algie, see:acr/ite, see:apex/ite       (last access 26/03/07)
                                                                   2
                                                                     http://www.nlm.nih.gov/pubs/factsheets/umlslex.html (last access
                                                                   26/03/07)
2.   the table of Combining Forms(CFs): each CF is asso-                  Table 1 – Table of Combining Forms (excerpt)
         ciated with a modern-language word which describes
         its meaning, a semantic type, a part-of-speech, and a          CF                English        Type          Relations
         set of CFs related through relations eql, isa, see;          algia        pain                disease      odyn, algo, ~itis
    3.   the decomposition rules: these rules are triggered in a      blephar      eyelid              anatomy      palpebr,
•    Errors at the preprocessing level (mistagged words),       while other methods remain at the level of morphological
         e.g., corporal/N was tagged as a noun while being an       segmentation [8] or add semantics after using a tool for de-
         adjective in our context.                                  composition [1]. We also provide a hierarchical decomposi-
                                                                    tion as opposed to a linear one [1,6,8]. In [1] the method was
              Table 4 – Precision / Recall figures                  also applied to the WHO-ART terminology, so that we could
                                                                    compare results. The same word list was submitted to the Mor-
     Nb of correct results         Precision         Recall         fessor segmenter in exactly the same conditions as in [1], and
                         675            78.5%           52%         obtained a coverage of 93.7%, but a precision of 53.2% (25%
                                                                    less than DériF) and a recall of 49.9% (slightly less than
                                                                    DériF). This means that about the same number of words was
We measured a precision of 78.5% (see table 4) which is fairly      correctly analyzed (recall), but DériF was much more to the
good. Combined to a moderate coverage, this yields a 52%            point by proposing significantly less incorrect analyses. By
recall. Incorrect results were mainly due to:                       relying on a statistical segmenter, the method in [1] has the
                                                                    advantage of being language-independent; however, its im-
    •    Wrong structuring of the decomposition. An example         plementation also uses a table of morphemes, so transposition
         can be seen in table 3 with the word meningoen-            to another language is not immediate either. Besides, statistical
         cephalitis. Its correct decomposition should be:           segmentation also brings certain types of errors that could be
         [ [ [ mening N* ] [ encephal N* ] ] [ itis N* ] N]         avoided with linguistic rules as implemented in DériF, e.g.,
         -itis should be the head of the conjunction of mening-     chemosis segmented as c+hem+osis. Moreover, as pointed out
         and encephal-, which would give a definition such as       above, DériF outputs more complete information than Morfes-
         “inflammation related to head and meninges.”               sor (simple linear segmentation).
    •    Unsatisfactory definition (often due to the fact that      This work also suggests the perspective of a system that could
         the meaning of the word was not sufficiently reflected     work with several languages. We transposed the system from
         by the meaning of its parts). See for instance the         French to English, but a possible next step would be to be able
         analysis of the word acanthosis in table 3. The de-        to use it for translation: i.e. producing an English definition of
         composition is right but its meaning has evolved too       a French word (or vice-versa). This would require a multilin-
         much and cannot be derived from that of its parts.         gual table of CFs (as suggested in [12] and prepared here for
         This word should not be considered a complex one.          French+English), multilingual semantic templates (as obtained
                                                                    from the present work for these two languages), and transla-
    •    Mistagged words that could not be correctly ana-           tions of simplex words involved in compounds. Such a system
         lyzed. This is the case of alveolar (last row of table     could contribute for instance to cross-language information
         3), which was treated as a noun derived from an ad-        retrieval, with the same principles as [14]. Another potential
         jective (alveolar as an adjective is correctly analyzed    next step would be to test the adaptation method on other re-
         by DériF).                                                 lated languages.

Discussion                                                          Conclusion

The precision of the adapted system is good, especially for a       In this work, we successfully transposed a linguistics-based
first implementation. Recall is lower but should rapidly grow       morphosemantic analyzer from French to English in order to
when increasing the size of the table of CFs and the lexicon.       provide definitions for English neoclassical medical words.
                                                                    This verifies our hypothesis that neoclassical compounds from
The present state of the system already shows that this lan-        different languages can be analyzed in a similar way, as can be
guage-dependent system could be adapted to another, related         expected in the medical vocabularies of languages in the Ro-
language for the task of analyzing medical compounds. The           mance (e.g., French) and Germanic (e.g., English) families.
advantage of doing so is that we did not have to start from         This can be seen as a first step towards a multilingual system.
scratch and implement a new system. Indeed, a certain amount
of manual work is still necessary, such as preparing the table      Future work includes improvement of the system as well as
of CFs and adapting the semantic templates; but there is a sta-     using the results for a specific application such as pharma-
ble basis on which to work, so that we believe that this solu-      covigilance. The decompositions and definitions generated can
tion is overall less time-consuming. An additional advantage is     be used to measure proximity between WHO-ART terms and
that in the process, we can transfer to English linguistic analy-   group similar terms, but without necessarily relying on
ses which were initially designed for French. For instance,         SNOMED as done in [1]. Grouping similar terms describing
Namer [13] proposed an analysis of pathology nouns of the           Adverse Drug Reactions would allow better signal detection in
form Pref-(Y)X-ie, for instance hypercalciurie, which she           pharmacovigilance.
showed could also apply to German, Spanish, Italian, and Eng-
lish. In the present work, the implementation of this analysis in   References
the French DériF is directly transferred to its English version.
Using a linguistics-based morphosemantic analyzer such as           [1] Iavindrasana J, Bousquet C and Jaulent MC. Knowledge
DériF has a number of advantages. The system performs both              acquisition for computation of semantic distance between
morphological decomposition and semantic interpretation
WHO-ART terms. Stud Health Technol Inform. 2006;                    [9] McCray AT, Srinivasan S and Brown AC. Lexical methods
   124:839-44.                                                             for managing variation in biomedical terminologies. Proc
                                                                           Annu Symp Comput Appl Med Care. 1994:235-9.
[2] Creutz M, Lagus K, Linden K, and Virpioja S. Unsuper-
    vised morpheme segmentation and morphology induction               [10] Namer F and Zweigenbaum P. Acquiring meaning for
    from text corpora using Morfessor 1.0. Technical report,              French medical terminology: contribution of morphose-
    Publications in Computer and Information Science, Hel-                mantics. In Marius Fieschi, Enrico Coiera, and Jack Li,
    sinky University of Technology, 2005.                                 eds. Proc 10th World Congress on Medical Informatics,
                                                                          San Francisco, Ca, 2004;11(Pt 1):535-9.
[3] Pacak MG, Norton LM, and Dunham GS. Morphosemantic
    analysis of -ITIS forms in medical language. Methods of            [11] Baud R, Rassinoux AM, Ruch P, and Lovis C. The
    Information in Medicine 1980;19:99-105.                               power and limits of a rule-based morphosemantic parser.
                                                                          In: Proceedings of AMIA 1999 Annual Symposium, 1999:
[4] Dujols P, Aubas P, Baylon C, and Grémy F. Morphose-
                                                                          22-6.
    mantic analysis and translation of medical compound
    terms. Methods of Information in Medicine 1991;30:30-5.            [12] Namer F and Baud R. Predicting lexical relations be-
                                                                          tween biomedical terms: towards a multilingual mor-
[5] Wolff S. The use of morphosemantic regularities in the
                                                                          phosemantics-based system. Stud Health Technol Inform.
    medical vocabulary for automatic lexical coding. Methods
                                                                          2005;116:793-8.
    of Information in Medicine 1984;4(23):195-203.
                                                                       [13] Namer F. Guessing the meaning of neoclassical com-
[6] Lovis C, Michel PA, Baud R, and Scherrer JR. Word seg-
                                                                          pounds within LG: the case of pathology nouns. In: Pro-
    mentation processing: a way to exponentially extend medi-
                                                                          ceedings of Generative Approaches to the Lexicon, Ge-
    cal dictionaries. Methods of Information in Medicine
                                                                          neva, 2005:175-84.
    1995:28-32.
                                                                       [14] Hahn U, Honeck M, Piotrowski M, and Schulz S. Sub-
[7] Schulz S, Romacker M, Franz P, et al. Towards a multilin-             word segmentation: Leveling out morphological variations
    gual morpheme thesaurus for medical free-text retrieval.              for medical document retrieval. 2001:229-33.
    In: Proceedings of MIE 99, Ljubliana, Slovenia. IOS Press,
    1999:891-894.                                                      Address for correspondence

[8] Markó K, Schulz S, and Hahn U. Morphosaurus - design               Louise Deléger, INSERM U729, 15 rue de l'Ecole de Médecine
    and evaluation of an interlingua-based, cross-language             75006 Paris, France; louise.deleger@spim.jussieu.fr
    document retrieval engine for the medical domain. Meth-
    ods of Information in Medicine 2005;(44):537-45.

                                 Table 3 – Example word analyses generated by the updated DériF
      Word/POS                               Decomposition                          Definition           Type     Related words
                                                         Correct analyses
 arthralgia/N            [ [ arthr N* ] [ algia N* ] N]                        pain (of -- linked to)   disease   eql:arthr/algesia
                                                                               joint                              see:arthr/itis
 adactyly/N              [ a+ [ dactyl N*] +y N]                               Affection character-     disease          _
                                                                               ized by the absence
                                                                               of digit
 gastroesophageal        [ [ gastr N* ] [ oesophag N* ] al ADJ ]               Related to oesopha-      anatomy   eql:stomac/oe-
 /ADJ                                                                          gus and stomach                    sophag
                                                                                                                  isa:abdomin/oe-
                                                                                                                  sophag
                                                                 Errors
 meningoencephali-       [ [ mening N* ] [ [ encephal N* ] [ itis N* ] N] N]   (Part of -- Specific        _             _
 tis/N                                                                         type of) encephalitis
                                                                               related to meninges
 acanthosis/N            [ [acanth N*] [osis N*] N]                            (Part of -- Specific     disease          _
                                                                               type of) disease re-
                                                                               lated to prickle
 alveolar/N              [ [alveolar A] N ]                                    Entity being alveolar       _             _
You can also read