From Text to Knowledge: a Unifying Document-Centered View of Analyzed Medical Language

Page created by Keith Ingram
 
CONTINUE READING
From Text to Knowledge: a Unifying Document-Centered View of Analyzed
                          Medical Language
                             Pierre Zweigenbaum, Jacques Bouaud, Bruno Bachimont,
                             Jean Charlet, Brigitte Seroussi, Jean-Francois Boisvieux
                          Service d'informatique medicale, Assistance Publique - H^opitaux de Paris
                                  & Departement de biomathematiques, Universite Paris 6

Abstract                                                               that their levels of representation of medical information
                                                                       are simply not congruent.
Although medical language processing (MLP) has                            We present in this paper a paradigm which can help to
achieved some success, the actual use and dissemination                address these two problems. In this paradigm, we con-
of data extracted from free text by MLP systems is still               sider the analyses of a text as enrichments to this text:
very limited. We claim that the adoption of an \enriched-              the resulting document integrates the original text, \an-
document" paradigm (or \document-centered" view) can                   notated" by the data produced by MLP systems. Ex-
help to address this issue. We present this paradigm and               tracted data no longer replace the original text, which
explain how it can be implemented, then discuss its ex-                can be checked by the information user, alleviating the
pected bene ts both for end-users and MLP researchers.                 need for 100 % reliable data. Besides, such an approach
                                                                       can facilitate the identi cation of data exchanged between
1 Introduction                                                         MLP researchers.
                                                                          This view of text management is not new. It is
Some medical language processing (MLP) systems have                    widespread in the natural language processing (NLP) and
achieved a reasonable level of performance and success                 document processing communities, and is becoming in-
from a technical point a view, the most well-known one                 creasingly known in medical informatics too. Our pur-
being unarguably the Linguistic String Project Medical                 pose is to transpose it to the context of medical language
Language Processor (LSP/MLP, [1]). However, the actual                 processing.
use and dissemination of data extracted from free text by                 We rst recall some background material, and then
MLP systems is still very limited.                                     present the principles of the proposed model. We illus-
   In our opinion, the most crucial factor which impedes               trate the model on several examples, and show how the
a wider use of these results by health care professionals              resulting enriched documents can be exploited. We dis-
involves the expected quality of the resulting data: one               cuss the advantages that such a \document-centered" view
obviously cannot rely on extracted data as a substitute                of analyzed medical language should bring both to nal
for the original text as long as this data is not produced             users (health care professionals) and to MLP researchers.
by a near 100 % safe procedure. Free text processing has               We also identify some limitations, and propose further di-
not reached such a level of performance | neither can one              rections of research.
expect humans to.
   On another plan, MLP results are also not disseminated
enough to the research community. Some conditions for
this to happen are indeed political (individual and organ-
isational willingness) and infrastructural (availability of
network and servers). But there is also an issue of rep-
                                                                       2 Background
resentation framework for such interchange to take place.
At present, the kinds of results that are produced by dif-             We present here the background which supports the
ferent MLP systems are often so distant from one another               paradigm proposed in this paper: a schematic reconstruc-
    To appear in Methods of Information in Medicine. A rst
                                                                       tion of MLP tasks, the emergence of a trend towards a
version was presented at the \Fourth International Conference on       document-centered patient record, the usefulness of anno-
Medical Concept Representation", workshop of Working Group 6 of        tated corpora, and some basics about document markup
IMIA, Jacksonville, Fl., January 1997.                                 (see gure 1).

                                                                   1
such as diagnoses, acts, or anatomic localizations.
                                                                                          Segmentation may be structured: segments may be
                                                               MLP results

                               Enriched document                                       nested in one another. For instance, a diagnostic expres-
                                                                                       sion may mention a speci c anatomical location: we then
                                                                                       have an anatomic location segment inside a diagnosis seg-
  Document-centered
       EMR                                                                             ment. The basic model of main-stream syntax is that of
                                                          segment and categorize       constituent structure, which is indeed a structured seg-
                                                              text contents
                                                                                       mentation. Sager's MLP system [1] can essentially be de-
                                                                                       scribed as a structured segmentation and categorization
                                                                                       system. Categorization and segmentation correspond to
                                                             Annotated corpus
                                                           for NLP development
                                                                                       the paradigmatic and syntagmatic division in linguistics.
  variably structured                                                                  It is therefore not a surprize that the data extracted from
     text and data
                                                                                       a natural language text can often be represented as a cate-
                                  SGML markup                                          gorization of text segments. This accounts for a large part
                        
                                                                                       of MLP systems (e.g., [1, 7]), and does address a substan-
                                                        knowledge about language
                                                                                       tial part of the needs for data extraction from medical
             Figure 1: Background: convergence.                                        text.
                                                                                          However, instead of assigning a text segment an atomic
                                                                                       category drawn from an a priori set, one may choose to
2.1 A reconstruction of medical language                                               build for it a complex category. The hypothesis is that
    processing tasks                                                                   a nite set of a priori (pre-coordinated) categories is not
                                                                                       sucient for dealing with the open-endedness of natural
MLP systems aim at extracting \medical data" from free                                 language, and that one therefore needs to rely on a gen-
text. The prototypical application is diagnosis encoding                               erative language for describing a potentially in nite set
(e.g., [2]), where the desired output is a diagnosis code in                           of (post-coordinated) categories. This is the assumption
an existing classi cation (e.g., the International Classi -                            made by those who rely on some sort of compositional
cation of Diseases [3]). This is a classi cation, or catego-                           language [4, 8] for representing medical information, and
rization, task: given a string of words and an a priori set                            in particular by the tenants of knowledge representation
of classes, determine the class which should be assigned                               languages [9, 10]. Such a complex categorization is gener-
to this string. An extended version of categorization can                              ally not reducible to atomic categorizations of structured
assign several classes (or indices, or keywords) to an in-                             segments, and must therefore be considered as a distinct
put string, for instance, to associate a set of SNOMED                                 class of MLP production.
[4] codes to an input sentence [5]. Some \ancillary" pro-                                 The paradigm we present in this paper naturally caters
cedures in MLP are also categorization tasks: e.g., deter-                             for the rst two types of NLP output: atomic categoriza-
mining the part-of-speech of each word, or (more interest-                             tion and (structured) segmentation, and thus can help to
ingly) determining the broad semantic category of each                                 manage a very large part of the results of MLP systems.
word or expression. The latter task plays a central role                               It can also be made, with some restrictions, to host the
in [1], and [6] shows that its status can be promoted from                             latter type (complex categorization).
ancillary MLP to an actual, end-user service.
   Although categorization may be the only task expected                               2.2 The document-centered                        patient
from an MLP system (e.g., in [2]), another important task
is segmentation : given a string of words, identify relevant
                                                                                           record
segments in this string. These segments correspond to                                  The traditional view of the electronic medical record
meaningful units, for instance the expression of a diagnosis                           (EMR) is that of a database which holds items of coded, or
| which one might further want to categorize. Segmen-                                  standardized, information. While this view has achieved
tation, in this context, allows to focus on (and categorize)                           some success, it does require a substantial e ort of stan-
speci c parts of a text, rather than considering it as an                              dardization over medical information. Moreover, a com-
indistinct whole. An application of segmentation could                                 plete formalization of medical information is not theoreti-
be to identify in a text all instances of diagnoses. This                              cally reachable | this is precisely one of the reasons why
task is di erent from the above categorization of a string                             natural language still has some appeal to medical practi-
which is already known to be a diagnosis. \Ancillary" seg-                             tioners.
mentation tasks include morpho-syntactic segmentation                                     An alternate view gives textual documents a rst-class
(into words, sentences, phrases and other syntactic con-                               status. It is inspired both by an observation of non-
stituents) as well as the identi cation of semantic units                              computerized medical practice (e.g., [11, 12]) and by the

                                                                                   2
fast-growing document processing industry. In that view,                 word is marked for semantic category can be used to
the EMR is a collection of documents, which can be com-                  learn selectional restrictions or taxonomies [22].
posed of text, images, etc. Textual documents here can,
and generally do, have structure. They can be hierar-               The same idea lies behind some statistical approaches to
chically divided into identi ed chunks, down to any suit-           MLP, which learn word to code associations on the basis
able level, thus providing much of the same data struc-             of a manually provided training sample.
turing advantages as databases. But at the same time,                  The paradigm we present is based on the same idea of
not every piece of text has to be standardized: the bal-            annotating a text with its analyses: we believe this can
ance between formal data and natural language text can              help MLP researchers both for their own developments
be more nely tuned than with traditional database mod-              and by facilitating the exchange of enriched corpora.
els, and they can merge together more tightly. The ex-
pected bene t is rst a better account of available medi-            2.4 Marking up text
cal information, described as it can be expressed, rather
than as it should be expressed to t a prede ned com-                Managing structure, categories, and annotations in a text
puter model. Assuming that a document-centered view                 can be realized by inserting in it tags which mark this
of the medical record is closer to non-computerized prac-           structure, specify the categories, and identify the annota-
tice than an item-oriented view, an additional bene t is a          tions. The publishing industry has pushed forth an ini-
potentially better-suited interface for health care profes-         tiative which led to the adoption of an ISO standard for
sionals to work on the patient record. Several attempts             this purpose: the Standard Generalized Markup Language
at experimenting with such a document-centered patient              (SGML, [23, 24]). SGML has since then spread to a wide
record are ongoing [13, 14, 15, 16, 17, 18].                        variety of industries and research disciplines. HTML (Hy-
   The document-centered view aims at nding a suitable              perText Markup Language), the language of the World
tradeo between free text and formal data. By producing              Wide Web, is itself a xed application of SGML.
data from free text, MLP therefore plays a central role                SGML basically allows to de ne the nature and struc-
in a document-centered patient record and should bene t             ture of the \tags" which can be inserted in a document.
from it.                                                            Tags delimit \elements" of a document, i.e., categorized
                                                                    segments, such as a section title, paragraph, diagnosis or
2.3 Annotated corpora as a precious re-                             person name. Tags may also associate \attributes" with
                                                                    elements, such as the level or number of a section title, the
    source                                                          sex of a person or the standardized code for a diagnosis.
The 80's have seen the development of increasingly larger           The set of allowed tags in a class of documents and their
on-line text corpora, and their dimension has now ex-               authorized order of appearance and nesting are speci ed
ploded with the advent of the World Wide Web. While                 in a \Document Type De nition" (DTD), also written in
large volumes of text are a very useful resource for study-         SGML.
ing language, for evaluating NLP tools and for training                Large SGML DTDs have been proposed for a wide va-
them [19], annotated corpora are an even more valuable              riety of industries and scienti c disciplines. Of prominent
asset for these same purposes. Annotated corpora link               interest for NLP is the Text Encoding Initiative [25], an
texts with their analyses: a typical, basic level of analysis       international project which has led to a DTD encompass-
may involve the identi cation of sentence boundaries or             ing a large part of the needs of the humanities, litterary
the part-of-speech of each word. More elaborate analyses            and linguistic research. In the medical informatics arena,
associate syntactic parses with sentences or semantic cat-          the HL7-SGML group has recently been promoting the
egories with words. The resulting corpora are much more             use of SGML for the medical record [26].
than text: they constitute sources of knowledge about lan-
guage.
   Typical work on annotated corpora includes:                      3 The enriched document model
    evaluating NLP systems: this is how the DARPA                  In this section, we present the principles of the enriched
     MUC competitions have been proceeding [20];                    document paradigm and sketch a possible implementation
    training NLP systems: e.g., a part-of-speech tagger
                                                                    based on the SGML language.
     can tag accurately some 97 % of the words of a corpus
     provided it is given a large enough hand-annotated             3.1 MLP as a document enrichment pro-
     training corpus [21];                                              cess
    building resources for NLP systems; this is actually a         We propose to handle the analyses which MLP can pro-
     variant of the preceding point: a corpus where each            duce from a text as enrichments to this text. Once \data"

                                                                3
has been obtained, the original text is not \forgotten". It                 one MLP process could identify and categorize the
remains there in the forefront to help to manage the de-                    overall structure of the text (reason for admission, an-
rived data. It is still the reference that the reader of the                tecedents, etc.) whereas another one could determine
patient record may want to consult when examining this                      the SNOMED encodings of the relevant expressions
patient's data.                                                             in the text. This possibility of task division has im-
                                                                            plications both for MLP architectural issues and for
3.2 Indexing interpretations on the                                         research team collaboration.
    source text                                                            An analysis of only some parts of a text can be han-
The general picture of an enriched document is that of a                    dled. Moreover, the actual piece of text which gives
text, some portions of which are associated with \analy-                    rise to a speci c analysis can be precisely identi ed.
ses", or \interpretations" (in the above-proposed terms,                    By applying categorization to a segmented text, one
\categorizations"). A very simple principle is applied                      can make the di erence between an analysis of a
throughout the model: index these interpretations on                        whole text and an analysis of a limited extract of
the source text that they interpret (see gure 2). This                      this text. For instance, a diagnosis will not have the
gives a deserved importance to the segmentation aspect                      same meaning if it is presented as the diagnosis of a
of NLP. The same principle was at the root of seed mod-                     patient (attached to the whole report) or as a diag-
els in computational linguistics (the chart model for pars-                 nosis found in the \reason for admission" section or
ing [27]) and arti cial intelligence (the blackboard model                  in the \antecedents" section.
for problem-solving [28] and the ATMS model for belief                     MLP-provided data can be useful as normalized
maintenance [29]).                                                          search data. By indexing analyses on their source
                                                                            text, one can thus link back to the source text from
                         original text                                      the search data. As a consequence, since text and
                       is still available     add as many
                                            interpretations                 corresponding data are synchronized, one can have
                                                as useful                   search processes operate on normalized data while
                                             (categorization)               letting the user view understandable text.
                                                                            This back link is also an essential feature to enable
                                                                            traceability and quality control of MLP results.
    can interpret
  selected parts of                                                        Since several such interpretations can be synchro-
     document
                                            two-way
                                              link                          nized, search criteria operating on several levels of
        (structured                                                         interpretation can be combined together, resulting in
      segmentation)
                                                    search                  greater search power. For instance, one can search for
                                                interpretations
                                                    or text                 a given diagnosis (segment indexed with a \diagno-
                                                                            sis" category) in a speci c section (segment indexed
                                                                            with a \section" category) of a discharge summary {
 Figure 2: Indexing interpretations on the source text.                     and search or view the corresponding original text.
  Let us review some of the properties this indexing brings                Finally, if an analysis produces a categorization, this
to the resulting enriched document.                                         categorization is inherited by the text segment in-
                                                                            dexed with it. Such a categorization can be made per-
     The default entry point into the document remains                     spicuous to the reader through appropriate markup
      the original text. None of its structure or contents                  and rendering devices [6].
      is lost in the new document, as might be the case
      in an item-oriented paradigm. Therefore, the basic              3.3 A simple implementation of the en-
      structure, the basic reading of the document is that of
      the source text. The importance of the structure and
                                                                          riched text model
      physical aspect of documents in the patient record              The main features of the enriched text model can be im-
      have been stressed by some authors (in particular,              plemented with a few SGML constructs.
      [11]): keeping these features should be helpful to the
      reader.                                                         3.3.1 Segmentation and categorization
     Several interpretations of the text can be added                A basic kind of analysis simply consists in categorizing
      monotonically to the same document. For instance,               text segments. Text segments can range from a (part of)

                                                                  4
word to a whole paragraph, section or document. Cat-               possible to keep the analyses outside the source document,
egories may be morphosyntactic (part-of-speech, lemma,             so that it remains untouched (and therefore may have a
syntactic category) as well as semantic (broad semantic            read-only status, a safe way to implement data integrity),
category, word sense, position in a thesaurus) or concep-          using external references to positions in this document.
tual (position in an ontology). Another kind of segmen-            HyTime links [32], in particular, allow to do this.
tation concerns the general structure of the text (header,           A limitation of the proposed encoding for \complex"
signature, divisions, paragraphs).                                 analyses is that they are stored as plain text rather than
   An analysis which results in the categorization of a text       encoded as markup (elements and attributes). As a conse-
segment can be recorded as an encapsulation of this text           quence, the SGML processing mechanisms built in SGML
segment within a start and end tags assigned to this cat-          tools do not help much for processing (e.g., searching)
egory. For instance, the text on gure 3 is tagged with             these representations.
SGML elements denoting segmentation and categorization
of structural units (div, head, p) and semantic units
(name, date, diagnosis). A variant uses attribute val-             4 Uses of enriched documents
ues to specify categories. In the same example, a semantic
categorization of divisions is encoded as a type attribute,        4.1 Overview of uses
and a precise semantic category for diagnoses is speci ed          We envision two broad classes of users for enriched medical
as an ICD9CM attribute whose value is a position (a hier-          documents: health care professionals and medical infor-
archical code) in the ICD-9-CM classi cation.                      matics researchers (see gure 7. We de ne the rst class
   Figure 4 shows an additional example where the begin-           as being interested primarily in working on patient data.
ning of a patient discharge summary is tagged with divi-           The second class is concerned with methods and tools for
sions (div, head), paragraphs (p), person names (name),            processing medical language and knowledge.
health care locations (locsoins, typelocs, nomlocs),
dates (date), symptoms and diagnoses (etatpath), med-
ical treatments (theramed, drug, poso) etc.
   The rules that govern which elements and attributes

                                                                                                               M
can be used and how they can nest are speci ed in an

                                                                                                                 LP
SGML document type de nition (DTD). Figure 5 shows
an extract of the DTD enforced in the document of g-
ure 4. Each legal element is declared (!element declara-
tion) with its allowed attributes (!attlist declaration).
The content of an element can be simple text (#PCDATA)
or other SGML elements.

                                                                                                                       E
3.3.2 More complex analyses
                                                                                                                     RC
                                                                       M

                                                                                                                   U
                                                                         ED

More complex analyses need to be represented explicitly.
                                                                                                                SO
                                                                           IC

                                                                                                             RE

The TEI has de ned several notations for managing such
                                                                              A
                                                                               L

analyses (see, e.g., [30]).
                                                                                                           E
                                                                                                          G
                                                                                RE

                                                                                                         A

   The simplest one consists in inserting the analysis as
                                                                                    CO

                                                                                                      U
                                                                                                     G

text enclosed within speci c start and end tags, at the
                                                                                      RD

                                                                                                    N

                                                                     Clinical use                              MLP research
                                                                                                  LA

place in the text where it belongs (interlinear analysis).
Assuming for instance that we want to include the con-
ceptual representation of each sentence as derived by an                   Figure 7: The enriched document model.
MLP system such as MENELAS [10] or RECIT [31], we
can segment the text into sentences (element s), each of             An enriched document may be used in (at least) three
which includes this conceptual representation (in a cr el-         ways. One may simply want to consult the document |
ement | see gure 6). The corresponding DTD must                    the enrichments can help reading or navigating through
indeed cater for these elements.                                   the document. One can obtain information or data from
   In variant encodings, such analyses are grouped to-             the document | for instance, extract patient information
gether in a special part of the document (for instance,            to be sent to another person or system. Finally, one may
in the end) and SGML reference links tie together a text           want to store data into the document | including writing
segment and its analysis. This has the advantage of keep-          the original text. The last two operations may be com-
ing the \main" part of the document \clean". It is even            bined, for instance when applying NLP tools to a text and

                                                               5
MOTIF D'HOSPITALISATION
      Monsieur DURAND, a^g
                                          e de 88 ans, est hospitalis
                                                                     e le
        18/1/1997 pour un angor
        spontan
               e 
                 a r
                    ep
                      etition.
      
                                          Figure 3: Categorization and segmentation.
      
            MOTIF D'ADMISSION 

              Monsieur DURAND, ^ag
                                                  e de 61ans,
                est admis dans
                l'Unite de R
                                                                     eanimation du
                    Service de Pneumologie du
                        Groupe HospitalierPitie Salp^
                                                                                             etriere
                    
                le 26/7/90 pour prise en charge d'une
                ischemie aigu
                                                     e du membre inf
                                                                    erieur gauche avec
                insuffisance r
                                        enale oligo-anurique.
              
            PROVENANCE :

              UrgencesPiti
                                                                     e.
              
                                      Figure 4: A more complete example of tagged text.
storing the results as appropriate enrichments.                          instance, linking together all mentions of problems
  We rst examine several uses of enriched documents in                   a ecting a given anatomical location, or linking the
a health care setting. We then recall the main bene ts we                mention of an examination in a discharge summary
see for the medical informatics research community.                      to the actual, full examination report.
4.2 Consulting enriched documents                                       The rst two operations on an enriched document
                                                                     are extremely simple to implement with SGML-aware
The segmentation and categorization in an enriched docu-             software. For instance, using an SGML browser (e.g.,
ment can be used to help a reader consult this document:             Panorama PRO, [33]), one can de ne a style sheet which
                                                                     speci es how each kind of SGML element should be ren-
     a basic idea is to use the segmentation and catego-            dered. Figure 8 shows a Panorama display of the doc-
      rization to compose the layout of the document | for           ument of gure 4, where section heads are in boldface,
      instance, to highlight segments of a speci c category          symptoms and diagnoses are in italics, examination results
      [6];                                                           are indented, and person and location names are replaced
     one can further present a di erent view of the same            with an icon.
      document t to a speci c goal or task | for instance,              With this same SGML tool, one can de ne \naviga-
      creating a table of contents by extracting the section         tors", user-tailored tables of contents which are presented
      titles, or creating an index or a synopsis by extracting       in a companion window and synchronized with the main
      the pathological states or observations [15];                  text window. Figure 9 shows a navigator based on sec-
                                                                     tion heads and symptoms and diagnoses. This navigator
     one can also link together segments in the same or in          can be used both as a synopsis to get a quick feel for the
      di erent documents, thus providing hyperlinks: for             patient's history and, indeed, as a navigation device: a

                                                                 6
Figure 5: A document type de nition (extract).
click on one of the items in this window scrolls the main             powerful searches on both the original text and the data
window to the actual occurrence of this item in the source            extracted from it.
text. With the appropriate tools (here, the Panorama Pro                As hinted above, such tools know how to manipulate
SGML browser), these helps come at a marginal cost once               SGML markup and plain text. However, confronted with,
the text is tagged.                                                   for instance, the conceptual graph of gure 6, the best
   The third feature (hyperlinks) generally requires that             they can do is to consider it as text. As a result, the full
the desired links exist,and encodes them as appropriate               formal meaning of the conceptual graph is lost. Never-
tags in the document. These links could be precomputed,               theless, concept and relation names can be searched as if
e.g., by NLP software, or be user-de ned.                             they were keywords, which can still be useful (as a fall-
                                                                      back) depending on the kind of queries which are pro-
4.3 Searching a collection of enriched                                cessed. A better account of such elaborate representations
                                                                      requires some investment beyond a simple use of o -the-
    documents                                                         shelf SGML software.
The basic search mechanism in an item-oriented database
deals with formal data. Searching natural language texts
needs to resort to di erent methods, namely, content-
                                                                      4.4 Extracting information
oriented search, also called full-text search. Recent soft-           An enriched document is an elaborate compound from
ware allows to apply full-text search to structured text,             which one may wish to extract only a selected subset.
taking into account its segmentation and categorization               This may be useful, for instance, in a work ow context to
features. So for instance, one can search for occurrences of          produce a targeted report to be sent to a correspondent.
the word \kidney" in diagnosis elements that are nested               This is also convenient to build di erent views of the same
in div elements with attribute value type='reason for                 document for consultation, as suggested above.
admission'.                                                             The basic mechanism for these purposes is the selection,
   One can at will choose to search on text, on tags (and             in tagged texts, of speci ed segments, based on markup
attributes), or on a combination of both: encoding MLP                and content, and their rearrangement in the desired or-
analyses as SGML-tagged segments in enriched text en-                 der to produce the target document. This mechanism can
ables the application of o -the-shelf software to perform             be e ected by SGML transformations, for which a vari-

                                                                 7
MOTIF D'HOSPITALISATION
    Monsieur DURAND, a ^g
                                            e de 88 ans, est hospitalis
                                                                       e le
      18/1/1997 pour un angor
      spontan
             e a r
                   epetition.
     
      [Admission]-
         (past)
         (pat)->[Human_Being]-
                     (defines_cultural_function)->
                        [Medical_Subfunction]--(cultural_role)->[Patient:I63]
                     (attr)->[Age]--(val_quant)->
                        [Quantitative_Val:88]--(reference_unit)->[Year_Duration]%
         (motivated_by)->[Angina_Syndrome:I77]
         ...
     
                                    Figure 6: Embedding a conceptual representation.
                                                (Windows 95 screen dump)
                                       Figure 8: Rendering an enriched document.

ety of commercial or public-domain software exists. One           program, which produces an ICD code for each of them;
may cite in particular the SGML query language Sgm-               store the resulting ICD codes into an ICD9CM attribute of
lQL developed in the Multext project [34], which allows           the diagnosis element. The Multext project [34] has de-
to specify such transformations in a declarative way. One           ned a general NLP architecture where components work
should note that SGML transformations are a more gen-             on an SGML encapsulation of natural language data. Note
eral and powerful mechanism than simple style sheets.             that MLP programs need not all be SGML-aware: Mul-
DSSSL (Document Style Semantics and Speci cation Lan-             text also proposes alternate encodings and transcoders so
guage, [35]) aims at standardizing this notion. Performing        that usual NLP components can be plugged into such a
SGML transformations will be easier and more portable             processing chain.
when implementations of DSSSL become widely available.               Such an architecture facilitates the reuse of existing
                                                                  MLP components: its strength relies on the de nition of a
4.5 Enriching a document                                          set of analysis levels (input or output of NLP components)
                                                                  and the corresponding encodings. Some general-purpose
Enriched documents accomodate expansion in at least two           levels have been identi ed and some encodings proposed
contexts. First, health care professionals must be able to        by NLP projects such as Multext or the TEI [25]. An ob-
enter new information into the patient record (with the           vious need for this kind of architecture to be really useful
usual precautions to comply with authorizations and en-           for MLP is to de ne relevant levels of analysis for medical
sure data integrity). Beyond the mere creation of new             texts. We return to this point later.
documents, a reader can annotate existing documents [36]:
the document-centered paradigm facilitates \active read-          4.6 Exchanging enriched documents
ing".
   Second, MLP programs can be applied to the current             The general NLP community has been exchanging data
state of a document and perform interpretation tasks as           and programs for a while now: one can nd repositories
proposed above. The basic ow of control consists in se-           for lexicons, annotated corpora, parsers and other NLP re-
lecting input segments, passing them to the MLP pro-              sources in multiple places over the world (see for instance
gram, and then inserting the results back into the enriched       the Natural Language Software Registry [37] or the Eu-
document structure. A typical processing chain could be:          ropean Language Resources Association [38]) | although
select all occurrences of diagnoses (diagnosis elements,          one must stress that resources are most developed for the
identi ed in a previous pass); pass them to an encoding           English language.

                                                              8
(Windows 95 screen dump)
                                         Figure 9: A synopsis and navigation help.

   The MLP community, in contrast, lacks a comparable              cerned with partially structured information: a mixture
level of sharing of speci c resources. This point was put          of text and data, where natural language itself can un-
forward at the last meeting of Working Group 8 (Natu-              dergo some variable degree of structuring. This feature
ral Language Processing) of the European Federation for            of traditional (paper-based) medical records is considered
Medical Informatics, which was held during the Medical             essential by authors such as [11]. This is di erent from
Informatics Europe Conference in August 1996.                      approaches which assume a strict division between struc-
   Enriched documents constitute annotated corpora                 tured data and free text, as we understand that HL7 and
which can be extremely useful to the MLP research com-             its SGML initiative does. This is also di erent from ap-
munity. They allow to compare more precisely di erent              proaches where structured data is accessed and displayed
approaches, thanks to common test data, and would be a             through HTML front-ends [41]. Second, we use SGML
central piece of material in evaluations such as discussed         to tag the logical content of documents, rather than their
by [39]. For instance, to work on automated diagnosis              external presentation. The actual presentation and layout
coding, a necessary material is a corpus of already coded          of documents must be determined by separate style sheets
diagnoses. A corpus of texts (e.g., patient discharge sum-         | which can be important too, but must be designed at
maries) where diagnoses are identi ed and coded (i.e.,             a di erent level. This contrasts with methods which di-
segmented and categorized) with conventional SGML ele-             rectly encode the presentation of documents with speci c
ments and attributes would be a valuable resource to share         SGML markup, for instance HTML.
for this purpose. Annotated corpora can also be used                  Whereas MLP produces data from natural language
to train coding programs. For instance, a large enough             text, it does not have to replace text with this data. A
corpus where each word is tagged with a broad semantic             more conservative approach simply adds this data to the
category (e.g., Sager's 39 semantic categories, or the cat-        original text. As a consequence, depending on the use
egories derived from SNOMED's 11 axes) could provide               which is made of the resulting document, a 100 % e-
training material for a semantic tagger. The normalized            ciency is not necessary. One can draw a parallel with
document or corpus \header" provides room to describe              information retrieval: 100 % recall and precision are not
the contents of and speci c conventions applied in the an-         necessary (neither are they available) to provide useful
notated corpus.                                                    services to users. Answers to queries come in the form of
   We believe that the speci cation of a general set of            actual texts or abstracts, which the user can read to make
MLP-produced data would help the MLP community fo-                 his/her nal assessment. Since the user is \in the loop",
cus on useful data to be exchanged. Furthermore, de ning           the program only serves as an assistant; the nal thinking
encodings for these data, in the form of SGML Document             and decision belong to the user. The enriched document
Type De nitions, would provide a more precise framework            approach allows to follow a similar principle: although
and would facilitate data exchange. Here again, the work           search and navigation can work on MLP-produced data,
of the TEI [25] is of interest: one of the main goals of the       the user can always be presented with the synchronized,
TEI was precisely to facilitate the exchange of annotated          original text or text extract | together with the data if
textual data between scholars. A valuable feature of the           relevant.
TEI scheme is the provision for each text or corpus of                Note that the consultation of an MLP-enriched docu-
a header [40] where information about the document can             ment has something more than a \classical" (hyper)text.
be inserted: author, mode of construction and encoding,            Navigation can rely not only on explicit text and links, but
etc. This self-documentation is particularly useful when           also on the attached, underlying MLP-produced interpre-
exchanging corpora between di erent teams.                         tations. For instance, given the appropriate tools, one can
                                                                   navigate through the links induced by common semantic
                                                                   categories (e.g., go from one diagnosis to the next in a
5 Discussion                                                       text) or search for text segments whose attached analysis
                                                                   at a given level satisfy some criterion (e.g., nd all occur-
The enriched document paradigm presented here relies on            rences of conditions a ecting the lower limbs, based on a
an SGML encoding of medical text and data. As already              SNOMED encoding of the text). This opens up a whole
mentioned earlier, SGML is making its way into the med-            range of dynamic navigation mechanisms.
ical informatics community. We would like to stress two               MLP deals with tasks of various complexities. For in-
important speci cities of our proposal. First, we are con-         stance, identifying segments of a given broad semantic cat-

                                                               9
egory (e.g., all occurrences of anatomical locations) is less        6 Conclusion
dicult than providing a detailed conceptual representa-
tion for each sentence. Indeed, the results of MLP sys-              We have described a general paradigm for managing data
tems are generally less good as the diculty of the task             produced by medical language processing systems and,
increases. The proposed enriched document architecture               more generally, medical data, from its textual form to a
can allocate some room for the di erent levels of MLP re-            knowledge representation. The basic principle is to asso-
sults. The exploitation of the available data is then more           ciate NLP-produced data to its source text, the natural
or less ecient according to the available levels and the            implementation being based on SGML markup and tools.
quality of MLP components. For instance, consultation                Although this does not constitute in itself a new NLP
helps (e.g., layout and rendering | see above) can be of             technique, we have shown that this principle can bring
varying quality depending on which segmentations and                 many bene ts to MLP, both by fostering shorter term
categorizations are available and how accurate they are.             application of current MLP components in a health care
Here again, the approach is well suited to target services           setting, and by facilitating data interchange between MLP
that can accomodate some degree of imprecision or incom-             research teams.
pleteness in their input data, and that give a major role               The major need for this paradigm to be e ective is the
to text. It allows to put to work in the short term the              de nition of an architecture and document type de ni-
simpler MLP tools, and to introduce more sophisticated               tions for MLP-produced data. The de nition of DTDs
tools gradually, instead of having to wait until MLP is              for medical documents is an independently motivated goal
perfect before it can be used.                                       for the document-centered medical record [12, 18] or for
                                                                     a data-oriented electronic medical record [26]. We believe
   The underlying assumption throughout this paper is                that these requirements should be joined within a common
the possibility to specify a common set of general data              \medical document encoding initiative".
types and their encodings for MLP-processed data: in
other words, SGML document type de nitions (DTDs)
for enriched documents. This work still remains to be
                                                                     7 Acknowledgments
done, and is by no means an easy one. It is strongly                 We thank Vincent Brunie for useful discussions about
linked to the de nition of DTDs for medical language doc-            SGML and SGML tools, and Dr. Thomas Similowski for
uments, which is itself closely related to the de nition of          providing patient record material for experimentation.
the electronic medical record. Several tracks already go
some way in this direction. In the NLP community, enter-
prises such as the Text Encoding Initiative or the Multext           References
project have made steps towards DTDs for various kinds
of texts and analyses. The TEI DTD, in particular, is                 [1] Sager N, Lyman M, Bucknall C, Nhan NT, and
a very wide-coverage, extremely parameterizable, docu-                    Tick LJ. Natural language processing and the rep-
ment type de nition which can be thus adapted to many                     resentation of clinical data. Journal of the American
needs. In the medical informatics community, e orts such                  Medical Informatics Association 1994;1(2):142{60.
as CEN TC251 or HL7 have considered the encoding of                   [2] Gundersen ML, Haug PJ, Pryor TA, et al. Devel-
many kinds of medical data; both are considering the use                  opment and evaluation of a computerized admission
of SGML to pursue their tasks.                                            diagnoses encoding system. Computers and Biomed-
                                                                          ical Research 1996;29(1/2):351{72.
  The proposed architecture does not solve the issue of the
incompatibility (or uncomparability) of di erent represen-            [3] Department of Health and Human Services, Washing-
tations of medical data. It can help to identify, though,                 ton. International Classi cation of Diseases | Clin-
that representations concern the same level of interpre-                  ical Modi cations, 9th revision, 1980.
tation (e.g., di erent controlled vocabularies for coding             [4] C^ote RA, Rothwell DJ, Palotay JL, Beckett RS, and
diagnoses; or di erent knowledge representations for de-                  Brochu L, eds. The Systematised Nomenclature of
scribing the meaning of a sentence), and provide notations                Human and Veterinary Medicine: SNOMED Inter-
to explicit the nature and extent of each representation;                 national. College of American Pathologists, North-
but it does not include means to go from one represen-                      eld, 1993.
tation to another. Projects such as the UMLS [42] or
GALEN [9] do address this issue. The enriched document                [5] Oliver DE and Altman RB. Extraction of SNOMED
paradigm can help, for its part, to anchor these di er-                   concepts from medical record texts. In: Proceedings
ent representations to natural language texts and to one                  of the 18th Annual SCAMC, Washington. Mc Graw
another.                                                                  Hill, 1994:179{83.

                                                                10
[6] Sager N, Nhan NT, Lyman M, and Tick LJ. Medical              [18] Bachimont B and Charlet J. Hospitexte : station
     language processing with SGML display. In: Cimino                  de consultation pour un dossier patient dematerial-
     [43], October 1996:547{51.                                         ise. Technical Report 96{170, Departement Intelli-
                                                                        gence Arti cielle et Medecine - SIM/AP-HP, 91, bd
 [7] Zingmond D and Lenert LA. Monitoring free-text                     de l'H^opital, 75634 Paris cedex 13, France, 1996.
     data using medical language processing. Computers
     and Biomedical Research 1993;26:467{81.                       [19] Armstrong S. Using Large Corpora. MIT Press,
 [8] Friedman C, Alderson PO, Austin JH, Cimino JJ,                     1994. Reprinted from Computational Linguistics,
     and Johnson SB. A general natural-language text                    19(1/2), 1993.
     processor for clinical radiology. Journal of the Amer-        [20] Lehnert W and Sundheim B. A performance eval-
     ican Medical Informatics Association 1994;1(2):161{
     74.                                                                uation of text-analysis technologies. The Arti cial
                                                                        Intelligence Magazine 1991(Fall):81{94.
 [9] Rector AL, Nowlan WA, and Kay S. Conceptual
     knowledge: the core of medical information systems.           [21] Brill E. Transformation-based error-driven learning
     In: Lun KC, Degoulet P, Piemme T, and Rienho O,                    and natural language processing: A case study in
     eds, Proc MEDINFO 92, Geneva. North Holland,                       part-of-speech tagging. Computational Linguistics
     1992:1420{6.                                                       1995;21(4):543{65.
[10] Zweigenbaum P and Consortium Menelas .                        [22] Basili R, Pazienza MT, and Velardi P. What can
     Menelas: an access system for medical records using                be learned from raw texts? Machine Translation
     natural language. Computer Methods and Programs                    1993;8:147{73.
     in Biomedicine 1994;45:117{20.
[11] Nygren E and Henriksson P. Reading the medical                [23] ISO . Standard generalized markup language. ISO
     record. I. Analysis of physicians' ways of reading the             8879, International Organization for Standardization,
     medical record. Computer Methods and Programs in                   1 rue de Varembe, C.P. 56, CH-1211 Geneve, 1986.
     Biomedicine 1992;39:1{2.                                      [24] Goldfarb CF. The SGML handbook. Oxford Univer-
[12] Seroussi B, Baud R, Moens M, et al. DOME nal re-                  sity Press, 1990.
     port. DOME (EU Project MLAP-63221) Deliverable                [25] Sperberg-McQueen CM and Burnard L, eds. Guide-
     D0.2, DIAM | SIM/AP-HP, 1996.                                      lines for Electronic Text Encoding and Interchange
[13] Nygren E, Johnson M, and Henriksson P. Reading                     (TEI P3), Chicago, Oxford, 1994. ACH, ACL and
     the medical record. II. Design of a human-computer                 ALLC.
     interface for basic reading of computerized medi-
     cal records. Computer Methods and Programs in                 [26] Alschuler L, Lincoln TL, and Spinosa J. Medicine for
     Biomedicine 1992;39:13{25.                                         SGML. In: Proceedings GCA SGML'96, 1996:181{9.
[14] Adelhard K, Eckel R, Holzel D, and Tretter W.                [27] Kay M. Algorithm schemata and data structure in
     A prototype of a computerized patient record.                      syntactic processing. In: Grosz BJ, Jones KS, and
     Computer Methods and Programs in Biomedicine                       Webber BL, eds, Readings in Natural Language Pro-
     1995;48:115{9.                                                     cessing. Morgan Kaufmann, Los Altos, Ca., 1986:35{
[15] Bouaud J and Seroussi B. Navigating through a                     70.
     document-centered computer-based patient record: a            [28] Nii HP. Blackboard systems: the blackboard model
     mock-up based on WWW technology. In: Cimino                        of problem solving and the evolution of blackboard
     [43], October 1996.                                                architectures. The Arti cial Intelligence Magazine
[16] Carvalho G. A. Modelisation documentaire du                       1986;7(2,3):38{53,82{106.
     dossier patient : une experience. Rapport de DEA,
     Informatique Biomedicale, Universite Paris 5, 1996.         [29] de Kleer J. An assumption-based truth maintenance
                                                                        system. Arti cial Intelligence 1986;28(1).
[17] Laforest F, Frenot S, and Flory A. A new approach
     for hypermedia medical records management. In:                [30] Langendoen DT and Simons GF. A rationale for the
     Brender J, Christensen JP, Scherrer JR, and Mc-                    TEI recommendations for feature-structure markup.
     Nair P, eds, Proceedings of MIE'96, Copenhagen,                    In: Ide and Veronis [44], 1995:191{209. Reprinted
     DK. IOS Press, 1996:1042{6.                                        from Computers and the Humanities 29, 1995.

                                                              11
[31] Baud RH, Rassinoux AM, Wagner JC, et al. Rep-                   [43] Cimino JJ, ed. Proceedings of AMIA Annual
     resenting clinical narratives by Sowa's Conceptual                   Fall Symposium 96, Washington DC, October 1996.
     Graphs. Methods of Information in Medicine                           AMIA (JAMIA supplement).
     1995;34(1/2).
                                                                     [44] Ide N and Veronis J, eds. The Text Encoding Ini-
[32] DeRose SJ and Durand DG. Making Hypermedia                           tiative. Kluwer Academic Publishers, Boston, 1995.
     Work { A User's Guide to HyTime. Kluwer Aca-                         Reprinted from Computers and the Humanities 29,
     demic Publishers, 1994.                                              1995.
[33] SoftQuad, Inc., Toronto. Panorama Pro for Microsoft             [45] Clark J. ISO/IEC DIS 10179.2: Document style se-
     Windows, 1995.                                                       mantics and speci cation language (DSSSL). WWW
                                                                          page http://www.jclark.com/dsssl/, 1995.
[34] Veronis J. MULTEXT home page. WWW
     page http://www.lpl.univ-aix.fr/projects/multext/,
     CNRS et Universite de Provence, 1996.
[35] ISO . Document style semantics and speci cation
     language. ISO/IEC 10179, International Organiza-
     tion for Standardization, 1 rue de Varembe, C.P. 56,
     CH-1211 Geneve, 1995. See also [45].
[36] Bachimont B. Intelligence arti cielle et ecriture dy-
     namique : de la raison graphique a la raison com-
     putationnelle. In: Colloque de Cerisy La Salle \Au
     nom du sens" autour de et avec Umberto Eco. Gras-
     set, Paris, 1996. To appear .
[37] Natural Language Software Registry .       The
     natural language software registry online re-
     search server. WWW page http://cl-www.dfki.uni-
     sb.de/cl/registry/draft.html, 1996.
[38] European Language Resources Association . ELRA
     home      page.                WWW         page
     http://www.icp.grenet.fr/ELRA/home.html, 1996.
[39] Friedman C and Hripcsak G. Evaluating natu-
     ral language processors in the clinical domain. In:
     Chute CC, ed, Proc Conference on Natural Language
     and Medical Concept Representation, Jacksonville,
     Fl. IMIA WG6, 1997:41{52.
[40] Giordano R. The TEI header and the documentation
     of electronic texts. In: Ide and Veronis [44], 1995:75{
     84. Reprinted from Computers and the Humanities
     29, 1995.
[41] Cimino JJ, Socratous SA, and Grewal R. The in-
     formatics superhighway: Prototyping on the World
     Wide Web. In: Gardner RM, ed, Proceedings of the
     19th Annual SCAMC, New Orleans. AMIA (JAMIA
     supplement), 1995:111{5.
[42] Humphrey BL and Lindberg DAB. Building the
     Uni ed Medical Language System. In: Proceed-
     ings of the 13th Annual SCAMC, Washington. IEEE,
     1989:475{80.

                                                                12
You can also read