From Text to Knowledge: a Unifying Document-Centered View of Analyzed Medical Language
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
From Text to Knowledge: a Unifying Document-Centered View of Analyzed Medical Language Pierre Zweigenbaum, Jacques Bouaud, Bruno Bachimont, Jean Charlet, Brigitte Seroussi, Jean-Francois Boisvieux Service d'informatique medicale, Assistance Publique - H^opitaux de Paris & Departement de biomathematiques, Universite Paris 6 Abstract that their levels of representation of medical information are simply not congruent. Although medical language processing (MLP) has We present in this paper a paradigm which can help to achieved some success, the actual use and dissemination address these two problems. In this paradigm, we con- of data extracted from free text by MLP systems is still sider the analyses of a text as enrichments to this text: very limited. We claim that the adoption of an \enriched- the resulting document integrates the original text, \an- document" paradigm (or \document-centered" view) can notated" by the data produced by MLP systems. Ex- help to address this issue. We present this paradigm and tracted data no longer replace the original text, which explain how it can be implemented, then discuss its ex- can be checked by the information user, alleviating the pected bene ts both for end-users and MLP researchers. need for 100 % reliable data. Besides, such an approach can facilitate the identi cation of data exchanged between 1 Introduction MLP researchers. This view of text management is not new. It is Some medical language processing (MLP) systems have widespread in the natural language processing (NLP) and achieved a reasonable level of performance and success document processing communities, and is becoming in- from a technical point a view, the most well-known one creasingly known in medical informatics too. Our pur- being unarguably the Linguistic String Project Medical pose is to transpose it to the context of medical language Language Processor (LSP/MLP, [1]). However, the actual processing. use and dissemination of data extracted from free text by We rst recall some background material, and then MLP systems is still very limited. present the principles of the proposed model. We illus- In our opinion, the most crucial factor which impedes trate the model on several examples, and show how the a wider use of these results by health care professionals resulting enriched documents can be exploited. We dis- involves the expected quality of the resulting data: one cuss the advantages that such a \document-centered" view obviously cannot rely on extracted data as a substitute of analyzed medical language should bring both to nal for the original text as long as this data is not produced users (health care professionals) and to MLP researchers. by a near 100 % safe procedure. Free text processing has We also identify some limitations, and propose further di- not reached such a level of performance | neither can one rections of research. expect humans to. On another plan, MLP results are also not disseminated enough to the research community. Some conditions for this to happen are indeed political (individual and organ- isational willingness) and infrastructural (availability of network and servers). But there is also an issue of rep- 2 Background resentation framework for such interchange to take place. At present, the kinds of results that are produced by dif- We present here the background which supports the ferent MLP systems are often so distant from one another paradigm proposed in this paper: a schematic reconstruc- To appear in Methods of Information in Medicine. A rst tion of MLP tasks, the emergence of a trend towards a version was presented at the \Fourth International Conference on document-centered patient record, the usefulness of anno- Medical Concept Representation", workshop of Working Group 6 of tated corpora, and some basics about document markup IMIA, Jacksonville, Fl., January 1997. (see gure 1). 1
such as diagnoses, acts, or anatomic localizations. Segmentation may be structured: segments may be MLP results Enriched document nested in one another. For instance, a diagnostic expres- sion may mention a speci c anatomical location: we then have an anatomic location segment inside a diagnosis seg- Document-centered EMR ment. The basic model of main-stream syntax is that of segment and categorize constituent structure, which is indeed a structured seg- text contents mentation. Sager's MLP system [1] can essentially be de- scribed as a structured segmentation and categorization system. Categorization and segmentation correspond to Annotated corpus for NLP development the paradigmatic and syntagmatic division in linguistics. variably structured It is therefore not a surprize that the data extracted from text and data a natural language text can often be represented as a cate- SGML markup gorization of text segments. This accounts for a large part of MLP systems (e.g., [1, 7]), and does address a substan- knowledge about language tial part of the needs for data extraction from medical Figure 1: Background: convergence. text. However, instead of assigning a text segment an atomic category drawn from an a priori set, one may choose to 2.1 A reconstruction of medical language build for it a complex category. The hypothesis is that processing tasks a nite set of a priori (pre-coordinated) categories is not sucient for dealing with the open-endedness of natural MLP systems aim at extracting \medical data" from free language, and that one therefore needs to rely on a gen- text. The prototypical application is diagnosis encoding erative language for describing a potentially in nite set (e.g., [2]), where the desired output is a diagnosis code in of (post-coordinated) categories. This is the assumption an existing classi cation (e.g., the International Classi - made by those who rely on some sort of compositional cation of Diseases [3]). This is a classi cation, or catego- language [4, 8] for representing medical information, and rization, task: given a string of words and an a priori set in particular by the tenants of knowledge representation of classes, determine the class which should be assigned languages [9, 10]. Such a complex categorization is gener- to this string. An extended version of categorization can ally not reducible to atomic categorizations of structured assign several classes (or indices, or keywords) to an in- segments, and must therefore be considered as a distinct put string, for instance, to associate a set of SNOMED class of MLP production. [4] codes to an input sentence [5]. Some \ancillary" pro- The paradigm we present in this paper naturally caters cedures in MLP are also categorization tasks: e.g., deter- for the rst two types of NLP output: atomic categoriza- mining the part-of-speech of each word, or (more interest- tion and (structured) segmentation, and thus can help to ingly) determining the broad semantic category of each manage a very large part of the results of MLP systems. word or expression. The latter task plays a central role It can also be made, with some restrictions, to host the in [1], and [6] shows that its status can be promoted from latter type (complex categorization). ancillary MLP to an actual, end-user service. Although categorization may be the only task expected 2.2 The document-centered patient from an MLP system (e.g., in [2]), another important task is segmentation : given a string of words, identify relevant record segments in this string. These segments correspond to The traditional view of the electronic medical record meaningful units, for instance the expression of a diagnosis (EMR) is that of a database which holds items of coded, or | which one might further want to categorize. Segmen- standardized, information. While this view has achieved tation, in this context, allows to focus on (and categorize) some success, it does require a substantial e ort of stan- speci c parts of a text, rather than considering it as an dardization over medical information. Moreover, a com- indistinct whole. An application of segmentation could plete formalization of medical information is not theoreti- be to identify in a text all instances of diagnoses. This cally reachable | this is precisely one of the reasons why task is di erent from the above categorization of a string natural language still has some appeal to medical practi- which is already known to be a diagnosis. \Ancillary" seg- tioners. mentation tasks include morpho-syntactic segmentation An alternate view gives textual documents a rst-class (into words, sentences, phrases and other syntactic con- status. It is inspired both by an observation of non- stituents) as well as the identi cation of semantic units computerized medical practice (e.g., [11, 12]) and by the 2
fast-growing document processing industry. In that view, word is marked for semantic category can be used to the EMR is a collection of documents, which can be com- learn selectional restrictions or taxonomies [22]. posed of text, images, etc. Textual documents here can, and generally do, have structure. They can be hierar- The same idea lies behind some statistical approaches to chically divided into identi ed chunks, down to any suit- MLP, which learn word to code associations on the basis able level, thus providing much of the same data struc- of a manually provided training sample. turing advantages as databases. But at the same time, The paradigm we present is based on the same idea of not every piece of text has to be standardized: the bal- annotating a text with its analyses: we believe this can ance between formal data and natural language text can help MLP researchers both for their own developments be more nely tuned than with traditional database mod- and by facilitating the exchange of enriched corpora. els, and they can merge together more tightly. The ex- pected bene t is rst a better account of available medi- 2.4 Marking up text cal information, described as it can be expressed, rather than as it should be expressed to t a prede ned com- Managing structure, categories, and annotations in a text puter model. Assuming that a document-centered view can be realized by inserting in it tags which mark this of the medical record is closer to non-computerized prac- structure, specify the categories, and identify the annota- tice than an item-oriented view, an additional bene t is a tions. The publishing industry has pushed forth an ini- potentially better-suited interface for health care profes- tiative which led to the adoption of an ISO standard for sionals to work on the patient record. Several attempts this purpose: the Standard Generalized Markup Language at experimenting with such a document-centered patient (SGML, [23, 24]). SGML has since then spread to a wide record are ongoing [13, 14, 15, 16, 17, 18]. variety of industries and research disciplines. HTML (Hy- The document-centered view aims at nding a suitable perText Markup Language), the language of the World tradeo between free text and formal data. By producing Wide Web, is itself a xed application of SGML. data from free text, MLP therefore plays a central role SGML basically allows to de ne the nature and struc- in a document-centered patient record and should bene t ture of the \tags" which can be inserted in a document. from it. Tags delimit \elements" of a document, i.e., categorized segments, such as a section title, paragraph, diagnosis or 2.3 Annotated corpora as a precious re- person name. Tags may also associate \attributes" with elements, such as the level or number of a section title, the source sex of a person or the standardized code for a diagnosis. The 80's have seen the development of increasingly larger The set of allowed tags in a class of documents and their on-line text corpora, and their dimension has now ex- authorized order of appearance and nesting are speci ed ploded with the advent of the World Wide Web. While in a \Document Type De nition" (DTD), also written in large volumes of text are a very useful resource for study- SGML. ing language, for evaluating NLP tools and for training Large SGML DTDs have been proposed for a wide va- them [19], annotated corpora are an even more valuable riety of industries and scienti c disciplines. Of prominent asset for these same purposes. Annotated corpora link interest for NLP is the Text Encoding Initiative [25], an texts with their analyses: a typical, basic level of analysis international project which has led to a DTD encompass- may involve the identi cation of sentence boundaries or ing a large part of the needs of the humanities, litterary the part-of-speech of each word. More elaborate analyses and linguistic research. In the medical informatics arena, associate syntactic parses with sentences or semantic cat- the HL7-SGML group has recently been promoting the egories with words. The resulting corpora are much more use of SGML for the medical record [26]. than text: they constitute sources of knowledge about lan- guage. Typical work on annotated corpora includes: 3 The enriched document model evaluating NLP systems: this is how the DARPA In this section, we present the principles of the enriched MUC competitions have been proceeding [20]; document paradigm and sketch a possible implementation training NLP systems: e.g., a part-of-speech tagger based on the SGML language. can tag accurately some 97 % of the words of a corpus provided it is given a large enough hand-annotated 3.1 MLP as a document enrichment pro- training corpus [21]; cess building resources for NLP systems; this is actually a We propose to handle the analyses which MLP can pro- variant of the preceding point: a corpus where each duce from a text as enrichments to this text. Once \data" 3
has been obtained, the original text is not \forgotten". It one MLP process could identify and categorize the remains there in the forefront to help to manage the de- overall structure of the text (reason for admission, an- rived data. It is still the reference that the reader of the tecedents, etc.) whereas another one could determine patient record may want to consult when examining this the SNOMED encodings of the relevant expressions patient's data. in the text. This possibility of task division has im- plications both for MLP architectural issues and for 3.2 Indexing interpretations on the research team collaboration. source text An analysis of only some parts of a text can be han- The general picture of an enriched document is that of a dled. Moreover, the actual piece of text which gives text, some portions of which are associated with \analy- rise to a speci c analysis can be precisely identi ed. ses", or \interpretations" (in the above-proposed terms, By applying categorization to a segmented text, one \categorizations"). A very simple principle is applied can make the di erence between an analysis of a throughout the model: index these interpretations on whole text and an analysis of a limited extract of the source text that they interpret (see gure 2). This this text. For instance, a diagnosis will not have the gives a deserved importance to the segmentation aspect same meaning if it is presented as the diagnosis of a of NLP. The same principle was at the root of seed mod- patient (attached to the whole report) or as a diag- els in computational linguistics (the chart model for pars- nosis found in the \reason for admission" section or ing [27]) and arti cial intelligence (the blackboard model in the \antecedents" section. for problem-solving [28] and the ATMS model for belief MLP-provided data can be useful as normalized maintenance [29]). search data. By indexing analyses on their source text, one can thus link back to the source text from original text the search data. As a consequence, since text and is still available add as many interpretations corresponding data are synchronized, one can have as useful search processes operate on normalized data while (categorization) letting the user view understandable text. This back link is also an essential feature to enable traceability and quality control of MLP results. can interpret selected parts of Since several such interpretations can be synchro- document two-way link nized, search criteria operating on several levels of (structured interpretation can be combined together, resulting in segmentation) search greater search power. For instance, one can search for interpretations or text a given diagnosis (segment indexed with a \diagno- sis" category) in a speci c section (segment indexed with a \section" category) of a discharge summary { Figure 2: Indexing interpretations on the source text. and search or view the corresponding original text. Let us review some of the properties this indexing brings Finally, if an analysis produces a categorization, this to the resulting enriched document. categorization is inherited by the text segment in- dexed with it. Such a categorization can be made per- The default entry point into the document remains spicuous to the reader through appropriate markup the original text. None of its structure or contents and rendering devices [6]. is lost in the new document, as might be the case in an item-oriented paradigm. Therefore, the basic 3.3 A simple implementation of the en- structure, the basic reading of the document is that of the source text. The importance of the structure and riched text model physical aspect of documents in the patient record The main features of the enriched text model can be im- have been stressed by some authors (in particular, plemented with a few SGML constructs. [11]): keeping these features should be helpful to the reader. 3.3.1 Segmentation and categorization Several interpretations of the text can be added A basic kind of analysis simply consists in categorizing monotonically to the same document. For instance, text segments. Text segments can range from a (part of) 4
word to a whole paragraph, section or document. Cat- possible to keep the analyses outside the source document, egories may be morphosyntactic (part-of-speech, lemma, so that it remains untouched (and therefore may have a syntactic category) as well as semantic (broad semantic read-only status, a safe way to implement data integrity), category, word sense, position in a thesaurus) or concep- using external references to positions in this document. tual (position in an ontology). Another kind of segmen- HyTime links [32], in particular, allow to do this. tation concerns the general structure of the text (header, A limitation of the proposed encoding for \complex" signature, divisions, paragraphs). analyses is that they are stored as plain text rather than An analysis which results in the categorization of a text encoded as markup (elements and attributes). As a conse- segment can be recorded as an encapsulation of this text quence, the SGML processing mechanisms built in SGML segment within a start and end tags assigned to this cat- tools do not help much for processing (e.g., searching) egory. For instance, the text on gure 3 is tagged with these representations. SGML elements denoting segmentation and categorization of structural units (div, head, p) and semantic units (name, date, diagnosis). A variant uses attribute val- 4 Uses of enriched documents ues to specify categories. In the same example, a semantic categorization of divisions is encoded as a type attribute, 4.1 Overview of uses and a precise semantic category for diagnoses is speci ed We envision two broad classes of users for enriched medical as an ICD9CM attribute whose value is a position (a hier- documents: health care professionals and medical infor- archical code) in the ICD-9-CM classi cation. matics researchers (see gure 7. We de ne the rst class Figure 4 shows an additional example where the begin- as being interested primarily in working on patient data. ning of a patient discharge summary is tagged with divi- The second class is concerned with methods and tools for sions (div, head), paragraphs (p), person names (name), processing medical language and knowledge. health care locations (locsoins, typelocs, nomlocs), dates (date), symptoms and diagnoses (etatpath), med- ical treatments (theramed, drug, poso) etc. The rules that govern which elements and attributes M can be used and how they can nest are speci ed in an LP SGML document type de nition (DTD). Figure 5 shows an extract of the DTD enforced in the document of g- ure 4. Each legal element is declared (!element declara- tion) with its allowed attributes (!attlist declaration). The content of an element can be simple text (#PCDATA) or other SGML elements. E 3.3.2 More complex analyses RC M U ED More complex analyses need to be represented explicitly. SO IC RE The TEI has de ned several notations for managing such A L analyses (see, e.g., [30]). E G RE A The simplest one consists in inserting the analysis as CO U G text enclosed within speci c start and end tags, at the RD N Clinical use MLP research LA place in the text where it belongs (interlinear analysis). Assuming for instance that we want to include the con- ceptual representation of each sentence as derived by an Figure 7: The enriched document model. MLP system such as MENELAS [10] or RECIT [31], we can segment the text into sentences (element s), each of An enriched document may be used in (at least) three which includes this conceptual representation (in a cr el- ways. One may simply want to consult the document | ement | see gure 6). The corresponding DTD must the enrichments can help reading or navigating through indeed cater for these elements. the document. One can obtain information or data from In variant encodings, such analyses are grouped to- the document | for instance, extract patient information gether in a special part of the document (for instance, to be sent to another person or system. Finally, one may in the end) and SGML reference links tie together a text want to store data into the document | including writing segment and its analysis. This has the advantage of keep- the original text. The last two operations may be com- ing the \main" part of the document \clean". It is even bined, for instance when applying NLP tools to a text and 5
MOTIF D'HOSPITALISATION Monsieur DURAND, a^g e de 88 ans, est hospitalis e le 18/1/1997 pour un angor spontan e a r ep etition. Figure 3: Categorization and segmentation. MOTIF D'ADMISSION Monsieur DURAND, ^ag e de 61ans, est admis dans l'Unite de R eanimation du Service de Pneumologie du Groupe HospitalierPitie Salp^ etriere le 26/7/90 pour prise en charge d'une ischemie aigu e du membre inf erieur gauche avec insuffisance r enale oligo-anurique. PROVENANCE : UrgencesPiti e. Figure 4: A more complete example of tagged text. storing the results as appropriate enrichments. instance, linking together all mentions of problems We rst examine several uses of enriched documents in a ecting a given anatomical location, or linking the a health care setting. We then recall the main bene ts we mention of an examination in a discharge summary see for the medical informatics research community. to the actual, full examination report. 4.2 Consulting enriched documents The rst two operations on an enriched document are extremely simple to implement with SGML-aware The segmentation and categorization in an enriched docu- software. For instance, using an SGML browser (e.g., ment can be used to help a reader consult this document: Panorama PRO, [33]), one can de ne a style sheet which speci es how each kind of SGML element should be ren- a basic idea is to use the segmentation and catego- dered. Figure 8 shows a Panorama display of the doc- rization to compose the layout of the document | for ument of gure 4, where section heads are in boldface, instance, to highlight segments of a speci c category symptoms and diagnoses are in italics, examination results [6]; are indented, and person and location names are replaced one can further present a di erent view of the same with an icon. document t to a speci c goal or task | for instance, With this same SGML tool, one can de ne \naviga- creating a table of contents by extracting the section tors", user-tailored tables of contents which are presented titles, or creating an index or a synopsis by extracting in a companion window and synchronized with the main the pathological states or observations [15]; text window. Figure 9 shows a navigator based on sec- tion heads and symptoms and diagnoses. This navigator one can also link together segments in the same or in can be used both as a synopsis to get a quick feel for the di erent documents, thus providing hyperlinks: for patient's history and, indeed, as a navigation device: a 6
Figure 5: A document type de nition (extract). click on one of the items in this window scrolls the main powerful searches on both the original text and the data window to the actual occurrence of this item in the source extracted from it. text. With the appropriate tools (here, the Panorama Pro As hinted above, such tools know how to manipulate SGML browser), these helps come at a marginal cost once SGML markup and plain text. However, confronted with, the text is tagged. for instance, the conceptual graph of gure 6, the best The third feature (hyperlinks) generally requires that they can do is to consider it as text. As a result, the full the desired links exist,and encodes them as appropriate formal meaning of the conceptual graph is lost. Never- tags in the document. These links could be precomputed, theless, concept and relation names can be searched as if e.g., by NLP software, or be user-de ned. they were keywords, which can still be useful (as a fall- back) depending on the kind of queries which are pro- 4.3 Searching a collection of enriched cessed. A better account of such elaborate representations requires some investment beyond a simple use of o -the- documents shelf SGML software. The basic search mechanism in an item-oriented database deals with formal data. Searching natural language texts needs to resort to di erent methods, namely, content- 4.4 Extracting information oriented search, also called full-text search. Recent soft- An enriched document is an elaborate compound from ware allows to apply full-text search to structured text, which one may wish to extract only a selected subset. taking into account its segmentation and categorization This may be useful, for instance, in a work ow context to features. So for instance, one can search for occurrences of produce a targeted report to be sent to a correspondent. the word \kidney" in diagnosis elements that are nested This is also convenient to build di erent views of the same in div elements with attribute value type='reason for document for consultation, as suggested above. admission'. The basic mechanism for these purposes is the selection, One can at will choose to search on text, on tags (and in tagged texts, of speci ed segments, based on markup attributes), or on a combination of both: encoding MLP and content, and their rearrangement in the desired or- analyses as SGML-tagged segments in enriched text en- der to produce the target document. This mechanism can ables the application of o -the-shelf software to perform be e ected by SGML transformations, for which a vari- 7
MOTIF D'HOSPITALISATION Monsieur DURAND, a ^g e de 88 ans, est hospitalis e le 18/1/1997 pour un angor spontan e a r epetition. [Admission]- (past) (pat)->[Human_Being]- (defines_cultural_function)-> [Medical_Subfunction]--(cultural_role)->[Patient:I63] (attr)->[Age]--(val_quant)-> [Quantitative_Val:88]--(reference_unit)->[Year_Duration]% (motivated_by)->[Angina_Syndrome:I77] ... Figure 6: Embedding a conceptual representation. (Windows 95 screen dump) Figure 8: Rendering an enriched document. ety of commercial or public-domain software exists. One program, which produces an ICD code for each of them; may cite in particular the SGML query language Sgm- store the resulting ICD codes into an ICD9CM attribute of lQL developed in the Multext project [34], which allows the diagnosis element. The Multext project [34] has de- to specify such transformations in a declarative way. One ned a general NLP architecture where components work should note that SGML transformations are a more gen- on an SGML encapsulation of natural language data. Note eral and powerful mechanism than simple style sheets. that MLP programs need not all be SGML-aware: Mul- DSSSL (Document Style Semantics and Speci cation Lan- text also proposes alternate encodings and transcoders so guage, [35]) aims at standardizing this notion. Performing that usual NLP components can be plugged into such a SGML transformations will be easier and more portable processing chain. when implementations of DSSSL become widely available. Such an architecture facilitates the reuse of existing MLP components: its strength relies on the de nition of a 4.5 Enriching a document set of analysis levels (input or output of NLP components) and the corresponding encodings. Some general-purpose Enriched documents accomodate expansion in at least two levels have been identi ed and some encodings proposed contexts. First, health care professionals must be able to by NLP projects such as Multext or the TEI [25]. An ob- enter new information into the patient record (with the vious need for this kind of architecture to be really useful usual precautions to comply with authorizations and en- for MLP is to de ne relevant levels of analysis for medical sure data integrity). Beyond the mere creation of new texts. We return to this point later. documents, a reader can annotate existing documents [36]: the document-centered paradigm facilitates \active read- 4.6 Exchanging enriched documents ing". Second, MLP programs can be applied to the current The general NLP community has been exchanging data state of a document and perform interpretation tasks as and programs for a while now: one can nd repositories proposed above. The basic ow of control consists in se- for lexicons, annotated corpora, parsers and other NLP re- lecting input segments, passing them to the MLP pro- sources in multiple places over the world (see for instance gram, and then inserting the results back into the enriched the Natural Language Software Registry [37] or the Eu- document structure. A typical processing chain could be: ropean Language Resources Association [38]) | although select all occurrences of diagnoses (diagnosis elements, one must stress that resources are most developed for the identi ed in a previous pass); pass them to an encoding English language. 8
(Windows 95 screen dump) Figure 9: A synopsis and navigation help. The MLP community, in contrast, lacks a comparable cerned with partially structured information: a mixture level of sharing of speci c resources. This point was put of text and data, where natural language itself can un- forward at the last meeting of Working Group 8 (Natu- dergo some variable degree of structuring. This feature ral Language Processing) of the European Federation for of traditional (paper-based) medical records is considered Medical Informatics, which was held during the Medical essential by authors such as [11]. This is di erent from Informatics Europe Conference in August 1996. approaches which assume a strict division between struc- Enriched documents constitute annotated corpora tured data and free text, as we understand that HL7 and which can be extremely useful to the MLP research com- its SGML initiative does. This is also di erent from ap- munity. They allow to compare more precisely di erent proaches where structured data is accessed and displayed approaches, thanks to common test data, and would be a through HTML front-ends [41]. Second, we use SGML central piece of material in evaluations such as discussed to tag the logical content of documents, rather than their by [39]. For instance, to work on automated diagnosis external presentation. The actual presentation and layout coding, a necessary material is a corpus of already coded of documents must be determined by separate style sheets diagnoses. A corpus of texts (e.g., patient discharge sum- | which can be important too, but must be designed at maries) where diagnoses are identi ed and coded (i.e., a di erent level. This contrasts with methods which di- segmented and categorized) with conventional SGML ele- rectly encode the presentation of documents with speci c ments and attributes would be a valuable resource to share SGML markup, for instance HTML. for this purpose. Annotated corpora can also be used Whereas MLP produces data from natural language to train coding programs. For instance, a large enough text, it does not have to replace text with this data. A corpus where each word is tagged with a broad semantic more conservative approach simply adds this data to the category (e.g., Sager's 39 semantic categories, or the cat- original text. As a consequence, depending on the use egories derived from SNOMED's 11 axes) could provide which is made of the resulting document, a 100 % e- training material for a semantic tagger. The normalized ciency is not necessary. One can draw a parallel with document or corpus \header" provides room to describe information retrieval: 100 % recall and precision are not the contents of and speci c conventions applied in the an- necessary (neither are they available) to provide useful notated corpus. services to users. Answers to queries come in the form of We believe that the speci cation of a general set of actual texts or abstracts, which the user can read to make MLP-produced data would help the MLP community fo- his/her nal assessment. Since the user is \in the loop", cus on useful data to be exchanged. Furthermore, de ning the program only serves as an assistant; the nal thinking encodings for these data, in the form of SGML Document and decision belong to the user. The enriched document Type De nitions, would provide a more precise framework approach allows to follow a similar principle: although and would facilitate data exchange. Here again, the work search and navigation can work on MLP-produced data, of the TEI [25] is of interest: one of the main goals of the the user can always be presented with the synchronized, TEI was precisely to facilitate the exchange of annotated original text or text extract | together with the data if textual data between scholars. A valuable feature of the relevant. TEI scheme is the provision for each text or corpus of Note that the consultation of an MLP-enriched docu- a header [40] where information about the document can ment has something more than a \classical" (hyper)text. be inserted: author, mode of construction and encoding, Navigation can rely not only on explicit text and links, but etc. This self-documentation is particularly useful when also on the attached, underlying MLP-produced interpre- exchanging corpora between di erent teams. tations. For instance, given the appropriate tools, one can navigate through the links induced by common semantic categories (e.g., go from one diagnosis to the next in a 5 Discussion text) or search for text segments whose attached analysis at a given level satisfy some criterion (e.g., nd all occur- The enriched document paradigm presented here relies on rences of conditions a ecting the lower limbs, based on a an SGML encoding of medical text and data. As already SNOMED encoding of the text). This opens up a whole mentioned earlier, SGML is making its way into the med- range of dynamic navigation mechanisms. ical informatics community. We would like to stress two MLP deals with tasks of various complexities. For in- important speci cities of our proposal. First, we are con- stance, identifying segments of a given broad semantic cat- 9
egory (e.g., all occurrences of anatomical locations) is less 6 Conclusion dicult than providing a detailed conceptual representa- tion for each sentence. Indeed, the results of MLP sys- We have described a general paradigm for managing data tems are generally less good as the diculty of the task produced by medical language processing systems and, increases. The proposed enriched document architecture more generally, medical data, from its textual form to a can allocate some room for the di erent levels of MLP re- knowledge representation. The basic principle is to asso- sults. The exploitation of the available data is then more ciate NLP-produced data to its source text, the natural or less ecient according to the available levels and the implementation being based on SGML markup and tools. quality of MLP components. For instance, consultation Although this does not constitute in itself a new NLP helps (e.g., layout and rendering | see above) can be of technique, we have shown that this principle can bring varying quality depending on which segmentations and many bene ts to MLP, both by fostering shorter term categorizations are available and how accurate they are. application of current MLP components in a health care Here again, the approach is well suited to target services setting, and by facilitating data interchange between MLP that can accomodate some degree of imprecision or incom- research teams. pleteness in their input data, and that give a major role The major need for this paradigm to be e ective is the to text. It allows to put to work in the short term the de nition of an architecture and document type de ni- simpler MLP tools, and to introduce more sophisticated tions for MLP-produced data. The de nition of DTDs tools gradually, instead of having to wait until MLP is for medical documents is an independently motivated goal perfect before it can be used. for the document-centered medical record [12, 18] or for a data-oriented electronic medical record [26]. We believe The underlying assumption throughout this paper is that these requirements should be joined within a common the possibility to specify a common set of general data \medical document encoding initiative". types and their encodings for MLP-processed data: in other words, SGML document type de nitions (DTDs) for enriched documents. This work still remains to be 7 Acknowledgments done, and is by no means an easy one. It is strongly We thank Vincent Brunie for useful discussions about linked to the de nition of DTDs for medical language doc- SGML and SGML tools, and Dr. Thomas Similowski for uments, which is itself closely related to the de nition of providing patient record material for experimentation. the electronic medical record. Several tracks already go some way in this direction. In the NLP community, enter- prises such as the Text Encoding Initiative or the Multext References project have made steps towards DTDs for various kinds of texts and analyses. The TEI DTD, in particular, is [1] Sager N, Lyman M, Bucknall C, Nhan NT, and a very wide-coverage, extremely parameterizable, docu- Tick LJ. Natural language processing and the rep- ment type de nition which can be thus adapted to many resentation of clinical data. Journal of the American needs. In the medical informatics community, e orts such Medical Informatics Association 1994;1(2):142{60. as CEN TC251 or HL7 have considered the encoding of [2] Gundersen ML, Haug PJ, Pryor TA, et al. Devel- many kinds of medical data; both are considering the use opment and evaluation of a computerized admission of SGML to pursue their tasks. diagnoses encoding system. Computers and Biomed- ical Research 1996;29(1/2):351{72. The proposed architecture does not solve the issue of the incompatibility (or uncomparability) of di erent represen- [3] Department of Health and Human Services, Washing- tations of medical data. It can help to identify, though, ton. International Classi cation of Diseases | Clin- that representations concern the same level of interpre- ical Modi cations, 9th revision, 1980. tation (e.g., di erent controlled vocabularies for coding [4] C^ote RA, Rothwell DJ, Palotay JL, Beckett RS, and diagnoses; or di erent knowledge representations for de- Brochu L, eds. The Systematised Nomenclature of scribing the meaning of a sentence), and provide notations Human and Veterinary Medicine: SNOMED Inter- to explicit the nature and extent of each representation; national. College of American Pathologists, North- but it does not include means to go from one represen- eld, 1993. tation to another. Projects such as the UMLS [42] or GALEN [9] do address this issue. The enriched document [5] Oliver DE and Altman RB. Extraction of SNOMED paradigm can help, for its part, to anchor these di er- concepts from medical record texts. In: Proceedings ent representations to natural language texts and to one of the 18th Annual SCAMC, Washington. Mc Graw another. Hill, 1994:179{83. 10
[6] Sager N, Nhan NT, Lyman M, and Tick LJ. Medical [18] Bachimont B and Charlet J. Hospitexte : station language processing with SGML display. In: Cimino de consultation pour un dossier patient dematerial- [43], October 1996:547{51. ise. Technical Report 96{170, Departement Intelli- gence Arti cielle et Medecine - SIM/AP-HP, 91, bd [7] Zingmond D and Lenert LA. Monitoring free-text de l'H^opital, 75634 Paris cedex 13, France, 1996. data using medical language processing. Computers and Biomedical Research 1993;26:467{81. [19] Armstrong S. Using Large Corpora. MIT Press, [8] Friedman C, Alderson PO, Austin JH, Cimino JJ, 1994. Reprinted from Computational Linguistics, and Johnson SB. A general natural-language text 19(1/2), 1993. processor for clinical radiology. Journal of the Amer- [20] Lehnert W and Sundheim B. A performance eval- ican Medical Informatics Association 1994;1(2):161{ 74. uation of text-analysis technologies. The Arti cial Intelligence Magazine 1991(Fall):81{94. [9] Rector AL, Nowlan WA, and Kay S. Conceptual knowledge: the core of medical information systems. [21] Brill E. Transformation-based error-driven learning In: Lun KC, Degoulet P, Piemme T, and Rienho O, and natural language processing: A case study in eds, Proc MEDINFO 92, Geneva. North Holland, part-of-speech tagging. Computational Linguistics 1992:1420{6. 1995;21(4):543{65. [10] Zweigenbaum P and Consortium Menelas . [22] Basili R, Pazienza MT, and Velardi P. What can Menelas: an access system for medical records using be learned from raw texts? Machine Translation natural language. Computer Methods and Programs 1993;8:147{73. in Biomedicine 1994;45:117{20. [11] Nygren E and Henriksson P. Reading the medical [23] ISO . Standard generalized markup language. ISO record. I. Analysis of physicians' ways of reading the 8879, International Organization for Standardization, medical record. Computer Methods and Programs in 1 rue de Varembe, C.P. 56, CH-1211 Geneve, 1986. Biomedicine 1992;39:1{2. [24] Goldfarb CF. The SGML handbook. Oxford Univer- [12] Seroussi B, Baud R, Moens M, et al. DOME nal re- sity Press, 1990. port. DOME (EU Project MLAP-63221) Deliverable [25] Sperberg-McQueen CM and Burnard L, eds. Guide- D0.2, DIAM | SIM/AP-HP, 1996. lines for Electronic Text Encoding and Interchange [13] Nygren E, Johnson M, and Henriksson P. Reading (TEI P3), Chicago, Oxford, 1994. ACH, ACL and the medical record. II. Design of a human-computer ALLC. interface for basic reading of computerized medi- cal records. Computer Methods and Programs in [26] Alschuler L, Lincoln TL, and Spinosa J. Medicine for Biomedicine 1992;39:13{25. SGML. In: Proceedings GCA SGML'96, 1996:181{9. [14] Adelhard K, Eckel R, Holzel D, and Tretter W. [27] Kay M. Algorithm schemata and data structure in A prototype of a computerized patient record. syntactic processing. In: Grosz BJ, Jones KS, and Computer Methods and Programs in Biomedicine Webber BL, eds, Readings in Natural Language Pro- 1995;48:115{9. cessing. Morgan Kaufmann, Los Altos, Ca., 1986:35{ [15] Bouaud J and Seroussi B. Navigating through a 70. document-centered computer-based patient record: a [28] Nii HP. Blackboard systems: the blackboard model mock-up based on WWW technology. In: Cimino of problem solving and the evolution of blackboard [43], October 1996. architectures. The Arti cial Intelligence Magazine [16] Carvalho G. A. Modelisation documentaire du 1986;7(2,3):38{53,82{106. dossier patient : une experience. Rapport de DEA, Informatique Biomedicale, Universite Paris 5, 1996. [29] de Kleer J. An assumption-based truth maintenance system. Arti cial Intelligence 1986;28(1). [17] Laforest F, Frenot S, and Flory A. A new approach for hypermedia medical records management. In: [30] Langendoen DT and Simons GF. A rationale for the Brender J, Christensen JP, Scherrer JR, and Mc- TEI recommendations for feature-structure markup. Nair P, eds, Proceedings of MIE'96, Copenhagen, In: Ide and Veronis [44], 1995:191{209. Reprinted DK. IOS Press, 1996:1042{6. from Computers and the Humanities 29, 1995. 11
[31] Baud RH, Rassinoux AM, Wagner JC, et al. Rep- [43] Cimino JJ, ed. Proceedings of AMIA Annual resenting clinical narratives by Sowa's Conceptual Fall Symposium 96, Washington DC, October 1996. Graphs. Methods of Information in Medicine AMIA (JAMIA supplement). 1995;34(1/2). [44] Ide N and Veronis J, eds. The Text Encoding Ini- [32] DeRose SJ and Durand DG. Making Hypermedia tiative. Kluwer Academic Publishers, Boston, 1995. Work { A User's Guide to HyTime. Kluwer Aca- Reprinted from Computers and the Humanities 29, demic Publishers, 1994. 1995. [33] SoftQuad, Inc., Toronto. Panorama Pro for Microsoft [45] Clark J. ISO/IEC DIS 10179.2: Document style se- Windows, 1995. mantics and speci cation language (DSSSL). WWW page http://www.jclark.com/dsssl/, 1995. [34] Veronis J. MULTEXT home page. WWW page http://www.lpl.univ-aix.fr/projects/multext/, CNRS et Universite de Provence, 1996. [35] ISO . Document style semantics and speci cation language. ISO/IEC 10179, International Organiza- tion for Standardization, 1 rue de Varembe, C.P. 56, CH-1211 Geneve, 1995. See also [45]. [36] Bachimont B. Intelligence arti cielle et ecriture dy- namique : de la raison graphique a la raison com- putationnelle. In: Colloque de Cerisy La Salle \Au nom du sens" autour de et avec Umberto Eco. Gras- set, Paris, 1996. To appear . [37] Natural Language Software Registry . The natural language software registry online re- search server. WWW page http://cl-www.dfki.uni- sb.de/cl/registry/draft.html, 1996. [38] European Language Resources Association . ELRA home page. WWW page http://www.icp.grenet.fr/ELRA/home.html, 1996. [39] Friedman C and Hripcsak G. Evaluating natu- ral language processors in the clinical domain. In: Chute CC, ed, Proc Conference on Natural Language and Medical Concept Representation, Jacksonville, Fl. IMIA WG6, 1997:41{52. [40] Giordano R. The TEI header and the documentation of electronic texts. In: Ide and Veronis [44], 1995:75{ 84. Reprinted from Computers and the Humanities 29, 1995. [41] Cimino JJ, Socratous SA, and Grewal R. The in- formatics superhighway: Prototyping on the World Wide Web. In: Gardner RM, ed, Proceedings of the 19th Annual SCAMC, New Orleans. AMIA (JAMIA supplement), 1995:111{5. [42] Humphrey BL and Lindberg DAB. Building the Uni ed Medical Language System. In: Proceed- ings of the 13th Annual SCAMC, Washington. IEEE, 1989:475{80. 12
You can also read