Adapting Linguistic Tools for the Analysis of Italian Medical Records
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
10.12871/CLICIT201414 Adapting Linguistic Tools for the Analysis of Italian Medical Records Giuseppe Attardi Vittoria Cozza Daniele Sartiano Dipartimento di Informatica Dipartimento di Informatica Dipartimento di Informatica Università di Pisa Università di Pisa Università di Pisa Largo B. Pontecorvo, 3 Largo B. Pontecorvo, 3 Largo B. Pontecorvo, 3 attardi@di.unipi.it cozza@di.unipi.it sartiano@di.unipi.it formation extracted from natural language texts, Abstract can supplement or improve information extracted from the more structured data available in the English. We address the problem of recogni- medical test records. tion of medical entities in clinical records Clinical records are expressed as plain text in written in Italian. We report on experiments natural language and contain mentions of diseas- performed on medical data in English provid- es or symptoms affecting a patient, whose accu- ed in the shared tasks at CLEF-ER 2013 and rate identification is crucial for any further text SemEval 2014. This allowed us to refine Named Entity recognition techniques to deal mining process. with the specifics of medical and clinical lan- Our task in the project is to provide a set of guage in particular. We present two ap- NLP tools for extracting automatically infor- proaches for transferring the techniques to mation from medical reports in Italian. We are Italian. One solution relies on the creation of facing the double challenge of adapting NLP an Italian corpus of annotated clinical records tools to the medical domain and of handling doc- and the other on adapting existing linguistic uments in a language (Italian) for which there are tools to the medical domain. few available linguistic resources. Our approach to information extraction ex- Italiano. Questo lavoro affronta il problema ploits both supervised machine-learning tools, del riconoscimento di entità mediche in refer- ti medici in lingua italiana. Riferiamo su de- which require annotated training corpora, and gli esperimenti svolti su testi medici in ingle- unsupervised deep learning techniques, in order se forniti nei task di CLEF-ER 2013 e SemE- to leverage unlabeled data. val 2014. Questi ci hanno consentito di raffi- For dealing with the lack of annotated Italian nare le tecniche di Named Entity recognition resources for the bio-medical domain, we at- per trattare le specificità del linguaggio me- tempted to create a silver corpus with a semi- dico e in particolare quello dei referti clinici. automatic approach that uses both machine trans- Presentiamo due approcci al trasferimento di lation and dictionary based techniques. The cor- queste tecniche all’italiano. Una soluzione pus will be validated through crowdsourcing. consiste nella creazione di un corpus di refer- ti medici in italiano annotato con entità me- 2 Medical Training Corpus diche e l’altro nell’adattare strumenti tradi- zionali per l’analisi linguistica al dominio Currently Italian corpora annotated with men- medico. tions of medical terms are not easily available. Hence we decided to create a corpus of Italian 1 Introduction medical reports (IMR), annotated with medical One of the objectives of the RIS project (RIS mentions and to make it available on demand. 2014) is to develop tools and techniques to help The corpus consists of 10,000 sentences, ex- identifying patients at risk of evolving their dis- tracted from a collection of 23,695 clinical rec- ease into a chronic condition. The study relies on ords of various types, including discharge sum- a sample of patient data consisting of both medi- maries, diagnoses, and medical test reports. cal test reports and clinical records. We are inter- The annotation process consists in two steps: ested in verifying whether text analytics, i.e. in- creating a silver corpus using automated tools 17
and then turning the corpus into a gold one by 4. the annotated document is enriched by adding manually correcting the annotations. CUIs to each entity, looking up the pair (enti- For building the silver corpus we used: ty, group) in the dictionary of CUIs, of step 1. a NER trained over a silver English resource For machine translation we trained Moses translated to Italian; (Koehn, 2007) using a biomedical parallel corpus a dictionary-based entity recognition ap- consisting of EMEA, Medline and the Spanish proach. Wikipedia for the language model. For converting the silver corpus into a gold one, In task A of the challenge, on mention identi- validation by medical experts is required. We fication, our submission achieved the best score organized a crowdsourcing campaign, for which for the categories disease, anatomical part, live we are recruiting volunteers to whose we will being and drugs, with scores ranging between assign micro annotation tasks. Special care will 91.5% and 97.4% (Rebholz-Schuhmann et al., be taken to collected answers reliability. 2013). In task B on CUI identification, the scores were however much lower. 2.1 Translation based approach As NE tagger, we used the Tanl NER (Attardi The CLEF-ER 2013 challenge (Rebholz- et al., 2009), a generic sequence tagger based on Schuhmann et al., 2010) aimed at the identifica- a Maximum Entropy Markov Model, that uses a tion of mentions in bio-medical texts in various rich feature set, customizable by providing a fea- languages, starting from an annotated resource in ture model. Such kinds of taggers perform quite English, and at assigning to them a concept well on newswire documents, where capitaliza- unique identifier (CUI) from the UMLS thesau- tion features are quite helpful in identifying peo- rus (Bodenreider, 2004). UMLS combines sever- ple, organization and locations. With a proper al multilingual medical resources, including Ital- feature model we were able to achieve satisfacto- ian terminology from MedDRA Italian ry results also for medical domain. (MDRITA15_1) and MESH Italian Adapting the CLEF-ER approach to Italian re- (MSHITA2013), bridged through their CUI’s to quired repeating the translation step with an en-it their English counterparts. parallel corpus, consisting of EMEA and UMLS The organizers provided a silver standard cor- for the medical domain and (Europarl, 2014; pus (SSC) in English, consisting of 364,005 sen- JRC-Acquis, 2014) for more general domains. tences extracted from the EMEA corpus, which A NE tagger for Italian was the trained on the had been automatically annotated by combining translated silver corpus. the outputs of several Named Entity taggers Due to a lack of annotated Italian medical (Rebholz-Schuhmann et al., 2010). texts, we couldn’t perform validation on the re- In (Attardi et al., 2013) we proposed a solution sulting tagger. Manual inspection confirms the for annotating Spanish bio-medical texts, starting hypothesis that accuracy is similar to the Spanish from the SSC English resource. Our approach version, given that the major difference in the combined techniques of machine translation and process is the translation corpus and that Spanish NER and consists of the following steps: and Italian are similar languages. 1. phrase-based statistical machine translation is 2.2 Dictionary based approach applied to the SSC in order to obtain a corre- Since the terminology for entities in medical rec- sponding annotated corpus in the target lan- ords is fairly restricted, another method for iden- guage. A mapping between mentions in the tifying mentions in the IMR corpus is to use a original and the corresponding ones in the dictionary. We produced an Italian medical the- translation is preserved, so that the CUIs from the original can be transferred to the transla- saurus by merging: tion. This produces a Bronze Standard Corpus over 70,000 definitions of treatments and (BSC) in the target language. A dictionary of diagnosis from the ICD-9-CM terminology; entities is also created, which associates to about 22,000 definitions from the SnoMed each pair (entity text, semantic group) the cor- semantic group “Symptoms and Signs, Dis- responding CUIs that appeared in the SSC. ease and Anatomical part” in the UMLS; 2. the BSC is used to train a model for a Named over 2,600 active ingredients and drugs from Entity tagger, capable of assigning semantic the “Lista dei Farmaci” (AIFA, 2014). groups to mentions. 3. the model built at step 2) is used for tagging We identified mentions in the IMR corpus using entities in sentences in the target language. two techniques: n-gram based and parser based. 18
The IMR text is preprocessed first with the Tanl The training set consisted of 199 notes with pipeline, performing sentence splitting, tokeniza- 5,816 annotations, the development set of 99 tion, POS tagging and lemma extraction. notes with 5,351 annotations. To ease matching, text is normalized by low- For the first step we explored using or adapt- ercasing each word. ing traditional NER techniques. We performed The n-gram based technique tries matching experiments in the context of the SemEval 2014 each n-gram (with n between 1 and 10) in the task 7, Analysis of Clinical Text, where we could corpus with entries in the thesaurus in two ways: try these techniques using suitable training and with lemma match and with approximate match. test data, even though in English rather than Ital- Approximate matching involves removing prep- ian (Attardi et al., 2014). We tested several NER ositions, punctuations and articles from both n- tools: the Tanl NER, the Stanford NER grams and entries and performing an exact (CRF_NER, 2014) and a Deep Learning NER match. (NLPNET, 2014), which we developed based on The parse based matching enables to deal also the SENNA architecture (SENNA, 2011). While with some kinds of discontiguous mentions, i.e. the first two taggers rely on a rich feature sets entity mentions interleaved with modifiers, e.g. and supervised learning, the Deep Learning tag- adjectives, verb or adverbs. The matching pro- ger uses almost no features and relies on word cess involves parsing each sentence in the IMR embeddings, learned through an unsupervised with the DeSR parser (Attardi et al., 2009), se- process from unannotated texts, along the ap- lecting noun phrases corresponding to subtrees proach by Collobert et al. 2011. whose root is a noun and consisting of certain We created an unannotated corpus (UC) com- patterns of nouns, adjectives and prepositions, bining 100,000 terms from the English Wikipe- and finally searching these noun phrases in the dia and 30,000 additional terms from a subset of thesaurus. unannotated medical texts from the MIMIC cor- pus (Moody and Marks, 1996). The word em- beddings for the UC are computed by training a deep learning architecture initialized to the val- ues provided by Al-Rfou’ et al. (2013) for the English Wikipedia and to random values for the medical terms. All taggers use dictionary features. We created a dictionary of disease terms (about 22,000 terms Figure 1: A parsed sentence from the IMR. from the “Disease or Syndrome” semantic type Figure 1 shows a sample sentence and its parse of UMLS) excluding the most frequent words tree from which the following noun phrases are from Wikipedia. identified: reumatismo acuto, reumatismo ar- The Tanl NER could be customized with addi- ticolare, reumatismo in giovane età, gio- tional cluster features, extracted from a small vane età. Among these, reumatismo acuto is window of input tokens. The clusters of UC recognized as an ICD9 disease. terms were calculated using the following algo- Overall, we were able to identify over 100,000 rithms: entities in the IMR corpus by means of the dic- DBSCAN (Ester et al., 1996) as implement- tionary approach. The process is expected to ed in scikit-learn (SCIKIT, 2014). Applied to guarantee high precision: manual inspection of the word embeddings it produced a set of 4,000 sentences, detected a 96% precision. 572 clusters. Besides recognized entities in the thesaurus, Continuous Vector Representation of Words we annotated temporal expressions by a version (Mikolov et al., 2013), using the word2vec of HeidelTime (Strotgen and Gertz, 2013) library (WORD2VEC, 2014) with several adapted to Italian (Attardi and Baronti, 2014). settings. We obtained the best accuracy with word2vec 3 NE Recognition using a set of 2,000 clusters. We split the analysis of medical records into two 3.1 Conversion to IOB format steps: recognition of mentions and assignment of unique identifiers (CUI) to those mentions. Before applying NE tagging, we had to convert the medical records into the IOB format used by 19
most NE taggers. The conversion is not straight- is created if the terms belong to the same men- forward since clinical reports contain discontigu- tion, a negative one otherwise. ous and overlapping mentions. For example, in: The classifier was trained using a binned dis- Abdomen is soft, nontender, non- tance feature and dictionary features, extracted distended, negative bruits for each pair of words in the training set. For mapping entities to CUIs we applied fuzzy there are two mentions: Abdomen nontender and matching (Fraser, 2011) between the extracted Abdomen bruits. mentions and the textual description of entities The IOB format does not allow either discon- present in a set of UMLS disorders. The CUI tinuity or overlaps. We tested two conversions: from the match with highest score is chosen. one by replicating a sentence, each version hav- Our submission reached a comparable accura- ing a single mention from a set of overlapping cy to the best ones based on a single system ap- ones. The second approach consisted in using proach (Pradhan et al., 2014), with an F-score of additional tags for disjoint and shared mentions 0.65 for Task A and 0.83 for Task A relaxed. For (Tang et al., 2013): DISO for contiguous men- Task B and Task B relaxed the accuracies were tions; DDISO for disjoint entity words that are 0.46 and 0.70 respectively. Better results were not shared by multiple mentions; HDISO for the achieved by submissions that used an ensemble head word that belongs to more than one disjoint of taggers. mentions. We also attempted combinations of the out- We tested the accuracy of various NE taggers puts from the Tanl NER (with word2vec cluster on the SemEval development set. The results are features), Nlpnet NER and Stanford NER in sev- reported in Table 1. Results marked with discont eral ways. The best results were obtained by a were obtained with the additional tags for dis- simple one voting approach, taking the union of contiguous and overlapping mentions. all annotations. The results of the evaluation, for both the multiple copies and discount annotation NER Precision Recall F- score style, are shown below: Tanl 80.41 65.08 71.94 Tanl+dbscan 80.43 64.48 71.58 NER Precision Recall F- score Tanl+word2vec 79.70 67.44 73.06 Agreement multiple 73.96 73.68 73.82 Nlpnet 80.29 62.51 70.29 Agreement discont 81.69 65.85 72.92 Stanford 80.30 64.89 71.78 CRFsuite 79.69 61.97 69.72 Table 2: Accuracy of NER system combination. Tanl discont 78.57 65.35 71.35 Nlpnet discont 77.37 63.76 69.61 4 Conclusions Stanford discont 80.21 62.79 70.44 We presented a series of experiments on bio- Table 1: Accuracy on the development set. medical texts from both medical literature and clinical records, in multiple languages, that 3.2 Semeval 2014 NER for clinical text helped us to refine the techniques of NE recogni- The task 7 of SemEval 2014 allowed us to test tion and to adapt them to Italian. We explored NE tagging techniques on medical records and to supervised techniques as well as unsupervised adapt them to the task. Peculiarly, only one class ones, in the form of word embeddings or word of entities, namely diseases, is present in the cor- clusters. We also developed a Deep Learning NE pus. tagger that exploits embeddings. The best results We dealt with overlapping mentions by con- were achieved by using a MEMM sequence la- verting the annotations. Discontiguous mentions beler using clusters as features improved in an were dealt in two steps: the first step identifies ensemble combination with other NE taggers. contiguous portions of a mention with a tradi- As an further contribution of our work, we tional sequence labeler; then separate portions of produced, by exploiting semi-automated tech- mentions are combined into a full mention with niques, an Italian corpus of medical records, an- guidance from a Maximum Entropy classifier notated with mentions of medical terms. (Berger et al., 1996), trained to recognize which pairs belong to the same mention. The training Acknowledgements set consists of all pairs of terms within a docu- Partial support for this work was provided by ment annotated as disorders. A positive instance project RIS (POR RIS of the Regione Toscana, CUP n° 6408.30122011.026000160). 20
References JRC-Acquis Multilingual Parallel Corpus, Version 2.2. 2014. Retrieved from: AIFA open data. 2014. Retrieved from: http://optima.jrc.it/Acquis/index_2.2.html http://www.agenziafarmaco.gov.it/it/content/dati- Philipp Koehn, et al. 2007. Moses: Open source sulle-liste-dei-farmaci-open-data toolkit for statistical machine translation. In Proc. Rami Al-Rfou’, Bryan Perozzi, and Steven Skiena. of the 45th Annual Meeting of the ACL on Interac- 2013. Polyglot: Distributed Word Representations tive Poster and Demonstration Sessions. ACL. for Multilingual NLP. In Proc. of Conference on MDRITA15_1. 2012. Medical Dictionary for Regula- Computational Natural Language Learning, tory Activities Terminology (MedDRA) Version CoNLL 2013, pp. 183-192, Sofia, Bulgaria. 15.1, Italian Edition; MedDRA MSSO; September, Giuseppe Attardi et al., 2009. Tanl (Text Analytics 2012. and Natural Language Processing). SemaWiki pro- MSHITA2013. 2013. Italian translation of Medical ject: http://medialab.di.unipi.it/wiki/SemaWiki Subject Headings. Istituto Superiore di Sanità, Set- Giuseppe Attardi, et al. 2009. The Tanl Named Entity tore Documentazione. Roma, Italy. Recognizer at Evalita 2009. In Proc. of Workshop Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jef- Evalita’09 - Evaluation of NLP and Speech Tools fey Dean. 2013. Efficient Estimation of Word Rep- for Italian, Reggio Emilia, ISBN 978-88-903581-1- resentations in Vector Space. In Proc. of Workshop 1. at ICLR. Giuseppe Attardi, Felice Dell’Orletta, Maria Simi and George B. Moody and Roger G. Mark. 1996. A Data- Joseph Turian. 2009. Accurate Dependency Pars- base to Support Development and Evaluation of In- ing with a Stacked Multilayer Perceptron. In Proc. telligent Intensive Care Monitoring. Computers in of Workshop Evalita’09 - Evaluation of NLP and Cardiology 23:657–660. Speech Tools for Italian, Reggio Emilia, ISBN NLPNET. 2014. Retrieved from 978-88-903581-1-1. https://github.com/attardi/nlpnet/ Giuseppe Attardi, Andrea. Buzzelli, Daniele Sartiano. Sameer Pradhan, et al. 2014. SemEval-2014 Task 7: 2013. Machine Translation for Entity Recognition Analysis of Clinical Text. Proc. of the 8th Interna- across Languages in Biomedical Documents. Proc. tional Workshop on Semantic Evaluation (SemEval of CLEF-ER 2013 Workshop, September 23-26, 2014), August 2014, Dublin, Ireland, pp. 5462. Valencia, Spain. Dietrich Rebholz-Schuhmann et al. 2010. The Giuseppe Attardi, Vittoria Cozza, Daniele Sartiano. CALBC Silver Standard Corpus for Biomedical 2014. UniPi: Recognition of Mentions of Disorders Named Entities - A Study in Harmonizing the Con- in Clinical Text. Proc. of the 8th International tributions from Four Independent Named Entity Workshop on Semantic Evaluation. SemEval 2014, Taggers. In Proc. of the Seventh International Con- pp. 754–760 ference on Language Resources and Evaluation Giuseppe Attardi, Luca Baronti. 2014. Experiments in (LREC'10). Valletta, Malta. Identification of Temporal Expressions in Evalita Dietrich Rebholz-Schuhmann, et al. 2013. Entity 2014. Proc. of Evalita 2014. Recognition in Parallel Multi-lingual Biomedical Adam L. Berger, Vincent J. Della Pietra, and Stephen Corpora: The CLEF-ER Laboratory Overview. A. Della Pietra. 1996. A maximum entropy ap- Lecture Notes in Computer Science, Vol. 8138, proach to natural language processing. Comput. 353-367 Linguist. 22, 1 (March 1996), 39-71. RIS: Ricerca e innovazione nella sanità. 2014. POR Olivier Bodenreider. 2004. The Unified Medical Lan- RIS of the Regione Toscana. homepage: guage System (UMLS): integrating biomedical http://progetto-ris.it/ terminology. Nucleic Acids Research, vol. 32, no. SCIKIT. 2014 Retrieved from http://scikit-learn.org/ supplement 1, D267–D270. SENNA. Semantic/syntactic Extraction using a Neu- Ronan Collobert et al. 2011. Natural Language Pro- ral Network Architecture. 2011. Retrieved from cessing (Almost) from Scratch. Journal of Machine http://ml.nec-labs.com/senna/ Learning Research, 12, pp. 2461–2505. Jannik Strotgen and Michael Gertz. Multilingual and CRF-NER. 2014. Retrieved from: Cross-domain Temporal Tagging. 2013. Language http://nlp.stanford.edu/software/CRF-NER.shtml Resources and Evaluation, June 2013, Volume 47, EUROPARL. European Parliament Proceedings Par- Issue 2, pp 269-298. allel Corpus 1996-2011. 2014. Buzhou Tang et al. 2013. Recognizing and Encoding http://www.statmt.org/europarl/ Discorder Concepts in Clinical Text using Machine Jenny Rose Finkel, Trond Grenager and Christopher Learning and Vector Space Model. Workshop of D. Manning 2005. Incorporating Non-local Infor- ShARe/CLEF eHealth Evaluation Lab 2013. mation into Information Extraction Systems by Buzhou Tang, et al. 2014. Evaluating Word Represen- Gibbs Sampling. In Proc. of the 43nd Annual tation Features in Biomedical Named Entity Meeting of the ACL, 2005, pp. 363–370. Recognition Tasks. BioMed Research Internation- Neil Fraser. 2011. Diff, Match and Patch libraries for al, Volume 2014, Article ID 240403. Plain Text. 21
Jorg Tiedemann. 2009. News from OPUS - A Collec- Erik F. Tjong Kim Sang and Fien De Meulder. 2003. tion of Multilingual Parallel Corpora with Tools Introduction to the CoNLL’03 Shared Task: Lan- and Interfaces. In Nicolov et al., eds.: Recent Ad- guage-Independent Named Entity Recognition. In: vances in Natural Language Processing. Volume Proc. of CoNLL '03, Edmonton, Canada, 142–147. V. John Benjamins, Amsterdam/Philadelphia, Bo- WORD2VEC. 2014 Retrieved from rovets, Bulgaria, pp. 237–248. http://code.google.com/p/word2vec/ 22
You can also read