Adapting Linguistic Tools for the Analysis of Italian Medical Records

Page created by Tim Mann

Health & Fitness

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

10.12871/CLICIT201414

Adapting Linguistic Tools for the Analysis of Italian Medical Records

Giuseppe Attardi Vittoria Cozza Daniele Sartiano
Dipartimento di Informatica Dipartimento di Informatica Dipartimento di Informatica
Università di Pisa Università di Pisa Università di Pisa
Largo B. Pontecorvo, 3 Largo B. Pontecorvo, 3 Largo B. Pontecorvo, 3
attardi@di.unipi.it cozza@di.unipi.it sartiano@di.unipi.it

formation extracted from natural language texts,
Abstract can supplement or improve information extracted
from the more structured data available in the
English. We address the problem of recogni- medical test records.
tion of medical entities in clinical records Clinical records are expressed as plain text in
written in Italian. We report on experiments natural language and contain mentions of diseas-
performed on medical data in English provid- es or symptoms affecting a patient, whose accu-
ed in the shared tasks at CLEF-ER 2013 and
rate identification is crucial for any further text
SemEval 2014. This allowed us to refine
Named Entity recognition techniques to deal mining process.
with the specifics of medical and clinical lan- Our task in the project is to provide a set of
guage in particular. We present two ap- NLP tools for extracting automatically infor-
proaches for transferring the techniques to mation from medical reports in Italian. We are
Italian. One solution relies on the creation of facing the double challenge of adapting NLP
an Italian corpus of annotated clinical records tools to the medical domain and of handling doc-
and the other on adapting existing linguistic uments in a language (Italian) for which there are
tools to the medical domain. few available linguistic resources.
Our approach to information extraction ex-
Italiano. Questo lavoro affronta il problema
ploits both supervised machine-learning tools,
del riconoscimento di entità mediche in refer-
ti medici in lingua italiana. Riferiamo su de-
which require annotated training corpora, and
gli esperimenti svolti su testi medici in ingle- unsupervised deep learning techniques, in order
se forniti nei task di CLEF-ER 2013 e SemE- to leverage unlabeled data.
val 2014. Questi ci hanno consentito di raffi- For dealing with the lack of annotated Italian
nare le tecniche di Named Entity recognition resources for the bio-medical domain, we at-
per trattare le specificità del linguaggio me- tempted to create a silver corpus with a semi-
dico e in particolare quello dei referti clinici. automatic approach that uses both machine trans-
Presentiamo due approcci al trasferimento di lation and dictionary based techniques. The cor-
queste tecniche all’italiano. Una soluzione pus will be validated through crowdsourcing.
consiste nella creazione di un corpus di refer-
ti medici in italiano annotato con entità me- 2 Medical Training Corpus
diche e l’altro nell’adattare strumenti tradi-
zionali per l’analisi linguistica al dominio Currently Italian corpora annotated with men-
medico. tions of medical terms are not easily available.
Hence we decided to create a corpus of Italian
1 Introduction medical reports (IMR), annotated with medical
One of the objectives of the RIS project (RIS mentions and to make it available on demand.
2014) is to develop tools and techniques to help The corpus consists of 10,000 sentences, ex-
identifying patients at risk of evolving their dis- tracted from a collection of 23,695 clinical rec-
ease into a chronic condition. The study relies on ords of various types, including discharge sum-
a sample of patient data consisting of both medi- maries, diagnoses, and medical test reports.
cal test reports and clinical records. We are inter- The annotation process consists in two steps:
ested in verifying whether text analytics, i.e. in- creating a silver corpus using automated tools

and then turning the corpus into a gold one by 4. the annotated document is enriched by adding
manually correcting the annotations. CUIs to each entity, looking up the pair (enti-
For building the silver corpus we used: ty, group) in the dictionary of CUIs, of step 1.
 a NER trained over a silver English resource For machine translation we trained Moses
translated to Italian; (Koehn, 2007) using a biomedical parallel corpus
 a dictionary-based entity recognition ap- consisting of EMEA, Medline and the Spanish
proach. Wikipedia for the language model.
For converting the silver corpus into a gold one, In task A of the challenge, on mention identi-
validation by medical experts is required. We fication, our submission achieved the best score
organized a crowdsourcing campaign, for which for the categories disease, anatomical part, live
we are recruiting volunteers to whose we will being and drugs, with scores ranging between
assign micro annotation tasks. Special care will 91.5% and 97.4% (Rebholz-Schuhmann et al.,
be taken to collected answers reliability. 2013). In task B on CUI identification, the scores
were however much lower.
2.1 Translation based approach As NE tagger, we used the Tanl NER (Attardi
The CLEF-ER 2013 challenge (Rebholz- et al., 2009), a generic sequence tagger based on
Schuhmann et al., 2010) aimed at the identifica- a Maximum Entropy Markov Model, that uses a
tion of mentions in bio-medical texts in various rich feature set, customizable by providing a fea-
languages, starting from an annotated resource in ture model. Such kinds of taggers perform quite
English, and at assigning to them a concept well on newswire documents, where capitaliza-
unique identifier (CUI) from the UMLS thesau- tion features are quite helpful in identifying peo-
rus (Bodenreider, 2004). UMLS combines sever- ple, organization and locations. With a proper
al multilingual medical resources, including Ital- feature model we were able to achieve satisfacto-
ian terminology from MedDRA Italian ry results also for medical domain.
(MDRITA15_1) and MESH Italian Adapting the CLEF-ER approach to Italian re-
(MSHITA2013), bridged through their CUI’s to quired repeating the translation step with an en-it
their English counterparts. parallel corpus, consisting of EMEA and UMLS
The organizers provided a silver standard cor- for the medical domain and (Europarl, 2014;
pus (SSC) in English, consisting of 364,005 sen- JRC-Acquis, 2014) for more general domains.
tences extracted from the EMEA corpus, which A NE tagger for Italian was the trained on the
had been automatically annotated by combining translated silver corpus.
the outputs of several Named Entity taggers Due to a lack of annotated Italian medical
(Rebholz-Schuhmann et al., 2010). texts, we couldn’t perform validation on the re-
In (Attardi et al., 2013) we proposed a solution sulting tagger. Manual inspection confirms the
for annotating Spanish bio-medical texts, starting hypothesis that accuracy is similar to the Spanish
from the SSC English resource. Our approach version, given that the major difference in the
combined techniques of machine translation and process is the translation corpus and that Spanish
NER and consists of the following steps: and Italian are similar languages.

1. phrase-based statistical machine translation is 2.2 Dictionary based approach
applied to the SSC in order to obtain a corre- Since the terminology for entities in medical rec-
sponding annotated corpus in the target lan- ords is fairly restricted, another method for iden-
guage. A mapping between mentions in the tifying mentions in the IMR corpus is to use a
original and the corresponding ones in the
dictionary. We produced an Italian medical the-
translation is preserved, so that the CUIs from
the original can be transferred to the transla- saurus by merging:
tion. This produces a Bronze Standard Corpus  over 70,000 definitions of treatments and
(BSC) in the target language. A dictionary of diagnosis from the ICD-9-CM terminology;
entities is also created, which associates to  about 22,000 definitions from the SnoMed
each pair (entity text, semantic group) the cor- semantic group “Symptoms and Signs, Dis-
responding CUIs that appeared in the SSC. ease and Anatomical part” in the UMLS;
2. the BSC is used to train a model for a Named  over 2,600 active ingredients and drugs from
Entity tagger, capable of assigning semantic the “Lista dei Farmaci” (AIFA, 2014).
groups to mentions.
3. the model built at step 2) is used for tagging We identified mentions in the IMR corpus using
entities in sentences in the target language. two techniques: n-gram based and parser based.

The IMR text is preprocessed first with the Tanl The training set consisted of 199 notes with
pipeline, performing sentence splitting, tokeniza- 5,816 annotations, the development set of 99
tion, POS tagging and lemma extraction. notes with 5,351 annotations.
To ease matching, text is normalized by low- For the first step we explored using or adapt-
ercasing each word. ing traditional NER techniques. We performed
The n-gram based technique tries matching experiments in the context of the SemEval 2014
each n-gram (with n between 1 and 10) in the task 7, Analysis of Clinical Text, where we could
corpus with entries in the thesaurus in two ways: try these techniques using suitable training and
with lemma match and with approximate match. test data, even though in English rather than Ital-
Approximate matching involves removing prep- ian (Attardi et al., 2014). We tested several NER
ositions, punctuations and articles from both n- tools: the Tanl NER, the Stanford NER
grams and entries and performing an exact (CRF_NER, 2014) and a Deep Learning NER
match. (NLPNET, 2014), which we developed based on
The parse based matching enables to deal also the SENNA architecture (SENNA, 2011). While
with some kinds of discontiguous mentions, i.e. the first two taggers rely on a rich feature sets
entity mentions interleaved with modifiers, e.g. and supervised learning, the Deep Learning tag-
adjectives, verb or adverbs. The matching pro- ger uses almost no features and relies on word
cess involves parsing each sentence in the IMR embeddings, learned through an unsupervised
with the DeSR parser (Attardi et al., 2009), se- process from unannotated texts, along the ap-
lecting noun phrases corresponding to subtrees proach by Collobert et al. 2011.
whose root is a noun and consisting of certain We created an unannotated corpus (UC) com-
patterns of nouns, adjectives and prepositions, bining 100,000 terms from the English Wikipe-
and finally searching these noun phrases in the dia and 30,000 additional terms from a subset of
thesaurus. unannotated medical texts from the MIMIC cor-
pus (Moody and Marks, 1996). The word em-
beddings for the UC are computed by training a
deep learning architecture initialized to the val-
ues provided by Al-Rfou’ et al. (2013) for the
English Wikipedia and to random values for the
medical terms.
All taggers use dictionary features. We created
a dictionary of disease terms (about 22,000 terms
Figure 1: A parsed sentence from the IMR. from the “Disease or Syndrome” semantic type
Figure 1 shows a sample sentence and its parse of UMLS) excluding the most frequent words
tree from which the following noun phrases are from Wikipedia.
identified: reumatismo acuto, reumatismo ar- The Tanl NER could be customized with addi-
ticolare, reumatismo in giovane età, gio- tional cluster features, extracted from a small
vane età. Among these, reumatismo acuto is
window of input tokens. The clusters of UC
recognized as an ICD9 disease. terms were calculated using the following algo-
Overall, we were able to identify over 100,000 rithms:
entities in the IMR corpus by means of the dic-  DBSCAN (Ester et al., 1996) as implement-
tionary approach. The process is expected to ed in scikit-learn (SCIKIT, 2014). Applied to
guarantee high precision: manual inspection of the word embeddings it produced a set of
4,000 sentences, detected a 96% precision. 572 clusters.
Besides recognized entities in the thesaurus,  Continuous Vector Representation of Words
we annotated temporal expressions by a version (Mikolov et al., 2013), using the word2vec
of HeidelTime (Strotgen and Gertz, 2013) library (WORD2VEC, 2014) with several
adapted to Italian (Attardi and Baronti, 2014). settings.
We obtained the best accuracy with word2vec
3 NE Recognition using a set of 2,000 clusters.
We split the analysis of medical records into two 3.1 Conversion to IOB format
steps: recognition of mentions and assignment of
unique identifiers (CUI) to those mentions. Before applying NE tagging, we had to convert
the medical records into the IOB format used by

most NE taggers. The conversion is not straight- is created if the terms belong to the same men-
forward since clinical reports contain discontigu- tion, a negative one otherwise.
ous and overlapping mentions. For example, in: The classifier was trained using a binned dis-
Abdomen is soft, nontender, non- tance feature and dictionary features, extracted
distended, negative bruits for each pair of words in the training set.
For mapping entities to CUIs we applied fuzzy
there are two mentions: Abdomen nontender and matching (Fraser, 2011) between the extracted
Abdomen bruits.
mentions and the textual description of entities
The IOB format does not allow either discon- present in a set of UMLS disorders. The CUI
tinuity or overlaps. We tested two conversions: from the match with highest score is chosen.
one by replicating a sentence, each version hav- Our submission reached a comparable accura-
ing a single mention from a set of overlapping cy to the best ones based on a single system ap-
ones. The second approach consisted in using proach (Pradhan et al., 2014), with an F-score of
additional tags for disjoint and shared mentions 0.65 for Task A and 0.83 for Task A relaxed. For
(Tang et al., 2013): DISO for contiguous men- Task B and Task B relaxed the accuracies were
tions; DDISO for disjoint entity words that are 0.46 and 0.70 respectively. Better results were
not shared by multiple mentions; HDISO for the achieved by submissions that used an ensemble
head word that belongs to more than one disjoint of taggers.
mentions. We also attempted combinations of the out-
We tested the accuracy of various NE taggers puts from the Tanl NER (with word2vec cluster
on the SemEval development set. The results are features), Nlpnet NER and Stanford NER in sev-
reported in Table 1. Results marked with discont eral ways. The best results were obtained by a
were obtained with the additional tags for dis- simple one voting approach, taking the union of
contiguous and overlapping mentions. all annotations. The results of the evaluation, for
both the multiple copies and discount annotation
NER Precision Recall F- score style, are shown below:
Tanl 80.41 65.08 71.94
Tanl+dbscan 80.43 64.48 71.58 NER Precision Recall F- score
Tanl+word2vec 79.70 67.44 73.06 Agreement multiple 73.96 73.68 73.82
Nlpnet 80.29 62.51 70.29 Agreement discont 81.69 65.85 72.92
Stanford 80.30 64.89 71.78
CRFsuite 79.69 61.97 69.72 Table 2: Accuracy of NER system combination.
Tanl discont 78.57 65.35 71.35
Nlpnet discont 77.37 63.76 69.61 4 Conclusions
Stanford discont 80.21 62.79 70.44
We presented a series of experiments on bio-
Table 1: Accuracy on the development set. medical texts from both medical literature and
clinical records, in multiple languages, that
3.2 Semeval 2014 NER for clinical text helped us to refine the techniques of NE recogni-
The task 7 of SemEval 2014 allowed us to test tion and to adapt them to Italian. We explored
NE tagging techniques on medical records and to supervised techniques as well as unsupervised
adapt them to the task. Peculiarly, only one class ones, in the form of word embeddings or word
of entities, namely diseases, is present in the cor- clusters. We also developed a Deep Learning NE
pus. tagger that exploits embeddings. The best results
We dealt with overlapping mentions by con- were achieved by using a MEMM sequence la-
verting the annotations. Discontiguous mentions beler using clusters as features improved in an
were dealt in two steps: the first step identifies ensemble combination with other NE taggers.
contiguous portions of a mention with a tradi- As an further contribution of our work, we
tional sequence labeler; then separate portions of produced, by exploiting semi-automated tech-
mentions are combined into a full mention with niques, an Italian corpus of medical records, an-
guidance from a Maximum Entropy classifier notated with mentions of medical terms.
(Berger et al., 1996), trained to recognize which
pairs belong to the same mention. The training Acknowledgements
set consists of all pairs of terms within a docu- Partial support for this work was provided by
ment annotated as disorders. A positive instance project RIS (POR RIS of the Regione Toscana,
CUP n° 6408.30122011.026000160).

References                                                    JRC-Acquis Multilingual Parallel Corpus, Version
                                                                 2.2.          2014.          Retrieved          from:
AIFA open data. 2014. Retrieved from:                            http://optima.jrc.it/Acquis/index_2.2.html
   http://www.agenziafarmaco.gov.it/it/content/dati-          Philipp Koehn, et al. 2007. Moses: Open source
   sulle-liste-dei-farmaci-open-data                             toolkit for statistical machine translation. In Proc.
Rami Al-Rfou’, Bryan Perozzi, and Steven Skiena.                 of the 45th Annual Meeting of the ACL on Interac-
   2013. Polyglot: Distributed Word Representations              tive Poster and Demonstration Sessions. ACL.
   for Multilingual NLP. In Proc. of Conference on            MDRITA15_1. 2012. Medical Dictionary for Regula-
   Computational Natural Language Learning,                      tory Activities Terminology (MedDRA) Version
   CoNLL 2013, pp. 183-192, Sofia, Bulgaria.                     15.1, Italian Edition; MedDRA MSSO; September,
Giuseppe Attardi et al., 2009. Tanl (Text Analytics              2012.
   and Natural Language Processing). SemaWiki pro-            MSHITA2013. 2013. Italian translation of Medical
   ject: http://medialab.di.unipi.it/wiki/SemaWiki               Subject Headings. Istituto Superiore di Sanità, Set-
Giuseppe Attardi, et al. 2009. The Tanl Named Entity             tore Documentazione. Roma, Italy.
   Recognizer at Evalita 2009. In Proc. of Workshop           Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jef-
   Evalita’09 - Evaluation of NLP and Speech Tools               fey Dean. 2013. Efficient Estimation of Word Rep-
   for Italian, Reggio Emilia, ISBN 978-88-903581-1-             resentations in Vector Space. In Proc. of Workshop
   1.                                                            at ICLR.
Giuseppe Attardi, Felice Dell’Orletta, Maria Simi and         George B. Moody and Roger G. Mark. 1996. A Data-
   Joseph Turian. 2009. Accurate Dependency Pars-                base to Support Development and Evaluation of In-
   ing with a Stacked Multilayer Perceptron. In Proc.            telligent Intensive Care Monitoring. Computers in
   of Workshop Evalita’09 - Evaluation of NLP and                Cardiology 23:657–660.
   Speech Tools for Italian, Reggio Emilia, ISBN              NLPNET.             2014.        Retrieved          from
   978-88-903581-1-1.                                            https://github.com/attardi/nlpnet/
Giuseppe Attardi, Andrea. Buzzelli, Daniele Sartiano.         Sameer Pradhan, et al. 2014. SemEval-2014 Task 7:
   2013. Machine Translation for Entity Recognition              Analysis of Clinical Text. Proc. of the 8th Interna-
   across Languages in Biomedical Documents. Proc.               tional Workshop on Semantic Evaluation (SemEval
   of CLEF-ER 2013 Workshop, September 23-26,                    2014), August 2014, Dublin, Ireland, pp. 5462.
   Valencia, Spain.                                           Dietrich Rebholz-Schuhmann et al. 2010. The
Giuseppe Attardi, Vittoria Cozza, Daniele Sartiano.              CALBC Silver Standard Corpus for Biomedical
   2014. UniPi: Recognition of Mentions of Disorders             Named Entities - A Study in Harmonizing the Con-
   in Clinical Text. Proc. of the 8th International              tributions from Four Independent Named Entity
   Workshop on Semantic Evaluation. SemEval 2014,                Taggers. In Proc. of the Seventh International Con-
   pp. 754–760                                                   ference on Language Resources and Evaluation
Giuseppe Attardi, Luca Baronti. 2014. Experiments in             (LREC'10). Valletta, Malta.
   Identification of Temporal Expressions in Evalita          Dietrich Rebholz-Schuhmann, et al. 2013. Entity
   2014. Proc. of Evalita 2014.                                  Recognition in Parallel Multi-lingual Biomedical
Adam L. Berger, Vincent J. Della Pietra, and Stephen             Corpora: The CLEF-ER Laboratory Overview.
   A. Della Pietra. 1996. A maximum entropy ap-                  Lecture Notes in Computer Science, Vol. 8138,
   proach to natural language processing. Comput.                353-367
   Linguist. 22, 1 (March 1996), 39-71.                       RIS: Ricerca e innovazione nella sanità. 2014. POR
Olivier Bodenreider. 2004. The Unified Medical Lan-              RIS of the Regione Toscana. homepage:
   guage System (UMLS): integrating biomedical                   http://progetto-ris.it/
   terminology. Nucleic Acids Research, vol. 32, no.          SCIKIT. 2014 Retrieved from http://scikit-learn.org/
   supplement 1, D267–D270.                                   SENNA. Semantic/syntactic Extraction using a Neu-
Ronan Collobert et al. 2011. Natural Language Pro-               ral Network Architecture. 2011. Retrieved from
   cessing (Almost) from Scratch. Journal of Machine             http://ml.nec-labs.com/senna/
   Learning Research, 12, pp. 2461–2505.                      Jannik Strotgen and Michael Gertz. Multilingual and
CRF-NER.          2014.              Retrieved   from:           Cross-domain Temporal Tagging. 2013. Language
   http://nlp.stanford.edu/software/CRF-NER.shtml                Resources and Evaluation, June 2013, Volume 47,
EUROPARL. European Parliament Proceedings Par-                   Issue 2, pp 269-298.
   allel         Corpus        1996-2011.        2014.        Buzhou Tang et al. 2013. Recognizing and Encoding
   http://www.statmt.org/europarl/                               Discorder Concepts in Clinical Text using Machine
Jenny Rose Finkel, Trond Grenager and Christopher                Learning and Vector Space Model. Workshop of
   D. Manning 2005. Incorporating Non-local Infor-               ShARe/CLEF eHealth Evaluation Lab 2013.
   mation into Information Extraction Systems by              Buzhou Tang, et al. 2014. Evaluating Word Represen-
   Gibbs Sampling. In Proc. of the 43nd Annual                   tation Features in Biomedical Named Entity
   Meeting of the ACL, 2005, pp. 363–370.                        Recognition Tasks. BioMed Research Internation-
Neil Fraser. 2011. Diff, Match and Patch libraries for           al, Volume 2014, Article ID 240403.
   Plain Text.

                                                         21

Jorg Tiedemann. 2009. News from OPUS - A Collec-              Erik F. Tjong Kim Sang and Fien De Meulder. 2003.
   tion of Multilingual Parallel Corpora with Tools              Introduction to the CoNLL’03 Shared Task: Lan-
   and Interfaces. In Nicolov et al., eds.: Recent Ad-           guage-Independent Named Entity Recognition. In:
   vances in Natural Language Processing. Volume                 Proc. of CoNLL '03, Edmonton, Canada, 142–147.
   V. John Benjamins, Amsterdam/Philadelphia, Bo-             WORD2VEC.             2014      Retrieved    from
   rovets, Bulgaria, pp. 237–248.                                http://code.google.com/p/word2vec/

                                                         22

You can also read