Representation Learning of Documents Driven by Knowledge Resources - IRIT
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Representation Learning of Documents Driven by Knowledge Resources Lynda Tamine University Paul Sabatier Institut de Recherche en Informatique de Toulouse IRIT e-mail: lynda.lechani@irit.fr http://www.irit.fr/~Lynda.Tamine-Lechani/ Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019
Objectives - Introduce the semantic gap problem in information retrieval (IR) - Design document representation learning models: combine distributional semantics and human-established semantics provided by external structured knowledge resources - Compare online vs. offline representation learning strategies on IR and Natural Laguage Processing (NLP) tasks - Provide lessons for future representation learning frameworks Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 2
Outline I- The semantic gap problem in IR II- Representation learning of documents driven by external knowledge resources -- Online learning strategy -- Offline learning strategy III- Empirical evaluation on IR and NLP tasks IV- Lessons learned and implications Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 3
The semantic gap: a longstanding research issue in IR I | II | III | IV Anatomy of an IR process Information need Main issues (in query-document matching) - Lexical gap aspirin vs aceltylsalid acids - Granularity mismatch Query text Document text aspirin, anacardic acid vs salicylates - Polysemy bass (fish vs part of harmony) Generate query Generate document representation representation Manually designed features Query vector Doc. vector Term Fequency Term position Estimate relevance Length Impact on relevance estimate .. ln N − df (w) + 0.5 (k1 +1) × c(w,d ) (k +1) × c(w,q) - Default sense matching between queries ∑ ⋅ ⋅ 3 w∈q∩d df (w) + 0.5 k1 ((1− b) + b |d | ) + c(w,d ) k3 + c(w,q) and documents avdl - Low levels of retrieval performance Learning 2 Rank (SVM, NN, ..) (effectiveness in termes of recall/precision Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 4
The semantic gap: a longstanding research issue in IR I | II | III | IV The semantic gap in medical IR: a review of the TREC medical search track (Edinger et al. 2012) False negative Task: clinical search cohort Query: expression of disease/consitions sets and treatments or intervention Eg., "find patients with gastroesophgeal reflux disease False who had an upper endoscopy" positive Documents: de-identified medical visit reports Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 5
Research directions I | II | III | IV How to tackle? Hybrid models for document representation learning Human-established semantics Distributional semantics Generate Generate query document representation representation Estimate Relevance Complementarity Human-established AND Distributional semantics semantics Lexical gap ++ ++ Granularity mismatch ++ + Polysemy + - Word pair relation inference - + Sense readability ++ - Domain adaptability - + Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 6
Research directions I | II | III | IV Our approach: (human) Knowledge-enhanced representation learning of documents Knowledge-enhanced representation learning Joint learning of embeddings (Online learning) word concept document with relational Embeddings learning embedding embedding embedding constraints [Liu et al., 2016] X X [Yu etDredze, 2014] X X [Jauhar et al., 2015] X X [Liu et al., 2018] X X X [Mancini et al., 2016] X X [Cheng et al., 2015] X X [Yamada et al., 2016] X X Our model X X X X Retrofitting of embeddings (Offline learning) [Faruqui et al., 2014] X X [Glavas et Vulic, 2018] X X [Mrksic et al., 2016] X X [Jauhar et al., 2015] X X Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 7
Research directions I | II | III | IV Our approach: (human) Knowledge-enhanced representation learning of documents To the best of our knowledge, no strongly related work (in IR) Zhang e et al. (2018) Neural Information Retrieval: A literature Review Extracted from the review paper Information Retrieval Journal, June 2018, Volume 21, Issue 2–3, pp 107– 110| Nguyen et al. (2016) Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, Nathalie Souf: Toward a Deep Neural Approach for Knowledge-Based IR, Worshop on Neural IR, in Conjonction with SIGIR'2016 Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 8
Document representation learning of documents I | II | III | IV Our general approach: (human) Knowledge-enhanced representation learning of documents Retroffing document Joint learning of word, concept, document embeddings (offline learning) embeddings (Online learning) Structured knowledge resources: UMLS, YAGO, WordNet, DBPedia, .. Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, Nathalie Souf: Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, Nathalie Souf: Learning Concept-Driven Document Embeddings for Medical A Tri-Partite Neural Document Language Model for Semantic Information Search. Information Retrieval. Artificial Intelligence in Medecine (AIME) 2017: 160-170 Extended Semantic Web Conference (ESWC) 2018: 445-461 Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 9
Document representation learning of documents I | II | III | IV Offline representation learning of documents: driving idea and Illustration in the medical domain Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 10
Document representation learning of documents I | II | III | IV Representation learning driven by relational semantics: enhance the readability of the learning outcomes Relational constraints: C1: Constrain the distributional learning model towards beter revealing paradigmatic word- word relations based on word-concept relations established in the knowledge resource. C2: Favour the learning of syntagmatic similarity relations between words linked to related concepts in the knowledge resource through concept-concept relations. Make the vectorial representations of related words/concepts in the knowledge resource, close Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 11
Document representation learning of documents I | II | III | IV Representation learning driven by relational semantics: enhance the readability of learning outcomes Retroffing document Joint learning of word, concept, document embeddings (offline learning) embeddings (Online learning) Structured knowledge resources: UMLS, YAGO, WordNet, DBPedia, .. Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 12
Experimental validation I | II | III | IV Experimental set up Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 13
Experimental validation I | II | III | IV Results: comparing the quality of document embedding using offline vs. online learning strategies Main general observations and trends: O1. Offline models generally achieve better results for NLP similarity tasks than NLP classification tasks O2. Offline models are more effective in NLP within general domains, while online models are more effective in the medical domain O3. Offline and online models behave similarly in IR tasks while being more effective in medical search tasks Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 14
Experimental validation I | II | III | IV Results: Focus on query expansion in offline learning Main observation: O4. Query expansion in offline models is more effective for medical search tasks Ohsumed 07 Ohsumed 35 young woman with lactase deficiency 26 yo female with bulimia young woman with lactase deficiency fibrosis abscess 26 yo female with bulimia hypertension failure Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 15
Experimental validation I | II | III | IV Results: measuring the impact of considering relational constraints in both offline vs. online learning Main observations and trends: O4. Considering relational constraints is more effective in both NLP and IR tasks O5. Considering both word and concept relations is more effective than considering one type of relations Document re-ranking Query expansion Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 16
Conclusion and Implications I | II | III | IV Lessons learned • Performance of online and offline learning models are both task and domain-dependent o Online models are more effective in identifying similarity signals o Both online and offline models are effective in identifying relevance signals o Both offline and online models are more effective in medical IR tasks than in general domain-search tasks. This is reversed in the case of NLP tasks • Relational knowledge is useful for driving the distributional learning in both NLP and IR tasks o Constraining the learning with relational knowledge is effective in both NLP and IR tasks. The learning leverage from both word-word relations and concept-concept relations Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 17
Conclusion and Implications I | II | III | IV Pending issues and perspectives • (Main) Pending issues o Robustness of the models: significant performance variation depending on multiple factors (knowledge resource, task, annotation quality, etc.) o Transfer the learning to new senses: particularly challenging in specialized domains and/or with low-resource languages. Cross-domain performance is important in IR • Perspectives o Consider the relation types in the learning objective to better map the vectorial representation with the knowledge resource o Constrain the learning with multiple (heterogeneous) structured knowledge resources Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019 18
Lynda Tamine, IRIT, UPS Laure Soulier, LIP6, Sorbonne Université Gia Nguyen, IRIT, UPS Nathalie Souf, IRIT, UPS Thank You Lynda Tamine University Paul Sabatier Institut de Recherche en Informatique de Toulouse IRIT e-mail : lynda.lechani@irit.fr http://www.irit.fr/~Lynda.Tamine-Lechani/ Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019
You can also read