Representation Learning of Documents Driven by Knowledge Resources - IRIT

Page created by Dwight Campbell
 
CONTINUE READING
Representation Learning of Documents Driven by Knowledge Resources - IRIT
Representation Learning of Documents
                          Driven by Knowledge Resources

                                                       Lynda Tamine

                                       University Paul Sabatier
                       Institut de Recherche en Informatique de Toulouse IRIT

                                       e-mail: lynda.lechani@irit.fr
                                http://www.irit.fr/~Lynda.Tamine-Lechani/

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019
Representation Learning of Documents Driven by Knowledge Resources - IRIT
Objectives

   -     Introduce the semantic gap problem in information retrieval (IR)

   -     Design document representation learning models: combine distributional
         semantics and human-established semantics provided by external structured
         knowledge resources

   -     Compare online vs. offline representation learning strategies on IR and
         Natural Laguage Processing (NLP) tasks

   -     Provide lessons for future representation learning frameworks

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019      2
Representation Learning of Documents Driven by Knowledge Resources - IRIT
Outline

     I- The semantic gap problem in IR

     II- Representation learning of documents driven by external knowledge resources
              -- Online learning strategy
              -- Offline learning strategy

     III- Empirical evaluation on IR and NLP tasks

     IV- Lessons learned and implications

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019   3
Representation Learning of Documents Driven by Knowledge Resources - IRIT
The semantic gap: a longstanding research issue in IR                                                                                                        I | II | III | IV

Anatomy of an IR process

                    Information need
                                                                                                                       Main issues (in query-document
                                                                                                                                 matching)

                                                                                                                      - Lexical gap
                                                                                                                             aspirin vs aceltylsalid acids
                                                                                                                      - Granularity mismatch
       Query text                 Document text                                                                              aspirin, anacardic acid vs salicylates
                                                                                                                      - Polysemy
                                                                                                                             bass (fish vs part of harmony)
   Generate query                Generate document
   representation                representation
                                                                                   Manually designed
                                                                                   features
       Query vector                    Doc. vector
                                                                                   Term
                                                                                   Fequency
                                                                                   Term position
                    Estimate relevance                                             Length
                                                                                                                       Impact on relevance estimate
                                                                                   ..

                                         ln
                                              N − df (w) + 0.5            (k1 +1) × c(w,d )         (k +1) × c(w,q)   - Default sense matching between queries
                                  ∑                            ⋅                                   ⋅ 3
                                 w∈q∩d         df (w) + 0.5
                                                                   k1 ((1− b) + b
                                                                                  |d |
                                                                                       ) + c(w,d )
                                                                                                      k3 + c(w,q)      and documents
                                                                                  avdl
                                                                                                                      - Low levels of retrieval performance
                                  Learning 2 Rank (SVM, NN, ..)                                                       (effectiveness in termes of recall/precision

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                                                                                       4
Representation Learning of Documents Driven by Knowledge Resources - IRIT
The semantic gap: a longstanding research issue in IR                                                  I | II | III | IV

The semantic gap in medical IR: a review of the TREC medical search track (Edinger et al. 2012)

                                                                                                   False
                                                                                                  negative
 Task: clinical search cohort

 Query:       expression      of
 disease/consitions sets and
 treatments or intervention
 Eg.,   "find    patients   with
 gastroesophgeal reflux disease                                                                    False
 who had an upper endoscopy"                                                                      positive

 Documents:          de-identified
 medical visit reports

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                                 5
Representation Learning of Documents Driven by Knowledge Resources - IRIT
Research directions                                                                                            I | II | III | IV

How to tackle? Hybrid models for document representation learning
Human-established semantics                                                                  Distributional semantics

                                                                        Generate
                                             Generate query
                                                                       document
                                             representation
                                                                     representation

                                                            Estimate
                                                           Relevance

        Complementarity                          Human-established
                                                                                  AND   Distributional semantics
                                                    semantics
               Lexical gap                                 ++                                     ++
         Granularity mismatch                              ++                                      +
                Polysemy                                    +                                      -
     Word pair relation inference                             -                                    +
           Sense readability                               ++                                      -
          Domain adaptability                                 -                                    +

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                                         6
Representation Learning of Documents Driven by Knowledge Resources - IRIT
Research directions                                                                                             I | II | III | IV

Our approach: (human) Knowledge-enhanced representation learning of documents

                              Knowledge-enhanced representation learning
                                   Joint learning of embeddings (Online learning)
                                          word              concept               document    with relational
         Embeddings learning
                                        embedding          embedding              embedding    constraints
              [Liu et al., 2016]              X                                                     X
           [Yu etDredze, 2014]                X                                                     X
           [Jauhar et al., 2015]              X                                                     X
              [Liu et al., 2018]              X                  X                                  X
           [Mancini et al., 2016]             X                  X
           [Cheng et al., 2015]               X                  X
          [Yamada et al., 2016]               X                  X
                Our model                     X                  X                   X              X
                                    Retrofitting of embeddings (Offline learning)
           [Faruqui et al., 2014]             X                                                     X
          [Glavas et Vulic, 2018]             X                                                     X
           [Mrksic et al., 2016]              X                                                     X
           [Jauhar et al., 2015]              X                                                     X

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                                          7
Representation Learning of Documents Driven by Knowledge Resources - IRIT
Research directions                                                                                                               I | II | III | IV

Our approach: (human) Knowledge-enhanced representation learning of documents

 To the best of our knowledge, no strongly related work (in IR)

                                                        Zhang e et al. (2018) Neural Information Retrieval: A literature Review
        Extracted from the review paper                 Information Retrieval Journal, June 2018, Volume 21, Issue 2–3, pp 107–
                                                        110|

 Nguyen et al. (2016) Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, Nathalie Souf:
 Toward a Deep Neural Approach for Knowledge-Based IR, Worshop on Neural IR, in
 Conjonction with SIGIR'2016

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                                                            8
Representation Learning of Documents Driven by Knowledge Resources - IRIT
Document representation learning of documents                                                                               I | II | III | IV

Our general approach: (human) Knowledge-enhanced representation learning of documents

 Retroffing document                                                   Joint learning of word, concept, document
 embeddings (offline learning)                                         embeddings (Online learning)
                                                                              Structured knowledge resources:
                                                                              UMLS, YAGO, WordNet, DBPedia, ..

Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, Nathalie Souf:          Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, Nathalie Souf:
Learning Concept-Driven Document Embeddings for Medical               A Tri-Partite Neural Document Language Model for Semantic
Information Search.                                                   Information Retrieval.
Artificial Intelligence in Medecine (AIME) 2017: 160-170              Extended Semantic Web Conference (ESWC) 2018: 445-461

 Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                                                     9
Representation Learning of Documents Driven by Knowledge Resources - IRIT
Document representation learning of documents                                                       I | II | III | IV

Offline representation learning of documents: driving idea and Illustration in the medical domain

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                              10
Document representation learning of documents                                                     I | II | III | IV

Representation learning driven by relational semantics: enhance the readability of the learning outcomes

   Relational constraints:
   C1: Constrain the distributional learning model towards beter revealing paradigmatic word-
       word relations based on word-concept relations established in the knowledge resource.

   C2: Favour the learning of syntagmatic similarity relations between words linked to related
       concepts in the knowledge resource through concept-concept relations.

       Make the vectorial representations of related words/concepts in the knowledge resource, close

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                            11
Document representation learning of documents                                                                I | II | III | IV

Representation learning driven by relational semantics: enhance the readability of learning outcomes

Retroffing document                                                   Joint learning of word, concept, document
embeddings (offline learning)                                         embeddings (Online learning)
                                                                          Structured knowledge resources:
                                                                          UMLS, YAGO, WordNet, DBPedia, ..

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                                       12
Experimental validation                                                           I | II | III | IV

Experimental set up

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019            13
Experimental validation                                                                             I | II | III | IV

Results: comparing the quality of document embedding using offline vs. online learning strategies

 Main general observations and trends:
 O1. Offline models generally achieve better results for NLP similarity tasks than NLP classification tasks
 O2. Offline models are more effective in NLP within general domains, while online models are more
     effective in the medical domain
 O3. Offline and online models behave similarly in IR tasks while being more effective in medical search tasks

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                              14
Experimental validation                                                                                                I | II | III | IV

Results: Focus on query expansion in offline learning
  Main observation:
  O4. Query expansion in offline models is more effective for medical search tasks

                              Ohsumed 07                                                 Ohsumed 35

           young woman with lactase deficiency                                    26 yo female with bulimia
           young woman with lactase deficiency fibrosis abscess                   26 yo female with bulimia hypertension failure

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                                                 15
Experimental validation                                                                               I | II | III | IV

Results: measuring the impact of considering relational constraints in both offline vs. online learning
  Main observations and trends:
  O4. Considering relational constraints is more effective in both NLP and IR tasks
  O5. Considering both word and concept relations is more effective than considering one type of relations

                                 Document re-ranking                 Query expansion

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019                                16
Conclusion and Implications                                                         I | II | III | IV

Lessons learned

     • Performance of online and offline learning models are both
       task and domain-dependent
           o Online models are more effective in identifying similarity signals
           o Both online and offline models are effective in identifying relevance
                signals
           o Both offline and online models are more effective in medical IR tasks
                than in general domain-search tasks. This is reversed in the case of NLP
                tasks
     • Relational knowledge is useful for driving the distributional
       learning in both NLP and IR tasks
           o Constraining the learning with relational knowledge is effective in both
                NLP and IR tasks. The learning leverage from both word-word relations
                and concept-concept relations

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019              17
Conclusion and Implications                                                         I | II | III | IV

Pending issues and perspectives

  • (Main) Pending issues
         o Robustness of the models: significant performance variation depending
             on multiple factors (knowledge resource, task, annotation quality, etc.)
         o Transfer the learning to new senses: particularly challenging in
             specialized domains and/or with low-resource languages. Cross-domain
             performance is important in IR

  • Perspectives
         o Consider the relation types in the learning objective to better map the
             vectorial representation with the knowledge resource
         o Constrain the learning with multiple (heterogeneous) structured
             knowledge resources

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019              18
Lynda Tamine, IRIT, UPS    Laure Soulier, LIP6, Sorbonne Université   Gia Nguyen, IRIT, UPS   Nathalie Souf, IRIT, UPS

                                                   Thank You
                                                         Lynda Tamine

                                        University Paul Sabatier
                        Institut de Recherche en Informatique de Toulouse IRIT

                                       e-mail : lynda.lechani@irit.fr
                                http://www.irit.fr/~Lynda.Tamine-Lechani/

Ontologies, Données et Informatique Médicale (ODIM), Toulouse May 27th May 2019
You can also read