Text Mining and Chatbots : where are we? - Anne Vilnat Janvier 2022 - LIMSI

Page created by Darryl Contreras
 
CONTINUE READING
Text Mining and Chatbots : where are we? - Anne Vilnat Janvier 2022 - LIMSI
Text Mining and Chatbots : where are we?

                                          Anne Vilnat

                                         LIMSI, UPSaclay

                                          Janvier 2022

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   1 / 23
Text Mining and Chatbots : where are we? - Anne Vilnat Janvier 2022 - LIMSI
Plan

Plan of the course : what we plan... and hope to do !

  1   Jan 7 : What’s in a text : an introduction, and (some words on) syntax
      Anne Vilnat
  2   Jan 14 : "A corpus, first approach : different tools to be used ..." :
      Anne Vilnat
  3   Jan 21 : Semantics, Sahar Ghannay
  4   Jan 28 : Text mining in open domain, Pierre Zweigenbaum
  5   Feb 4 : Text mining in medical domain, Aurélie Névéol
  6   Feb 11 : Chatbot, Sophie Rosset
  7   Feb 18 : Projects and papers presentation

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   2 / 23
Text Mining and Chatbots : where are we? - Anne Vilnat Janvier 2022 - LIMSI
Le problème

How to define what we do...
Natural Language Processing or NLP
      Part of Computer Science and Artificial Intelligence, to study the links
      between natural languages (used by humans) and computers (using
      programming languages)
      at the frontier between computer science and linguistics
      NLP begins at the same time than ... Computer scince itself !
      the first computers where... "big dictionaries" to translate, or decode
      messages !

Machine translation and ... its tricks !
   The spirit is willing but the flesh is weak
   L’esprit est fort mais la chair est faible
      The vodka is strong but the meat is rotten
      La vodka est forte mais la viande est pourrie
Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   3 / 23
Text Mining and Chatbots : where are we? - Anne Vilnat Janvier 2022 - LIMSI
Le problème

Some questions to answer...

When is NLP useful ?
      language modelling
      using natural language, with a lot of industrial issues
What are the difficult problems ?
      ambiguity in linguistics units
      implicit in text un natural language
What technics are useful for NLP ?
      "old school" but classical ones : rules or others (grammars, knowledge
      ontologies, etc)
      and ...deep learning

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   4 / 23
À quoi sert le TAL ?

NLP is useful for :

      machine translation
      spelling/grammar correction
      information retrieval (text mining)
      text simplification
      conversational agents (chatbots)
      ...

Anne Vilnat (LIMSI, UPSaclay)    Text Mining and Chatbots : where are we?   Janvier 2022   5 / 23
À quoi sert le TAL ?

NLP is useful for :
      machine translation
      spelling/grammar correction
      information retrieval (text mining)
      text simplification
      conversational agents (chatbots)
      ...

Anne Vilnat (LIMSI, UPSaclay)    Text Mining and Chatbots : where are we?   Janvier 2022   6 / 23
À quoi sert le TAL ?

NLP is useful for :
      machine translation
      spelling/grammar correction
      information retrieval (text mining)
      text simplification
      conversational agents (chatbots)
      ...

Anne Vilnat (LIMSI, UPSaclay)    Text Mining and Chatbots : where are we?   Janvier 2022   7 / 23
À quoi sert le TAL ?

NLP is useful for :
      machine translation
      spelling/grammar correction
      information retrieval (text mining)
      text simplification
      conversational agents (chatbots)
      ...

Anne Vilnat (LIMSI, UPSaclay)    Text Mining and Chatbots : where are we?   Janvier 2022   8 / 23
À quoi sert le TAL ?

NLP is useful for :
      machine translation
      spelling/grammar correction
      information retrieval (text mining)
      text simplification
      conversational agents (chatbots)
      ...

Anne Vilnat (LIMSI, UPSaclay)    Text Mining and Chatbots : where are we?   Janvier 2022   9 / 23
À quoi sert le TAL ?

An example to introduce the problem

The president of the United States was eating an apple with a knife.
Which treatments ?
      text segmentation → lexical units
      recognition of the lexical components, an their properties →
      textbflexical processing ;
      recognition of the higher level components, and the relations between
      them → syntactic processing ;
      building the meaning representation of the statement → semantic
      analysis ;
      how this statement may be related to the context in which it is
      analyzed ? (text, dialog,...) → pragmatics.
But it is not a sequential process !

Anne Vilnat (LIMSI, UPSaclay)    Text Mining and Chatbots : where are we?   Janvier 2022   10 / 23
La segmentation

Segmentation/normalization : what is a word ?

      The writings are not always clearly segmented : chinese, thaï,...
      Typography is not always fixed :
            . : etc. or limsi.fr or 20.3 or ...
            ’ : jusqu’à or aujourd’hui or 3’4 or Floc’h ou Sotheby’s ,...
            - : Jean-Paul or donne-t-il or 06-05-04-03-02 or 1914-1918 or -10.5%,...
            the space...
      Detect and normalize typographical variants :
            France-Inter France-inter France Inter
            United States United-States US
      Finding the numbers, dates, durations, amounts, special numbers
      (phone, Visa card,...), scores (sport)
      “Deal with” unknown words (neologisms,...), words from another
      language (anglicisms in french,...), typos,...

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   11 / 23
La segmentation   Lexical processing

Lexical processing
Goal :
Identifying lexical elements, their structure and characteristics ; put
together the forms from the same origin.

Lemmatization
find the canonical form of a word, or lemma
      ate → to eat
      journaux → journal
      viendras → venir

Racinization(stemming )
      lived → liv-
      journaux, journal → journ-
      chantais, chanteras → chant-
      venons, venaient →Text
                         ven-
Anne Vilnat (LIMSI, UPSaclay)Mining and Chatbots : where are we?       Janvier 2022   12 / 23
La segmentation   Lexical processing

Lexical processing : the result

Le président des antialcooliques mangeait une pomme avec un couteau
      le-det.masc.sing.,/l@/ ;pron.pers.masc.sing.,/l@/
      président-vrb 3pers. plur. prés. ind./subjonctif
      [présid+ent],,/pKezid+@t/ ;nom masc.
      sing. ← présider : action de X,,/pKezidÃ/
      des-det. masc./fem. plur., /dE+z/ ;prep. contr. de les...
      antialcooliques-adj. masc./fem. plur. [anti+alcool+ique+s], ←
      antialcoolique(adj) : être X, antialcoolique(X), /ÃtialkOlik@+z/
      mangeait- vrb(1,3)pers. sing. imp.ind., [mang+e+ait],
      , /mÃZE+t/
      pomme-nom fem. sing., [pomme],
      , /pOm@/
      ...

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   13 / 23
La segmentation   Syntactic level

"Simple" syntax : Tagging and chunking

      Goal : desambiguate ambiguous morphological labels (POS tagging) ;
      identify the group frontiers (not their internal structure interne, not
      the dependancy relations) : chunking
      How : desambiguisation rules/patterns ; statistical models (HMM,
      CRF) ; learning disambiguation rules
      Tools : rules, patterns, manually annotated corpus
      Difficulties : unknown words ...
      Result : tags and chunks.
[Le/Admp président/Vpi3p] [des/Prep antialcooliques/Ncmp]
[mange/Vpi3s][une/Aifs pomme/Ncfs]...

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   14 / 23
La segmentation   Syntactic level

Syntactic level

      Goal : identifying syntactic components (syntagms), their function,
      and the relations between them.
      How : syntactic parsing, giving a tree or a dependancy structure
      Tools : syntactic parser, definig the representation and the way to get
      it
      Difficulties : compromise between a rich description, time processing,
      and proliferation of ambiguities, complexity of linguistic phenomena,
      robustness againt “noise” (typos, grammatical errors...).
      Résult : one (or a lot of) syntatic representation of the sentence.

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   15 / 23
La segmentation   Syntactic level

Ambiguity : due to lexical entries

One of the more important problem for suntactic parsing is ambiguity.
Lexical Ambiguity :
      souris : verbal forms of sourire , feminin singular and plural name ;
      petit : adjective or name masculine singular ;
      la : determiner or persona pronoun feminine singular, name masculine ;
      mousse : verbal forms of mousser, to foam , name masculine (in the
      navy), name feminine (foam) ;
If the description ismore precise, the ambiguity increases : monter (monter
un escalier , monter un cheval, monter une pièce, ...).

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   16 / 23
La segmentation   Syntactic level

Ambiguity : due to syntax

      La petite brise la glace ;
      La troupe monte Molière vs Le jockey monte Belino ;
      She eats an apple with a knife vs She eats an apple with the skin ;
      She sees the beach with a telescope vs She sees the beach with
      seagulls ;
      it’s the daughter of the cousin who drinks ;
      he talked about having lunch with Paul ;
The desambiguation is not possible at the syntactic level, we need
semantics or pragmatics to decide ; the richer is the grammar, the more
ambiguities you obtain...

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   17 / 23
La segmentation   Semantic level

Semantics

      Goal : Solve referential problems ; building a conceptual representation
      see the first course...

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   18 / 23
La segmentation   Pragmatics

Pragmatic level

      Goal : integrate the statement in the current text, or interaction,
      adding what is implicit ; understand the argumentative function of the
      sentence : what is new, what is it about, is it a new information, a
      question ? ? ?
      How : knowledge about human activities, about human interaction
      (speech acts, relevance,...) ; about rhetorical and discursive
      structures...
      Tools : world knowledge (scripts), intaction "grammar", ...
      Difficultés : taille de la connaissance à représenter, spécification de la
      « grammaire » des interactions
      Result : formal representation, new knowledge,... but ... complex !

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   19 / 23
Short story of NLP

Short story : the beginnings...

      1954 : first machine translation system (russian → english...)
      1962 : first conference on machine translation at MIT, organised by
      Bar-Hillel
            The spirit is willing but the flesh is weak
            the box is in the pen
            ↔ the pen is in the box.
      Bar-hillel : “translation is not possible”...
      ALPAC report : translation too expensive, no results... : no more
      fundings !

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   20 / 23
Short story of NLP

Short story : the beginnings, elswhere...

      Harris : distribution linguistics (51 to 54)
      Chomsky : syntax, language grammars/ formal languages grammar
      (57)
      and now arrives AI in 56 (Dartmouth summerschool) : Mc Carthy,
      Minsky, Newell, Simon → computers with language abilities
      first systems : BASEBALL (1961), SIR (1964), STUDENT (1964),
      ELIZA (1966)
      see : emacs Meta-X doctor
      knowlege representation : Quillian (semantic networks)
      → 72 : SCHRDLU (Winograd) ; first système that “understands”
      http ://hci.stanford.edu/ winograd/shrdlu/

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   21 / 23
Short story of NLP

Short story : the help of semantics, the progress of syntax...

      70’s : systems developed by Schank, Wilks ; Semantics is the most
      important...
      BUT how will it be possible to give ALL the necessary knowledge...
      80’s : progress in syntax : unification grammars
      BUT how will it be possible to give all the rules...
      2000- :deep learning, transformer, and ...
      what are the limits ? how to learn ? manually annotated data ?

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   22 / 23
Short story of NLP

State of the art...

Anne Vilnat (LIMSI, UPSaclay)   Text Mining and Chatbots : where are we?   Janvier 2022   23 / 23
You can also read