Text Mining and Chatbots : where are we? - Anne Vilnat Janvier 2022 - LIMSI
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Text Mining and Chatbots : where are we? Anne Vilnat LIMSI, UPSaclay Janvier 2022 Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 1 / 23
Plan Plan of the course : what we plan... and hope to do ! 1 Jan 7 : What’s in a text : an introduction, and (some words on) syntax Anne Vilnat 2 Jan 14 : "A corpus, first approach : different tools to be used ..." : Anne Vilnat 3 Jan 21 : Semantics, Sahar Ghannay 4 Jan 28 : Text mining in open domain, Pierre Zweigenbaum 5 Feb 4 : Text mining in medical domain, Aurélie Névéol 6 Feb 11 : Chatbot, Sophie Rosset 7 Feb 18 : Projects and papers presentation Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 2 / 23
Le problème How to define what we do... Natural Language Processing or NLP Part of Computer Science and Artificial Intelligence, to study the links between natural languages (used by humans) and computers (using programming languages) at the frontier between computer science and linguistics NLP begins at the same time than ... Computer scince itself ! the first computers where... "big dictionaries" to translate, or decode messages ! Machine translation and ... its tricks ! The spirit is willing but the flesh is weak L’esprit est fort mais la chair est faible The vodka is strong but the meat is rotten La vodka est forte mais la viande est pourrie Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 3 / 23
Le problème Some questions to answer... When is NLP useful ? language modelling using natural language, with a lot of industrial issues What are the difficult problems ? ambiguity in linguistics units implicit in text un natural language What technics are useful for NLP ? "old school" but classical ones : rules or others (grammars, knowledge ontologies, etc) and ...deep learning Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 4 / 23
À quoi sert le TAL ? NLP is useful for : machine translation spelling/grammar correction information retrieval (text mining) text simplification conversational agents (chatbots) ... Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 5 / 23
À quoi sert le TAL ? NLP is useful for : machine translation spelling/grammar correction information retrieval (text mining) text simplification conversational agents (chatbots) ... Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 6 / 23
À quoi sert le TAL ? NLP is useful for : machine translation spelling/grammar correction information retrieval (text mining) text simplification conversational agents (chatbots) ... Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 7 / 23
À quoi sert le TAL ? NLP is useful for : machine translation spelling/grammar correction information retrieval (text mining) text simplification conversational agents (chatbots) ... Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 8 / 23
À quoi sert le TAL ? NLP is useful for : machine translation spelling/grammar correction information retrieval (text mining) text simplification conversational agents (chatbots) ... Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 9 / 23
À quoi sert le TAL ? An example to introduce the problem The president of the United States was eating an apple with a knife. Which treatments ? text segmentation → lexical units recognition of the lexical components, an their properties → textbflexical processing ; recognition of the higher level components, and the relations between them → syntactic processing ; building the meaning representation of the statement → semantic analysis ; how this statement may be related to the context in which it is analyzed ? (text, dialog,...) → pragmatics. But it is not a sequential process ! Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 10 / 23
La segmentation Segmentation/normalization : what is a word ? The writings are not always clearly segmented : chinese, thaï,... Typography is not always fixed : . : etc. or limsi.fr or 20.3 or ... ’ : jusqu’à or aujourd’hui or 3’4 or Floc’h ou Sotheby’s ,... - : Jean-Paul or donne-t-il or 06-05-04-03-02 or 1914-1918 or -10.5%,... the space... Detect and normalize typographical variants : France-Inter France-inter France Inter United States United-States US Finding the numbers, dates, durations, amounts, special numbers (phone, Visa card,...), scores (sport) “Deal with” unknown words (neologisms,...), words from another language (anglicisms in french,...), typos,... Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 11 / 23
La segmentation Lexical processing Lexical processing Goal : Identifying lexical elements, their structure and characteristics ; put together the forms from the same origin. Lemmatization find the canonical form of a word, or lemma ate → to eat journaux → journal viendras → venir Racinization(stemming ) lived → liv- journaux, journal → journ- chantais, chanteras → chant- venons, venaient →Text ven- Anne Vilnat (LIMSI, UPSaclay)Mining and Chatbots : where are we? Janvier 2022 12 / 23
La segmentation Lexical processing Lexical processing : the result Le président des antialcooliques mangeait une pomme avec un couteau le-det.masc.sing.,/l@/ ;pron.pers.masc.sing.,/l@/ président-vrb 3pers. plur. prés. ind./subjonctif [présid+ent],,/pKezid+@t/ ;nom masc. sing. ← présider : action de X,,/pKezidÃ/ des-det. masc./fem. plur., /dE+z/ ;prep. contr. de les... antialcooliques-adj. masc./fem. plur. [anti+alcool+ique+s], ← antialcoolique(adj) : être X, antialcoolique(X), /ÃtialkOlik@+z/ mangeait- vrb(1,3)pers. sing. imp.ind., [mang+e+ait], , /mÃZE+t/ pomme-nom fem. sing., [pomme], , /pOm@/ ... Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 13 / 23
La segmentation Syntactic level "Simple" syntax : Tagging and chunking Goal : desambiguate ambiguous morphological labels (POS tagging) ; identify the group frontiers (not their internal structure interne, not the dependancy relations) : chunking How : desambiguisation rules/patterns ; statistical models (HMM, CRF) ; learning disambiguation rules Tools : rules, patterns, manually annotated corpus Difficulties : unknown words ... Result : tags and chunks. [Le/Admp président/Vpi3p] [des/Prep antialcooliques/Ncmp] [mange/Vpi3s][une/Aifs pomme/Ncfs]... Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 14 / 23
La segmentation Syntactic level Syntactic level Goal : identifying syntactic components (syntagms), their function, and the relations between them. How : syntactic parsing, giving a tree or a dependancy structure Tools : syntactic parser, definig the representation and the way to get it Difficulties : compromise between a rich description, time processing, and proliferation of ambiguities, complexity of linguistic phenomena, robustness againt “noise” (typos, grammatical errors...). Résult : one (or a lot of) syntatic representation of the sentence. Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 15 / 23
La segmentation Syntactic level Ambiguity : due to lexical entries One of the more important problem for suntactic parsing is ambiguity. Lexical Ambiguity : souris : verbal forms of sourire , feminin singular and plural name ; petit : adjective or name masculine singular ; la : determiner or persona pronoun feminine singular, name masculine ; mousse : verbal forms of mousser, to foam , name masculine (in the navy), name feminine (foam) ; If the description ismore precise, the ambiguity increases : monter (monter un escalier , monter un cheval, monter une pièce, ...). Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 16 / 23
La segmentation Syntactic level Ambiguity : due to syntax La petite brise la glace ; La troupe monte Molière vs Le jockey monte Belino ; She eats an apple with a knife vs She eats an apple with the skin ; She sees the beach with a telescope vs She sees the beach with seagulls ; it’s the daughter of the cousin who drinks ; he talked about having lunch with Paul ; The desambiguation is not possible at the syntactic level, we need semantics or pragmatics to decide ; the richer is the grammar, the more ambiguities you obtain... Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 17 / 23
La segmentation Semantic level Semantics Goal : Solve referential problems ; building a conceptual representation see the first course... Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 18 / 23
La segmentation Pragmatics Pragmatic level Goal : integrate the statement in the current text, or interaction, adding what is implicit ; understand the argumentative function of the sentence : what is new, what is it about, is it a new information, a question ? ? ? How : knowledge about human activities, about human interaction (speech acts, relevance,...) ; about rhetorical and discursive structures... Tools : world knowledge (scripts), intaction "grammar", ... Difficultés : taille de la connaissance à représenter, spécification de la « grammaire » des interactions Result : formal representation, new knowledge,... but ... complex ! Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 19 / 23
Short story of NLP Short story : the beginnings... 1954 : first machine translation system (russian → english...) 1962 : first conference on machine translation at MIT, organised by Bar-Hillel The spirit is willing but the flesh is weak the box is in the pen ↔ the pen is in the box. Bar-hillel : “translation is not possible”... ALPAC report : translation too expensive, no results... : no more fundings ! Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 20 / 23
Short story of NLP Short story : the beginnings, elswhere... Harris : distribution linguistics (51 to 54) Chomsky : syntax, language grammars/ formal languages grammar (57) and now arrives AI in 56 (Dartmouth summerschool) : Mc Carthy, Minsky, Newell, Simon → computers with language abilities first systems : BASEBALL (1961), SIR (1964), STUDENT (1964), ELIZA (1966) see : emacs Meta-X doctor knowlege representation : Quillian (semantic networks) → 72 : SCHRDLU (Winograd) ; first système that “understands” http ://hci.stanford.edu/ winograd/shrdlu/ Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 21 / 23
Short story of NLP Short story : the help of semantics, the progress of syntax... 70’s : systems developed by Schank, Wilks ; Semantics is the most important... BUT how will it be possible to give ALL the necessary knowledge... 80’s : progress in syntax : unification grammars BUT how will it be possible to give all the rules... 2000- :deep learning, transformer, and ... what are the limits ? how to learn ? manually annotated data ? Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 22 / 23
Short story of NLP State of the art... Anne Vilnat (LIMSI, UPSaclay) Text Mining and Chatbots : where are we? Janvier 2022 23 / 23
You can also read