Informatics 1: Data & Analysis - Lecture 14: Example Corpora Applications Ian Stark - Learn
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Informatics 1: Data & Analysis Lecture 14: Example Corpora Applications Ian Stark School of Informatics The University of Edinburgh Thursday 7 March 2019 Semester 2 Week 7 https://course.inf.ed.ac.uk/inf1-da
Lecture Plan XML — The Extensible Markup Language We start with technologies for modelling and querying semistructured data. Semistructured Data: Trees and XML Schemas for structuring XML Navigating and querying XML with XPath Corpora One particular kind of semistructured data is large bodies of written or spoken text: each one a corpus, plural corpora. Corpora: What they are and how to build them Applications: corpus analysis and data extraction Ian Stark Inf1-DA / Lecture 14 2019-03-07
Applications of Corpora Answering empirical questions in linguistics and cognitive science: Corpora can be analyzed using statistical tools; Hypotheses about language processing and acquisition can be tested; New facts about language structure can be discovered. Engineering natural-language systems in AI and computer science: Corpora represent the data that these systems have to handle; Algorithms can find and extract regularities from corpus data; Text-based or speech-based applications can learn automatically from corpus data. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Outline 1 Finding Things and Counting Them 2 Small Application 3 Large Application 4 Closing Ian Stark Inf1-DA / Lecture 14 2019-03-07
Extracting Information from Corpora Once we have an annotated corpus, we can begin to use it to find out information and answer questions. For now, we start with the following: The basic notion of a concordance in a text. Statistics of word frequency and relative frequency, useful for linguistic questions and natural language processing. Word groups: unigrams, bigrams and n-grams. Words that mean something together: collocations. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Concordances Concordance: all occurrences of a given word, shown in context. That’s the simplest form. More generally, a concordance might mean all occurrences of a certain part of speech, a particular combination of words, or all matches for a query expression. Specialist concordance programs will generate these from a given keyword. This query might specify a single word, some annotation (POS, etc.) or more complex information (e.g., using regular expressions). Results are typically displayed as keyword in context (kwic): a matched keyword in the middle of a line with a fixed amount of context to left and right. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Example Concordance These are the opening kwic lines of a concordance for all forms of the word “remember” in a collection of novels by Charles Dickens. This was generated with the Corpus Query Processor: the samecqp tool that you will use for the current tutorial exercises. ’s cellar . Scrooge then to have heard that ghost , for your own sake , you what has passed between e-quarters more , when he , on a sudden , that the corroborated everything , everything , enjoyed eve urned from them , that he the Ghost , and became c ht be pleasant to them to upon Christmas Day , who its festivities ; and had those he cared for at a wn that they delighted to him . It was a great sur ke ceased to vibrate , he the prediction of old Ja as present myself , and I to have felt quite uncom ... Ian Stark Inf1-DA / Lecture 14 2019-03-07
Frequencies Frequency information obtained from corpora can be used to investigate characteristics of the language represented. Token count N: the number of tokens (words, punctuation marks, etc.) in a corpus; i.e., the size of the corpus. Absolute frequency f(t) of type t: the number of tokens of type t in a corpus. Relative frequency of type t: the absolute frequency of t scaled by the overall token count; i.e., f(t)/N. Type count: the number of different types of token in a corpus. Here “tokens of type t” might mean a single word, or all its variants, or every use of a certain part of speech. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Frequency Example Here is a comparison of frequency information between two sources: the British National Corpus (BNC) and the Sherlock Holmes story A Case of Identity by Sir Arthur Conan Doyle. BNC A Case of Identity Token count N 100,000,000 7,006 Type count 636,397 1,621 f(“Holmes”) 890 46 f(“Sherlock”) 209 7 f(“Holmes”)/N 0.0000089 0.0066 f(“Sherlock”)/N 0.00000209 0.000999 Ian Stark Inf1-DA / Lecture 14 2019-03-07
Unigrams We can now ask questions such as: what are the most frequent words in a corpus? Count absolute frequencies of all word types in the corpus. Tabulate them in an ordered list. Result: list of unigram frequencies — frequencies of individual words. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Unigram example BNC A Case of Identity 6,184,914 the 350 the 3,997,762 be 212 and 2,941,372 of 189 to 2,125,397 a 167 of 1,812,161 in 163 a 1,372,253 have 158 I 1,088,577 it 132 that 917,292 to 117 it The unigram rankings are different, but we can see similarities. For example, the definite article “the” is the most frequent word in both corpora; and prepositions like “of” and “to” appear in both lists. Ian Stark Inf1-DA / Lecture 14 2019-03-07
n-grams The notion of unigram generalises: Bigrams — pairs of adjacent words; Trigrams — triples of adjacent words; n-grams — n-tuples of adjacent words. These larger clusters of words carry more linguistic significance than individual words; and, again, we can make use of these even before finding out anything about their semantic content. Ian Stark Inf1-DA / Lecture 14 2019-03-07
n-grams example The most frequent n-grams in A Case of Identity, for n = 2, 3, 4. bigrams trigrams 4-grams 40 of the 5 there was no 2 very morning of the 23 in the 5 Mr. Hosmer Angel 2 use of the money 21 to the 4 to say that 2 the very morning of 21 that I 4 that it was 2 the use of the 20 at the 4 that it is 2 the King of Bohemia Note that frequencies of even the most common n-grams naturally get smaller with increasing n. As more word combinations become possible, there is an increase in data sparseness. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Bigram and POS Example Concordance Here is a concordance for all occurrences of bigrams in the Dickens corpus in which the second word is “tea” and the first is an adjective. This query uses the POS tagging of the corpus to search for adjectives. [pos="J.*"][word="tea"] 87773: now , notwithstanding the they had given me before 281162: .’ ’ Shall I put a little in the pot afore I go , 565002: o moisten a box-full with , stir it up on a piece 607297: tween eating , drinking , , devilled grill , muffi 663703: e , handed round a little . The harp was there ; 692255: e so repentant over their , at home , that by eigh 1141472: rs. Sparsit took a little ; and , as she bent her 1322382: s illness ! Dry toast and offered him every night 1456507: of robing , after which , and brandy were administ 1732571: rsty . You may give him a , ma’am , and some dry t Ian Stark Inf1-DA / Lecture 14 2019-03-07
Outline 1 Finding Things and Counting Them 2 Small Application 3 Large Application 4 Closing Ian Stark Inf1-DA / Lecture 14 2019-03-07
Sample Linguistic Application: Collocations A collocation is a sequence of words that occur close together ‘atypically often’ in language usage. For example: To “run amok”: the verb “run” can occur on its own, but “amok” does not. To say “strong tea” is much more natural English than “powerful tea” although the literal meanings are much the same. Phrasal verbs such as “settle up” or “make do”. “heartily sick”, “heated argument”, “commit a crime”,. . . Both Macmillan and Oxford University Press have specialist dictionaries that provide extensive lists of collocations specifically for those learning English. You can also buy collocation lists for linguistic research at http://www.collocates.info/. The inverted commas around ‘atypically often’ are because we need statistical ideas to make this precise. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Identifying Collocations We would like to automatically identify collocations in a large corpus. For example, collocations in the Dickens corpus involving the word “tea”. The bigram “strong tea” occurs in the corpus. This is a collocation. The bigram “powerful tea” does not, in fact, appear in the corpus. However, “more tea” and “little tea” do occur in the corpus. These are not collocations. These word sequences do not occur with any frequency above what would be suggested by their component words. The challenge is: how do we detect when a bigram (or n-gram) is a collocation? Ian Stark Inf1-DA / Lecture 14 2019-03-07
Looking at the Data Here are the most common bigrams from the Dickens corpus where the first word is “strong” or “powerful”. strong and 31 powerful effect 3 enough 16 sight 3 in 15 enough 3 man 14 mind 3 emphasis 11 for 3 desire 10 and 3 upon 10 with 3 interest 8 enchanter 2 a 8 displeasure 2 as 8 motives 2 inclination 7 impulse 2 tide 7 struggle 2 beer 7 grasp 2 Ian Stark Inf1-DA / Lecture 14 2019-03-07
Filtering Collocations We observe the following from the bigram tables. Neither “strong tea” nor “powerful tea” are frequent enough to make it into the top 13. Some potential collocations for “strong”: like “strong desire”, “strong inclination”, and “strong beer”. Some potential collocations for “powerful”: like “powerful effect”, “powerful motives”, and “powerful struggle”. A possible problem: bigrams like “strong and”, “strong enough” and “powerful for”, have high frequency. These do not seem like collocations. To distinguish collocations from non-collocations, we need some way to filter out noise. Ian Stark Inf1-DA / Lecture 14 2019-03-07
What We Need is More Maths Problem: Words like “for” and “and” are very common anyway: they occur with “strong” by chance. Solution: Use statistical tests to identify when the frequency of a bigram is atypically high given the frequencies of its constituent words. “beer” ¬“beer” Total “strong” 7 618 625 ¬“strong” 127 2310422 2310549 Total 134 2311040 2311174 In general, statistical tools offer powerful methods for the analysis of all types of data. In particular, they provide the principal approach to the quantitative (and qualitative) analysis of unstructured data. We shall return to the problem of finding collocations later in the course, when we have some appropriate statistical tools. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Coursework ! Written Assignment The Inf1-DA assignment will go online by the end of the week. This runs alongside your usual tutorial exercises for two weeks; ask your tutor for help with any problems. The assignment is based on past examination questions. Your tutor will give you marks and feedback on your work in the last tutorial of semester, and I shall distribute a solution guide. These marks will not be part of your final grade for Inf1-DA — this formative assessment is entirely for your feedback and learning. You are free to look things up, discuss with others, share advice, discuss on Piazza, and do whatever helps you learn. Please do. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Outline 1 Finding Things and Counting Them 2 Small Application 3 Large Application 4 Closing Ian Stark Inf1-DA / Lecture 14 2019-03-07
Engineering Natural-Language Systems Two Informatics system-building examples which use corpora extensively: Natural Language Processing (NLP): Computer systems that accept or produce readable text. For example: Summarization: Take a text, or multiple texts, and automatically produce an abstract or summary. Machine Translation (MT): Take a text in a source language and turn it into a text in the target language. For example Google Translate or Microsoft Translator. Speech Processing: Systems that accept or produce spoken language. Building these draws on probability theory, information theory and machine learning to extract and use the language information in large text corpora. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Example: Machine Translation The aim of machine translation is to automatically map sentences in one source language to corresponding sentences in a different target language, while preserving the meaning of the text. Historically, there have been two major approaches: Rule-based Translation: Long history including Systran and Babel Fish (Alta Vista, then Yahoo, now disappeared). Statistical Translation: Much recent growth, leading to Google Translate and Microsoft Translator. Both approaches make use of multilingual corpora. “The Babel fish,” said The Hitchhiker’s Guide to the Galaxy quietly, “ is small, yellow and leech-like, and probably the oddest thing in the Universe” Ian Stark Inf1-DA / Lecture 14 2019-03-07
Rule-Based Machine Translation A typical rule-based machine translation (RBMT) scheme might include: 1 Automatically assign part-of-speech information to a source sentence. 2 Build up a syntax tree for the sentence using grammatical rules. 3 Map this parse tree in the source language into the target language, using a dictionary to translate individual words, and rules to find correct inflections and word ordering for translated sentence. Some systems use an interlingua between the source and target language. In any real implementations each of these steps will be much refined; even so, the central point remains to have the system translate a sentence by identifying its structure and, to some extent, its meaning. These systems use corpora to train algorithms that identify part-of-speech information and grammatical structures across different languages. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Examples of Rule-Based Translation From http://www.systranet.com/translate The capital city of Scotland is Edinburgh English −→ German Die Hauptstadt von Schottland ist Edinburgh German −→ English The capital of Scotland is Edinburgh Ian Stark Inf1-DA / Lecture 14 2019-03-07
Examples of Rule-Based Translation From http://www.systranet.com/translate Sales of processed food collapsed across Europe when the news broke. English −→ French Les ventes de la nourriture traitée se sont effondrées à travers l’Europe quand les actualités se sont cassées. French −→ English The sales of treated food crumbled through Europe when the news broke. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Examples of Rule-Based Translation From http://www.systranet.com/translate and Robert Burns My love is like a red, red rose That’s newly sprung in June English −→ Italian Il mio amore è come un rosso, rosa rossa Quello recentemente è balzato a giugno Italian −→ English My love is like red, pink a red one That recently is jumped to june Ian Stark Inf1-DA / Lecture 14 2019-03-07
Issues with Rule-Based Translation A major difficulty with rule-based translation is gathering enough rules to cover the very many special cases and nuances in natural language. As a result, rule-based translations often have a very unnatural feel. This issue is a serious one, and rule-based translation systems have not yet overcome the challenge. However, even though the translations seem a little rough to read, they may well be enough to successfully communicate meaning. (The problem with the example translation on the last slide is of a different nature. The source text is poetry, which routinely takes huge liberties with grammar and use of vocabulary. It’s not a surprise that this puts it far outside the scope of rule-based translation.) Ian Stark Inf1-DA / Lecture 14 2019-03-07
Statistical Machine Translation This uses a corpus of parallel texts, where the same text is given in both source and target languages. Translation might go like this: 1 For each word and phrase from the source sentence find all occurrences of that word or phrase in the corpus. 2 Match these words and phrases with the parallel corpus text, and use statistical methods to select preferred translations. 3 Do some smoothing to find appropriate sizes for phrases and to glue translated phrases together to produce the translated sentence. Again, real implementations will refine these stages: for example, both source and target language corpora can be used to train neural networks that do the actual translation. To be effective, statistical translation requires a large and representative corpus of parallel texts. This corpus does not need to be heavily annotated. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Examples of Statistical Machine Translation From http://translate.google.com The capital city of Scotland is Edinburgh English −→ German Die Hauptstadt von Schottland ist Edinburgh German −→ English The capital of Scotland is Edinburgh Ian Stark Inf1-DA / Lecture 14 2019-03-07
Examples of Statistical Machine Translation From http://translate.google.com Sales of processed food collapsed across Europe when the news broke. English −→ French Les ventes d’aliments transformés se sont effondrées en Europe lorsque la nouvelle a été annoncée. French −→ English Processed food sales collapsed in Europe when the news was announced. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Examples of Statistical Machine Translation From http://translate.google.com and Robert Burns. My love is like a red, red rose That’s newly sprung in June English −→ Italian Il mio amore è come un rosso, rosa rossa Questo è appena spuntato a giugno Italian −→ English My love is like a red, red rose That just popped up in June Ian Stark Inf1-DA / Lecture 14 2019-03-07
Features of Statistical Machine Translation Statistical machine translation has challenges: it requires a very large corpus of parallel texts, and is computationally expensive to carry out. In recent years, these problems have diminished, at least for widely-used languages: large corpora have become available, and there have been improvements to algorithms and hardware. Given a large enough corpus, statistical translations can produce more natural translations than rule-based translations. Because it is not tied to grammar, statistical translation may work better with less rigid uses of language, such as poetry. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Features of Statistical Machine Translation At the moment, statistical translation is dominant: machine learning over large corpora is used to train neural networks that perform the actual translation. However, it has its limitations. If statistical translation is applied to a sentence that uses uncommon phrases, not in the corpus, then it can result in nonsense, while rule-based translation may survive. Large parallel corpora have often been compiled for reasons of political union: EU, UN, Canada. Quality can drop off sharply once we step outside the languages covered by these very large historical corpora. Some traditional generators of human-translated parallel corpora are now looking to save money by using machine translation . . . The future of machine translation looks interesting. Ian Stark Inf1-DA / Lecture 14 2019-03-07
Outline 1 Finding Things and Counting Them 2 Small Application 3 Large Application 4 Closing Ian Stark Inf1-DA / Lecture 14 2019-03-07
Relevant Courses for Future Years ! Year 2 Inf2A: Processing Formal and Natural Languages Year 3 Foundations of Natural Language Processing FNLP Introductory Applied Machine Learning IAML Year 4/5 Natural Language Understanding, Generation and Machine Translation NLU+ Topics in Natural Language Processing TNLP Ian Stark Inf1-DA / Lecture 14 2019-03-07
Homework Read This Schuster, Johnson, Thorat https://is.gd/zeroshot Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System Google Research blog, November 2016 Do This Try out the Google Books Ngram Viewer at https://books.google.com/ngrams. Compare the relative frequencies over time of the words “computer”, “software” and “hardware”; and also the city names “Edinburgh”, “London”, “Paris” and “New York”. To find out about the more complex queries available, take a look at https://books.google.com/ngrams/info Ian Stark Inf1-DA / Lecture 14 2019-03-07
Automatic Topic Identification + David Mimno https://is.gd/topicsnyt 1000 topics automatically extracted from 20 years of the New York Times. October 2012 Ben Schmidt https://is.gd/tvtopics Typical TV episodes: visualizing topics in screen time Sapping Attention blog post, December 2014 Ian Stark Inf1-DA / Lecture 14 2019-03-07
You can also read