Lecture 9: Corpus Linguistics - Ling 1330/2330 Intro to Computational Linguistics Na-Rae Han, 9/28/2021
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Lecture 9: Corpus Linguistics Ling 1330/2330 Intro to Computational Linguistics Na-Rae Han, 9/28/2021
Objectives Corpus linguistics Corpus types Key concepts Making assumptions: pitfalls Linguistic analysis and interpretation: how to draw conclusions 9/28/2021 2
Key concepts in corpus linguistics Type, token Type-token ratio (TTR) Frequency Zipf's law Concordance Collocation n-grams ("chunks") Sparseness problem 9/28/2021 3
Counting words: token, type, TTR Word token: each word occurring in a text/corpus Corpora sizes are measured as total number of words (=tokens) Word type: unique words Q: Are 'sleep' and 'sleeps' different types or the same type? A: Depends. Sometimes, types are meant as lemma types. Sometimes, inflected and derived words count as different types. Sometimes, even capitalized vs. lowercase words count as 2 types. Hapax legomena ("hapaxes") ↑ Pay attention to how types are Words that occur only once. handled in your resource! In natural language corpora, a huge portion will typically be hapaxes. Type-Token Ratio (TTR) ➔ Next slide 9/28/2021 4
Type-token ratio Type-Token Ratio (TTR) The number of types divided by the number of tokens Often used as an indicator of lexical density / vocabulary diversity. (with caveat!) 'Rose is a rose is a rose is a rose' 3/10 = 0.3 TTR 'A rose is a woody perennial flowering plant of the genus Rosa' 11/12 = 0.916 TTR 9/28/2021 5
Type-token ratio: the caveat Alice in Wonderland 2,585 types / 35,656 tokens = 0.097 TTR Moby Dick 17,172 types / 265,010 tokens = 0.080 TTR Does this mean Moby Dick has less diverse vocabulary? Not necessarily -- the text sizes are different. Type # does not grow linearly with text type count size. As your text grows larger, fewer and fewer new word types will be encountered. TTR comparison is only meaningful for comparably sized texts. corpus size 9/28/2021 6
Word frequency The words in a corpus can be arranged in order of their frequency in that corpus Comparing frequency lists across corpora can highlight differences in register and subject field Frequency distribution in natural language texts observes common patterns: Word frequencies are not distributed evenly. A small number of words are found in large frequencies Long tails: a large number of words found in small frequencies (tons of hapaxes!) 9/28/2021 7
Word Freq Example: Tom Sawyer the 3332 and 2972 Word tokens: 71,370 a 1775 Word types: 8,018 to 1725 TTR: 0.11 of 1440 Top word frequencies: → was 1161 Frequencies of frequencies: it 1027 Word # of word types with in 906 frequency the frequency that 877 1 3993 he 877 2 1292 I 783 3 664 his 772 4 410 you 686 5 243 Tom 679 … … 51-100 99 Over 90% of word types > 100 102 occur 10 times or less. 8
Zipf's Law Published in Human Behavior and the Principle of Least Effort (1949) Given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. 1st common word: twice the count of 2nd most common word 50th common word: 3-times the count of 150th common word Holds true in natural language corpora! Tom Sawyer → Word Frequency Rank one 172 50 two 104 100 turned 51 200 name 21 400 9/28/2021 friends 10 800 9
Concordances Concordance lines: shows instances of query words/ phrases found in a corpus; produced by concordance programs KWIC: "Key Word in Context" Lextutor concordancer: http://lextutor.ca/conc/eng/ 9/28/2021 10
Collocation Collocation: statistical tendency of words to co-occur Concordance lines are used by humans to mostly observe; collocation data is processed and compiled by computerized statistical operations ex. collocates of shed: light, tear/s, garden, jobs, blood, cents, image, pounds, staff, skin, clothes Strong collocations indicate: finer-grained semantic compatibility, polysemy association between a lexical item and its frequent grammatical environment lexical collocates of head: SHAKE, injuries, SHOOT state, office, former, department grammatical collocates of head: of, over, on, back, off 9/28/2021 11
n-grams, chunks, n-gram frequency Units of word sequences are called n-grams in computational linguistics chunks in corpus linguistics circles Certain chunks (a couple of, at the moment, all the time) are as frequent as ordinary, everyday single words such as possible, alone, fun, expensive. n-grams are of interest to computational linguists because they are the backbones of statistical language modeling Chunks are of interest in: corpus linguistics because they highlight stylistic and register variation applied linguistics because they are markers of successful second-language acquisition 9/28/2021 12
Top 3-word chunks, Top 3-word chunks, North American English Written English 1 I don’t know 1 one of the n-gram frequencies, 2 a lot of 2 out of the general vs. written 3 you know what 3 it was a register 4 what do you 4 there was a 5 you have to 5 the end of 6 I don’t think 6 a lot of 7 I was like 7 there was no 8 you want to 8 as well as 9 do you have 9 end of the 10 I have to 10 to be a 11 I want to 11 it would be 12 I mean I 12 in front of 13 a little bit 13 it was the 14 you know I 14 some of the 15 one of the 15 I don’t know 16 and I was 16 on to the 9/28/2021 13
Data sparseness problem In natural language data, frequent phenomena are very frequent, while the majority of data points remain relatively rare. ( Zipf's law) As your consider larger linguistic context, it compounds the data sparseness/sparsity problem. For 1-grams, Norvig's 333K 1-gram data was plenty. Assuming 100K English word types, for bigrams we need 100K**2 = 10,000,000,000 data points. Norvig's bigram list was 250K in size: nowhere near adequate. Even COCA's 1 million does not provide good enough coverage. (cf. Google's original was 315 mil.) Numbers grow exponentially as we consider 3-grams, 4- and 5- grams! 9/28/2021 14
2-grams: Norvig vs. COCA count_2w.txt w2_.txt you get 25183570 39509 you get you getting 430987 30 you gets you give 3512233 31 you gettin you go 8889243 Total # of entries: 861 you getting you going 2100506 ¼ million* 263 you girls you gone 210111 vs. 24 you git you gonna 416217 1 million → 5690 you give you good 441878 138 you given you got 4699128 169 you giving you gotta 668275 182 you glad *NOT google's fault! you graduate 117698 46 you glance you grant 103633 Norvig only took top 23594 you go you great 450637 0.1% of 315 million. 70 you god you grep 120367 54 you goddamn you grew 102321 115 you goin you grow 398329 9911 you going you guess 186565 Usefulness? 1530 you gon you guessed 295086 262 you gone you guys 5968988 444 you good you had 7305583 25 you google you hand 120379 19843 you got 15
Corpus-based linguistic research Create a corpus of Tweets, analyze linguistic variation (based on geography, demographics, etc.) http://languagelog.ldc.upenn.edu/nll/?p=3536 Process the inaugural speeches of all US presidents, analyze trends in sentence and word length http://languagelog.ldc.upenn.edu/nll/?p=3534 A corpus of rap music lyrics: word/bigram/trigram frequencies? http://poly-graph.co/vocabulary.html Corpora of female and male authors. Any stylistic differences? Corpora of Japanese/Bulgarian English learners. How do their Englishes compare? 9/28/2021 16
HW 3: Two EFL Corpora Bulgarian Students Japanese Students It is time, that our society is dominated by I agree greatly this topic mainly because I industrialization. The prosperity of a country is think that English becomes an official language based on its enormous industrial corporations in the not too distant. Now, many people can that are gradually replacing men with speak English or study it all over the world, machines. Science is highly developed and and so more people will be able to speak controls the economy. From the beginning of English. Before the Japanese fall behind other school life students are expected to master a people, we should be able to speak English, huge amount of scientific data. Technology is therefore, we must study English not only part of our everyday life. junior high school students or over but also Children nowadays prefer to play with pupils. Japanese education system is changing computers rather than with our parents' such a program. In this way, Japan tries to wooden toys. But I think that in our modern internationalize rapidly. However, I think this world which worships science and technology way won't suffice for becoming international there is still a place for dreams and humans. To becoming international humans, imagination. we should study English not only school but also daily life. If we can do it, we are able to There has always been a place for them in master English conversation. It is important for man's life. Even in the darkness of the … us to master English honorific words. … 9/28/2021 17
Assessing writing quality Measurable indicators of writing quality 1. Syntactic complexity Long, complex sentences vs. short. simple sentences Average sentence length, types of syntactic clauses used 2. Lexical diversity Diverse vocabulary used vs. small set of words repeatedly used Type-token ratio (with caveat!) or other measures 3. Vocabulary level Common, everyday words vs. sophisticated & technical words Average word length (common words tend to be shorter) % of word tokens in top 1K, 2K, 3K most common English words (Google Web 1T n-grams!) 9/28/2021 18
Corpus analysis: beyond numbers We have become adept at processing corpora to produce these metrics: How big is a corpus? How many unique words? # of tokens, # of types Unigram, bigram, n-gram frequency counts Which words and n-grams are frequent But: INTERPRETATION of what you get is a whole lot of these numbers is NUMBERS. what really matters. 9/28/2021 19
Corpus analysis: pitfalls 1. It's too easy to get hyper-focused on the coding part and lose sight of the actual linguistic data behind it all. Make sure to maintain your linguistic motivation. 2. As your corpus gets large, you run the risk of operating blind: it is difficult to keep tabs on what linguistic data you are handling. Poke your data in shell. Take time to understand the data. Make sure your data object is correct. Do NOT just assume it is. 3. Attaching a linguistically valid interpretation to numbers is not at all trivial. Careful when drawing conclusions. Make sure to explore all factors that might be affecting numbers. 9/28/2021 20
Homework 3 This homework assignment is equally about Python coding AND corpus analysis. That means, calculating the correct numbers is no longer enough. You should take care to understand your corpus data and offer up well-considered and valid analysis of the data points. 9/28/2021 21
Making assumptions About datasets we downloaded from the web About text processing output we just built … these assumptions just might bite you. 9/28/2021 22
1-grams/word list: Norvig vs. ENABLE count_1w.txt enable1.txt the 23135851162 aa of 13151942776 Total # of entries: aah and 12997637966 aahed to 12136980858 333K aahing a 9081174698 vs. aahs in 8469404971 173K → aal for 5933321709 aalii is 4705743816 aaliis on 3750423199 aals that 3400031103 aardvark by 3350048871 Assumption: aardvarks this 3228469771 Norvig/Google word list aardwolf with 3183110675 aardwolves i 3086225277 is a superset of the aargh goofel 12711 ENABLE list aarrgh gooek 12711 zymotic gooddg 12711 zymurgies gooblle 12711 WRONG! zymurgy gollgo 12711 zyzzyva golgw 12711 zyzzyvas 23
1-grams/word list: Norvig vs. ENABLE count_1w.txt enable1.txt the 23135851162 aa of 13151942776 Total # of entries: aah and 12997637966 aahed to 12136980858 333K aahing a 9081174698 vs. aahs in 8469404971 173K → aal for 5933321709 aalii is 4705743816 Only 78,835 aaliis on 3750423199 shared between aals that 3400031103 (45.6% of ENABLE) aardvark by 3350048871 aardvarks this 3228469771 aardwolf with 3183110675 No single-char aardwolves i 3086225277 words ("I", "a"), aargh goofel 12711 lots of arcane aarrgh gooek 12711 zymotic gooddg 12711 words zymurgies gooblle 12711 TONS of proper zymurgy gollgo 12711 zyzzyva golgw 12711 nouns, many zyzzyvas non-words 24
Know your data When using publicly available resources, you must evaluate and understand the data. Origin? Purpose? Domain & genre? Size? Traits? Merits and limitations? Fit with your project? 9/28/2021 25
NLTK's corpus reader >>> from nltk.corpus import PlaintextCorpusReader >>> corpus_root = "C:/Users/narae/Documents/ling1330/MLK" >>> mlkcor = PlaintextCorpusReader(corpus_root, '.*txt') >>> type(mlkcor) >>> mlkcor.fileids() ['1963-I Have a Dream.txt', '1964-Nobel Peace Prize Acceptance Speech.txt', '1967-Beyond Vietnam.txt', "1968-I've been to the Mountain Top.txt"] >>> len(mlkcor.fileids()) Assumption: 4 .word() tokens will be >>> mlkcor.fileids()[0] tokenized the same way as '1963-I Have a Dream.txt' nltk.word_tokenize() WRONG! >>> mlkcor.words()[:50] ['I', 'Have', 'A', 'Dream', 'by', 'Dr', '.', 'Martin', 'Luther', 'King', 'Jr', '.', 'Delivered', 'on', 'the', 'steps', 'at', 'the', 'Lincoln', 'Memorial', 'in', 'Washington', 'D', '.', 'C', '.', 'on', 'August', '28', ',', '1963', 'I', 'am', 'happy', 'to', 'join', 'with', 'you', 'today', 'in', 'what', 'will', 'go', 'down', 'in', 'history', 'as', 'the', 'greatest', 'demonstration'] 9/28/2021 26
NLTK's corpus reader >>> mlkcor.words()[-50:] ['that', 'we', ',', 'as', 'a', 'people', 'will', 'get', 'to', 'the', 'promised', 'land', '.', 'And', 'I', "'", 'm', 'happy', ',', 'tonight', '.', 'I', "'", 'm', 'not', 'worried', 'about', 'anything', '.', 'I', "'", 'm', 'not', 'fearing', 'any', 'man', '.', 'Mine', 'eyes', 'have', 'seen', 'the', 'glory', 'of', 'the', 'coming', 'of', 'the', 'Lord', '.'] >>> mlkcor.words()[-100:-50] ['God', "'", 's', 'will', '.', 'And', 'He', "'", 's', 'allowed', 'me', 'to', 'go', 'up', 'to', 'the', 'mountain', '.', 'And', 'I', "'", 've', 'looked', 'over', '.', 'And', 'I', "'", 've', 'seen', 'the', 'promised', 'land', '.', 'I', 'may', 'not', 'get', 'there', 'with', 'you', '.', 'But', 'I', 'want', 'you', 'to', 'know', 'tonight', ','] By default, PlaintextCorpusReader uses a regular-expression based tokenizer, which splits out all symbols from alphabetic words. 9/28/2021 27
HW#2: Basic corpus stats The Bible Jane Austen novels Word token count: 946,812 Word token count: 431,079 Word type count: 17,188 Word type count: 11,641 TTR: 0.018 TTR: 0.027 Assumption: These are all legitimate word types. WRONG! 9/28/2021 28
HW#2: Basic corpus stats The Bible Jane Austen novels Word token count: 946,812 Word token count: 431,079 Word type count: 17,188 Word type count: 11,641 TTR: 0.018 TTR: 0.027 >>> b_type_nonalnum = [t for t in b_tokfd if not t.isalnum()] >>> len(b_type_nonalnum) 4628 >>> b_type_nonalnum[:30] ['[', ']', ':', '1:1', '.', '1:2', ',', ';', '1:3', '1:4', '1:5', '1:6', '1:7', '1:8', '1:9', '1:10', '1:11', '1:12', '1:13', '1:14', '1:15', '1:16', '1:17', '1:18', '1:19', '1:20', '1:21', '1:22', '1:23', '1:24'] Over ¼ (!!!) of Bible word types are verse numberings, vastly inflating type count & TTR. 9/28/2021 29
Always validate, verify About text processing output we just built Make sure to probe and verify. Watch out for oddities. Pre-built text processing functions are NOT perfect! Sentence tokenization, word tokenization ➔ might include errors They also might operate under different rules than you expect. Especially important when attaching linguistic interpretation to your numbers. Hidden factors might be affecting the numbers. 9/28/2021 30
Wrap-up Attendance & participation policy changed -7 attendance points for each missed class… BUT! CHANGE: Joining Zoom streaming (real-time) or viewing recorded Panopto video (post-class) will count as attendance Recommendation: watch Panopto video post-class along with posted PPT lecture slides Participation points (Forum posting + office hour drop-ins) can also be used to make up Homework 3 is due on THU Larger at 60 points Next class: HW3 review Classifying documents 9/28/2021 31
You can also read