Lecture 9: Corpus Linguistics - Ling 1330/2330 Intro to Computational Linguistics Na-Rae Han, 9/28/2021
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Lecture 9: Corpus Linguistics
Ling 1330/2330 Intro to Computational Linguistics
Na-Rae Han, 9/28/2021Objectives Corpus linguistics Corpus types Key concepts Making assumptions: pitfalls Linguistic analysis and interpretation: how to draw conclusions 9/28/2021 2
Key concepts in corpus linguistics
Type, token
Type-token ratio (TTR)
Frequency
Zipf's law
Concordance
Collocation
n-grams ("chunks")
Sparseness problem
9/28/2021 3Counting words: token, type, TTR
Word token: each word occurring in a text/corpus
Corpora sizes are measured as total number of words (=tokens)
Word type: unique words
Q: Are 'sleep' and 'sleeps' different types or the same type?
A: Depends.
Sometimes, types are meant as lemma types.
Sometimes, inflected and derived words count as different types.
Sometimes, even capitalized vs. lowercase words count as 2 types.
Hapax legomena ("hapaxes") ↑ Pay attention to how types are
Words that occur only once. handled in your resource!
In natural language corpora, a huge portion will typically be
hapaxes.
Type-Token Ratio (TTR)
➔ Next slide
9/28/2021 4Type-token ratio
Type-Token Ratio (TTR)
The number of types divided by the number of tokens
Often used as an indicator of lexical density / vocabulary
diversity. (with caveat!)
'Rose is a rose is a rose is a rose'
3/10 = 0.3 TTR
'A rose is a woody perennial flowering plant of the genus Rosa'
11/12 = 0.916 TTR
9/28/2021 5Type-token ratio: the caveat
Alice in Wonderland
2,585 types / 35,656 tokens = 0.097 TTR
Moby Dick
17,172 types / 265,010 tokens = 0.080 TTR
Does this mean Moby Dick has less diverse vocabulary?
Not necessarily -- the text sizes are
different.
Type # does not grow linearly with text
type count
size. As your text grows larger, fewer and
fewer new word types will be encountered.
TTR comparison is only meaningful for
comparably sized texts. corpus size
9/28/2021 6Word frequency
The words in a corpus can be arranged in order of their
frequency in that corpus
Comparing frequency lists across corpora can highlight
differences in register and subject field
Frequency distribution in natural language texts observes
common patterns:
Word frequencies are not distributed evenly.
A small number of words are found in large frequencies
Long tails: a large number of words found in small frequencies
(tons of hapaxes!)
9/28/2021 7Word Freq
Example: Tom Sawyer the 3332
and 2972
Word tokens: 71,370
a 1775
Word types: 8,018 to 1725
TTR: 0.11 of 1440
Top word frequencies: → was 1161
Frequencies of frequencies: it 1027
Word # of word types with in 906
frequency the frequency that 877
1 3993 he 877
2 1292 I 783
3 664 his 772
4 410 you 686
5 243 Tom 679
… …
51-100 99 Over 90% of word types
> 100 102 occur 10 times or less.
8Zipf's Law
Published in Human Behavior and the Principle of Least Effort
(1949)
Given some corpus of natural language utterances, the
frequency of any word is inversely proportional to its rank in
the frequency table.
1st common word: twice the count of 2nd most common word
50th common word: 3-times the count of 150th common word
Holds true in natural language corpora!
Tom Sawyer → Word Frequency Rank
one 172 50
two 104 100
turned 51 200
name 21 400
9/28/2021 friends 10 800 9Concordances Concordance lines: shows instances of query words/ phrases found in a corpus; produced by concordance programs KWIC: "Key Word in Context" Lextutor concordancer: http://lextutor.ca/conc/eng/ 9/28/2021 10
Collocation
Collocation: statistical tendency of words to co-occur
Concordance lines are used by humans to mostly observe;
collocation data is processed and compiled by computerized
statistical operations
ex. collocates of shed:
light, tear/s, garden, jobs, blood, cents, image, pounds, staff, skin, clothes
Strong collocations indicate:
finer-grained semantic compatibility, polysemy
association between a lexical item and its frequent grammatical
environment
lexical collocates of head:
SHAKE, injuries, SHOOT
state, office, former, department
grammatical collocates of head:
of, over, on, back, off
9/28/2021 11n-grams, chunks, n-gram frequency
Units of word sequences are called
n-grams in computational linguistics
chunks in corpus linguistics circles
Certain chunks (a couple of, at the moment, all the time) are
as frequent as ordinary, everyday single words such as
possible, alone, fun, expensive.
n-grams are of interest to computational linguists because
they are the backbones of statistical language modeling
Chunks are of interest in:
corpus linguistics because they highlight stylistic and register
variation
applied linguistics because they are markers of successful
second-language acquisition
9/28/2021 12Top 3-word chunks, Top 3-word chunks,
North American English Written English
1 I don’t know 1 one of the
n-gram frequencies,
2 a lot of 2 out of the
general vs. written 3 you know what 3 it was a
register 4 what do you 4 there was a
5 you have to 5 the end of
6 I don’t think 6 a lot of
7 I was like 7 there was no
8 you want to 8 as well as
9 do you have 9 end of the
10 I have to 10 to be a
11 I want to 11 it would be
12 I mean I 12 in front of
13 a little bit 13 it was the
14 you know I 14 some of the
15 one of the 15 I don’t know
16 and I was 16 on to the
9/28/2021 13Data sparseness problem
In natural language data, frequent phenomena are very
frequent, while the majority of data points remain
relatively rare. ( Zipf's law)
As your consider larger linguistic context, it compounds
the data sparseness/sparsity problem.
For 1-grams, Norvig's 333K 1-gram data was plenty.
Assuming 100K English word types, for bigrams we need
100K**2 = 10,000,000,000 data points.
Norvig's bigram list was 250K in size: nowhere near adequate.
Even COCA's 1 million does not provide good enough coverage.
(cf. Google's original was 315 mil.)
Numbers grow exponentially as we consider 3-grams, 4- and 5-
grams!
9/28/2021 142-grams: Norvig vs. COCA
count_2w.txt w2_.txt
you get 25183570 39509 you get
you getting 430987 30 you gets
you give 3512233 31 you gettin
you go 8889243 Total # of entries: 861 you getting
you going 2100506 ¼ million* 263 you girls
you gone 210111 vs. 24 you git
you gonna 416217 1 million → 5690 you give
you good 441878 138 you given
you got 4699128 169 you giving
you gotta 668275 182 you glad
*NOT google's fault!
you graduate 117698 46 you glance
you grant 103633 Norvig only took top 23594 you go
you great 450637 0.1% of 315 million. 70 you god
you grep 120367 54 you goddamn
you grew 102321 115 you goin
you grow 398329 9911 you going
you guess 186565 Usefulness? 1530 you gon
you guessed 295086 262 you gone
you guys 5968988 444 you good
you had 7305583 25 you google
you hand 120379 19843 you got 15Corpus-based linguistic research Create a corpus of Tweets, analyze linguistic variation (based on geography, demographics, etc.) http://languagelog.ldc.upenn.edu/nll/?p=3536 Process the inaugural speeches of all US presidents, analyze trends in sentence and word length http://languagelog.ldc.upenn.edu/nll/?p=3534 A corpus of rap music lyrics: word/bigram/trigram frequencies? http://poly-graph.co/vocabulary.html Corpora of female and male authors. Any stylistic differences? Corpora of Japanese/Bulgarian English learners. How do their Englishes compare? 9/28/2021 16
HW 3: Two EFL Corpora
Bulgarian Students Japanese Students
It is time, that our society is dominated by I agree greatly this topic mainly because I
industrialization. The prosperity of a country is think that English becomes an official language
based on its enormous industrial corporations in the not too distant. Now, many people can
that are gradually replacing men with speak English or study it all over the world,
machines. Science is highly developed and and so more people will be able to speak
controls the economy. From the beginning of English. Before the Japanese fall behind other
school life students are expected to master a people, we should be able to speak English,
huge amount of scientific data. Technology is therefore, we must study English not only
part of our everyday life. junior high school students or over but also
Children nowadays prefer to play with pupils. Japanese education system is changing
computers rather than with our parents' such a program. In this way, Japan tries to
wooden toys. But I think that in our modern internationalize rapidly. However, I think this
world which worships science and technology way won't suffice for becoming international
there is still a place for dreams and humans. To becoming international humans,
imagination. we should study English not only school but
also daily life. If we can do it, we are able to
There has always been a place for them in master English conversation. It is important for
man's life. Even in the darkness of the … us to master English honorific words. …
9/28/2021 17Assessing writing quality
Measurable indicators of writing quality
1. Syntactic complexity
Long, complex sentences vs. short. simple sentences
Average sentence length, types of syntactic clauses used
2. Lexical diversity
Diverse vocabulary used vs. small set of words repeatedly used
Type-token ratio (with caveat!) or other measures
3. Vocabulary level
Common, everyday words vs. sophisticated & technical words
Average word length (common words tend to be shorter)
% of word tokens in top 1K, 2K, 3K most common English
words (Google Web 1T n-grams!)
9/28/2021 18Corpus analysis: beyond numbers
We have become adept at processing corpora to produce
these metrics:
How big is a corpus? How many unique words?
# of tokens, # of types
Unigram, bigram, n-gram frequency counts
Which words and n-grams are frequent
But:
INTERPRETATION of
what you get is a
whole lot of these numbers is
NUMBERS. what really matters.
9/28/2021 19Corpus analysis: pitfalls
1. It's too easy to get hyper-focused on the coding part and
lose sight of the actual linguistic data behind it all.
Make sure to maintain your linguistic motivation.
2. As your corpus gets large, you run the risk of operating
blind: it is difficult to keep tabs on what linguistic data you
are handling.
Poke your data in shell. Take time to understand the data.
Make sure your data object is correct. Do NOT just assume it is.
3. Attaching a linguistically valid interpretation to numbers
is not at all trivial.
Careful when drawing conclusions.
Make sure to explore all factors that might be affecting numbers.
9/28/2021 20Homework 3 This homework assignment is equally about Python coding AND corpus analysis. That means, calculating the correct numbers is no longer enough. You should take care to understand your corpus data and offer up well-considered and valid analysis of the data points. 9/28/2021 21
Making assumptions About datasets we downloaded from the web About text processing output we just built … these assumptions just might bite you. 9/28/2021 22
1-grams/word list: Norvig vs. ENABLE
count_1w.txt enable1.txt
the 23135851162 aa
of 13151942776 Total # of entries: aah
and 12997637966 aahed
to 12136980858 333K aahing
a 9081174698 vs. aahs
in 8469404971 173K → aal
for 5933321709 aalii
is 4705743816 aaliis
on 3750423199 aals
that 3400031103 aardvark
by 3350048871 Assumption: aardvarks
this 3228469771 Norvig/Google word list aardwolf
with 3183110675 aardwolves
i 3086225277
is a superset of the aargh
goofel 12711 ENABLE list aarrgh
gooek 12711 zymotic
gooddg 12711 zymurgies
gooblle 12711 WRONG! zymurgy
gollgo 12711 zyzzyva
golgw 12711 zyzzyvas
231-grams/word list: Norvig vs. ENABLE
count_1w.txt enable1.txt
the 23135851162 aa
of 13151942776 Total # of entries: aah
and 12997637966 aahed
to 12136980858 333K aahing
a 9081174698 vs. aahs
in 8469404971 173K → aal
for 5933321709 aalii
is 4705743816 Only 78,835 aaliis
on 3750423199 shared between aals
that 3400031103 (45.6% of ENABLE) aardvark
by 3350048871 aardvarks
this 3228469771 aardwolf
with 3183110675 No single-char aardwolves
i 3086225277 words ("I", "a"), aargh
goofel 12711 lots of arcane aarrgh
gooek 12711 zymotic
gooddg 12711
words zymurgies
gooblle 12711 TONS of proper zymurgy
gollgo 12711 zyzzyva
golgw 12711
nouns, many zyzzyvas
non-words 24Know your data When using publicly available resources, you must evaluate and understand the data. Origin? Purpose? Domain & genre? Size? Traits? Merits and limitations? Fit with your project? 9/28/2021 25
NLTK's corpus reader >>> from nltk.corpus import PlaintextCorpusReader >>> corpus_root = "C:/Users/narae/Documents/ling1330/MLK" >>> mlkcor = PlaintextCorpusReader(corpus_root, '.*txt') >>> type(mlkcor) >>> mlkcor.fileids() ['1963-I Have a Dream.txt', '1964-Nobel Peace Prize Acceptance Speech.txt', '1967-Beyond Vietnam.txt', "1968-I've been to the Mountain Top.txt"] >>> len(mlkcor.fileids()) Assumption: 4 .word() tokens will be >>> mlkcor.fileids()[0] tokenized the same way as '1963-I Have a Dream.txt' nltk.word_tokenize() WRONG! >>> mlkcor.words()[:50] ['I', 'Have', 'A', 'Dream', 'by', 'Dr', '.', 'Martin', 'Luther', 'King', 'Jr', '.', 'Delivered', 'on', 'the', 'steps', 'at', 'the', 'Lincoln', 'Memorial', 'in', 'Washington', 'D', '.', 'C', '.', 'on', 'August', '28', ',', '1963', 'I', 'am', 'happy', 'to', 'join', 'with', 'you', 'today', 'in', 'what', 'will', 'go', 'down', 'in', 'history', 'as', 'the', 'greatest', 'demonstration'] 9/28/2021 26
NLTK's corpus reader
>>> mlkcor.words()[-50:]
['that', 'we', ',', 'as', 'a', 'people', 'will', 'get', 'to', 'the',
'promised', 'land', '.', 'And', 'I', "'", 'm', 'happy', ',',
'tonight', '.', 'I', "'", 'm', 'not', 'worried', 'about',
'anything', '.', 'I', "'", 'm', 'not', 'fearing', 'any', 'man', '.',
'Mine', 'eyes', 'have', 'seen', 'the', 'glory', 'of', 'the',
'coming', 'of', 'the', 'Lord', '.']
>>> mlkcor.words()[-100:-50]
['God', "'", 's', 'will', '.', 'And', 'He', "'", 's', 'allowed',
'me', 'to', 'go', 'up', 'to', 'the', 'mountain', '.', 'And', 'I',
"'", 've', 'looked', 'over', '.', 'And', 'I', "'", 've', 'seen',
'the', 'promised', 'land', '.', 'I', 'may', 'not', 'get', 'there',
'with', 'you', '.', 'But', 'I', 'want', 'you', 'to', 'know',
'tonight', ',']
By default, PlaintextCorpusReader
uses a regular-expression based
tokenizer, which splits out all
symbols from alphabetic words.
9/28/2021 27HW#2: Basic corpus stats
The Bible Jane Austen novels
Word token count: 946,812 Word token count: 431,079
Word type count: 17,188 Word type count: 11,641
TTR: 0.018 TTR: 0.027
Assumption:
These are all legitimate
word types.
WRONG!
9/28/2021 28HW#2: Basic corpus stats
The Bible Jane Austen novels
Word token count: 946,812 Word token count: 431,079
Word type count: 17,188 Word type count: 11,641
TTR: 0.018 TTR: 0.027
>>> b_type_nonalnum = [t for t in b_tokfd if not t.isalnum()]
>>> len(b_type_nonalnum)
4628
>>> b_type_nonalnum[:30]
['[', ']', ':', '1:1', '.', '1:2', ',', ';', '1:3', '1:4', '1:5',
'1:6', '1:7', '1:8', '1:9', '1:10', '1:11', '1:12', '1:13',
'1:14', '1:15', '1:16', '1:17', '1:18', '1:19', '1:20', '1:21',
'1:22', '1:23', '1:24']
Over ¼ (!!!) of Bible word types
are verse numberings,
vastly inflating type count & TTR.
9/28/2021 29Always validate, verify
About text processing output we just built
Make sure to probe and verify.
Watch out for oddities.
Pre-built text processing functions are NOT perfect!
Sentence tokenization, word tokenization ➔ might include errors
They also might operate under different rules than you expect.
Especially important when attaching linguistic
interpretation to your numbers.
Hidden factors might be affecting the numbers.
9/28/2021 30Wrap-up
Attendance & participation policy changed
-7 attendance points for each missed class… BUT!
CHANGE: Joining Zoom streaming (real-time) or viewing
recorded Panopto video (post-class) will count as attendance
Recommendation: watch Panopto video post-class along with
posted PPT lecture slides
Participation points (Forum posting + office hour drop-ins) can
also be used to make up
Homework 3 is due on THU
Larger at 60 points
Next class:
HW3 review
Classifying documents
9/28/2021 31You can also read