Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
What Every Computational Linguist Should KnowAbout Type-Token Distributions and Zipf’s Law
Tutorial 1, 7 May 2018
Stefan EvertFAU Erlangen-Nürnberg
http://zipfr.r-forge.r-project.org/lrec2018.htmlLicensed under CC-by-sa version 3.0
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 1 / 99
Outline
Outline
Part 1MotivationDescriptive statistics & notationSome examples (zipfR)LNRE models: intuitionLNRE models: mathematics
Part 2Applications & examples (zipfR)LimitationsNon-randomnessConclusion & outlook
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 2 / 99
Part 1 Motivation
Outline
Part 1MotivationDescriptive statistics & notationSome examples (zipfR)LNRE models: intuitionLNRE models: mathematics
Part 2Applications & examples (zipfR)LimitationsNon-randomnessConclusion & outlook
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 3 / 99
Part 1 Motivation
Type-token statistics
I Type-token statistics different from most statistical inferenceI not about probability of a specific eventI but about diversity of events and their probability distribution
I Relatively little work in statistical scienceI Nor a major research topic in computational linguistics
I very specialized, usually plays ancillary role in NLPI But type-token statistics appear in wide range of applications
I often crucial for sound analysis
å NLP community needs better awareness of statisticaltechniques, their limitations, and available software
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 4 / 99
Part 1 Motivation
Some research questions
I How many words did Shakespeare know?I What is the coverage of my treebank grammar on big data?I How many typos are there on the Internet?I Is -ness more productive than -ity in English?I Are there differences in the productivity of nominal
compounds between academic writing and novels?I Does Dickens use a more complex vocabulary than Rowling?I Can a decline in lexical complexity predict Alzheimer’s disease?I How frequent is a hapax legomenon from the Brown corpus?I What is appropriate smoothing for my n-gram model?I Who wrote the Bixby letter, Lincoln or Hay?I How many different species of . . . are there? (Brainerd 1982)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 5 / 99
Part 1 Motivation
Some research questions
II coverage estimatesIII productivity
I lexical complexity & stylometryII prior & posterior distributionII unexpected applicationsI
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 6 / 99
Part 1 Motivation
Zipf’s law (Zipf 1949)
A) Frequency distributions in natural language are highly skewedB) Curious relationship between rank & frequency
word r f r · fthe 1. 142,776 142,776and 2. 100,637 201,274be 3. 94,181 282,543of 4. 74,054 296,216
(Dickens)
C) Various explanations of Zipf’s lawI principle of least effort (Zipf 1949)I optimal coding system, MDL (Mandelbrot 1953, 1962)I random sequences (Miller 1957; Li 1992; Cao et al. 2017)I Markov processes Ü n-gram models (Rouault 1978)
D) Language evolution: birth-death-process (Simon 1955)+ not the main topic today!
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 7 / 99
Part 1 Descriptive statistics & notation
Outline
Part 1MotivationDescriptive statistics & notationSome examples (zipfR)LNRE models: intuitionLNRE models: mathematics
Part 2Applications & examples (zipfR)LimitationsNon-randomnessConclusion & outlook
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 8 / 99
Part 1 Descriptive statistics & notation
Tokens & types
our sample: recently, very, not, otherwise, much, very, very,merely, not, now, very, much, merely, not, veryI N = 15: number of tokens = sample sizeI V = 7: number of distinct types = vocabulary size
(recently, very, not, otherwise, much, merely, now)type-frequency list
w fwrecently 1very 5not 3otherwise 1much 2merely 2now 1
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 9 / 99
Part 1 Descriptive statistics & notation
Zipf ranking
our sample: recently, very, not, otherwise, much, very, very,merely, not, now, very, much, merely, not, veryI N = 15: number of tokens = sample sizeI V = 7: number of distinct types = vocabulary size
(recently, very, not, otherwise, much, merely, now)
Zipf rankingw r frvery 1 5not 2 3merely 3 2much 4 2now 5 1otherwise 6 1recently 7 1 1 2 3 4 5 6 7
02
46
810
Zipf ranking: adverbs
rank
frequency
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 10 / 99
Part 1 Descriptive statistics & notation
A realistic Zipf ranking: the Brown corpus
top frequencies bottom frequenciesr f word rank range f randomly selected examples1 69836 the 7731 – 8271 10 schedules, polynomials, bleak2 36365 of 8272 – 8922 9 tolerance, shaved, hymn3 28826 and 8923 – 9703 8 decreased, abolish, irresistible4 26126 to 9704 – 10783 7 immunity, cruising, titan5 23157 a 10784 – 11985 6 geographic, lauro, portrayed6 21314 in 11986 – 13690 5 grigori, slashing, developer7 10777 that 13691 – 15991 4 sheath, gaulle, ellipsoids8 10182 is 15992 – 19627 3 mc, initials, abstracted9 9968 was 19628 – 26085 2 thar, slackening, deluxe
10 9801 he 26086 – 45215 1 beck, encompasses, second-place
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 11 / 99
Part 1 Descriptive statistics & notation
A realistic Zipf ranking: the Brown corpus
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 12 / 99
Part 1 Descriptive statistics & notation
Frequency spectrumI pool types with f = 1 (hapax legomena), types with f = 2
(dis legomena), . . . , f = m, . . .I V1 = 3: number of hapax legomena (now, otherwise, recently)I V2 = 2: number of dis legomena (merely, much)I general definition: Vm = |{w | fw = m}|
Zipf rankingw r frvery 1 5not 2 3merely 3 2much 4 2now 5 1otherwise 6 1recently 7 1
frequencyspectrumm Vm1 32 23 15 1
1 2 3 4 5 6 7
frequency spectrum: adverbs
mVm
02
46
810
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 13 / 99
Part 1 Descriptive statistics & notation
A realistic frequency spectrum: the Brown corpus
1 2 3 4 5 6 7 8 9 11 13 15
frequency spectrum: Brown corpus
m
Vm
05000
10000
15000
20000
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 14 / 99
Part 1 Descriptive statistics & notation
Vocabulary growth curve
our sample: recently, very, not, otherwise, much, very, very,merely, not, now, very, much, merely, not, very
I N = 1, V (N) = 1, V1(N) = 1I N = 3, V (N) = 3, V1(N) = 3I N = 7, V (N) = 5, V1(N) = 4I N = 12, V (N) = 7, V1(N) = 4I N = 15, V (N) = 7, V1(N) = 3
0 2 4 6 8 10 12 14
02
46
810
vocabulary growth curve: adverbs
N
V(N)V1(N)
V(N)V1(N)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99
Part 1 Descriptive statistics & notation
A realistic vocabulary growth curve: the Brown corpus
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010000
20000
30000
40000
50000 vocabulary growth curve: Brown corpus
N
V(N)V1(N)
V(N)V1(N)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 16 / 99
Part 1 Descriptive statistics & notation
Vocabulary growth in authorship attribution
I Authorship attribution by n-gram tracing applied to the caseof the Bixby letter (Grieve et al. submitted)
I Word or character n-grams in disputed text are comparedagainst large “training” corpora from candidate authors
Figure 1 One Gettysburg Address 2-word n-gram traces 322
323 324
In addition to analysing 2-word n-grams, we also analysed 1-, 3- and 4-word n-grams, 325
based on the average percentage of n-grams seen in 50 random 260,000-word samples of 326
texts. The analysis was only run up to 4-word n-grams because from that point onward the Hay 327
corpus contains none of the n-grams in the Gettysburg Address. The 3- and 4-word n-gram 328
analyses also correctly attributed the Gettysburg Address to Lincoln: 18% of 3-grams for Lincoln 329
vs. 14% of 3-grams for Hay and 2% of 4-grams for Lincoln vs. 0% of 4-grams for Hay. The 1-330
word n-gram analysis, however, incorrectly attributed the Gettysburg Address to Hay. Figure 3 331
presents the aggregated n-gram traces for all analyses. Notably, the 2-, 3- and 4-word n-gram 332
analyses, which correctly attributed the document to Lincoln, appear to be far more definitive 333
than the incorrect 1-word n-gram analysis. 334
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 17 / 99
Part 1 Descriptive statistics & notation
Observing Zipf’s lawacross languages and different linguistic units
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 18 / 99
Part 1 Descriptive statistics & notation
Observing Zipf’s lawThe Italian prefix ri- in the la Repubblica corpus
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 19 / 99
Part 1 Descriptive statistics & notation
Observing Zipf’s law
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 20 / 99
Part 1 Descriptive statistics & notation
Observing Zipf’s law
I Straight line in double-logarithmic space correspondsto power law for original variables
I This leads to Zipf’s (1949; 1965) famous law:
fr = Cra
I If we take logarithm on both sides, we obtain:
log fr︸ ︷︷ ︸y
= logC − a · log r︸︷︷︸x
I Intuitive interpretation of a and C :I a is slope determining how fast log frequency decreasesI logC is intercept, i.e. log frequency of most frequent word
(r = 1 Ü log r = 0)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99
Part 1 Descriptive statistics & notation
Observing Zipf’s lawLeast-squares fit = linear regression in log-space (Brown corpus)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 22 / 99
Part 1 Descriptive statistics & notation
Zipf-Mandelbrot lawMandelbrot (1953, 1962)
I Mandelbrot’s extra parameter:
fr = C(r + b)a
I Zipf’s law is special case with b = 0I Assuming a = 1, C = 60,000, b = 1:
I For word with rank 1, Zipf’s law predicts frequency of 60,000;Mandelbrot’s variation predicts frequency of 30,000
I For word with rank 1,000, Zipf’s law predicts frequency of 60;Mandelbrot’s variation predicts frequency of 59.94
I Zipf-Mandelbrot law forms basis of statistical LNRE modelsI ZM law derived mathematically as limiting distribution of
vocabulary generated by a character-level Markov processStefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 23 / 99
Part 1 Descriptive statistics & notation
Zipf-Mandelbrot lawNon-linear least-squares fit (Brown corpus)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 24 / 99
Part 1 Some examples (zipfR)
Outline
Part 1MotivationDescriptive statistics & notationSome examples (zipfR)LNRE models: intuitionLNRE models: mathematics
Part 2Applications & examples (zipfR)LimitationsNon-randomnessConclusion & outlook
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 25 / 99
Part 1 Some examples (zipfR)
zipfREvert and Baroni (2007)
I http://zipfR.R-Forge.R-Project.org/I Conveniently available from CRAN repositoryI Package vignette = gentle tutorial introduction
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 26 / 99
Part 1 Some examples (zipfR)
First steps with zipfR
I Set up a folder for this course, and make sure it is yourworking directory in R (preferably as an RStudio project)
I Install the most recent version of the zipfR packageI Package, handouts, code samples & data sets available from
http://zipfr.r-forge.r-project.org/lrec2018.html
> library(zipfR)
> ?zipfR # documentation entry point
> vignette("zipfr-tutorial") # read the zipfR tutorial
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 27 / 99
Part 1 Some examples (zipfR)
Loading type-token data
I Most convenient input: sequence of tokens as text file invertical format (“one token per line”)
+ mapped to appropriate types: normalized word forms, wordpairs, lemmatized, semantic class, n-gram of POS tags, . . .
+ language data should always be in UTF-8 encoding!+ large files can be compressed (.gz, .bz2, .xz)
I Sample data: brown_adverbs.txt on tutorial homepageI lowercased adverb tokens from Brown corpus (original order)
+ download and save to your working directory
> adv <- readLines("brown_adverbs.txt", encoding="UTF-8")
> head(adv, 30) # mathematically, a ‘‘vector’’ of tokens> length(adv) # sample size = 52,037 tokens
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 28 / 99
Part 1 Some examples (zipfR)
Descriptive statistics: type-frequency list
> adv.tfl <- vec2tfl(adv)> adv.tfl
k f type1 1 4859 not2 2 2084 n’t3 3 1464 so4 4 1381 only5 5 1374 then6 6 1309 now7 7 1134 even8 8 1089 as
......
...N V
52037 1907
> N(adv.tfl) # sample size> V(adv.tfl) # type count
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 29 / 99
Part 1 Some examples (zipfR)
Descriptive statistics: frequency spectrum
> adv.spc <- tfl2spc(adv.tfl) # or directly with vec2spc> adv.spc
m Vm1 1 7622 2 2603 3 1444 4 995 5 696 6 507 7 408 8 34
......
N V52037 1907
> N(adv.spc) # sample size> V(adv.spc) # type count
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 30 / 99
Part 1 Some examples (zipfR)
Descriptive statistics: vocabulary growth
I VGC lists vocabulary size V (N) at different sample sizes NI Optionally also spectrum elements Vm(N) up to m.max
> adv.vgc <- vec2vgc(adv, m.max=2)
I Visualize descriptive statistics with plot method
> plot(adv.tfl) # Zipf ranking> plot(adv.tfl, log="xy") # logarithmic scale recommended
> plot(adv.spc) # barplot of frequency spectrum
> plot(adv.vgc, add.m = 1:2) # vocabulary growth curve
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 31 / 99
Part 1 Some examples (zipfR)
Further example data sets
?Brown words from Brown corpus?BrownSubsets various subsets
?Dickens words from novels by Charles Dickens?ItaPref Italian word-formation prefixes?TigerNP NP and PP patterns from German Tiger treebank
?Baayen2001 frequency spectra from Baayen (2001)?EvertLuedeling2001 German word-formation affixes (manually
corrected data from Evert and Lüdeling 2001)Practice:I Explore these data sets with descriptive statisticsI Try different plot options (from help pages ?plot.tfl,
?plot.spc, ?plot.vgc)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 32 / 99
Part 1 LNRE models: intuition
Outline
Part 1MotivationDescriptive statistics & notationSome examples (zipfR)LNRE models: intuitionLNRE models: mathematics
Part 2Applications & examples (zipfR)LimitationsNon-randomnessConclusion & outlook
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 33 / 99
Part 1 LNRE models: intuition
Motivation
I Interested in productivity of affix, vocabulary of author, . . . ;not in a particular text or sample
+ statistical inference from sample to population
I Discrete frequency counts are difficult to capture withgeneralizations such as Zipf’s law
I Zipf’s law predicts many impossible types with 1 < fr < 2+ population does not suffer from such quantization effects
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 34 / 99
Part 1 LNRE models: intuition
LNRE models
I This tutorial introduces the state-of-the-art LNRE approachproposed by Baayen (2001)
I LNRE = Large Number of Rare Events
I LNRE uses various approximations and simplifications toobtain a tractable and elegant model
I Of course, we could also estimate the precise discretedistributions using MCMC simulations, but . . .1. LNRE model usually minor component of complex procedure2. often applied to very large samples (N > 1 M tokens)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 35 / 99
Part 1 LNRE models: intuition
The LNRE population
I Population: set of S types wi with occurrence probabilities πiI S = population diversity can be finite or infinite (S =∞)I Not interested in specific types Ü arrange by decreasing
probability: π1 ≥ π2 ≥ π3 ≥ · · ·+ impossible to determine probabilities of all individual types
I Normalization: π1 + π2 + . . .+ πS = 1
I Need parametric statistical model to describe full population(esp. for S =∞), i.e. a function i 7→ πi
I type probabilities πi cannot be estimated reliably from asample, but parameters of this function can
I NB: population index i 6= Zipf rank r
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 36 / 99
Part 1 LNRE models: intuition
Examples of population models
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
●
●
●
●
●
●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 37 / 99
Part 1 LNRE models: intuition
The Zipf-Mandelbrot law as a population model
What is the right family of models for lexical frequencydistributions?I We have already seen that the Zipf-Mandelbrot law captures
the distribution of observed frequencies very wellI Re-phrase the law for type probabilities:
πi := C(i + b)a
I Two free parameters: a > 1 and b ≥ 0I C is not a parameter but a normalization constant,
needed to ensure that ∑i πi = 1I This is the Zipf-Mandelbrot population model
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 38 / 99
Part 1 LNRE models: intuition
The parameters of the Zipf-Mandelbrot model●
●
●
●
●
●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
a == 1.2b == 1.5
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
a == 2b == 10
●
●
●
●
●
●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
a == 2b == 15
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
a == 5b == 40
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 39 / 99
Part 1 LNRE models: intuition
The parameters of the Zipf-Mandelbrot model●
●
●
●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1 2 5 10 20 50 100
1e−
045e
−04
5e−
035e
−02
k
ππ k
a == 1.2b == 1.5
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1 2 5 10 20 50 100
1e−
045e
−04
5e−
035e
−02
k
ππ k
a == 2b == 10
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1 2 5 10 20 50 100
1e−
045e
−04
5e−
035e
−02
k
ππ k
a == 2b == 15
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1 2 5 10 20 50 100
1e−
045e
−04
5e−
035e
−02
k
ππ k
a == 5b == 40
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 40 / 99
Part 1 LNRE models: intuition
The finite Zipf-Mandelbrot modelEvert (2004)
I Zipf-Mandelbrot population model characterizes an infinitetype population: there is no upper bound on i , and the typeprobabilities πi can become arbitrarily small
I π = 10−6 (once every million words), π = 10−9 (once everybillion words), π = 10−15 (once on the entire Internet),π = 10−100 (once in the universe?)
I The finite Zipf-Mandelbrot model stops after first S typesI Population diversity S becomes a parameter of the model→ the finite Zipf-Mandelbrot model has 3 parameters
Abbreviations:I ZM for Zipf-Mandelbrot modelI fZM for finite Zipf-Mandelbrot model
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 41 / 99
Part 1 LNRE models: intuition
Sampling from a population model
Assume we believe that the population we are interested in can bedescribed by a Zipf-Mandelbrot model:
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.01
0.02
0.03
0.04
0.05
k
ππ k
a == 3b == 50
● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1 2 5 10 20 50 100
1e−
045e
−04
5e−
035e
−02
k
ππ k
a == 3b == 50
Use computer simulation to generate random samples:I Draw N tokens from the population such that in
each step, type wi has probability πi to be pickedI This allows us to make predictions for samples (= corpora)
of arbitrary size NStefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 42 / 99
Part 1 LNRE models: intuition
Sampling from a population model
#1: 1 42 34 23 108 18 48 18 1 . . .time order room school town course area course time . . .
#2: 286 28 23 36 3 4 7 4 8 . . .
#3: 2 11 105 21 11 17 17 1 16 . . .
#4: 44 3 110 34 223 2 25 20 28 . . .
#5: 24 81 54 11 8 61 1 31 35 . . .
#6: 3 65 9 165 5 42 16 20 7 . . .
#7: 10 21 11 60 164 54 18 16 203 . . .
#8: 11 7 147 5 24 19 15 85 37 . . .
......
......
......
......
......
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99
Part 1 LNRE models: intuition
Samples: type frequency list & spectrum
rank r fr type i1 37 62 36 13 33 34 31 75 31 106 30 57 28 128 27 29 24 4
10 24 1611 23 812 22 14... ... ...
m Vm1 832 223 204 125 106 57 58 39 310 3... ...
sample #1
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 44 / 99
Part 1 LNRE models: intuition
Samples: type frequency list & spectrum
rank r fr type i1 39 22 34 33 30 54 29 105 28 86 26 17 25 138 24 79 23 6
10 23 1111 20 412 19 17... ... ...
m Vm1 762 273 174 105 66 57 78 310 411 2... ...
sample #2
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 45 / 99
Part 1 LNRE models: intuition
Random variation in type-frequency lists●
●
●
●●●
●●
●●●
●●
●●
●●●
●●
●●●●●●
●●●●
●●●●●●
●●●●●●●●●●
●●●●
0 10 20 30 40 50
010
2030
40
Sample #1
r
f r
●
●
●●
●
●●
●●●
●●●
●●●●●●●
●●●●
●●
●●●●●
●●●●
●●●●●●●●●●
●●●●●
0 10 20 30 40 50
010
2030
40
Sample #2
r
f r r ↔ fr
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●●
●●
0 10 20 30 40 50
010
2030
40
Sample #1
k
f k
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
0 10 20 30 40 50
010
2030
40
Sample #2
k
f k i ↔ fi
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 46 / 99
Part 1 LNRE models: intuition
Random variation: frequency spectrumSample #1
m
Vm
020
4060
8010
0 Sample #2
m
Vm
020
4060
8010
0
Sample #3
m
Vm
020
4060
8010
0 Sample #4
m
Vm
020
4060
8010
0
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99
Part 1 LNRE models: intuition
Random variation: vocabulary growth curve
0 200 400 600 800 1000
050
100
150
200 Sample #1
N
V((N
))V
1((N
))
0 200 400 600 800 1000
050
100
150
200 Sample #2
N
V((N
))V
1((N
))
0 200 400 600 800 1000
050
100
150
200 Sample #3
N
V((N
))V
1((N
))
0 200 400 600 800 1000
050
100
150
200 Sample #4
N
V((N
))V
1((N
))
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99
Part 1 LNRE models: intuition
Expected values
I There is no reason why we should choose a particular sampleto compare to the real data or make a prediction – each one isequally likely or unlikely
I Take the average over a large number of samples, calledexpected value or expectation in statistics
I Notation: E[V (N)
]and E
[Vm(N)
]
I indicates that we are referring to expected values for a sampleof size N
I rather than to the specific values V and Vmobserved in a particular sample or a real-world data set
I Expected values can be calculated efficiently withoutgenerating thousands of random samples
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 49 / 99
Part 1 LNRE models: intuition
The expected frequency spectrumVm
E[[Vm]]
Sample #1
m
Vm
E[[V
m]]
020
4060
8010
0
Vm
E[[Vm]]
Sample #2
m
Vm
E[[V
m]]
020
4060
8010
0
Vm
E[[Vm]]
Sample #3
m
Vm
E[[V
m]]
020
4060
8010
0
Vm
E[[Vm]]
Sample #4
m
Vm
E[[V
m]]
020
4060
8010
0
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99
Part 1 LNRE models: intuition
The expected vocabulary growth curve
0 200 400 600 800 1000
050
100
150
200 Sample #1
N
E[[V
((N))]]
V((N))E[[V((N))]]
0 200 400 600 800 1000
050
100
150
200 Sample #1
N
E[[V
1((N
))]]
V1((N))E[[V1((N))]]
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 51 / 99
Part 1 LNRE models: intuition
Prediction intervals for the expected VGC
0 200 400 600 800 1000
050
100
150
200 Sample #1
N
E[[V
((N))]]
V((N))E[[V((N))]]
0 200 400 600 800 1000
050
100
150
200 Sample #1
N
E[[V
1((N
))]]
V1((N))E[[V1((N))]]
“Confidence intervals” indicate predicted sampling distribution:+ for 95% of samples generated by the LNRE model, VGC will
fall within the range delimited by the thin red lines
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 52 / 99
Part 1 LNRE models: intuition
Parameter estimation by trial & error
observedZM model
a == 1.5,, b == 7.5
m
Vm
E[[V
m]]
050
0010
000
1500
020
000
2500
0
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010
000
2000
030
000
4000
050
000 a == 1.5,, b == 7.5
N
V((N
))E
[[V((N
))]]
observedZM model
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99
Part 1 LNRE models: intuition
Parameter estimation by trial & error
observedZM model
a == 1.3,, b == 7.5
m
Vm
E[[V
m]]
050
0010
000
1500
020
000
2500
0
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010
000
2000
030
000
4000
050
000 a == 1.3,, b == 7.5
N
V((N
))E
[[V((N
))]]
observedZM model
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99
Part 1 LNRE models: intuition
Parameter estimation by trial & error
observedZM model
a == 1.3,, b == 0.2
m
Vm
E[[V
m]]
050
0010
000
1500
020
000
2500
0
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010
000
2000
030
000
4000
050
000 a == 1.3,, b == 0.2
N
V((N
))E
[[V((N
))]]
observedZM model
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99
Part 1 LNRE models: intuition
Parameter estimation by trial & error
observedZM model
a == 2,, b == 550
m
Vm
E[[V
m]]
050
0010
000
1500
020
000
2500
00e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010
000
2000
030
000
4000
050
000 a == 2,, b == 550
N
V((N
))E
[[V((N
))]]
observedZM model
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99
Part 1 LNRE models: intuition
Automatic parameter estimation
observedexpected
a == 2.39,, b == 1968.49
m
Vm
E[[V
m]]
050
0010
000
1500
020
000
2500
0
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010
000
2000
030
000
4000
050
000 a == 2.39,, b == 1968.49
N
V((N
))E
[[V((N
))]]
observedexpected
I By trial & error we found a = 2.0 and b = 550I Automatic estimation procedure: a = 2.39 and b = 1968
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 54 / 99
Part 1 LNRE models: mathematics
Outline
Part 1MotivationDescriptive statistics & notationSome examples (zipfR)LNRE models: intuitionLNRE models: mathematics
Part 2Applications & examples (zipfR)LimitationsNon-randomnessConclusion & outlook
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 55 / 99
Part 1 LNRE models: mathematics
The sampling model
I Draw random sample of N tokens from LNRE populationI Sufficient statistic: set of type frequencies {fi}
I because tokens of random sample have no orderingI Joint multinomial distribution of {fi}:
Pr({fi = ki} |N) = N!k1! · · · kS !π
k11 · · ·πkSS
I Approximation: do not condition on fixed sample size NI N is now the average (expected) sample size
I Random variables fi have independent Poisson distributions:
Pr(fi = ki) = e−Nπi (Nπi)kiki !
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 56 / 99
Part 1 LNRE models: mathematics
Frequency spectrum
I Key problem: we cannot determine fi in observed sampleI becasue we don’t know which type wi isI recall that population ranking fi 6= Zipf ranking fr
I Use spectrum {Vm} and sample size V as statisticsI contains all information we have about observed sample
I Can be expressed in terms of indicator variables
I[fi=m] ={1 fi = m0 otherwise
Vm =S∑
i=1I[fi=m]
V =S∑
i=1I[fi>0] =
S∑
i=1
(1− I[fi=0]
)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99
Part 1 LNRE models: mathematics
The expected spectrum
I It is easy to compute expected values for the frequencyspectrum (and variances because the fi are independent)
E[I[fi=m]] = Pr(fi = m) = e−Nπi (Nπi)mm!
E[Vm] =S∑
i=1E[I[fi=m]] =
S∑
i=1e−Nπi (Nπi)m
m!
E[V ] =S∑
i=1E[1− I[fi=0]
]=
S∑
i=1
(1− e−Nπi
)
I NB: Vm and V are not independent because they are derivedfrom the same random variables fi
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 58 / 99
Part 1 LNRE models: mathematics
Sampling distribution of Vm and V
I Joint sampling distribution of {Vm} and V is complicatedI Approximation: V and {Vm} asymptotically follow a
multivariate normal distributionI motivated by the multivariate central limit theorem:
sum of many independent variables I[fi=m]I Usually limited to first spectrum elements, e.g. V1, . . . ,V15
I approximation of discrete Vm by continuous distributionsuitable only if E[Vm] is sufficiently large
I Parameters of multivariate normal:µµµ = (E[V ],E[V1],E[V2], . . .) and ΣΣΣ = covariance matrix
Pr((V ,V1, . . . ,Vk) = v
) ∼ e− 12 (v−µµµ)TΣΣΣ−1(v−µµµ)√
(2π)k+1 det ΣΣΣ
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 59 / 99
Part 1 LNRE models: mathematics
Type density function
I Discrete sums of probabilities in E[V ], E[Vm], ldots areinconvenient and computationally expensive
I Approximation: continuous type density function g(π)
|{wi | a ≤ πi ≤ b}| =∫ b
ag(π) dπ
∑{πi | a ≤ πi ≤ b} =
∫ b
aπg(π) dπ
I Normalization constraint:∫ ∞
0πg(π) dπ = 1
I Good approximation for low-probability types, but probabilitymass of w1,w2, . . . “smeared out” over range
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 60 / 99
Part 1 LNRE models: mathematics
Type density function
0.0 0.1 0.2 0.3 0.4 0.5
01
02
03
04
0
Type density as continuous approximation
occurrence probability π
typ
e d
en
sity
g(π)
w1w2w3w4
0.0 0.1 0.2 0.3 0.4 0.5
01
23
4
Probability density vs. type probabilities
occurrence probability π
pro
ba
bili
ty d
en
sity
f(π)
π1=
.37
6
π2=
.21
5
π3=
.13
0
π4=
.08
2
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 61 / 99
Part 1 LNRE models: mathematics
ZM and fZM as LNRE models
I Discrete Zipf-Mandelbrot population
πi := C(i + b)a for i = 1, . . . ,S
I Corresponding type density function (Evert 2004)
g(π) ={C · π−α−1 A ≤ π ≤ B0 otherwise
with parametersI α = 1/a (0 < α < 1)I B = b · α/(1− α)I 0 ≤ A < B determines S (ZM with S =∞ for A = 0)
+ C is a normalization factor, not a parameter
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 62 / 99
Part 1 LNRE models: mathematics
ZM and fZM as LNRE models
1e-09 1e-07 1e-05 1e-03
02
00
00
40
00
06
00
00
80
00
0 Type density of LNRE model
occurrence probability π
typ
e d
en
sity
g(π)
fZM
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 63 / 99
Part 1 LNRE models: mathematics
Expectations as integrals
I Expected values can now be expressed as integrals over g(π)
E[Vm] =∫ ∞
0
(Nπ)mm! e−Nπg(π) dπ
E[V ] =∫ ∞
0
(1− e−Nπ
)g(π) dπ
I Reduce to simple closed form for ZM (approximation)
E[Vm] = Cm! · N
α · Γ(m − α)
E[V ] = C · Nα · Γ(1− α)α
I fZM and exact solution for ZM with incompl. Gamma function
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 64 / 99
Part 1 LNRE models: mathematics
Parameter estimation from training corpus
I For ZM, α = E[V1]E[V ] ≈ V1
V can be estimated directly,but prone to overfitting
I General parameter fitting by MLE:maximize likelihood of observed spectrum v
maxα,A,B
Pr((V ,V 1, . . . ,Vk) = v
∣∣α,A,B)
I Multivariate normal approximation:
minα,A,B
(v−µµµ)TΣΣΣ−1(v−µµµ)
I Minimization by gradient descent (BFGS, CG) or simplexsearch (Nelder-Mead)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 65 / 99
Part 1 LNRE models: mathematics
Parameter estimation from training corpus
0.2 0.4 0.6 0.8
−4
−3
−2
−1
BNC (bare singular PPs)
α
log 1
0(B)
0.55 0.60 0.65 0.70 0.75
−2.
5−
2.4
−2.
3−
2.2
−2.
1−
2.0
−1.
9−
1.8 Goodness−of−fit X2 (m = 10)
α
log 1
0(B)
(0.65, −2.11)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 66 / 99
Part 1 LNRE models: mathematics
Goodness-of-fit(Baayen 2001, Sec. 3.3)
I How well does the fitted model explain the observed data?I For multivariate normal distribution:
X 2 = (V−µµµ)TΣΣΣ−1(V−µµµ) ∼ χ2k+1
where V = (V ,V1, . . . ,Vk)å Multivariate chi-squared test of goodness-of-fit
I replace V by observed v Ü test statistic x2I must reduce df = k + 1 by number of estimated parameters
I NB: significant rejection of the LNRE model for p < .05
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 67 / 99
Part 1 LNRE models: mathematics
Coffee break!
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 68 / 99
Part 2 Applications & examples (zipfR)
Outline
Part 1MotivationDescriptive statistics & notationSome examples (zipfR)LNRE models: intuitionLNRE models: mathematics
Part 2Applications & examples (zipfR)LimitationsNon-randomnessConclusion & outlook
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 69 / 99
Part 2 Applications & examples (zipfR)
Measuring morphological productivityexample from Evert and Lüdeling (2001)
0 5000 10000 15000 20000 25000 30000 35000
0100
200
300
400
500
Vocabulary Growth Curves
N
V(N)
-bar-sam-ös
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 70 / 99
Part 2 Applications & examples (zipfR)
Measuring morphological productivityexample from Evert and Lüdeling (2001)
0 50000 150000 250000 350000
050
0010
000
1500
0
a == 1.45,, b == 34.59,, S == 20587
N
V((N
))E
[[V((N
))]]V
1((N
))E
[[V1((
N))]]
observedexpected
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 70 / 99
Part 2 Applications & examples (zipfR)
Measuring morphological productivityexample from Evert and Lüdeling (2001)
0 5000 10000 15000 20000 25000 30000 35000
0100
200
300
400
500
Vocabulary Growth Curves
N
V(N)
-bar-sam-ös
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 70 / 99
Part 2 Applications & examples (zipfR)
Quantitative measures of productivity(Tweedie and Baayen 1998; Baayen 2001)
I Baayen’s (1991) productivity index P(slope of vocabulary growth curve)
P = V1N
I TTR = type-token ratio
TTR = VN
I Zipf-Mandelbrot slope
a
I Herdan’s law (1964)
C = log Vlog N
I Yule (1944) / Simpson (1949)
K = 10 000 ·∑
m m2Vm − NN2
I Guiraud (1954)
R = V√N
I Sichel (1975)
S = V2V
I Honoré (1979)
H = log N1− V1
V
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 71 / 99
Part 2 Applications & examples (zipfR)
Productivity measures for bare singulars in the BNC
spoken writtenV 2,039 12,876N 6,766 85,750K 86.84 28.57R 24.79 43.97S 0.13 0.15C 0.86 0.83P 0.21 0.08
TTR 0.301 0.150a 1.18 1.27
pop. S 15,958 36,874 0 20000 40000 60000 80000
020
0040
0060
0080
0010
000
1200
0
vocabulary growth curves (BNC)
NV
(N)
writtenspoken
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 72 / 99
Part 2 Applications & examples (zipfR)
Are these “lexical constants” really constant?
0 1000 3000 5000
020
4060
8010
0
Yule's K
0 1000 3000 5000
010
2030
4050
Guiraud's R
0 1000 3000 5000
0.00
0.05
0.10
0.15
0.20
Sichel's S
0 1000 3000 5000
0.0
0.2
0.4
0.6
0.8
1.0
Herdan's law C
0 1000 3000 5000
0.0
0.1
0.2
0.3
0.4
0.5
Baayen's P
0 1000 3000 5000
02
46
810
TTR
0 1000 3000 5000
0.0
0.5
1.0
1.5
2.0
Zipf slope (a)
0 1000 3000 5000
050
0015
000
2500
035
000 population size
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 73 / 99
Part 2 Applications & examples (zipfR)
Simulation experiments based on LNRE models
I Systematic study of size dependence and other aspects ofproductivity measures based on samples from LNRE model
I LNRE model Ü well-defined populationI Random sampling helps to assess variability of measuresI Expected values E[P] etc. can often be computed directly
(or approximated) Ü computationally efficientå LNRE models as tools for understanding productivity measures
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 74 / 99
Part 2 Applications & examples (zipfR)
Simulation: sample size
0 5000 10000 15000 20000 25000
020
4060
8010
0
Yule−Simpson K
0 5000 10000 15000 20000 25000
010
2030
4050
Guiraud R
0 5000 10000 15000 20000 25000
0.0
0.2
0.4
0.6
0.8
1.0
Herdan's law C
0 5000 10000 15000 20000 25000
02
46
810
TTR
0 5000 10000 15000 20000 25000
0.0
0.1
0.2
0.3
0.4
0.5
Baayen P
0 5000 10000 15000 20000 250000.
00.
51.
01.
52.
0
Zipf slope (a)
0 5000 10000 15000 20000 25000
0.00
0.05
0.10
0.15
0.20
Sichel S
0 5000 10000 15000 20000 25000
020
0040
0060
0080
0010
000
1200
0 Honoré H
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 75 / 99
Part 2 Applications & examples (zipfR)
Simulation: frequent lexicalized types
0.0 0.1 0.2 0.3 0.4
020
4060
8010
0
Yule−Simpson K
probability mass of lexicalized types
0.0 0.1 0.2 0.3 0.40
1020
3040
50
Guiraud R
probability mass of lexicalized types
0.0 0.1 0.2 0.3 0.4
0.0
0.2
0.4
0.6
0.8
1.0
Herdan law C
probability mass of lexicalized types
0.0 0.1 0.2 0.3 0.4
02
46
810
TTR
probability mass of lexicalized types
0.0 0.1 0.2 0.3 0.4
0.0
0.1
0.2
0.3
0.4
0.5
Baayen P
0.0 0.1 0.2 0.3 0.4
0.0
0.5
1.0
1.5
2.0
Zipf slope (a)
0.0 0.1 0.2 0.3 0.4
0.00
0.05
0.10
0.15
0.20
Sichel S
0.0 0.1 0.2 0.3 0.4
020
0040
0060
0080
0010
000
1200
0 Honoré H
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 76 / 99
Part 2 Applications & examples (zipfR)
interactive demo
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 77 / 99
Part 2 Applications & examples (zipfR)
Posterior distribution
1e−08 1e−05 1e−02 1e+01
0.0
0.2
0.4
0.6
0.8
1.0
Posterior distribution ((m == 1)) for ZM model with αα == 0.4
expected frequency (Nππ)
post
erio
r di
strib
utio
n P
r((ππ|m
)) ((m==
1)) M
LE
95%
con
fiden
ce99
.9%
con
fiden
ce
Goo
d−T
urin
g
posteriorlog−adjusted
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 78 / 99
Part 2 Applications & examples (zipfR)
Posterior distribution
1e−08 1e−05 1e−02 1e+01
0.0
0.2
0.4
0.6
0.8
1.0
Posterior distribution ((m == 1)) for ZM model with αα == 0.9
expected frequency (Nππ)
post
erio
r di
strib
utio
n P
r((ππ|m
)) ((m==
1)) M
LE
95%
con
fiden
ce99
.9%
con
fiden
ce
Goo
d−T
urin
g
posteriorlog−adjusted
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 78 / 99
Part 2 Limitations
Outline
Part 1MotivationDescriptive statistics & notationSome examples (zipfR)LNRE models: intuitionLNRE models: mathematics
Part 2Applications & examples (zipfR)LimitationsNon-randomnessConclusion & outlook
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 79 / 99
Part 2 Limitations
How reliable are the fitted models?
Three potential issues:1. Model assumptions 6= population
(e.g. distribution does not follow a Zipf-Mandelbrot law)+ model cannot be adequate, regardless of parameter settings
2. Parameter estimation unsuccessful(i.e. suboptimal goodness-of-fit to training data)
+ optimization algorithm trapped in local minimum+ can result in highly inaccurate model
3. Uncertainty due to sampling variation(i.e. training data differ from population distribution)
+ model fitted to training data, may not reflect true population+ another training sample would have led to different parameters+ especially critical for small samples (N < 10,000)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 80 / 99
Part 2 Limitations
Bootstrapping
I An empirical approach to sampling variation:I take many random samples from the same populationI estimate LNRE model from each sampleI analyse distribution of model parameters, goodness-of-fit, etc.
(mean, median, s.d., boxplot, histogram, . . . )I problem: how to obtain the additional samples?
I Bootstrapping (Efron 1979)I resample from observed data with replacementI this approach is not suitable for type-token distributions
(resamples underestimate vocabulary size V !)I Parametric bootstrapping
I use fitted model to generate samples, i.e. sample from thepopulation described by the model
I advantage: “correct” parameter values are known
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 81 / 99
Part 2 Limitations
Bootstrappingparametric bootstrapping with 100 replicates
Zipfian slope a = 1/α
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
12
α
Den
sity
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 82 / 99
Part 2 Limitations
Bootstrappingparametric bootstrapping with 100 replicates
Goodness-of-fit statistic X 2 (model not plausible for X 2 > 11)
0 5 10 15 20 25 30
0.00
0.02
0.04
0.06
0.08
X2
Den
sity
p<
0.05
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 82 / 99
Part 2 Limitations
Bootstrappingparametric bootstrapping with 100 replicates
Population diversity S
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
0e+
002e
−05
4e−
05
S
Den
sity
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 82 / 99
Part 2 Limitations
Sample size matters!Brown corpus is too small for reliable LNRE parameter estimation (bare singulars)
0.9 1.0 1.1 1.2 1.3 1.4 1.5
05
1015
Zipf slope (a)
BNC spoken (N=6766)Brown (N=1005)
0 10000 20000 30000 40000 500000.00
000
0.00
002
0.00
004
0.00
006
0.00
008
0.00
010
0.00
012
population size
Den
sity
BNC spoken (N=6766)Brown (N=1005)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 83 / 99
Part 2 Limitations
How reliable are the fitted models?
Three potential issues:1. Model assumptions 6= population
(e.g. distribution does not follow a Zipf-Mandelbrot law)+ model cannot be adequate, regardless of parameter settings
2. Parameter estimation unsuccessful(i.e. suboptimal goodness-of-fit to training data)
+ optimization algorithm trapped in local minimum+ can result in highly inaccurate model
3. Uncertainty due to sampling variation(i.e. training data differ from population distribution)
+ model fitted to training data, may not reflect true population+ another training sample would have led to different parameters+ especially critical for small samples (N < 10,000)
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 84 / 99
Part 2 Limitations
How well does Zipf’s law hold?
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 85 / 99
Part 2 Limitations
How well does Zipf’s law hold?
I Z-M law seems to fit the first few thousand ranks very well,but then slope of empirical ranking becomes much steeper
I similar patterns have been found in many different data sets
I Various modifications and extensions have been suggested(Sichel 1971; Kornai 1999; Montemurro 2001)
I mathematics of corresponding LNRE models are often muchmore complex and numerically challenging
I may not have closed form for E[V ], E[Vm], or for thecumulative type distribution G(ρ) =
∫∞ρ
g(π) dπ
I E.g. Generalized Inverse Gauss-Poisson (GIGP; Sichel 1971)
g(π) = (2/bc)γ+1
Kγ+1(b) · πγ−1 · e−π
c −b2c4π
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 86 / 99
Part 2 Limitations
The GIGP model (Sichel 1971)
1e-09 1e-07 1e-05 1e-03
02
00
00
40
00
06
00
00
80
00
0 Type density of LNRE model
occurrence probability π
typ
e d
en
sity
g(π)
fZMGIGP
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 87 / 99
Part 2 Non-randomness
Outline
Part 1MotivationDescriptive statistics & notationSome examples (zipfR)LNRE models: intuitionLNRE models: mathematics
Part 2Applications & examples (zipfR)LimitationsNon-randomnessConclusion & outlook
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 88 / 99
Part 2 Non-randomness
How accurate is LNRE-based extrapolation?(Baroni and Evert 2005)
0 10000 20000 30000
01
00
20
03
00
40
05
00
60
0
Suffix −bar (25%)
N (tokens)
V (
typ
es)
CorpusZMfZMGIGP
0 10000 20000 30000
01
00
20
03
00
40
05
00
60
0
Suffix −bar (50%)
N (tokens)V
(ty
pe
s)
CorpusZMfZMGIGP
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 89 / 99
Part 2 Non-randomness
How accurate is LNRE-based extrapolation?(Baroni and Evert 2005)
0 200 400 600 800 1000
01
02
03
04
05
0
LOB (25%)
N (k tokens)
V (
k t
yp
es)
CorpusZMfZMGIGPlogN
0 200 400 600 800 1000
01
02
03
04
05
0
LOB (50%)
N (k tokens)
V (
k t
yp
es)
CorpusZMfZMGIGPlogN
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 89 / 99
Part 2 Non-randomness
How accurate is LNRE-based extrapolation?(Baroni and Evert 2005)
0 20 40 60 80 100
01
00
20
03
00
40
05
00
60
0
BNC (1%)
N (M tokens)
V (
k t
yp
es)
CorpusZMfZMGIGPlogN
0 20 40 60 80 100
01
00
20
03
00
40
05
00
60
0
BNC (10%)
N (M tokens)
V (
k t
yp
es)
CorpusZMfZMGIGPlogN
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 89 / 99
Part 2 Non-randomness
Reasons for poor extrapolation quality
I Major problem: non-randomness of corpus dataI LNRE modelling assumes that corpus is random sample
I Cause 1: repetition within textsI most corpora use entire text as unit of samplingI also referred to as “term clustering” or “burstiness”I well-known in computational linguistics (Church 2000)
I Cause 2: non-homogeneous corpusI cannot extrapolate from spoken BNC to written BNCI similar for different genres and domainsI also within single text, e.g. beginning/end of novel
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 90 / 99
Part 2 Non-randomness
The ECHO correction(Baroni and Evert 2007)
I Empirical study: quality of extrapolation N0 → 4N0 startingfrom random samples of corpus texts
ZM fZM GIGP
Relative error: E[V] vs. V (DEWAC)
rela
tive
erro
r (%
)
−30
−20
−10
010
2030
●
● ●
●
N0
2N0
3N0
0 5000 10000 15000
05
1015
Goodness−of−fit vs. accuracy for V (3N0)
X2
rMS
E (
%)
●
●●
●
●●
●
●●
● standard
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 91 / 99
Part 2 Non-randomness
The ECHO correction(Baroni and Evert 2007)
I ECHO correction: replace every repetition within same text byspecial type echo (= document frequencies)
ZM fZM GIGP fZMecho
GIGPecho
GIGPpartition
Relative error: E[V] vs. V (DEWAC)
rela
tive
erro
r (%
)
−20
−10
010
20
●
● ●
● ●
●
●
N0
2N0
3N0
0 5000 10000 15000
05
1015
Goodness−of−fit vs. accuracy for V (3N0)
X2
rMS
E (
%)
●
●●
●
●●
●
●●
● standard
echomodelpartition−adjusted
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 92 / 99
Part 2 Conclusion & outlook
Outline
Part 1MotivationDescriptive statistics & notationSome examples (zipfR)LNRE models: intuitionLNRE models: mathematics
Part 2Applications & examples (zipfR)LimitationsNon-randomnessConclusion & outlook
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 93 / 99
Part 2 Conclusion & outlook
Future plans for zipfR
I More efficient LNRE sampling & parametric bootstrappingI Improve parameter estimation (minimization algorithm)I Better computation accuracy by numerical integrationI Extended Zipf-Mandelbrot LNRE model: piecewise power lawI Development of robust and interpretable productivity
measures, using LNRE simulationsI Computationally expensive modelling (MCMC) for accurate
inference from small samples
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 94 / 99
Part 2 Conclusion & outlook
Thank you!
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 95 / 99
Part 2 Conclusion & outlook
References I
Baayen, Harald (1991). A stochastic process for word frequency distributions. InProceedings of the 29th Annual Meeting of the Association for ComputationalLinguistics, pages 271–278.
Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer AcademicPublishers, Dordrecht.
Baroni, Marco and Evert, Stefan (2005). Testing the extrapolation quality of wordfrequency models. In P. Danielsson and M. Wagenmakers (eds.), Proceedings ofCorpus Linguistics 2005, volume 1, no. 1 of Proceedings from the CorpusLinguistics Conference Series, Birmingham, UK. ISSN 1747-9398.
Baroni, Marco and Evert, Stefan (2007). Words and echoes: Assessing and mitigatingthe non-randomness problem in word frequency distribution modeling. InProceedings of the 45th Annual Meeting of the Association for ComputationalLinguistics, pages 904–911, Prague, Czech Republic.
Brainerd, Barron (1982). On the relation between the type-token and species-areaproblems. Journal of Applied Probability, 19(4), 785–793.
Cao, Yong; Xiong, Fei; Zhao, Youjie; Sun, Yongke; Yue, Xiaoguang; He, Xin; Wang,Lichao (2017). Pow law in random symbolic sequences. Digital Scholarship in theHumanities, 32(4), 733–738.
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 96 / 99
Part 2 Conclusion & outlook
References IIChurch, Kenneth W. (2000). Empirical estimates of adaptation: The chance of two
Noriegas is closer to p/2 than p2. In Proceedings of COLING 2000, pages 173–179,Saarbrücken, Germany.
Efron, Bradley (1979). Bootstrap methods: Another look at the jackknife. The Annalsof Statistics, 7(1), 1–26.
Evert, Stefan (2004). A simple LNRE model for random character sequences. InProceedings of the 7èmes Journées Internationales d’Analyse Statistique desDonnées Textuelles (JADT 2004), pages 411–422, Louvain-la-Neuve, Belgium.
Evert, Stefan and Baroni, Marco (2007). zipfR: Word frequency distributions in R. InProceedings of the 45th Annual Meeting of the Association for ComputationalLinguistics, Posters and Demonstrations Sessions, pages 29–32, Prague, CzechRepublic.
Evert, Stefan and Lüdeling, Anke (2001). Measuring morphological productivity: Isautomatic preprocessing sufficient? In P. Rayson, A. Wilson, T. McEnery,A. Hardie, and S. Khoja (eds.), Proceedings of the Corpus Linguistics 2001Conference, pages 167–175, Lancaster. UCREL.
Grieve, Jack; Carmody, Emily; Clarke, Isobelle; Gideon, Hannah; Heini, Annina; Nini,Andrea; Waibel, Emily (submitted). Attributing the Bixby Letter using n-gramtracing. Digital Scholarship in the Humanities. Submitted on May 26, 2017.
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 97 / 99
Part 2 Conclusion & outlook
References IIIHerdan, Gustav (1964). Quantitative Linguistics. Butterworths, London.Kornai, András (1999). Zipf’s law outside the middle range. In Proceedings of the
Sixth Meeting on Mathematics of Language, pages 347–356, University of CentralFlorida.
Li, Wentian (1992). Random texts exhibit zipf’s-law-like word frequency distribution.IEEE Transactions on Information Theory, 38(6), 1842–1845.
Mandelbrot, Benoît (1953). An informational theory of the statistical structure oflanguages. In W. Jackson (ed.), Communication Theory, pages 486–502.Butterworth, London.
Mandelbrot, Benoît (1962). On the theory of word frequencies and on relatedMarkovian models of discourse. In R. Jakobson (ed.), Structure of Language andits Mathematical Aspects, pages 190–219. American Mathematical Society,Providence, RI.
Miller, George A. (1957). Some effects of intermittent silence. The American Journalof Psychology, 52, 311–314.
Montemurro, Marcelo A. (2001). Beyond the Zipf-Mandelbrot law in quantitativelinguistics. Physica A, 300, 567–578.
Rouault, Alain (1978). Lois de Zipf et sources markoviennes. Annales de l’Institut H.Poincaré (B), 14, 169–188.
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 98 / 99
Part 2 Conclusion & outlook
References IVSichel, H. S. (1971). On a family of discrete distributions particularly suited to
represent long-tailed frequency data. In N. F. Laubscher (ed.), Proceedings of theThird Symposium on Mathematical Statistics, pages 51–97, Pretoria, South Africa.C.S.I.R.
Sichel, H. S. (1975). On a distribution law for word frequencies. Journal of theAmerican Statistical Association, 70, 542–547.
Simon, Herbert A. (1955). On a class of skew distribution functions. Biometrika,47(3/4), 425–440.
Tweedie, Fiona J. and Baayen, R. Harald (1998). How variable may a constant be?measures of lexical richness in perspective. Computers and the Humanities, 32,323–352.
Yule, G. Udny (1944). The Statistical Study of Literary Vocabulary. CambridgeUniversity Press, Cambridge.
Zipf, George Kingsley (1949). Human Behavior and the Principle of Least Effort.Addison-Wesley, Cambridge, MA.
Zipf, George Kingsley (1965). The Psycho-biology of Language. MIT Press,Cambridge, MA.
Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 99 / 99