Outline About Type-Token Distributions and Zipf's Lawzipfr.r-forge.r-project.org/materials/LREC2018/tutorial_lrec2018... · I Zipf-Mandelbrot law forms basis of statistical LNRE models

What Every Computational Linguist Should KnowAbout Type-Token Distributions and Zipf’s Law

Tutorial 1, 7 May 2018

Stefan EvertFAU Erlangen-Nürnberg

http://zipfr.r-forge.r-project.org/lrec2018.htmlLicensed under CC-by-sa version 3.0

Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 1 / 99

Outline

Outline

Part 1MotivationDescriptive statistics & notationSome examples (zipfR)LNRE models: intuitionLNRE models: mathematics

Part 2Applications & examples (zipfR)LimitationsNon-randomnessConclusion & outlook


Part 1 Motivation

Outline




Part 1 Motivation

Type-token statistics

I Type-token statistics different from most statistical inferenceI not about probability of a specific eventI but about diversity of events and their probability distribution

I Relatively little work in statistical scienceI Nor a major research topic in computational linguistics

I very specialized, usually plays ancillary role in NLPI But type-token statistics appear in wide range of applications

I often crucial for sound analysis

å NLP community needs better awareness of statisticaltechniques, their limitations, and available software


Part 1 Motivation

Some research questions

I How many words did Shakespeare know?I What is the coverage of my treebank grammar on big data?I How many typos are there on the Internet?I Is -ness more productive than -ity in English?I Are there differences in the productivity of nominal

compounds between academic writing and novels?I Does Dickens use a more complex vocabulary than Rowling?I Can a decline in lexical complexity predict Alzheimer’s disease?I How frequent is a hapax legomenon from the Brown corpus?I What is appropriate smoothing for my n-gram model?I Who wrote the Bixby letter, Lincoln or Hay?I How many different species of . . . are there? (Brainerd 1982)


Part 1 Motivation

Some research questions

II coverage estimatesIII productivity

I lexical complexity & stylometryII prior & posterior distributionII unexpected applicationsI


Part 1 Motivation

Zipf’s law (Zipf 1949)

A) Frequency distributions in natural language are highly skewedB) Curious relationship between rank & frequency

word r f r · fthe 1. 142,776 142,776and 2. 100,637 201,274be 3. 94,181 282,543of 4. 74,054 296,216

(Dickens)

C) Various explanations of Zipf’s lawI principle of least effort (Zipf 1949)I optimal coding system, MDL (Mandelbrot 1953, 1962)I random sequences (Miller 1957; Li 1992; Cao et al. 2017)I Markov processes Ü n-gram models (Rouault 1978)

D) Language evolution: birth-death-process (Simon 1955)+ not the main topic today!


Part 1 Descriptive statistics & notation

Outline





Tokens & types

our sample: recently, very, not, otherwise, much, very, very,merely, not, now, very, much, merely, not, veryI N = 15: number of tokens = sample sizeI V = 7: number of distinct types = vocabulary size

(recently, very, not, otherwise, much, merely, now)type-frequency list

w fwrecently 1very 5not 3otherwise 1much 2merely 2now 1



Zipf ranking

our sample: recently, very, not, otherwise, much, very, very,merely, not, now, very, much, merely, not, veryI N = 15: number of tokens = sample sizeI V = 7: number of distinct types = vocabulary size

(recently, very, not, otherwise, much, merely, now)

Zipf rankingw r frvery 1 5not 2 3merely 3 2much 4 2now 5 1otherwise 6 1recently 7 1 1 2 3 4 5 6 7

02

46

810

Zipf ranking: adverbs

rank

frequency



A realistic Zipf ranking: the Brown corpus

top frequencies bottom frequenciesr f word rank range f randomly selected examples1 69836 the 7731 – 8271 10 schedules, polynomials, bleak2 36365 of 8272 – 8922 9 tolerance, shaved, hymn3 28826 and 8923 – 9703 8 decreased, abolish, irresistible4 26126 to 9704 – 10783 7 immunity, cruising, titan5 23157 a 10784 – 11985 6 geographic, lauro, portrayed6 21314 in 11986 – 13690 5 grigori, slashing, developer7 10777 that 13691 – 15991 4 sheath, gaulle, ellipsoids8 10182 is 15992 – 19627 3 mc, initials, abstracted9 9968 was 19628 – 26085 2 thar, slackening, deluxe

10 9801 he 26086 – 45215 1 beck, encompasses, second-place



A realistic Zipf ranking: the Brown corpus



Frequency spectrumI pool types with f = 1 (hapax legomena), types with f = 2

(dis legomena), . . . , f = m, . . .I V1 = 3: number of hapax legomena (now, otherwise, recently)I V2 = 2: number of dis legomena (merely, much)I general definition: Vm = |{w | fw = m}|

Zipf rankingw r frvery 1 5not 2 3merely 3 2much 4 2now 5 1otherwise 6 1recently 7 1

frequencyspectrumm Vm1 32 23 15 1

1 2 3 4 5 6 7

frequency spectrum: adverbs

mVm

02

46

810



A realistic frequency spectrum: the Brown corpus

1 2 3 4 5 6 7 8 9 11 13 15

frequency spectrum: Brown corpus

m

Vm

05000

10000

15000

20000



Vocabulary growth curve

our sample: recently, very, not, otherwise, much, very, very,merely, not, now, very, much, merely, not, very

I N = 1, V (N) = 1, V1(N) = 1I N = 3, V (N) = 3, V1(N) = 3I N = 7, V (N) = 5, V1(N) = 4I N = 12, V (N) = 7, V1(N) = 4I N = 15, V (N) = 7, V1(N) = 3

0 2 4 6 8 10 12 14

02

46

810

vocabulary growth curve: adverbs

N

V(N)V1(N)

V(N)V1(N)



A realistic vocabulary growth curve: the Brown corpus

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010000

20000

30000

40000

50000 vocabulary growth curve: Brown corpus

N

V(N)V1(N)

V(N)V1(N)



Vocabulary growth in authorship attribution

I Authorship attribution by n-gram tracing applied to the caseof the Bixby letter (Grieve et al. submitted)

I Word or character n-grams in disputed text are comparedagainst large “training” corpora from candidate authors

Figure 1 One Gettysburg Address 2-word n-gram traces 322

323 324

In addition to analysing 2-word n-grams, we also analysed 1-, 3- and 4-word n-grams, 325

based on the average percentage of n-grams seen in 50 random 260,000-word samples of 326

texts. The analysis was only run up to 4-word n-grams because from that point onward the Hay 327

corpus contains none of the n-grams in the Gettysburg Address. The 3- and 4-word n-gram 328

analyses also correctly attributed the Gettysburg Address to Lincoln: 18% of 3-grams for Lincoln 329

vs. 14% of 3-grams for Hay and 2% of 4-grams for Lincoln vs. 0% of 4-grams for Hay. The 1-330

word n-gram analysis, however, incorrectly attributed the Gettysburg Address to Hay. Figure 3 331

presents the aggregated n-gram traces for all analyses. Notably, the 2-, 3- and 4-word n-gram 332

analyses, which correctly attributed the document to Lincoln, appear to be far more definitive 333

than the incorrect 1-word n-gram analysis. 334



Observing Zipf’s lawacross languages and different linguistic units



Observing Zipf’s lawThe Italian prefix ri- in the la Repubblica corpus



Observing Zipf’s law



Observing Zipf’s law

I Straight line in double-logarithmic space correspondsto power law for original variables

I This leads to Zipf’s (1949; 1965) famous law:

fr = Cra

I If we take logarithm on both sides, we obtain:

log fr︸︷︷︸y

= logC − a · log r︸︷︷︸x

I Intuitive interpretation of a and C :I a is slope determining how fast log frequency decreasesI logC is intercept, i.e. log frequency of most frequent word

(r = 1 Ü log r = 0)



Observing Zipf’s lawLeast-squares fit = linear regression in log-space (Brown corpus)



Zipf-Mandelbrot lawMandelbrot (1953, 1962)

I Mandelbrot’s extra parameter:

fr = C(r + b)a

I Zipf’s law is special case with b = 0I Assuming a = 1, C = 60,000, b = 1:

I For word with rank 1, Zipf’s law predicts frequency of 60,000;Mandelbrot’s variation predicts frequency of 30,000

I For word with rank 1,000, Zipf’s law predicts frequency of 60;Mandelbrot’s variation predicts frequency of 59.94

I Zipf-Mandelbrot law forms basis of statistical LNRE modelsI ZM law derived mathematically as limiting distribution of

vocabulary generated by a character-level Markov processStefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 23 / 99


Zipf-Mandelbrot lawNon-linear least-squares fit (Brown corpus)


Part 1 Some examples (zipfR)

Outline





zipfREvert and Baroni (2007)

I http://zipfR.R-Forge.R-Project.org/I Conveniently available from CRAN repositoryI Package vignette = gentle tutorial introduction



First steps with zipfR

I Set up a folder for this course, and make sure it is yourworking directory in R (preferably as an RStudio project)

I Install the most recent version of the zipfR packageI Package, handouts, code samples & data sets available from

http://zipfr.r-forge.r-project.org/lrec2018.html

> library(zipfR)

> ?zipfR # documentation entry point

> vignette("zipfr-tutorial") # read the zipfR tutorial



Loading type-token data

I Most convenient input: sequence of tokens as text file invertical format (“one token per line”)

+ mapped to appropriate types: normalized word forms, wordpairs, lemmatized, semantic class, n-gram of POS tags, . . .

+ language data should always be in UTF-8 encoding!+ large files can be compressed (.gz, .bz2, .xz)

I Sample data: brown_adverbs.txt on tutorial homepageI lowercased adverb tokens from Brown corpus (original order)

+ download and save to your working directory

> adv <- readLines("brown_adverbs.txt", encoding="UTF-8")

> head(adv, 30) # mathematically, a ‘‘vector’’ of tokens> length(adv) # sample size = 52,037 tokens



Descriptive statistics: type-frequency list

> adv.tfl <- vec2tfl(adv)> adv.tfl

k f type1 1 4859 not2 2 2084 n’t3 3 1464 so4 4 1381 only5 5 1374 then6 6 1309 now7 7 1134 even8 8 1089 as

......

...N V

52037 1907

> N(adv.tfl) # sample size> V(adv.tfl) # type count



Descriptive statistics: frequency spectrum

> adv.spc <- tfl2spc(adv.tfl) # or directly with vec2spc> adv.spc

m Vm1 1 7622 2 2603 3 1444 4 995 5 696 6 507 7 408 8 34

......

N V52037 1907

> N(adv.spc) # sample size> V(adv.spc) # type count



Descriptive statistics: vocabulary growth

I VGC lists vocabulary size V (N) at different sample sizes NI Optionally also spectrum elements Vm(N) up to m.max

> adv.vgc <- vec2vgc(adv, m.max=2)

I Visualize descriptive statistics with plot method

> plot(adv.tfl) # Zipf ranking> plot(adv.tfl, log="xy") # logarithmic scale recommended

> plot(adv.spc) # barplot of frequency spectrum

> plot(adv.vgc, add.m = 1:2) # vocabulary growth curve



Further example data sets

?Brown words from Brown corpus?BrownSubsets various subsets

?Dickens words from novels by Charles Dickens?ItaPref Italian word-formation prefixes?TigerNP NP and PP patterns from German Tiger treebank

?Baayen2001 frequency spectra from Baayen (2001)?EvertLuedeling2001 German word-formation affixes (manually

corrected data from Evert and Lüdeling 2001)Practice:I Explore these data sets with descriptive statisticsI Try different plot options (from help pages ?plot.tfl,

?plot.spc, ?plot.vgc)


Part 1 LNRE models: intuition

Outline





Motivation

I Interested in productivity of affix, vocabulary of author, . . . ;not in a particular text or sample

+ statistical inference from sample to population

I Discrete frequency counts are difficult to capture withgeneralizations such as Zipf’s law

I Zipf’s law predicts many impossible types with 1 < fr < 2+ population does not suffer from such quantization effects



LNRE models

I This tutorial introduces the state-of-the-art LNRE approachproposed by Baayen (2001)

I LNRE = Large Number of Rare Events

I LNRE uses various approximations and simplifications toobtain a tractable and elegant model

I Of course, we could also estimate the precise discretedistributions using MCMC simulations, but . . .1. LNRE model usually minor component of complex procedure2. often applied to very large samples (N > 1 M tokens)



The LNRE population

I Population: set of S types wi with occurrence probabilities πiI S = population diversity can be finite or infinite (S =∞)I Not interested in specific types Ü arrange by decreasing

probability: π1 ≥ π2 ≥ π3 ≥ · · ·+ impossible to determine probabilities of all individual types

I Normalization: π1 + π2 + . . .+ πS = 1

I Need parametric statistical model to describe full population(esp. for S =∞), i.e. a function i 7→ πi

I type probabilities πi cannot be estimated reliably from asample, but parameters of this function can

I NB: population index i 6= Zipf rank r



Examples of population models

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

0.00

0.02

0.04

0.06

0.08

0.10

k

ππ k

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

0.00

0.02

0.04

0.06

0.08

0.10

k

ππ k

●

●

●

●

●

●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

0.00

0.02

0.04

0.06

0.08

0.10

k

ππ k

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

0.00

0.02

0.04

0.06

0.08

0.10

k

ππ k



The Zipf-Mandelbrot law as a population model

What is the right family of models for lexical frequencydistributions?I We have already seen that the Zipf-Mandelbrot law captures

the distribution of observed frequencies very wellI Re-phrase the law for type probabilities:

πi := C(i + b)a

I Two free parameters: a > 1 and b ≥ 0I C is not a parameter but a normalization constant,

needed to ensure that ∑i πi = 1I This is the Zipf-Mandelbrot population model



The parameters of the Zipf-Mandelbrot model●

●

●

●

●

●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

0.00

0.02

0.04

0.06

0.08

0.10

k

ππ k

a == 1.2b == 1.5

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

0.00

0.02

0.04

0.06

0.08

0.10

k

ππ k

a == 2b == 10

●

●

●

●

●

●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

0.00

0.02

0.04

0.06

0.08

0.10

k

ππ k

a == 2b == 15

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

0.00

0.02

0.04

0.06

0.08

0.10

k

ππ k

a == 5b == 40



The parameters of the Zipf-Mandelbrot model●

●

●

●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

1 2 5 10 20 50 100

1e−

045e

−04

5e−

035e

−02

k

ππ k

a == 1.2b == 1.5

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

1 2 5 10 20 50 100

1e−

045e

−04

5e−

035e

−02

k

ππ k

a == 2b == 10

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1 2 5 10 20 50 100

1e−

045e

−04

5e−

035e

−02

k

ππ k

a == 2b == 15

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

1 2 5 10 20 50 100

1e−

045e

−04

5e−

035e

−02

k

ππ k

a == 5b == 40



The finite Zipf-Mandelbrot modelEvert (2004)

I Zipf-Mandelbrot population model characterizes an infinitetype population: there is no upper bound on i , and the typeprobabilities πi can become arbitrarily small

I π = 10−6 (once every million words), π = 10−9 (once everybillion words), π = 10−15 (once on the entire Internet),π = 10−100 (once in the universe?)

I The finite Zipf-Mandelbrot model stops after first S typesI Population diversity S becomes a parameter of the model→ the finite Zipf-Mandelbrot model has 3 parameters

Abbreviations:I ZM for Zipf-Mandelbrot modelI fZM for finite Zipf-Mandelbrot model



Sampling from a population model

Assume we believe that the population we are interested in can bedescribed by a Zipf-Mandelbrot model:

●

●

●

●

●

●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 10 20 30 40 50

0.00

0.01

0.02

0.03

0.04

0.05

k

ππ k

a == 3b == 50

● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

1 2 5 10 20 50 100

1e−

045e

−04

5e−

035e

−02

k

ππ k

a == 3b == 50

Use computer simulation to generate random samples:I Draw N tokens from the population such that in

each step, type wi has probability πi to be pickedI This allows us to make predictions for samples (= corpora)

of arbitrary size NStefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 42 / 99


Sampling from a population model

#1: 1 42 34 23 108 18 48 18 1 . . .time order room school town course area course time . . .

#2: 286 28 23 36 3 4 7 4 8 . . .

#3: 2 11 105 21 11 17 17 1 16 . . .

#4: 44 3 110 34 223 2 25 20 28 . . .

#5: 24 81 54 11 8 61 1 31 35 . . .

#6: 3 65 9 165 5 42 16 20 7 . . .

#7: 10 21 11 60 164 54 18 16 203 . . .

#8: 11 7 147 5 24 19 15 85 37 . . .

......

......

......

......

......



Samples: type frequency list & spectrum

rank r fr type i1 37 62 36 13 33 34 31 75 31 106 30 57 28 128 27 29 24 4

10 24 1611 23 812 22 14... ... ...

m Vm1 832 223 204 125 106 57 58 39 310 3... ...

sample #1



Samples: type frequency list & spectrum

rank r fr type i1 39 22 34 33 30 54 29 105 28 86 26 17 25 138 24 79 23 6

10 23 1111 20 412 19 17... ... ...

m Vm1 762 273 174 105 66 57 78 310 411 2... ...

sample #2



Random variation in type-frequency lists●

●

●

●●●

●●

●●●

●●

●●

●●●

●●

●●●●●●

●●●●

●●●●●●

●●●●●●●●●●

●●●●

0 10 20 30 40 50

010

2030

40

Sample #1

r

f r

●

●

●●

●

●●

●●●

●●●

●●●●●●●

●●●●

●●

●●●●●

●●●●

●●●●●●●●●●

●●●●●

0 10 20 30 40 50

010

2030

40

Sample #2

r

f r r ↔ fr

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●●●

●●

0 10 20 30 40 50

010

2030

40

Sample #1

k

f k

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

0 10 20 30 40 50

010

2030

40

Sample #2

k

f k i ↔ fi



Random variation: frequency spectrumSample #1

m

Vm

020

4060

8010

0 Sample #2

m

Vm

020

4060

8010

0

Sample #3

m

Vm

020

4060

8010

0 Sample #4

m

Vm

020

4060

8010

0



Random variation: vocabulary growth curve

0 200 400 600 800 1000

050

100

150

200 Sample #1

N

V((N

))V

1((N

))

0 200 400 600 800 1000

050

100

150

200 Sample #2

N

V((N

))V

1((N

))

0 200 400 600 800 1000

050

100

150

200 Sample #3

N

V((N

))V

1((N

))

0 200 400 600 800 1000

050

100

150

200 Sample #4

N

V((N

))V

1((N

))



Expected values

I There is no reason why we should choose a particular sampleto compare to the real data or make a prediction – each one isequally likely or unlikely

I Take the average over a large number of samples, calledexpected value or expectation in statistics

I Notation: E[V (N)

]and E

[Vm(N)

]

I indicates that we are referring to expected values for a sampleof size N

I rather than to the specific values V and Vmobserved in a particular sample or a real-world data set

I Expected values can be calculated efficiently withoutgenerating thousands of random samples



The expected frequency spectrumVm

E[[Vm]]

Sample #1

m

Vm

E[[V

m]]

020

4060

8010

0

Vm

E[[Vm]]

Sample #2

m

Vm

E[[V

m]]

020

4060

8010

0

Vm

E[[Vm]]

Sample #3

m

Vm

E[[V

m]]

020

4060

8010

0

Vm

E[[Vm]]

Sample #4

m

Vm

E[[V

m]]

020

4060

8010

0



The expected vocabulary growth curve

0 200 400 600 800 1000

050

100

150

200 Sample #1

N

E[[V

((N))]]

V((N))E[[V((N))]]

0 200 400 600 800 1000

050

100

150

200 Sample #1

N

E[[V

1((N

))]]

V1((N))E[[V1((N))]]



Prediction intervals for the expected VGC

0 200 400 600 800 1000

050

100

150

200 Sample #1

N

E[[V

((N))]]

V((N))E[[V((N))]]

0 200 400 600 800 1000

050

100

150

200 Sample #1

N

E[[V

1((N

))]]

V1((N))E[[V1((N))]]

“Confidence intervals” indicate predicted sampling distribution:+ for 95% of samples generated by the LNRE model, VGC will

fall within the range delimited by the thin red lines



Parameter estimation by trial & error

observedZM model

a == 1.5,, b == 7.5

m

Vm

E[[V

m]]

050

0010

000

1500

020

000

2500

0

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000 a == 1.5,, b == 7.5

N

V((N

))E

[[V((N

))]]

observedZM model




observedZM model

a == 1.3,, b == 7.5

m

Vm

E[[V

m]]

050

0010

000

1500

020

000

2500

0

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000 a == 1.3,, b == 7.5

N

V((N

))E

[[V((N

))]]

observedZM model




observedZM model

a == 1.3,, b == 0.2

m

Vm

E[[V

m]]

050

0010

000

1500

020

000

2500

0

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000 a == 1.3,, b == 0.2

N

V((N

))E

[[V((N

))]]

observedZM model




observedZM model

a == 2,, b == 550

m

Vm

E[[V

m]]

050

0010

000

1500

020

000

2500

00e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000 a == 2,, b == 550

N

V((N

))E

[[V((N

))]]

observedZM model



Automatic parameter estimation

observedexpected

a == 2.39,, b == 1968.49

m

Vm

E[[V

m]]

050

0010

000

1500

020

000

2500

0

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

010

000

2000

030

000

4000

050

000 a == 2.39,, b == 1968.49

N

V((N

))E

[[V((N

))]]

observedexpected

I By trial & error we found a = 2.0 and b = 550I Automatic estimation procedure: a = 2.39 and b = 1968


Part 1 LNRE models: mathematics

Outline





The sampling model

I Draw random sample of N tokens from LNRE populationI Sufficient statistic: set of type frequencies {fi}

I because tokens of random sample have no orderingI Joint multinomial distribution of {fi}:

Pr({fi = ki} |N) = N!k1! · · · kS !π

k11 · · ·πkSS

I Approximation: do not condition on fixed sample size NI N is now the average (expected) sample size

I Random variables fi have independent Poisson distributions:

Pr(fi = ki) = e−Nπi (Nπi)kiki !



Frequency spectrum

I Key problem: we cannot determine fi in observed sampleI becasue we don’t know which type wi isI recall that population ranking fi 6= Zipf ranking fr

I Use spectrum {Vm} and sample size V as statisticsI contains all information we have about observed sample

I Can be expressed in terms of indicator variables

I[fi=m] ={1 fi = m0 otherwise

Vm =S∑

i=1I[fi=m]

V =S∑

i=1I[fi>0] =

S∑

i=1

(1− I[fi=0]

)



The expected spectrum

I It is easy to compute expected values for the frequencyspectrum (and variances because the fi are independent)

E[I[fi=m]] = Pr(fi = m) = e−Nπi (Nπi)mm!

E[Vm] =S∑

i=1E[I[fi=m]] =

S∑

i=1e−Nπi (Nπi)m

m!

E[V ] =S∑

i=1E[1− I[fi=0]

]=

S∑

i=1

(1− e−Nπi

)

I NB: Vm and V are not independent because they are derivedfrom the same random variables fi



Sampling distribution of Vm and V

I Joint sampling distribution of {Vm} and V is complicatedI Approximation: V and {Vm} asymptotically follow a

multivariate normal distributionI motivated by the multivariate central limit theorem:

sum of many independent variables I[fi=m]I Usually limited to first spectrum elements, e.g. V1, . . . ,V15

I approximation of discrete Vm by continuous distributionsuitable only if E[Vm] is sufficiently large

I Parameters of multivariate normal:µµµ = (E[V ],E[V1],E[V2], . . .) and ΣΣΣ = covariance matrix

Pr((V ,V1, . . . ,Vk) = v

) ∼ e− 12 (v−µµµ)TΣΣΣ−1(v−µµµ)√

(2π)k+1 det ΣΣΣ



Type density function

I Discrete sums of probabilities in E[V ], E[Vm], ldots areinconvenient and computationally expensive

I Approximation: continuous type density function g(π)

|{wi | a ≤ πi ≤ b}| =∫ b

ag(π) dπ

∑{πi | a ≤ πi ≤ b} =

∫ b

aπg(π) dπ

I Normalization constraint:∫ ∞

0πg(π) dπ = 1

I Good approximation for low-probability types, but probabilitymass of w1,w2, . . . “smeared out” over range



Type density function

0.0 0.1 0.2 0.3 0.4 0.5

01

02

03

04

0

Type density as continuous approximation

occurrence probability π

typ

e d

en

sity

g(π)

w1w2w3w4

0.0 0.1 0.2 0.3 0.4 0.5

01

23

4

Probability density vs. type probabilities


pro

ba

bili

ty d

en

sity

f(π)

π1=

.37

6

π2=

.21

5

π3=

.13

0

π4=

.08

2



ZM and fZM as LNRE models

I Discrete Zipf-Mandelbrot population

πi := C(i + b)a for i = 1, . . . ,S

I Corresponding type density function (Evert 2004)

g(π) ={C · π−α−1 A ≤ π ≤ B0 otherwise

with parametersI α = 1/a (0 < α < 1)I B = b · α/(1− α)I 0 ≤ A < B determines S (ZM with S =∞ for A = 0)

+ C is a normalization factor, not a parameter



ZM and fZM as LNRE models

1e-09 1e-07 1e-05 1e-03

02

00

00

40

00

06

00

00

80

00

0 Type density of LNRE model


typ

e d

en

sity

g(π)

fZM



Expectations as integrals

I Expected values can now be expressed as integrals over g(π)

E[Vm] =∫ ∞

0

(Nπ)mm! e−Nπg(π) dπ

E[V ] =∫ ∞

0

(1− e−Nπ

)g(π) dπ

I Reduce to simple closed form for ZM (approximation)

E[Vm] = Cm! · N

α · Γ(m − α)

E[V ] = C · Nα · Γ(1− α)α

I fZM and exact solution for ZM with incompl. Gamma function



Parameter estimation from training corpus

I For ZM, α = E[V1]E[V ] ≈ V1

V can be estimated directly,but prone to overfitting

I General parameter fitting by MLE:maximize likelihood of observed spectrum v

maxα,A,B

Pr((V ,V 1, . . . ,Vk) = v

∣∣α,A,B)

I Multivariate normal approximation:

minα,A,B

(v−µµµ)TΣΣΣ−1(v−µµµ)

I Minimization by gradient descent (BFGS, CG) or simplexsearch (Nelder-Mead)



Parameter estimation from training corpus

0.2 0.4 0.6 0.8

−4

−3

−2

−1

BNC (bare singular PPs)

α

log 1

0(B)

0.55 0.60 0.65 0.70 0.75

−2.

5−

2.4

−2.

3−

2.2

−2.

1−

2.0

−1.

9−

1.8 Goodness−of−fit X2 (m = 10)

α

log 1

0(B)

(0.65, −2.11)



Goodness-of-fit(Baayen 2001, Sec. 3.3)

I How well does the fitted model explain the observed data?I For multivariate normal distribution:

X 2 = (V−µµµ)TΣΣΣ−1(V−µµµ) ∼ χ2k+1

where V = (V ,V1, . . . ,Vk)å Multivariate chi-squared test of goodness-of-fit

I replace V by observed v Ü test statistic x2I must reduce df = k + 1 by number of estimated parameters

I NB: significant rejection of the LNRE model for p < .05



Coffee break!


Part 2 Applications & examples (zipfR)

Outline





Measuring morphological productivityexample from Evert and Lüdeling (2001)

0 5000 10000 15000 20000 25000 30000 35000

0100

200

300

400

500

Vocabulary Growth Curves

N

V(N)

-bar-sam-ös




0 50000 150000 250000 350000

050

0010

000

1500

0

a == 1.45,, b == 34.59,, S == 20587

N

V((N

))E

[[V((N

))]]V

1((N

))E

[[V1((

N))]]

observedexpected




0 5000 10000 15000 20000 25000 30000 35000

0100

200

300

400

500

Vocabulary Growth Curves

N

V(N)

-bar-sam-ös



Quantitative measures of productivity(Tweedie and Baayen 1998; Baayen 2001)

I Baayen’s (1991) productivity index P(slope of vocabulary growth curve)

P = V1N

I TTR = type-token ratio

TTR = VN

I Zipf-Mandelbrot slope

a

I Herdan’s law (1964)

C = log Vlog N

I Yule (1944) / Simpson (1949)

K = 10 000 ·∑

m m2Vm − NN2

I Guiraud (1954)

R = V√N

I Sichel (1975)

S = V2V

I Honoré (1979)

H = log N1− V1

V



Productivity measures for bare singulars in the BNC

spoken writtenV 2,039 12,876N 6,766 85,750K 86.84 28.57R 24.79 43.97S 0.13 0.15C 0.86 0.83P 0.21 0.08

TTR 0.301 0.150a 1.18 1.27

pop. S 15,958 36,874 0 20000 40000 60000 80000

020

0040

0060

0080

0010

000

1200

0

vocabulary growth curves (BNC)

NV

(N)

writtenspoken



Are these “lexical constants” really constant?

0 1000 3000 5000

020

4060

8010

0

Yule's K

0 1000 3000 5000

010

2030

4050

Guiraud's R

0 1000 3000 5000

0.00

0.05

0.10

0.15

0.20

Sichel's S

0 1000 3000 5000

0.0

0.2

0.4

0.6

0.8

1.0

Herdan's law C

0 1000 3000 5000

0.0

0.1

0.2

0.3

0.4

0.5

Baayen's P

0 1000 3000 5000

02

46

810

TTR

0 1000 3000 5000

0.0

0.5

1.0

1.5

2.0

Zipf slope (a)

0 1000 3000 5000

050

0015

000

2500

035

000 population size



Simulation experiments based on LNRE models

I Systematic study of size dependence and other aspects ofproductivity measures based on samples from LNRE model

I LNRE model Ü well-defined populationI Random sampling helps to assess variability of measuresI Expected values E[P] etc. can often be computed directly

(or approximated) Ü computationally efficientå LNRE models as tools for understanding productivity measures



Simulation: sample size

0 5000 10000 15000 20000 25000

020

4060

8010

0

Yule−Simpson K

0 5000 10000 15000 20000 25000

010

2030

4050

Guiraud R

0 5000 10000 15000 20000 25000

0.0

0.2

0.4

0.6

0.8

1.0

Herdan's law C

0 5000 10000 15000 20000 25000

02

46

810

TTR

0 5000 10000 15000 20000 25000

0.0

0.1

0.2

0.3

0.4

0.5

Baayen P

0 5000 10000 15000 20000 250000.

00.

51.

01.

52.

0

Zipf slope (a)

0 5000 10000 15000 20000 25000

0.00

0.05

0.10

0.15

0.20

Sichel S

0 5000 10000 15000 20000 25000

020

0040

0060

0080

0010

000

1200

0 Honoré H



Simulation: frequent lexicalized types

0.0 0.1 0.2 0.3 0.4

020

4060

8010

0

Yule−Simpson K

probability mass of lexicalized types

0.0 0.1 0.2 0.3 0.40

1020

3040

50

Guiraud R


0.0 0.1 0.2 0.3 0.4

0.0

0.2

0.4

0.6

0.8

1.0

Herdan law C


0.0 0.1 0.2 0.3 0.4

02

46

810

TTR


0.0 0.1 0.2 0.3 0.4

0.0

0.1

0.2

0.3

0.4

0.5

Baayen P

0.0 0.1 0.2 0.3 0.4

0.0

0.5

1.0

1.5

2.0

Zipf slope (a)

0.0 0.1 0.2 0.3 0.4

0.00

0.05

0.10

0.15

0.20

Sichel S

0.0 0.1 0.2 0.3 0.4

020

0040

0060

0080

0010

000

1200

0 Honoré H



interactive demo



Posterior distribution

1e−08 1e−05 1e−02 1e+01

0.0

0.2

0.4

0.6

0.8

1.0

Posterior distribution ((m == 1)) for ZM model with αα == 0.4

expected frequency (Nππ)

post

erio

r di

strib

utio

n P

r((ππ|m

)) ((m==

1)) M

LE

95%

con

fiden

ce99

.9%

con

fiden

ce

Goo

d−T

urin

g

posteriorlog−adjusted



Posterior distribution

1e−08 1e−05 1e−02 1e+01

0.0

0.2

0.4

0.6

0.8

1.0

Posterior distribution ((m == 1)) for ZM model with αα == 0.9

expected frequency (Nππ)

post

erio

r di

strib

utio

n P

r((ππ|m

)) ((m==

1)) M

LE

95%

con

fiden

ce99

.9%

con

fiden

ce

Goo

d−T

urin

g

posteriorlog−adjusted


Part 2 Limitations

Outline




Part 2 Limitations

How reliable are the fitted models?

Three potential issues:1. Model assumptions 6= population

(e.g. distribution does not follow a Zipf-Mandelbrot law)+ model cannot be adequate, regardless of parameter settings

2. Parameter estimation unsuccessful(i.e. suboptimal goodness-of-fit to training data)

+ optimization algorithm trapped in local minimum+ can result in highly inaccurate model

3. Uncertainty due to sampling variation(i.e. training data differ from population distribution)

+ model fitted to training data, may not reflect true population+ another training sample would have led to different parameters+ especially critical for small samples (N < 10,000)


Part 2 Limitations

Bootstrapping

I An empirical approach to sampling variation:I take many random samples from the same populationI estimate LNRE model from each sampleI analyse distribution of model parameters, goodness-of-fit, etc.

(mean, median, s.d., boxplot, histogram, . . . )I problem: how to obtain the additional samples?

I Bootstrapping (Efron 1979)I resample from observed data with replacementI this approach is not suitable for type-token distributions

(resamples underestimate vocabulary size V !)I Parametric bootstrapping

I use fitted model to generate samples, i.e. sample from thepopulation described by the model

I advantage: “correct” parameter values are known


Part 2 Limitations

Bootstrappingparametric bootstrapping with 100 replicates

Zipfian slope a = 1/α

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

12

α

Den

sity


Part 2 Limitations


Goodness-of-fit statistic X 2 (model not plausible for X 2 > 11)

0 5 10 15 20 25 30

0.00

0.02

0.04

0.06

0.08

X2

Den

sity

p<

0.05


Part 2 Limitations


Population diversity S

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0e+

002e

−05

4e−

05

S

Den

sity


Part 2 Limitations

Sample size matters!Brown corpus is too small for reliable LNRE parameter estimation (bare singulars)

0.9 1.0 1.1 1.2 1.3 1.4 1.5

05

1015

Zipf slope (a)

BNC spoken (N=6766)Brown (N=1005)

0 10000 20000 30000 40000 500000.00

000

0.00

002

0.00

004

0.00

006

0.00

008

0.00

010

0.00

012

population size

Den

sity

BNC spoken (N=6766)Brown (N=1005)


Part 2 Limitations

How reliable are the fitted models?

Three potential issues:1. Model assumptions 6= population

(e.g. distribution does not follow a Zipf-Mandelbrot law)+ model cannot be adequate, regardless of parameter settings

2. Parameter estimation unsuccessful(i.e. suboptimal goodness-of-fit to training data)

+ optimization algorithm trapped in local minimum+ can result in highly inaccurate model

3. Uncertainty due to sampling variation(i.e. training data differ from population distribution)

+ model fitted to training data, may not reflect true population+ another training sample would have led to different parameters+ especially critical for small samples (N < 10,000)


Part 2 Limitations

How well does Zipf’s law hold?


Part 2 Limitations

How well does Zipf’s law hold?

I Z-M law seems to fit the first few thousand ranks very well,but then slope of empirical ranking becomes much steeper

I similar patterns have been found in many different data sets

I Various modifications and extensions have been suggested(Sichel 1971; Kornai 1999; Montemurro 2001)

I mathematics of corresponding LNRE models are often muchmore complex and numerically challenging

I may not have closed form for E[V ], E[Vm], or for thecumulative type distribution G(ρ) =

∫∞ρ

g(π) dπ

I E.g. Generalized Inverse Gauss-Poisson (GIGP; Sichel 1971)

g(π) = (2/bc)γ+1

Kγ+1(b) · πγ−1 · e−π

c −b2c4π


Part 2 Limitations

The GIGP model (Sichel 1971)

1e-09 1e-07 1e-05 1e-03

02

00

00

40

00

06

00

00

80

00

0 Type density of LNRE model


typ

e d

en

sity

g(π)

fZMGIGP


Part 2 Non-randomness

Outline





How accurate is LNRE-based extrapolation?(Baroni and Evert 2005)

0 10000 20000 30000

01

00

20

03

00

40

05

00

60

0

Suffix −bar (25%)

N (tokens)

V (

typ

es)

CorpusZMfZMGIGP

0 10000 20000 30000

01

00

20

03

00

40

05

00

60

0

Suffix −bar (50%)

N (tokens)V

(ty

pe

s)

CorpusZMfZMGIGP




0 200 400 600 800 1000

01

02

03

04

05

0

LOB (25%)

N (k tokens)

V (

k t

yp

es)

CorpusZMfZMGIGPlogN

0 200 400 600 800 1000

01

02

03

04

05

0

LOB (50%)

N (k tokens)

V (

k t

yp

es)

CorpusZMfZMGIGPlogN




0 20 40 60 80 100

01

00

20

03

00

40

05

00

60

0

BNC (1%)

N (M tokens)

V (

k t

yp

es)

CorpusZMfZMGIGPlogN

0 20 40 60 80 100

01

00

20

03

00

40

05

00

60

0

BNC (10%)

N (M tokens)

V (

k t

yp

es)

CorpusZMfZMGIGPlogN



Reasons for poor extrapolation quality

I Major problem: non-randomness of corpus dataI LNRE modelling assumes that corpus is random sample

I Cause 1: repetition within textsI most corpora use entire text as unit of samplingI also referred to as “term clustering” or “burstiness”I well-known in computational linguistics (Church 2000)

I Cause 2: non-homogeneous corpusI cannot extrapolate from spoken BNC to written BNCI similar for different genres and domainsI also within single text, e.g. beginning/end of novel



The ECHO correction(Baroni and Evert 2007)

I Empirical study: quality of extrapolation N0 → 4N0 startingfrom random samples of corpus texts

ZM fZM GIGP

Relative error: E[V] vs. V (DEWAC)

rela

tive

erro

r (%

)

−30

−20

−10

010

2030

●

● ●

●

N0

2N0

3N0

0 5000 10000 15000

05

1015

Goodness−of−fit vs. accuracy for V (3N0)

X2

rMS

E (

%)

●

●●

●

●●

●

●●

● standard



The ECHO correction(Baroni and Evert 2007)

I ECHO correction: replace every repetition within same text byspecial type echo (= document frequencies)

ZM fZM GIGP fZMecho

GIGPecho

GIGPpartition

Relative error: E[V] vs. V (DEWAC)

rela

tive

erro

r (%

)

−20

−10

010

20

●

● ●

● ●

●

●

N0

2N0

3N0

0 5000 10000 15000

05

1015

Goodness−of−fit vs. accuracy for V (3N0)

X2

rMS

E (

%)

●

●●

●

●●

●

●●

● standard

echomodelpartition−adjusted


Part 2 Conclusion & outlook

Outline





Future plans for zipfR

I More efficient LNRE sampling & parametric bootstrappingI Improve parameter estimation (minimization algorithm)I Better computation accuracy by numerical integrationI Extended Zipf-Mandelbrot LNRE model: piecewise power lawI Development of robust and interpretable productivity

measures, using LNRE simulationsI Computationally expensive modelling (MCMC) for accurate

inference from small samples



Thank you!



References I

Baayen, Harald (1991). A stochastic process for word frequency distributions. InProceedings of the 29th Annual Meeting of the Association for ComputationalLinguistics, pages 271–278.

Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer AcademicPublishers, Dordrecht.

Baroni, Marco and Evert, Stefan (2005). Testing the extrapolation quality of wordfrequency models. In P. Danielsson and M. Wagenmakers (eds.), Proceedings ofCorpus Linguistics 2005, volume 1, no. 1 of Proceedings from the CorpusLinguistics Conference Series, Birmingham, UK. ISSN 1747-9398.

Baroni, Marco and Evert, Stefan (2007). Words and echoes: Assessing and mitigatingthe non-randomness problem in word frequency distribution modeling. InProceedings of the 45th Annual Meeting of the Association for ComputationalLinguistics, pages 904–911, Prague, Czech Republic.

Brainerd, Barron (1982). On the relation between the type-token and species-areaproblems. Journal of Applied Probability, 19(4), 785–793.

Cao, Yong; Xiong, Fei; Zhao, Youjie; Sun, Yongke; Yue, Xiaoguang; He, Xin; Wang,Lichao (2017). Pow law in random symbolic sequences. Digital Scholarship in theHumanities, 32(4), 733–738.



References IIChurch, Kenneth W. (2000). Empirical estimates of adaptation: The chance of two

Noriegas is closer to p/2 than p2. In Proceedings of COLING 2000, pages 173–179,Saarbrücken, Germany.

Efron, Bradley (1979). Bootstrap methods: Another look at the jackknife. The Annalsof Statistics, 7(1), 1–26.

Evert, Stefan (2004). A simple LNRE model for random character sequences. InProceedings of the 7èmes Journées Internationales d’Analyse Statistique desDonnées Textuelles (JADT 2004), pages 411–422, Louvain-la-Neuve, Belgium.

Evert, Stefan and Baroni, Marco (2007). zipfR: Word frequency distributions in R. InProceedings of the 45th Annual Meeting of the Association for ComputationalLinguistics, Posters and Demonstrations Sessions, pages 29–32, Prague, CzechRepublic.

Evert, Stefan and Lüdeling, Anke (2001). Measuring morphological productivity: Isautomatic preprocessing sufficient? In P. Rayson, A. Wilson, T. McEnery,A. Hardie, and S. Khoja (eds.), Proceedings of the Corpus Linguistics 2001Conference, pages 167–175, Lancaster. UCREL.

Grieve, Jack; Carmody, Emily; Clarke, Isobelle; Gideon, Hannah; Heini, Annina; Nini,Andrea; Waibel, Emily (submitted). Attributing the Bixby Letter using n-gramtracing. Digital Scholarship in the Humanities. Submitted on May 26, 2017.



References IIIHerdan, Gustav (1964). Quantitative Linguistics. Butterworths, London.Kornai, András (1999). Zipf’s law outside the middle range. In Proceedings of the

Sixth Meeting on Mathematics of Language, pages 347–356, University of CentralFlorida.

Li, Wentian (1992). Random texts exhibit zipf’s-law-like word frequency distribution.IEEE Transactions on Information Theory, 38(6), 1842–1845.

Mandelbrot, Benoît (1953). An informational theory of the statistical structure oflanguages. In W. Jackson (ed.), Communication Theory, pages 486–502.Butterworth, London.

Mandelbrot, Benoît (1962). On the theory of word frequencies and on relatedMarkovian models of discourse. In R. Jakobson (ed.), Structure of Language andits Mathematical Aspects, pages 190–219. American Mathematical Society,Providence, RI.

Miller, George A. (1957). Some effects of intermittent silence. The American Journalof Psychology, 52, 311–314.

Montemurro, Marcelo A. (2001). Beyond the Zipf-Mandelbrot law in quantitativelinguistics. Physica A, 300, 567–578.

Rouault, Alain (1978). Lois de Zipf et sources markoviennes. Annales de l’Institut H.Poincaré (B), 14, 169–188.



References IVSichel, H. S. (1971). On a family of discrete distributions particularly suited to

represent long-tailed frequency data. In N. F. Laubscher (ed.), Proceedings of theThird Symposium on Mathematical Statistics, pages 51–97, Pretoria, South Africa.C.S.I.R.

Sichel, H. S. (1975). On a distribution law for word frequencies. Journal of theAmerican Statistical Association, 70, 542–547.

Simon, Herbert A. (1955). On a class of skew distribution functions. Biometrika,47(3/4), 425–440.

Tweedie, Fiona J. and Baayen, R. Harald (1998). How variable may a constant be?measures of lexical richness in perspective. Computers and the Humanities, 32,323–352.

Yule, G. Udny (1944). The Statistical Study of Literary Vocabulary. CambridgeUniversity Press, Cambridge.

Zipf, George Kingsley (1949). Human Behavior and the Principle of Least Effort.Addison-Wesley, Cambridge, MA.

Zipf, George Kingsley (1965). The Psycho-biology of Language. MIT Press,Cambridge, MA.


Documents

Outline About Type-Token Distributions and Zipf's Lawzipfr.r-forge.r-project.org/materials/LREC2018/tutorial_lrec2018... · I Zipf-Mandelbrot law forms basis of statistical LNRE models