Statistical Measures for Corpus Profiling

Statistical Measures for Corpus Profiling

Michael P. Oakes

University of Sunderland

Corpus Profiling Workshop, 2008.

Contents

• Why study differences between corpora? (Kilgarriff, 2001)• Case Study in parsing (Sekine, 1997).• Words and “countable linguistic features”.• Overall differences between corpora and contributions of

individual features:– Information theory– Chi-squared test– Factor Analysis

• “Gold standard” comparison of measures (Kilgarriff, 2001).

Why study differences between corpora?

• Kilgarriff (2001), “Comparing Corpora”, Int. J. Corpus Linguistics 6(1), pp. 97-133.

• Taxonomise the field: how does a new corpus stand in relation to existing ones?

• If an interesting finding is found for one corpus, for what other corpora does it hold?

• Is a new corpus sufficiently different from ones you have already got to be worth acquiring?

• Difficulty in porting a new corpus to an existing NLP system: time and cost are measurable.

Different Text Types

• Englishes of the world, e.g. US vs. UK (Hofland and Johannson, 1982)

• Social differentiation e.g. gender, age, social class (Rayson, Leech and Hodges 1997), diachronic, geographical location.

• Stylometry, e.g. disputed authorship • Genre analysis, e.g. science fiction, e-shop (Santini, 2006)• Sentiment analysis (Westerveld, 2008). • Relevant vs. non-relevant documents? Probabilistic IR. • Statistical techniques exist to discriminate between these

text types. Here the interest is in the types of language per se, rather than their amenability to NLP tools.

Words and countable linguistic features

• Bits of words e.g. 2-grams (Kjell, 1994)• Words (many studies)• Linguistic features for Factor Analysis (Biber,

1995) e.g. questions, past participles.• Phrase rewrite rules (Sekine 1997, Baayen, van

Halteren and Tweedie, 1996). • Any countable feature characteristic of one corpus

as opposed to another.• Not hapax legomena, Semitisms in the New

Testament.

Domain independence of parsing (Sekine, 1997)

• Used 8 genres from the Brown Corpus, chosen to give equal amount of fiction (KLNP) and non-fiction (ABEJ).

• Characterised domains by production rules which fire.• From this data produced a matrix of Cross Entropy of

grammar across domains.• Then average linking of the domains based on the matrix

of cross entropy gave intuitively reasonable results.• Evaluated (training / test) corpus difference on parser

performance. • Discussed size of the training corpus.

x

xqxpqXH )(log).(, 2

Broad Text Category

Genre Texts in Brown

Texts in LOB

Press A Reportage 44 44

B Editorial 27 27

C Reviews 17 17

General Prose D Religion 17 17

E Skills, Trades, Hobbies 36 38

F Popular Lore 48 44

G Belles Lettres, Biographies, Essays 75 77

H Miscellaneous 30 30

J Academic Prose 80 80

Fiction K General Fiction 29 29

L Mystery and Detective 24 24

M Science Fiction 6 6

N Adventure and Western 29 29

P Romance and Love Story 29 29

R Humour 9 9

Sekine characterised domains by production rules which fire

Domain A Domain B

PP IN NP (8.40%) NP PRP (9.52%)

NP NN PX (5.42%) PP IN NP (5.79%)

S S (5.06%) S NP VP (5.77%)

S NP VP (4.28%) S S (5.37%)

NP DT NNX (3.81%) NP DT NNX (3.90%)

Sekine: Cross-Entropy of Grammar Across Domains

T/M A B E J K L N P

A 5.13 5.35 5.41 5.45 5.51 5.52 5.53 5.55

B 5.47 5.19 5.50 5.51 5.55 5.58 5.60 5.60

E 5.50 5.48 5.20 5.48 5.58 5.59 5.58 5.61

J 5.39 5.37 5.35 5.15 5.52 5.57 5.58 5.61

K 5.32 5.25 5.31 5.41 4.95 5.14 5.15 5.17

L 5.32 5.25 5.31 5.41 4.95 5.14 5.15 5.17

N 5.29 5.25 5.28 5.43 5.10 5.06 4.89 5.12

P 5.43 5.36 5.40 5.55 5.23 5.21 5.21 5.00

Overall differences between corpora and contributions of individual features.

• Vocabulary richness (e.g. type/token ratio, Yule’s K Characteristic, V2/N) is a characteristic of the entire corpus. Puts all corpora on a linear scale.

• The techniques we will look at (chi-squared, information theoretic and factor analysis) can both give a value for the overall difference between two corpora, and quantify the contributions made by individual features.

Measures of Vocabulary Richness

• Yule’s K characteristic: K = 10000 * (M2 -M1) / (M1 * M1); M1 = tokens; M2 = (V1 * 1²) + (V2 * 2²) + (V3 * 3²) …

• Gerson 35.9, Kempis 59.7, De Imitatione Christi 84.2

• Heap’s Law: Vocabulary size as a function of text size, M = kT^b. Parameters k and b could discriminate texts, and allow them to be plotted in two dimensions.

• Entropy is a form of vocabulary richness (but high individual contributions from both common and rare words).

i

ipipEntropy )(log).( 2

The chi-squared test (Oakes and Farrow, 2006): (O - E)² / E values for three words in five balanced

corpora (Σ (O-E)²/E = 414916.8)

Australian

British US Indian NZ

A 12.68 1.36 2.55 76.65 8.33

Commonwealth

399.63 31.20 32.95 19.84 2.16

zzzzooop - - - - -

Measures from Information Theory (Dagan et al., 1997)

• Kullback Leibler (KL) divergence (also called relative entropy) used as a measure of semantic similarity by Dagan et al., 1997.

• Meaning in coding theory• Problems: we get a value of

infinity if there is a word with frequency 0 in corpus B and >0 in corpus A, and not symmetrical

• Dagan (1997), Information Radius.

i i

ii q

ppqpD 2log.)||(

2||

2||

qpqD

qppD

Information Radius

• L (Fiction: detective) and P (Fiction: romance): 0.180

• A (Press reportage) and B (Press editorial): 0.257

• J (Academic prose) and P (Fiction: romance): 0.572

Detective versus Romantic Fiction

Detective Romance Detective Romance

The .00821 -.00732 Her .00819 -.00522

Of .00308 -.00277 She .00784 -.00535

A .00280 -.00257 You .00453 -.00345

Was .00180 -.00172 To .00235 -.00229

It .00161 -.00148 Be .00128 -.00110

He .00157 -.00148 They .00126 -.00097

On .00110 -.00099 Would .00121 -.00097

Been .00106 -.00089 Are .00087 -.00056

Man .00089 -.00061 Your .00084 -.00062

Money .00065 -.00034 Love .00081 -.00039

Factor Analysis

• Decathlon analogy: running, jumping and throwing. • Biber (1988): groups of countable features which

consistently co-occur in texts are said to define a “linguistic dimension”.

• Such features are said to have positive loadings with respect to that dimension, but dimensions can also be defined by features which are in “complementary distributions”, i.e. negatively loaded.

• Example: at one pole is “many pronouns and contractions”, near which lie conversational texts and panel discussions. At the other pole, “few dimensions and contractions” are scientific texts and fiction.

Evaluation of Measures (Kilgarriff 2001)

• Reference corpus made up of known proportions of two corpora: 100% A, 0% B; 90% A, 10% B; 80% A, 20% B …

• This gives a set of “gold standard” judgements: subcorpus 1 is more like subcorpus 2 than subcorpus 3, etc.

• Compare machine ranking of corpora with the gold standard ranking using Spearman’s rank correlation coefficient.

Conclusions

• Some measures allow comparisons of entire corpora, others enable the identification of typical features.

• Different measure allow different kinds of maps: vocabulary richness allows ranking of corpora on a linear scale, Heap’s Law a 2D map of two parameters. Information theoretic measures give the (dis)similarity between two corpora – best viewed using clustering. With Factor Analysis, you don’t know what the dimensions are until you’ve done it.

• Maps enable contours of application success.

Documents

Statistical Measures for Corpus Profiling