Upload
reuben-george
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Statistical Measures for Corpus Profiling. Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008. Contents. Why study differences between corpora? (Kilgarriff, 2001) Case Study in parsing (Sekine, 1997). Words and “countable linguistic features”. - PowerPoint PPT Presentation
Citation preview
Statistical Measures for Corpus Profiling
Michael P. Oakes
University of Sunderland
Corpus Profiling Workshop, 2008.
Contents
• Why study differences between corpora? (Kilgarriff, 2001)• Case Study in parsing (Sekine, 1997).• Words and “countable linguistic features”.• Overall differences between corpora and contributions of
individual features:– Information theory– Chi-squared test– Factor Analysis
• “Gold standard” comparison of measures (Kilgarriff, 2001).
Why study differences between corpora?
• Kilgarriff (2001), “Comparing Corpora”, Int. J. Corpus Linguistics 6(1), pp. 97-133.
• Taxonomise the field: how does a new corpus stand in relation to existing ones?
• If an interesting finding is found for one corpus, for what other corpora does it hold?
• Is a new corpus sufficiently different from ones you have already got to be worth acquiring?
• Difficulty in porting a new corpus to an existing NLP system: time and cost are measurable.
Different Text Types
• Englishes of the world, e.g. US vs. UK (Hofland and Johannson, 1982)
• Social differentiation e.g. gender, age, social class (Rayson, Leech and Hodges 1997), diachronic, geographical location.
• Stylometry, e.g. disputed authorship • Genre analysis, e.g. science fiction, e-shop (Santini, 2006)• Sentiment analysis (Westerveld, 2008). • Relevant vs. non-relevant documents? Probabilistic IR. • Statistical techniques exist to discriminate between these
text types. Here the interest is in the types of language per se, rather than their amenability to NLP tools.
Words and countable linguistic features
• Bits of words e.g. 2-grams (Kjell, 1994)• Words (many studies)• Linguistic features for Factor Analysis (Biber,
1995) e.g. questions, past participles.• Phrase rewrite rules (Sekine 1997, Baayen, van
Halteren and Tweedie, 1996). • Any countable feature characteristic of one corpus
as opposed to another.• Not hapax legomena, Semitisms in the New
Testament.
Domain independence of parsing (Sekine, 1997)
• Used 8 genres from the Brown Corpus, chosen to give equal amount of fiction (KLNP) and non-fiction (ABEJ).
• Characterised domains by production rules which fire.• From this data produced a matrix of Cross Entropy of
grammar across domains.• Then average linking of the domains based on the matrix
of cross entropy gave intuitively reasonable results.• Evaluated (training / test) corpus difference on parser
performance. • Discussed size of the training corpus.
x
xqxpqXH )(log).(, 2
Broad Text Category
Genre Texts in Brown
Texts in LOB
Press A Reportage 44 44
B Editorial 27 27
C Reviews 17 17
General Prose D Religion 17 17
E Skills, Trades, Hobbies 36 38
F Popular Lore 48 44
G Belles Lettres, Biographies, Essays 75 77
H Miscellaneous 30 30
J Academic Prose 80 80
Fiction K General Fiction 29 29
L Mystery and Detective 24 24
M Science Fiction 6 6
N Adventure and Western 29 29
P Romance and Love Story 29 29
R Humour 9 9
Sekine characterised domains by production rules which fire
Domain A Domain B
PP IN NP (8.40%) NP PRP (9.52%)
NP NN PX (5.42%) PP IN NP (5.79%)
S S (5.06%) S NP VP (5.77%)
S NP VP (4.28%) S S (5.37%)
NP DT NNX (3.81%) NP DT NNX (3.90%)
Sekine: Cross-Entropy of Grammar Across Domains
T/M A B E J K L N P
A 5.13 5.35 5.41 5.45 5.51 5.52 5.53 5.55
B 5.47 5.19 5.50 5.51 5.55 5.58 5.60 5.60
E 5.50 5.48 5.20 5.48 5.58 5.59 5.58 5.61
J 5.39 5.37 5.35 5.15 5.52 5.57 5.58 5.61
K 5.32 5.25 5.31 5.41 4.95 5.14 5.15 5.17
L 5.32 5.25 5.31 5.41 4.95 5.14 5.15 5.17
N 5.29 5.25 5.28 5.43 5.10 5.06 4.89 5.12
P 5.43 5.36 5.40 5.55 5.23 5.21 5.21 5.00
Overall differences between corpora and contributions of individual features.
• Vocabulary richness (e.g. type/token ratio, Yule’s K Characteristic, V2/N) is a characteristic of the entire corpus. Puts all corpora on a linear scale.
• The techniques we will look at (chi-squared, information theoretic and factor analysis) can both give a value for the overall difference between two corpora, and quantify the contributions made by individual features.
Measures of Vocabulary Richness
• Yule’s K characteristic: K = 10000 * (M2 -M1) / (M1 * M1); M1 = tokens; M2 = (V1 * 1²) + (V2 * 2²) + (V3 * 3²) …
• Gerson 35.9, Kempis 59.7, De Imitatione Christi 84.2
• Heap’s Law: Vocabulary size as a function of text size, M = kT^b. Parameters k and b could discriminate texts, and allow them to be plotted in two dimensions.
• Entropy is a form of vocabulary richness (but high individual contributions from both common and rare words).
i
ipipEntropy )(log).( 2
The chi-squared test (Oakes and Farrow, 2006): (O - E)² / E values for three words in five balanced
corpora (Σ (O-E)²/E = 414916.8)
Australian
British US Indian NZ
A 12.68 1.36 2.55 76.65 8.33
Commonwealth
399.63 31.20 32.95 19.84 2.16
zzzzooop - - - - -
Measures from Information Theory (Dagan et al., 1997)
• Kullback Leibler (KL) divergence (also called relative entropy) used as a measure of semantic similarity by Dagan et al., 1997.
• Meaning in coding theory• Problems: we get a value of
infinity if there is a word with frequency 0 in corpus B and >0 in corpus A, and not symmetrical
• Dagan (1997), Information Radius.
i i
ii q
ppqpD 2log.)||(
2||
2||
qpqD
qppD
Information Radius
• L (Fiction: detective) and P (Fiction: romance): 0.180
• A (Press reportage) and B (Press editorial): 0.257
• J (Academic prose) and P (Fiction: romance): 0.572
Detective versus Romantic Fiction
Detective Romance Detective Romance
The .00821 -.00732 Her .00819 -.00522
Of .00308 -.00277 She .00784 -.00535
A .00280 -.00257 You .00453 -.00345
Was .00180 -.00172 To .00235 -.00229
It .00161 -.00148 Be .00128 -.00110
He .00157 -.00148 They .00126 -.00097
On .00110 -.00099 Would .00121 -.00097
Been .00106 -.00089 Are .00087 -.00056
Man .00089 -.00061 Your .00084 -.00062
Money .00065 -.00034 Love .00081 -.00039
Factor Analysis
• Decathlon analogy: running, jumping and throwing. • Biber (1988): groups of countable features which
consistently co-occur in texts are said to define a “linguistic dimension”.
• Such features are said to have positive loadings with respect to that dimension, but dimensions can also be defined by features which are in “complementary distributions”, i.e. negatively loaded.
• Example: at one pole is “many pronouns and contractions”, near which lie conversational texts and panel discussions. At the other pole, “few dimensions and contractions” are scientific texts and fiction.
Evaluation of Measures (Kilgarriff 2001)
• Reference corpus made up of known proportions of two corpora: 100% A, 0% B; 90% A, 10% B; 80% A, 20% B …
• This gives a set of “gold standard” judgements: subcorpus 1 is more like subcorpus 2 than subcorpus 3, etc.
• Compare machine ranking of corpora with the gold standard ranking using Spearman’s rank correlation coefficient.
Conclusions
• Some measures allow comparisons of entire corpora, others enable the identification of typical features.
• Different measure allow different kinds of maps: vocabulary richness allows ranking of corpora on a linear scale, Heap’s Law a 2D map of two parameters. Information theoretic measures give the (dis)similarity between two corpora – best viewed using clustering. With Factor Analysis, you don’t know what the dimensions are until you’ve done it.
• Maps enable contours of application success.