39
Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus. Michael Oakes University of Sunderland, England.

Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus

Embed Size (px)

DESCRIPTION

Measures from Information Retrieval to Find the Words which are Characteristic of a Corpus. Michael Oakes University of Sunderland, England. Contents. Background and the ICAME disk Two traditional measures: chi-squared and G-squared (Log-likelihood) Information Retrieval. - PowerPoint PPT Presentation

Citation preview

Measures from Information Retrieval to Find the Words which

are Characteristic of a Corpus.Michael Oakes

University of Sunderland, England.

Contents

• Background and the ICAME disk

• Two traditional measures: chi-squared and G-squared (Log-likelihood)

• Information Retrieval

Looking for discriminating vocabulary

• Two classic papers: Kilgarriff (1996), Which words are particularly characteristic of a text? A survey of statistical approaches.

• Yang and Pedersen (1997), A comparative study on feature selection in text categorization.

• Identify discriminants, linguistic features more typical of one form of English than another.

• Automatic categorisation of text types akin to automatic topic, genre and author identification (Souter, 1994).

• Vocabulary differences reveal cultural differences (Leech and Fallon,1992).

Leech and Fallon (1992) compared the vocabulary in Brown and LOB

• Linguistic contrasts:• Spelling differences: color / colour• Lexical choice: gasoline / petrol• Proper nouns (Chicago more common in

Brown)• Non-linguistic contrasts: indicators of

socio-cultural differences between the two countries.

Samples of written English on the ICAME CD

Corpus Country Words Status

ACE Australia 746,372 First

FLOB Britain 1,009,765 First

Kolhapur India 1,006,315 Second

ICE (EA) Kenya 299,792 Education, High Court, Govt, LF

Wellington N. Zealand 1,016,623 First

ICE (EA) Tanzania 292,012 Education, High Court, LF

FROWN USA 1,009,598 First

Number of sections of approx. 2000 words in 5 comparable corpora (1)

Aust. UK US India NZ

A Press: reportage 44 44 44 44 44

B Press: editorial 27 27 27 27 27

C Press: reviews 17 17 17 17 17

D Religion 17 17 17 17 17

E Skills, trades, hobbies 38 38 36 36 38

F Popular lore 44 44 48 48 44

G Belles lettres, biography, essays

77 77 75 75 77

H Misc. e.g. Government docs. 30 30 30 30 30

J Learned and scientific 80 80 80 80 80

Number of sections of approx. 2000 words in 5 comparable corpora (1)

Aust. UK US India NZ

K General fiction 29 29 29 29 126

L Mystery and detective fiction 15 24 24 24

M Science fiction 7 6 6 6

N Adventure fiction, westerns 8 29 29 29

P Romance and love story 15 29 29 29

R Humour 15 9 9 9

S Historical fiction 22 0 0 0

W Women’s fiction 15 0 0 0

TOTAL 500 500 500 500 500

The chi-squared (X²) test

• See Rayson, Leech & Hodges (1997).• Case study: Is the word “lovely” used more often in

speech by men or women?• Experiment: In the BNC conversational corpus, men say

“lovely” 414 times while women say “lovely” 1214 times.• Statistics: Is this due to chance, or does the use of this

word genuinely vary with the gender of the speaker? Use the chi-square test.

• Contingency table of observed values O: see next slide

Contingency table of observed values O

Men Women

Lovely 414 1214 Row total = 1628

Any other word

1714029 2592228 Row total = 2593452

Col total = 1714443

Col total = 2593452

Grand total = 4307895

Contingency table of expected values E

Men Women

Lovely 647.9 980.1

Any Other Word 1713795.1 2592471.9

The chi-squared test (2)

• Expected frequencies E• E = row total x column total / grand total• e.g. E (lovely, men) = 1628 x 1714443 / 4307895• See previous table

• X² = Σ (O – E)² / E• Find (O – E)² / E for every box in the table, • e.g. (O – E)² / E for (lovely, men) =• (414 – 647.9)² / 647.9 = 84.4.

• X² = sum (Σ) for all four boxes• = 84.0 + 55.8 + 0.0 + 0.0 = 140.2

G² or Log-Likelihood

i i

ii E

OOG ln2

G² vs. Chi-squared

• The chi-squared test is an approximation to the G² test, easier to calculate in the days before PCs and pocket calculators (Wikipedia)

• Both can be used to compare corpora of different sizes

• The only restriction is that the expected values must be >= 5 (Moore 2004, Rayson et al., 2004)

The 20 Words Most Typical of New Zealand English

Word Chi-Squared G-Squared

Zealand 6896.2 5456.4

Maori 3481.7 2895.9

Auckland 2200.4 1809.2

New 1856.2 1580.0

Wellington 1805.5 1453.7

Te 1056.9 832.9

Christchurch 899.4 724.7

Pakeha 781.9 647.2

Canterbury 525.4 403.3

Zealanders 495.7 392.4

Otago 412.8 340.2

Pacific 332.8 257.4

Dunedin 312.1 256.6

Waikato 308.4 253.5

Rugby 296.0 226.1

Maoris 291.6 235.8

Bay 283.9 224.6

Island 280.1 228.9

NZ 262.1 204.4

Waitangi 245.0 200.9

Bonferroni Correction

• Controls the False Discovery Rate• For a single test, X² or G² > 10.83 is significant at the .1 % level.• In comparing the vocabulary across the five corpora, we

effectively perform 101,984 tests because there are 101,984 unique word types across the 5 corpora.

• To find the appropriate critical value we divided 0.001 by 101,984 to give an adjusted significance level of 9.805 x 10 E-9.

• We then identify words with chi-squared contributions > 32.9• Not more than 0.1% of the words selected in this way will have

been incorrectly identified, since the Bonferroni correction is conservative.

• We are more interested in ranking than absolute values.

Dispersion

• Dispersion measures show how evenly or otherwise a word is distributed throughout a corpus (Lyne 1985, 1986).

• In this study, we should only consider words which are relatively evenly spread throughout the corpus.

• E.g. thalidomide, ranked 15th most typical of UK, occurs all 55 times in a single medical article.

Juilland’s D (1)

• Divide the corpus into n contiguous subsections (we used 5).

• Commonwealth was found 31, 8, 32, 88, 5 times respectively in the Australian corpus.

• The standard deviation of the number of times the word is found in each subsection = 29.79, and the mean frequency is 32.8.

Juilland’s D (2)

• To account for the fact that the standard deviation tends to be higher for more frequent words, it is divided by the mean frequency to give the coefficient of variation V = 29.79 / 32.8 = 0.908

• The coefficient of dispersion falls in the range 0 to 1.

• D = 1 - V / sqrt (n-1) = 0.546 for commonwealth • Empirical finding: keep if D >= 0.3, range >= 3.

The Australian list

• 18 of top 19 people and places

• Exception is Commonwealth (of Australia)

• Politics: Premier, Senator, Hawke, Whitlam, ALP, Labor, BHP

• Employment rights: unions, unemployed, superannuation

The British list

• People and places

• Institutions: NHS, BBC

• Politics: Tory, Labour

• EC (European Community)

• Historical epochs: century, eighteenth

• Aristocratic titles: Duke, Lord(s), Prince, Royal

The Indian List

• People and places• Currency: Rs (rupees)• Numbers: mn (million), crores (ten million), lakhs (ten

thousand).• Function words: the, of, in, upto (single word)• Religion: Buddha (86.0), Buddhism (45.4), divine (150.6),

Gita (119.3), God (37.8), Gods (78.6), Goddess (44.4), Hindu (299.5), Hindus (148.1), Karma (61.4), Muslim (151.8), Muslims (42.2), mystic (53.1), Mystics (100.7), pandit (104.4), Saints (35.6), Sikh (80.0), Swami (131.2), temple (248.8), temples (104.2), Vedas (101.4), Vedic (102.9), yoga (97.7).

The New Zealand list

• Place names

• Pakeha (person of European descent)

• The natural world: bay, forest, harbour, island(s), landscape.

• Rugby

The U.S. list

• Few people and places

• Spelling variants: toward, percent, programs, defense, program, color, behavior, labor, fiber, gray, theater, favorite, favor, colors, organization

• Inclusiveness: black, gender, white

Measures from Information Retrieval

• Main difference with corpus linguistics is that we are interested in the information itself rather than its linguistic style.

• Raw frequency with stoplisting

• TF.IDF

• Deviation from Randomness

• Kullback-Liebler Divergence

Raw Frequency

• Most frequent words in the New Zealand corpus:

• the (67355), of (32182), and (28678), to (26552), a (23558), in (20519), is (10284), was (10081), it (9814), that (9743), for (9341), I (7844), on (7629), ‘s (7585), with (7185), as (7027), he (6716), be (6297), at (5530), by (5207)…

The Glasgow Stoplist

• a, about, above, across, adj, after, again, against, all, almost, alone, along, also, although, always, am, among, an, and, another, any, anybody, anyone, anything, anywhere, apart, are, around, as, aside, at, away, be … yourself.

Raw Frequency with Stoplisting

• ‘s (7875), he (6716), you (3838), New (3319), we (3292), one (3267), my (2078), Zealand (1985), time (1920), like (1607), me (1602), two (1589), people (1583), first (1393), now (1285), back (1208), years (1145), way (1079), work (1041), and made (1019)

• only New and Zealand appeared typical of the corpus of New Zealand English.

• This shows the need for more sophisticated measures.

TF.IDF

• Takes into account both the frequency of a word in a corpus (TF, term frequency) and the inverse of the number of corpora the word appears in (IDF, inverse document frequency).

• The highest scores are given to words which are common in the corpus we are looking at, but do not occur in many other corpora.

kkdkd D

NDocfw 2log.

20 Words in the NZ Corpus with Highest TF.IDF

• Maori (1504.8), pakeha (339.5) , Aukland (304.4), Otago (180.2), Dunedin (136.8), Waikato (135.1), Christchurch (127.7), Wellington (112.0), Waitangi (107.8), Aotearoa (91.7), Hutt (91.7), Ngati (83.6), Rotorua (75.6), Maoris (74.2), moa (72.4), Te (68.7), NZPA (67.5), marae (65.9), ANZUS (62.7), TVNZ (62.7), Waitaki (59.5) and Invercargill (57.9)

• suggests that TF.IDF is a good measure for finding words typical of a corpus.

Deviation from Randomness

• One component is Bose-Einstein probability• If λ is the mean frequency of term t across all the

corpora, the Bose-Einstein probability is the probability that a term occurs exactly f times in one of the corpora

• Words which occur much more often in one corpus than they do on average across the corpora are typical of that corpus, and have low Bose-Einstein probability.

Inf1 is the negative of log base 2 of the Bose-Einstein probability, so words typical of a

corpus will have high Inf1:

1log.

1

1log1 22 fInf

The 20 words with highest Inf1 for the corpus of NZ English were:

• Maori (28.66), Auckland (28.52), Pakeha (28.47), Otago (28.46), Wellington (28.16), Dunedin (28.12), Waikato (28.11), Christchurch (28.10), Waitangi (28.11), Maoris (27.85), Aoteoroa (27.84), Hutt (27.74), Ngati (27.76), Zealand (27.76), Rotorua (27.67), moa (27.62), NZPA (27.55), Zealanders (27.53). marae (27.52), Te (27.52).

• On its own, Inf1 appears to be a good indicator of which words are typical of a corpus.

Kullback-Liebler Divergence and Relevance Feedback (“more like

this”)

tp

tptptKLD

C

RR

)(log).(.)( 2

KLD(t)

• pR(t) is the number of times that word is found in relevant documents, divided by the total number of words in relevant documents

• pC(t) is the number of words is found in the entire document collection, divided by the total number of words in the entire document collection

• μ is a tuning parameter, which worked best when set to 0.5

• Instead of relevant documents we discuss the corpus of interest, and instead of non-relevant documents we have the other comparison corpora.

The 20 highest scoring words for NZ English were:

• Zealand (1141), Maori (567), Auckland (359), Wellington (297), Te (175), Christchurch (148), Pakeha (128), Canterbury (89), Zealanders (82), Otago (67), Pacific (57), Rugby (52), Dunedin (51), Waikato (50), Maoris (48), NZ (44), Bay (44), Waitangi (40), Aoteoroa (34), Hutt (34). Values in millionths.

• All these words appear typical of NZ English• KLD(t) is a value for a single word. We can add together the

KLD(t) values for every word, to derive a single value KLD(Dr, Dc) showing the divergence between relevant documents and non-relevant documents. It thus gives a measure of corpus similarity.

Information Gain

• Whereas the other measures tells us something about the strength of the association between a word and a corpus, IG is a single value for the power of a word to discriminate between corpora.

• As an exercise in judging the usefulness of this measure, look at the 20 words in all five corpora with highest IG, and try to guess the corpora they are most typical of:

• Zealand (332), Maori (213), India (153), Auckland (130), Australian (104), Wellington (98), Rs (Rupees) (75), Gandhi (73), Pounds (68), Clinton (67), Janata (65), Australia (64), Delhi (54), Singh (54), Queensland (50), Bombay (50), Aboriginal (50), Chistchurch (49), pakeha (48), NSW (40). These IG values are in millionths.

Conclusions (1)

• In corpus linguistics, interest is mainly in the language used in corpora, while in information retrieval we are mainly interested in the information conveyed by a document

• In IR, function words on a “stoplist” are routinely discarded, since these are not related to the topic of a document, but in CL, such words tell us a great deal about the grammatical structures used in a corpus.

• The question of “which words are characteristic of a text” is common to both IR and CL. A number of statistical measures are thus relevant to both fields of study.

Conclusions (2)

• Our initial results suggest that the IR measures of TF.IDF, Bose-Einstein probability and Kullbeck-Liebler Divergence when μ = 0.5 are all good measures for finding the words most typical of New Zealand English.

• A variant of KLD measures the divergence between two corpora

• Information Gain provides a single score for a word, reflecting its ability to discriminate between corpora.