Corpora and Statistical Methods

PowerPoint Presentation

Albert GattCorpora and Statistical MethodsIn this lectureCorpora and Statistical MethodsWe have considered distributions of words and lexical variation in corpora.Today we consider collocations:definition and characteristicsmeasures of collocational strengthexperiments on corporahypothesis testingCollocations: Definition and characteristicsPart 1A motivating exampleCorpora and Statistical MethodsConsider phrases such as:strong tea? powerful teastrong support? powerful supportpowerful drug? strong drug

Traditional semantic theories have difficulty accounting for these patterns.strong and powerful seem near-synonymsdo we claim they have different senses?what is the crucial difference?

The empiricist view of meaningCorpora and Statistical MethodsFirths view (1957):You shall know a word by the company it keeps

This is a contextual view of meaning, akin to that espoused by Wittgenstein (1953).

In the Firthian tradition, attention is paid to patterns that crop up with regularity in language.Contrast symbolic/rationalist approaches, emphasising polysemy, componential analysis, etc.

Statistical work on collocations tends to follow this tradition.Defining collocationsCorpora and Statistical MethodsCollocations are statements of the habitual or customary places of [a] word. (Firth 1957)

Characteristics/Expectations:regular/frequently attested;occur within a narrow window (span of few words);not fully compositional;non-substitutable;non-modifiabledisplay category restrictions Frequency and regularityCorpora and Statistical MethodsWe know that language is regular (non-random) and rule-based.this aspect is emphasised by rationalist approaches to grammar

We also need to acknowledge that frequency of usage is an important factor in language development.why do big and large collocate differently with different nouns?Regularity/frequencyCorpora and Statistical Methodsf(strong tea) > f(powerful tea)

f(credit card) > f(credit bankruptcy)

f(white wine) > f(yellow wine)(even though white wine is actually yellowish)

Narrow window (textual proximity)Corpora and Statistical MethodsUsually, we specify an n-gram window within which to analyse collocations:bigram: credit card, credit crunchtrigram: credit card fraud, credit card expiry

The idea is to look at co-occurrence of words within a specific n-gram window

We can also count n-grams with intervening words:federal (.*) subsidymatches: federal subsidy, federal farm subsidy, federal manufacturing subsidyTextual proximity (continued)Corpora and Statistical MethodsUsually collocates of a word occur close to that word.may still occur across a span

Examples:bigram: white wine, powerful tea>bigram: knock on the door; knock on Xs door

Non-compositionalityCorpora and Statistical Methodswhite winenot really white, meaning not fully predictable from component words + syntax

signal interpretationa term used in Intelligent Signal Processing: connotations go beyond compositional meaning

Similarly:regression coefficientgood practice guidelines

Extreme cases:idioms such as kick the bucketmeaning is completely frozenNon-substitutabilityCorpora and Statistical MethodsIf a phrase is a collocation, we cant substitute a word in the phrase for a near-synonym, and still have the same overall meaning.

E.g.:white wine vs. yellow winepowerful tea vs. strong tea

Non-modifiabilityCorpora and Statistical MethodsOften, there are restrictions on inserting additional lexical items into the collocation, especially in the case of idioms.

Example:kick the bucket vs. ?kick the large bucket

NB:this is a matter of degree!non-idiomatic collocations are more flexible

Category restrictionsCorpora and Statistical MethodsFrequency alone doesnt indicate collocational strength:by the is a very frequent phrase in Englishnot a collocation

Collocations tend to be formed from content words:A+N: powerful teaN+N: regression coefficient, mass demonstrationN+PREP+N: degrees of freedomCollocations in a broad senseCorpora and Statistical MethodsIn many statistical NLP applications, the term collocation is quite broadly understood:any phrase which is frequent/regular enoughproper names (New York)compound nouns (elevator operator)set phrases (part of speech)idioms (kick the bucket)

Why are collocations interesting?Corpora and Statistical MethodsSeveral applications need to know about collocations:

terminology extraction: technical or domain-specific phrases crop up frequently in text (oil prices)

document classification: specialist phrases are good indicators of the topic of a text

named entity recognition: names such as New York tend to occur together frequently; phrases like new toy dont

Example application: ParsingCorpora and Statistical MethodsShe spotted the man with a pair of binoculars[VP spotted [NP the man [PP with a pair of binoculars]]][VP spotted [NP the man] [PP with a pair of binoculars]]

Parser might prefer (2) if spot/binoculars are frequent co-occurrences in a window of a certain width.Example application: GenerationCorpora and Statistical MethodsNLG systems often need to map a semantic representation to a lexical/syntactic one.Shouldnt use the wrong adjective-noun combinations: clean face vs. ?immaculate face

Lapata et al. (1999):experiment asking people to rate different adjective-noun combinationsfrequency of the combination a strong predictor of peoples preferencesargue that NLG systems need to be able to make contextually-informed decisions in lexical choiceFinding collocations in corpora: basic methodsFrequency-based approachCorpora and Statistical MethodsMotivation: if two (or three, or) words occur together a lot within some window, theyre a collocation

Problems:frequent collocations under this definition include with the, onto a, etc.not very interestingImproving the frequency-based approachCorpora and Statistical MethodsJusteson & Katz (1995): part of speech filteronly look at word combinations of the right category:N + N: regression coefficientN + PRP + N: jack in (the) box dramatically improves the resultscontent-word combinations more likely to be phrasesCase study: strong vs. powerfulCorpora and Statistical MethodsSee: Manning & Schutze `99, Sec 5.2Motivation:try to distinguish the meanings of two quasi-synonymsdata from New York Times corpus

Basic strategy:find all bigrams where w1 = strong or powerfulapply POS filter to remove strong on [crime], powerful in [industry] etc.Case study (cont/d)Corpora and Statistical MethodsSample results from Manning & Schutze `99:f(strong support) = 50f(strong supporter) = 10f(powerful force) = 13f(powerful computers) = 10

Teaser:would you also expect powerful supporter?whats the difference between strong supporter and powerful supporter?

Limitations of frequency-based searchCorpora and Statistical MethodsOnly work for fixed phrasesBut collocations can be looser, allowing interpolation of other words.knock on [the,Xs,a] doorpull [a] punch

Simple frequency wont do for these: different interpolated words dilute the frequency.Using mean and varianceCorpora and Statistical MethodsGeneral idea: include bigrams even at a distance:w1Xw2pull apunch

Strategy:find co-occurrences of the two words in windows of varying lengthcompute mean offset between w1 and w2compute variance of offset between w1 and w2if offsets are randomly distributed, then we have high variance and conclude that is not a collocation

Example outcomes (M&S `99)Corpora and Statistical Methodsposition of strong wrt oppositionmean = -1.15, standard dev = 0.67i.e. most occurrences are strong [] opposition

position of strong wrt formean = -1.12, standard dev = 2.15i.e. for occurs anywhere around strong, SD is higher than mean.can get strong support for, for the strong support, etc.More limitations of frequencyCorpora and Statistical MethodsIf we use simple frequency or mean & variance, we have a good way of ranking likely collocations.

But how do we know if a frequent pattern is frequent enough? Is it above what would be predicted by chance?

We need to think in terms of hypothesis-testing.Given , we want to compare:The hypothesis that they are non-independent.The hypothesis that they are independent.

Documents

Corpora and Statistical Methods