Click here to load reader
Upload
adina
View
58
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Corpora and Statistical Methods. Albert Gatt. In this lecture. We have considered distributions of words and lexical variation in corpora. Today we consider collocations: definition and characteristics measures of collocational strength experiments on corpora hypothesis testing. Part 1. - PowerPoint PPT Presentation
Citation preview
PowerPoint Presentation
Albert GattCorpora and Statistical MethodsIn this lectureCorpora and Statistical MethodsWe have considered distributions of words and lexical variation in corpora.Today we consider collocations:definition and characteristicsmeasures of collocational strengthexperiments on corporahypothesis testingCollocations: Definition and characteristicsPart 1A motivating exampleCorpora and Statistical MethodsConsider phrases such as:strong tea? powerful teastrong support? powerful supportpowerful drug? strong drug
Traditional semantic theories have difficulty accounting for these patterns.strong and powerful seem near-synonymsdo we claim they have different senses?what is the crucial difference?
The empiricist view of meaningCorpora and Statistical MethodsFirths view (1957):You shall know a word by the company it keeps
This is a contextual view of meaning, akin to that espoused by Wittgenstein (1953).
In the Firthian tradition, attention is paid to patterns that crop up with regularity in language.Contrast symbolic/rationalist approaches, emphasising polysemy, componential analysis, etc.
Statistical work on collocations tends to follow this tradition.Defining collocationsCorpora and Statistical MethodsCollocations are statements of the habitual or customary places of [a] word. (Firth 1957)
Characteristics/Expectations:regular/frequently attested;occur within a narrow window (span of few words);not fully compositional;non-substitutable;non-modifiabledisplay category restrictions Frequency and regularityCorpora and Statistical MethodsWe know that language is regular (non-random) and rule-based.this aspect is emphasised by rationalist approaches to grammar
We also need to acknowledge that frequency of usage is an important factor in language development.why do big and large collocate differently with different nouns?Regularity/frequencyCorpora and Statistical Methodsf(strong tea) > f(powerful tea)
f(credit card) > f(credit bankruptcy)
f(white wine) > f(yellow wine)(even though white wine is actually yellowish)
Narrow window (textual proximity)Corpora and Statistical MethodsUsually, we specify an n-gram window within which to analyse collocations:bigram: credit card, credit crunchtrigram: credit card fraud, credit card expiry
The idea is to look at co-occurrence of words within a specific n-gram window
We can also count n-grams with intervening words:federal (.*) subsidymatches: federal subsidy, federal farm subsidy, federal manufacturing subsidyTextual proximity (continued)Corpora and Statistical MethodsUsually collocates of a word occur close to that word.may still occur across a span
Examples:bigram: white wine, powerful tea>bigram: knock on the door; knock on Xs door
Non-compositionalityCorpora and Statistical Methodswhite winenot really white, meaning not fully predictable from component words + syntax
signal interpretationa term used in Intelligent Signal Processing: connotations go beyond compositional meaning
Similarly:regression coefficientgood practice guidelines
Extreme cases:idioms such as kick the bucketmeaning is completely frozenNon-substitutabilityCorpora and Statistical MethodsIf a phrase is a collocation, we cant substitute a word in the phrase for a near-synonym, and still have the same overall meaning.
E.g.:white wine vs. yellow winepowerful tea vs. strong tea
Non-modifiabilityCorpora and Statistical MethodsOften, there are restrictions on inserting additional lexical items into the collocation, especially in the case of idioms.
Example:kick the bucket vs. ?kick the large bucket
NB:this is a matter of degree!non-idiomatic collocations are more flexible
Category restrictionsCorpora and Statistical MethodsFrequency alone doesnt indicate collocational strength:by the is a very frequent phrase in Englishnot a collocation
Collocations tend to be formed from content words:A+N: powerful teaN+N: regression coefficient, mass demonstrationN+PREP+N: degrees of freedomCollocations in a broad senseCorpora and Statistical MethodsIn many statistical NLP applications, the term collocation is quite broadly understood:any phrase which is frequent/regular enoughproper names (New York)compound nouns (elevator operator)set phrases (part of speech)idioms (kick the bucket)
Why are collocations interesting?Corpora and Statistical MethodsSeveral applications need to know about collocations:
terminology extraction: technical or domain-specific phrases crop up frequently in text (oil prices)
document classification: specialist phrases are good indicators of the topic of a text
named entity recognition: names such as New York tend to occur together frequently; phrases like new toy dont
Example application: ParsingCorpora and Statistical MethodsShe spotted the man with a pair of binoculars[VP spotted [NP the man [PP with a pair of binoculars]]][VP spotted [NP the man] [PP with a pair of binoculars]]
Parser might prefer (2) if spot/binoculars are frequent co-occurrences in a window of a certain width.Example application: GenerationCorpora and Statistical MethodsNLG systems often need to map a semantic representation to a lexical/syntactic one.Shouldnt use the wrong adjective-noun combinations: clean face vs. ?immaculate face
Lapata et al. (1999):experiment asking people to rate different adjective-noun combinationsfrequency of the combination a strong predictor of peoples preferencesargue that NLG systems need to be able to make contextually-informed decisions in lexical choiceFinding collocations in corpora: basic methodsFrequency-based approachCorpora and Statistical MethodsMotivation: if two (or three, or) words occur together a lot within some window, theyre a collocation
Problems:frequent collocations under this definition include with the, onto a, etc.not very interestingImproving the frequency-based approachCorpora and Statistical MethodsJusteson & Katz (1995): part of speech filteronly look at word combinations of the right category:N + N: regression coefficientN + PRP + N: jack in (the) box dramatically improves the resultscontent-word combinations more likely to be phrasesCase study: strong vs. powerfulCorpora and Statistical MethodsSee: Manning & Schutze `99, Sec 5.2Motivation:try to distinguish the meanings of two quasi-synonymsdata from New York Times corpus
Basic strategy:find all bigrams where w1 = strong or powerfulapply POS filter to remove strong on [crime], powerful in [industry] etc.Case study (cont/d)Corpora and Statistical MethodsSample results from Manning & Schutze `99:f(strong support) = 50f(strong supporter) = 10f(powerful force) = 13f(powerful computers) = 10
Teaser:would you also expect powerful supporter?whats the difference between strong supporter and powerful supporter?
Limitations of frequency-based searchCorpora and Statistical MethodsOnly work for fixed phrasesBut collocations can be looser, allowing interpolation of other words.knock on [the,Xs,a] doorpull [a] punch
Simple frequency wont do for these: different interpolated words dilute the frequency.Using mean and varianceCorpora and Statistical MethodsGeneral idea: include bigrams even at a distance:w1Xw2pull apunch
Strategy:find co-occurrences of the two words in windows of varying lengthcompute mean offset between w1 and w2compute variance of offset between w1 and w2if offsets are randomly distributed, then we have high variance and conclude that is not a collocation
Example outcomes (M&S `99)Corpora and Statistical Methodsposition of strong wrt oppositionmean = -1.15, standard dev = 0.67i.e. most occurrences are strong [] opposition
position of strong wrt formean = -1.12, standard dev = 2.15i.e. for occurs anywhere around strong, SD is higher than mean.can get strong support for, for the strong support, etc.More limitations of frequencyCorpora and Statistical MethodsIf we use simple frequency or mean & variance, we have a good way of ranking likely collocations.
But how do we know if a frequent pattern is frequent enough? Is it above what would be predicted by chance?
We need to think in terms of hypothesis-testing.Given , we want to compare:The hypothesis that they are non-independent.The hypothesis that they are independent.