164
EuroLAN-2005 Summer Sch EuroLAN-2005 Summer Sch ool ool 1 Language Independent Language Independent Methods of Clustering Methods of Clustering Similar Contexts (with Similar Contexts (with applications) applications) Ted Pedersen Ted Pedersen University of Minnesota, University of Minnesota, Duluth Duluth http://www.d.umn.edu/~tpeders http://www.d.umn.edu/~tpeders e e [email protected] [email protected]

Eurolan 2005 Pedersen

Embed Size (px)

Citation preview

Page 1: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 11

Language Independent Language Independent Methods of Clustering Similar Methods of Clustering Similar Contexts (with applications)Contexts (with applications)

Ted PedersenTed PedersenUniversity of Minnesota, Duluth University of Minnesota, Duluth

http://www.d.umn.edu/~tpedersehttp://www.d.umn.edu/[email protected]@d.umn.edu

Page 2: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 22

The ProblemThe Problem

A A contextcontext is a short unit of text is a short unit of text often a phrase to a paragraph in length, often a phrase to a paragraph in length,

although it can be longeralthough it can be longer Input: N contextsInput: N contexts Output: K clustersOutput: K clusters

Where each member of a cluster is a Where each member of a cluster is a context that is more similar to each other context that is more similar to each other than to the contexts found in other clustersthan to the contexts found in other clusters

Page 3: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 33

Language Independent Language Independent MethodsMethods

Do not utilize syntactic informationDo not utilize syntactic information No parsers, part of speech taggers, etc. requiredNo parsers, part of speech taggers, etc. required

Do not utilize dictionaries or other manually Do not utilize dictionaries or other manually created lexical resourcescreated lexical resources

Based on lexical features selected from Based on lexical features selected from corpora corpora

No manually annotated data of any kind, No manually annotated data of any kind, methods are completely unsupervised in the methods are completely unsupervised in the strictest sensestrictest sense

Assumption: word segmentation can be done Assumption: word segmentation can be done by looking for white spaces between stringsby looking for white spaces between strings

Page 4: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 44

Outline (Tutorial)Outline (Tutorial) Background and motivationsBackground and motivations Identifying lexical featuresIdentifying lexical features

Measures of association & tests of significanceMeasures of association & tests of significance Context representationsContext representations

First & second orderFirst & second order Dimensionality reductionDimensionality reduction

Singular Value DecompositionSingular Value Decomposition Clustering methodsClustering methods

Agglomerative & partitional techniquesAgglomerative & partitional techniques Cluster labelingCluster labeling Evaluation techniques Evaluation techniques

Gold standard comparisonsGold standard comparisons

Page 5: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 55

Outline (Practical Session)Outline (Practical Session)

Headed contextsHeaded contexts Name DiscriminationName Discrimination Word Sense DiscriminationWord Sense Discrimination AbbreviationsAbbreviations

Headless contextsHeadless contexts Email/Newsgroup OrganizationEmail/Newsgroup Organization Newspaper textNewspaper text

Identifying Sets of Related WordsIdentifying Sets of Related Words

Page 6: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 66

SenseClustersSenseClusters

A package designed to cluster A package designed to cluster contextscontexts

Integrates with various other toolsIntegrates with various other tools Ngram Statistics PackageNgram Statistics Package ClutoCluto SVDPACKCSVDPACKC

http://senseclusters.sourceforge.nethttp://senseclusters.sourceforge.net

Page 7: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 77

Many thanks…Many thanks… Satanjeev (“Bano”) Banerjee (M.S., 2002)Satanjeev (“Bano”) Banerjee (M.S., 2002)

Founding developer of the Ngram Statistics Package Founding developer of the Ngram Statistics Package (2000-2001)(2000-2001)

Now PhD student in the Language Technology Institute at Now PhD student in the Language Technology Institute at Carnegie Mellon University Carnegie Mellon University http://www-2.cs.cmu.edu/~banerjee/http://www-2.cs.cmu.edu/~banerjee/

Amruta Purandare (M.S., 2004)Amruta Purandare (M.S., 2004) Founding developer of SenseClusters (2002-2004)Founding developer of SenseClusters (2002-2004) Now PhD student in Intelligent Systems at the University Now PhD student in Intelligent Systems at the University

of Pittsburgh of Pittsburgh http://http://www.cs.pitt.edu/~amrutawww.cs.pitt.edu/~amruta// Anagha Kulkarni (M.S., 2006, expected)Anagha Kulkarni (M.S., 2006, expected)

Enhancing SenseClusters since Fall 2004!Enhancing SenseClusters since Fall 2004! http://www.d.umn.edu/~kulka020/http://www.d.umn.edu/~kulka020/

National Science Foundation (USA) for supporting National Science Foundation (USA) for supporting Bano, Amruta, Anagha and me (!) via CAREER Bano, Amruta, Anagha and me (!) via CAREER award #0092784award #0092784

Page 8: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 88

Practical SessionPractical Session

Experiment with SenseClustersExperiment with SenseClusters http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgihttp://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi Has both a command line and web interface (above)Has both a command line and web interface (above)

Can be installed on Linux/Unix Can be installed on Linux/Unix machine without too much workmachine without too much work http://senseclusters.sourceforge.nethttp://senseclusters.sourceforge.net Has some dependencies that must be installed, so having Has some dependencies that must be installed, so having

supervisor access and/or sysadmin experience helpssupervisor access and/or sysadmin experience helps Complete system (SenseClusters plus dependencies) is Complete system (SenseClusters plus dependencies) is

available on CDavailable on CD

Page 9: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 99

Background and Background and MotivationsMotivations

Page 10: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 1010

Headed and Headless Headed and Headless ContextsContexts

A headed context includes a target wordA headed context includes a target word Our goal is to collect multiple contexts that Our goal is to collect multiple contexts that

mention a particular target word in order to mention a particular target word in order to try identify different senses of that word try identify different senses of that word

A headless context has no target wordA headless context has no target word Our goal is to identify the contexts that are Our goal is to identify the contexts that are

similar to each othersimilar to each other

Page 11: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 1111

Headed Contexts (input)Headed Contexts (input)

I can hear the ocean in that I can hear the ocean in that shell.shell. My operating system My operating system shell shell is bash.is bash. The The shellsshells on the shore are lovely. on the shore are lovely. The The shell shell command line is flexible.command line is flexible. The oyster The oyster shellshell is very hard and is very hard and

black.black.

Page 12: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 1212

Headed Contexts (output)Headed Contexts (output)

Cluster 1: Cluster 1: My operating system My operating system shell shell is bash.is bash. The The shell shell command line is flexible.command line is flexible.

Cluster 2:Cluster 2: The The shellsshells on the shore are lovely. on the shore are lovely. The oyster The oyster shellshell is very hard and black. is very hard and black. I can hear the ocean in that I can hear the ocean in that shell.shell.

Page 13: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 1313

Headless Contexts (input)Headless Contexts (input)

The new version of Linux is more stable The new version of Linux is more stable and better support for cameras.and better support for cameras.

My Chevy Malibu has had some front end My Chevy Malibu has had some front end troubles.troubles.

Osborne made on of the first personal Osborne made on of the first personal computers.computers.

The brakes went out, and the car flew into The brakes went out, and the car flew into the house. the house.

With the price of gasoline, I think I’ll be With the price of gasoline, I think I’ll be taking the bus more often!taking the bus more often!

Page 14: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 1414

Headless Contexts (output)Headless Contexts (output)

Cluster 1:Cluster 1: The new version of Linux is more stable and better The new version of Linux is more stable and better

support for cameras.support for cameras. Osborne made one of the first personal computers.Osborne made one of the first personal computers.

Cluster 2: Cluster 2: My Chevy Malibu has had some front end troubles.My Chevy Malibu has had some front end troubles. The brakes went out, and the car flew into the house. The brakes went out, and the car flew into the house. With the price of gasoline, I think I’ll be taking the With the price of gasoline, I think I’ll be taking the

bus more often!bus more often!

Page 15: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 1515

ApplicationsApplications

Web search results are headed contextsWeb search results are headed contexts Term you search for is included in snippetTerm you search for is included in snippet

Web search results are often disorganized Web search results are often disorganized – two people sharing same name, two – two people sharing same name, two organizations sharing same abbreviation, organizations sharing same abbreviation, etc. often have their pages “mixed up” etc. often have their pages “mixed up” Organizing web search results is an important Organizing web search results is an important

problem. problem. If you click on search results or follow links If you click on search results or follow links

in pages found, you will encounter in pages found, you will encounter headless contexts too…headless contexts too…

Page 16: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 1616

Page 17: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 1717

Page 18: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 1818

Page 19: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 1919

Page 20: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 2020

Page 21: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 2121

ApplicationsApplications

Email (public or private) is made up Email (public or private) is made up of headless contextsof headless contexts Short, usually focused…Short, usually focused…

Cluster similar email messages Cluster similar email messages together together Automatic email folderingAutomatic email foldering Take all messages from sent-mail file or Take all messages from sent-mail file or

inbox and organize into categoriesinbox and organize into categories

Page 22: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 2222

Page 23: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 2323

Page 24: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 2424

ApplicationsApplications

News article are another example of News article are another example of headless contextsheadless contexts Entire article or first paragraphEntire article or first paragraph Short, usually focusedShort, usually focused

Cluster similar articles togetherCluster similar articles together

Page 25: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 2525

Page 26: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 2626

Page 27: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 2727

Page 28: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 2828

Underlying Premise…Underlying Premise…

You shall know a word by the company it keepsYou shall know a word by the company it keeps Firth, 1957 (Firth, 1957 (Studies in Linguistic AnalysisStudies in Linguistic Analysis))

Meanings of words are (largely) determined by their Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis)distributional patterns (Distributional Hypothesis) Harris, 1968 (Harris, 1968 (Mathematical Structures of LanguageMathematical Structures of Language))

Words that occur in similar contexts will have Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis)similar meanings (Strong Contextual Hypothesis) Miller and Charles, 1991 (Miller and Charles, 1991 (Language and Cognitive Language and Cognitive

ProcessesProcesses)) Various extensions…Various extensions…

Similar contexts will have similar meanings, etc.Similar contexts will have similar meanings, etc. Names that occur in similar contexts will refer to the same Names that occur in similar contexts will refer to the same

underlying person, etc.underlying person, etc.

Page 29: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 2929

Identifying Lexical Identifying Lexical FeaturesFeatures

Measures of Association and Measures of Association and

Tests of SignificanceTests of Significance

Page 30: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 3030

What are features?What are features?

Features represent the (hopefully) salient Features represent the (hopefully) salient characteristics of the contexts to be characteristics of the contexts to be clusteredclustered

Eventually we will represent each context Eventually we will represent each context as a vector, where the dimensions of the as a vector, where the dimensions of the vector are associated with featuresvector are associated with features

Vectors/contexts that include many of Vectors/contexts that include many of the same features will be similar to each the same features will be similar to each otherother

Page 31: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 3131

Where do features come Where do features come from? from?

In unsupervised clustering, it is common In unsupervised clustering, it is common for the feature selection data to be the for the feature selection data to be the same data that is to be clusteredsame data that is to be clustered This is not cheating, since data to be clustered This is not cheating, since data to be clustered

does not have any labeled classes that can be does not have any labeled classes that can be used to assist feature selectionused to assist feature selection

It may also be necessary, since we may need It may also be necessary, since we may need to cluster all available data, and not hold out to cluster all available data, and not hold out some for a separate feature identification stepsome for a separate feature identification step

Email or news articlesEmail or news articles

Page 32: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 3232

Feature SelectionFeature Selection

““Test” data – the contexts to be clusteredTest” data – the contexts to be clustered Assume that the feature selection data is the Assume that the feature selection data is the

same as the test data, unless otherwise same as the test data, unless otherwise indicated indicated

““Training” data – a separate corpus of held Training” data – a separate corpus of held out feature selection data (that will not be out feature selection data (that will not be clustered)clustered) may need to use if you have a small number of may need to use if you have a small number of

contexts to cluster (e.g., web search results)contexts to cluster (e.g., web search results) This sense of “training” due to Schütze (1998)This sense of “training” due to Schütze (1998)

Page 33: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 3333

Lexical FeaturesLexical Features

Unigram – a single word that occurs more Unigram – a single word that occurs more than a given number of timesthan a given number of times

Bigram – an ordered pair of words that Bigram – an ordered pair of words that occur together more often than expected occur together more often than expected by chanceby chance Consecutive or may have intervening wordsConsecutive or may have intervening words

Co-occurrence – an unordered bigramCo-occurrence – an unordered bigram Target Co-occurrence – a co-occurrence Target Co-occurrence – a co-occurrence

where one of the words is the target wordwhere one of the words is the target word

Page 34: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 3434

BigramsBigrams

fine wine (window size of 2)fine wine (window size of 2) baseball batbaseball bat house house of of representatives (window size of 3)representatives (window size of 3) president president of theof the republic (window size of 4) republic (window size of 4) apple orchardapple orchard

Selected using a small window size (2-4 Selected using a small window size (2-4 words), trying to capture a regular (localized) words), trying to capture a regular (localized) pattern between two words (collocation?)pattern between two words (collocation?)

Page 35: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 3535

Co-occurrencesCo-occurrences

tropics watertropics water boat fishboat fish law presidentlaw president train traveltrain travel

Usually selected using a larger window (7-Usually selected using a larger window (7-10 words) of context, hoping to capture 10 words) of context, hoping to capture pairs of related words rather than pairs of related words rather than collocationscollocations

Page 36: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 3636

Bigrams and Co-Bigrams and Co-occurrencesoccurrences

Pairs of words tend to be much less Pairs of words tend to be much less ambiguous than unigramsambiguous than unigrams ““bank” versus “river bank” and “bank bank” versus “river bank” and “bank

card”card” ““dot” versus “dot com” and “dot product”dot” versus “dot com” and “dot product”

Three grams and beyond occur much Three grams and beyond occur much less frequently (Ngrams very Zipfian)less frequently (Ngrams very Zipfian)

Unigrams are noisy, but bountifulUnigrams are noisy, but bountiful

Page 37: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 3737

““occur together more often occur together more often than expected by chance…”than expected by chance…”

Observed frequencies for two words Observed frequencies for two words occurring together and alone are stored in a occurring together and alone are stored in a 2x2 matrix2x2 matrix Throw out bigrams that include one or two stop Throw out bigrams that include one or two stop

wordswords Expected values are calculated, based on the Expected values are calculated, based on the

model of independence and observed valuesmodel of independence and observed values How often would you expect these words to occur How often would you expect these words to occur

together, if they only occurred together by together, if they only occurred together by chance?chance?

If two words occur “significantly” more often than If two words occur “significantly” more often than the expected value, then the words do not occur the expected value, then the words do not occur together by chance.together by chance.

Page 38: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 3838

2x2 Contingency Table2x2 Contingency Table

IntelligencIntelligencee

!!IntelligencIntelligenc

ee

ArtificialArtificial 100100 400400

!Artificial!Artificial

300300 100,000100,000

Page 39: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 3939

2x2 Contingency Table2x2 Contingency Table

IntelligencIntelligencee

!!IntelligencIntelligenc

ee

ArtificialArtificial 100100 300300 400400

!Artificial!Artificial 200200 99,40099,400 99,60099,600

300300 99,70099,700 100,000100,000

Page 40: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 4040

2x2 Contingency Table2x2 Contingency Table

IntelligencIntelligencee

!!IntelligencIntelligenc

ee

ArtificialArtificial 100.0100.0

000.12000.12300.0300.0

398.8398.8400400

!Artificial!Artificial 200.0200.0

298.8298.899,400.099,400.0

99,301.299,301.299,60099,600

300300 99,70099,700 100,000100,000

Page 41: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 4141

Measures of AssociationMeasures of Association

2

1,

22

2

1,

2

),(

)],(),([

)),(

),(log*),((

ji ji

jiji

ji

ji

jiji

wwexpected

wwexpectedwwobservedX

wwexpected

wwobservedwwobservedG

Page 42: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 4242

Measures of AssociationMeasures of Association

78.8191

88.7502

2

X

G

Page 43: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 4343

Interpreting the Scores…Interpreting the Scores…

G^2 and X^2 are asymptotically G^2 and X^2 are asymptotically approximated by the chi-squared approximated by the chi-squared distribution…distribution…

This means…if you fix the marginal totals This means…if you fix the marginal totals of a table, randomly generate internal cell of a table, randomly generate internal cell values in the table, calculate the G^2 or values in the table, calculate the G^2 or X^2 scores for each resulting table, and X^2 scores for each resulting table, and plot the distribution of the scores, you plot the distribution of the scores, you *should* get …*should* get …

Page 44: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 4444

Page 45: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 4545

Interpreting the Scores…Interpreting the Scores…

Values above a certain level of Values above a certain level of significance can be considered significance can be considered grounds for rejecting the null grounds for rejecting the null hypothesis hypothesis H0: the words in the bigram are H0: the words in the bigram are

independentindependent 3.841 is associated with 95% confidence 3.841 is associated with 95% confidence

that the null hypothesis should be that the null hypothesis should be rejectedrejected

Page 46: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 4646

Measures of AssociationMeasures of Association

There are numerous measures of There are numerous measures of association that can be used to association that can be used to identify bigram and co-occurrence identify bigram and co-occurrence featuresfeatures

Many of these are supported in the Many of these are supported in the Ngram Statistics Package (NSP)Ngram Statistics Package (NSP) http://www.d.umn.edu/~tpederse/nsp.hthttp://www.d.umn.edu/~tpederse/nsp.ht

mlml

Page 47: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 4747

Measures Supported in NSPMeasures Supported in NSP

Log-likelihood Ratio (ll)Log-likelihood Ratio (ll) True Mutual Information (tmi)True Mutual Information (tmi)

Pearson’s Chi-squared Test (x2)Pearson’s Chi-squared Test (x2) Pointwise Mutual Information (pmi)Pointwise Mutual Information (pmi) Phi coefficient (phi)Phi coefficient (phi) T-test (tscore)T-test (tscore) Fisher’s Exact Test (leftFisher, rightFisher)Fisher’s Exact Test (leftFisher, rightFisher) Dice Coefficient (dice)Dice Coefficient (dice) Odds Ratio (odds)Odds Ratio (odds)

Page 48: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 4848

NSPNSP

Will explore NSP during practical Will explore NSP during practical sessionsession Integrated into SenseClusters, may also be Integrated into SenseClusters, may also be

used in stand-alone modeused in stand-alone mode Can be installed easily on a Linux/Unix Can be installed easily on a Linux/Unix

system from CD or download fromsystem from CD or download from http://www.d.umn.edu/~tpederse/nsp.htmlhttp://www.d.umn.edu/~tpederse/nsp.html I’m told it can also be installed on Windows I’m told it can also be installed on Windows

(via cygwin or ActivePerl), but I have no (via cygwin or ActivePerl), but I have no personal experience of this…personal experience of this…

Page 49: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 4949

SummarySummary Identify lexical features based on frequency counts or Identify lexical features based on frequency counts or

measures of association – either in the data to be measures of association – either in the data to be clustered or in a separate set of feature selection dataclustered or in a separate set of feature selection data Language independentLanguage independent

Unigrams usually only selected by frequencyUnigrams usually only selected by frequency Remember, no labeled data from which to learn, so Remember, no labeled data from which to learn, so

somewhat less effective as features than in supervised casesomewhat less effective as features than in supervised case Bigrams and co-occurrences can also be selected by Bigrams and co-occurrences can also be selected by

frequency, or better yet measures of associationfrequency, or better yet measures of association Bigrams and co-occurrences need not be consecutiveBigrams and co-occurrences need not be consecutive Stop words should be eliminatedStop words should be eliminated Frequency thresholds are helpful (e.g., unigram/bigram that Frequency thresholds are helpful (e.g., unigram/bigram that

occurs once may be too rare to be useful)occurs once may be too rare to be useful)

Page 50: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 5050

Related WorkRelated Work

Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on log-likelihood and exact testslog-likelihood and exact tests

http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdfhttp://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf

Pedersen, 1996 (SCSUG) explanation of exact tests, and Pedersen, 1996 (SCSUG) explanation of exact tests, and comparison to log-likelihoodcomparison to log-likelihoodhttp://arxiv.org/abs/cmp-lg/9608010http://arxiv.org/abs/cmp-lg/9608010

(also see Pedersen, Kayaalp, and Bruce, AAAI-1996)(also see Pedersen, Kayaalp, and Bruce, AAAI-1996)

Dunning, 1993 (Dunning, 1993 (Computational LinguisticsComputational Linguistics) introduces log-) introduces log-likelihood ratio for collocation identificationlikelihood ratio for collocation identification

http://acl.ldc.upenn.edu/J/J93/J93-1003.pdfhttp://acl.ldc.upenn.edu/J/J93/J93-1003.pdf

Page 51: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 5151

Context RepresentationsContext Representations

First and Second Order First and Second Order MethodsMethods

Page 52: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 5252

Once features selected…Once features selected…

We will have a set of unigrams, bigrams, We will have a set of unigrams, bigrams, co-occurrences or target co-occurrences co-occurrences or target co-occurrences that we believe are somehow interesting that we believe are somehow interesting and usefuland useful We also have any frequency and measure of We also have any frequency and measure of

association score that have been used in their association score that have been used in their selectionselection

Convert contexts to be clustered into a Convert contexts to be clustered into a vector representation based on these vector representation based on these featuresfeatures

Page 53: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 5353

First Order RepresentationFirst Order Representation

Each context is represented by a Each context is represented by a vector with M dimensions, each of vector with M dimensions, each of which indicates whether or not a which indicates whether or not a particular feature occurred in that particular feature occurred in that contextcontext Value may be binary, a frequency count, Value may be binary, a frequency count,

or an association scoreor an association score Context by Feature representationContext by Feature representation

Page 54: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 5454

ContextsContexts

C1: There was an island curse of black C1: There was an island curse of black magic cast by that voodoo child. magic cast by that voodoo child.

C2: Harold, a known voodoo child, C2: Harold, a known voodoo child, was gifted in the arts of black magic.was gifted in the arts of black magic.

C3: Despite their military might, it C3: Despite their military might, it was a serious error to attack.was a serious error to attack.

C4: Military might is no defense C4: Military might is no defense against a voodoo child or an island against a voodoo child or an island curse.curse.

Page 55: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 5555

Unigram Feature Set Unigram Feature Set

island 1000island 1000 black 700black 700 curse 500curse 500 magic 400magic 400 child 200child 200

(assume these are frequency counts (assume these are frequency counts obtained from some corpus…)obtained from some corpus…)

Page 56: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 5656

First Order Vectors of First Order Vectors of UnigramsUnigrams

islandisland blackblack cursecurse magicmagic childchild

C1C1 11 11 11 11 11

C2C2 00 11 00 11 11

C3C3 00 00 00 00 00

C4C4 11 00 11 00 11

Page 57: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 5757

Bigram Feature SetBigram Feature Set

island curse 189.2island curse 189.2 black magic 123.5black magic 123.5 voodoo child 120.0voodoo child 120.0 military might 100.3military might 100.3 serious error 89.2serious error 89.2 island child 73.2island child 73.2 voodoo might 69.4voodoo might 69.4 military error 54.9military error 54.9 black child 43.2black child 43.2 serious curse 21.2serious curse 21.2

(assume these are log-likelihood scores based on frequency (assume these are log-likelihood scores based on frequency counts from some corpus)counts from some corpus)

Page 58: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 5858

First Order Vectors of First Order Vectors of BigramsBigrams

blackblack

magicmagicisland island curse curse

militarmilitary y

might might

seriouserious errors error

voodovoodoo childo child

C1C1 11 11 00 00 11

C2C2 11 00 00 00 11

C3C3 00 00 11 11 00

C4C4 00 11 11 00 11

Page 59: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 5959

First Order VectorsFirst Order Vectors

Can have binary values or weights Can have binary values or weights associated with frequency, etc.associated with frequency, etc.

May optionally be smoothed/reduced May optionally be smoothed/reduced with Singular Value Decomposition with Singular Value Decomposition More on that later…More on that later…

The contexts are ready for The contexts are ready for clustering…clustering… More on that later…More on that later…

Page 60: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 6060

Second Order Second Order RepresentationRepresentation

Build word by word matrix from features Build word by word matrix from features Must be bigrams or co-occurrencesMust be bigrams or co-occurrences (optionally) reduce dimensionality w/SVD(optionally) reduce dimensionality w/SVD Each row represents first order co-occurrencesEach row represents first order co-occurrences

Represent a context by replacing each Represent a context by replacing each word with an entry in the word by word word with an entry in the word by word matrix with its associated vectormatrix with its associated vector

Average word vectors found for the Average word vectors found for the context context

Due to Schuetze (1998)Due to Schuetze (1998)

Page 61: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 6161

Word by Word MatrixWord by Word Matrix

magicmagic cursecurse mightmight errorerror childchild

blackblack 123.5123.5 00 00 00 43.243.2

islandisland 00 189.2189.2 00 00 73.273.2

militarmilitaryy

00 00 100.3100.3 54.954.9 00

seriouseriouss

00 21.221.2 00 89.289.2 00

voodovoodooo

00 00 69.469.4 00 120.0120.0

Page 62: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 6262

Word by Word MatrixWord by Word Matrix

……can also be used to identify sets of related wordscan also be used to identify sets of related words In the case of bigrams, rows represent the first word In the case of bigrams, rows represent the first word

in a bigram and columns represent the second wordin a bigram and columns represent the second word Matrix is asymmetricMatrix is asymmetric

In the case of co-occurrences, rows and columns are In the case of co-occurrences, rows and columns are equivalentequivalent Matrix is symmetricMatrix is symmetric

The vector (row) for each word represent a set of The vector (row) for each word represent a set of first order features for that wordfirst order features for that word

Each word in a context to be clustered for which a Each word in a context to be clustered for which a vector exists (in the word by word matrix) is vector exists (in the word by word matrix) is replaced by that vector in that contextreplaced by that vector in that context

Page 63: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 6363

There was anThere was an island island curse of curse of blackblack magic cast by that magic cast by that

voodoo voodoo child. child.

magicmagic cursecurse mightmight errorerror childchild

blackblack 123.5123.5 00 00 00 43.243.2

islandisland 00 189.2189.2 00 00 73.273.2

voodovoodooo

00 00 69.469.4 00 120.0120.0

Page 64: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 6464

Second Order Second Order RepresentationRepresentation

There was an There was an [curse, child][curse, child] curse of curse of [magic, child][magic, child] magic cast by that magic cast by that [might, child][might, child] child child

[curse, child][curse, child] + + [magic, child][magic, child] + + [might, child][might, child]

Page 65: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 6565

There was anThere was an island island curse of curse of blackblack magic cast by that magic cast by that

voodoo voodoo child. child.

magicmagic cursecurse mightmight errorerror childchild

C1C1 41.241.2 63.163.1 24.424.4 00 78.878.8

Page 66: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 6666

First versus Second OrderFirst versus Second Order

First Order represents a context by First Order represents a context by showing which features occurred in that showing which features occurred in that contextcontext This is what feature vectors normally do…This is what feature vectors normally do…

Second Order allows for additional Second Order allows for additional information about a word to be information about a word to be incorporated into the representation incorporated into the representation Feature values based on information found Feature values based on information found

outside of the immediate contextoutside of the immediate context

Page 67: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 6767

Second Order Co-Second Order Co-OccurrencesOccurrences

““black” and “island” show similarity black” and “island” show similarity because both words have occurred because both words have occurred with “child” with “child”

““black” and “island” are second black” and “island” are second order co-occurrence with each other, order co-occurrence with each other, since both occur with “child” but not since both occur with “child” but not with each other (i.e., “black island” is with each other (i.e., “black island” is not observed)not observed)

Page 68: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 6868

Second Order Co-Second Order Co-occurrencesoccurrences

Imagine a co-occurrence graph Imagine a co-occurrence graph Word networkWord network

First order co-occurrences are directly First order co-occurrences are directly connectedconnected

Second order co-occurrences are to Second order co-occurrences are to each connected via one other wordeach connected via one other word

kocos.pl program in Ngram Statistics kocos.pl program in Ngram Statistics Package finds kth order co-Package finds kth order co-occurrencesoccurrences

Page 69: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 6969

SummarySummary

First order representations are intuitive, First order representations are intuitive, but…but… Can suffer from sparsityCan suffer from sparsity Contexts represented based on the features Contexts represented based on the features

that occur in those contextsthat occur in those contexts Second order representations are harder Second order representations are harder

to visualize, but…to visualize, but… Allow a word to be represented by the words it Allow a word to be represented by the words it

co-occurs with (i.e., the company it keeps)co-occurs with (i.e., the company it keeps) Allows a context to be represented by the Allows a context to be represented by the

words that occur with the words in the context words that occur with the words in the context Helps combat sparsity…Helps combat sparsity…

Page 70: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 7070

Related WorkRelated Work

Pedersen and Bruce 1997 (EMNLP) presented first order Pedersen and Bruce 1997 (EMNLP) presented first order method of discriminationmethod of discrimination

http://acl.ldc.upenn.edu/W/W97/W97-0322.pdfhttp://acl.ldc.upenn.edu/W/W97/W97-0322.pdf

Schütze 1998 (Schütze 1998 (Computational LinguisticsComputational Linguistics) introduced second ) introduced second order method order method

http://acl.ldc.upenn.edu/J/J98/J98-1004.pdfhttp://acl.ldc.upenn.edu/J/J98/J98-1004.pdf

Purandare and Pedersen 2004 (CoNLL) compared first and Purandare and Pedersen 2004 (CoNLL) compared first and second order methodssecond order methods

http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.pdfpdf

First order better if you have lots of dataFirst order better if you have lots of data Second order better with smaller amounts of dataSecond order better with smaller amounts of data

Page 71: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 7171

Dimensionality ReductionDimensionality Reduction

Singular Value DecompositionSingular Value Decomposition

Page 72: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 7272

MotivationMotivation

First order matrices are very sparseFirst order matrices are very sparse Word by wordWord by word Context by featureContext by feature

NLP data is noisyNLP data is noisy No stemming performedNo stemming performed synonymssynonyms

Page 73: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 7373

Many Methods Many Methods

Singular Value Decomposition (SVD)Singular Value Decomposition (SVD) SVDPACKC SVDPACKC http://http://www.netlib.org/svdpackwww.netlib.org/svdpack//

Multi-Dimensional Scaling (MDS)Multi-Dimensional Scaling (MDS) Principal Components Analysis (PCA)Principal Components Analysis (PCA) Independent Components Analysis (ICA)Independent Components Analysis (ICA) Linear Discriminant Analysis (LDA)Linear Discriminant Analysis (LDA) etc…etc…

Page 74: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 7474

Effect of SVDEffect of SVD

SVD reduces a matrix to a given SVD reduces a matrix to a given number of dimensions This may number of dimensions This may convert a word level space into a convert a word level space into a semantic or conceptual spacesemantic or conceptual space If “dog” and “collie” and “wolf” are If “dog” and “collie” and “wolf” are

dimensions/columns in a word co-dimensions/columns in a word co-occurrence matrix, after SVD they may occurrence matrix, after SVD they may be a single dimension that represents be a single dimension that represents “canines”“canines”

Page 75: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 7575

Effect of SVDEffect of SVD

The dimensions of the matrix after The dimensions of the matrix after SVD are principal components that SVD are principal components that represent the meaning of conceptsrepresent the meaning of concepts Similar columns are grouped together Similar columns are grouped together

SVD is a way of smoothing a very SVD is a way of smoothing a very sparse matrix, so that there are very sparse matrix, so that there are very few zero valued cells after SVDfew zero valued cells after SVD

Page 76: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 7676

How can SVD be used?How can SVD be used? SVD on first order contexts will reduce a SVD on first order contexts will reduce a

context by feature representation down to a context by feature representation down to a smaller number of featuressmaller number of features Latent Semantic Analysis typically performs SVD on Latent Semantic Analysis typically performs SVD on

a word by context representation, where the a word by context representation, where the contexts are reducedcontexts are reduced

SVD used in creating second order context SVD used in creating second order context representationsrepresentations Reduce word by word matrix Reduce word by word matrix

SVD could also be used on resultant second SVD could also be used on resultant second order context representations (although not order context representations (although not supported)supported)

Page 77: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 7777

Word by Word MatrixWord by Word Matrixapplapplee

bloobloodd

cellcellss

ibmibm datdataa

boxbox tissutissuee

graphicgraphicss

memormemoryy

orgaorgann

plasmplasmaa

pcpc 22 00 00 11 33 11 00 00 00 00 00

bodybody 00 33 00 00 00 00 22 00 00 22 11

diskdisk 11 00 00 22 00 33 00 11 22 00 00

petripetri 00 22 11 00 00 00 22 00 11 00 11

lablab 00 00 33 00 22 00 22 00 22 11 33

salessales 00 00 00 22 33 00 00 11 22 00 00

linuxlinux 22 00 00 11 33 22 00 11 11 00 00

debtdebt 00 00 00 22 33 44 00 22 00 00 00

Page 78: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 7878

Singular Value Singular Value DecompositionDecomposition

A=UDV’A=UDV’

Page 79: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 7979

UU.35.35 .09.09 -.2-.2 .52.52 -.0-.0

99.40.40 .02.02 .63.63 .20.20 -.00-.00 -.02-.02

.05.05 -.4-.499

.59.59 .44.44 .08.08 -.0-.099

-.4-.444

-.04-.04 -.6-.6 -.02-.02 -.01-.01

.35.35 .13.13 .39.39 -.6-.600

.31.31 .41.41 -.2-.222

.20.20 -.39-.39 .00.00 .03.03

.08.08 -.4-.455

.25.25 -.0-.022

.17.17 .09.09 .83.83 .05.05 -.26-.26 -.01-.01 .00.00

.29.29 -.6-.688

-.4-.455

-.3-.344

-.3-.311

.02.02 -.2-.211

.01.01 .43.43 -.02-.02 -.07-.07

.37.37 -.0-.011

-.3-.311

.09.09 .72.72 -.4-.488

-.0-.044

.03.03 .31.31 -.00-.00 .08.08

.46.46 .11.11 -.0-.088

.24.24 -.0-.011

.39.39 .05.05 .08.08 .08.08 -.00-.00 -.01-.01

.56.56 .25.25 .30.30 -.0-.077

-.4-.499

-.5-.522

.14.14 -.3-.3 -.30-.30 .00.00 -.07-.07

Page 80: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 8080

DD9.199.19

6.366.36

3.993.99

3.253.25

2.522.52

2.302.30

1.261.26

0.660.66

0.000.00

0.000.00

0.000.00

Page 81: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 8181

VV.21.21 .08.08 -.04-.04 .28.28 .04.04 .86.86 -.05-.05 -.05-.05 -.31-.31 -.12-.12 .03.03

.04.04 -.37-.37 .57.57 .39.39 .23.23 -.04-.04 .26.26 -.02-.02 .03.03 .25.25 .44.44

.11.11 -.39-.39 -.27-.27 -.32-.32 -.30-.30 .06.06 .17.17 .15.15 -.41-.41 .58.58 .07.07

.37.37 .15.15 .12.12 -.12-.12 .39.39 -.17-.17 -.13-.13 .71.71 -.31-.31 -.12-.12 .03.03

.63.63 -.01-.01 -.45-.45 .52.52 -.09-.09 -.26-.26 .08.08 -.06-.06 .21.21 .08.08 -.0-.022

.49.49 .27.27 .50.50 -.32-.32 -.45-.45 .13.13 .02.02 -.01-.01 .31.31 .12.12 -.0-.033

.09.09 -.51-.51 .20.20 .05.05 -.05-.05 .02.02 .29.29 .08.08 -.04-.04 -.31-.31 -.7-.711

.25.25 .11.11 .15.15 -.12-.12 .02.02 -.32-.32 .05.05 -.59-.59 -.62-.62 -.23-.23 .07.07

.28.28 -.23-.23 -.14-.14 -.45-.45 .64.64 .17.17 -.04-.04 -.32-.32 .31.31 .12.12 -.0-.033

.04.04 -.26-.26 .19.19 .17.17 -.06-.06 -.07-.07 -.87-.87 -.10-.10 -.07-.07 .22.22 -.2-.200

.11.11 -.47-.47 -.12-.12 -.18-.18 -.27-.27 .03.03 -.18-.18 .09.09 .12.12 -.58-.58 .50.50

Page 82: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 8282

Word by Word Matrix After Word by Word Matrix After SVDSVD

appleapple bloodblood cellscells ibmibm datdataa

tissutissuee

graphicgraphicss

memormemoryy

orgaorgann

plasmplasmaa

pcpc .73.73 .00.00 .11.11 1.31.3 2.02.0 .01.01 .86.86 .77.77 .00.00 .09.09

bodybody .00.00 1.21.2 1.31.3 .00.00 .33.33 1.61.6 .00.00 .85.85 .84.84 1.51.5

diskdisk .76.76 .00.00 .01.01 1.31.3 2.12.1 .00.00 .91.91 .72.72 .00.00 .00.00

germgerm .00.00 1.11.1 1.21.2 .00.00 .49.49 1.51.5 .00.00 .86.86 .77.77 1.41.4

lablab .21.21 1.71.7 2.02.0 .35.35 1.71.7 2.52.5 .18.18 1.71.7 1.21.2 2.32.3

salessales .73.73 .15.15 .39.39 1.31.3 2.22.2 .35.35 .85.85 .98.98 .17.17 .41.41

linuxlinux .96.96 .00.00 .16.16 1.71.7 2.72.7 .03.03 1.11.1 1.01.0 .00.00 .13.13

debtdebt 1.21.2 .00.00 .00.00 2.12.1 3.23.2 .00.00 1.51.5 1.11.1 .00.00 .00.00

Page 83: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 8383

Second Order RepresentationSecond Order Representation

These two contexts share no words in common, yet they These two contexts share no words in common, yet they are similar! are similar! diskdisk and and linux linux both occur with “Apple”, both occur with “Apple”, “IBM”, “data”, “graphics”, and “memory” “IBM”, “data”, “graphics”, and “memory”

The two contexts are similar because they share many The two contexts are similar because they share many second order co-occurrencessecond order co-occurrences

appleapple bloobloodd

cellscells ibmibm datadata tissutissuee

graphicgraphicss

memormemoryy

orgaorgann

PlasmPlasmaa

diskdisk .76.76 .00.00 .01.01 1.31.3 2.12.1 .00.00 .91.91 .72.72 .00.00 .00.00

linuxlinux .96.96 .00.00 .16.16 1.71.7 2.72.7 .03.03 1.11.1 1.01.0 .00.00 .13.13

• I got a new disk today!

• What do you think of linux?

Page 84: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 8484

Clustering MethodsClustering Methods

Agglomerative and Agglomerative and

PartitionalPartitional

Page 85: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 8585

Many many methods…Many many methods…

Cluto supports a wide range of different Cluto supports a wide range of different clustering methodsclustering methods AgglomerativeAgglomerative

Average, single, complete link…Average, single, complete link… PartitionalPartitional

K-meansK-means HybridHybrid

Repeated bisectionsRepeated bisections SenseClusters integrates with ClutoSenseClusters integrates with Cluto

http://www-users.cs.umn.edu/~karypis/cluto/http://www-users.cs.umn.edu/~karypis/cluto/

Page 86: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 8686

General MethodologyGeneral Methodology

Represent contexts to be clustered in Represent contexts to be clustered in first or second order vectorsfirst or second order vectors

Cluster the vectors directly, or Cluster the vectors directly, or convert to similarity matrix and then convert to similarity matrix and then clustercluster vclustervcluster sclusterscluster

Page 87: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 8787

Agglomerative ClusteringAgglomerative Clustering

Create a similarity matrix of Create a similarity matrix of instances to be discriminatedinstances to be discriminated Results in a symmetric “instance by Results in a symmetric “instance by

instance” matrix, where each cell instance” matrix, where each cell contains the similarity score between a contains the similarity score between a pair of instancespair of instances

Typically a first order representation, Typically a first order representation, where similarity is based on the features where similarity is based on the features observed in the pair of instancesobserved in the pair of instances)(

)(

YX

YX

Page 88: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 8888

Measuring SimilarityMeasuring Similarity

Integer ValuesInteger Values

Matching CoefficientMatching Coefficient

Jaccard CoefficientJaccard Coefficient

Dice CoefficientDice Coefficient

Real ValuesReal Values

CosineCosine

YX

YX

YX

YX

YX

2

YX

YX

Page 89: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 8989

Agglomerative ClusteringAgglomerative Clustering

Apply Agglomerative Clustering algorithm to Apply Agglomerative Clustering algorithm to similarity matrixsimilarity matrix To start, each instance is its own clusterTo start, each instance is its own cluster Form a cluster from the most similar pair of instancesForm a cluster from the most similar pair of instances Repeat until the desired number of clusters is Repeat until the desired number of clusters is

obtainedobtained Advantages : high quality clustering Advantages : high quality clustering Disadvantages – computationally expensive, Disadvantages – computationally expensive,

must carry out exhaustive pair wise must carry out exhaustive pair wise comparisonscomparisons

Page 90: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 9090

Average Link ClusteringAverage Link Clustering

S1S1 S2S2 S3S3 S4S4

S1S1 33 44 22

S2S2 33 22 00

S3S3 44 22 11

S4S4 22 00 11

S1S3S1S3 S2S2 S4S4

S1S3S1S3

S2S2 00

S4S4 00

5.22

23

5.22

23

5.12

12

5.12

12

S1S3S1S3S2S2

S4S4

S1S3SS1S3S22

S4S4

5.12

5.15.1

5.12

5.15.1

Page 91: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 9191

Partitional MethodsPartitional Methods

Select some number of contexts in Select some number of contexts in feature space to act as centroidsfeature space to act as centroids

Assign each context to nearest Assign each context to nearest centroid, forming clustercentroid, forming cluster

After all contexts assigned, recompute After all contexts assigned, recompute centroidscentroids

Repeat until stable clusters foundRepeat until stable clusters found Centroids don’t shift from iteration to Centroids don’t shift from iteration to

iterationiteration

Page 92: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 9292

Partitional MethodsPartitional Methods

Advances : fastAdvances : fast Disadvantages : very dependent on Disadvantages : very dependent on

the initial placement of centroidsthe initial placement of centroids

Page 93: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 9393

Cluster LabelingCluster Labeling

Page 94: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 9494

Results of ClusteringResults of Clustering

Each cluster consists of some Each cluster consists of some number of contextsnumber of contexts

Each context is a short unit of textEach context is a short unit of text Apply measures of association to the Apply measures of association to the

contents of each cluster to determine contents of each cluster to determine N most significant bigramsN most significant bigrams

Use those bigrams as a label for the Use those bigrams as a label for the clustercluster

Page 95: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 9595

Label TypesLabel Types

The N most significant bigrams for The N most significant bigrams for each cluster will act as a descriptive each cluster will act as a descriptive labellabel

The M most significant bigrams that The M most significant bigrams that are unique to each cluster will act as are unique to each cluster will act as a discriminating labela discriminating label

Page 96: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 9696

Evaluation TechniquesEvaluation Techniques

Comparison to gold standard Comparison to gold standard datadata

Page 97: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 9797

EvaluationEvaluation

If Sense tagged text is available, can If Sense tagged text is available, can be used for evaluationbe used for evaluation But don’t use sense tags for clustering But don’t use sense tags for clustering

or feature selection!or feature selection! Assume that sense tags represent Assume that sense tags represent

“true” clusters, and compare these “true” clusters, and compare these to discovered clustersto discovered clusters Find mapping of clusters to senses that Find mapping of clusters to senses that

attains maximum accuracyattains maximum accuracy

Page 98: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 9898

EvaluationEvaluation

Pseudo words are especially useful, Pseudo words are especially useful, since it is hard to find data that is since it is hard to find data that is discriminateddiscriminated Pick two words or names from a corpus, and Pick two words or names from a corpus, and

conflate them into one name. Then see how conflate them into one name. Then see how well you can discriminate.well you can discriminate.

http://www.d.umn.edu/~tpederse/tools.htmlhttp://www.d.umn.edu/~tpederse/tools.html Baseline Algorithm– group all instances Baseline Algorithm– group all instances

into one cluster, this will reach into one cluster, this will reach “accuracy” equal to majority classifier“accuracy” equal to majority classifier

Page 99: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 9999

EvaluationEvaluation

Pseudo words are especially useful, Pseudo words are especially useful, since it is hard to find data that is since it is hard to find data that is discriminateddiscriminated Pick two words or names from a corpus, Pick two words or names from a corpus,

and conflate them into one name. Then and conflate them into one name. Then see how well you can discriminate.see how well you can discriminate.

http://www.d.umn.edu/~kulka020/http://www.d.umn.edu/~kulka020/kanaghaName.htmlkanaghaName.html

Page 100: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 100100

Baseline AlgorithmBaseline Algorithm

Baseline Algorithm – group all Baseline Algorithm – group all instances into one cluster, this will instances into one cluster, this will reach “accuracy” equal to majority reach “accuracy” equal to majority classifierclassifier

What if the clustering said everything What if the clustering said everything should be in the same cluster?should be in the same cluster?

Page 101: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 101101

Baseline PerformanceBaseline Performance

S1S1 S2S2 S3S3 TotalTotalss

C1C1 00 00 00 00

C2C2 00 00 00 00

C3C3 8080 3535 5555 170170

TotalTotalss

8080 3535 5555 170170

S3S3 S2S2 S1S1 TotalTotalss

C1C1 00 00 00 00

C2C2 00 00 00 00

C3C3 5555 3535 8080 170170

TotalTotalss

5555 3535 8080 170170

(0+0+55)/170 = .32 if C3 is S1 (0+0+55)/170 = .32 if C3 is S1 (0+0+80)/170 = .47(0+0+80)/170 = .47 if C3 is S3if C3 is S3

Page 102: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 102102

EvaluationEvaluation Suppose that C1 is labeled S1, C2 as S2, and C3 as S3Suppose that C1 is labeled S1, C2 as S2, and C3 as S3 Accuracy = (10 + 0 + 10) / 170 = 12% Accuracy = (10 + 0 + 10) / 170 = 12% Diagonal shows how many members of the cluster actually Diagonal shows how many members of the cluster actually

belong to the sense given on the column belong to the sense given on the column Can the “columns” be rearranged to improve the overall Can the “columns” be rearranged to improve the overall

accuracy?accuracy? Optimally assign clusters to sensesOptimally assign clusters to senses

S1S1 S2S2 S3S3 TotalTotal

ss

C1C1 1010 3030 55 4545

C2C2 2020 00 4040 6060

C3C3 5050 55 1010 6565

TotalTotalss

8080 3535 5555 170170

Page 103: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 103103

EvaluationEvaluation

The assignment of C1 to The assignment of C1 to S2, C2 to S3, and C3 to S1 S2, C2 to S3, and C3 to S1 results in 120/170 = 71%results in 120/170 = 71%

Find the ordering of the Find the ordering of the columns in the matrix that columns in the matrix that maximizes the sum of the maximizes the sum of the diagonal. diagonal.

This is an instance of the This is an instance of the Assignment Problem from Assignment Problem from Operations Research, or Operations Research, or finding the Maximal finding the Maximal Matching of a Bipartite Matching of a Bipartite Graph from Graph Theory.Graph from Graph Theory.

S2S2 S3S3 S1S1 TotalTotalss

C1C1 3030 55 1010 4545

C2C2 00 4040 2020 6060

C3C3 55 1010 5050 6565

TotalTotalss

3535 5555 8080 170170

Page 104: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 104104

AnalysisAnalysis

Unsupervised methods may not discover clusters Unsupervised methods may not discover clusters equivalent to the classes learned in supervised equivalent to the classes learned in supervised learninglearning

Evaluation based on assuming that sense tags Evaluation based on assuming that sense tags represent the “true” cluster are likely a bit harsh. represent the “true” cluster are likely a bit harsh. Alternatives?Alternatives? Humans could look at the members of each cluster and Humans could look at the members of each cluster and

determine the nature of the relationship or meaning that determine the nature of the relationship or meaning that they all sharethey all share

Use the contents of the cluster to generate a descriptive Use the contents of the cluster to generate a descriptive label that could be inspected by a humanlabel that could be inspected by a human

Page 105: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 105105

Practical SessionPractical Session

Experiments with Experiments with SenseClusters SenseClusters

Page 106: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 106106

Experimental DataExperimental Data

Available on Web SiteAvailable on Web Site http://senseclusters.sourceforge.nethttp://senseclusters.sourceforge.net

Available on CDAvailable on CD Data/SenseClusters-DataData/SenseClusters-Data

SenseClusters requires data to be in SenseClusters requires data to be in the Senseval-2 lexical sample formatthe Senseval-2 lexical sample format Plenty of such data available on CD and Plenty of such data available on CD and

from web sitefrom web site

Page 107: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 107107

Creating Experimental DataCreating Experimental Data

NameConflate programNameConflate program Creates name conflated data from Creates name conflated data from

English GigaWord corpusEnglish GigaWord corpus Text2Headless program Text2Headless program

Convert plain text into headless Convert plain text into headless contextscontexts

http://http://www.d.umn.edu/~tpederse/tools.htmlwww.d.umn.edu/~tpederse/tools.html

Page 108: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 108108

Name Conflation DataName Conflation Data

Smaller Data Set (also on Web as SC-Web…)Smaller Data Set (also on Web as SC-Web…) Country - NounCountry - Noun Name - NameName - Name Noun - NounNoun - Noun

Larger Data Sets (also on Web as Split-Smaller…)Larger Data Sets (also on Web as Split-Smaller…) Adidas - PumaAdidas - Puma Emile Lahoud – Askar AkayevEmile Lahoud – Askar Akayev

CICLING data (CD only)CICLING data (CD only) David Beckham – Ronaldo David Beckham – Ronaldo Microsoft – IBMMicrosoft – IBM

ACL 2005 demo data (CD only)ACL 2005 demo data (CD only) Name - NameName - Name

Page 109: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 109109

Clustering ContextsClustering Contexts

ACL 2005 Demo (also on Web as ACL 2005 Demo (also on Web as Email…)Email…) Various partitions of 20 news groups Various partitions of 20 news groups

data setsdata sets Spanish Data (web only)Spanish Data (web only)

News articles each of which mention News articles each of which mention abbreviations PP or PSOEabbreviations PP or PSOE

Page 110: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 110110

Name DiscriminationName Discrimination

Page 111: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 111111

George Millers!George Millers!

Page 112: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 112112

Headed ClusteringHeaded Clustering

Name DiscriminationName Discrimination Tom HanksTom Hanks Russell CroweRussell Crowe

Page 113: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 113113

Page 114: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 114114

Page 115: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 115115

Page 116: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 116116

Page 117: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 117117

Page 118: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 118118

Page 119: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 119119

Page 120: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 120120

Page 121: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 121121

Page 122: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 122122

Page 123: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 123123

Page 124: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 124124

Page 125: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 125125

Page 126: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 126126

Page 127: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 127127

Page 128: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 128128

Page 129: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 129129

Page 130: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 130130

Page 131: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 131131

Page 132: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 132132

Page 133: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 133133

Page 134: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 134134

Page 135: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 135135

Headless ContextsHeadless Contexts

Email / 20 newsgroups dataEmail / 20 newsgroups data Spanish TextSpanish Text

Page 136: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 136136

Page 137: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 137137

Page 138: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 138138

Page 139: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 139139

Page 140: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 140140

Page 141: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 141141

Page 142: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 142142

Page 143: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 143143

Page 144: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 144144

Page 145: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 145145

Page 146: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 146146

Page 147: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 147147

Page 148: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 148148

Page 149: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 149149

Page 150: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 150150

Page 151: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 151151

Page 152: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 152152

Page 153: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 153153

Page 154: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 154154

Page 155: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 155155

Page 156: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 156156

Page 157: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 157157

If you after all these matrices If you after all these matrices you crave knowledge based you crave knowledge based

resources…resources…

Read on…Read on…

Page 158: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 158158

WordNet-SimilarityWordNet-Similarity

Not language independentNot language independent Based on English WordNetBased on English WordNet

But, can be combined with distributional But, can be combined with distributional methods to good effectmethods to good effect McCarthy, et. al. ACL-2004McCarthy, et. al. ACL-2004

Perl modulePerl module http://http://search.cpan.orgsearch.cpan.org/dist/WordNet-Similarity/dist/WordNet-Similarity

Web interfaceWeb interface http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgihttp://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi

Page 159: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 159159

Many thanks!Many thanks!

Satanjeev “Bano” Banerjee (M.S., 2002)Satanjeev “Bano” Banerjee (M.S., 2002) Inventor of Adapted Lesk Algorithm (IJCAI-2003), which is the Inventor of Adapted Lesk Algorithm (IJCAI-2003), which is the

earliest origin and motivation for WordNet-Similarity…earliest origin and motivation for WordNet-Similarity… Now PhD student at LTI/CMU…Now PhD student at LTI/CMU…

Siddharth Patwardhan (M.S., 2003)Siddharth Patwardhan (M.S., 2003) Founding developer of WordNet-Similarity (2001-2003)Founding developer of WordNet-Similarity (2001-2003) Now PhD student at University of UtahNow PhD student at University of Utah http://http://www.cs.utah.edu/~siddwww.cs.utah.edu/~sidd//

Jason Michelizzi (M.S., 2005)Jason Michelizzi (M.S., 2005) Enhanced WordNet-Similarity in many ways and applied it to Enhanced WordNet-Similarity in many ways and applied it to

all words sense disambiguation (2003-2005)all words sense disambiguation (2003-2005) http://www.d.umn.edu/~mich0212http://www.d.umn.edu/~mich0212

NSF for supporting Bano, and University of Minnesota for NSF for supporting Bano, and University of Minnesota for supporting Bano, Sid and Jason via various internal sourcessupporting Bano, Sid and Jason via various internal sources

Page 160: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 160160

Vector measureVector measure

Build a word by word matrix from WordNet Build a word by word matrix from WordNet Gloss CorpusGloss Corpus 1.4 million words1.4 million words

Treat glosses as contexts, and use second Treat glosses as contexts, and use second order representation where words are order representation where words are replaced with vectors from matrixreplaced with vectors from matrix Average together all vectors to represent Average together all vectors to represent

concept/definitionconcept/definition High correlation with human relatedness High correlation with human relatedness

judgementsjudgements

Page 161: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 161161

Many other measuresMany other measures

Path BasedPath Based PathPath Leacock & ChodorowLeacock & Chodorow Wu and PalmerWu and Palmer

Information Content BasedInformation Content Based ResnikResnik LinLin Jiang & ConrathJiang & Conrath

RelatednessRelatedness Hirst & St-OngeHirst & St-Onge Adapted LeskAdapted Lesk Vector Vector

Page 162: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 162162

Page 163: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 163163

Page 164: Eurolan 2005 Pedersen

EuroLAN-2005 Summer SchoolEuroLAN-2005 Summer School 164164

Thank you!Thank you!

Questions are welcome at any time. Questions are welcome at any time. Feel free to contact me in person or Feel free to contact me in person or via email (via email ([email protected]@d.umn.edu) at ) at any time!any time!

All of our software is free and open All of our software is free and open source, you are welcome to source, you are welcome to download, modify, redistribute, etc. download, modify, redistribute, etc. http://www.d.umn.edu/~tpederse/code.hhttp://www.d.umn.edu/~tpederse/code.h

tmltml