29
1 Name Discrimination by Clustering Similar Contexts Ted Pedersen & Anagha Kulkarni University of Minnesota, Duluth Amruta Purandare Now at University of Pittsburgh Research Supported by National Science Foundation Faculty Early Career Development Award (#0092784)

Cicling2005

Embed Size (px)

Citation preview

Page 1: Cicling2005

1

Name Discrimination by Clustering Similar Contexts

Ted Pedersen & Anagha KulkarniUniversity of Minnesota, Duluth

Amruta PurandareNow at University of Pittsburgh

Research Supported by National Science FoundationFaculty Early Career Development Award (#0092784)

Page 2: Cicling2005

2

Name Discrimination

Different people have the same name George (HW) Bush and George (W) Bush

Different places have the same name Duluth (Minn) and Duluth (GA)

Different things have the same abbrev. UMD (Duluth) and UMD (College Park)

Page 3: Cicling2005

3

Page 4: Cicling2005

4

Page 5: Cicling2005

5

Page 6: Cicling2005

6

Page 7: Cicling2005

7

Our goals?

Given 1000 contexts w/ “John Smith”, identify those that are similar to each other

Group similar contexts together, assume they are associated with single individual

Generate an identifying label from the content of the different clusters

Page 8: Cicling2005

8

Measuring Similarity of Words and Contexts w/Large Corpora?

Second order Co-occurrences

Jim drives his car fast / Jim speeds in his auto

Car -> motor, garage, gasoline, insurance Auto -> motor, insurance, gasoline, accident

Car and Auto occur with many of the same words. They are therefore similar!

Less direct relationship, more resistant to sparsity!

Page 9: Cicling2005

9

Word sense discrimination Given 1000 contexts that include a

particular target word (e.g., shell) Cluster those contexts such that

similar contexts come together Similar contexts have similar

meanings Label each cluster with something

that describes content, maybe even provides definition

Page 10: Cicling2005

10

Methodology

Feature Selection Context Representation Measuring Similarities Clustering Evaluation

Page 11: Cicling2005

11

Feature Selection Identify features in large

(separate) training corpora, or in data to be clustered.

Rely on lexical features Unigrams, bigrams, co-occurrences

Page 12: Cicling2005

12

Lexical features Unigrams, words that occur more

than X times Bigrams, ordered pairs of words,

separated by at most 2-3 intervening words, score above cutoff on measure of association

Co-occurrences, same as bigrams, but unordered

Page 13: Cicling2005

13

Context representation

First order Unigrams, bigrams, and co-

occurrences that occur in training corpus, also occur in context to be clustered

Context is represented as vector that shows if (or how often) these features occur in context to be clustered

Page 14: Cicling2005

14

Context Representation

Second order Bigrams or co-occurrences used to

create matrix, cells represent counts or measure of word pair

Rows serve as co-occurrence vector for a word

Represent context by averaging vectors of words in that context

Page 15: Cicling2005

15

2nd Order Context VectorsThe largest shell store by the sea shore

Sells Water North-

West

Sandy Bombs

Sales Artillery

Sea 18.5533 3324.98 30.520 51.7812 8.7399 0 0

Shore 0 0 29.576 136.0441

0 0 0

Store 134.5102

205.5469

0 0 0 18818.55

0

O2contex

t

51.021 1176.84 20.032 62.6084 2.9133 6272.85 0

Page 16: Cicling2005

16

2nd Order Context Vectors

Page 17: Cicling2005

17

Measuring Similaritiesc1: {file, unix, commands, system, store}c2: {machine, os, unix, system, computer, dos,

store}

Matching = |X П Y|{unix, system, store} = 3

Cosine = |X П Y|/(|X|*|Y|)3/(√5*√7) = 3/(2.2361*2.646) = 0.5070

Page 18: Cicling2005

18

Limitations of 1st or 2nd order

Kill Murder

Destroy

Fire Shoot Missile

Weapon

2.53 0 1.28 0 3.24 0 28.72

0 4.21 0 0.92 0 52.27 0

Burn

CD Fire Pipe Bomb

Command Execute

2.56 1.28

0 72.7 0 2.36 19.23

34.2 0 22.1 46.2 14.6 0 17.77

Page 19: Cicling2005

19

Latent Semantic Analysis

Singular Value Decomposition

Captures Polysemy and Synonymy(?)

Conceptual Fuzzy Feature Matching

Word Space to Semantic Space

Page 20: Cicling2005

20

After context representation…

Each context is represented by a vector of some sort First order vector shows direct

occurrence of features in context Second order vector is an average of

word vectors that make up context, captures indirect relationships

Now, cluster the vectors!

Page 21: Cicling2005

21

Clustering

UPGMA Hierarchical : Agglomerative

Repeated Bisections Hybrid : Divisive + Partitional

Page 22: Cicling2005

22

Evaluation (before mapping)

C1 10 0 3 2

C2 1 1 7 1

C3 2 1 1 6

C4 2 15 1 2

Page 23: Cicling2005

23

Evaluation (after mapping)

C1 10 3 2 0 15

C2 1 7 1 1 10

C3 2 1 6 1 10

C4 2 1 2 15 20

15 12 11 17 55

Page 24: Cicling2005

24

Majority Sense Classifier

Page 25: Cicling2005

25

Data Line, Hard, Serve

4000+ Instances / Word 60:40 Split 3-5 Senses / Word

SENSEVAL-2 73 words = 28 V + 29 N + 15 A Approx. 50-100 Test, 100-200 Train 8-12 Senses/Word

Page 26: Cicling2005

26

Experimental comparison of 1st and 2nd order representations:

Pedersen & Bruce (1st Order Contexts)

Schütze(2nd Order Contexts)

• PB1Co-occurrences,

UPGMA, Similarity Space

• SC1Co-occurrence Matrix,

SVDRB, Vector Space

• PB2PB1 except

RB, Vector Space

• SC2SC1 except

UPGMA, Similarity Space

• PB3PB1 with Bi-gram

Features

• SC3SC1 with Bi-gram

Matrix

Page 27: Cicling2005

27

Experimental Conclusions

Nature of Data RecommendationSmaller Data

(like SENSEVAL-2)2nd order, RB

Large, Homogeneous(like Line, Hard, Serve)

1st order, UPGMA

Page 28: Cicling2005

28

Software SenseClusters –

http://senseclusters.sourceforge.net/

N-gram Statistic Package - http://www.d.umn.edu/~tpederse/nsp.html

Cluto -http://www-users.cs.umn.edu/~karypis/cluto/

SVDPack - http://netlib.org/svdpack/

Page 29: Cicling2005

29

Making Free SoftwareMostly Perl, All CopyLeft

SenseClusters Identify similar contexts

Ngram Statistics Package Identify interesting sequences of words

WordNet::Similarity Measure similarity among concepts

Google-Hack Find sets of related words

WordNet::SenseRelate All words sense disambiguation

SyntaLex and Duluth systems Supervised WSD

http://www.d.umn.edu/~tpederse/code.html