Cicling2005

1

Name Discrimination by Clustering Similar Contexts

Ted Pedersen & Anagha KulkarniUniversity of Minnesota, Duluth

Amruta PurandareNow at University of Pittsburgh

Research Supported by National Science FoundationFaculty Early Career Development Award (#0092784)

2

Name Discrimination

Different people have the same name George (HW) Bush and George (W) Bush

Different places have the same name Duluth (Minn) and Duluth (GA)

Different things have the same abbrev. UMD (Duluth) and UMD (College Park)

3

4

5

6

7

Our goals?

Given 1000 contexts w/ “John Smith”, identify those that are similar to each other

Group similar contexts together, assume they are associated with single individual

Generate an identifying label from the content of the different clusters

8

Measuring Similarity of Words and Contexts w/Large Corpora?

Second order Co-occurrences

Jim drives his car fast / Jim speeds in his auto

Car -> motor, garage, gasoline, insurance Auto -> motor, insurance, gasoline, accident

Car and Auto occur with many of the same words. They are therefore similar!

Less direct relationship, more resistant to sparsity!

9

Word sense discrimination Given 1000 contexts that include a

particular target word (e.g., shell) Cluster those contexts such that

similar contexts come together Similar contexts have similar

meanings Label each cluster with something

that describes content, maybe even provides definition

10

Methodology

Feature Selection Context Representation Measuring Similarities Clustering Evaluation

11

Feature Selection Identify features in large

(separate) training corpora, or in data to be clustered.

Rely on lexical features Unigrams, bigrams, co-occurrences

12

Lexical features Unigrams, words that occur more

than X times Bigrams, ordered pairs of words,

separated by at most 2-3 intervening words, score above cutoff on measure of association

Co-occurrences, same as bigrams, but unordered

13

Context representation

First order Unigrams, bigrams, and co-

occurrences that occur in training corpus, also occur in context to be clustered

Context is represented as vector that shows if (or how often) these features occur in context to be clustered

14

Context Representation

Second order Bigrams or co-occurrences used to

create matrix, cells represent counts or measure of word pair

Rows serve as co-occurrence vector for a word

Represent context by averaging vectors of words in that context

15

2nd Order Context VectorsThe largest shell store by the sea shore

Sells Water North-

West

Sandy Bombs

Sales Artillery

Sea 18.5533 3324.98 30.520 51.7812 8.7399 0 0

Shore 0 0 29.576 136.0441

0 0 0

Store 134.5102

205.5469

0 0 0 18818.55

0

O2contex

t

51.021 1176.84 20.032 62.6084 2.9133 6272.85 0

16

2nd Order Context Vectors

17

Measuring Similaritiesc1: {file, unix, commands, system, store}c2: {machine, os, unix, system, computer, dos,

store}

Matching = |X П Y|{unix, system, store} = 3

Cosine = |X П Y|/(|X|*|Y|)3/(√5*√7) = 3/(2.2361*2.646) = 0.5070

18

Limitations of 1st or 2nd order

Kill Murder

Destroy

Fire Shoot Missile

Weapon

2.53 0 1.28 0 3.24 0 28.72

0 4.21 0 0.92 0 52.27 0

Burn

CD Fire Pipe Bomb

Command Execute

2.56 1.28

0 72.7 0 2.36 19.23

34.2 0 22.1 46.2 14.6 0 17.77

19

Latent Semantic Analysis

Singular Value Decomposition

Captures Polysemy and Synonymy(?)

Conceptual Fuzzy Feature Matching

Word Space to Semantic Space

20

After context representation…

Each context is represented by a vector of some sort First order vector shows direct

occurrence of features in context Second order vector is an average of

word vectors that make up context, captures indirect relationships

Now, cluster the vectors!

21

Clustering

UPGMA Hierarchical : Agglomerative

Repeated Bisections Hybrid : Divisive + Partitional

22

Evaluation (before mapping)

C1 10 0 3 2

C2 1 1 7 1

C3 2 1 1 6

C4 2 15 1 2

23

Evaluation (after mapping)

C1 10 3 2 0 15

C2 1 7 1 1 10

C3 2 1 6 1 10

C4 2 1 2 15 20

15 12 11 17 55

24

Majority Sense Classifier

25

Data Line, Hard, Serve

4000+ Instances / Word 60:40 Split 3-5 Senses / Word

SENSEVAL-2 73 words = 28 V + 29 N + 15 A Approx. 50-100 Test, 100-200 Train 8-12 Senses/Word

26

Experimental comparison of 1st and 2nd order representations:

Pedersen & Bruce (1st Order Contexts)

Schütze(2nd Order Contexts)

• PB1Co-occurrences,

UPGMA, Similarity Space

• SC1Co-occurrence Matrix,

SVDRB, Vector Space

• PB2PB1 except

RB, Vector Space

• SC2SC1 except

UPGMA, Similarity Space

• PB3PB1 with Bi-gram

Features

• SC3SC1 with Bi-gram

Matrix

27

Experimental Conclusions

Nature of Data RecommendationSmaller Data

(like SENSEVAL-2)2nd order, RB

Large, Homogeneous(like Line, Hard, Serve)

1st order, UPGMA

28

Software SenseClusters –

http://senseclusters.sourceforge.net/

N-gram Statistic Package - http://www.d.umn.edu/~tpederse/nsp.html

Cluto -http://www-users.cs.umn.edu/~karypis/cluto/

SVDPack - http://netlib.org/svdpack/

29

Making Free SoftwareMostly Perl, All CopyLeft

SenseClusters Identify similar contexts

Ngram Statistics Package Identify interesting sequences of words

WordNet::Similarity Measure similarity among concepts

Google-Hack Find sets of related words

WordNet::SenseRelate All words sense disambiguation

SyntaLex and Duluth systems Supervised WSD

http://www.d.umn.edu/~tpederse/code.html

Technology

Cicling2005