119
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨utze’s, linked from http://informationretrieval.org/ IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell University, Ithaca, NY 10 Nov 2011 1 / 121

INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Embed Size (px)

Citation preview

Page 1: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

INFO 4300 / CS4300Information Retrieval

slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/

IR 20/25: Linear Classifiers and Flat clustering

Paul Ginsparg

Cornell University, Ithaca, NY

10 Nov 2011

1 / 121

Page 2: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Administrativa

Assignment 4 to be posted tomorrow,due Fri 2 Dec (last day of classes),permitted until Sun 4 Dec (no extensions)

2 / 121

Page 3: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Overview

1 Recap

2 Rocchio

3 kNN

4 Linear classifiers

5 > two classes

6 Clustering: Introduction

7 Clustering in IR

8 K -means

3 / 121

Page 4: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Outline

1 Recap

2 Rocchio

3 kNN

4 Linear classifiers

5 > two classes

6 Clustering: Introduction

7 Clustering in IR

8 K -means

4 / 121

Page 5: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Digression: “naive” Bayes

Spam classifier:Imagine a training set of 2000 messages,1000 classified as spam (S),and 1000 classified as non-spam (S).

180 of the S messages contain the word “offer”.20 of the S messages contain the word “offer”.

Suppose you receive a message containing the word “offer”.What is the probability it is S? Estimate:

180

180 + 20=

9

10.

(Formally, assuming “flat prior” p(S) = p(S):

p(S |offer) =p(offer|S)p(S)

p(offer|S)p(S) + p(offer|S)p(S)=

1801000

1801000 + 20

1000

=9

10.)

5 / 121

Page 6: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Basics of probability theory

A = event

0 ≤ p(A) ≤ 1

joint probability p(A,B) = p(A ∩ B)

conditional probability p(A|B) = p(A,B)/p(B)

Note p(A,B) = p(A|B)p(B) = p(B |A)p(A), gives posteriorprobability of A after seeing the evidence B

Bayes ‘Thm‘ : p(A|B) =p(B |A)p(A)

p(B)

In denominator, usep(B) = p(B ,A) + p(B ,A) = p(B |A)p(A) + p(B |A)p(A)

Odds: O(A) =p(A)

p(A)=

p(A)

1− p(A)

6 / 121

Page 7: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

“naive” Bayes, cont’d

Spam classifier:Imagine a training set of 2000 messages,1000 classified as spam (S),and 1000 classified as non-spam (S).

words wi = {“offer”,“FF0000”,“click”,“unix”,“job”,“enlarge”,. . .}ni of the S messages contain the word wi .mi of the S messages contain the word wi .

Suppose you receive a message containing the wordsw1,w4,w5, . . ..What are the odds it is S? Estimate:

p(S |w1,w4,w5, . . .) ∝ p(w1,w4,w5, . . . |S)p(S)

p(S |w1,w4,w5, . . .) ∝ p(w1,w4,w5, . . . |S)p(S)

Odds are

p(S |w1,w4,w5, . . .)

p(S |w1,w4,w5, . . .)=

p(w1,w4,w5, . . . |S)p(S)

p(w1,w4,w5, . . . |S)p(S)7 / 121

Page 8: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

“naive” Bayes odds

Oddsp(S |w1,w4,w5, . . .)

p(S |w1,w4,w5, . . .)=

p(w1,w4,w5, . . . |S)p(S)

p(w1,w4,w5, . . . |S)p(S)

are approximated by

≈p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S)p(S)

p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S)p(S)

≈(n1/1000)(n4/1000)(n5/1000) · · · (nℓ/1000)

(m1/1000)(m4/1000)(m5/1000) · · · (mℓ/1000)=

n1n4n5 · · · nℓ

m1m4m5 · · ·mℓ

where we’ve assumed words are independent eventsp(w1,w4,w5, . . . |S) ≈ p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S),and p(wi |S) ≈ ni/|S |, p(wi |S) ≈ mi/|S |(recall ni and mi , respectively, counted the number of spam S andnon-spam S training messages containing the word wi)

8 / 121

Page 9: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Classification

Naive Bayes is simple and a good baseline.

Use it if you want to get a text classifier up and running in ahurry.

But other classification methods are more accurate.

Perhaps the simplest well performing alternative: kNN

kNN is a vector space classifier.

Today1 intro vector space classification2 very simple vector space classification: Rocchio3 kNN

Next time: general properties of classifiers

9 / 121

Page 10: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Recall vector space representation

Each document is a vector, one component for each term.

Terms are axes.

High dimensionality: 100,000s of dimensions

Normalize vectors (documents) to unit length

How can we do classification in this space?

10 / 121

Page 11: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Vector space classification

As before, the training set is a set of documents, each labeledwith its class.

In vector space classification, this set corresponds to a labeledset of points or vectors in the vector space.

Premise 1: Documents in the same class form a contiguousregion.

Premise 2: Documents from different classes don’t overlap.

We define lines, surfaces, hypersurfaces to divide regions.

11 / 121

Page 12: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Classes in the vector space

xxx

x

⋄⋄⋄

China

Kenya

UK⋆

Should the document ⋆ be assigned to China, UK or Kenya?Find separators between the classesBased on these separators: ⋆ should be assigned to ChinaHow do we find separators that do a good job at classifying newdocuments like ⋆?

12 / 121

Page 13: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Outline

1 Recap

2 Rocchio

3 kNN

4 Linear classifiers

5 > two classes

6 Clustering: Introduction

7 Clustering in IR

8 K -means

13 / 121

Page 14: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Recall Rocchio algorithm (lecture 12)

The optimal query vector is:

~qopt = µ(Dr ) + [µ(Dr )− µ(Dnr )]

=1

|Dr |

~dj∈Dr

~dj + [1

|Dr |

~dj∈Dr

~dj −1

|Dnr |

~dj∈Dnr

~dj ]

We move the centroid of the relevant documents by thedifference between the two centroids.

14 / 121

Page 15: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Exercise: Compute Rocchio vector (lecture 12)

x

x

x

x

xx

circles: relevant documents, X’s: nonrelevant documents

15 / 121

Page 16: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Rocchio illustrated (lecture 12)

x

x

x

x

xx

~µR

~µNR

~µR − ~µNR~qopt

~µR : centroid of relevant documents~µNR : centroid of nonrelevant documents~µR − ~µNR : difference vectorAdd difference vector to ~µR to get ~qopt

~qopt separates relevant/nonrelevant perfectly.

16 / 121

Page 17: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Rocchio 1971 algorithm (SMART) (lecture 12)

Used in practice:

~qm = α~q0 + βµ(Dr )− γµ(Dnr )

= α~q0 + β1

|Dr |

~dj∈Dr

~dj − γ1

|Dnr |

~dj∈Dnr

~dj

qm: modified query vector; q0: original query vector; Dr andDnr : sets of known relevant and nonrelevant documentsrespectively; α, β, and γ: weights attached to each term

New query moves towards relevant documents and away fromnonrelevant documents.

Tradeoff α vs. β/γ: If we have a lot of judged documents, wewant a higher β/γ.

Set negative term weights to 0.

“Negative weight” for a term doesn’t make sense in the vectorspace model.

17 / 121

Page 18: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Using Rocchio for vector space classification

We can view relevance feedback as two-class classification.

The two classes: the relevant documents and the nonrelevantdocuments.

The training set is the set of documents the user has labeledso far.

The principal difference between relevance feedback and textclassification:

The training set is given as part of the input in textclassification.It is interactively created in relevance feedback.

18 / 121

Page 19: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Rocchio classification: Basic idea

Compute a centroid for each class

The centroid is the average of all documents in the class.

Assign each test document to the class of its closest centroid.

19 / 121

Page 20: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Recall definition of centroid

~µ(c) =1

|Dc |

d∈Dc

~v(d)

where Dc is the set of all documents that belong to class c and

~v(d) is the vector space representation of d .

20 / 121

Page 21: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Rocchio algorithm

TrainRocchio(C, D)1 for each cj ∈ C

2 do Dj ← {d : 〈d , cj 〉 ∈ D}3 ~µj ←

1|Dj |

d∈Dj~v(d)

4 return {~µ1, . . . , ~µJ}

ApplyRocchio({~µ1, . . . , ~µJ}, d)1 return arg minj |~µj − ~v(d)|

21 / 121

Page 22: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Rocchio illustrated: a1 = a2, b1 = b2, c1 = c2

xxx

x

⋄⋄

China

Kenya

UK

⋆ a1

a2

b1

b2

c1

c2

22 / 121

Page 23: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Rocchio properties

Rocchio forms a simple representation for each class: thecentroid

We can interpret the centroid as the prototype of the class.

Classification is based on similarity to / distance fromcentroid/prototype.

Does not guarantee that classifications are consistent with thetraining data!

23 / 121

Page 24: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Time complexity of Rocchio

mode time complexity

training Θ(|D|Lave + |C||V |) ≈ Θ(|D|Lave)testing Θ(La + |C|Ma) ≈ Θ(|C|Ma)

24 / 121

Page 25: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Rocchio vs. Naive Bayes

In many cases, Rocchio performs worse than Naive Bayes.

One reason: Rocchio does not handle nonconvex, multimodalclasses correctly.

25 / 121

Page 26: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Rocchio cannot handle nonconvex, multimodal classes

a

a

a

a

a

a

a aa

a

aa

aa

a a

aa

a

a

a

a

a

a

a

a

a

a

a

a a

aa

aa

aa

aa

a

bb

bb

bb

bb b

b

bbb

b

b

b

b

b

b

X XA

B

o

Exercise: Why is Rocchio notexpected to do well for theclassification task a vs. b here?

A is centroid of the a’s, Bis centroid of the b’s.

The point o is closer to Athan to B.

But it is a better fit forthe b class.

A is a multimodal classwith two prototypes.

But in Rocchio we onlyhave one.

26 / 121

Page 27: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Outline

1 Recap

2 Rocchio

3 kNN

4 Linear classifiers

5 > two classes

6 Clustering: Introduction

7 Clustering in IR

8 K -means

27 / 121

Page 28: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

kNN classification

kNN classification is another vector space classificationmethod.

It also is very simple and easy to implement.

kNN is more accurate (in most cases) than Naive Bayes andRocchio.

If you need to get a pretty accurate classifier up and runningin a short time . . .

. . . and you don’t care about efficiency that much . . .

. . . use kNN.

28 / 121

Page 29: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

kNN classification

kNN = k nearest neighbors

kNN classification rule for k = 1 (1NN): Assign each testdocument to the class of its nearest neighbor in the trainingset.

1NN is not very robust – one document can be mislabeled oratypical.

kNN classification rule for k > 1 (kNN): Assign each testdocument to the majority class of its k nearest neighbors inthe training set.

Rationale of kNN: contiguity hypothesis

We expect a test document d to have the same label as thetraining documents located in the local region surrounding d .

29 / 121

Page 30: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Probabilistic kNN

Probabilistic version of kNN: P(c |d) = fraction of k neighborsof d that are in c

kNN classification rule for probabilistic kNN: Assign d to classc with highest P(c |d)

30 / 121

Page 31: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

kNN is based on Voronoi tessellation

x

x

xx

x

xx

xx x

x

⋄⋄

⋄⋄⋄

⋄ ⋄

1NN, 3NNclassifica-tion decisionfor star?

31 / 121

Page 32: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

kNN algorithm

Train-kNN(C, D)1 D

′ ← Preprocess(D)2 k ← Select-k(C, D′)3 return D

′, k

Apply-kNN(D′, k, d)1 Sk ← ComputeNearestNeighbors(D′, k, d)2 for each cj ∈ C(D′)3 do pj ← |Sk ∩ cj |/k4 return arg maxj pj

32 / 121

Page 33: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Exercise

x

x

x

x

x

x

x

x

x

x

oo

o

o

o

How is star classified by:

(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio?

33 / 121

Page 34: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Exercise

x

x

x

x

x

x

x

x

x

x

oo

o

o

o

How is star classified by:

(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio

34 / 121

Page 35: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Time complexity of kNN

kNN with preprocessing of training set

training Θ(|D|Lave)testing Θ(La + |D|MaveMa) = Θ(|D|MaveMa)

kNN test time proportional to the size of the training set!

The larger the training set, the longer it takes to classify atest document.

kNN is inefficient for very large training sets.

35 / 121

Page 36: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

kNN: Discussion

No training necessary

But linear preprocessing of documents is as expensive astraining Naive Bayes.You will always preprocess the training set, so in realitytraining time of kNN is linear.

kNN is very accurate if training set is large.

Optimality result: asymptotically zero error if Bayes rate iszero.

But kNN can be very inaccurate if training set is small.

36 / 121

Page 37: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Outline

1 Recap

2 Rocchio

3 kNN

4 Linear classifiers

5 > two classes

6 Clustering: Introduction

7 Clustering in IR

8 K -means

37 / 121

Page 38: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Linear classifiers

Linear classifiers compute a linear combination or weightedsum

i wixi of the feature values.

Classification decision:∑

i wixi > θ?

. . . where θ (the threshold) is a parameter.

(First, we only consider binary classifiers.)

Geometrically, this corresponds to a line (2D), a plane (3D) ora hyperplane (higher dimensionalities)

Assumption: The classes are linearly separable.

Can find hyperplane (=separator) based on training set

Methods for finding separator: Perceptron, Rocchio, NaiveBayes – as we will explain on the next slides

38 / 121

Page 39: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

A linear classifier in 1D

x1

A linear classifier in 1D isa point described by theequation w1x1 = θ

The point at θ/w1

Points (x1) with w1x1 ≥ θare in the class c .

Points (x1) with w1x1 < θare in the complementclass c .

39 / 121

Page 40: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

A linear classifier in 2D

A linear classifier in 2D isa line described by theequation w1x1 + w2x2 = θ

Example for a 2D linearclassifier

Points (x1 x2) withw1x1 + w2x2 ≥ θ are inthe class c .

Points (x1 x2) withw1x1 + w2x2 < θ are inthe complement class c .

40 / 121

Page 41: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

A linear classifier in 3D

A linear classifier in 3D isa plane described by theequationw1x1 + w2x2 + w3x3 = θ

Example for a 3D linearclassifier

Points (x1 x2 x3) withw1x1 + w2x2 + w3x3 ≥ θare in the class c .

Points (x1 x2 x3) withw1x1 + w2x2 + w3x3 < θare in the complementclass c .

41 / 121

Page 42: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Rocchio as a linear classifier

Rocchio is a linear classifier defined by:

M∑

i=1

wixi = ~w · ~x = θ

where the normal vector ~w = ~µ(c1)− ~µ(c2)andθ = 0.5 ∗ (|~µ(c1)|

2 − |~µ(c2)|2).

(follows from decision boundary |~µ(c1)− ~x | = |~µ(c2)− ~x |)

42 / 121

Page 43: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Naive Bayes classifier

~x represents document, what is p(c |~x) that document is in class c?

p(c |~x) =p(~x |c)p(c)

p(~x)p(c |~x) =

p(~x |c)p(c)

p(~x)

odds :p(c |~x)

p(c |~x)=

p(~x |c)p(c)

p(~x |c)p(c)≈

p(c)

p(c)

1≤k≤ndp(tk |c)

1≤k≤ndp(tk |c)

log odds : logp(c |~x)

p(c |~x)= log

p(c)

p(c)+

1≤k≤nd

logp(tk |c)

p(tk |c)

43 / 121

Page 44: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Naive Bayes as a linear classifier

Naive Bayes is a linear classifier defined by:

M∑

i=1

wixi = θ

where wi = log(

p(ti |c)/p(ti |c))

,xi = number of occurrences of ti in d ,andθ = − log

(

p(c)/p(c))

.

(the index i , 1 ≤ i ≤ M, refers to terms of the vocabulary)

Linear in log space

44 / 121

Page 45: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

kNN is not a linear classifier

x

x

x x

x

x x

xx x

x

⋄⋄

⋄⋄⋄

⋄ ⋄

Classification decisionbased on majority ofk nearest neighbors.

The decisionboundaries betweenclasses are piecewiselinear . . .

. . . but they are notlinear classifiers thatcan be described as∑M

i=1 wixi = θ.

45 / 121

Page 46: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Example of a linear two-class classifier

ti wi x1i x2i ti wi x1i x2i

prime 0.70 0 1 dlrs -0.71 1 1rate 0.67 1 0 world -0.35 1 0interest 0.63 0 0 sees -0.33 0 0rates 0.60 0 0 year -0.25 0 0discount 0.46 1 0 group -0.24 0 0bundesbank 0.43 0 0 dlr -0.24 0 0

This is for the class interest in Reuters-21578.For simplicity: assume a simple 0/1 vector representationx1: “rate discount dlrs world”x2: “prime dlrs”Exercise: Which class is x1 assigned to? Which class is x2 assigned to?We assign document ~d1 “rate discount dlrs world” to interest since~wT · ~d1 = 0.67 · 1 + 0.46 · 1 + (−0.71) · 1 + (−0.35) · 1 = 0.07 > 0 = b.We assign ~d2 “prime dlrs” to the complement class (not in interest) since~wT · ~d2 = −0.01 ≤ b.

(dlr and world have negative weights because they are indicatorsfor the competing class currency)

46 / 121

Page 47: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Which hyperplane?

47 / 121

Page 48: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Which hyperplane?

For linearly separable training sets: there are infinitely manyseparating hyperplanes.

They all separate the training set perfectly . . .

. . . but they behave differently on test data.

Error rates on new data are low for some, high for others.

How do we find a low-error separator?

Perceptron: generally bad; Naive Bayes, Rocchio: ok; linearSVM: good

48 / 121

Page 49: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Linear classifiers: Discussion

Many common text classifiers are linear classifiers: NaiveBayes, Rocchio, logistic regression, linear support vectormachines etc.

Each method has a different way of selecting the separatinghyperplane

Huge differences in performance on test documents

Can we get better performance with more powerful nonlinearclassifiers?

Not in general: A given amount of training data may sufficefor estimating a linear boundary, but not for estimating amore complex nonlinear boundary.

49 / 121

Page 50: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

A nonlinear problem

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Linear classifier like Rocchio does badly on this task.

kNN will do well (assuming enough training data)

50 / 121

Page 51: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

A linear problem with noise

Figure 14.10: hypothetical web page classification scenario:Chinese-only web pages (solid circles) and mixed Chinese-Englishweb (squares). linear class boundary, except for three noise docs

51 / 121

Page 52: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Which classifier do I use for a given TC problem?

Is there a learning method that is optimal for all textclassification problems?

No, because there is a tradeoff between bias and variance.

Factors to take into account:

How much training data is available?How simple/complex is the problem? (linear vs. nonlineardecision boundary)How noisy is the problem?How stable is the problem over time?

For an unstable problem, it’s better to use a simple and robustclassifier.

52 / 121

Page 53: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Outline

1 Recap

2 Rocchio

3 kNN

4 Linear classifiers

5 > two classes

6 Clustering: Introduction

7 Clustering in IR

8 K -means

53 / 121

Page 54: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

How to combine hyperplanes for > 2 classes?

?

(e.g.: rank and select top-ranked classes)

54 / 121

Page 55: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

One-of problems

One-of or multiclass classification

Classes are mutually exclusive.Each document belongs to exactly one class.Example: language of a document (assumption: no documentcontains multiple languages)

55 / 121

Page 56: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

One-of classification with linear classifiers

Combine two-class linear classifiers as follows for one-ofclassification:

Run each classifier separatelyRank classifiers (e.g., according to score)Pick the class with the highest score

56 / 121

Page 57: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Any-of problems

Any-of or multilabel classification

A document can be a member of 0, 1, or many classes.A decision on one class leaves decisions open on all otherclasses.A type of “independence” (but not statistical independence)Example: topic classificationUsually: make decisions on the region, on the subject area, onthe industry and so on “independently”

57 / 121

Page 58: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Any-of classification with linear classifiers

Combine two-class linear classifiers as follows for any-ofclassification:

Simply run each two-class classifier separately on the testdocument and assign document accordingly

58 / 121

Page 59: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Outline

1 Recap

2 Rocchio

3 kNN

4 Linear classifiers

5 > two classes

6 Clustering: Introduction

7 Clustering in IR

8 K -means

59 / 121

Page 60: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

What is clustering?

(Document) clustering is the process of grouping a set ofdocuments into clusters of similar documents.

Documents within a cluster should be similar.

Documents from different clusters should be dissimilar.

Clustering is the most common form of unsupervised learning.

Unsupervised = there are no labeled or annotated data.

60 / 121

Page 61: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Data set with clear cluster structure

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

2.5

61 / 121

Page 62: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Classification vs. Clustering

Classification: supervised learning

Clustering: unsupervised learning

Classification: Classes are human-defined and part of theinput to the learning algorithm.

Clustering: Clusters are inferred from the data without humaninput.

However, there are many ways of influencing the outcome ofclustering: number of clusters, similarity measure,representation of documents, . . .

62 / 121

Page 63: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Outline

1 Recap

2 Rocchio

3 kNN

4 Linear classifiers

5 > two classes

6 Clustering: Introduction

7 Clustering in IR

8 K -means

63 / 121

Page 64: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

The cluster hypothesis

Cluster hypothesis. Documents in the same cluster behavesimilarly with respect to relevance to information needs.

All applications in IR are based (directly or indirectly) on thecluster hypothesis.

64 / 121

Page 65: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Applications of clustering in IR

Application What is Benefit Exampleclustered?

Search result clustering searchresults

more effective infor-mation presentationto user

next slide

Scatter-Gather (subsets of)collection

alternative user inter-face: “search withouttyping”

two slides ahead

Collection clustering collection effective informationpresentation for ex-ploratory browsing

McKeown et al. 2002,news.google.com

Cluster-based retrieval collection higher efficiency:faster search

Salton 1971

65 / 121

Page 66: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Search result clustering for better navigation

Jaguar the cat not among top results, but available via menu at left

66 / 121

Page 67: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Scatter-Gather

A collection of news stories is clustered (“scattered”) into eight clusters (toprow), user manually gathers three into smaller collection ‘International Stories’and performs another scattering. Process repeats until a small cluster withrelevant documents is found (e.g., Trinidad).

67 / 121

Page 68: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Global navigation: Yahoo

68 / 121

Page 69: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Global navigation: MESH (upper level)

69 / 121

Page 70: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Global navigation: MESH (lower level)

70 / 121

Page 71: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Note: Yahoo/MESH are not examples of clustering.

But they are well known examples for using a global hierarchyfor navigation.

Some examples for global navigation/exploration based onclustering:

CartiaThemescapesGoogle News

71 / 121

Page 72: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Global navigation combined with visualization (1)

72 / 121

Page 73: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Global navigation combined with visualization (2)

73 / 121

Page 74: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Global clustering for navigation: Google News

http://news.google.com

74 / 121

Page 75: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Clustering for improving recall

To improve search recall:

Cluster docs in collection a prioriWhen a query matches a doc d , also return other docs in thecluster containing d

Hope: if we do this: the query “car” will also return docscontaining “automobile”

Because clustering groups together docs containing “car” withthose containing “automobile”.Both types of documents contain words like “parts”, “dealer”,“mercedes”, “road trip”.

75 / 121

Page 76: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Data set with clear cluster structure

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

2.5

Exercise: Come up with analgorithm for finding the threeclusters in this case

76 / 121

Page 77: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Document representations in clustering

Vector space model

As in vector space classification, we measure relatednessbetween vectors by Euclidean distance . . .

. . . which is almost equivalent to cosine similarity.

Almost: centroids are not length-normalized.

For centroids, distance and cosine give different results.

77 / 121

Page 78: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Issues in clustering

General goal: put related docs in the same cluster, putunrelated docs in different clusters.

But how do we formalize this?

How many clusters?

Initially, we will assume the number of clusters K is given.

Often: secondary goals in clustering

Example: avoid very small and very large clusters

Flat vs. hierarchical clustering

Hard vs. soft clustering

78 / 121

Page 79: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Flat vs. Hierarchical clustering

Flat algorithms

Usually start with a random (partial) partitioning of docs intogroupsRefine iterativelyMain algorithm: K -means

Hierarchical algorithms

Create a hierarchyBottom-up, agglomerativeTop-down, divisive

79 / 121

Page 80: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Soft clustering: A document can belong to more than onecluster.

Makes more sense for applications like creating browsablehierarchiesYou may want to put a pair of sneakers in two clusters:

sports apparelshoes

You can only do that with a soft clustering approach.

For soft clustering, see course text: 16.5,18

Today: Flat, hard clusteringNext time: Hierarchical, hard clustering

80 / 121

Page 81: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Flat algorithms

Flat algorithms compute a partition of N documents into aset of K clusters.

Given: a set of documents and the number K

Find: a partition in K clusters that optimizes the chosenpartitioning criterion

Global optimization: exhaustively enumerate partitions, pickoptimal one

Not tractable

Effective heuristic method: K -means algorithm

81 / 121

Page 82: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Outline

1 Recap

2 Rocchio

3 kNN

4 Linear classifiers

5 > two classes

6 Clustering: Introduction

7 Clustering in IR

8 K -means

82 / 121

Page 83: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

K -means

Perhaps the best known clustering algorithm

Simple, works well in many cases

Use as default / baseline for clustering documents

83 / 121

Page 84: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

K -means

Each cluster in K -means is defined by a centroid.

Objective/partitioning criterion: minimize the average squareddifference from the centroid

Recall definition of centroid:

~µ(ω) =1

|ω|

~x∈ω

~x

where we use ω to denote a cluster.

We try to find the minimum average squared difference byiterating two steps:

reassignment: assign each vector to its closest centroidrecomputation: recompute each centroid as the average of thevectors that were assigned to it in reassignment

84 / 121

Page 85: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

K -means algorithm

K -means({~x1, . . . , ~xN},K )1 (~s1,~s2, . . . ,~sK )← SelectRandomSeeds({~x1, . . . , ~xN},K )2 for k ← 1 to K3 do ~µk ← ~sk4 while stopping criterion has not been met5 do for k ← 1 to K6 do ωk ← {}7 for n← 1 to N8 do j ← arg minj ′ |~µj ′ − ~xn|9 ωj ← ωj ∪ {~xn} (reassignment of vectors)

10 for k ← 1 to K11 do ~µk ←

1|ωk |

~x∈ωk~x (recomputation of centroids)

12 return {~µ1, . . . , ~µK}

85 / 121

Page 86: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Set of points to be clustered

b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

86 / 121

Page 87: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Random selection of initial cluster centers (k = 2 means)

×

×b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

Centroids after convergence?

87 / 121

Page 88: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

×

×

88 / 121

Page 89: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assignment

2

1

1

2

1

1

1 111

1

1

1

11

2

11

2 2

×

×

89 / 121

Page 90: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Recompute cluster centroids

2

1

1

2

1

1

1 111

1

1

1

11

2

11

2 2

×

×

×

×

90 / 121

Page 91: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assign points to closest centroid

b

b

b

b

b

b bb

b

b

b

b

b

b

bb

bb

×

×b b

91 / 121

Page 92: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assignment

2

2

1

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

92 / 121

Page 93: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Recompute cluster centroids

2

2

1

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

×

×

93 / 121

Page 94: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assign points to closest centroid

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

×

×b

94 / 121

Page 95: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assignment

2

2

2

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

95 / 121

Page 96: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Recompute cluster centroids

2

2

2

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

×

×

96 / 121

Page 97: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assign points to closest centroid

b

b

b

b

b

b

b b

b

b

b

b

b

b

b

bb

bb

×

×

b

97 / 121

Page 98: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assignment

2

2

2

2

1

1

1 121

1

2

1

11

2

11

2 2

×

×

98 / 121

Page 99: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Recompute cluster centroids

2

2

2

2

1

1

1 121

1

2

1

11

2

11

2 2

×

×

×

×

99 / 121

Page 100: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

b

bb

b

×

×b

bb

100 / 121

Page 101: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assignment

2

2

2

2

1

1

1 122

1

2

1

11

1

11

2 1

×

×

101 / 121

Page 102: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Recompute cluster centroids

2

2

2

2

1

1

1 122

1

2

1

11

1

11

2 1

××

×

×

102 / 121

Page 103: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

b

××

b

103 / 121

Page 104: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assignment

2

2

2

2

1

1

1 122

1

2

1

11

1

11

1 1

××

104 / 121

Page 105: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Recompute cluster centroids

2

2

2

2

1

1

1 122

1

2

1

11

1

11

1 1

××

×

×

105 / 121

Page 106: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

b

b

bb

bb

×× b

106 / 121

Page 107: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Assignment

2

2

2

2

1

1

1 122

1

1

1

11

1

11

1 1

××

107 / 121

Page 108: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Recompute cluster centroids

2

2

2

2

1

1

1 122

1

1

1

11

1

11

1 1

××

×

×

108 / 121

Page 109: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Centroids and assignments after convergence

2

2

2

2

1

1

1 122

1

1

1

11

1

11

1 1

××

109 / 121

Page 110: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Set of points clustered

b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

110 / 121

Page 111: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Set of points to be clustered

b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

111 / 121

Page 112: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

K -means is guaranteed to converge

Proof:

The sum of squared distances (RSS) decreases duringreassignment, because each vector is moved to a closercentroid(RSS = sum of all squared distances between documentvectors and closest centroids)

RSS decreases during recomputation (see next slide)

There is only a finite number of clusterings.

Thus: We must reach a fixed point.(assume that ties are broken consistently)

112 / 121

Page 113: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Recomputation decreases average distance

RSS =∑K

k=1 RSSk – the residual sum of squares (the “goodness”measure)

RSSk(~v) =∑

~x∈ωk

‖~v − ~x‖2 =∑

~x∈ωk

M∑

m=1

(vm − xm)2

∂RSSk(~v)

∂vm

=∑

~x∈ωk

2(vm − xm) = 0

vm =1

|ωk |

~x∈ωk

xm

The last line is the componentwise definition of the centroid!We minimize RSSk when the old centroid is replaced with the newcentroid.RSS, the sum of the RSSk , must then also decrease duringrecomputation.

113 / 121

Page 114: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

K -means is guaranteed to converge

But we don’t know how long convergence will take!

If we don’t care about a few docs switching back and forth,then convergence is usually fast (< 10-20 iterations).

However, complete convergence can take many moreiterations.

114 / 121

Page 115: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Optimality of K -means

Convergence does not mean that we converge to the optimalclustering!

This is the great weakness of K -means.

If we start with a bad set of seeds, the resulting clustering canbe horrible.

115 / 121

Page 116: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Exercise: Suboptimal clustering

0 1 2 3 40

1

2

3

×

×

×

×

×

×d1 d2 d3

d4 d5 d6

What is the optimal clustering for K = 2?

Do we converge on this clustering for arbitrary seeds di1 , di2?

116 / 121

Page 117: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Exercise: Suboptimal clustering

0 1 2 3 40

1

2

3

×

×

×

×

×

×d1 d2 d3

d4 d5 d6

What is the optimal clustering for K = 2?

Do we converge on this clustering for arbitrary seeds di1 , di2?

For seeds d2 and d5, K -means converges to{{d1, d2, d3}, {d4, d5, d6}} (suboptimal clustering).

For seeds d2 and d3, instead converges to{{d1, d2, d4, d5}, {d3, d6}} (global optimum for K = 2).

117 / 121

Page 118: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Initialization of K -means

Random seed selection is just one of many ways K -means canbe initialized.

Random seed selection is not very robust: It’s easy to get asuboptimal clustering.

Better heuristics:

Select seeds not randomly, but using some heuristic (e.g., filterout outliers or find a set of seeds that has “good coverage” ofthe document space)Use hierarchical clustering to find good seeds (next class)Select i (e.g., i = 10) different sets of seeds, do a K -meansclustering for each, select the clustering with lowest RSS

118 / 121

Page 119: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell

Time complexity of K -means

Computing one distance of two vectors is O(M).

Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)

Recomputation step: O(NM) (we need to add each of thedocument’s < M values to one of the centroids)

Assume number of iterations bounded by I

Overall complexity: O(IKNM) – linear in all importantdimensions

However: This is not a real worst-case analysis.

In pathological cases, the number of iterations can be muchhigher than linear in the number of documents.

119 / 121