INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classiﬁers and Flat clustering Paul Ginsparg Cornell

INFO 4300 / CS4300Information Retrieval

slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/

IR 20/25: Linear Classifiers and Flat clustering

Paul Ginsparg

Cornell University, Ithaca, NY

10 Nov 2011

1 / 121

http://informationretrieval.org/

Administrativa

Assignment 4 to be posted tomorrow,due Fri 2 Dec (last day of classes),permitted until Sun 4 Dec (no extensions)

2 / 121

Overview

1 Recap

2 Rocchio

3 kNN

4 Linear classifiers

5 > two classes

6 Clustering: Introduction

7 Clustering in IR

8 K -means

3 / 121

Outline

1 Recap

2 Rocchio

3 kNN


5 > two classes


7 Clustering in IR

8 K -means

4 / 121

Digression: “naive” Bayes

Spam classifier:Imagine a training set of 2000 messages,1000 classified as spam (S),and 1000 classified as non-spam (S).

180 of the S messages contain the word “offer”.20 of the S messages contain the word “offer”.

Suppose you receive a message containing the word “offer”.What is the probability it is S? Estimate:

180

180 + 20=

9

10.

(Formally, assuming “flat prior” p(S) = p(S):

p(S |offer) =p(offer|S)p(S)

p(offer|S)p(S) + p(offer|S)p(S)=

1801000

1801000 + 20

1000

=9

10.)

5 / 121

Basics of probability theory

A = event

0 ≤ p(A) ≤ 1

joint probability p(A,B) = p(A ∩ B)

conditional probability p(A|B) = p(A,B)/p(B)

Note p(A,B) = p(A|B)p(B) = p(B |A)p(A), gives posteriorprobability of A after seeing the evidence B

Bayes ‘Thm‘ : p(A|B) =p(B |A)p(A)

p(B)

In denominator, usep(B) = p(B ,A) + p(B ,A) = p(B |A)p(A) + p(B |A)p(A)

Odds: O(A) =p(A)

p(A)=

p(A)

1− p(A)

6 / 121

“naive” Bayes, cont’d

Spam classifier:Imagine a training set of 2000 messages,1000 classified as spam (S),and 1000 classified as non-spam (S).

words wi = {“offer”,“FF0000”,“click”,“unix”,“job”,“enlarge”,. . .}ni of the S messages contain the word wi .mi of the S messages contain the word wi .

Suppose you receive a message containing the wordsw1,w4,w5, . . ..What are the odds it is S? Estimate:

p(S |w1,w4,w5, . . .) ∝ p(w1,w4,w5, . . . |S)p(S)

p(S |w1,w4,w5, . . .) ∝ p(w1,w4,w5, . . . |S)p(S)

Odds are

p(S |w1,w4,w5, . . .)

p(S |w1,w4,w5, . . .)=

p(w1,w4,w5, . . . |S)p(S)

p(w1,w4,w5, . . . |S)p(S)7 / 121

“naive” Bayes odds

Oddsp(S |w1,w4,w5, . . .)

p(S |w1,w4,w5, . . .)=

p(w1,w4,w5, . . . |S)p(S)

p(w1,w4,w5, . . . |S)p(S)

are approximated by

≈p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S)p(S)

p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S)p(S)

≈(n1/1000)(n4/1000)(n5/1000) · · · (nℓ/1000)

(m1/1000)(m4/1000)(m5/1000) · · · (mℓ/1000)=

n1n4n5 · · · nℓ

m1m4m5 · · ·mℓ

where we’ve assumed words are independent eventsp(w1,w4,w5, . . . |S) ≈ p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S),and p(wi |S) ≈ ni/|S |, p(wi |S) ≈ mi/|S |(recall ni and mi , respectively, counted the number of spam S andnon-spam S training messages containing the word wi)

8 / 121

Classification

Naive Bayes is simple and a good baseline.

Use it if you want to get a text classifier up and running in ahurry.

But other classification methods are more accurate.

Perhaps the simplest well performing alternative: kNN

kNN is a vector space classifier.

Today1 intro vector space classification2 very simple vector space classification: Rocchio3 kNN

Next time: general properties of classifiers

9 / 121

Recall vector space representation

Each document is a vector, one component for each term.

Terms are axes.

High dimensionality: 100,000s of dimensions

Normalize vectors (documents) to unit length

How can we do classification in this space?

10 / 121

Vector space classification

As before, the training set is a set of documents, each labeledwith its class.

In vector space classification, this set corresponds to a labeledset of points or vectors in the vector space.

Premise 1: Documents in the same class form a contiguousregion.

Premise 2: Documents from different classes don’t overlap.

We define lines, surfaces, hypersurfaces to divide regions.

11 / 121

Classes in the vector space

xxx

x

⋄

⋄⋄⋄

⋄

⋄

China

Kenya

UK⋆

Should the document ⋆ be assigned to China, UK or Kenya?Find separators between the classesBased on these separators: ⋆ should be assigned to ChinaHow do we find separators that do a good job at classifying newdocuments like ⋆?

12 / 121

Outline

1 Recap

2 Rocchio

3 kNN


5 > two classes


7 Clustering in IR

8 K -means

13 / 121

Recall Rocchio algorithm (lecture 12)

The optimal query vector is:

~qopt = µ(Dr ) + [µ(Dr )− µ(Dnr )]

=1

|Dr |

∑

~dj∈Dr

~dj + [1

|Dr |

∑

~dj∈Dr

~dj −1

|Dnr |

∑

~dj∈Dnr

~dj ]

We move the centroid of the relevant documents by thedifference between the two centroids.

14 / 121

Exercise: Compute Rocchio vector (lecture 12)

x

x

x

x

xx

circles: relevant documents, X’s: nonrelevant documents

15 / 121

Rocchio illustrated (lecture 12)

x

x

x

x

xx

~µR

~µNR

~µR − ~µNR~qopt

~µR : centroid of relevant documents~µNR : centroid of nonrelevant documents~µR − ~µNR : difference vectorAdd difference vector to ~µR to get ~qopt

~qopt separates relevant/nonrelevant perfectly.

16 / 121

Rocchio 1971 algorithm (SMART) (lecture 12)

Used in practice:

~qm = α~q0 + βµ(Dr )− γµ(Dnr )

= α~q0 + β1

|Dr |

∑

~dj∈Dr

~dj − γ1

|Dnr |

∑

~dj∈Dnr

~dj

qm: modified query vector; q0: original query vector; Dr andDnr : sets of known relevant and nonrelevant documentsrespectively; α, β, and γ: weights attached to each term

New query moves towards relevant documents and away fromnonrelevant documents.

Tradeoff α vs. β/γ: If we have a lot of judged documents, wewant a higher β/γ.

Set negative term weights to 0.

“Negative weight” for a term doesn’t make sense in the vectorspace model.

17 / 121

Using Rocchio for vector space classification

We can view relevance feedback as two-class classification.

The two classes: the relevant documents and the nonrelevantdocuments.

The training set is the set of documents the user has labeledso far.

The principal difference between relevance feedback and textclassification:

The training set is given as part of the input in textclassification.It is interactively created in relevance feedback.

18 / 121

Rocchio classification: Basic idea

Compute a centroid for each class

The centroid is the average of all documents in the class.

Assign each test document to the class of its closest centroid.

19 / 121

Recall definition of centroid

~µ(c) =1

|Dc |

∑

d∈Dc

~v(d)

where Dc is the set of all documents that belong to class c and

~v(d) is the vector space representation of d .

20 / 121

Rocchio algorithm

TrainRocchio(C, D)1 for each cj ∈ C

2 do Dj ← {d : 〈d , cj 〉 ∈ D}3 ~µj ←

1|Dj |

∑

d∈Dj~v(d)

4 return {~µ1, . . . , ~µJ}

ApplyRocchio({~µ1, . . . , ~µJ}, d)1 return arg minj |~µj − ~v(d)|

21 / 121

Rocchio illustrated: a1 = a2, b1 = b2, c1 = c2

xxx

x

⋄

⋄⋄

⋄

⋄

⋄

China

Kenya

UK

⋆ a1

a2

b1

b2

c1

c2

22 / 121

Rocchio properties

Rocchio forms a simple representation for each class: thecentroid

We can interpret the centroid as the prototype of the class.

Classification is based on similarity to / distance fromcentroid/prototype.

Does not guarantee that classifications are consistent with thetraining data!

23 / 121

Time complexity of Rocchio

mode time complexity

training Θ(|D|Lave + |C||V |) ≈ Θ(|D|Lave)testing Θ(La + |C|Ma) ≈ Θ(|C|Ma)

24 / 121

Rocchio vs. Naive Bayes

In many cases, Rocchio performs worse than Naive Bayes.

One reason: Rocchio does not handle nonconvex, multimodalclasses correctly.

25 / 121

Rocchio cannot handle nonconvex, multimodal classes

a

a

a

a

a

a

a aa

a

aa

aa

a a

aa

a

a

a

a

a

a

a

a

a

a

a

a a

aa

aa

aa

aa

a

bb

bb

bb

bb b

b

bbb

b

b

b

b

b

b

X XA

B

o

Exercise: Why is Rocchio notexpected to do well for theclassification task a vs. b here?

A is centroid of the a’s, Bis centroid of the b’s.

The point o is closer to Athan to B.

But it is a better fit forthe b class.

A is a multimodal classwith two prototypes.

But in Rocchio we onlyhave one.

26 / 121

Outline

1 Recap

2 Rocchio

3 kNN


5 > two classes


7 Clustering in IR

8 K -means

27 / 121

kNN classification

kNN classification is another vector space classificationmethod.

It also is very simple and easy to implement.

kNN is more accurate (in most cases) than Naive Bayes andRocchio.

If you need to get a pretty accurate classifier up and runningin a short time . . .

. . . and you don’t care about efficiency that much . . .

. . . use kNN.

28 / 121

kNN classification

kNN = k nearest neighbors

kNN classification rule for k = 1 (1NN): Assign each testdocument to the class of its nearest neighbor in the trainingset.

1NN is not very robust – one document can be mislabeled oratypical.

kNN classification rule for k > 1 (kNN): Assign each testdocument to the majority class of its k nearest neighbors inthe training set.

Rationale of kNN: contiguity hypothesis

We expect a test document d to have the same label as thetraining documents located in the local region surrounding d .

29 / 121

Probabilistic kNN

Probabilistic version of kNN: P(c |d) = fraction of k neighborsof d that are in c

kNN classification rule for probabilistic kNN: Assign d to classc with highest P(c |d)

30 / 121

kNN is based on Voronoi tessellation

x

x

xx

x

xx

xx x

x

⋄

⋄⋄

⋄

⋄

⋄

⋄⋄⋄

⋄ ⋄

⋆

1NN, 3NNclassifica-tion decisionfor star?

31 / 121

kNN algorithm

Train-kNN(C, D)1 D

′ ← Preprocess(D)2 k ← Select-k(C, D′)3 return D

′, k

Apply-kNN(D′, k, d)1 Sk ← ComputeNearestNeighbors(D′, k, d)2 for each cj ∈ C(D′)3 do pj ← |Sk ∩ cj |/k4 return arg maxj pj

32 / 121

Exercise

⋆

x

x

x

x

x

x

x

x

x

x

oo

o

o

o

How is star classified by:

(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio?

33 / 121

Exercise

⋆

x

x

x

x

x

x

x

x

x

x

oo

o

o

o

How is star classified by:

(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio

34 / 121

Time complexity of kNN

kNN with preprocessing of training set

training Θ(|D|Lave)testing Θ(La + |D|MaveMa) = Θ(|D|MaveMa)

kNN test time proportional to the size of the training set!

The larger the training set, the longer it takes to classify atest document.

kNN is inefficient for very large training sets.

35 / 121

kNN: Discussion

No training necessary

But linear preprocessing of documents is as expensive astraining Naive Bayes.You will always preprocess the training set, so in realitytraining time of kNN is linear.

kNN is very accurate if training set is large.

Optimality result: asymptotically zero error if Bayes rate iszero.

But kNN can be very inaccurate if training set is small.

36 / 121

Outline

1 Recap

2 Rocchio

3 kNN


5 > two classes


7 Clustering in IR

8 K -means

37 / 121

Linear classifiers

Linear classifiers compute a linear combination or weightedsum

∑

i wixi of the feature values.

Classification decision:∑

i wixi > θ?

. . . where θ (the threshold) is a parameter.

(First, we only consider binary classifiers.)

Geometrically, this corresponds to a line (2D), a plane (3D) ora hyperplane (higher dimensionalities)

Assumption: The classes are linearly separable.

Can find hyperplane (=separator) based on training set

Methods for finding separator: Perceptron, Rocchio, NaiveBayes – as we will explain on the next slides

38 / 121

A linear classifier in 1D

x1

A linear classifier in 1D isa point described by theequation w1x1 = θ

The point at θ/w1

Points (x1) with w1x1 ≥ θare in the class c .

Points (x1) with w1x1 < θare in the complementclass c .

39 / 121


A linear classifier in 2D isa line described by theequation w1x1 + w2x2 = θ

Example for a 2D linearclassifier

Points (x1 x2) withw1x1 + w2x2 ≥ θ are inthe class c .

Points (x1 x2) withw1x1 + w2x2 < θ are inthe complement class c .

40 / 121


A linear classifier in 3D isa plane described by theequationw1x1 + w2x2 + w3x3 = θ

Example for a 3D linearclassifier

Points (x1 x2 x3) withw1x1 + w2x2 + w3x3 ≥ θare in the class c .

Points (x1 x2 x3) withw1x1 + w2x2 + w3x3 < θare in the complementclass c .

41 / 121

Rocchio as a linear classifier

Rocchio is a linear classifier defined by:

M∑

i=1

wixi = ~w · ~x = θ

where the normal vector ~w = ~µ(c1)− ~µ(c2)andθ = 0.5 ∗ (|~µ(c1)|

2 − |~µ(c2)|2).

(follows from decision boundary |~µ(c1)− ~x | = |~µ(c2)− ~x |)

42 / 121

Naive Bayes classifier

~x represents document, what is p(c |~x) that document is in class c?

p(c |~x) =p(~x |c)p(c)

p(~x)p(c |~x) =

p(~x |c)p(c)

p(~x)

odds :p(c |~x)

p(c |~x)=

p(~x |c)p(c)

p(~x |c)p(c)≈

p(c)

p(c)

∏

1≤k≤ndp(tk |c)

∏

1≤k≤ndp(tk |c)

log odds : logp(c |~x)

p(c |~x)= log

p(c)

p(c)+

∑

1≤k≤nd

logp(tk |c)

p(tk |c)

43 / 121

Naive Bayes as a linear classifier

Naive Bayes is a linear classifier defined by:

M∑

i=1

wixi = θ

where wi = log(

p(ti |c)/p(ti |c))

,xi = number of occurrences of ti in d ,andθ = − log

(

p(c)/p(c))

.

(the index i , 1 ≤ i ≤ M, refers to terms of the vocabulary)

Linear in log space

44 / 121

kNN is not a linear classifier

x

x

x x

x

x x

xx x

x

⋄

⋄⋄

⋄

⋄

⋄

⋄⋄⋄

⋄ ⋄

⋆

Classification decisionbased on majority ofk nearest neighbors.

The decisionboundaries betweenclasses are piecewiselinear . . .

. . . but they are notlinear classifiers thatcan be described as∑M

i=1 wixi = θ.

45 / 121

Example of a linear two-class classifier

ti wi x1i x2i ti wi x1i x2i

prime 0.70 0 1 dlrs -0.71 1 1rate 0.67 1 0 world -0.35 1 0interest 0.63 0 0 sees -0.33 0 0rates 0.60 0 0 year -0.25 0 0discount 0.46 1 0 group -0.24 0 0bundesbank 0.43 0 0 dlr -0.24 0 0

This is for the class interest in Reuters-21578.For simplicity: assume a simple 0/1 vector representationx1: “rate discount dlrs world”x2: “prime dlrs”Exercise: Which class is x1 assigned to? Which class is x2 assigned to?We assign document ~d1 “rate discount dlrs world” to interest since~wT · ~d1 = 0.67 · 1 + 0.46 · 1 + (−0.71) · 1 + (−0.35) · 1 = 0.07 > 0 = b.We assign ~d2 “prime dlrs” to the complement class (not in interest) since~wT · ~d2 = −0.01 ≤ b.

(dlr and world have negative weights because they are indicatorsfor the competing class currency)

46 / 121

Which hyperplane?

47 / 121

Which hyperplane?

For linearly separable training sets: there are infinitely manyseparating hyperplanes.

They all separate the training set perfectly . . .

. . . but they behave differently on test data.

Error rates on new data are low for some, high for others.

How do we find a low-error separator?

Perceptron: generally bad; Naive Bayes, Rocchio: ok; linearSVM: good

48 / 121

Linear classifiers: Discussion

Many common text classifiers are linear classifiers: NaiveBayes, Rocchio, logistic regression, linear support vectormachines etc.

Each method has a different way of selecting the separatinghyperplane

Huge differences in performance on test documents

Can we get better performance with more powerful nonlinearclassifiers?

Not in general: A given amount of training data may sufficefor estimating a linear boundary, but not for estimating amore complex nonlinear boundary.

49 / 121

A nonlinear problem

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Linear classifier like Rocchio does badly on this task.

kNN will do well (assuming enough training data)

50 / 121

A linear problem with noise

Figure 14.10: hypothetical web page classification scenario:Chinese-only web pages (solid circles) and mixed Chinese-Englishweb (squares). linear class boundary, except for three noise docs

51 / 121

Which classifier do I use for a given TC problem?

Is there a learning method that is optimal for all textclassification problems?

No, because there is a tradeoff between bias and variance.

Factors to take into account:

How much training data is available?How simple/complex is the problem? (linear vs. nonlineardecision boundary)How noisy is the problem?How stable is the problem over time?

For an unstable problem, it’s better to use a simple and robustclassifier.

52 / 121

Outline

1 Recap

2 Rocchio

3 kNN


5 > two classes


7 Clustering in IR

8 K -means

53 / 121

How to combine hyperplanes for > 2 classes?

?

(e.g.: rank and select top-ranked classes)

54 / 121

One-of problems

One-of or multiclass classification

Classes are mutually exclusive.Each document belongs to exactly one class.Example: language of a document (assumption: no documentcontains multiple languages)

55 / 121

One-of classification with linear classifiers

Combine two-class linear classifiers as follows for one-ofclassification:

Run each classifier separatelyRank classifiers (e.g., according to score)Pick the class with the highest score

56 / 121

Any-of problems

Any-of or multilabel classification

A document can be a member of 0, 1, or many classes.A decision on one class leaves decisions open on all otherclasses.A type of “independence” (but not statistical independence)Example: topic classificationUsually: make decisions on the region, on the subject area, onthe industry and so on “independently”

57 / 121

Any-of classification with linear classifiers

Combine two-class linear classifiers as follows for any-ofclassification:

Simply run each two-class classifier separately on the testdocument and assign document accordingly

58 / 121

Outline

1 Recap

2 Rocchio

3 kNN


5 > two classes


7 Clustering in IR

8 K -means

59 / 121

What is clustering?

(Document) clustering is the process of grouping a set ofdocuments into clusters of similar documents.

Documents within a cluster should be similar.

Documents from different clusters should be dissimilar.

Clustering is the most common form of unsupervised learning.

Unsupervised = there are no labeled or annotated data.

60 / 121

Data set with clear cluster structure

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

2.5

61 / 121

Classification vs. Clustering

Classification: supervised learning

Clustering: unsupervised learning

Classification: Classes are human-defined and part of theinput to the learning algorithm.

Clustering: Clusters are inferred from the data without humaninput.

However, there are many ways of influencing the outcome ofclustering: number of clusters, similarity measure,representation of documents, . . .

62 / 121

Outline

1 Recap

2 Rocchio

3 kNN


5 > two classes


7 Clustering in IR

8 K -means

63 / 121

The cluster hypothesis

Cluster hypothesis. Documents in the same cluster behavesimilarly with respect to relevance to information needs.

All applications in IR are based (directly or indirectly) on thecluster hypothesis.

64 / 121

Applications of clustering in IR

Application What is Benefit Exampleclustered?

Search result clustering searchresults

more effective infor-mation presentationto user

next slide

Scatter-Gather (subsets of)collection

alternative user inter-face: “search withouttyping”

two slides ahead

Collection clustering collection effective informationpresentation for ex-ploratory browsing

McKeown et al. 2002,news.google.com

Cluster-based retrieval collection higher efficiency:faster search

Salton 1971

65 / 121

Search result clustering for better navigation

Jaguar the cat not among top results, but available via menu at left

66 / 121

Scatter-Gather

A collection of news stories is clustered (“scattered”) into eight clusters (toprow), user manually gathers three into smaller collection ‘International Stories’and performs another scattering. Process repeats until a small cluster withrelevant documents is found (e.g., Trinidad).

67 / 121

Global navigation: Yahoo

68 / 121

Global navigation: MESH (upper level)

69 / 121

Global navigation: MESH (lower level)

70 / 121

Note: Yahoo/MESH are not examples of clustering.

But they are well known examples for using a global hierarchyfor navigation.

Some examples for global navigation/exploration based onclustering:

CartiaThemescapesGoogle News

71 / 121

Global navigation combined with visualization (1)

72 / 121

Global navigation combined with visualization (2)

73 / 121

Global clustering for navigation: Google News

http://news.google.com

74 / 121

Clustering for improving recall

To improve search recall:

Cluster docs in collection a prioriWhen a query matches a doc d , also return other docs in thecluster containing d

Hope: if we do this: the query “car” will also return docscontaining “automobile”

Because clustering groups together docs containing “car” withthose containing “automobile”.Both types of documents contain words like “parts”, “dealer”,“mercedes”, “road trip”.

75 / 121

Data set with clear cluster structure

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

2.5

Exercise: Come up with analgorithm for finding the threeclusters in this case

76 / 121

Document representations in clustering

Vector space model

As in vector space classification, we measure relatednessbetween vectors by Euclidean distance . . .

. . . which is almost equivalent to cosine similarity.

Almost: centroids are not length-normalized.

For centroids, distance and cosine give different results.

77 / 121

Issues in clustering

General goal: put related docs in the same cluster, putunrelated docs in different clusters.

But how do we formalize this?

How many clusters?

Initially, we will assume the number of clusters K is given.

Often: secondary goals in clustering

Example: avoid very small and very large clusters

Flat vs. hierarchical clustering

Hard vs. soft clustering

78 / 121

Flat vs. Hierarchical clustering

Flat algorithms

Usually start with a random (partial) partitioning of docs intogroupsRefine iterativelyMain algorithm: K -means

Hierarchical algorithms

Create a hierarchyBottom-up, agglomerativeTop-down, divisive

79 / 121

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Soft clustering: A document can belong to more than onecluster.

Makes more sense for applications like creating browsablehierarchiesYou may want to put a pair of sneakers in two clusters:

sports apparelshoes

You can only do that with a soft clustering approach.

For soft clustering, see course text: 16.5,18

Today: Flat, hard clusteringNext time: Hierarchical, hard clustering

80 / 121

Flat algorithms

Flat algorithms compute a partition of N documents into aset of K clusters.

Given: a set of documents and the number K

Find: a partition in K clusters that optimizes the chosenpartitioning criterion

Global optimization: exhaustively enumerate partitions, pickoptimal one

Not tractable

Effective heuristic method: K -means algorithm

81 / 121

Outline

1 Recap

2 Rocchio

3 kNN


5 > two classes


7 Clustering in IR

8 K -means

82 / 121

K -means

Perhaps the best known clustering algorithm

Simple, works well in many cases

Use as default / baseline for clustering documents

83 / 121

K -means

Each cluster in K -means is defined by a centroid.

Objective/partitioning criterion: minimize the average squareddifference from the centroid

Recall definition of centroid:

~µ(ω) =1

|ω|

∑

~x∈ω

~x

where we use ω to denote a cluster.

We try to find the minimum average squared difference byiterating two steps:

reassignment: assign each vector to its closest centroidrecomputation: recompute each centroid as the average of thevectors that were assigned to it in reassignment

84 / 121

K -means algorithm

K -means({~x1, . . . , ~xN},K )1 (~s1,~s2, . . . ,~sK )← SelectRandomSeeds({~x1, . . . , ~xN},K )2 for k ← 1 to K3 do ~µk ← ~sk4 while stopping criterion has not been met5 do for k ← 1 to K6 do ωk ← {}7 for n← 1 to N8 do j ← arg minj ′ |~µj ′ − ~xn|9 ωj ← ωj ∪ {~xn} (reassignment of vectors)

10 for k ← 1 to K11 do ~µk ←

1|ωk |

∑

~x∈ωk~x (recomputation of centroids)

12 return {~µ1, . . . , ~µK}

85 / 121

Set of points to be clustered

b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

86 / 121

Random selection of initial cluster centers (k = 2 means)

×

×b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

Centroids after convergence?

87 / 121

Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

×

×

88 / 121

Assignment

2

1

1

2

1

1

1 111

1

1

1

11

2

11

2 2

×

×

89 / 121

Recompute cluster centroids

2

1

1

2

1

1

1 111

1

1

1

11

2

11

2 2

×

×

×

×

90 / 121


b

b

b

b

b

b bb

b

b

b

b

b

b

bb

bb

×

×b b

91 / 121

Assignment

2

2

1

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

92 / 121


2

2

1

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

×

×

93 / 121


b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

×

×b

94 / 121

Assignment

2

2

2

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

95 / 121


2

2

2

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

×

×

96 / 121


b

b

b

b

b

b

b b

b

b

b

b

b

b

b

bb

bb

×

×

b

97 / 121

Assignment

2

2

2

2

1

1

1 121

1

2

1

11

2

11

2 2

×

×

98 / 121


2

2

2

2

1

1

1 121

1

2

1

11

2

11

2 2

×

×

×

×

99 / 121


b

b

b

b

b

b

b bb

b

b

b

b

b

bb

b

×

×b

bb

100 / 121

Assignment

2

2

2

2

1

1

1 122

1

2

1

11

1

11

2 1

×

×

101 / 121


2

2

2

2

1

1

1 122

1

2

1

11

1

11

2 1

××

×

×

102 / 121


b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

b

××

b

103 / 121

Assignment

2

2

2

2

1

1

1 122

1

2

1

11

1

11

1 1

××

104 / 121


2

2

2

2

1

1

1 122

1

2

1

11

1

11

1 1

××

×

×

105 / 121


b

b

b

b

b

b

b bb

b

b

b

b

b

b

bb

bb

×× b

106 / 121

Assignment

2

2

2

2

1

1

1 122

1

1

1

11

1

11

1 1

××

107 / 121


2

2

2

2

1

1

1 122

1

1

1

11

1

11

1 1

××

×

×

108 / 121

Centroids and assignments after convergence

2

2

2

2

1

1

1 122

1

1

1

11

1

11

1 1

××

109 / 121

Set of points clustered

b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

110 / 121

Set of points to be clustered

b

b

b

b

b

b

b bb

b

b

b

b

b

b

b

bb

bb

111 / 121

K -means is guaranteed to converge

Proof:

The sum of squared distances (RSS) decreases duringreassignment, because each vector is moved to a closercentroid(RSS = sum of all squared distances between documentvectors and closest centroids)

RSS decreases during recomputation (see next slide)

There is only a finite number of clusterings.

Thus: We must reach a fixed point.(assume that ties are broken consistently)

112 / 121

Recomputation decreases average distance

RSS =∑K

k=1 RSSk – the residual sum of squares (the “goodness”measure)

RSSk(~v) =∑

~x∈ωk

‖~v − ~x‖2 =∑

~x∈ωk

M∑

m=1

(vm − xm)2

∂RSSk(~v)

∂vm

=∑

~x∈ωk

2(vm − xm) = 0

vm =1

|ωk |

∑

~x∈ωk

xm

The last line is the componentwise definition of the centroid!We minimize RSSk when the old centroid is replaced with the newcentroid.RSS, the sum of the RSSk , must then also decrease duringrecomputation.

113 / 121

K -means is guaranteed to converge

But we don’t know how long convergence will take!

If we don’t care about a few docs switching back and forth,then convergence is usually fast (< 10-20 iterations).

However, complete convergence can take many moreiterations.

114 / 121

Optimality of K -means

Convergence does not mean that we converge to the optimalclustering!

This is the great weakness of K -means.

If we start with a bad set of seeds, the resulting clustering canbe horrible.

115 / 121

Exercise: Suboptimal clustering

0 1 2 3 40

1

2

3

×

×

×

×

×

×d1 d2 d3

d4 d5 d6

What is the optimal clustering for K = 2?

Do we converge on this clustering for arbitrary seeds di1 , di2?

116 / 121

Exercise: Suboptimal clustering

0 1 2 3 40

1

2

3

×

×

×

×

×

×d1 d2 d3

d4 d5 d6

What is the optimal clustering for K = 2?

Do we converge on this clustering for arbitrary seeds di1 , di2?

For seeds d2 and d5, K -means converges to{{d1, d2, d3}, {d4, d5, d6}} (suboptimal clustering).

For seeds d2 and d3, instead converges to{{d1, d2, d4, d5}, {d3, d6}} (global optimum for K = 2).

117 / 121

Initialization of K -means

Random seed selection is just one of many ways K -means canbe initialized.

Random seed selection is not very robust: It’s easy to get asuboptimal clustering.

Better heuristics:

Select seeds not randomly, but using some heuristic (e.g., filterout outliers or find a set of seeds that has “good coverage” ofthe document space)Use hierarchical clustering to find good seeds (next class)Select i (e.g., i = 10) different sets of seeds, do a K -meansclustering for each, select the clustering with lowest RSS

118 / 121

Time complexity of K -means

Computing one distance of two vectors is O(M).

Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)

Recomputation step: O(NM) (we need to add each of thedocument’s < M values to one of the centroids)

Assume number of iterations bounded by I

Overall complexity: O(IKNM) – linear in all importantdimensions

However: This is not a real worst-case analysis.

In pathological cases, the number of iterations can be muchhigher than linear in the number of documents.

119 / 121