View
215
Download
0
Category
Preview:
Citation preview
INFO 4300 / CS4300Information Retrieval
slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/
IR 20/25: Linear Classifiers and Flat clustering
Paul Ginsparg
Cornell University, Ithaca, NY
10 Nov 2011
1 / 121
Administrativa
Assignment 4 to be posted tomorrow,due Fri 2 Dec (last day of classes),permitted until Sun 4 Dec (no extensions)
2 / 121
Overview
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
3 / 121
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
4 / 121
Digression: “naive” Bayes
Spam classifier:Imagine a training set of 2000 messages,1000 classified as spam (S),and 1000 classified as non-spam (S).
180 of the S messages contain the word “offer”.20 of the S messages contain the word “offer”.
Suppose you receive a message containing the word “offer”.What is the probability it is S? Estimate:
180
180 + 20=
9
10.
(Formally, assuming “flat prior” p(S) = p(S):
p(S |offer) =p(offer|S)p(S)
p(offer|S)p(S) + p(offer|S)p(S)=
1801000
1801000 + 20
1000
=9
10.)
5 / 121
Basics of probability theory
A = event
0 ≤ p(A) ≤ 1
joint probability p(A,B) = p(A ∩ B)
conditional probability p(A|B) = p(A,B)/p(B)
Note p(A,B) = p(A|B)p(B) = p(B |A)p(A), gives posteriorprobability of A after seeing the evidence B
Bayes ‘Thm‘ : p(A|B) =p(B |A)p(A)
p(B)
In denominator, usep(B) = p(B ,A) + p(B ,A) = p(B |A)p(A) + p(B |A)p(A)
Odds: O(A) =p(A)
p(A)=
p(A)
1− p(A)
6 / 121
“naive” Bayes, cont’d
Spam classifier:Imagine a training set of 2000 messages,1000 classified as spam (S),and 1000 classified as non-spam (S).
words wi = {“offer”,“FF0000”,“click”,“unix”,“job”,“enlarge”,. . .}ni of the S messages contain the word wi .mi of the S messages contain the word wi .
Suppose you receive a message containing the wordsw1,w4,w5, . . ..What are the odds it is S? Estimate:
p(S |w1,w4,w5, . . .) ∝ p(w1,w4,w5, . . . |S)p(S)
p(S |w1,w4,w5, . . .) ∝ p(w1,w4,w5, . . . |S)p(S)
Odds are
p(S |w1,w4,w5, . . .)
p(S |w1,w4,w5, . . .)=
p(w1,w4,w5, . . . |S)p(S)
p(w1,w4,w5, . . . |S)p(S)7 / 121
“naive” Bayes odds
Oddsp(S |w1,w4,w5, . . .)
p(S |w1,w4,w5, . . .)=
p(w1,w4,w5, . . . |S)p(S)
p(w1,w4,w5, . . . |S)p(S)
are approximated by
≈p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S)p(S)
p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S)p(S)
≈(n1/1000)(n4/1000)(n5/1000) · · · (nℓ/1000)
(m1/1000)(m4/1000)(m5/1000) · · · (mℓ/1000)=
n1n4n5 · · · nℓ
m1m4m5 · · ·mℓ
where we’ve assumed words are independent eventsp(w1,w4,w5, . . . |S) ≈ p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S),and p(wi |S) ≈ ni/|S |, p(wi |S) ≈ mi/|S |(recall ni and mi , respectively, counted the number of spam S andnon-spam S training messages containing the word wi)
8 / 121
Classification
Naive Bayes is simple and a good baseline.
Use it if you want to get a text classifier up and running in ahurry.
But other classification methods are more accurate.
Perhaps the simplest well performing alternative: kNN
kNN is a vector space classifier.
Today1 intro vector space classification2 very simple vector space classification: Rocchio3 kNN
Next time: general properties of classifiers
9 / 121
Recall vector space representation
Each document is a vector, one component for each term.
Terms are axes.
High dimensionality: 100,000s of dimensions
Normalize vectors (documents) to unit length
How can we do classification in this space?
10 / 121
Vector space classification
As before, the training set is a set of documents, each labeledwith its class.
In vector space classification, this set corresponds to a labeledset of points or vectors in the vector space.
Premise 1: Documents in the same class form a contiguousregion.
Premise 2: Documents from different classes don’t overlap.
We define lines, surfaces, hypersurfaces to divide regions.
11 / 121
Classes in the vector space
xxx
x
⋄
⋄⋄⋄
⋄
⋄
China
Kenya
UK⋆
Should the document ⋆ be assigned to China, UK or Kenya?Find separators between the classesBased on these separators: ⋆ should be assigned to ChinaHow do we find separators that do a good job at classifying newdocuments like ⋆?
12 / 121
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
13 / 121
Recall Rocchio algorithm (lecture 12)
The optimal query vector is:
~qopt = µ(Dr ) + [µ(Dr )− µ(Dnr )]
=1
|Dr |
∑
~dj∈Dr
~dj + [1
|Dr |
∑
~dj∈Dr
~dj −1
|Dnr |
∑
~dj∈Dnr
~dj ]
We move the centroid of the relevant documents by thedifference between the two centroids.
14 / 121
Exercise: Compute Rocchio vector (lecture 12)
x
x
x
x
xx
circles: relevant documents, X’s: nonrelevant documents
15 / 121
Rocchio illustrated (lecture 12)
x
x
x
x
xx
~µR
~µNR
~µR − ~µNR~qopt
~µR : centroid of relevant documents~µNR : centroid of nonrelevant documents~µR − ~µNR : difference vectorAdd difference vector to ~µR to get ~qopt
~qopt separates relevant/nonrelevant perfectly.
16 / 121
Rocchio 1971 algorithm (SMART) (lecture 12)
Used in practice:
~qm = α~q0 + βµ(Dr )− γµ(Dnr )
= α~q0 + β1
|Dr |
∑
~dj∈Dr
~dj − γ1
|Dnr |
∑
~dj∈Dnr
~dj
qm: modified query vector; q0: original query vector; Dr andDnr : sets of known relevant and nonrelevant documentsrespectively; α, β, and γ: weights attached to each term
New query moves towards relevant documents and away fromnonrelevant documents.
Tradeoff α vs. β/γ: If we have a lot of judged documents, wewant a higher β/γ.
Set negative term weights to 0.
“Negative weight” for a term doesn’t make sense in the vectorspace model.
17 / 121
Using Rocchio for vector space classification
We can view relevance feedback as two-class classification.
The two classes: the relevant documents and the nonrelevantdocuments.
The training set is the set of documents the user has labeledso far.
The principal difference between relevance feedback and textclassification:
The training set is given as part of the input in textclassification.It is interactively created in relevance feedback.
18 / 121
Rocchio classification: Basic idea
Compute a centroid for each class
The centroid is the average of all documents in the class.
Assign each test document to the class of its closest centroid.
19 / 121
Recall definition of centroid
~µ(c) =1
|Dc |
∑
d∈Dc
~v(d)
where Dc is the set of all documents that belong to class c and
~v(d) is the vector space representation of d .
20 / 121
Rocchio algorithm
TrainRocchio(C, D)1 for each cj ∈ C
2 do Dj ← {d : 〈d , cj 〉 ∈ D}3 ~µj ←
1|Dj |
∑
d∈Dj~v(d)
4 return {~µ1, . . . , ~µJ}
ApplyRocchio({~µ1, . . . , ~µJ}, d)1 return arg minj |~µj − ~v(d)|
21 / 121
Rocchio illustrated: a1 = a2, b1 = b2, c1 = c2
xxx
x
⋄
⋄⋄
⋄
⋄
⋄
China
Kenya
UK
⋆ a1
a2
b1
b2
c1
c2
22 / 121
Rocchio properties
Rocchio forms a simple representation for each class: thecentroid
We can interpret the centroid as the prototype of the class.
Classification is based on similarity to / distance fromcentroid/prototype.
Does not guarantee that classifications are consistent with thetraining data!
23 / 121
Time complexity of Rocchio
mode time complexity
training Θ(|D|Lave + |C||V |) ≈ Θ(|D|Lave)testing Θ(La + |C|Ma) ≈ Θ(|C|Ma)
24 / 121
Rocchio vs. Naive Bayes
In many cases, Rocchio performs worse than Naive Bayes.
One reason: Rocchio does not handle nonconvex, multimodalclasses correctly.
25 / 121
Rocchio cannot handle nonconvex, multimodal classes
a
a
a
a
a
a
a aa
a
aa
aa
a a
aa
a
a
a
a
a
a
a
a
a
a
a
a a
aa
aa
aa
aa
a
bb
bb
bb
bb b
b
bbb
b
b
b
b
b
b
X XA
B
o
Exercise: Why is Rocchio notexpected to do well for theclassification task a vs. b here?
A is centroid of the a’s, Bis centroid of the b’s.
The point o is closer to Athan to B.
But it is a better fit forthe b class.
A is a multimodal classwith two prototypes.
But in Rocchio we onlyhave one.
26 / 121
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
27 / 121
kNN classification
kNN classification is another vector space classificationmethod.
It also is very simple and easy to implement.
kNN is more accurate (in most cases) than Naive Bayes andRocchio.
If you need to get a pretty accurate classifier up and runningin a short time . . .
. . . and you don’t care about efficiency that much . . .
. . . use kNN.
28 / 121
kNN classification
kNN = k nearest neighbors
kNN classification rule for k = 1 (1NN): Assign each testdocument to the class of its nearest neighbor in the trainingset.
1NN is not very robust – one document can be mislabeled oratypical.
kNN classification rule for k > 1 (kNN): Assign each testdocument to the majority class of its k nearest neighbors inthe training set.
Rationale of kNN: contiguity hypothesis
We expect a test document d to have the same label as thetraining documents located in the local region surrounding d .
29 / 121
Probabilistic kNN
Probabilistic version of kNN: P(c |d) = fraction of k neighborsof d that are in c
kNN classification rule for probabilistic kNN: Assign d to classc with highest P(c |d)
30 / 121
kNN is based on Voronoi tessellation
x
x
xx
x
xx
xx x
x
⋄
⋄⋄
⋄
⋄
⋄
⋄⋄⋄
⋄ ⋄
⋆
1NN, 3NNclassifica-tion decisionfor star?
31 / 121
kNN algorithm
Train-kNN(C, D)1 D
′ ← Preprocess(D)2 k ← Select-k(C, D′)3 return D
′, k
Apply-kNN(D′, k, d)1 Sk ← ComputeNearestNeighbors(D′, k, d)2 for each cj ∈ C(D′)3 do pj ← |Sk ∩ cj |/k4 return arg maxj pj
32 / 121
Exercise
⋆
x
x
x
x
x
x
x
x
x
x
oo
o
o
o
How is star classified by:
(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio?
33 / 121
Exercise
⋆
x
x
x
x
x
x
x
x
x
x
oo
o
o
o
How is star classified by:
(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio
34 / 121
Time complexity of kNN
kNN with preprocessing of training set
training Θ(|D|Lave)testing Θ(La + |D|MaveMa) = Θ(|D|MaveMa)
kNN test time proportional to the size of the training set!
The larger the training set, the longer it takes to classify atest document.
kNN is inefficient for very large training sets.
35 / 121
kNN: Discussion
No training necessary
But linear preprocessing of documents is as expensive astraining Naive Bayes.You will always preprocess the training set, so in realitytraining time of kNN is linear.
kNN is very accurate if training set is large.
Optimality result: asymptotically zero error if Bayes rate iszero.
But kNN can be very inaccurate if training set is small.
36 / 121
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
37 / 121
Linear classifiers
Linear classifiers compute a linear combination or weightedsum
∑
i wixi of the feature values.
Classification decision:∑
i wixi > θ?
. . . where θ (the threshold) is a parameter.
(First, we only consider binary classifiers.)
Geometrically, this corresponds to a line (2D), a plane (3D) ora hyperplane (higher dimensionalities)
Assumption: The classes are linearly separable.
Can find hyperplane (=separator) based on training set
Methods for finding separator: Perceptron, Rocchio, NaiveBayes – as we will explain on the next slides
38 / 121
A linear classifier in 1D
x1
A linear classifier in 1D isa point described by theequation w1x1 = θ
The point at θ/w1
Points (x1) with w1x1 ≥ θare in the class c .
Points (x1) with w1x1 < θare in the complementclass c .
39 / 121
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1x1 + w2x2 = θ
Example for a 2D linearclassifier
Points (x1 x2) withw1x1 + w2x2 ≥ θ are inthe class c .
Points (x1 x2) withw1x1 + w2x2 < θ are inthe complement class c .
40 / 121
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1x1 + w2x2 + w3x3 = θ
Example for a 3D linearclassifier
Points (x1 x2 x3) withw1x1 + w2x2 + w3x3 ≥ θare in the class c .
Points (x1 x2 x3) withw1x1 + w2x2 + w3x3 < θare in the complementclass c .
41 / 121
Rocchio as a linear classifier
Rocchio is a linear classifier defined by:
M∑
i=1
wixi = ~w · ~x = θ
where the normal vector ~w = ~µ(c1)− ~µ(c2)andθ = 0.5 ∗ (|~µ(c1)|
2 − |~µ(c2)|2).
(follows from decision boundary |~µ(c1)− ~x | = |~µ(c2)− ~x |)
42 / 121
Naive Bayes classifier
~x represents document, what is p(c |~x) that document is in class c?
p(c |~x) =p(~x |c)p(c)
p(~x)p(c |~x) =
p(~x |c)p(c)
p(~x)
odds :p(c |~x)
p(c |~x)=
p(~x |c)p(c)
p(~x |c)p(c)≈
p(c)
p(c)
∏
1≤k≤ndp(tk |c)
∏
1≤k≤ndp(tk |c)
log odds : logp(c |~x)
p(c |~x)= log
p(c)
p(c)+
∑
1≤k≤nd
logp(tk |c)
p(tk |c)
43 / 121
Naive Bayes as a linear classifier
Naive Bayes is a linear classifier defined by:
M∑
i=1
wixi = θ
where wi = log(
p(ti |c)/p(ti |c))
,xi = number of occurrences of ti in d ,andθ = − log
(
p(c)/p(c))
.
(the index i , 1 ≤ i ≤ M, refers to terms of the vocabulary)
Linear in log space
44 / 121
kNN is not a linear classifier
x
x
x x
x
x x
xx x
x
⋄
⋄⋄
⋄
⋄
⋄
⋄⋄⋄
⋄ ⋄
⋆
Classification decisionbased on majority ofk nearest neighbors.
The decisionboundaries betweenclasses are piecewiselinear . . .
. . . but they are notlinear classifiers thatcan be described as∑M
i=1 wixi = θ.
45 / 121
Example of a linear two-class classifier
ti wi x1i x2i ti wi x1i x2i
prime 0.70 0 1 dlrs -0.71 1 1rate 0.67 1 0 world -0.35 1 0interest 0.63 0 0 sees -0.33 0 0rates 0.60 0 0 year -0.25 0 0discount 0.46 1 0 group -0.24 0 0bundesbank 0.43 0 0 dlr -0.24 0 0
This is for the class interest in Reuters-21578.For simplicity: assume a simple 0/1 vector representationx1: “rate discount dlrs world”x2: “prime dlrs”Exercise: Which class is x1 assigned to? Which class is x2 assigned to?We assign document ~d1 “rate discount dlrs world” to interest since~wT · ~d1 = 0.67 · 1 + 0.46 · 1 + (−0.71) · 1 + (−0.35) · 1 = 0.07 > 0 = b.We assign ~d2 “prime dlrs” to the complement class (not in interest) since~wT · ~d2 = −0.01 ≤ b.
(dlr and world have negative weights because they are indicatorsfor the competing class currency)
46 / 121
Which hyperplane?
47 / 121
Which hyperplane?
For linearly separable training sets: there are infinitely manyseparating hyperplanes.
They all separate the training set perfectly . . .
. . . but they behave differently on test data.
Error rates on new data are low for some, high for others.
How do we find a low-error separator?
Perceptron: generally bad; Naive Bayes, Rocchio: ok; linearSVM: good
48 / 121
Linear classifiers: Discussion
Many common text classifiers are linear classifiers: NaiveBayes, Rocchio, logistic regression, linear support vectormachines etc.
Each method has a different way of selecting the separatinghyperplane
Huge differences in performance on test documents
Can we get better performance with more powerful nonlinearclassifiers?
Not in general: A given amount of training data may sufficefor estimating a linear boundary, but not for estimating amore complex nonlinear boundary.
49 / 121
A nonlinear problem
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Linear classifier like Rocchio does badly on this task.
kNN will do well (assuming enough training data)
50 / 121
A linear problem with noise
Figure 14.10: hypothetical web page classification scenario:Chinese-only web pages (solid circles) and mixed Chinese-Englishweb (squares). linear class boundary, except for three noise docs
51 / 121
Which classifier do I use for a given TC problem?
Is there a learning method that is optimal for all textclassification problems?
No, because there is a tradeoff between bias and variance.
Factors to take into account:
How much training data is available?How simple/complex is the problem? (linear vs. nonlineardecision boundary)How noisy is the problem?How stable is the problem over time?
For an unstable problem, it’s better to use a simple and robustclassifier.
52 / 121
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
53 / 121
How to combine hyperplanes for > 2 classes?
?
(e.g.: rank and select top-ranked classes)
54 / 121
One-of problems
One-of or multiclass classification
Classes are mutually exclusive.Each document belongs to exactly one class.Example: language of a document (assumption: no documentcontains multiple languages)
55 / 121
One-of classification with linear classifiers
Combine two-class linear classifiers as follows for one-ofclassification:
Run each classifier separatelyRank classifiers (e.g., according to score)Pick the class with the highest score
56 / 121
Any-of problems
Any-of or multilabel classification
A document can be a member of 0, 1, or many classes.A decision on one class leaves decisions open on all otherclasses.A type of “independence” (but not statistical independence)Example: topic classificationUsually: make decisions on the region, on the subject area, onthe industry and so on “independently”
57 / 121
Any-of classification with linear classifiers
Combine two-class linear classifiers as follows for any-ofclassification:
Simply run each two-class classifier separately on the testdocument and assign document accordingly
58 / 121
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
59 / 121
What is clustering?
(Document) clustering is the process of grouping a set ofdocuments into clusters of similar documents.
Documents within a cluster should be similar.
Documents from different clusters should be dissimilar.
Clustering is the most common form of unsupervised learning.
Unsupervised = there are no labeled or annotated data.
60 / 121
Data set with clear cluster structure
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
2.0
2.5
61 / 121
Classification vs. Clustering
Classification: supervised learning
Clustering: unsupervised learning
Classification: Classes are human-defined and part of theinput to the learning algorithm.
Clustering: Clusters are inferred from the data without humaninput.
However, there are many ways of influencing the outcome ofclustering: number of clusters, similarity measure,representation of documents, . . .
62 / 121
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
63 / 121
The cluster hypothesis
Cluster hypothesis. Documents in the same cluster behavesimilarly with respect to relevance to information needs.
All applications in IR are based (directly or indirectly) on thecluster hypothesis.
64 / 121
Applications of clustering in IR
Application What is Benefit Exampleclustered?
Search result clustering searchresults
more effective infor-mation presentationto user
next slide
Scatter-Gather (subsets of)collection
alternative user inter-face: “search withouttyping”
two slides ahead
Collection clustering collection effective informationpresentation for ex-ploratory browsing
McKeown et al. 2002,news.google.com
Cluster-based retrieval collection higher efficiency:faster search
Salton 1971
65 / 121
Search result clustering for better navigation
Jaguar the cat not among top results, but available via menu at left
66 / 121
Scatter-Gather
A collection of news stories is clustered (“scattered”) into eight clusters (toprow), user manually gathers three into smaller collection ‘International Stories’and performs another scattering. Process repeats until a small cluster withrelevant documents is found (e.g., Trinidad).
67 / 121
Global navigation: Yahoo
68 / 121
Global navigation: MESH (upper level)
69 / 121
Global navigation: MESH (lower level)
70 / 121
Note: Yahoo/MESH are not examples of clustering.
But they are well known examples for using a global hierarchyfor navigation.
Some examples for global navigation/exploration based onclustering:
CartiaThemescapesGoogle News
71 / 121
Global navigation combined with visualization (1)
72 / 121
Global navigation combined with visualization (2)
73 / 121
Global clustering for navigation: Google News
http://news.google.com
74 / 121
Clustering for improving recall
To improve search recall:
Cluster docs in collection a prioriWhen a query matches a doc d , also return other docs in thecluster containing d
Hope: if we do this: the query “car” will also return docscontaining “automobile”
Because clustering groups together docs containing “car” withthose containing “automobile”.Both types of documents contain words like “parts”, “dealer”,“mercedes”, “road trip”.
75 / 121
Data set with clear cluster structure
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
2.0
2.5
Exercise: Come up with analgorithm for finding the threeclusters in this case
76 / 121
Document representations in clustering
Vector space model
As in vector space classification, we measure relatednessbetween vectors by Euclidean distance . . .
. . . which is almost equivalent to cosine similarity.
Almost: centroids are not length-normalized.
For centroids, distance and cosine give different results.
77 / 121
Issues in clustering
General goal: put related docs in the same cluster, putunrelated docs in different clusters.
But how do we formalize this?
How many clusters?
Initially, we will assume the number of clusters K is given.
Often: secondary goals in clustering
Example: avoid very small and very large clusters
Flat vs. hierarchical clustering
Hard vs. soft clustering
78 / 121
Flat vs. Hierarchical clustering
Flat algorithms
Usually start with a random (partial) partitioning of docs intogroupsRefine iterativelyMain algorithm: K -means
Hierarchical algorithms
Create a hierarchyBottom-up, agglomerativeTop-down, divisive
79 / 121
Hard vs. Soft clustering
Hard clustering: Each document belongs to exactly onecluster.
More common and easier to do
Soft clustering: A document can belong to more than onecluster.
Makes more sense for applications like creating browsablehierarchiesYou may want to put a pair of sneakers in two clusters:
sports apparelshoes
You can only do that with a soft clustering approach.
For soft clustering, see course text: 16.5,18
Today: Flat, hard clusteringNext time: Hierarchical, hard clustering
80 / 121
Flat algorithms
Flat algorithms compute a partition of N documents into aset of K clusters.
Given: a set of documents and the number K
Find: a partition in K clusters that optimizes the chosenpartitioning criterion
Global optimization: exhaustively enumerate partitions, pickoptimal one
Not tractable
Effective heuristic method: K -means algorithm
81 / 121
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
82 / 121
K -means
Perhaps the best known clustering algorithm
Simple, works well in many cases
Use as default / baseline for clustering documents
83 / 121
K -means
Each cluster in K -means is defined by a centroid.
Objective/partitioning criterion: minimize the average squareddifference from the centroid
Recall definition of centroid:
~µ(ω) =1
|ω|
∑
~x∈ω
~x
where we use ω to denote a cluster.
We try to find the minimum average squared difference byiterating two steps:
reassignment: assign each vector to its closest centroidrecomputation: recompute each centroid as the average of thevectors that were assigned to it in reassignment
84 / 121
K -means algorithm
K -means({~x1, . . . , ~xN},K )1 (~s1,~s2, . . . ,~sK )← SelectRandomSeeds({~x1, . . . , ~xN},K )2 for k ← 1 to K3 do ~µk ← ~sk4 while stopping criterion has not been met5 do for k ← 1 to K6 do ωk ← {}7 for n← 1 to N8 do j ← arg minj ′ |~µj ′ − ~xn|9 ωj ← ωj ∪ {~xn} (reassignment of vectors)
10 for k ← 1 to K11 do ~µk ←
1|ωk |
∑
~x∈ωk~x (recomputation of centroids)
12 return {~µ1, . . . , ~µK}
85 / 121
Set of points to be clustered
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
86 / 121
Random selection of initial cluster centers (k = 2 means)
×
×b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
Centroids after convergence?
87 / 121
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
×
×
88 / 121
Assignment
2
1
1
2
1
1
1 111
1
1
1
11
2
11
2 2
×
×
89 / 121
Recompute cluster centroids
2
1
1
2
1
1
1 111
1
1
1
11
2
11
2 2
×
×
×
×
90 / 121
Assign points to closest centroid
b
b
b
b
b
b bb
b
b
b
b
b
b
bb
bb
×
×b b
91 / 121
Assignment
2
2
1
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
92 / 121
Recompute cluster centroids
2
2
1
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
×
×
93 / 121
Assign points to closest centroid
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
×
×b
94 / 121
Assignment
2
2
2
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
95 / 121
Recompute cluster centroids
2
2
2
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
×
×
96 / 121
Assign points to closest centroid
b
b
b
b
b
b
b b
b
b
b
b
b
b
b
bb
bb
×
×
b
97 / 121
Assignment
2
2
2
2
1
1
1 121
1
2
1
11
2
11
2 2
×
×
98 / 121
Recompute cluster centroids
2
2
2
2
1
1
1 121
1
2
1
11
2
11
2 2
×
×
×
×
99 / 121
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
bb
b
×
×b
bb
100 / 121
Assignment
2
2
2
2
1
1
1 122
1
2
1
11
1
11
2 1
×
×
101 / 121
Recompute cluster centroids
2
2
2
2
1
1
1 122
1
2
1
11
1
11
2 1
××
×
×
102 / 121
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
b
××
b
103 / 121
Assignment
2
2
2
2
1
1
1 122
1
2
1
11
1
11
1 1
××
104 / 121
Recompute cluster centroids
2
2
2
2
1
1
1 122
1
2
1
11
1
11
1 1
××
×
×
105 / 121
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
b
bb
bb
×× b
106 / 121
Assignment
2
2
2
2
1
1
1 122
1
1
1
11
1
11
1 1
××
107 / 121
Recompute cluster centroids
2
2
2
2
1
1
1 122
1
1
1
11
1
11
1 1
××
×
×
108 / 121
Centroids and assignments after convergence
2
2
2
2
1
1
1 122
1
1
1
11
1
11
1 1
××
109 / 121
Set of points clustered
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
110 / 121
Set of points to be clustered
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
111 / 121
K -means is guaranteed to converge
Proof:
The sum of squared distances (RSS) decreases duringreassignment, because each vector is moved to a closercentroid(RSS = sum of all squared distances between documentvectors and closest centroids)
RSS decreases during recomputation (see next slide)
There is only a finite number of clusterings.
Thus: We must reach a fixed point.(assume that ties are broken consistently)
112 / 121
Recomputation decreases average distance
RSS =∑K
k=1 RSSk – the residual sum of squares (the “goodness”measure)
RSSk(~v) =∑
~x∈ωk
‖~v − ~x‖2 =∑
~x∈ωk
M∑
m=1
(vm − xm)2
∂RSSk(~v)
∂vm
=∑
~x∈ωk
2(vm − xm) = 0
vm =1
|ωk |
∑
~x∈ωk
xm
The last line is the componentwise definition of the centroid!We minimize RSSk when the old centroid is replaced with the newcentroid.RSS, the sum of the RSSk , must then also decrease duringrecomputation.
113 / 121
K -means is guaranteed to converge
But we don’t know how long convergence will take!
If we don’t care about a few docs switching back and forth,then convergence is usually fast (< 10-20 iterations).
However, complete convergence can take many moreiterations.
114 / 121
Optimality of K -means
Convergence does not mean that we converge to the optimalclustering!
This is the great weakness of K -means.
If we start with a bad set of seeds, the resulting clustering canbe horrible.
115 / 121
Exercise: Suboptimal clustering
0 1 2 3 40
1
2
3
×
×
×
×
×
×d1 d2 d3
d4 d5 d6
What is the optimal clustering for K = 2?
Do we converge on this clustering for arbitrary seeds di1 , di2?
116 / 121
Exercise: Suboptimal clustering
0 1 2 3 40
1
2
3
×
×
×
×
×
×d1 d2 d3
d4 d5 d6
What is the optimal clustering for K = 2?
Do we converge on this clustering for arbitrary seeds di1 , di2?
For seeds d2 and d5, K -means converges to{{d1, d2, d3}, {d4, d5, d6}} (suboptimal clustering).
For seeds d2 and d3, instead converges to{{d1, d2, d4, d5}, {d3, d6}} (global optimum for K = 2).
117 / 121
Initialization of K -means
Random seed selection is just one of many ways K -means canbe initialized.
Random seed selection is not very robust: It’s easy to get asuboptimal clustering.
Better heuristics:
Select seeds not randomly, but using some heuristic (e.g., filterout outliers or find a set of seeds that has “good coverage” ofthe document space)Use hierarchical clustering to find good seeds (next class)Select i (e.g., i = 10) different sets of seeds, do a K -meansclustering for each, select the clustering with lowest RSS
118 / 121
Time complexity of K -means
Computing one distance of two vectors is O(M).
Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)
Recomputation step: O(NM) (we need to add each of thedocument’s < M values to one of the centroids)
Assume number of iterations bounded by I
Overall complexity: O(IKNM) – linear in all importantdimensions
However: This is not a real worst-case analysis.
In pathological cases, the number of iterations can be muchhigher than linear in the number of documents.
119 / 121
Recommended