Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Introduction to Machine Learning
Clustering
Varun ChandolaComputer Science & Engineering
State University of New York at BuffaloBuffalo, NY, USA
Chandola@UB CSE 474/574 1 / 19
Outline
ClusteringClustering DefinitionK-Means ClusteringInstantations and Variants of K-MeansChoosing ParametersInitialization IssuesK-Means Limitations
Spectral ClusteringGraph LaplacianSpectral Clustering Algorithm
Chandola@UB CSE 474/574 2 / 19
Publishing a Magazine
I Imagine your are a magazineeditor
I Need to produce the next issue
I What do you do?
I Call your four assistanteditors
1. Politics2. Health3. Technology4. Sports
I Ask each to send in k articlesI Join all to create an issue
Chandola@UB CSE 474/574 3 / 19
Publishing a Magazine
I Imagine your are a magazineeditor
I Need to produce the next issue
I What do you do?I Call your four assistant
editors
1. Politics2. Health3. Technology4. Sports
I Ask each to send in k articlesI Join all to create an issue
Chandola@UB CSE 474/574 3 / 19
Treating a Magazine Issue as a Data Set
I Each article is a data point consisting of words, etc.
I Each article has a (hidden) type - sports, health, politics, andtechnology
Now imagine your are the readerI Can you assign the type to each article?
I Simpler problem: Can you group articles by type?
I Clustering
Chandola@UB CSE 474/574 4 / 19
Treating a Magazine Issue as a Data Set
I Each article is a data point consisting of words, etc.
I Each article has a (hidden) type - sports, health, politics, andtechnology
Now imagine your are the readerI Can you assign the type to each article?
I Simpler problem: Can you group articles by type?
I Clustering
Chandola@UB CSE 474/574 4 / 19
What is Clustering?
I Grouping similar things together
I A notion of a similarity or distance metric
I A type of unsupervised learningI Learning without any labels or target
Chandola@UB CSE 474/574 5 / 19
Expected Outcome of Clustering
x1
x 2
x1
x 2
Technology Health Politics Sports
Chandola@UB CSE 474/574 6 / 19
Expected Outcome of Clustering
x1
x 2
x1
x 2
Technology Health Politics Sports
Chandola@UB CSE 474/574 6 / 19
K-Means Clustering
I Objective: Group a set of N points (∈ <D) into K clusters.
1. Start with k randomly initialized points in D dimensional spaceI Denoted as {ck}Kk=1
I Also called cluster centers
2. Assign each input point xn (∀n ∈ [1,N]) to cluster k, such that:
mink
dist(xn, ck)
3. Revise each cluster center ck using all points assigned to cluster k
4. Repeat 2
Chandola@UB CSE 474/574 7 / 19
K-Means Clustering
I Objective: Group a set of N points (∈ <D) into K clusters.
1. Start with k randomly initialized points in D dimensional spaceI Denoted as {ck}Kk=1
I Also called cluster centers
2. Assign each input point xn (∀n ∈ [1,N]) to cluster k, such that:
mink
dist(xn, ck)
3. Revise each cluster center ck using all points assigned to cluster k
4. Repeat 2
Chandola@UB CSE 474/574 7 / 19
K-Means Clustering
I Objective: Group a set of N points (∈ <D) into K clusters.
1. Start with k randomly initialized points in D dimensional spaceI Denoted as {ck}Kk=1
I Also called cluster centers
2. Assign each input point xn (∀n ∈ [1,N]) to cluster k, such that:
mink
dist(xn, ck)
3. Revise each cluster center ck using all points assigned to cluster k
4. Repeat 2
Chandola@UB CSE 474/574 7 / 19
K-Means Clustering
I Objective: Group a set of N points (∈ <D) into K clusters.
1. Start with k randomly initialized points in D dimensional spaceI Denoted as {ck}Kk=1
I Also called cluster centers
2. Assign each input point xn (∀n ∈ [1,N]) to cluster k, such that:
mink
dist(xn, ck)
3. Revise each cluster center ck using all points assigned to cluster k
4. Repeat 2
Chandola@UB CSE 474/574 7 / 19
K-Means Clustering
I Objective: Group a set of N points (∈ <D) into K clusters.
1. Start with k randomly initialized points in D dimensional spaceI Denoted as {ck}Kk=1
I Also called cluster centers
2. Assign each input point xn (∀n ∈ [1,N]) to cluster k, such that:
mink
dist(xn, ck)
3. Revise each cluster center ck using all points assigned to cluster k
4. Repeat 2
Chandola@UB CSE 474/574 7 / 19
Variants of K-Means
I Finding distanceI Euclidean distance is popular
I Finding cluster centersI Mean for K-MeansI Median for k-medoids
Chandola@UB CSE 474/574 8 / 19
Choosing Parameters
1. Similarity/distance metricI Can use non-linear transformationsI K-Means with Euclidean distance produces “circular” clusters
2. How to set k?I Trial and errorI How to evaluate clustering?I K-Means objective function
J(c,R) =N∑
n=1
K∑k=1
Rnk‖xn − ck‖2
I R is the cluster assignment matrix
Rnk =
{1 If xn ∈ cluster k0 Otherwise
Chandola@UB CSE 474/574 9 / 19
Initialization Issues
I Can lead to wrong clustering
I Better strategies
1. Choose first centroid randomly, choose second farthest away fromfirst, third farthest away from first and second, and so on.
2. Make multiple runs and choose the best
Chandola@UB CSE 474/574 10 / 19
Strengths and Limitations of K-Means
StrengthsI Simple
I Can be extended to other types of data
I Easy to parallelize
WeaknessesI Circular clusters (not with kernelized versions)
I Choosing K is always an issue
I Not guaranteed to be optimal
I Works well if natural clusters are round and of equal densities
I Hard Clustering
Chandola@UB CSE 474/574 11 / 19
Issues with K-Means
I “Hard clustering”
I Assign every data point to exactly one cluster
I Probabilistic ClusteringI Each data point can belong to multiple clusters with varying
probabilitiesI In general
P(xi ∈ Cj) > 0 ∀j = 1 . . .K
I For hard clustering probability will be 1 for one cluster and 0 for allothers
Chandola@UB CSE 474/574 12 / 19
Spectral Clustering
I An alternate approach to clustering
I Let the data be a set of N points
X = x1, x2, . . . , xN
I Let S be a N × N similarity matrix
Sij = sim(xi , xj)
I sim(, ) is a similarity function
I Construct a weighted undirected graph from S with adjacencymatrix, W
Wij =
{sim(xi , xj) if xi is nearest neighbor of xj
0 otherwise
I Can use more than 1 nearest neighbors to construct the graph
Chandola@UB CSE 474/574 13 / 19
Spectral Clustering as a Graph Min-cut Problem
I Clustering X into K clusters is equivalent to finding K cuts in thegraph WI A1,A2, . . . ,AK
I Possible objective function
cut(A1,A2, . . . ,AK ) ,1
2
K∑k=1
W (Ak , Ak)
I where Ak denotes the nodes in the graph which are not in Ak and
W (A,B) ,∑
i∈A,j∈B
Wij
Chandola@UB CSE 474/574 14 / 19
Straight min-cut results in trivial solution
I For K = 2, an optimal solution would have only one node in A1 andrest in A2 (or vice-versa)
Normalized Min-cut Problem
normcut(A1,A2, . . . ,AK ) ,1
2
K∑k=1
W (Ak , Ak)
vol(Ak)
where vol(A) ,∑
i∈A di , di is the weighted degree of the node i
Chandola@UB CSE 474/574 15 / 19
NP Hard Problem
I Equivalent to solving a 0-1 knapsack problem
I Find N binary vectors, ci of length K such that cik = 1 only if pointi belongs to cluster k
I If we relax constraints to allow cik to be real-valued, the problembecomes an eigenvector problemI Hence the name: spectral clustering
Chandola@UB CSE 474/574 16 / 19
The Graph Laplacian
L , D−W
I D is a diagonal matrix with degree of corresponding node as thediagonal value
Properties of Laplacian Matrix
1. Each row sums to 0
2. 1 is an eigen vector with eigen value equal to 0
3. Symmetric and positive semi-definite
4. Has N non-negative real-valued eigenvalues
5. If the graph (W) has K connected components, then L has Keigenvectors spanned by 1A1 , . . . , 1AK with 0 eigenvalue.
Chandola@UB CSE 474/574 17 / 19
Spectral Clustering Algorithm
ObservationI In practice, W might not have K exactly isolated connected
components
I By perturbation theory, the smallest eigenvectors of L will be closeto the ideal indicator functions
AlgorithmI Compute first (smallest) K eigen vectors of L
I Let U be the N × K matrix with eigenvectors as the columns
I Perform kMeans clustering on the rows of U
Chandola@UB CSE 474/574 18 / 19
References
Chandola@UB CSE 474/574 19 / 19