54
Clustering, K-means variants clustering techniques and applications Jagdeep Matharu Brock University March 18th 2013 Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 1 / 54

Clustering

Embed Size (px)

DESCRIPTION

Data Clustering and clustering techniques focus on K-means algorithms

Citation preview

Page 1: Clustering

Clustering, K-means variants clustering techniques andapplications

Jagdeep Matharu

Brock University

March 18th 2013

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 1 / 54

Page 2: Clustering

Clustering Algorithms Clustering

Clustering

1 Grouping together data objects that are in some similar wayaccording to some user defined criteria.

2 Cluster : collection of data objects that are similar to each other

3 A form of Unsupervised learning.

4 Data exploration - Looking for new patterns for structures of data.

5 Optimization problem.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 2 / 54

Page 3: Clustering

Clustering Algorithms Clustering

Clustering Task

1 Pattern Representation2 Pattern proximity measure Most important

How much (de)similar two objects are.

3 Grouping

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 3 / 54

Page 4: Clustering

Clustering Algorithms Clustering Techniques

Clustering Techniques

1 Hierarchical Algorithms: Create Hierarchical decomposition of thedata set.

Agglomerative: Bottom-up approach.Divisive: top-down approach.

2 Partition Algorithms: Create partition and then evaluate by somecriteria

e.g: k-means ,k-medoids

Figure 1 : Examples of segmentation based on colour or intensity.Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 4 / 54

Page 5: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Hierarchical Clustering Algorithms

1 Sequential Clustering Algorithm2 Algorithm:

assign every data point in a separate clusterKeep merging the most similar pairs of data points/clusters until wehave one clusterCompute Distances between and old clusters

3 Use distance matrix as clustering criteria

4 Construct nested partitions layer by layer into tree like structure

5 Resulting cluster can further cut down to get the desired number ofcluster.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 5 / 54

Page 6: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Cont’d

1 Binary Tree or dendrogram.

2 Where Height of the bars shows how close two objects are.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 6 / 54

Page 7: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 7 / 54

Page 8: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 8 / 54

Page 9: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 9 / 54

Page 10: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 10 / 54

Page 11: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 11 / 54

Page 12: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 12 / 54

Page 13: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 13 / 54

Page 14: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 14 / 54

Page 15: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 15 / 54

Page 16: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 16 / 54

Page 17: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 17 / 54

Page 18: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 18 / 54

Page 19: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 19 / 54

Page 20: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 20 / 54

Page 21: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 21 / 54

Page 22: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 22 / 54

Page 23: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 23 / 54

Page 24: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 24 / 54

Page 25: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 25 / 54

Page 26: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 26 / 54

Page 27: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 27 / 54

Page 28: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 28 / 54

Page 29: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 29 / 54

Page 30: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 30 / 54

Page 31: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 31 / 54

Page 32: Clustering

Clustering Algorithms Hierarchical Clustering Algorithms

Strengths and Weaknesses

1 Pros:No need to assume number of clusters required.Easy to implement.

2 Cons:

Time and Space complexity O(n2).

computing proximity matrix.

No objective function directly minimized.Merging decisions are final - cannot undone.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 32 / 54

Page 33: Clustering

Partition Clustering algorithms

Partition Clustering algorithms

1 Overview:

Construct a partition of a data set D of n objects into a set of kclusters.Value of k is specified by user.

different values of k result in different cluster output.

Find the partition of k clusters that optimize the chosen partitioncriteria/Error Function.

E.g.: Error Sum of Squares(SSE)

2 Combinatorial search can be computationally expensive.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 33 / 54

Page 34: Clustering

Partition Clustering algorithms Partition Clustering algorithm

Partition Clustering algorithms

1 k-medoids

Use medoid (data point) to represent the cluster.

2 k-means

Use centriod to represent the cluster.

3 Variations

Bisecting k-meansISODATA

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 34 / 54

Page 35: Clustering

Partition Clustering algorithms Partition Clustering algorithms

k-means algorithms

1 Choose k initial centroids (center points).2 Each cluster is associated with a centroid.3 Each data object is assigned to closet centroid.4 The centroid of each cluster is then updated based on the data

objects assignment to the cluster.5 Repeat the assignment and update steps until convergence.

Figure 2 : Algorithm

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 35 / 54

Page 36: Clustering

Partition Clustering algorithms Partition Clustering algorithms

K-means Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 36 / 54

Page 37: Clustering

Partition Clustering algorithms Partition Clustering algorithms

K-means Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 37 / 54

Page 38: Clustering

Partition Clustering algorithms Partition Clustering algorithms

K-means Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 38 / 54

Page 39: Clustering

Partition Clustering algorithms Partition Clustering algorithms

K-means Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 39 / 54

Page 40: Clustering

Partition Clustering algorithms Partition Clustering algorithms

K-means Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 40 / 54

Page 41: Clustering

Partition Clustering algorithms Partition Clustering algorithms

K-means Example

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 41 / 54

Page 42: Clustering

Partition Clustering algorithms Partition Clustering algorithms

K-means

1 What is the size of k?

2 How to Choosing initial centroids ?

3 How to assign points to closet centroid ?

4 Cluster evaluation ?

5 Other issues.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 42 / 54

Page 43: Clustering

Partition Clustering algorithms Partition Clustering algorithms

Choosing value of k

1 k represent the number of the clusters required in a partition.2 Must specify before hand3 There is no rule of thumb while choosing k - Trail and failure.4 Different sizes may result to different results.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 43 / 54

Page 44: Clustering

Partition Clustering algorithms Partition Clustering algorithms

choosing initial centroid.

1 Key step of k-means method.

2 Different initial centroids can produce different results.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 44 / 54

Page 45: Clustering

Partition Clustering algorithms Partition Clustering algorithms

Example - Optimal Initial Centroid.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 45 / 54

Page 46: Clustering

Partition Clustering algorithms Partition Clustering algorithms

Example - Sub - Optimal Initial Centroid.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 46 / 54

Page 47: Clustering

Partition Clustering algorithms Partition Clustering algorithms

Choosing intial centroid.

1 Choose Initial centroid randomly.

Can lead to poor clustering.

2 Choosing centroid by performing multiple runs with randomly choseninitial centroid.

Select the set of clusters with optimal solution.

3 Take a sample of points and cluster them using a hierarchicalclustering technique. k clusters are extracted from hierarchy.Centroids of those clusters are used as initial centroids.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 47 / 54

Page 48: Clustering

Partition Clustering algorithms Partition Clustering algorithms

Assigning points to centroid.

1 Goal is to find the closest centroid for each data points.

2 Assign data points to the closest centroid .3 Required proximity measure to calculate distances.

Euclidien distance, Manhattan distance.

4 Point is assigned to the centroid with smallest distance.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 48 / 54

Page 49: Clustering

Partition Clustering algorithms Partition Clustering algorithms

Cluster Evaluation

1 Most common measure is the sum of squared errors. (SSE)

2 Goal is to reduce the error.

3 Error represent the distance from data point to nearest cluster.

4 MathematicallyK∑i=1

∑x∈Ci

dist2(mi , x)

5 Where dist is the distence from a data point to cluster, x is a datapoint, Ci and Mi is repersentative points for the cluster Ci

6 Given the two clusters, we choose the one with the smallest error.

7 To reduce SSE increase k.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 49 / 54

Page 50: Clustering

Partition Clustering algorithms Partition Clustering algorithms

k-means

1 Pros

Easy to implement.Guarantee to converge.

In few initial iterations.

Linear complexity O(n).

2 Cons

Need to specify k , in advance.Sensitive to outliers.May yield empty clusters.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 50 / 54

Page 51: Clustering

Partition Clustering algorithms Partition Clustering algorithms

Bisecting k-means

1 Variation of basic k-means method.

2 Can produce a partitional or hierarchical clustering.

3 To obtain K clusters, split the set of all points into two clusters.4 Choose one of two clusters to split again.

Can choose largest cluster between two.Can choose one with hight SSE .Cab choose based on both.

5 Continue until K clusters have been produced.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 51 / 54

Page 52: Clustering

Partition Clustering algorithms Partition Clustering algorithms

ISODATA

1 Iterative Self Organizing Data Analysis Technique A

2 Dont need to know the number of clusters.

3 Cluster centers are randomly placed and points are assigned to closestcentriod.

4 The standard deviation within each cluster, and the distance betweencluster centers is calculated.

Clusters are split if standard deviation is greater than the user-defined.Clusters are merged if the distance between them is less than theuser-defined threshold.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 52 / 54

Page 53: Clustering

Partition Clustering algorithms Partition Clustering algorithms

Practical Example of k-means

1 Image segmentation using k-means clustering.

Figure 3 : Examples of segmentation based on colour or intensity.

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 53 / 54

Page 54: Clustering

Partition Clustering algorithms Bibliography

Bibliography I

A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: Areview,” 1999.

P. L. Lanzi. (2007) Clustering: Partitioning methods. [Online].Available: http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-06-clustering-partitioning?from=ss embed

Tan. (2005) Introduction to data mining. [Online]. Available:http://www-users.cs.umn.edu/∼kumar/dmbook/dmslides/chap8 basic cluster analysis.pdf

Jagdeep Matharu (Brock University) Clustering - k-means March 18th 2013 54 / 54