Upload
maximillian-harrington
View
238
Download
0
Embed Size (px)
DESCRIPTION
Clustering Clustering is a classification pattern that divide data into groups in meaningful and useful way Unsupervised classification pattern
Citation preview
Clustering Algorithms
Sunida Ratanothayanon
What is Clustering?
Clustering Clustering is a classification
pattern that divide data into groups in meaningful and useful way
Unsupervised classification pattern
Clustering Clustering is a classification
pattern that divide data into groups in meaningful and useful way
Unsupervised classification pattern
Outline K-Means Algorithm
Hierarchical Clustering Algorithm
K-Means Algorithm A partial clustering algorithm k clusters (# of k is specified by a
user) Each cluster has a cluster center
called centroid. The algorithm will literately group
data into k clusters based on a distance function.
K-Means Algorithm The centroid can be obtained from
the mean of all data points in the cluster.
Stop when there is no change of center.
A numerical example
K-Means example
Data Point x1 x21 22 21
2 19 20
3 18 22
4 1 3
5 4 2
We have five data points with 2 attributesWant to group data into 2 clusters (k=2)
K-Means exampleWe can plot a graph from five data points as following.
Plot of 5 data points over X1 and X2
0
5
10
15
20
25
0 5 10 15 20 25
X1
X2 Cluster C2Cluster C1
K-Means example(1 st iteration)
Step1 : Choosing center and defining kData
Point x1 x2
1 22 21
2 19 20
3 18 22
4 1 3
5 4 2
C1=(18,22), C2= (4,2)
Step2 : Computing cluster centersWe already define c1 and c2
Step3 : Finding square of Euclidian distance of each data point from the center and assigning each data points to a cluster
21
nd x yi ii
K-Means example(1 st iteration)
DataPoint x1 x2
1 22 21
2 19 20
3 18 22
4 1 3
5 4 2Step3 (cont):Distance table for all data points
Data Point
C1 C2
(18,22) (4,2)
(22,21) 4.13 26.9
(19,20) 2.23 23.43
(18,22) 0 24.41
(1,3) 25.49 3.1
(4,2) 24.41 0
Then, we assign each data point to the cluster by comparing its distance to the center. The data point will be assigned to its closest cluster.
K-Means example(2 nd iteration)
Step2 : Computing cluster centersWe will compute new cluster
centers Member of cluster1 are (22,21), (19,20) and (18,22). We will
find average of these data points
Data Point
C1 C2
(18,22) (4,2)
(22,21) 4.13 26.9
(19,20) 2.23 23.43
(18,22) 0 24.41
(1,3) 25.49 3.1
(4,2) 24.41 0
22 19 18 5921 20 22 63
59 / 3 19.763/ 3 21
C1 is [19.7, 21]
Member of cluster2 are (1,3) and (4,2). 1 4 53 2 5
5 / 2 2.55 / 2 2.5
C2 is [2.5, 2.5]
K-Means example(2 nd iteration)
Data Point
C1’ C2’
(19.7,21) (2.5,2.5)
(22,21) 2.3 26.88
(19,20) 1.22 24.05
(18,22) 1.97 24.91
(1,3) 25.96 1.58
(4,2) 24.65 1.58
Step3 :Finding square of Euclidian distance of each data point from the center and assigning each data points to a cluster
Distance table for all data points with new centers
Assign each data point to the cluster by comparing its distance to the center. The data point will be assigned to its closest cluster.
Repeat step2 and 3 for the next iteration because centers still have a change
Data Point
C1 C2
(18,22) (4,2)
(22,21) 4.13 26.9
(19,20) 2.23 23.43
(18,22) 0 24.41
(1,3) 25.49 3.1
(4,2) 24.41 0
K-Means example(3 rd iteration)
Step2 : Computing cluster centersWe will compute new cluster
centers Member of cluster1 are (22,21), (19,20) and (18,22). We will
find average of these data points22 19 18 5921 20 22 63
59 / 3 19.763/ 3 21
C1 is [19.7, 21]
Member of cluster2 are (1,3) and (4,2). 1 4 53 2 5
5 / 2 2.55 / 2 2.5
C2 is [2.5, 2.5]
Data Point
C1’ C2’
(19.7,21) (2.5,2.5)
(22,21) 2.3 26.88
(19,20) 1.22 24.05
(18,22) 1.97 24.91
(1,3) 25.96 1.58
(4,2) 24.65 1.58
K-Means example(3 rd iteration)
Data Point
C1’ C2’
(19.7,21) (2.5,2.5)
(22,21) 2.3 26.88
(19,20) 1.22 24.05
(18,22) 1.97 24.91
(1,3) 25.96 1.58
(4,2) 24.65 1.58
Step3 :Finding square of Euclidian distance of each data point from the center and assigning each data points to a cluster
Distance table for all data points with new centers
Assign each data point to the cluster by comparing its distance to the center. The data point will be assigned to its closest cluster.
Stop the algorithm because centers remain the same.
Data Point
C1’’ C2’’
(19.7,21) (2.5,2.5)
(22,21) 2.3 26.88
(19,20) 1.22 24.05
(18,22) 1.97 24.91
(1,3) 25.96 1.58
(4,2) 24.65 1.58
Hierarchical Clustering Algorithm Produce a nest sequence of cluster
like a tree. Allow to have subclusters. Individual data point at the bottom
of the tree are called “Singleton clusters”.
C
E
A
B
D
Hierarchical Clustering Algorithm Agglomerative method
A tree will be build up from the bottom level and will be merged the nearest pair of clusters at each level to go one level up
Continue until all the data points are merged into a single cluster.
A numerical example
Hierarchical Clustering example
We have five data points with 3 attributes
Data Point x1 x2 x3
A 9 3 7
B 10 2 9
C 1 9 4
D 6 5 5
E 1 10 3
Hierarchical Clustering example(1 st iteration) Data
Point x1 x2 x3
A 9 3 7
B 10 2 9
C 1 9 4
D 6 5 5
E 1 10 3
Step1 : Calculating Euclidian distance between two vector points
Then we obtain distance table as following
Data Point A B C D E
(9, 3, 7) (10, 2, 9) (1, 9, 4) (6, 5, 5) (1, 10, 3)
A ( 9, 3, 7) 0 2.45 10.44 4.12 11.36
B (10, 2, 9) - 0 12.45 6.4 13.45
C (1, 9, 4) - - 0 6.48 1.41
D (6, 5, 5) - - - 0 7.35
E (1, 10, 3) - - - - 0
Hierarchical Clustering example(1 st iteration)
Step2 : Forming a tree Consider the most similar pair of
data points from the previous distance table
Data Point A B C D E
(9, 3, 7) (10, 2, 9) (1, 9, 4) (6, 5, 5) (1, 10, 3)
A ( 9, 3, 7) 0 2.45 10.44 4.12 11.36
B (10, 2, 9) - 0 12.45 6.4 13.45
C (1, 9, 4) - - 0 6.48 1.41
D (6, 5, 5) - - - 0 7.35
E (1, 10, 3) - - - - 0
C and E are the most similar We will obtain the first cluster as
followingC
E
Repeat step1 and 2 until all data points are merged into a single cluster.
Hierarchical Clustering example(2 nd iteration)
C
E
Data Point A B C D E
(9, 3, 7) (10, 2, 9) (1, 9, 4) (6, 5, 5) (1, 10, 3)
A ( 9, 3, 7) 0 2.45 10.44 4.12 11.36
B (10, 2, 9) - 0 12.45 6.4 13.45
C (1, 9, 4) - - 0 6.48 1.41
D (6, 5, 5) - - - 0 7.35
E (1, 10, 3) - - - - 0
Step1 : Calculating Euclidian distance between two vector points
We will redraw the distance table including the merge of two entities, C&E.
Data Point A B D C&E
(9, 3, 7) (10, 2, 9) (6, 5, 5)
A ( 9, 3, 7) 0 2.45 4.12 10.9
B (10, 2, 9) - 0 6.4 12.95
D (6, 5, 5) - - 0 6.90
C&E (1, 9.5, 3.5) - - - 0
A distance for C&E to A can be obtained from
We can use a previous table to get the distance from C to A and E to A.
( , ), ,( , ),d avg d dC A E AC E A
avg (10.44, 11.36) = 10.9
Hierarchical Clustering example(2 nd iteration)
Step2 : Forming a tree Consider the most similar pair of
data points from the previous distance table
A and B are the most similar We will obtain the second cluster as
following
Repeat step1 and 2 until all data points are merged into a single cluster.
Data Point A B D C&E
(9, 3, 7) (10, 2, 9) (6, 5, 5)
A ( 9, 3, 7) 0 2.45 4.12 10.9
B (10, 2, 9) - 0 6.4 12.95
D (6, 5, 5) - - 0 6.90
C&E (1, 9.5, 3.5) - - - 0
C
E
A
B
From previous table, we can obtain following distances for the new distance table
Hierarchical Clustering example(3 rd iteration) Data Point A B D C&E
(9, 3, 7) (10, 2, 9) (6, 5, 5)
A ( 9, 3, 7) 0 2.45 4.12 10.9
B (10, 2, 9) - 0 6.4 12.95
D (6, 5, 5) - - 0 6.90
C&E (1, 9.5, 3.5) - - - 0
Step1 : Calculating Euclidian distance between two vector points
We will redraw the distance table including the merge entities, C&E and A&B.
Data Point A&B D C&E
(6, 5, 5)
A&B 0 5.26 11.93
D (6, 5, 5) - 0 6.9
C&E - - 0
( , ), ( , ) ( , )( , ) (4.12,6.40) 5.26A B D A D B Dd avg d d avg
( , ),( , ) ( , ) ( , ) ( , ), ( , ),( , ) ( , ) (10.9,12.95) 11.93C E A B C E A B C E A C E Bd avg d d avg d d avg
( , ), 6.90C E Dd
Hierarchical Clustering example(3 rd iteration)
Step2 : Forming a tree Consider the most similar pair of
data points from the previous distance table
A&B and D are the most similar We will obtain the new cluster as
following
Repeat step1 and 2 until all data points are merged into a single cluster.
Data Point A&B D C&E
(6, 5, 5)
A&B 0 5.26 11.93
D (6, 5, 5) - 0 6.9
C&E - - 0
C
E
A
B
D
From previous table, we can obtain a distance from cluster A&B&D to C&E as following
Hierarchical Clustering example(4 th iteration)
Data Point A&B D C&E
(6, 5, 5)
A&B 0 5.26 11.93
D (6, 5, 5) - 0 6.9
C&E - - 0
Step1 : Calculating Euclidian distance between two vector points
We will redraw the distance table including the merge entities, C&E and A&B&D.
Data Point A&B&D C&E
A&B&D 0 9.4
C&E - 0
( , , ),( , ) ( , ),( , ) ,( , )( , ) (11.93,6.9) 9.4A B D C E A B C E D C Ed avg d d avg
Hierarchical Clustering example(4 th iteration)
Step2 : Forming a tree Consider the most similar pair of
data points from the previous distance table
We can form a final tree because no more recalculation has to be made
We can merge all data points into a single cluster A&B&D&C&E.
Stop the algorithm.
Data Point A&B&D C&E
A&B&D 0 9.4
C&E - 0
C
E
A
B
D
Conclusion Two major clustering algorithms.
K-Means algorithm An algorithm which literately groups data
into k clusters based on a distance function. # of k is specified by a user.
Hierarchical Clustering algorithm It is a nest sequence of cluster like a tree. A tree will be build up from the bottom level
and continue until all the data points are merged into a single cluster.
References[1] Hastie, T., Tibeshirani, R., & Friedman J. Data Mining, Inference, Prediction. Unsupervised Learning. pp.453-480
[2] JAIN, A. K., MURTY, M. N., & FLYNN, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264-330.
[3] Liu, B. (2006). Web Data Mining. Unsupervised Learning. Springer. pp.117-150.
[4] Ning, T. P., STEINBACH, M., & KUMAR, V. Introduction to Data Mining. Cluster Analysis: Basic Concepts and Algorithms. pp.487- 553.
Thank you