Comparison of Clustering Algorithms Report

  • Upload
    san343

  • View
    231

  • Download
    0

Embed Size (px)

Citation preview

  • 7/29/2019 Comparison of Clustering Algorithms Report

    1/23

    COMPARISONOFCLUSTERINGALGORITHMS:

    PARTITIONALANDHIERARCHICAL

    Principal InvestigatorDr.Sanjay Ranka

    Professor

    Department of Computer Science, University of Florida

    Teaching AssistantManas Somaiya

    AuthorsJoyesh Mishra, Gnana Sundar Rajendiran, Vasanth Prabhu Sundararaj

    Department of Computer Science,University of Florida

    Gainesvillewww.cise.ufl.edu

    Final Report December 2007

  • 7/29/2019 Comparison of Clustering Algorithms Report

    2/23

    TABLE OF CONTENTS

    I. ABSTRACT.................................................................................................................................1

    II. DETAILED REPORT.................................................................................................................11.K-Means Partitional clustering..........................................................................................1

    1.1 Characteristics of K - means..............................................................................11.2 Algorithm...........................................................................................................11.3 Observations.......................................................................................................2

    2.Agglomerative Hierarchical Clustering.............................................................................42.1 Definition............................................................................................................42.2 Algorithms implemented in this Project.............................................................42.3 Datasets and Experiments..................................................................................6

    3.DBSCAN (Using KD Trees)...........................................................................................123.1 DBSCAN Algorithm........................................................................................12

    3.2 DBSCAN Performance Enhancements Using KD Trees.................................123.3 Observations regarding DBSCAN Issues.........................................................134.CURE Hierarchical Clustering (Using KD Trees)........................................................13

    4.1 CURE Hierarchical Clustering Algorithm........................................................144.2 CURE Overview..............................................................................................144.3 CURE - Data Structures Used.........................................................................154.4 Benefits of CURE against Other Algorithms...................................................164.5 Observations towards Sensitivity to Parameters..............................................16

    III. CONCLUSION........................................................................................................................17

    IV. REFERENCES........................................................................................................................18

  • 7/29/2019 Comparison of Clustering Algorithms Report

    3/23

    LIST OF FIGURES

    FIGURE 1 K MEANS INITIAL K CLUSTERS..........................................................................2

    FIGURE 2 K MEANS CLUSTERS GETTING REARRANGED BY COMPUTING NEWCENTROIDS.......................................................................................................................3

    FIGURE 3 K MEANS CONVERGED CLUSTERS....................................................................3

    FIGURE 4 UNION BY RANK.......................................................................................................6

    FIGURE 5 SPAETH DATASET.....................................................................................................7

    FIGURE 6 AGGLOMERATIVE CLUSTERS AFTER 28000 ITERATIONS..............................7

    FIGURE 7 AGGLOMERATIVE CLUSTERS AFTER 64000 ITERATIONS..............................8

    FIGURE 8 AGGLOMERATIVE CLUSTERS AFTER 65388 ITERATIONS..............................8

    FIGURE 9 AGGLOMERATIVE NON GLOBULAR CLUSTERS............................................9

    FIGURE 10 CURE NON GLOBULAR CLUSTERS..................................................................9

    FIGURE 11 COMPLETE LINK...................................................................................................10

    FIGURE 12 COMPLETE LINK CLUSTERS AFTER 2000 ITERATIONS...............................10

    FIGURE 13 COMPLETE LINK CLUSTERS AFTER 2012 ITERATIONS...............................11

    FIGURE 14 CURE CLUSTERS...................................................................................................11

    FIGURE 15 DBSCAN PERFORMANCE MEASUREMENTS...................................................13

    FIGURE 16 PARTITIONING RESULTS....................................................................................17

  • 7/29/2019 Comparison of Clustering Algorithms Report

    4/23

  • 7/29/2019 Comparison of Clustering Algorithms Report

    5/23

    I. ABSTRACT

    Clustering is one of the important streams in data mining useful for discovering groupsand identifying interesting distributions in the underlying data.

    This project aims in analyzing and comparing the partitional and hierarchical clusteringalgorithms namely DBSCAN and k-means (partitional) with Agglomerative and CURE(hierarchical). The comparison is done based on the extent to which each of these algorithmsidentify the clusters, their pros and cons and the timing that each algorithm takes to identify theclusters present in the dataset. Among each clustering algorithm, computation time was measuresas the size of data set increased. This was used to test the scalability of the algorithm and if itcould be disintegrated and executed concurrently on several machines.

    k-means is a partitional clustering technique that helps to identify k clusters from a givenset of n data points in d-dimensional space. It starts with k random centers and a single cluster,

    and refines it at each step arriving to k clusters. Currently, the time complexity for implementing k- means is O (I * k* d * n), where I is the number of iterations. If we could use the KD-Tree datastructure in the implementation, it can further reduce the complexity to O (I * k * d * log (n)).

    DBSCAN discovers clusters of arbitrary shape relying on a density based notion ofclusters. Given eps as the input parameter, unlike k-means clustering, it tries to find out allpossible clusters by classifying each point as core, border or noise. DBSCAN can be expensive ascomputation of nearest neighbors requires computing all pair wise proximities. Additionalimplementation includes KD-Trees to store the data which would allow efficient retrieval of dataand bring down the time complexity from O(m^2) to O(m log m).

    Agglomerative Hierarchical Clustering is one of the non-parametric approaches toClustering which is based on the measures of the dissimilarities among the current cluster set ineach iteration. In general we will start with the points as individual clusters and at each step mergethe closest pair of clusters by defining a notion of cluster proximity. We will implement threealgorithms, namely, Single-Linkage Clustering and Complete-Linkage Clustering. We will beanalyzing the advantages and drawbacks of Agglomerative Hierarchical Clustering by comparingit with the other Algorithms CURE, DBSCAN and K-Means.

    CURE clustering algorithm helps in attaining scalability for clustering in large databaseswithout sacrificing quality of the generated clusters. The algorithm uses KD-Trees and Min Heapsfor efficient data analysis and repetitive clustering. The random sampling, partitioning of clusters

    and two pass merging helps in scaling the algorithm for large datasets. Our implementation wouldprovide a comparative study of CURE against other partitioning and hierarchical algorithm.

    1

  • 7/29/2019 Comparison of Clustering Algorithms Report

    6/23

    II. DETAILED REPORT

    1. K-Means Partitional clustering

    Clustering based on k-means is closely related to a number of other clustering and locationproblems. These include the Euclidean k-medians in which the objective is to minimize the sum ofdistances to the nearest center and the geometric k-center problem in which the objective is tominimize the maximum distance from every point to its closest center. There are no efficientsolutions known to any of these problems and some formulations are NP-hard. The large constantfactors suggest that it is not a good candidate for practical implementation.One of the most popular heuristics for solving the k-means problem is based on a simple iterativescheme for finding a locally minimal solution. This algorithm is often called the k-meansalgorithm.

    1.1 Characteristics of K - means

    a. It is a prototype based Clustering. It can only be applied to clusters that have the notion ofa centre.

    b. The algorithm has a space complexity of O (I * K * m * n), where I is the number ofiterations, K is the number of clusters, m is the number of dimensions and n is the numberof points.

    c. Using KD Trees the overall Time Complexity reduces to O (n * logn). KD Tree is a datastructure that will help grouping the points that will be most likely be a cluster at eachpoint of decision between isolating the clusters.

    1.2 Algorithm

    a. Select K initial centroidsb. Repeat

    For each point, find its closes centroid and assign that point to the centroid. This

    results in the formation of K clusters

    Recompute centroid for each cluster

    Until the centroids do not change

    In the first step, points are assigned to the initial centroids, which are all in the largergroup of points. After points are assigned to a centroid, the centroid is then updated. In thesecond step, points are assigned to the updated centroids, and the centroids are updated again.

    When the k-means algorithm terminate, the centroids would have identified the natural groupingsof points. For some combinations of proximity functions and types of centroids, k-means alwaysconverge to a solution i.e., k-means reaches a state in which no points are shifting from onecluster to another and hence the centroids do not change.

  • 7/29/2019 Comparison of Clustering Algorithms Report

    7/23

    1.3 Observations

    The datasets used for running k-means algorithm is a 2d array of x y points obtained fromSPAETH (http://people.scs.fsu.edu/~burkardt/datasets/spaeth/spaeth.html). The list of figuresgiven below shows how the k-means algorithm converges for the set of data points.

    Figure 1 K means Initial K clusters

    http://people.scs.fsu.edu/~burkardt/datasets/spaeth/spaeth.htmlhttp://people.scs.fsu.edu/~burkardt/datasets/spaeth/spaeth.html
  • 7/29/2019 Comparison of Clustering Algorithms Report

    8/23

    Figure 2 K means Clusters getting rearranged by computing new centroids

    Figure 3 K means Converged clusters

    With LabVIEW 8.2.1 compiler, and with 3360 points, the k-means algorithm took 355 msto arrive to convergence. The hardware used is Intel@ Core2 IV 1.73 Ghz with 1GB RAM.

  • 7/29/2019 Comparison of Clustering Algorithms Report

    9/23

    The pros of k-means algorithm are:

    a. It is very simple to implementb. This algorithm is very fast for low dimensional datac. It can find pure sub clusters if large number of clusters is specified

    The cons of k-means of algorithm are:

    a. K-Means cannot handle non-globular data of different sizes and densitiesb. K-Means will not identify outliersc. K-Means is restricted to data which has the notion of a centre (centroid)

    2. Agglomerative Hierarchical Clustering

    2.1 Definition

    Hierarchical clustering builds a cluster hierarchy or, in other words, a tree of clusters, alsoknown as a dendrogram. Every cluster node contains child clusters; sibling clusters partition thepoints covered by their common parent. Such an approach allows exploring data on differentlevels of granularity. Hierarchical clustering methods are categorized into agglomerative (bottom-up) and divisive (top-down). An agglomerative clustering starts with one-point (singleton)clusters and recursively merges two or more most appropriate clusters. A divisive clustering startswith one cluster of all data points and recursively splits the most appropriate cluster. The processcontinues until a stopping criterion (frequently, the requested number k of clusters) is achieved. Inthis project, we will dealing with Agglomerative Hierarchical Clustering.

    Advantages of hierarchical clustering:

    Embedded flexibility regarding the level of granularity Ease of handling of any forms of similarity or distance

    applicability to any attribute types

    Disadvantages of hierarchical clustering:

    Vagueness of termination criteria

    The fact that most hierarchical algorithms do not revisit once constructed (intermediate)

    clusters with the purpose of their improvement

    2.2 Algorithms implemented in this Project

    In this project, we have implemented two linkage metric algorithms, Single-Link (MIN)and Complete-Link (MAX) algorithms. Time Complexity is O(n2logn).

    Single Link Algorithm

    In this algorithm, the proximity of two clusters is defined as the minimum of the distance

  • 7/29/2019 Comparison of Clustering Algorithms Report

    10/23

    (maximum of the similarity) between any two points in the two different clusters. Using graphterminology, if you start with all points as singleton clusters and add links between points one at atime, shortest links first, and then these single links combine the points into clusters. In theproject, a new method is used to implement the single link algorithm. A minimum spanning tree isimplemented using the Kruskals algorithm. Union-by-Rank and Path compression methods areused for optimization.

    Minimum Spanning Tree - Given a connected, undirected graph, a spanning tree of that graphis a sub-graph which is a tree and connects all the vertices together. A single graph can have manydifferent spanning trees. We can also assign a weight to each edge, which is a numberrepresenting how unfavorable it is, and use this to assign a weight to a spanning tree bycomputing the sum of the weights of the edges in that spanning tree. A minimum spanning tree orminimum weight spanning tree is then a spanning tree with weight less than or equal to the weightof every other spanning tree. More generally, any undirected graph (not necessarily connected)has a minimum spanning forest, which is a union of minimum spanning trees for its connectedcomponents.

    Kruskals algorithm - Kruskal's algorithm is an algorithm in graph theory that finds a minimum

    spanning tree for a connected weighted graph. This means it finds a subset of the edges that formsa tree that includes every vertex, where the total weight of all the edges in the tree is minimized. Ifthe graph is not connected, then it finds a minimum spanning forest (a minimum spanning tree foreach connected component). Kruskal's algorithm is an example of a greedy algorithm.

    It works as follows:

    create a forest F (a set of trees), where each vertex in the graph is a separate tree

    create a set S containing all the edges in the graph

    while S is nonempty

    o remove an edge with minimum weight from S

    o if that edge connects two different trees, then add it to the forest, combining two

    trees into a single tree

    o otherwise discard that edge

    At the termination of the algorithm, the forest has only one component and forms a minimumspanning tree of the graph.Union By Rank In this we have a parent of shallower tree point to other tree. We will bemaintaining the rank(x) as an upper bound on the depth of the tree rooted at x. Consider thefollowing example,

  • 7/29/2019 Comparison of Clustering Algorithms Report

    11/23

    Figure 4 Union By Rank

    If suppose, rank(x) = 3, rank(y) = 2, then Union (x, y) results in with the rank of the resultant tree= greater rank.If the two trees are of the same rank then the rank of the resultant tree increases by one.

    Path Compression

    1st walk: Find the name of the set. Take a walk until we reach the root.

    2nd walk: Retrace the path and join all the elements along the path to the root using

    another pointer.

    This enables future finds to take shorter paths.

    In the implementation of Single Link Algorithm, each point is initially considered as a singletoncluster. When the Euclidean distance between two clusters (trees) is minimum when comparedwith the other clusters, the two clusters are merged into a single cluster (tree) and the root node is

    updated.

    Complete Link Algorithm

    In this algorithm, the proximity of two clusters is defined as the maximum of the distance(minimum of the similarity) between any two points in the two different clusters. Using graphterminology, if you start with all points as singleton clusters and add links between points one at atime, shortest links first, then a group of points is not a cluster until all the points in it arecompletely linked, i.e. form a clique.

    Single Link is susceptible to noise/outliers. Complete Link may not work well with non-globularclusters.

    2.3 Datasets and Experiments

    Single Link Algorithm Testing

    Dataset: SPAETH2 dataset (2D- voice modulation data) from the Florida State Universityswebsite (Around 900 data points)

  • 7/29/2019 Comparison of Clustering Algorithms Report

    12/23

    Figure 5 SPAETH dataset

    Output Cluster Plot

    Globular Clusters

    After 28000 iterations (3 clusters remain)

    Figure 6 Agglomerative Clusters After 28000 iterations

  • 7/29/2019 Comparison of Clustering Algorithms Report

    13/23

    After 64000 iterations (2 Clusters remain)

    Figure 7 Agglomerative Clusters After 64000 iterations

    Final Cluster (After 65388 iterations)

    Figure 8 Agglomerative Clusters After 65388 iterations

  • 7/29/2019 Comparison of Clustering Algorithms Report

    14/23

    Non-Globular Clusters (Run on CheckBoard data)

    Single Link

    Figure 9 Agglomerative Non globular clusters

    CURE

    Figure 10 CURE Non globular clusters

  • 7/29/2019 Comparison of Clustering Algorithms Report

    15/23

    Complete Link

    It was executed on a part of the Census data obtained from UCI Repository

    Figure 11 Complete Link

    Output Cluster Plot (Compared with CURE algorithm)

    After 2000 iterations (13 clusters remain)

    Figure 12 Complete Link clusters After 2000 iterations

  • 7/29/2019 Comparison of Clustering Algorithms Report

    16/23

    Final Cluster (after 2012 iterations)

    Figure 13 Complete Link clusters After 2012 iterations

    CURE

    Figure 14 CURE clusters

  • 7/29/2019 Comparison of Clustering Algorithms Report

    17/23

    3. DBSCAN (Using KD Trees)

    The main reason why natural clusters are recognizable is that within each cluster we havea typical density of points which is considerably higher than outside of the cluster. Furthermore,the density within the areas of noise is lower than the density in any of the clusters. With this

    understanding, we can describe core, border and noise points in a given data set next.

    Core points: A point is a core point if the number of points within a given neighborhood aroundthe point as determined by the distance function and as user specified distance parameterEps,exceeds a certain threshold,MinPts, which is also a user-specified parameter.

    Border points: A border point is not a core point, but falls within the neighborhood of a corepoint.

    Noise points: A noise point is any point that is neither a core point nor a border point.

    3.1 DBSCAN Algorithm

    1. Label all points as core, border or noise points2. Eliminate noise points3. Put an edge between all core points that are withinEps of each other4. Make each group of connected core points into a separate cluster5. Assign each border point to one of the clusters of its associated core points

    3.2 DBSCAN Performance Enhancements Using KD Trees

    We used KD Trees to improve the efficiency of DBSCAN clustering. The worst case time

    complexity of DBSCAN algorithm is O(m^2). However, it can be shown that in low dimensionaldata, this time complexity can be reduced to O(m*logm) using KD Trees.

    The Initialization of KD Trees is a one time cost which the algorithm incurs while reading the datapoints from File. Once the KD Tree has been initialized, it can be used across the algorithm toclassify core points, border points and noise points based on the the number of nearest neighborsfound as well as find the nearest core point for a border point. KD Tree helps to decrease thesearch time for nearest neighbor of a point from O(n) to O(log n) where n is the size of the dataset.

    We saw performance improvements by using KD Trees. The algorithm was run on a Intel PentiumIV 1.8 Ghz (Duo Core) System with 1 GB RAM. The program was compiled using Java 1.6Compiler.

  • 7/29/2019 Comparison of Clustering Algorithms Report

    18/23

    No. of Points Clustering Time (sec)

    1572 3.5

    3568 10.9

    7502 39.5

    10256 78.4

    Figure 15 DBSCAN performance measurements

    3.3 Observations regarding DBSCAN Issues

    The following are our observations:1. DBSCAN algorithm performs efficiently for low dimensional data.2. The algorithm is robust towards outliers and noise points3. Using KD Tree improves the efficiency over traditional DBSCAN algorithm4. DBSCAN is highly sensitive to user parameters MinPts and Eps. Slight change in the

    values may produce different clustering results and prior knowledge about these valuescannot be inferred that easily.

    5. The dataset cannot be sampled as sampling would affect the density measures.6. The Algorithm is not partitionable for multi-processor systems.7. DBSCAN fails to identify clusters if density varies and if the dataset is too sparse.8.

    4. CURE Hierarchical Clustering (Using KD Trees)

    Partitional Clustering Algorithms attempt to determine k partitions that optimize a certaincriterion function. The square error criterion, defined below, is the most commonly used (m i is themean of the cluster Ci).

    The square error is a good measure of the within cluster variation across all the partitions. Thisobjective tries to make the k clusters as compact and separated as possible. However when thereare large differences in the sizes or geometries of clusters, the square error method could splitlarge clusters to minimize the square error.

    Next we considered DBSCAN which has been explained above. Apart from problems withvariable clustering, DBSCAN could not be made concurrent. In this age, when computationpower is getting cheaper day by day and Dedicated Grids have been setup for data intensivecomputations, an algorithm is required which can be parallelized and takes advantage of all theresources available. DBSCAN as we would see had scaling problems as data sets increased.

    In comparison to Agglomerative Clustering, CURE certainly would perform well. AgglomerativeClustering provide options of choosing Single Link vs Complete Link to cluster points. While theformer identifies only globular clusters efficiently the latter is computationally intensive. Hence for

  • 7/29/2019 Comparison of Clustering Algorithms Report

    19/23

    data sets ranging more than 5000 points, agglomerative clustering was highly inefficient thoughquality of clustering could be achieved by using one of the above options.Our experiments on CURE clustering algorithm, suggest that CURE depends on few parametersand if once they are tuned for a given data set pertaining to a domain, the algorithm can scale wellby adding more resources and partitioning the data.

    4.1 CURE Hierarchical Clustering Algorithm

    The CURE clustering algorithm is a hierarchical algorithm which merges two clusters at everystep and the clustering process is carried over in two passes. The overall hierarchical algorithm isas follows:

    To enhance performance, scalability as well as quality of clustering, CURE takes into account fewmore pre-clustering and post-clustering steps.

    4.2 CURE Overview

  • 7/29/2019 Comparison of Clustering Algorithms Report

    20/23

    While drawing the random sample, due importance was given to the fact that all clusters wererepresented and none of them were missed out by estimating a minimum probability.

    4.3 CURE - Data Structures Used

    We used two data structures namely the KD Tree and Min Heap. Following are the briefdescription of both of them.

    4.3.1 KD Tree

    A KD-Tree (short for k-dimensional tree) is a space-partitioning data structure for organizingpoints in a k-dimensional space. KD-Trees are a useful data structure for several applications,such as searches involving a multidimensional search key (e.g. range searches and nearestneighbour searches). KD-Trees are a special case of BSP trees.

    A KD-Tree uses only splitting planes that are perpendicular to one of the coordinate system axes.This differs from BSP trees, in which arbitrary splitting planes can be used. In addition, in thetypical definition every node of a KD-Tree, from the root to the leaves, stores a point. This differsfrom BSP trees, in which leaves are typically the only nodes that contain points (or other

    geometric primitives). As a consequence, each splitting plane must go through one of the points inthe KD-Tree. KD-Tries are a variant that store data only in leaf nodes. It is worth noting that inan alternative definition of KD-Tree the points are stored in its leaf nodes only, although eachsplitting plane still goes through one of the points.

    In Cure, the KD Tree is initialized during the initial phase of clustering to hold all the points. Lateron in the algorithm, we use this tree for nearest neighbor search and finding closest clusters basedon representative points of a cluster. When a new cluster is formed, new representative points areadded to the KD Trees. The representative points of older clusters are deleted from the tree.

    KD Tree improves the search of points in k dimensional space from O(n) to O(log n) as it usesbinary partitioning across coordinate axes.

    4.3.2 Min Heap

    A Min Heap is a simple heap data structure created using a binary tree. It can be seen as a binarytree with two additional constraints:

    1. The shape property: all levels of the tree, except possibly the last one (deepest) are fullyfilled, and, if the last level of the tree is not complete, the nodes of that level are filled fromleft to right.

  • 7/29/2019 Comparison of Clustering Algorithms Report

    21/23

    2. The heap property: each node is lesser than or equal to each of its children.

    The Min Heap stores the minimum element at the root of the heap. In Cure, we always merge twoclusters at every step. Thus the cluster to be merged would necessary be having the closestdistance from another nearby cluster as the heap is created using inter-cluster distancecomparisons. Hence we can get this cluster in O(1) time always.

    We used java.util.PriorityQueue which supports all the Min Heap operations.

    4.4 Benefits of CURE against Other Algorithms

    K-Means (& Centroid based Algorithms): Unsuitable for non-spherical and size differing clusters.

    CLARANS: Needs multiple data scan (R* Trees were proposed later on). CURE uses KD Treesinherently to store the dataset and use it across passes.

    BIRCH: Suffers from identifying only convex or spherical clusters of uniform size

    DBSCAN: No parallelism, High Sensitivity, Sampling of data may affect density measures.

    4.5 Observations towards Sensitivity to Parameters

    We observed that the random sample size was an important criterion while pre-clustering the data set. Hence we used the Chernoff bounds as given in [1] to calculate theminimum size of sample to be selected. Random Sampling often missed out some of the smallerclusters. The next important parameter was the Shrink Factor of Representative Points(a). If weincreased a to make it 1, the algorithm would degenerate to MST based algorithms. If theparameter a is reduced to 0.1, CURE starts behaving as a centroid based algorithm. Thus for arange of 0.3 to 0.7, CURE identified the right clusters.

  • 7/29/2019 Comparison of Clustering Algorithms Report

    22/23

    The number of Representative Points present in a cluster is an importantparameter. If the cluster is too sparse, it may need more representative points than a compactsmaller cluster. We observed that if the number of representative points is increased to 8 or 10,sparse clusters with variable size and density were identified properly. But with increase inrepresentative points, the computation time for clustering increased as for every new clusterformed, new representative points have to be calculated and shrunk.

    One of the most important observations of our experiments was with respect topartitioning of data sets as CURE supports concurrent execution of the first pass of algorithm. Asthe number of partitions was increased from 2 to 6 or 10, the clustering time droppedsignificantly. Though the number of clusters to be merged increased in the second step, but theadvantage of concurrent execution was far more. But what we noticed is that if we increased thenumber of partitions to higher numbers such as 50, the clustering would not give proper results assome of the partitions would not have any data to cluster. Hence, though the time consumedwould be lesser, the quality of cluster gets affected and CURE could not identify all the clusterscorrectly. Some of them got merged to form bigger clusters. Hence, a partitioning of 10 20would result in efficient speed up of algorithm while maintaining the quality of clusters.

    Partitioning Results

    No. of Points 1572 3568 7502 10256

    Time (sec)

    Partition P = 2 6.4 7.8 29.4 75.7

    Partition P = 2 6.5 7.6 21.6 43.6

    Partition P = 5 6.1 7.3 12.2 21.2

    Figure 16 Partitioning results

    If a chart is plotted for the same, we can see that as the partitioning is increased, the time taken tocluster increases very slowly even though the data set size has increased by four times.

    III. CONCLUSION

    From the clusters obtained through various algorithms and the time taken by eachalgorithm on the datasets, we can say that, K means is not the best of clustering methods withits high space complexity. For high dimensional data, K means takes a lot of time and memory.Also it cannot always converge.

    Our experiments suggest that DBSCAN faired well for low-dimensional data. Also, if thedensity of clusters did not vary too much, DBSCAN fairly identified all the clusters. But if the sizeof the data increases and if shapes and density of clusters vary too much, DBSCAN resulted incombining or splitting those clusters.

    Cure could identify all the clusters properly. But CURE depends on some of the userparameters which have to be data specific. The range of such parameters do not vary too muchmany of them being from 0 1. Cure could identify several clusters with high purity which K-

  • 7/29/2019 Comparison of Clustering Algorithms Report

    23/23

    means and DBSCAN failed to identify.

    With respect to agglomerative clustering, clusters with high purity could be obtained butthe computation time for clustering was high. Application of Kruskal and Union-By-RankAlgorithm helped to improve the efficiency but still the computation time increased significantly asthe size of the data set increased.

    IV. REFERENCES

    1. An Efficient k-Means Clustering Algorithm: Analysis and Implementation - Tapas Kanungo,Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu.

    2. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise -Martin Ester, Hans-Peter Kriegel, Jrg Sander, Xiaowei Xu, KDD '96

    3. CURE : An Efficient Clustering Algorithm for Large Databases S. Guha, R. Rastogi and K.Shim, 1998.

    4. Introduction to Clustering Techniques by Leo Wanner

    5. A comprehensive overview of Basic Clustering Algorithms Glenn Fung6. Introduction to Data Mining Tan/Steinbach/Kumar7. Thomas T. Cormen , Charles E. Leiserson , Ronald L. Rivest, Introduction to algorithms,

    MIT Press, Cambridge, MA, 19908. Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method

    for very large databases, Proceedings of the 1996 ACM SIGMOD international conference onManagement of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada

    9. An Efficient K-Means Clustering Algorithm. K. Alsabti, S. Ranka, V. Singh. 199810. Density based Indexing for Approximate Nearest Neighbor Queries. K. Bennett, U. Fayyad,

    D. Geiger. Microsoft Research. 199811. The Analysis of a Simple K-Means Algorithm. T. Kanungo, D.M. Mount, N.S. Netanyahu, C.

    Piatko, R. Silverman and A.Y. Wu. 200012. Accelerating exact K-Means algorithms with Geometric Reasoning. D Pelleg and A. Moore.1999.