Upload
gaurav-jaiswal
View
230
Download
0
Embed Size (px)
Citation preview
7/27/2019 Unit-03 (Part 2)
1/20
Unit-03: Cluster Analysis 1
GAURAV JAISWAL, Dept. of CS, AITM
CLUSTERANALYSIS
The process of grouping a set of physical or abstract objects into classes of similarobjects is calledclustering. A cluster is a collection of data objects that are similarto one another within the same
cluster and are dissimilarto the objects in other clusters. Clustering is also called data segmentationin some applications because clustering partitions large data sets into groups according to their
similarity. Clustering can also be used for outlier detection, where outliers (values that are far away
from any cluster) may be more interesting than common cases. As a data mining function, cluster
analysis can be used as a stand-alone tool to gain insight into the distribution of data, to observe the
characteristics of each cluster, and to focus on a particular set of clusters for further analysis.
Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization,
attribute subset selection, and classification, which would then operate on the detected clusters and
the selected attributes or features.
The following are typical requirements of clustering in data mining:
Scalability: Many clustering algorithms work well on small data sets containing fewer than
several hundred data objects; however, a large database may contain millions of objects.
Clustering on a sample of a given large data set may lead to biased results. Highly scalableclustering algorithms are needed.
Ability to deal with different types of attributes: Many algorithms are designed to cluster
interval-based (numerical) data. However, applications may require clustering other types of
data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types.
Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters
based on Euclidean or Manhattan distance measures. Algorithms based on such distancemeasures tend to find spherical clusters with similar size and density. However, a cluster
could be of any shape. It is important to develop algorithms that can detect clusters of
arbitrary shape.
Minimal requirements for domain knowledge to determine input parameters: Manyclustering algorithms require users to input certain parameters in cluster analysis (such as
the number of desired clusters). The clustering results can be quite sensitive to inputparameters. Parameters are often difficult to determine, especially for data sets containing
high-dimensional objects. This not only burdens users, but it also makes the quality ofclustering difficult to control.
Ability to deal with noisy data: Most real-world databases contain outliers or missing,
unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may
lead to clusters of poor quality.
Incremental clustering and insensitivity to the order of input records : Some clustering
algorithms cannot incorporate newly inserted data (i.e., database updates) into existing
clustering structures and, instead, must determine a new clustering from scratch. Some
clustering algorithms are sensitive to the order of input data. High dimensionality: A database or a data warehouse can contain several dimensions or
attributes. Many clustering algorithms are good at handling low-dimensional data, involving
only two to three dimensions.
Constraint-based clustering: Real-world applications may need to perform clustering
under various kinds of constraints. Suppose that your job is to choose the locations for a givennumber of new automatic banking machines (ATMs) in a city. To decide upon this, you may
cluster households while considering constraints such as the citys rivers and highway
7/27/2019 Unit-03 (Part 2)
2/20
Unit-03: Cluster Analysis 2
GAURAV JAISWAL, Dept. of CS, AITM
networks, and the type and number of customers per cluster. A challenging task is to findgroups of data with good clustering behavior that satisfy specified constraints.
Interpretability and usability: Users expect clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied to specific semantic
interpretations and applications.
DATATYPESINCLUSTERANALYSIS
Main memory-based clustering algorithms typically operate on either of the following two data
structures.
Data matrix (or object-by-variable structure): This represents n objects, such as persons,
withp variables (also called measurements or attributes), such as age, height, weight, gender,
and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects X p
variables):
Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities
that are available for all pairs ofn objects. It is often represented by an n-by-n table:
where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i, j) is a
nonnegative number that is close to 0 when objects i and jare highly similar or near each other,
and becomes larger the more they differ. Since d(i, j)=d(j, i), and d(i, i)=0, we have the 2nd matrix.
1. Interval-Scaled Variables
Interval-scaled variables are continuous measurements of a roughly linear scale. Typical examplesinclude weight and height, latitude and longitude coordinates (e.g., when clustering houses), and
weather temperature. The measurement unit used can affect the clustering analysis. To standardize
measurements, one choice is to convert the original measurements to unitless variables. Given
measurements for a variable f, this can be performed as follows.
Calculate the mean absolute deviation, S f:
7/27/2019 Unit-03 (Part 2)
3/20
Unit-03: Cluster Analysis 3
GAURAV JAISWAL, Dept. of CS, AITM
where x1 f, : : : , xn fare n measurements off, and mfis the mean value off, that is, mf= 1n
(x1 f+x2 f+_ _ _+xn f).
Calculate the standardized measurement, or z-score:
The mean absolute deviation, s f , is more robust to outliers than the standard deviation, sf .When
computing the mean absolute deviation, the deviations from the mean (i.e., |xi f - mf|) are not squared;
hence, the effect of outliers is somewhat reduced.
After standardization, or without standardization in certain applications, the dissimilarity (or
similarity) between the objects described by interval-scaled variables is typically computed based on
the distance between each pair of objects. The most popular distance measure is Euclidean distance,
which is defined as
where i=(xi1, xi2, : : : , xin) and j=(x j1, x j2, : : : , x jn) are two n-dimensional data objects.
Another well-known metric is Manhattan (or city block) distance, defined as
For Example - Letx1 = (1, 2) and x2 = (3, 5) represent two objects. The Euclidean distance between
the two is The Manhattan distance between the two is 2+3 = 5.
2. Binary Variables The dissimilarity between objects described by either symmetric orasymmetric binary variables. A binary variable is symmetric if both of its states are equally
valuable and carry the same weight; that is, there is no preference on which outcome shouldbe coded as 0 or 1. One such example could be the attribute genderhaving the states male and
female. Dissimilarity that is based on symmetric binary variables is called symmetric binary
dissimilarity.
A binary variable is asymmetric if the outcomes of the states are not equally important, such as the
positive and negative outcomes of a disease test. By convention, we shall code the most importantoutcome, which is usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV negative).
The dissimilarity based on such variables is called asymmetric binary dissimilarity, where thenumber of negative matches, t, is considered unimportant and thus is ignored in the computation, as
shown in Equation
Asymmetric binary similarity between the objects i and j, or sim(i, j), can be computed as,
The coefficientsim(i, j) is called the Jaccard coefficient.
7/27/2019 Unit-03 (Part 2)
4/20
Unit-03: Cluster Analysis 4
GAURAV JAISWAL, Dept. of CS, AITM
For example Suppose that a patient record table contains the attributes name, gender, fever, cough,
test-1, test-2, test-3, and test-4, where name is an object identifier, genderis a symmetric attribute,
and the remaining attributes are asymmetric binary. For asymmetric attribute values, let the valuesY(yes) and P(positive)be set to1,and the value N(no or negative) be set to 0. Suppose that the distance
between objects (patients) is computed based only on the asymmetric variables. According toEquation, the distance between each pair of the three patients, Jack, Mary, and Jim, is
3. Categorical, Ordinal, and Ratio-Scaled Variables
3.1 Categorical Variables A categorical variable is a generalization of the binary variable in that it can take on more than two
states. For example, map color is a categorical variable that may have, say, five states: red, yellow,
green, pink, and blue. Let the number of states of a categorical variable be M. The states can be denotedby letters, symbols, or a set of integers, such as 1, 2, : : : , M.
The dissimilarity between two objects i and jcan be computed based on the ratio of mismatches:
where m is the number ofmatches (i.e., the number of variables for which i and jare in the samestate), and p is the total number of variables.
7/27/2019 Unit-03 (Part 2)
5/20
Unit-03: Cluster Analysis 5
GAURAV JAISWAL, Dept. of CS, AITM
3.2 Ordinal Variables
A discrete ordinal variable resembles a categorical variable, except that the Mstates of the ordinalvalue are ordered in a meaningful sequence. Ordinal variables are very useful for registering
subjective assessments of qualities that cannot be measured objectively. For example, professional
ranks are often enumerated in a sequential order, such as assistant, associate, andfullfor professors.A continuous ordinal variable looks like a set of continuous data of an unknown scale; that is, the
relative ordering of the values is essential but their actual magnitude is not.Ordinal variables may also be obtained from the discretization of interval-scaled quantities by
splitting the value range into a finite number of classes. The values of an ordinal variable can bemapped to ranks. For example, suppose that an ordinal variable fhas Mfstates. These ordered states
define the ranking 1, : : : , Mf.
Suppose thatf is a variable from a set of ordinal variables describing n objects. The dissimilarity
computation with respect to finvolves the following steps:
1. The value offfor the ith object isxi f, andfhas Mfordered states, representing the ranking 1,
: : : , Mf. Replace each xi fby its corresponding rank,2. Since each ordinal variable can have a different number of states, it is often necessary to map
the range of each variable onto [0.0,1.0] so that each variable has equal weight. This can beachieved by replacing the rankri fof the ith object in the fth variable by
3.3 Ratio-Scaled Variables
A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponentialscale, approximately following the formula
whereA and B are positive constants, and ttypically represents time. Common examples include the
growth of a bacteria population or the decay of a radioactive element.There are three methods to handle ratio-scaled variables for computing the dissimilarity between
objects.
Treat ratio-scaled variables like interval-scaled variables. This, however, is not usually a good
choice since it is likely that the scale may be distorted.
Apply logarithmic transformation to a ratio-scaled variable fhaving value xi ffor objecti by
using the formula yi f= log(xi f).
Treatxi fas continuous ordinal data and treat their ranks as interval-valued.
3.4 Variables of Mixed Types
In many real databases, objects are described by a mixture of variable types. One approach is to groupeach kind of variable together, performing a separate cluster analysis for each variable type.
A more preferable approach is to process all variable types together, performing a single cluster
analysis. One such technique combines the different variables into a single dissimilarity matrix,
bringing all of the meaningful variables onto a common scale of the interval [0.0,1.0].
7/27/2019 Unit-03 (Part 2)
6/20
Unit-03: Cluster Analysis 6
GAURAV JAISWAL, Dept. of CS, AITM
Suppose that the data set containsp variables of mixed type. The dissimilarity d(i,j) between objectsi and jis defined as
where the indicator d( f) i j = 0 if either (1) xi f or x j fis missing (i.e., there is no measurement ofvariable ffor objecti or objectj), or (2) xi f= x j f= 0 and variable fis asymmetric binary; otherwise,
d( f) i j= 1. The contribution of variable fto the dissimilarity between i and j, that is, d( f) i j , is
computed dependent on its type:
3.5 Vector Objects In some applications, such as information retrieval, text document clustering, and biological
taxonomy, we need to compare and cluster complex objects (such as documents) containing a largenumber of symbolic entities (such as keywords and phrases). To measure the distance between
complex objects, it is often desirable to abandon traditional metric distance computation andintroduce a nonmetric similarity function.
There are several ways to define such a similarity function, s(x, y), to compare two vectors xand y.One popular way is to define the similarity function as a cosine measure as follows:
where xtis a transposition of vector x, ||x|| is the Euclidean norm of vector x, ||y|| is the Euclidean
norm of vector y, and s is essentially the cosine of the angle between vectors xand y.
CATEGORIESOFCLUSTERINGMETHODS
In general, the major clustering methods can be classified into the following categories.
Partitioning methods: Given a database of n objects or data tuples, a partitioning methodconstructs k partitions of the data, where each partition represents a cluster and k n. That is, it
classifies the data into kgroups, which together satisfy the following requirements: (1) each group
must contain at least one object, and (2) each object must belong to exactly one group.Given k, the number of partitions to construct, a partitioning method creates an initial partitioning.
It then uses an iterative relocation technique that attempts to improve the partitioning by movingobjects from one group to another. The general criterion of a good partitioning is that objects in the
7/27/2019 Unit-03 (Part 2)
7/20
Unit-03: Cluster Analysis 7
GAURAV JAISWAL, Dept. of CS, AITM
same cluster are close or related to each other, whereas objects of different clusters are far apartor very different.
To achieve global optimality in partitioning-based clustering would require the exhaustive
enumeration of all of the possible partitions. Instead, most applications adopt one of a few popularheuristic methods, such as (1) the k-means algorithm, where each cluster is represented by the mean
value of the objects in the cluster, and (2) the k-medoids algorithm, where each cluster is representedby one of the objects located near the center of the cluster.
Hierarchical methods: A hierarchical method creates a hierarchical decomposition of the givenset of data objects. A hierarchical method can be classified as being either agglomerative or divisive,
based on how the hierarchical decomposition is formed. The agglomerative approach, also called the
bottom-up approach, starts with each object forming a separate group. It successively merges the
objects or groups that are close to one another, until all of the groups are merged into one (the
topmost level of the hierarchy), or until a termination condition holds. The divisive approach, alsocalled the top-down approach, starts with all of the objects in the same cluster. In each successive
iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, oruntil a termination condition holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never beundone. There are two approaches to improving the quality of hierarchical clustering: (1) perform
careful analysis of object linkages at each hierarchical partitioning, such as in Chameleon, or (2)integrate hierarchical agglomeration and other approaches by first using a hierarchical
agglomerative algorithm to group objects into microclusters, and then performing macroclustering
on the microclusters using another clustering method such as iterative relocation, as in BIRCH.
Density-based methods: clustering methods have been developed based on the notion ofdensity.Their general idea is to continue growing the given cluster as long as the density (number of objects
or data points) in the neighborhood exceeds some threshold; that is, for each data point within a
given cluster, the neighborhood of a given radius has to contain at least a minimum number of points.Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary shape.DBSCAN and its extension, OPTICS, are typical density-based methods that grow clusters according
to a density-based connectivity analysis.
Grid-based methods: Grid-based methods quantize the object space into a finite number of cellsthat form a grid structure. All of the clustering operations are performed on the grid structure (i.e.,
on the quantized space). The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space.
Model-based methods: Model-based methods hypothesize a model for each of the clusters and find
the best fit of the data to the given model. A model-based algorithm may locate clusters byconstructing a density function that reflects the spatial distribution of the data points. It also leads to
a way of automatically determining the number of clusters based on standard statistics, takingnoise or outliers into account and thus yielding robust clustering methods.
NOTE: - The choice of clustering algorithm depends both on the type of data available and on
the particular purpose of the application.
7/27/2019 Unit-03 (Part 2)
8/20
Unit-03: Cluster Analysis 8
GAURAV JAISWAL, Dept. of CS, AITM
PartitioningMethods
Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm
organizes the objects into kpartitions (k n), where each partition represents a cluster. The clusters
are formed to optimize an objective partitioning criterion, such as a dissimilarity function based on
distance, so that the objects within a cluster are similar, whereas the objects of different clusters
are dissimilar in terms of the data set attributes.
The most well-known and commonly used partitioning methods are k-means, k-medoids, and their
variations.
Centroid-Based Technique: The k-Means Method
The k-means algorithm takes the input parameter, k, and partitions a set ofn objects into kclusters
so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster
similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed asthe clusters centroidor center of gravity. The k-means method, however, can be applied only when
the mean of a cluster is defined.
The k-means algorithm proceeds as follows. First, it randomly selects kof the objects, each of whichinitially represents a cluster mean or center. For each of the remaining objects, an object is assigned
to the cluster to which it is the most similar, based on the distance between the object and the clustermean. It then computes the new mean for each cluster. This process iterates until the criterion
function converges. Typically, the square-error criterion is used, defined as
(In other words, for each object in each cluster, the distance from the object to its cluster center is
squared, and the distances are summed.)
where E is the sum of the square error for all objects in the data set; p is the point in space
representing a given object; and miis the mean of cluster Ci (bothp and miare multidimensional).
Algorithm: k-means. The k-means algorithm for partitioning, where each clusters center isrepresented by the mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.Output: A set ofkclusters.
Method:1. arbitrarily choose kobjects from D as the initial cluster centers;2. repeat
3. (re)assign each object to the cluster to which the object is the most similar, based on the meanvalue of the objects in the cluster;
4. update the cluster means, i.e., calculate the mean value of the objects for each cluster;5. until no change;
7/27/2019 Unit-03 (Part 2)
9/20
Unit-03: Cluster Analysis 9
GAURAV JAISWAL, Dept. of CS, AITM
The algorithm attempts to determine kpartitions that minimize the square-error function. It workswell when the clusters are compact clouds that are rather well separated from one another. The
method is relatively scalable and efficient in processing large data sets because the computationalcomplexity of the algorithm is O(nkt), where n is the total number of objects, k is the number of
clusters, and tis the number of iterations.
Representative Object-Based Technique: The k-Medoids Method
The k-means algorithm is sensitive to outliers because an object with an extremely large value may
substantially distort the distribution of data. This effect is particularly exacerbated due to the use ofthe square-error function. Instead of taking the mean value of the objects in a cluster as a reference
point, we can pick actual objects to represent the clusters, using one representative object per cluster.Each remaining object is clustered with the representative object to which it is the most similar. The
partitioning method is then performed based on the principle of minimizing the sum of thedissimilarities between each object and its corresponding reference point. That is, an absolute-error
criterion is used, defined as
where E is the sum of the absolute error for all objects in the data set; p is the point in space
representing a given object in cluster Cj; and ojis the representative object ofCj.
The initial representative objects (or seeds) are chosen arbitrarily. The iterative process of replacingrepresentative objects by nonrepresentative objects continues as long as the quality of the resulting
clustering is improved. This quality is estimated using a cost function that measures the average
dissimilarity between an object and the representative object of its cluster. To determine whether anonrepresentative object, orandom, is a good replacement for a current representative object, oj, the
following four cases are examined for each of the nonrepresentative objects, p, as-
Case 1: p currently belongs to representative object, oj. Ifoj is replaced by orandom as arepresentative object and p is closest to one of the other representative objects, oi, i j, then
p is reassigned to oi.
7/27/2019 Unit-03 (Part 2)
10/20
Unit-03: Cluster Analysis 10
GAURAV JAISWAL, Dept. of CS, AITM
Case 2: p currently belongs to representative object, oj. Ifoj is replaced by orandom as arepresentative object andp is closest to orandom, thenp is reassigned to orandom.
Case 3:p currently belongs to representative object, oi, i j. Ifojis replaced by orandom as arepresentative object andp is still closest to oi, then the assignment does not change.
Case 4:p currently belongs to representative object, oi, i j. Ifojis replaced by orandom as arepresentative object andp is closest to orandom, thenp is reassigned to orandom.
If the total cost is negative, then ojis replaced or swapped with orandom since the actual absolute
error E would be reduced. If the total cost is positive, the current representative object, oj, is
considered acceptable, and nothing is changed in the iteration.
Algorithm: k-medoids. PAM (Partitioning Around Medoids), a k-medoids algorithm for partitioningbased on medoid or central objects.Input:
k: the number of clusters, D: a data set containing n objects.
Output: A set ofkclusters.Method:
1. arbitrarily choose kobjects in D as the initial representative objects or seeds;
2. repeat
3. assign each remaining object to the cluster with the nearest representative object;
4. randomly select a nonrepresentative object, orandom;5. compute the total cost, S, of swapping representative object, oj, with orandom;
6. if S< 0 then swap ojwith orandom to form the new set ofkrepresentative objects;7. until no change;
In general, the algorithm iterates until, eventually, each representative object is actually the medoid,
or most centrally located object, of its cluster. This is the basis of the k-medoids method for groupingn objects into kclusters.
7/27/2019 Unit-03 (Part 2)
11/20
Unit-03: Cluster Analysis 11
GAURAV JAISWAL, Dept. of CS, AITM
HierarchicalMethods
A hierarchical clustering method works by grouping data objects into a tree of clusters. Hierarchical
clustering methods can be further classified as either agglomerative or divisive.
Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each
object in its own cluster and then merges these atomic clusters into larger and larger clusters,
until all of the objects are in a single cluster or until certain termination conditions are
satisfied.
Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerativehierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into
smaller and smaller pieces, until each object forms a cluster on its own or until it satisfiescertain termination conditions.
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling
Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the
similarity between pairs of clusters. In Chameleon, cluster similarity is assessed based on how well-
connected objects are within a cluster and on the proximity of clusters. That is, two clusters aremerged if their interconnectivityis high and they are close together. Chameleon does not depend on a
static, user-supplied model and can automatically adapt to the internal characteristics of the clustersbeing merged.
7/27/2019 Unit-03 (Part 2)
12/20
Unit-03: Cluster Analysis 12
GAURAV JAISWAL, Dept. of CS, AITM
Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph, where each vertex
of the graph represents a data object, and there exists an edge between two vertices (objects) if one
object is among the k-most-similar objects of the other. The edges are weighted to reflect the
similarity between objects. Chameleon uses a graph partitioning algorithm to partition the k-nearest-
neighbor graph into a large number of relatively small subclusters. It then uses an agglomerative
hierarchical clustering algorithm that repeatedly merges subclusters based on their similarity. To
determine the pairs of most similar subclusters, it takes into account both the interconnectivity as
well as the closeness of the clusters.
The graph-partitioning algorithm partitions the k-nearest-neighbor graph such that it minimizes the
edge cut. That is, a cluster Cis partitioned into subclusters Ci and Cjso as to minimize the weight of
the edges.
Chameleon determines the similarity between each pair of clusters Ci and Cj according to their
relative interconnectivity, RI(Ci, Cj), and their relative closeness, RC(Ci, Cj):
The relative interconnectivity, RI(Ci, Cj), between two clusters,Ci and Cj, is defined as the
absolute interconnectivity between Ci and Cj, normalized with respect to the internalinterconnectivity of the two clusters, Ci and Cj. That is,
where EC{Ci, Cj} is the edge cut, defined as above, for a cluster containing both Ci and Cj. Similarly, ECCi(or ECCj) is the minimum sum of the cut edges that partition Ci (or Cj) into two roughly equal parts.
The relative closeness, RC(Ci, Cj), between a pair of clusters,Ci and Cj, is the absolutecloseness between Ci and Cj, normalized with respect to the internal closeness of the two
clusters, Ci and Cj. It is defined as
where SEC{Ci;Cj} is the average weight of the edges that connect vertices in Ci to vertices in Cj, andSECCi (or SECCj) is the average weight of the edges that belong to the min-cut bisector of cluster Ci (or
Cj).
7/27/2019 Unit-03 (Part 2)
13/20
Unit-03: Cluster Analysis 13
GAURAV JAISWAL, Dept. of CS, AITM
CURE (Clustering Using REpresentatives)
CURE is an efficient data clustering algorithm for large databases that is more robust to outliers and
identifies clusters having non-spherical shapes and wide variances in size. To avoid the problemswith non-uniform sized or shaped clusters, CURE employs a novel hierarchical clustering algorithm
that adopts a middle ground between the centroid based and all point extremes.
In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk
towards the centroid of the cluster by a fraction . The scattered points after shrinking are used as
representatives of the cluster. The clusters with the closest pair of representatives are the clusters
that are merged at each step of CURE's hierarchical clustering algorithm. This enables CURE tocorrectly identify the clusters and makes it less sensitive to outliers.
To handle large databases, CURE employs a combination of random sampling and partitioning: a
random sample is first partitioned, and each partition is partially clustered. The partial clusters are
then clustered in a second pass to yield the desired clusters.
The algorithm is given below.
Draw random sample s. Partition sample to p partitions with size s/p
Partially cluster partitions into s/pq clusters
Eliminate outliers
o By random sampling
o If a cluster grows too slow, eliminate it.
Cluster partial clusters.
Label data in disk
Strength:
Produces high-quality clusters in the existence of outliers
Allowing clusters of complex shapes and different sizes.
Algorithm requires one scan of the entire database
Weakness:
CURE does not handle categorical attributes.
7/27/2019 Unit-03 (Part 2)
14/20
Unit-03: Cluster Analysis 14
GAURAV JAISWAL, Dept. of CS, AITM
Density-BasedMethodsTo discover clusters with arbitrary shape, density-based clustering methods have been developed.
DBSCAN: A Density-Based Clustering Method Based on Connected Regionswith Sufficiently High Density
The algorithm grows regions with sufficiently high density into clusters and discovers clusters of
arbitrary shape in spatial databases with noise. It defines a cluster as a maximal set of density-
connectedpoints.
The basic ideas of density-based clustering involve a number of new definitions.
The neighborhood within a radius of a given object is called the -neighborhood of the
object. If the -neighborhood of an object contains at least a minimum number, MinPts, of objects,
then the object is called a core object. Given a set of objects, D, we say that an objectp is directly density-reachable from objectq
ifp is within the e-neighborhood ofq, and q is a core object. An objectp is density-reachable from objectq with respect to and MinPts in a set of objects,
D, if there is a chain of objects p1, : : : , pn, where p1 = q and pn = p such thatpi+1 is directly
density-reachable from piwith respect to e and MinPts, for 1 i n,pi D.
An objectp is density-connected to objectq with respect to and MinPts in a set of objects,D, if there is an objecto D such that bothp and q are density-reachable from o with respectto and MinPts.
Consider Figure for a given
represented by the radius of the circles, and, say, letMinPts = 3. Basedon the above definitions:
Of the labeled points, m, p, o, and rare core objects because each is in an -neighborhoodcontaining at least three points.
q is directly density-reachable from m. m is directly density-reachable fromp and vice versa. q is (indirectly) density-reachable fromp because q is directly density-reachable from m and
m is directly density-reachable from p. However, p is not density-reachable from q because
7/27/2019 Unit-03 (Part 2)
15/20
Unit-03: Cluster Analysis 15
GAURAV JAISWAL, Dept. of CS, AITM
q is not a core object. Similarly, r and s are density-reachable from o, and o is density-reachable from r.
o, r, and s are all density-connected.
DBSCAN searches for clusters by checking the e-neighborhood of each point in the database. If the -neighborhood of a pointp contains more than MinPts, a new cluster withp as a core object is created.
DBSCAN then iteratively collects directly density-reachable objects from these core objects, whichmay involve the merge of a few density-reachable clusters. The process terminates when no new
point can be added to any cluster.
OPTICS: Ordering Points to Identify the Clustering Structure
Although DBSCAN can cluster objects given input parameters such as and MinPts, it still leaves the
user with the responsibility of selecting parameter values that will lead to the discovery of acceptableclusters. To help overcome this difficulty, a cluster analysis method called OPTICS was proposed.
Rather than produce a data set clustering explicitly, OPTICS computes an augmented cluster ordering
for automatic and interactive cluster analysis. The cluster ordering can be used to extract basic
clustering information (such as cluster centers or arbitrary-shaped clusters) as well as provide theintrinsic clustering structure.
Two values need to be stored for each objectcore-distance and reachability-distance.
The core-distance of an objectp is the smallest value that makes {p} a core object. Ifp is
not a core object, the core-distance ofp is undefined.
The reachability-distance of an objectq with respect to another objectp is the greater value
of the core-distance ofp and the Euclidean distance between p and q. Ifp is not a core object,
the reachability-distance between p and q is undefined.
The OPTICS algorithm creates an ordering of the objects in a database, additionally storing the core-
distance and a suitable reachability distance for each object. An algorithm was proposed to extract
clusters based on the ordering information produced by OPTICS. Such information is sufficient for
the extraction of all density-based clustering with respect to any distance that is smaller than thedistance used in generating the order.
7/27/2019 Unit-03 (Part 2)
16/20
Unit-03: Cluster Analysis 16
GAURAV JAISWAL, Dept. of CS, AITM
Figure is the reachability plot for a simple two-dimensional data set, which presents a generaloverview of how the data are structured and clustered. The data objects are plotted in cluster order
(horizontal axis) together with their respective reachability-distance (vertical axis). The three
Gaussian bumps in the plot reflect three clusters in the data set. Methods have also been developed
for viewing clustering structures of high-dimensional data at various levels of detail.
7/27/2019 Unit-03 (Part 2)
17/20
Unit-03: Cluster Analysis 17
GAURAV JAISWAL, Dept. of CS, AITM
Grid-BasedMethodsThe grid-based clustering approach uses a multiresolution grid data structure. It quantizes the object
space into a finite number of cells that form a grid structure on which all of the operations for
clustering are performed. The main advantage of the approach is its fast processing time, which istypically independent of the number of data objects, yet dependent on only the number of cells in
each dimension in the quantized space.
STING: STatistical INformation Grid
STING is a grid-based multiresolution clustering technique in which the spatial area is divided into
rectangular cells. Because STING uses a multiresolution approach to cluster analysis, the quality of
STING clustering depends on the granularity of the lowest level of the grid structure. There are
usually several levels of such rectangular cells corresponding to different levels of resolution, and
these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of
cells at the next lower level. Statistical information regarding the attributes in each grid cell (such as
the mean, maximum, and minimum values) is precomputed and stored.
Statistical parameters of higher-level cells can easily be computed from the parameters of the lower-
level cells. These parameters include the following: the attribute-independent parameter, count; the
attribute-dependent parameters, mean, stdev (standard deviation), min (minimum), max
(maximum); and the type ofdistribution that the attribute value in the cell follows, such as normal,
uniform, exponential, or none (if the distribution is unknown).
The statistical parameters can be used in a top-down, grid-based method as follows. First, a layerwithin the hierarchical structure is determined from which the query-answering process is to start.
This layer typically contains a small number of cells. For each cell in the current layer, we compute
the confidence interval (or estimated range of probability) reflecting the cells relevancy to the givenquery. The irrelevant cells are removed from further consideration. Processing of the next lower level
examines only the remaining relevant cells. This process is repeated until the bottom layer is reached.At this time, if the query specification is met, the regions of relevant cells that satisfy the query are
returned. Otherwise, the data that fall into the relevant cells are retrieved and further processed untilthey meet the requirements of the query.
7/27/2019 Unit-03 (Part 2)
18/20
Unit-03: Cluster Analysis 18
GAURAV JAISWAL, Dept. of CS, AITM
STING offers several advantages:
1. The grid-based computation is query-independent, because the statistical information stored
in each cell represents the summary information of the data in the grid cell, independent ofthe query;
2. The grid structure facilitates parallel processing and incremental updating; and3. The methods efficiency is a major advantage: STING goes through the database once to
compute the statistical parameters of the cells, and hence the time complexity of generating
clusters is O(n), where n is the total number of objects. After generating the hierarchical
structure, the query processing time is O(g), where g is the total number of grid cells at the
lowest level, which is usually much smaller than n.
CLIQUE: A Dimension-Growth Subspace Clustering Method
CLIQUE (CLustering In QUEst) was the first algorithm proposed for dimension-growth subspace
clustering in high-dimensional space. In dimension-growth subspace clustering, the clustering
process starts at single-dimensional subspaces and grows upward to higher-dimensional ones.Because CLIQUE partitions each dimension like a grid structure and determines whether a cell is
dense based on the number of points it contains.
The ideas of the CLIQUE clustering algorithm are outlined as follows:
Given a large set of multidimensional data points, the data space is usually not uniformly
occupied by the data points. CLIQUEs clustering identifies the sparse and the crowdedareas in space (or units), thereby discovering the overall distribution patterns of the data set.
A unit is dense if the fraction of total data points contained in it exceeds an input model
parameter. In CLIQUE, a cluster is defined as a maximal set ofconnected dense units.
CLIQUE performs multidimensional clustering in two steps:
1. In the first step, CLIQUE partitions the d-dimensional data space into nonoverlapping
rectangular units, identifying the dense units among these. This is done (in 1-D) for eachdimension.
2. In the second step, CLIQUE generates a minimal description for each cluster as follows. Foreach cluster, it determines the maximal region that covers the cluster of connected dense
units. It then determines a minimal cover (logic description) for each cluster.
CLIQUE automatically finds subspaces of the highest dimensionality such that high-density clusters
exist in those subspaces. It is insensitive to the order of input objects and does not presume anycanonical data distribution. It scales linearly with the size of input and has good scalability as the
number of dimensions in the data is increased. However, obtaining meaningful clustering results isdependent on proper tuning of the grid size (which is a stable structure here) and the density
threshold.
7/27/2019 Unit-03 (Part 2)
19/20
Unit-03: Cluster Analysis 19
GAURAV JAISWAL, Dept. of CS, AITM
Model-BasedClusteringMethodsModel-based clustering methods attempt to optimize the fit between the given data and some
mathematical model. Such methods are often based on the assumption that the data are generated
by a mixture of underlying probability distributions.
Neural Network Approach
The neural network approach is motivated by biological neural networks. Roughly speaking, a neural
network is a set of connected input/output units, where each connection has a weight associated
with it.
Neural networks have several properties: First, neural networks are inherently parallel and
distributed processing architectures. Second, neural networks learn by adjusting their
interconnection weights so as to best fit the data. This allows them to normalize or prototype thepatterns and act as feature (or attribute) extractors for the various clusters. Third, neural networks
process numerical vectors and require object patterns to be represented by quantitative features
only.
The neural network approach to clustering tends to represent each cluster as an exemplar. An
exemplar acts as a prototype of the cluster and does not necessarily have to correspond to a
particular data example or object. New objects can be distributed to the cluster whose exemplar is
the most similar, based on some distance measure.
Self-organizing feature maps (SOMs) are one of the most popular neural network methods for
cluster analysis. They are sometimes referred to as Kohonen self-organizing feature maps, or astopologically ordered maps. SOMs goal is to represent all points in a high-dimensional source space
by points in a low-dimensional (usually 2-D or 3-D) target space, such that the distance and proximity
relationships (hence the topology) are preserved as much as possible. With SOMs, clustering is
performed by having several units competing for the current object. The unit whose weight vector isclosest to the current object becomes the winning or active unit. The organization of units is said to
form a feature map. The SOM approach has been used successfully for Web document clustering.
Outlier Analysis
Very often, there exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set of data, are
called outliers. Outliers can be caused by measurement or execution error. For example, the display
of a persons age as - 999 could be caused by a program default setting of an unrecorded age.
Alternatively, outliers may be the result of inherent data variability. Many data mining algorithms try
to minimize the influence of outliers or eliminate the mall together. This, however, could result in theloss of important hidden information.
In other words, the outliers may be of particular interest, such as in the case of fraud detection, where
outliers may indicate fraudulent activity. Thus, outlier detection and analysis is an interesting data
mining task, referred to as outlier mining. Outlier mining can be described as follows: Given a set ofn data points or objects and k, the expected number of outliers, find the top k objects that are
considerably dissimilar, exceptional, or inconsistent with respect to the remaining data.
7/27/2019 Unit-03 (Part 2)
20/20
Unit-03: Cluster Analysis 20
f
The outlier mining problem can be viewed as two subproblems: (1) define what data can beconsidered as inconsistent in a given data set, and (2) find an efficient method to mine the outliers so
defined. The problem of defining outliers is nontrivial.These can be categorized into four approaches: the statistical approach, the distance-based approach,the density-based local outlier approach, and the deviation-based approach.
1. Statistical Distribution-Based Outlier DetectionThe statistical distribution-based approach to outlier detection assumes a distribution or probability modelfor the given data set (e.g., a normal or Poisson distribution) and then identifies outliers with respect tothe model using a discordancy test. Application of the test requires knowledge of the data set parameters(such as the assumed data distribution), knowledge of distribution parameters (such as the mean andvariance), and the expected number of outliers.
There are two basic types of procedures for detecting outliers:
1) Block procedures: In this case, either all of the suspect objects are treated as outliers or all of themare accepted as consistent.
2) Consecutive (or sequential) procedures: An example of such a procedure is the inside outprocedure. Its main idea is that the object that is least likely to be an outlier is tested first. If it is
found to be an outlier, then all of the more extreme values are also considered outliers; otherwise, thenext most extreme object is tested, and so on. This procedure tends to be more effective than blockprocedures.