Upload
mjviru
View
160
Download
0
Embed Size (px)
Citation preview
UNIT V
CLUSTERING AND APPLICATIONS AND TRENDS IN DATA MINING
Cluster Analysis
Cluster is a collection of data objects. Similar to one another within in the same cluster. Dissimilar to the objects in the other clusters
Cluster Analysis
Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering
Clustering is an example of unsupervised learning
Clustering is a form of learning by observation, rather than learning by examples
The following are typical requirements of clustering in data mining
Scalability: Many clustering algorithms work well on small data sets containing fewer than several hundred data objects. However, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed
Ability to deal with different types of attributes: Many algorithms are designed to cluster interval based data. however applications may require clustering other types of data, such as binary, categorical and ordinal data or mixtures of these data types
Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to finf spherical clusters with similar size and density.
Minimal; requirements for domain knowledge to determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis. The clustering results can be quite sensitive to input parameters.
Ability to deal with noisy data: Most real world databases contain outliers or missing unknown or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality.
Incremental clustering and insensitivity to the order of input records: Some clustering algorithms cannot incorporate newly inserted data into existing clustering structures and instead must determine a new clustering from scratch.
High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low dimensional data, involving only two to three dimensions.
Constraint based clustering: Real world applications may need to perform clustering under various kinds of constraints. To decide upon this, you may cluster households while considering constraints such as city rivers and highway networks.
Interpretability and usability: Users expect clustering results to be interpretable comprehensible and usable. That is, clustering may need to be tied to specific semantic interpretations and applications.
Types of Data in cluster analysis
Main memory based clustering algorithms uses following data structures
Data matrix: This represents n objects, such as persons, with p variables, such as age, height, weight, gender, and so on. The structure is in the form of a relational table, or n by p matrix
Dissimilarity matrix: This stores a collection of proximities that are available for all pairs of n objects. It is often represented by an n-by-n table
where d(i,j) is the measured difference or dissimilarity between objects i and j
The rows and columns of the data matrix represent different entities, while those of the dissimilarity matrix represent the same entity. Thus, the data matrix is often called a two mode matrix, whereas the dissimilarity matrix is called a one mode matrix. Many clustering algorithms operate on a dissimilarity matrix. If the data are presented in the form of a data matrix, it can first be transformed into a dissimilarity matrix before applying such clustering algorithms
(i) Interval Scaled variable
Interval scaled variable describe distance measures that are commonly used for computing the dissimilarity of objects described by such variables. These measures include the Euclidean Manhattan and and Minkowski distances
Standardize data:
Calculate the mean absolute deviation
Sf =1\n (|x1f-mf|+|x2f-mf|+….|xnf-mf|)
where mf = (x1f + x2f+….+xnf)/n
Calculate the z score
Zif = xif-mf -------- sfusing a mean absolute deviation is more robust than standard deviation. Mad is even more robust, but outliers disappear completely
Similarity and Dissimilarity
Distances are normally used to measure the similarity or dissimilarity between tow data objectsSome popular distances are based on Minkowski distance
d(I,k)=[|xil-xk1|^p + |xi2=xk2|^p +…..+|xin-xkn|^p]^1/p
p=1: Manhattan distance
d(i,k)=|xi1-xk1|+|xi2-xk2|+……+|xin-xkn|
p=2 Euclidean distance
d(I,k) =[|xi1-xk1|^2+|xi2-xk2|^2+…..+|xin-xkn|^2]^1/2
(ii) Binary variables
Contigency table for binary data
A contingency table for binary variablesBinary variables
A binary variable has only two states: 0 or 1, where 0 means that the variable is absent, and 1 means that is present.
A binary variable is a symmetric if the outcomes of the states are not equally important such as the positive and negative outcome of a disease test.
Distance measures for symmetric binary variables. Symmetric binary dissimilarity
d(i,j)=r+s ------ q+r+s+t
Distance measures for asymmetric binary variables
Asymmetric binary dissimilarity. Negative matches dropped
r+s d(i,j)= ------ q+r+sAsymmetric binary similarity
sim(i,j)= q ------------- =1-d(i,j) p+q+r+s
This is also called the jaccard coefficient
(iii) Categorical ordinal and Ratio scaled variables
Categorical variables A categorical variables is a generalization of the binary variable in that it can take on
more than two states
Let the number of states of a categorical variable of M. The states can be denoted by letter symbols or a set of integers
Ordinal variables
A discrete ordinal variable resembles a categorical variable. except that the M states of the ordinal value are ordered in a meaningful sequence.
Ordinal variables are very useful for registering subjective assessments of qualities that cannot be measured objectively.
1. The value of f for the ith object sif, and f has Mf ordered states representing the ranking 1,…….,Mf.
2. Since each ordinal variable can have a different number of states, it is often necessary to map the range of each variable onto. So that each variable has equal height.
3. Dissimilarity can then be computed using any of the distance measures described for interval scaled variables
Ratio scaled variables
A ration scaled variables makes a positive measurement on a nonlinear scale, such as an exponential scale, approximately
Treat ration scaled variables like interval scaled variables. This however is not usually a good choice it is likely that the scale may be disorted
Apply logarithmic transformations to a ratio scaled variable f having value xif for object I by using the formule yif=log(xif). The yif values can be treated as interval valued.
Treat xif as continuous ordinal data and treat their ranks as interval valued
(iv) Variables of Mixed types
A more preferable approach is to process all variable types together, performing a single cluster analysis. One such technique combines the different variables into a single dissimilarity matrix, bringing all of the meaningful variables onto a common scale of the interval
Vector Objects
In some applications such as information retrieval text document clustering, and biological taxonomy, we need to compare and cluster complex objects containing a large number of symbolic entities. To measure the distance between complex objects it is often desirable to abandon traditional metric distance computation and introduce a nonmetric similarity function.
Categorization of Major Clustering Methods
(i) Partitioning methods:
Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k<=n
It classifies the data into k groups, which together satisfy the following requirements
1. Each group must contain at least one object
2. Each object must belong to exactly one group
The general criterion of a good partitioning is that objects in the same cluster are close or related to each other, whereas objects of different clusters are far apart or very different
(ii) Hierarchical Methods
A hierarchical methods creates a hierarchical decomposition of the given set of data object i.e. the data are not partitioned into a particular cluster in a single step. Instead, a serried of partitions takes place which may run from a single cluster containing allo objects to n cluster each containing a single object
Two main types of hierarchical clustering(a) Agglomerative(Bottom up approach)
o Start with the positions as individual clustero At each step merge the closest pair of clusters until only one cluster left
(b) Divisive: (top down approach)
Start with one all inclusive cluster At each step split a cluster until each cluster contains a point
To improve the quality of hierarchical clustering Perform careful analysis of object linkages at each hierarchical partitioning Integrate hierarchical agglomeration and other approaches by first using a hierarchical
agglomerative algorithm , to group objects into microclusters and then performing macro clustering on the microclusters using another clustering methods
(c) Density based methodsOther clustering methods have been developed based on the notion of densitySuch a method can be used to filter out noise and discover clusters of arbitrary shape
(d) Grid Based Methods
Grid based methods quantize the object space into a finite number of cell that from a grid structure
All of the clustering operations are performed on the grid structure
(e) Model based Methods
Model based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model
A model based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points
Partitioning Methods
Partitioning algorithms construct partitions of a database of N objects into a set of k cluster. The construction involves determining the optimal partition with respect to an objective function,
Nonhierarchical each instance is placed in exactly one of K nonoverlapping clusters Since only one set of cluster is output, the user normally has to input the desired number
of cluster k
The partitioning techniques usually produce clusters by optimizing a criterion function defined either locally or globally. A local criterion such as the minimal mutual neighbor distance forms clusters by utilizing the local structure or context in the data
The most commonly used partitional clustering strategy is based on the square error criterion. The general objective is to obtain the partition that for a fixed number of clusters, minimizes the total square error.
(i) Classical Partitioning methods: k-means and K-medoids
The most well known and commonly used partitioning methods are k-means, k-medoids, and their variations
Centroid based Technique: The k-means method
K means is one of the simplest unsupervised learning algorithms that solve the well known clustering proble. The k-means algorithm takes the input parameter, k and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low.
Given k the k-means algorithm is implemented in four stpes1. Partition objects into k nonempty subsets2. Compute seed points as the centroids of the clusters of the current partition3. Assign each object to the cluster with the nearest seed point4. Go back to step 2, stop when no more new assignment
Advantages
With a large number of variables, k-means may be computationally faster than hierarchical clustering
K-means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
Disadvantages
Applicable only when mean is defined Need to specify k, the number of cluster in advance Unable to handle noisy data and outliers
PAM(portioning around medoids) was one of the first k-medoids algorithms introduced. It attempts to determine k partitions for n objects. After an initial random selection of k representative objects.
Partitioning Methods in Large Databases
CLARANS works as follows:
CLARANS draws sample with some randomness in each step of search
If a better neighbor is found CLARANS moves to the neighbours node and the process starts again
Once a user specified number of minimal bas been found, the algorithm outputs, as a solutions the best local minimum that is the local minimum having the lowest cost
Advantages
More efficient than k-means and k-medoids
Used to detect outliers
Disadvantages
May not find a real local minimum due to the trimming of its searching
It assumes that all objects fit into the main memory, and the result is sensitive to input order
Hierarchical Methods
Basically hierarchical methods group data into a tree of clusters, There are two basic varities of hierarchical algorithms; agglomerative and divisive. A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering
Agglomerative and Divisive Hierarchical clustering
Agglomerative Hierarchical methods
Begin with as many clusters as object. Clusters are successively merged until only one cluster remains
Divisive Hierarchical Methods
Begin with all objects in one cluster, Groups are continually divided until there are as many clusters as objects
Steps in Agglomerative Hierarchical Clutering
Compute the proximity matrix
Let each data point be a clatter
repeat
Merge the two closes clusters
Update the proximity matrix
Until only a single cluster remains
BIRCH:Balanced Iterative Reducing Clustering using hierarchies
BIRCH applies a multiphase clustering techniques a single scan of the data set yields a basic good clustering, and one or more additional scans scan be used to further improve the quality.
phase 1: BIRCH scans the database to build an initial in memory CF tree, which can be viewed as a multilevel compression of the data that tries to preserve the inherent clustering structure of the data
phase 2: BIRCH applies a clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and group dense clusters into larger ones
CURE: Clustering Using Representatives
Steps:
Draw a random sample D of the original objects
Partition sample S into a set of partitions and forma cluster fro each partition
Representative points are found by selecting a constant number of points from a cluster and then shrinking them toward the center of the cluster
Cluster similarity is the similarity of the closest pair of representative points from different clusters
Shrinking representative points toward the center helps avoid problems with noise and outliers
CURE is better to handle clusters of arbitrary shapes and sizes
ROCK (Robust clustering using links)
Steps:
obtain a sample of points from the data set
Compute the link value for each set of points i.e., transform the original similarities into similarities that reflect the number of shared neighbors between points
Assign the remaining points to the clusters that have been found
Density Based Methods
DBSCAN: A density based clustering method based on connected regions with sufficiently high density
The neighborhood within a radius of a given object is called the neighborhood of the objects
If the neighborhood of an object contains at least a minimum number minpts of objects then the object is called a core object
Given a set of objects D, we say that an object p is directly density reachable from object q if p is within the neighborhood of q, and q is a core object.
An object p is density reachable from object q with respect to and minpts in a set of objects, D, if there is a chain of objects
An objects p is density connected to object q with respect to and minpts in a set of objects D,
OPTICS: Ordering Points to identify the clustering structure
The core distance of an object p is the smallest value that makes a core object. If p is not a core object, the core distance of p is undefined
The reach ability distance of an object q with respect to another object p is the greater value of the core distance of p and the Euclidean distance between p and q. If p is not a core object, the reach ability distance between p and q is undefined.
GRID BASED CLUSTERING
Using multi resolution grid data structure
Several interesting methods
STING( a Statistical Information Grid approach)
Wave cluster
CLIQUE
STING
The spatial are is divied into rectangular cells
There are several levels of cells corresponding to different levels of resolution
The STING clustering method
Each cell at a high level; is partitioned into a number of smaller cells in the next lower level;
Statistical info of each cell is calculated and stored before hand and is used to answer queries
Use a top down approach to answer spatial data queries
Wave cluster
A multi resolution clustering approach which apples wavelet transforms to the feature space
Wavelet transforms
Wavelet transform: A signal processing technique that decomposes a signal into different frequency sub-band
Data are transformed to preserve relative distance between objects at different levels of resolution
Allows natural clusters to become more distinguishable
The wavecluster algorithm
Input parameters
Grid cells for each dimension
The wavelet and the applications of wavelet transforms
Major features
Complexity
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data
Quantization and Transformation
scale 1: high resolution
scale 2: medium resolution
scale 3: low resolution
Model Based Clustering methods
Model based clustering methods attempt to optimize the fit between the given data and some mathematical model.
Expectation Maximization
Make an initial guess for the parameter vector: This involves randomly selecting k objects to represent the cluster means or centers as well as making guesses for the additional parameters
The EM algorithm is simple and easy to implement. In practice it converges fast but may not reach the global optima. Convergence is guaranteed for certain forms of optimization functions.
Conceptual Clustering
Intraclass similarity is the probability. The larger this value is the greater the proportion of class members that share this attribute value pair and the more predictable the pair of class members
Interclass similarity is the probability. The larger this value is the fewer the objects in constraining classes that share this attribute value pair and the more predictive the pair of the class
Neural network approach
The neural network approach is motivated by biological neural networks. Neural networks have several properties that make them popular for clustering
Self organizing feature maps are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as kohomen self organizing feature maps, after their creator.
Clustering high dimensional Data
Most clustering methods are designed for clustering low dimensional data and encounter challenges when the dimensionality of the data grows really high. This is because when the dimensionality increases, usually only a small number of dimensions are relevant to certain clusters but data in the irrelevant dimensions may produce much noise and mask the real clusters to be discovered.
Feature transformation methods such as principal component analysis and singular value decomposition transform the data onto a smaller space while generally preserving the original relative distance between objects
CLIQUE: A dimension growth subspace clustering method
Given a large set of multidimensional data points the data space is usually not uniformly occupied by the data points. CLIQUEs clustering identifies the sparse and the crwded areas in space
A unit is dense if the fraction of total data points contained in it exceeds an input model parameter. In CLIQUE a cluster is defined as a maximal set of connected dense units
Constraint based Cluster analysis
Constraints on individual objects: We can specify constraints on the objects to be clustered, in a real estate application . It can easily be handled by preprocessing after which the problems reduces to an instance of unconstrained clustering
Constraints on the selection of clustering parameters: A user may like to set a desired range for each clustering parameters. Clustering parameters are usually quite specific to the given clustering algorithm
Constraints on distance or similarity functions: We can specify different distance or similarity function for specific attributes of the objects to be clustered, or different distant measures for specific pair of objects
User specified constraints on the properties of individual clusters: A user may like to specify desired characteristics of the resulting clusters, which may strongly influence the clustering process.
Semi supervised clustering based on partial supervision: The quality of unsupervised clustering can be significantly improved using some weak form of supervisons. This may be in the form of pairwise constraints.
Outlier Analysis
Data objects that show significantly different characteristics from remaining data are declared outliers. Detection and analysis of outliers is called outliers mining
Application
Detects unusual usage of telecommunication services
Detects credit card fraud and criminal activities in E-commerce
Tracking customer behaviors
Medical analysis
Steps for detecting outliers
Define inconsistent data in a given data set
Find an efficient method to mine the outliers
Methods for detecting outliers are
Statistical approach
Distance based approach
Density based local outlier approach
Deviation based approach
Statistical Distribution based outlier detection
The statistical distribution based approach to outlier detection assumes a distribution or probability model for the given data set
Knowledge of the data set parameters
Knowledge of distribution parameters
The expected number of outliers
Distance based outlier Detection
The distance based outlier mining concepts assigns numeric distances to data objects and computes outliers as data objects with relatively larger distances.
Index based algorithms: Given a data set the index based algorithm uses multidimensional indexing structures, such as R-trees or k-d trees, to search for neighbors of each object o within radius dmin around the object.
Cell based Algorithm: The first layer is one cell thick, while the second is d2pk_le cells thick, rounded upto the close integer. The algorithm counts outliers on a cell by cell rather than an object by object basis.
cell count- Number of objects in the cell
Cell+1 layer count- Number of objects in the cell + Number of objects in first layer
Cell+2 layers count- Number of objects in the cell+Number of objects in both layers
Density based local outlier detection
Depth based techniques represent every data object in a k-d space, and assign a depth to each object. This is degree of outlierness is composed as the local factor of an object. It is local in the sense that the degree depends on how isolated the object is with respect to the surrounding neighborhood.
Deviation based outlier detection
Deviation based outlier detection identifies outliers by examining the main characteristics of objects in a group. Objects that deviate from this description are considered outliers.
Techniques for deviation based outlier detection are
Sequential approach
The sequential exception technique simulates the way in which humans can distinguish unusual objects from among a series of supposedly like objects
The technique to asses the dissimilarity between subsets are:
Exception set: This is the set of deviation or outliers. It is defined as the smallest subset of objects whose removal results in the greatest reduction of dissimilarity in the residual set
Dissimilarity function: The dissimilarity of a subset is incrementally computed based on the subset prior to it in the sequence.
Cardinality function: This is typically the count of the number of objects in a given set
Smoothing factor: Smoothing factor assesses how much the dissimilarity can be reduced by removing the sunset from the original set of objects, This value is scaled by the cardinality of the set.
Data Mining Applications
(i)Data mining for financial data analysis
Design and construction of data warehouses for multidimensional data analysis and data mining: Like many other application, data warehouses need to be constructed for banking and financial data.
Loan payment prediction and customer credit policy analysis: Loan payment prediction and customer credit analysis are critical to the business of a bank. Many factors can strongly or weekly influence loan payment performance
Classification and clustering of customers for targeted marketing: Classification and clustering methods can be used for customer group identification and targeted marketing.
Detection of money laundering and other financial crimes: To detect money laundering and other laundering and other financial crimes, it is important to integrate information from multiple databases
(ii)Data mining for the retail industry
Design and construction of data warehouses based on the benefits of data mining: Because retail data cover a wide spectrum there can be many ways to design a data warehouse for this industry.
Multidimensional analysis of sales customers products time and region: The retail industry requires timely information regarding customer needs, product sales, trends, and fashions, as well as the quality cost.
Analysis of the effectiveness of sales campaigns: The retail industry conducts sales campaigns using advertisemements, coupons and various kinds of discouts and bounses to promote products and attract customers.
Customer retention-analysis of customer loyalty: With customer loyalty card information one can register sequences of purchases of particular customers. Customer loyalty and purchase trends can be analyzed systematically.
Product recommendation and cross referencing of items: By mining associations from sales records, one may discover that a customer who buys a digital camera is likely to buy another set of items
(iii)Data mining for the Telecommunication Industry
Multidimensional analysis of telecommunication data: Telecommunication data are intrinsically multidimensional with dimensions such as calling time duration location of caller location of called, and type of call.
Pattern analysis and the identification of unusual patterns: Fraudulent activity costs the telecommunication industry millions of dollars per year.
Multidimensional association and sequential pattern analysis: The discovery if association and sequential patterns in multidimensional analysis can be used to promote telecommunication services.
Mobile telecommunication services: Mobile telecommunications web and information services and mobile computing are becoming increasingly integrated and common in our work and life
Use of visualization tools in telecommunication data analysis: Tools for OLAP visualization linkage visualization association visualization clustering and outlier association have been shown to be very useful for telecommunication data analysis.
(iv)Data mining for biological data analysis
Semantic integration of heterogeneous distributed genomic and proteomic databases: Genomic and proteomic data sets are often generated at different labs and by different methods,
Alignment indexing similarity search and comparative analysis of multiple nucleotide/protein sequences: Various biological sequence alignment methods have been developed in the past two decades.
Discovery of structural patterns and analysis of genetic networks and protein pathways: In biology, protein sequences are folded into three dimensional structures, and such structures interact with each other based on their relative positions and the distance between them.
Association and path analysis: Identifying co-occuring gene sequences and linking genes to different stages of disease development:
Visualization tools in genetic data analysis: Alignments among genomic ar proteomic sequences and the interactions among complex biological structure are most effectively presented in graphic forms, transformed into various kinds of easy to understand visual displays
(v)Data mining in other scientific application
Data warehouses and data preprocessing: Data warehouses are critical for information exchange and data mining. In the area of geospatial data, however no true geospatial data warehouses exist today.
Mining complex data types: Scientific data sets are heterogeneous in nature, typically involving semi-structured and unstructured data, such as multimedia data and georeferenced stream data. Robust method are needed for handling spatiotemporal data.
Graph based mining: It is often difficult or impossible to model several physical phenomena and processes due to limitations of existing modeling approaches. Alternatively, labeled graphs may be used to capture many of the spatial topological geometric and other relational characteristics.
Visualization tolls and domain specific knowledge: High level graphical user interfaces and visualization tools are required for scientific data mining systems.
(vi)Data mining for Intrusion Detection
Development of data mining algorithms for intrusion detection: Data mining algorithms can be used for misuse detection and anomaly detection. In misuse detection training data are labeled either normal or intrusion.
Association and correlation analysis, and aggregation to help select and build discriminating attributes: Association and correlation mining can be applied to find relationships between system attributes describing the network data.
Analysis of stream data: Due to the transient and dynamic nature of intrusions and malicious attacks, it is crucial to perform intrusion detection in the data stream environment.
Distributed data mining: Intrusions can be launched from several different locations and targeted to many different destinations. Distributed data mining methods may be used to analyze network data from several network locations in order to detect these distributed attacks
Visualization and querying tools: Visualization tools should be available for viewing any anomalous patterns detected. Such tools may include features for viewing associations clusters and outliers