UNIT v Dwm Notes

UNIT V

CLUSTERING AND APPLICATIONS AND TRENDS IN DATA MINING

Cluster Analysis

Cluster is a collection of data objects. Similar to one another within in the same cluster. Dissimilar to the objects in the other clusters

Cluster Analysis

Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering

Clustering is an example of unsupervised learning

Clustering is a form of learning by observation, rather than learning by examples

The following are typical requirements of clustering in data mining

Scalability: Many clustering algorithms work well on small data sets containing fewer than several hundred data objects. However, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed

Ability to deal with different types of attributes: Many algorithms are designed to cluster interval based data. however applications may require clustering other types of data, such as binary, categorical and ordinal data or mixtures of these data types

Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to finf spherical clusters with similar size and density.

Minimal; requirements for domain knowledge to determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis. The clustering results can be quite sensitive to input parameters.

Ability to deal with noisy data: Most real world databases contain outliers or missing unknown or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality.

Incremental clustering and insensitivity to the order of input records: Some clustering algorithms cannot incorporate newly inserted data into existing clustering structures and instead must determine a new clustering from scratch.

High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low dimensional data, involving only two to three dimensions.

Constraint based clustering: Real world applications may need to perform clustering under various kinds of constraints. To decide upon this, you may cluster households while considering constraints such as city rivers and highway networks.

Interpretability and usability: Users expect clustering results to be interpretable comprehensible and usable. That is, clustering may need to be tied to specific semantic interpretations and applications.

Types of Data in cluster analysis

Main memory based clustering algorithms uses following data structures

Data matrix: This represents n objects, such as persons, with p variables, such as age, height, weight, gender, and so on. The structure is in the form of a relational table, or n by p matrix

Dissimilarity matrix: This stores a collection of proximities that are available for all pairs of n objects. It is often represented by an n-by-n table

where d(i,j) is the measured difference or dissimilarity between objects i and j

The rows and columns of the data matrix represent different entities, while those of the dissimilarity matrix represent the same entity. Thus, the data matrix is often called a two mode matrix, whereas the dissimilarity matrix is called a one mode matrix. Many clustering algorithms operate on a dissimilarity matrix. If the data are presented in the form of a data matrix, it can first be transformed into a dissimilarity matrix before applying such clustering algorithms

(i) Interval Scaled variable

Interval scaled variable describe distance measures that are commonly used for computing the dissimilarity of objects described by such variables. These measures include the Euclidean Manhattan and and Minkowski distances

Standardize data:

Calculate the mean absolute deviation

Sf =1\n (|x1f-mf|+|x2f-mf|+….|xnf-mf|)

where mf = (x1f + x2f+….+xnf)/n

Calculate the z score

Zif = xif-mf -------- sfusing a mean absolute deviation is more robust than standard deviation. Mad is even more robust, but outliers disappear completely

Similarity and Dissimilarity

Distances are normally used to measure the similarity or dissimilarity between tow data objectsSome popular distances are based on Minkowski distance

d(I,k)=[|xil-xk1|^p + |xi2=xk2|^p +…..+|xin-xkn|^p]^1/p

p=1: Manhattan distance

d(i,k)=|xi1-xk1|+|xi2-xk2|+……+|xin-xkn|

p=2 Euclidean distance

d(I,k) =[|xi1-xk1|^2+|xi2-xk2|^2+…..+|xin-xkn|^2]^1/2

(ii) Binary variables

Contigency table for binary data

A contingency table for binary variablesBinary variables

A binary variable has only two states: 0 or 1, where 0 means that the variable is absent, and 1 means that is present.

A binary variable is a symmetric if the outcomes of the states are not equally important such as the positive and negative outcome of a disease test.

Distance measures for symmetric binary variables. Symmetric binary dissimilarity

d(i,j)=r+s ------ q+r+s+t

Distance measures for asymmetric binary variables

Asymmetric binary dissimilarity. Negative matches dropped

r+s d(i,j)= ------ q+r+sAsymmetric binary similarity

sim(i,j)= q ------------- =1-d(i,j) p+q+r+s

This is also called the jaccard coefficient

(iii) Categorical ordinal and Ratio scaled variables

Categorical variables A categorical variables is a generalization of the binary variable in that it can take on

more than two states

Let the number of states of a categorical variable of M. The states can be denoted by letter symbols or a set of integers

Ordinal variables

A discrete ordinal variable resembles a categorical variable. except that the M states of the ordinal value are ordered in a meaningful sequence.

Ordinal variables are very useful for registering subjective assessments of qualities that cannot be measured objectively.

1. The value of f for the ith object sif, and f has Mf ordered states representing the ranking 1,…….,Mf.

2. Since each ordinal variable can have a different number of states, it is often necessary to map the range of each variable onto. So that each variable has equal height.

3. Dissimilarity can then be computed using any of the distance measures described for interval scaled variables

Ratio scaled variables

A ration scaled variables makes a positive measurement on a nonlinear scale, such as an exponential scale, approximately

Treat ration scaled variables like interval scaled variables. This however is not usually a good choice it is likely that the scale may be disorted

Apply logarithmic transformations to a ratio scaled variable f having value xif for object I by using the formule yif=log(xif). The yif values can be treated as interval valued.

Treat xif as continuous ordinal data and treat their ranks as interval valued

(iv) Variables of Mixed types

A more preferable approach is to process all variable types together, performing a single cluster analysis. One such technique combines the different variables into a single dissimilarity matrix, bringing all of the meaningful variables onto a common scale of the interval

Vector Objects

In some applications such as information retrieval text document clustering, and biological taxonomy, we need to compare and cluster complex objects containing a large number of symbolic entities. To measure the distance between complex objects it is often desirable to abandon traditional metric distance computation and introduce a nonmetric similarity function.

Categorization of Major Clustering Methods

(i) Partitioning methods:

Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k<=n

It classifies the data into k groups, which together satisfy the following requirements

1. Each group must contain at least one object

2. Each object must belong to exactly one group

The general criterion of a good partitioning is that objects in the same cluster are close or related to each other, whereas objects of different clusters are far apart or very different

(ii) Hierarchical Methods

A hierarchical methods creates a hierarchical decomposition of the given set of data object i.e. the data are not partitioned into a particular cluster in a single step. Instead, a serried of partitions takes place which may run from a single cluster containing allo objects to n cluster each containing a single object

Two main types of hierarchical clustering(a) Agglomerative(Bottom up approach)

o Start with the positions as individual clustero At each step merge the closest pair of clusters until only one cluster left

(b) Divisive: (top down approach)

Start with one all inclusive cluster At each step split a cluster until each cluster contains a point

To improve the quality of hierarchical clustering Perform careful analysis of object linkages at each hierarchical partitioning Integrate hierarchical agglomeration and other approaches by first using a hierarchical

agglomerative algorithm , to group objects into microclusters and then performing macro clustering on the microclusters using another clustering methods

(c) Density based methodsOther clustering methods have been developed based on the notion of densitySuch a method can be used to filter out noise and discover clusters of arbitrary shape

(d) Grid Based Methods

Grid based methods quantize the object space into a finite number of cell that from a grid structure

All of the clustering operations are performed on the grid structure

(e) Model based Methods

Model based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model

A model based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points

Partitioning Methods

Partitioning algorithms construct partitions of a database of N objects into a set of k cluster. The construction involves determining the optimal partition with respect to an objective function,

Nonhierarchical each instance is placed in exactly one of K nonoverlapping clusters Since only one set of cluster is output, the user normally has to input the desired number

of cluster k

The partitioning techniques usually produce clusters by optimizing a criterion function defined either locally or globally. A local criterion such as the minimal mutual neighbor distance forms clusters by utilizing the local structure or context in the data

The most commonly used partitional clustering strategy is based on the square error criterion. The general objective is to obtain the partition that for a fixed number of clusters, minimizes the total square error.

(i) Classical Partitioning methods: k-means and K-medoids

The most well known and commonly used partitioning methods are k-means, k-medoids, and their variations

Centroid based Technique: The k-means method

K means is one of the simplest unsupervised learning algorithms that solve the well known clustering proble. The k-means algorithm takes the input parameter, k and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low.

Given k the k-means algorithm is implemented in four stpes1. Partition objects into k nonempty subsets2. Compute seed points as the centroids of the clusters of the current partition3. Assign each object to the cluster with the nearest seed point4. Go back to step 2, stop when no more new assignment

Advantages

With a large number of variables, k-means may be computationally faster than hierarchical clustering

K-means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

Disadvantages

Applicable only when mean is defined Need to specify k, the number of cluster in advance Unable to handle noisy data and outliers

PAM(portioning around medoids) was one of the first k-medoids algorithms introduced. It attempts to determine k partitions for n objects. After an initial random selection of k representative objects.

Partitioning Methods in Large Databases

CLARANS works as follows:

CLARANS draws sample with some randomness in each step of search

If a better neighbor is found CLARANS moves to the neighbours node and the process starts again

Once a user specified number of minimal bas been found, the algorithm outputs, as a solutions the best local minimum that is the local minimum having the lowest cost

Advantages

More efficient than k-means and k-medoids

Used to detect outliers

Disadvantages

May not find a real local minimum due to the trimming of its searching

It assumes that all objects fit into the main memory, and the result is sensitive to input order

Hierarchical Methods

Basically hierarchical methods group data into a tree of clusters, There are two basic varities of hierarchical algorithms; agglomerative and divisive. A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering

Agglomerative and Divisive Hierarchical clustering

Agglomerative Hierarchical methods

Begin with as many clusters as object. Clusters are successively merged until only one cluster remains

Divisive Hierarchical Methods

Begin with all objects in one cluster, Groups are continually divided until there are as many clusters as objects

Steps in Agglomerative Hierarchical Clutering

Compute the proximity matrix

Let each data point be a clatter

repeat

Merge the two closes clusters

Update the proximity matrix

Until only a single cluster remains

BIRCH:Balanced Iterative Reducing Clustering using hierarchies

BIRCH applies a multiphase clustering techniques a single scan of the data set yields a basic good clustering, and one or more additional scans scan be used to further improve the quality.

phase 1: BIRCH scans the database to build an initial in memory CF tree, which can be viewed as a multilevel compression of the data that tries to preserve the inherent clustering structure of the data

phase 2: BIRCH applies a clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and group dense clusters into larger ones

CURE: Clustering Using Representatives

Steps:

Draw a random sample D of the original objects

Partition sample S into a set of partitions and forma cluster fro each partition

Representative points are found by selecting a constant number of points from a cluster and then shrinking them toward the center of the cluster

Cluster similarity is the similarity of the closest pair of representative points from different clusters

Shrinking representative points toward the center helps avoid problems with noise and outliers

CURE is better to handle clusters of arbitrary shapes and sizes

ROCK (Robust clustering using links)

Steps:

obtain a sample of points from the data set

Compute the link value for each set of points i.e., transform the original similarities into similarities that reflect the number of shared neighbors between points

Assign the remaining points to the clusters that have been found

Density Based Methods

DBSCAN: A density based clustering method based on connected regions with sufficiently high density

The neighborhood within a radius of a given object is called the neighborhood of the objects

If the neighborhood of an object contains at least a minimum number minpts of objects then the object is called a core object

Given a set of objects D, we say that an object p is directly density reachable from object q if p is within the neighborhood of q, and q is a core object.

An object p is density reachable from object q with respect to and minpts in a set of objects, D, if there is a chain of objects

An objects p is density connected to object q with respect to and minpts in a set of objects D,

OPTICS: Ordering Points to identify the clustering structure

The core distance of an object p is the smallest value that makes a core object. If p is not a core object, the core distance of p is undefined

The reach ability distance of an object q with respect to another object p is the greater value of the core distance of p and the Euclidean distance between p and q. If p is not a core object, the reach ability distance between p and q is undefined.

GRID BASED CLUSTERING

Using multi resolution grid data structure

Several interesting methods

STING( a Statistical Information Grid approach)

Wave cluster

CLIQUE

STING

The spatial are is divied into rectangular cells

There are several levels of cells corresponding to different levels of resolution

The STING clustering method

Each cell at a high level; is partitioned into a number of smaller cells in the next lower level;

Statistical info of each cell is calculated and stored before hand and is used to answer queries

Use a top down approach to answer spatial data queries

Wave cluster

A multi resolution clustering approach which apples wavelet transforms to the feature space

Wavelet transforms

Wavelet transform: A signal processing technique that decomposes a signal into different frequency sub-band

Data are transformed to preserve relative distance between objects at different levels of resolution

Allows natural clusters to become more distinguishable

The wavecluster algorithm

Input parameters

Grid cells for each dimension

The wavelet and the applications of wavelet transforms

Major features

Complexity

Detect arbitrary shaped clusters at different scales

Not sensitive to noise, not sensitive to input order

Only applicable to low dimensional data

Quantization and Transformation

scale 1: high resolution

scale 2: medium resolution

scale 3: low resolution

Model Based Clustering methods

Model based clustering methods attempt to optimize the fit between the given data and some mathematical model.

Expectation Maximization

Make an initial guess for the parameter vector: This involves randomly selecting k objects to represent the cluster means or centers as well as making guesses for the additional parameters

The EM algorithm is simple and easy to implement. In practice it converges fast but may not reach the global optima. Convergence is guaranteed for certain forms of optimization functions.

Conceptual Clustering

Intraclass similarity is the probability. The larger this value is the greater the proportion of class members that share this attribute value pair and the more predictable the pair of class members

Interclass similarity is the probability. The larger this value is the fewer the objects in constraining classes that share this attribute value pair and the more predictive the pair of the class

Neural network approach

The neural network approach is motivated by biological neural networks. Neural networks have several properties that make them popular for clustering

Self organizing feature maps are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as kohomen self organizing feature maps, after their creator.

Clustering high dimensional Data

Most clustering methods are designed for clustering low dimensional data and encounter challenges when the dimensionality of the data grows really high. This is because when the dimensionality increases, usually only a small number of dimensions are relevant to certain clusters but data in the irrelevant dimensions may produce much noise and mask the real clusters to be discovered.

Feature transformation methods such as principal component analysis and singular value decomposition transform the data onto a smaller space while generally preserving the original relative distance between objects

CLIQUE: A dimension growth subspace clustering method

Given a large set of multidimensional data points the data space is usually not uniformly occupied by the data points. CLIQUEs clustering identifies the sparse and the crwded areas in space

A unit is dense if the fraction of total data points contained in it exceeds an input model parameter. In CLIQUE a cluster is defined as a maximal set of connected dense units

Constraint based Cluster analysis

Constraints on individual objects: We can specify constraints on the objects to be clustered, in a real estate application . It can easily be handled by preprocessing after which the problems reduces to an instance of unconstrained clustering

Constraints on the selection of clustering parameters: A user may like to set a desired range for each clustering parameters. Clustering parameters are usually quite specific to the given clustering algorithm

Constraints on distance or similarity functions: We can specify different distance or similarity function for specific attributes of the objects to be clustered, or different distant measures for specific pair of objects

User specified constraints on the properties of individual clusters: A user may like to specify desired characteristics of the resulting clusters, which may strongly influence the clustering process.

Semi supervised clustering based on partial supervision: The quality of unsupervised clustering can be significantly improved using some weak form of supervisons. This may be in the form of pairwise constraints.

Outlier Analysis

Data objects that show significantly different characteristics from remaining data are declared outliers. Detection and analysis of outliers is called outliers mining

Application

Detects unusual usage of telecommunication services

Detects credit card fraud and criminal activities in E-commerce

Tracking customer behaviors

Medical analysis

Steps for detecting outliers

Define inconsistent data in a given data set

Find an efficient method to mine the outliers

Methods for detecting outliers are

Statistical approach

Distance based approach

Density based local outlier approach

Deviation based approach

Statistical Distribution based outlier detection

The statistical distribution based approach to outlier detection assumes a distribution or probability model for the given data set

Knowledge of the data set parameters

Knowledge of distribution parameters

The expected number of outliers

Distance based outlier Detection

The distance based outlier mining concepts assigns numeric distances to data objects and computes outliers as data objects with relatively larger distances.

Index based algorithms: Given a data set the index based algorithm uses multidimensional indexing structures, such as R-trees or k-d trees, to search for neighbors of each object o within radius dmin around the object.

Cell based Algorithm: The first layer is one cell thick, while the second is d2pk_le cells thick, rounded upto the close integer. The algorithm counts outliers on a cell by cell rather than an object by object basis.

cell count- Number of objects in the cell

Cell+1 layer count- Number of objects in the cell + Number of objects in first layer

Cell+2 layers count- Number of objects in the cell+Number of objects in both layers

Density based local outlier detection

Depth based techniques represent every data object in a k-d space, and assign a depth to each object. This is degree of outlierness is composed as the local factor of an object. It is local in the sense that the degree depends on how isolated the object is with respect to the surrounding neighborhood.

Deviation based outlier detection

Deviation based outlier detection identifies outliers by examining the main characteristics of objects in a group. Objects that deviate from this description are considered outliers.

Techniques for deviation based outlier detection are

Sequential approach

The sequential exception technique simulates the way in which humans can distinguish unusual objects from among a series of supposedly like objects

The technique to asses the dissimilarity between subsets are:

Exception set: This is the set of deviation or outliers. It is defined as the smallest subset of objects whose removal results in the greatest reduction of dissimilarity in the residual set

Dissimilarity function: The dissimilarity of a subset is incrementally computed based on the subset prior to it in the sequence.

Cardinality function: This is typically the count of the number of objects in a given set

Smoothing factor: Smoothing factor assesses how much the dissimilarity can be reduced by removing the sunset from the original set of objects, This value is scaled by the cardinality of the set.

Data Mining Applications

(i)Data mining for financial data analysis

Design and construction of data warehouses for multidimensional data analysis and data mining: Like many other application, data warehouses need to be constructed for banking and financial data.

Loan payment prediction and customer credit policy analysis: Loan payment prediction and customer credit analysis are critical to the business of a bank. Many factors can strongly or weekly influence loan payment performance

Classification and clustering of customers for targeted marketing: Classification and clustering methods can be used for customer group identification and targeted marketing.

Detection of money laundering and other financial crimes: To detect money laundering and other laundering and other financial crimes, it is important to integrate information from multiple databases

(ii)Data mining for the retail industry

Design and construction of data warehouses based on the benefits of data mining: Because retail data cover a wide spectrum there can be many ways to design a data warehouse for this industry.

Multidimensional analysis of sales customers products time and region: The retail industry requires timely information regarding customer needs, product sales, trends, and fashions, as well as the quality cost.

Analysis of the effectiveness of sales campaigns: The retail industry conducts sales campaigns using advertisemements, coupons and various kinds of discouts and bounses to promote products and attract customers.

Customer retention-analysis of customer loyalty: With customer loyalty card information one can register sequences of purchases of particular customers. Customer loyalty and purchase trends can be analyzed systematically.

Product recommendation and cross referencing of items: By mining associations from sales records, one may discover that a customer who buys a digital camera is likely to buy another set of items

(iii)Data mining for the Telecommunication Industry

Multidimensional analysis of telecommunication data: Telecommunication data are intrinsically multidimensional with dimensions such as calling time duration location of caller location of called, and type of call.

Pattern analysis and the identification of unusual patterns: Fraudulent activity costs the telecommunication industry millions of dollars per year.

Multidimensional association and sequential pattern analysis: The discovery if association and sequential patterns in multidimensional analysis can be used to promote telecommunication services.

Mobile telecommunication services: Mobile telecommunications web and information services and mobile computing are becoming increasingly integrated and common in our work and life

Use of visualization tools in telecommunication data analysis: Tools for OLAP visualization linkage visualization association visualization clustering and outlier association have been shown to be very useful for telecommunication data analysis.

(iv)Data mining for biological data analysis

Semantic integration of heterogeneous distributed genomic and proteomic databases: Genomic and proteomic data sets are often generated at different labs and by different methods,

Alignment indexing similarity search and comparative analysis of multiple nucleotide/protein sequences: Various biological sequence alignment methods have been developed in the past two decades.

Discovery of structural patterns and analysis of genetic networks and protein pathways: In biology, protein sequences are folded into three dimensional structures, and such structures interact with each other based on their relative positions and the distance between them.

Association and path analysis: Identifying co-occuring gene sequences and linking genes to different stages of disease development:

Visualization tools in genetic data analysis: Alignments among genomic ar proteomic sequences and the interactions among complex biological structure are most effectively presented in graphic forms, transformed into various kinds of easy to understand visual displays

(v)Data mining in other scientific application

Data warehouses and data preprocessing: Data warehouses are critical for information exchange and data mining. In the area of geospatial data, however no true geospatial data warehouses exist today.

Mining complex data types: Scientific data sets are heterogeneous in nature, typically involving semi-structured and unstructured data, such as multimedia data and georeferenced stream data. Robust method are needed for handling spatiotemporal data.

Graph based mining: It is often difficult or impossible to model several physical phenomena and processes due to limitations of existing modeling approaches. Alternatively, labeled graphs may be used to capture many of the spatial topological geometric and other relational characteristics.

Visualization tolls and domain specific knowledge: High level graphical user interfaces and visualization tools are required for scientific data mining systems.

(vi)Data mining for Intrusion Detection

Development of data mining algorithms for intrusion detection: Data mining algorithms can be used for misuse detection and anomaly detection. In misuse detection training data are labeled either normal or intrusion.

Association and correlation analysis, and aggregation to help select and build discriminating attributes: Association and correlation mining can be applied to find relationships between system attributes describing the network data.

Analysis of stream data: Due to the transient and dynamic nature of intrusions and malicious attacks, it is crucial to perform intrusion detection in the data stream environment.

Distributed data mining: Intrusions can be launched from several different locations and targeted to many different destinations. Distributed data mining methods may be used to analyze network data from several network locations in order to detect these distributed attacks

Visualization and querying tools: Visualization tools should be available for viewing any anomalous patterns detected. Such tools may include features for viewing associations clusters and outliers

Documents

UNIT v Dwm Notes