51
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 2012 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 1 / 51

Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Unsupervised Data Mining(Clustering)

Javier Béjar

KEMLG

December 2012

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 1 / 51

Page 2: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Introduction

Clustering in KDD

One of the main tasks in the KDD process is the analysis of datawhen we do not know its structureThis task is very different from the task of prediction in which weknow the true answer and we try to approximate itA large number of KDD projects involve unsupervised problems(KDNuggets poll, 2-3 most frequent task)Problems: Scalability, Arbitrary cluster shapes, New types of data,Parameters fitting, ...

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 2 / 51

Page 3: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Introduction

Clustering in DM

Clustering in DM

Dimensio-nality

Reduction

FeatureSelection Feature

Transforma-tion

ClusteringAlgorithms

ModelBased

DensityBased

Grid Based

Agglomera-tive

Other

Strategies

Sampling

One pass

Approxima-tion

DataCom-

pression

Paralelization

Data

UnstructuredData

StructuredData

DataStreams

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 3 / 51

Page 4: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining The data Zoo

All kinds of data

DataUnstructuredData

StructuredData

DataStreams

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 4 / 51

Page 5: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining The data Zoo

Unstructured data

Only one table of observationsEach example represents an instanceof the problemEach instance is represented by a setof attributes (Discrete, continuous)

A B C · · ·1 3.1 a · · ·1 5.7 b · · ·0 -2.2 b · · ·1 -9.0 c · · ·0 0.3 d · · ·1 2.1 a · · ·...

...... . . .

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 5 / 51

Page 6: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining The data Zoo

Structured data

One sequential relation amonginstances (Time, Strings)

Several instances with internalstructure (eg: sequences ofevents)Subsequences of unstructuredinstances (eg: sequences ofcomplex transactions)One big instance (eg: timeseries)

Several relations amonginstances (graphs, trees)

Several instances with internalstructure (eg: XMLdocuments)One big instance (eg: socialnetwork)

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 6 / 51

Page 7: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining The data Zoo

Data Streams

Endless sequence of data (eg:sensor data)

Several streams synchronizedUnstructured instancesStructured instances

Static/Dynamic model

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 7 / 51

Page 8: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Dimensionality Reduction

I am in Hyperspace

DimensionalityReduction

FeatureTransforma-

tion

FeatureSelection

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 8 / 51

Page 9: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Dimensionality Reduction

The Curse of dimensionality

Two problems arise from the dimensionality of a datasetThe computational cost of processing the data (scalability of thealgorithms)The quality of the data (more probability of bad data)

Two elements define the dimensionality of a datasetThe number of examplesThe number of attributes

Having too many examples can sometimes be solved by samplingAttribute reduction is a more complex problem

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 9 / 51

Page 10: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Dimensionality Reduction

Reducing attributes

Usually the number of attributes of the dataset has an impact on theperformance of the algorithms:

Because their poor scalability (cost is a function of the number ofattributes)Because the inability to cope with irrelevant/noisy/redundant attributesBecause the low ratio instances/dimensions (sparsity of space)

Two main methodologies to reduce attributes from a datasetTransforming the data to a space of less dimensions preserving thestructure of the data (dimensionality reduction - feature extraction)Eliminating the attributes that are not relevant for the goal task(feature subset selection)

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 10 / 51

Page 11: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Dimensionality Reduction

Dimensionality reduction

Many techniques have been developed for this purposeProjection to a space that preserve the statistical model of the data(Linear: PCA, ICA; Non linear: Kernel-PCA)Projection to a space that preserves distances among the data (Linear:Multidimensional scaling, random projection, SVD, NMF; Non linear:ISOMAP, LLE, t-SNE)

Not all this techniques are scalable per se (O(N2) to O(N3) timecomplexity), but some can be approximated

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51

Page 12: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Dimensionality Reduction

Original (60 Dim) PCA

Kernel PCA ISOMAP

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 12 / 51

Page 13: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Dimensionality Reduction

Unsupervised Attribute Selection

The goal is to eliminate from the dataset all the redundant orirrelevant attributesThe original attributes are preservedThe methods for Unsupervised Attribute Selection are less developedthan in Supervised Attribute SelectionThe problem is that an attribute can be relevant or not depending onthe goal of the discovery processThere are mainly two techniques for attribute selection: Wrapping andFiltering

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 13 / 51

Page 14: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

All kinds of algorithms

ClusteringAlgorithms

Model Based

DensityBased

Grid Based

Agglomera-tive

Other

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 14 / 51

Page 15: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

Model/Center Based Algorithms

The algorithm is usually biased by a preferred model (hyperspheres,hyper elipsoids, ...)Clustering defined as an optimization problemAn objective function is optimized (Distorsion, loglikelihood, ...)Based on gradient descent-like algorithms

k-means and variantsProbabilistic mixture models/EM algorithms (eg: Gaussian Mixture)

Scalability depends on the number of parameters, size of the datasetand iterations until convergence

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 15 / 51

Page 16: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

Model Based Algorithms

1

11 1

1

22

2

2

22

22

1

1

11

2

1

12

1

11 1

1

2

2 2

2

22

2

2

22

22

1

11 1

1

2

22

2

2

22

22

2

1

1

1

1

11 1

1

22

2

2

22

22

1

1

11

2

1

1

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 16 / 51

Page 17: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

Agglomerative Algorithms

Based on the hierarchical aggregation of examplesFrom spherical to arbitrary shaped clustersDefined over the affinity/distance matrixAlgorithms based on graphs/matrix algebraSeveral criteria for the aggregationScalability is cubic on the number of examples (O(n2 log(n)) usingheaps, O(n2) in special cases)

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 17 / 51

Page 18: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

Agglomerative Algorithms

−1 0 1 2 3 4 5

−2

02

46

8

x1

x2

●●

● ●

12 10 8 6 4 2 0

151723192224182116201014118126351314279

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 18 / 51

Page 19: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

Density Based Algorithms

Based on the discovery of areas of larger densityArbitrary shaped clustersNoise/outliers tolerantAlgorithms based on:

Topology (DBSCAN-like: core points, reachability)Kernel density estimation (DENCLUE, mean shift)

Scalability depends on the size of the dataset and the computation ofthe density (KDE, k-nearest neighbours) (best complexityO(n log(n)))

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 19 / 51

Page 20: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

Density Based Algorithms - DBSCAN

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 20 / 51

Page 21: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

Density Based Algorithms - DENCLUE

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 21 / 51

Page 22: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

Grid Based Algorithms

Based on the discovery of areas of larger density by space partitioningArbitrary shaped clustersNoise/outliers tolerantAlgorithms based on:

Fixed/adaptative grid partitioning of featuresHistogram mode discovery and dimension merging

Scalability depends on the number of features, grid granularity andnumber of examples

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 22 / 51

Page 23: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

Grid Based Algorithms

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 23 / 51

Page 24: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

Other Algorithms

Based on graph theoryK-way graph partitioningMax-flow/min-flow algorithmsSpectral clustering

Message passing/belief propagation (affinity clustering)Evolutionaty algorithmsSVM clustering/Maximum margin clustering· · ·

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 24 / 51

Page 25: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Clustering Algorithms

Other Clustering Related Areas

Consensus clusteringClustering by consensuating several partitions

Semisupervised clustering/clustering with constraintsClustering with additional information (must and cannot links)

Subspace clusteringClusters with reduced dimensionality

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 25 / 51

Page 26: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Size matters :-)

Strategies

Sampling

One pass

Approxima-tion

Datacompression

Paralelization

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 26 / 51

Page 27: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Strategies for cluster scalability

One-passProcess data as a stream

Summarization/data compressionCompress examples to fit more data in memory

Sampling/Batch algorithmsProcess a subset of the data and maintain/compute a global model

ApproximationAvoid expensive computations by approximate estimation

Paralelization/DistributionDivide the task in several parts and merge models

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 27 / 51

Page 28: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

One pass

This strategy is based on incremental clustering algorithmsThey are cheap but order of processing affects greatly their qualityAlthough can be used as a preprocessing stepTwo steps algorithms

1 A large number of clusters is generated using the one-pass algorithm2 A more accurate algorithm clusters the preprocessed data

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 28 / 51

Page 29: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Data Compression/Summarization

Discard sets of examples and summarize by:Sufficient statisticsDensity approximations

Discard data irrelevant for the model (do not affect the result)

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 29 / 51

Page 30: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Approximation

Not using all the information available to make decisionsUsing K-neighbours (data structures for computing k-neighbours)

Preprocessing the data using a cheaper algorithmGenerate batches using approximate distances (eg: canopy clustering)

Use approximate data structuresUse of hashing or approximate counts for distances and frequencycomputation

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 30 / 51

Page 31: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Batches/Sampling

Process only data that fits in memoryObtain from the data set:

Samples (process only a subset of the dataset)Determine the size of the sample so all the clusters are represented

Batches (process all the dataset)

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 31 / 51

Page 32: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Paralelization/Distribution/Divide&Conquer

Paralelization of clustering usually depends on the specific algorithmSome are difficult to parallelize (eg: hierarchical clustering)Some have specific parts that can be solved in parallel or byDivide&Conquer

Distance computations in k-meansParameter estimation in EM algorithmsGrid density estimationsSpace partitioning

Batches and sampling are more general approachesThe problem is how to merge all the different partitions

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 32 / 51

Page 33: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Examples - One pass + Summarization

One pass + hierarchical single linkOne pass summarization using Leader algorithm (many clusters)Single link hierarchical clustering of the summariesTheoretical results about equivalence to SL at top levels (from O(N2)to O(c2))

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 33 / 51

Page 34: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

One pass + Single Link

1st Phase

2nd Phase

Leader Algorithm

Hierarchical Clustering

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 34 / 51

Page 35: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Examples - One pass + Summarization

One pass + model based algorithmsBIRCH algorithmOne-pass Leader algorithm using an adaptive hierarchical datastructure (CF-tree)Postprocess for noise filteringClustering of cluster centers from CF-tree

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 35 / 51

Page 36: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

One pass + CFTREE (BIRCH)1st Phase - CFTree 2nd Phase - Kmeans

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 36 / 51

Page 37: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Examples - Batch + (Divide&Conquer / One-pass)

Batch + K-means-likeSaveSpace algorithm (hierarchy of centers)Clustering using a randomized algorithm based on k-facilities problem(LSEARCH algorithm)Data is divided on M batchesEach batch is clustered in k weighted centersRecursivelly M · k centers are divided on new batches for the next levelLast level is clustered in k centers

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 37 / 51

Page 38: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Batch + Divide&Conquer

M Batches - K Clusters each

M*K new examples

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 38 / 51

Page 39: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Examples - Sampling + Summarization

Random Samples + Data CompressionScalable K-meansRetrieve a sample from the database that fits in memoryCluster with the current modelDiscard data close to the centersCompress high density areas using sufficient statistics

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 39 / 51

Page 40: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Sampling + Summarization

Original Data Sampling Updating model

Compression More data is added Updating model

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 40 / 51

Page 41: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Examples - Sampling + Distribution

CUREDraws a random samples from the datasetPartitions the sample in p groupsExecutes the clustering algorithm on each partitionDeletes outliersRuns the clustering algorithm on the union of all groups until it obtainsk groups

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 41 / 51

Page 42: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Sampling + DistributionSampling+Partition Clustering partition 1

Clustering partition 2 Join partitions Labelling data

DATA

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 42 / 51

Page 43: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Examples - Approximation + Distribution

Canopy ClusteringApproximate dense areas by probing the space of examplesBatches are formed by extracting spherical volumes (canopies) using acheap distance

Inner radius/Outer radiusEach canopy is clustered separately (disjoint clusters)Only distances inside the canopy are computedSeveral strategies for classical algorithms

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 43 / 51

Page 44: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Approximation + Distribution

1st Canopy 2nd Canopy 3rd Canopy

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 44 / 51

Page 45: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Examples - Divide&Conquer

K-means + indexingBuild a kd-tree structure for the examplesDetermine the example-center assignment in parallelEach branch of the kd-tree only keeps a subset of centersIf only one candidate in a branch → no distances computedDistances only are computed at leavesEffective for low dimensionality

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 45 / 51

Page 46: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Divide&Conquer - K-means + KD-tree

KD-TREE

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 46 / 51

Page 47: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Examples - Divide&Conquer

OptigridFind projections of the data in lower dimensionalityDetermine the best cuts of the projectionsBuild a grid with the cutsRecurse for dense cells (paralellization)

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 47 / 51

Page 48: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Divide&Conquer - Grid Clustering

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 48 / 51

Page 49: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Examples - Approximation

Quantization Cluster (k-means)Quantize the space of examples and keep dense cellsApproximate cluster assignments using cells as examplesRecompute centers with examples in the cells

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 49 / 51

Page 50: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

Divide&Conquer + approximation

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 50 / 51

Page 51: Unsupervised Data Mining (Clustering)mmartin/DMClustering.pdfJavier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 11 / 51. Clustering in Data Mining Dimensionality

Clustering in Data Mining Scalability strategies

The END

THANKS(Questions, Comments, ...)

Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 2012 51 / 51