32
Clustering and Indexing in High-dimensional spaces

Clustering and Indexing in High-dimensional spaces

  • Upload
    lethia

  • View
    59

  • Download
    0

Embed Size (px)

DESCRIPTION

Clustering and Indexing in High-dimensional spaces. Outline. CLIQUE GDR and LDR. CLIQUE (Clustering In QUEst). Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98). Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space - PowerPoint PPT Presentation

Citation preview

Page 1: Clustering and Indexing in  High-dimensional spaces

Clustering and Indexing in High-dimensional spaces

Page 2: Clustering and Indexing in  High-dimensional spaces

Outline

• CLIQUE

• GDR and LDR

Page 3: Clustering and Indexing in  High-dimensional spaces

CLIQUE (Clustering In QUEst)

• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

• Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

• CLIQUE can be considered as both density-based and grid-based– It partitions each dimension into the same number of equal length intervals

– It partitions an m-dimensional data space into non-overlapping rectangular units

– A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter

– A cluster is a maximal set of connected dense units within a subspace

Page 4: Clustering and Indexing in  High-dimensional spaces

CLIQUE: The Major Steps• Partition the data space and find the number of points that

lie inside each cell of the partition.

• Identify the subspaces that contain clusters using the Apriori principle

• Identify clusters:

– Determine dense units in all subspaces of interests– Determine connected dense units in all subspaces of interests.

• Generate minimal description for the clusters– Determine maximal regions that cover a cluster of connected

dense units for each cluster– Determination of minimal cover for each cluster

Page 5: Clustering and Indexing in  High-dimensional spaces

Sala

ry

(10,

000)

20 30 40 50 60age

54

31

26

70

20 30 40 50 60age

54

31

26

70

Vac

atio

n(w

eek)

age

Vac

atio

n

Salary 30 50

= 3

Page 6: Clustering and Indexing in  High-dimensional spaces

Strength and Weakness of CLIQUE

• Strength – It automatically finds subspaces of the highest

dimensionality such that high density clusters exist in those subspaces

– It is insensitive to the order of records in input and does not presume some canonical data distribution

– It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases

• Weakness– The accuracy of the clustering result may be degraded at

the expense of simplicity of the method

Page 7: Clustering and Indexing in  High-dimensional spaces

High Dimensional Indexing Techniques

• Index trees (e.g., X-tree, TV-tree, SS-tree, SR-tree, M-tree, Hybrid Tree)– Sequential scan better at high dim. (Dimensionality Curse)

• Dimensionality reduction (e.g., Principal Component Analysis (PCA)), then build index on reduced space

Page 8: Clustering and Indexing in  High-dimensional spaces

Global Dimensionality Reduction (GDR)

First PrincipalComponent (PC) First PC

•works well only when data is globally correlated

•otherwise too many false positives result in high

query cost

•solution: find local correlations instead of global

correlation

Page 9: Clustering and Indexing in  High-dimensional spaces

Local Dimensionality Reduction (LDR)

First PC

GDR LDR

First PC of Cluster1

Cluster1

Cluster2

First PC of Cluster2

Page 10: Clustering and Indexing in  High-dimensional spaces

Correlated Cluster

Second PC(eliminated dim.)

Centroid of cluster (projection of mean on eliminated dim)

First PC(retained dim.)

Mean of all points in cluster

A set of locally correlated points = <PCs, subspace dim, centroid, points>

Page 11: Clustering and Indexing in  High-dimensional spaces

Reconstruction Distance

Centroid of cluster

First PC(retained dim)

Second PC(eliminated dim)

Point QProjection of Q on eliminated dim

ReconstructionDistance(Q,S)

Page 12: Clustering and Indexing in  High-dimensional spaces

Reconstruction Distance Bound

Centroid

First PC(retained dim)

Second PC(eliminated dim)

MaxReconDist

MaxReconDist

ReconDist(P, S) MaxReconDist, P in S

Page 13: Clustering and Indexing in  High-dimensional spaces

Other constraints

• Dimensionality bound: A cluster must not retain any more dimensions necessary and subspace dimensionality MaxDim

• Size bound: number of points in the cluster MinSize

Page 14: Clustering and Indexing in  High-dimensional spaces

Clustering Algorithm Step 1: Construct Spatial Clusters

• Choose a set of well-scattered points as centroids (piercing set) from random sample

• Group each point P in the dataset with its closest centroid C if the Dist(P,C)

Page 15: Clustering and Indexing in  High-dimensional spaces

Clustering Algorithm Step 2: Choose PCs for each cluster

• Compute PCs

Page 16: Clustering and Indexing in  High-dimensional spaces

Clustering AlgorithmStep 3: Compute Subspace Dimensionality

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16

#dims retained

Fra

c p

oin

ts o

be

yin

g

rec

on

s.

bo

un

d

• Assign each point to cluster that needs min dim. to accommodate it

• Subspace dim. for each cluster is the min # dims to retain to keep most points

Page 17: Clustering and Indexing in  High-dimensional spaces

Clustering Algorithm Step 4: Recluster points

• Assign each point P to the cluster S such that ReconDist(P,S)

MaxReconDist

• If multiple such clusters, assign to first cluster (overcomes “splitting” problem)

Emptyclusters

Page 18: Clustering and Indexing in  High-dimensional spaces

Clustering algorithmStep 5: Map points

• Eliminate small clusters

• Map each point to subspace (also store reconstruction dist.)

Map

Page 19: Clustering and Indexing in  High-dimensional spaces

Clustering algorithmStep 6: Iterate

• Iterate for more clusters as long as new clusters are being found among outliers

• Overall Complexity: 3 passes, O(ND2K)

Page 20: Clustering and Indexing in  High-dimensional spaces

Experiments (Part 1)• Precision Experiments:

– Compare information loss in GDR and LDR for same reduced

dimensionality

– Precision = |Orig. Space Result|/|Reduced Space Result| (for

range queries)

– Note: precision measures efficiency, not answer quality

Page 21: Clustering and Indexing in  High-dimensional spaces

Datasets• Synthetic dataset:

– 64-d data, 100,000 points, generates clusters in different subspaces (cluster sizes and subspace dimensionalities follow Zipf distribution), contains noise

• Real dataset:– 64-d data (8X8 color histograms extracted from 70,000

images in Corel collection), available at http://kdd.ics.uci.edu/databases/CorelFeatures

Page 22: Clustering and Indexing in  High-dimensional spaces

Precision Experiments (1)

0

0.5

1

Prec.

0 0.5 1 2

Skew in c luster size

Sensitivity of prec. to skew

LDR GDR

0

0.5

1

Prec.

1 2 5 10

Number of c lusters

Sensitivity of prec. to num clus

LDR GDR

Page 23: Clustering and Indexing in  High-dimensional spaces

Precision Experiments (2)

0

0.5

1

Prec.

0 0.02 0.05 0.1 0.2

Degree of Correlation

Sensitivity of prec. to correlation

LDR GDR

0

0.5

1

Prec.

7 10 12 14 23 42

Reduced dim

Sensitivity of prec. to reduced dim

LDR GDR

Page 24: Clustering and Indexing in  High-dimensional spaces

Index structureRoot containing pointers to root of each cluster index (also stores PCs and subspace dim.)

Index

on

Cluster 1

Index

on

Cluster K

Set of outliers (no index: sequential scan)

Properties: (1) disk based

(2) height 1 + height(original space index) (3) almost balanced

Page 25: Clustering and Indexing in  High-dimensional spaces

Cluster Indices• For each cluster S, multidimensional index on (d+1)-dimensional space instead of d-

dimensional space:

– NewImage(P,S)[j] = projection of P along jth PC for 1 j d

= ReconDist(P,S) for j= d+1

• Better estimate:

D(NewImage(P,S), NewImage(Q,S))

D(Image(P,S

), Image(Q,S))

• Correctness: Lower Bounding Lemma D(NewImage(P,S), NewImage(Q,S)) D(P,Q)

Page 26: Clustering and Indexing in  High-dimensional spaces

Effect of Extra dimension

I/O cost

0200400600800

1000

12 14 15 17 19 30 34

Reduced dimensionality

# r

an

d d

isk

ac

ce

sse

s

d-dim

(d+1)-dim

Page 27: Clustering and Indexing in  High-dimensional spaces

Outlier Index

• Retain all dimensions

• May build an index, else use sequential scan (we use sequential scan for our experiments)

Page 28: Clustering and Indexing in  High-dimensional spaces

Query Support

• Correctness:– Query result same as original space index

• Point query, Range Query, k-NN query– similar to algorithms in multidimensional index structures

– see paper for details

• Dynamic insertions and deletions– see paper for details

Page 29: Clustering and Indexing in  High-dimensional spaces

Experiments (Part 2)• Cost Experiments:

– Compare linear scan, Original Space Index(OSI), GDR and LDR in terms

of I/O and CPU costs. We used hybrid tree index structure for OSI, GDR

and LDR.

• Cost Formulae:– Linear Scan: I/O cost (#rand accesses)=file_size/10, CPU cost

– OSI: I/O cost=num index nodes visited, CPU cost

– GDR: I/O cost=index cost+post processing cost (to eliminate false positives), CPU cost

– LDR: I/O cost=index cost+post processing cost+outlier_file_size/10, CPU cost

Page 30: Clustering and Indexing in  High-dimensional spaces

I/O Cost (#random disk accesses)

I/O cost comparison

0

500

1000

1500

2000

2500

3000

7 10 12 14 23 42 50 60

Reduced dim

#rand disk

acc

LDR

GDR

OSI

Lin Scan

Page 31: Clustering and Indexing in  High-dimensional spaces

CPU Cost (only computation time)

CPU cost comparison

0

20

40

60

80

7 10 12 14 23 42

Reduced dim

CPU cost

(sec)

LDR

GDR

OSI

Lin Scan

Page 32: Clustering and Indexing in  High-dimensional spaces

Conclusion• LDR is a powerful dimensionality reduction technique

for high dimensional data

– reduces dimensionality with lower loss in distance

information compared to GDR

– achieves significantly lower query cost compared to linear

scan, original space index and GDR

• LDR has applications beyond high dimensional indexing