27
Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Performance guarantees for hierarchical clustering

Sanjoy Dasgupta University of California, San Diego

Philip Long Genomics Institute of Singapore

Page 2: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Hierarchical clustering

Recursive partitioning of a data set

clustering

clustering

clustering

clustering

clustering

Page 3: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Popular form of data analysis

• No need to specify number of clusters

• Can view data at many levels of granularity, all at the same time

• Simple heuristics for constructing hierarchical clusterings

Page 4: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Applications

• Has long been used by biologists and social scientists

• A standard part of the statistician’s toolbox since the 60s or 70s

• Recently: common tool for analyzing gene expression data

Page 5: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Performance guarantees

There are many simple greedy schemes for constructing hierarchical clusterings.

But are these resulting clusterings any good?

Or are they pretty arbitrary?

Page 6: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

One basic problem

In fact, the whole enterprise of hierarchical clustering could use some more justification.

e.g.

Page 7: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

An existence question

Must there always exist a hierarchical clustering which is close to optimal at every level of granularity, simultaneously?

[I.e., such that for all k, the induced k-clustering is close to the best k-clustering?]

Page 8: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

What is the best k-clustering?

The k-clustering problem.

Input: data points in a metric space; k

Output: a partition of the points into k clusters C1,…, Ck with centers k

Goal: minimize cost of the clustering

Page 9: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Cost functions for clustering

Two cost functions which are commonly used:

Maximum radius (k-center)

max {d(x,i): i = 1…k, x in Ci}Average radius (k-median)

avg {d(x,i): i = 1…k, x in Ci}

Both yield NP-hard optimization problems, but have constant-factor approximation algorithms.

Page 10: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Maximum-radius cost function

Page 11: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Our main result

Adopt the maximum-radius cost function.

Our algorithm returns a hierarchical clustering such that for every k, the induced k-clustering is guaranteed to be within a factor eight of optimal.

Page 12: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Standard heuristics

• The standard heuristics for hierarchical clustering are greedy and work bottom-up:

single-linkage, average-linkage, complete-linkage• Their k-clusterings can be off by a factor of:

-- at least log2 k (average-, complete-linkage);

-- at least k (single-linkage).• Our algorithm is similar in efficiency and simplicity,

but works top-down.

Page 13: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

A heuristic for k-clustering

[Hochbaum and Shmoys, 1985]

Eg. k = 4.

R

This 4-clustering has cost R 2 OPT4

Page 14: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Algorithm: step oneNumber all points by farthest-first traversal.

R2

R3

R5

R4

R6

For all k, the k-clustering defined by centers {1,2,…,k}has radius Rk+1 2 OPTk. (Note: R2 R3 … Rn.)

Page 15: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

A possible hierarchical clustering

R2

R5

R4

R3

R7 R8R6

R9

R10

Hierarchical clustering specified by parent function:(j) = closest point to j in {1,2,…,j-1}.Note: Rk = d(k, (k))

Page 16: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Algorithm: step two

Divide points into levels of granularity.

Set R = R2; and fix some > 1.

The jth level has points {i: R/j Ri > R/j+1}.

10

Page 17: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Algorithm: step two, cont’d

10

Different parent function:*(j) = closest point to j at lower level of granularity

Page 18: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Algorithm: summary

1. Number the points by farthest-first traversal; note the values Ri = d(i, {1,2,…, i-1}).

2. Choose R = R2.

3. L(0) = {1}; for j > 0, L(j) = {i: R/j-1 Ri > R/j}.

4. If point i is in L(j),

*(i) = closest point to i in L(0), …., L(j-1).

Theorem: Fix =1, =2. If the data points lie in a metric space, then for all k simultaneously, the induced k-clustering is within a factor eight of optimal.

Page 19: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Randomization trick

Pick from the distribution U[0,1] . Set = e.

Then for all k, the induced k-clustering has expected cost at most 2e 5.44 times optimal.

Thanks to Rajeev Motwani for suggesting this.

Page 20: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

What does a constant-factor approximation

mean?Prevent the worst.

Page 21: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Standard agglomerative heuristics

1. Initially each point is its own cluster.

2. Repeatedly merge the two “closest” clusters.

Need to define distance between clusters…

Single-linkage: distance between closest pair of points

Average-linkage: distance between centroids

Complete-linkage: distance between farthest pair

Page 22: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Single-linkage clustering

Chaining effect.

… j j+1 n…

1 - j

The k-clustering will have diameter about n-k, instead of n/k.Therefore: off by a factor of k.

Page 23: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Average-linkage clustering

Points in d-dimensional space, d = log2 k, under an l1 metric.

Final radius should be 1, instead is d.Therefore: off by a factor of log2 k.

Page 24: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Complete-linkage clustering

Can similarly construct a bad case…

Off by a factor of at least log2 k.

Page 25: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Summary

There is a basic existence question about hierarchical clustering which needs to be addressed:must there always exist a hierarchical clustering in which, for each k, the induced k-clustering is close to optimal?

It turns out the answer is yes.

Page 26: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Summary, cont’d

In fact, there is a simple, fast algorithm to construct such hierarchical clusterings.

Meanwhile, the standard agglomerative heuristics do not always produce close-to-optimal clusterings.

Page 27: Performance guarantees for hierarchical clustering Sanjoy Dasgupta University of California, San Diego Philip Long Genomics Institute of Singapore

Where next?

1. Reduce the approximation factor.

2. Other cost functions for clustering.

3. For average- and complete-linkage, is the log k lower bound also an upper bound?

4. Local improvement procedures for hierarchical clustering?