16
Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

Embed Size (px)

Citation preview

Page 1: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

Shai Ben-David

University of Waterloo Canada, Dec 2004

Generalization Bounds for Clustering - Some Thoughts and Many Questions

Page 2: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

Provide rigorous generalization bounds for clustering.

The Goal

Why ?

It would be useful to have assurances that clusterings

that we produce are meaningful, rather than just an

artifact of data randomness.

Page 3: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

There is some large, possibly infinite, domain set X.

An unknown probability distribution over X generates an i. i.d sample.

Upon viewing such a sample, a learner wishes to deduce a clustering, as a simple, yet meaningful, description of the distribution.

1st Step: A formal model for Sample Based Clustering

Page 4: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

Roughly, we wish to be able to say:

If sufficiently many sample points have been drawn,then the clustering we come up with is “stable”.

2nd Step:What should a bound look like?

Page 5: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

If S1 , S2 are sufficiently large i.id. samples

from the same distribution, then, w.h.p.,

C(S1) is ‘similar’ to C(S2)

Where C(S) is the clustering we get by applyingour clustering alg. to S

What should a bound look like? More formally

Page 6: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

Classification generalization bounds guaranteethe convergence of the loss of the hypothesis –“For any distribution P, large enough samples S, L(A(S1)) is close to L(A(P))”Since for clustering there is no natural analogueof the distribution true cost, L(A(P)), we consider its ‘stability’ implication:

“If S1, S2 are sufficiently large i.id. samples from the same distribution, then, w.h.p.,L(A(S1)) is close to L(A(S2))”

How is it Different than Classification bounds?

Here, for clustering, we seek a stronger statement, namely:“If S1, S2 are sufficiently large i.id. samples from the same distribution, then, w.h.p.,

C(S1) is ‘similar’ to C(S2)”

Page 7: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

From a more traditional scientific-methodology pointof view, Stability can be viewed as the fundamental issue of replication --

to what extent are the results of an experiment reproducible?

A Different Perspective – Replication

Replication has been investigated in many applicationsof clustering, but mostly by visual inspection of theresults of cluster analysis on two samples.

Page 8: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

If S1 , S2 are sufficiently large i.id. samples

from the same distribution, then, w.h.p.,

C(S1) is ‘similar’ to C(S2)

where C(S) is the clustering we get by applyingour clustering alg. to S

What should a bound look like? More formally

Page 9: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

How should similarity between clusters be defined?

Some Issues need Clarification:

There are two notions to be defined; Similarity between clusterings of the same set,and similarity between clusterings of different sets.

Similarity between two clusterings of the same sethave been extensively discussed in the literature (see, e.g, Meila in COLT’03).

Page 10: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

A common approach to the defining similaritybetween clusterings of different sets is to reduceit to a definition of similarity between clusterings of the same set.

Reducing the Second Notion to the First:

This is done via an extension operator - a methodfor extending a clustering of a domain subsetto a clustering of the full domain (Breckenridge ’89,Roth et al COMPSTAT’02 and BD in COLT’04)

Examples of such extensions are Nearest Neighbor, or Center-Based clustering.

Page 11: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

For a clustering C1 of S1 (or C2 of S2), use the extension operator to extend C1 toa clustering C1,2 of S2 (or C2,1 of C1, respectively).

Reducing the Similarity over Two Sets to Similarity over Same Set :

Given a similarity measure d for same-set clusterings, define a similarity measure

D(C1, C2) = ½(d(C1, C2,1) + d(C2, C1,2))

Page 12: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

If the number of clusters, k, is fixed, there is no hope to get distribution free stability results.

Types of Potential Bounds: 1. Fixed # of Clusters

Example2: Square with 4 equal mass hips on its corners --bad for k ≠4

Example 3: Cocentric rings -- bad for center-based clustering algorithms.

Example1: The uniform distribution over a circle:

Page 13: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

Von Luxemburg, Bousquet and Belkin (this NIPS),analyze when does Spectral Clustering convergeto a global clustering of the domain space.

Koltchinskii (2002) proved that if the underlying distribution is generated by a certain tree structureof Gaussians, then a clustering algorithm can recoverthis structure from random samples.

BD (COLT 2004) showed distribution-free s convergence rates for the limited issue of clustering lossfunction.

What Can We Currently Prove? (Not too much …)

Page 14: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

What is the “Intrinsic Instability” of a given sample distribution? (Buhmann et al)

Fixed # of Clusters – Natural questions

Can one characterize (useful) families of probabilitydistributions for which cluster stability holds(i.e., the intrinsic instability is zero)?

What levels of intrinsic instability grant a clustering meaningless?

Page 15: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

Now there may be hope for distribution-freeBounds (the algorithm may choose to have just one cluster for a uniform distribution).

Types of Potential Bounds:2. Let the algorithm Chose k

Major issue:

A tradeoff between the stability and the“information content” of a clustering.

Page 16: Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

To assure that the outcome of a clustering algorithm is meaningful.

Potential Uses of Bounds:

Help Detect changes in the sample generating distribution (“the two-sample problem”)

Model selection – Choose the number of clusters thatmaximizes a stability-based criterion (Lange – Braun- Roth-Buhmann NIPS’02)