Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions

Shai Ben-David

University of Waterloo Canada, Dec 2004

Generalization Bounds for Clustering - Some Thoughts and Many Questions

Provide rigorous generalization bounds for clustering.

The Goal

Why ?

It would be useful to have assurances that clusterings

that we produce are meaningful, rather than just an

artifact of data randomness.

There is some large, possibly infinite, domain set X.

An unknown probability distribution over X generates an i. i.d sample.

Upon viewing such a sample, a learner wishes to deduce a clustering, as a simple, yet meaningful, description of the distribution.

1st Step: A formal model for Sample Based Clustering

Roughly, we wish to be able to say:

If sufficiently many sample points have been drawn,then the clustering we come up with is “stable”.

2nd Step:What should a bound look like?

If S1 , S2 are sufficiently large i.id. samples

from the same distribution, then, w.h.p.,

C(S1) is ‘similar’ to C(S2)

Where C(S) is the clustering we get by applyingour clustering alg. to S

What should a bound look like? More formally

Classification generalization bounds guaranteethe convergence of the loss of the hypothesis –“For any distribution P, large enough samples S, L(A(S1)) is close to L(A(P))”Since for clustering there is no natural analogueof the distribution true cost, L(A(P)), we consider its ‘stability’ implication:

“If S1, S2 are sufficiently large i.id. samples from the same distribution, then, w.h.p.,L(A(S1)) is close to L(A(S2))”

How is it Different than Classification bounds?

Here, for clustering, we seek a stronger statement, namely:“If S1, S2 are sufficiently large i.id. samples from the same distribution, then, w.h.p.,

C(S1) is ‘similar’ to C(S2)”

From a more traditional scientific-methodology pointof view, Stability can be viewed as the fundamental issue of replication --

to what extent are the results of an experiment reproducible?

A Different Perspective – Replication

Replication has been investigated in many applicationsof clustering, but mostly by visual inspection of theresults of cluster analysis on two samples.

If S1 , S2 are sufficiently large i.id. samples

from the same distribution, then, w.h.p.,

C(S1) is ‘similar’ to C(S2)

where C(S) is the clustering we get by applyingour clustering alg. to S

What should a bound look like? More formally

How should similarity between clusters be defined?

Some Issues need Clarification:

There are two notions to be defined; Similarity between clusterings of the same set,and similarity between clusterings of different sets.

Similarity between two clusterings of the same sethave been extensively discussed in the literature (see, e.g, Meila in COLT’03).

A common approach to the defining similaritybetween clusterings of different sets is to reduceit to a definition of similarity between clusterings of the same set.

Reducing the Second Notion to the First:

This is done via an extension operator - a methodfor extending a clustering of a domain subsetto a clustering of the full domain (Breckenridge ’89,Roth et al COMPSTAT’02 and BD in COLT’04)

Examples of such extensions are Nearest Neighbor, or Center-Based clustering.

For a clustering C1 of S1 (or C2 of S2), use the extension operator to extend C1 toa clustering C1,2 of S2 (or C2,1 of C1, respectively).

Reducing the Similarity over Two Sets to Similarity over Same Set :

Given a similarity measure d for same-set clusterings, define a similarity measure

D(C1, C2) = ½(d(C1, C2,1) + d(C2, C1,2))

If the number of clusters, k, is fixed, there is no hope to get distribution free stability results.

Types of Potential Bounds: 1. Fixed # of Clusters

Example2: Square with 4 equal mass hips on its corners --bad for k ≠4

Example 3: Cocentric rings -- bad for center-based clustering algorithms.

Example1: The uniform distribution over a circle:

Von Luxemburg, Bousquet and Belkin (this NIPS),analyze when does Spectral Clustering convergeto a global clustering of the domain space.

Koltchinskii (2002) proved that if the underlying distribution is generated by a certain tree structureof Gaussians, then a clustering algorithm can recoverthis structure from random samples.

BD (COLT 2004) showed distribution-free s convergence rates for the limited issue of clustering lossfunction.

What Can We Currently Prove? (Not too much …)

What is the “Intrinsic Instability” of a given sample distribution? (Buhmann et al)

Fixed # of Clusters – Natural questions

Can one characterize (useful) families of probabilitydistributions for which cluster stability holds(i.e., the intrinsic instability is zero)?

What levels of intrinsic instability grant a clustering meaningless?

Now there may be hope for distribution-freeBounds (the algorithm may choose to have just one cluster for a uniform distribution).

Types of Potential Bounds:2. Let the algorithm Chose k

Major issue:

A tradeoff between the stability and the“information content” of a clustering.

To assure that the outcome of a clustering algorithm is meaningful.

Potential Uses of Bounds:

Help Detect changes in the sample generating distribution (“the two-sample problem”)

Model selection – Choose the number of clusters thatmaximizes a stability-based criterion (Lange – Braun- Roth-Buhmann NIPS’02)

Documents

Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions