Clustering with Bregman Divergences

Clustering with Bregman Divergences

Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh

Presented by

Rohit Gupta

CSci 8980: Machine Learning

Outline

• Bregman Divergences – Basics and Examples

• Bregman Information

• Bregman Hard Clustering

• The Exponential Family and connection to Bregman Divergence

• Bregman Soft Clustering

• Experiments and Results

• Conclusions

Bregman Hard and Soft Clustering

• Most existing parametric clustering methods partition the data into pre-specified number of partitions with cluster representative corresponding to every partition/cluster

Hard Clustering – disjoint partitioning of the data such that each data point belongs to exactly one of the partitions

Soft Clustering – each data point has a certain probability of belonging to each of the partitions

Hard Clustering can be seen as Soft Clustering when probabilities are either 0 or 1

Distortion or Loss Functions

• Squared euclidean distance is the most commonly used loss function

Extensive literature

Easy to use – leads to simple calculations

Not appropriate for some domains

Difficult to compute for sparse data (missing dimensions)

Example: Iterative K-means algorithm

• Question: How to choose a distortion/loss function for a given problem?

Bregman Divergences

• Ref: Definition 1 in the paper:

d : be differentiable and convex function on a convex set SLet S

Bregman Divergence, d is defined as:

( ) ( ) , ( )d x y x y y

• Examples:

Squared distance

Relative Entropy (KL divergence)

Itakura Saito distance

Few Take Home Points on Bregman Divergence

( , ) ( , )

(Not symmetric and therefore triangle property does not hold)

d x y d y x

( , ) 0 if

( , ) 0 if

d x y x y

d x y x y

1.

2.

3. Three Point Property

( , ) ( , ) ( , ) ( ), ( ( ) ( )d x y d z y d x z x z y z

4. Strictly convex in the first argument but not necessarily so in the second argument

Bregman Information

• Bregman Information of a random variable X is given by

( ) min [ ( , )]s S

I X E d X s

• The optimal vector that achieves the minimal value will be called Bregman representative of X

• For squared loss, minimum loss is variance2[|| || ]E X

• Best predictor of the random variable is the mean

Bregman Information

• Bregman Information is the minimum loss that corresponds to

arg min [ ( , )]s

E d X s

• Points to note:

representative defined above always exists

uniquely determined

does not depend on the choice of Bregman divergence

expectation of the random variable, X defines the minimizer

Bregman Hard Clustering

• This problem is posed as a quantization problem that involves minimizing the loss in Bregman information

• Very similar to squared distance based iterative K-means – except that distortion function is general class of Bregman Divergence

• Expected Bregman Divergence of the data points from their Bregman representatives is minimized

• Procedure:

Initialize the representatives

Assign points to them

Re-estimate the representatives

Bregman Hard Clustering

• Algorithm:

1

h

{ }

( )

Step 1:Assign each data point, x to the nearest cluster X such that

arg min ( , )

Step 2: Re-estimate the representatives

h

h

kh h

ss

x Xh

X

Initialize

While converged

h d x

x

n

Take home points

• Exhaustiveness: Bregman hard clustering algorithm works for all Bregman divergences and in fact only for Bregman Divergences

Arithmetic mean is the best predictor for Bregman Divergences only

Possible to design clustering algorithms based on distortion functions that are not Bregman divergences, but in that case, cluster representative would not be the arithmetic mean or the expectation

• Linear Separators: Clusters obtained are separated by hyperplanes

Take home points

• Scalability: Each iteration of Bregman hard clustering algorithm is linear in the number of data points and the number of desired clusters

• Applicability to mixed data types: Allows choosing different Bregman divergence that are meaningful and appropriate for different subsets of features

• Also guarantees that the objective function will monotonically decrease till convergence

Exponential families and Bregman Divergences

• [Forster & Warmuth] remarked that the log-likelihood of the density of an exponential family distribution can be written as follows:

( , )log( ( )) ( , ( )) log( ( ))

Here is any uniquely determined function,

is the expectation parameter and is some other natural parameter

p x d x b x

b

• Points to note:

is cumulant function and it determines the exponential family

fixes the distribution in the family

Bregman Soft Clustering

• Problem is posed as a parameter estimation problem for mixture models based on exponential family distributions

• EM algorithm is used to design Bregman Soft Clustering algorithm

• Maximizing log likelihood of data in the EM algorithm would be equivalent to minimizing the Bregman Divergence in the Bregman Soft Clustering algorithm (refer to the previous slide)

• There is a Bregman Divergence for a defined exponential family

Bregman Soft Clustering

• Algorithm:k

h h h=1Initialize { , }

While (converged)

Step 1: Expectation step

Compute the posterior probability for all x, h

( | ) exp( ( , )) ( )

Step 2: Maximization step

Recompute the paramters for all h, suc

h hp h x d x b x

h that Bregman Divergence is minimized

1( | )

( | )

( | )

hx

xh

x

p h xn

p h x x

p h x

Experiments and Results

• Question: How the quality of clustering would depend on the appropriateness of Bregman divergence?

• Experiments performed on synthetic data proved that cluster quality is better when matching Bregman divergence is used than the non-matching one

• Experiment 1:

Three 1-dimensional datasets of 100 samples each are generated based on mixture models of Gaussian, Poisson, and Binomial distributions respectively

datasets were clustered using three versions of Bregman hard clustering corresponding to different Bregman divergences

Experiments and Results

Mutual information is used to compare the results

Table 3 in the paper shows large numbers along the diagonals, which shows the importance of using appropriate Bregman divergence

• Experiment 2:

Similar as experiment 1 except that this is for multi-dimensional data.

Table 4 in the paper shows the results, which again indicate the same observation as above

Conclusions

• Hard and Soft clustering algorithms are presented that minimize the loss function based on Bregman Divergences

• It was shown that there is a one-to-one mapping between regular exponential families and regular Bregman Divergences – this helped formulating soft clustering algorithm

• Connection of Bregman divergences to shannon’s rate distortion theory is also established

• Experiments on synthetic data showed the importance of choosing right Bregman divergence for the corresponding family of exponential distributions

Documents

Clustering with Bregman Divergences