Information Theoretic Co Clustering

1

INFORMATION-THEORETIC CO-

CLUSTERINGAuthors / Inderjit S. Dhillon, Subramanyam Mallela and

Dharmendra S. Modha

Conference / ACM SIGKDD ’03, August 24-27, 2003, Washington

Presenter / Meng-Lun, Wu

2

OUTLINE Introduction Problem Formulation Co-Clustering Algorithm Experimental Result Conclusions And Future Work

3

INTRODUCTION (CONT.) Clustering is a fundamental tool in

unsupervised learning. Most clustering algorithms focus on one-

way clustering.

doc Word1 … Wordn

50 12 … 10

52 13 … 0

53 10 … 20

Clustering

doc Word1 … Wordn Cluster

50 12 … 10 Cluster0

52 13 … 0 Cluster1

53 10 … 20 Cluster0

4

INTRODUCTION (CONT.) It is often desirable to co-cluster or

simultaneously cluster both dimensions. The normalized non-negative

contingency table into a joint probability distribution between two discrete random variables.

The optimal co-clustering is one that leads to the largest mutual information between the clustered random variables.

5

INTRODUCTION (CONT.) The optimal co-clustering is one that

minimizes the loss in mutual information.

The mutual information of two random variables is a quantity that measures the mutual dependence of the two variables.

Formally, the mutual information can be defined as:

Xx Yy ypxp

yxpyxpYXI

)()(

),(log),();(

6

INTRODUCTION (CONT.) The Kullback-Leibler (K-L) divergence,

measures the difference between two probability distributions.

Given the true probability distribution p(x,y) and another distribution q(x,y) can be defined as:

x y

KL yxq

yxpyxpyxqyxpD

),(

),(log),(),(||),((

7

PROBLEM FORMULATION Let X and Y be discrete random

variables.X: {x1,…,xm}, Y: {y1,…,yn}

p(X, Y) denote the joint probability distribution.

Let the k clusters of X as: Let the l clusters of Y as: {ŷ1, ŷ2, . . . , ŷl}

)(ˆ}ˆ,...,ˆ,ˆ{},...,,{:

)(ˆ}ˆ,...,ˆ,ˆ{},...,,{:

2121

2121

YCYyyyyyyC

XCXxxxxxxC

YlnY

XkmX

}ˆ,...,ˆ,ˆ{ 21 kxxx

8

PROBLEM FORMULATION (CONT.) Definition

An optimal co-clustering minimizes

Subject to constraints on the number of row and column clusters.

For a fixed co-clustering (CX,CY), we can write the loss in mutual information.

)ˆ;ˆ();( YXIYXI

)),(||),(()ˆ;ˆ();( yxqyxpDYXIYXI KL

9

PROBLEM FORMULATION (CONT.)

)),(||),((),(

),(log),(

)ˆ|()ˆ|()ˆ,ˆ(

),(log),(

)ˆ()(

)ˆ()(

)ˆ,ˆ(

),(log),(

)ˆ()ˆ()ˆ,ˆ(

1

)()(

),(log),(

)ˆ()ˆ()ˆ,ˆ()()(),(

log),(

ˆ ˆ

ˆ ˆˆ ˆ

ˆ ˆˆ ˆ

yxqyxpDyxq

yxpyxp

yypxxpyxp

yxpyxp

ypyp

xpxp

yxp

yxpyxp

ypxpyxpypxp

yxpyxp

ypxpyxpypxpyxp

yxp

KLx y xx yy

x y xx yyx y xx yy

x y xx yyx y xx yy

10

PROBLEM FORMULATION (CONT.)

q(X,Y) is a distribution of the form

0.18 0.18 0.14 0.14 0.18 0.180.150.150.150.150.20.2

)ˆ(

)(

)ˆ(

)()ˆ,ˆ()ˆ|()ˆ|()ˆ,ˆ(),(

yp

yp

xp

xpyxpyypxxpyxpyxq

0.5 0.5

0.30.30.4

Suppose

054.05.0

18.0

3.0

15.03.0

11

CO-CLUSTERING ALGORITHM Input :

The joint probability distribution p(X,Y), k the desired number of row clusters and l the desired number of column clusters.

Output:The partition functions C†

X and C†Y

12

CO-CLUSTERING ALGORITHM (CONT.)

D(p||q)

0.041909

0.041909

0.05696

0.05696

0.0376

0.049641

^x3^x1

D(p||q)

0.05696

0.05696

0.04191

0.04191

0.049641

0.0376

^x3^x2

13


D(p||q)0.0211

80.0211

80.0224

30.0407

650.0489

30.0489

3

ŷ2 ŷ1

D(p||q)0.04813

80.04813

80.04194

20.0229

50.0205

20.0205

2

ŷ1 ŷ2

14


D(p||q)=0.02881

15

EXPERIMENTAL RESULTS For our experimental results we use

various subsets of the 20-Newsgroup data(NG20).

We use 1D-clustering to denote document clustering without any word clustering.

Evaluation MeasuresMicro-averaged-precision

Micro-averaged-recall

16

EXPERIMENTAL RESULTS (CONT.)

17


18


19

CONCLUSIONS AND FUTURE WORK The information-theoretic formulation

for co-clustering can be guaranteed to reach a local minimum in a finite number of steps.

Co-clustering for joint distribution of two random variables.

In this paper, the row and column clusters are pre-specified.

We hope that an information-theoretic regularization procedure may allow us to select the number of clusters.

Technology

Information Theoretic Co Clustering