19
INFORMATION- THEORETIC CO- CLUSTERING Authors / Inderjit S. Dhillon, Subramanyam Mallela and Dharmendra S. Modha Conference / ACM SIGKDD ’03, August 24-27, 2003, Washington Presenter / Meng-Lun, Wu 1

Information Theoretic Co Clustering

  • Upload
    allenwu

  • View
    1.078

  • Download
    0

Embed Size (px)

DESCRIPTION

Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory — the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. We present an innovative co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. Using the practical example of simultaneous word-document clustering, we demonstrate that our algorithm works well in practice, especially in the presence of sparsity and high-dimensionality.

Citation preview

Page 1: Information Theoretic Co Clustering

1

INFORMATION-THEORETIC CO-

CLUSTERINGAuthors / Inderjit S. Dhillon, Subramanyam Mallela and

Dharmendra S. Modha

Conference / ACM SIGKDD ’03, August 24-27, 2003, Washington

Presenter / Meng-Lun, Wu

Page 2: Information Theoretic Co Clustering

2

OUTLINE Introduction Problem Formulation Co-Clustering Algorithm Experimental Result Conclusions And Future Work

Page 3: Information Theoretic Co Clustering

3

INTRODUCTION (CONT.) Clustering is a fundamental tool in

unsupervised learning. Most clustering algorithms focus on one-

way clustering.

doc Word1 … Wordn

50 12 … 10

52 13 … 0

53 10 … 20

Clustering

doc Word1 … Wordn Cluster

50 12 … 10 Cluster0

52 13 … 0 Cluster1

53 10 … 20 Cluster0

Page 4: Information Theoretic Co Clustering

4

INTRODUCTION (CONT.) It is often desirable to co-cluster or

simultaneously cluster both dimensions. The normalized non-negative

contingency table into a joint probability distribution between two discrete random variables.

The optimal co-clustering is one that leads to the largest mutual information between the clustered random variables.

Page 5: Information Theoretic Co Clustering

5

INTRODUCTION (CONT.) The optimal co-clustering is one that

minimizes the loss in mutual information.

The mutual information of two random variables is a quantity that measures the mutual dependence of the two variables.

Formally, the mutual information can be defined as:

Xx Yy ypxp

yxpyxpYXI

)()(

),(log),();(

Page 6: Information Theoretic Co Clustering

6

INTRODUCTION (CONT.) The Kullback-Leibler (K-L) divergence,

measures the difference between two probability distributions.

Given the true probability distribution p(x,y) and another distribution q(x,y) can be defined as:

x y

KL yxq

yxpyxpyxqyxpD

),(

),(log),(),(||),((

Page 7: Information Theoretic Co Clustering

7

PROBLEM FORMULATION Let X and Y be discrete random

variables.X: {x1,…,xm}, Y: {y1,…,yn}

p(X, Y) denote the joint probability distribution.

Let the k clusters of X as: Let the l clusters of Y as: {ŷ1, ŷ2, . . . , ŷl}

)(ˆ}ˆ,...,ˆ,ˆ{},...,,{:

)(ˆ}ˆ,...,ˆ,ˆ{},...,,{:

2121

2121

YCYyyyyyyC

XCXxxxxxxC

YlnY

XkmX

}ˆ,...,ˆ,ˆ{ 21 kxxx

Page 8: Information Theoretic Co Clustering

8

PROBLEM FORMULATION (CONT.) Definition

An optimal co-clustering minimizes

Subject to constraints on the number of row and column clusters.

For a fixed co-clustering (CX,CY), we can write the loss in mutual information.

)ˆ;ˆ();( YXIYXI

)),(||),(()ˆ;ˆ();( yxqyxpDYXIYXI KL

Page 9: Information Theoretic Co Clustering

9

PROBLEM FORMULATION (CONT.)

)),(||),((),(

),(log),(

)ˆ|()ˆ|()ˆ,ˆ(

),(log),(

)ˆ()(

)ˆ()(

)ˆ,ˆ(

),(log),(

)ˆ()ˆ()ˆ,ˆ(

1

)()(

),(log),(

)ˆ()ˆ()ˆ,ˆ()()(),(

log),(

ˆ ˆ

ˆ ˆˆ ˆ

ˆ ˆˆ ˆ

yxqyxpDyxq

yxpyxp

yypxxpyxp

yxpyxp

ypyp

xpxp

yxp

yxpyxp

ypxpyxpypxp

yxpyxp

ypxpyxpypxpyxp

yxp

KLx y xx yy

x y xx yyx y xx yy

x y xx yyx y xx yy

Page 10: Information Theoretic Co Clustering

10

PROBLEM FORMULATION (CONT.)

q(X,Y) is a distribution of the form

0.18 0.18 0.14 0.14 0.18 0.180.150.150.150.150.20.2

)ˆ(

)(

)ˆ(

)()ˆ,ˆ()ˆ|()ˆ|()ˆ,ˆ(),(

yp

yp

xp

xpyxpyypxxpyxpyxq

0.5 0.5

0.30.30.4

Suppose

054.05.0

18.0

3.0

15.03.0

Page 11: Information Theoretic Co Clustering

11

CO-CLUSTERING ALGORITHM Input :

The joint probability distribution p(X,Y), k the desired number of row clusters and l the desired number of column clusters.

Output:The partition functions C†

X and C†Y

Page 12: Information Theoretic Co Clustering

12

CO-CLUSTERING ALGORITHM (CONT.)

D(p||q)

0.041909

0.041909

0.05696

0.05696

0.0376

0.049641

^x3^x1

D(p||q)

0.05696

0.05696

0.04191

0.04191

0.049641

0.0376

^x3^x2

Page 13: Information Theoretic Co Clustering

13

CO-CLUSTERING ALGORITHM (CONT.)

D(p||q)0.0211

80.0211

80.0224

30.0407

650.0489

30.0489

3

ŷ2 ŷ1

D(p||q)0.04813

80.04813

80.04194

20.0229

50.0205

20.0205

2

ŷ1 ŷ2

Page 14: Information Theoretic Co Clustering

14

CO-CLUSTERING ALGORITHM (CONT.)

D(p||q)=0.02881

Page 15: Information Theoretic Co Clustering

15

EXPERIMENTAL RESULTS For our experimental results we use

various subsets of the 20-Newsgroup data(NG20).

We use 1D-clustering to denote document clustering without any word clustering.

Evaluation MeasuresMicro-averaged-precision

Micro-averaged-recall

Page 16: Information Theoretic Co Clustering

16

EXPERIMENTAL RESULTS (CONT.)

Page 17: Information Theoretic Co Clustering

17

EXPERIMENTAL RESULTS (CONT.)

Page 18: Information Theoretic Co Clustering

18

EXPERIMENTAL RESULTS (CONT.)

Page 19: Information Theoretic Co Clustering

19

CONCLUSIONS AND FUTURE WORK The information-theoretic formulation

for co-clustering can be guaranteed to reach a local minimum in a finite number of steps.

Co-clustering for joint distribution of two random variables.

In this paper, the row and column clusters are pre-specified.

We hope that an information-theoretic regularization procedure may allow us to select the number of clusters.