36
Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin Joint work with Y. Guan, S. Mallela & Dharmendra Modha IMA Workshop on Data Analysis & Optimization May 7, 2003

IMA Workshop on Data Analysis & Optimization May 7, 2003

  • Upload
    tamma

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin. Joint work with Y. Guan, S. Mallela & Dharmendra Modha . - PowerPoint PPT Presentation

Citation preview

Page 1: IMA Workshop on Data Analysis & Optimization May 7, 2003

Information Theoretic Clustering, Co-clustering and Matrix Approximations

Inderjit S. Dhillon University of Texas, Austin

Joint work with Y. Guan, S. Mallela & Dharmendra Modha

IMA Workshop on Data Analysis & Optimization

May 7, 2003

Page 2: IMA Workshop on Data Analysis & Optimization May 7, 2003

Introduction Important Tasks in Data Mining

Clustering Grouping together of “similar” objects

Classification Labelling new objects given an existing grouping

Matrix Approximations Reducing dimensionality (SVD, PCA, NNMF,…..)

Obstacles High Dimensionality Sparsity Noise

Need for robust and scalable algorithms

Page 3: IMA Workshop on Data Analysis & Optimization May 7, 2003

Clustering

Clustering along “One-Dimension” Grouping together of “similar” objects

Hard Clustering -- Each object belongs to a single cluster

Soft Clustering -- Each object is probabilistically assigned to clusters

Page 4: IMA Workshop on Data Analysis & Optimization May 7, 2003

Co-clustering

Given a multi-dimensional data matrix, co-clustering refers to simultaneous clustering along multiple dimensions

In a two-dimensional case it is simultaneous clustering of rows and columns

Most traditional clustering algorithms cluster along a single dimension

Co-clustering is more robust to sparsity

Page 5: IMA Workshop on Data Analysis & Optimization May 7, 2003

Matrix Approximations

Co-occurrence matrices frequently arise in applications, for example, word-document matrices

Matrix characteristics Large Sparse Non-negative

Traditional matrix approximations, such as SVD(PCA) do not preserve non-negativity or sparsity

Page 6: IMA Workshop on Data Analysis & Optimization May 7, 2003

110000000

111000000

011100000

100000010

000001100

000010010

000010010

000002110

000010110

000000011

000000101

000001001

61.71.50.22.15.21.10.25.04.

85.98.69.31.20.30.15.34.06.

66.77.55.24.14.27.34.23.06.

42.44.31.14.27.21.23.53.10.

11.20.14.07.24.63.51.55.22.

22.19.13.06.28.42.38.58.16.

22.19.13.06.28.42.38.58.16.

19.21.15.07.56.3.11.12.145.

19.12.08.03.39.70.61.84.26.

12.09.06.0224.41.36.50.15.

04.10.07.03.16.40.33.37.14.

10.16.12.05.18.47.38.40.16.

67.67.44.22.00000

0.10.167.33.00000

0.10.167.33.00000

33.33.22.11.15.20.20.30.15.

000030.40.40.60.30.

000030.40.40.60.30.

000030.40.40.60.30.

000060.80.80.2.160.

000045.60.60.90.45.

000030.40.40.60.30.

000030.40.40.60.30.

000030.40.40.60.30.

Page 7: IMA Workshop on Data Analysis & Optimization May 7, 2003

030.030.020.035.044.044.

044.044.035.020.030.030.

051.051.047.007.003.003.

051.051.047.007.003.003.

003.003.007.047.051.051.

003.003.007.047.051.051.

036.036.028.028.036.036.

036.036.028.028.036.036.

054.054.042.000

054.054.042.000

000042.054.054.

000042.054.054.

04.04.004.04.04.

04.04.04.004.04.

05.05.05.000

05.05.05.000

00005.05.05.

00005.05.05.

Page 8: IMA Workshop on Data Analysis & Optimization May 7, 2003

Co-clustering and Information Theory View (scaled) co-occurrence matrix as a joint probability distribution

between row & column random variables

We seek a hard-clustering of both dimensions such that loss in “Mutual Information”

is minimized given a fixed no. of row & col. clusters (similar framework as in Tishby, Pereira & Bialek(1999), Berkhin & Becher(2002))

XY

XY

)ˆ,ˆ( - ),( YXIYXI

Page 9: IMA Workshop on Data Analysis & Optimization May 7, 2003

Information Theory Concepts

Entropy of a random variable X with probability distribution p(x):

The Kullback-Leibler(KL) Divergence or “Relative Entropy” between two probability distributions p and q:

Mutual Information between random variables X and Y:

x

xqxpxpqpKL ))()(log()(),(

)(log)()( xpxppHx

x y ypxp

yxpyxpYXI

)()(

),(log),(),(

Page 10: IMA Workshop on Data Analysis & Optimization May 7, 2003

Jensen-Shannon Divergence

Jensen-Shannon(JS) divergence between two probability distributions:

where Jensen-Shannon(JS) divergence between a finite

number of probability distributions:

)()()(

),(),(),(

22112211

22112222111121

pHpHppH

pppKLpppKLppJS

1,0, 2121

)(

).....,(}),....,({ 111

ii

ii

ii

nnii

in

pHpH

pppKLppJS

Page 11: IMA Workshop on Data Analysis & Optimization May 7, 2003

Information-Theoretic Clustering: (preserving mutual information)

(Lemma) The loss in mutual information equals:

Interpretation: Quality of each cluster is measured by the Jensen-Shannon Divergence between the individual distributions in the cluster.

Can rewrite the above as:

Goal: Find a clustering that minimizes the above loss

})ˆ:)|(({)ˆ()ˆ,(),(1

* jtt

k

jj yyyxpJSyYXIYXI

k

j yyjtt

jt

yxpyxpKLYXIYXI1 ˆ

))ˆ|(),|(()ˆ,(),(

Page 12: IMA Workshop on Data Analysis & Optimization May 7, 2003

Information Theoretic Co-clustering (preserving mutual information) (Lemma) Loss in mutual information equals

where

Can be shown that q(x,y) is a “maximum entropy” approximation to p(x,y).

q(x,y) preserves marginals : q(x)=p(x) & q(y)=p(y)

),()ˆ|()ˆ|()ˆ,ˆ(

)),( || ),(( )ˆ,ˆ( - ),(

YXHYYHXXHYXH

yxqyxpKLYXIYXI

yyxxyypxxpyxpyxq ˆ,ˆ where),ˆ|()ˆ|()ˆ,ˆ(),(

Page 13: IMA Workshop on Data Analysis & Optimization May 7, 2003

04.04.004.04.04.

04.04.04.004.04.

05.05.05.000

05.05.05.000

00005.05.05.

00005.05.05.

),( yxp

036.036.028.028.036.036.

036.036.028.028036.036.

054.054.042.000

054.054.042.000

000042.054.054.

000042.054.054.

5.00

5.00

05.0

05.0

005.

005.

2.2.

3.0

03. 36.36.28.000

00028.36.36.

)ˆ|( xxp

)ˆ,ˆ( yxp)ˆ|( yyp

),( yxq

#parameters that determine q are: (m-k)+(kl-1)+(n-l)

Page 14: IMA Workshop on Data Analysis & Optimization May 7, 2003

Preserving Mutual Information Lemma:

Note that may be thought of as the

“prototype” of row cluster (the usual “centroid” of the cluster is )

Similarly,

x xx

xyqxypKLxpyxqyxpKLˆ ˆ

))ˆ|(||)|(()()),(||),((

)ˆ|( xyq

)ˆ|()|ˆ()ˆ|()ˆ|ˆ()ˆ|()ˆ|( whereˆ

xxpxypyypxypyypxyqxx

x

y yy

yxqyxpKLypyxqyxpKLˆ ˆ

))ˆ|(||)|(()()),(||),((

)ˆ|()|(ˆ

xxpxypxx

Page 15: IMA Workshop on Data Analysis & Optimization May 7, 2003

Example – Continued

.20.20

.300

0.30.18.18.14.14.18.18

.36.36.28000

000.28.36.36

.16.24

.24.16

.300

.300

0.30

0.30

)ˆ,ˆ( yxp)ˆ|( yxq

)ˆ|( xyq

Page 16: IMA Workshop on Data Analysis & Optimization May 7, 2003

Co-Clustering Algorithm

Page 17: IMA Workshop on Data Analysis & Optimization May 7, 2003

Properties of Co-clustering Algorithm Theorem: The co-clustering algorithm

monotonically decreases loss in mutual information (objective function value)

Marginals p(x) and p(y) are preserved at every step (q(x)=p(x) and q(y)=p(y) )

Can be generalized to higher dimensions

Page 18: IMA Workshop on Data Analysis & Optimization May 7, 2003
Page 19: IMA Workshop on Data Analysis & Optimization May 7, 2003

Applications -- Text Classification

Assigning class labels to text documents Training and Testing Phases

Documentcollection

Class-1

Class-m

Grouped intoclasses

Training Data

New Document

Classifier

(Learns fromTraining data)

New Document

WithAssigned

class

Page 20: IMA Workshop on Data Analysis & Optimization May 7, 2003

Dimensionality Reduction

Feature Selection

Feature Clustering

DocumentBag-of-words

1

m

VectorOf

words

• Select the “best” words• Throw away rest• Frequency based pruning• Information criterion based pruning

DocumentBag-of-words

VectorOf

words

1

m

Cluster#1

Cluster#k

• Do not throw away words • Cluster words instead• Use clusters as features

Word#1

Word#k

Page 21: IMA Workshop on Data Analysis & Optimization May 7, 2003

Experiments

Data sets 20 Newsgroups data

20 classes, 20000 documents Classic3 data set

3 classes (cisi, med and cran), 3893 documents Dmoz Science HTML data

49 leaves in the hierarchy 5000 documents with 14538 words Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt

Implementation Details Bow – for indexing,co-clustering, clustering and classifying

Page 22: IMA Workshop on Data Analysis & Optimization May 7, 2003

Naïve Bayes with word clusters Naïve Bayes classifier

Assign document d to the class with the highest score

Relation to KL Divergence

Using word clusters instead of words

where parameters for clusters are estimated according to joint statistics

)))|ˆ(log()|ˆ())((log(maxarg)(1

*is

k

ssii cxpdxpcpdc

)))(log)|(),|(((minarg)(*iii cpcWpdWpKLdc

)))|(log()|())((log(maxarg)(1

*it

v

ttii cwpdwpcpdc

Page 23: IMA Workshop on Data Analysis & Optimization May 7, 2003

Results (20Ng)

Classification Accuracy on 20 Newsgroups data with 1/3-2/3 test-train split

Clustering beats feature selection algorithms by a large margin

The effect is more significant at lower number of features

Page 24: IMA Workshop on Data Analysis & Optimization May 7, 2003

Results (Dmoz)

Classification Accuracy on Dmoz data with 1/3-2/3 test train split

Divisive Clustering is better at lower number of features

Note contrasting behavior of Naïve Bayes and SVMs

Page 25: IMA Workshop on Data Analysis & Optimization May 7, 2003

Results (Dmoz) Naïve Bayes on

Dmoz data with only 2% Training data

Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase

Divisive Clustering performs better than IG at lower training data

Page 26: IMA Workshop on Data Analysis & Optimization May 7, 2003

Science

Math Physics Social Science

Number Theory

Logic MechanicsQuantum Theory Economics Archeology

•Flat classifier builds a classifier over the leaf classes in the above hierarchy•Hierarchical Classifier builds a classifier at each internal node of the hierarchy

Hierarchical Classification

Page 27: IMA Workshop on Data Analysis & Optimization May 7, 2003

Results (Dmoz)

Dmoz data

01020304050607080

5

10 20 50

100

200

500

1000

5000

1000

0

Number of Features

% A

ccur

acy

Hierarchical

Flat(DC)

Flat(IG)

• Hierarchical Classifier (Naïve Bayes at each node)• Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features)• Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers

Page 28: IMA Workshop on Data Analysis & Optimization May 7, 2003

ExampleCluster 10

Divisive Clustering

(rec.sport.hockey)

Cluster 9

Divisive Clustering

(rec.sport.baseball)

Cluster 12

Agglomerative Clustering

(rec.sport.hockey and rec.sport.baseball)

team

game

play

hockey

Season

boston

chicago

pit

van

nhl

hit

runs

Baseball

base

Ball

greg

morris

Ted

Pitcher

Hitting

team detroit

hockey pitching

Games hitter

Players rangers

baseball nyi

league morris

player blues

nhl shots

Pit Vancouver

buffalo ens

Top few words sorted in Clusters obtained by Divisive and

Agglomerative approaches on 20 Newsgroups data

Page 29: IMA Workshop on Data Analysis & Optimization May 7, 2003

Co-clustering Example for Text Data

document

word wordclusters

document clusters

Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix

Page 30: IMA Workshop on Data Analysis & Optimization May 7, 2003

Results– CLASSIC3

109986275138741

405954417145240

4414284784992

1-D Clustering

(0.821)

Co-Clustering(0.9835)

Page 31: IMA Workshop on Data Analysis & Optimization May 7, 2003

Results – Sparsity

Page 32: IMA Workshop on Data Analysis & Optimization May 7, 2003

Results – continued

Page 33: IMA Workshop on Data Analysis & Optimization May 7, 2003

Results (Monotonicity)

Page 34: IMA Workshop on Data Analysis & Optimization May 7, 2003

Related Work

Distributional Clustering Pereira,Tishby & Lee (1993) Baker & McCallum (1998)

Information Bottleneck Tishby, Pereira & Bialek(1999) Berkhin & Becher(2002)

Probabilistic Latent Semantic Indexing Hofmann (1999)

Non-Negative Matrix Approximation Lee & Seung(2000)

Page 35: IMA Workshop on Data Analysis & Optimization May 7, 2003

Conclusions

Information theoretic approaches to clustering, co-clustering Co-clustering problem is tied to a non-negative matrix approximation Requires estimation of fewer number of parameters Can be extended to the more general class of Bregman divergences

KL-divergence, squared Euclidean distances are special cases Theoretical approach has the potential of extending to other

problems: incorporating unlabelled data multi-dimensional co-clustering MDL to choose number of clusters

Page 36: IMA Workshop on Data Analysis & Optimization May 7, 2003

Contact Information Email: [email protected] Papers are available at:

http://www.cs.utexas.edu/users/inderjit “Divisive Information-Theoretic Feature Clustering for

Text Classification”, Dhillon, Mallela & Kumar, Journal of Machine Learning Research(JMLR), March, 2003 (also see KDD, 2002)

“Information-Theoretic Co-clustering”, Dhillon, Mallela & Modha, To appear in KDD, 2003 (also UTCS Technical Report).

“Clustering with Bregman Divergences”, Banerjee, Merugu, Dhillon & Ghosh, UTCS Technical Report, 2003