Maurizio Vichi - Paris Dauphine University

Multimode clustering

Maurizio VichiDepartment of Statistics

Sapienza University of Rome

em: [email protected]

Firenze, 9 May 2012

1

keywords:keywords:

-- Data Dimensionality ReductionData Dimensionality Reduction

-- Multimode clusteringMultimode clustering

-- BiBi--Clustering, coClustering, co--clusteringclustering

-- Block clusteringBlock clustering

Fields of application

� Medicine and Bioinformatics (microarray);

� Marketing (preference data);

� Chemometrics (non negative matrix factorization);

2

��Outline of the presentationOutline of the presentation

�� ThreeThree--way and twoway and two--way data way data

�� Single partitioning;Single partitioning;

�� Multi partitioning;Multi partitioning;

�� Clustering Models Clustering Models �� Least Square Estimation (LSE);Least Square Estimation (LSE);

�� Maximum Likelihood Estimation (MSE);Maximum Likelihood Estimation (MSE);

ExamplesExamples

3

The set X is organized as a 3-Way Array

x11r x12r ... x1kK a set of sensory variables (preference data)

evaluated by a set of trained assessors

Three-way Data Set X

a set X of n × J × H values related to:

J variables measured (observed, estimated) on

n objects (individuals, products) at

H occasions (assessors, times, locations, etc.)

Occasion

h

Unit

i

x11h x12h ... x1kh

x11r x12r ... x1kK

xn11 xn21... x1JK

x111 x121 … x1k1

Variable j

xijk

a set of sensory variables (preference data)

products such as:

steamed potatoes,

grape/raspberry

beverages

Riesling wines,

evaluated by a set of trained assessors

(occasions) time

4

�� Single partitioningSingle partitioning�� Symmetrical mode reductionSymmetrical mode reduction

�� Asymmetrical mode reductionAsymmetrical mode reduction

Two approaches

1. SymmetricSymmetric treatment of Units (Rows) & Variables (Columns)

Clustering for Units & Clustering for Vars.

Result: reduced sets of mean profiles for Units and Variables;

OR

2. AsymmetricAsymmetric treatment of Units (Rows) & Variables (Columns)

Clustering for Units & Factorial methods for Variables

Result: reduced sets of mean profiles and factors

5

Symmetrical Single Partitioning for each mode:Symmetrical Single Partitioning for each mode:

FromFrom Partitioning objects and variables Partitioning objects and variables toto twotwo--mode partitioningmode partitioningv1 v2 v3 c1 c2 c3 v1 v2 v3 c1 c2 v1 v2 v3

u1 1 0 0 u1 v1 1 0 u1

u2 1 0 0 u2 v2 1 0 u2

X = u3 U= 0 1 0 X = u3 v3 0 1 X = u3

u4 0 0 1 u4 U4

u5 0 0 1 u5 u5

u6 0 0 1 u6 u6

Partitioning of units Partitioning of variables Two-mode partitioning

3,0 2,1 0,1 co-clustering, biclustering

Xmean= 1,3 5,1 0,0

x11 x12 x13

x21 x22 x23

x31 x32 x33

x41 x42 x43

x51 x52 x53

x14 x15

x24 x25

x34 x35

x44 x45

x54 x55

x1j x1h x1k x1p

x2 j x2h x2k x2p

x3 j x3h x3k x3p

x4 j x4h x4k x4p

x5 j x5h x5k x5p

x61 x62 x63

x71 x72 x73

x64 x75

x74 x75

x6j x6h x6k x6p

x7j x7h x7k x7p

xi1 xi2 xi3

xl1 xl2 xl3

xm1 xm2 xm3

xi4 xi5

xl4 xl5

xm4 xm5

xij xih xik xip

xlj xlh xlk xlp

xmj xmh xmk xmp

Cluster Q

……………

……

…

………

VARIABLES

… … … … … … … … …

Cluster 2

Cluster 1

Cluster 2

Cluster G

Cluster 1

OBJECTS

Xmean= 1,3 5,1 0,0

7,8 8,1 4,5

6

Asymmetrical Single Partitioning for each mode:Asymmetrical Single Partitioning for each mode:

Partitioning of objects and Partitioning of variablesPartitioning of objects and Partitioning of variables

•• Generally objects are synthesized by mean vectors Generally objects are synthesized by mean vectors

••Variables are synthesized by factors (latent variables, Variables are synthesized by factors (latent variables,

componetscomponets) )

c1 c2 c3 c1 c2 v1 v2 v3

1 0 0 v1 0,77 0 u1

1 0 0 v2 0,63 0 u2

U= 0 1 0 v3 0 1 X = u3

0 0 1 u4

0 0 1 u5

0 0 1 u6

7

��MultiMulti--partitioningpartitioning��MultiMulti--partitioningpartitioning

8

v1 v2 v3 v1 v2 v3

u1 u1

u2 u2

X = u3 X = u3

u4 u4

Single partitioning for one mode and MultiSingle partitioning for one mode and Multi--partitioning for partitioning for

the other mode:the other mode:

From From TwoTwo--mode Partitioning mode Partitioning to to

TwoTwo--mode mode MultipartitioningMultipartitioning –– objects / variablesobjects / variables

u5 u5

u6 u6

Two-mode multipartitioning of variables Two-mode multipartitioning of objects

It can be seen that genes (on the columns, here) are partitioned into 5 blocks; and that each block is divided into a different number of groups of slides (on the rows, here). For instance, the second group of genes is divided in two groups of slides, one of which is composed of only a single outlier. Red=up regulated, Green = down regulated

9

��ModelsModels

�� for threefor three-- & two& two--way way

multimode clusteringmultimode clustering

10

MultimodeMultimode clusteringclustering forfor threethree way dataway data

EBVCWXAUX +′⊗′= )(

Subject to

U, V, W binary and row stochastic;

A, B, C diagonal and s.t.

1=′′kk AuAu , for k=1,…,K;

1=′′qq BvBv , for q=1,…,Q;

1=′′rr CwCw , for r=1,…,R; 1=′′rr CwCw , for r=1,…,R;

where

X = Xn,JH (n×JH), [X1, X2,…,XH], three-way data matrix

obtained by placing side by side Xh;

E = En,JH=[eijk](n × JH), [E1,…, EH]; three-way error matrix

�� =��,��=[�� ](K×QR), [��1, … , ��] three-way centroid matrix

Matrices U, V, W are classification matrices for objects, variables and occasions,

respectively;

Matrices A, B, C are diagonal matrices for objects, variables and occasions;

11

MultimodeMultimode clusteringclustering

where

Matrices U, V, W are classification matrices for objects, variables and occasions, respectively;

U = [uik] (n × K) binary and row stochastic matrix defining a partition of the objects into K

clusters,

with uik=1 if the ith

object belongs to cluster k, uik=0 otherwise;

V = [vjq] (J × Q) binary and row stochastic matrix defining a partition of variables into Q clusters,

with vjq=1 if the jth

variable belongs to qth

cluster, vjq=0, otherwise; with vjq=1 if the j variable belongs to q cluster, vjq=0, otherwise;

W= [whr] (H × R) binary matrix defining a partition of occasions into R clusters,

with whr=1 if the hth

occasions belongs to rth

cluster, whr=0, otherwise;

Matrices A, B, C are diagonal matrices for objects, variables and occasions;

A=dg(a1,…,an) (n × n) diagonal matrix weighting objects, ∑ ��=1 ��

2 = 1;∑ ∑ ��2�

=1 = ��=1 ;

B=dg(b1,…,bJ) (J × J) diagonal matrix weighting variables ∑ ��=1 ��

2 = 1;∑ ∑ �� 2 = �

��=1

�=1 ;

C=dg(c1,…,cH) (H×H) diagonal matrix weighting occasions ∑ �ℎ��ℎ=1 �ℎ

2 = 1;∑ ∑ �ℎ��ℎ2 = ��

ℎ=1��=1 ;

12

PropertiesProperties

Let us reparameterize �� = � , !� = !", #$ = #% thus the model can be written

EBCXAX +′⊗′= )~~

(~

Subject to

�′� �� = '�

!′� !� = '�

#′�#$ = '�

�′� �� = '� !′� !� = '�

#′�#$ = '�

This is a special Tucker three-mode factor analysis model (Tucker, 1966)

where �� is the core matrix and ��, !� and #$ are particular orthonormal matrices.

13

TwoTwo mode mode clusteringclustering (H=1): (H=1): bibi--clusteringclustering

Symmetric reduction of rows and columns of the data matrix X

Classes of objects and variables are summarized by components

(latent objects and latent variables)

EBVXAUX +′⊗= )1(

i.e., the Generalized Double K-means (GDKM)

model for asymmetrical single partitioning model for asymmetrical single partitioning

EBVXAUX +′=

Subject to

U, V binary and row stochastic


1=′′qq BvBv , for q=1,…,Q.

14

NonNon--negativenegative matrixmatrix factorizationfactorizationLet us now factorize matrix �� into the product of two non-negative matrices

H (K×L) and M (Q×L)

�� = HM' + (��

and including the previous model into the GDKM we obtain a non-negative matrix

factorization algorithm for two mode single partitioning, i.e.

EBVMAUHX +′′=

subject to

H ≥ 0, M ≥ 0;

U, V binary and row stochastic; U, V binary and row stochastic;



where E includes also the error part correlated to ��, that is: BVXAUE ′ .

Uniqueness The factorization is not unique: an arbitrary non negative monomial matrix D, i.e.,

a permutation matrix with positive non null elements, and its inverse can be used

to transform the two factorization matrices by,

HM' = HDD-1

M' = )�*� ′

Thus matrices )� = )+, *� = *+−1, form a non negative matrix factorization. 15

ClusteringClustering and and DisjointDisjoint PrincipalPrincipal ComponentComponent AnalysisAnalysis

Asymmetric reduction of rows and columns of the data matrix X

Classes of objects are summarized by mean profiles, classes of variables are

summarized by components

EBVXUX +′=

Subject to

ByBy fixing H=1, A = Ifixing H=1, A = In

(CDPCA, Vichi,(CDPCA, Vichi, SaportaSaporta, 2009 CSDA, 2009 CSDA))

Subject to

U, V binary and row stochastic


One major drawback of PCA is that each variable may contribute to define more

than a single component. This is not the case for CDPCA.

Variables are properly summarized by a single factor, therefore each factor

summarizes only a disjoint class of variables and all classes form a partition

of variables.

16

Model-free: Least-Squares Estimation

SSRdk = ||X – U V'B ||2 →

BYVU ,,,

min

subject to

U and V binary and row stochastic

B diagonal and such that V'BBV = IQ

Coordinate descent algorithm

Step 1

X

= (U'U)-1

U'XBV(V'BBV)-1

=(U'U)-1

U'XBV Step 2

uip = 1 if 2

ˆˆˆˆˆpi xBVxBV ′−′ = min{

2ˆˆˆˆˆ

si xBVxBV ′−′ : s=1,…,P; s≠p},

uip = 0 otherwise.

Step 3

vjq = 1 if F( qc , U , X , [vjq]) = max{F( rc , U , X , [vjr=1]): r=1,..Q; (r≠q)}

vjq = 0 otherwise.

Step 4

B =

∑

=

)()(1

q

Q

q

q diagdiag cv ,

Stopping rule

If SSRdk decreases less than a constant ε>0 the algorithm has converged

X

17

Short Term Indicators and Economic Performance Indicators (OCSE, 1999)

GDP IR LI UR NNS TB

Dim 2 0 -0.697 -0.229 0 0 0.679

Dim 1 -0.383 0 0 -0.498 0.778 0

Loadings of Clustering and Disjoint PCA

Var(Dim1) =1.5601, Var(Dim2) = 1.2553

Dim

1 (2

6%

)

20 Countries: Australia (A-lia), Canada (Can), Finland (Fin), France (Fra), Spain (Spa), Sweden (Swe), United

States (USA), Netherlands (Net), Greece (Gre), Mexico (Mex), Portugal (Por), Austria (A-tria), Belgium (Bel), Denmark

(Den), Germany (Ger), Italy (Ita), Japan (Jap), Norway (Nor), Switzerland (Swi), United Kingdom (UK)

6 Macro Eco. VARS: Gross Domestic Product (GDP), Leading Indicator (LI), Unemployment Rate (UR),

Interest Rate (IR), Trade Balance (TB), Net National Savings (NNS)

Component loadings of PCA

GDP IR LI UR NNS TB

Dim 2 -0.065 -0.696 -0.229 0.367 -0.092 0.563

Dim 1 -0.567 -0.175 -0.192 -0.489 0.607 0.059

Var(Dim1) =1.6531, Var(Dim2) = 1.3680

Dim 2 (21%) (IR(49%) , LI(5%), TB(46%))

Dim

1 (2

6%

) (GD

P(1

5%

), UR

(25

%), N

NS

(60

%))

Mex, Por, Gre are in the same class also in k-means on the original variables

Ita, Ger Den have almost equal NNS, GDP;

18

Short Term Indicators and Economic Performance Indicators (OCSE, 1999)

GDP IR LI UR NNS TB

Dim 2 0 -0.697 -0.229 0 0 0.679

Dim 1 -0.383 0 0 -0.498 0.778 0

Loadings of Clustering and Disjoint PCA

Var(Dim1) =1.5601, Var(Dim2) = 1.2553

Dim

1 (2

6%

)

Dim

1

20 Countries: Australia (A-lia), Canada (Can), Finland (Fin), France (Fra), Spain (Spa), Sweden (Swe), United

States (USA), Netherlands (Net), Greece (Gre), Mexico (Mex), Portugal (Por), Austria (A-tria), Belgium (Bel), Denmark

(Den), Germany (Ger), Italy (Ita), Japan (Jap), Norway (Nor), Switzerland (Swi), United Kingdom (UK)

6 Macro Eco. VARS: Gross Domestic Product (GDP), Leading Indicator (LI), Unemployment Rate (UR),

Interest Rate (IR), Trade Balance (TB), Net National Savings (NNS)

Component loadings of PCA

GDP IR LI UR NNS TB

Dim 2 -0.065 -0.696 -0.229 0.367 -0.092 0.563

Dim 1 -0.567 -0.175 -0.192 -0.489 0.607 0.059

Var(Dim1) =1.6531, Var(Dim2) = 1.3680

Dim 2 (21%) (IR(49%) , LI(5%), TB(46%))

Dim

1 (2

6%

) (GD

P(1

5%

), UR

(25

%), N

NS

(60

%))

Dim 2 (23%)

(28

%)

Mex, Por, Gre are in the same class also in k-means on the original variables

Ita, Ger Den have almost equal NNS, GDP;

19

�Maximum likelihood approach

20

DDiimmeennssiioonn rreedduuccttiioonn ffoorr UUnniittss:: GGaauussssiiaann MMiixxttuurree MMooddeell

The population is composed of P subpopulations in proportions Pπππ ,...,, 21 , 11

=π∑=

P

i

i ,

The data ( ix′ , iu ′ )′, include the J variables & P-dim. binary and row stochastic vector ui.

iii euXx +′= (i = 1,…,I)

where xi is column centered ],...,,[ 21′= PµµµX and ei is the random error with

(i) E(ei)=0,

(ii) Cov(ei)=ΣΣΣΣp

Distributional assumptions Distributional assumptions

xi|ui ∼∼∼∼ fp(x;θθθθp) = NJ(µµµµp, ΣΣΣΣp).

In general, variable ui can be:

1. a fixed label i.e., ui cannot be considered as a random variable.

2. a random label with distribution

ui∼∼∼∼∏=

πP

P

u

p

ip

1Multinomial of one draw on P categories

21

DDiimmeennssiioonn RReedduuccttiioonn ffoorr VVaarriiaabblleess:: FFaaccttoorriiaall AAnnaallyysseess

In the Asymmetric framework

•••• The Factor Analysis model

(i=1,…,I)

(i) E(ei) = E(fi) = 0,

(ii) Cov(ei) = ΨΨΨΨ (diagonal ψj > 0)

(iii) Cov(fi) = I

It implies: Cov(xi) = ΛΛΛΛΛΛΛΛ′ + ΨΨΨΨ,

iii eΛfx +=

with distributional assumptions : xi ∼∼∼∼ = NJ(0, ΛΛΛΛΛΛΛΛ′ +ΨΨΨΨ).

•••• The Probabilistic PCA and PCA

Cov(xi) = ΛΛΛΛΛΛΛΛ′ + Iσ2 xi ∼∼∼∼ = NJ(0, ΛΛΛΛΛΛΛΛ′ + Iσ2). (Isotropic component)

Cov(xi) = ΛΛΛΛΛΛΛΛ′ΣΣΣΣΛΛΛΛΛΛΛΛ′ xi ∼∼∼∼ = NJ(0, ΛΛΛΛΛΛΛΛ′ΣΣΣΣΛΛΛΛΛΛΛΛ′). (Homoschedastic component)

22

Simultaneous GMM and PCA (Rocci & Vichi, 2001, 2006)

The GMM is modified iii euXΛΛx +′′=

( )( ) ( )

====≥

==≥

−′′′

∑

∑

∑∑

=

=

ΛΣΛΛΛµΛΛ

),..,2,1(,1);,...,2,1;,...,2,1(,0

1);,...,2,1(,0

subject to

ln,lnmax

1

1

pp

IiuPpIiu

Pp

uuNu

P

p

ipip

P

p

ip

ipip

ip

pJpip

ππ

π

Coordinate ascent

formulation of the EM

algorithm by a penalized

complete LL (Hathaway,

1986)

The estimators of µµµµp, ΣΣΣΣ, uip, and πp are those of GMM. For ΛΛΛΛ the complete LL

( ) ( )∑ ′−

′′−−= −

ip

pipiipucl µΛΛxΣµΛΛxθ 1

2

1)(

=

∑=

IΣΛΛΛΛ ''

1p

2

,

11||||

2

1

2

1−′+′=′′′+′= ∑ −

ΣDΛΛXµΛΛΣΛΛµ cc

p

pppπ max Between Var

where ],...,,[ 21′′′′= PµµµX and ),...,,(diag 21 Pπππ=D .

23

=

−−

JIΣΛΛΛΛ

ΣΛΛIXΣDΛ

''

)'(min2

, 1

which is equivalent to the minimization of the following problem

=′′

′′

J

B

IΛΣΛΛΛ

ΛΛΣΛΛΛ

)tr(max

But also equivalent to the maximization of the following problem

=′′JIΛΣΛΛΛ

Which corresponds to the Linear Discriminant Analysis

Therefore we have a simultaneous GMM and LDA

24

AApppplliiccaattiioonn Wine recognition data Blake, C.L. & Merz, C.J. (1998)

.

UCI Repository of machine learning databases

http://www.ics.uci.edu/~mlearn/MLRepository.html

178 wines x 13 constituents

These data are the results of a chemical analysis of wines derived from

three different cultivars (Barolo, Grignolino, Barbera)

The analysis determined the quantities of 13 chemical properties found in

each of the three types of wines.

1)Alcohol 8)Nonflavanoid phenols 1)Alcohol 8)Nonflavanoid phenols

2)Malic acid 9)Proanthocyanins

3)Ash 10)Color intensity

4)Alcalinity of ash 11)Hue

5)Magnesium 12)OD280/OD315 of diluted

wines

6)Total phenols 13)Proline

7)Flavanoids

25

Model selection: BIC (Bayesian Information Criterion), ndL ln)(2 −ϑ

Selected model: 7 groups; 4 components;

Components variances: 17.59, 5.25, 1.78, 1.02

Components - Variables correlations:

Variables

Comp 1 2 3 4 5 6 7

1 0.01 0.55 0.07 0.39 -0.12 -0.68 -0.85

2 0.85 0.13 0.37 -0.38 0.41 0.34 0.30

3 -0.04 0.09 0.28 0.25 -0.54 0.11 0.25 3 -0.04 0.09 0.28 0.25 -0.54 0.11 0.25

4 -0.11 0.06 0.01 0.28 0.43 0.19 0.19

Variables

Comp 8 9 10 11 12 13

1 0.50 -0.51 0.61 -0.76 -0.86 -0.36 2 -0.18 0.18 0.66 -0.13 0.10 0.84

3 0.14 -0.06 0.19 -0.06 0.21 -0.19

4 -0.12 0.61 0.26 -0.03 0.17 -0.01

26

Plot on the first two components

2

4

6

2

222

2 2

2

2

2

2

22

2

22

2

2

22

2

2

22

3

33

3

3

333

3 3

3

33

3

3

33

3

3

3

3

3

3

33

3

3

33

33

3

3

3

3

3

3

3

3

3

3 3

3

33

3

3

3 3

3

3

3

33

3

3

3

33

4 4

4

-8 -6 -4 -2 0 2 4 6 8 10-6

-4

-2

0

1

1

11

1

2

2

22

33

333

4

4

4

4

444

4

44

4

4

444

44

44

4

4

45

6

6

6

66

67

7 7

7

7

77 7

7

7

7

77

7

7

7

7

7

77

7

7

7 77

777

7

7

77

7

7

7

7

77

7

7

7

7

7

7

7

77

7

7

7

7

7

77

77

7

27

Classification

Estimated

3 1 5 6 7 2 4

Barolo 59 0 0 0 0 0 0

True Grignolino 0 5 1 5 57 0 3

Barbera 0 0 0 1 0 26 21

It is interesting to note that wines have been produced in a period of 10 years from 1970-1979.

Therefore, there is almost a nested structure (maybe specifying wines produced in different

years)

28

DoubleDouble KK--MeansMeans (Vichi, 2001,CLADAG )(Vichi, 2001,CLADAG )

Symmetric reduction of rows and columns of the data matrix X

Classes of objects and variables are summarized by mean profiles,

EVXUX +′=

Subject to

uik ∈{0, 1} i=1,…,n; k=1,…,K; (hard object classification) (2)

or uij ∈[0, 1] i=1,…,n; k=1,…,K; (fuzzy object classification) (2’)

∑K

iku = 1 i=1,…,n; (object partitioning) (3)

x11 x12 x13

x21 x22 x23

x31 x32 x33

x41 x42 x43

x51 x52 x53

x14 x15

x24 x25

x34 x35

x44 x45

x54 x55

x1j x1h x1k x1p

x2 j x2h x2k x2p

x3 j x3h x3k x3p

x4 j x4h x4k x4p

x5 j x5h x5k x5p

x61 x62 x63

x71 x72 x73

x64 x75

x74 x75

x6j x6h x6k x6p

x7j x7h x7k x7p

xi1 xi2 xi3

xl1 xl2 xl3

xm1 xm2 xm3

xi4 xi5

xl4 xl5

xm4 xm5

xij xih xik xip

xlj xlh xlk xlp

xmj xmh xmk xmp

Cluster Q

……………

……

…

………

VARIABLES

… … … … … … …… …

Cluster 2

Cluster 1

Cluster 2

Cluster G

Cluster 1

OBJECTS

∑=k

iku1

= 1 i=1,…,n; (object partitioning) (3)

or ∑=

K

k

iku1

≥ 1 i=1,…,n; (object covering) (3’)

or ∑=

K

k

iku1

≤ 1 i=1,…,n; (object packing) (3’’)

vjq ∈{0, 1} j=1,…,J; q=1,…,Q; (hard variable classification) (4)

or vpl ∈[0, 1] j=1,…,J; q=1,…,Q; (fuzzy variable classification) (4’)

∑=

Q

q

jqv1

= 1 p=1,…,k; (variable partitioning) (5)

or ∑=

Q

q

jqv1

≥ 1 p=1,…,k; (variable covering) (5’)

or ∑=

Q

q

jqv1

≤ 1 p=1,…,k; (variable packing) (5’’) 29

F2(U,V, X ) = || X – U X V′ ||2 → XVU ,,

min

subject to

uik ∈{0, 1} (i=1,…,n; k=1,…,K),

∑=

=P

p

ipu1

1 (i=1,…,n).

vjq ∈{0, 1} (j=1,…,J; q=1,…,Q),

∑=

=Q

q

jqv1

1 (j=1,…,J).

LS Estimation of the modelLS Estimation of the model

Coordinate Descent Algorithm

Step 1

( ) ( ) 11'''

−−= VVXVUUUX

Step 2

∀ i: uik=1 if ||xi - VXu ′′k ||

2 = min{ ||xi - VXu ′′

l ||2 : l=1,…,K}.

uik= 0 otherwise

Step 3

( ) ( ) 11'''

−−= VVXVUUUX

Step 4

∀ i: vjq=1 if ||xi - qvXU ||2 = min{ ||xi - lvXU ||

2 : l=1,…,q}.

vjq= 0 otherwise

Stopping Rule

If F2 decreases less than a constant ε>0 the algorithm has converged

30

Simulation Study for Double k-means: local minima The performance of the methods has been evaluated by using the following measures:

(1) Mrand(U, Ut). Modified Rand Index (Hubert and Arabie, 1985) between the true matrix U and the fitted for tth generated

matrix Ut; (2) Mrand(V, Vt). Modified Rand Index between the true matrix V and the fitted for the tth generated matrix Qt; (3) number of times the fitted partition of the objects is equal to the true partition, i.e., U=Ut (the equality is established up to a

column permutation matrix P, U=UtP);

(4) number of times the fitted partition of variables is equal to the true partition, i.e., V=Vt (the equality is established up to a

column permutation matrix P, V=VtP);

# runs to

retain the

best

solution

Average

Mrand for

object

partition

Average

Mrand for

variable

partition

% of object

partitions =

to the true

partitions

% of variable

partitions = to

the true

partitions

Average

number of

iterations

Error

level

1 0.9853 0.9938 94 98 2.3200 Low

5 1.0000 1.0000 100 100 2.1000 Low

Low error level permuted matrix

1 0.9592 0.9261 84 76 6.1600 Medium

5 0.9902 0.9631 96 88 4.3000 Medium

10 0.9975 0.9969 99 99 3.1500 Medium

Medium error level permuted matrix

high error level permuted matrix

10 0.9975 0.9969 99 99 3.1500 Medium

15 1.0000 1.0000 100 100 2.9900 Medium

1 0.8894 0.9259 35 72 9.7000 High

5 0.9788 0.9907 55 97 6.6000 High

10 0.9884 0.9999 55 99 6.2500 High

20 0.9901 0.9998 59 98 6.2600 High

30 0.9943 0.9994 61 99 5.4500 High

40 0.9956 1.0000 66 100 5.9900 High

50 0.9921 0.9998 60 97 6.2100 High

80 0.9940 0.9998 57 97 6.0000 High

31

Model-Based Double clustering (Vichi, Martella, 2012, JCS) xi = HV X ′ui + ei

( )

{ }

{ }

====∈

====∈

==≥

′

∑

∑

∑

∑

=

=

),..,2,1(,1);,...,2,1;,...,2,1(,1,0

),..,2,1(,1);,...,2,1;,...,2,1(,1,0

1);,...,2,1(,0

subject to

,lnmax

1

1

JjvPpJjv

IiuPpIiu

Pp

Nu

Q

P

p

ipip

P

p

pp

ip

piJip

ππ

ΣuXHV

Maximum likelihood clustering

where HV = V(V′V)-1

V′

{ }

====∈ ∑

=

),..,2,1(,1);,...,2,1;,...,2,1(,1,01

JjvPpJjvq

jqjq

( )( ) ( )

{ }

====∈

====≥

==≥

−′

∑

∑

∑

∑∑

=

=

=

),..,2,1(,1);,...,2,1;,...,2,1(,1,0

),..,2,1(,1);,...,2,1;,...,2,1(,0

1);,...,2,1(,0

subject to

ln,lnmax

1

1

1

JjvPpJjv

IiuPpIiu

Pp

uuNu

Q

q

jqjq

P

p

ipip

P

p

pp

ip

ipip

ip

piJpip

ππ

π ΣuXHV

Gaussian mixture model

32

Application cutaneous melanoma Bittner et al., 2000

31 samples of cutaneous melanomas x 3,613 genes. Bitner et al. obtained two clusters of 12 and 19 samples,

Best solution for BIC and AIC P=4, Q=2

Two blocks of 83 genes have down-regulated

genes classificated into two clusters of 21 and 10

tissue samples 33

0.8

0.9

1

1.1

1.2

1.3

4

5

7

8

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

21 samples

-3 -2.5 -2 -1.5 -1 -0.50.3

0.4

0.5

0.6

0.7 1

2

3

5

6

8

9

10

11

12

10 samples

34

Mixture of Factor Analyzers (Martella, Alfo’, Vichi 2010)

Data Matrix X →→→→ GMM+MAP →→→→ Centroid matrix →→→→ FA or

PCA

x11 x12 ... x1J

xi1 xi2 ... xiJ

XI1 xI2 ... xIJ

Model-based

clustering

Mixture models

& MAP

FA model, with

binary constraint

µ11... µ1Q1

µP1...µPQ1 x11 x12 ... x1J

xi1 xi2 ... xiJ

XI1 xI2 ... xIJ FA model, with

binary constraint

µ11... µ1QP

( )( ) ( )( ) ( )

{ }

=∈

+′=

====≥

==≥

−++

∑

∑

∑

∑∑∑

=

=

=

pQ

q

jqpjqp

pppipi

P

p

ipip

P

p

p

ip

ipip

ip

QQpip

ip

ppppJpip

u

IiuPpIiu

Pp

uuNuNu

1

1

1

p

1;1,0

)|cov(

),..,2,1(,1);,...,2,1;,...,2,1(,0

1);,...,2,1(,0

subject to

ln,ln,lnmax

λλ

ππ

ππ

ΨΛΛx

I0ΨfΛµ

XI1 xI2 ... xIJ XI1 xI2 ... xIJ binary constraint µP1 ...µPQP

35

Remark: DkM specifies for each object class the same partition of the

variables and for each variable class the same partition of the objects

PROBLEM: single-paritioning vs Multi-partitioning

one matrix V

one matrix U

If we relax this symmetry we could have that:

(i) conditionally to each

class of the variable

partition a different

partition of the objects

is allowed.

one matrix V

different U1, U2, …UQ

36

Remark: DkM specifies for each object class the same partition of the

variables and for each variable class the same partition of the objects

PROBLEM: single-paritioning vs Multi-partitioning

one matrix V

one matrix U

If we relax this symmetry we could have that:

(ii) conditionally to

each class of the object

partition a different

partition of the

variables is allowed.

one matrix U

different V1, V2,…,VP

37

MultimodeMultimode MultipartitioningMultipartitioning

EVXUX +′=

where X is a block diagonal matrix of the form

X =

′

′

′

Kx00

0x0

00x

L

MMM

L

L

2

1

= diag( 1x′ ,…, Kx′ ),

where

V = [V1, …,Vk, …, VK], Vk = [vjqk] is the (J × Qk) binary matrix for partition of the kth object class.

Matrix V specifies a particular covering of the set of variables.

×X (K×Q) with Q = Q1+ Q2 +…+ QK.

cp =[1px′ , …,

Qpx′ ]' is the (Qp×1) centroid vector of the pth object cluster; and Qp is the number of

clusters of the variable partition conditional to the pth object cluster.

When V1= V2 =…= VP= V , Double K-Means can be revritten

X = U

′

′

′

Kx00

0x0

00x

L

MMM

L

L

2

1

[I, I, …,I]′ V ′ + E = U X V ′ + E,

where X is the centroid matrix of the double k-means.

Thus, double k-means can be seen as the model identifying the consensus partition of the variable partitions;

the consensus is found by an optimization approach (Barthélemy & Leclerc, 1995), providing a central

classification which gives an estimate of the true common classification of the set of variables. 38

Least Squares Estimation of GDkM

Rewrite the model

X = ∑=

′K

k

kkk

1

Vxu + E

where up is the p-th columns of U. The LS estimation is

SSRGDkM = ||X - ∑=

′K

k

kkk

1

Vxu ||2 →

kxVU ,,

min

subject to

U and Vk. binary and row stochastic U and Vk. binary and row stochastic

Coordinate Descent Algorithm

Step 1

k

′x = (uk′′′′uk)

-1 uk′′′′ XVk(Vk′′′′Vk)-1 (k=1,…,K)

Step 2

∀ i: uip=1 if ||xi - Vc ′p ||

2 = min{ ||xi - Vc ′

l ||2 : l=1,…,K}.

uip= 0 otherwise

Step 3

∀ i: vjpq=1 if ||xi - ppuc ′ ||2 = min{ ||xi - lluc ′ ||

2 : l=1,…,q} p=1,...,P.

vjqp= 0 otherwise

Stopping Rule

If SSR decreases less than a constant ε>0 the algorithm has converged

39

The performance of the methods has been evaluated by using the following measures:

(1) Mrand(U, Ut). Modified Rand Index (Hubert and Arabie, 1985) between the true matrix U and the fitted for tth generated

matrix Ut; (2) Mrand(V, Vt). Modified Rand Index between the true matrix V and the fitted for the tth generated matrix Qt; (3) number of times the fitted partition of the objects is equal to the true partition, i.e., U=Ut (the equality is established up to a

column permutation matrix P, U=UtP);

(4) number of times the fitted partition of variables is equal to the true partition, i.e., V=Vt (the equality is established up to a

column permutation matrix P, V=VtP);

Simulation Study for Generalized Double k-means: local minima

# runs

to retain

the best

solution

Averag

e

Mrand

for

object

partitio

n

Average

Mrand

for

variable

partition

% of

object

partitions

= to the

true

partitions

% of

variable

partitions

= to the

true

partitions

Average

number

of

iterations

cputime Tsr Error

level

ALG2 1 1.0000 0.9738 100 82 2.8200 0.0045 1.1715 low err

ALG2 5 1.0000 0.9979 100 98 2.8100 0.0237 1.1002 low err ALG2 5 1.0000 0.9979 100 98 2.8100 0.0237 1.1002 low err

ALG2 10 1.0000 1.0000 100 100 2.7900 0.0516 1.0951 low err

ALG2 1 1.0000 0.9232 100 57 3.4000 0.0070 20.3373 Med err

ALG2 5 1.0000 0.9490 100 69 3.3300 0.0289 20.2196 Med err

ALG2 10 1.0000 0.9555 100 72 3.2900 0.0600 20.2134 Med err

ALG2 20 1.0000 0.9563 100 72 3.3000 0.1182 20.2113 Med err

ALG2 30 1.0000 0.9586 100 75 3.3100 0.1848 20.2094 Med err

40

Model-Based Multi-Partitioning (Farcomeni Vichi 2007)

xi = [ ( )PPµHµH VV ,...,11 ui + ei

( )

{ }

{ }

( )

=

====∈

====∈

∑

∑

∑

=

=

p

P

Q

q

jqpjqp

P

p

ipip

ip

piPJip

diag

JjvPpJjv

IiuPpIiu

Nu

ΣΣΣ

ΣuµHµH VV

,...,

),..,2,1(,1);,...,2,1;,...,2,1(,1,0

),..,2,1(,1);,...,2,1;,...,2,1(,1,0

subject to

,),...,( lnmax

1

1

11

Maximum likelihood clustering

The estimators of uip, and µµµµp ΣΣΣΣp, and Vp

where a variable partition is written as xi =[ pi1x′ ,..., piqx′ ,…, piQ

x′ ]′. The solutions are

pqµ = , pq

Σ =

( )

=ppQpp diag ΣΣΣ ,...,1

∑∑ =

=

I

i

iqipI

i

ip

u

u 1

1

ˆ

ˆ

1x ( )( )∑

∑ =

=

′−−

I

i

pqiqpqiqipI

i

ip

u

u 1

1

ˆˆˆ

ˆ

1µxµx

( ) ( )∑ ∑∑∑=

−

=

−′

−−−=ip

Q

q

pqiqpqiqip

Q

q

pq

ip

ip

p

ppqp

p

uucl1

1

1 2

1log

2

1)( µxΣµxΣθ

uip = 1 if ∑=

−−

p

pqp

Q

q

pqiq

1

2

1ˆ

Σµx = min{∑

=−

−p

vqv

Q

q

vqiq

1

2

1ˆ

Σµx : v=1,…,P (v≠q)}; uip = 0 otherwise

vjqp = 1 if ∑∑∑== =

−−

p

p

Q

qpqiq

I

i

P

pip

u1

2

1 11

ˆˆΣ

µx = min{ ∑∑∑== =

−−

p

p

Q

vpqiv

I

i

P

pip

u1

2

1 11

ˆˆΣ

µx : v=1,…,Qp (v≠q)}; vjqp = 0 otherwise 41

Model-Based Multi-Partitioning (Farcomeni Vichi 2007) The parameters of the normal distributions are estimated on the a single truncated

distribution and the tails of the other P-1 (mainly all those units closer to the distribution of interest according to Mahalanobis distance) This induces inconsistent estimators.

Consistent estimators are given by

pqµ = , pq

Σ =

where

∑∑ =

=

I

i

iqipI

i

ip

z

z 1

1

1x ( )( )∑

∑ =

=

′−−

I

i

pqiqpqiqipI

i

ip

z

z 1

1

ˆˆ1

µxµx

where

zip =

is the expectation of uip given the observed data

Updating the parameters of the multivariate normal distributions the likelihood still increases or at least never decreases

( )( )∑ =

P

j jij

pip

f

f

1;

;

θx

θx

42

APPLICATION Colon Cancer data Alon et al. 1999

Unsupervised problem (22 colon tissues x 2000 genes)BIC 5 clusters of genes and 2, 2, 3, 3 3 clusters of tissues C1: 79 over-expressed genes; C2 with

832 genes; C3 and C4 with 183 and 53 genes under-expressed; C5 with 853 genes

Recognition of Glass and Ceramic Glass, 2007 (161 (109+52) glass/ceramic x 64

wavelength )

Supervise classification. Observed data are randomly split in test (25 units) and training sets.

After parameter estimation the classification error is evaluated on the test set. The procedure is

repeated 1000 times.

Glass GlassCeramic

Predicted Glass 0.739 0.005

Predicted Glass

Ceramic0.076 0.180

Estimated classification error 8%

43

Documents

Maurizio Vichi - Paris Dauphine University