Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Multimode clustering
Maurizio VichiDepartment of Statistics
Sapienza University of Rome
Firenze, 9 May 2012
1
keywords:keywords:
-- Data Dimensionality ReductionData Dimensionality Reduction
-- Multimode clusteringMultimode clustering
-- BiBi--Clustering, coClustering, co--clusteringclustering
-- Block clusteringBlock clustering
Fields of application
� Medicine and Bioinformatics (microarray);
� Marketing (preference data);
� Chemometrics (non negative matrix factorization);
2
��Outline of the presentationOutline of the presentation
�� ThreeThree--way and twoway and two--way data way data
�� Single partitioning;Single partitioning;
�� Multi partitioning;Multi partitioning;
�� Clustering Models Clustering Models �� Least Square Estimation (LSE);Least Square Estimation (LSE);
�� Maximum Likelihood Estimation (MSE);Maximum Likelihood Estimation (MSE);
ExamplesExamples
3
The set X is organized as a 3-Way Array
x11r x12r ... x1kK a set of sensory variables (preference data)
evaluated by a set of trained assessors
Three-way Data Set X
a set X of n × J × H values related to:
J variables measured (observed, estimated) on
n objects (individuals, products) at
H occasions (assessors, times, locations, etc.)
Occasion
h
Unit
i
x11h x12h ... x1kh
x11r x12r ... x1kK
xn11 xn21... x1JK
x111 x121 … x1k1
Variable j
xijk
a set of sensory variables (preference data)
products such as:
steamed potatoes,
grape/raspberry
beverages
Riesling wines,
evaluated by a set of trained assessors
(occasions) time
4
�� Single partitioningSingle partitioning�� Symmetrical mode reductionSymmetrical mode reduction
�� Asymmetrical mode reductionAsymmetrical mode reduction
Two approaches
1. SymmetricSymmetric treatment of Units (Rows) & Variables (Columns)
Clustering for Units & Clustering for Vars.
Result: reduced sets of mean profiles for Units and Variables;
OR
2. AsymmetricAsymmetric treatment of Units (Rows) & Variables (Columns)
Clustering for Units & Factorial methods for Variables
Result: reduced sets of mean profiles and factors
5
Symmetrical Single Partitioning for each mode:Symmetrical Single Partitioning for each mode:
FromFrom Partitioning objects and variables Partitioning objects and variables toto twotwo--mode partitioningmode partitioningv1 v2 v3 c1 c2 c3 v1 v2 v3 c1 c2 v1 v2 v3
u1 1 0 0 u1 v1 1 0 u1
u2 1 0 0 u2 v2 1 0 u2
X = u3 U= 0 1 0 X = u3 v3 0 1 X = u3
u4 0 0 1 u4 U4
u5 0 0 1 u5 u5
u6 0 0 1 u6 u6
Partitioning of units Partitioning of variables Two-mode partitioning
3,0 2,1 0,1 co-clustering, biclustering
Xmean= 1,3 5,1 0,0
x11 x12 x13
x21 x22 x23
x31 x32 x33
x41 x42 x43
x51 x52 x53
x14 x15
x24 x25
x34 x35
x44 x45
x54 x55
x1j x1h x1k x1p
x2 j x2h x2k x2p
x3 j x3h x3k x3p
x4 j x4h x4k x4p
x5 j x5h x5k x5p
x61 x62 x63
x71 x72 x73
x64 x75
x74 x75
x6j x6h x6k x6p
x7j x7h x7k x7p
xi1 xi2 xi3
xl1 xl2 xl3
xm1 xm2 xm3
xi4 xi5
xl4 xl5
xm4 xm5
xij xih xik xip
xlj xlh xlk xlp
xmj xmh xmk xmp
Cluster Q
……………
……
…
………
VARIABLES
… … … … … … … … …
Cluster 2
Cluster 1
Cluster 2
Cluster G
Cluster 1
OBJECTS
Xmean= 1,3 5,1 0,0
7,8 8,1 4,5
6
Asymmetrical Single Partitioning for each mode:Asymmetrical Single Partitioning for each mode:
Partitioning of objects and Partitioning of variablesPartitioning of objects and Partitioning of variables
•• Generally objects are synthesized by mean vectors Generally objects are synthesized by mean vectors
••Variables are synthesized by factors (latent variables, Variables are synthesized by factors (latent variables,
componetscomponets) )
c1 c2 c3 c1 c2 v1 v2 v3
1 0 0 v1 0,77 0 u1
1 0 0 v2 0,63 0 u2
U= 0 1 0 v3 0 1 X = u3
0 0 1 u4
0 0 1 u5
0 0 1 u6
7
��MultiMulti--partitioningpartitioning��MultiMulti--partitioningpartitioning
8
v1 v2 v3 v1 v2 v3
u1 u1
u2 u2
X = u3 X = u3
u4 u4
Single partitioning for one mode and MultiSingle partitioning for one mode and Multi--partitioning for partitioning for
the other mode:the other mode:
From From TwoTwo--mode Partitioning mode Partitioning to to
TwoTwo--mode mode MultipartitioningMultipartitioning –– objects / variablesobjects / variables
u5 u5
u6 u6
Two-mode multipartitioning of variables Two-mode multipartitioning of objects
It can be seen that genes (on the columns, here) are partitioned into 5 blocks; and that each block is divided into a different number of groups of slides (on the rows, here). For instance, the second group of genes is divided in two groups of slides, one of which is composed of only a single outlier. Red=up regulated, Green = down regulated
9
��ModelsModels
�� for threefor three-- & two& two--way way
multimode clusteringmultimode clustering
10
MultimodeMultimode clusteringclustering forfor threethree way dataway data
EBVCWXAUX +′⊗′= )(
Subject to
U, V, W binary and row stochastic;
A, B, C diagonal and s.t.
1=′′kk AuAu , for k=1,…,K;
1=′′qq BvBv , for q=1,…,Q;
1=′′rr CwCw , for r=1,…,R; 1=′′rr CwCw , for r=1,…,R;
where
X = Xn,JH (n×JH), [X1, X2,…,XH], three-way data matrix
obtained by placing side by side Xh;
E = En,JH=[eijk](n × JH), [E1,…, EH]; three-way error matrix
�� =���,��=[�� ](K×QR), [��1, … , ���] three-way centroid matrix
Matrices U, V, W are classification matrices for objects, variables and occasions,
respectively;
Matrices A, B, C are diagonal matrices for objects, variables and occasions;
11
MultimodeMultimode clusteringclustering
where
Matrices U, V, W are classification matrices for objects, variables and occasions, respectively;
U = [uik] (n × K) binary and row stochastic matrix defining a partition of the objects into K
clusters,
with uik=1 if the ith
object belongs to cluster k, uik=0 otherwise;
V = [vjq] (J × Q) binary and row stochastic matrix defining a partition of variables into Q clusters,
with vjq=1 if the jth
variable belongs to qth
cluster, vjq=0, otherwise; with vjq=1 if the j variable belongs to q cluster, vjq=0, otherwise;
W= [whr] (H × R) binary matrix defining a partition of occasions into R clusters,
with whr=1 if the hth
occasions belongs to rth
cluster, whr=0, otherwise;
Matrices A, B, C are diagonal matrices for objects, variables and occasions;
A=dg(a1,…,an) (n × n) diagonal matrix weighting objects, ∑ ����=1 ��
2 = 1;∑ ∑ ����2�
=1 = ���=1 ;
B=dg(b1,…,bJ) (J × J) diagonal matrix weighting variables ∑ ����=1 ��
2 = 1;∑ ∑ �� ��2 = �
��=1
�=1 ;
C=dg(c1,…,cH) (H×H) diagonal matrix weighting occasions ∑ �ℎ��ℎ=1 �ℎ
2 = 1;∑ ∑ �ℎ��ℎ2 = ��
ℎ=1��=1 ;
12
PropertiesProperties
Let us reparameterize �� = � , !� = !", #$ = #% thus the model can be written
EBCXAX +′⊗′= )~~
(~
Subject to
�′� �� = '�
!′� !� = '�
#′�#$ = '�
�′� �� = '� !′� !� = '�
#′�#$ = '�
This is a special Tucker three-mode factor analysis model (Tucker, 1966)
where �� is the core matrix and ��, !� and #$ are particular orthonormal matrices.
13
TwoTwo mode mode clusteringclustering (H=1): (H=1): bibi--clusteringclustering
Symmetric reduction of rows and columns of the data matrix X
Classes of objects and variables are summarized by components
(latent objects and latent variables)
EBVXAUX +′⊗= )1(
i.e., the Generalized Double K-means (GDKM)
model for asymmetrical single partitioning model for asymmetrical single partitioning
EBVXAUX +′=
Subject to
U, V binary and row stochastic
1=′′kk AuAu , for k=1,…,K;
1=′′qq BvBv , for q=1,…,Q.
14
NonNon--negativenegative matrixmatrix factorizationfactorizationLet us now factorize matrix �� into the product of two non-negative matrices
H (K×L) and M (Q×L)
�� = HM' + (��
and including the previous model into the GDKM we obtain a non-negative matrix
factorization algorithm for two mode single partitioning, i.e.
EBVMAUHX +′′=
subject to
H ≥ 0, M ≥ 0;
U, V binary and row stochastic; U, V binary and row stochastic;
1=′′kk AuAu , for k=1,…,K;
1=′′qq BvBv , for q=1,…,Q.
where E includes also the error part correlated to ��, that is: BVXAUE ′ .
Uniqueness The factorization is not unique: an arbitrary non negative monomial matrix D, i.e.,
a permutation matrix with positive non null elements, and its inverse can be used
to transform the two factorization matrices by,
HM' = HDD-1
M' = )�*� ′
Thus matrices )� = )+, *� = *+−1, form a non negative matrix factorization. 15
ClusteringClustering and and DisjointDisjoint PrincipalPrincipal ComponentComponent AnalysisAnalysis
Asymmetric reduction of rows and columns of the data matrix X
Classes of objects are summarized by mean profiles, classes of variables are
summarized by components
EBVXUX +′=
Subject to
ByBy fixing H=1, A = Ifixing H=1, A = In
(CDPCA, Vichi,(CDPCA, Vichi, SaportaSaporta, 2009 CSDA, 2009 CSDA))
Subject to
U, V binary and row stochastic
1=′′qq BvBv , for q=1,…,Q.
One major drawback of PCA is that each variable may contribute to define more
than a single component. This is not the case for CDPCA.
Variables are properly summarized by a single factor, therefore each factor
summarizes only a disjoint class of variables and all classes form a partition
of variables.
16
Model-free: Least-Squares Estimation
SSRdk = ||X – U V'B ||2 →
BYVU ,,,
min
subject to
U and V binary and row stochastic
B diagonal and such that V'BBV = IQ
Coordinate descent algorithm
Step 1
X
= (U'U)-1
U'XBV(V'BBV)-1
=(U'U)-1
U'XBV Step 2
uip = 1 if 2
ˆˆˆˆˆpi xBVxBV ′−′ = min{
2ˆˆˆˆˆ
si xBVxBV ′−′ : s=1,…,P; s≠p},
uip = 0 otherwise.
Step 3
vjq = 1 if F( qc , U , X , [vjq]) = max{F( rc , U , X , [vjr=1]): r=1,..Q; (r≠q)}
vjq = 0 otherwise.
Step 4
B =
∑
=
)()(1
q
Q
q
q diagdiag cv ,
Stopping rule
If SSRdk decreases less than a constant ε>0 the algorithm has converged
X
17
Short Term Indicators and Economic Performance Indicators (OCSE, 1999)
GDP IR LI UR NNS TB
Dim 2 0 -0.697 -0.229 0 0 0.679
Dim 1 -0.383 0 0 -0.498 0.778 0
Loadings of Clustering and Disjoint PCA
Var(Dim1) =1.5601, Var(Dim2) = 1.2553
Dim
1 (2
6%
)
20 Countries: Australia (A-lia), Canada (Can), Finland (Fin), France (Fra), Spain (Spa), Sweden (Swe), United
States (USA), Netherlands (Net), Greece (Gre), Mexico (Mex), Portugal (Por), Austria (A-tria), Belgium (Bel), Denmark
(Den), Germany (Ger), Italy (Ita), Japan (Jap), Norway (Nor), Switzerland (Swi), United Kingdom (UK)
6 Macro Eco. VARS: Gross Domestic Product (GDP), Leading Indicator (LI), Unemployment Rate (UR),
Interest Rate (IR), Trade Balance (TB), Net National Savings (NNS)
Component loadings of PCA
GDP IR LI UR NNS TB
Dim 2 -0.065 -0.696 -0.229 0.367 -0.092 0.563
Dim 1 -0.567 -0.175 -0.192 -0.489 0.607 0.059
Var(Dim1) =1.6531, Var(Dim2) = 1.3680
Dim 2 (21%) (IR(49%) , LI(5%), TB(46%))
Dim
1 (2
6%
) (GD
P(1
5%
), UR
(25
%), N
NS
(60
%))
Mex, Por, Gre are in the same class also in k-means on the original variables
Ita, Ger Den have almost equal NNS, GDP;
18
Short Term Indicators and Economic Performance Indicators (OCSE, 1999)
GDP IR LI UR NNS TB
Dim 2 0 -0.697 -0.229 0 0 0.679
Dim 1 -0.383 0 0 -0.498 0.778 0
Loadings of Clustering and Disjoint PCA
Var(Dim1) =1.5601, Var(Dim2) = 1.2553
Dim
1 (2
6%
)
Dim
1
20 Countries: Australia (A-lia), Canada (Can), Finland (Fin), France (Fra), Spain (Spa), Sweden (Swe), United
States (USA), Netherlands (Net), Greece (Gre), Mexico (Mex), Portugal (Por), Austria (A-tria), Belgium (Bel), Denmark
(Den), Germany (Ger), Italy (Ita), Japan (Jap), Norway (Nor), Switzerland (Swi), United Kingdom (UK)
6 Macro Eco. VARS: Gross Domestic Product (GDP), Leading Indicator (LI), Unemployment Rate (UR),
Interest Rate (IR), Trade Balance (TB), Net National Savings (NNS)
Component loadings of PCA
GDP IR LI UR NNS TB
Dim 2 -0.065 -0.696 -0.229 0.367 -0.092 0.563
Dim 1 -0.567 -0.175 -0.192 -0.489 0.607 0.059
Var(Dim1) =1.6531, Var(Dim2) = 1.3680
Dim 2 (21%) (IR(49%) , LI(5%), TB(46%))
Dim
1 (2
6%
) (GD
P(1
5%
), UR
(25
%), N
NS
(60
%))
Dim 2 (23%)
(28
%)
Mex, Por, Gre are in the same class also in k-means on the original variables
Ita, Ger Den have almost equal NNS, GDP;
19
�Maximum likelihood approach
20
DDiimmeennssiioonn rreedduuccttiioonn ffoorr UUnniittss:: GGaauussssiiaann MMiixxttuurree MMooddeell
The population is composed of P subpopulations in proportions Pπππ ,...,, 21 , 11
=π∑=
P
i
i ,
The data ( ix′ , iu ′ )′, include the J variables & P-dim. binary and row stochastic vector ui.
iii euXx +′= (i = 1,…,I)
where xi is column centered ],...,,[ 21′= PµµµX and ei is the random error with
(i) E(ei)=0,
(ii) Cov(ei)=ΣΣΣΣp
Distributional assumptions Distributional assumptions
xi|ui ∼∼∼∼ fp(x;θθθθp) = NJ(µµµµp, ΣΣΣΣp).
In general, variable ui can be:
1. a fixed label i.e., ui cannot be considered as a random variable.
2. a random label with distribution
ui∼∼∼∼∏=
πP
P
u
p
ip
1Multinomial of one draw on P categories
21
DDiimmeennssiioonn RReedduuccttiioonn ffoorr VVaarriiaabblleess:: FFaaccttoorriiaall AAnnaallyysseess
In the Asymmetric framework
•••• The Factor Analysis model
(i=1,…,I)
(i) E(ei) = E(fi) = 0,
(ii) Cov(ei) = ΨΨΨΨ (diagonal ψj > 0)
(iii) Cov(fi) = I
It implies: Cov(xi) = ΛΛΛΛΛΛΛΛ′ + ΨΨΨΨ,
iii eΛfx +=
with distributional assumptions : xi ∼∼∼∼ = NJ(0, ΛΛΛΛΛΛΛΛ′ +ΨΨΨΨ).
•••• The Probabilistic PCA and PCA
Cov(xi) = ΛΛΛΛΛΛΛΛ′ + Iσ2 xi ∼∼∼∼ = NJ(0, ΛΛΛΛΛΛΛΛ′ + Iσ2). (Isotropic component)
Cov(xi) = ΛΛΛΛΛΛΛΛ′ΣΣΣΣΛΛΛΛΛΛΛΛ′ xi ∼∼∼∼ = NJ(0, ΛΛΛΛΛΛΛΛ′ΣΣΣΣΛΛΛΛΛΛΛΛ′). (Homoschedastic component)
22
Simultaneous GMM and PCA (Rocci & Vichi, 2001, 2006)
The GMM is modified iii euXΛΛx +′′=
( )( ) ( )
====≥
==≥
−′′′
∑
∑
∑∑
=
=
ΛΣΛΛΛµΛΛ
),..,2,1(,1);,...,2,1;,...,2,1(,0
1);,...,2,1(,0
subject to
ln,lnmax
1
1
pp
IiuPpIiu
Pp
uuNu
P
p
ipip
P
p
ip
ipip
ip
pJpip
ππ
π
Coordinate ascent
formulation of the EM
algorithm by a penalized
complete LL (Hathaway,
1986)
The estimators of µµµµp, ΣΣΣΣ, uip, and πp are those of GMM. For ΛΛΛΛ the complete LL
( ) ( )∑ ′−
′′−−= −
ip
pipiipucl µΛΛxΣµΛΛxθ 1
2
1)(
=
∑=
IΣΛΛΛΛ ''
1p
2
,
11||||
2
1
2
1−′+′=′′′+′= ∑ −
ΣDΛΛXµΛΛΣΛΛµ cc
p
pppπ max Between Var
where ],...,,[ 21′′′′= PµµµX and ),...,,(diag 21 Pπππ=D .
23
=
−−
JIΣΛΛΛΛ
ΣΛΛIXΣDΛ
''
)'(min2
, 1
which is equivalent to the minimization of the following problem
=′′
′′
J
B
IΛΣΛΛΛ
ΛΛΣΛΛΛ
)tr(max
But also equivalent to the maximization of the following problem
=′′JIΛΣΛΛΛ
Which corresponds to the Linear Discriminant Analysis
Therefore we have a simultaneous GMM and LDA
24
AApppplliiccaattiioonn Wine recognition data Blake, C.L. & Merz, C.J. (1998)
.
UCI Repository of machine learning databases
http://www.ics.uci.edu/~mlearn/MLRepository.html
178 wines x 13 constituents
These data are the results of a chemical analysis of wines derived from
three different cultivars (Barolo, Grignolino, Barbera)
The analysis determined the quantities of 13 chemical properties found in
each of the three types of wines.
1)Alcohol 8)Nonflavanoid phenols 1)Alcohol 8)Nonflavanoid phenols
2)Malic acid 9)Proanthocyanins
3)Ash 10)Color intensity
4)Alcalinity of ash 11)Hue
5)Magnesium 12)OD280/OD315 of diluted
wines
6)Total phenols 13)Proline
7)Flavanoids
25
Model selection: BIC (Bayesian Information Criterion), ndL ln)(2 −ϑ
Selected model: 7 groups; 4 components;
Components variances: 17.59, 5.25, 1.78, 1.02
Components - Variables correlations:
Variables
Comp 1 2 3 4 5 6 7
1 0.01 0.55 0.07 0.39 -0.12 -0.68 -0.85
2 0.85 0.13 0.37 -0.38 0.41 0.34 0.30
3 -0.04 0.09 0.28 0.25 -0.54 0.11 0.25 3 -0.04 0.09 0.28 0.25 -0.54 0.11 0.25
4 -0.11 0.06 0.01 0.28 0.43 0.19 0.19
Variables
Comp 8 9 10 11 12 13
1 0.50 -0.51 0.61 -0.76 -0.86 -0.36 2 -0.18 0.18 0.66 -0.13 0.10 0.84
3 0.14 -0.06 0.19 -0.06 0.21 -0.19
4 -0.12 0.61 0.26 -0.03 0.17 -0.01
26
Plot on the first two components
2
4
6
2
222
2 2
2
2
2
2
22
2
22
2
2
22
2
2
22
3
33
3
3
333
3 3
3
33
3
3
33
3
3
3
3
3
3
33
3
3
33
33
3
3
3
3
3
3
3
3
3
3 3
3
33
3
3
3 3
3
3
3
33
3
3
3
33
4 4
4
-8 -6 -4 -2 0 2 4 6 8 10-6
-4
-2
0
1
1
11
1
2
2
22
33
333
4
4
4
4
444
4
44
4
4
444
44
44
4
4
45
6
6
6
66
67
7 7
7
7
77 7
7
7
7
77
7
7
7
7
7
77
7
7
7 77
777
7
7
77
7
7
7
7
77
7
7
7
7
7
7
7
77
7
7
7
7
7
77
77
7
27
Classification
Estimated
3 1 5 6 7 2 4
Barolo 59 0 0 0 0 0 0
True Grignolino 0 5 1 5 57 0 3
Barbera 0 0 0 1 0 26 21
It is interesting to note that wines have been produced in a period of 10 years from 1970-1979.
Therefore, there is almost a nested structure (maybe specifying wines produced in different
years)
28
DoubleDouble KK--MeansMeans (Vichi, 2001,CLADAG )(Vichi, 2001,CLADAG )
Symmetric reduction of rows and columns of the data matrix X
Classes of objects and variables are summarized by mean profiles,
EVXUX +′=
Subject to
uik ∈{0, 1} i=1,…,n; k=1,…,K; (hard object classification) (2)
or uij ∈[0, 1] i=1,…,n; k=1,…,K; (fuzzy object classification) (2’)
∑K
iku = 1 i=1,…,n; (object partitioning) (3)
x11 x12 x13
x21 x22 x23
x31 x32 x33
x41 x42 x43
x51 x52 x53
x14 x15
x24 x25
x34 x35
x44 x45
x54 x55
x1j x1h x1k x1p
x2 j x2h x2k x2p
x3 j x3h x3k x3p
x4 j x4h x4k x4p
x5 j x5h x5k x5p
x61 x62 x63
x71 x72 x73
x64 x75
x74 x75
x6j x6h x6k x6p
x7j x7h x7k x7p
xi1 xi2 xi3
xl1 xl2 xl3
xm1 xm2 xm3
xi4 xi5
xl4 xl5
xm4 xm5
xij xih xik xip
xlj xlh xlk xlp
xmj xmh xmk xmp
Cluster Q
……………
……
…
………
VARIABLES
… … … … … … …… …
Cluster 2
Cluster 1
Cluster 2
Cluster G
Cluster 1
OBJECTS
∑=k
iku1
= 1 i=1,…,n; (object partitioning) (3)
or ∑=
K
k
iku1
≥ 1 i=1,…,n; (object covering) (3’)
or ∑=
K
k
iku1
≤ 1 i=1,…,n; (object packing) (3’’)
vjq ∈{0, 1} j=1,…,J; q=1,…,Q; (hard variable classification) (4)
or vpl ∈[0, 1] j=1,…,J; q=1,…,Q; (fuzzy variable classification) (4’)
∑=
Q
q
jqv1
= 1 p=1,…,k; (variable partitioning) (5)
or ∑=
Q
q
jqv1
≥ 1 p=1,…,k; (variable covering) (5’)
or ∑=
Q
q
jqv1
≤ 1 p=1,…,k; (variable packing) (5’’) 29
F2(U,V, X ) = || X – U X V′ ||2 → XVU ,,
min
subject to
uik ∈{0, 1} (i=1,…,n; k=1,…,K),
∑=
=P
p
ipu1
1 (i=1,…,n).
vjq ∈{0, 1} (j=1,…,J; q=1,…,Q),
∑=
=Q
q
jqv1
1 (j=1,…,J).
LS Estimation of the modelLS Estimation of the model
Coordinate Descent Algorithm
Step 1
( ) ( ) 11'''
−−= VVXVUUUX
Step 2
∀ i: uik=1 if ||xi - VXu ′′k ||
2 = min{ ||xi - VXu ′′
l ||2 : l=1,…,K}.
uik= 0 otherwise
Step 3
( ) ( ) 11'''
−−= VVXVUUUX
Step 4
∀ i: vjq=1 if ||xi - qvXU ||2 = min{ ||xi - lvXU ||
2 : l=1,…,q}.
vjq= 0 otherwise
Stopping Rule
If F2 decreases less than a constant ε>0 the algorithm has converged
30
Simulation Study for Double k-means: local minima The performance of the methods has been evaluated by using the following measures:
(1) Mrand(U, Ut). Modified Rand Index (Hubert and Arabie, 1985) between the true matrix U and the fitted for tth generated
matrix Ut; (2) Mrand(V, Vt). Modified Rand Index between the true matrix V and the fitted for the tth generated matrix Qt; (3) number of times the fitted partition of the objects is equal to the true partition, i.e., U=Ut (the equality is established up to a
column permutation matrix P, U=UtP);
(4) number of times the fitted partition of variables is equal to the true partition, i.e., V=Vt (the equality is established up to a
column permutation matrix P, V=VtP);
# runs to
retain the
best
solution
Average
Mrand for
object
partition
Average
Mrand for
variable
partition
% of object
partitions =
to the true
partitions
% of variable
partitions = to
the true
partitions
Average
number of
iterations
Error
level
1 0.9853 0.9938 94 98 2.3200 Low
5 1.0000 1.0000 100 100 2.1000 Low
Low error level permuted matrix
1 0.9592 0.9261 84 76 6.1600 Medium
5 0.9902 0.9631 96 88 4.3000 Medium
10 0.9975 0.9969 99 99 3.1500 Medium
Medium error level permuted matrix
high error level permuted matrix
10 0.9975 0.9969 99 99 3.1500 Medium
15 1.0000 1.0000 100 100 2.9900 Medium
1 0.8894 0.9259 35 72 9.7000 High
5 0.9788 0.9907 55 97 6.6000 High
10 0.9884 0.9999 55 99 6.2500 High
20 0.9901 0.9998 59 98 6.2600 High
30 0.9943 0.9994 61 99 5.4500 High
40 0.9956 1.0000 66 100 5.9900 High
50 0.9921 0.9998 60 97 6.2100 High
80 0.9940 0.9998 57 97 6.0000 High
31
Model-Based Double clustering (Vichi, Martella, 2012, JCS) xi = HV X ′ui + ei
( )
{ }
{ }
====∈
====∈
==≥
′
∑
∑
∑
∑
=
=
),..,2,1(,1);,...,2,1;,...,2,1(,1,0
),..,2,1(,1);,...,2,1;,...,2,1(,1,0
1);,...,2,1(,0
subject to
,lnmax
1
1
JjvPpJjv
IiuPpIiu
Pp
Nu
Q
P
p
ipip
P
p
pp
ip
piJip
ππ
ΣuXHV
Maximum likelihood clustering
where HV = V(V′V)-1
V′
{ }
====∈ ∑
=
),..,2,1(,1);,...,2,1;,...,2,1(,1,01
JjvPpJjvq
jqjq
( )( ) ( )
{ }
====∈
====≥
==≥
−′
∑
∑
∑
∑∑
=
=
=
),..,2,1(,1);,...,2,1;,...,2,1(,1,0
),..,2,1(,1);,...,2,1;,...,2,1(,0
1);,...,2,1(,0
subject to
ln,lnmax
1
1
1
JjvPpJjv
IiuPpIiu
Pp
uuNu
Q
q
jqjq
P
p
ipip
P
p
pp
ip
ipip
ip
piJpip
ππ
π ΣuXHV
Gaussian mixture model
32
Application cutaneous melanoma Bittner et al., 2000
31 samples of cutaneous melanomas x 3,613 genes. Bitner et al. obtained two clusters of 12 and 19 samples,
Best solution for BIC and AIC P=4, Q=2
Two blocks of 83 genes have down-regulated
genes classificated into two clusters of 21 and 10
tissue samples 33
0.8
0.9
1
1.1
1.2
1.3
4
5
7
8
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
21 samples
-3 -2.5 -2 -1.5 -1 -0.50.3
0.4
0.5
0.6
0.7 1
2
3
5
6
8
9
10
11
12
10 samples
34
Mixture of Factor Analyzers (Martella, Alfo’, Vichi 2010)
Data Matrix X →→→→ GMM+MAP →→→→ Centroid matrix →→→→ FA or
PCA
x11 x12 ... x1J
xi1 xi2 ... xiJ
XI1 xI2 ... xIJ
Model-based
clustering
Mixture models
& MAP
FA model, with
binary constraint
µ11... µ1Q1
µP1...µPQ1 x11 x12 ... x1J
xi1 xi2 ... xiJ
XI1 xI2 ... xIJ FA model, with
binary constraint
µ11... µ1QP
( )( ) ( )( ) ( )
{ }
=∈
+′=
====≥
==≥
−++
∑
∑
∑
∑∑∑
=
=
=
pQ
q
jqpjqp
pppipi
P
p
ipip
P
p
p
ip
ipip
ip
QQpip
ip
ppppJpip
u
IiuPpIiu
Pp
uuNuNu
1
1
1
p
1;1,0
)|cov(
),..,2,1(,1);,...,2,1;,...,2,1(,0
1);,...,2,1(,0
subject to
ln,ln,lnmax
λλ
ππ
ππ
ΨΛΛx
I0ΨfΛµ
XI1 xI2 ... xIJ XI1 xI2 ... xIJ binary constraint µP1 ...µPQP
35
Remark: DkM specifies for each object class the same partition of the
variables and for each variable class the same partition of the objects
PROBLEM: single-paritioning vs Multi-partitioning
one matrix V
one matrix U
If we relax this symmetry we could have that:
(i) conditionally to each
class of the variable
partition a different
partition of the objects
is allowed.
one matrix V
different U1, U2, …UQ
36
Remark: DkM specifies for each object class the same partition of the
variables and for each variable class the same partition of the objects
PROBLEM: single-paritioning vs Multi-partitioning
one matrix V
one matrix U
If we relax this symmetry we could have that:
(ii) conditionally to
each class of the object
partition a different
partition of the
variables is allowed.
one matrix U
different V1, V2,…,VP
37
MultimodeMultimode MultipartitioningMultipartitioning
EVXUX +′=
where X is a block diagonal matrix of the form
X =
′
′
′
Kx00
0x0
00x
L
MMM
L
L
2
1
= diag( 1x′ ,…, Kx′ ),
where
V = [V1, …,Vk, …, VK], Vk = [vjqk] is the (J × Qk) binary matrix for partition of the kth object class.
Matrix V specifies a particular covering of the set of variables.
×X (K×Q) with Q = Q1+ Q2 +…+ QK.
cp =[1px′ , …,
Qpx′ ]' is the (Qp×1) centroid vector of the pth object cluster; and Qp is the number of
clusters of the variable partition conditional to the pth object cluster.
When V1= V2 =…= VP= V , Double K-Means can be revritten
X = U
′
′
′
Kx00
0x0
00x
L
MMM
L
L
2
1
[I, I, …,I]′ V ′ + E = U X V ′ + E,
where X is the centroid matrix of the double k-means.
Thus, double k-means can be seen as the model identifying the consensus partition of the variable partitions;
the consensus is found by an optimization approach (Barthélemy & Leclerc, 1995), providing a central
classification which gives an estimate of the true common classification of the set of variables. 38
Least Squares Estimation of GDkM
Rewrite the model
X = ∑=
′K
k
kkk
1
Vxu + E
where up is the p-th columns of U. The LS estimation is
SSRGDkM = ||X - ∑=
′K
k
kkk
1
Vxu ||2 →
kxVU ,,
min
subject to
U and Vk. binary and row stochastic U and Vk. binary and row stochastic
Coordinate Descent Algorithm
Step 1
k
′x = (uk′′′′uk)
-1 uk′′′′ XVk(Vk′′′′Vk)-1 (k=1,…,K)
Step 2
∀ i: uip=1 if ||xi - Vc ′p ||
2 = min{ ||xi - Vc ′
l ||2 : l=1,…,K}.
uip= 0 otherwise
Step 3
∀ i: vjpq=1 if ||xi - ppuc ′ ||2 = min{ ||xi - lluc ′ ||
2 : l=1,…,q} p=1,...,P.
vjqp= 0 otherwise
Stopping Rule
If SSR decreases less than a constant ε>0 the algorithm has converged
39
The performance of the methods has been evaluated by using the following measures:
(1) Mrand(U, Ut). Modified Rand Index (Hubert and Arabie, 1985) between the true matrix U and the fitted for tth generated
matrix Ut; (2) Mrand(V, Vt). Modified Rand Index between the true matrix V and the fitted for the tth generated matrix Qt; (3) number of times the fitted partition of the objects is equal to the true partition, i.e., U=Ut (the equality is established up to a
column permutation matrix P, U=UtP);
(4) number of times the fitted partition of variables is equal to the true partition, i.e., V=Vt (the equality is established up to a
column permutation matrix P, V=VtP);
Simulation Study for Generalized Double k-means: local minima
# runs
to retain
the best
solution
Averag
e
Mrand
for
object
partitio
n
Average
Mrand
for
variable
partition
% of
object
partitions
= to the
true
partitions
% of
variable
partitions
= to the
true
partitions
Average
number
of
iterations
cputime Tsr Error
level
ALG2 1 1.0000 0.9738 100 82 2.8200 0.0045 1.1715 low err
ALG2 5 1.0000 0.9979 100 98 2.8100 0.0237 1.1002 low err ALG2 5 1.0000 0.9979 100 98 2.8100 0.0237 1.1002 low err
ALG2 10 1.0000 1.0000 100 100 2.7900 0.0516 1.0951 low err
ALG2 1 1.0000 0.9232 100 57 3.4000 0.0070 20.3373 Med err
ALG2 5 1.0000 0.9490 100 69 3.3300 0.0289 20.2196 Med err
ALG2 10 1.0000 0.9555 100 72 3.2900 0.0600 20.2134 Med err
ALG2 20 1.0000 0.9563 100 72 3.3000 0.1182 20.2113 Med err
ALG2 30 1.0000 0.9586 100 75 3.3100 0.1848 20.2094 Med err
40
Model-Based Multi-Partitioning (Farcomeni Vichi 2007)
xi = [ ( )PPµHµH VV ,...,11 ui + ei
( )
{ }
{ }
( )
=
====∈
====∈
∑
∑
∑
=
=
p
P
Q
q
jqpjqp
P
p
ipip
ip
piPJip
diag
JjvPpJjv
IiuPpIiu
Nu
ΣΣΣ
ΣuµHµH VV
,...,
),..,2,1(,1);,...,2,1;,...,2,1(,1,0
),..,2,1(,1);,...,2,1;,...,2,1(,1,0
subject to
,),...,( lnmax
1
1
11
Maximum likelihood clustering
The estimators of uip, and µµµµp ΣΣΣΣp, and Vp
where a variable partition is written as xi =[ pi1x′ ,..., piqx′ ,…, piQ
x′ ]′. The solutions are
pqµ = , pq
Σ =
( )
=ppQpp diag ΣΣΣ ,...,1
∑∑ =
=
I
i
iqipI
i
ip
u
u 1
1
ˆ
ˆ
1x ( )( )∑
∑ =
=
′−−
I
i
pqiqpqiqipI
i
ip
u
u 1
1
ˆˆˆ
ˆ
1µxµx
( ) ( )∑ ∑∑∑=
−
=
−′
−−−=ip
Q
q
pqiqpqiqip
Q
q
pq
ip
ip
p
ppqp
p
uucl1
1
1 2
1log
2
1)( µxΣµxΣθ
uip = 1 if ∑=
−−
p
pqp
Q
q
pqiq
1
2
1ˆ
Σµx = min{∑
=−
−p
vqv
Q
q
vqiq
1
2
1ˆ
Σµx : v=1,…,P (v≠q)}; uip = 0 otherwise
vjqp = 1 if ∑∑∑== =
−−
p
p
Q
qpqiq
I
i
P
pip
u1
2
1 11
ˆˆΣ
µx = min{ ∑∑∑== =
−−
p
p
Q
vpqiv
I
i
P
pip
u1
2
1 11
ˆˆΣ
µx : v=1,…,Qp (v≠q)}; vjqp = 0 otherwise 41
Model-Based Multi-Partitioning (Farcomeni Vichi 2007) The parameters of the normal distributions are estimated on the a single truncated
distribution and the tails of the other P-1 (mainly all those units closer to the distribution of interest according to Mahalanobis distance) This induces inconsistent estimators.
Consistent estimators are given by
pqµ = , pq
Σ =
where
∑∑ =
=
I
i
iqipI
i
ip
z
z 1
1
1x ( )( )∑
∑ =
=
′−−
I
i
pqiqpqiqipI
i
ip
z
z 1
1
ˆˆ1
µxµx
where
zip =
is the expectation of uip given the observed data
Updating the parameters of the multivariate normal distributions the likelihood still increases or at least never decreases
( )( )∑ =
P
j jij
pip
f
f
1;
;
θx
θx
42
APPLICATION Colon Cancer data Alon et al. 1999
Unsupervised problem (22 colon tissues x 2000 genes)BIC 5 clusters of genes and 2, 2, 3, 3 3 clusters of tissues C1: 79 over-expressed genes; C2 with
832 genes; C3 and C4 with 183 and 53 genes under-expressed; C5 with 853 genes
Recognition of Glass and Ceramic Glass, 2007 (161 (109+52) glass/ceramic x 64
wavelength )
Supervise classification. Observed data are randomly split in test (25 units) and training sets.
After parameter estimation the classification error is evaluated on the test set. The procedure is
repeated 1000 times.
Glass GlassCeramic
Predicted Glass 0.739 0.005
Predicted Glass
Ceramic0.076 0.180
Estimated classification error 8%
43