Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
StatLearn 2018 : tutorial on Model Based LearningJulien Jacques (Université de Lyon, Lyon 2 & ERIC EA3083)
Discover mixture model through simulations
One-dimensional data
Let’s simulate the size of 30 women and 70 mentaille=c(rnorm(n=30,mean=162,sd=5),rnorm(n=70,mean=175,sd=8))sexe=c(rep(1,30),rep(2,70))
hist1=hist(taille[sexe==1],breaks=seq(140,200,5),plot=F)hist2=hist(taille[sexe==2],breaks=seq(140,200,5),plot=F)barplot(hist1$counts,beside=TRUE,col="pink",ylim=c(0,max(hist1$counts,hist2$counts)))par(new=TRUE)barplot(hist2$counts,beside=TRUE,col="blue",ylim=c(0,max(hist1$counts,hist2$counts)))
02
46
810
1214
02
46
810
1214
Let’s try clustering with kmeansres=kmeans(taille,centers = 2)table(res$cluster,sexe)
## sexe## 1 2## 1 0 46## 2 30 24
We can use the Rand Index to compare two partitions Z1 and Z2:
R = a + d
a + b + c + d= a + d(2
n
) ∈ [0, 1]
where, among the(2
n
)pair of objects :
• a : is the number of pairs in the same class in Z1 and in Z2
1
• b : is the number of pairs in the same class in Z1 and separated in Z2
• c : is the number of pairs separated in Z1 and in the same class in Z2
• d : is the number of pairs separated in Z1 and in Z2
The function adjustedRandIndex() from mclust package computes an adjusted version (ARI) of this index(more similar are the partition, closer to 1 is the ARI).library(mclust)
## Package 'mclust' version 5.3
## Type 'citation("mclust")' for citing this R package in publications.cat('ARI kmeans = ',adjustedRandIndex(res$cluster,sexe),'\n')
## ARI kmeans = 0.2634361
Let’s have a look, before of looking at theory, to clustering with a Gaussian mixtureres=Mclust(taille,G = 2,verbose = F)table(res$classification,sexe)
## sexe## 1 2## 1 30 26## 2 0 44cat('ARI GMM = ',adjustedRandIndex(res$classification,sexe),'\n')
## ARI GMM = 0.2221024barplot(hist1$counts,beside=TRUE,col="pink",ylim=c(0,max(hist1$counts,hist2$counts)))par(new=TRUE)barplot(hist2$counts,beside=TRUE,col="blue",ylim=c(0,max(hist1$counts,hist2$counts)))par(new=TRUE)plot(res,what="density",yaxt="n")
2
02
46
810
1214
02
46
810
1214
160 170 180 190
taille
dens
ityDensity
##Two-dimensional data
Let’s now and the weight of women and menpoids=c(rnorm(n=30,mean=62,sd=6),rnorm(n=70,mean=77,sd=9))data=data.frame(taille,poids)sexe=c(rep(1,30),rep(2,70))couleur=c(rep("pink",30),rep("blue",70))plot(taille,poids,col=couleur,pch=16)
160 170 180 190
5060
7080
9010
0
taille
poid
s
3
Let’s run kmeans and GMMres=kmeans(data,centers = 2)table(res$cluster,sexe)
## sexe## 1 2## 1 30 12## 2 0 58cat('ARI kmeans = ',adjustedRandIndex(res$cluster,sexe),'\n')
## ARI kmeans = 0.5723122res=Mclust(data,G = 2,verbose = F)table(res$classification,sexe)
## sexe## 1 2## 1 29 8## 2 1 62cat('ARI GMM = ',adjustedRandIndex(res$classification,sexe),'\n')
## ARI GMM = 0.6661479
Classification with GMM
Download the wine dataset on: https://archive.ics.uci.edu/ml/datasets/winedata=read.table('data/wine.txt',sep=',')cls=data$V1data$V1=NULL
Let’s start with representation of the data with a biplot:plot(data,col=cls,pch=19)
4
V2
1 4 10 30 1.0 0.2 2 12 1.5
11
1 V3
V4
1.5
10
V5
V6 80
1.0 V7
V8 1
0.2 V9
V10
0.5
2 V11
V12
0.6
1.5 V13
11 1.5 80 1 4 0.5 3.5 0.6 400
400V14
We don’tsee anything.. Let’s go for a PCA:pc = princomp(data,cor=TRUE)plot(pc)
Comp.1 Comp.3 Comp.5 Comp.7 Comp.9
pc
Var
ianc
es
01
23
4
biplot(pc)
5
−0.15 −0.05 0.05 0.15
−0.
15−
0.05
0.05
0.15
Comp.1
Com
p.2
1
2
3
4
5
6
78
91011
121314
15161718
19
202122
232425
2627
28
2930
3132
33
34
3536
3738
39
40
4142
43
4445
464748
4950
5152
53 54
55565758
59
60
6162
6364 65
66
6768
69
707172 73
74
75
7677
7879 80
81
82
83
84
8586 8788
89
909192
93
9495
9697
98
99
100101 102
103
104105
106107
108
109
110111112
113
114115
116
117
118119
120
121
122
123124
125126
127 128
129
130
131132133134
135
136137138
139140141142143144145
146 147148
149150
151152
153154
155
156157158
159
160
161162
163164
165166
167
168
169170
171
172
173174175
176177
178
−10 −5 0 5 10
−10
−5
05
10
V2
V3V4
V5
V6
V7V8 V9V10
V11
V12V13
V14
pairs(predict(pc)[,1:5],col=cls,pch=19)
Comp.1
−2 0 2 4 −3 −1 1 3
−4
04
−2
2
Comp.2
Comp.3
−4
04
−3
02
4
Comp.4
−4 0 2 4 −4 0 2 4 −2 0 2 4
−2
13
Comp.5
In order to evaluate classification, we randomly select 1/3 of the data as test setitest <- sample(1:nrow(data), nrow(data)/3)data.train <- data[-itest,]cls.train <- cls[-itest]
6
data.test <- data[itest,]cls.test <- cls[itest]
We train a GMM (EEE : Elipsoidal, equal volume, shape and orientation) with 1 component per classwineMclustDA <- MclustDA(data.train, cls.train, modelType = "EDDA",verbose=F)summary(wineMclustDA)
## ------------------------------------------------## Gaussian finite mixture model for classification## ------------------------------------------------#### EDDA model summary:#### log.likelihood n df BIC## -1979.751 119 156 -4705.045#### Classes n Model G## 1 41 VVE 1## 2 45 VVE 1## 3 33 VVE 1#### Training classification summary:#### Predicted## Class 1 2 3## 1 41 0 0## 2 0 45 0## 3 0 0 33#### Training error = 0#summary(wineMclustDA, parameters = TRUE)summary(wineMclustDA, newdata = data.test, newclass = cls.test)
## ------------------------------------------------## Gaussian finite mixture model for classification## ------------------------------------------------#### EDDA model summary:#### log.likelihood n df BIC## -1979.751 119 156 -4705.045#### Classes n Model G## 1 41 VVE 1## 2 45 VVE 1## 3 33 VVE 1#### Training classification summary:#### Predicted## Class 1 2 3## 1 41 0 0## 2 0 45 0
7
## 3 0 0 33#### Training error = 0#### Test classification summary:#### Predicted## Class 1 2 3## 1 18 0 0## 2 0 26 0## 3 0 0 15#### Test error = 0plot(wineMclustDA, dimens = 1:2,what="scatterplot")
11.5 12.0 12.5 13.0 13.5 14.0 14.5
12
34
56
V2
V3
Let’s compare with Mixture of Discriminant Analysis (several gaussian components per class), and alwayswith the EEE modelswineMclustDA <- MclustDA(data.train, cls.train, modelType = "MclustDA", modelNames = "EEE",verbose=F)summary(wineMclustDA)
## ------------------------------------------------## Gaussian finite mixture model for classification## ------------------------------------------------#### MclustDA model summary:#### log.likelihood n df BIC## -1794.149 119 312 -5079.385#### Classes n Model G## 1 41 XXX 1## 2 45 XXX 1
8
## 3 33 XXX 1#### Training classification summary:#### Predicted## Class 1 2 3## 1 41 0 0## 2 0 45 0## 3 0 0 33#### Training error = 0#summary(wineMclustDA, parameters = TRUE)summary(wineMclustDA, newdata = data.test, newclass = cls.test)
## ------------------------------------------------## Gaussian finite mixture model for classification## ------------------------------------------------#### MclustDA model summary:#### log.likelihood n df BIC## -1794.149 119 312 -5079.385#### Classes n Model G## 1 41 XXX 1## 2 45 XXX 1## 3 33 XXX 1#### Training classification summary:#### Predicted## Class 1 2 3## 1 41 0 0## 2 0 45 0## 3 0 0 33#### Training error = 0#### Test classification summary:#### Predicted## Class 1 2 3## 1 18 0 0## 2 0 26 0## 3 0 0 15#### Test error = 0plot(wineMclustDA, dimens = 1:2,what="scatterplot")
9
11.5 12.0 12.5 13.0 13.5 14.0 14.5
12
34
56
V2
V3
Let’sfinally try all parsimonious model:wineMclustDA <- MclustDA(data.train, cls.train, modelType = "MclustDA",verbose=F)summary(wineMclustDA)
## ------------------------------------------------## Gaussian finite mixture model for classification## ------------------------------------------------#### MclustDA model summary:#### log.likelihood n df BIC## -1975.559 119 164 -4734.895#### Classes n Model G## 1 41 EEI 2## 2 45 VEI 3## 3 33 EEI 4#### Training classification summary:#### Predicted## Class 1 2 3## 1 40 1 0## 2 0 45 0## 3 0 0 33#### Training error = 0.008403361#summary(wineMclustDA, parameters = TRUE)summary(wineMclustDA, newdata = data.test, newclass = cls.test)
## ------------------------------------------------## Gaussian finite mixture model for classification
10
## ------------------------------------------------#### MclustDA model summary:#### log.likelihood n df BIC## -1975.559 119 164 -4734.895#### Classes n Model G## 1 41 EEI 2## 2 45 VEI 3## 3 33 EEI 4#### Training classification summary:#### Predicted## Class 1 2 3## 1 40 1 0## 2 0 45 0## 3 0 0 33#### Training error = 0.008403361#### Test classification summary:#### Predicted## Class 1 2 3## 1 18 0 0## 2 0 26 0## 3 0 0 15#### Test error = 0plot(wineMclustDA, dimens = 1:2,what="scatterplot")
11.5 12.0 12.5 13.0 13.5 14.0 14.5
12
34
56
V2
V3
11
Some questions:
• how are choose the number of components per class?
• is-it the best way to do that?
Clustering with GMM
Let’s continue with the wine dataset:data=read.table('data/wine.txt',sep=',')cls=data$V1data$V1=NULL
Select using BIC the best model as well as the best nb. of clustersBIC = mclustBIC(data,verbose=F)plot(BIC)
−25
000
−15
000
Number of components
BIC
1 2 3 4 5 6 7 8 9
EIIVIIEEIVEIEVIVVIEEE
EVEVEEVVEEEVVEVEVVVVV
summary(BIC)
## Best BIC values:## EVE,3 VVE,5 VVE,3## BIC -6873.246 -6884.37905 -6896.8868## BIC diff 0.000 -11.13286 -23.6406
We can have a look to the selected modelmod1 = Mclust(data, x = BIC)summary(mod1)
## ----------------------------------------------------## Gaussian finite mixture model fitted by EM algorithm## ----------------------------------------------------##
12
## Mclust EVE (ellipsoidal, equal volume and orientation) model with 3 components:#### log.likelihood n df BIC ICL## -3032.444 178 156 -6873.246 -6873.537#### Clustering table:## 1 2 3## 63 51 64
We can compare the best partition with the type of winetable(mod1$classification,cls)
## cls## 1 2 3## 1 59 4 0## 2 0 3 48## 3 0 64 0
13