12
Mixture Models for Classification Gilles Celeux Inria Futurs, Orsay, France; [email protected] Abstract. Finite mixture distributions provide efficient approaches of model-based clustering and classification. The advantages of mixture models for unsupervised classification are reviewed. Then, the article is focusing on the model selection prob- lem. The usefulness of taking into account the modeling purpose when selecting a model is advocated in the unsupervised and supervised classification contexts. This point of view had lead to the definition of two penalized likelihood criteria, ICL and BEC, which are presented and discussed. Criterion ICL is the approximation of the integrated completed likelihood and is concerned with model-based cluster analysis. Criterion BEC is the approximation of the integrated conditional likelihood and is concerned with generative models of classification. The behavior of ICL for choos- ing the number of components in a mixture model and of BEC to choose a model minimizing the expected error rate are analyzed in contrast with standard model selection criteria. 1 Introduction Finite mixtures models has been extensively studied for decades and provide a fruitful framework for classification (McLachlan and Peel (2000)). In this article some of the main features and advantages of finite mixture analysis for model-based clustering are reviewed in Section 2. An important interest of finite mixture model is to provide a rigorous setting to assess the number of clusters in an unsupervised classification context or to assess the stability of a classification function. It is focused on those two questions in Section 3. Model-based clustering (MBC) consists of assuming that the data come from a source with several subpopulations. Each subpopulation is modeled separately and the overall population is a mixture of these subpopulations. The resulting model is a finite mixture model. Observations x =(x 1 ,..., x n ) in R nd are assumed to be a sample from a probability distribution with density p(x i | K, θ K )= K k=1 p k φ(x i | a k ) (1)

fulltext(1)

Embed Size (px)

Citation preview

Mixture Models for Classification

Gilles Celeux

Inria Futurs, Orsay, France; [email protected]

Abstract. Finite mixture distributions provide efficient approaches of model-basedclustering and classification. The advantages of mixture models for unsupervisedclassification are reviewed. Then, the article is focusing on the model selection prob-lem. The usefulness of taking into account the modeling purpose when selecting amodel is advocated in the unsupervised and supervised classification contexts. Thispoint of view had lead to the definition of two penalized likelihood criteria, ICL andBEC, which are presented and discussed. Criterion ICL is the approximation of theintegrated completed likelihood and is concerned with model-based cluster analysis.Criterion BEC is the approximation of the integrated conditional likelihood and isconcerned with generative models of classification. The behavior of ICL for choos-ing the number of components in a mixture model and of BEC to choose a modelminimizing the expected error rate are analyzed in contrast with standard modelselection criteria.

1 Introduction

Finite mixtures models has been extensively studied for decades and providea fruitful framework for classification (McLachlan and Peel (2000)). In thisarticle some of the main features and advantages of finite mixture analysisfor model-based clustering are reviewed in Section 2. An important interestof finite mixture model is to provide a rigorous setting to assess the numberof clusters in an unsupervised classification context or to assess the stabilityof a classification function. It is focused on those two questions in Section 3.

Model-based clustering (MBC) consists of assuming that the data comefrom a source with several subpopulations. Each subpopulation is modeledseparately and the overall population is a mixture of these subpopulations.The resulting model is a finite mixture model. Observations x = (x1, . . . ,xn)in Rnd are assumed to be a sample from a probability distribution with density

p(xi | K, θK) =K∑

k=1

pkφ(xi | ak) (1)

4 Gilles Celeux

where the pk’s are the mixing proportions (0 < pk < 1 for all k = 1, . . . , Kand

∑k pk = 1), φ(. | ak) denotes a parameterized density and θK =

(p1, . . . , pK−1, a1, . . . , aK). When data are multivariate continuous observa-tions, the component parameterized density is usually the d-dimensional Gaus-sian density and parameter ak = (µk, Σk), µk being the mean and Σk thevariance matrix of component k. When data are discrete, the component pa-rameterized density is usually the multivariate multinomial density which isassuming conditional Independence of the observations knowing their com-ponent mixture and the ak = (aj

k, j = 1, . . . , d)’s are the multinomial prob-abilities for variable j and mixture component k. The resulting model is theso-called Latent Class Model (see for instance Goodman (1974)).

The mixture model is an incomplete data structure model: The completedata are

y = (y1, . . . ,yn) = ((x1, z1), . . . , (xn, zn))

where the missing data are z = (z1, . . . , zn), with zi = (zi1, . . . , ziK) arebinary vectors such that zik = 1 iff xi arises from group k. The z’s define apartition P = (P1, . . . , PK) of the observed data x with Pk = {xi | zik = 1}.

In this article, it is considered that the mixture models at hand are esti-mated through maximum likelihood (ml) or related methods. Despite it hasreceived a lot of attention, since the seminal article of Diebolt and Robert(1994), Bayesian inference is not considered here. Bayesian analysis of uni-variate mixtures has became the standard Bayesian tool for density estima-tion. But, especially in the multivariate setting a lot of problems (possible slowconvergence of MCMC algorithms, definition of subjective weakly informativepriors, identifiability, . . . ) remain and it cannot be regarded as a standard toolfor Bayesian clustering of multivariate data (see Aitkin (2001)). The reader isreferred to the survey article of Marin et al. (2005) for a readable state of theart of Bayesian inference for mixture models.

2 Some advantages of model-based clustering

In this section, some important and nice features of finite mixture analysis aresketched. The advantages of finite mixture analysis in a clustering context,highlighted here, are: Many versatile or parsimonious models are available,many algorithms to estimate the mixture parameters are available, specialquestions can be tackled in a proper way in the MBC context, and, last butnot least, finite mixture models can be compared and assessed in an objectiveway. It allows in particular to assess the number of clusters properly. Thediscussion on this important point is postponed to Section 3.

Many versatile or parsimonious models are available.

In the multivariate Gaussian mixture context, the variance matrix eigenvaluedecomposition

Mixture Models for Classification 5

Σk = VkDtkAkDk

where Vk = |Σk)|1/d defines the component volume, Dk the matrix of eigen-vectors of Σ defines the component orientation, and Ak the diagonal matrixof normalized eigenvalues defines the component shape, leads to get differentand easily interpreted models by allowing some of these quantities to varybetween components. Following Banfield and Raftery (1993) or Celeux andGovaert (1995), a large range of fourteen versatile (from the most complexto the simplest one) models derived from this eigenvalue decomposition canbe considered. Assuming equal or free volumes, orientations and shapes leadsto eight different models. Assuming in addition that the component variancematrices are diagonal leads to four models. And, finally, assuming in additionthat the component variance matrices are proportional to the identity matrixleads to two other models.

In the Latent Class Model, a re-parameterization is possible to lead tovarious models taking account of the scattering around centers of the clustersin different ways (Celeux and Govaert (1991)). This re-parameterization is asfollows. The multinomial probabilities ak are decomposed in (mk, εk) wherebinary vector mk = (m1

k, . . . ,mdk) provides the mode levels in cluster k for

variable j

(mjhk ) =

{1 if h = argmaxh ajh

k

0 otherwise,

and the εjk can be regarded as scattering values.

(εjh) ={

1 − αjh if ajh = 1αjh if ajh = 0.

For instance, if ajk = (0.7, 0.2, 0.1), the new parameters are mj

k = (1, 0, 0) andεj

k = (0.3, 0.2, 0.1). This parameterization can lead to five latent class models.Denoting h(jk) the mode level for variable j and cluster k and h(ij) the levelof object i for the variable j, the model can be written

f(xi; θ) =∑

k

pk

∏j

((1 − ε

jh(jk)k )x

jh(jk)i (εjh(ij)

k )xjh(ij)i −x

jh(jk)k

).

Using this form, it is possible to impose various constraints to the scatteringparameters εjh

k . The models of interest are the following:

• The standard latent class model [εjhk ]: The scattering is depending upon

clusters, variables and levels.• [εj

k]: The scattering is depending upon clusters and variables but not uponlevels.

• [εk]: The scattering is depending upon clusters, but not upon variables.• [εj ]: The scattering is depending upon variables, but not upon clusters.• [ε]: The scattering is constant over variables and clusters.

6 Gilles Celeux

Many algorithms available from different points of view

The EM algorithm of Dempster et al. (1977) is the reference tool to derivethe ml estimates in a mixture model. An iteration of EM is as follows:

• E step: Compute the conditional probabilities tik, i = 1, . . . , n, k =1, . . . , K that xi arises from the kth component for the current valueof the mixture parameters.

• M step: Update the mixture parameter estimates maximizing the expectedvalue of the completed likelihood. It leads to use standard formulas wherethe observation i for group k is weighted with the conditional probabilitytik.

Others algorithms are taking profit of the missing data structure of the mix-ture model. For instance, the classification EM (CEM), see Celeux and Gov-aert (1992) is directly concerned with the estimation of the missing labels z.An iteration of CEM is as follows:

• E step: As in EM.• C step: Assign each point xi to the component maximizing the conditional

probability tik using a maximum a posteriori (MAP) principle.• M step: Update the mixture parameter estimates maximizing the com-

pleted likelihood.

CEM aims to maximize the completed likelihood where the component labelof each sample point is included in the data set. CEM is a K-means-likealgorithm and, contrary to EM, it converges in a finite number of iterations.But CEM provides biased estimates of the mixture parameters. This algorithmis interesting in a clustering context when the clusters are well separated (seeCeleux and Govaert (1993)).

From an other point of view, the Stochastic EM (SEM) algorithm can beuseful. It is as follows:

• E step: As in EM.• S step: Assign each point xi at random to one of the component according

to the distribution defined by the (tik, k = 1, . . . , K).• M step: Update the mixture parameter estimates maximizing the com-

pleted likelihood.

SEM generates a Markov chain whose stationary distribution is (more or less)concentrated around the ML parameter estimator. Thus a natural parameterestimate from a SEM sequence is the mean of the iterates values obtain aftera burn-in period. An alternative estimate is to consider the parameter valueleading to the largest likelihood in a SEM sequence. In any cases, SEM isexpected to avoid insensible maxima of the likelihood that EM cannot avoid,but SEM can be jeopardized by spurious maxima (see Celeux et al. (1996) orMcLachlan and Peel (2000) for details). Note that different variants (MonteCarlo EM, Simulated Annealing EM) are possible (see, for instance, Celeux et

Mixture Models for Classification 7

al. (1996)). Note also that Biernacki et al. (2003) proposed simple strategiesfor getting sensible ml estimates. Those strategies are acting in two ways todeal with this problem. They choose particular starting values from CEM orSEM and they run several times EM or algorithms combining CEM and EM.

Special questions can be tackled in a proper way in the MBC context

Robust Cluster Analysis can be obtained by making use of multivariate Stu-dent distributions instead of Multivariate Gaussian distributions. It lead toattenuate the influence of outliers (McLachlan and Peel (2000)). On an otherhand, including in the mixture a group from a uniform distribution allows totake into account noisy data (DasGupta and Raftery (1998)).

To avoid spurious maxima of likelihood, shrinking the group variance ma-trix toward a matrix proportional to the identity matrix can be quite efficient.On of the most achieved work in this domain is Ciuperca et al. (2003).

Taking profit of the probabilistic framework, it is possible to deal withmissing data at random in a proper way with mixture models (Hunt andBasford (2001)). Also, simple, natural and efficient methods of semi-supervisedclassification can be derived in the mixture framework (an example of pioneerarticle on this subject, recently followed by many others, is Ganesalingam andMcLachlan (1978)). Finally, it can be noted that promising variable selectionprocedures for Model-Based Clustering begin to appear (Raftery and Dean(2006)).

3 Choosing a model in a classification purpose

In statistical inference from data selecting a parsimonious model among acollection of models is an important but difficult task. This general prob-lem receives much attention since the seminal articles of Akaike (1974) andSchwarz (1978). A model selection problem consists essentially of solving thebias-variance dilemma. A classical approach to the model assessing problemconsists of penalizing the fit of a model by a measure of its complexity. Crite-rion AIC of Akaike (1974) is an asymptotic approximation of the expectationof the deviance. It is

AIC(m) = 2 logp(x|m, θm) − 2νm. (2)

where θm is the ml estimate of parameter θm and νm is the number of freeparameters of model m.

An other point of view consists of basing the model selection on the inte-grated likelihood of the data in a Bayesian perspective (see Kass and Raftery(1995)). This integrated likelihood is

p(x|m) =∫

p(x|m, θm)π(θm)dθm, (3)

8 Gilles Celeux

π(θm) being a prior distribution for parameter θm. The essential technicalproblem is to approximate this integrated likelihood in a right way. A classicalasymptotic approximation of the logarithm of the integrated likelihood is theBIC criterion of Schwarz (1978). It is

BIC(m) = logp(x|m, θm) − νm

2log(n). (4)

Beyond technical difficulties, the scope of this section is to show how it canbe fruitful to take into account the purpose of the model user to get reliableand useful models for statistical description or decision tasks. Two situationsare considered to support this idea: Choosing the number of components ina mixture model in a cluster analysis perspective, and choosing a generativeprobabilistic model in a supervised classification context.

3.1 Choosing the number of clusters

Assessing the number K of components in a mixture model is a difficult ques-tion, from both theoretical and practical points of view, which had receivedmuch attention in the past two decades. This section does not propose a stateof the art of this problem which has not been completely solved. The readeris referred to the chapter 6 of the book of McLachlan and Peel (2000) for anexcellent overview on this subject. This section is essentially aiming to discusselements of practical interest regarding the problem of choosing the numberof mixture components when concerned with cluster analysis.

From the theoretical point of view, even when K∗ the right number ofcomponent is assumed to exist, if K∗ < K0 then K∗ is not identifiable in theparameter space ΘK0 (see for instance McLachlan and Peel (2000), chapter6).

But, here, we want to stress the importance of taking into account themodeling context to select a reasonable number of mixture components. Ouropinion is that, behind the theoretical difficulties, assessing the number ofcomponents in a mixture model from data is a weakly identifiable statisticalproblem. Mixture densities with different number of components can leadto quite similar resulting probability distributions. For instance, the galaxyvelocities data of Roeder (1990) has became a benchmark data set and is usedby many authors to illustrate procedures for choosing the number of mixturecomponents. Now, according to those authors the answer lies from K = 2 toK = 10, and it is not exaggerating a lot to say that all the answers between2 and 10 have been proposed as a good answer, at least one time, in thearticles considering this particular data set. (An interesting and illuminatingcomparative study on this data set can be found in Aitkin (2001).) Thus, weconsider that it is highly desirable to choose K by keeping in mind what isexpected from the mixture modeling to get a relevant answer to this question.Actually, mixture modeling can be used in quite different purposes. It can be

Mixture Models for Classification 9

regarded as a semi parametric tool for density estimation purpose or as a toolfor cluster analysis.

In the first perspective, much considered by Bayesian statisticians, nu-merical experiments (see Fraley and Raftery (1998)) show that the BIC ap-proximation of the integrated likelihood works well at a practical level. And,under regularity conditions including the fact that the component densitiesare finite, Keribin (2000) proved that BIC provides a consistent estimator ofK.

But, in the second perspective, the integrated likelihood does not takeinto account the clustering purpose for selecting a mixture model in a model-based clustering setting. As a consequence, in the most current situationswhere the distribution from which the data arose is not in the collection ofconsidered mixture models, BIC criterion will tend to overestimate the correctsize regardless of the separation of the clusters (see Biernacki et al. (2000)).

To overcome this limitation, it can be advantageous to choose K in orderto get the mixture giving rise to partitioning data with the greatest evidence.With that purpose in mind, Biernacki et al. (2000) considered the integratedlikelihood of the complete data (x, z) (or integrated completed likelihood).(Recall that z = (z1, . . . , zn) is denoting the missing data such that zi =(zi1, . . . , ziK) are binary K-dimensional vectors with zik = 1 if and only if xi

arises from component k.) Then, the integrated complete likelihood is

p(x, z | K) =∫

ΘK

p(x, z | K, θ)π(θ | K)dθ, (5)

where

p(x, z | K, θ) =n∏

i=1

p(xi, zi | K, θ)

with

p(xi, zi | K, θ) =K∏

k=1

pzik

k [φ(xi | ak)]zik .

To approximate this integrated complete likelihood, those authors propose touse a BIC-like approximation leading to the criterion

ICL(K) = logp(x, z | K, θ) − νK

2log n, (6)

where the missing data have been replaced by their most probable value for pa-rameter estimate θ. (Details can be found in Biernacki et al. (2000)). Roughlyspeaking criterion ICL is the criterion BIC penalized by the mean entropy

E(K) = −K∑

k=1

n∑i=1

tik log tik ≥ 0,

tik denoting the conditional probability that xi arises from the kth mixturecomponent (1 ≤ i ≤ n and 1 ≤ k ≤ K).

10 Gilles Celeux

As a consequence, ICL favors K values giving rise to partitioning the datawith the greatest evidence, as highlighted in the numerical experiments inBiernacki et al. (2000), because of this additional entropy term. More gener-ally, ICL appears to provide a stable and reliable estimate of K for real datasets and also for simulated data sets from mixtures when the components arenot too much overlapping (see McLachlan and Peel (2000)). But ICL, which isnot aiming to discover the true number of mixture components, can underesti-mate the number of components for simulated data arising from mixture withpoorly separated components as illustrated in Figueiredo and Jain (2002).

On the contrary, BIC performs remarkably well to assess the true numberof components from simulated data (see Biernacki et al. (2000), Fraley andRaftery (1998) for instance). But, for real world data sets, BIC has a markedtendency to overestimate the numbers of components. The reason is that realdata sets do not arise from the mixture densities at hand, and the penaltyterm of BIC is not strong enough to balance the tendency of the loglikelihoodto increase with K in order to improve the fit of the mixture model.

3.2 Model selection in classification

Supervised classification is about guessing the unknown group among Kgroups from the knowledge of d variables entering in a vector xi for unit i.This group for unit i is defined by zi = (zi1, . . . , ziK) a binary K-dimensionalvector with zik = 1 if and only if xi arises from group k. For that purpose,a decision function, called a classifier, δ(x) : Rd → {1, . . . , K} is designedfrom a learning sample (xi, zi), i = 1, . . . , n. A classical approach to designa classifier is to represent the group conditional densities with a parametricmodel p(x|m, zk = 1, θm) for k = 1, . . . , K. Then the classifier is assign-ing an observation x to the group k maximizing the conditional probabilityp(zk = 1|m,x, θm). Using the Bayes rule, it leads to set δ(x) = j if and onlyif j = argmaxk pkp(x|m, zk = 1, θm), θm being the ml estimate of the groupconditional parameters θm and pk being the prior probability of group k. Thisapproach is known as the generative discriminant analysis in the MachineLearning community.

In this context, it could be expected to improve the actual error rate byselecting a generative model m among a large collection of models M (see forinstance Friedman (1989) or Bensmail and Celeux (1996)). For instance Hastieand Tibshirani (1996) proposed to model each group density with a mixture ofGaussian distributions. In this approach the number of mixture componentsper group are sensitive tuning parameters. They can be supplied by the user, asin Hastie and Tibshirani (1996), but it is clearly a sub-optimal solution. Theycan be chosen to minimize the v-fold cross-validated error rate, as done inFriedman (1989) or Bensmail and Celeux (1996) for other tuning parameters.Despite the fact the choice of v can be sensitive, it can be regarded as a nearlyoptimal solution. But it is highly CPU time consuming and choosing tuningparameters with a penalized loglikelihood criterion, as BIC, can be expected

Mixture Models for Classification 11

to be much more efficient in many situations. But, BIC measures the fit ofthe model m to the data (x, z) rather than its ability to produce a reliableclassifier. Thus, in many situations, BIC can have a tendency to overestimatethe complexity of the generative classification model to be chosen. In orderto counter this tendency, a penalized likelihood criterion taking into accountthe classification task when evaluating the performance of a model has beenproposed by Bouchard and Celeux (2006). It the so-called Bayesian EntropyCriterion (BEC) that it is now presented.

As stated above, a classifier deduced from model m is assigning an obser-vation x to the group k maximizing p(zk = 1|m,x, θm). Thus, from the clas-sification point of view, the conditional likelihood p(z|m,x, θm) has a centralposition. For this very reason, Bouchard and Celeux (2006) proposed to makeuse of the integrated conditional likelihood

p(z|m,x) =∫

p(z|m,x, θm)π(θm|x)dθm, (7)

whereπ(θm|x) ∝ π(θm)p(x|m, θm)

is the posterior distribution of θm knowing x, to select a relevant model m.As for the integrated likelihood, this integral is generally difficult to calculateand has to be approximated. We have

p(z|m,x) =p(x, z|m)p(x|m)

(8)

withp(x, z|m) =

∫p(x, z|m, θm)π(θm)dθm, (9)

andp(x|m) =

∫p(x|m, θm)π(θm)dθm. (10)

Denotingθm = arg max

θm

p(x, z|m, θm),

θm = arg maxθm

p(x|m, θm)

andθ�

m = arg maxθm

p(z|m,x, θm),

BIC-like approximations of the numerator and denominator of equation (8)leads to

logp(z|m,x) = logp(x, z|m, θm) − logp(x|m, θm) + O(1). (11)

Thus the approximation of logp(z|m,x) that Bouchard and Celeux (2006)proposed is

12 Gilles Celeux

BEC = logp(x, z|m, θm) − logp(x|m, θm). (12)

The criterion BEC needs to compute θm = argmaxθm p(x|m, θm). Since, fori = 1, . . . , n,

p(xi|m, θm) =K∑

k=1

p(zik = 1|m, θm)p(xi|zik = 1, m, θm),

θ is the ml estimate of a finite mixture distribution. It can be derived from theEM algorithm. And, the EM algorithm can be initiated in a quite natural andunique way with θ. Thus the calculation of θ avoids all the possible difficultieswhich can be encountered with the EM algorithm. Despite the need to use theEM algorithm to estimate this parameter, it would be estimated in a stableand reliable way. It can also be noted that when the learning data set hasbeen obtained through the diagnosis paradigm, the proportions in the mixturedistribution are fixed: pk = card{i such that zik = 1}/n for k = 1, . . . , K.

Numerical experiments reported in Bouchard and Celeux (2006) show thatBEC and cross validated error rate criteria select most of the times the samemodels contrary to BIC which often selects suboptimal models.

4 Discussion

As sketched in the Section 2 of this article, finite mixture analysis is definitivelya powerful framework for model-based cluster analysis. Many free and valu-able softwares for mixture analysis are available: C.A.Man, Emmix, Flemix,MClust, mixmod, Multimix, Sob . . . We want to insist on the software mix-mod on which we are working for years (Biernacki et al. (2006)). It is a mixturesoftware for cluster analysis and classification which contains most of the fea-tures described here and which last version is quite rapid. It is available aturl http://www-math.univ-fcomte.fr/mixmod.

In the second part of this article, we highlighted how it could be useful totake into account the model purpose to select a relevant and useful model. Thispoint of view can lead to define different selection criteria than the classicalBIC criterion. It has been illustrated in two situations: modeling in a clusteringpurpose and modeling in a supervised classification purpose. This leads to twopenalized likelihood criteria ICL and BEC for which the the penalty is datadriven and is expected to choose a useful, if not true, model.

Now, it can be noticed that we did not consider the modeling purposewhen estimating the model parameters. In both situations, we simply con-sidered the maximum likelihood estimator. Taking into account the modelingpurpose in the estimation process could be regarded as an interesting pointof view. However we do not think that this point of view is fruitful and,moreover, we think it can jeopardize the statistical analysis. For instance, inthe cluster analysis context, it could be thought of as more natural to com-pute the parameter value maximizing the complete loglikelihood logp(x, z|θ)

Mixture Models for Classification 13

rather than the observed loglikelihood logp(x|θ). But as proved in Bryantand Williamson (1978), this strategy leads to asymptotically biased estimatesof the mixture parameters. In the same manner, in the supervised classifi-cation context, considering the parameter value θ∗ maximizing directly theconditional likelihood logp(z|x, θ) could be regarded as an alternative to theclassical maximum likelihood estimation. But this would lead to difficult op-timization problems and would provide unstable estimated values. Finally, wedo not recommend taking into account the modeling purpose when estimat-ing the model parameters because it could lead to cumbersome algorithms orprovoke undesirable biases in the estimation. On the contrary, we think thattaking into account the model purpose when assessing a model could lead tochoose reliable and stable models especially in unsupervised and supervisedclassification context.

References

AITKIN, M. (2001): Likelihood and Bayesian Analysis of Mixtures. Statistical Mod-eling, 1, 287–304.

AKAIKE, H. (1974): A New Look at Statistical Model Identification. IEEE Trans-actions on Automatic Control, 19, 716–723.

BANFIELD and RAFTERY, A.E. (1993): Model-based Gaussian and Non-GaussianClustering. Biometrics, 49, 803–821.

BENSMAIL, H. and CELEUX, G. (1996): Regularized Gaussian Discriminant Anal-ysis Through Eigenvalue Decomposition. Journal of the American StatisticalAssociation, 91, 1743–48.

BIERNACKI, C., CELEUX., G. and GOVAERT, G. (2000): Assessing a MixtureModel for Clustering with the Integrated Completed Likelihood. IEEE Trans.on PAMI, 22, 719–725.

BIERNACKI, C., CELEUX., G. and GOVAERT, G. (2003): Choosing Starting Val-ues for the EM Algorithm for Getting the Highest Likelihood in Multivari-ate Gaussian Mixture Models. Computational Statistics and Data Analysis, 41,561–575.

BIERNACKI, C., CELEUX, G., GOVAERT G. and LANGROGNET F. (2006):Model-based Cluster Analysis and Discriminant Analysis With the MIXMODSoftware, Computational Statistics and Data Analysis (to appear).

BOUCHARD, G. and CELEUX, G. (2006): Selection of Generative Models in Clas-sification. IEEE Trans. on PAMI, 28, 544–554.

BRYANT, P. and WILLIAMSON, J. (1978): Asymptotic Behavior of ClassificationMaximum Likelihood Estimates. Biometrika, 65, 273–281.

CELEUX, G., CHAUVEAU, D. and DIEBOLT, J. (1996): Some Stochastic Versionsof the EM Algorithm. Journal of Statistical Computation and Simulation, 55,287–314.

CELEUX, G. and GOVAERT, G. (1991): Clustering Criteria for Discrete Data andLatent Class Model. Journal of Classification, 8, 157–176.

CELEUX, G. and GOVAERT, G. (1992): A Classification EM Algorithm for Cluster-ing and Two Stochastic Versions. Computational Statistics and Data Analysis,14, 315–332.

14 Gilles Celeux

CELEUX, G. and GOVAERT, G. (1993): Comparison of the Mixture and the Clas-sification Maximum Likelihood in Cluster Analysis. Journal of Computationaland Simulated Statistics, 14, 315–332.

CIUPERCA, G., IDIER, J. and RIDOLFI, A.(2003): Penalized Maximum Likeli-hood Estimator for Normal Mixtures. Scandinavian Journal of Statistics, 30,45–59.

DEMPSTER, A.P., LAIRD, N.M. and RUBIN, D.B. (1977): Maximum LikelihoodFrom Incomplete Data Via the EM Algorithm (With Discussion). Journal ofthe Royal Statistical Society, Series B, 39, 1–38.

DIEBOLT, J. and ROBERT, C. P. (1994): Estimation of Finite Mixture Distribu-tions by Bayesian Sampling. Journal of the Royal Statistical Society, Series B,56, 363–375.

FIGUEIREDO, M. and JAIN, A.K. (2002): Unsupervised Learning of Finite MixtureModels. IEEE Trans. on PAMI, 24, 381–396.

FRALEY, C. and RAFTERY, A.E. (1998): How Many Clusters? Answers via Model-based Cluster Analysis. The Computer Journal, 41, 578–588.

FRIEDMAN, J. (1989): Regularized Discriminant Analysis. Journal of the AmericanStatistical Association, 84, 165–175.

GANESALINGAM, S. and MCLACHLAN, G. J. (1978): The Efficiency of a LinearDiscriminant Function Based on Unclassified Initial Samples. Biometrika, 65,658–662.

GOODMAN, L.A. (1974): Exploratory Latent Structure Analysis Using Both Iden-tifiable and Unidentifiable Models. Biometrika, 61, 215–231.

HASTIE, T. and TIBSHIRANI, R. (1996): Discriminant Analysis By Gaussian Mix-tures. Journal of the Royal Statistical Society, Series B, 58, 158–176.

HUNT, L.A. and BASFORD K.E. (2001): Fitting a Mixture Model to Three-modeThree-way Data With Missing Information. Journal of Classification, 18, 209–226.

KASS, R.E. and RAFTERY, A.E. (1995): Bayes Factors. Journal of the AmericanStatistical Association, 90, 773–795.

KERIBIN, C. (2000): Consistent Estimation of the Order of Mixture. Sankhya, 62,49–66.

MARIN, J.-M., MENGERSEN, K. and ROBERT, C.P. (2005): Bayesian Analysisof Finite mixtures. Handbook of Statistics, Vol. 25, Chapter 16. Elsevier B.V.

MCLACHLAN, G.J. and PEEL, D. (2000): Finite Mixture Models. Wiley, New York.RAFTERY, A.E. (1995): Bayesian Model Selection in Social Research (With Dis-

cussion). In: P.V. Marsden (Ed.): Sociological Methodology 1995, Oxford, U.K.:Blackwells, 111–196.

RAFTERY, A.E. and DEAN, N. (2006): Journal of the American Statistical Asso-ciation, 101, 168–78.

ROEDER, K. (1990): Density Estimation with Confidence Sets Exemplified by Su-perclusters and Voids in Galaxies. Journal of the American Statistical Associa-tion, 85, 617–624.

SCHWARZ, G. (1978): Estimating the Dimension of a Model. The Annals of Statis-tics, 6, 461–464.