Challenges in model-based clustering

Advanced Review

Challenges in model-basedclusteringVolodymyr Melnykov∗

Model-based clustering is an increasingly popular area of cluster analysis that relieson probabilistic description of data by means of finite mixture models. Mixturedistributions prove to be a powerful technique for modeling heterogeneity indata. In model-based clustering, each data group is seen as a sample from oneor several mixture components. Despite attractive interpretation, model-basedclustering poses many challenges. This paper discusses some of the most importantproblems a researcher might encounter while applying the model-based clusteranalysis. © 2013 Wiley Periodicals, Inc.

How to cite this article:WIREs Comput Stat 2013, 5:135–148. doi: 10.1002/wics.1248

Keywords: model-based clustering; finite mixture models; EM algorithm;initialization; dimensionality reduction

INTRODUCTION

Model-based clustering experienced extensivedevelopment since the first appearance in

the paper by Pearson1 more than 100 years ago.Numerous applications of model-based clustering canbe seen in various fields of human activity, e.g., in theanalysis of video scenes,2 mass spectrometry data,3,4

studies of spelling ability,5 and many others. A detailedoverview of basic model-based clustering principlesis provided in Ref 6. This paper aims at extendingthe discussion started by Stahl and Sallis6 in thedirection of common practical pitfalls a researchercan encounter in this popular area.

Model-based clustering is closely related to finitemixture modeling which is very popular due to itsflexibility in fitting heterogeneous data.

A finite mixture distribution is given by thefollowing expression

f (x|ϑ) =K∑

k=1

τkfk(x|ϑk), (1)

where fk(·|ϑk) denotes the kth mixture componentof a known functional form with the parameter

∗Correspondence to: [email protected]

Department of Information Systems, Statistics, and ManagementScience, The University of Alabama, Tuscaloosa, AL, USA

vector ϑk and K represents the total number ofcomponents; τk is the kth mixing proportion suchthat τk > 0 and

∑Kk=1 τk = 1. ϑ represents the

entire parameter vector ϑ = (τ1, . . . , τK, ϑ ′1, . . . , ϑ ′

K)′.Various parameter estimation procedures have beenstudied in this setting. Possible approaches includethe method of moments,1,7 and distance-basedprocedures,8 but the maximum likelihood estimationcarried out by means of the expectation-maximization(EM) algorithm9 is by far the most popular methodfor the estimation of the parameter vector ϑ . Tolearn more about the application of the EM algorithmand finite mixture models, we refer the reader totwo comprehensive resources by McLachlan andKrishnan10 and McLachlan and Peel.11

In the last two decades, with the development ofnew high-performance computing powers, Bayesianapproaches to the analysis of mixture models viaMarkov chain Monte Carlo (MCMC) procedureshave become increasingly popular.12,13,14 Lau andGreen15 and McLachlan and Peel11 discussed theapplication of Bayesian methods in the model-based clustering setting. The most challenging andwell-known problem in this framework is labelswitching. Due to a lack of model’s identifiability,mixture component labels can permute making theanalysis problematic. Some remedies are consideredin Refs 16 and 17. One possible approach isto conduct unrestricted simulations followed by a

Volume 5, March/Apr i l 2013 © 2013 Wiley Per iodica ls, Inc. 135

Advanced Review wires.wiley.com/compstats

label reassignment step.18 More information aboutBayesian approaches to mixture modeling can befound in Ref 6. The primary focus of this paper,however, is the traditional frequentist parameterestimation via the EM algorithm.

EM ALGORITHM

Let X1, . . . , Xn be a random sample consisting of nindependent realizations from the mixture distribution(1). The direct maximum likelihood estimationbased on the likelihood function L(ϑ |x1, . . . , xn) =∏n

i=1 f (xi|ϑ) is generally not straightforward becauseof the complicated log-likelihood function form givenby �(ϑ |x1, . . . , xn) = ∑n

i=1 log∑K

k=1 τkfk(xi|ϑk). TheEM algorithm provides a convenient approach tomaximum likelihood estimation through the inclusionof unknown membership labels z1, . . . , zn. Then, thecomplete-data likelihood function can be written as

Lc(ϑ |z1, . . . , zn, x1, . . . , xn)

=n∏

i=1

K∏k=1

(τkfk(xi|ϑk)

)I(zi=k) ,

where I(·) is the indicator function, i.e., I(zi = k) = 1when xi comes from the kth component; otherwise,I(zi = k) = 0. The EM algorithm involves expectation(E) and maximization (M) steps. The E-step aims atfinding the conditional expectation of the complete-data log-likelihood function, given the observed data.Traditionally, this expectation is denoted as theQ-function and is given by

Q(ϑ |ϑ (b−1), x1, . . . , xn)

=n∑

i=1

K∑k=1

π(b)ik (log τk + log fk(xi|ϑk)), (2)

where b = 1, 2, . . . represents the iteration numberand π

(b)ik is the posterior probability that xi originates

from the kth distribution calculated by

π(b)ik = τ

(b−1)k fk(xi|ϑ (b−1)

k )∑Kk′=1 τ

(b−1)k′ fk′(xi|ϑ (b−1)

k′ ).

The M-step involves the maximization of theQ-function (2) with respect to ϑ . The E- and M-stepshave to be iterated till the convergence is reached atsome iteration B. The convergence is evaluated bymeans of some stopping criterion such as the relativechange in two consecutive log-likelihood values beingsmaller than a pre-specified tolerance level. Then, the

maximum likelihood estimate is given by ϑ = ϑ (B) andfinal posterior probabilities are πik = π

(B)ik .

One of the most well-known pitfalls ofmaximum likelihood estimation in finite mixturemodels is the potential nonexistence of the globalmaximizer, which happens when the likelihoodfunction is unbounded.19 The relevance of thisproblem depends entirely on the particular formof the considered mixture. For example, the mostfrequently used multivariate Gaussian mixture modelwith unrestricted covariance matrices does not havea global maximizer. At the same time, a mixture ofGaussians with equal covariance matrices is free of thisproblem. The difference between these two exampleslies in the control over the size of clusters modeled bycomponents. If covariance matrices are not restricted,unwanted singular solutions are possible when somecomponents are built on single observations. Thissuggests a natural remedy by providing control overparameters. Hathaway20 noted that the consistentglobal maximizer can be found in the constrainedparameter space. The restrictions can be imposedeither on scale parameters or mixing proportions.An alternative approach suggests finding the best(in terms of the highest likelihood) local maximizerin the interior of the parameter space. Such amaximizer is consistent and asymptotically normaland efficient.11,21 One more approach to dealingwith an unbounded likelihood is to consider apenalized likelihood as discussed in Refs 22 and 23.Closely related to singular solutions are spurious localmaximizers, another kind of undesirable solutions,discussed in the paper later.

Assuming that each group of points can bemodeled by a single mixture component, we obtain anattractive one-to-one correspondence between clustersand components that makes model-based clusteringso intuitive and popular. In other words, it istypically assumed that for a correctly specified model,each data group can be seen as a sample from aspecific mixture component. To obtain a classificationvector, the Bayes decision rule has to be employed,i.e., each observation xi is assigned to a mixturecomponent with the highest posterior probability:zi = argmaxk{πik}. For more details, we refer thereader to a recent review of model-based clusteringconcepts in Ref 6. Another useful resource is a reviewprovided in Ref 24.

Although the realization of model-based clus-tering seems relatively straightforward, a researcherneeds to be familiar with numerous existing pitfalls.The solution obtained by the EM algorithm is verysensitive to the choice of starting parameter values.The performance of model-based clustering can be

136 © 2013 Wiley Per iodica ls, Inc. Volume 5, March/Apr i l 2013

WIREs Computational Statistics Challenges in model-based clustering

severely affected by the presence of scatter, outliers,or noninformative variables. Often, more than onecomponent is needed to adequately model a singlegroup of data25,26; hence, one has to decide on theoptimal number of clusters. These and some otherchallenges in modern model-based clustering will bediscussed in this paper.

INITIALIZATION STRATEGIES

The final partition obtained by model-based clusteringseverely depends on the initialization of the EMalgorithm, by means of which the likelihood functionfor a finite mixture model is typically maximized.Various starting points in the parameter space veryoften lead to different local maxima and, as aresult, different clustering solutions. Although theconvergence of the EM algorithm is crucial forthe entire concept of mixture modeling and model-based clustering, the literature still does not providea global initialization recommendation even forthe most studied and popular case of Gaussianmixture models. Several approaches have beenconsidered in the literature. Karlis and Xekalaki27

conducted a simulation study comparing severalinitialization strategies for simple two- and three-component mixture models in univariate framework.In the meantime, real-life problems are usually muchmore challenging. In this paper, we focus on themost popular, useful, and promising initializationstrategies.

One of the most popular schemes for findingreasonable starting values for the EM algorithm is toobtain an initial partition by means of some otherclustering algorithm, e.g., k-means28 or hierarchicalclustering.29 Then, the obtained partitioning can beimmediately used to start the EM algorithm fromthe M-step. Unfortunately, this approach has severaldrawbacks. One of them is that many algorithms (e.g.,k-means or k-medoids30) suffer from the necessity tobe properly initialized. In addition to this, clusteringalgorithms by construction impose specific tendenciesto prefer clusters of particular shapes and patterns.For example, k-means uses the Euclidean metric forcalculating distances between observations and clustercenters and, therefore, tends to organize data pointsinto homogeneous clusters of approximately sphericalshapes. Many linkages used in hierarchical clusteringsuch as Ward’s31 and average32 linkages have asomewhat similar effect. Single33 and complete34

linkages tend to create some clusters consisting of veryfew data points and this feature is very undesirablein the finite mixture modeling framework becauseof the difficulties associated with the estimation of

parameters based on few observations. Hence, thecommon practice to initialize the EM algorithm bya classification vector obtained by another clusteringalgorithm is a simple technique that often leads to apoor solution.

A different initialization approach, calledemEM, was proposed by Biernacki et al.35 Thisstochastic procedure starts by choosing K data pointsat random. These points are used as centroids andthe rest of observations are assigned to a clusterwith the closest, in terms of Euclidean distances,centroid. Then, the obtained partitioning is usedto estimate starting parameter values for the EMalgorithm which has to be run for several iterationsor until some lax convergence criterion (e.g., therelative change in log-likelihood values being lessthan 0.01) is met.36 The described procedure, calledshort EM, should be repeated multiple times fromvarious starting points, chosen at random. The set ofparameter values from the short EM stage yieldingthe highest likelihood should be recommended as astarter for the final run of the EM algorithm calledlong EM. This approach proved to be an efficientinstrument in many situations; however, it has its ownlimitations. The most serious concern related to emEMis that short EM stages are initialized with parameterestimates coming from the partitioning relying on thecalculation of Euclidean distances. In other words,even though an investigator can repeat short EMruns multiple times and hope that at least one ofthem produces starting values that will eventuallylead to the correct solution, it is very likely that allproposed starting points are going to be far awayfrom the true solution in the parameter space. Itshould be noted that in many cases, especially whenK is very low, the EM algorithm is able to find theoptimal solution even when it starts from a point notnecessarily close to the maximizer. On the other hand,when the number of mixture components is large oreven moderate (e.g., K = 10)37, the EM algorithm hasserious issues with finding the best local maximizer ifnot initialized properly. A modification of the emEMprocedure called Rnd-EM was proposed by Maitra.38

The author suggested replacing all short EM stageswith immediate likelihood estimation. The underlyingidea is that sometimes it can be more beneficial toobserve more starting parameter combinations insteadof running short EM stages for several iterations.Melnykov and Melnykov37 remarked that basedon conducted simulation studies, emEM should bethe method of choice when components overlapsubstantially and Rnd-EM should be preferred inthose cases when there is considerable separationbetween mixture components. The performance of



both methods is reported to degrade considerablyalong with increasing K.

Melnykov and Melnykov37 proposed their owninitialization method for Gaussian mixture modelscalled �-EM. The procedure relies on detecting densegroups of observations belonging to the same cluster.On the basis of the observations in such a group, thecovariance matrix of a truncated normal distributioncan be estimated. Then, using the relationship betweentruncated and untruncated normal distributions,authors develop an iterative procedure capable ofassessing the covariance matrix of a Gaussian mixturecomponent. Upon convergence of this procedure,a hyperellipsoid for a pre-specified confidence level1 − α is constructed. All points contained in thisellipsoid have to be excluded from the furtherconsideration to avoid repetitive selections of the samecluster. Then, the next dense group has to be chosenand the procedure goes on until no more groups can beproposed. The authors mention that several mixturecomponents can be potentially combined together,especially if they overlap considerably. Therefore, eachmixture component proposed by this technique hasto be further investigated by the emEM algorithmfor further splitting. The number of subcomponentsneeded can be conveniently detected by the BayesianInformation Criterion (BIC).39 The authors illustratedthe superior performance of the proposed method inlow dimensions. Even for a large number of clusters,when all other methods are not capable of findingthe correct solution, �-EM shows excellent results.Unfortunately, the performance of the proceduredegrades noticeably even for a moderate number ofdimensions. It happens because the iterative procedureestimating covariance matrices has a tendency toinclude multiple components. Hence, in the ultimatecase of just one proposed component, �-EM reducesto the emEM algorithm.

A popular R package mclust40 uses adeterministic initialization relying on model-basedagglomerative hierarchical clustering following theideas described by Fraley41 and Fraley and Raftery42.This approach aims at finding an approximatemaximum for the classification likelihood given by

Lclass(ϑ1, . . . , ϑK; z1, . . . , zn|x1, . . . , xn)

=n∏

i=1

fzi(xi|ϑzi),

where each zi represents the label corresponding tothe observation xi. Maitra38 proposed a deterministicmulti-stage approach based on detecting localmodes. Maitra and Melnykov43 and Melnykov and

Melnykov37 applied this initialization procedure insimulation studies. According to their findings, emEM,Rnd-EM, and �-EM outperform the deterministicapproach of maitra38 in general even though the lattersometimes demonstrates impressive performance.

Overall, it should be recommended to find sev-eral local maximizers based on various initializationstrategies and choose the nonsingular, nonspurioussolution that yields the highest likelihood value.

Illustrative ExampleHere, we provide a small example illustratingthe topic we just considered. A bivariate datasetconsisting of 1500 observations was simulated froma 15-component Gaussian mixture by means of the Rpackage MixSim.44 As we can see from Figure 1a,all clusters are well-separated and easy to detectat least by visual inspection. In Figure 1, ellipsoidsrepresent 99% confidence regions corresponding toeach mixture component. The initialization method�-EM identifies all clusters correctly (Figure 1a).The correct number of clusters was assumed to beknown. The log-likelihood value � associated withthe solution is −10,643. Meanwhile, the emEMmethod (plot (b)), hierarchical model-based clusteringof mclust (plot (c)), and EM initialized by k-means(plot (d)) perform worse: � = −10, 665, � = −10, 667,and � = −10, 837, respectively. As we can see from theplots (b)–(d), the most populated cluster located closeto the center was fit with two mixture components inall three cases, while some other two clusters (righttop corner in plots (b) and (c), and left bottom cornerin plot (d)) were fit with just one component. Thishappened because the most populated cluster waschosen several times during the initialization stagewhile some other clusters were not selected at all.The overall number of short EM iterations and thenumber of k-means restarts were chosen to be 1500 inboth cases. This example also illustrates the fact thatunequal cluster sizes can contribute to the degree ofclustering complexity.

Initialization in Massive DatasetsThe direct application of model-based clustering isoften restrictive or even impossible due to timeand memory constraints.45,46. Therefore, a matterof finding a good initialization strategy becomeseven more important when datasets are large.Some approaches, i.e., initialization by hierarchicalclustering can be very time consuming and thereforeprohibitive in cases with large n42,47. Clearly,stochastic seed-based algorithms like Rnd-EM andemEM also suffer in this framework as they assume



(a) (b)

(c) (d)

FIGURE 1 | Performance of the EM algorithm based on different initialization procedures. � represents the log-likelihood value reached by the EMalgorithm. (a) �-EM, � = −10, 643, (b) emEM, � = −10, 665, (c) mclust, � = −10, 667, (d) k -means & EM, � = −10, 837.

multiple restarts. �-EM can be preferred over thesemethods as it does not require multiple restarts. Themost common practice, however, is to initialize theEM algorithm based on a small percentage of datapoints randomly selected from the data.36,42,45,46,48

In other words, the initial parameters of themodel are obtained from the sample, but the EMalgorithm runs on the entire dataset. Of course, notheavily represented clusters can be easily omitted inthese cases.49 Fraley et al.45 proposed a so-calledincremental model-based clustering approach. Afterfinding an initial mixture model based on a sample, theauthors recommend focusing on those observationswhose density values are among the lowest since lowdensities may indicate poor fit. Then, such data pointsare extracted into an additional component and theEM algorithm starts over. If BIC can reach a bettervalue, this process should be continued until there is noimprovement in BIC. Another technique was proposed

by Fayyad and Smyth.50 The authors find the initialmodel based on a random sample and identify allwell-classified data points. Then, these observationsare retained and the procedure is repeated until allobservations are classified with certainty.

Overall, the initialization problem does nothave a satisfactory universal solution even fordatasets with moderate sample sizes. In the case ofmassive datasets, it becomes especially challengingbecause of prohibitively expensive time and memoryrequirements.

CLUSTER DETECTION

An attractive one-to-one relationship between mixturecomponents and data groups is one of the reasons whymodel-based clustering is so popular. In this setting,clustering and classification are intuitive and havenice interpretation. Unfortunately, quite often the



assumption that one mixture component distributioncan model a particular data group is violated. Morethan one component may be needed for describinga cluster structure properly. This situation arises, forexample, in modeling a skewed data group by a Gaus-sian mixture. However, even when a model is specifiedcorrectly but some components severely overlap, onecan argue that perhaps there is a single cluster that justhas several modes. Therefore, problems of finding thebest-fitting mixture model and identifying the optimalpartition are not always equivalent. Main approachesfor detecting the optimal number of components Kare considered in Ref 6. Here, we focus on findingthe number of clusters K and corresponding parti-tioning. There are two main approaches to this issuein literature. The first one focuses on nonparametricmethods identifying all local modes and associatingthem with clusters.51,52 Then, the appealing one-to-one relation is retained; but now it is between modesand clusters. The other approach relies on mergingmixture components into meaningful clusters. Underthis approach, clusters are allowed to be multimodal.In this paper, we focus on the latter class of methodssince they deal with parametric distributions in theform of traditional finite mixtures. A great source ofinformation with many ideas on merging Gaussianmixture components is provided by Hennig.26

Li53 introduced a multilayer Gaussian mixturemodel which can be seen as a mixture of mixtures.The proposed method employs k-means and requiresthat the number of clusters is pre-specified. Recently,Baudryetal25 introduced a novel criterion designedfor merging clusters in model-based clustering. Theircriterion relies on the concept of fuzzy classificationentropy defined by E = −∑n

i=1∑K

k=1 πik log πik. Fortwo clusters Ck and Ck′ , the criterion is defined asfollows:

νπCk,πCk′ =

n∑i=1

((πiCk

+ πiCk′ ) log (πiCk+ πiCk′ )

− πiCklog πiCk

− πiCk′ log πiCk′

).

Here, πCkand πCk′ represent vectors of posterior

probabilities associated with the kth and k′th clusters,respectively. When clusters are well-separated, theircorresponding posterior probabilities are close to0 or 1. Then, the criterion νπCk

,πCk′ takes on valuesclose to 0 and this is an indication that clustersshould not be merged. On the contrary, considerableuncertainty in the classification of observationsis directly reflected in having many posteriorprobabilities not close to the boundaries of the

[0, 1]-interval. Therefore, high values of the proposedcriterion imply the presence of a considerable overlapbetween clusters Ck and Ck′ . The authors proposestarting with the best finite mixture model obtainedwith the use of BIC. Initially, it is assumed thateach component adequately models a particular datagroup. Then, all mixture components are examinedpairwise using the suggested criterion. Mixturecomponents producing the highest ν-value have tobe merged; from now on, they are treated as asingle cluster and posterior probabilities associatedwith this newly formed cluster have to be updated.The entire procedure can be repeated to yield ahierarchical structure illustrating the merging process.If the required number of clusters is not provided,the authors suggest using the integrated classificationlikelihood criterion (ICL)54 defined as BIC + 2E. ICLaims at detecting the correct number of clusters ratherthan the number of mixture components targetedby BIC. The entropy-related term penalizes BIC foroverlapping components. Thus, if all componentsare well-separated, BIC and ICL generally agreeon the same number of components and clusters.If the mixture model indicates the presence of aconsiderable interaction among components, one canexpect the proposed number of clusters to be lowerthan that of mixture components. Baudry et al.25

also recommended another approach to detecting thenumber of clusters; the method was based on fittinga piecewise linear regression to the entropy plot. Thisapproach, however, is more restrictive. It cannot beapplied in cases with few mixture components. On theother hand, when the number of components is large,it is not clear how many change-points in a piecewiselinear regression are needed.

Another interesting idea, based on directly esti-mated misclassification probabilities, was proposedby Hennig.26 The author employs calculated posteriorprobabilities and merges those clusters that yield thehighest misclassification probability. This procedurecan be repeated providing a tree for merging mixturecomponents. Hennig26 remarks that this method isasymptotically correct but is somewhat optimistic forfinite samples. Unfortunately, the asymptotic proper-ties of the approach are not discussed in great detail.

VARIABLE SELECTION

The problem of variable selection is of special interestin multidimensional frameworks. Often, informationimportant for clustering is contained in a limited setof variables. Then, the rest of the variables can beseen as uninformative and irrelevant for clustering.The exclusion of such irrelevant information can



Irrelevant variable

Info

rmat

ive

varia

ble

(a)Informative variable 1

Info

rmat

ive

varia

ble

2

(b)

FIGURE 2 | Bivariate datasets with three distinct clusters: (a) one variable is informative and the other is irrelevant for clustering, (b) bothvariables carry clustering information.

remarkably improve the performance of clusteringalgorithms. Figure 2 illustrates this idea. In plot(a), one variable is clearly informative and helpfulfor discrimination. The other variable is irrelevantas it does not carry any clustering information;this variable has to be excluded from the data.In plot (b), both variables are informative as thecorrect detection of all three clusters is impossibleif any one of the two variables is excluded.We illustrate the practical importance of variableselection on two well-known classification datasets.Iris,55,56 contains 150 four-dimensional observationsrepresenting three different species of Iris. Withoutvariable selection, model-based clustering based ona Gaussian mixture model finds the solution withfive misclassifications. However, if the variableSepal Width is eliminated, 147 observations can belabeled correctly with just three misclassifications.Another popular dataset, Wine,57 includes 178 13Dobservations summarizing various characteristics ofwine produced from three different cultivars. Withoutvariable selection, the obtained solution contains 34misclassified observations. At the same time, thenumber of misclassifications can be reduced to justseven if ten variables are excluded and only threevariables (Flavanoids, Color Intensity, and Proline)are kept for the analysis. The results obtained forthese two datasets are summarized in Table 1. Boldfont represents correct classification counts. Theimportance of variable selection can be seen evenfrom these relatively low-dimensional examples. Atthe same time, there are numerous applications whereone needs to handle hundreds or even thousands ofvariables and clustering success is not realistic withoutthe dimensionality reduction task.

Raftery and Dean58 proposed splitting the entiredataset Y into three parts Y (1), Y (2), and Y (3) such thatY (1) is the dataset part built on the variables alreadyincluded in the model, Y (2) is the part containingvariables being currently under consideration forinclusion, and Y (3) is the dataset portion includingthe rest of the variables. Under this formulation,two competing models M1 and M2 are proposedas follows:

M1 : p(Y (1), Y (2), Y (3)|Z)

= p(Y (3)|Y (1), Y (2))p(Y (2)|Y (1))p(Y (1)|Z)

M2 : p(Y (1), Y (2), Y (3)|Z)

= p(Y (3)|Y (1), Y (2))p(Y (1), Y (2)|Z),

where Z stands for the unobserved membership vector.Thus, model M1 postulates that the second set ofvariables Y (2) does not bring any new clusteringinformation in addition to that already incorporatedinto the model through the variables in Y (1). ModelM2 is an appropriate model when Y (2) still providessome new information about clustering after Y (1) hasbeen observed. The authors compare models M1 andM2 using the Bayes factor

B12 = p(Y (2)|Y (1),M1)p(Y (1)|M1)

p(Y (1), Y (2)|M2),

which can be conveniently approximated by BIC.The term p(Y (3)|Y (1), Y (2)) is not involved in theexpression for B12 as it cancels out. Recently,Maugisetal59,60 generalized the approach of Rafteryand Dean58 by introducing an additional assumption



TABLE 1 Classification Tables for Iris and Wine datasets

Dataset Full Data Reduced Data

Group Name 1 2 3 1 2 3Setosa 50 0 0 50 0 0

Iris Versicolor 0 45 5 0 47 3

Virginica 0 0 50 0 0 50

Cultivar 1 55 0 4 59 0 0

Wine Cultivar 2 4 50 17 5 65 1

Cultivar 3 0 9 39 0 1 47

Rows and columns correspond to the true and estimated partitions, respectively. Bold font represents correct classification counts.

that some irrelevant variables can be independentof variables relevant for clustering. The authorsalso propose partitioning variables into subsets thatcannot be split any further. They claim that theirgeneralized approach provides improved performanceand interpretability.

Conceptually different methodology was pro-posed by Pan and Shen.61 The authors proposedconsidering a penalized version of the log-likelihoodfunction. The underlying idea is to parameterize thekth cluster mean in the jth dimension as μkj = μj + δkj,where j = 1, . . . , p and p is the number of variables.The penalty allows shrinking variables toward theglobal mean. If the jth variable is not informative,δkj = 0 for all k = 1, . . ., K and such a variablecan be eliminated. The penalized complete data log-likelihood function is defined by

logLc,p(ϑ) =n∑

i=1

K∑k=1

zik(log τk + log fk(xi|ϑk)

)−hλ(ϑ),

where zik is equal to 1 if the ith observation belongs tothe kth component and takes 0 in other cases; hλ(ϑ) isthe penalty function with the parameter λ. The authorsrecommend using the L1-penalty defined as hλ(ϑ) =λ

∑Kk=1

∑pj=1 |μkj|. Then, the EM algorithm based

on the conditional expectation E(logLc,p|X1, . . . , Xn),also called the penalized Q-function, can be employed.This approach was proposed specifically for situationswhen the number of observations is low but thenumber of dimensions is high. This framework istypical, for instance, in DNA microarray analysis.Pan and Shen61 consider a mixture of K Gaussiancomponents. One limitation of the proposed approachis the assumption that all covariance matrices �kare the same and can be written in diagonal form,i.e., �k = diag{σ 2

1 , . . . , σ 2p } for all k = 1, . . . , K. Of

course, in the small n, large p problem, the number ofparameters in the model cannot be high. Nevertheless,the model proposed by the authors is very restrictive

and requires further validation. Wang and Zhu62 usedthe same approach based on more advanced penaltyfunctions. In particular, the authors proposed thepenalty hλ(ϑ) = λ

∑pj=1 maxk(|μ1j|, . . . , |μKj|). This

penalty provides an improvement as it uses theinformation that μkj and μk′j correspond to thesame jth variable. The model considered by Wangand Zhu62 also assumes homogeneous covariancematrices with diagonal structure. Quo et al.63

proposed another penalty function given by hλ(ϑ) =λ

∑pj=1

∑1≤k≤k′≤K |μkj − μk′j|. Using the same model

assumptions, the authors focus on pairwise variableselection.

NOISE, OUTLIERS, AND INFLUENTIALOBSERVATIONS

Noise and OutliersThe presence of scattered observations can imposeremarkable complexity on model-based clustering.Outliers or noise observations have great impact onthe estimation of mixture model parameters and, asa result, on the obtained partition. Therefore, it isoften desirable to diagnose such problematic datapoints and treat them correspondingly. Often, theycan be detected by means of graphical exploratorytools such as parallel coordinate plots64 or somelow-dimensional projection displays such as biplots65

and projection pursuit plots.66 It is expected thatproblematic data points will be visually distinct andrelatively easy to identify. Several methodologicalapproaches have been proposed in literature.

The most popular as well as the most ap-pealing method, originally proposed by Banfield andRaftery48, is to introduce a so-called noise mixturecomponent which can naturally take care of noiseobservations by assigning them to this additionalcomponent. For this purpose, Banfield and Raftery48

employed a uniform distribution defined over thedataset’s domain. As pointed out by Hennig and



1

2

FIGURE 3 | Model-based clustering with and without influential observations denoted as 1 and 2. Symbols represent the true assignment of datapoints simulated from a four-component mixture model; colors represent obtained classification.

Coretto,67 this method is not robust and a mixturemodel defined this way becomes data-dependent. As aresult, its maximum likelihood estimator is improperand therefore does not enjoy the usual asymptoticproperties. The author provides alternatives formodeling the noise part of the mixture and claimsthat they are more robust than the original approach.

McLachlan and Peel11 remarked that someheavy-tailed distributions, e.g., t, can help in modelingnoise and outliers. Of course, a mixture of t68 or skew-normal69,70 distributions can provide a substantiallybetter fit than a Gaussian mixture in this case.However, noise observations will be assigned tomixture components while it is usually important toexclude them from the clustering solution. McLachlanand Peel11 also provided a brief summary of someother methods related to the outlier detection.Two alternative approaches here are the atypicalitymeasure71 and modified likelihood ratio test.72

Influential ObservationsJolliffe et al.73 defined an influential observationin clustering as a data point, removal of whichleads to a different partitioning. Figure 3 illustratesthe influence that such points can have on theobtained clustering solution. Different charactersrepresent origins of data points simulated from afour-component Gaussian mixture model. Colorsillustrate a proposed model-based clustering solution.All solutions are obtained with the model-basedhierarchical initialization approach incorporated inthe package mclust. In the first plot, two influentialdata points are labeled as 1 and 2 and the partitioningis obtained in the presence of these observations. Theother two plots provide classifications obtained afterone of the two points is removed. As we can see,the assignments in all three plots differ substantially.

While the effect of both influential points is similar,the observations themselves are rather different. Point1 is located between two clusters; there is highuncertainty associated with the classification of thispoint. Observation 2, however, is rather criticalfor defining the structure of the cluster. Posteriorprobabilities related to this point are not split amongseveral components. Overall, influential observationsin the model-based clustering framework are an opentopic that requires more attention.

Spurious Local MaximizersAnother practical problem closely related to ourdiscussion on influential observations is the existenceof spurious likelihood function maximizers.11,74 It isoften the case that a solution proposed by the EMalgorithm lacks real-life interpretability but yields ahigh likelihood value compared to those of othersolutions. This phenomenon, called a spurious localmaximizer, can often be identified by studying theobtained parameter estimates: spurious solutions liein a lower-dimensional subspace and close to theboundary of the parameter space, some clusterscan include very few observations and have a lowgeneralized variance. Figure 4 illustrates four differentsolutions obtained based on the Rnd-EM initializationstrategy for a dataset of size 250 simulated from abivariate Gaussian mixture with five components.Symbols and colors represent the original andestimated classifications, respectively. Plots (a) and (b)present two nonspurious solutions with correspondinglikelihood values � = 25.69 and � = 26.61. As wecan see, both solutions identified all five clustercores correctly. Plots (c) and (d) with � = 27.14and � = 27.58, respectively, illustrate typical spurioussolutions. Observations in both blue clusters arealmost collinear and describe some random data



(a) (b)

(c) (d)

FIGURE 4 | Four solutions obtained by Rnd-EM: (a) and (b) are nonspurious, (c) and (d) are spurious. (a) � = 25.69, (b) � = 26.61, (c) � = 27.14,(d) � = 27.58.

pattern rather than a systematic cluster structure.Also, the blue cluster in plot (d) has a small size,which is typical for spurious maximizers. The log-likelihood values associated with solutions in (c)and (d) are higher than those related to the twononspurious local maximizers in plots (a) and (b). Asper discussion on initialization strategies, a researcherneeds to find a local maximizer that produces thehighest likelihood value. In the meantime, potentialspurious solutions have to be detected and excludedfrom the consideration. It can be accomplished witha close supervision of obtained parameter values. Inparticular, one needs to pay attention to the presenceof clusters with low representations and relativelysmall eigenvalues of covariance matrices. Recently,Seo and Kim75 researched this topic in the frameworkof Gaussian mixtures and proposed a novel approachfor finding nonspurious roots of the likelihoodequation. The authors focused on the k-deleted loglikelihood. The main idea is to consider log-likelihoodvalues adjusted by the exclusion of k individual

terms �i = log f (xi|ϑ) with the highest contributionto the total log-likelihood value � = ∑n

i=1 �i. Suchterms can be associated with spurious, singular, andother problematic roots; therefore, their exclusionallows assessing the model fit for the rest of thedata. Alternatively, the authors suggest employing thescore function to decide which �i-terms have to beexcluded. Conducted simulation studies suggest thatthe proposed technique reduces the risk of choosingtroublesome solutions.

SOME RELEVANT PROBLEMS ANDAPPLICATIONS

In this section, we briefly discuss several interestingproblems and applications that highlight some otherchallenges in model-based clustering.

One popular application of finite mixtures andmodel-based clustering that has experienced fastdevelopment recently is the cluster analysis of timeseries data. This particular problem is of special



interest since it provides necessary framework formodeling dependent data such as observations madeover time. Xiong and Yeung76 considered the model-based clustering by the EM algorithm assuming thatcomponents and clusters are connected based on theone-to-one relationship. Fruhwirth and Kaufman77

investigated pooling time series into groups based onmixture models. The method of estimation employedby the authors was fully Bayesian as it relied onMCMC. Chen and Maitra78 considered a more gen-eral problem of model-based clustering of Gaussianregression time series. The authors assumed that themean of the process for each group can be expressedas a linear function of a set of explanatory variables.The authors considered a problem of grouping fundsin order to form a portfolio. Unfortunately, they con-sidered time series of relatively short lengths as thedirect numerical optimization was used in the M-stepto estimate K covariance matrices. As was shown inRef 79, this shortcoming can be fixed with the use ofconditional maximum likelihood estimators that havethe same asymptotic distribution as the unconditionalones.80

Melnykov et al.81 employed MCMC methods inthe study of two-dimensional gel electrophoresis. Thedata contained several gel replications with trivariateobservations including protein molecular weight, iso-electric point, and color intensity. The authors aimedat accounting for spot-matching uncertainty in theobserved gels. One problem in the analysis of suchdata is related to the fact that observations within gelsare not independent since only one spot within eachgel can be associated with a specific protein. Then, theconvenience of the E-step, usual for independent data,is ruined and some other technique has to be employedfor calculating posterior probabilities. The authorsconstructed a random walk process to approximatethe E-step. Thus, MCMC-based methods can be effec-tively used as a remedy in the case of dependent data.

One more interesting problem worth mentioningin the context of this paper is clustering data when,in addition to unknown labels, feature values are alsounavailable for some observations. Handling such asituation requires applying a modified E-step sincethe expectation needs to be calculated not only forunknown labels, but also for missing features. Thisimportant situation that often occurs in the analysisof real-life data is thoroughly discussed and illustratedin Ref 82.

Of course, there are other interesting topics andapplications that are not covered or just briefly men-tioned in this paper. Among them, there are thesmall n, large p problem,83 semi-supervised model-based clustering,84,85 identifiability issues,86,87 andmany others.

CONCLUSION

This paper summarizes the most important challengesin model-based clustering. Despite fast developmentof methodology and tremendous interest in the topic,there are numerous pitfalls that require close atten-tion of a researcher. Proper initialization is absolutelynecessary and can be extremely difficult in many prob-lems. Unfortunately, this subject is often omitted inliterature as something of little importance. Anothercrucial issue is the detection of the number of clus-ters. Recent publications pronounced the importanceof distinguishing the problem of finding the best mix-ture model and that of detecting the optimal clusterpartitioning. The other topics discussed in this paperare related to the detection and elimination of a prob-lematic part of the dataset; it can be either scatterand outliers or variables irrelevant for clustering. Sev-eral applications mentioned in the paper highlightsome other specific challenges existing in model-basedclustering.

REFERENCES1. Pearson K. Contribution to the mathematical theory of

evolution. Phil Trans Roy Soc 1894, 185:71–110.

2. Tan Y-P, Lu H. Model-based clustering and analysisof video scenes. In International Conference on ImageProcessing, Volume 1, 2002, 617–620.

3. Luksza M, Kluge B, Ostrowski J, Karczmarski J, Gam-bin A. Two-stage model-based clustering for liquidchromatography mass spectrometry data analysis. StatAppl Genet Mol Biol 2009, 8: Article 15.

4. Melnykov V. Finite mixture modeling in mass spec-trometry analysis. J Roy Stat Soc: Ser C. In press.

5. Hoijtink H, Notenboom A. Model based clusteringof large data sets: tracing the development of spellingability. Psychometrika 2004, 69:481–498.

6. Stahl D, Sallis H. Model-based cluster analysis. WileyInterdiscip Rev: Comput Stat 2012, 4:341–358.

7. Farrell PJ, Saleh A. K. Md. E., Zhang Z. Methods ofmoments estimation in finite mixtures. Sankhya 2011,73-A:218–230.

8. Titterington D, Smith A, Makov U. Statistical Analysisof Finite Mixture Distributions. Chichester, UK: JohnWiley & Sons; 1985.



9. Dempster AP, Laird NM, Rubin DB. Maximum likeli-hood for incomplete data via the EM algorithm (withdiscussion). J Roy Stat Soc: Ser B 1977, 39:1–38.

10. McLachlan G, Krishnan T. The EM Algorithm andExtensions. 2nd. New York: John Wiley & Sons; 2008.

11. McLachlan G, Peel D. Finite Mixture Models. NewYork: John Wiley & Sons; 2000.

12. Diebolt J, Robert C. Estimation of finite mixture dis-tributions by Bayesian sampling. J Roy Stat Soc: Ser B1994, 56:363–375.

13. Escobar MD, West M. Bayesian density estimationand inference using mixtures. J Am Stat Assoc 1995,90:577–588.

14. Marin J-M, Mengersen K, Robert C. Bayesian mod-elling and inference on mixtures of distributions. InRao C, Dey D, eds. Handbook of Statistics. Volume 25.New York: Springer-Verlag; 2005, 127–138.

15. Lau JW, Green PJ. Bayesian model-based clusteringprocedures. J Comput Graph Stat 2007, 16:526–558.

16. Jasra A, Holmes CC, Stephens DA. Markov chainMonte Carlo methods and the label switching problemin Bayesian mixture modeling. Stat Sci 2005, 20:50–67.

17. Richardson S, Green PJ. On Bayesian analysis of mix-tures with an unknown number of components (withdiscussion). J Roy Stat Soc: Ser B 1997, 59:731–792.

18. Celeux G. Bayesian inference for mixture: the labelswitching problem. In Proceedings of Compstat, Vol-ume 98, 1998, 227–232.

19. Kiefer J, Wolfowitz J. Consistency of the maximum like-lihood estimator in the presence of infinitely many inci-dental parameters. Ann Math Stat 1956, 27:886–906.

20. Hathaway RJ. A constrained formulation of maximum-likelihood estimation for normal mixture distributions.Stat Probab Lett 1985, 4:53–56.

21. Kiefer NM. Discrete parameter variation: efficient esti-mation of a switching regression model. Econometrica1978, 46:427–434.

22. Chen J, Tan X. Inference for multivariate normal mix-tures. J Multivariate Anal 2009, 100:1367–1383.

23. Chen J, Tan X, Zhang R. Consistency of penalizedMLE for normal mixtures in mean and variance. StatSin 2008, 18:443–465.

24. Melnykov V, Maitra R. Finite mixture models andmodel-based clustering. Stat Surv 2010, 4: 80–116.

25. Baudry J-P, Raftery A, Celeux G, Lo K, Gottardo R.Combining mixture components for clustering. J Com-put Graph Stat 2010, 19:332–353.

26. Hennig C. Methods for merging Gaussian mixture com-ponents. Adv Data Anal Classif 2010, 4:3–34.

27. Karlis D, Xekalaki E. Choosing initial values for theEM algorithm for finite mixtures. Comput Stat DataAnal 2003, 41:577–590.

28. MacQueen J. Some methods for classification and anal-ysis of multivariate observations. Proc Fifth BerkeleySymp 1967, 1:281–297.

29. Xu R, Wunsch DC. Clustering. Hoboken, NJ: JohnWiley & Sons; 2009.

30. Kaufman L, Rousseuw PJ. Finding Groups in Data.New York: John Wiley & Sons; 1990.

31. Ward JH. Hierarchical grouping to optimize an objec-tive function. J Am Stat Assoc 1963, 58:236–244.

32. Sokal R, Michener C. A statistical method for evaluat-ing systematic relationships. Univ Kansas Sci Bull 1958,38:1409–1438.

33. Sneath P. The application of computers to taxonomy.J Gen Microbiol 1957, 17:201–226.

34. Sorensen T. A method of establishing groups of equalamplitude in plant sociology based on similarity ofspecies content and its application to analyses of the veg-etation on Danish commons. Biologiske Skrifter 1948,5:1–34.

35. Biernacki C, Celeux G, Govaert G. Choosing startingvalues for the EM algorithm for getting the highestlikelihood in multivariate Gaussian mixture models.Comput Stat Data Anal 2003, 413:561–575.

36. Scharl T, Grun B, Leisch F. Mixtures of regression mod-els for time course gene expression data: evaluation ofinitialization and random effects. Bioinformatics 2010,26:370–377.

37. Melnykov V, Melnykov I. Initializing the EM algo-rithm in Gaussian mixture models with an unknownnumber of components. Comput Stat Data Anal 2012,56:1381–1395.

38. Maitra R. Initializing partition-optimization algo-rithms. IEEE/ACM Trans Comput Biol Bioinf 2009,6:144–157.

39. Schwarz G. Estimating the dimensions of a model. AnnStat 1978, 6:461–464.

40. Fraley C, Raftery AE. MCLUST version 3 for R: Normalmixture modeling and model-based clustering. Techni-cal Report 504, University of Washington, Departmentof Statistics, Seattle, WA, 2006.

41. Fraley C. Algorithms for model-based Gaussianhierarchical clustering. SIAM J Sci Comput 1998,20:270–281.

42. Fraley C, Raftery AE. Model-based clustering, discrim-inant analysis, and density estimation. J Am Stat Assoc2002, 97:611–631.

43. Maitra R, Melnykov V. Simulating data to study per-formance of finite mixture modeling and clusteringalgorithms. J Comput Graph Stat 2010, 19:354–376.

44. Melnykov V, Chen W-C, Maitra R. MixSim: An Rpackage for simulating data to study performance ofclustering algorithms. J Stat Softw 2012, 51:1–25.

45. Fraley C, Raftery AE, Wehrens R. Incremental model-based clustering for large datasets with small clusters. JComput Graph Stat 2005, 14:529–546.

46. Wehrens R, Buydens C, Fraley C, Raftery A. Model-based clustering for image segmentation and largedatasets via sampling. J Classif 2004, 21:231–253.



47. Posse C. Hierarchical model-based clustering for largedatasets. J Comput Graph Stat 2001, 10:464–486.

48. Banfield JD, Raftery AE. Model-based Gaus-sian and non-Gaussian clustering. Biometrics 1993,49:803–821.

49. Maitra R. Clustering massive datasets with applicationsto software metrics and tomography. Technometrics2001, 43:336–346.

50. Fayyad U, Smyth P. Cataloging and mining massivedatasets for science data analysis. J Comput Graph Stat1999, 8:589–610.

51. Li J, Ray S, Lindsay B. A nonparametric statisticalapproach to clustering via mode identification. J MachLearn Res 2007, 8:1687–1723.

52. Stuetzle W, Nugent R. A generalized single linkagemethod for estimating the cluster tree of a density. JComput Graph Stat 2010, 19:397–418.

53. Li J. Clustering based on multi-layer mixture model. JComput Graph Stat 2005, 14:547–568.

54. Biernacki C, Celeux G, Gold EM. Assessing a mixturemodel for clustering with the integrated completed like-lihood. IEEE Trans Pattern Anal Mach Intell 2000,22:719–725.

55. Anderson E. The Irises of the Gaspe peninsula. Bull AmIris Soc 1935, 59:2–5.

56. Fisher RA. The use of multiple measurements in taxo-nomic problems. Ann Eugen 1936, 7:179–188.

57. Forina M, Leardi R, Armanino C, Lanteri S. Parvus-anextendible package for data exploration, classificationand correlation. Institute of Pharmaceutical and FoodAnalysis and Technologies, Via Brigata Salerno, 1991.

58. Raftery AE, Dean N. Variable selection for model-basedclustering. J Am Stat Assoc 2006, 101:168–178.

59. Maugis C, Celeux G, Martin-Magniette M-L. Variableselection for clustering with Gaussian mixture models.Biometrics 2009, 65:701–709.

60. Maugis C, Celeux G, Martin-Magniette M-L. Vari-able selection in model-based clustering: A generalvariable role modeling. Comput Stat Data Anal 2009,53:3872–3882.

61. Pan W, Shen X. Penalized model-based clustering withapplication to variable selection. J Mach Learn Res2007, 8:1145–1164.

62. Wang S, Zhu J. Variable selection for model-based high-dimensional clustering and its application to microarraydata. Biometrics 2008, 64:440–448.

63. Guo J, Levina E, Michailidis G, Zhu J. Pairwise variableselection for high-dimensional model-based clustering.Biometrics 2010, 66:793–804.

64. Wegman E. Hyperdimensional data analysis using par-allel coordinates. J Am Stat Assoc 1990, 85:664–675.

65. Gabriel KR. The biplot graphical display of matri-ces with application to principal component analysis.Biometrika 1971, 58:453–467.

66. Faith J, Mintram R, Angelova M. Targeted projectionpursuit for visualising gene expression data classifica-tions. Bioinformatics 2006, 22:2667–2673.

67. Hennig C, Coretto P. The noise component in model-based cluster analysis. In Preisach C, Burkhardt H,Schmidt-Thieme L, Decker R, eds.Data Analysis,Machine Learning and Applications, Studies in Clas-sification, Data Analysis, and Knowledge Organization.Berlin, Heidelberg: Springer; 2008, 127–138.

68. Peel D, McLachlan G. Robust mixture modeling usingthe t distribution. Stat Comput 2000, 10:339–348.

69. Basso R, Lachos V, Cabral C, Ghosh P. Robustmixture modeling based on scale mixtures of skew-normal distributions. Comput Stat Data Anal 2010, 54:2926–2941.

70. Cabral C, Lachos V, Prates M. Multivariate mixturemodeling using skew-normal independent distributions.Comput Stat Data Anal 2012, 56:126–142.

71. McLachlan GJ, Basford KE. Mixture Models: Infer-ence and Applications to Clustering. New York: MarcelDekker; 1988.

72. Wang SJ, Woodward WA, Gray HL, Wiechecki S, SatinSR. A new test for outlier detection from a multivari-ate mixture distribution. J Comput Graph Stat 1997,6:285–299.

73. Jolliffe IT, Jones B, Morgan BJT. Identifying influentialobservations in hierarchical cluster analysis. J Appl Stat1995, 22:61–80.

74. Day NE. Estimating the components of a mixture of twonormal distributions. Biometrika 1969, 56:463–474.

75. Seo B, Kim D. Root selection in normal mixture models.Comput Stat Data Anal 2012, 56:2454–2470.

76. Xiong Y, Yeung D-Y. Time series clusteringwith ARMA mixtures. Pattern Recogn 2004, 37:1675–1689.

77. Fruhwirth-Schnatter S, Kaufmann S. Model-based clus-tering of multiple time series. J Bus Econ Stat 2008,26:78–89.

78. Chen W-C, Maitra R. Model-based clustering of regres-sion time series data via APECM-an AECM algorithmsung to an even faster beat. Stat Anal Data Min 2011,4:567–578.

79. Melnykov V. Efficient estimation in model-based clus-tering of Gaussian regression time series. Stat Anal DataMin 2012, 5:95–99.

80. Hamilton JD. Time Series Analysis. Princeton, NJ:Princeton University Press; 1994.

81. Melnykov V, Maitra R, Nettleton D. Accounting forspot matching uncertainty in the analysis of pro-teomics data from two-dimensional gel electrophoresis.Sankhya, Ser B 2011, 73:123–143.

82. Hunt L, Jorgensen M. Mixture model clustering formixed data with missing information. Comput StatData Anal 2003, 41:429–440.



83. Stadler N, Buhlmann P, Geer S. �1-penalizationfor mixture regression models. TEST 2010, 19:209–285.

84. Basu S, Bilenko M, Mooney RJ. A probabilistic frame-work for semi-supervised clustering. In Proceedingsof the Tenth ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining, 2004,59–68.

85. Zhong S. Semi-supervised model-based document clus-tering: a comparative study. Mach Learn 2006,65:3–29.

86. Aitkin M, Rubin D. Estimation and hypothesis test-ing in finite mixture models. J Roy Stat Soc B 1985,47:67–75.

87. Hennig C. Identifiability of models for clusterwise linearregression. J Classif 2000, 17:273–296.


Documents

Challenges in model-based clustering