IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL…jwp2128/Papers/QiPaisleyCarin2007b.pdf · Aucouturier and Pachet [3] model the distribution of the MFCCs over all frames of an individual

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 11, NOVEMBER 2007 5209

Music Analysis Using HiddenMarkov Mixture Models

Yuting Qi, Student Member, IEEE, John William Paisley, and Lawrence Carin, Fellow, IEEE

Abstract—We develop a hidden Markov mixture model based ona Dirichlet process (DP) prior, for representation of the statistics ofsequential data for which a single hidden Markov model (HMM)may not be sufficient. The DP prior has an intrinsic clusteringproperty that encourages parameter sharing, and this naturally re-veals the proper number of mixture components. The evaluation ofposterior distributions for all model parameters is achieved in twoways: 1) via a rigorous Markov chain Monte Carlo method; and2) approximately and efficiently via a variational Bayes formula-tion. Using DP HMM mixture models in a Bayesian setting, we pro-pose a novel scheme for music analysis, highlighting the effective-ness of the DP HMM mixture model. Music is treated as a time-se-ries data sequence and each music piece is represented as a mix-ture of HMMs. We approximate the similarity of two music piecesby computing the distance between the associated HMM mixtures.Experimental results are presented for synthesized sequential dataand from classical music clips. Music similarities computed usingDP HMM mixture modeling are compared to those computed fromGaussian mixture modeling, for which the mixture modeling is alsoperformed using DP. The results show that the performance of DPHMM mixture modeling exceeds that of the DP Gaussian mixturemodeling.

Index Terms—Dirichlet process, hidden Markov model (HMM)mixture, Markov chain Monte Carlo (MCMC), music, variationalBayes.

I. INTRODUCTION

WITHtheadventofonlinemusicalargequantityandvarietyof music is now highly accessible, this music spanning

many eras and genres. However, this wealth of available musiccan also pose a problem for music listeners and researchers alike:how can the listener efficiently find new music he or she wouldlike from a vast library, and how can this database be organized?This challenge has motivated researchers working in the field ofmusic recognition to find ways of building robust and efficientclassification, retrieval, browsing, and recommendation systemsfor listeners.

The use of statistical models has great potential for analyzingmusic, and for recognition of similarities and relationships be-tween different musical pieces. Motivated by these goals, ideasfrom statistical machine learning have attracted growing interestin the music-analysis community. For example, in the workof Logan and Salomon [19], the sampled music signal is di-vided into overlapping frames and Mel-frequency cepstral co-efficients (MFCCs) are computed as a feature vector for each

Manuscript received October 20, 2006; revised March 19, 2007. The associateeditor coordinating the review of this manuscript and approving it for publica-tion was Dr. Peter Handel.

The authors are with the Department of Electrical and Computer Engineering,Duke University, Durham, NC 27708-0291 USA (e-mail: [email protected];[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2007.898782

frame. A K-means method is then applied to cluster frames inthe MFCC feature space. Aucouturier and Pachet [3] model thedistribution of the MFCCs over all frames of an individual songwith a Gaussian mixture model, and the distance between twopieces is evaluated based on their corresponding Gaussian mix-ture models (GMMs). Similar work can be found in [7] as well.Recently, support vector machines (SVMs) have been used inthe context of genre classification [32], [33], where individualframes are classified based on short-time features, which thenvote for the classification of the entire piece. In [21], SVMsare utilized for genre classification, with song-level features anda Kullback-Leiber (KL) divergence-based kernel employed tomeasure distances between songs.

A common disadvantage of the aforementioned methods isthat either frame-level or song-level feature vectors are treatedas i.i.d. samples from a distribution assumed to characterizethe song (for frame-level feature vectors) or the genre (song-level feature vectors), with no dynamic behavior of the musicaccounted for. However, “in order to understand and segmentacoustic phenomena, the brain dynamically links a multitude ofshort events which cannot always be separated” [4]. For the brainto recognize and appreciate music, temporal cues are criticaland contain information that should not be ignored. Therefore,there have been many attempts at modeling frame-to-framedynamics in music [notably hidden Markov models (HMMs)],and complex and multifaceted music analysis may benefit fromconsidering dynamical information (see [1] for a review). AnHMM can accurately represent the statistics of sequential data,and such models have been exploited extensively in speechrecognition [5], [22]; one may utilize HMMs to learn short-termtransitions across a piece of music. For a piece of monophonicmusic, Raphael [23] modeled the overall music as an HMM andsegmented the sampled music data into a sequence of contiguousregions that correspond to notes. In [4], a song is consideredas a collection of distinct regions that present steady statisticalproperties, which are uncovered through HMM modeling. Morerecently, Sheh and Ellis [29] transcribe audio recordings intochord sequences for music indexing and retrieval via HMMs.Shao [28] and Scaringella [25] use HMMs for music genre clas-sification. Building a single HMM for a song performs well whenthe music’s “movement pattern,” modeled as Markov chaintransition pattern, is relatively simple and thus the structure is ofmodest complexity (e.g., the number of states is few). However,most real music is a complicated signal, which has a quasi-uniform feature set and may have more than one “movementpattern” across the entire piece. One may in principle model theentire piece by a single HMM, with distinct segments of musiccharacterized by an associated set of HMM states. In such amodel there would be limited state transitions between states

1053-587X/$25.00 © 2007 IEEE

Authorized licensed use limited to: IEEE Xplore. Downloaded on February 13, 2009 at 15:31 from IEEE Xplore. Restrictions apply.

5210 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 11, NOVEMBER 2007

associated with distinct segments of music. Semi-parametrictechniques such as the iHMM [30] could be used to infer theappropriate number of HMM states. We alternatively consider anHMM mixture model, because the distinct mixture componentsallow analysis of the characteristics of different segments of agiven piece. An important question centers around the propernumber of mixture components, and this issue is addressedusing ideas from modern Bayesian statistics.

Mixture models provide an effective means of density estima-tion. For example, GMMs have been widely used in modelinguncertain data distributions. However, few researchers haveworked on mixtures of HMMs designed using Bayesian prin-ciples. The most related work can be found in [17] and [26],where an HMM mixture is learned not in a Bayesian setting, butvia the EM algorithm with the number of mixture componentspreset. The work reported here develops an HMM mixturemodel in a Bayesian setting using a nonparametric Dirichletprocess (DP) as a common prior distribution on the parametersof the individual HMMs. Through Bayesian inference, we learnthe posterior of the model parameters, which yields an en-semble of HMM mixture models rather than a point estimationof the model parameters. In contrast, the maximum-likelihood(ML) expectation-maximization (EM) algorithm [10] providesa point of estimate of the model parameters, i.e., a single HMMmixture model.

Research on DP models can be traced to Ferguson [12]. A DPis characterized by a base distribution and a positive scalar“innovation” parameter . The DP is a distribution on a distri-bution , specifically the distribution is drawn from the DP,denoted . Ferguson proved that there is pos-itive probability that a sample drawn from a DP will be asclose as desired to any probability function defined on the sup-port of . Therefore, DP is rich enough to model parametersof individual components with arbitrarily high complexity, andflexible enough to fit them well without any assumptions aboutthe functional form of the prior distribution.

In a DP formulation, we let represent the seg-ments of data associated with a given piece, where eachrepresents a sequence of time-evolving features. Each is as-sumed to be drawn from an HMM characterized by parameters

, . All are assumed to be drawn i.i.d. from

the distribution , i.e., , where is drawn from, i.e., . As discussed further below,

this hierarchical model naturally yields a clustering of the data, from which an HMM mixture model is constituted.

Importantly, the DP HMM mixture is essentially an ensemble ofHMM mixtures where each mixture may have different numberof components and different component parameters, thereforethe number of mixture components need not be set a priori.The same basic framework may be applied to virtually any datamodel, as shown by West [31], and for comparison we also con-sider a Gaussian mixture model.

Two schemes are considered to perform DP-based mixturemodeling, Gibbs sampling [15] and variational Bayes [9]. Thevariational Bayesian inference algorithm avoids the expensivecomputation of Gibbs sampling while retaining much of therigor of the Bayesian formulation. In this paper we focus on

HMM mixture models based on discrete observations; we havealso considered continuous observation HMMs, which yieldedvery similar results as the discrete case, and therefore the case ofcontinuous observations is omitted for brevity. We also note thatthe discrete HMMs are computationally much faster to learn thantheir continuous counterparts. Although music modeling is themotivation and specific application in this paper, our method isapplicable to anydiscrete sequentialdata sets containing multipleunderlying patterns, for instance behavior modeling in video.

The remainder of the paper is organized as follows. The pro-posed HMM mixture model is described in Section II. Sec-tion III provides an introduction to the DP and its applicationto HMM mixture models. An MCMC-based sampling schemeand variational Bayes inference are developed in Section IV.Section V describes the application of HMM mixture models inmusic analysis. In Section VI, we present experimental resultson synthetic data as well as for real music data. Section VII con-cludes the work and outlines future directions.

II. HIDDEN MARKOV MIXTURE MODEL

A. Hidden Markov Model

For a sequence of observations , an HMMassumes that the observation at time is generated by an un-derlying, unobservable discrete state and that the state se-quence follows a first-order Markov process. Inthe discrete case considered here, and

, where is the alphabet size and the number ofstates. Therefore, an HMM can be modeled as ,where , and are defined as follows:

• , : state transitionprobabilities;

• , : emission proba-bilities;

• , : initial state distribution.For given model parameters , the joint probability of the obser-vation sequence and the underlying state sequence is expressedas

(1)

The likelihood of data given model parameters results froma summation over all possible hidden state sequences

(2)

B. Hidden Markov Mixture Model

The hidden Markov mixture model with mixture compo-nents may be written as

(3)

where represents the th HMM component with asso-ciated parameters , and represents the mixing weight forthe th HMM, with .


QI et al.: MUSIC ANALYSIS USING HIDDEN MARKOV MIXTURE MODELS 5211

Most HMM mixture models have been designed using a MLsolution, via the EM algorithm [10] for which a single mixturemodel is learned. An important question concerns the propernumber of mixture components , this constituting a problemof model selection. For this purpose researchers have used suchmeasures as the minimum description length (MDL) [24]. Inthe work reported here, rather than a point estimate of an HMMmixture we learn an ensemble of HMM mixtures. We assumea set of sequences of data. Each data se-quence is assumed to be drawn from an associated HMMwith parameters , i.e., ,where represents the HMM. The set of associated pa-rameters are drawn i.i.d from a shared prior , i.e.,

. The distribution is itself drawn from a dis-tribution, in particular a DP. We discuss below that proper se-lection of the DP “innovation parameter” yields a framework bywhich the parameters are encouraged to cluster, andeach such cluster corresponds to an HMM mixture componentin (3). The algorithm automatically balances the DP-generateddesire to cluster with the likelihood’s desire to choose param-eters that match the data well. The likelihood and the DPprior are balanced in the posterior density function for parame-ters , and the posterior distribution foris learned to constitute an ensemble of HMM mixture models.

III. DIRICHLET PROCESS PRIOR

A. Dirichlet Process

The DP, denoted as , is a random measure on mea-sures and is parameterized by the “innovation parameter” anda base distribution . Assume we have random variables

distributed according to , and itself is a randommeasure drawn from a DP,

where is the expectation of

(4)

Defining and in-tegrating out , the conditional distribution of givenfollows a Pólya urn scheme and has the following form [8]:

(5)

where denotes the distribution concentrated at single point. Let be the distinct values taken by

and let be the number of values in that equal . Wecan rewrite (5) as

(6)

Equation (6) shows that when considering given all otherobservations , this new sample is either drawn from basedistribution with probability , or is selectedfrom the existing draws according to a multinomial allo-cation, with probabilities proportional to existing groups sizes

. This highlights the valuable sharing property of the DP: anew sample prefers to join a group with a large population, i.e.,the more often a parameter is shared, the more likely it will beshared in the future.

The parameter plays a balancing role between sampling anew parameter from the base distribution (“innovating”), orsharing previously sampled parameters. A larger yields moreclusters, and in the limit , ; as , all

are aggregated into a single cluster and take on thesame value.

Prediction of a future sample can be directly extendedfrom (6)

(7)

where is the number of that take value . It is provenin [31] that the posterior of is still a DP, with the scalingparameter and the base distribution updated as follows:

(8)

The above DP representation highlights its sharing property,but without an explicit form for . Sethuraman [27] providesan explicit characterization of in terms of a stick-breakingconstruction. Consider two infinite collections of independentrandom variables and , , where isdrawn from a Beta distribution, denoted , and isdrawn independently from the base distribution . The stick-breaking representation of is then defined as

(9)

with

(10)

where

This representation makes explicit that the random measureis discrete with probability one and the support of consistsof an infinite set of atoms located at , drawn independentlyfrom . The mixing weights for atom are given by suc-cessively breaking a unit length “stick” into an infinite numberof pieces [27], with and .



The relationship between the stick-breaking representationand the Pólya urn scheme is interpreted as follows: if is large,each drawn from will be very small, which meanswe will tend to have many sticks of very short length. Conse-quently, will consist of an infinite number of with verysmall weights and therefore will approach , the basedistribution. For a small , each drawn from willbe large, which will result in several large sticks with the re-maining sticks very small. This leads to a clustering effect onthe parameters , as will only have a large masson a small subset of (those corresponding tothe large sticks ).

B. DP HMM Mixture Models

Given the observed data , each is as-sumed to be drawn from its own HMM parameterizedby with the underlying state sequence . The common prior

on all is given as (9). Since is discrete, different mayshare the same value, , and take the value of with proba-bility . Introducing an indicator variable andletting indicate that takes the value of , the hiddenMarkov mixture model with DP prior can be expressed as

(11)

where is given by (10) and is themultinomial distribution with parameter .

Assuming , , and are independent of each other, thebase distribution is represented as .For computational convenience (use of appropriate conjugatepriors), are specified as a product of Dirichlet distributions

(12)

where are parameters of the Dirichlet distri-bution. Similarly, for and , we have

(13)

and

(14)

where and . As discussedin Section IV-A, the use of conjugate distribution yields ana-lytic update equations for the MCMC sampler as well as for thevariational Bayes algorithm.

It has been addressed in [31] that the innovation parameterplays a critical role and will define the number of clusters

inferred, with the appropriate depending on the number ofdata points. Therefore, a prior distribution is placed on and aposterior on is learned from the data. We choose

(15)

where is the Gamma distribution with selectedparameters and .

Fig. 1. Graphical representation of a DP mixture model in terms of the stickbreaking construction, where x are observed sequences, , , andG arepreset, and the other parameters are hidden variables to be learned.

The corresponding graphical representation of the model isshown in Fig. 1. We note that the stick-breaking representationin (11) in principle uses an infinite number of sticks. It has beendemonstrated [15] that in practice a finite set of sticks may beemployed with minimal error and may reduce computationalcomplexity; in the work reported here an appropriate truncationlevel , i.e., a finite number of sticks, is employed.

IV. INFERENCE FOR HMM MIXTURE MODELS WITH DP PRIOR

A. MCMC Inference

The posterior distribution of the model parameters isexpressed as where

, , ,, is the hidden state sequence

corresponding to when assuming is generated fromthe th HMM, and is the indicator variable defined in Sec-tion III-B. Truncation level corresponds to the number ofsticks used in the stick-breaking construction. The posterior canbe approximated by MCMC methods based on Gibbs sampling,by iteratively drawing samples for each random variable fromits full conditional posterior distribution given the most recentvalues of all other random variables. We follow Bayes’ ruleto derive the full conditional distributions for each randomvariable in . These distributions areprerequisites for performing the Gibbs sampling of the poste-rior. In each iteration of the Gibbs sampling scheme, we samplea typical state sequence for observation

. Given the HMM parameters and a particular statesequence , the joint likelihood for and is obtainedfrom (1).

Using to represent all ’s except , the conditionalposterior for is obtained as

(16)



where . Equa-tion (16) indicates that the conditional posterior for is still aproduct of Dirichlet distributions

(17)

Similarly, the conditional posterior for is given as

(18)

where , andthe conditional posterior for is

(19)

where .The conditional posterior for a state sequence is given as

(20)

Equation (20) shows that state sequences will be sampled foreach corresponding to .

West proved in [11] that the conditional posterior of is

(21)

When given the prior , drawing asample from the above conditional posteriors can be realizedin two steps: 1) sample an intermediate variable from a betadistribution given the mostrecent value of , 2) sample a new value from

(22)

where is defined by .The detailed derivation of this sampling scheme is found in [11].

The priors for in (11) can be rewritten as. Thus, the conditional posterior for is

(23)where .

The conditional posterior for is

(24)

where is the number of associated with the th compo-nent, i.e., the number of .

A Gibbs sampling scheme called the Blocked Gibbs Sampler[15] is adopted to implement the MCMC method. In the BlockedGibbs Sampler, is set to 1. Let denote the set of currentunique values of , a set of indexes of those components thatcontain at least one sequence. In each iteration of the Gibbssampling scheme, the values are drawn in the following order.

1) Draw samples for HMM mixture component parameters, , , and state sequences given a particular

HMM component, : For , draw ,, and from (12)–(14) respectively, i.e., for those

mixture components that are currently not occupied byany data, new samples are generated from the base dis-tribution. A new state sequence for is generatedby the current according a Markov process, whichhas the same procedure as in (20), but without the emis-sion probabilities.For , draw , , and from (17)–(19) re-spectively, i.e., for those mixture components that haveat least one sequence, component parameters are drawnfrom their conditional posteriors. A state sequence

is generated from (20) given current componentparameters.

2) Draw the membership, , for the data according to thedensity given by (23), where .

3) Draw mixing weights, , from the posterior in (24).4) Draw innovation parameter from (21).

The above steps proceed iteratively, with each variable drawneither from the base distribution or its conditional posteriorgiven the most recent values of all other samples. Accordingto MCMC theory, after the Markov chain reaches its stationarydistribution, these samples can be viewed as random drawsfrom the full posteriors , and thus,in practice, the posteriors can be constructed by collecting asufficient number of samples after the above iteration stabilizes(the change of each variable becomes stable) [13].

An approximate predictive distribution can be obtained byaveraging the predictive distributions across these collectedsamples

(25)

where is the number of collected samples and andare the th collected sample values of and .



It is interesting to compare the traditional HMM mixturemodel in (3) to the DP product in (25). In (3) the data are char-acterized by an HMM mixture model, in which each mixturecomponent has a fixed set of parameters. In this model we mustset the number of mixture components and must learn theassociated parameters. Equation (25) makes it explicit that theDP HMM mixture model is essentially an ensemble of HMMmixtures of (3). Each mixture, the term in the parenthesisin (25), is sampled from the Gibbs sampler. Because of theclustering properties of DP, most components of each suchmixture will have near-zero probability of being used, but thenumber of utilized mixture components for each mixture aredifferent (less than ).

B. Variational Inference

As stated above, in MCMC the posterior is estimated fromcollected samples. Although this yields an accurate result the-oretically (with sufficient samples) [13], it often requires vastcomputational resources and the convergence of the algorithmis often difficult to diagnose. Variational Bayes inference is in-troduced as an alternative method for approximating likelihoodsand posteriors.

From Bayes’ rule, we have

(26)

where are hidden variables of in-terest and are hyper-parameterswhich determine the distribution of the model parameters. Since

is a function of [see (10)], estimating the posterior of isequivalent to estimating the posterior of . The integration inthe denominator of (26), called the marginal likelihood, or “ev-idence”, is generally intractable analytically except for simplecases and thus estimating the posterior cannot beachieved analytically. Instead of directly estimating ,variational methods seek a distribution to approximate thetrue posterior distribution . Consider the log marginallikelihood

(27)

where

(28)

and

(29)

is the KL divergence between the ap-proximate and true posterior. The approximation of the true pos-terior using can be achieved by minimizing

. Since the KL divergence is nonnega-tive, from (27) this minimization is equivalent to maximizationof , which forms a strict lower bound on

(30)

For computational convenience, is expressed in a factor-ized form, with the same functional form as the priorsand each parameter represented by its own conjugate prior. Forthe HMM mixture model proposed in this paper, we assume

(31)

where , , have the same form as in (12),(13) and (14) respectively but different parameters,

with , and. Once we learn the parameters of these varia-

tional distributions from the data, we obtain the approximationof by . The joint distribution of and observa-tions are given as

(32)

where priors , , , and are given in (12),(13), (14), and (15) respectively, andwith . All parameters in thesepriors distributions are assumed to be set.

We substitute (31) and (32) into (28) to yield

(33)



The optimization of the lower bound in (33) is realized bytaking functional derivatives with respect to each of thedistributions while fixing the other distributions and setting

to find the distribution that increases[6]. The update equations for the variational posteriors

are listed as follows (their derivation is summarized in theappendix).

1) , where

.2) ,

where.

3) , where

.4) ,

where and.

5) , where and.

6), where , , and are

given in (56)–(58) in the appendix.7)

where is given as (61) in the appendix.The local maximum of the lower bound is achieved by

iteratively updating the parameters of the variational distribu-tions according to the above equations. Each iteration guar-antees to either increase the lower bound or leave it unchanged.We terminate the algorithm when the change in is negli-gibly small. can be computed by substituting the updated

and the prior distributions into (33).The prediction for a new observation sequence is

(34)

Since the true posteriors are unknown, we use the variationalposterior from the VB optimization to approximate

in (34). However, the above quantity is still in-tractable because the states and the model parameters are cou-pled. There are several possible methods for approximating thepredictive probability. One such method is to sample parame-ters from the posterior distribution and construct a Monte Carloestimate, but this approach is not efficient. An alternative is toconstruct a lower bound on the approximation of the predictivequantity in (34) [6]. Another way suggested in [20] assumesthat the states and the model parameters are independent andthe model can be evaluated at the mean (or mode) of the vari-ational posterior. This approach makes the prediction tractable

and so is used in our following experiments. Equation (34) canthus be approximated as

(35)

where ,with ,

, , and. Equation (35) can be evalu-

ated efficiently using the forward–backward algorithm.

C. Truncation Level

In the DP prior without truncation is a probability massfunction with an infinite number of atoms. In practice, a mix-ture model estimated from observed data usually cannot havemore than mixture components (in the limit, each data isdrawn from one mixture component). So we truncate at fi-nite level, , which saves computational resources.

Here we need to make clear the relationship between the trun-cation level and the utilized number of mixture components

. A given prior consists of distinct valueswith associated probability . However may takeonly a subset of , which means the true utilizednumber of mixture components may be less than (andthe clustering properties of DP almost always yield less than

mixture components, unless is very large). Furthermore,since is itself random, drawn from a DP, willhave different values rather than a set of fixed values. Conse-quently, the utilized number of mixture components varywith different but will typically be less than .However, in the DP formulation, the unoccupied HMMs arestill included as part of the mixture with very small mixingweights, , and continue to draw from the base distribution,

, with the potential of being occupied at some point. There-fore, from this point forward we will refer to a -componentDP HMM mixtures, essentially an ensemble of -componentmixture models, though the actual number of occupied HMMswill likely be less than .

V. MUSIC ANALYSIS VIA HMM MIXTURE MODELS

Considering music to be a set of concurrently played notes(each note defining a location in feature space) and note tran-sitions (time-evolving features), music can be represented as atime series, and thus modeled by an HMM. As music often fol-lows a deliberate structure, the underlying hidden mechanism ofthat music should not be viewed as homogenous, but rather orig-inating from a mixture of HMMs. We are interested in modelingmusic in this manner to ultimately develop a similarity metric toaid in music browsing and query. Our experiments are confined



to classical music with a highly overlapping observation space,but multiple “motion patterns.”

We sample the music clips at 22 kHz and divide each clip into25-ms nonoverlapping frames. We extract 10-D MFCC featuresfor each frame using downloaded software1 with each vector anobservation in feature space. We then quantize our features intodiscrete symbols using vector quantization (VQ), in which thecodebook is learned on the whole set of tested music [18]. Forour experiments, we use a sequence of 1 s or 40 observationswithout overlap; this transforms the music into a collection ofsequences, with each sequence assumed to originate from anHMM.

A. Music Similarity Measure

The proposed method for measuring music similarity com-putes the distance between the respective HMM mixturemodels. Similar to the work in [2], we use Monte Carlo sam-pling to compare two HMM mixture models. Let be thelearned HMM mixture model for music , and for music

. We draw a sample set of size from and a sampleset of size from . The distance between any twoHMM mixture models is defined as

(36)

where the first two terms on the right-hand side of (36) are ameasure of how well model matches observations gener-ated by model , relative to how well matches the ob-servations generated by itself. The last two terms are in the samespirit and make the distance symmetric. Equation(36) can be rewritten in terms of individual samples as

(37)

where is the th sample in with length , is theth sample in with length , and the log-likelihood for

each sample given the HMM mixture model can be obtainedfrom (25) in the MCMC implementation or from (35) in the VBapproach. The similarity of music and is definedby a kernel function as

(38)

where is a fixed parameter; we notice that the choice of willnot change the order of similarities. This approach is well suited

1http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/

to large music databases, as it only requires the storage of theHMM mixture parameters for each piece rather than the originalmusic data itself.

B. DP GMM Modeling

For comparison, we also model each piece of music as aDP GMM, where the 10-D MFCC feature vector of a framecorresponds to one data point in the feature space. In the samespirit as the DP HMM mixture model, each datum in the DPGMM is assumed drawn from a Gaussian and a DP prior isplaced on the mean and precision (inverse of variance) parame-ters of all Gaussians, encouraging sharing of these parameters.An MCMC solution and variational approach for the DP GMMcan be found in [31] and [9], respectively. The posterior on thenumber of mixtures used in DP GMM is learned automaticallyfrom the algorithm (as for the DP-based HMM mixture modeldiscussed above), however the dynamic (time-evolving) infor-mation between observations is discarded. The music similarityunder DP GMM modeling is defined similar to (36), while thelog-likelihood is given by the DP GMM. In our experimentsthe DP GMM is trained via the VB method for computationalefficiency.

VI. EXPERIMENTS

The effectiveness of the proposed method is demonstratedwith both synthetic data and real music data. Our syntheticproblem, for which ground truth is known, exhibits how a dataset, including data with different hidden mechanisms, can becharacterized by an HMM mixture. We then explore musicsimilarity within the classical genre with three experiments.

For the DP HMM mixture modeling in each experiment, aprior is placed on the scaling parameter , which

slightly favors the data over the base distribution when enough iscollected to learn the mixture model. To avoid overfitting prob-lems in the VB approach, we choose a reasonably large but

( in our experiments). In the MCMC imple-mentation, we set uniform, or noninformative priors for , ,and , i.e., , , and are set to unit vector . In the VBalgorithm, we set , , and toensure a good convergence [6].

For the DP GMM analysis in our music experiments, wealso employ a prior on the scaling parameter

. The base distribution for model parameters (meanand precision matrix ) is a Normal–Wishart distribution,

, where andare set to the mean and inverse covariance matrix of all

observed data, respectively; , and equals the featuredimension.

A. Synthetic Data

The synthetic data are generated from three distinct HMMs,each of which generates 50 sequences of length 30 (i.e., mixingweights for the three HMM components are 1/3). The alphabetsize of the discrete observations is for each HMM,



and the number of states has two values, 2,2,3 respectively. Theparameters for these three HMMs are

In this configuration, HMM 1 tends to generate the first twosymbols with a high probability of maintaining the previoussymbol, HMM 2 primarily generates the last two symbolsand has a high tendency to switch, and HMM 3 producesall three observations with equal probability. We apply bothMCMC and VB implementations to model the synthetic dataas an HMM mixture. In our algorithm, we assume that eachHMM mixture component has the same number of states,which we set to . Note that if a given mixture componenthas less than states, the -state model may be used witha very small probability of transitioning to un-needed states,i.e., for a superfluous state, the corresponding row and columnin the state transition matrix will be close to zero. This isachieved through the Dirichlet priors put on each row of ,which promote sparseness. Therefore, is set to a relativelylarge value, which can be estimated in principle by choosingthe upper bound of the model evidence: the log likelihood inMCMC implementation and the log-marginal likelihood (thelower bound) in the VB algorithm. For brevity, we neglect thisstep and set empirically. We’ve observed that, in general,setting different values of does not influence the final resultsin our experiments, as long as is sufficiently large.

Fig. 2 shows the clustering results of modeling the syntheticdata as an HMM mixture, where (a)–(c) employ MCMC and(d)–(f) employ VB. We set the truncated level in the DP priorto and the number of states . Fig. 2(a) and(d) represent the estimated distribution of the indicators, i.e.,the probability that each sequence belongs to a given HMMcomponent, which is computed by averaging the 200 collected

Fig. 2. DP HMM mixture modeling for synthetic data. (a) The averaged indi-cator matrix obtained from MCMC, where the element at (i; j) is the probabilitythat the jth sequence are generated from the ith mixture component. (b) Themembership of each sequence from MCMC, which is the maximum value alongthe columns of the averaged indicator matrix in (a). (c) Mixing weights of DPHMM mixture with K = 50 computed by MCMC, the error bars representingthe standard deviation of each weight. (d) The variational posteriors of the in-dicator matrix, � , in VB. (e) The sequence membership from VB, i.e., themaximum along the columns of (d). (f) Mixing weights of DP HMM mixturewith K = 50 by VB.

samples (spaced evenly 25 apart in burn-out iterations) inMCMC and is the variational posteriors of the indicator matrix

in VB. The memberships in Fig. 2(b) and (e) are obtainedby setting the membership of a sequence to the componentfor which it has the highest probability. Fig. 2(c) and (f) drawthe mixing weights of the HMM mixture model computedfrom the MCMC and VB algorithms respectively, with themean and standard derivation estimated from 500 collectedsamples in MCMC and in VB derived readily from (10) given

and.

The results show that although the true number of HMM mix-ture components (here 3) is unknown a priori, both algorithmsautomatically reduce the superfluous components and give anaccurate approximation of the true component number (3 dom-inant mixture components in MCMC and 4 in VB). Fig. 2(a)and (d) show that sequences generated from the same HMMhave a high probability of belonging to the same HMM mixturecomponent. Fig. 2(b) and (e) clearly indicates that the syntheticdata can be clustered into three major groups, which matches theground truth. In comparison, MCMC slightly outperforms VB;MCMC yields a mixture of 3 HMMs to VB’s 4 and less dataare indexed incorrectly. However, the computation of MCMC



Fig. 3. Hinton diagram for the similarity matrix for 4 violin music clips. (a) Similarity matrix learned by DP HMM mixture models. (b) similarity matrix learnedby DP GMMs.

is expensive; it requires roughly 4 h of CPU in Matlab on a Pen-tium IV PC with a 1.73 GHz CPU to compute the results (5000burn-in and 5000 burn-out iterations), while VB requires lessthan 2 min. Considering the high efficiency and acceptable per-formance of the VB algorithm, we adopt it for our followingapplication.

It is important to note that the synthetic data used haveoverlapping observation spaces, {1,2}, {2,3}, and {1,2,3},and are generated by HMMs with more than one unique statenumber. Despite this, the algorithm correctly clusters the datainto three major HMMs, and setting one comprehensive statenumber worked well in this problem.

B. Music Data

For our first experiment, we choose four 3-min violin con-certo clips from two different composers. We chose to modelthe first few minutes rather than the whole of a particular piecein our experiments, because we felt that in this way we couldbetter control the ground truth for our initial application of thesemethods. Clips 1 and 2 are from Bach and are considered sim-ilar, clips 3 and clip 4 are from Stravinsky are also consideredsimilar. The clips of one composer are considered different fromthe other. All four music clips are played using essentially thesame instruments, but their styles vary, which would indicate ahigh overlap in feature space, but significantly different move-ment. We divided each clip into 180 sequences of length 40and quantized the MFCC feature vectors into symbols of size

with the VQ algorithm. An HMM mixture model isthen built for each clip with mixture component number, or trun-cation level, set to and number of states to ; thetruncation level of the DP GMM was set to 50 as well. Fig. 3shows the computed similarity between each clip for both HMMmixture and GMM modeling using a Hinton diagram [14], inwhich the size of a block is inversely proportional to the valueof the corresponding matrix elements. To compute the distances,we use 100 sequences of length 50 drawn from the HMM mix-ture and 200 samples drawn from the GMM. It is difficult tocompare the similarity matrices directly since the distance met-rics are not on the same scale, but as our goal is to suggest musicby similarities, we may focus on the relative similarities withineach matrix. Our modeling by HMM mixture produces results

Fig. 4. 2-D PCA features for four violin music clips considered in Fig. 3.

that fit with our intuition.2 However, our GMM results do notcatch the connection between clips 3 and 4, and, proportionally,do not contrast clips 1 and 2 from 3 and 4 as well. This is becauseof their high overlap in feature space, which we show by re-ducing the 10-D MFCC features to the a 2-D space through prin-ciple component analysis (PCA) [16] in Fig. 4 to give a sense oftheir distribution. We observe that the features for all four clipsalmost share the same range in the feature space and have a sim-ilar distribution. If we model each piece of music without takinginto consideration the motion in feature space, it is clear thatthe results will be very similar. The improved similarity recog-nition can be attributed to the temporal consideration given bythe HMM mixture model. Fig. 5 shows the mixing weights ofthe VB-learned HMM mixture models for the four violin clips.Although the number of significant weights is initially high, thealgorithm automatically reduces this number by suppressing thesuperfluous components to that necessary to model each clip:the expected mixing weights for these unused HMMs are nearzero with high confidence, indicated by the small variance of the

2Prof. S. Lindroth, Chair of the Duke University Music Department, providedguidance in assessing similarities between the pieces considered.



Fig. 5. Mixing weights of HMM mixtures for four violin clips whenK = 50 computed by VB; the error bars representing the standard deviation of each weight.

Fig. 6. Memberships for violin clip 4 considered in Figs. 3–5.

mixing weights. We notice that each clip is represented by a dif-ferent number of dominant HMMs. For example, clips 1 and 2require fewer HMMs, which is understandable, as the music forthese particular clips is more homogenous. We give an exampleof a posterior membership for clip 4 in Fig. 6, where those partshaving similar styles should be drawn from the same HMM. Thefact that the first 20 s of this clip are repeated during the last 20can be seen in their similar membership patterns.

For our second experiment, we compute the similarities be-tween ten 3-min clips, this time of works of different formats(instrumentation). These clips were chosen deliberately with thefollowing intended clustering: 1) clip 1 is unique in style andinstrumentation; 2) clips 2 and 3, 4 and 5, 6 and 7, and 9 and

10 are intended to be paired together 3) clip 8 is also unique,but is the same format (instrumentation) as clips 6 and 7. Thesimilarities of these clips are computed in the same manner asthe first experiment via HMM mixture (with , ,and ) and GMM modeling. The Hinton diagrams of thecorresponding similarity matrices are shown in Fig. 7. Again,our intuition is consistent in this experiment with HMM mix-ture modeling, but less accurate with GMM modeling. Thoughthe GMM model does not contradict our intuition, the similar-ities are not as stark as in the HMM mixture, especially in thecase of clip 1, which was selected to be unique.

Table I lists the number of dominant and utilized HMMs inthe mixture model for each clip. Dominant HMMs are those forwhich the corresponding expected mixing weights( is a small value). Utilized HMMs are those that contain atleast one sequence after the membership is calculated. Given

, the results show that each clip needs fewer HMMsthan the preset truncation level.

Our third experiment posed our most challenging problem. Inthis experiment, we look at 15 two-min clips from three differentclassical formats: solo piano, string quartet and orchestral, withfive pieces from each. The feature space covered by these worksis significantly larger than that spanned by the other two experi-ments. Given and , we built an HMM mixture foreach of the 15 clips of which we considered 1 and 2, 3 and 4, 6and 7, 9 and 10, and 11 and 12 to be the most similar pairs. Clips8 and 13 were considered to be very different from all other clipsand can therefore be considered anomalies. Clips 5 and 14 werealso considered more unique than similar to any other piece. Theresults of the HMM modeling as well as GMM modeling areshown in Fig. 8. The HMM mixture better catches the connec-tion between clips 3 and 4 and deemphasizes the connections



Fig. 7. Hinton diagram for the similarity matrix for 10 music clips. (a) Similarity matrix learned by DP HMM mixture models. (b) similarity matrix learned byDP GMMs.

TABLE INUMBER OF HMM MIXTURE COMPONENTS UTILIZED FOR EACH CLIP CONSIDERED IN FIG. 7

between clip 6 and clips 9 and 10. Also, though still not consid-ered unique, the similarities of clip 5 are reduced in proportionto other similarities within the solo piano format, indicating thebeneficial effects of the temporal information. The results indi-cate that clip 14 is closer to the other orchestral works than ex-pected from human listening. In general, the GMM and HMMmixture modeling approaches are more comparable in our thirdexperiment, with a slight edge given to the HMM mixture re-sulting from the temporal information.

It is important to note that, theoretically, the GMM shouldalways perform well when comparing two similar pieces, astheir similarity will be manifested in part by an overlappingfeature space. However, in cases where the overlapping featurescontain significantly different motions, be it in rhythm or notetransitions, the GMM inherently will not perform well andwill tend to yield larger similarity than the “truth,” whereasour HMM mixture, in theory, will detect and account for thesevariations.

VII. CONCLUSION AND FUTURE WORK

We have developed a discrete HMM mixture model for sit-uations for which a sequential data set may have several dif-ferent underlying mechanisms, unable to be adequately mod-eled by a single HMM. This model is built in a Bayesian set-ting using DP priors, which has the advantage of automaticallydetermining the component number and associated membershipthrough the encouragement of parameter sharing. Two inference

tools are provided for the HMM mixture model: an MCMC so-lution based on a Gibbs sampler, and a VB approach via maxi-mizing a variational lower bound.

The performance of HMM mixture modeling was demon-strated on both synthetic and music data sets. We comparedMCMC and VB implementations with the synthetic data, wherewe showed that VB provides an acceptable and highly efficientalternative to MCMC, allowing consideration of large data sets,such as music. For our music application, we presented threeexperiments within the classical music genre, where the MFCCfeature spaces for each piece of music was highly overlapped.We compared our HMM mixture model to the GMM, com-puting similarities between music as a measure of performance.The results showed that HMM mixture modeling was able tobetter distinguish between the content of the given music, gen-erally providing sharper contrasts in similarities than the GMM.The results from this and the synthetic data are both promising.Where GMMs have been shown to succeed at genre classifica-tion, our HMM mixture model matches this performance andexceeds it within a music format by taking the temporal char-acter of the given music into account. However, we have to breakup the input sequence into a number of short segments, whichmay create a problem of trading-off the accuracy of boundaryplacement (which would require short pieces) against the accu-racy of HMM parameter estimation/recognition (which wouldrequire longer pieces). In addition, pieces that span the boundarybetween two segments may show characteristics of more thanone HMM. These issues constitute important subjects for futurework.



Fig. 8. Hinton diagram for the similarity matrix for 15 music clips. (a) Similarity matrix learned by DP HMM mixture models. (b) similarity matrix learned byDP GMMs.

APPENDIX

DERIVATION OF UPDATING EQUATIONS IN VB APPROACH

1) Update , : Only those terms relatedto in (33) are kept for the derivation and then we have

(39)

where

(40)

with . The quantity can be rewrittenas

(41)



where can be computed fromthe forward-backward algorithm [22]. Then we have

(42)

where

(43)

According to Gibbs inequality that is

(44)

where is a normalizing constant to ensure ,is maximized with respect to by choosing

(45)

which is a product of Dirichlet distributions with the hyper-pa-rameters .

2) Update , : is maximized withrespect to in a similar procedure and the optimalis obtained as

(46)

where

(47)

3) Update , : Similarly, the optimalto maximize is derived as

(48)

where

(49)

in (47) and in (49)can also be computed from the forward-backward algorithm.

4) Update : The lower bound related to is given as

(50)

Using the property thatwhere and Beta distribution is a two-parameter Dirichlet distribution, (50) is then maximized at

(51)

with and.

5) Update : Collecting all the quantities related to ,we have

(52)

is rewritten in [9] as

(53)

where is the indicator function. Substitute (53)into (52) and recall that and

, the optimized become

(54)

where and.

6) Update , : Theis expressed as

(55)



Using the property again that, we define

(56)

(57)

(58)

Maximizing (55) is achieved at

(59)

7) Update , :is given as

(60)

Define

(61)

then to optimize (60) is expressed as

(62)

ACKNOWLEDGMENT

The authors would like to thank Prof. S. Lindroth, Chair of theDuke University Music Department, for reviewing their resultsand providing guidance with regard to defining “truth” for themusic data.

REFERENCES

[1] J.-J. Aucouturier and F. Pachet, “The influence of polyphony on thedynamical modelling of musical timbre,” Pattern Recognit. Lett., vol.28, no. 5, pp. 654–661, 2007.

[2] J.-J. Aucouturier and F. Pachet, “Music similarity measures: What’sthe use?,” presented at the Int. Symp. Music Information Retrieval(ISMIR), 2002.

[3] J.-J. Aucouturier and F. Pachet, “Improving timbre similarity: Howhigh’s the sky?,” J. Negative Results Speech Audio Sci., vol. 1, no. 1,2004.

[4] J.-J. Aucouturier and M. Sandler, “Segmentation of musical signalsusing hidden markov models,” presented at the 110th Conv. Audio Eng.Soc., May 2001.

[5] L. R. Bahl, F. Jelinek, and R. L. Mercer, “A maximum likelihood ap-proach to continuous speech recognition,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 5, no. 2, pp. 179–190, 1983.

[6] M. J. Beal, “Variational algorithms for approximate bayesian infer-ence,” Ph.D. dissertation, Gatsby Computational Neuroscience Unit,University College London, London, U.K., 2003.

[7] A. Berenzweig, B. Logan, D. P. Ellis, and B. Whitman, “A large-scaleevaluation of acoustic and subjective music similarity measures,”Comp. Music J., vol. 28, no. 2, pp. 63–76, 2004.

[8] D. Blackwell and J. MacQueen, “Ferguson distributions via polya urnschemes,” Annal. Statist., vol. 1, pp. 353–355, 1973.

[9] D. Blei and M. Jordan, “Variational methods for the Dirichlet process,”presented at the 21st Int. Conf. Machine Learning, 2004.

[10] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from in-complete data via the EM algorithm,” J. Roy. Statist. Soc. B, vol. 39,no. 1, pp. 1–38, 1977.

[11] M. D. Escobar and M. West, “Bayesian density estimation and infer-ence using mixtures,” J. Amer. Statist. Assoc., vol. 90, no. 430, pp.577–588, 1995.

[12] T. S. Ferguson, “A Bayesian analysis of some nonparametric prob-lems,” Annal. Statist., vol. 1, pp. 209–230, 1973.

[13] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, “IntroducingMarkov Chain Monte Carlo,” in Markov Chain Monte Carlo inPractice. London, U.K.: Chapman & Hall, 1996.

[14] G. Hinton, T. Sejnowski, and D. Ackley, “Boltzmann machines: Con-straint satisfaction networks that learn,” Carnegie Mellon Univ., Tech.Rep. CMUCS-84-119, May 1984.

[15] H. Ishwaran and L. F. James, “Gibbs sampling methods forstick-breaking priros,” J. Amer. Statist. Assoc., Theory Meth., vol. 96,no. 453, pp. 161–173, 2001.

[16] I. T. Jolliffe, Principal Component Analysis, 2nd ed. New York:Springer, 2002.

[17] Y. Lin, “Learning phonetic features from waveforms,” UCLA WorkingPapers in Phonetics (WPP), no. 103, pp. 64–70, Sep. 2004.

[18] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizerdesign,” IEEE Trans. Commun., vol. COM-28, no. 1, pp. 84–95, Jan1980.

[19] B. Logan and A. Salomon, “A music similarity function based on signalanalysis,” presented at the ICME 2001, 2001.

[20] D. J. C. MacKay, “Ensemble learning for hidden Markov models,”Cavendish Lab., Univ. Cambridge, 1997 [Online]. Available: http://wol.ra.phy.cam.ac.uk/mackay

[21] M. Mandel, G. Poliner, and D. Ellis, “Support vector machine activelearning for music retrieval,” Multimedia Syst., Special Issue on Ma-chine Learning Approaches to Multimedia Information Retrieval, vol.12, no. 1, pp. 3–13, 2006.

[22] L. R. Rabiner, “A tutorial on hidden Markov models and selected appli-cations in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286,Feb. 1989.

[23] C. Raphael, “Automatic segmentation of acoustic musical signals usinghidden Markov models,” IEEE Trans. Pattern Anal. Mach. Intell., vol.21, no. 4, pp. 360–370, Apr. 1999.

[24] J. Rissanen, “Modeling by shortest data description,” Automatica, vol.14, pp. 465–471, 1978.

[25] N. Scaringella and G. Zoia, “On the modeling of time information forautomatic genre recognition systems in audio signals,” in Proc. 6th Int.Symp. Music Inf. Retrieval (ISMIR), 2005, pp. 666–671.

[26] A. Schliep, B. Georgi, W. Rungsarityotin, I. G. Costa, and A. Schon-huth, “The general hidden Markov model library: Analyzing systemswith unobservable states,” 2004 [Online]. Available: www.billingpreis.mpg.de/hbp04/ghmm.pdf

[27] J. Sethuraman, “A constructive definition of the Dirichlet prior,” Sta-tistica Sinica, vol. 2, pp. 639–650, 1994.

[28] X. Shao, C. Xu, and M. Kankanhalli, “Unsupervised classification ofmusical genre using hidden markov model,” in Proc. IEEE Int. Conf.Multimedia Explore (ICME), 2004, pp. 2023–2026.



[29] A. Sheh and D. P. W. Ellis, “Chord segmentation and recognition usingemtrained hidden markov models,” in Proc. 4th ISMIR, Oct. 2003, pp.183–189.

[30] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchicaldirichlet processes,” J. Amer. Statist. Assoc., vol. 101, pp. 1566–1581,2006.

[31] M. West, P. Muller, and M. Escobar, “Hierarchical priors and mixturemodels with applications in regression and density estimation,” in As-pects of Uncertainty, P. R. Freeman and A. F. Smith, Eds. New York:Wiley, 1994, pp. 363–386.

[32] B. Whitman, G. Flake, and S. Lawrence, “Artist detection in music withminnowmatch,” in Proc. 2001 IEEE Workshop Neural Netw. SignalProcess., 2001, pp. 559–568.

[33] C. Xu, N. C. Maddage, X. Shao, F. Cao, and Q. Tian, “Musical genreclassification using support vector machines,” in Proc. IEEE Int. Conf.Acoustics, Speech, and Signal Processing, 2003, vol. V, pp. 429–432.

Yuting Qi (S’06) received the B.S. degree inelectrical engineering from Tsinghua University,Beijing, China, in 1999, and the M.S. degree inelectrical and computer engineering from Duke Uni-versity, Durham, NC, in 2004, where she is currentlyworking toward the Ph.D. degree in the Departmentof Electrical and Computer Engineering.

Her research interests include machine learning,statistical signal processing, and computer vision.

John William Paisley received the B.S. and M.S.degrees in electrical and computer engineering fromDuke University, Durham, NC, in 2004 and 2007, re-spectively, where he is currently working toward aPh.D. degree.

Lawrence Carin (SM’96–F’01) was born March 25, 1963 in Washington, DC.He received the B.S., M.S., and Ph.D. degrees in electrical engineering from theUniversity of Maryland, College Park, in 1985, 1986, and 1989, respectively.

In 1989, he joined the Electrical Engineering Department at PolytechnicUniversity, Brooklyn, NY, as an Assistant Professor, and became an Asso-ciate Professor there in 1994. In September 1995, he joined the ElectricalEngineering Department, Duke University, Durham, NC, where he is nowthe William H. Younger Professor of Engineering. He was the principalinvestigator (PI) on a Multidisciplinary University Research Initiative (MURI)on demining (1996–2001), and he is currently the PI of a MURI dedicated tomultimodal inversion. He was the co-founder of Signal Innovations Group, Inc.(SIG), which is now a subsidiary of Integrian, Inc.; he serves as the Directorof Technology at SIG. His current research interests include signal processing,sensing, and machine learning.

Prof. Carin was an Associate Editor of the IEEE TRANSACTIONS ON

ANTENNAS AND PROPAGATION from 1995–2004. He is a member of the TauBeta Pi and Eta Kappa Nu honor societies.


Documents

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL…jwp2128/Papers/QiPaisleyCarin2007b.pdf · Aucouturier and Pachet [3] model the distribution of the MFCCs over all frames of an individual