mStruct: Inference of Population Structure in Light of Both ... - Genetics · population genetics problems (Queller et al. 1993; Hammer et al. 1998; Templeton 2002). Various methods

Copyright � 2009 by the Genetics Society of AmericaDOI: 10.1534/genetics.108.100222

mStruct: Inference of Population Structure in Light of Both GeneticAdmixing and Allele Mutations

Suyash Shringarpure* and Eric P. Xing†,1

*Machine Learning Department and †School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15215

Manuscript received December 29, 2008Accepted for publication April 2, 2009

ABSTRACT

Traditional methods for analyzing population structure, such as the Structure program, ignore theinfluence of the effect of allele mutations between the ancestral and current alleles of genetic markers, whichcan dramatically influence the accuracy of the structural estimation of current populations. Studying theseeffects can also reveal additional information about population evolution such as the divergence time andmigration history of admixed populations. We propose mStruct, an admixture of population-specificmixtures of inheritance models that addresses the task of structure inference and mutation estimation jointlythrough a hierarchical Bayesian framework, and a variational algorithm for inference. We validated ourmethod on synthetic data and used it to analyze the Human Genome Diversity Project–Centre d’Etude duPolymorphisme Humain (HGDP–CEPH) cell line panel of microsatellites and HGDP single-nucleotidepolymorphism (SNP) data. A comparison of the structural maps of world populations estimated by mStructand Structure is presented, and we also report potentially interesting mutation patterns in world populationsestimated by mStruct.

THE deluge of genomic polymorphism data, such asthe genomewide multilocus genotype profiles of

variablenumbersof tandem repeats (i.e.,microsatellites)and single-nucleotide polymorphisms (SNPs), has fu-eled the long-standing interest in analyzing patterns ofgenetic variations to reconstruct the ancestral structuresof modern human populations. Genetic ancestral in-formation can shed light on the evolutionary history andmigrations of modern populations (Bowcock et al. 1994;Rosenberg et al. 2002; Conrad et al. 2006). It alsoprovidesguidelines formoreaccurateassociationstudies(Roeder et al. 1998) and is useful for many otherpopulation genetics problems (Queller et al. 1993;Hammer et al. 1998; Templeton 2002).

Various methods have been proposed for stratifyingpopulation structures on the basis of multilocus geno-type information from a set of individuals. For example,Pritchard et al. (2000) proposed a model-based ap-proach implemented in the program Structure, whichuses a statistical methodology known as the allele-frequency admixture model to stratify population struc-tures. This model, and admixture models in generalarising in genetic and other contexts (Blei et al. 2003),belongs to a more general class of hierarchical Bayesianmodels known as the mixed membership models (Erosheva

et al. 2004). Such a model postulates that an empiricalmultiple-instance sample, such as the ensemble of

genetic markers of an individual, is made up of eitherindependently and identically distributed (iid) instan-tiations (Pritchard et al. 2000) or spatially coupled(Falush et al. 2003) instantiations, from multiplepopulation-specific fixed-dimensional multinomial dis-tributions of marker alleles [known as allele-frequencyprofiles, AP (Falush et al. 2003)]. Under this assumption,the admixture model identifies each ancestral popula-tion by a specific AP (that defines a unique vector ofallele frequencies of each marker in each ancestralpopulation) and displays the fraction of contributionsfrom each AP in a modern individual genome as anadmixing vector (also known as an ancestral proportion vectoror structure vector) in a structural map over the populationsample in question. Figure 1 shows an example of astructural map of four modern populations inferredfrom a portion of the HapMap multipopulation data setby Structure. In this population structural map, theadmixing vector underlying each individual is repre-sented as a thin vertical line of unit length and multiplecolors, with the height of each color reflecting thefraction of the individual’s genome originated from acertain ancestral population denoted by that color andformally represented by a unique AP. This method hasbeen applied to the Human Genome Diversity Project–Centre d’Etude du Polymorphisme Humain (HGDP–CEPH) Human Genome Diversity Cell Line Panel inRosenberg et al. (2002) and many other studies, and hasunraveled interesting patterns in the genetic structuresof the world population. However, even though Struc-

1Corresponding author: 5000 Forbes Ave., School of Computer Science,Pittsburgh, PA 15215. Email: [email protected]

Genetics 182: 575–593 ( June 2009)

ture was originally built on a genetic admixture model,in reality the structural patterns derived by Structure invarious studies often turn out to be distinct clustersamong the study populations (e.g., Figure 1), which hasled many to think of it as a clustering program rather thana tool for uncovering genetic admixing as it was supposedto do. The design limitation of the Structure modelbehind this issue motivated us to develop a new approachin this article to analyze admixed genetic samples.

A recent extension of Structure, known as Structur-ama (Pella and Masuda 2006; Huelsenbeck andAndolfatto 2007), relaxes the finite dimensionalassumption on ancestral populations in the admixturemodel by employing a Dirichlet process prior over theancestral allele-frequency profiles. This allows auto-matic estimation of the maximum a posteriori probablenumber of ancestral populations. This extension is auseful improvement since it eliminates the need formanual selection of the number of ancestral popula-tions. Anderson and Thompson (2002) address theproblem of classifying species hybrids into categories,using a model-based Bayesian clustering approachimplemented in the NewHybrid program. While thisproblem is not exactly identical to the problem ofstratifying the structure of highly admixed populations,it is useful for structural analysis of populations that wererecently admixed. The BAPS program (Corander et al.2003) also uses a Bayesian approach to find the bestpartition of a set of individuals into subpopulations onthe basis of genotypes. Parallel to the aforementionedmodel-based approaches for genomic structural analy-sis, direct algebraic eigen-decomposition and dimen-sionality reduction methods, such as the Eigensoftprogram (Patterson et al. 2006) based on principalcomponents analysis (PCA), offer an alternative ap-proach to explore and visualize the ancestral composi-tion of modern populations and facilitate formalstatistical tests for significance of population differenti-ation. However, unlike the model-based methods such asStructure, where each inferred ancestral populationbears a concrete genetic meaning as a population-specific allele-frequency profile, the eigenvectors com-puted by Eigensoft represent the mutually orthogonaldirections in an abstract low-dimensional ancestralspace, in which population samples can be embeddedand visualized; these eigenvectors can be understood asmathematical surrogates of independent genetic sour-

ces underlying a population sample, but lack a concreteinterpretation under a generative genetic inheritancemodel (from here on, we use the term ‘‘inheritancemodel’’ to describe the process by which a descendantallele is derived from an ancestral allele). Analyses basedon Eigensoft are usually limited to two-dimensionalancestral spaces, offering limited power in stratifyinghighly admixed populations.

This progress notwithstanding, an important aspectof population admixing that is largely missing in theexisting methods is the effect of allele mutationsbetween the ancestral and current alleles of geneticmarkers, which can dramatically influence the accuracyof the structural estimation of current populations. Itcan also reveal additional information about popula-tion evolution, such as the relative divergence time andmigration history of admixed populations.

Consider, for example, the Structure model. Since anAP merely represents the frequency of alleles in anancestral population rather than the actual alleliccontent or haplotypes of the alleles themselves, theadmixture models developed so far on the basis of APsdo not model genetic changes due to mutations fromthe ancestral alleles. Indeed, a serious pitfall of themodel underlying Structure, as pointed out in Excoffier

and Hamilton (2003), is that there is no mutationmodel for modern individual alleles with respect tohypothetical common prototypes in the ancestral pop-ulations. That means every unique allele in the modernpopulation is assumed to have a distinct ancestralproportion, rather than allowing the possibility of itjust being a descendant of some common ancestralallele that can also give rise to other closely related allelesat the same locus of other individuals in the modernpopulation. Thus, while Structure aims to provideancestry information for each individual and each locus,there is no explicit representation of the ‘‘ancestors’’ as aphysical set of ‘‘founding alleles.’’ Therefore, the in-ferred population structural map emphasizes revealingthe contributions of abstract population-specific ances-tral proportion profiles, which does not necessarilyreflect individual diversity or the extent of geneticchanges with respect to the founders. Due to thislimitation, Structure does not enable inference of thefounding genetic patterns, the age of the foundingalleles, or the population divergence time (Excoffier

and Hamilton 2003).The lack of an appropriate allele mutation model in a

structural inference program can also compromise ourability to reliably assess the amount or level of geneticadmixing in different populations. The Structuremodel, like several other related models (Blei et al.2003), is based on the fundamental assumption of thepresence of genetic admixing among multiple foundingpopulations. However, as we shall see later, on realpopulation data such as the HGDP–CEPH panel, itproduces results that favor clustering individuals into

Figure 1.—Population structural map inferred by Struc-ture on HapMap data consisting of four populations.

576 S. Shringarpure and E. P. Xing

predominantly one allele-frequency profile or another,thus leading us to conclude that there was little or noadmixing between the ancestral human populations. Webelieve that this occurs due to the absence of a mutationmodel in Structure. While a partitioning of individualswould be desirable for clustering them into groups, itdoes not offer enough biological insight into the inter-mixing of the populations.

In this article, we present mStruct (which stands forStructure under mutations), based on a new model: anadmixture of population-specific mixtures of inheri-tance models (AdMim). Statistically, AdMim is anadmixture of mixture models, which represents each ances-tral population as a mixture of ancestral alleles each withits own inheritance process and each modern individualas an ‘‘ancestry vector’’ (or structure vector) that reflectsmembership proportions of the ancestral populations.As we explain shortly, mStruct facilitates estimation ofboth the structural map of populations and the muta-tion parameters of either SNP or microsatellite allelesunder various contexts. A new variational inferencealgorithm, which is much faster than the MCMCalgorithm used for Structure, was developed for estimat-ing the structure vectors and other genetic parametersof interest. We compare our method with Structure onsimulated genotype data and on the microsatellite andSNP genotype data of world populations (Rosenberg

et al. 2002; Conrad et al. 2006). Our results usingmicrosatellite data reveal the presence of significantlevels of genetic admixing among the founding popula-tions underlying the HGDP–CEPH cell line panel, aswell as consequences of expansion of humans out ofAfrica. Our results suggest that the inability of Structureto model mutations during genetic admixing could havecaused it to detect correct clustering but very low levelsof genetic admixing in each modern population in theHGDP–CEPH data. We also report interesting visual-izations of genetic divergence in world populationsrevealed by the mutation patterns estimated by mStruct.The mStruct software has been implemented in C11

and is available for download at http://www.sailing.cs.cmu.edu/mstruct.html.

THE STATISTICAL MODEL

The mStruct model differs from the Structure modelin two main aspects: the representation of ancestralpopulations and the generative process for sampling amodern individual from the ancestral populations. Inthis section we describe in detail the statistical un-derpinning of these two aspects.

Representation of Populations

To reveal the genetic composition of each modernindividual in terms of contributions from hypotheticalancestral populations via statistical inference on multi-

locus genotype data, one must first choose an appropri-ate representation of ancestral populations. We beginwith a brief description of the commonly used represen-tation adopted by Structure, followed by a new represen-tation we propose that allows mutations to be captured.

Population-specific allele-frequency profiles: Sinceall markers that are used for population structurestratification are polymorphic in nature, it is notsurprising that the most intuitive representation of anancestral population is a set of frequency vectors for allalleles observed at all the loci. Specifically, we canrepresent an ancestral population k by a unique set ofpopulation-specific multinomial distributions bk [ f~bk

i ;i ¼ 1 : Ig, where ~b

k

i ¼ ½bki;1; . . . ;bk

i;L9i� is the vector of

multinomial parameters, also known as an AP (Falush

et al. 2003), of the allele distribution at locus i inancestral population k; L9i denotes the total number ofobserved marker alleles at locus i; and I denotes the totalnumber of marker loci. This representation, known aspopulation-specific allele-frequency profiles, is used by theprogram Structure.

Population-specific mixtures of ancestral alleles: AnAP does not enable us to model the possibility ofmutations; i.e., there is no way of representing a situationwhere two observed alleles might have been derivedfrom a single ancestral allele by two different mutations.This possibility can be represented by a genetically morerealistic statistical model known as the population-specificmixture of ancestral alleles (MAA). For each locus i, an MAAfor ancestral population k is a set Qk

i [ fmki ; d

ki ;~b

k

i gconsisting of three components: (1) a set of ancestral(or founder) alleles mk

i [ fmki;1; . . . ;mk

i;Lig, which can

differ from their descendant alleles in the modernpopulation; (2) a mutation parameter di

k associated withthe locus, which can be further generalized to be allele-specific if necessary; and (3) an AP ~b

k

i , which nowrepresents the frequencies of the ancestral alleles. HereLi denotes the total number of ancestral alleles at loci i,which is different from L9i in the previous section, whichdenotes the total number of observed alleles at loci i. Byexplicitly associating a mutation model with an ancestralpopulation, we can now capture mutation events as de-scribed above. It is important to note that the mutationparameter d is not the mutation rate commonly referredto in the literature. As we shall see later, it is a measure ofthe variability of a locus that can be described approx-imately as the combined effect of the per-generationmutation rate and the age of the population.

An MAA is strictly more expressive than an AP,because the incorporation of a mutation model helpsto capture details about the population structure that anAP cannot; and the MAA reduces to the AP when themutation rates (and hence the mutation parameters)become zero and the founders are identical to theirdescendants. MAA is also arguably more realistic be-cause it allows mutation rates (and mutation parame-ters) to be different for different founder alleles even

mStruct: Structure Under Mutations 577

within the same ancestral population, as is commonlythe case with many genetic markers. For example, themutation rates for microsatellite alleles are believed tobe dependent on their length (number of repeats). Aswe shall show shortly, with an MAA, one can examine themutation parameters corresponding to each ancestralpopulation via Bayesian inference from genotype data;this might enable us to infer the age of alleles and alsoestimate population divergence times subject to acalibration constant.

Let i 2 {1, . . . , I} index the position of a locus in thestudy genome, n 2 {1, . . . , N} index an individual in thestudy population, and e 2 {0, 1} index the two possibleparental origins of an allele (in this study we do notrequire strict phase information of the two alleles, so theindex e is used merely to indicate ploidy of the data).Under an MAA specific to an ancestral population k, thecorrespondence between a marker allele Xi;ne

and afounder mk

i;l 2 mki is not directly observable. For each

allele founder mki;l, we associate with it an inheritance

model p(� j mki;l , dk

i;l ) from which descendants can besampled. Then, given specifications of the ancestralpopulation from which Xi;ne

is derived, which is denotedby hidden indicator variable Zi;ne

, the conditionaldistribution of Xi;ne

under an MAA follows a mixture ofpopulation-specific inheritance models:

Pðxi;ne¼ l9 j zi;ne

¼ kÞ ¼XL

l¼1

bki;l Pðxi;ne

jmki;l ; d

ki;lÞ: ð1Þ

Comparing to the counterpart of this function underAP, Pðxi;ne

¼ l9 j zi;ne¼ kÞ ¼ bk

i;l9, we can see that thelatter cannot explicitly model allele diversities in termsof molecular evolution from the founders.

A New Admixture Model for Population Structure

Admixtures are useful for modeling objects (e.g.,human beings), each comprising multiple instances ofsome attributes (e.g., marker alleles), each of whichcomes from a (possibly different) source distributionPk(� j Qk), according to an individual-specific admixingvector (a.k.a. structure vector) ~u. The structure vectorrepresents the normalized contribution from each ofthe source distributions {Pk; k ¼ 1:K} to the object inquestion. For a single data set, all the structure vectorsare assumed to be samples from an underlying structureprior with parameter a. For example, for every individ-ual, the alleles at all loci may be inherited from foundersin different ancestral populations, each represented by aunique distribution of founding alleles and the way theycan be inherited. Formally, this scenario can be capturedin the following generative process:

1. For each individual n, draw the admixing vector~un �Pð� j aÞ, where P(� j a) is a prechosen structure prior.

2. For each marker allele xi;ne2 xn :

2.1, draw the latent ancestral-population-origin indicatorzi;ne� Multinomialð� j ~unÞ;

2.2, draw the allele xi;nej zi;ne

¼ k � Pkð� j Qki Þ.

As discussed in the previous section, an ancestralpopulation can be represented either as an AP or as anMAA. These two different representations lead to twodifferent probability distributions for Pk(� j Qk) in thelast sampling step above and thereby to two differentadmixtures of very different characteristics.

The Structure model by PRITCHARD et al. (2000): InStructure, the ancestral populations are represented bya set of population-specific APs. Thus the distributionPk(� jQk) from which an observed allele can be sampled isa multinomial distribution defined by the frequencies ofall observed alleles in the ancestral population; i.e.,xi;nej zi;ne

¼ k � Multinomialð� j ~bk

i Þ. Using this proba-bility distribution in the general admixture schemeoutlined above, we can see that Structure essentiallyimplements an admixture of population-specific allele-frequency profiles (Adaf) model. But a serious pitfall ofusing such a model, as pointed out in Excoffier andHamilton (2003), is that there is no mutation modelfor individual alleles with respect to the commonprototypes; i.e., every unique allele measurement at aparticular locus is assumed to correspond to a uniqueancestral allele, rather than allowing the possibility of itjust being derived from some common ancestral allele atthat locus as a result of a mutation.

Our model: We propose to represent each ancestralpopulation by a set of population-specific MAAs. Recallthat in an MAA for each locus we define a finite set offounders with prototypical alleles mk

i [ fmki;1; . . . ;mk

i;Lig

that can be different from the alleles observed in amodern population; each founder is associated with aunique frequency bk

i;l and a unique (if desired) muta-tion model from the prototype allele parameterized byrate dk

i;l . Under this representation, now the distributionPk(� jQi

k) from which an observed allele can be sampledbecomes a mixture of inheritance models, each definedon a specific founder; and the ensuing sampling modulethat can be plugged into the general admixture schemeoutlined above (to replace step 2.2) becomes a two-stepgenerative process: (step 2.2a) draw the latent founderindicator ci;ne

j zi;ne¼ k � Multinomialð� j ~bk

i Þ; and(step 2.2b) draw the allele xi;ne

j ci;ne¼ l ; zi;ne

¼k � Pmð� j mk

i;l ; dki;lÞ, where Pm() is a mutation model

that can be flexibly defined on the basis of whether thegenetic markers are microsatellites or single-nucleotidepolymorphisms. We call this model AdMim. Figure 2Ashows a graphical model representation of the overallgenerative scheme for AdMim, in comparison with theAdaf model underlying Structure. From Figure 2, we canclearly see that mStruct is an extended Structure modelthat allows copying errors.

For simplicity of presentation, in the model describedabove, we assume that for a particular individual, the


genetic markers at each locus are conditionally iidsamples from a set of population-specific fixed-dimen-sional mixture of inheritance models and that the set offounder alleles (but not their frequencies) at a partic-ular locus is the same for all ancestral populations (i.e.,mk

i [ mi). We also assume that the mutation parametersfor each population at any locus are independent of thealleles at that locus (i.e., dk

i;l [ dki ). Also, our model

assumes Hardy–Weinberg equilibrium within popula-tions. The simplifying assumptions of unlinked loci and nolinkage disequilibrium between loci within populations can beeasily removed by incorporating Markovian dependen-cies over ancestral indicators Zi;ne

and Zi11;neof adjacent

loci and over other parameters such as the allelefrequencies ~b

k

i in exactly the same way as in Structure.We can also introduce Markovian dependencies overmutation parameters at adjacent loci, which might bedesirable to better reflect the dynamics of molecular

evolution in the genome. We defer such extensions to afuture article.

Mutation model

As described above, our model is applicable to almostall kinds of genetic markers by plugging in an appro-priate allele mutation model (i.e., inheritance model)Pm(). We now discuss mutation models for microsatel-lites and SNPs.

Microsatellite mutation model: Microsatellites are aclass of tandem-repeat loci that involve a DNA unit thatis 1–4 bp in length. Microsatellite DNA has significantlyhigh mutation rates as compared to other DNA, withmutation rates as high as 10�3 or 10�4 (Kelly et al. 1991;Henderson and Petes 1992). The large amount ofvariations present in microsatellite DNA makes it idealfor differentiating founder patterns between closelyrelated populations. Microsatellite loci have been usedbefore DNA fingerprinting (Queller et al. 1993),before linkage analysis (Dietrich et al. 1992), and inthe reconstruction of human phylogeny (Bowcock et al.1994). By applying theoretical models of microsatelliteevolution to data, questions such as time of divergenceof two populations can be attempted to be addressed(Pisani et al. 2004; Zhivotovsky et al. 2004).

The choice of a suitable microsatellite mutationmodel is important, for both computational and in-terpretation purposes. Below we discuss the mutationmodel that we use and the biological interpretation ofthe parameters of the mutation model. We begin with astepwise mutation model for microsatellites widely usedin forensic analysis (Valdes et al. 1993; Lin et al. 2006).

This model defines a conditional distribution of aprogeny allele b given its progenitor allele a, both ofwhich take continuous values

pðb j aÞ ¼ 1

2jð1� dÞdjb�aj�1; ð2Þ

where j is the mutation rate (probability of anymutation), and d is the factor by which mutationdecreases as distance between the two alleles increases.Although this mutation distribution is not stationary(i.e., it does not ensure allele frequencies to be constantover the generations), it is commonly used in forensicinference due to its simplicity. To some degree d can beregarded as a parameter that controls the probability ofunit-distance mutation, as can be seen from the follow-ing identity: p(b 1 1 j a)/p(b j a) ¼ d.

In practice, the alleles for almost all microsatellitesare represented by discrete counts. The two-parameterstepwise mutation model described above complicatesthe inference procedure. We propose a discrete micro-satellite mutation model that is a simplification ofEquation 2, but captures its main idea. We posit thatP(b j a) } djb�aj. Since b 2 [1, ‘), the normalizationconstant of this distribution is

Figure 2.—Graphical models: the circles represent ran-dom variables and diamonds represent hyperparameters.


X‘

b¼1

Pðb j aÞ ¼Xa

b¼1

da�b 1X‘

b¼a11

db�a

¼ 1� da

1� d1

d

1� d

¼ 1 1 d� da

1� d;

which gives the mutation model as

Pðb j aÞ ¼ 1� d

1� da 1 ddjb�aj : ð3Þ

We can interpret d as a variance parameter, the factorby which probability drops as a function of the distancebetween the mutated version b of the allele a. Figure 3shows the discrete probability density function (pdf) forvarious values of d.

Determination of founder set at each locus: According toour model assumptions, there can be a differentnumber of founder alleles at each locus. This numberis typically smaller than the number of alleles observedat each marker since the founder alleles are ‘‘ancestral.’’To estimate the appropriate number and allele states offounders, we fit finite mixtures (of fixed size, corre-sponding to the desired number of ancestral alleles) ofmicrosatellite mutation models over all the measure-ments at a particular marker for all individuals. We usethe Bayesian information criterion (BIC) (Schwarz

1978) to determine the best number and states offounder alleles to use at each locus, since informationcriteria tend to favor a smaller number of founderalleles that fit the observed data well.

For each locus, we fit many different finite-sizedmixtures of mutation distributions, with the size varyingfrom 1 to the number of observed alleles at the locus. Foreach mixture size, the likelihood is optimized and a BICvalue is computed. The number of founder alleles ischosen to be the size of the mixture that has the best(minimum) BIC value. We can do this as a pre-process-ing step before the actual inference or estimationprocedures. This is possible since we assumed that theset of founder alleles at each locus was the same for allpopulations.

Choice of mutation prior: In our model, the d parameter,as explained above, is a population-specific parameter

that controls the probability of stepwise mutations.Being a parameter that controls the variance of themutation distribution, there is a possibility that infer-ence on the model will encourage higher values of d toimprove the log-likelihood, in the absence of any priordistribution on d. To avoid this situation, and to allowmore meaningful and realistic results to emerge fromthe inference process, we impose on d a beta priorthat is biased toward smaller values of d. The beta prioris a fixed one and is not among the parameters weestimate.

SNP mutation model: SNPs represent the largest classof individual differences in DNA. In general, there is awell-defined correlation between the age of the muta-tion producing a SNP allele and the frequency of theallele. For SNPs, we use a simple pointwise mutationmodel, rather than more complex block models. Thus,the observations in SNP data are only binary (0/1) innature. So, given the observed allele b, we say that theprobability of it being derived from the founder allele ais given by

Pðb j aÞ ¼ dI½b¼a�3 ð1� dÞI½b 6¼a�; a; b 2 f0; 1g: ð4Þ

In this case, the mutation parameter d is the probabilitythat the observed allele is not identical to the founderallele, but derived from it due to a mutation.

INFERENCE AND PARAMETER ESTIMATION

For notational convenience, we ignore the diploidnature of observations in the analysis that follows. Withthe understanding that the analysis is carried out foran arbitrary nth individual, we drop the subscript n.Also, we overload the indicator variables zi and ci to beboth arrays with only one element equal to 1 and the restequal to 0, as well as scalars with a value equal to theindex at which the array forms have 1’s. In other words,zi 2 {1, . . . , K} or zi [ [zi,1, . . . , zi,K], where zi;k ¼ I½zi ¼ k�,and I½�� denotes an indicator function that equals to1 when the predicate argument is true and 0 otherwise.A similar overloading is also assumed for the ci variables.For generalization across different types of markers,we use f ðxi j mi;ci

; di;ziÞ to denote P(xi j ci, zi, mi, di).

Different mutation models can be used in AdMim byvarying the form of the function f ().

The joint probability distribution of the data and therelevant variables under the AdMim model can then bewritten as

P ðx; z; c; ~u ja; b; m; dÞ

¼ pð~u jaÞYI

i¼1

Pðzi j~uÞPðci j zi ;~bk¼1:Ki Þ

3 Pðxi j ci ; zi ; mi ; dk¼1:Ki Þ:

The marginal likelihood of the data can be computedby summing/integrating out the latent variables:

Figure 3.—Discretepdffortwovaluesofmutationparameter.


P ðx ja;b;m; dÞ

¼ GðP

Kk¼1 akÞQ

Kk¼1 GðakÞ

ð YKk¼1

uak�1k

!. . .

3YI

i¼1

XK

k¼1

YKk¼1

uzi;k

k

!XI

i¼1

YKk¼1

YLi

l¼1

ðbki;lÞci;l zi;k . . .

3 Pðxi jmi;l ; dki Þci;l zi;k d~u:

However, a closed-form solution to this summation/integration is not possible, and indeed exact inferenceon hidden variables such as the structure vector ~u andestimation of model parameters such as the mutationrates d under AdMim is intractable. Pritchard et al.(2000) presented an MCMC algorithm for approximateinference for their admixture model underlying Struc-ture. While it is straightforward to implement a similarMCMC scheme for AdMim, we choose to apply acomputationally more efficient approximate inferencemethod known as variational inference ( Jordan et al.1999).

Variational inference: We use a mean-field approxi-mation for performing inference on the model. Thisapproximation method approximates an intractablejoint posterior p() of all the hidden variables inthe model by a product of marginal distributionsqðÞ ¼

QqiðÞ, each over only a single hidden variable.

The optimal parameterization of qi() for each variable isobtained by minimizing the Kullback–Leibler diver-gence between the variational approximation q andthe true joint posterior p. Using results from thegeneralized mean field theory (Xing et al. 2003), wecan write the variational distributions of the latentvariables in AdMim as follows:

qð~uÞ}YKk¼1

uak�11

PI

i¼1hzi;ki

k

qðciÞ}YLl¼1

YKk¼1

ðbki;l f ðxi jmi;l ; d

ki ÞÞhzi;ki

!ci;l

qðziÞ}YKk¼1

ehlogðukÞiYLl¼1

bki;l f ðxi jmi;l ; d

ki Þhci;l i

! !zi;k

:

In the distributions above, the ‘‘h�i’’ are used to indicatethe expected values of the enclosed random variables. Aclose inspection of the above formulas reveals thatthese variational distributions have the form qð~uÞ �Dirichletðg1; . . . ; gK Þ, q(zi) � Multinomial(ri,1, . . . ,ri,K), and q(ci)�Multinomial(ji,1, . . . , ji,L), respectively,of which the parameters gk, ri,k and ji,l are given by theequations

gk ¼ ak 1XI

i¼1

hzi;ki

ri;k ¼ehlogðukÞið

QLl¼1 bk

i;l f ðxi jmi;l ; dki Þhci;l iÞP

Kk¼1ðehlogðukÞið

QLl¼1 bk

i;l f ðxi jmi;l ; dki Þhci;l iÞÞ

ji;l ¼Q

Kk¼1ðbk

i;l f ðxi jmi;l ; dki ÞÞhzi;kiP

Kk¼1ð

QKk¼1ðbk

i;l f ðxi jmi;l ; dki ÞÞhzi;kiÞ

and they have the properties hlogðukÞi ¼ cðgkÞ �cðP

k gkÞ, hzi;ki ¼ ri,k, and hci;l i ¼ ji,l, which suggest thatthey can be computed via fixed point iterations. [Thedigamma function c() used above is the first derivativeof the logarithm of the gamma function G().] It can beshown that this iteration will converge to a localoptimum, similar to what happens in an EM algorithm.Empirically, a near global optimum can be obtained bymultiple random restarts of the fixed point iteration.Typically, such a mean-field variational inference con-verges much faster than sampling (Xing et al. 2003).Upon convergence, we can easily compute an estimateof the structure vector~u for each individual from qð~uÞ.

Parameter estimation: The parameters of our modelare the centroids m, the mutation parameters d, theancestral allele frequency distributions b, and theDirichlet hyperparameter that is the prior on ancestralpopulations, a. For the hyperparameter estimation, weperform empirical Bayes estimation using the varia-tional expectation maximization algorithm described inBlei et al. (2003). The variational inference described inthe previous section provides us with a tractable lowerbound on the log-likelihood as a function of the currentvalues of the hyperparameters. We can thus maximize itwith respect to the hyperparameters. If we alternatelycarry out variational inference with fixed hyperpara-meters, followed by a maximization of the lower boundwith respect to the hyperparameters for fixed values ofthe variational parameters, we can get an empiricalBayes estimate of the hyperparameters. The derivation,details of which we do not show here, leads to thefollowing iterative algorithm:

1. E-step: For each individual, find the optimizing valuesof the variational parameters (gn, rn, jn; n 2 1, . . . , N)using the variational updates described above.

2. M-step: Maximize the resulting variational lowerbound on the likelihood with respect to the modelparameters, namely a, b, m, d.

The two steps are repeated until the lower bound onthe log-likelihood converges. The details of estimationof each hyperparameter are included in the appendix.

EXPERIMENTS AND RESULTS

We validated our model on synthetic microsatellitedata sets simulated using a coalescent model to assess


the performance of mStruct in terms of the accuracyand consistency of the estimated structure vectors andto test the correctness of the inference and estimationalgorithms we developed. We also conducted empiricalanalysis using mStruct of two real data sets: the HGDP–CEPH cell line panel of microsatellite loci and theHGDP SNP data, in comparison with the Structureprogram (version 2.2).

Validations on coalescent simulations: To verify thecorrectness of the empirical admixture estimations basedon mStruct when the truth is known, we first simulated amultitude of admixture population data sets, usingcoalescent techniques described in Hudson (1990),under various user-specified admixing scenarios. Specif-ically, following Hudson (R. Hudson, personal commu-nications), without loss of generality we simulatedgenealogy trees for two discrete populations of effectivesize 2N, which were assumed to have split from a singleancestral population, also of size 2N, at a time Ngenerations in the past. We assumed that there was nomigration between the populations after the split. Thesetwo discrete populations were joined together to form asingle random-mating population. (A simulation ofmultiple-population admixing is possible, but tedious,and thus omitted here for simplicity.) After a singlegeneration of random mating, samples were collectedfrom the resulting population. Individuals, therefore,have i parents from population 1 and 2 � i parents frompopulation 2 with probability

�2i

�=4. Every locus was

simulated independently. Microsatellite mutation wasmodeled by a simple stepwise mutation process. Themutation parameter 4Nm was varied over data points, withthree discrete values, {8, 16, 32}, being used. Since theexpected number of mutations within the populations isgiven by 2Nm, the values chosen are representative of thediversity observed in real data (Pritchard et al. 2000).

For each individual, we stored the fractional contri-bution of population 1 to its genome. For each data set,we also stored the fractional contribution of population1 to the entire population. To ensure that eachpopulation was well represented in the admixed pop-ulation, only data sets that had roughly equal contribu-tion from both populations were accepted (thecontribution of population 1 to the resulting popula-tion was required to be in [50� 0.01, 50 1 0.01]%). Foreach data point in the graph, 10 data sets were simulatedusing the same parameter settings for the mutationparameter. Each data set had 60 individuals from theadmixed population measured at 100 loci. For each dataset, 10 runs of each software (i.e., mStruct and Struc-ture) were used to determine the run with best likeli-hood. The statistics used in the result were computedonly on the run with the best likelihood.

We used the simulated data sets to carry out threeanalyses. First, we study the ability of both softwares torecover the contribution of population 1 (denoted as h)to the resulting admixed population. Next, we study

how well each software is able to recover the proportionof ancestry in population 1 for each individual. Finally,we consider the problem of model selection—i.e.,choosing the number of ancestral populations to pro-vide an appropriate representation of the data.

Recovering the contribution of population 1 to the resultingpopulation: We evaluated the accuracy of the estimated h

under three different conditions, one for each value ofthe magnitude of the mutation parameter describedabove. The greater the magnitude is, the more difficultthe estimation of admixing coefficient h, because morediscrepancy would exist between the ancestral allelesand the simulated population alleles. As a measure oferror, we used the absolute difference between thetrue value htrue and the inferred value hinfer. The resultsshown in Figure 4A denote the means and quartiles ofthe result statistics. From Figure 4A, we can see thatas the magnitude of the mutation parameter increases,the error for Structure increases. However, for mStruct,there is no significant effect of the mutation parameteron the error. mStruct also performs better than Struc-ture over all the data points.

Recovering the contribution of population 1 to the ancestryof an individual: We used the same data from the earlierexperiment for this analysis. In this case, we used themean of the absolute difference between the true andinferred values of the proportion of ancestry of individ-uals in population 1 as the measure of error. Figure 4Bshows the results of this analysis. The results follow asimilar trend as in the earlier experiment. For Structure,an increase in the mutation parameter causes as in-crease in the error, but there is no significant effect ofthe mutation parameter on the error for mStruct. Weshow the results for a particular data set with mutationparameter 4Nm ¼ 32 in Figure 5. Figure 5A shows thetrue ancestry proportion map for the sample. It showsthat around half the individuals are admixed. Figure 5,B and C, shows the ancestry proportion maps inferredby Structure and mStruct, respectively. We can see thatthe ancestry structure recovered by mStruct is very closeto the true ancestry proportions. The recovery ofancestry proportions by Structure is not very close tothe truth in this case.

Model selection—choice of K: As in Structure, our modelis defined for a particular value of K, the number ofancestral populations. In general, it is not always clearwhat value of K must be chosen to interpret the dataappropriately. We performed an experiment on thesimulated data to determine the most appropriatenumber of ancestral populations for the data. In thiscase, only a single data set was used with the mutationparameter 4Nm set to 16. For each value of K from 1 to 5,we performed 10 runs of mStruct on the data and chosethe run with the best likelihood for model selection. Tochoose the best value of K, we used the BIC (Schwarz

1978) (that we previously used to decide the optimalnumber of ancestral alleles at each locus). The preferred


model is the one that has the minimum value of the BIC.Table 1 shows the BIC values for the values of K. FromTable 1, we can see that the model with K ¼ 2 ancestralpopulations is correctly chosen as the optimal model.

Empirical analysis of real data sets: The HGDP–CEPH cell line panel (Cann et al. 2002; Cavalli-Sforza

2005) used in Rosenberg et al. (2002) contains geno-type information from 1056 indviduals from 52 pop-ulations at 377 autosomal microsatellite loci, along withgeographical and population labels. The HGDP SNPdata (Conrad et al. 2006) contain the SNP genotypes at2834 loci of 927 unrelated individuals that overlap withthe HGDP–CEPH data. To make results for both types ofdata comparable, we chose the set of only thoseindividuals present in both data sets. As in Rosenberg

et al. (2002), the choice of the total number of ancestralpopulations can be left to the user; we tried K rangingfrom 2 to 5, and we applied the BIC to decide the Bayesoptimal number of ancestral populations within thisrange to be K ¼ 4. Below, we present the structuralanalysis under four ancestral populations.

Structural map from the HGDP–CEPH data: We com-pare the structural maps inferred from the micro-satellite data using mStruct and Structure in Figure 6.The most obvious difference between the maps pro-duced by both programs is the degree of admixing thatthe individuals in the program are assigned. Structureassigns each geographical population to a distinctancestral allele-frequency profile. This assignment isvery useful for partitioning individuals into separateclusters. However, in doing so, it is unable to capture thegenetic structural relationships between individuals. Itoffers no insights into the admixture history of pop-ulations, as mStruct does. In contrast, the structure mapproduced by mStruct from microsatellite data suggeststhat all populations share a common ancestral popula-tion as a unique extra component (represented by themagenta color in Figure 6) that characterizes theirparticular regional genotypes. A structure map, charac-terized thus by an underlying commonality in a part ofthe genetic ancestry, together with regional differences,clearly reveals the expansion of humans out of Africa

Figure 4.—Recovery of individual- and population-level admixture parameters.

Figure 5.—A comparison of the true and inferred ancestry proportions for a single example. (A) The true ancestry proportionsfor the sample. (B) The ancestry proportions inferred by Structure. (C) The ancestry proportions inferred by mStruct.


(Hammer et al. 1998; Templeton 2002). It is in thisregard that Structure and mStruct are significantlydifferent.

Both structure maps show that individuals having asimilar population label (at regional, national, orcontinental levels) have similar admixture proportions.The similarity is least if two individuals come fromdifferent continents and most if two individuals arefrom the same region. We can therefore represent eachregional population by the average of the admixtureproportions of all individuals from the region. Wecomputed the Euclidean distance between all pairs ofthe 52 regional populations and constructed a neigh-bor-joining tree from the distance matrices. Figure 7, Aand B, shows the neighbor-joining trees constructed forStructure and mStruct. It is important to note that thedistance measure used is not known to be a truemeasure of evolutionary distance. These trees havebeen constructed from a single instance of the distancematrix and have not been bootstrapped. Despite this, wecan see that the mStruct tree agrees quite well withpreviously constructed phylogenetic trees for humanpopulations (Bowcock et al. 1994). The phylogenyfrom mStruct appears to be more interpretable thanthat from Structure. In Figure 7B, we can see a tightercluster for the African populations and that Americanpopulations diverged after Asian and European popu-lations diverged, rather than before.

Analysis of the mutation spectra: Now we report apreliminary analysis of the evolutionary dynamics re-flected by the estimated mutation spectra of differentancestral populations (denoted ‘‘am-spectrum’’) and of

different modern geographical populations (denoted‘‘gm-spectrum’’), which is not possible by Structure. Forthe am-spectrum (Figure 8A), we compute the meanmutation rates over all loci and founding alleles for eachancestral population as estimated by mStruct. Weestimate the gm-spectrum (Figure 8B) as follows: forevery individual, a mutation parameter is computed asthe per-locus number of observed alleles that areattributed to mutations, weighted by the mutationparameters corresponding to the ancestral allele chosenfor that locus. This can be computed by observing thepopulation indicator (Z) and the allele indicator (C) foreach locus of the individual. We then compute thepopulation mutation parameters by averaging mutationparameters of all individuals having the same geo-graphical label.

As shown in the gm-spectrum in Figure 8B, themutation parameters for African populations are in-deed higher than those of other modern populations.Since the mutation parameter reflects effects of muta-tion rate and population age, this indicates that theydiverged earlier, a common hypothesis of humanmigration. Other trends in the gm-spectra also revealinteresting insights. We computed the empirical muta-tion parameters for each of the 52 subpopulationspresent in the data as we did for each continent. Sinceeach population has an associated latitude and longi-tude, this allows us to set up a function that maps ageographical latitude/longitude coordinate to an em-pirical mutation parameter. Figure 9 shows the contourplot of this function. The mutation parameter d in ourmodel is a measure of variability (a combination of pergeneration mutation rate and age of the population).Thus, the contour plots shows us how the amount ofvariability changes across the world. We can see that themaximum variation is in Africa. There is a decrease invariation as we move away from central Africa. We canalso see that the South American tribes have the leastamount of accumulated variation. This is in qualitativeagreement with the ages of different populations aspredicted by the ‘‘Out of Africa’’ hypothesis of humanmigration.

Structural map from the HGDP SNP data: Figure 10shows the structural maps produced by mStruct andStructure for the HGDP SNP data. We can see that thetwo population maps are nearly identical, which signals

TABLE 1

Model selection for simulated data: BIC valuesfor K from 1 to 5

K BIC

1 6.91 3 104

2 6.87 3 104

3 6.99 3 104

4 7.12 3 104

5 7.26 3 104

The model having a smaller BIC value (K¼ 2 in this case) ispreferred (numbers in boldface type).

Figure 6.—Ancestry structuremaps inferred from the microsa-tellite portion of the HGDP dataset, using mStruct and Structurewith four ancestral populations.The colors represent different an-cestral populations.


Figure 7.—Neighbor-joining trees constructed using mStruct and Structure for the 52 regional populations in the HGDP mi-crosatellite data. (A) Tree constructed using Structure. (B) Tree constructed using mStruct.


an inconsistency between the microsatellite and SNPmStruct results for the human data. However, there aresome important caveats that must be taken into consid-eration. In our analysis, we consider a simplistic Ber-noulli-like model of SNP mutation. While richermutation models could potentially reduce this difficulty,there is a more significant difficulty with the analysis ofSNP data. The biallelic nature of SNP markers makes itdifficult to draw any inferences about the correctnumber of ancestral alleles at a locus. For microsatel-lites, this problem is considerably easier due to theirmultiallelic nature. As a result, mStruct is unable toobtain more information about evolutionary historyfrom SNP markers than Structure does. As we explainedearlier, mStruct is an extension of Structure that findssignals about mutations present in the data. So in theevent that mStruct is unable to find any extra mutationinformation from the data, it is quite reasonable to expectits output to be nearly the same as that of Structure.

Model selection: As with all probabilistic models, weface a trade-off between model complexity and the log-likelihood value that the model achieves. In our case,complexity is controlled by the number of ancestralpopulations we pick, K. Unlike nonparametric or

infinite-dimensional models (Dirichlet processes,etc.), for models of fixed dimension, it is not clear ingeneral as to what value of K gives us the best balancebetween model complexity and log-likelihood. In suchcases, different information criteria are often used todetermine the optimal model complexity. To determinewhat number of ancestral populations fit the HGDPSNP and microsatellite data best, we computed BICscores for K¼ 2–5 for both kinds of data separately. Theresults are shown in Figure 11. From the BIC curves forboth SNP and microsatellite data, we can see that thecurves suggest K ¼ 4 as the best fit for the data.

DISCUSSION

The task of estimating the genetic contributions ofancestral populations, i.e., structural map estimation, ineach modern individual, is an important problem inpopulation genetics. Due to the relatively high rates ofmutation in markers such as microsatellites and SNPs,multilocus genotype data usually harbor a large amountof variations, which allows differentiation even betweenpopulations that have close evolutionary relationships.

Figure 8.—Am-spectrum andGm-spectrum inferred from themicrosatellite portion of theHGDP data set, using mStructwith four ancestral populations.The colors represent different an-cestral populations.

Figure 9.—Contourmap of the empirical muta-tion parameters over theworld map.


However, to our knowledge, none of the existing methodsis able to take advantage of this property to compare howmarker mutation rates vary with population and locus,while at the same time exploiting such information forpopulation structural estimation. Traditionally, popula-tion structure estimation and mutation spectrum estima-tion have been performed as separate tasks.

We have developed mStruct, which allows estima-tion of genetic contributions of ancestral populationsin each modern individual in light of both popula-tion admixture and allele mutation. The variationalinference algorithm that we developed allows tractableapproximate inference on the model. The ancestralproportions of each individual enable representingpopulation structure in a way that is visually easy tointerpret, as well as amenable to further computationalanalysis.

The statistical modeling differences between mStructand Structure provide an interesting insight into thepossible reasons that lead to mStruct inferring higherlevels of admixture than Structure. In Structure’srepresentation of population, every microsatellite alleleis considered to be a separate element of the popula-tion, even though they might be very similar. In theinheritance model representation, such alleles areconsidered to be possibly derived from a single ancestralallele. This can lead to detection of extra similarityamong individuals possessing these alleles. This isprobably the main reason that the inferred levels ofadmixture are higher in mStruct than in Structure.

Another parameter that would also affect inferredlevels of admixture is the d-parameter that determinesthe variance of the mutation distributions. Highervalues of d (tending to 1) lead to significantly higherlevels of inferred admixture. If a strong prior is not used,the d-values tend toward 1 in the initial few steps of thevariational EM algorithm. This seems to happen due tothe initial imprecise assignments for the z and c in-dicator variables. However, the region of high d-values isa region of low log-likelihood in the parameter spaceand the EM quickly finds a local optimum that isundesirable due to the low log-likelihood of that regionof the parameter space.

In conjunction with geographical location, the in-ferred ancestry proportions could be used to detectmigrations, subpopulations, etc. Moreover, the ability to

estimate population- and locus-specific mutation pa-rameters also allows us to substantiate evolutionarydynamics claims on the basis of high/low mutationparameters in certain geographical populations or onthe basis of high/low mutation parameters at certainloci in the genome. While the estimates of mutationparameters that mStruct provides are not on an absolutescale, the comparison of their relative magnitudes iscertainly informative.

The mutation model we currently use is a computa-tionally simple one. However, it lacks the ability todistinguish between the effects of per generation muta-tion rate and the age of the population. Under thestepwise mutation model, we can model inheritance byusing a more complex but powerful model, using Besselfunctions (Felsenstein 2004). This form would allowseparate inference of the per generation mutation rateas well as the age of the population.

As of now, a number of possible extensions remain tothe methodology we presented so far. It would beinstructive to see the impact of allowing linked loci asin Falush et al. (2003). We have not yet addressed theissue of the most suitable choice of mutation process,but instead have chosen one that is reasonable andcomputationally tractable. It would also be interestingto combine mStruct with the nonparametric Bayesianmodels based on the Dirichlet processes as in programssuch as Spectrum (Sohn and Xing 2007) and Structur-ama (Huelsenbeck and Andolfatto 2007).

Figure 10.—Ancestry structuremaps inferred from the SNP por-tion of the HGDP data set, usingmStruct and Structure with fourancestral populations.

Figure 11.—Model selection with BIC score for the HGDPdata with mStruct on SNP and microsatellite data.


In summary, current population stratification meth-ods such as Structure ignore the effects of allelemutations, which are a significant factor in shapingallele diversity in microsatellites in human populations.In doing so, they are restricted to clustering humangenetic data rather than being able to identify admixingof populations. Clustering is useful for populationstratification, but a more accurate representation ofevents such as genome variations might cast morelight on population evolutionary history. By incorpo-rating the effect of allele mutations, the mStructapproach developed in this article represents such anattempt to gain more insight into the fine structuresof genetic admixing of populations and their diver-gence times.

This material is based upon work supported by a National ScienceFoundation Career Award to E.P.X. under grant DBI-0546594 and NSFgrant CCF-0523757. E.P.X. is also supported by an Alfred P. SloanResearch Fellowship.

LITERATURE CITED

Anderson, E., and E. Thompson, 2002 A model-based method foridentifying species hybrids using multilocus genetic data. Genet-ics 160: 1217–1229.

Blei, D., A. Ng and M. Jordan, 2003 Latent Dirichlet allocation. J.Mach. Learn. Res. 3: 993–1022.

Bowcock, A., A. Ruiz-Linares, J. Tomfohrde, E. Minch, J. Kidd

et al., 1994 High resolution of human evolutionary trees withpolymorphic microsatellites. Nature 368: 455–457.

Cann, H., C. de Toma, L. Cazes, M. Legrand, V. Morel et al.,2002 A human genome diversity cell line panel. Science 296:261–262.

Cavalli-Sforza, L., 2005 The Human Genome Diversity Project:past, present and future. Nat. Rev. Genet. 6: 333–340.

Conrad, D., M. Jakobsson, G. Coop, X. Wen, J. Wall et al., 2006 Aworldwide survey of haplotype variation and linkage disequilib-rium in the human genome. Nat. Genet. 38: 1251–1260.

Corander, J., P. Waldmann and M. Sillanpaa, 2003 Bayesian anal-ysis of genetic differentiation between populations. Genetics 163:367–374.

Dietrich, W., H. Katz, S. Lincoln, H. Shin, J. Friedman et al.,1992 A genetic map of the mouse suitable for typing intraspe-cific crosses. Genetics 131: 423–447.

Erosheva, E., S. Fienberg and J. Lafferty, 2004 Mixed-member-ship models of scientific publications. Proc. Natl. Acad. Sci.USA 101: 5220–5227.

Excoffier, L., and G. Hamilton, 2003 Comment on genetic struc-ture of human populations. Science 300: 1877.

Falush, D., M. Stephens and J. Pritchard, 2003 Inference of pop-ulation structure using multilocus genotype data linked loci andcorrelated allele frequencies. Genetics 164: 1567–1587.

Felsenstein, J., 2004 Inferring Phylogenies. Sinauer Associates, Sun-derland, MA.

Hammer, M. F., T. Karafet, A. Rasanayagam, E. T. Wood, T. K.Altheide et al., 1998 Out of Africa and back again: nested

cladistic analysis of human y chromosome variation. Mol. Biol.Evol. 15: 427–441.

Henderson, S., and T. Petes, 1992 Instability of simple sequenceDNA in Saccharomyces cerevisiae. Mol. Cell. Biol. 12: 2749–2757.

Hudson, R., 1990 Gene genealogies and the coalescent process.Oxf. Surv. Evol. Biol. 7: 1–44.

Huelsenbeck, J., and P. Andolfatto, 2007 Inference of popula-tion structure under a Dirichlet process model. Genetics 175:1787–1802.

Jordan, M., Z. Ghahramani, T. Jaakkola and L. Saul, 1999 An in-troduction to variational methods for graphical models. Mach.Learn. 37: 183–233.

Kelly, R., M. Gibbs, A. Collick and A. Jeffreys, 1991 Spontaneousmutation at the hypervariable mouse minisatellite locus Ms6-hm:flanking DNA sequence and analysis of germline and early so-matic mutation events. Proc. Biol. Sci. 245: 235–245.

Lin, T., E. Myers and E. Xing, 2006 Interpreting anonymous DNAsamples from mass disasters–probabilistic forensic inference us-ing genetic markers. Bioinformatics 22: e298.

Minka, T., 2003 Estimating a Dirichlet distribution. Technical Re-port. Carnegie Mellon University, Pittsburgh.

Patterson, N., A. Price and D. Reich, 2006 Population structureand eigenanalysis. PLoS Genet. 2: e190.

Pella, J., and M. Masuda, 2006 The Gibbs and split-merge samplerfor population mixture analysis from genetic data with incom-plete baselines. Can. J. Fish. Aquat. Sci. 63: 576–596.

Pisani, D., L. Poling, M. Lyons-Weiler and S. Hedges, 2004 Thecolonization of land by animals: molecular phylogeny and diver-gence times among arthropods. BMC Biol. 2: 1.

Pritchard, J., M. Stephens and P. Donnelly, 2000 Inference ofpopulation structure using multilocus genotype data. Genetics155: 945–959.

Queller, D., J. Strassmann and C. Hughes, 1993 Microsatellitesand kinship. Trends Ecol. Evol. 8: 285–288.

Roeder, K., M. Escoar, J. Kadane and I. Balazs, 1998 Measuringheterogeneity in forensic databases using hierarchical Bayesmodels. Biometrika 85: 269.

Rosenberg, N., J. Pritchard, J. Weber, H. Cann, K. Kidd et al.,2002 Genetic structure of human populations. Science 298:2381–2385.

Schwarz, G., 1978 Estimating the dimension of a model. Ann. Stat.6: 461–464.

Sohn, K., and E. Xing, 2007 Spectrum: joint Bayesian inference ofpopulation structure and recombination events. Bioinformatics23: i479–i489.

Templeton, A., 2002 Out of Africa again and again. Nature 416:45–51.

Valdes, A., M. Slatkin and N. Freimer, 1993 Allele frequencies atmicrosatellite loci: the stepwise mutation model revisited. Genet-ics 133: 737–749.

Xing, E., M. Jordan and S. Russell, 2003 A generalized mean fieldalgorithm for variational inference in exponential families, pp.583–591 in Uncertainty in Artificial Intelligence (UAI2003). MorganKaufmann Publishers.

Zhivotovsky, L., P. Underhill, C. Cinnioglu, M. Kayser, B.Morar et al., 2004 The effective mutation rate at Y chromosomeshort tandem repeats, with application to human population-divergence time. Am. J. Hum. Genet. 74: 50–61.

Communicating editor: M. K. Uyenoyama


APPENDIX: DETAILS OF HYPERPARAMETER ESTIMATION

Bayes estimates of hyperparameters: Denote the original set of hyperparameters by

H ¼ fa;b;m; dg ðA1Þ

and the variational parameters for the nth individual by

Vn ¼ fgn; rn; jng: ðA2Þ

The variational lower bound to the log-likelihood for the nth individual is given by

LnðH; VnÞ ¼ Eq ½log pðxn;~un; z:;n; c:;n; HÞ� � Eq ½log qð~un; z:;n; c:;n; H; VnÞ�: ðA3Þ

The subscripts indicate the nth individual. In the analysis below, we use z.,n to denote {z1,n, . . . , zI,n} and c.,n to represent{c1,n, . . . , cI,n}. As described earlier, we partition the variational approximation as

qð~un; z:;n; c:;n; H; VÞ ¼ qð~unÞYI

i¼1

qðzi;nÞqðci;nÞ: ðA4Þ

So we can expand Equation 7 as

LnðH; ViÞ ¼ Eq ½log pð~un; aÞ�1 Eq ½log pðz:;n j~unÞ�1 Eq ½log pðc:;n j z:;nÞ�1 Eq ½log pðxn j c:;n; z:;n; bÞ� � Eq ½log qð~unÞ� � Eq ½log qðz:; nÞ� � Eq ½log qðc:; nÞ�: ðA5Þ

The lower bound to the total data log-likelihood is

LðH; VÞ ¼XNn¼1

LnðH; VnÞ;

which, on substituting from Equation A5, becomes

LðH; VÞ ¼XNn¼1

Eq ½log pð~un; aÞ�1XNn¼1

Eq ½log pðz:;n j~unÞ�

1XNn¼1

Eq ½log pðc:;n j z:;nÞ�1XNn¼1

Eq ½log pðxn j c:;n; z:;n; bÞ�

�XNn¼1

Eq ½log qð~unÞ� �XNn¼1

Eq ½log qðz:;nÞ�

�XNn¼1

Eq ½log qðc:;nÞ�: ðA6Þ

To compute Eq ½log pð~un; aÞ� and Eq ½log qð~unÞ�, we use the properties of a Dirichlet distribution, which is an exponentialfamily distribution. If u � Dir(a), then the exponential family representation of p(u; a) is given by

pðu; aÞ ¼ expXK

k¼1

ðak � 1Þ log uk

!1 log G

XK

k¼1

ak

!�XK

k¼1

log GðakÞ" #

: ðA7Þ

So the natural parameter of the Dirichlet is hk ¼ ak � 1 and the sufficient statistic is T(uk) ¼ log uk. The lognormalization factor is

PKk¼1 log G(ak)� log G(

PKk¼1 ak). For an exponential distribution, the derivative of the log

normalization factor with respect to the natural parameter is equal to the expected value of the sufficient statistic.Using this fact, we get


E½log uk ; a� ¼ cðakÞ � cX

k

ak

!; ðA8Þ

where c is the digamma function, the first derivative of the log gamma function. The remaining expectation terms inEquation A6 are expectations of multinomial parameters and hence are easy to calculate.

Simplifying each term in Equation A6, we get

LðH; VÞ ¼ N log GXK

k¼1

ak

!� N

XK

k¼1

log GðakÞ1XNn¼1

XK

k¼1

ðak � 1Þ cðgn;kÞ � cXK

k¼1

gn;k

!" #

1XNn¼1

XI

i¼1

XK

k¼1

rn;i;k cðgn;kÞ � cXK

k¼1

gn;k

!" #

1XNn¼1

XI

i¼1

XK

k¼1

XLi

l¼1

jn;i;l rn;i;k log bkil

1XNn¼1

XI

i¼1

XK

k¼1

XLi

l¼1

jn;i;l rn;i;k log ð1� dki Þ1 jxi;n � mi;l jlog dk

i � log ð1 1 dki � ðdk

i Þmi;l Þh i

�XNn¼1

log GXK

k¼1

gn;k

!�XK

k¼1

log Gðgn;kÞ1XK

k¼1

ðgn;k � 1Þ cðgn;kÞ � cXK

k¼1

gn;k

!" #" #

�XNn¼1

XI

i¼1

XLi

l¼1

jn;i;l log jn;i;l

�XNn¼1

XI

i¼1

XK

k¼1

rn;i;k log rn;i;k : ðA9Þ

Each line in Equation A9 corresponds to an expectation term in Equation A6. In the following sections, we brieflydescribe how the maximum-likelihood estimates of the hyperparameters were obtained from the variational lowerbound.

Estimating ancestral allele frequency profiles b: Since b is a table of probability distributions, the values of itselements are constrained by the equality

PLi

l¼1 bki;l ¼ 1 for all combinations of {i, k}. So to find the optimal values of b

satisfying this constraint while maximizing the variational lower bound, we introduce Lagrange multipliers ni,k. Thenew objective function to maximize is then given by

LnewðH; VÞ ¼ LðH; VÞ1XI

i¼1

XK

k¼1

ni;k

XLi

l¼1

bki;l � 1

!: ðA10Þ

Maximizing this objective function gives

bki;l ¼

PNn¼1 jn;i;l rn;i;kPLi

l¼1

PNn¼1 jn

n;i;l rn;i;k

: ðA11Þ

We use a uniform Dirichlet prior l on each multinomial~bk

i . Under this prior, it is not difficult to show that the estimateof bk

i;l changes to

bki;l ¼

l 1P

Nn¼1 jn

i;l rni;k

l 3 Li 1PLi

l¼1

PNn¼1 jn

i;l rni;k

: ðA12Þ

Estimating the Dirichlet prior on populations a: For estimating a we use the method described in Minka (2003).This gives a Newton–Raphson iteration for a that does not involve inversion of the Hessian and hence is reasonablyfast. The log-likelihood terms involving a are


LðH; VÞ ¼ N log GXK

k¼1

ak

!� N

XK

k¼1

log GðakÞ1XNn¼1

XK

k¼1

ðak � 1Þ cðgn;kÞ � cXK

k¼1

gn;k

!" #: ðA13Þ

The gradient of the log-likelihood with respect to ak is given by

gk ¼dLðH; VÞ

dak¼ N c

XK

k¼1

ak

!� N cðakÞ1

XNn¼1

cðgn;kÞ � cXK

k¼1

gn;k

!" #; ðA14Þ

where the digamma function used above is the first derivative of the logarithm of the gamma function.The second derivatives, which form the Hessian, can be computed as

dLðH; VÞd2ak

¼ N c9XK

k¼1

ak

!� N c9ðakÞ ðA15Þ

dLðH; VÞdakaj

¼ N c9XK

k¼1

ak

!ðk 6¼ jÞ; ðA16Þ

where c9, the trigamma function, is the derivative of the digamma function. The Hessian can then be written as

H ¼ Q 1 11Tz ðA17Þ

qj ;k ¼ �N c9ðakÞdðj � kÞ ðA18Þ

z ¼ N c9XK

k¼1

ak

!; ðA19Þ

where Q is a K 3 K matrix with elements qj,k. As we can see from the definition, Q is a diagonal matrix. The Newtonupdate equation we have is

anew ¼ aold � ðH�1gÞ: ðA20Þ

The inverse of the Hessian can be computed using the Sherman–Morris formula to be

H�1 ¼ Q�1 � Q�111TQ�1

1=z 1 1TQ�11: ðA21Þ

Therefore, we have that the update term is

ðH�1gÞk ¼gk � b

qk;k; ðA22Þ

where

b ¼P

Kk¼1 gk=qk;k

1=z 1P

Kk¼1 1=qk;k

:

So the update equation for ak is

anewk ¼ aold

k �gk � b

qk;k: ðA23Þ

Estimating the ancestral alleles m and the mutation parameters d: For finding the optimal values of m and d, we usesimple gradient ascent with line search. m-values are actually discrete variables; however, as an approximation, we


assume them to be continuous in the optimization and round off the result to the nearest integer. The gradient of thevariational lower bound with respect to mi,l is given by

@L

@mi;l

¼XNn¼1

XK

k¼1

jn;i;l rn;i;klog ðdki Þ signðxn;i � mi;lÞ1

ðdki Þmi;l

1 1 dki � ðdk

i Þmi;l

� �: ðA24Þ

The gradient with respect to dik is given by

@L

@dki

¼XNn¼1

XLi

l¼1

jn;i;l rn;i;k

jxn;i � mi;l jdk

i

� 1

1� dki

�1� mi;lðdk

i Þmi;l�1

1 1 dki � ðdk

i Þmi;l

� �: ðA25Þ

Since the values of d are constrained to be in [0, 1], we use the logit transformation to create a mapping from [0, 1] toR. This gives us the equations

si;k ¼ logdk

i

1� dki

� �dk

i ¼ sigmoidðsi;kÞ

@L

@si;k¼ @L

@dki

@dki

@si;k

¼ @L

@dki

3 dki ð1� dk

i Þ:

We can then perform gradient ascent on each m and d separately and repeat this a number of times, to obtain valuesthat increase the lower bound. To constrain values of the mutation parameter d to allow meaningful interpretation, weuse a b prior on it with a small expected value (�0.1). We denote the prior as b(z1, z2).

While the gradient methods developed are useful for small data sets, they are inefficient on larger data sets andincrease the time required for estimation. Hence we look at a couple of small approximations that help speed up thehyperparameter estimation. A careful look at the results that have been produced indicates that once the founderalleles have been picked initially by fitting a mixture of mutation distributions individually at each locus, the latergradient descent on m makes only very minor changes in their values, if any at all. So, to improve the speed of thealgorithm, we do not perform gradient descent on the founder alleles m but fix them after initialization. We showbelow an approximation for estimating the mutation parameter d.

For the estimation of the mutation parameter d, the only relevant term in the likelihood lower bound isthe term

Lðdki Þ ¼

XNn¼1

XLi

l¼1

jn;i;l rn;i;k 3 log f ðxn;i ; mi;l ; dki Þ

1ðdk

i Þz1�1ð1� dki Þz2�1

Bðz1; z2Þ1 ðterms not involving dk

i Þ:

ðA26Þ

And for the mutation distribution, we use the discrete distribution whose pdf is

f ðx jm; dÞ ¼ ð1� dÞdjx�mj

1 1 d� dm : ðA27Þ

Approximation: We assume d to be small in Equation A27. So we can ignore the term exponential in m in thedenominator, reducing it to only (1 1 d). The expansion of (1 1 d)�1 is given by

1

1 1 d¼ 1� d 1 d2 � d3 1 . . . ðA28Þ

$ 1� d: ðA29Þ

This gives us a lower bound to the mutation distribution of


flbðx jm; dÞ ¼ ð1� dÞ2djx�mj: ðA30Þ

It is not hard to show that using this form for the mutation distribution allows a closed-form maximum-likelihoodestimate for d. This approximation gives us a lower bound to the likelihood that is not as tight as the variational lowerbound. However, it offers a significant improvement in time complexity due to the existence of a closed-form solution,thus avoiding the need for slow gradient-based methods. Under this approximation, the maximum-likelihoodestimate of dk

i for the microsatellite mutation model is given by

dki ¼

z1 1P

Nn¼1

PLil¼1 j n; i; lrn;i;k jxn;i � mi;l j

z2 1P

Nn¼1

PLil¼1 jn;i;l rn;i;kð2 1 jxn;i � mi;l j Þ

: ðA31Þ


Documents

mStruct: Inference of Population Structure in Light of Both ... - Genetics · population genetics problems (Queller et al. 1993; Hammer et al. 1998; Templeton 2002). Various methods