9
Theoretical Population Biology 53, 143151 (1998) Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model Rasmus Nielsen* Department of Integrative Biology, University of California, Berkeley, California 94720-3140 E-mail: rasmusmws4.biol.berkeley.edu Received January 24, 1997 In this paper, a maximum likelihood estimator of population divergence time based on the infinite sites model is developed. It is demonstrated how this estimator may be applied to obtain maximum likelihood estimates of the topology of population phylogenies. This approach addresses several classical problems occurring in the inference of the phylogenetic rela- tionship of populations, most notably the problem of shared ancestral polymorphisms. The method is applied to previously published data sets of human African populations and of Caribbean hawksbill turtles. ] 1998 Academic Press INTRODUCTION In recent years, the genetic analysis of population sub- division has undergone dramatic development. Phyloge- netic approaches have been championed by numerous authors (Vigilant et al., 1989; Slatkin and Maddison, 1990; Mountain and Cavalli-Sforza, 1994; Patton et al., 1994; Templeton et al., 1995). Perhaps the most notable application of phylogenetic approaches is the analysis of the divergence of human ethnic groups out of Africa (Vigilant et al., 1989; Watson et al., 1996). In this type of analysis a gene genealogy is estimated and conclusions regarding migration and divergence between populations are inferred from the estimated gene genealogy. One of the most serious challenges in this type of analysis is the potential lack of concordance between population phylo- genies and gene genealogies even in the absence of migra- tion between the populations. The branching pattern in a gene genealogy consisting of genes sampled from several populations may be caused both by divergence between the genes after the time of population separation and by divergence before the time of population separation. This problem of shared ancestral polymorphisms has, in cer- tain parts of the literature, been termed lineage sorting (see, for example, Avise, 1994). The problem of shared ancestral polymorphisms may be relevant not only within species but also between species. For example, the difficulty in determining the branching pattern of humans, chimpanzees, and gorillas (see for example Horai et al., 1992) may very well reflect that a large proportion of the divergence between individuals occurred in the time before speciation. Several authors have attempted to account for the effect of shared ancestral polymorphisms. Simple probabilistic argu- ments based on coalescent theory have been used to assess the probability of disconcordance between popula- tion phylogeny and gene-genealogy (Takahata, 1989; Wu, 1991; Hudson, 1992). However, a more appropriate approach would be to estimate the parameters of the underlying demographic process directly. For example, if one is interested in the divergence time between popula- tions, this time should be estimated directly instead of Article No. TP971348 143 0040-580998 K25.00 Copyright ] 1998 by Academic Press All rights of reproduction in any form reserved. * Current address: Museum of Comparative Zoology, Harvard University, 26 Oxford St., Cambridge, Massachusetts 02138.

Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model

Embed Size (px)

Citation preview

Page 1: Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model

File: DISTL2 134801 . By:AK . Date:28:04:98 . Time:07:30 LOP8M. V8.B. Page 01:01Codes: 5247 Signs: 3407 . Length: 60 pic 4 pts, 254 mm

Theoretical Population Biology�TP1348

Theoretical Population Biology 53, 143�151 (1998)

Maximum Likelihood Estimation of PopulationDivergence Times and Population Phylogeniesunder the Infinite Sites Model

Rasmus Nielsen*Department of Integrative Biology, University of California, Berkeley,California 94720-3140

E-mail: rasmus�mws4.biol.berkeley.edu

Received January 24, 1997

In this paper, a maximum likelihood estimator of population divergence time based on theinfinite sites model is developed. It is demonstrated how this estimator may be applied toobtain maximum likelihood estimates of the topology of population phylogenies. This approachaddresses several classical problems occurring in the inference of the phylogenetic rela-tionship of populations, most notably the problem of shared ancestral polymorphisms. Themethod is applied to previously published data sets of human African populations and ofCaribbean hawksbill turtles. ] 1998 Academic Press

INTRODUCTION

In recent years, the genetic analysis of population sub-division has undergone dramatic development. Phyloge-netic approaches have been championed by numerousauthors (Vigilant et al., 1989; Slatkin and Maddison,1990; Mountain and Cavalli-Sforza, 1994; Patton et al.,1994; Templeton et al., 1995). Perhaps the most notableapplication of phylogenetic approaches is the analysis ofthe divergence of human ethnic groups out of Africa(Vigilant et al., 1989; Watson et al., 1996). In this type ofanalysis a gene genealogy is estimated and conclusionsregarding migration and divergence between populationsare inferred from the estimated gene genealogy. One ofthe most serious challenges in this type of analysis is thepotential lack of concordance between population phylo-genies and gene genealogies even in the absence of migra-tion between the populations. The branching pattern in agene genealogy consisting of genes sampled from several

populations may be caused both by divergence betweenthe genes after the time of population separation and bydivergence before the time of population separation. Thisproblem of shared ancestral polymorphisms has, in cer-tain parts of the literature, been termed lineage sorting(see, for example, Avise, 1994). The problem of sharedancestral polymorphisms may be relevant not onlywithin species but also between species. For example, thedifficulty in determining the branching pattern ofhumans, chimpanzees, and gorillas (see for exampleHorai et al., 1992) may very well reflect that a largeproportion of the divergence between individualsoccurred in the time before speciation. Several authorshave attempted to account for the effect of sharedancestral polymorphisms. Simple probabilistic argu-ments based on coalescent theory have been used toassess the probability of disconcordance between popula-tion phylogeny and gene-genealogy (Takahata, 1989;Wu, 1991; Hudson, 1992). However, a more appropriateapproach would be to estimate the parameters of theunderlying demographic process directly. For example, ifone is interested in the divergence time between popula-tions, this time should be estimated directly instead of

Article No. TP971348

143 0040-5809�98 K25.00

Copyright ] 1998 by Academic PressAll rights of reproduction in any form reserved.

* Current address: Museum of Comparative Zoology, HarvardUniversity, 26 Oxford St., Cambridge, Massachusetts 02138.

Page 2: Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model

File: DISTL2 134802 . By:AK . Date:28:04:98 . Time:07:30 LOP8M. V8.B. Page 01:01Codes: 5511 Signs: 4415 . Length: 54 pic 0 pts, 227 mm

estimating the coalescence time of the gene genealogyand then subsequently relating these estimates to thedivergence times of the populations. Obtaining a directestimate of population divergence times requires integra-tion over all possible coalescence times and topologies ofthe gene genealogies. One of the reasons why such ananalysis has not been performed is that no analyticaltools have been previously available for performing suchintegration. However, recent computational advances inpopulation genetics have made this type of analysis trac-table (Kuhner et al. 1995, Griffiths and Tavare� , 1994a, b).In this paper, a maximum likelihood method for directlyestimating the divergence time of populations under theinfinite sites model is presented. It is shown how thisapproach leads to a method for estimating populationphylogenies. In the applications section, the populationdivergence times of two human populations from Africaare estimated using a previously published data set ofhuman mitochondrial DNA. In addition, the populationphylogeny of three populations of hawksbill turtles isestimated using previously published data.

THE METHOD OF GRIFFITHS ANDTAVARE� IN THE ONE-POPULATIONCASE

Before embarking on the analysis of multiple popula-tions it would be useful to first review some of the resultsand terminology applied in the method of estimating of%=4N+ (where + is the mutation rate in the entirehaplotype and N is the population size) described byGriffiths and Tavare� (1994a, 1995). Subsequently, it willbe shown how this methodology can be applied in theestimation of parameters of the demographic process formultiple populations.

In the following, we will assume that the divergencewithin a population follows Kingman's coalescent pro-cess (Kingman 1982, see Tavare� 1984 or Hudson 1991 fora review). This implies, among other assumptions, thatwe assume random mating, selective neutrality betweengenes and a constant population size. However, asdemonstrated by Griffiths and Tavare� (1994), changes inpopulation size can easily be incorporated into themodel. We will further assume that the mutational pro-cess follows an infinite sites model (Kimura, 1969). Thisimplies that multiple mutations in the same nucleotidesite are not allowed. The sample of haplotypes, under theinfinite sites model, can be represented by a matrix ofbinary characters since only one mutation can occur ineach site. For example, had we obtained a sample from a

single population consisting of 5 haplotypes containing 4variable sites:

atgcacccacgcacccgcct

this data set could be coded in the binary matrix (S) witharbitrary labeling of zeros and ones as

S=_0001

0111

0101

0001& and n=_

1211& ,

where n is a vector containing the counts or multiplicitiesof each haplotype (type of haplotype). The quantity weare interested in obtaining is the likelihood of %, that is,the probability of observing our particular sample ofordered haplotypes given %. Denote this probabilityp(S, n). To obtain an expression for this probability wewill consider the genealogical history of the haplotypes.We will sum over all other possible ancestral samples ata previous time which by one mutation or one coalescenceevent could be transformed into the present sample. Inorder to do this some notation must be introduced.

A sample in which a coalescence event just occurredbetween haplotypes of the k th type is given by (S, n&ek)where ek is a unit vector that subtracts 1 from entry kof n. If the last event was a mutation, there are twopossibilities. If the mutation originally happened in ahaplotype with only one copy in the sample, then thesample before the mutation is given by (Sl, n), where Sl

denotes a matrix identical to S but with the l th columnremoved corresponding to the elimination of one segre-gating site. If the mutation occurred from a haplotype ( j)with multiple copies in the sample, then the samplebefore mutation is given by (Skl, nk+ej ), where Skl

denotes a matrix identical to S but with column l and rowk removed and nk denotes a vector identical to n but withentry k removed. This corresponds to eliminating the onesegregating site that distinguished haplotype k fromhaplotype j.

Now, note that conditional on either a mutation orcoalescence event occurring, the probability that it was acoalescence event is

n&1n&1+%

144 Rasmus Nielsen

Page 3: Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model

File: DISTL2 134803 . By:AK . Date:28:04:98 . Time:07:30 LOP8M. V8.B. Page 01:01Codes: 5560 Signs: 3561 . Length: 54 pic 0 pts, 227 mm

and the probability that it was mutation is

%n&1+%

where n is the sample size.Under the infinite sites model, mutations to previously

existing haplotypes (back mutations) are not allowed.Thus, mutations can have occurred as the most recentevent only in haplotypes of multiplicity 1. The probabil-ity of a mutation in haplotype k (nk=1) is 1�n, where nk

is the multiplicity of haplotype k. Likewise, the probabil-ity of a coalescence event in haplotype k (nk>1) given acoalescence occurred is nk(nk&1)�n(n&1).

Now, by summing over all possible previous states inthe genealogy we obtain the following recursion

p(S, n)=1

n(n&1+%) _ :k # Z1

nk(nk&1) p(S, n&ek)

+% :k # Z2

p(S l, n)+% :k # Z3

p(Skl, nk+ej)& (1)

(Griffiths and Tavare� , 1995). The first sum (the coales-cence case) is over all haplotypes with multiplicitieslarger than one, i.e.,

Z1=[k : nk�2].

The second sum (mutation from a single copy haplotype)is over all haplotypes with multiplicities 1 that differ fromall other haplotypes by at least two mutations and thatcontain a site with a unique mutation, i.e.,

Z2=[k : nk=1, s . l=ek or s . l=eck , s l

k .{s lj , k{j].

where eck is the complement of ek , si is the i th row of S

and s . l is the l th column of S. The third sum (mutationfrom a multiple copy haplotype) is over haplotypes withmultiplicities 1 that contain a site that differs from allother haplotypes and which is indistinguishable fromanother haplotype except for this site, i.e.,

Z3=[k : nk=1, s . l=ek or s . l=eck , s l

k .=s lj , k{j].

As an example, assume that the following sample wasobserved:

S=_01

10& , n=_2

1& ,

then Z1=[1], Z2=[2] and Z3=[ ]. Therefore

p(S, n)

=1

3(2+%) \2p \_01

10& , _1

1&++%p\_00

10& , _2

1&++ .

For a more rigorous derivation of Eq. (1) consult Ethierand Griffiths (1987) which provide a treatment of sampleprobabilities under the infinite sites model using measuretheory. Hudson and Kaplan (1986) derived a similarrecursive equation for the infinite alleles model. Eq. (1)first appeared in the exact form presented here inGriffiths and Tavare� (1995).

Note that in theory, p(S, n) can be calculated directlyfrom this recursion by iteration and by specifying theboundary condition

p(2, m)=1

1+% \%

1+%+m

(2)

(Watterson, 1975) where p(2, m) is the probability ofobtaining a particular sample of two haplotypes with msegregating sites. However, for samples of sizes largerthan 15�20, the number of possible states in the recursionis so large that a direct evaluation is not computationallypossible. Instead, the Markov chain Monte Carlo ap-proach by Griffiths and Tavare� (1994a, 1994b, 1995) canbe applied to provide estimates of p(S, n). Griffiths andTavare� (1995) developed a method for evaluating recur-sion similar to Eq. (1) and applied the method to evaluatethe probabilities of unrooted trees. The derivation of aMarkov chain Monte Carlo approach for Eq. (1) followstrivially from the derivation for the tree probabilities givenby Griffiths and Tavare� (1995) since it only involves aslight change in state space. Sequences under the infinitesites model can simply be interpreted as unrootedgenealogical trees, with possible multifurcations in caseswhere the data do not allow distinction between differentalleles (lineages).

By defining a Markov chain with the same state spaceas the recursion (Eq. (1)), one may evaluate p(S, n) byevaluating only a subset of all possible paths in the recur-sion. Paths are chosen by simulating along the Markovchain. In particular, by defining a function

f (S, n)= :d

k=1

nk(nk&1)+%vn(n+%&1)

where v is given by

v=|[k : nk=1, s . l=ek or s . l=eck for some l ]|,

145Population Phylogenies

Page 4: Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model

File: 653J 134804 . By:XX . Date:20:04:98 . Time:15:40 LOP8M. V8.B. Page 01:01Codes: 4937 Signs: 2834 . Length: 54 pic 0 pts, 227 mm

and |[...]| indicates the size of a set, and a Markov chainwith the following transition probabilities

(S, n) � (S, n&ek) with probabilitynk(nk&1)

f (S, n) n(n+%+1)

(S, n) � (Sl, n) with probability%

f (S, n) n(n+%+1)

(S, n) � (Skl, n+ej) with probability%

f (S, n) n(n+%+1)

(3)

where the conditions under which transitions of the threetypes are allowed, is provided by k # Z1 , k # Z2 , andk # Z3 , respectively. Repeated simulation of the Markovchain until hitting the absorptive state (n=2) providesan estimate of p(S, n)

p̂(S, n)=

:v

j=1

p(2, m(')) `'&1

i=0

f (S(i), n(i ))

v, (4)

where v is the number of simulations performed, ' isthe random number of states the chain visits until theabsorbing state (n=2) is hit, p(2, m(')) is the number ofsegregating sites between the two remaining haplotypesat time ' and f (S(i ), n(i) is the value of f (S, n) at the i thstate the Markov chain visits. A proof of this result willnot be provided here. Readers interested in the derivationshould consult Griffiths and Tavare� (1994a, b).

TWO POPULATIONS

The aim of this section is to derive a method forevaluating the likelihood function for two populationsthat diverged some time (T ) in the past. T refers here tothe scaled time, that is, the divergence time measured ingenerations divided by the effective populations size. Inthe two population case we also denote the matrix ofhaplotypes by S but there are now two vectors n1 and n2 ,containing the multiplicities of the haplotypes in popula-tion 1 and in population 2, respectively. Also let S(T&),n1(T&), and n2(T&) denote the particular values of S, n1

and n2 , respectively, in the ancestry at time right beforeT and S(T+), n1(T+) and n2(T+) right after T, lookingforward in time (Fig. 1). Then, the probability of obtain-ing a particular sample of haplotypes from two popula-tions is given by

FIG. 1. The divergence of two populations from a commonancestral population.

p2(S, n1 , n2 | T )

= :S(T+), n1(T+), n2(T+)

p2(S, n1 , n2 | S(T+), n1(T+),

n2(T+)) p(S(T+), n1(T+), n2(T+))

= :S(T+), n1(T+), n2(T+)

p2(S, n1 , n2 | S(T+), n1(T+),

n2(T+)) p(S(T&), n1(T&)+n2(T&))

} p(S(T+), n1(T+),

n2(T+) | S(T&), n1(T&)+n2(T&))

where p2 denotes the probability of observing a two pop-ulation sample. Notice that the sum is over all possiblevalues of (S(T+), n1(T+), n2(T+)) that could lead toa sample of (S, n1 , n2). If both the haplotypes and thetwo populations are ordered, no combinatorial factor isinvolved when considering the events at the time of pop-ulation divergence. Therefore

p2(S, n1 , n2 | T )

= :S(T+), n1(T+), n2(T+)

p2(S, n1 , n2 | S(T+),

n1(T+), n2(T+)) p(S(T&), n1(T&)+n2(T&))

(5)

This probability provides the likelihood function of thepopulation divergence time (T ).

It was previously shown how to estimate p(S, n). Itremains to be demonstrated how to calculate p2(S, n1 ,n2 | S(T+), n1(T+), n2(T+)) and perform the summa-tion over all possible values of (S(T+), n1(T+),

146 Rasmus Nielsen

Page 5: Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model

File: DISTL2 134805 . By:AK . Date:28:04:98 . Time:07:30 LOP8M. V8.B. Page 01:01Codes: 5936 Signs: 3122 . Length: 54 pic 0 pts, 227 mm

n2(T+)). In order to develop such a method we need toprovide a recursive equation for the probability ofobtaining two particular samples from two different pop-ulations given that the two populations diverged from acommon population some time T ago and given that thelast mutation or coalescence in the genealogy of the geneshappened at a time { in the past, where {<T. Notice thatsince both the time to coalescence and the time toa mutation are exponentially distributed, the relativeprobability of observing a mutation or coalescence isindependent of T. In fact, the relative rate argument stillnormally applied in coalescence models still holds whenconditioning on the time of the next event. The reason forthis is that P(x<y | min(x, y)=T )=(a�(a+b)), wherex and y are exponential random variables with rates aand b. Now for t<T, coalescences occur with rates(n1(n1&1)�2 and n2(n2&1)�2 and mutations occur withrate (n1+n2) %�2, assuming the population sizes are con-stant. Let f ({) be the density function of an exponentialrandom variable with parameter [n1(n1&1)+n2(n2&1)+(n1+n2) %]�2 and let F ({) be the corresponding CDF.For {<T, { is the time to the next coalescence event ormutation looking back in time. If {�T, no coalescenceevents or mutations happened before T. This event willoccur with probability 1&F (T ). In that case,

p2(S, n1 , n2 | {, T ){�T=p(S, n1+n2). (6)

Conditional on {<T, we arrive at the following equation

p2(S, n1 , n2 | {, T ){<T

:k # Z11

n1k(n1k&1) p2(S, n1&ek , n1 | T&{)

+% :k # Z21

p2(Sl, n1 , n2 | T&{)

+% :k # Z31

p2(Skl, nk1+ej , n2 | T&{)

+n2k(n2k&1) :k # Z12

p2(S, n1&ek , n2 | T&{)

+% :k # Z22

p2(Sl, n1 , n2 | T&{)

+% :k # Z32

p2(Skl, n1 , nk2+ej | T&{)

=n1(n1&1+%)+n2(n2&1+%)

(7)

where nik is the multiplicity of haplotype k in populationi and ni is the size of the sample from population i. This

equation follows from precisely the same argumentsprovided when establishing the one population recur-sion. However, now four events are possible; a coales-cence in population 1, a coalescence in population 2, amutation in population 1, or a mutation in population 1.Again, haplotypes can only be newly mutated if theyhave multiplicity 1 in the entire sample (population 1 andpopulation 2). Therefore the set Z1i is given by

Z1i=[k : nki�2].

Likewise, Z2i and Z3i is given by

Z2i=[k : nki=1, nk(3&i )=0, s . l=ek

or s . l=eck , s l

k .{s lj , k{j]

and

Z3i=[k1 : nki=1, nk(3&i )=0, s . l=ek

or s. l=eck , s l

k .=s lj , k{j].

Notice that we can use Eq. (6) and Eq. (7) to write

p2(S, n1 , n2 | T )=|T

0p2(S, n1 , n2 | {, T ){<T f ({) d{

+(1&F (T )) p(S, n1+n2). (8)

This expression suggests that p2(S, n1 , n2 | T ) can beevaluated by Monte Carlo integration in a simulationscheme similar to the one devised in the one-populationcase by Griffiths and Tavare� (1995). Analogous to theone-population case, Markov chain Monte Carlosimulations can be performed along Eq. (7) by specifyingthe following transition probabilities

(S, n1 , n2) � (S, n1&ek , n3&i) with probability

nki (nki&1)f (S, n1 , n2)(n1(n1+%+1)+n2(n2+%+1))

(S, n1 , n2) � (Sl, n1 , n2) with probability

%f (S, n1 , n2)(n1(n1+%+1)+n2(n2+%+1))

(S, n1 , n2) � (Skl, n i+e j , n3&i) with probability

%f (S, n1 , n2)(n1(n1+%+1)+n2(n2+%+1))

(9)

147Population Phylogenies

Page 6: Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model

File: DISTL2 134806 . By:AK . Date:28:04:98 . Time:07:30 LOP8M. V8.B. Page 01:01Codes: 5604 Signs: 3218 . Length: 54 pic 0 pts, 227 mm

where the conditions under which transitions of type 1, 2and 3 are allowed are provided by Z1i , Z2i and Z3i

respectively and f2(S, n1 , n2) is defined as

f2(S, n1 , n2)

= :d

k=1

(nk1&1) nk1+(nk2&1) nk2+%(v1+v2)(n1+%&1) n1+(n2+%&1) n2

,

(10)

vi=|[k : nki=1, nk(3&i )=0,

s . l=ek or s . l=eck for some l ]|.

While simulating along this Markov chain, time iskept track of by summing up deviates from the exponen-tial distribution, i.e., the time to k th mutation or coales-cence event is given by t=�k

i=1 ti where ti is obtained bysimulating an exponential distribution with parameter(n1(n1&1)+n2(n2&1)+%(n1+n2))�2 if there were ni

copies of the gene in population i at time t(i&1) . Now,simulations can be performed along this Markov chainuntil t�T. Denote the configuration of the sample at timet by (S(t), n1(t), n2(t)). Then

p2(S, n1 , n2 | T )=E _p(S(T+), n1(T+)+n2(T+))

_ `NT

j=1

f2(S( j ), n1( j ), n2( j ))&which suggest the following estimator of p2 :

p̂2(S, n1 , n2 | T )

=

:v

i=1

`NT

j=1

f2(S( j ), n1( j ), n2( j )) `'

j=1

f (S( j), n( j ))

v.

(11)

In other words, the likelihood function of T can beevaluated by performing simulations of the Markovchain given by Eq. (9) while t�T and along the Markovchain given by Eq. (3) when t>T while evaluatingf2(S, n1 , n2) and f (S, n) before and after T respectively.

Boundary conditions are given by p(2, m) if t>T.While t�T, boundary conditions are obtained by takingthe convolution of the distribution of the number ofsegregating sites which arose while the populations areisolated and the number of segregating sites which arosein the time before isolation. In particular

p2(2, m | T )

= :m

i=0

(%(T&t ))m&i e&%(T&t)

(m&i )! \ 11+%+\

%1+%+

i

=\(%(T&t))m e&%(T&t)

_F (1, &m ; 1; &[(T&t)+%(T&t)]&1)+(1+%) m !

, (12)

(Mathematica, 1988), where F is the generalized hyper-geometric function.

MULTIPLE POPULATIONS

Analogous to the two population case, a method forestimating the likelihood of an entire population treewith specified divergence times can be established. For rpopulations, a vector of population divergence timesT=(T1 , T2 , ..., Tr&1) is constructed such that the firstentry specifies the time from the present where popula-tion 1 and population 2 merges, the second entry specifiesthe divergence time for population (1, 2) and 3, etc. Thequantity of interest is the conditional probability ofobtaining a set of samples from multiple populationsgiven a specified topology of the population tree and thepopulation divergence times (T). Equations equivalentto Eqs. (3) and (7) can easily be established. For exam-ple, the conditional probability of observing a set of sam-ples from r populations, given that the last mutation orcoalescence happened at time { before the last populationdivergence time (looking backward in time), is given by

pr(S, n1 , ..., nr | {, T )

=1

:r

j=1

nj (nj&1+%)

:k # Z1i

nik(nik&1) pr(S, n1 , ..., n i&ek , ..., nr | T&{)

_ :r

i=1\ + :k # Z2i

%pr(Sl, n1 , ..., nr | T&{) + .

+ :k # Z3i

%pr(Skl, n1 , ..., nk

i +ej , ..., nr | T&{)

(13)

The Markov chain Monte Carlo method for estimatingthe likelihood follows trivially from the derivationprovided for two populations and will not be derivedhere. Note however, that the likelihood is evaluated bysimulations through a Markov chain backwards in timeuntil all populations have merged into one ancestral

148 Rasmus Nielsen

Page 7: Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model

File: DISTL2 134807 . By:AK . Date:28:04:98 . Time:07:31 LOP8M. V8.B. Page 01:01Codes: 5632 Signs: 4786 . Length: 54 pic 0 pts, 227 mm

population. By evaluating the likelihood for different pop-ulation trees, the maximum likelihood tree can be foundjust as in the case of single sequence phylogenetic inference.

APPLICATIONS

Two applications of the methods developed here willbe presented. First, a maximum likelihood estimate of thedivergence time between two African human populationswill be obtained. Second, the population phylogeny ofthree Caribbean turtle populations will be estimated.

There has been considerable interest in the divergenceof human populations in Africa (Vigilant et al., 1989;Watson et al., 1996; Tishkoff et al., 1996). One of thereasons for this interest is that the ancestral human pop-ulation can probably be traced to Africa. As an illustra-tion of the methods for estimating population divergencetimes, the divergence time between the Mbuti population(formerly known as the eastern pygmies) and the Biakapopulation (formerly known as the western pygmies) willhere be estimated. For this purpose previously publishedDNA sequences of the mitochondrial D-loop regionfrom the two populations were obtained from Gen-Bank (Mbuti: HUMMTDL073, HUMMTDL072,HUMMTDL071, HUMMTDL070, HUMMTDL069,HUMMTDL068, HUMMTDL067, HUMMTDL066,HUMMTDL065, HUMMTDL032, HUMMTDL031,HUMMTDL030, HUMMTDL006, HUMMTDL005,HUMMTDL004 and Biaka: HUMMTDL047,HUMMTDL046, HUMMTDL045, HUMMTDL044,HUMMTDL043, HUMMTDL042, HUMMTDL041,HUMMTDL040, HUMMTDL039, HUMMTDL038,HUMMTDL037, HUMMTDL002, HUMMTDL001).The sequences were aligned by ClustalV. Sequences withmissing data were subsequently deleted. Likewise, siteswith ambiguous alignment were also removed. Themethods discussed above assume an infinite sites model.This implies that only one substitution is allowed to occurin each nucleotide site. However, the set of haplotypes dis-cussed above is not consistent with the infinite sites model.In fact, certain parts of the D-loop region are rathersaturated with substitutions. A binary data set, consistentwith the infinite sites model, was therefore obtained fromthe aligned sequences by only considering transversionalchanges. The resulting data set is fully compatible with theinfinite sites model (Table I).

The population size of the Mbutis and the Biakasappear to be almost the same. Both populations todayconsist of approximately 30,000 individuals. Further-more, there is no evidence of changes in the effective pop-ulation size from the demographic data or from the

TABLE I

The Sequences of Sites Containing Tranversions in the Mbuti and BiakamtDNA Control Region. The Last Two Columns Provide the Counts ofEach Haplotype in Each of the Populations

Haplotype Mbuti Biaka

1 0 1 1 0 1 1 0 0 3 00 1 1 1 0 1 1 0 0 1 01 0 1 1 0 1 1 0 1 1 01 0 0 1 0 0 1 0 0 1 01 0 0 1 0 1 1 0 0 2 01 0 1 1 1 1 0 1 0 0 21 0 1 0 0 1 0 1 0 0 11 0 1 1 0 1 0 1 0 0 81 0 0 1 0 1 1 0 0 0 2

mismatch distribution (Watson et al., 1996). Likewise,comparable levels of heterozygosity in the two popula-tions also suggest that the effective population size isapproximately the same (see, for example, Vigilant et al.,1989). Therefore, in the following it will be assumed that% (four times the effective population size times the muta-tion rate) for the Mbuti population is the same as % forthe Biaka population. There are therefore only twounknown parameters to estimate: T and % (where T is thedivergence of the two populations divided by the effectivepopulation size). The likelihood for different values of Tand % was evaluated by performing 1,000,000 runsthrough the Markov chain (see above). The maximumlikelihood values for % and T were estimated to 0.9and 1.8 (l=&43.3) respectively. Assuming an effectivediploid population size of 1,000 would then imply thatthe total rate of transversions in the D-loop region isapproximately 0.00045. More importantly, the diver-gence time of the two populations would be approxi-mately 1.800 generations. Assuming a generation timeof 20, this translates into 36,000 years.

Another application of the method is the estimation ofpopulation phylogenies. This type of application will bedemonstrated on a previously published data set on theCaribbean hawksbill turtle (Table II). Bass et al. (1996)collected data of the mitochondrial DNA control regionof the Caribbean hawksbill turtle in order to testhypotheses on female nest site choice. They observed ahigh degree of isolation between different reproductivepopulations. Data from 3 populations (Table II) isapplied here to provide an estimate of the populationphylogeny of the three populations. As in the two-pop-ulation case, it is assumed that all three populations, aswell as the ancestral populations, have the same valueof %. The method described above can now be applied to

149Population Phylogenies

Page 8: Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model

File: 653J 134808 . By:XX . Date:20:04:98 . Time:15:42 LOP8M. V8.B. Page 01:01Codes: 4273 Signs: 3255 . Length: 54 pic 0 pts, 227 mm

TABLE II

The Sequences of Variable Sites for Three Populations of the CaribbeanHawksbill Turtle. The Last Three Columns Provide the Counts of EachHaplotype in Each of the Three Populations.

Haplotype USVI Barbados Mexico

0 0 0 0 0 0 0 0 0 1 11 01 0 0 0 0 1 0 1 1 0 1 00 1 0 0 0 0 0 0 0 0 3 01 0 1 0 1 1 1 1 1 14 0 01 0 1 1 0 1 1 1 1 0 0 21 0 1 0 0 1 1 1 1 0 0 13

estimate the likelihood for different values of % and thetwo population divergence times (T1 and T2) for each ofthe three rooted population phylogenies. As in normalphylogenetic inference, the phylogeny with the highestlikelihood is said to be most supported by the data.Notice that it is not obvious which population phylogenyis most supported by the data (Fig. 2a). Only the USVI(U.S. Virgin Islands) and the Barbados populationsshare a haplotype. Based on haplotype sharing measuresthese two populations should therefore be groupedtogether. However, the average number of nucleotide dif-ferences between the Mexican and the USVI populationis much less than in any other comparison (Table II).Measures based on this type of information would there-fore group the USVI and the Mexican populationtogether. These two approaches differ in that the firstapproach assumes that the effect of mutation is negligiblewhereas the second type of approach assumes that

FIG. 2. (a) The maximum parsimony tree of the haplotypes shown in Table 2 and (b) the estimated population phylogeny using the coalescentlikelihood approach. Notice that the parsimony tree provides no resolution of the relationship between populations. The number in front of the pop-ulation name indicates how many of the particular haplotype are found in the population.

genetic drift is of little importance. Which approach is themost reasonable depends on assumptions regarding therelative effect of mutation and drift. In contrast, the likeli-hood method introduced here applies all of the informa-tion in the data and accounts for both drift and mutation.

In order to estimate the population phylogenies,1,000,000 runs through the Markov chain were per-formed for each value of T1 , T2 , and %. A full searchwas completed for the three parameters for each ofthe population phylogenies. For the three populationphylogenies (Mexico, (USVI, Barbados)), (USVI,(Mexico, Barbados), and (Barbados, (USVI, Mexico),maximum likelihood values of &44.5, &44.8, and&43.9, respectively, were estimated. The maximumlikelihood values of T1 and T2 were approximately 0.75and 1.0 (Fig. 2b). In other words, this method will groupthe USVI and the Mexico populations together. In thiscase, the results obtained by methods based on haplotypesharing would differ from the maximum likelihood solu-tion. Notice that the maximum likelihood method is thiscase provides a unique estimate of the populationphylogeny whereas methods based on haplotype treesprovide no resolution and methods based on allele shar-ing would reach the wrong conclusion. However, alsonotice that the likelihood values differ only slightly.Obviously, there is only little information in this data setregarding the population phylogeny.

DISCUSSION

There are several advantages to the presentedmethodology. It allows maximum likelihood estimates of

150 Rasmus Nielsen

Page 9: Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model

File: DISTL2 134809 . By:AK . Date:28:04:98 . Time:07:31 LOP8M. V8.B. Page 01:01Codes: 14437 Signs: 5974 . Length: 54 pic 0 pts, 227 mm

population divergence times with corresponding con-fidence intervals to be obtained. This in return solves theproblem of ancestral polymorphisms while applying allof the information in the sample regarding the popula-tion phylogeny. The maximum likelihood methodprovides a unique estimate of the population phylogenytaking account of both mutation and drift. Othermethods rely on assumptions regarding the relative effectof mutation and drift.

However, there are also several problems with thismethodology. At the present stage the method is onlyimplemented for the infinite sites model. Most data setsdo not immediately conform to this model. This problemis richly illustrated by the analysis of the human mito-chondrial DNA discussed above in which only transver-sional differences were considered. The method could, inprinciple, be implemented for finite sites models of DNAevolution (see, for example, Griffiths and Tavare� , 1994a,and the corresponding program SEQUENCE). Unfortu-nately, such models would most likely increase the com-putational time drastically. Since the method, even at thepresent stage, is computationally intensive it does notseem to be immediately feasible to perform this type ofanalysis for a finite sites model. However, for closelyrelated populations where the data does conform to theinfinite sites model, the presented method provides astrong alternative to classical approaches in the analysisof population divergence in models without migration.

ACKNOWLEDGMENTS

This study is supported by NIH grant GM40282 to MontgomerySlatkin and a personal grant to R.N. from the Danish Research Council.I thank M. Slatkin and two anonymous referees for comments andR. Hudson for important suggestions in the initial stages of this project.

REFERENCES

Avise, J. C. 1994. ``Molecular Markers, Natural History and Evolu-tion,'' Chapman 6 Hall, London�New York.

Bass, A. L., Good, D. A., Bjorndal, K. A., Richardson, J. I., Hillis, Z.-M.,Horrocks, J. A., and Bowen, B. W. 1996. Testing models of femalereproductive migratory behavior and population structure in theCaribbean Hawksbill turtle, Eretmochelys imbricata, with mtDNASequences, Mol. Ecol. 5, 321�328.

Ethier, S. N., and Griffiths, R. C. 1987. The infinitely-many-sites modelas a measure-valued diffusion, Ann. Prob. 15, 515�545.

Griffiths, R. C. 1989. Genealogical-tree probabilities in the infinitely-many-sites model, J. Math. Biol. 27, 667�680.

Griffiths, R. C., and Tavare� , S. 1994a. Simulating probability distribu-tions in the coalescent, Theor. Popul. Biol. 46, 131�159.

Griffiths, R. C., and Tavare� , S. 1994b. Ancestral inference in populationgenetics, Stat. Sci. 9, 307�319.

Griffiths, R. C., and Tavare� , S. 1995. Unrooted genealogical treeprobabilities in the infinitely-many-sites model, Math. Biosci. 127,77�98.

Horai, S., Satta, Y., Haysaka, K., Kondo, R., and others. 1992. Mansplace in Hominoidea revealed by mitochondrial DNA genealogy,J. Mol. Evol. 35, 32�43.

Hudson, R. R. 1991. Gene genealogies and the coalescent process,in ``Oxford Surveys in Evolutionary Biology'' (D. Futuyma andJ. Antonovivs, Eds.), Vol. 7, pp. 1�44. Oxford Univ. Press, London.

Hudson, R. R. 1992. Gene trees, species trees and the segregation ofancestral alleles, Genetics 131, 509�512.

Hudson, R. R., and Kaplan, N. L. 1986. On the divergence of alleles innested subsamples from finite populations, Genetics 113, 1057�1076.

Kimura, M. 1969. The number of heterozygous nucleotide sites main-tained in a finite population due to steady flux of mutations, Genetics61, 893�903.

Kingman, J. F. C. 1982. The coalescent, Stochast. Proc. Appl. 13,235�248.

Kuhner, M. K., Yamato, J., and Felsenstein, J. 1995. Estimating effec-tive population size and mutation rate from sequence data usingMetropolis�Hastings sampling, Genetics 40, 1421�1430.

Mathematica. 1988. Wolfram Research, Inc. Illinois.Mountain, J. L., and Cavalli-Sforza, L. L. 1994. Inference of human

evolution through cladistic analysis of nuclear DNA restrictionpolymorphisms, Proc. Natl. Acad. Sci. 91, 6515�6519.

Patton, J. L., Dasilva, M. N. F., and Malcolm, J. R. 1994. Gene geneal-ogy and differentiation among Arboreal Spiny Rats (Rodentia,Echimyidae) of the amazon basin���A test of the riverine barrierhypothesis, Evolution 48, 1314�1323.

Slatkin, M., and Maddison, W. P. 1990. Detecting isolation by distanceusing phylogenies of genes, Genetics 126, 249�260.

Takahata, N. 1989. Gene genealogy in three related populations:Consistency probability between gene and population trees,Genetics 122, 957�966.

Tavare� , S. 1984. Lines-of-descent and genealogical processes, and theirapplication in population genetics models, Theor. Popul. Biol. 26,119�164.

Tishkoff, S. A., Dietzsch, E., Speed, W., Pakstis, A. J., Kidd, J. R.,Cheung, K., Bonne� -Tamir, B., Santachiara-Benerecetti, A. S.,Moral, P., Krings, M., Pa� a� bo, S., Watson, E., Risch, N., Jenkins, T.,and Kidd, K. K. 1996. Global patterns of linkage disequilibrium atthe CD4 locus and modern human origins, Science 271, 1380�1387.

Templeton, A. R., Routman, E., and Phillips, C. A. 1995. Separatingpopulation structure from population history��A cladistic analysisof the geographical distribution of mitochondrial DNA haplotypesin the Tiger salamander, Ambystoma tigrinum, Genetics 140,767�782.

Vigilant, L., Pennington, R., Harpending, H., and Kocher, T. D. 1989.Mitochondrial DNA sequences in single hairs from a southernAfrican population, Proc. Natl. Acad. Sci. 86, 9350�9354.

Watson, E., Bauer, K., Aman, R., Weiss, G., von Haeseler, A., andPa� a� bo, S. 1996. mtDNA sequence diversity in Africa, Am. J. Hum.Genet. 59, 437�444.

Watterson, G. A. 1975. On the number of segregating sites in geneticalmodels without recombination, Theor. Pop. Biol. 7, 256�276.

Wu, C. I. 1991. Inferences of species phylogeny in relation to segrega-tion of ancient polymorphism, Genetics 127, 429�435.

� � � � � � � � � � � � � � � � � � � �

151Population Phylogenies