Phylogenetic Inference under the Pure Drift Model · populations, it is inefficient in countering drift when population sizes are on the order of 100 or fewer (Lacy 1987). Random

Phylogenetic Inference under the Pure Drift Model

Shizhong Xu, * William R. Atchley, * and Walter M. Fitch? *Center for Qua ntitative Genetics, Department of Genetics, North Carolina State University; and TDepartment of Ecology and Evolutionary Biology, University of California, Irvine

When pairwise genetic distances are used for phylogenetic reconstruction, it is usually assumed that the genetic distance between two taxa contains information about the time after the two taxa diverged. As a result, upon an appropriate transformation if necessary, the distance usually can be fitted to a linear model such that it is expressed as the sum of lengths of all branches that connect the two taxa in a given phylogeny. This kind of distance is referred to as “additive distance.” For a phylogenetic tree exclusively driven by random genetic drift, genetic distances related to coancestry coefficients (6x, ) between any two taxa are more suitable. However, these distances are fundamentally different from the additive distance in that coancestry does not contain any information about the time after two taxa split from a common ancestral population; instead, it reflects the time before the two taxa diverged. In other words, the magnitude of OxY provides information about how long the two taxa share the same evolutionary pathways. The fundamental difference between the two kinds of distances has led to a different algorithm of evaluating phylogenetic trees when 8 Xv and related distance measures are used. Here we present the new algorithm using the ordinary-least-squares approach but fitting to a different linear model. This treatment allows genetic variation within a taxon to be included in the model. Monte Carlo simulation for a rooted phylogeny of four taxa has verified the efficacy and consistency of the new method. Application of the method to human population was demonstrated.

Introduction

Random genetic drift is an important evolutionary force. It has been argued that, in natural populations, population size is sufficiently large that drift could be ignored compared with other evolutionary forces such as selection and mutation (Fisher 1958, pp. 22-5 1). In inbred strains of mice, rats, guinea pigs, and some plants, for example, the population size is so small and the evolutionary history so short that variation in allelic frequencies among inbred strains must have been predom- inantly driven by random drift or allelic fixation ( Atchley and Fitch 199 1, 1993 ). Thus, patterns of genetic divergence observed among inbred strains result from random segregation of the original heterozygosity of the founding stocks.

For captive populations, genetic drift is the over- riding factor controlling the loss of heterozygosity. Mu- tation has no noticeable effect on populations of size typically managed in zoos and nature preserves. Unless

Key words: phylogeny, genetic drift, coancestry coefficient, genetic distance, reduction of heterozygosity.

Address for correspondence and reprints: Shizhong Xu, Center for Quantitative Genetics, Department of Genetics, North Carolina State University, Raleigh, North Carolina 2769576 14.

Mol. Biol. Evol. 11(6):949-960. 1994. 0 1994 by The University of Chicago. All rights reserved. 0737-4038/94/l 106-00 13$02.00

selection is stronger than commonly observed in natural populations, it is inefficient in countering drift when population sizes are on the order of 100 or fewer (Lacy 1987). Random genetic drift is also considered to be important in determining the variation in gene frequencies in man (Cavalli-Sforza et al. 1964; Edwards and Cavalli-Sforza 1964; Cavalli-Sforza 1966; Cavalli-Sforza and Edwards 1967 ) .

A commonly used measurement of divergence in gene frequencies caused by random genetic drift is Wright’s FST (Wright 1943, 195 1, 1965 ) . &statistics were originally derived from a population-genetics perspective and it was assumed that an infinite number of populations diverged at the same time from a common ancestral population. From a phylogenetic perspective, the coancestry coefficient (&), which is another Fs,-related measurement of population divergence, seems more appropriate. Within a population, it is defined as the probability that a random gene from one individual is identical by descent to a random gene from another individual ( Kempthorne 1969, pp. 72-80; Falconer 1980, pp. 80-83). Between two populations exu is defined as the probability that a random pair of genes, one from each population, are identical by descent. Appro- priate transformations of the coancestry coefficients can

949

950 Xu et al.

be treated as genetic distances for use in phylogenetic inference.

However, this measure of genetic distance may not be additive, which is assumed by the Fitch-Margolish method (Fitch and Margolish 1967) and other phylogeny inferring algorithms. Further, there are no phylogeny- inferring algorithms available that incorporate inbreeding coefficients like 8 xy. To circumvent this problem, a phylogeny-inferring algorithm using character data such as the parsimony method may be used. Recently, Atchley and Fitch ( 1993) introduced the concept of loss parsimony to describe the segregation and random fixation of alleles under systematic brother-sister mating. These authors used an inverted Camin-Sokal algorithm to find

A t AB

X

trees that minimize allele loss. However, irreversibility FIG. 1 .-The rooted tree for two taxa used as an example in the

of allele loss is only a qualitative prediction of random text.

genetic drift and the loss parsimony model fails to incorporate appropriate quantitative predictions from population-genetics theory. A maximum-likelihood (ML) method under the pure drift model could, in principle, incorporate all the pertinent quantitative predictions inherent to genetic drift (Felsenstein 1973, 198 1). The drawback of an ML method is that it involves ex- tensive computing if explicit solutions are not possible. In addition, lack of knowledge on the exact joint prob-

is the mean inbreeding coefficient of population B. We use H throughout to represent the expected heterozygosity. Estimated heterozygosity will be discussed later. Equation ( 1) allows the time before divergence to be inferred from the existing heterozygosity of node B as

tAB = [~~g(~B)-~~g(~A)l/1~~[~-1/(2N,)l. (2)

ability distribution of the data will decrease the credibility of ML. Let Hx and Hy be the expected heterozygosities of

In this research, we first introduce a class of 8xy- the terminal population X and Y, respectively, at the

related measurements of genetic distances. These genetic time when data are sampled. Because of the relationships

distances, after appropriate transformation, are then used to infer phylogenetic relationships among taxa. Hx = H~[l-l/(2N,)]‘~~

The Pure Drift Model and

Consider a finite population with effective population size N, isolated from an infinite, random-mating

Hy = HB[ l-1 /(2Ne)lzBy,

population in Hardy-Weinberg equilibrium (denoted by A as in fig. 1). Assume effective population size did not

tBX and tBy can be inferred by

change through time and at generation tAB the population (denoted by B) was split into two lineages, X and

tBx = [log(Hx)-log(HB)]/log[l-1/(2x)] (3)

Y, each of which had the same effective population size and N,. Populations X and Y have independent histories of genetic drift for t BX and tgy generations, reSpeCtiVdy. Suppose that the heterozygosity of a neutral locus in the

tBY = [log(HY)-log(HB)l/los[1-l/(2N,)1, (4)

infinite ancestral population was HA. When the finite population (population B) was split, the expected het-

respectively. If generation intervals for the two lineages

erozygosity was expressed by HB. If both HA and HB are were the same and data were sampled at the same time,

known, one can infer the time tAB, in generations, by tBX should equal t BY. However, this is not a requirement

the following formula: of the phylogenetic methods being described here.

The number of heterozygotes in populations X and

HB = HA( l-FIAB) = HA[I-~/(XV~)]~*~, ( 1 ) Y can be obtained by counting individual genotypes. Observed frequencies of heterozygotes are then used to

where

FIAB = 1 - [1-1/(2N,)]‘AB

estimate Hx and Hy, denoted by fix and fiy, respectively. Unfortunately, HB is an unobservable historical event that cannot be simply counted. What we want is

rnylogenetic merence unaer vnn y3 I

to obtain an estimate of HB from the observed data sampled from X and Y.

Estimation of Heterozygosity of an Internal Node

The number of heterozygotes in populations X and Y (the terminal nodes) can be obtained from the actual counts by examining individual genotypes. However, individuals were assumed to mate randomly within each population so that the heterozygosity of each terminal node can be estimated by the so-called gene diversity (Nei 1976, pp. 723-765). Let xi and yi denote the ith allele frequencies (observed) for taxa X and Y, respectively, where i = 1, 2, . . . , n for a locus with n allelic states. The heterozygosities of X and Y are then estimated by

H~=l-~xf

i= 1

and

respectively. The allelic frequencies of the two terminal taxa

provide all information about the heterozygosity of their ancestral population at the time when they diverged. Consider node B as a parental population with X and Y being two sets of offspring of B. Note that no matter how many generations had passed since X and Y diverged, X carries genes from one set of offspring and Y carries genes from another set of offspring of B. Recall that the coancestry coefficient between X and Y is defined as the probability that a gene from X is identical by descent with a gene from Y. If X and Y form a pair of mates, their offspring would have an inbreeding coefficient equal to the coancestry coefficient between X and Y. Because X and Y have been treated as offspring of B, the progenies of X and Y would be the grandchildren of population B. As population B is designated as generation tAB, their potential grandchildren should be designated as generation tAB -I- 2. Therefore, 8xu = FtAB+2.

Coancestry coefficient, Oxv, is probably the most appropriate measurement of genetic distance between two taxa when dealing with random drift because it is independent of the initial gene frequency (allele frequency of the common ancestral population, A) and reflects the time elapsed before X and Y diverged. How- ever, estimation of 0 xy still needs an estimate of the initial frequency. Fortunately, there is a simple linear relationship between the expected heterozygosity and

inbreeding coefficient, as H = Ho( 1 -F), where Ho is the heterozygosity of the panmictic base population. Therefore, instead of estimating F or Oxy, we may estimate the heterozygosity.

We now propose an unbiased estimator of HB, the heterozygosity of the internal node (see fig. 1 ), using observed allele frequencies for a locus of interest from taxa X and Y. Under the drift model, their expected values are the same as that in the initial base population. An unbiased estimate of the heterozygosity two generations after node B is proposed as

Dxy = 1 - i xiyi. i=l

(5)

This heterozygosity estimator, denoted by Dxy , is a genetic distance. The unbiased property of equation (5) is proved easily by showing

E[Dxul = HA(~-~xY)-

As indicated before, exy = F,,,,,; hence,

E[Dxul = HA( l-r;;,,,,).

(64

(W

If the constant 2 in the subscript of equation (6b) is dropped, the quantity in the left-hand-side is HB. There- fore, the expected genetic distance between two lineages equals the heterozygosity of the potential grandchildren (generation tAB+2) of their common ancestral population at the time when the two lineages diverged. If the effective population size is not too small (N,>50), heterozygosity reduction for two generations is insignificant so that the expected Dxy is a good approximation for HB, that is,

E[Dxul = HB- (7)

Unless specified, otherwise we assume that N, is large enough so that the expected genetic distance between two lineages equals the heterozygosity of their immediate common ancestral population at the time when the two lineages diverged.

Although equation ( 5 ) is unbiased, it is not practical to use just a single locus to infer the heterozygosity of an internal node because a large variance will be anticipated. Especially if there is no heterozygosity in the terminal taxa, there are only two possible outcomes: that the two lineages fixed the same allele or that they fixed different alleles. The latter may be called alternative fixation. In either case, we will not be able to infer HB just on the basis of the two possible observed outcomes of fixation. It can be shown, however, that if rn independent

Y3L AU er a.

neutral loci are used, we can compare the two lineages locus by locus, find the number of alternatively fixed loci, and use the proportion over the total number of loci to measure the genetic distance ( Atchley and Fitch 199 1, 1993 ) . A general strategy is to average Dxu over loci, which has the operational form of

(8)

where Xji and yji are the ith allele frequencies of the jth locus for populations X and Y, respectively, and n, is the number of allelic states of the jth locus. Equation (8) is still an unbiased estimator of the average heterozygosity of the internal node over loci because drift has an equal effect on all loci. In this case, E(&,) = EJA( 1 41xy), where HA is the average heterozygosity over loci of the common ancestral population (node A).

The estimated length (number of generations) of the root branch (fig. 1) is

2,, = [log~~~~~-~~g~H,~ll~~~~~-~l~~N,~1. (9)

no information about the lengths of branches that connect A and D. Therefore, branch lengths of a phylogenetic tree under the pure drift model cannot be estimated with the conventional Fitch-Margoliash method.

Suppose that figure 2 represents the true phylogenetic relationship of the four taxa and we have collected 4( 4- 1 )/ 2 = 6 pairwise distances and four observed heterozygosities within taxa. The total number of observed data points is 10. Let us define variable y as the appropriate transformation of either the heterozygosity within taxon or the genetic distance between taxa. Thus,

YA = [~OS~E;r,~-~~~~H,~l/~~~~~-~/~~Ne~l

YAB = [log(~AB)-log(~~)l/log[l-1/(2~~)1

YAC = [logt~AC)-logtEi,)l/10g[1-1/t2N~)1 YAD = [log(~AD)-log(~~)1/1og[l-1/(2~~)1

YB = [log(ri,)-log(H,)1/10g[l-1/(2N,)1 (12)

YBC = [log(~BC)-log(H,)1/10g[l-1/(2N,)1

YBD = [log(~BD)-log(~~)l/log[l-l /(2Ne)l

Yc = [log~~~~-~o~tH,~l/~~~~~-~/~~N,)1 Let fix and Ejy, respectively, be the observed av-

YCD = [log(~CD)-log(H,)1/1og[l-1/(2N,)1 erage heterozygosities (over loci) of populations X and Y at the time when data are sampled; then YD = [log(ri,)-log(Ei,)l/log[l-1/(2N,)1.

2,X = [log(ljX)-log(DXU)l/log[l-l /(2Ne)1 ( lo) If fla and N, are known, these y’s are the observed data, which can be fitted to the following additive model:

and

?By = [log(fi+log(&v)l/log[l-ll(2NJ. (11) YA = tl + t2 + t3 + t4

YAB = 11 + 12 + t3

+ eA

+ eAB

Phylogenetic Inference under the Pure Drift Model YAC = t 1 + t2 + eAC

We will describe a new algorithm of phylogeny re- YAD = tt + eAD

construction under the pure drift model using both the YB = t, + t2 + t3 + tB genetic distances between populations (& ) and vari-

+ eB (13)

ation within populations (fix ). Figure 2 gives a hypo- YBC = t 1 + t2 + eBC

thetical drift-produced phylogeny with four taxa. The YBD = tl + eBD branch lengths represent time in generations. The Fitch- Margoliash method assumes that the true genetic dis- Yc = t, + t2 + tc + ec

tance between taxa i and j, or a transformation of the distance, equals the sum of all the branch lengths that

YCD = tl + eCD

connect i and j. For instance, genetic distance between YD =tl + tD + eD A and B in figure 2 is the sum of tA and lg. Likewise, distance between A and D is the sum of tA, t3, t2, and where the e’s are error terms representing departures of tD. This is clearly not the case under the pure drift model observed from expected values. Let y and e be 10 X 1 because DAB estimates the heterozygosity at node d, which is determined by the sum of tl , t2, and t3 and

vectors containing the y’s and the e’s, t = [ tl t2 t3 tA tB tc tDIT be a 7 x 1

respectively, and vector where the

does not depend on tA and lg. Similarly, DAD refleCtS superscript T represents matrix transposition. The con- heterozygosity at node b, a function of tl , and provides densed matrix notation for the above additive model is

a %

FIG. studies.

2.-The model tree for four taxa used in the simulation

D

y=Zt+e (14)

where Z is a design matrix representing the tree topology. In this particular example,

Z=

1111000

1110000

1100000

1000000

1110100

1100000

1000000

1100010

1000000

1000001

The ordinary least squares solution for t is

t = (ZTZ)_‘ZTy,

with a mean squared error (MSE) of

MSE = (y-Zt)T(y-Zi)/dj- (16)

where d’is the degrees of freedom and equals the number of data points minus the number of branches. For T taxa, df = T( T-3)/ 2 + 1. The ordinary-least-squares solutions given in equation ( 15) assume that the error terms are independent with a constant variance. How- ever, these e’s are certainly correlated and the variance may also vary. Let V denote the variance-covariance

Phylogenetic Inference under Drift 953

matrix of vector e; the generalized-least-squares solution for t would be

t = (ZTV-‘Z)-‘ZTV-‘y. (17)

Unfortunately, V is usually unknown and its estimation is difficult. Therefore, the ordinary-least-squares solutions are used hereafter. The V matrix will be further discussed in a later section.

There are 15 possible rooted trees for four taxa. In principle, one needs to evaluate all 15 possible trees and choose the one with the smallest MSE.

Several key points concerning y’s need to be made. First, for large N, , log [ 1 - 1 / ( 2N,)] can be replaced by - 1 / (2N,). Second, N, may be unknown in natural populations, or it may have been estimated. Fortunately, 2N, occurs in all the y’s, thus permitting 2N, to be dropped. In that case, the branch length is not the number of generations but t/( 2N,). Third, Ha = 2p0( 1 -pO), requiring the initial allelic frequency in the common ancestral population ( node a in fig. 2). One may replace po by the average allelic frequency of all lineages, but that value is anticipated to have a large variance because lineages are not independent random samples of node a. However, -log( fla) is a constant added to all the y’s. Linear regression theory indicates that adding a constant to y’s does not affect the estimates of regression coefficients and the MSE, but it does change the estimate of the intercept. Examining the above additive model and the Z matrix, we see that the tree trunk (branch tl in fig. 2) is actually the intercept and all other t’s are regression coefficients. Therefore, if one is not interested in the length of the roof, the constant -log(H,) can be ignored, leaving MSE and the other branch-length estimates unaffected. Ignoring -log(aa) we have yA = -log (HA) and YAB = -log( DAB), a similar transformation to Nei’s ( 1987, pp. 208-253) standard genetic distance, but, they have quite different meanings. Nei’s standard genetic distance takes the negative of the natural log of the genetic similarity, whereas our y variable takes the negative of the natural log of the genetic distance. Therefore, the y variable may not be called genetic distance but rather a kind of genetic “similarity.”

Branch lengths of a phylogenetic tree cannot be negative. However, the least squares solution does not guarantee the nonnegativity. There are two situations where negative estimates may occur. First, a wrong topology may be chosen, which would lead to systematic bias for the estimate of a branch length. Such bias cannot be removed by increasing the number of loci. Second, if a small number of loci are used, sampling error may cause a negative estimate of a branch length, even when a correct tree topology is used. In the latter case, nega-

954 Xu et al.

tivity can be overcome by increasing the number of loci. Algebraically, an ad hoc way of solving the problem of negativity is to set any negative branches to zero and then recalculate the MSE for a given tree topology (Swofford and Olsen 199 1). The optimal approach, however, is to utilize quadratic programming to disallow negative estimates of regression coefficients (Hildreth 1957). The tree trunk, tl , is the intercept in the linear model; thus, it should not be constrained.

Monte Carlo Simulation

An example was generated via Monte Carlo simulation to demonstrate the usage of the new method. We simulated the model tree given in figure 2 under a special breeding procedure, namely, brother X sister (bXs) mating. Most inbred populations of laboratory mice and rats were developed by b X s mating, which strictly fit this pure drift model ( Atchley and Fitch 199 1, 1993). With brother X sister mating, equation ( 1) can still be used to approximate HB such that the heterozygosity is expressed as a function of generations. Kempthorne ( 1969) has shown that, with systematic full- sib mating, 1 - 1/(2N,) = ( 1+6)/4 = 0.809, leading toN,=2.6178.Hence,log[l-1/(2N,)]canbereplaced by log( 0.809).

We first simulate an initial hypothetical random population with 200 independent neutral loci, all with two allelic states (0 and 1) . The frequency of allelic state 1 is 0.5 across all loci. This population is designated as generation 0 and denoted as node a; hence, H, = 2(.5)(1-S) = .5. We then randomly sampled two individuals (one male and one female) from this hypothetical population, who were then b X s mated for three generations (designated as generation 3 and denoted by node b). From node b, a series of inbreeding lines were produced as described in figure 2. Four full- sib progenies were produced from node b; one pair of full-sibs was used to initiate a lineage that produced line D after nine generations of b X s mating, as shown in figure 2. The progenies from the other pair of full-sibs from node b were b X s mated for three generations to produce node c. At node c, one pair of progenies were b X s mated for six generations to produce the lineage leading to line C and another pair leading to node d after three generations of b X s mating, which subsequently led to line A and B. Both tA and tg are three generations. The genotypes of the simulated organisms were examined and allelic frequencies were calculated. In the ab- sence of selection and mutations, systematic b X s mating reduces heterozygosity. Thus, this is a pure random drift model of evolutionary change of allelic frequencies.

The estimated heterozygosity of each terminal taxon was obtained by evaluating the genotype of each locus of each individual. The genetic distance was esti-

Table 1 Estimated Heterozygosities fi of Terminal Taxa (diagonals) and Heterozygosites @xv) of Internal Nodes (off diagonals) for Four Taxa from a Simulated Data Set under the Model Tree Given in Figure 2

A B C D

A . . . . . .055 B t.... .059 .045 c . . . . . .114 .lll .055 D . . . . . ,222 .198 .200 .04:

mated using equation ( 8). These values are listed ir table 1. We now try to estimate the branch lengths ant calculate MSE of the data. Data in table 1 were appro priately transformed into y variables using a modifiet version of equation ( 12). The reason to modify equatior ( 12) for the special mating system is that the effective population is too small (N,=2.6 178) and selfing is no allowed. The modified version of equation ( 12) follows all y’s with one subscript are added by 1; otherwise, the: are subtracted by 1. The MSE for this data set was 0.060~ (generation ‘) and the estimated branch lengths are listen in table 2. To show that a choice of ii, does not affec MSE and estimates of t2 . . . tD, we also chose fla = 0. and Z?a = 0.9 to compare with I;ia = 0.5. Remembe that the true value of tl is 3. When l?a = 0.5, the esti mated value was 3.17 (close to 3)) but it became -4.4: and 5.95, respectively, when Ra = 0.1 and 0.9 were used

To evaluate the sensitivity of MSE and generalize the results, more simulations were conducted. The numbers of independent loci examined were 5, 10, 15 20,25, 50, 75, and 100. For any given number of loci 100 replicated samples were simulated. For each repli cate, all 15 possible trees (fig. 3) were evaluated. The tree with the smallest MSE was chosen as the inferrec phylogeny.

Table 2 Comparisons of the Estimated with the True Branch Lengths for the Simulated Data Set

Branch True Length

(no. of generations) Estimated Lengtl

t, (intercept) 3 3.17 t2 . . . . . . . 3 2.86 t3 . . . . . . . . 3 3.22 tA . . . 3 2.16 tB . . , . . . 3 3.11 tc , . . 6 5.38 tD . . . . . . . . 9 9.46

NOTE.-The expected heterozygosity in (H,) is 0.5. The MSE is 0.0604 (generation*).

the common ancestral population

-A

--Is- B D A

(4) c

6 A B C

-D (7)

-e C

D

B

-A (10)

e- A

D

B

(13) c

-is A

C B

(2) D

4s B D C

(5) A

-A

-C (8)

-A

G A C B D

(3)

-A

e B

C

A

-D (14)

-B

--I% C D

(15) A FIG. 3.-The 15 possible rooted trees for four taxa

The 15 different rooted trees can be classified into two categories, asymmetric and symmetric. The asymmetric trees include tree numbers 1, 2, 4, 5, 6, 7, 8, 10, 11, 12, 14, and 15. Tree numbers 3, 9, and 13 belong to the symmetrical class.

First, we chose tree number 7 as the true tree to simulate the data. The average MSE of the 100 replicates for each tree is reported in table 3. The mean MSE of the true tree (tree number 7) was the smallest among the 15 possible trees. Therefore, MSE is sensitive to the choice of tree topology. The frequency of being chosen as the inferred phylogeny is given in table 4. When the number of loci was five, the percentage of choosing the right phylogeny (tree number 7) was 26%, which ranked the second highest (the highest value was 28% for tree number 2). When the number of loci increased to 10, the frequency of choosing the right tree increased to 44%, which dominated all other trees. As expected, this frequency increased as the number of loci increased. The patterns of increase in frequency of choosing the true tree and decrease in MSE for the true tree (tree number 7) with increasing number of loci are shown in figure 4a. Tree numbers 2, 8, 9, and 14 generally had smaller MSEs (table 3) and higher frequencies (table 4) than other trees (except for the true one). Looking into the 15 trees given in figure 3 again, we found that each of these four trees (tree numbers 2, 8, 9, and 14) retains a true clade, but no other trees retain any true clade.

Second, we chose tree number 9 as the true tree to simulate the data. The average MSE of the 100 replicates


for each tree is given in table 5. The mean MSE of the true tree (tree number 9) was again the smallest among the 15 possible trees. The frequency of being chosen as the inferred phylogeny is given in table 6. When the number of loci was five, the percentage of choosing the right phylogeny (tree number 9) was 26%. When the number of loci increased to 50, this frequency increased to 94%. For 100 loci, it has reached 100%. These frequencies are generally greater than those found when tree number 7 was assigned as the true tree. Similar plots are provided in fig. 4b.

An Application to Human Evolution

The data consist of gene frequencies for five blood- group systems, AIA2B0, RH, MNSs, Fy, and Di, sampled from four human populations: Eskimo, Bantu, En- glish, and Korean (see table 1 of Cavalli-Sforza and Ed- wards 1967). The total number of alleles of the five loci is 19. Cavalli-Sforza and Edwards have analyzed the data under a similar drift model and provided an exhaustive treatment ( 15 possible rooted trees). Therefore, their results are directly comparable with the results presented here. These gene-frequency data were used to calculate heterozygosities and pairwise genetic distances (table 7 ) .

The best (MSE=.00698) and second best (MSE= .008 13 ) trees among 15 rooted trees are given in figure 5. These two trees are also the best trees of Cavalli-Sforza and Edwards ( 1967 ) . However, our best tree turns out to be their second best, in which the initial split places Bantu on one branch and Eskimo, English, and Korean on the other. The second split occurs between English and Eskimo-Korean. Internal branches of negative lengths are generated with the remaining 13 trees, which, on the average, have an MSE several times larger than those of the two trees. The internal node that separates Eskimo and Korean has an estimated heterozygosity of 0.3954, but Korean has an estimate of 0.5348, which has generated a negative external branch length for the Korean lineage after divergence from Eskimo. The negative external branch length has been set to zero (fig. 5). In general, our results are comparable with those of Cavalli-Sforza and Edwards ( 1967 ) . Disregarding the position of the root, the two trees have the same topology. However, under the drift model, the two trees will produce quite different predictions. With the best tree, we would predict that English and Bantu are equally alike, as are Eskimo (or Korean) and Bantu, while with the second best tree, English and Bantu are more alike than Eskimo (or Korean) and Bantu.

Nei and Roychoudhury ( 1993) inferred the phylogenetic tree for 26 human populations using the neighbor-joining method (Saitou and Nei 1987) from 29 polymorphic loci. Their tree divides the 26 populations into four major groups. Nei and Roychoudhury

956 Xu et al.

Table 3 Averages of the MSE of 100 Replicated Simulations

NUMBER OF LOCI

TREE 5 10 15 20 25 50 75 100

2 3 4 5 6 7 8 9 10 11 12 13 14 15

407.33 184.83 133.67 95.84 45.35 12.92 11.96 11.48 185.24 93.53 70.99 61.90 26.72 3.15 3.07 2.74 403.39 177.55 133.62 96.42 45.34 12.92 11.96 11.48 387.25 180.43 147.46 95.26 43.19 13.68 12.55 12.13 417.48 206.35 156.99 103.21 47.08 14.00 12.77 12.20 430.12 207.6 1 156.28 99.49 46.99 14.00 12.77 12.20 104.91 54.33 16.84 12.80 1.26 0.36 0.25 0.17 218.32 119.01 60.85 29.06 8.28 6.21 5.01 4.63 234.65 129.68 60.86 24.38 8.20 6.21 5.01 4.62 420.23 208.73 157.00 99.60 47.04 14.00 12.77 12.20 430.13 211.13 156.29 103.46 47.03 14.00 12.77 12.20 388.47 180.5 1 147.46 95.27 43.18 13.68 12.54 12.13 392.10 192.09 143.68 97.01 45.84 12.85 12.04 11.65 175.77 104.94 77.37 62.01 26.94 3.16 3.08 2.75 385.99 198.88 143.70 96.06 45.84 12.85 12.04 11.65

NOTE.-There are 15 possible rooted trees for four taxa, and tree number 7 is the true tree.

( 1993) claimed that their tree was consistent with data on morphological differences, archaeological records, and geographic distributions of the populations. It turns out that the four human populations analyzed in this study represent the four major groups. The inferred phylogeny of the four major groups (Nei and Roy- choudhury 1993, fig. 2) has the same topology as our best tree (fig. 5a), assuming that their tree was rooted on the longest branch. In additional, the branch lengths

of our best tree are roughly proportional to those of Nei and Roychoudhury’s tree.

There is no doubt that genetic drift is an important evolutionary force, but it is not the only reason for population differentiation in human. The initial split of human population might have occurred 200,000 years ago (see Nei and Roychoudhury 1993), which is equivalent to 6,000-7,000 generations. For such a large time scale, selection and mutation may have played an important

Table 4 Frequency (of 100 replicates) of Being Chosen as the Inferred Phylogeny for Each Tree

NUMBER OF LOCI

TREE 5 10 15 20 25 50 75 100

8 9 10 11 12 13 14 15

15 2 28 20

0 1 3 3 1 2 1 2

26 44 11 5 5 8 2 0 0 0 0 2 2 2 6 7 0 2

0 20

2 0 0 0

54 6 3

0 0 0

14 0

0 12 0 1 0 0

66 3

10 0 0 0 0 7

0 13

1 0 0 0

65 5 7 0 0

0 8 0

0 0 8 2 0 0 0 0 0 0 0 0

82 93 4 2

0 0 0 0 0 0

100 0 0 0 0 0 0 0 0

NOTE-Tree number 7 is the true tree.

100 90 80 70 60 50 40 30 20 10 0

100

90

80 70 60 50 40 30 20 10 0

0 10 20 30 40 50 60 70 80 90 100

Number of loci

(b)

0 10 20 30 40 50 60 70 80 90 100

Number of loci

FIG. 4.-Changes of MSE and frequency that the inferred phylogeny is the true tree as the number of loci increases. All MSE’s were expressed as percentage of the MSE when the number of loci was five; (a) tree number 7 is the true tree; (b) tree number 9 is the true tree.


role in population divergence. Gene admixture may also have occurred after divergence of these populations. Es- timates of heterozygosities of both internal and external nodes reported in table 7 are indeed much higher than we normally anticipate under drift. From Cavalli-Sforza and Bodmer’s ( 197 1, p. 733) table 11.9, we found that the sampled average heterozygosity over the five loci in the English population is 0.4788, which is similar to that described here (0.4693). Cavalli-Sforza and Bodmer ( 197 1, pp. 732-735) explain the high level of heterozygosity as possibly due to selection for heterozygotes and mutations. Disregarding all possible nondrift forces of evolution, the algorithm introduced here, which is valid under the pure drift model, seems relatively robust, because it used five loci with 19 alleles but produced a tree identical to Nei and Roychoudhury’s ( 1993) tree using 29 loci. In particular, the neighbor-joining method used by Nei and Roychoudhury’s ( 1993) does not depend on the pure drift model.

Discussion

Our purpose is to introduce a new phylogeny-inferring method under the pure drift model and using &,-related genetic distance data. There are several fundamental differences between this method and other methods based on pairwise distance data. First, our method not only uses the pairwise distance (Dxv) but also the variation within population ( Hx ), whereas only the former is used in conventional distance-based methods. The Hx is the heterozygosity within taxon and can be treated as the distance of a taxon with itself if it is denoted by D xx, which is usually greater than zero. Sec- ond, the genetic distances presented here measure the heterozygosities of internal nodes. Therefore, hetero-

Table 5 Averages of MSE of 100 Replicated Simulations

NUMBER OFLOCI

TREE 5 10 15 20

1 . . . . 256.90 79.29 52.53 22.96 2 258.99 77.03 50.29 22.38 3 :::: 290.02 88.73 55.58 23.96 4 . . . . 264.16 88.92 51.50 21.92 5 . . 282.89 82.03 54.04 23.00 6 205.42 31.51 38.47 16.34 7 :::: 146.32 66.2 1 20.68 10.06 8 . . . . 154.72 75.11 20.86 10.05 9 . . . 97.38 14.63 1.49 1.40 10 . . . 233.25 31.24 38.6 1 16.35 11 . . . 260.09 81.49 52.49 22.96 12 . . 262.14 88.9 1 51.48 22.14 13 290.73 84.05 55.52 24.33 14 . . . 261.75 73.46 50.27 22.43 15 . . . 287.34 75.65 54.02 23.25

No-K-There are 15 possible rooted trees for four taxa, and tree number 9 is the true tree.

25 50 75 100

11.18 5.16 4.66 4.95 12.20 5.23 4.65 4.99 12.38 5.33 4.75 5.04 12.21 5.16 4.69 4.97 11.35 5.16 4.65 4.99 3.78 3.13 2.78 2.99 9.97 3.24 2.96 3.02 9.98 3.24 2.96 3.02 0.54 0.28 0.20 0.10 3.79 3.13 2.78 2.99

11.20 5.17 4.66 4.95 12.21 5.16 4.69 4.97 12.39 5.34 4.74 5.04 12.20 5.23 4.65 4.99 11.34 5.17 4.65 4.99

958 Xu et al.

Table 6 Frequency (of 100 replicates) of Being Chosen as the Inferred Phylogeny for Each of the 15 Trees

NUMBER OF LOCI

TREE 5 10 15 20 25 50 75 100

1 . . . . 20 2 2 1 0 0 0 0 2 3 3 1 3 2 0 0 0 3 . . . . 2 0 1 0 1 0 0 0 4 5 5 0 3 0 0 0 0 5 4 3 1 1 0 0 0 0 6 5 7 8 5 2 1 1 0 7 8 ::::

12 8 6 6 3 1 0 0 11 2 5 4 3 2 0 0

9 . . . . 26 55 67 71 84 94 99 100 10 . . . 4 8 6 5 4 2 0 0 11 . 0 1 1 1 0 0 0 0 12 . . . 3 2 0 0 0 0 0 0 13 2 0 0 0 0 0 0 0 14 t.. 1 3 0 0 0 0 0 0 15 2 1 2 0 1 0 0 0

NOTE.-Tree number 9 is the true tree.

zygosity reduction from one node to its successive node reflects the branch length. Third, upon the appropriate transformation of D xy, the distance becomes “similarity,” which is not the sum of branches that connect X and Y. Instead, it is the sum of lengths of all segments from the root to the fork where X and Y split. In other words, the magnitude of Dxy is determined by how long taxa X and Y shared the same evolutionary pathways and it contains no information about the time since they split. Finally, this method directly fits rooted trees. If two rooted trees are identical in topology with regard to taxa of interest except being rooted differently, the two trees will have different MSEs, which is contrary to the Fitch-Margoliash-related methods.

Under the pure drift model, if there is no heterozygosity for taxon X, then there is no information about the length of the terminal branch. Taxon X may have been fixed just before the data were sampled or many generations ago. Hx will be zero in either case. In this

Table 7 Estimated Heterozygosites fi of Terminal Taxa (Diagonals) and Heterozygosites (&v) of Internal Nodes (off Diagonals) for Four Human Populations Calculated from Five Blood-Group Loci (ArA,BO, RH, MNSs, Fy, and Di)

Bantu English Eskimo Korean

Bantu ....... English ...... Eskimo ..... Korean .....

.3558

.4907 .4693

.5478 .4888 .3675

.5997 .5082 .3954 .5348

case, we need to delete Hx from the data and delete its corresponding branch length from the unknown vector, t, when we fit the linear model. If all the taxa are fixed, like inbred laboratory strains of mice or rats, then we cannot estimate the length of any terminal branch. An internal branch is estimable only if two lineages diverged from this internal branch show a nonzero distance.

The genetic distance between taxa, Dxy , and heterozygosity within a taxon, Hx, are linearly related to exy and Fx, respectively. The purpose in using Dxy is to estimate 8 xy because only 8 xy relates to the time

(a) MeanSqoaredErm=.0 0.0731

(b) wan squared Em = .m13

00732

FIG. 5.-The best (a) and the second best (b) trees for four human populations found using the new phylogenetic reconstruction method. The estimated branch lengths are numbers of generations expressed as 2N,.

elapsed between taxa X and Y. If one already has good estimates of all pairwise coancestry coefficients and fixation indices by using other methods (e.g., Cockerham 1973; Reynolds et al. 1983; Weir 1990, pp. 135-172), it is not necessary to invoke the Dxv and Hx statistics. However, Dxu and Hx are easy to calculate and thus may turn out to be very useful.

Equation ( 8) for the average Dxu over loci assumes that all loci are equally informative. If they are not, a weighted average is more appropriate. However, the relative information provided by each locus depends on the variance of Dxu for that locus, which is rarely known. In addition, the m loci may be selectively chosen by researchers so that only polymorphic loci are included for a set of taxa. This will cause Dxu to be a biased estimate of the heterozygosity of their recent common ancestor. However, this bias will eventually go to the estimation of the intercept (the tree trunk), which is usually not of interest.

As mentioned earlier, the error terms in the linear model ( eq. [ 15 ] ) are correlated with a variance-covari- ante matrix V. First, to derive V we have to derive the variance of Dxu , the covariance between Dxu and Dxz, and so on. However, those variances and covariances also involve three- or four-gene identity by descent, which are certainly more complicated than two-gene identity by descent. Weir and Basten ( 1990) have developed explicit formulas for the variances and covari- antes of similar statistics from DNA-sequence data. Ex- tension of the Weir-Basten formulas to allele-frequency data has not been obvious to us, so further investigation is needed. Second, the y variables in the linear model are log transformations of Dxv statistics; even though we have explicit formulas for the variances and covari- antes of the D xy statistics, variances of log transformations of D xy statistics still have to be approximated. unless one knows the distributional properties of the Dxy . Alternatively, one can invoke the bootstrapping or jackknifing resampling technique (Efron 1979) and substitute V by its bootstrap estimate. This may increase the chance of picking up the right tree and improve the estimates of branch lengths. On the other hand, the errors associated with estimation of V are also introduced into estimation oft, which may cause, in return, more errors in the estimation of branch lengths. Further investigation on this issue is necessary.

Inferring phylogeny under the pure drift model was originally suggested by Edwards and Cavalli-Sforza ( 1964) and Cavalli-Sforza and Edwards ( 1967), in which the allelic frequencies were turned into coordinates in a Euclidean space by using a generalization of the arcsine transformation so that the process of random genetic drift could be approximated by a process of Brownian

motion in these Euclidean coordinates. A maximum- likelihood approach was then suggested using the transformed frequency data. However, the authors ran into singularities in the “likelihood surface,” which forced them to utilize an ad hoc approach-the method of minimum evolution. Formal maximum-likelihood solutions were provided, via a restricted maximum-likelihood (REML) approach, by Felsenstein ( 1973, 198 1, 1985), who subsequently made the computer program available (i.e., the CONTML program in the PHYLIP package of Felsenstein 1989). Rohlf and Wooten ( 1988) evaluated the relative efficacy of Felsenstein’s REML method to Wagner’s parsimony and the UPGMA and obtained some results opposed to those of Kim and Burgman’s ( 1988 ) simulations.

Acknowledgments

We thank B. S. Weir, Z.-B. Zeng, and R. R. Hudson for many helpful suggestions on the earlier version of the manuscript. We also thank two anonymous reviewers for useful comments and suggestions on the earlier version of the manuscript. This work was supported by Na- tional Institutes of Health grant GM-45344 and National Science Foundation grants BSR-9 107 18 to W.R.A. and BSR-9096052 to W.M.F.

LITERATURE CITED

ATCHLEY, W. R., and W. M. FITCH. 199 1. Gene trees and the origins of inbred strains of mice. Science 254:554-5X

-. 1993. Genetic affinities among inbred strains of laboratory mice. Molecular Biology and Evolution. Mol. Biol. Evol. 10: 1150- 1169.

CAVALLI-SFORZA, L. L. 1966. Population structure and human evolution. Proc Roy Sot. Lond. [B] 164:362-379.

CAVALLI-SFORZA, L. L., I. BARRAI, and A. W. F. EDWARDS. 1964. Analysis of human evolution under random genetic drift. Cold Spring Harb. Symp. Quant. Biol. 29:9-20.

CAVALLI-SFORZA, L. L., and W. F. BODMER. 1971. The genetics of human populations. W. H. Freeman, San Fran- cisco.

CAVALLI-SFORZA, L. L., and A. W. F. EDWARDS. 1967. Phy- logenetic analysis: models and estimation procedures. Evo- lution 21:550-570.

COCKERHAM, C. C. 1973. Analysis of gene frequencies. Ge- netics 74:679-700.

EDWARDS, A. W. F., and L. L. CAVALLI-SFORZA. 1964. Re- construction of evolutionary trees. Pp. 67-76 in V. H. HEYWOOD and J. MCNEILL, eds. Phenetic and phylogenetic classification. Systematics Association, London.

EFRON, B. 1979. Bootstrap methods: another look at the Jack- knife. Annu. Stat. 7: l-26.

FALCONER, D. S. 1980. Introduction to quantitative genetics. 2d ed. Longman, London.

FELSENSTEIN, J. 1973. Maximum-likelihood estimation of evolutionary trees from continuous characters. Am. J. Hum. Genet. 25:47 l-492.

960 Xu et al.

- 198 1. Evolutionary trees from gene frequencies and . quantitative characters: finding maximum likelihood estimates. Evolution 35: 1229- 1252.

-. 1985. Phylogenies from gene frequencies: a statistical problem. Syst. Zool. 34:300-3 11.

-. 1989. PHYLIP-phylogeny inference package (version 3.2). Cladistics 5: 164- 166.

FISHER, R. A. 1958. The genetical theory of natural selection, 2d ed. Dover, New York.

FITCH, W. M., and M. MARGOLIASH. 1967. Construction of phylogenetic trees. Science 155:279-284.

HILDRETH, C. 1957. A quadratic programming procedure. Naval Res. Logistics Q. 4:79-85.

KEMPTHORNE, 0. 1969. An introduction to genetic statistics. Iowa State University Press, Ames.

KIM, J., and M. A. BURGMAN. 1988. Accuracy of phylogenetic- estimation methods under unequal evolutionary rates. Evolution 42:596-602.

LACY, B. C. 1987. Loss of genetic diversity from managed populations: interacting effects of drift, mutation, immi- gration, selection, and population subdivision. Conserv. Biol. 1:143-158.

NEI, M. 1976. Mathematical models of speciation and genetic distance. Pp 723-765 in S. KARLIN and E. NAVO, eds. Pop- ulation genetics and ecology. Academic Press, New York.

-. 1987. Molecular evolutionary genetics. Columbia University Press, New York.

NEI, M., and A. K. ROYCHOUDHURY. 1993. Evolutionary relationships of human populations on a global scale. Mol. Biol. Evol. 10:927-943.

REYNOLDS, J., B. S. WEIR, and C. C. COCKERHAM. 1983. Es- timation of the coancestry coefficient: basis for a short-term genetic distance. Genetics 105:767-779.

ROHLF, F. J., and M. C. WOOTEN. 1988. Evaluation of the restricted maximum-likelihood method for estimating phylogenetic trees using simulated allele-frequency data. Evolution 42:58 l-595.

SAITOU, N., and M. NEI . 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425.

SWOFFORD, D. L., and G. J. OLSEN. 199 1. Phylogeny reconstruction. Pp. 4 1 l-50 1 in D. M. HILLIS and C. MORITZ, eds. Molecular systematics. Sinauer, Sunderland, Mass.

WEIR, B. S. 1990. Genetic data analysis. Sinauer, Sunderland, Mass.

WEIR, B. S., and C. J. BASTEN. 1990. Sampling strategies for distances between DNA sequences. Biometrics 46:55 l-582.

WRIGHT, S. 1943. Isolation by distance. Genetics 28: 114- 138. - 195 1. The genetical structure of populations. Ann. .

Eugenics 15:323-354. -. 1965. The interpretation of population structure by

F-statistics with special regard to systems of mating. Evo- lution 19:395-420.

NAOYUKI TAKAHATA, reviewing editor

Received February 8, 1994

Accepted June 10, 1994

Documents

Phylogenetic Inference under the Pure Drift Model · populations, it is inefficient in countering drift when population sizes are on the order of 100 or fewer (Lacy 1987). Random