Phylogenetic Inference and Hypothesis Testing - …laic/testing.pdf · 2005-09-27 · Phylogenetic Inference and Hypothesis Testing Catherine Lai (92720) BSc ... 4 Phylogenetics Tree

Phylogenetic Inference and Hypothesis Testing

Catherine Lai (92720)

BSc(Hons) Department of Mathematics and Statistics

University of Melbourne

November 13, 2003

Contents

1 Introduction 4

2 Molecular Phylogenetics 5

2.1 The Use of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Traditional Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Phylogenetic Trees From Genomic Data . . . . . . . . . . . . . . . . . . . . . . . 62.4 What about the root? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 How Treelike is Evolution? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Models of Evolution 9

3.0.1 A Simple Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.0.2 Evolution as a stochastic process . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Markov Models of Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Markov Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Markov Models of Site Substitution . . . . . . . . . . . . . . . . . . . . . 11

3.2 Parameterized Models of Nucleotide Evolution . . . . . . . . . . . . . . . . . . . 123.2.1 Jukes-Cantor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.2 Jukes-Cantor Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 Generalisations of the Jukes-Cantor Model . . . . . . . . . . . . . . . . . 16

3.3 Problems with Markov Models of Evolution . . . . . . . . . . . . . . . . . . . . . 173.4 Modelling Rate Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Modelling Non-Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.5.1 Summary of Nucleotide Markov Models . . . . . . . . . . . . . . . . . . . 203.6 Empirical Models of amino acid evolution . . . . . . . . . . . . . . . . . . . . . . 21

3.6.1 PAM/Dayhoff Substitution Matrices . . . . . . . . . . . . . . . . . . . . . 213.6.2 BLOSUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.7 Differences in PAM and BLOSUM . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Phylogenetics Tree Reconstruction Methods 25

4.1 Evaluating Reconstruction Methods . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.5 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.6 Usability in tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Parsimony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4 Is MP the same as ML? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Distance Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5.1 Unweighted pair group method using arithmetic averages (UPGMA) . . . 31

1

4.5.2 The Molecular Clock Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 324.5.3 Long Branch Attraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5.4 Neighbour Joining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5.5 BIONJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.6 Weighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5.7 NJ and the minimum evolution method . . . . . . . . . . . . . . . . . . . 36

4.6 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6.1 Estimating Branch lengths . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6.2 Minimum Evolution Method with Least Squares . . . . . . . . . . . . . . 37

4.7 Bayesian Tree Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.8 Trees from Alignments, Alignments from Trees . . . . . . . . . . . . . . . . . . . 39

5 Phylogenetic Hypothesis Tests 40

5.1 Confidence Regions of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . 405.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 The Non-parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3.1 Testing Phylogenies using the Non-parametric Boostrap . . . . . . . . . . 435.3.2 How well does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.4 The Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4.1 Problems with the Parametric Bootstrap . . . . . . . . . . . . . . . . . . 46

5.5 Bootstrap Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.5.1 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5.2 The Kishino Hasegawa Test . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5.3 The Shimodaira Hasegawa Test . . . . . . . . . . . . . . . . . . . . . . . . 485.5.4 The Swofford Olsen Waddell Hillis Test (SOWH) . . . . . . . . . . . . . . 50

5.6 Bayesian methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.7 Bootstraps and Posterior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 515.8 Which Test? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Generalized Least Squares in Phylogenetic Hypothesis Testing 54

6.1 Sample Average Variance and Covariance . . . . . . . . . . . . . . . . . . . . . . 556.2 Motivation for Simulation of GLS test statistic . . . . . . . . . . . . . . . . . . . 586.3 GLS Test Statistic Simulation Method . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.7 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Conclusion 63

A GLS Results: Sample Average Covariance 64

A.1 Four Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64A.2 Five Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

B GLS results: JC Covariance 73

B.1 Four Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73B.2 Five Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

C Covariance Estimation 81

C.1 Sample Covariance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81C.1.1 Sample Covariance - 100bp . . . . . . . . . . . . . . . . . . . . . . . . . . 81C.1.2 Sample Covariance - 1000bp . . . . . . . . . . . . . . . . . . . . . . . . . . 82

2

C.1.3 Sample Covariance - 10000bp . . . . . . . . . . . . . . . . . . . . . . . . . 82C.2 Jukes-Cantor Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

C.2.1 Jukes-Cantor Covariance - 100bp . . . . . . . . . . . . . . . . . . . . . . . 83C.2.2 Jukes-Cantor Covariance - 1000bp . . . . . . . . . . . . . . . . . . . . . . 83C.2.3 Jukes-Cantor Covariance - 10000bp . . . . . . . . . . . . . . . . . . . . . . 83

C.3 Sample Average Covariance(Susko) . . . . . . . . . . . . . . . . . . . . . . . . . . 84C.3.1 Sample Average Covariance(Susko) - 100bp . . . . . . . . . . . . . . . . . 84C.3.2 Sample Average Covariance(Susko) - 1000bp . . . . . . . . . . . . . . . . 84C.3.3 Sample Average Covariance(Susko) - 10000bp . . . . . . . . . . . . . . . . 84

3

Chapter 1

Introduction

Phylogenetics is a field of biology that seeks to unlock the evolutionary history of life on earth.

The aim is to understand relationships between species and through this the process of evolution

itself. These relationships can be represented with a graph structure - traditionally simplified to

evolutionary trees. The current approach is to try to reconstruct these trees from the blueprint

of life: DNA sequences.

Reconstruction methods are difficult to design and evaluate because the biological evidence

is often ambiguous. Many approaches have been introduced to deal with the problems of estima-

tion and hypothesis testing of phylogenetic trees. Parametric approaches exploit the elementary

knowledge we have of evolution while non-parametric approaches have been developed to avoid

the possibility of inaccurate preconceptions.

Recently, Susko[40] presented an approach that applies the theory of generalized least squares

to phylogenetic hypothesis testing. The generalized least squares approach has strong theoretical

foundations in the theory of linear models. While the theory appears to be sound it is based on

asymptotic results with regard to sequence length. It is not clear how well the test will perform

in practice where the length of sequences is often only a few hundred nucleotides. I investigate

the effect of sequence length on this approach. I also consider how Susko’s approach differs

from traditional parametric techniques with respect to estimation techniques for variance the

variance-covariance estimation.

In Chapter 2 I will give a general background to the problems involved in molecular phy-

logenetics. In Chapter 3 and Chapter 4 I review commonly used probabilistic models and tree

reconstruction methods commonly respectively. In Chapter 5 I consider methods of evaluating

confidence in results from such reconstruction methods when there may be conflict. This leads

to an examination of current hypothesis testing methods and consideration of the validity of

the generalized least squares approach in Chapter 6.

4

Chapter 2

Molecular Phylogenetics

Phylogenetic trees represent relationships between species. They tell the story of life on earth.

A phylogenetic tree is a tree in the graph sense. External vertices (nodes) represent extant

species while internal nodes represent speciating events. The tree topology determines the

lines of evolution - which species descended from which common ancestors. The branch (edge)

lengths represent time since speciation of adjacent nodes. In reality, absolute time scales cannot

be used and relative time scales are employed. These time scales depend on the data used to

infer the tree. For example, if we use genome sequence data, a scale is the expected number of

substitutions that have taken place at a site.

2.1 The Use of Phylogenetic Trees

The role of phylogenetics is to help us understand the process of evolution from the patterns in

nature we can observe in the present. As Huelsenbeck describes [25]:

[F]or any question in which history may be a confounding factor, phylogenies

have a central role.

The most obvious use of phylogenetics is inferring common ancestors. This has implications

to our understanding of evolution on a large scale. It can also help with more immediate

problems. For example in epidemiology and understanding the spread of viruses. Viruses such

as hepatitis C have a long dormant period meaning we can only detect the spread of the virus

as it happened in the past.

Understanding these processes and relationships can help in the development of biotech-

nology. For example, improving the design of drugs to consider host-pathogen mutual genetic

variation. Phylogeny also shed light on our understanding of structural biology, helping us to

infer function and functional constraints of genes. [30]

5

2.2 Traditional Approaches

Historically phylogenetic trees have been constructed using two principles. The phenetic ap-

proach uses similarity scores derived from measures of physical characteristics. The most similar

species are clustered together. While this is an intuitive approach, results may not represent

genetic or evolutionary similarity.

The cladistic approach assumes that related species will share unique features that were not

present in distant ancestors. All species in a group must share a common ancestor. This means

that species with many similar physical traits may not be grouped together.

Both of the above approaches rely heavily on morphological and geographical data. However,

this has changed with our understanding of the role of DNA in evolution. DNA sequencing

has massively increased information about the evolutionary process. Most tree reconstruction

methods now focus on examining the way DNA (or amino acid) sequences have evolved.

2.3 Phylogenetic Trees From Genomic Data

In molecular phylogenetics patterns are searched for in genomic data. What we find is that

evolution is a stochastic process. Mutations arise from changes to a species genome. That is,

site substitutions, deletions, insertions and inversions. If we understand the process of mutation,

we can make reasonable inferences about our past from the genome material we have now.

Sequences that are very different are likely to be less closely related then sequences showing

high similarity.

Determining similarity is not an easy problem. Substitutions may be hidden from view by a

number of factors. Examples of these include when a site changes and then changes back again

(a reversal); when more than one mutation occurs at the same site; parallel changes occur on

different branches of the tree (convergence or parallelism).

With that in mind, there are three major components to phylogenetic inference that need

to be considered.

Probabilistic Models in Phylogenetics

We need to consider what role probabilistic models can play to help our understanding of the

problem. Typically, it is assumed that mutations occuring at time t depend on the sequence at

that time but not on its previous history. This suggests a Markov model of sequence evolution.

However whether or not the traditional assumptions such as homogeneity and reversibility are

valid is less clear. The role of probabilistic models in phylogenetics is discussed in Chapter 3.

6

Reconstructing Phylogenetic trees from Sequence Data

We need to understand what methods can be employed to actually reconstruct a phylogenetic

tree. Within this there are three solid problems to consider: choosing a criteria, estimating

the tree topology, estimating branch lengths. Besides this there has been a long standing

feud between biologists (and to a latter extent mathematicians) whether parametric or non-

parametric methods or something in between should be used. A review of commonly used tree

reconstruction methods is contained in Chapter 4

Hypothesis Tests of Phylogenetic Trees

Once a phylogenetic tree is decided upon, the next step is usually to try to identify which parts

are well supported by the data. Hypothesis tests be must used with an understanding of what

they actually test and what information the can provide the user. To complicate matters, the

hypothesis being tested is often a tree topology and it it is still unclear how the usual statistical

measures, such as variance, can be applied to such a structure.

The debate between parametric and non-parametric testing is continued. The non-parametric

bootstrap has been extensively in phylogenetics to provide a measure of confidence. However,

the use of the parametric bootstraps and Bayesian methods are also becoming popular. All

have their advantages and disadvantages.

In any case, inferred trees need to be compared to traditional biological data (eg morpho-

logical). No matter how well a method works for simulated data, the aim of the game is to

understand the process in reality. These issues are considered in greater detail in Chapter 5

and (with respect to a generalized least squares test) Chapter 6.

2.4 What about the root?

In theory phylogenetic trees should be rooted to represent descent from the common ancestor.

However, as we attempt to reconstruct phylogenetic trees, we have to consider all possible

positions of a root with respect to the other nodes of the tree. Unfortunately it is unlikely that

sequences will contain enough information to accurately place a root.

However, there are some methods for rooting a tree. These include adding data from very

distantly related species (outgroups), or the use of the molecular clock hypothesis. The use of

outgroups has throws in its own bundle of problems as error in distant species effects the other

closer related species. This and the validity of the molecular clock hypothesis is discussed in

section 4.5.2.

7

2.5 How Treelike is Evolution?

It is worth asking the question of whether trees are really the right structure to describe evolu-

tion. Often there is not enough information in the data to resolve the issue of parallel mutations,

while gene transfer from species to species mean that a gene can have more than one ancestor.

There is criterion to determine if data are treelike: the four point condition. For any taxa

i, j, k, l in a phylogenetic tree with four or more taxa we have:

dij + dkl ≤ max(dik + djl, djk + dil) (2.1)

Otherwise other structures that may be more valid. For example, see Strimmer and Moul-

ton’s work on phylogenetic networks [39]

8

Chapter 3

Models of Evolution

Most widely used tree reconstruction methods use probabilistic models of evolution. These can

be formulated parametrically, using known (or assumed) properties of sequence evolution. They

can also be derived empirically from information in the observed sequences.

It makes sense to use whatever knowledge we have about the process of evolution rather

than ignore it. On the other hand, evolution is very complex and biological evidence is often

ambiguous. An example of a factor that needs to be taken into account, but is very hard to

modeli, is differing in rates of evolution between and within lineages.

How well a model fits reality can effect how a testing method works. Simpler models have

greater power to discriminate but may be biased. So understanding models is necessary to

understanding both tree reconstruction and confidence testing [19].

3.0.1 A Simple Approach

Evolutionary distances represent the divergence between species. That is, branch lengths on a

phylogenetic tree. The following naive approach to determining distance shows why a proba-

bilistic model is desirable.

When determining distances between sequences it is intuitive to use a measure of dis-

similarity. That is, take the distance between two sequences to be the Hamming distance.

That is, for two species x and y, with sequence length S, we have:

Dxy =

S∑

i

(xi 6= yi) (3.1)

However, this approach does not take into account the phenomena described in (2.3) such as

reversals and parallelism. This means the observed number of substitutions in a given sequence

are a lower bound for the actual number of substitutions that have occured. This basically

9

means we cannot accurately look very far back in time.

We need models that estimate substitution rates that correct for unseen events. An obvious

first step is to try to define evolution as some type of stochastic process.

3.0.2 Evolution as a stochastic process

Definition 3.0.1 A stochastic process is a collection of random variables X(t)t∈T , with a

common probability space.

We can think of the process of evolution as the stochastic process of substitutions in a

sequence. The the set of states, are nucleotides (or amino acids). That is, for a nucleotide

model, X(t) ∈ A,C,G, T is the nucleotide at that site at time t. 1

To develop a tractable model for evolution we need to make further simplifying assumptions.

This leads us to the well studied world of Markov processes.

3.1 Markov Models of Evolution

3.1.1 Markov Theory

Definition 3.1.1 A stochastic process has the Markov property if the probability of observing

a new state at time s + t only depends on the state at time s. That is,

P (X(s + t) = j|X(s) = is, . . . ,X(0) = i0) = P (X(s + t) = j|X(s) = is) (3.2)

A Markov process is a stochastic process with the Markov property. The Markov property

is also referred to as the memoryless property of Markov processes. If t, s ∈ Z then we have a

discrete time Markov process. Similarly if t, s ∈ R we have a continuous time Markov process.

To model evolution we want the latter since evolution is happening in continuous time.

Definition 3.1.2 The transition probability, Pij(t, s), is the probability of changing from state

i at time s to state j at time t + s.

For a Markov process we have:

Pij(s, t) = P (X(s + t) = j|X(s) = i) (3.3)

1When deriving species trees this means the nucleotides at a site at time t expressed in the majority of thepopulation.

10

If Pij(t, s) above is independent of s then the process is homogeneous. That is,

Pij(t, s) = Pij(t) (3.4)

We can notate these probabilities as a transition matrix P(t). The transition probabilities obey

the Kolmogorov-Chapman equation:

Pik(t) =∑

j

Pij(v)Pjk(t − v) (3.5)

In matrix form:

P(t) = P(v)P(t − v) (3.6)

This is equivalent to

P(t + v) = P(t)P(v) (3.7)

With initial condition

P(0) = I (3.8)

From this, we can extrapolate the transition probabilities at time t as

P(t) = [P(1)]t (3.9)

3.1.2 Markov Models of Site Substitution

A discrete state continuous time Markov process of site substitution can be formulated as

follows. We define the transition probabilities as the probability of substitution of a nucleotide

or amino acid. We also assume that the process is time-homogeneous. That is, the rate of

substitution is independent of time and the process is the same throughout the whole tree.

Markov chain models of site substitution are usually further constrained other properties

such as stationarity and reversibility. The assumption of stationarity is that the process is in

equilibrium. This effectively assumes that that nucleotide frequencies are (approximately) the

same from species to species throught time. Reversibility means that we do not distinguish the

11

process from the process in reverse: we treat the process starting at an ancestor species to a

descendent and vice versa as the same. That is, evolution does not have a ‘direction’.

This is in fact a model the evolution of a sequence along a phylogenetic tree branch . Usually

the process of substitution is assumed to be Poisson.

The validity of these assumptions is discussed in 3.1.2. Before this, I examine how transition

matrices for Markov models of evolution can be derived parametrically.

3.2 Parameterized Models of Nucleotide Evolution

We can derive the transition matrix P(t) by estimating a rate matrix Q. Qij is the rate that i

changes to j in a very small time step δt.

Pij(δt) = Qijδt + o(δt) (3.10)

In fact, if for all states i,∑

j Pij = 1 (that is the process is honest), Q defines the process

(hence transition matrix) uniquely[9].

Let P(1) = eQ. Then using the Kolmogorov-Chapman equation,

P(t) = P(1)t = etQ (3.11)

=∞∑

j=0

tQj

j!(3.12)

Now,

d

dtP (t)|0 = QetQ|0 (3.13)

= Q (3.14)

Now let v = (1, 1, 1, 1)T . It is easy to see that v is an eigenvector of P(t) as∑

i Pij(t) = 1) as

noted earlier.

Qv = etQv =

∞∑

j=0

(tQ)jv

j!(3.15)

= 1.v +

∞∑

j=1

tj(Qjv)

j!(3.16)

= v (3.17)

This represents a power series on t, so all coefficients Qjv = 0. That is Qv = 0. Rate

12

matrices that have this condition define processes.

It is clear at this point that Markov models suffer from confounding. That is etQ = etγ

γQ,

where γ > 0, scales the rate. This means that absolute time scales cannot be used. Hence,

expected subsitutions per site is the usual time unit stated.

3.2.1 Jukes-Cantor Model

Of all the Markov models of evolution, the Jukes-Cantor model makes the strongest assumptions

about the process. Besides the properties of Markov chains this model assumes:

• The process acts on sites is independently and identically distributed (iid).

• All substitutions occur with equal probability.

With these assumptions, we need only define an appropriate rate matrix to derive transition

probabilities. If we define the rate of substitution as α, the Jukes-Cantor rate matrix:

QJC =

−3α α α α

α −3α α α

α α −3α α

α α α −3α

(3.18)

The rate at which a site stays in its current state must be −3α as row sums must equal zero.

We can find the spectral decomposition of etQJC by determining the eigenvectors and eigen-

values of Q.

Q = Sdiag(λ1, λ2, λ3, λ4)S−1 (3.19)

Where S is the matrix with the eigenvectors of Q as columns, λi are the corresponding eigen-

values. From further linear algebra we can write:

etQ = Sdiag(etλ1 , etλ2 , etλ3 , etλ4)S−1 (3.20)

In this case it is easy to see that the matrix S is can be defined as follows.

13

S =

1 1 1 1

1 1 −1 −1

1 −1 1 −1

1 −1 −1 1

(3.21)

Corresponding eigenvalues: λ1 = 0 and λ2 = λ3 = λ4 = −4α. Also, we can verify that

S−1 = 14S

T . We can now derive the transition probabilities for the Jukes-Cantor Model.

P(t) =

14(1 + 3e−4αt) for diagonal elements

14(1 − e−4αt) for off diagonal elements

(3.22)

The probability of seeing a change after time t does not depend on the current state.

Pr(X(t) 6= X(0)) = Pr(X(t) = b|X(0) = a), b 6= a (3.23)

=∑

b6=a

Pab(t) (3.24)

= 3 × 1

4(1 − e−4αt) (3.25)

We can use Pc(t), the proportion of changed sites after time t, to estimate the time of

divergence by solving the above for t. Now the number of sites that changed is distributied

binomially as they either have or have not changed. So we have, Pc(t) ∼ Bin(N, 34(1− e−4αt)),

where N is the length of the sequence. This means Pc(t) = no. changes/N is a maximum

likelihood estimate. From the invariant property of maximum likelihood estimates the following

is a maximum likelihood estimate of the time of divergence.

t =−1

4αlog(1 − 4

3Pc) (3.26)

This is the Jukes-Cantor distance estimate. It is usually written as dij where i and j

represent to sequences/taxa. Since two independent (unrelated sequences are expected to agree

at 1/4 of sites, sequences are considered to be unrelated as Pc → 3/4. At this point distances

tend to infinity.

14

Selection of Rate Parameter

For very short time spaces the total number of changes inferred by the Jukes-Cantor estimate

is equal to the number of observed changes. More precisely:

limt→0

Pc(t)

Pobs(t)= 1 (3.27)

As both numerator and denominator tend to zero this can be seen using l’Hopitals rule

dPc

dt|t=0 =

dPobs

dt|t=0 = 1 (3.28)

dPc

dt= 3αe−4αt (3.29)

= 3α at t = 0 (3.30)

Applying our boundary condition

3α = 1 (3.31)

α =1

3(3.32)

3.2.2 Jukes-Cantor Variance

The variance of the Jukes-Cantor estimate can be derived using the delta method.

t − Et ≈ dt

dp× (p − Ep) (3.33)

σ2(t) =1

(1 − 43 p)2

p − Ep

n(3.34)

That is,

σ2(t) = e8t/3T (1 − T )/S (3.35)

Where T = 34(1−e−4t/3) and S is the sequence length. That is, the variance grows exponentially

with t. The covariance of two distance estimates is derived using the tree structure. Distance

estimates are represented on a phylogenetic tree by the sum of branch lengths on the unique

path between the taxa under question. To calculate the covariance of two pairwise distances we

simply calculate the variance of branches common to both paths.

15

3.2.3 Generalisations of the Jukes-Cantor Model

It is very clear from biological evidence that the assumptions made in the Jukes-Cantor model do

not generally hold. Analysis of DNA sequences shows that substitutions are not equiprobable.

An example of this is transition/transversion bias. Nucleotides are grouped according to

their molecular structure as purines (A,G) or pyramidines (C,T). Purine to purine or pyramidine

to pyramidine substitutions are called transitions. The rest are called transversions. Because of

this molecular structure it is much more likely that a transition than a transversion will happen.

The Kimura-2-parameter model(K2P) attempts to correct the assumption by introducing pa-

rameters to model the difference in transition and transversion rates. This approach produces

a new rate matrix where β is the rate of transitions, α the rate of transversion:

QK2P =

−(2β + α) β β α

β −(2β + α) α β

β α −(2β + α) β

α β β −(2β + α)

(3.36)

In 1981 Felsenstein [15] presented a model (F81) where substitution rate depends only on

the equilibrium frequency of a nucleotide. These equilibrium frequencies are usually determined

from the observed frequencies in the sequences to hand. µ represents a rate parameter and πi

represents the frequency of nucleotide i. (‘.’ indicates the value necessary to make the row sums

equal to zero).

QF81 =

. µπT µπC µπG

µπA . µπC µπG

µπA µπT . µπG

µπA µπT µπC .

(3.37)

Hasegawa et al[20] futher refined Felsenstein’s model by considering transition/tranversion

rates β and α.

16

QHKY =

. βπT βπC απG

βπA . απC βπG

βπA απT . βπG

απA βπT βπC .

(3.38)

Finally the most general time reversible model (GTR) has nine free parameters:

QG =

. ρπT βπC γπG

ρπA . απC σπG

βπA απT . τπG

γπA σπT τπC .

(3.39)

3.3 Problems with Markov Models of Evolution

Biological evidence that all models so far considered simplify situation too much. For example,

they can’t deal with long-additive distance correlation due to RNA folding.

A key problem appears to be the iid assumption between sites. The assumption of rate ho-

mogeneity is contradicted by evidence that mutations are dependent on local sequence context.

Protein coding genes are an example of how this assumption can be violated by very basic ideas.

Because of the redundancy in the nucleotide to amino acid code, different codon positions are

subject to different selectional pressures. Mutation rates appear to be dependent on structural

and functional constraints as well as chromosomal positions. These are all local properties of a

sequence. However, it is assumed that substitution rates are constant throughout a phylogenetic

tree.

Markov models of evolution assume stationarity of base frequencies. That is, expected

nucleotide frequencies remain the same with time. This is contradicted by observations of

nucleotides are very different in sequences from different species. For example, GC content

in mammals is much higher than in flies.[30] Lockhart et al [31] have shown that if a model

that assumes stationarity is used, then breaking the assumption can lead to inaccurate distance

estimates. The main problem being a tendency to group sequences with similar nucleotide

frequencies, irrespective of evolutionary development.

17

3.4 Modelling Rate Heterogeneity

A number of methods have been suggested to add some level of rate heterogeneity into Markov

models.

One approach is to set some sites invariable while others change. This is useful when one

can determine conserved regions in sequences. However, it doesn’t allow for more than one rate.

Another approach allows sites to evolve at different rates, where the rate for a site from a

Gamma distribution with shape parameter α. A discrete Gamma model has also been developed

by Yang [30] that allows much easier computation. This is, perhaps, the most popular approach

at present. However, it still does make use of information available about local behaviour.

Recently, Steel et al [42] have presented a covariotide model of site substitution where sites

are effected by different selection pressures. This model allows some sites to be invariant while

others change. However, sites do not have to remain invariant. This represents the fact that

constraints on sites can change over time. The activation of sites is governed by a Markov

process where sites are still iid to keep the model tractable.

Other techniques have been based on defining multiple categories of rates. This implemented

using hidden Markov models. Algorithms infer the most probable rate category for a site. These

are discussed in [11]

3.5 Modelling Non-Stationarity

As mentioned previously, Markov models of evolution assume stationarity of nucleotide frequen-

cies. However, there is strong evidence suggesting that this is not the case.

The paralinear and logdet corrections have been developed to make distance estimates more

reliable when base frequencies differ from species to species. Both rely on the following lemma.

Lemma 3.5.1 Let t be a measure of evolutionary time. Now,

t ∝ log[det(P(t)] (3.40)

Proof

P(t) = etQ) (3.41)

18

From linear algebra

=⇒ det(P(t)) = ettrace(Q) (3.42)

=⇒ log[det(P(t))] = ttrace(Q) (3.43)

Since Q remains fixed

=⇒ t ∝ log[det(P(t))] (3.44)

Paralinear distance

Barry and Hartigan [2] suggest an asynchronous distance estimate. This is still based on a

Markov process where sites are iid. However, it makes no assumption of homogeneity, re-

versibility or stationarity. It need not assume base frequencies are in equilibrium, nor that the

rate of substitution is constant throughout the tree. The distance estimate is taken to be:

dij = −1

4log[det(Pxy)] (3.45)

Where Pxy is the transition matrix at a particular site from species x to species y. This

is assumed to be the same across all sites. The (i, j)th element of the transition matrix is

estimated as the Pr(Y = j|X = i), where X and Y are bases that have the same position in

sequences for species x and species y, respectively. i, j ∈ A,C,G, T for nucleotide sequences.

The distance measure is additive and asymmetric. The latter property means that generally

dij 6= dji, which is not a particularly desirable property. In fact, this measure can only be used

to estimate the total number of substitutions along a branch when substitution rates are held

constant and the model is reversible.

LogDet Transformation

This transformation method involves recording a divergence matrix, Fxy, for each pair of taxa

x and y. The ijth entry of Fxy is the proportion of sites in which taxa x and y have states i

and j respectively. The dissimilarity value, dxy is calculated as:

dxy = − log[detFxy] (3.46)

Variance can be calculated using the paralinear method. Where S is the sequence length, r

19

is 4 or 20:

σ2xy =

r∑

i=1

r∑

j=1

[(F−1xy )2ji(Fxy) − 1]/S (3.47)

When models have with equal nucleotide frequencies, Lockhart et al [31] show how to cal-

culate branch lengths:

d′

xy = (dxy + [log(detFxxFyy)]/2)/r (3.48)

Distances become treelike as sequence lengths increase, provided we reinstate our indepen-

dence assumptions across sites and across the tree. This means that reconstruction methods

that require treelike distances to work will work with corrected distances (and sufficiently long

sequences). The LogDet transform has been shown to provide more realistic results where

similar nucleotide frequencies might be indicating false evolutionary relationships.

3.5.1 Summary of Nucleotide Markov Models

We can consider the relationships between these models via the following parameters [44]:

• κ: The rate of transitions relative to rate of transversions. In practice, κ > 1 reflects the

evidence that transitions are more prevalent than transversions.

• α: A measure of between site variation in the rate of nucleotide substitution. This is often

drawn from a gamma distribution with mean 1 and variance 1α [44]. High values mean

low amounts of rate variation.

• Base frequencies π = (πA, πC , πG, πT ). ie three independent parameters.

• πMLE , πobs: The maximum likelihood, and observed base frequencies.

The maximum likelihood estimate α and κ are usually used.

Model α κ π

Jukes Cantor ∞ 1 0.25 each

Kimura 2-P ∞ variable 0.25 each

Felsenstein ∞ 1 variable

HKY ∞ variable variable

JC+Γ variable 1 0.25 each

K2P+Γ variable variable 0.25 each

Fel+Γ variable 1 variable

HKY+Γ variable variable variable

20

3.6 Empirical Models of amino acid evolution

When modelling amino acid evolution, empirical Models have been the preferred solution. These

models specify explicit transition probabilities derived from empirical evidence. The preference

for empirical models when dealing with amino acids is partially due to the complexity increase

involved in having twenty character states. The following section provides an overview of the

two most common empirical methods: The PAM and the BLOSUM matrices.

3.6.1 PAM/Dayhoff Substitution Matrices

The PAM/Dayhoff matrices empirically estimate amino acide substitution rates based on a

markov process framework. These rates were derived from alignments of protein sequences that

are atleast 85% identical.

Deriving the Mutation Matrix

Let A be the matrix of observed proportions of changes in between two amino acides i, j. That

is:

Aij =Nij

N(3.49)

In fact, Aij has the same description as Fxy described for the LogDet transform.

Let πk be the vector of amino acid frequencies of sequence k.

πk =Nk

j

N(3.50)

We want to derive substitution (transition) probabilities for the time it takes 1% of all amino

acids to mutate - this is the point acception mutation (PAM) unit.

Pij = Pr(i mutates)Pr(i mutates toj|i mutates) (3.51)

Now we can empirically derive a relative mutability of the amino acid i as mi:

mi = P (i mutates) (3.52)

=

∑

j Aij∑

k,j Akj(3.53)

21

Now,

Pr(i mutates to j|i mutates) =Aij

∑

j Aij(3.54)

and we now have have an estimate of P :

Pij = mi ×Aij

∑

j Aij(3.55)

To calibrate our matrix to the PAM measure we simply solve:

∑

i

πi(1 − Pii) = 0.01 (3.56)

The matrix of Pij ’s is the PAM matrix.

If π a vector of amino acid frequencies, Pπ is the probability vector after that time period

(1-PAM) To consider more distant relationships we can derive the k-PAM matrix. Because this

is based on a Markov process we can theoretically achieve this by raisig the 1-PAM matrix to

the kth power.

P (k) = Pk (3.57)

The log-odds form of PAM matrices are often used for scoring sequence alignment reliability.

This can be thought of as a log-likelihood ration test with the null hypothesis being that a sites

have aligned by chance.

Sij = logPij

πi(3.58)

The more rare the amino acid in each aligned pair, the lower the probability of a chance

alignment and so a greater significance.

Problems with the PAM model

Besides the problems inherent in Markov models of evolution, the PAM matrices suffer from

other problems. Firstly, it assumes that proteins have average amino acide composition (many

don’t). Secondly, rare replacements are not observed enough to resolve relative frequencies

properly. Thirdly, error in PAM(1) extrapolated (in say PAM(250)) Markov processes don’t

accurately model evolution.

There is no theoretical justification for applying this to divergent alignments. In fact this

22

approach implies a large loss of information. As evolutionary distance increase, information

content decreases. This means a longer region of similarity to get a high score to distinguish

from chance. However, regions of similarity are found in narrow blocks as evolutionary distance

increases so it is difficult to find the necessary data.

Attempts to update the PAM matrices to make them more accurate have been made. A

particular example is the Jones Taylor Thornton model.

3.6.2 BLOSUM

The Block Sum (BLOSUM) substitution matrices were introduce in 1992 Henikoff and Henikoff

[21]. They take completely different approach to the PAM matrices. The key point is that the

derivation of transition probabilities uses alignments of distantly related sequences. Blocks are

conserved regions of local alignments with no gaps.

The aim is to obtain a set of score for matches and mismatches that best favors a correct

alignment with each of the other segments in the block relative to incorrect alignment. This

is done by creating a table where each column contains amino acid pair frequencies for the

corresponding column in the alignment. This is a 20(20− 1)/2×N matrix where the first term

is the number of possible pairs of amino acids and N is the length of the alignment.

A score matrix is defined from a log-odds matrix from the frequency table. Let Fij be the

ijth entry of frequency matrix. Let qij be the observed probability of an ij pair.

qij =Fij

∑

j Fij(3.59)

We can estimate the expected probability of an ij pair occuring as eij . Let pi be the probability

of i occuring in an ij pair. Let eij = pipj. Our odds ratio matrix takes the form:

Sij =qij

eij(3.60)

That is, the observed probability over the expected probability that i and j appear together

at random. Ratios are usually multiplied by scaling factor of 2 then rounded to the nearest

integer. This is the BLOSUM (block substitution matrix) with half bit units.

Unlike the PAM matrices, separate matrices have been derived for different time scales.

BLOSUM matrices are referred to by minimum percentage identity between species. That is

in BLOSUM 60 sequences that are atleast 60% similar are treated as identical. As distances

become large we expect to a BLOSUM matrix with a decrease BLOSUM parameter.

23

Problems with the BLOSUM matrices

The main problem with the BLOSUM matrices is that it can be overtrained. That is, if most of

the conserved blocks are taken from just a few species then the resulting matrix isn’t going to

look too much like reality. This is a real problem as most genomic data available is from very

few species.

To reduce contributions from most closely related members of family (reduce multiple con-

tributions of amino acid pairs) - sequences are clustered within blocks. Each cluster is weighted

as a single sequence. Matrices analogous to transition matrices estimated without any reference

to rate matrix Q.

3.7 Differences in PAM and BLOSUM

The differences in PAM and BLOSUM substitution matrices are a consequence of their different

approaches to the problem PAM matrices are derived from a tree based model that uses matrix

multiplication to extrapolate larger time scales. It is based on mutations in both conserved and

variable regions.

BLOSUM is derived from pair frequencies in highly conserved blocks. Different weights

can be given to different sequence groups. BLOSUM has an advantage in that it was derived

with from more representative data set. Hardly any transitions were observed in deriving PAM

whereas this was not the case for BLOSUM. This problem has been address by re-deriving the

models with more data. This is the Jones Taylor Thornton model.

The fact that BLOSUM is not tree derive does not seem to be a major disadvantage.

BLOSUM generally gives better results when used to score database searches as highly conserved

regions usually serve as anchor points. However, PAM-style matrices are still more widely used

in phylogenetics

24

Chapter 4

Phylogenetics Tree Reconstruction

Methods

This chapter surveys tree reconstruction methods. Phylogenetic reconstruction methods come

in many forms. Parametric methods such as maximum likelihood rely on a specification of a

model. On the other end of the scale, non-parametric, such as maximum parsimony, claim to

make no assumptions about the underlying model as any assumption we do make are likely to be

inadequate. In the middle there are semi-parametric methods - the distance based methods that

require a model to generate distances but then go onto reconstruct the tree by a non-parametric

cluster method. Bayesian approaches have also been proposed.

This chapter first outlines how the usual statistical indicators are redefined for the prob-

lem of phylogenetic inference. The rest of the chapter examines commonly used methods for

phylogenetic tree reconstruction.

4.1 Evaluating Reconstruction Methods

Before examining the methods available we need to have an idea of what we want from them.

Several often used criteria are discussed below. When evaluating tree reconstruction methods

having the available the usual bag of statistical measures. However, it will be seen that defining

these with respect to phylogenetics is not straightforward.

4.1.1 Complexity

An important issue to consider in the design of reconstruction methods is the size of the space of

trees. There are (2N − 3)!! topologically unique rooted trees N leaves, and (2N − 5)!! unrooted

trees. Clearly, algorithms that involve evaluating entire space of N leaf trees are not going to

25

be computationally feasible.

4.1.2 Accuracy

The evaluation critieria that has been given the most attention is accuracy. When we build

phylogenies we need some measure of how well the method used estimates the true tree. This is

usually evaluated by examining how the method performs with respect to simulated data and

biologically well supported phylogenies[25].

4.1.3 Consistency

The behaviour reconstruction methods as sequences gets longer is usually discussed in terms of

consistency.

Definition 4.1.1 An estimator Tn of T is consistent if

limn→∞

PT (Tn − T > ǫ) = 0 as n → 0 (4.1)

In phylogenetics this is interpreted as whether a reconstruction method will return the true

tree if its inputs are based on infinitely long sequences. This criteria has been given much

consideration as the amount of genome data available increases. In effect, this means that the

barrier to success only depends on the researcher having enough sequence to hand.

Steel [13] showed that if the frequencies of residues are known and are iid, then a consistent

estimator can be found. If not (site rates are variable and/or frequencies are not known), then

it can be impossible to estimate a phylogenetic tree consistently. However, Chang [8] has shown

that if sites are iid, the correct model is being used assome other restrictions, than a consistent

estimator can be found.

There has been much debate about the usefulness of such a measure given that real sequences

will always be finite. Holmes makes the point that as sequence lengths are made longer in reality

the less valid the site independence assumption is [23]. 1

4.1.4 Efficiency

Definition 4.1.2 As estimator T is efficient if it is unbiased and

limn→∞

σ2(Tn)

(I(T )−1)= 1 (4.2)

1Holmes provides a useful phylogenetics to statistics term conversion table in [23]

26

Where I(T ) is the Fisher information of T .

In phylogenetics, we want this to mean that as longer sequences are used the variances in our

trees is as low as it can get. The problem with this is that the variance of a tree is not well

defined. In fact, the literature generally uses effiency to describe how quickly a reconstruction

method converges to the correct solution as it is given more data. This is usually measured via

simulation.

Ideally, we would use an analog of the mean squared error for trees, E(d(T , T ))2, as this

gives an indication of both variance and bias. However, the problem remains of how to define

distances between trees and this has not been solved.

4.1.5 Robustness

We know current assumptions made about sequence evolution are inadequate . With this is

mind, it is very desirable to know how well a method is likely to behave when wrong assumptions

are made. For example, the effect of model misspecification on parametric methods This is

usually assessed by simulating a data under a fully specified model and then reconstructing the

tree with misspecificastion.

4.1.6 Usability in tests

We also need to consider if a reconstruction method can reject false assumptions in our model of

evolution. For example, we want to be able to determine is additional complexity is worthwhile.

Since understanding the process of evolution is our primary goal, this should alway be kept in

mind.

4.2 Parsimony

The maximum parsimony method chooses the best tree as the one where the least number of

base changes have occurred in sequences from the root to the leaves. An example of this is seen

in fig 4.1. Combinatorially this is the same as finding the minimum Steiner tree for Hamming

distance between sequences[23].

In theory this means that all possible assignments of sequences to internal nodes over all

possible tree topologies (with the necessary number of leaves) must be evaluated. In practice,

heuristics are employed to cut down the search space. Recursive algorithms and branch and

bound have also been employed to avoid repeating computation (See [11]).

27

AAG

GGA

AAA GGA AGA

AAA

AAA

Figure 4.1: An example of how changes are counted using the principle of parsimony. The pathsfrom sequences at the leaves of the tree to the root involve 4 base changes.

Parsimony is based on the concept of Occam’s razor. Solutions that make the least amount

of assumptions are likely to be the best. By looking only at base changes, parsimony claims to

require no knowledge of the evolutionary model. Parametized models are known to be flawed

so this non-parametric approach may seem quite reasonable.

However, it seems that an underlying model for parsimony exists implicitly. The assumptions

are that sites are independent (we cost each substitution separately) and the probability of

substitution is equal for all bases.

It has been long established that parsimony is inconsistent. The situtation where this

happens has been dubbed the ‘Felsenstein zone’. However, as mentioned previously, consistency

is not always a necessary property for a reconstruction method. Another problem is that

different trees can be equally parsimonious for a set of sequences.

4.3 Maximum Likelihood

Maximum likelihood reconstruction is, unsurprisingly, based on the likelihood principle. Given

data, D and a model we calculate the likelihood of hypothesis, H as P (D|H) the probability

of observing D if H is correct [38]. With respect to phylogenetics, the data D is sequence data

and the model is a process of site substitution (see Chapter 3) .H is a phylogenetic tree which

is is defined by it’s topology and branch lengths.

The aim is to find the tree that maximum likelihood. We choose this tree as our ‘best’ guess.

28

x

x

1

x

3

x5

2

x4 t

t1

t

t3

2

4

Figure 4.2: Example ML tree T . xi are nodes representing sequences, ti are branch lengths

Example Likelihood Calculation

The likelihood of the rooted tree in fig 4.3 can be calculated as follows.

P (x1, ..., x5|T , t1, ..., t4) = P (x1|x4, t1)P (x2|x4, t2)P (x4|x5, t1, t2, t4) (4.3)

Where P (xi|xj , t) = L(xi, xj , t). The right hand side being the likelihood of (xi, xj) forming a

branch of lenght t in tree T This can also be transformed into a recursive form.

Characterisitcs of maximum likelihood

Maximum likelihood estimation is consistent. It borrows its efficiency rating from more general

theory of maximum likelihood estimation. Unlike distance based methods it has been found to

be robust to the presence of distant taxa [4].

The maximum likelihood tree is not necessarily unique [38]. So this method may not be

able to resolve completely which is the best tree.

It is also extremely expensive computationally. The three taxa tree shown in fig 4.3 is a

trivial example because there is only one possible unrooted tree topology for three leaf tree. If

we are dealing with models that are not reversible we have to consider every possible rooted

tree. For n sequences ths potentially involves evaluating the likelihood for all (2n − 3)!! rooted

trees topologies and all possible assignments of sequences to the hidden internal nodes of the

29

tree.

This is a huge computational problem!2

Simplifications

The problem can be simplified by making our usual assumptions. If we assume that sites are iid,

we need only consider the evolution of individual sites with respect to the tree. The probability

of the tree with respect to sequences is then just of the product of the probabilities of the sites.

This provides opportunities for parallelising computation.

If we assume that the model of site substitution is reversible then we can determine the

probabilities of substiutions from the leaves up - a postorder traversal. Infact, we only need to

consider the unrooted tree . This is the ‘pulley’ principle described by Felsenstein [15].

Search heuristics

This still leaves the problem of calculating the likelihood of every unrooted tree. To cut down

the search space heuristics need to be employed. Felsenstein proposed a branch and bound

method where taxa are added incrementally to maximize the likelihood at each stage. The big

disadvantage with this approach is that it may not find the optimal tree.

4.4 Is MP the same as ML?

The use of maximum parsimony over maximum likelihood (and vice versa) has been the source

of much division in the phylogenetics. However, as Holmes aptly puts it:

The statistical perspective sees the differences between maximum likelihood, maxi-

mum parsimony...as much more a matter of degrees of freedom allowed in a model

than a matter for religious wars

The non-parametric nature of MP means that no parameters are pinned down. In effect it

needs to optimize over infinite dimensional criteria. A parametric model such as Jukes-Cantor

is at the other end of the scale. Variable rate models lie somewhere in the middle. This view

is well supported by the work of Steel et al [38] who have found conditions where the MP tree

is the ML tree. This happens when there is ‘no common mechanism’ assumed between sites or

lineages.

However, the general evidence from simulations is that MP does not perform as well as

ML. This is likely to be due to the implicitly restrained model involved in most parsimony

2In fact is has been shown that maximum likelihood for phylogeny is NP-complete.

30

implementations. In the usual form parsimony will not take account of sequence evolution

behaviour such as reversal. Parsimony as a ‘no common mechanism’ can in theory account

for unseen substitutions as any possible assignment of sequences to internal nodes is possible.

However, such as candidate tree is unlikely to be selected as the best one as it will involve more

substitutions - violating the most parsimonious criteria.

This often makes the choice between ML and MP a choice between a flawed model that

takes into account some of our knowledge of evolution, or a model constrained to be as simple

a possible that does not take into account any biological evidence.

4.5 Distance Based Methods

Distance based methods reconstruct phylogenetic trees that estimate distances between species.

These distances are usually estimated according to some parametric model. However, the

actual reconstruction method is usually non-parametric. These methods are dominated by

agglomerative algorithms.

Agglomerative or cluster methods follow the same general algorithm.

• Select two nodes

• Merge them to form a new node (or cluster).

• Update the distances to reflect removal of two and addition of one node according to some

rule.

The difference in method lie in how (most importantly) the select and update methods are

implemented.

The concept of additivity is essential to the agglomerative algorithms [11].

Definition 4.5.1 Given a tree, it edge lengths are additive if the distance between leaves is

equal to the sum of the length of edges on the (unique) path between the leaves.

It is important to note that additivity is a property of the distance measure used. Real data

is only ever approximately additive.

4.5.1 Unweighted pair group method using arithmetic averages (UPGMA)

UPGMA is the classic naive clustering algorithm. We start with each sequence being a cluster

on its own and proceed with our generic clustering algorithm. The distance between two clusters

31

Ci and Cj is defined to be the average distance between pairs of sequences from each cluster:

dij =1

|Ci||Cj |∑

p∈Ci,q∈Cj

dpq (4.4)

Where |Ci| is the number of sequences in Ci.

When clusters Ci and Cj are combined we have Ck = Ci ∪ Cj . The distance between the

new cluster Ck and any other cluster Cl is:

dkl =dil|Ci| + djl|Cj|

|Ci| + |Cj |(4.5)

The algorithm terminates when only two clusters Ci, Cj remain. At this stage the tree root

is added at height dij/2.

4.5.2 The Molecular Clock Hypothesis

The molecular clock hypothesis assumes that the rate of evolution is approximately constant on

the molecular level. This is equivalent to having the same Markov model rate matrix Q apply

to every part of the tree. If we believe this hypothesis, we can estimate the time of divergence

simply by looking at the number of changes between sequences.

There is much evidence to contradict this assumption. Rates appear to vary between and

within lineages[30].

UPGMA produces rooted trees with edge lengths that obey the molecular clock hypothesis.

If our distance data conforms to this then UPGMA will reconstruct the correct tree.

4.5.3 Long Branch Attraction

If distance data does not conform to the molecular clock then UPGMA will select the wrong

tree, even in very simple cases. Consider the tree in fig 4.3. When trying to reconstruct this tree

our input is the distances between leaves of the tree. Since UPGMA picks the closest clusters

at each stage to merge, the UPGMA tree will not be the true tree.

This is the case of long branch attracts. The error associated with long distances causes

a large amount of noise in the phylogenetic signal. In general, long branches are hard to deal

with accurately.

32

1

23

41 4 2 3

Figure 4.3: An example of a tree that does not conform to the molecular clock hypothesis (left).UPGMA incorrectly reconstructs the tree

4.5.4 Neighbour Joining

Satou and Nei’s Neighbour joining algorithm[36] keeps the notion of additivity (4.5.1) but

dispenses with the molecular clock. For two leaves/clusters i and j we define:

Dij = dij − (ri + rj) (4.6)

Where

ri =1

|L| − 2

∑

k∈L

dik (4.7)

and L is the set of leaves in the tree.

(4.8)

This criteria attempts to deal with the problem of long branch attraction. Subtracting off

the averaged distances to all other leaves/clusters compensates for long branch short branch

neighbour pairs. Nodes i and j are neighbours in the tree if there is another node k such that

branches ik and jk are both in the tree (the shortest path between them is length two). That

is they are only separated by one ancestor.

Theorem 4.5.1 If additivity holds and if Dij is minimal for leaves i, j ∈ L then i, j are neigh-

bours in the tree.

The proof of this is available in Chapter 7 of [11]

33

At each iteration Dij is used to select neighbours to join. In fact, NJ works to a minimum

evolution criterion: the best tree is the shortest one where length of a tree is usually calculated

by summing all branch lengths.

At each iteration if we select i and j as neighbours as a new node k, for m ∈ L we update

our distances as:

dkm =1

2(dim + djm − dij) (4.9)

and set

dik =1

2(dij + ri + rj), djk = dij − dik (4.10)

Neighbour Joining has been shown to reconstruct the correct tree when distances are addi-

tive. This also holds when distances are only approximately additive. [17]. It is consistent and

has been shown to return the correct tree when the correct distances are used. However this

does not give assurances about sequences of finite length. We need to remember that errors in

distance based methods grow exponentially as distance increases. This is also is important to

remember as the algorithm produces unrooted trees. This means that if an outgroup is used to

root the tree it is likely to damage the accuracy of the result.

The next two sections describe some improvements to the NJ algorithm.

4.5.5 BIONJ

The BIONJ algorithm of Gascuel [17] improves on NJ by taking into account more biological

features of the data. Gascuel shows that the formulae used to update distance matrices is just

one of a large class. BIONJ selects the minimum variance reduction from this class.

The class of equations is described by:

δui = λδ1i + (1 − λ)δ2i − λδ1u − (1 − λ)δ2u (4.11)

for taxa 1 and 2, where u is the root of 1, 2. For NJ, we have λ = 1/2. Gascuel has shown

that sampling noise influences the structure of the tree. BIONJ takes advantage of the fact that

λ does not have to be fixed. Instead it can be calculated at each step to minimize sampling

variance of the new reduced matrix.

A first order model is used estimate sampling variances and covariances. The simplicity of

this means that BIONJ retains all of NJ O(N3) complexity and runs

34

4.5.6 Weighbor

When determining the distance between taxa, random error increases exponentially the further

apart the taxa are. This essentially means that the distance based methods considered so far

are not robust to distant taxa. This means adding distant outgroups can lead to very dubious

results.

The maximum likelihood approach is well known to be robust to the presence of distant

taxa. However, it is also very computationally expensive. Weighted neighbour joining [4] uses

a likelihood based criteria to the neighbour joining approach in an attempt to deal with this

problem. NJs minimum evolution select criteria with a likelihood based one, while the distance

update step is much the same as that for BIONJ.

Weighbor’s selection criteria is based on evaluating additivity and positivity of possible

neighbours. These two properties are used to evaluate the likelihood that the observed distance

between two taxa given that they are neighbours. The taxa that are joined at this step are the

two that have the highest likelihood. A cost function is derived from the negative loglikehood

which turns the problem into one of cost minimization.

Definition 4.5.2 Distances have the additivity property if for taxa i and j, dik−djk is constant

for all other taxa k.

At each iteration, the additivity property is evaluated with respect to possible neighbours

i and j. The likelihood is determined as the likelihood that for each k 6= i, j, dik − djk is an

estimate of an optimally weighted average (constant). Taking the negative loglikelihood gives

us an additivity cost Add(i, j):

Bruno et al assume that distance errors are normally distributed. This formulation also

assumes that correlations are at of a star phylogeny and so multiplied by a constant g to

account for the fact this assumption is usually incorrect.

Definition 4.5.3 Distances have the positivity property if for taxa i and j, dik+djl−dij−dkl ≥0 for all other taxa d and l. That is, the internal branch in the tree ((i, j), (k, l)) has nonnegative

length.

Consider dPQ to be an internal branch in the (i,k) (j,l) paths in the tree. Assuming that

dPQ is a normal random variable, with a positive mean, we can calculate the likelihood that

dPQ ≥ 0 by integration over this part of the probability space. The positivity cost is can then

35

be computed as a negative loglikelihood :

Pos(i, j) = −ln

(

1

2erfc

(

−dP Q√2σPQ

))

(4.12)

Bruno et al suggest a heuristic for evaluating the positivity constraint. This is done to avoid

measuring positivity for dPQ for every quartet that i and j are involved in.

The complete cost function has the form:

S(i, j) = gAdd(i, j) + Pos(i, j) (4.13)

Weighbor updates the matrix of distance in a similar manner to BIONJ. and it will not be

discussed here.

4.5.7 NJ and the minimum evolution method

Neighbour joining’s select operation is a minimum evolution criteria. That is, we want to choose

the best tree to be the one with the smallest sum of branch lengths.

Neighbour Joining is also employed in Rhzetsky and Nei’s [35] Minimum-Evolution method.

This involves the construction of a neighbour joining tree from the data. Topologies close to

the NJ tree topology are also examined. Finally the shortest tree is selected as the best tree.

This provides a method of reducing the search space to a neighbourhood of trees.

In practice, the true tree is usually close to the ME tree However, This does not apply to

all data sets and proofs have only described expected behaviour. Testing has indicated that

searching the tree space around that of the ME tree have been shown to not add much value

for money.

4.6 Least Squares

The method of least squares is a well studied tool for parameter estimation. It is also (theoret-

ically) very applicable to the problem of branch length estimation of phylogentic trees[7]. This,

in turn, makes it very useful in minimum evolution based phylogenetic inference

4.6.1 Estimating Branch lengths

The ordinary least squares method minimizes the following criteria:

∑

i<j

(dij − δij)2 (4.14)

36

Where dij is the pairwise distance between sequences i and j. The δij is the sum of branch

lengths between i and j on a tree of certain topology. That is the unweighted sum of squares

of residuals.

Weighted least squares is similar but looks to weight observations by their reliability. That

is:∑

i<j

wij(dij − δij)2 (4.15)

Where wij is usually taken to be the reciprocal of the variance of the observed distance. If

the the weights are correct and observations are independent then WLS is statitically optimal[7].

This is implemented as the program FITCH in the the PHYLIP package.

This is often not the case as distances are usually correlated, if paths between two pairs of

species have branches in common. In this case the optimal method is generalized least squares

(GLS). This minimizes the weighed sum of squares of cross products.

∑

i<j,k<l

wij,kl(dij − δij)(dkl − δkl) (4.16)

Where wij,kl is the appropriate element of the inverted variance- covariance matrix of the dis-

tances.

This method is especially appealing because (theoretically) the GLS test statistic should

have a χ2 distribution under the true topology. This suggests methods to construct confidence

sets.

Bulmer [7] gives method for estimating branch lengths using GLS. This follows for known

general theoretical results of GLS. Susko [40] provides a general method to calculate the variance-

covariance matrix. This is consider with his proposed GLS hypothesis testing technique in

Chapter 6.

If we are considering an N taxa tree, the inversion of covariance matrix has O(N6) complex-

ity. This dominates the time complexity of the generalized least squares approach. However, it

only needs to be computed once. If this is preprocessed than the naive algorithm will O(N5).

In fact, Bryant and Waddell have devised an O(N4) algorithm to deal with this problem, but

have also shown that this is a lower bound on its time complexity[5].

4.6.2 Minimum Evolution Method with Least Squares

Rzhetsky and Nei [35] have discussed the use of least squares in a minimum evolution method

of phylogenetic inference. This method involves the construction of a neighbour joining tree

from the data. Branch lengths of the NJ tree are fitted using OLS. estimator. Topologies close

37

to the NJ tree topology are also fitted. For each tree, the length of the tree is calculated as the

sum of branch its lengths. Finally the shortest tree is selected as the best one.

A major disadvantage of all three least squares approaches is that negative branch lengths

can be generated. These lengths has no biological meaning and it makes the application of min-

imum evolution unclear when tree length is used as a criteria. This means that some positivity

constraints are incorporated into either the branch length estimation or to the calculation of

tree length.

Gascuel et al [18] provide an excellent survey of positivity constraints. They also investigate

the consistency of minimum evolution methods with least square methods including WLS and

GLS, given different positivity constraints. They found that OLS is generally consistent while

WLS and GLS are inconsistent. However, this does not eliminate the possibility of usig GLS

for tree reconstruction. Consistency is asymptotic property while in reality sequences are finite

and its other characteristics need further investigation.

4.7 Bayesian Tree Reconstruction

We may also take a Bayesian approach to the problem of tree reconstruction. As can be expect,

this involves finding the tree with the highest posterior probability. For tree Ti, this means

evaluating:

f(Ti|D) =f(D|Ti)f(Ti)

∑B(s)j f(D|Tj)f(Tj)

(4.17)

f(Ti) is the prior distribution over the space of trees and is usually set to be the uniform, ie

f(Ti) = 1/B(S). (B(S) is the number of trees with S taxa). We also need to evaluate the

following integral.

f(D|Ti) =

∫

ν

∫

θf(D|Ti, ν, θ)f(ν, θ)dνdθ (4.18)

.

Where ν and θ are the vector of branch lengths and other model parameters respectively.

This integral cannot be evaluated analytically. Instead it is approximated using the Markov

Chain Monte Carlo method. In this method, chains roam around, taking samples from space

of S-taxa trees. After many (millions of) iterations the proportion the samples taken of (time

spent at) each tree is the posterior probability of that tree.

In practice, the Metropolis-coupled MCMC algorithm is often used. This involves running

38

n heated chains and one unheated chain. Only samples from the cold chain count but swaps

between two randomly selected chains is proprosed at each time step. Getting trapped at local

maxima is a big problem for hill-climbing algorithms. In effect, this gives allows the cold chain

to escape from this situation.

This method is further discussed in Chapter 5.6 in view of its hypothesis testing capabilities.

4.8 Trees from Alignments, Alignments from Trees

Phylogenetic tree reconstructions from sequence data require a set of aligned sequences. These

columns of these alignments represent the evolution of a particular site in a sequence.

For a tree to be reconstructed we need the sequences to have descended from a common

ancestor, that is, they are homologuous. Ideally reconstruction methods should be able to

deal with insertions and deletions as well as substitutions. This is rarely incorporated into

reconstruction methods. Instead, conserved blocks (no indels) are used.

It is clear that good tree reconstructions require good alignments. Phylogenetic trees have

a somewhat circular relationship with the multiple alignment problem. On one hand, they can

be used to guide the order of alignment. This is certainly the case in the popular ClustalW

alignment program which uses neighbour joining. On the other hand a good alignment is

essential for accurate inference of phylogenetic trees. Algorithms exist to do these two things

simultaneously. For an overview see Durbin et al, chapter 7[11].

What about sequences that are only distantly related? Don’t use them is usually the answer.

The error involved is just too high.

39

Chapter 5

Phylogenetic Hypothesis Tests

The main problem of phylogenetic inference is that the process of evolution is stochastic. We

cannot be sure of reconstructing the true tree even if we have the correct model. There is no

way of determining absolutely if we have the correct answer.

It can be difficult to place any faith in a tree reconstructed from a sequence alignment when

it conflicts with evidence of other phylogenies. This evidence can include taxonomy, geography

and even different partitions of the same genome.

The range of reconstruction methods available may not help clarify the issue either. Al-

though the methods described in Chapter 4 can be seen on a spectrum of parametization, they

do not often agree on what the best tree is. This is often due to the fact that underlying criteria

for best trees are differently specified. The maximum likelihood tree may be the most parsimo-

nious tree in some cases, but not always. For model based methods small different models of

evolution may also cause conflicting results.

This motivates the need of some sort of confidence testing method for trees. However,

different types of tests (of topology) can give rise to different conclusions. This may represent a

clash between Bayesian and frequentist, and between non-parametric and parametric tests [6].

In this chapter, I examine the need commonly used phylogenetic hypothesis testing tech-

niques. These include tests based on parametric and non-parametric boostrapping and Bayesian-

MCMC techniques. However, first I consider the need for confidence region estimation in these

tests.

5.1 Confidence Regions of Phylogenetic Trees

Specifying confidence regions for phylogenetic trees, rather than making point estimates, makes

sense.

40

Fitting one tree to all the data at hand involves a loss of phylogenetic information at each

stage of the estimation process. The problem arises from estimating a discrete object from

continuous measurements. Holmes discusses the ‘rounding’ problem in phylogenetics as the

‘replacement of one functional stretch of DNA (usually a gene) by the gene tree’ 1[24]. We need

to look at how the conversion from continuous to discrete, the rounding, at different stages in

the reconstruction process effects the outcome.

When we convert a matrix of characters into distances we lose information about how

mutation differs in different parts of the sequence. Whether is is better to use whole genomes,

or create many trees from different partitions and try to reach a consensus is an unanswered

question. We also need to consider what it means to output a tree answer if the true structure

is not treelike. Conflicting signals in the data may not always be noise, so they should not just

be disgarded (as is the case if the data is forced to fit one tree).

The space of trees is enormous, so tests that declare a particular tree wrong given the data

are not particularly helpful in the search for the correct one. Especially when there is a lot of

conflicting signal around. However, the fact that there is conflicting signal should be able to

tell us something. This could be done via a probability distribution over trees or confidence

regions. However, it is not clear how we might go about constructing either without a notion

of distance between trees need to calculate, for example, the mean squared error.

Trees topologies are discrete combinatorial objects that represent a branching order. How-

ever, this discrete space becomes more complex as we consider variable branch lengths. In fact

it becomes a union of manifolds where each manifold represents are tree with set topology but

varying in branch lengths. However, it is not a manifold itself as discontinuities appear where

manifolds join (it is not smooth) [3]. When attempting to make phylogenetic inferences we need

to consider what the effect of this sort of landscape has on the problem.

On the other hand, classical hypothesis testing methods have a natural duality with confi-

dence regions and testing methods described in the rest of the chapter use this. While the other

issues of distances and geometry remain unclear, these techniques need to be well understood.

5.2 The Bootstrap

The bootstrap was introduced into phylogenetics as a way measure of confidence in reconstructed

trees.

[In the] case of phylogeny estimation, we do not know how the estimated tree

1A gene tree differs from a species tree in that it traces the evolution of a particular gene. This involves adifferent set of problems such a gene duplication within a species

41

will be expected to vary around the unknown true tree, and it is the function of the

bootstrap to give us an estimate of that distribution. (Felsenstein and Kishino [16]).

The bootstrap is used to determine how accurate an estimate θ is of the true parameter

value θ. This is done by estimating the parameter using data replicated from the original data.

Each bootstrap data set x(i) produces an estimate θ(i). The theory suggests that we should be

able to infer the relationship of θ − θ from θ(i) − θ. [12].

The above formulation was created for the non-parametric bootstrap. However the moti-

vation applies (with some slight differences) to the newly popular parametric bootstrap. The

difference between the two being that the parametric bootstrap relies on model of evolution.

The two approaches are discussed below.

5.3 The Non-parametric Bootstrap

The non-parametric bootstrap for phylogenetic trees is based on the more general bootstrap

technique in statistics. The algorithm is simple: from an alignment of m species draw n columns

with replacement. These n columns are then used to reconstruct a tree. This is done R times

and a consensus tree can be drawn from the R bootstrap trees.

The statistic that is usually used to interpret bootstrap data is are the bootstrap proportions

[12] of the group of descendents of a particular common ancestor (monophyletic group).

Definition 5.3.1 The bootstrap proportion (BP), for a particular monophyletic group, is the

proportion of bootstrap replicates that agree with the group in the original data set.

Within this definition the monophyletic group doesn’t depend on the topology of the subtree

induced by it. This is also known as a bootstrap confidence level. It is also sometimes mislead-

ingly called a bootstrap P-value[12]. However, it should be emphasized that it does not have

the usual P value interpretation.

A Multinomial Model

The boostrap can be described via a multinomial model.

Let N be the number of sequences we have, with S the length of each of these sequences.

Assume that these sequences are aligned. If we are dealing with nucleotide sequences then

K = 4N is the total number of possible vectors of length N . These vectors can be enumerated

as X1,X2, . . . ,XK with associated probability vector π = (π1, . . . , πK). . So, our observed

sequence alignment can be seen as S independent samples (columnwise) from this set. We can

42

take the set of distinct vectors observed as Y1, . . . , Yk. This is a multinomial model. We also

have:

πi =1

|K|

S∑

j=1

(Yi = Xj) (5.1)

The space of trees can be partitioned according to tree topology. Now, the partition that

π (the true probabilities) falls into determines should the correct tree. This happens on the

assumption that a true value π will result in correct distance estimates. It is well known

that reconstruction methods such as neighbour joining guarantee the correct tree with correct

distances as input. So, we get the correct tree. The problem is whether or not π falls in the

same partition. The bootstrap can be used to shed light on how often we can expect this to

happen.

5.3.1 Testing Phylogenies using the Non-parametric Boostrap

Efron et al[12] claim that Felsenstein’s application of the bootstrap is non-standard in that the

statistic that is the tree does not change in a continuous sort of way [12]. The discontinuities

in tree space occur as the tree topology changes. This means that a little variability near a

boundary (discontinuity) may result in selection of a different tree, while in another area it may

have negligible effect.

Holmes also points out that there does not exist a theory for bootstrap with respect to

discrete combinatorial objects such as trees. We must also consider the sparsity of the data and

many possible columns will have zero counts. However, the development of confidence testing

methods using the bootstrap has persisted.

Posterior probabilities using the Non-Parametric Bootstrap

The following is an example of how the bootstrap can be used to calculate a confidence estimate

of θ. This is basically probabilistic measure of the distance between the estimate θ and a

boundary of the region of interest R. Using bootstrap replicates i estimate α = Prθ(θ(i) ∈ R)

This is an a posteriori probability that θ ∈ R assuming a uniform prior. This construction

means that if the boundary curves away from θ the the confidence estimate increases.

This suggests that bootstrap proportions can be used to estimate Bayesian posterior prob-

abilities. The validity of this is discussed in section 5.7.

43

5.3.2 How well does it work?

The non-parametric bootstrap claims to be free of the shackles of evolutionary models and the

problems entailed with them. However, the technique implicitly assumes that sites evolve iid.

This contradicts, for example, evidence that regions of proteins that are highly conserved for

functional reasons. A block bootstrap has been developed to that considers whole conserved

blocks as an independent ‘site’. The block bootstrap does not avoid other problems associated

with the technique.

The main purpose of the bootstrap is to lend some confidence to monophyletic groupings. On

one hand, it is reasonable to suggest that rare groupings are not to be relied upon. However, it

is not particularly clear what frequent occurence (high bootstrap proportion) of a group means.

The most intuitive interpretation of the bootstrap proportions is as a measure of repeata-

bility. This is calculated as the average of bootstrap proportions computed from samples drawn

from the true distribution [16]. That is it gives us an idea of how sensitive trees (subtrees)

are to noise in the data. If a group occurs frequently then it is less likely to be prone to sam-

pling artefacts. There has been a great deal of theory justifying the bootstrap proportion as a

measure of repeatability [1][12].

Bootstrap precision indicates how well the bootstrap proportions for a finite set of data repli-

cates matches the what would be obtained from an infinite set. The non-parametric bootstrap

has been found to have high precision when the number of replicates data sets is reasonably

large (say 1000). That is, when the number of replicates is sufficiently high, we can be assume

that the asymptotic behaviour is not too different.

However, it is often pointed out that a very precise estimate of the wrong thing does not

really help anyone[22]. A similar argument is made against the using repeatability to validate

a result.

The attribute of the bootstrap that should resolve this issue is bootstrap accuracy. Accuracy,

in a bootstrap, is fraction of times repeated sampling that derive the true topology. This use

of the bootstrap as a measure of accuracy has been attacked many times for being biased and

highly variable.

The Bias of the Bootstrap

Hillis and Bull [22] have empirically shown that the probability that the bootstrap tree and

the true tree fall in the same region is less than that of the original estimate and the true tree.

This suggested that the bootstrap is biased. However, many have taken up the defense of the

bootstrap [16], to argue that this is simply a matter of incorrect usage.

44

If θ is the parameter we are estimating, the argument goes like this. The distribution of

θ(i) − θ is not the same as that of θ(i) − θ. It is well agreed that the former has around twice

the variance of the latter. To make inferences we need to examine the the differences between

the bootstrap result and the tree generated from the original data. If used properly, bootstrap

estimates reasonably reflect the kinds and degree of variability around the true tree.

Variability of Bootstrap Proportions

Felsenstein and Kishino [16] claim that whether large bootstrap proportion values over or under

estimates the probability that a group is correct depends on how informative the data is. This

means whether or not the data is powerful enough to resolve the phylogeny. If it does the

bootstrap proportion is usually an underestimate. This makes the bootstrap a conservative

test.

Felsenstein/Kishino suggest that a better use for the bootstrap proportion, PB , is to look

at (1−PB) as a conservative assessment of the probability of much evidence favouring a group

if it is not really present. This is like Type I error in classical hypothesis testing.

5.4 The Parametric Bootstrap

The parametric bootstrap uses the parameters of the null hypothesis and the original data set

to simulate replicate data sets. Unlike the non-parametric bootstrap this approach requires a

fully specified model of sequence evolution to generate replicates. The parameters involved are

some or all of a tree topology, branch lengths and substitution model paramters. These replicate

data sets are used to estimate the distribution of the test statistic under the null hypothesis.

P-values can be determined by ranking the values obtained from the simulation and determining

quantile values from there.

This big advantage of this technique is that it guarantees that statistics are drawn from the

null hypothesis. This means there is there is overall less confusion about what values generated

from boostrap replicates mean.

The use of the parametric bootstrap is has also been motivated by tests of monophyly.

That is testing whether a group of species have a common ancestor. Traditionally, this has

been seen as test of nested model with the null hypothesis constraining certain species to form

a monophyletic group. Huelsenbeck et al give an example of how the space of trees can be

partitioned to reflect this constraint in [27].

The common approach as been to use perform a likelihood ratio test. The assumption has

been that, since the hypothesis are nested, the test statistic −2 log λ has χ2(k) distribution,

45

where λ is the likelihood ratio and k is the difference in the number of parameters between

hypotheses.

The problem start to appear when we try to estimate the degrees of freedom k. The topology

of the tree is clearly a parameter of the models we are considering. A tree topology is also clearly

not an element of the R. This renders the use of χ2(k) distribution invalid.

However, if we do know the distribution of −2 log λ under the null hypothesis we can proceed

with the hypothesis testing. This motivates the use of the parametric bootstrap.

5.4.1 Problems with the Parametric Bootstrap

Certainly, the debate over the use of the parametric bootstrap has been less inflamed than that

of the non-parametric bootstrap. The area of contention appears to be the pervasive problem

of whether results based on parameterized models can be trusted.

This knowledge of the underlying model makes parametric tests generally more powerful

than non-parametric tests 2. But, as usual, this also makes it more susceptible to errors due to

model misspecification.

The parametric bootstrap is highly dependent on the model of evolution used and, of course,

parameter estimation. For example, if estimation of parameters is not very good the alternative

hypothesis (with more parameter) might perform worse than then null hypothesis when it

shouldn’t. This often causes tests based on this method to be too liberal.

Huelsenback et al [27] have found with rates of change variable over the tree, the tests

became too liberal. Only one parameter change could cause a noticeable bias. This means a

higher rate of Type I error.

5.5 Bootstrap Based Tests

In this section, I examine two related hypothesis testing methods based on the bootstrap.

The Kishino-Hasegawa test and the Shimodaira-Hasegawa test both use the non-parametric

bootstrap while the SOWH test is parametric. Each of these tests attempts to simulate the

distribution of a test statistic under a null hypothesis.

However, first we need to examine the need for centering bootstrap data to make it conform

to the null hypothesis.

2Although this is not always the case. See [24].

46

5.5.1 Centering

If boostrap estimates are not drawn from the null hypothesis distribution we lose ability to

detect alternative hypothesis. That is, the test has less power [43]. This is related to the

incorrect use of the bootstrap discussed in 5.3.2.

For example, let µ, the true mean, be the value we are trying to estimate and y of set

of samples. Consider a calculating a bootstrap proportion by counting how many times the

following occurs in the bootstrap replicate sets.

y(i) − µ ≥ y − µ (5.2)

Where the bar indicates the sample mean. But E(y(i)) = y so the proportion of times that this

is true will always be around will always be around one half. This certainly demonstrates a lack

of power.

A better method employs centering [43]. That is, calculate bootstrap proportions from:

y(i) − y ≥ y − µ (5.3)

This ensure that the statistic under question is drawn from the null hypothesis and that power

of the test is retained. Both of the following techniques use centering.

5.5.2 The Kishino Hasegawa Test

The Kishino and Hasegawa (KH) test determines whether two trees, selected a priori, are equally

well supported by the data. This is determined by examining the difference in likelihood of the

two trees. Let the two trees under consideration be T1 and T2. Let Li be the log-likelihood of

tree Ti. If they are equally well supported then δ = L1 − L2 ≈ 0 .

In terms of hypothesis testing we compare:

H0 : E(δ) = 0 (5.4)

HA : E(δ) 6= 0 (5.5)

Clearly the distribution of δ under H0 is needed to test these hypotheses. The KH test derives

this using the non-parametric bootstrap. To do this n bootstrap replicates of the original data.

For each replicate i, parameters are re-estimated to obtain maximum likelihood estimates for

T1 and T2. Hence, δ(i) = L(i)1 −L

(i)2 is calculated for each replicate. Centering is then applied to

the δ(i) to ensure conformity to the null hypothesis. The centered statistics are then estimates

47

the distribution of δ under the null hypothesis.

The plausibility of δ determining if is in the appropriate confidence interval determine from

a ranked list of the bootstrap statistics. For example, if it falls between 2.5% and 97.5% points

for a 5% test.

Problems with the KH test

The problem with the KH test is more to do with its usage than its construction. The test is

two sided on the assumption that T1 and T2 are selected a priori. That is we have no knowledge

about which one is likely to be a better fit of the data.

If this does not hold then analyses based on this test become invalid. In [19], Goldman et al

show that when the maximum likelihood tree is used in such a comparison, we inevitably have

E(δ) > 0. This means that tests should not be based on the null hypothesis as stated above.

Unfortunately main use of likelihood based test (like KH) are to test the maximum likelihood

tree against another tree (perhaps the second most likely). This severely restrict the use of the

KH test in real phylogenetic study.

5.5.3 The Shimodaira Hasegawa Test

The Shimodaira Hasegawa (SH) test [37] is a correction to the KH test. It simultaneously

compares every tree in the set of plausible trees, Ω say. Null and alternative hypotheses are

formulated as follows.

H0 : All Tx ∈ Ω are equally good explanations of data (5.6)

HA : Some Tx ∈ Ω are better than others (5.7)

In the SH test δx = LML − Lx is calculated ∀Tx ∈ Ω where LML is the likelihood of the

maximum likelihood tree for the data to hand. The plausibility of each δx is once again deter-

mined by comparing it to the distribution of δx under H0. Like the KH test, this distribution

is determined by non-parametric bootstrap.

For each of N bootstrap replicates , the likelihood of each Tx ∈ Ω with parameters θx (L(i)x ).

These likelihood estimates are then centered as for conformity to H0:

L(i)x = L(i)

x − L(i)x (5.8)

This devalues the significance accorded to the selection of TML a posterori. The maximum

48

likelihood tree for each replicate i is then determined as

L(i)ML = max

Tx∈Ω(L(i)

x ) (5.9)

We can then generates the required distribution by determining:

δ(i)x = L

(i)ML − L(i)

x . (5.10)

These δ(i)x approximate the distribution of the δx under the null hypothesis. Test whether each

Tx is plausible by checking wheter δx lies in the 0 - 95% confidence interval determined by

ranking the δ(i)x values. Note that this is an appropriate one sided test. A P value for Tx can

be calculated as:

Px =number of replicates with δ

(i)x > δx

N(5.11)

Where N is the number of bootstrap replicates generated. Note that it is extremely important

that all plausible trees are used. This set must be selected a priori for the test to remain valid.

We can then determine a confidence set as topologies with Px ≥ P∗ where P∗ is a prespecified

significance level. Let TML∗ be the topology that maximizes the expected likelihood. Shimodaira

and Hasegawa [37] note that the coverage probability of TML being included in the confidence

interval is greater than 1 − P∗Both the KH and SH tests use the RELL technique to estimate likelihood values for bootstrap

sets. Instead of finding the ML estimates of parameters for each replicate, the RELL method

uses the estimates from the original data set. Avoiding these reoptimization sets provides a

significant time saving.

Relationship of SH and KH

Goldman et al[19] show that dividing the KH P value, PKH , in half converts it to a one-sided

test. They also explore the relationship of the halved P value and the SH one sided test.

Now, the SH test P value exceeds PKH/2 by an unknown amount. So, if PKH/2 does not

reject the null hypothesis then neither would the SH test. However,if PKH/2 does reject the

null hypothesis we can’t predict what the SH test might have done.

49

5.5.4 The Swofford Olsen Waddell Hillis Test (SOWH)

The SOWH [41] test employs the parametric bootstrap to determine whether the tree T0, is the

true topology for a given data set. This is formulated as:

H0 : T0 is the true topology (5.12)

HA : Another tree has the true topology (5.13)

To determine whether or not we reject the null hypothesis we consider a likelihood ratio

test of T0 and TML. This can be expressed in terms of log likelihoods as δ = LML − L0 The

parametric bootstrap is then used to determine the distribution of δ under the null hypothesis.

The distribution is generated by simulating replicate data sets based on T0. and θ0, the ML

estimates of free parameters for T0 with respect to the original data set.

For each data set i, L(i)0 is calculated from T0 and re-estimated ML parameters θ0

(i). Simi-

larly, L(i)ML is determined from the Tx, θx

(i)that maximize the log-likelihood for data set i. Now

we can simulate the distribution of δ under the null hypothesis using,

δ(i) = L(i)ML − L

(i)0 (5.14)

We can now determine a 95% confidence interval by ranking these δ(i). Our original δ plausible

if it falls into this confidence interval. This is a one-sided test as δ must be greater than zero[19].

Problems with SOWH

The SOWH test needs to find the ML tree for every replicate data set generated. This is clearly

a very time consuming process so ML heuristic searches are used. A problem with this approach

is that the heuristic might miss the optimal tree. This might create artifacts if it happens on

many data sets. In fact selecting a suboptimal tree leads to a lowered log likelihood value. This

in turn reduces the value of δ(i). On a large scale this serves to pull the mass of the distribution

closer to zero. This makes the test more liberal. Hence the SOWH test is prone to type 1 errors.

To get around this it is suggested that the most closely fitting model should be used. This

means the most highly parameterized. However, even the GTR model has been shown to be

inadequate[26].

Theoretically, in the SOWH test, negative δ values shouldn’t occur, they have been known

to arise in simulation. This can be attributed to poor parameter estimation. This again points

to the need for a realistic model of evolution.

50

5.6 Bayesian methods

Bayesian methods of testing tree topologies follow from Bayesian reconstruction method. The

‘best’ tree is the one with maximum posterior probability. As in Section 4.7, this is determined

by approximating the posterior probability distribution over all trees with the appropriate

number of leaves. This immediately gives us an idea of the confidence we should assign to a

given tree and how we can determine confidence regions.

This method also allows us to determine the posterior probability of a monophyletic group.

The algorithm is the same as that of whole tree testing. We merely determine the proportion

of samples that have the grouping of interest.

The approach has many attractive features. It allows us to incorporates multiple sources

of uncertainty via the prior probabilities [6]. Many argue that posterior probabilities are more

intuitive than P-values. Simulations also indicate that the Bayesian method (as implemented

in MrBayes) is able to assign large amounts of confidence when very few site changes observed.

On the other, this method is highly dependent on the substitution model. This is exacerbated

if uninformative priors are used. If informative priors are used, there is still the risk that the

priors are wrong. Although Rannala and Yang’s work on MAP trees suggests that posterior

probabilities are insensitive to priors [33].

Another issue of concern is that convergence of sampling to the desired distribution is

not guaranteed. Still more worrying is the evidence that posterior probabilities may support

contradictory hypotheses equally well.

Buckley’s investigations into the Bayesian approach show that it can definitely mislead

[6]. This is attributed to model misspecification. Infact when the result were misleading, the

posterior probabilities were very low. These simulations also indicate that the results of the

SOWH test and Bayesian posterior probabilities are generally correlated. Although BPP shows

greater levels of uncertainly (because it can!).

5.7 Bootstraps and Posterior Probabilities

As mentioned in Section 5.3 Efron et al[12] have suggested that bootstrap P-values can be used

as indicators of posterior probability of a tree. However, simulation studies by Duoady et al

[10] provide evidence to the contrary, when posterior probabilities are calculated as described

in Section 5.6. This does not have too much shock value in light of the different information

that these values take as input [1].

Consider, that Bayesian MCMC (BMCMC) methods are highly dependent on the model

51

of substitution. On the other hand, the measurement of bootstrap proportions is based on a

multinomial model of a matrix of characters.

Another difference that needs considering is the difference in substitutions need to get some

significance from the two values. Posterior probabilities and bootstrap proportions showed the

greatest difference on short branches. The Bayesian MCMC methods placed high confidence

(exceeding 95%) in situtations were very few substitutions were expected. Maximum likehood

bootstrap and maximum parsimony bootstrap did not reach these confidence levels. This in-

dicates that BMCMC methods are much more sensitive to phylogenetic signal. On the other

hand, BMCMC methods are more likely to lend high support to incorrect groupings.

Alfaro’s recent investigation [1] show that Bayesian MCMC posterior probabilities and boot-

strap proportions ‘are not equivalent measure of confidence’. BMCMC values tend to have a

lower type I error rate than do bootstrap methods. The conclusion is that node with high

posterior probability is very likely to be correct but only if the underlying model is correct. If

a node has a moderate bootstrap value it should be considered to be highly dependent on the

data and may not be present if the data is extended.

5.8 Which Test?

The answer to this question is the test that best suits the need of the research being done.

Non-parametric bootstrap based tests can provide good indications of how sensitive a recon-

struction method is to variability in the data. However, much care must be taken when using

the bootstrap to find distributions of test statistics. Ensuring that the simulated distributions

conform to the null hypothesis is particularly important. These tests are also generally conser-

vative and high confidence values cannot usually be obtained for trees with very short branches.

Buckley [6] has found that SH has much greater uncertainty than SOWH. That is SH P-values

are generally higher than than those for SOWH.

Parametric boostrap based tests can be very powerful but are vulnerable to errors due to

model misspecification. Inaccurate estimation of parameters usually leads to the test becoming

too liberal and prone to Type I error. Current models of site substution do not appear to be

adequate to conquer this problem.

Tests based on Bayesian inference via Markov Chain Monte-Carlo methods give a more

intuitive picture of the uncertainty. If we want to determine how well results fit the data with

respect to a fully specified model then Bayesian posterior probabilities are a good choice. This

is especially the case when there is not much phylogenetic signal. However, it too is prone to

model misspecification and can lead to high support for conflicting trees.

52

All of the tests described above can be used to provide some sort of confidence region on the

space of trees. They also require some sort of optimization (maximum likelihood estimation)

over a large number of trees. This is very computationally expensive. Heuristics introduced

also reduce the effectiveness of these tests.

Clearly, there is still scope for the development of alternative tests for phylogenetic inference.

53

Chapter 6

Generalized Least Squares in

Phylogenetic Hypothesis Testing

Branch length estimation for a tree T with N taxa can be described by a linear model 1.

y = Xβ + ǫ (6.1)

The variables are defined as follows. y = (Y1, . . . , Yn)T is the n = N(N − 1)/2 vector

observed pairwise distances between sequences. That is, Yj is the distance between the j-th

pair of sequences. β = (β0, . . . , βm)T is a m = 2N − 3 vector of estimated branch lengths.

ǫ = (ǫ1, . . . , ǫn)T is a vector of unobserved errors. The design matrix, X, is the n × (m + 1)

incidence matrix of branches in the tree with respect to pairs of nodes. That is, if j is the

pairwise distance between sequences r and s, Xjl = 1 if βl is in the path between the r and s.

Similarly, we can write the distance between leaves of the tree as δj = δ(r, s). This can be

calculated simply by adding up all the branches in the path between the pair:

δj =∑

0≤l≤m+1

Xjlβl (6.2)

In matrix notation:

δ = Xβ (6.3)

Let V be the variance-covariance matrix for y. We know from the theory of linear models,

1See, for example, [34], for a more detailed account of the theory of linear models

54

we can fit to δ to y for a given tree topology T by minimizing the GLS test statistic, gT .

gT = (y − Xβ)T V−1(y − Xβ) (6.4)

The statistical theory behind generalized least squares is well developed. It indicates, theo-

retically, that GLS test statistic can be used to test the phylogenetic hypotheses.

H0 : the given tree T is the true tree (6.5)

HA : another tree is the true tree (6.6)

If y, the observed distances, are normally distributed then it is well known that under H0 the

GLS test statistic has a χ2(k) distribution The degrees of freedom, k, can be calculated as the

number of observed distances minus the number of branches in the tree. From the construction

above k = n − m. A P-value can be calculated as the probability of observed statistic value

under the χ2(k) distribution. This is used to test the null hypothesis for a given signficance

level α in the usual fashion.

Confidence regions over sets of topologies can be constructed by calculating Gτ over all

topologies with T taxa. The (1 − α) × 100% set can be determined by finding all topologies

with P values greater that or equal to α.

6.1 Sample Average Variance and Covariance

The use of the χ2(k) distibution requires the observed distances to be multivariate normal.

Susko provides a proof that this holds when the number of sites used to generate the distances

is large and assuming the y are ML estimates.

The following provides a more detailed derivation of the sample average covariance matrix,

V , as proposed in [40]

Theorem 6.1.1 If the observed pairwise distances y are maximum likelihood distances are

derived from a sequences of length n, as n → ∞, y is multivariate normal.

Proof This proof basically follows from well known results from the asymptotic theory for

maximum likelihood estimation[28][29]. Let x be the characters of pair of the jth aligned

sequence, d as distance measurement between the pair. Let pj(x, d) denote the probability of

the data x at a site for the jth pair of taxa. Also, let

55

l(x; d) =

n∑

i

log pj(xi; d) (6.7)

l′(x; d) =n∑

i

∂

∂dlog pj(xi; d) (6.8)

l′′(x; d) =

n∑

i

∂2

∂d2log pj(xi; d) (6.9)

Let yj be the observed distance between them and dj the real distance of the jth pair. Now if

yj is a maximum likelihood estimate we can assume it was found by solving the condition

l′(x; yj) = 0 (6.10)

Edj(l′(x; dj)) = 0 (6.11)

We can approximate l′(xi; yj) by expanding around dj

l′(x; yj) = l′(x; dj) − l′′(x; dj)(yj − dj) (6.12)

ie. (yj − dj) =l′(x; dj)

l′′(x; dj)(6.13)

ie.√

n(yj − dj) =

1√nl′(x; dj)

1n l′′(x; dj)

(6.14)

Note, as n → ∞

− 1

nl′′(x; dj) =

I(dj)

n→p Edj

(I(dj)) = I(dj) (6.15)

Where I(dj) is the observed Fisher information of the distance and I(dj) is the expected Fisher

information.

ie.√

n(yj − dj) = − 1√

nI(dj)

n

l′(x; dj) (6.16)

ie.√

n(yj − dj) = −√

nI(dj)−1l′(x; dj) (6.17)

Now let u be a vector such that

uj = −√

nI(dj)−1l′(x; dj) (6.18)

56

Now each uj is a sum of independent random variables with expectation 0. So, by the Multi-

variate Central Limit Theorem

1√nu ∼ Np(0,V) (6.19)

Where V is the covariance matrix of the uj. This in turn means that

y =1√nu + d (6.20)

y ∼ N(d,V) (6.21)

The sample average covariance matrix, for n large is formulated as

Vii = var(uy) = E(uy − 0)2 (6.22)

= E(−√

nI(di)−1l′(x; di))

2 (6.23)

= nI(di)−2E(l′(x; di))

2 (6.24)

= nI(di)−2E(l′′(x; di)) (6.25)

= nI(di)−2 ×−I(di)

n(6.26)

=−1

I(di)(6.27)

=1

−l′′(x; di)(6.28)

Vii =1

∑ni=1 − ∂2

∂d2

j

log pj(xi; dj)(6.29)

With covariance

Vjk = cov(uj , uk) (6.30)

= cov(−√

nI(dj)−1l′(x; dj),−

√nI(dk)

−1l′(x; dk)) (6.31)

= cov(√

nVjjl′(x; dj),

√nVkkl

′(x; dk)) (6.32)

= nVjjVkkcov(l′(x; dj), l′(x; dk)) (6.33)

Vjk = nVjjVkkE(l′(x; dj), l′(x; dk)) (6.34)

57

where

E(l′(x; dj), l′(x; dk)) =

1

n

n∑

i=1

∂

∂djlog pj(xi; dj)

∂

∂dklog pj(xi; dk) (6.35)

So, with a large number of sites y is multivariate normal with variance-covariance matrix V

described above (sample average covariance).

6.2 Motivation for Simulation of GLS test statistic

There are many reasons to be enthusiastic about this approach. GLS test statistic avoids some

of the computational burden of previously described tests. The estimation and inversion of the

variance-covariance matrix is costly but it only needs to be done once. In theory we know the

distribution of the test statistic under the null hypothesis which means we do no have to try

and simulate it ala parametric bootstrap.

The use of the χ2 distibution requires the observed distances to be multivariate normal and

the observed distances be derived from a large number of sites. However, it is unclear how

large is large. The estimation of the variance-covariance matrix also poses a potential problem.

Susko claims that if the method of estimation is consistent then the results will still hold. Again,

consistency is an asymptotic property while sequences are finite.

Susko’s sample average variance-covariance matrix and the traditional Jukes-Cantor esti-

mates have very different derivations. The latter is derived from the shared branch length

between the taxa at the leaves. That is, it depends on the tree explicitly and is based on a max-

imum likelihood estimation. The sample average covariance does not have a basis in maximum

likelihood estimation. So we may expect the covariance matrices to differ. However, general-

ized estimation equation theory suggests that the sample average covariance should have the

same asymptotic properties are maximum likelihood estimation. So, the covariance estimates

for both of these approaches should converge.

Susko mentions these issues but does not provide any information on how these factors effect

the test effectiveness. The following simulations to try to shed some light on these issues.

6.3 GLS Test Statistic Simulation Method

The distribution of the gT under null hypothesis distribution and see if it was indeed the

appropriate χ2 distribution. The following general method was used:

1. An unrooted tree was generated the program evolver from the PAML package of pro-

58

grams [45]. This program assigns branch lengths according to the birth-death process

described by Yang and Rannala[33].

2. Around 700 2 sequence data sets of a specified lengths were generated from the tree using

the program seq-gen [32]. The Jukes Cantor model was used to generate the sequences.

The procedure was the same as a parametric bootstrap of sequence data.

3. A distance matrix for each sequence data set replicate was estimated using DNAML from

the PHYLIP [14] package of programs.

4. Branch lengths were fitted to the generating (‘True’) tree topology using the GLS approach

using the program treedist.

5. GLS test statistic for each replicate was calculated as described previously.

Simulations were carried out of four and five leaf trees. Sequences lengths were 100, 1000,

10000 nucleotides. Branch length fitting and GLS statistic calculations were done using Susko’s

sample covariance formulas, and the tree-derived covariance for the Jukes-Cantor model (see

Section 3.2.2).

The statistic for the ‘best ’ tree for each data set was also recorded. Because I only dealt

with small trees it was possible to fit every topology and then take the best tree to be the one

resulting in the smallest GLS test statistic.

6.4 Results

Simulated GLS data was compared to χ2 data using a quantile-quantile plot. Graphs can be

found in appendix A.

Sample Average Covariance

The test statistic for the true tree appeared to follow the appropriate χ2 distribution. As

expected, the fit was much better for the longest sequences (10000 base pairs) and worst for

the shortest (100bp). The five leaf trees appeared to fit better than the five leaf tree. The data

set that caused the most trouble is shown in fig A.7 where large negative and positive values

were present in the residual data. This did not reoccur on further when data was generated

2The magic number 700 arose because of the (unexpected) behaviour of the unix split program. How-ever, work into parametric bootstrap techniques has suggested that 100 replicates can suffice to determine theunderlying the distribution of a statistic [19]

59

later with the same parameters. Some curving on the plots is visible on fig A.2 and remains

unexplained.

The gT values appeared to be lower than expected for a χ2 distribution best trees from 100bp

sequences. However, the trees generated by longer sequence appeared to fit a χ2 distribution

quite well. This is indicates that, in these cases, the best fitting tree was the generating tree.

Jukes Cantor Covariance

Data generated using the Jukes Cantor Covariance often did not appear to fit the expected χ2

distributions. This seemed to be more of an issue for the five leaf trees than the four leaf trees.

Surprisingly the best five leaf tree residuals appeared to fit χ2(3) better than the true tree,

although these did not fit particularly well (except see fig B.11).

Differences in covariance matrices

To investigate the difference in sample average and Jukes-Cantor covariance matrices, I per-

formed a simulation to estimate the true sample covariance on a four leaf tree with equal length

branches. 1000 sequence sets were generated from the tree and Jukes-Cantor distance estimates

were made for each data set. The sample covariance matrix for the six pairwise distances was

calculated using R’s inbuilt cov() method. Resulting covariance matrices can be found in are

in appendix C. Both Jukes-Cantor and sample average covariance appear to converge to the

true covariance for this simple tree.

6.5 Discussion

These results indicate that the GLS test statistic generated under the null hypothesis with the

sample average covariance estimates have a χ2(k) distribution as proposed at the beginning of

this chapter. This provides a valid test of whether a given tree T lies in the (1 − α) × 100

confidence region for the true topology.

However, there were a number of unexplained phenomena associated with this experiment.

They are discussed below.

Negative GLS values

Negative GLS values were calculated in some simulations using sample average covariance. This

is of some concern! This means that the covariance matrix is not positive definite and the GLS

theory cannot be used.

60

It is not clear at the moment why this happens. One reason may be that the errors in

distance estimates are ‘damaging’ the covariance estimation. Another highly likely reason is

that small branches have high variance so the inverted variance-covariance matrix becomes

inaccurate. More simulations with (long) tree length might help verify this.

Covariance Matrices

The JC covariance matrix appears reasonably similar to the sample covariance. There also

does seem to be some convergence of the sample average covariance to the tree derived sample

covariance and JC covariance when the number of sites is large (10000bp). Conversely, the

difference is greatest for the short (100bp). This is expected from the convergence of properties

of generalized estimating equations and maximum likelihood estimation.

This seems to imply that once sequences reach a certain point in length we should be able

use the estimated Jukes-Cantor covariance in place of the sample average covariance. This

appears plausible for the four leaf tree but fig B.8 and fig B.9 certainly imply otherwise.

Best Tree, True Tree

We may expect the relationship of the best tree distributions to those of the true tree to

shed some light on the above. Residues values from the best tree will always be less than or

equal to those for the true tree. One example of why the best and true tree might differ is

that generalized least squares coupled with minimum evolution criteria is not consistent [18]

(although the conditions for this have not been established). From this one might expect residues

values to be systematically less than those of the associated χ2 distribution. Some of the graphs

do show this (for example fig A.10) which indicates that in those particular simulations the best

tree selected was often not the true tree.

However, that does not explain the marked the differences in distribution seen in B.8 and

B.9 but the reasonable fit in B.11 and B.12.

It can be expected that, like the tests described in Chapter 5, this test will be used to test

if the best tree returned by some method is the true tree. The data collected on four and five

leaf trees appears to imply that the best tree is often the true tree.

However, it is unlikely that this result will scale up well to larger trees. One reasons be-

ing that there are many more possible tree topologies as the number of taxa being examined

increases so it becomes more likely that the best tree is not the true one. It is well known

that results derived from small trees do not necessarily apply to very large trees. The results

presented here need to be considered with this in mind.

61

6.6 Contribution

My contributions include the design and implementation of the GLS test statistic computer

simulation. I have also provided a expanded proof of the theorem proposed by Susko that the

observed distances between taxa have large site normality when maximum likelihood estimates

are used.

The development of the computer simulation involved extending the C program treedist

originally written by Graham Byrnes. I added support for calculation of sample average covari-

ance for the Jukes-Cantor model and calculation of the GLS test statistic. This was done for

both sample average covariance and tree-derived Jukes-Cantor covariance.

I wrote several Unix shell scripts to interface treedist with the programs that generated

the input data (evolver and seq-gen). Source code for these and treedist are available on

request.

The expansion of the proof of Susko’s theorem provided more detail into how to calculate

the sample average covariance.

6.7 Further Work

Further work into the relationship between the tree-derived covariance and the sample average

covariance is needed. As as more conclusive results explaining in what cases the sample average

covariance matrix can be non-negative definite.

Susko notes that this approach relies on the model derived distances. If the model is incorrect

then it is unclear how much confidence we can put into the testing procedure. Simulations that

reveal how the test reacts to model misspecification would be helpful.

62

Chapter 7

Conclusion

My results show that the GLS test statistic proposed by Susko can be applied to short sequences

but the assumptions behind the test may be broken. The applicability to shorter seems vulner-

able to errors resulting in large negative outliers. There is also the possbility that covariance

matrices that are not positive definite can be estimated using the proposed sample average

estimation method.

The GLS test statistic seems like a reasonable choice for long sequences because of its

computational advantages over other testing methods such as bootstrap and Bayesian MCMC.

However,investigation still needs to be carried out to test how well this method responds to

model misspecification.

Simulations derived from real data, where the number of taxa is large, are also necessary.

63

Appendix A

GLS Results: Sample Average

Covariance

The following graphs are quantile-quantile plots of GLS test statistic calculated using Susko’s

sample average covariance versus a χ2 graph with degrees of freedom described in Chapter 6.

In graphs captioned True tree, the GLS test statistic was calculated for the tree that gener-

ated the sequence data only. In graphs captioned Best tree, the GLS test statistic was calculated

for all possible four or five leaf topologies (depending on the generating tree) but only the lowest

test statistic value recorded.

A.1 Four Leaf Trees

64

0 1 2 3 4 5 6

05

1015

20

GLS residual

chi s

quar

ed, 1

deg

ree

of fr

eedo

m

Figure A.1: True tree, Susko Covariance, 4, 100

65

02

46

810

0 5 10 15 20

GLS

residual

chi squared, 1 degree of freedom

Figu

reA

.2:Tru

etree,

Susko

Covarian

ce,4,

1000

02

46

0 5 10 15 20

GLS

residual


Figu

reA

.3:Tru

etree,

Susko

Covarian

ce,4,

10000

66

02

46

0 5 10 15 20

GLS

residues

Chi Squared, 1 degree of freedom

Figu

reA

.4:B

esttree,

Susko

Covarian

ce,4,

100

02

46

810

0 5 10 15 20

GLS

residues


Figu

reA

.5:B

esttree,

Susko

Covarian

ce,4,

1000

67

0 2 4 6 8 10

05

1015

20

GLS residues

Chi

Squ

ared

, 1 d

egre

e of

free

dom

Figure A.6: Best tree, Susko Covariance, 4, 10000

68

−60000 −50000 −40000 −30000 −20000 −10000 0

05

1015

20

GLS residual

chi s

quar

ed, 3

deg

ree

of fr

eedo

m


A.2 Five Leaf Trees

69

0 5 10 15 20

05

1015

20

GLS residual

chi s

quar

ed, 3

deg

ree

of fr

eedo

m


0 5 10 15

05

1015

20

GLS residual

chi s

quar

ed, 3

deg

ree

of fr

eedo

m


70

02

46

8

0 5 10 15 20 25 30

GLS

residues


Figu

reA

.10:B

esttree,

Susko

Covarian

ce,5,

100

02

46

810

1214

0 5 10 15 20 25 30

GLS

residues


Figu

reA

.11:B

esttree,

Susko

Covarian

ce,5,

1000

71

05

1015

0 5 10 15 20 25 30

GLS

residues


Figu

reA

.12:B

esttree,

Susko

Covarian

ce,5,

10000

72

Appendix B

GLS results: JC Covariance

B.1 Four Leaf Trees

73

02

46

810

12

0 5 10 15 20

GLS

residual


Figu

reB

.1:Tru

etree,

JC

Covarian

ce,4,

100

05

1015

0 5 10 15 20

GLS

residues


Figu

reB

.2:Tru

etree,

JC

Covarian

ce,4,

1000

74

05

1015

0 5 10 15 20

GLS

residues


Figu

reB

.3:Tru

etree,

JC

Covarian

ce,4,

10000

01

23

45

6

0 5 10 15

GLS

residues


Figu

reB

.4:B

esttree,

JC

Covarian

ce,4,

100

75

0 2 4 6 8 10 12

05

1015

GLS residues

Chi

Squ

ared

, 1 d

egre

e of

free

dom

Figure B.5: Best tree, JC Covariance, 4, 1000

0 2 4 6 8 10

05

1015

GLS residues

Chi

Squ

ared

, 1 d

egre

e of

free

dom

Figure B.6: Best tree, JC Covariance, 4, 10000

76

0 10 20 30 40

05

1015

2025

GLS residues

Chi

Squ

ared

, 3 d

egre

es o

f fre

edom

Figure B.7: True tree, JC Covariance, 5, 100

B.2 Five Leaf Trees

77

0.0 0.5 1.0 1.5 2.0

05

1015

2025

GLS residues

Chi

Squ

ared

, 3 d

egre

es o

f fre

edom


0.0 0.1 0.2 0.3 0.4 0.5

05

1015

20

GLS residual

chi s

quar

ed, 3

deg

ree

of fr

eedo

m


78

05

1015

0 5 10 15 20 25 30

GLS

residues

Chi Squared, 3 degrees of freedom

Figu

reB

.10:B

esttree,

JC

Covarian

ce,5,

100

05

1015

0 5 10 15 20 25 30

GLS

residues


Figu

reB

.11:B

esttree,

JC

Covarian

ce,5,

1000

79

05

1015

0 5 10 15 20 25 30

GLS

residues


Figu

reB

.12:B

esttree,

JC

Covarian

ce,5,

10000

80

Appendix C

Covariance Estimation

This chapter contains covariance matrices for a four leaf tree with five branches, each of length

0.5.

The sample covariance results, refer to the sample covariance calculated by the statistical

program R after calculating pairwise distance estimates from 1000 sequence data sets using the

Jukes-Cantor model

The Jukes-Cantor covariance refers the theoretical covariance estimate derived from the

Jukes-Cantor model. Sample Average Covariance refers to the covariance matrix derived in [40]

C.1 Sample Covariance Results

C.1.1 Sample Covariance - 100bp

0.154685911 0.030742862 0.0023922773 0.0146437905 0.021297064 0.007273003

0.030742862 0.127398955 0.0123768091 0.0138471254 0.009112506 0.026076178

0.002392277 0.012376809 0.0569422201 0.0002700298 0.013977221 0.018039598

0.014643791 0.013847125 0.0002700298 0.0471452913 0.026704351 0.006550489

0.021297064 0.009112506 0.0139772209 0.0267043514 0.166695501 0.033352401

0.007273003 0.026076178 0.0180395981 0.0065504892 0.033352401 0.139944026

81


[,1] [,2] [,3] [,4] [,5]

[1,] 0.0138869819 0.0040845656 0.0011769553 0.0009608333 0.0037507856

[2,] 0.0040845656 0.0129057929 0.0005606785 0.0007581601 0.0012195869

[3,] 0.0011769553 0.0005606785 0.0037213281 -0.0001051145 0.0008251945

[4,] 0.0009608333 0.0007581601 -0.0001051145 0.0037063417 0.0008871719

[5,] 0.0037507856 0.0012195869 0.0008251945 0.0008871719 0.0125837613

[6,] 0.0009980025 0.0036283243 0.0005083123 0.0008775054 0.0029320889

[,6]

[1,] 0.0009980025

[2,] 0.0036283243

[3,] 0.0005083123

[4,] 0.0008775054

[5,] 0.0029320889

[6,] 0.0128444270


[,1] [,2] [,3] [,4] [,5]

[1,] 1.286153e-03 3.794904e-04 1.109185e-04 7.172576e-05 3.764319e-04

[2,] 3.794904e-04 1.361460e-03 1.067608e-04 6.434031e-05 9.747156e-05

[3,] 1.109185e-04 1.067608e-04 3.781283e-04 1.612486e-06 9.272264e-05

[4,] 7.172576e-05 6.434031e-05 1.612486e-06 3.485316e-04 9.078303e-05

[5,] 3.764319e-04 9.747156e-05 9.272264e-05 9.078303e-05 1.279326e-03

[6,] 9.181898e-05 4.077580e-04 9.613416e-05 8.487179e-05 3.283484e-04

[,6]

[1,] 9.181898e-05

[2,] 4.077580e-04

[3,] 9.613416e-05

[4,] 8.487179e-05

[5,] 3.283484e-04

[6,] 1.265110e-03

82

C.2 Jukes-Cantor Covariance

C.2.1 Jukes-Cantor Covariance - 100bp

0.471919 0.126338 0.013920 0.021597 0.087623 0.017458

0.126338 0.240716 0.004129 0.021597 0.017458 0.037730

0.013920 0.004129 0.030822 0.000000 0.013920 0.004129

0.021597 0.021597 0.000000 0.025934 0.000759 0.000759

0.087623 0.017458 0.013920 0.000759 0.103009 0.021089

0.017458 0.037730 0.004129 0.000759 0.021089 0.044828


0.014586 0.003595 0.001074 0.000790 0.004576 0.000989

0.003595 0.012295 0.000862 0.000790 0.000989 0.003842

0.001074 0.000862 0.004108 0.000000 0.001074 0.000862

0.000790 0.000790 0.000000 0.003131 0.000831 0.000831

0.004576 0.000989 0.001074 0.000831 0.015133 0.003734

0.000989 0.003842 0.000862 0.000831 0.003734 0.012756


0.001238 0.000353 0.000088 0.000088 0.000355 0.000087

0.000353 0.001195 0.000084 0.000088 0.000087 0.000342

0.000088 0.000084 0.000345 0.000000 0.000088 0.000084

0.000088 0.000088 0.000000 0.000346 0.000085 0.000085

0.000355 0.000087 0.000088 0.000085 0.001208 0.000344

0.000087 0.000342 0.000084 0.000085 0.000344 0.001167

83

C.3 Sample Average Covariance(Susko)

C.3.1 Sample Average Covariance(Susko) - 100bp

2.772195 0.284422 0.061343 0.037179 0.160589 0.044012

0.284422 0.194301 0.006093 0.012716 0.013550 0.025393

0.061343 0.006093 0.031681 -0.001258 0.015681 0.005798

0.037179 0.012716 -0.001258 0.026541 -0.002846 -0.000230

0.160589 0.013550 0.015681 -0.002846 0.107110 0.023703

0.044012 0.025393 0.005798 -0.000230 0.023703 0.047412


0.012331 0.003691 0.000952 0.000620 0.004649 0.000714

0.003691 0.014367 0.001146 0.001033 0.001152 0.003982

0.000952 0.001146 0.004123 -0.000038 0.001208 0.000597

0.000620 0.001033 -0.000038 0.003140 0.001071 0.000740

0.004649 0.001152 0.001208 0.001071 0.018552 0.003630

0.000714 0.003982 0.000597 0.000740 0.003630 0.011086


0.001224 0.000357 0.000089 0.000080 0.000342 0.000095

0.000357 0.001211 0.000087 0.000090 0.000100 0.000344

0.000089 0.000087 0.000345 -0.000003 0.000087 0.000080

0.000080 0.000090 -0.000003 0.000346 0.000080 0.000085

0.000342 0.000100 0.000087 0.000080 0.001224 0.000339

0.000095 0.000344 0.000080 0.000085 0.000339 0.001154

84

Bibliography

[1] Michael E. Alfaro, Stefan Zoller, and Francois Lutzoni. Bayes or Bootstrap? A Simulation

Study Comparing the Performance of Bayesian Markov Chain Monte Carlo Sampling and

Bootstrapping in Assessing Phylogenetic Confidence. Mol. Biol. Evol, 20(2):255–266, 2003.

[2] Daniel Barry and J.A. Hartigan. Asynchronous Distance between Homologuous DNA

Sequences. Biometrics, 43:261–276, 1987.

[3] L. Billera, S. Holmes, and K. Vogtmann. The Geometry of Tree Space. Adv. Appl. Math,

27, 2001.

[4] W. Bruno, N. Socci, and A. Halpern. Weighted Neighbor Joining: A Likelihood-Based

Approac to Distance-Based Phylogeny Reconstruction. Mol. Biol. Evol, 17(1):189–197,

2000.

[5] David Bryant and Peter Waddell. Rapid Evaluation of Least-Squares and Minimum-

Evolution Criteria on Phylogenetic Trees. Mol. Biol. Evol, 15(10):1346–1359, 1998.

[6] T. J. Buckley. Model misspecification and probabilistic tests of topology: evidence from

empirical data sets. Syst. Biol, 51(3):509–523, 2002.

[7] Michael Bulmer. Use of the Method of Generalized Least Squares in Reconstructing Phy-

logenies from Sequence Data. Mol. Biol. Evol, 8(6):868–883, 1991.

[8] Joseph T. Chang. Full Reconstruction of Markov Models on Evolutionary trees: Idenitifi-

ability and Consistency. Mathematical Biosciences, 137:51–73, 1996.

[9] D.R. Cox and H.D. Miller. The Theory of Stochastic Processes, chapter 4. Methuen and

Co, London, 1965.

[10] Christophe J. Douady, F Delsuc, Yan Bouche W. Ford Doolittle, and Emmanuel J. P.

Douzery. Comparison of Bayesian and Maximum Likelihood Bootstrap Measures of Phy-

logenetic Reliability. Mol. Biol. Evol, 20(2):248–254, 2003.

85

[11] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Biological sequence

analysis: Probabilistic models of proteins and nucleic acids, chapter 7. Cambridge Univer-

sity Press, Cambridge, UK, 1998.

[12] B. Efron, E. Halloran, and S. Holmes. Bootstrap confidence levels for phylogenetic trees.

Proceedings of the National Academy of Sciences of the USA, 93:13429–13434, 1996.

[13] Warren J. Ewens and Gregory R. Grant. Statistical Methods in Bioinformatics: An Intro-

duction, chapter 3. Springer-Verlag, New York, 2001.

[14] J. Felsenstein. PHYLIP (Phylogeny Inference Package), 2002.

[15] Joseph Felsenstein. Evolutionary Trees from DNA Sequences: A Maximum Likelihood

Approach. J Mol Evol, 17:368–376, 1981.

[16] Joseph Felsenstein and Hirohisa Kishino. Is There Something Wrong with the Bootstrap

on Phylogenies? A Reply to Hills and Bull. Systematic Biology, 42(2):193–200, 1993.

[17] Olivier Gascuel. BIONJ: An Improved Version of the NJ Algorithm Based on a Simple

Model of Sequence Data. Mol. Biol. Evol, 14(7):685–695, 1997.

[18] Olivier Gascuel, David Bryant, and Francois Denis. Strengths and Limitations of the

Minimum-Evolution Principle. Syst. Biol, 50(5):621–627, 2001.

[19] Nick Goldman, Jon P. Anderson, and Allen G. Rodrigo. Likelihood-Based tests of topolo-

gies in phylogenetics. Systematic Biology, 49(4):652–670, 2000.

[20] M. Hasegawa, H. Kishino, and T. Yano. Dating the human-ape splitting by a molecular

clock of mitochondrial DNA. J. Mol. Evol, 22:160–174, 1985.

[21] Steven Henikoff and Jorga G. Henikoff. Amino Acid Substitution Models from Protein

Blocks. Proc. Natl. Acad. Sci. Usa, 89:10915–10919, 1992.

[22] D.M. Hillis and J. Bull. An Empirical Test of Bootstrapping as a Method for Assessing

Confidence in Phylogenetic Analysis. Syst. Biol, 42(2):182–192, 1993.

[23] Susan Holmes. Phylogenies: An Overview. In: Halloran, E., Geiser, S.(Eds), Statistics in

Genetics, IMA, Vol 81. Springer-Verlag, New York, 1999.

[24] Susan Holmes. Statistics for phylogenetic trees. Theoretical Population Biology, 63:17–32,

2003.

86

[25] John P. Huelsenbeck and Keith A. Crandall. Phylogeny Estimation and Hypothesis Testing

using Maximum Likelihood. Ann. Rev. Ecol. Syst, 28:437–466, 1997.

[26] J.P. Huelsenbeck and J.J. Bull. A Likelihood Ratio Test to Detect Conflicting Phylogenetic

signal. Syst. Biol., 45:92–98, 1996.

[27] J.P. Huelsenbeck, D.M. Hillis, and R. Nielsen. A Likelihood Ratio Test of Monophyly.

Syst.Biol, 45:546–558, 1996.

[28] Keith Knight. Mathematical Statistics, chapter 5. Chapman Hall/CRC Press, Chicago,

2000.

[29] E.L. Lehmann. Theory of Point Estimation, chapter 4. Wiley, New York, 1983.

[30] Pietro Lio and Nick Goldman. Models of Molecular Evolution and Phylogeny. Genome

Research, 8:1233–1244, 1998.

[31] P.J. Lockhart, M. Steel, M. Hendy, and D. Penny. Recovering Evolutionary Trees under a

More Realistic Model of Sequence Evolution. Mol. Biol. Evol, 11(4):605–612, 1994.

[32] A. Rambaut and N.C. Grassly. Seq-Gen: an application for the Monte Carlo simulation

of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci., 13:235–238,

1997.

[33] Bruce Rannala and Ziheng Yang. Probability Distribution of Molecular Evolutionary Trees:

A New Method of Phylogenetic Inference. J. Mol. Evol, 43:304–311, 1996.

[34] Alvin C. Rencher. Linear Models in Statistics, chapter 7. Wiley-Interscience, New York,

2000.

[35] Andre Rzhetsky and Masatoshi Nei. Theoretical Foundation of the Minimum-Evolution

Method of Phylogenetic Inference. Mol. Biol. Evol, 10(5):1073–1095, 1993.

[36] Naruya Saitou and Masatoshi Nei. The Neighbor-joining Method: A New Method for

Reconstructing Phylogenetic Trees. Mol. Biol. Evol, 4(4):406–425, 1987.

[37] H. Shimodaira and M. Hasegawa. Multiple comparisons of log-likelihoods with applications

to phylogenetic tree selection. Mol. Biol. Evol, 16:1114–1116, 1999.

[38] Mike Steel and David Penny. Parsimony, Likelihood, and the Role of Models in Molecular

Phylogenetics. Mol. Biol. Evol, 17(6):839–850, 2000.

87

[39] K. Strimmer and V. Moulton. Likelihood Analysis of Phylogenetic Networks Using Directed

Graph Methods. Mol. Biol. Evol, 17:875–881, 2000.

[40] Edward Susko. Confidence Regions and Hypothesis Tests for Topologies Using Generalized

Least Squares. Mol. Biol. Evol, 20(6):862–868, 2003.

[41] D. Swofford, G. Olsen, P.J. Waddell, and D.M. Hillis. Molecular Systematics, chapter

Phylogenetic Inference. Sinauer Associates, Massachusetts, 1996.

[42] C. Tuffley and M.A. Steel. Modelling the covarion hypothesis of nucleotide substitution.

Mathematical Biosciences, 147:63–91, 1997.

[43] Peter H. Westfall and S. Stanley Young. Resampling-based Multiple testing: Examples and

methods for p-value adjustment, chapter 2. Wiley Interscience, New York, USA, 1993.

[44] Simon Whelan and Nick Goldman. Distributions of Statistics Used for the Comparison of

Models of Sequence Evolution in Phylogenetics. Mol. Biol. Evol, 16(9):1292–1299, 1999.

[45] Ziheng Yang. PAML: a program for phylogenetic analysis by maximum likelihood.

CABIOS, 13:555–556, 1997.

88

Documents

Phylogenetic Inference and Hypothesis Testing - …laic/testing.pdf · 2005-09-27 · Phylogenetic Inference and Hypothesis Testing Catherine Lai (92720) BSc ... 4 Phylogenetics Tree