Upload
dangthuan
View
215
Download
0
Embed Size (px)
Citation preview
Phylogenetic Inference and Hypothesis Testing
Catherine Lai (92720)
BSc(Hons) Department of Mathematics and Statistics
University of Melbourne
November 13, 2003
Contents
1 Introduction 4
2 Molecular Phylogenetics 5
2.1 The Use of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Traditional Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Phylogenetic Trees From Genomic Data . . . . . . . . . . . . . . . . . . . . . . . 62.4 What about the root? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 How Treelike is Evolution? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Models of Evolution 9
3.0.1 A Simple Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.0.2 Evolution as a stochastic process . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Markov Models of Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Markov Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Markov Models of Site Substitution . . . . . . . . . . . . . . . . . . . . . 11
3.2 Parameterized Models of Nucleotide Evolution . . . . . . . . . . . . . . . . . . . 123.2.1 Jukes-Cantor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.2 Jukes-Cantor Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 Generalisations of the Jukes-Cantor Model . . . . . . . . . . . . . . . . . 16
3.3 Problems with Markov Models of Evolution . . . . . . . . . . . . . . . . . . . . . 173.4 Modelling Rate Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Modelling Non-Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5.1 Summary of Nucleotide Markov Models . . . . . . . . . . . . . . . . . . . 203.6 Empirical Models of amino acid evolution . . . . . . . . . . . . . . . . . . . . . . 21
3.6.1 PAM/Dayhoff Substitution Matrices . . . . . . . . . . . . . . . . . . . . . 213.6.2 BLOSUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Differences in PAM and BLOSUM . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Phylogenetics Tree Reconstruction Methods 25
4.1 Evaluating Reconstruction Methods . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.5 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.6 Usability in tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Parsimony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4 Is MP the same as ML? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Distance Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5.1 Unweighted pair group method using arithmetic averages (UPGMA) . . . 31
1
4.5.2 The Molecular Clock Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 324.5.3 Long Branch Attraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5.4 Neighbour Joining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5.5 BIONJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.6 Weighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5.7 NJ and the minimum evolution method . . . . . . . . . . . . . . . . . . . 36
4.6 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6.1 Estimating Branch lengths . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6.2 Minimum Evolution Method with Least Squares . . . . . . . . . . . . . . 37
4.7 Bayesian Tree Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.8 Trees from Alignments, Alignments from Trees . . . . . . . . . . . . . . . . . . . 39
5 Phylogenetic Hypothesis Tests 40
5.1 Confidence Regions of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . 405.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 The Non-parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.1 Testing Phylogenies using the Non-parametric Boostrap . . . . . . . . . . 435.3.2 How well does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 The Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4.1 Problems with the Parametric Bootstrap . . . . . . . . . . . . . . . . . . 46
5.5 Bootstrap Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.5.1 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5.2 The Kishino Hasegawa Test . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5.3 The Shimodaira Hasegawa Test . . . . . . . . . . . . . . . . . . . . . . . . 485.5.4 The Swofford Olsen Waddell Hillis Test (SOWH) . . . . . . . . . . . . . . 50
5.6 Bayesian methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.7 Bootstraps and Posterior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 515.8 Which Test? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6 Generalized Least Squares in Phylogenetic Hypothesis Testing 54
6.1 Sample Average Variance and Covariance . . . . . . . . . . . . . . . . . . . . . . 556.2 Motivation for Simulation of GLS test statistic . . . . . . . . . . . . . . . . . . . 586.3 GLS Test Statistic Simulation Method . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.7 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7 Conclusion 63
A GLS Results: Sample Average Covariance 64
A.1 Four Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64A.2 Five Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
B GLS results: JC Covariance 73
B.1 Four Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73B.2 Five Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
C Covariance Estimation 81
C.1 Sample Covariance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81C.1.1 Sample Covariance - 100bp . . . . . . . . . . . . . . . . . . . . . . . . . . 81C.1.2 Sample Covariance - 1000bp . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2
C.1.3 Sample Covariance - 10000bp . . . . . . . . . . . . . . . . . . . . . . . . . 82C.2 Jukes-Cantor Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
C.2.1 Jukes-Cantor Covariance - 100bp . . . . . . . . . . . . . . . . . . . . . . . 83C.2.2 Jukes-Cantor Covariance - 1000bp . . . . . . . . . . . . . . . . . . . . . . 83C.2.3 Jukes-Cantor Covariance - 10000bp . . . . . . . . . . . . . . . . . . . . . . 83
C.3 Sample Average Covariance(Susko) . . . . . . . . . . . . . . . . . . . . . . . . . . 84C.3.1 Sample Average Covariance(Susko) - 100bp . . . . . . . . . . . . . . . . . 84C.3.2 Sample Average Covariance(Susko) - 1000bp . . . . . . . . . . . . . . . . 84C.3.3 Sample Average Covariance(Susko) - 10000bp . . . . . . . . . . . . . . . . 84
3
Chapter 1
Introduction
Phylogenetics is a field of biology that seeks to unlock the evolutionary history of life on earth.
The aim is to understand relationships between species and through this the process of evolution
itself. These relationships can be represented with a graph structure - traditionally simplified to
evolutionary trees. The current approach is to try to reconstruct these trees from the blueprint
of life: DNA sequences.
Reconstruction methods are difficult to design and evaluate because the biological evidence
is often ambiguous. Many approaches have been introduced to deal with the problems of estima-
tion and hypothesis testing of phylogenetic trees. Parametric approaches exploit the elementary
knowledge we have of evolution while non-parametric approaches have been developed to avoid
the possibility of inaccurate preconceptions.
Recently, Susko[40] presented an approach that applies the theory of generalized least squares
to phylogenetic hypothesis testing. The generalized least squares approach has strong theoretical
foundations in the theory of linear models. While the theory appears to be sound it is based on
asymptotic results with regard to sequence length. It is not clear how well the test will perform
in practice where the length of sequences is often only a few hundred nucleotides. I investigate
the effect of sequence length on this approach. I also consider how Susko’s approach differs
from traditional parametric techniques with respect to estimation techniques for variance the
variance-covariance estimation.
In Chapter 2 I will give a general background to the problems involved in molecular phy-
logenetics. In Chapter 3 and Chapter 4 I review commonly used probabilistic models and tree
reconstruction methods commonly respectively. In Chapter 5 I consider methods of evaluating
confidence in results from such reconstruction methods when there may be conflict. This leads
to an examination of current hypothesis testing methods and consideration of the validity of
the generalized least squares approach in Chapter 6.
4
Chapter 2
Molecular Phylogenetics
Phylogenetic trees represent relationships between species. They tell the story of life on earth.
A phylogenetic tree is a tree in the graph sense. External vertices (nodes) represent extant
species while internal nodes represent speciating events. The tree topology determines the
lines of evolution - which species descended from which common ancestors. The branch (edge)
lengths represent time since speciation of adjacent nodes. In reality, absolute time scales cannot
be used and relative time scales are employed. These time scales depend on the data used to
infer the tree. For example, if we use genome sequence data, a scale is the expected number of
substitutions that have taken place at a site.
2.1 The Use of Phylogenetic Trees
The role of phylogenetics is to help us understand the process of evolution from the patterns in
nature we can observe in the present. As Huelsenbeck describes [25]:
[F]or any question in which history may be a confounding factor, phylogenies
have a central role.
The most obvious use of phylogenetics is inferring common ancestors. This has implications
to our understanding of evolution on a large scale. It can also help with more immediate
problems. For example in epidemiology and understanding the spread of viruses. Viruses such
as hepatitis C have a long dormant period meaning we can only detect the spread of the virus
as it happened in the past.
Understanding these processes and relationships can help in the development of biotech-
nology. For example, improving the design of drugs to consider host-pathogen mutual genetic
variation. Phylogeny also shed light on our understanding of structural biology, helping us to
infer function and functional constraints of genes. [30]
5
2.2 Traditional Approaches
Historically phylogenetic trees have been constructed using two principles. The phenetic ap-
proach uses similarity scores derived from measures of physical characteristics. The most similar
species are clustered together. While this is an intuitive approach, results may not represent
genetic or evolutionary similarity.
The cladistic approach assumes that related species will share unique features that were not
present in distant ancestors. All species in a group must share a common ancestor. This means
that species with many similar physical traits may not be grouped together.
Both of the above approaches rely heavily on morphological and geographical data. However,
this has changed with our understanding of the role of DNA in evolution. DNA sequencing
has massively increased information about the evolutionary process. Most tree reconstruction
methods now focus on examining the way DNA (or amino acid) sequences have evolved.
2.3 Phylogenetic Trees From Genomic Data
In molecular phylogenetics patterns are searched for in genomic data. What we find is that
evolution is a stochastic process. Mutations arise from changes to a species genome. That is,
site substitutions, deletions, insertions and inversions. If we understand the process of mutation,
we can make reasonable inferences about our past from the genome material we have now.
Sequences that are very different are likely to be less closely related then sequences showing
high similarity.
Determining similarity is not an easy problem. Substitutions may be hidden from view by a
number of factors. Examples of these include when a site changes and then changes back again
(a reversal); when more than one mutation occurs at the same site; parallel changes occur on
different branches of the tree (convergence or parallelism).
With that in mind, there are three major components to phylogenetic inference that need
to be considered.
Probabilistic Models in Phylogenetics
We need to consider what role probabilistic models can play to help our understanding of the
problem. Typically, it is assumed that mutations occuring at time t depend on the sequence at
that time but not on its previous history. This suggests a Markov model of sequence evolution.
However whether or not the traditional assumptions such as homogeneity and reversibility are
valid is less clear. The role of probabilistic models in phylogenetics is discussed in Chapter 3.
6
Reconstructing Phylogenetic trees from Sequence Data
We need to understand what methods can be employed to actually reconstruct a phylogenetic
tree. Within this there are three solid problems to consider: choosing a criteria, estimating
the tree topology, estimating branch lengths. Besides this there has been a long standing
feud between biologists (and to a latter extent mathematicians) whether parametric or non-
parametric methods or something in between should be used. A review of commonly used tree
reconstruction methods is contained in Chapter 4
Hypothesis Tests of Phylogenetic Trees
Once a phylogenetic tree is decided upon, the next step is usually to try to identify which parts
are well supported by the data. Hypothesis tests be must used with an understanding of what
they actually test and what information the can provide the user. To complicate matters, the
hypothesis being tested is often a tree topology and it it is still unclear how the usual statistical
measures, such as variance, can be applied to such a structure.
The debate between parametric and non-parametric testing is continued. The non-parametric
bootstrap has been extensively in phylogenetics to provide a measure of confidence. However,
the use of the parametric bootstraps and Bayesian methods are also becoming popular. All
have their advantages and disadvantages.
In any case, inferred trees need to be compared to traditional biological data (eg morpho-
logical). No matter how well a method works for simulated data, the aim of the game is to
understand the process in reality. These issues are considered in greater detail in Chapter 5
and (with respect to a generalized least squares test) Chapter 6.
2.4 What about the root?
In theory phylogenetic trees should be rooted to represent descent from the common ancestor.
However, as we attempt to reconstruct phylogenetic trees, we have to consider all possible
positions of a root with respect to the other nodes of the tree. Unfortunately it is unlikely that
sequences will contain enough information to accurately place a root.
However, there are some methods for rooting a tree. These include adding data from very
distantly related species (outgroups), or the use of the molecular clock hypothesis. The use of
outgroups has throws in its own bundle of problems as error in distant species effects the other
closer related species. This and the validity of the molecular clock hypothesis is discussed in
section 4.5.2.
7
2.5 How Treelike is Evolution?
It is worth asking the question of whether trees are really the right structure to describe evolu-
tion. Often there is not enough information in the data to resolve the issue of parallel mutations,
while gene transfer from species to species mean that a gene can have more than one ancestor.
There is criterion to determine if data are treelike: the four point condition. For any taxa
i, j, k, l in a phylogenetic tree with four or more taxa we have:
dij + dkl ≤ max(dik + djl, djk + dil) (2.1)
Otherwise other structures that may be more valid. For example, see Strimmer and Moul-
ton’s work on phylogenetic networks [39]
8
Chapter 3
Models of Evolution
Most widely used tree reconstruction methods use probabilistic models of evolution. These can
be formulated parametrically, using known (or assumed) properties of sequence evolution. They
can also be derived empirically from information in the observed sequences.
It makes sense to use whatever knowledge we have about the process of evolution rather
than ignore it. On the other hand, evolution is very complex and biological evidence is often
ambiguous. An example of a factor that needs to be taken into account, but is very hard to
modeli, is differing in rates of evolution between and within lineages.
How well a model fits reality can effect how a testing method works. Simpler models have
greater power to discriminate but may be biased. So understanding models is necessary to
understanding both tree reconstruction and confidence testing [19].
3.0.1 A Simple Approach
Evolutionary distances represent the divergence between species. That is, branch lengths on a
phylogenetic tree. The following naive approach to determining distance shows why a proba-
bilistic model is desirable.
When determining distances between sequences it is intuitive to use a measure of dis-
similarity. That is, take the distance between two sequences to be the Hamming distance.
That is, for two species x and y, with sequence length S, we have:
Dxy =
S∑
i
(xi 6= yi) (3.1)
However, this approach does not take into account the phenomena described in (2.3) such as
reversals and parallelism. This means the observed number of substitutions in a given sequence
are a lower bound for the actual number of substitutions that have occured. This basically
9
means we cannot accurately look very far back in time.
We need models that estimate substitution rates that correct for unseen events. An obvious
first step is to try to define evolution as some type of stochastic process.
3.0.2 Evolution as a stochastic process
Definition 3.0.1 A stochastic process is a collection of random variables X(t)t∈T , with a
common probability space.
We can think of the process of evolution as the stochastic process of substitutions in a
sequence. The the set of states, are nucleotides (or amino acids). That is, for a nucleotide
model, X(t) ∈ A,C,G, T is the nucleotide at that site at time t. 1
To develop a tractable model for evolution we need to make further simplifying assumptions.
This leads us to the well studied world of Markov processes.
3.1 Markov Models of Evolution
3.1.1 Markov Theory
Definition 3.1.1 A stochastic process has the Markov property if the probability of observing
a new state at time s + t only depends on the state at time s. That is,
P (X(s + t) = j|X(s) = is, . . . ,X(0) = i0) = P (X(s + t) = j|X(s) = is) (3.2)
A Markov process is a stochastic process with the Markov property. The Markov property
is also referred to as the memoryless property of Markov processes. If t, s ∈ Z then we have a
discrete time Markov process. Similarly if t, s ∈ R we have a continuous time Markov process.
To model evolution we want the latter since evolution is happening in continuous time.
Definition 3.1.2 The transition probability, Pij(t, s), is the probability of changing from state
i at time s to state j at time t + s.
For a Markov process we have:
Pij(s, t) = P (X(s + t) = j|X(s) = i) (3.3)
1When deriving species trees this means the nucleotides at a site at time t expressed in the majority of thepopulation.
10
If Pij(t, s) above is independent of s then the process is homogeneous. That is,
Pij(t, s) = Pij(t) (3.4)
We can notate these probabilities as a transition matrix P(t). The transition probabilities obey
the Kolmogorov-Chapman equation:
Pik(t) =∑
j
Pij(v)Pjk(t − v) (3.5)
In matrix form:
P(t) = P(v)P(t − v) (3.6)
This is equivalent to
P(t + v) = P(t)P(v) (3.7)
With initial condition
P(0) = I (3.8)
From this, we can extrapolate the transition probabilities at time t as
P(t) = [P(1)]t (3.9)
3.1.2 Markov Models of Site Substitution
A discrete state continuous time Markov process of site substitution can be formulated as
follows. We define the transition probabilities as the probability of substitution of a nucleotide
or amino acid. We also assume that the process is time-homogeneous. That is, the rate of
substitution is independent of time and the process is the same throughout the whole tree.
Markov chain models of site substitution are usually further constrained other properties
such as stationarity and reversibility. The assumption of stationarity is that the process is in
equilibrium. This effectively assumes that that nucleotide frequencies are (approximately) the
same from species to species throught time. Reversibility means that we do not distinguish the
11
process from the process in reverse: we treat the process starting at an ancestor species to a
descendent and vice versa as the same. That is, evolution does not have a ‘direction’.
This is in fact a model the evolution of a sequence along a phylogenetic tree branch . Usually
the process of substitution is assumed to be Poisson.
The validity of these assumptions is discussed in 3.1.2. Before this, I examine how transition
matrices for Markov models of evolution can be derived parametrically.
3.2 Parameterized Models of Nucleotide Evolution
We can derive the transition matrix P(t) by estimating a rate matrix Q. Qij is the rate that i
changes to j in a very small time step δt.
Pij(δt) = Qijδt + o(δt) (3.10)
In fact, if for all states i,∑
j Pij = 1 (that is the process is honest), Q defines the process
(hence transition matrix) uniquely[9].
Let P(1) = eQ. Then using the Kolmogorov-Chapman equation,
P(t) = P(1)t = etQ (3.11)
=∞∑
j=0
tQj
j!(3.12)
Now,
d
dtP (t)|0 = QetQ|0 (3.13)
= Q (3.14)
Now let v = (1, 1, 1, 1)T . It is easy to see that v is an eigenvector of P(t) as∑
i Pij(t) = 1) as
noted earlier.
Qv = etQv =
∞∑
j=0
(tQ)jv
j!(3.15)
= 1.v +
∞∑
j=1
tj(Qjv)
j!(3.16)
= v (3.17)
This represents a power series on t, so all coefficients Qjv = 0. That is Qv = 0. Rate
12
matrices that have this condition define processes.
It is clear at this point that Markov models suffer from confounding. That is etQ = etγ
γQ,
where γ > 0, scales the rate. This means that absolute time scales cannot be used. Hence,
expected subsitutions per site is the usual time unit stated.
3.2.1 Jukes-Cantor Model
Of all the Markov models of evolution, the Jukes-Cantor model makes the strongest assumptions
about the process. Besides the properties of Markov chains this model assumes:
• The process acts on sites is independently and identically distributed (iid).
• All substitutions occur with equal probability.
With these assumptions, we need only define an appropriate rate matrix to derive transition
probabilities. If we define the rate of substitution as α, the Jukes-Cantor rate matrix:
QJC =
−3α α α α
α −3α α α
α α −3α α
α α α −3α
(3.18)
The rate at which a site stays in its current state must be −3α as row sums must equal zero.
We can find the spectral decomposition of etQJC by determining the eigenvectors and eigen-
values of Q.
Q = Sdiag(λ1, λ2, λ3, λ4)S−1 (3.19)
Where S is the matrix with the eigenvectors of Q as columns, λi are the corresponding eigen-
values. From further linear algebra we can write:
etQ = Sdiag(etλ1 , etλ2 , etλ3 , etλ4)S−1 (3.20)
In this case it is easy to see that the matrix S is can be defined as follows.
13
S =
1 1 1 1
1 1 −1 −1
1 −1 1 −1
1 −1 −1 1
(3.21)
Corresponding eigenvalues: λ1 = 0 and λ2 = λ3 = λ4 = −4α. Also, we can verify that
S−1 = 14S
T . We can now derive the transition probabilities for the Jukes-Cantor Model.
P(t) =
14(1 + 3e−4αt) for diagonal elements
14(1 − e−4αt) for off diagonal elements
(3.22)
The probability of seeing a change after time t does not depend on the current state.
Pr(X(t) 6= X(0)) = Pr(X(t) = b|X(0) = a), b 6= a (3.23)
=∑
b6=a
Pab(t) (3.24)
= 3 × 1
4(1 − e−4αt) (3.25)
We can use Pc(t), the proportion of changed sites after time t, to estimate the time of
divergence by solving the above for t. Now the number of sites that changed is distributied
binomially as they either have or have not changed. So we have, Pc(t) ∼ Bin(N, 34(1− e−4αt)),
where N is the length of the sequence. This means Pc(t) = no. changes/N is a maximum
likelihood estimate. From the invariant property of maximum likelihood estimates the following
is a maximum likelihood estimate of the time of divergence.
t =−1
4αlog(1 − 4
3Pc) (3.26)
This is the Jukes-Cantor distance estimate. It is usually written as dij where i and j
represent to sequences/taxa. Since two independent (unrelated sequences are expected to agree
at 1/4 of sites, sequences are considered to be unrelated as Pc → 3/4. At this point distances
tend to infinity.
14
Selection of Rate Parameter
For very short time spaces the total number of changes inferred by the Jukes-Cantor estimate
is equal to the number of observed changes. More precisely:
limt→0
Pc(t)
Pobs(t)= 1 (3.27)
As both numerator and denominator tend to zero this can be seen using l’Hopitals rule
dPc
dt|t=0 =
dPobs
dt|t=0 = 1 (3.28)
dPc
dt= 3αe−4αt (3.29)
= 3α at t = 0 (3.30)
Applying our boundary condition
3α = 1 (3.31)
α =1
3(3.32)
3.2.2 Jukes-Cantor Variance
The variance of the Jukes-Cantor estimate can be derived using the delta method.
t − Et ≈ dt
dp× (p − Ep) (3.33)
σ2(t) =1
(1 − 43 p)2
p − Ep
n(3.34)
That is,
σ2(t) = e8t/3T (1 − T )/S (3.35)
Where T = 34(1−e−4t/3) and S is the sequence length. That is, the variance grows exponentially
with t. The covariance of two distance estimates is derived using the tree structure. Distance
estimates are represented on a phylogenetic tree by the sum of branch lengths on the unique
path between the taxa under question. To calculate the covariance of two pairwise distances we
simply calculate the variance of branches common to both paths.
15
3.2.3 Generalisations of the Jukes-Cantor Model
It is very clear from biological evidence that the assumptions made in the Jukes-Cantor model do
not generally hold. Analysis of DNA sequences shows that substitutions are not equiprobable.
An example of this is transition/transversion bias. Nucleotides are grouped according to
their molecular structure as purines (A,G) or pyramidines (C,T). Purine to purine or pyramidine
to pyramidine substitutions are called transitions. The rest are called transversions. Because of
this molecular structure it is much more likely that a transition than a transversion will happen.
The Kimura-2-parameter model(K2P) attempts to correct the assumption by introducing pa-
rameters to model the difference in transition and transversion rates. This approach produces
a new rate matrix where β is the rate of transitions, α the rate of transversion:
QK2P =
−(2β + α) β β α
β −(2β + α) α β
β α −(2β + α) β
α β β −(2β + α)
(3.36)
In 1981 Felsenstein [15] presented a model (F81) where substitution rate depends only on
the equilibrium frequency of a nucleotide. These equilibrium frequencies are usually determined
from the observed frequencies in the sequences to hand. µ represents a rate parameter and πi
represents the frequency of nucleotide i. (‘.’ indicates the value necessary to make the row sums
equal to zero).
QF81 =
. µπT µπC µπG
µπA . µπC µπG
µπA µπT . µπG
µπA µπT µπC .
(3.37)
Hasegawa et al[20] futher refined Felsenstein’s model by considering transition/tranversion
rates β and α.
16
QHKY =
. βπT βπC απG
βπA . απC βπG
βπA απT . βπG
απA βπT βπC .
(3.38)
Finally the most general time reversible model (GTR) has nine free parameters:
QG =
. ρπT βπC γπG
ρπA . απC σπG
βπA απT . τπG
γπA σπT τπC .
(3.39)
3.3 Problems with Markov Models of Evolution
Biological evidence that all models so far considered simplify situation too much. For example,
they can’t deal with long-additive distance correlation due to RNA folding.
A key problem appears to be the iid assumption between sites. The assumption of rate ho-
mogeneity is contradicted by evidence that mutations are dependent on local sequence context.
Protein coding genes are an example of how this assumption can be violated by very basic ideas.
Because of the redundancy in the nucleotide to amino acid code, different codon positions are
subject to different selectional pressures. Mutation rates appear to be dependent on structural
and functional constraints as well as chromosomal positions. These are all local properties of a
sequence. However, it is assumed that substitution rates are constant throughout a phylogenetic
tree.
Markov models of evolution assume stationarity of base frequencies. That is, expected
nucleotide frequencies remain the same with time. This is contradicted by observations of
nucleotides are very different in sequences from different species. For example, GC content
in mammals is much higher than in flies.[30] Lockhart et al [31] have shown that if a model
that assumes stationarity is used, then breaking the assumption can lead to inaccurate distance
estimates. The main problem being a tendency to group sequences with similar nucleotide
frequencies, irrespective of evolutionary development.
17
3.4 Modelling Rate Heterogeneity
A number of methods have been suggested to add some level of rate heterogeneity into Markov
models.
One approach is to set some sites invariable while others change. This is useful when one
can determine conserved regions in sequences. However, it doesn’t allow for more than one rate.
Another approach allows sites to evolve at different rates, where the rate for a site from a
Gamma distribution with shape parameter α. A discrete Gamma model has also been developed
by Yang [30] that allows much easier computation. This is, perhaps, the most popular approach
at present. However, it still does make use of information available about local behaviour.
Recently, Steel et al [42] have presented a covariotide model of site substitution where sites
are effected by different selection pressures. This model allows some sites to be invariant while
others change. However, sites do not have to remain invariant. This represents the fact that
constraints on sites can change over time. The activation of sites is governed by a Markov
process where sites are still iid to keep the model tractable.
Other techniques have been based on defining multiple categories of rates. This implemented
using hidden Markov models. Algorithms infer the most probable rate category for a site. These
are discussed in [11]
3.5 Modelling Non-Stationarity
As mentioned previously, Markov models of evolution assume stationarity of nucleotide frequen-
cies. However, there is strong evidence suggesting that this is not the case.
The paralinear and logdet corrections have been developed to make distance estimates more
reliable when base frequencies differ from species to species. Both rely on the following lemma.
Lemma 3.5.1 Let t be a measure of evolutionary time. Now,
t ∝ log[det(P(t)] (3.40)
Proof
P(t) = etQ) (3.41)
18
From linear algebra
=⇒ det(P(t)) = ettrace(Q) (3.42)
=⇒ log[det(P(t))] = ttrace(Q) (3.43)
Since Q remains fixed
=⇒ t ∝ log[det(P(t))] (3.44)
Paralinear distance
Barry and Hartigan [2] suggest an asynchronous distance estimate. This is still based on a
Markov process where sites are iid. However, it makes no assumption of homogeneity, re-
versibility or stationarity. It need not assume base frequencies are in equilibrium, nor that the
rate of substitution is constant throughout the tree. The distance estimate is taken to be:
dij = −1
4log[det(Pxy)] (3.45)
Where Pxy is the transition matrix at a particular site from species x to species y. This
is assumed to be the same across all sites. The (i, j)th element of the transition matrix is
estimated as the Pr(Y = j|X = i), where X and Y are bases that have the same position in
sequences for species x and species y, respectively. i, j ∈ A,C,G, T for nucleotide sequences.
The distance measure is additive and asymmetric. The latter property means that generally
dij 6= dji, which is not a particularly desirable property. In fact, this measure can only be used
to estimate the total number of substitutions along a branch when substitution rates are held
constant and the model is reversible.
LogDet Transformation
This transformation method involves recording a divergence matrix, Fxy, for each pair of taxa
x and y. The ijth entry of Fxy is the proportion of sites in which taxa x and y have states i
and j respectively. The dissimilarity value, dxy is calculated as:
dxy = − log[detFxy] (3.46)
Variance can be calculated using the paralinear method. Where S is the sequence length, r
19
is 4 or 20:
σ2xy =
r∑
i=1
r∑
j=1
[(F−1xy )2ji(Fxy) − 1]/S (3.47)
When models have with equal nucleotide frequencies, Lockhart et al [31] show how to cal-
culate branch lengths:
d′
xy = (dxy + [log(detFxxFyy)]/2)/r (3.48)
Distances become treelike as sequence lengths increase, provided we reinstate our indepen-
dence assumptions across sites and across the tree. This means that reconstruction methods
that require treelike distances to work will work with corrected distances (and sufficiently long
sequences). The LogDet transform has been shown to provide more realistic results where
similar nucleotide frequencies might be indicating false evolutionary relationships.
3.5.1 Summary of Nucleotide Markov Models
We can consider the relationships between these models via the following parameters [44]:
• κ: The rate of transitions relative to rate of transversions. In practice, κ > 1 reflects the
evidence that transitions are more prevalent than transversions.
• α: A measure of between site variation in the rate of nucleotide substitution. This is often
drawn from a gamma distribution with mean 1 and variance 1α [44]. High values mean
low amounts of rate variation.
• Base frequencies π = (πA, πC , πG, πT ). ie three independent parameters.
• πMLE , πobs: The maximum likelihood, and observed base frequencies.
The maximum likelihood estimate α and κ are usually used.
Model α κ π
Jukes Cantor ∞ 1 0.25 each
Kimura 2-P ∞ variable 0.25 each
Felsenstein ∞ 1 variable
HKY ∞ variable variable
JC+Γ variable 1 0.25 each
K2P+Γ variable variable 0.25 each
Fel+Γ variable 1 variable
HKY+Γ variable variable variable
20
3.6 Empirical Models of amino acid evolution
When modelling amino acid evolution, empirical Models have been the preferred solution. These
models specify explicit transition probabilities derived from empirical evidence. The preference
for empirical models when dealing with amino acids is partially due to the complexity increase
involved in having twenty character states. The following section provides an overview of the
two most common empirical methods: The PAM and the BLOSUM matrices.
3.6.1 PAM/Dayhoff Substitution Matrices
The PAM/Dayhoff matrices empirically estimate amino acide substitution rates based on a
markov process framework. These rates were derived from alignments of protein sequences that
are atleast 85% identical.
Deriving the Mutation Matrix
Let A be the matrix of observed proportions of changes in between two amino acides i, j. That
is:
Aij =Nij
N(3.49)
In fact, Aij has the same description as Fxy described for the LogDet transform.
Let πk be the vector of amino acid frequencies of sequence k.
πk =Nk
j
N(3.50)
We want to derive substitution (transition) probabilities for the time it takes 1% of all amino
acids to mutate - this is the point acception mutation (PAM) unit.
Pij = Pr(i mutates)Pr(i mutates toj|i mutates) (3.51)
Now we can empirically derive a relative mutability of the amino acid i as mi:
mi = P (i mutates) (3.52)
=
∑
j Aij∑
k,j Akj(3.53)
21
Now,
Pr(i mutates to j|i mutates) =Aij
∑
j Aij(3.54)
and we now have have an estimate of P :
Pij = mi ×Aij
∑
j Aij(3.55)
To calibrate our matrix to the PAM measure we simply solve:
∑
i
πi(1 − Pii) = 0.01 (3.56)
The matrix of Pij ’s is the PAM matrix.
If π a vector of amino acid frequencies, Pπ is the probability vector after that time period
(1-PAM) To consider more distant relationships we can derive the k-PAM matrix. Because this
is based on a Markov process we can theoretically achieve this by raisig the 1-PAM matrix to
the kth power.
P (k) = Pk (3.57)
The log-odds form of PAM matrices are often used for scoring sequence alignment reliability.
This can be thought of as a log-likelihood ration test with the null hypothesis being that a sites
have aligned by chance.
Sij = logPij
πi(3.58)
The more rare the amino acid in each aligned pair, the lower the probability of a chance
alignment and so a greater significance.
Problems with the PAM model
Besides the problems inherent in Markov models of evolution, the PAM matrices suffer from
other problems. Firstly, it assumes that proteins have average amino acide composition (many
don’t). Secondly, rare replacements are not observed enough to resolve relative frequencies
properly. Thirdly, error in PAM(1) extrapolated (in say PAM(250)) Markov processes don’t
accurately model evolution.
There is no theoretical justification for applying this to divergent alignments. In fact this
22
approach implies a large loss of information. As evolutionary distance increase, information
content decreases. This means a longer region of similarity to get a high score to distinguish
from chance. However, regions of similarity are found in narrow blocks as evolutionary distance
increases so it is difficult to find the necessary data.
Attempts to update the PAM matrices to make them more accurate have been made. A
particular example is the Jones Taylor Thornton model.
3.6.2 BLOSUM
The Block Sum (BLOSUM) substitution matrices were introduce in 1992 Henikoff and Henikoff
[21]. They take completely different approach to the PAM matrices. The key point is that the
derivation of transition probabilities uses alignments of distantly related sequences. Blocks are
conserved regions of local alignments with no gaps.
The aim is to obtain a set of score for matches and mismatches that best favors a correct
alignment with each of the other segments in the block relative to incorrect alignment. This
is done by creating a table where each column contains amino acid pair frequencies for the
corresponding column in the alignment. This is a 20(20− 1)/2×N matrix where the first term
is the number of possible pairs of amino acids and N is the length of the alignment.
A score matrix is defined from a log-odds matrix from the frequency table. Let Fij be the
ijth entry of frequency matrix. Let qij be the observed probability of an ij pair.
qij =Fij
∑
j Fij(3.59)
We can estimate the expected probability of an ij pair occuring as eij . Let pi be the probability
of i occuring in an ij pair. Let eij = pipj. Our odds ratio matrix takes the form:
Sij =qij
eij(3.60)
That is, the observed probability over the expected probability that i and j appear together
at random. Ratios are usually multiplied by scaling factor of 2 then rounded to the nearest
integer. This is the BLOSUM (block substitution matrix) with half bit units.
Unlike the PAM matrices, separate matrices have been derived for different time scales.
BLOSUM matrices are referred to by minimum percentage identity between species. That is
in BLOSUM 60 sequences that are atleast 60% similar are treated as identical. As distances
become large we expect to a BLOSUM matrix with a decrease BLOSUM parameter.
23
Problems with the BLOSUM matrices
The main problem with the BLOSUM matrices is that it can be overtrained. That is, if most of
the conserved blocks are taken from just a few species then the resulting matrix isn’t going to
look too much like reality. This is a real problem as most genomic data available is from very
few species.
To reduce contributions from most closely related members of family (reduce multiple con-
tributions of amino acid pairs) - sequences are clustered within blocks. Each cluster is weighted
as a single sequence. Matrices analogous to transition matrices estimated without any reference
to rate matrix Q.
3.7 Differences in PAM and BLOSUM
The differences in PAM and BLOSUM substitution matrices are a consequence of their different
approaches to the problem PAM matrices are derived from a tree based model that uses matrix
multiplication to extrapolate larger time scales. It is based on mutations in both conserved and
variable regions.
BLOSUM is derived from pair frequencies in highly conserved blocks. Different weights
can be given to different sequence groups. BLOSUM has an advantage in that it was derived
with from more representative data set. Hardly any transitions were observed in deriving PAM
whereas this was not the case for BLOSUM. This problem has been address by re-deriving the
models with more data. This is the Jones Taylor Thornton model.
The fact that BLOSUM is not tree derive does not seem to be a major disadvantage.
BLOSUM generally gives better results when used to score database searches as highly conserved
regions usually serve as anchor points. However, PAM-style matrices are still more widely used
in phylogenetics
24
Chapter 4
Phylogenetics Tree Reconstruction
Methods
This chapter surveys tree reconstruction methods. Phylogenetic reconstruction methods come
in many forms. Parametric methods such as maximum likelihood rely on a specification of a
model. On the other end of the scale, non-parametric, such as maximum parsimony, claim to
make no assumptions about the underlying model as any assumption we do make are likely to be
inadequate. In the middle there are semi-parametric methods - the distance based methods that
require a model to generate distances but then go onto reconstruct the tree by a non-parametric
cluster method. Bayesian approaches have also been proposed.
This chapter first outlines how the usual statistical indicators are redefined for the prob-
lem of phylogenetic inference. The rest of the chapter examines commonly used methods for
phylogenetic tree reconstruction.
4.1 Evaluating Reconstruction Methods
Before examining the methods available we need to have an idea of what we want from them.
Several often used criteria are discussed below. When evaluating tree reconstruction methods
having the available the usual bag of statistical measures. However, it will be seen that defining
these with respect to phylogenetics is not straightforward.
4.1.1 Complexity
An important issue to consider in the design of reconstruction methods is the size of the space of
trees. There are (2N − 3)!! topologically unique rooted trees N leaves, and (2N − 5)!! unrooted
trees. Clearly, algorithms that involve evaluating entire space of N leaf trees are not going to
25
be computationally feasible.
4.1.2 Accuracy
The evaluation critieria that has been given the most attention is accuracy. When we build
phylogenies we need some measure of how well the method used estimates the true tree. This is
usually evaluated by examining how the method performs with respect to simulated data and
biologically well supported phylogenies[25].
4.1.3 Consistency
The behaviour reconstruction methods as sequences gets longer is usually discussed in terms of
consistency.
Definition 4.1.1 An estimator Tn of T is consistent if
limn→∞
PT (Tn − T > ǫ) = 0 as n → 0 (4.1)
In phylogenetics this is interpreted as whether a reconstruction method will return the true
tree if its inputs are based on infinitely long sequences. This criteria has been given much
consideration as the amount of genome data available increases. In effect, this means that the
barrier to success only depends on the researcher having enough sequence to hand.
Steel [13] showed that if the frequencies of residues are known and are iid, then a consistent
estimator can be found. If not (site rates are variable and/or frequencies are not known), then
it can be impossible to estimate a phylogenetic tree consistently. However, Chang [8] has shown
that if sites are iid, the correct model is being used assome other restrictions, than a consistent
estimator can be found.
There has been much debate about the usefulness of such a measure given that real sequences
will always be finite. Holmes makes the point that as sequence lengths are made longer in reality
the less valid the site independence assumption is [23]. 1
4.1.4 Efficiency
Definition 4.1.2 As estimator T is efficient if it is unbiased and
limn→∞
σ2(Tn)
(I(T )−1)= 1 (4.2)
1Holmes provides a useful phylogenetics to statistics term conversion table in [23]
26
Where I(T ) is the Fisher information of T .
In phylogenetics, we want this to mean that as longer sequences are used the variances in our
trees is as low as it can get. The problem with this is that the variance of a tree is not well
defined. In fact, the literature generally uses effiency to describe how quickly a reconstruction
method converges to the correct solution as it is given more data. This is usually measured via
simulation.
Ideally, we would use an analog of the mean squared error for trees, E(d(T , T ))2, as this
gives an indication of both variance and bias. However, the problem remains of how to define
distances between trees and this has not been solved.
4.1.5 Robustness
We know current assumptions made about sequence evolution are inadequate . With this is
mind, it is very desirable to know how well a method is likely to behave when wrong assumptions
are made. For example, the effect of model misspecification on parametric methods This is
usually assessed by simulating a data under a fully specified model and then reconstructing the
tree with misspecificastion.
4.1.6 Usability in tests
We also need to consider if a reconstruction method can reject false assumptions in our model of
evolution. For example, we want to be able to determine is additional complexity is worthwhile.
Since understanding the process of evolution is our primary goal, this should alway be kept in
mind.
4.2 Parsimony
The maximum parsimony method chooses the best tree as the one where the least number of
base changes have occurred in sequences from the root to the leaves. An example of this is seen
in fig 4.1. Combinatorially this is the same as finding the minimum Steiner tree for Hamming
distance between sequences[23].
In theory this means that all possible assignments of sequences to internal nodes over all
possible tree topologies (with the necessary number of leaves) must be evaluated. In practice,
heuristics are employed to cut down the search space. Recursive algorithms and branch and
bound have also been employed to avoid repeating computation (See [11]).
27
AAG
GGA
AAA GGA AGA
AAA
AAA
Figure 4.1: An example of how changes are counted using the principle of parsimony. The pathsfrom sequences at the leaves of the tree to the root involve 4 base changes.
Parsimony is based on the concept of Occam’s razor. Solutions that make the least amount
of assumptions are likely to be the best. By looking only at base changes, parsimony claims to
require no knowledge of the evolutionary model. Parametized models are known to be flawed
so this non-parametric approach may seem quite reasonable.
However, it seems that an underlying model for parsimony exists implicitly. The assumptions
are that sites are independent (we cost each substitution separately) and the probability of
substitution is equal for all bases.
It has been long established that parsimony is inconsistent. The situtation where this
happens has been dubbed the ‘Felsenstein zone’. However, as mentioned previously, consistency
is not always a necessary property for a reconstruction method. Another problem is that
different trees can be equally parsimonious for a set of sequences.
4.3 Maximum Likelihood
Maximum likelihood reconstruction is, unsurprisingly, based on the likelihood principle. Given
data, D and a model we calculate the likelihood of hypothesis, H as P (D|H) the probability
of observing D if H is correct [38]. With respect to phylogenetics, the data D is sequence data
and the model is a process of site substitution (see Chapter 3) .H is a phylogenetic tree which
is is defined by it’s topology and branch lengths.
The aim is to find the tree that maximum likelihood. We choose this tree as our ‘best’ guess.
28
x
x
1
x
3
x5
2
x4 t
t1
t
t3
2
4
Figure 4.2: Example ML tree T . xi are nodes representing sequences, ti are branch lengths
Example Likelihood Calculation
The likelihood of the rooted tree in fig 4.3 can be calculated as follows.
P (x1, ..., x5|T , t1, ..., t4) = P (x1|x4, t1)P (x2|x4, t2)P (x4|x5, t1, t2, t4) (4.3)
Where P (xi|xj , t) = L(xi, xj , t). The right hand side being the likelihood of (xi, xj) forming a
branch of lenght t in tree T This can also be transformed into a recursive form.
Characterisitcs of maximum likelihood
Maximum likelihood estimation is consistent. It borrows its efficiency rating from more general
theory of maximum likelihood estimation. Unlike distance based methods it has been found to
be robust to the presence of distant taxa [4].
The maximum likelihood tree is not necessarily unique [38]. So this method may not be
able to resolve completely which is the best tree.
It is also extremely expensive computationally. The three taxa tree shown in fig 4.3 is a
trivial example because there is only one possible unrooted tree topology for three leaf tree. If
we are dealing with models that are not reversible we have to consider every possible rooted
tree. For n sequences ths potentially involves evaluating the likelihood for all (2n − 3)!! rooted
trees topologies and all possible assignments of sequences to the hidden internal nodes of the
29
tree.
This is a huge computational problem!2
Simplifications
The problem can be simplified by making our usual assumptions. If we assume that sites are iid,
we need only consider the evolution of individual sites with respect to the tree. The probability
of the tree with respect to sequences is then just of the product of the probabilities of the sites.
This provides opportunities for parallelising computation.
If we assume that the model of site substitution is reversible then we can determine the
probabilities of substiutions from the leaves up - a postorder traversal. Infact, we only need to
consider the unrooted tree . This is the ‘pulley’ principle described by Felsenstein [15].
Search heuristics
This still leaves the problem of calculating the likelihood of every unrooted tree. To cut down
the search space heuristics need to be employed. Felsenstein proposed a branch and bound
method where taxa are added incrementally to maximize the likelihood at each stage. The big
disadvantage with this approach is that it may not find the optimal tree.
4.4 Is MP the same as ML?
The use of maximum parsimony over maximum likelihood (and vice versa) has been the source
of much division in the phylogenetics. However, as Holmes aptly puts it:
The statistical perspective sees the differences between maximum likelihood, maxi-
mum parsimony...as much more a matter of degrees of freedom allowed in a model
than a matter for religious wars
The non-parametric nature of MP means that no parameters are pinned down. In effect it
needs to optimize over infinite dimensional criteria. A parametric model such as Jukes-Cantor
is at the other end of the scale. Variable rate models lie somewhere in the middle. This view
is well supported by the work of Steel et al [38] who have found conditions where the MP tree
is the ML tree. This happens when there is ‘no common mechanism’ assumed between sites or
lineages.
However, the general evidence from simulations is that MP does not perform as well as
ML. This is likely to be due to the implicitly restrained model involved in most parsimony
2In fact is has been shown that maximum likelihood for phylogeny is NP-complete.
30
implementations. In the usual form parsimony will not take account of sequence evolution
behaviour such as reversal. Parsimony as a ‘no common mechanism’ can in theory account
for unseen substitutions as any possible assignment of sequences to internal nodes is possible.
However, such as candidate tree is unlikely to be selected as the best one as it will involve more
substitutions - violating the most parsimonious criteria.
This often makes the choice between ML and MP a choice between a flawed model that
takes into account some of our knowledge of evolution, or a model constrained to be as simple
a possible that does not take into account any biological evidence.
4.5 Distance Based Methods
Distance based methods reconstruct phylogenetic trees that estimate distances between species.
These distances are usually estimated according to some parametric model. However, the
actual reconstruction method is usually non-parametric. These methods are dominated by
agglomerative algorithms.
Agglomerative or cluster methods follow the same general algorithm.
• Select two nodes
• Merge them to form a new node (or cluster).
• Update the distances to reflect removal of two and addition of one node according to some
rule.
The difference in method lie in how (most importantly) the select and update methods are
implemented.
The concept of additivity is essential to the agglomerative algorithms [11].
Definition 4.5.1 Given a tree, it edge lengths are additive if the distance between leaves is
equal to the sum of the length of edges on the (unique) path between the leaves.
It is important to note that additivity is a property of the distance measure used. Real data
is only ever approximately additive.
4.5.1 Unweighted pair group method using arithmetic averages (UPGMA)
UPGMA is the classic naive clustering algorithm. We start with each sequence being a cluster
on its own and proceed with our generic clustering algorithm. The distance between two clusters
31
Ci and Cj is defined to be the average distance between pairs of sequences from each cluster:
dij =1
|Ci||Cj |∑
p∈Ci,q∈Cj
dpq (4.4)
Where |Ci| is the number of sequences in Ci.
When clusters Ci and Cj are combined we have Ck = Ci ∪ Cj . The distance between the
new cluster Ck and any other cluster Cl is:
dkl =dil|Ci| + djl|Cj|
|Ci| + |Cj |(4.5)
The algorithm terminates when only two clusters Ci, Cj remain. At this stage the tree root
is added at height dij/2.
4.5.2 The Molecular Clock Hypothesis
The molecular clock hypothesis assumes that the rate of evolution is approximately constant on
the molecular level. This is equivalent to having the same Markov model rate matrix Q apply
to every part of the tree. If we believe this hypothesis, we can estimate the time of divergence
simply by looking at the number of changes between sequences.
There is much evidence to contradict this assumption. Rates appear to vary between and
within lineages[30].
UPGMA produces rooted trees with edge lengths that obey the molecular clock hypothesis.
If our distance data conforms to this then UPGMA will reconstruct the correct tree.
4.5.3 Long Branch Attraction
If distance data does not conform to the molecular clock then UPGMA will select the wrong
tree, even in very simple cases. Consider the tree in fig 4.3. When trying to reconstruct this tree
our input is the distances between leaves of the tree. Since UPGMA picks the closest clusters
at each stage to merge, the UPGMA tree will not be the true tree.
This is the case of long branch attracts. The error associated with long distances causes
a large amount of noise in the phylogenetic signal. In general, long branches are hard to deal
with accurately.
32
1
23
41 4 2 3
Figure 4.3: An example of a tree that does not conform to the molecular clock hypothesis (left).UPGMA incorrectly reconstructs the tree
4.5.4 Neighbour Joining
Satou and Nei’s Neighbour joining algorithm[36] keeps the notion of additivity (4.5.1) but
dispenses with the molecular clock. For two leaves/clusters i and j we define:
Dij = dij − (ri + rj) (4.6)
Where
ri =1
|L| − 2
∑
k∈L
dik (4.7)
and L is the set of leaves in the tree.
(4.8)
This criteria attempts to deal with the problem of long branch attraction. Subtracting off
the averaged distances to all other leaves/clusters compensates for long branch short branch
neighbour pairs. Nodes i and j are neighbours in the tree if there is another node k such that
branches ik and jk are both in the tree (the shortest path between them is length two). That
is they are only separated by one ancestor.
Theorem 4.5.1 If additivity holds and if Dij is minimal for leaves i, j ∈ L then i, j are neigh-
bours in the tree.
The proof of this is available in Chapter 7 of [11]
33
At each iteration Dij is used to select neighbours to join. In fact, NJ works to a minimum
evolution criterion: the best tree is the shortest one where length of a tree is usually calculated
by summing all branch lengths.
At each iteration if we select i and j as neighbours as a new node k, for m ∈ L we update
our distances as:
dkm =1
2(dim + djm − dij) (4.9)
and set
dik =1
2(dij + ri + rj), djk = dij − dik (4.10)
Neighbour Joining has been shown to reconstruct the correct tree when distances are addi-
tive. This also holds when distances are only approximately additive. [17]. It is consistent and
has been shown to return the correct tree when the correct distances are used. However this
does not give assurances about sequences of finite length. We need to remember that errors in
distance based methods grow exponentially as distance increases. This is also is important to
remember as the algorithm produces unrooted trees. This means that if an outgroup is used to
root the tree it is likely to damage the accuracy of the result.
The next two sections describe some improvements to the NJ algorithm.
4.5.5 BIONJ
The BIONJ algorithm of Gascuel [17] improves on NJ by taking into account more biological
features of the data. Gascuel shows that the formulae used to update distance matrices is just
one of a large class. BIONJ selects the minimum variance reduction from this class.
The class of equations is described by:
δui = λδ1i + (1 − λ)δ2i − λδ1u − (1 − λ)δ2u (4.11)
for taxa 1 and 2, where u is the root of 1, 2. For NJ, we have λ = 1/2. Gascuel has shown
that sampling noise influences the structure of the tree. BIONJ takes advantage of the fact that
λ does not have to be fixed. Instead it can be calculated at each step to minimize sampling
variance of the new reduced matrix.
A first order model is used estimate sampling variances and covariances. The simplicity of
this means that BIONJ retains all of NJ O(N3) complexity and runs
34
4.5.6 Weighbor
When determining the distance between taxa, random error increases exponentially the further
apart the taxa are. This essentially means that the distance based methods considered so far
are not robust to distant taxa. This means adding distant outgroups can lead to very dubious
results.
The maximum likelihood approach is well known to be robust to the presence of distant
taxa. However, it is also very computationally expensive. Weighted neighbour joining [4] uses
a likelihood based criteria to the neighbour joining approach in an attempt to deal with this
problem. NJs minimum evolution select criteria with a likelihood based one, while the distance
update step is much the same as that for BIONJ.
Weighbor’s selection criteria is based on evaluating additivity and positivity of possible
neighbours. These two properties are used to evaluate the likelihood that the observed distance
between two taxa given that they are neighbours. The taxa that are joined at this step are the
two that have the highest likelihood. A cost function is derived from the negative loglikehood
which turns the problem into one of cost minimization.
Definition 4.5.2 Distances have the additivity property if for taxa i and j, dik−djk is constant
for all other taxa k.
At each iteration, the additivity property is evaluated with respect to possible neighbours
i and j. The likelihood is determined as the likelihood that for each k 6= i, j, dik − djk is an
estimate of an optimally weighted average (constant). Taking the negative loglikelihood gives
us an additivity cost Add(i, j):
Bruno et al assume that distance errors are normally distributed. This formulation also
assumes that correlations are at of a star phylogeny and so multiplied by a constant g to
account for the fact this assumption is usually incorrect.
Definition 4.5.3 Distances have the positivity property if for taxa i and j, dik+djl−dij−dkl ≥0 for all other taxa d and l. That is, the internal branch in the tree ((i, j), (k, l)) has nonnegative
length.
Consider dPQ to be an internal branch in the (i,k) (j,l) paths in the tree. Assuming that
dPQ is a normal random variable, with a positive mean, we can calculate the likelihood that
dPQ ≥ 0 by integration over this part of the probability space. The positivity cost is can then
35
be computed as a negative loglikelihood :
Pos(i, j) = −ln
(
1
2erfc
(
−dP Q√2σPQ
))
(4.12)
Bruno et al suggest a heuristic for evaluating the positivity constraint. This is done to avoid
measuring positivity for dPQ for every quartet that i and j are involved in.
The complete cost function has the form:
S(i, j) = gAdd(i, j) + Pos(i, j) (4.13)
Weighbor updates the matrix of distance in a similar manner to BIONJ. and it will not be
discussed here.
4.5.7 NJ and the minimum evolution method
Neighbour joining’s select operation is a minimum evolution criteria. That is, we want to choose
the best tree to be the one with the smallest sum of branch lengths.
Neighbour Joining is also employed in Rhzetsky and Nei’s [35] Minimum-Evolution method.
This involves the construction of a neighbour joining tree from the data. Topologies close to
the NJ tree topology are also examined. Finally the shortest tree is selected as the best tree.
This provides a method of reducing the search space to a neighbourhood of trees.
In practice, the true tree is usually close to the ME tree However, This does not apply to
all data sets and proofs have only described expected behaviour. Testing has indicated that
searching the tree space around that of the ME tree have been shown to not add much value
for money.
4.6 Least Squares
The method of least squares is a well studied tool for parameter estimation. It is also (theoret-
ically) very applicable to the problem of branch length estimation of phylogentic trees[7]. This,
in turn, makes it very useful in minimum evolution based phylogenetic inference
4.6.1 Estimating Branch lengths
The ordinary least squares method minimizes the following criteria:
∑
i<j
(dij − δij)2 (4.14)
36
Where dij is the pairwise distance between sequences i and j. The δij is the sum of branch
lengths between i and j on a tree of certain topology. That is the unweighted sum of squares
of residuals.
Weighted least squares is similar but looks to weight observations by their reliability. That
is:∑
i<j
wij(dij − δij)2 (4.15)
Where wij is usually taken to be the reciprocal of the variance of the observed distance. If
the the weights are correct and observations are independent then WLS is statitically optimal[7].
This is implemented as the program FITCH in the the PHYLIP package.
This is often not the case as distances are usually correlated, if paths between two pairs of
species have branches in common. In this case the optimal method is generalized least squares
(GLS). This minimizes the weighed sum of squares of cross products.
∑
i<j,k<l
wij,kl(dij − δij)(dkl − δkl) (4.16)
Where wij,kl is the appropriate element of the inverted variance- covariance matrix of the dis-
tances.
This method is especially appealing because (theoretically) the GLS test statistic should
have a χ2 distribution under the true topology. This suggests methods to construct confidence
sets.
Bulmer [7] gives method for estimating branch lengths using GLS. This follows for known
general theoretical results of GLS. Susko [40] provides a general method to calculate the variance-
covariance matrix. This is consider with his proposed GLS hypothesis testing technique in
Chapter 6.
If we are considering an N taxa tree, the inversion of covariance matrix has O(N6) complex-
ity. This dominates the time complexity of the generalized least squares approach. However, it
only needs to be computed once. If this is preprocessed than the naive algorithm will O(N5).
In fact, Bryant and Waddell have devised an O(N4) algorithm to deal with this problem, but
have also shown that this is a lower bound on its time complexity[5].
4.6.2 Minimum Evolution Method with Least Squares
Rzhetsky and Nei [35] have discussed the use of least squares in a minimum evolution method
of phylogenetic inference. This method involves the construction of a neighbour joining tree
from the data. Branch lengths of the NJ tree are fitted using OLS. estimator. Topologies close
37
to the NJ tree topology are also fitted. For each tree, the length of the tree is calculated as the
sum of branch its lengths. Finally the shortest tree is selected as the best one.
A major disadvantage of all three least squares approaches is that negative branch lengths
can be generated. These lengths has no biological meaning and it makes the application of min-
imum evolution unclear when tree length is used as a criteria. This means that some positivity
constraints are incorporated into either the branch length estimation or to the calculation of
tree length.
Gascuel et al [18] provide an excellent survey of positivity constraints. They also investigate
the consistency of minimum evolution methods with least square methods including WLS and
GLS, given different positivity constraints. They found that OLS is generally consistent while
WLS and GLS are inconsistent. However, this does not eliminate the possibility of usig GLS
for tree reconstruction. Consistency is asymptotic property while in reality sequences are finite
and its other characteristics need further investigation.
4.7 Bayesian Tree Reconstruction
We may also take a Bayesian approach to the problem of tree reconstruction. As can be expect,
this involves finding the tree with the highest posterior probability. For tree Ti, this means
evaluating:
f(Ti|D) =f(D|Ti)f(Ti)
∑B(s)j f(D|Tj)f(Tj)
(4.17)
f(Ti) is the prior distribution over the space of trees and is usually set to be the uniform, ie
f(Ti) = 1/B(S). (B(S) is the number of trees with S taxa). We also need to evaluate the
following integral.
f(D|Ti) =
∫
ν
∫
θf(D|Ti, ν, θ)f(ν, θ)dνdθ (4.18)
.
Where ν and θ are the vector of branch lengths and other model parameters respectively.
This integral cannot be evaluated analytically. Instead it is approximated using the Markov
Chain Monte Carlo method. In this method, chains roam around, taking samples from space
of S-taxa trees. After many (millions of) iterations the proportion the samples taken of (time
spent at) each tree is the posterior probability of that tree.
In practice, the Metropolis-coupled MCMC algorithm is often used. This involves running
38
n heated chains and one unheated chain. Only samples from the cold chain count but swaps
between two randomly selected chains is proprosed at each time step. Getting trapped at local
maxima is a big problem for hill-climbing algorithms. In effect, this gives allows the cold chain
to escape from this situation.
This method is further discussed in Chapter 5.6 in view of its hypothesis testing capabilities.
4.8 Trees from Alignments, Alignments from Trees
Phylogenetic tree reconstructions from sequence data require a set of aligned sequences. These
columns of these alignments represent the evolution of a particular site in a sequence.
For a tree to be reconstructed we need the sequences to have descended from a common
ancestor, that is, they are homologuous. Ideally reconstruction methods should be able to
deal with insertions and deletions as well as substitutions. This is rarely incorporated into
reconstruction methods. Instead, conserved blocks (no indels) are used.
It is clear that good tree reconstructions require good alignments. Phylogenetic trees have
a somewhat circular relationship with the multiple alignment problem. On one hand, they can
be used to guide the order of alignment. This is certainly the case in the popular ClustalW
alignment program which uses neighbour joining. On the other hand a good alignment is
essential for accurate inference of phylogenetic trees. Algorithms exist to do these two things
simultaneously. For an overview see Durbin et al, chapter 7[11].
What about sequences that are only distantly related? Don’t use them is usually the answer.
The error involved is just too high.
39
Chapter 5
Phylogenetic Hypothesis Tests
The main problem of phylogenetic inference is that the process of evolution is stochastic. We
cannot be sure of reconstructing the true tree even if we have the correct model. There is no
way of determining absolutely if we have the correct answer.
It can be difficult to place any faith in a tree reconstructed from a sequence alignment when
it conflicts with evidence of other phylogenies. This evidence can include taxonomy, geography
and even different partitions of the same genome.
The range of reconstruction methods available may not help clarify the issue either. Al-
though the methods described in Chapter 4 can be seen on a spectrum of parametization, they
do not often agree on what the best tree is. This is often due to the fact that underlying criteria
for best trees are differently specified. The maximum likelihood tree may be the most parsimo-
nious tree in some cases, but not always. For model based methods small different models of
evolution may also cause conflicting results.
This motivates the need of some sort of confidence testing method for trees. However,
different types of tests (of topology) can give rise to different conclusions. This may represent a
clash between Bayesian and frequentist, and between non-parametric and parametric tests [6].
In this chapter, I examine the need commonly used phylogenetic hypothesis testing tech-
niques. These include tests based on parametric and non-parametric boostrapping and Bayesian-
MCMC techniques. However, first I consider the need for confidence region estimation in these
tests.
5.1 Confidence Regions of Phylogenetic Trees
Specifying confidence regions for phylogenetic trees, rather than making point estimates, makes
sense.
40
Fitting one tree to all the data at hand involves a loss of phylogenetic information at each
stage of the estimation process. The problem arises from estimating a discrete object from
continuous measurements. Holmes discusses the ‘rounding’ problem in phylogenetics as the
‘replacement of one functional stretch of DNA (usually a gene) by the gene tree’ 1[24]. We need
to look at how the conversion from continuous to discrete, the rounding, at different stages in
the reconstruction process effects the outcome.
When we convert a matrix of characters into distances we lose information about how
mutation differs in different parts of the sequence. Whether is is better to use whole genomes,
or create many trees from different partitions and try to reach a consensus is an unanswered
question. We also need to consider what it means to output a tree answer if the true structure
is not treelike. Conflicting signals in the data may not always be noise, so they should not just
be disgarded (as is the case if the data is forced to fit one tree).
The space of trees is enormous, so tests that declare a particular tree wrong given the data
are not particularly helpful in the search for the correct one. Especially when there is a lot of
conflicting signal around. However, the fact that there is conflicting signal should be able to
tell us something. This could be done via a probability distribution over trees or confidence
regions. However, it is not clear how we might go about constructing either without a notion
of distance between trees need to calculate, for example, the mean squared error.
Trees topologies are discrete combinatorial objects that represent a branching order. How-
ever, this discrete space becomes more complex as we consider variable branch lengths. In fact
it becomes a union of manifolds where each manifold represents are tree with set topology but
varying in branch lengths. However, it is not a manifold itself as discontinuities appear where
manifolds join (it is not smooth) [3]. When attempting to make phylogenetic inferences we need
to consider what the effect of this sort of landscape has on the problem.
On the other hand, classical hypothesis testing methods have a natural duality with confi-
dence regions and testing methods described in the rest of the chapter use this. While the other
issues of distances and geometry remain unclear, these techniques need to be well understood.
5.2 The Bootstrap
The bootstrap was introduced into phylogenetics as a way measure of confidence in reconstructed
trees.
[In the] case of phylogeny estimation, we do not know how the estimated tree
1A gene tree differs from a species tree in that it traces the evolution of a particular gene. This involves adifferent set of problems such a gene duplication within a species
41
will be expected to vary around the unknown true tree, and it is the function of the
bootstrap to give us an estimate of that distribution. (Felsenstein and Kishino [16]).
The bootstrap is used to determine how accurate an estimate θ is of the true parameter
value θ. This is done by estimating the parameter using data replicated from the original data.
Each bootstrap data set x(i) produces an estimate θ(i). The theory suggests that we should be
able to infer the relationship of θ − θ from θ(i) − θ. [12].
The above formulation was created for the non-parametric bootstrap. However the moti-
vation applies (with some slight differences) to the newly popular parametric bootstrap. The
difference between the two being that the parametric bootstrap relies on model of evolution.
The two approaches are discussed below.
5.3 The Non-parametric Bootstrap
The non-parametric bootstrap for phylogenetic trees is based on the more general bootstrap
technique in statistics. The algorithm is simple: from an alignment of m species draw n columns
with replacement. These n columns are then used to reconstruct a tree. This is done R times
and a consensus tree can be drawn from the R bootstrap trees.
The statistic that is usually used to interpret bootstrap data is are the bootstrap proportions
[12] of the group of descendents of a particular common ancestor (monophyletic group).
Definition 5.3.1 The bootstrap proportion (BP), for a particular monophyletic group, is the
proportion of bootstrap replicates that agree with the group in the original data set.
Within this definition the monophyletic group doesn’t depend on the topology of the subtree
induced by it. This is also known as a bootstrap confidence level. It is also sometimes mislead-
ingly called a bootstrap P-value[12]. However, it should be emphasized that it does not have
the usual P value interpretation.
A Multinomial Model
The boostrap can be described via a multinomial model.
Let N be the number of sequences we have, with S the length of each of these sequences.
Assume that these sequences are aligned. If we are dealing with nucleotide sequences then
K = 4N is the total number of possible vectors of length N . These vectors can be enumerated
as X1,X2, . . . ,XK with associated probability vector π = (π1, . . . , πK). . So, our observed
sequence alignment can be seen as S independent samples (columnwise) from this set. We can
42
take the set of distinct vectors observed as Y1, . . . , Yk. This is a multinomial model. We also
have:
πi =1
|K|
S∑
j=1
(Yi = Xj) (5.1)
The space of trees can be partitioned according to tree topology. Now, the partition that
π (the true probabilities) falls into determines should the correct tree. This happens on the
assumption that a true value π will result in correct distance estimates. It is well known
that reconstruction methods such as neighbour joining guarantee the correct tree with correct
distances as input. So, we get the correct tree. The problem is whether or not π falls in the
same partition. The bootstrap can be used to shed light on how often we can expect this to
happen.
5.3.1 Testing Phylogenies using the Non-parametric Boostrap
Efron et al[12] claim that Felsenstein’s application of the bootstrap is non-standard in that the
statistic that is the tree does not change in a continuous sort of way [12]. The discontinuities
in tree space occur as the tree topology changes. This means that a little variability near a
boundary (discontinuity) may result in selection of a different tree, while in another area it may
have negligible effect.
Holmes also points out that there does not exist a theory for bootstrap with respect to
discrete combinatorial objects such as trees. We must also consider the sparsity of the data and
many possible columns will have zero counts. However, the development of confidence testing
methods using the bootstrap has persisted.
Posterior probabilities using the Non-Parametric Bootstrap
The following is an example of how the bootstrap can be used to calculate a confidence estimate
of θ. This is basically probabilistic measure of the distance between the estimate θ and a
boundary of the region of interest R. Using bootstrap replicates i estimate α = Prθ(θ(i) ∈ R)
This is an a posteriori probability that θ ∈ R assuming a uniform prior. This construction
means that if the boundary curves away from θ the the confidence estimate increases.
This suggests that bootstrap proportions can be used to estimate Bayesian posterior prob-
abilities. The validity of this is discussed in section 5.7.
43
5.3.2 How well does it work?
The non-parametric bootstrap claims to be free of the shackles of evolutionary models and the
problems entailed with them. However, the technique implicitly assumes that sites evolve iid.
This contradicts, for example, evidence that regions of proteins that are highly conserved for
functional reasons. A block bootstrap has been developed to that considers whole conserved
blocks as an independent ‘site’. The block bootstrap does not avoid other problems associated
with the technique.
The main purpose of the bootstrap is to lend some confidence to monophyletic groupings. On
one hand, it is reasonable to suggest that rare groupings are not to be relied upon. However, it
is not particularly clear what frequent occurence (high bootstrap proportion) of a group means.
The most intuitive interpretation of the bootstrap proportions is as a measure of repeata-
bility. This is calculated as the average of bootstrap proportions computed from samples drawn
from the true distribution [16]. That is it gives us an idea of how sensitive trees (subtrees)
are to noise in the data. If a group occurs frequently then it is less likely to be prone to sam-
pling artefacts. There has been a great deal of theory justifying the bootstrap proportion as a
measure of repeatability [1][12].
Bootstrap precision indicates how well the bootstrap proportions for a finite set of data repli-
cates matches the what would be obtained from an infinite set. The non-parametric bootstrap
has been found to have high precision when the number of replicates data sets is reasonably
large (say 1000). That is, when the number of replicates is sufficiently high, we can be assume
that the asymptotic behaviour is not too different.
However, it is often pointed out that a very precise estimate of the wrong thing does not
really help anyone[22]. A similar argument is made against the using repeatability to validate
a result.
The attribute of the bootstrap that should resolve this issue is bootstrap accuracy. Accuracy,
in a bootstrap, is fraction of times repeated sampling that derive the true topology. This use
of the bootstrap as a measure of accuracy has been attacked many times for being biased and
highly variable.
The Bias of the Bootstrap
Hillis and Bull [22] have empirically shown that the probability that the bootstrap tree and
the true tree fall in the same region is less than that of the original estimate and the true tree.
This suggested that the bootstrap is biased. However, many have taken up the defense of the
bootstrap [16], to argue that this is simply a matter of incorrect usage.
44
If θ is the parameter we are estimating, the argument goes like this. The distribution of
θ(i) − θ is not the same as that of θ(i) − θ. It is well agreed that the former has around twice
the variance of the latter. To make inferences we need to examine the the differences between
the bootstrap result and the tree generated from the original data. If used properly, bootstrap
estimates reasonably reflect the kinds and degree of variability around the true tree.
Variability of Bootstrap Proportions
Felsenstein and Kishino [16] claim that whether large bootstrap proportion values over or under
estimates the probability that a group is correct depends on how informative the data is. This
means whether or not the data is powerful enough to resolve the phylogeny. If it does the
bootstrap proportion is usually an underestimate. This makes the bootstrap a conservative
test.
Felsenstein/Kishino suggest that a better use for the bootstrap proportion, PB , is to look
at (1−PB) as a conservative assessment of the probability of much evidence favouring a group
if it is not really present. This is like Type I error in classical hypothesis testing.
5.4 The Parametric Bootstrap
The parametric bootstrap uses the parameters of the null hypothesis and the original data set
to simulate replicate data sets. Unlike the non-parametric bootstrap this approach requires a
fully specified model of sequence evolution to generate replicates. The parameters involved are
some or all of a tree topology, branch lengths and substitution model paramters. These replicate
data sets are used to estimate the distribution of the test statistic under the null hypothesis.
P-values can be determined by ranking the values obtained from the simulation and determining
quantile values from there.
This big advantage of this technique is that it guarantees that statistics are drawn from the
null hypothesis. This means there is there is overall less confusion about what values generated
from boostrap replicates mean.
The use of the parametric bootstrap is has also been motivated by tests of monophyly.
That is testing whether a group of species have a common ancestor. Traditionally, this has
been seen as test of nested model with the null hypothesis constraining certain species to form
a monophyletic group. Huelsenbeck et al give an example of how the space of trees can be
partitioned to reflect this constraint in [27].
The common approach as been to use perform a likelihood ratio test. The assumption has
been that, since the hypothesis are nested, the test statistic −2 log λ has χ2(k) distribution,
45
where λ is the likelihood ratio and k is the difference in the number of parameters between
hypotheses.
The problem start to appear when we try to estimate the degrees of freedom k. The topology
of the tree is clearly a parameter of the models we are considering. A tree topology is also clearly
not an element of the R. This renders the use of χ2(k) distribution invalid.
However, if we do know the distribution of −2 log λ under the null hypothesis we can proceed
with the hypothesis testing. This motivates the use of the parametric bootstrap.
5.4.1 Problems with the Parametric Bootstrap
Certainly, the debate over the use of the parametric bootstrap has been less inflamed than that
of the non-parametric bootstrap. The area of contention appears to be the pervasive problem
of whether results based on parameterized models can be trusted.
This knowledge of the underlying model makes parametric tests generally more powerful
than non-parametric tests 2. But, as usual, this also makes it more susceptible to errors due to
model misspecification.
The parametric bootstrap is highly dependent on the model of evolution used and, of course,
parameter estimation. For example, if estimation of parameters is not very good the alternative
hypothesis (with more parameter) might perform worse than then null hypothesis when it
shouldn’t. This often causes tests based on this method to be too liberal.
Huelsenback et al [27] have found with rates of change variable over the tree, the tests
became too liberal. Only one parameter change could cause a noticeable bias. This means a
higher rate of Type I error.
5.5 Bootstrap Based Tests
In this section, I examine two related hypothesis testing methods based on the bootstrap.
The Kishino-Hasegawa test and the Shimodaira-Hasegawa test both use the non-parametric
bootstrap while the SOWH test is parametric. Each of these tests attempts to simulate the
distribution of a test statistic under a null hypothesis.
However, first we need to examine the need for centering bootstrap data to make it conform
to the null hypothesis.
2Although this is not always the case. See [24].
46
5.5.1 Centering
If boostrap estimates are not drawn from the null hypothesis distribution we lose ability to
detect alternative hypothesis. That is, the test has less power [43]. This is related to the
incorrect use of the bootstrap discussed in 5.3.2.
For example, let µ, the true mean, be the value we are trying to estimate and y of set
of samples. Consider a calculating a bootstrap proportion by counting how many times the
following occurs in the bootstrap replicate sets.
y(i) − µ ≥ y − µ (5.2)
Where the bar indicates the sample mean. But E(y(i)) = y so the proportion of times that this
is true will always be around will always be around one half. This certainly demonstrates a lack
of power.
A better method employs centering [43]. That is, calculate bootstrap proportions from:
y(i) − y ≥ y − µ (5.3)
This ensure that the statistic under question is drawn from the null hypothesis and that power
of the test is retained. Both of the following techniques use centering.
5.5.2 The Kishino Hasegawa Test
The Kishino and Hasegawa (KH) test determines whether two trees, selected a priori, are equally
well supported by the data. This is determined by examining the difference in likelihood of the
two trees. Let the two trees under consideration be T1 and T2. Let Li be the log-likelihood of
tree Ti. If they are equally well supported then δ = L1 − L2 ≈ 0 .
In terms of hypothesis testing we compare:
H0 : E(δ) = 0 (5.4)
HA : E(δ) 6= 0 (5.5)
Clearly the distribution of δ under H0 is needed to test these hypotheses. The KH test derives
this using the non-parametric bootstrap. To do this n bootstrap replicates of the original data.
For each replicate i, parameters are re-estimated to obtain maximum likelihood estimates for
T1 and T2. Hence, δ(i) = L(i)1 −L
(i)2 is calculated for each replicate. Centering is then applied to
the δ(i) to ensure conformity to the null hypothesis. The centered statistics are then estimates
47
the distribution of δ under the null hypothesis.
The plausibility of δ determining if is in the appropriate confidence interval determine from
a ranked list of the bootstrap statistics. For example, if it falls between 2.5% and 97.5% points
for a 5% test.
Problems with the KH test
The problem with the KH test is more to do with its usage than its construction. The test is
two sided on the assumption that T1 and T2 are selected a priori. That is we have no knowledge
about which one is likely to be a better fit of the data.
If this does not hold then analyses based on this test become invalid. In [19], Goldman et al
show that when the maximum likelihood tree is used in such a comparison, we inevitably have
E(δ) > 0. This means that tests should not be based on the null hypothesis as stated above.
Unfortunately main use of likelihood based test (like KH) are to test the maximum likelihood
tree against another tree (perhaps the second most likely). This severely restrict the use of the
KH test in real phylogenetic study.
5.5.3 The Shimodaira Hasegawa Test
The Shimodaira Hasegawa (SH) test [37] is a correction to the KH test. It simultaneously
compares every tree in the set of plausible trees, Ω say. Null and alternative hypotheses are
formulated as follows.
H0 : All Tx ∈ Ω are equally good explanations of data (5.6)
HA : Some Tx ∈ Ω are better than others (5.7)
In the SH test δx = LML − Lx is calculated ∀Tx ∈ Ω where LML is the likelihood of the
maximum likelihood tree for the data to hand. The plausibility of each δx is once again deter-
mined by comparing it to the distribution of δx under H0. Like the KH test, this distribution
is determined by non-parametric bootstrap.
For each of N bootstrap replicates , the likelihood of each Tx ∈ Ω with parameters θx (L(i)x ).
These likelihood estimates are then centered as for conformity to H0:
L(i)x = L(i)
x − L(i)x (5.8)
This devalues the significance accorded to the selection of TML a posterori. The maximum
48
likelihood tree for each replicate i is then determined as
L(i)ML = max
Tx∈Ω(L(i)
x ) (5.9)
We can then generates the required distribution by determining:
δ(i)x = L
(i)ML − L(i)
x . (5.10)
These δ(i)x approximate the distribution of the δx under the null hypothesis. Test whether each
Tx is plausible by checking wheter δx lies in the 0 - 95% confidence interval determined by
ranking the δ(i)x values. Note that this is an appropriate one sided test. A P value for Tx can
be calculated as:
Px =number of replicates with δ
(i)x > δx
N(5.11)
Where N is the number of bootstrap replicates generated. Note that it is extremely important
that all plausible trees are used. This set must be selected a priori for the test to remain valid.
We can then determine a confidence set as topologies with Px ≥ P∗ where P∗ is a prespecified
significance level. Let TML∗ be the topology that maximizes the expected likelihood. Shimodaira
and Hasegawa [37] note that the coverage probability of TML being included in the confidence
interval is greater than 1 − P∗Both the KH and SH tests use the RELL technique to estimate likelihood values for bootstrap
sets. Instead of finding the ML estimates of parameters for each replicate, the RELL method
uses the estimates from the original data set. Avoiding these reoptimization sets provides a
significant time saving.
Relationship of SH and KH
Goldman et al[19] show that dividing the KH P value, PKH , in half converts it to a one-sided
test. They also explore the relationship of the halved P value and the SH one sided test.
Now, the SH test P value exceeds PKH/2 by an unknown amount. So, if PKH/2 does not
reject the null hypothesis then neither would the SH test. However,if PKH/2 does reject the
null hypothesis we can’t predict what the SH test might have done.
49
5.5.4 The Swofford Olsen Waddell Hillis Test (SOWH)
The SOWH [41] test employs the parametric bootstrap to determine whether the tree T0, is the
true topology for a given data set. This is formulated as:
H0 : T0 is the true topology (5.12)
HA : Another tree has the true topology (5.13)
To determine whether or not we reject the null hypothesis we consider a likelihood ratio
test of T0 and TML. This can be expressed in terms of log likelihoods as δ = LML − L0 The
parametric bootstrap is then used to determine the distribution of δ under the null hypothesis.
The distribution is generated by simulating replicate data sets based on T0. and θ0, the ML
estimates of free parameters for T0 with respect to the original data set.
For each data set i, L(i)0 is calculated from T0 and re-estimated ML parameters θ0
(i). Simi-
larly, L(i)ML is determined from the Tx, θx
(i)that maximize the log-likelihood for data set i. Now
we can simulate the distribution of δ under the null hypothesis using,
δ(i) = L(i)ML − L
(i)0 (5.14)
We can now determine a 95% confidence interval by ranking these δ(i). Our original δ plausible
if it falls into this confidence interval. This is a one-sided test as δ must be greater than zero[19].
Problems with SOWH
The SOWH test needs to find the ML tree for every replicate data set generated. This is clearly
a very time consuming process so ML heuristic searches are used. A problem with this approach
is that the heuristic might miss the optimal tree. This might create artifacts if it happens on
many data sets. In fact selecting a suboptimal tree leads to a lowered log likelihood value. This
in turn reduces the value of δ(i). On a large scale this serves to pull the mass of the distribution
closer to zero. This makes the test more liberal. Hence the SOWH test is prone to type 1 errors.
To get around this it is suggested that the most closely fitting model should be used. This
means the most highly parameterized. However, even the GTR model has been shown to be
inadequate[26].
Theoretically, in the SOWH test, negative δ values shouldn’t occur, they have been known
to arise in simulation. This can be attributed to poor parameter estimation. This again points
to the need for a realistic model of evolution.
50
5.6 Bayesian methods
Bayesian methods of testing tree topologies follow from Bayesian reconstruction method. The
‘best’ tree is the one with maximum posterior probability. As in Section 4.7, this is determined
by approximating the posterior probability distribution over all trees with the appropriate
number of leaves. This immediately gives us an idea of the confidence we should assign to a
given tree and how we can determine confidence regions.
This method also allows us to determine the posterior probability of a monophyletic group.
The algorithm is the same as that of whole tree testing. We merely determine the proportion
of samples that have the grouping of interest.
The approach has many attractive features. It allows us to incorporates multiple sources
of uncertainty via the prior probabilities [6]. Many argue that posterior probabilities are more
intuitive than P-values. Simulations also indicate that the Bayesian method (as implemented
in MrBayes) is able to assign large amounts of confidence when very few site changes observed.
On the other, this method is highly dependent on the substitution model. This is exacerbated
if uninformative priors are used. If informative priors are used, there is still the risk that the
priors are wrong. Although Rannala and Yang’s work on MAP trees suggests that posterior
probabilities are insensitive to priors [33].
Another issue of concern is that convergence of sampling to the desired distribution is
not guaranteed. Still more worrying is the evidence that posterior probabilities may support
contradictory hypotheses equally well.
Buckley’s investigations into the Bayesian approach show that it can definitely mislead
[6]. This is attributed to model misspecification. Infact when the result were misleading, the
posterior probabilities were very low. These simulations also indicate that the results of the
SOWH test and Bayesian posterior probabilities are generally correlated. Although BPP shows
greater levels of uncertainly (because it can!).
5.7 Bootstraps and Posterior Probabilities
As mentioned in Section 5.3 Efron et al[12] have suggested that bootstrap P-values can be used
as indicators of posterior probability of a tree. However, simulation studies by Duoady et al
[10] provide evidence to the contrary, when posterior probabilities are calculated as described
in Section 5.6. This does not have too much shock value in light of the different information
that these values take as input [1].
Consider, that Bayesian MCMC (BMCMC) methods are highly dependent on the model
51
of substitution. On the other hand, the measurement of bootstrap proportions is based on a
multinomial model of a matrix of characters.
Another difference that needs considering is the difference in substitutions need to get some
significance from the two values. Posterior probabilities and bootstrap proportions showed the
greatest difference on short branches. The Bayesian MCMC methods placed high confidence
(exceeding 95%) in situtations were very few substitutions were expected. Maximum likehood
bootstrap and maximum parsimony bootstrap did not reach these confidence levels. This in-
dicates that BMCMC methods are much more sensitive to phylogenetic signal. On the other
hand, BMCMC methods are more likely to lend high support to incorrect groupings.
Alfaro’s recent investigation [1] show that Bayesian MCMC posterior probabilities and boot-
strap proportions ‘are not equivalent measure of confidence’. BMCMC values tend to have a
lower type I error rate than do bootstrap methods. The conclusion is that node with high
posterior probability is very likely to be correct but only if the underlying model is correct. If
a node has a moderate bootstrap value it should be considered to be highly dependent on the
data and may not be present if the data is extended.
5.8 Which Test?
The answer to this question is the test that best suits the need of the research being done.
Non-parametric bootstrap based tests can provide good indications of how sensitive a recon-
struction method is to variability in the data. However, much care must be taken when using
the bootstrap to find distributions of test statistics. Ensuring that the simulated distributions
conform to the null hypothesis is particularly important. These tests are also generally conser-
vative and high confidence values cannot usually be obtained for trees with very short branches.
Buckley [6] has found that SH has much greater uncertainty than SOWH. That is SH P-values
are generally higher than than those for SOWH.
Parametric boostrap based tests can be very powerful but are vulnerable to errors due to
model misspecification. Inaccurate estimation of parameters usually leads to the test becoming
too liberal and prone to Type I error. Current models of site substution do not appear to be
adequate to conquer this problem.
Tests based on Bayesian inference via Markov Chain Monte-Carlo methods give a more
intuitive picture of the uncertainty. If we want to determine how well results fit the data with
respect to a fully specified model then Bayesian posterior probabilities are a good choice. This
is especially the case when there is not much phylogenetic signal. However, it too is prone to
model misspecification and can lead to high support for conflicting trees.
52
All of the tests described above can be used to provide some sort of confidence region on the
space of trees. They also require some sort of optimization (maximum likelihood estimation)
over a large number of trees. This is very computationally expensive. Heuristics introduced
also reduce the effectiveness of these tests.
Clearly, there is still scope for the development of alternative tests for phylogenetic inference.
53
Chapter 6
Generalized Least Squares in
Phylogenetic Hypothesis Testing
Branch length estimation for a tree T with N taxa can be described by a linear model 1.
y = Xβ + ǫ (6.1)
The variables are defined as follows. y = (Y1, . . . , Yn)T is the n = N(N − 1)/2 vector
observed pairwise distances between sequences. That is, Yj is the distance between the j-th
pair of sequences. β = (β0, . . . , βm)T is a m = 2N − 3 vector of estimated branch lengths.
ǫ = (ǫ1, . . . , ǫn)T is a vector of unobserved errors. The design matrix, X, is the n × (m + 1)
incidence matrix of branches in the tree with respect to pairs of nodes. That is, if j is the
pairwise distance between sequences r and s, Xjl = 1 if βl is in the path between the r and s.
Similarly, we can write the distance between leaves of the tree as δj = δ(r, s). This can be
calculated simply by adding up all the branches in the path between the pair:
δj =∑
0≤l≤m+1
Xjlβl (6.2)
In matrix notation:
δ = Xβ (6.3)
Let V be the variance-covariance matrix for y. We know from the theory of linear models,
1See, for example, [34], for a more detailed account of the theory of linear models
54
we can fit to δ to y for a given tree topology T by minimizing the GLS test statistic, gT .
gT = (y − Xβ)T V−1(y − Xβ) (6.4)
The statistical theory behind generalized least squares is well developed. It indicates, theo-
retically, that GLS test statistic can be used to test the phylogenetic hypotheses.
H0 : the given tree T is the true tree (6.5)
HA : another tree is the true tree (6.6)
If y, the observed distances, are normally distributed then it is well known that under H0 the
GLS test statistic has a χ2(k) distribution The degrees of freedom, k, can be calculated as the
number of observed distances minus the number of branches in the tree. From the construction
above k = n − m. A P-value can be calculated as the probability of observed statistic value
under the χ2(k) distribution. This is used to test the null hypothesis for a given signficance
level α in the usual fashion.
Confidence regions over sets of topologies can be constructed by calculating Gτ over all
topologies with T taxa. The (1 − α) × 100% set can be determined by finding all topologies
with P values greater that or equal to α.
6.1 Sample Average Variance and Covariance
The use of the χ2(k) distibution requires the observed distances to be multivariate normal.
Susko provides a proof that this holds when the number of sites used to generate the distances
is large and assuming the y are ML estimates.
The following provides a more detailed derivation of the sample average covariance matrix,
V , as proposed in [40]
Theorem 6.1.1 If the observed pairwise distances y are maximum likelihood distances are
derived from a sequences of length n, as n → ∞, y is multivariate normal.
Proof This proof basically follows from well known results from the asymptotic theory for
maximum likelihood estimation[28][29]. Let x be the characters of pair of the jth aligned
sequence, d as distance measurement between the pair. Let pj(x, d) denote the probability of
the data x at a site for the jth pair of taxa. Also, let
55
l(x; d) =
n∑
i
log pj(xi; d) (6.7)
l′(x; d) =n∑
i
∂
∂dlog pj(xi; d) (6.8)
l′′(x; d) =
n∑
i
∂2
∂d2log pj(xi; d) (6.9)
Let yj be the observed distance between them and dj the real distance of the jth pair. Now if
yj is a maximum likelihood estimate we can assume it was found by solving the condition
l′(x; yj) = 0 (6.10)
Edj(l′(x; dj)) = 0 (6.11)
We can approximate l′(xi; yj) by expanding around dj
l′(x; yj) = l′(x; dj) − l′′(x; dj)(yj − dj) (6.12)
ie. (yj − dj) =l′(x; dj)
l′′(x; dj)(6.13)
ie.√
n(yj − dj) =
1√nl′(x; dj)
1n l′′(x; dj)
(6.14)
Note, as n → ∞
− 1
nl′′(x; dj) =
I(dj)
n→p Edj
(I(dj)) = I(dj) (6.15)
Where I(dj) is the observed Fisher information of the distance and I(dj) is the expected Fisher
information.
ie.√
n(yj − dj) = − 1√
nI(dj)
n
l′(x; dj) (6.16)
ie.√
n(yj − dj) = −√
nI(dj)−1l′(x; dj) (6.17)
Now let u be a vector such that
uj = −√
nI(dj)−1l′(x; dj) (6.18)
56
Now each uj is a sum of independent random variables with expectation 0. So, by the Multi-
variate Central Limit Theorem
1√nu ∼ Np(0,V) (6.19)
Where V is the covariance matrix of the uj. This in turn means that
y =1√nu + d (6.20)
y ∼ N(d,V) (6.21)
The sample average covariance matrix, for n large is formulated as
Vii = var(uy) = E(uy − 0)2 (6.22)
= E(−√
nI(di)−1l′(x; di))
2 (6.23)
= nI(di)−2E(l′(x; di))
2 (6.24)
= nI(di)−2E(l′′(x; di)) (6.25)
= nI(di)−2 ×−I(di)
n(6.26)
=−1
I(di)(6.27)
=1
−l′′(x; di)(6.28)
Vii =1
∑ni=1 − ∂2
∂d2
j
log pj(xi; dj)(6.29)
With covariance
Vjk = cov(uj , uk) (6.30)
= cov(−√
nI(dj)−1l′(x; dj),−
√nI(dk)
−1l′(x; dk)) (6.31)
= cov(√
nVjjl′(x; dj),
√nVkkl
′(x; dk)) (6.32)
= nVjjVkkcov(l′(x; dj), l′(x; dk)) (6.33)
Vjk = nVjjVkkE(l′(x; dj), l′(x; dk)) (6.34)
57
where
E(l′(x; dj), l′(x; dk)) =
1
n
n∑
i=1
∂
∂djlog pj(xi; dj)
∂
∂dklog pj(xi; dk) (6.35)
So, with a large number of sites y is multivariate normal with variance-covariance matrix V
described above (sample average covariance).
6.2 Motivation for Simulation of GLS test statistic
There are many reasons to be enthusiastic about this approach. GLS test statistic avoids some
of the computational burden of previously described tests. The estimation and inversion of the
variance-covariance matrix is costly but it only needs to be done once. In theory we know the
distribution of the test statistic under the null hypothesis which means we do no have to try
and simulate it ala parametric bootstrap.
The use of the χ2 distibution requires the observed distances to be multivariate normal and
the observed distances be derived from a large number of sites. However, it is unclear how
large is large. The estimation of the variance-covariance matrix also poses a potential problem.
Susko claims that if the method of estimation is consistent then the results will still hold. Again,
consistency is an asymptotic property while sequences are finite.
Susko’s sample average variance-covariance matrix and the traditional Jukes-Cantor esti-
mates have very different derivations. The latter is derived from the shared branch length
between the taxa at the leaves. That is, it depends on the tree explicitly and is based on a max-
imum likelihood estimation. The sample average covariance does not have a basis in maximum
likelihood estimation. So we may expect the covariance matrices to differ. However, general-
ized estimation equation theory suggests that the sample average covariance should have the
same asymptotic properties are maximum likelihood estimation. So, the covariance estimates
for both of these approaches should converge.
Susko mentions these issues but does not provide any information on how these factors effect
the test effectiveness. The following simulations to try to shed some light on these issues.
6.3 GLS Test Statistic Simulation Method
The distribution of the gT under null hypothesis distribution and see if it was indeed the
appropriate χ2 distribution. The following general method was used:
1. An unrooted tree was generated the program evolver from the PAML package of pro-
58
grams [45]. This program assigns branch lengths according to the birth-death process
described by Yang and Rannala[33].
2. Around 700 2 sequence data sets of a specified lengths were generated from the tree using
the program seq-gen [32]. The Jukes Cantor model was used to generate the sequences.
The procedure was the same as a parametric bootstrap of sequence data.
3. A distance matrix for each sequence data set replicate was estimated using DNAML from
the PHYLIP [14] package of programs.
4. Branch lengths were fitted to the generating (‘True’) tree topology using the GLS approach
using the program treedist.
5. GLS test statistic for each replicate was calculated as described previously.
Simulations were carried out of four and five leaf trees. Sequences lengths were 100, 1000,
10000 nucleotides. Branch length fitting and GLS statistic calculations were done using Susko’s
sample covariance formulas, and the tree-derived covariance for the Jukes-Cantor model (see
Section 3.2.2).
The statistic for the ‘best ’ tree for each data set was also recorded. Because I only dealt
with small trees it was possible to fit every topology and then take the best tree to be the one
resulting in the smallest GLS test statistic.
6.4 Results
Simulated GLS data was compared to χ2 data using a quantile-quantile plot. Graphs can be
found in appendix A.
Sample Average Covariance
The test statistic for the true tree appeared to follow the appropriate χ2 distribution. As
expected, the fit was much better for the longest sequences (10000 base pairs) and worst for
the shortest (100bp). The five leaf trees appeared to fit better than the five leaf tree. The data
set that caused the most trouble is shown in fig A.7 where large negative and positive values
were present in the residual data. This did not reoccur on further when data was generated
2The magic number 700 arose because of the (unexpected) behaviour of the unix split program. How-ever, work into parametric bootstrap techniques has suggested that 100 replicates can suffice to determine theunderlying the distribution of a statistic [19]
59
later with the same parameters. Some curving on the plots is visible on fig A.2 and remains
unexplained.
The gT values appeared to be lower than expected for a χ2 distribution best trees from 100bp
sequences. However, the trees generated by longer sequence appeared to fit a χ2 distribution
quite well. This is indicates that, in these cases, the best fitting tree was the generating tree.
Jukes Cantor Covariance
Data generated using the Jukes Cantor Covariance often did not appear to fit the expected χ2
distributions. This seemed to be more of an issue for the five leaf trees than the four leaf trees.
Surprisingly the best five leaf tree residuals appeared to fit χ2(3) better than the true tree,
although these did not fit particularly well (except see fig B.11).
Differences in covariance matrices
To investigate the difference in sample average and Jukes-Cantor covariance matrices, I per-
formed a simulation to estimate the true sample covariance on a four leaf tree with equal length
branches. 1000 sequence sets were generated from the tree and Jukes-Cantor distance estimates
were made for each data set. The sample covariance matrix for the six pairwise distances was
calculated using R’s inbuilt cov() method. Resulting covariance matrices can be found in are
in appendix C. Both Jukes-Cantor and sample average covariance appear to converge to the
true covariance for this simple tree.
6.5 Discussion
These results indicate that the GLS test statistic generated under the null hypothesis with the
sample average covariance estimates have a χ2(k) distribution as proposed at the beginning of
this chapter. This provides a valid test of whether a given tree T lies in the (1 − α) × 100
confidence region for the true topology.
However, there were a number of unexplained phenomena associated with this experiment.
They are discussed below.
Negative GLS values
Negative GLS values were calculated in some simulations using sample average covariance. This
is of some concern! This means that the covariance matrix is not positive definite and the GLS
theory cannot be used.
60
It is not clear at the moment why this happens. One reason may be that the errors in
distance estimates are ‘damaging’ the covariance estimation. Another highly likely reason is
that small branches have high variance so the inverted variance-covariance matrix becomes
inaccurate. More simulations with (long) tree length might help verify this.
Covariance Matrices
The JC covariance matrix appears reasonably similar to the sample covariance. There also
does seem to be some convergence of the sample average covariance to the tree derived sample
covariance and JC covariance when the number of sites is large (10000bp). Conversely, the
difference is greatest for the short (100bp). This is expected from the convergence of properties
of generalized estimating equations and maximum likelihood estimation.
This seems to imply that once sequences reach a certain point in length we should be able
use the estimated Jukes-Cantor covariance in place of the sample average covariance. This
appears plausible for the four leaf tree but fig B.8 and fig B.9 certainly imply otherwise.
Best Tree, True Tree
We may expect the relationship of the best tree distributions to those of the true tree to
shed some light on the above. Residues values from the best tree will always be less than or
equal to those for the true tree. One example of why the best and true tree might differ is
that generalized least squares coupled with minimum evolution criteria is not consistent [18]
(although the conditions for this have not been established). From this one might expect residues
values to be systematically less than those of the associated χ2 distribution. Some of the graphs
do show this (for example fig A.10) which indicates that in those particular simulations the best
tree selected was often not the true tree.
However, that does not explain the marked the differences in distribution seen in B.8 and
B.9 but the reasonable fit in B.11 and B.12.
It can be expected that, like the tests described in Chapter 5, this test will be used to test
if the best tree returned by some method is the true tree. The data collected on four and five
leaf trees appears to imply that the best tree is often the true tree.
However, it is unlikely that this result will scale up well to larger trees. One reasons be-
ing that there are many more possible tree topologies as the number of taxa being examined
increases so it becomes more likely that the best tree is not the true one. It is well known
that results derived from small trees do not necessarily apply to very large trees. The results
presented here need to be considered with this in mind.
61
6.6 Contribution
My contributions include the design and implementation of the GLS test statistic computer
simulation. I have also provided a expanded proof of the theorem proposed by Susko that the
observed distances between taxa have large site normality when maximum likelihood estimates
are used.
The development of the computer simulation involved extending the C program treedist
originally written by Graham Byrnes. I added support for calculation of sample average covari-
ance for the Jukes-Cantor model and calculation of the GLS test statistic. This was done for
both sample average covariance and tree-derived Jukes-Cantor covariance.
I wrote several Unix shell scripts to interface treedist with the programs that generated
the input data (evolver and seq-gen). Source code for these and treedist are available on
request.
The expansion of the proof of Susko’s theorem provided more detail into how to calculate
the sample average covariance.
6.7 Further Work
Further work into the relationship between the tree-derived covariance and the sample average
covariance is needed. As as more conclusive results explaining in what cases the sample average
covariance matrix can be non-negative definite.
Susko notes that this approach relies on the model derived distances. If the model is incorrect
then it is unclear how much confidence we can put into the testing procedure. Simulations that
reveal how the test reacts to model misspecification would be helpful.
62
Chapter 7
Conclusion
My results show that the GLS test statistic proposed by Susko can be applied to short sequences
but the assumptions behind the test may be broken. The applicability to shorter seems vulner-
able to errors resulting in large negative outliers. There is also the possbility that covariance
matrices that are not positive definite can be estimated using the proposed sample average
estimation method.
The GLS test statistic seems like a reasonable choice for long sequences because of its
computational advantages over other testing methods such as bootstrap and Bayesian MCMC.
However,investigation still needs to be carried out to test how well this method responds to
model misspecification.
Simulations derived from real data, where the number of taxa is large, are also necessary.
63
Appendix A
GLS Results: Sample Average
Covariance
The following graphs are quantile-quantile plots of GLS test statistic calculated using Susko’s
sample average covariance versus a χ2 graph with degrees of freedom described in Chapter 6.
In graphs captioned True tree, the GLS test statistic was calculated for the tree that gener-
ated the sequence data only. In graphs captioned Best tree, the GLS test statistic was calculated
for all possible four or five leaf topologies (depending on the generating tree) but only the lowest
test statistic value recorded.
A.1 Four Leaf Trees
64
0 1 2 3 4 5 6
05
1015
20
GLS residual
chi s
quar
ed, 1
deg
ree
of fr
eedo
m
Figure A.1: True tree, Susko Covariance, 4, 100
65
02
46
810
0 5 10 15 20
GLS
residual
chi squared, 1 degree of freedom
Figu
reA
.2:Tru
etree,
Susko
Covarian
ce,4,
1000
02
46
0 5 10 15 20
GLS
residual
chi squared, 1 degree of freedom
Figu
reA
.3:Tru
etree,
Susko
Covarian
ce,4,
10000
66
02
46
0 5 10 15 20
GLS
residues
Chi Squared, 1 degree of freedom
Figu
reA
.4:B
esttree,
Susko
Covarian
ce,4,
100
02
46
810
0 5 10 15 20
GLS
residues
Chi Squared, 1 degree of freedom
Figu
reA
.5:B
esttree,
Susko
Covarian
ce,4,
1000
67
0 2 4 6 8 10
05
1015
20
GLS residues
Chi
Squ
ared
, 1 d
egre
e of
free
dom
Figure A.6: Best tree, Susko Covariance, 4, 10000
68
−60000 −50000 −40000 −30000 −20000 −10000 0
05
1015
20
GLS residual
chi s
quar
ed, 3
deg
ree
of fr
eedo
m
Figure A.7: True tree, Susko Covariance, 5, 100
A.2 Five Leaf Trees
69
0 5 10 15 20
05
1015
20
GLS residual
chi s
quar
ed, 3
deg
ree
of fr
eedo
m
Figure A.8: True tree, Susko Covariance, 5, 1000
0 5 10 15
05
1015
20
GLS residual
chi s
quar
ed, 3
deg
ree
of fr
eedo
m
Figure A.9: True tree, Susko Covariance, 5, 10000
70
02
46
8
0 5 10 15 20 25 30
GLS
residues
Chi Squared, 3 degree of freedom
Figu
reA
.10:B
esttree,
Susko
Covarian
ce,5,
100
02
46
810
1214
0 5 10 15 20 25 30
GLS
residues
Chi Squared, 3 degree of freedom
Figu
reA
.11:B
esttree,
Susko
Covarian
ce,5,
1000
71
05
1015
0 5 10 15 20 25 30
GLS
residues
Chi Squared, 3 degree of freedom
Figu
reA
.12:B
esttree,
Susko
Covarian
ce,5,
10000
72
Appendix B
GLS results: JC Covariance
B.1 Four Leaf Trees
73
02
46
810
12
0 5 10 15 20
GLS
residual
chi squared, 1 degree of freedom
Figu
reB
.1:Tru
etree,
JC
Covarian
ce,4,
100
05
1015
0 5 10 15 20
GLS
residues
Chi Squared, 1 degree of freedom
Figu
reB
.2:Tru
etree,
JC
Covarian
ce,4,
1000
74
05
1015
0 5 10 15 20
GLS
residues
Chi Squared, 1 degree of freedom
Figu
reB
.3:Tru
etree,
JC
Covarian
ce,4,
10000
01
23
45
6
0 5 10 15
GLS
residues
Chi Squared, 1 degree of freedom
Figu
reB
.4:B
esttree,
JC
Covarian
ce,4,
100
75
0 2 4 6 8 10 12
05
1015
GLS residues
Chi
Squ
ared
, 1 d
egre
e of
free
dom
Figure B.5: Best tree, JC Covariance, 4, 1000
0 2 4 6 8 10
05
1015
GLS residues
Chi
Squ
ared
, 1 d
egre
e of
free
dom
Figure B.6: Best tree, JC Covariance, 4, 10000
76
0 10 20 30 40
05
1015
2025
GLS residues
Chi
Squ
ared
, 3 d
egre
es o
f fre
edom
Figure B.7: True tree, JC Covariance, 5, 100
B.2 Five Leaf Trees
77
0.0 0.5 1.0 1.5 2.0
05
1015
2025
GLS residues
Chi
Squ
ared
, 3 d
egre
es o
f fre
edom
Figure B.8: True tree, JC Covariance, 5, 1000
0.0 0.1 0.2 0.3 0.4 0.5
05
1015
20
GLS residual
chi s
quar
ed, 3
deg
ree
of fr
eedo
m
Figure B.9: True tree, JC Covariance, 5, 10000
78
05
1015
0 5 10 15 20 25 30
GLS
residues
Chi Squared, 3 degrees of freedom
Figu
reB
.10:B
esttree,
JC
Covarian
ce,5,
100
05
1015
0 5 10 15 20 25 30
GLS
residues
Chi Squared, 3 degrees of freedom
Figu
reB
.11:B
esttree,
JC
Covarian
ce,5,
1000
79
05
1015
0 5 10 15 20 25 30
GLS
residues
Chi Squared, 3 degrees of freedom
Figu
reB
.12:B
esttree,
JC
Covarian
ce,5,
10000
80
Appendix C
Covariance Estimation
This chapter contains covariance matrices for a four leaf tree with five branches, each of length
0.5.
The sample covariance results, refer to the sample covariance calculated by the statistical
program R after calculating pairwise distance estimates from 1000 sequence data sets using the
Jukes-Cantor model
The Jukes-Cantor covariance refers the theoretical covariance estimate derived from the
Jukes-Cantor model. Sample Average Covariance refers to the covariance matrix derived in [40]
C.1 Sample Covariance Results
C.1.1 Sample Covariance - 100bp
0.154685911 0.030742862 0.0023922773 0.0146437905 0.021297064 0.007273003
0.030742862 0.127398955 0.0123768091 0.0138471254 0.009112506 0.026076178
0.002392277 0.012376809 0.0569422201 0.0002700298 0.013977221 0.018039598
0.014643791 0.013847125 0.0002700298 0.0471452913 0.026704351 0.006550489
0.021297064 0.009112506 0.0139772209 0.0267043514 0.166695501 0.033352401
0.007273003 0.026076178 0.0180395981 0.0065504892 0.033352401 0.139944026
81
C.1.2 Sample Covariance - 1000bp
[,1] [,2] [,3] [,4] [,5]
[1,] 0.0138869819 0.0040845656 0.0011769553 0.0009608333 0.0037507856
[2,] 0.0040845656 0.0129057929 0.0005606785 0.0007581601 0.0012195869
[3,] 0.0011769553 0.0005606785 0.0037213281 -0.0001051145 0.0008251945
[4,] 0.0009608333 0.0007581601 -0.0001051145 0.0037063417 0.0008871719
[5,] 0.0037507856 0.0012195869 0.0008251945 0.0008871719 0.0125837613
[6,] 0.0009980025 0.0036283243 0.0005083123 0.0008775054 0.0029320889
[,6]
[1,] 0.0009980025
[2,] 0.0036283243
[3,] 0.0005083123
[4,] 0.0008775054
[5,] 0.0029320889
[6,] 0.0128444270
C.1.3 Sample Covariance - 10000bp
[,1] [,2] [,3] [,4] [,5]
[1,] 1.286153e-03 3.794904e-04 1.109185e-04 7.172576e-05 3.764319e-04
[2,] 3.794904e-04 1.361460e-03 1.067608e-04 6.434031e-05 9.747156e-05
[3,] 1.109185e-04 1.067608e-04 3.781283e-04 1.612486e-06 9.272264e-05
[4,] 7.172576e-05 6.434031e-05 1.612486e-06 3.485316e-04 9.078303e-05
[5,] 3.764319e-04 9.747156e-05 9.272264e-05 9.078303e-05 1.279326e-03
[6,] 9.181898e-05 4.077580e-04 9.613416e-05 8.487179e-05 3.283484e-04
[,6]
[1,] 9.181898e-05
[2,] 4.077580e-04
[3,] 9.613416e-05
[4,] 8.487179e-05
[5,] 3.283484e-04
[6,] 1.265110e-03
82
C.2 Jukes-Cantor Covariance
C.2.1 Jukes-Cantor Covariance - 100bp
0.471919 0.126338 0.013920 0.021597 0.087623 0.017458
0.126338 0.240716 0.004129 0.021597 0.017458 0.037730
0.013920 0.004129 0.030822 0.000000 0.013920 0.004129
0.021597 0.021597 0.000000 0.025934 0.000759 0.000759
0.087623 0.017458 0.013920 0.000759 0.103009 0.021089
0.017458 0.037730 0.004129 0.000759 0.021089 0.044828
C.2.2 Jukes-Cantor Covariance - 1000bp
0.014586 0.003595 0.001074 0.000790 0.004576 0.000989
0.003595 0.012295 0.000862 0.000790 0.000989 0.003842
0.001074 0.000862 0.004108 0.000000 0.001074 0.000862
0.000790 0.000790 0.000000 0.003131 0.000831 0.000831
0.004576 0.000989 0.001074 0.000831 0.015133 0.003734
0.000989 0.003842 0.000862 0.000831 0.003734 0.012756
C.2.3 Jukes-Cantor Covariance - 10000bp
0.001238 0.000353 0.000088 0.000088 0.000355 0.000087
0.000353 0.001195 0.000084 0.000088 0.000087 0.000342
0.000088 0.000084 0.000345 0.000000 0.000088 0.000084
0.000088 0.000088 0.000000 0.000346 0.000085 0.000085
0.000355 0.000087 0.000088 0.000085 0.001208 0.000344
0.000087 0.000342 0.000084 0.000085 0.000344 0.001167
83
C.3 Sample Average Covariance(Susko)
C.3.1 Sample Average Covariance(Susko) - 100bp
2.772195 0.284422 0.061343 0.037179 0.160589 0.044012
0.284422 0.194301 0.006093 0.012716 0.013550 0.025393
0.061343 0.006093 0.031681 -0.001258 0.015681 0.005798
0.037179 0.012716 -0.001258 0.026541 -0.002846 -0.000230
0.160589 0.013550 0.015681 -0.002846 0.107110 0.023703
0.044012 0.025393 0.005798 -0.000230 0.023703 0.047412
C.3.2 Sample Average Covariance(Susko) - 1000bp
0.012331 0.003691 0.000952 0.000620 0.004649 0.000714
0.003691 0.014367 0.001146 0.001033 0.001152 0.003982
0.000952 0.001146 0.004123 -0.000038 0.001208 0.000597
0.000620 0.001033 -0.000038 0.003140 0.001071 0.000740
0.004649 0.001152 0.001208 0.001071 0.018552 0.003630
0.000714 0.003982 0.000597 0.000740 0.003630 0.011086
C.3.3 Sample Average Covariance(Susko) - 10000bp
0.001224 0.000357 0.000089 0.000080 0.000342 0.000095
0.000357 0.001211 0.000087 0.000090 0.000100 0.000344
0.000089 0.000087 0.000345 -0.000003 0.000087 0.000080
0.000080 0.000090 -0.000003 0.000346 0.000080 0.000085
0.000342 0.000100 0.000087 0.000080 0.001224 0.000339
0.000095 0.000344 0.000080 0.000085 0.000339 0.001154
84
Bibliography
[1] Michael E. Alfaro, Stefan Zoller, and Francois Lutzoni. Bayes or Bootstrap? A Simulation
Study Comparing the Performance of Bayesian Markov Chain Monte Carlo Sampling and
Bootstrapping in Assessing Phylogenetic Confidence. Mol. Biol. Evol, 20(2):255–266, 2003.
[2] Daniel Barry and J.A. Hartigan. Asynchronous Distance between Homologuous DNA
Sequences. Biometrics, 43:261–276, 1987.
[3] L. Billera, S. Holmes, and K. Vogtmann. The Geometry of Tree Space. Adv. Appl. Math,
27, 2001.
[4] W. Bruno, N. Socci, and A. Halpern. Weighted Neighbor Joining: A Likelihood-Based
Approac to Distance-Based Phylogeny Reconstruction. Mol. Biol. Evol, 17(1):189–197,
2000.
[5] David Bryant and Peter Waddell. Rapid Evaluation of Least-Squares and Minimum-
Evolution Criteria on Phylogenetic Trees. Mol. Biol. Evol, 15(10):1346–1359, 1998.
[6] T. J. Buckley. Model misspecification and probabilistic tests of topology: evidence from
empirical data sets. Syst. Biol, 51(3):509–523, 2002.
[7] Michael Bulmer. Use of the Method of Generalized Least Squares in Reconstructing Phy-
logenies from Sequence Data. Mol. Biol. Evol, 8(6):868–883, 1991.
[8] Joseph T. Chang. Full Reconstruction of Markov Models on Evolutionary trees: Idenitifi-
ability and Consistency. Mathematical Biosciences, 137:51–73, 1996.
[9] D.R. Cox and H.D. Miller. The Theory of Stochastic Processes, chapter 4. Methuen and
Co, London, 1965.
[10] Christophe J. Douady, F Delsuc, Yan Bouche W. Ford Doolittle, and Emmanuel J. P.
Douzery. Comparison of Bayesian and Maximum Likelihood Bootstrap Measures of Phy-
logenetic Reliability. Mol. Biol. Evol, 20(2):248–254, 2003.
85
[11] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Biological sequence
analysis: Probabilistic models of proteins and nucleic acids, chapter 7. Cambridge Univer-
sity Press, Cambridge, UK, 1998.
[12] B. Efron, E. Halloran, and S. Holmes. Bootstrap confidence levels for phylogenetic trees.
Proceedings of the National Academy of Sciences of the USA, 93:13429–13434, 1996.
[13] Warren J. Ewens and Gregory R. Grant. Statistical Methods in Bioinformatics: An Intro-
duction, chapter 3. Springer-Verlag, New York, 2001.
[14] J. Felsenstein. PHYLIP (Phylogeny Inference Package), 2002.
[15] Joseph Felsenstein. Evolutionary Trees from DNA Sequences: A Maximum Likelihood
Approach. J Mol Evol, 17:368–376, 1981.
[16] Joseph Felsenstein and Hirohisa Kishino. Is There Something Wrong with the Bootstrap
on Phylogenies? A Reply to Hills and Bull. Systematic Biology, 42(2):193–200, 1993.
[17] Olivier Gascuel. BIONJ: An Improved Version of the NJ Algorithm Based on a Simple
Model of Sequence Data. Mol. Biol. Evol, 14(7):685–695, 1997.
[18] Olivier Gascuel, David Bryant, and Francois Denis. Strengths and Limitations of the
Minimum-Evolution Principle. Syst. Biol, 50(5):621–627, 2001.
[19] Nick Goldman, Jon P. Anderson, and Allen G. Rodrigo. Likelihood-Based tests of topolo-
gies in phylogenetics. Systematic Biology, 49(4):652–670, 2000.
[20] M. Hasegawa, H. Kishino, and T. Yano. Dating the human-ape splitting by a molecular
clock of mitochondrial DNA. J. Mol. Evol, 22:160–174, 1985.
[21] Steven Henikoff and Jorga G. Henikoff. Amino Acid Substitution Models from Protein
Blocks. Proc. Natl. Acad. Sci. Usa, 89:10915–10919, 1992.
[22] D.M. Hillis and J. Bull. An Empirical Test of Bootstrapping as a Method for Assessing
Confidence in Phylogenetic Analysis. Syst. Biol, 42(2):182–192, 1993.
[23] Susan Holmes. Phylogenies: An Overview. In: Halloran, E., Geiser, S.(Eds), Statistics in
Genetics, IMA, Vol 81. Springer-Verlag, New York, 1999.
[24] Susan Holmes. Statistics for phylogenetic trees. Theoretical Population Biology, 63:17–32,
2003.
86
[25] John P. Huelsenbeck and Keith A. Crandall. Phylogeny Estimation and Hypothesis Testing
using Maximum Likelihood. Ann. Rev. Ecol. Syst, 28:437–466, 1997.
[26] J.P. Huelsenbeck and J.J. Bull. A Likelihood Ratio Test to Detect Conflicting Phylogenetic
signal. Syst. Biol., 45:92–98, 1996.
[27] J.P. Huelsenbeck, D.M. Hillis, and R. Nielsen. A Likelihood Ratio Test of Monophyly.
Syst.Biol, 45:546–558, 1996.
[28] Keith Knight. Mathematical Statistics, chapter 5. Chapman Hall/CRC Press, Chicago,
2000.
[29] E.L. Lehmann. Theory of Point Estimation, chapter 4. Wiley, New York, 1983.
[30] Pietro Lio and Nick Goldman. Models of Molecular Evolution and Phylogeny. Genome
Research, 8:1233–1244, 1998.
[31] P.J. Lockhart, M. Steel, M. Hendy, and D. Penny. Recovering Evolutionary Trees under a
More Realistic Model of Sequence Evolution. Mol. Biol. Evol, 11(4):605–612, 1994.
[32] A. Rambaut and N.C. Grassly. Seq-Gen: an application for the Monte Carlo simulation
of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci., 13:235–238,
1997.
[33] Bruce Rannala and Ziheng Yang. Probability Distribution of Molecular Evolutionary Trees:
A New Method of Phylogenetic Inference. J. Mol. Evol, 43:304–311, 1996.
[34] Alvin C. Rencher. Linear Models in Statistics, chapter 7. Wiley-Interscience, New York,
2000.
[35] Andre Rzhetsky and Masatoshi Nei. Theoretical Foundation of the Minimum-Evolution
Method of Phylogenetic Inference. Mol. Biol. Evol, 10(5):1073–1095, 1993.
[36] Naruya Saitou and Masatoshi Nei. The Neighbor-joining Method: A New Method for
Reconstructing Phylogenetic Trees. Mol. Biol. Evol, 4(4):406–425, 1987.
[37] H. Shimodaira and M. Hasegawa. Multiple comparisons of log-likelihoods with applications
to phylogenetic tree selection. Mol. Biol. Evol, 16:1114–1116, 1999.
[38] Mike Steel and David Penny. Parsimony, Likelihood, and the Role of Models in Molecular
Phylogenetics. Mol. Biol. Evol, 17(6):839–850, 2000.
87
[39] K. Strimmer and V. Moulton. Likelihood Analysis of Phylogenetic Networks Using Directed
Graph Methods. Mol. Biol. Evol, 17:875–881, 2000.
[40] Edward Susko. Confidence Regions and Hypothesis Tests for Topologies Using Generalized
Least Squares. Mol. Biol. Evol, 20(6):862–868, 2003.
[41] D. Swofford, G. Olsen, P.J. Waddell, and D.M. Hillis. Molecular Systematics, chapter
Phylogenetic Inference. Sinauer Associates, Massachusetts, 1996.
[42] C. Tuffley and M.A. Steel. Modelling the covarion hypothesis of nucleotide substitution.
Mathematical Biosciences, 147:63–91, 1997.
[43] Peter H. Westfall and S. Stanley Young. Resampling-based Multiple testing: Examples and
methods for p-value adjustment, chapter 2. Wiley Interscience, New York, USA, 1993.
[44] Simon Whelan and Nick Goldman. Distributions of Statistics Used for the Comparison of
Models of Sequence Evolution in Phylogenetics. Mol. Biol. Evol, 16(9):1292–1299, 1999.
[45] Ziheng Yang. PAML: a program for phylogenetic analysis by maximum likelihood.
CABIOS, 13:555–556, 1997.
88