View
212
Download
0
Embed Size (px)
Phylogenetic Inference and Hypothesis Testing
Catherine Lai (92720)
BSc(Hons) Department of Mathematics and Statistics
University of Melbourne
November 13, 2003
Contents
1 Introduction 4
2 Molecular Phylogenetics 5
2.1 The Use of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Traditional Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Phylogenetic Trees From Genomic Data . . . . . . . . . . . . . . . . . . . . . . . 62.4 What about the root? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 How Treelike is Evolution? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Models of Evolution 9
3.0.1 A Simple Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.0.2 Evolution as a stochastic process . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Markov Models of Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Markov Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Markov Models of Site Substitution . . . . . . . . . . . . . . . . . . . . . 11
3.2 Parameterized Models of Nucleotide Evolution . . . . . . . . . . . . . . . . . . . 123.2.1 Jukes-Cantor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.2 Jukes-Cantor Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 Generalisations of the Jukes-Cantor Model . . . . . . . . . . . . . . . . . 16
3.3 Problems with Markov Models of Evolution . . . . . . . . . . . . . . . . . . . . . 173.4 Modelling Rate Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Modelling Non-Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5.1 Summary of Nucleotide Markov Models . . . . . . . . . . . . . . . . . . . 203.6 Empirical Models of amino acid evolution . . . . . . . . . . . . . . . . . . . . . . 21
3.6.1 PAM/Dayhoff Substitution Matrices . . . . . . . . . . . . . . . . . . . . . 213.6.2 BLOSUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Differences in PAM and BLOSUM . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Phylogenetics Tree Reconstruction Methods 25
4.1 Evaluating Reconstruction Methods . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.5 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.6 Usability in tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Parsimony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4 Is MP the same as ML? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Distance Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5.1 Unweighted pair group method using arithmetic averages (UPGMA) . . . 31
1
4.5.2 The Molecular Clock Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 324.5.3 Long Branch Attraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5.4 Neighbour Joining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5.5 BIONJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.6 Weighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5.7 NJ and the minimum evolution method . . . . . . . . . . . . . . . . . . . 36
4.6 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6.1 Estimating Branch lengths . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6.2 Minimum Evolution Method with Least Squares . . . . . . . . . . . . . . 37
4.7 Bayesian Tree Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.8 Trees from Alignments, Alignments from Trees . . . . . . . . . . . . . . . . . . . 39
5 Phylogenetic Hypothesis Tests 40
5.1 Confidence Regions of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . 405.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 The Non-parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.1 Testing Phylogenies using the Non-parametric Boostrap . . . . . . . . . . 435.3.2 How well does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 The Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4.1 Problems with the Parametric Bootstrap . . . . . . . . . . . . . . . . . . 46
5.5 Bootstrap Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.5.1 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5.2 The Kishino Hasegawa Test . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5.3 The Shimodaira Hasegawa Test . . . . . . . . . . . . . . . . . . . . . . . . 485.5.4 The Swofford Olsen Waddell Hillis Test (SOWH) . . . . . . . . . . . . . . 50
5.6 Bayesian methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.7 Bootstraps and Posterior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 515.8 Which Test? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6 Generalized Least Squares in Phylogenetic Hypothesis Testing 54
6.1 Sample Average Variance and Covariance . . . . . . . . . . . . . . . . . . . . . . 556.2 Motivation for Simulation of GLS test statistic . . . . . . . . . . . . . . . . . . . 586.3 GLS Test Statistic Simulation Method . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.7 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7 Conclusion 63
A GLS Results: Sample Average Covariance 64
A.1 Four Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64A.2 Five Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
B GLS results: JC Covariance 73
B.1 Four Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73B.2 Five Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
C Covariance Estimation 81
C.1 Sample Covariance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81C.1.1 Sample Covariance - 100bp . . . . . . . . . . . . . . . . . . . . . . . . . . 81C.1.2 Sample Covariance - 1000bp . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2
C.1.3 Sample Covariance - 10000bp . . . . . . . . . . . . . . . . . . . . . . . . . 82C.2 Jukes-Cantor Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
C.2.1 Jukes-Cantor Covariance - 100bp . . . . . . . . . . . . . . . . . . . . . . . 83C.2.2 Jukes-Cantor Covariance - 1000bp . . . . . . . . . . . . . . . . . . . . . . 83C.2.3 Jukes-Cantor Covariance - 10000bp . . . . . . . . . . . . . . . . . . . . . . 83
C.3 Sample Average Covariance(Susko) . . . . . . . . . . . . . . . . . . . . . . . . . . 84C.3.1 Sample Average Covariance(Susko) - 100bp . . . . . . . . . . . . . . . . . 84C.3.2 Sample Average Covariance(Susko) - 1000bp . . . . . . . . . . . . . . . . 84C.3.3 Sample Average Covariance(Susko) - 10000bp . . . . . . . . . . . . . . . . 84
3
Chapter 1
Introduction
Phylogenetics is a field of biology that seeks to unlock the evolutionary history of life on earth.
The aim is to understand relationships between species and through this the process of evolution
itself. These relationships can be represented with a graph structure - traditionally simplified to
evolutionary trees. The current approach is to try to reconstruct these trees from the blueprint
of life: DNA sequences.
Reconstruction methods are difficult to design and evaluate because the biological evidence
is often ambiguous. Many approaches have been introduced to deal with the problems of estima-
tion and hypothesis testing of phylogenetic trees. Parametric approaches exploit the elementary
knowledge we have of evolution while non-parametric approaches have been developed to avoid
the possibility of inaccurate preconceptions.
Recently, Susko[40] presented an approach that applies the theory of generalized least squares
to phylogenetic hypothesis testing. The generalized least squares approach has strong theoretical
foundations in the theory of linear models. While the theory appears to be sound it is based on
asymptotic results with regard to sequence length. It is not clear how well the test will perform
in practice where the length of sequences is often only a few hundred nucleotides. I investigate
the effect of sequence length on this approach. I also consider how Suskos approach differs
from traditional parametric techniques with respect to estimation techniques for variance the
variance-covariance estimation.
In Chapter 2 I will give a general background