Phylogenetic Inference and Hypothesis Testing - laic/  · 2005-09-27 · Phylogenetic Inference and

  • View
    212

  • Download
    0

Embed Size (px)

Text of Phylogenetic Inference and Hypothesis Testing - laic/  · 2005-09-27 · Phylogenetic Inference and

  • Phylogenetic Inference and Hypothesis Testing

    Catherine Lai (92720)

    BSc(Hons) Department of Mathematics and Statistics

    University of Melbourne

    November 13, 2003

  • Contents

    1 Introduction 4

    2 Molecular Phylogenetics 5

    2.1 The Use of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Traditional Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Phylogenetic Trees From Genomic Data . . . . . . . . . . . . . . . . . . . . . . . 62.4 What about the root? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 How Treelike is Evolution? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3 Models of Evolution 9

    3.0.1 A Simple Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.0.2 Evolution as a stochastic process . . . . . . . . . . . . . . . . . . . . . . . 10

    3.1 Markov Models of Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Markov Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Markov Models of Site Substitution . . . . . . . . . . . . . . . . . . . . . 11

    3.2 Parameterized Models of Nucleotide Evolution . . . . . . . . . . . . . . . . . . . 123.2.1 Jukes-Cantor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.2 Jukes-Cantor Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 Generalisations of the Jukes-Cantor Model . . . . . . . . . . . . . . . . . 16

    3.3 Problems with Markov Models of Evolution . . . . . . . . . . . . . . . . . . . . . 173.4 Modelling Rate Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Modelling Non-Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.5.1 Summary of Nucleotide Markov Models . . . . . . . . . . . . . . . . . . . 203.6 Empirical Models of amino acid evolution . . . . . . . . . . . . . . . . . . . . . . 21

    3.6.1 PAM/Dayhoff Substitution Matrices . . . . . . . . . . . . . . . . . . . . . 213.6.2 BLOSUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.7 Differences in PAM and BLOSUM . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4 Phylogenetics Tree Reconstruction Methods 25

    4.1 Evaluating Reconstruction Methods . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.1 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.5 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.6 Usability in tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    4.2 Parsimony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4 Is MP the same as ML? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Distance Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    4.5.1 Unweighted pair group method using arithmetic averages (UPGMA) . . . 31

    1

  • 4.5.2 The Molecular Clock Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 324.5.3 Long Branch Attraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5.4 Neighbour Joining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5.5 BIONJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5.6 Weighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5.7 NJ and the minimum evolution method . . . . . . . . . . . . . . . . . . . 36

    4.6 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6.1 Estimating Branch lengths . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6.2 Minimum Evolution Method with Least Squares . . . . . . . . . . . . . . 37

    4.7 Bayesian Tree Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.8 Trees from Alignments, Alignments from Trees . . . . . . . . . . . . . . . . . . . 39

    5 Phylogenetic Hypothesis Tests 40

    5.1 Confidence Regions of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . 405.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 The Non-parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    5.3.1 Testing Phylogenies using the Non-parametric Boostrap . . . . . . . . . . 435.3.2 How well does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    5.4 The Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4.1 Problems with the Parametric Bootstrap . . . . . . . . . . . . . . . . . . 46

    5.5 Bootstrap Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.5.1 Centering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5.2 The Kishino Hasegawa Test . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5.3 The Shimodaira Hasegawa Test . . . . . . . . . . . . . . . . . . . . . . . . 485.5.4 The Swofford Olsen Waddell Hillis Test (SOWH) . . . . . . . . . . . . . . 50

    5.6 Bayesian methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.7 Bootstraps and Posterior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 515.8 Which Test? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    6 Generalized Least Squares in Phylogenetic Hypothesis Testing 54

    6.1 Sample Average Variance and Covariance . . . . . . . . . . . . . . . . . . . . . . 556.2 Motivation for Simulation of GLS test statistic . . . . . . . . . . . . . . . . . . . 586.3 GLS Test Statistic Simulation Method . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.7 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    7 Conclusion 63

    A GLS Results: Sample Average Covariance 64

    A.1 Four Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64A.2 Five Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    B GLS results: JC Covariance 73

    B.1 Four Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73B.2 Five Leaf Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    C Covariance Estimation 81

    C.1 Sample Covariance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81C.1.1 Sample Covariance - 100bp . . . . . . . . . . . . . . . . . . . . . . . . . . 81C.1.2 Sample Covariance - 1000bp . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    2

  • C.1.3 Sample Covariance - 10000bp . . . . . . . . . . . . . . . . . . . . . . . . . 82C.2 Jukes-Cantor Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    C.2.1 Jukes-Cantor Covariance - 100bp . . . . . . . . . . . . . . . . . . . . . . . 83C.2.2 Jukes-Cantor Covariance - 1000bp . . . . . . . . . . . . . . . . . . . . . . 83C.2.3 Jukes-Cantor Covariance - 10000bp . . . . . . . . . . . . . . . . . . . . . . 83

    C.3 Sample Average Covariance(Susko) . . . . . . . . . . . . . . . . . . . . . . . . . . 84C.3.1 Sample Average Covariance(Susko) - 100bp . . . . . . . . . . . . . . . . . 84C.3.2 Sample Average Covariance(Susko) - 1000bp . . . . . . . . . . . . . . . . 84C.3.3 Sample Average Covariance(Susko) - 10000bp . . . . . . . . . . . . . . . . 84

    3

  • Chapter 1

    Introduction

    Phylogenetics is a field of biology that seeks to unlock the evolutionary history of life on earth.

    The aim is to understand relationships between species and through this the process of evolution

    itself. These relationships can be represented with a graph structure - traditionally simplified to

    evolutionary trees. The current approach is to try to reconstruct these trees from the blueprint

    of life: DNA sequences.

    Reconstruction methods are difficult to design and evaluate because the biological evidence

    is often ambiguous. Many approaches have been introduced to deal with the problems of estima-

    tion and hypothesis testing of phylogenetic trees. Parametric approaches exploit the elementary

    knowledge we have of evolution while non-parametric approaches have been developed to avoid

    the possibility of inaccurate preconceptions.

    Recently, Susko[40] presented an approach that applies the theory of generalized least squares

    to phylogenetic hypothesis testing. The generalized least squares approach has strong theoretical

    foundations in the theory of linear models. While the theory appears to be sound it is based on

    asymptotic results with regard to sequence length. It is not clear how well the test will perform

    in practice where the length of sequences is often only a few hundred nucleotides. I investigate

    the effect of sequence length on this approach. I also consider how Suskos approach differs

    from traditional parametric techniques with respect to estimation techniques for variance the

    variance-covariance estimation.

    In Chapter 2 I will give a general background