34
Phylogenetic Inference Christian M Zmasek, PhD [email protected] 15 June 2010

Phylogenetic Inference

  • Upload
    gaye

  • View
    135

  • Download
    1

Embed Size (px)

DESCRIPTION

Christian M Zmasek , PhD [email protected] 15 June 2010. Phylogenetic Inference. Overview. Why perform phylogenetic inference? Theoretical background Methods Software & Examples. 1. Why perform phylogenetic inference?. ‘ Tree of life ’: The relationships amongst different species - PowerPoint PPT Presentation

Citation preview

Page 1: Phylogenetic  Inference

Phylogenetic Inference

Christian M Zmasek, [email protected] June 2010

Page 2: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 2

Overview

1. Why perform phylogenetic inference?

2. Theoretical background3. Methods4. Software & Examples

Page 3: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 3

1. Why perform phylogenetic inference?

‘Tree of life’: The relationships amongst different species

Infer the functions of proteins from family members in model organisms or to refine existing annotations through phylogenetic analysis

A method to organize/cluster sequences with biological justification

Page 4: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 4

Over-annotation due to database bias or gene loss

RAT

MOUSE

HUMANRICE

LIZARD

SHARK

RAT

MOUSE

HUMANRICE

LIZARD

SHARK

Y

Z

X

Z

Y

: query sequence

: orthologous to query

: most similar to query

: gene duplication

Page 5: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 5

Over-annotation due to unequal rates of evolution [phylogenetic tree ≠ clustering !!]

RATWHEAT

HUMAN

BARLEY

Y

Z

: query sequence

: orthologous to query

: most similar to query

: gene duplication

Page 6: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 6

2. Theoretical Background

A phylogeny is the evolutionary history of a species or a group of species. Lately, the term is also being applied to the evolutionary history of individual DNA or protein sequences.

The evolutionary history of organisms or sequences can be illustrated using a tree-like diagram – a phylogenetic tree.

Page 7: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 7

Page 8: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 8

Gene Trees/Species Trees

Initially, phylogenetic trees were built based on the morphology of organisms.

Around 1960 molecular sequences were recognized as containing phylogenetic information and hence as valuable for tree building

A tree built based on sequence data is called a gene tree since it is a representation of the evolutionary history of genes

A tree illustrating the evolutionary history of organisms is called a species tree

Page 9: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 9

A gene tree which is also a species tree

Page 10: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 10

A gene tree of orthologs and paralogs based on Bcl-2 family protein sequences

Page 11: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 11

Homologs

Homologs are defined as sequences which share a common ancestor (Fitch, 1966)

This definition becomes unclear if mosaic proteins, which are composed of structural units originating from different genes are considered

Phylogenetic trees make sense only if constructed based on homologous sequences (whole genes/proteins, or domains)

Page 12: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 12

Orthologs, Paralogs, Xenologs

Homologous sequences can be divided into orthologs, paralogs and xenologs:

Orthologs: diverged by a speciation event (their last common ancestor on a phylogenetic tree corresponds to a speciation event) IMPORANT: Functional similarity does not imply

orthology

Paralogs: diverged by a duplication event (their last common ancestor corresponds to a duplication)

Xenologs: are related to each other by horizontal gene transfer (via retroviruses, for example)

Page 13: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 13

Orthologs, Paralogs example

Page 14: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 14

Caveat emptor: Orthology vs. Function

Orthologous sequences tend to have more similar “functions” than paralogs

Yet: Orthologs are mathematically defined, whereas there is no definition of sequence “function” (i.e. it is a subjective term)

Page 15: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 15

Gene Duplication

New genes evolve if mutations accumulate while selective constraints are relaxed by gene duplication

First recognized by Haldane (“… it [mutation pressure] will favour polyploids, and particularly allopolyploids, which possess several pairs of sets of genes, so that one gene may be altered without disadvantage…”

Page 16: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 16

Gene Trees Vs. Species Trees – How Gene Duplications Can Be Detected

Hum

anRatW

heat

Hum

anRat

Wheat

Hum

an

Rat

Wheat

Hum

anRatW

heat

G1 G2 S

Page 17: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 17

3. Methods

Multiple sequence alignment of homologous sequences

Pairwise distance calculation

Algorithmic Methods Based on Pairwise Distances:• UPGMA• Neighbor

Joining

Optimality Criteria Based on Pairwise Distances:• Fitch-Margoliash• Minimal Evolution

Optimality Criteria Based on Character Data:• Maximum Parsimony• Maximum Likelihood

“More accurate”(in general)

Fast

Bayesian Methods (MCMC)

Page 18: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 18

Pairwise Distance Calculation

The simplest method to measure the distance between two amino acid sequences is by their fractional dissimilarity p (nd is the number of aligned sequence positions containing non-identical amino acids and ns is the number of aligned sequence positions containing identical amino acids):

p nd

nd ns

Page 19: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 19

Pairwise Distance Calculation

Unfortunately, this is unrealistic -- does not take into account: superimposed changes: multiple

mutations at the same sequence location

different chemical properties of amino acids: for example, changing leucine into isoleucine is more likely and should be weighted less than changing leucine into proline

Page 20: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 20

Pairwise Distance Calculation

A more realistic approach for estimating evolutionary distances is to apply maximum likelihood to empirical amino acid replacement models, such as PAM transition probability matrices.

The likelihood LH of a hypothesis H (an evolutionary distance, for example) given some data D (an alignment, for example) is the probability of D given H: LH=P(D|H)

Page 21: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 21

UPGMA vs …

UPGMA stands for unweighted pair group method using arithmetic averages

This is clustering

This algorithm produces rooted trees based under the assumption of a molecular clock.

Page 22: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 22

… Neighbor Joining

As opposed to UPGMA, neighbor joining (NJ) is not misled by the absence of a molecular clock

NJ produces phylogenetic trees (not cluster diagrams)

Page 23: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 23

Optimality Criteria Based on Character Data

Fitch-Margoliash Minimal evolution (ME) Maximum Parsimony (MP) Maximum Likelihood (ML)

Page 24: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 24

Minimal Evolution

Branch lengths are fitted to a tree according to a unweighted least squares criterion, but the optimality criterion to evaluate and compare trees is to minimize the sum of all branch lengths.

Page 25: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 25

Maximum Parsimony

Evaluate a given topology

Example:Sequence1: TGCSequence2: TACSequence3: AGGSequence4: AAG

Page 26: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 26

Maximum Likelihood Probabilistic methods can be used to assign a

likelihood to a given tree and therefore allow the selection of the tree which is most likely given the observed sequences.

Probability for one residue a to change to b in time t along a branch of a tree: P(b|a,t)

Its actual calculation is dependent on what model for sequence evolution is used.

Poisson process: P(b|a,t)=1/20 + 19/20e-ut for a=b P(b|a,t)=1/20 + 1/20e-ut for a≠b

Page 27: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 27

Bayesian Methods

Example: MrBayes Use Markov Chain Monte Carlo

(MCMC) approach to sample over tree space

Page 28: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 28

Bootstrap resampling

To asses the reliability of trees

Resampling with replacement (see example on next slide)

What is “good enough”?? >60%?, >90%?

Page 29: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 29

Bootstrap resampling: example

Original sequence alignment:Sequence 1: ARNDCQSequence 2: VRNDCQ 123456Bootstrap resample 1:Sequence 1: RRQCCASequence 2: RRQCCV 226551Bootstrap resample 2:Sequence 1: AQCDCQSequence 2: VQCDCQ 165456

Page 30: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 30

Summary

Multiple sequence alignment of homologous sequences

Pairwise distance calculation

Algorithmic Methods Based on Pairwise Distances:• UPGMA• Neighbor

Joining

Optimality Criteria Based on Pairwise Distances:• Fitch-Margoliash• Minimal Evolution

Optimality Criteria Based on Character Data:• Maximum Parsimony• Maximum Likelihood

“More accurate”(in general)

Fast

Bayesian Methods (MCMC)

Page 31: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 31

4: Software for multiple sequence alignments

Mafft: http://mafft.cbrc.jp/alignment/software/ Server: http://mafft.cbrc.jp/alignment/server/

T-Coffee: http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html Server: http://www.ch.embnet.org/software/TCoffee.html Server: http://www.ebi.ac.uk/t-coffee/

ClustalW: ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/ Server: http://www.ebi.ac.uk/clustalw/

Probcons: http://probcons.stanford.edu/ Server: http://probcons.stanford.edu

Muscle: http://www.drive5.com/muscle/ Server: http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

Page 32: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 32

Software for phylogeny reconstruction List of programs: http://evolution.genetics.washington.edu/phylip/software.html ML pairwise distance calculation (protein):

TREE-PUZZLE: http://www.tree-puzzle.de/ Bootstrapping, pairwise distance calculation, UPGMA, NJ, Fitch-Margolish, ME:

PHYLIP: http://evolution.genetics.washington.edu/phylip.html ME:

FastME (server): http://atgc.lirmm.fr/fastme/ MEGA: http://www.megasoftware.net/

ML: PhyML (server): http://www.atgc-montpellier.fr/phyml/ RAxML (server): http://phylobench.vital-it.ch/raxml-bb/

Bayesian (MCMC): MrBayes: http://mrbayes.csit.fsu.edu/

Parsimony (esp. on Macintosh), display: PAUP: http://paup.csit.fsu.edu/

Tree display: Archaeopteryx: http://www.phylosoft.org/archaeopteryx/

Hypothesis testing: HyPhy: http://www.hyphy.org/

Page 33: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 33

Books Richard Durbin et al.: Biological Sequence Analysis: Probabilistic

Models of Proteins and Nucleic Acids [http://www.amazon.com/Biological-Sequence-Analysis-Probabilistic-Proteins/dp/0521629713/sr=1-1/qid=1170198997/ref=sr_1_1/102-4955297-1236120?ie=UTF8&s=books]

Joe Felsenstein: Inferring Phylogenies [http://www.amazon.com/Inferring-Phylogenies-Joseph-Felsenstein/dp/0878931775/sr=8-1/qid=1170198215/ref=pd_bbs_sr_1/102-4955297-1236120?ie=UTF8&s=books]

Ziheng Yang: Computational Molecular Evolution [http://www.amazon.com/Computational-Molecular-Evolution-Oxford-Ecology/dp/0198567022/sr=1-1/qid=1170198731/ref=pd_bbs_sr_1/102-4955297-1236120?ie=UTF8&s=books]

Oliver Gascuel: Mathematics of Evolution & Phylogeny [http://www.amazon.com/Mathematics-Evolution-Phylogeny-Olivier-Gascuel/dp/0198566107/sr=1-1/qid=1170198842/ref=sr_1_1/102-4955297-1236120?ie=UTF8&s=books]

Page 34: Phylogenetic  Inference

(C) 2010 Christian M. Zmasek 34

5. “Homework” Download and install MrBayes: http://mrbayes.csit.fsu.edu/ Read the tutorial:

http://mrbayes.csit.fsu.edu/wiki/index.php/Tutorial Analyze the provided data set (“primates.nex”) Download and install PHYLIP:

http://evolution.genetics.washington.edu/phylip.html Perform seqboot (100x) – dnadist – neighbor (NJ) –

consense on “primates.nex” (you need to change the format accordingly)

Compare the results (MrBayes vs. Phylip NJ)