52
Molecular phylogenetics

Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Molecular phylogenetics

Page 2: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Genetic distance

Define genetic distance between a pair of ‘homologous’sequences x and y as the number of substitutions that haveoccurred (per alignment site) since x and y diverged from theircommon ancestor

Page 3: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Genetic distance

Given the following sequence alignment, infer the geneticdistance

A C G T T C A T T - - T G

A G - T C C C T G G G G G

Simplification: Ignore alignment positions with gaps

Page 4: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Model of evolution

Continuous-time process over {A,C,G,T}

)|(maxargˆ

)()|(

)(

21

tALt

tptAL

etp

t

iAA

ijGt

ij

ii

Page 5: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Some standard models of nucleotide evolution: Jukes and Cantor

3

3

3

3

G

C

T

A

GCTA

Page 6: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Kimura two-parameter model (1980)

G

C

T

A

GCTA

With

ji

ijii qq

Models a difference in the rate of transitions and transversions

Page 7: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Recall:

1

i

jijiji pp

imply πi are the limiting probabilities ofthe chain and the chain is reversible

for all i,j

Analogous result applies to the gij of a continuous-timechain

Page 8: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Kishino-Hasegawa-Yano (1985)

CTA

GTA

GCA

GCT

G

C

T

A

GCTA

Includes parameters gk for the equilibrium nucleotidefrequencies

Page 9: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

General Time-Reversible Model (Simon Tavaré 1986)

CTA

GTA

GCA

GCT

G

C

T

A

GCTA

Page 10: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Generator matrix (or equivalently time) scaled so that onesubstitution expected in one unit of time

ji

iji g 1

Page 11: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Commonly used evolutionary models also allow heterogeneity inrate of evolution across alignment sites (typically modeled withdiscretized gamma distribution)

In general, simpler nucleotide substitution models are nestedwithin successively more complex models – standard modelcomparison techniques can be used to select an appropriatemodel.

Page 12: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Phylogenetic tree: binary tree with edges representing genetic distance

An evolving sequence can bifurcate (e.g. speciation), giving rise to two daughtersequences

S1S2

S3S4

Branch length represents genetic distance between sequences (orhypothetisized sequences) at the nodes

‘Rooted’ tree

Page 13: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

‘Unrooted’ tree

Page 14: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Rooted versus unrooted tree

Most substitution models are reversible (see previous slides). Therefore themodels cannot distinguish the time-direction of evolution. Externalinformation is usually incorporated to decide the position of the root(hypothetical ancestor of all of the sequences represented in the alignment)

Page 15: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

ECP1 MOUSE

ECP2 MOUSE

ECP RAT

ECP HUMAN

ECP PONPY

0.1

Alternative tree representations…

Page 16: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

ECP1 MOUSE

ECP2 MOUSE

ECP RAT

ECP HUMAN

ECP PONPY

0.1

Alternative tree representations…

Page 17: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

EC

P1

MO

US

E

ECP2 MOUSE

EC

PR

AT

ECP

HU

MA

N

ECP PONPY

0.1

Alternative tree representations…

Page 18: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

ECP1 MOUSE

ECP2 MOUSE

ECP RAT

ECP HUMAN

ECP PONPY

0.1

Most conventional representation of a ‘rooted’ phylogenetic tree

Page 19: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

How many trees?

(2n – 3)!/(2n-2 (n – 2)!)

10 20 30 40 50

020

40

60

Number of sequences

log

10(#

T)

Page 20: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

The phylogeny problem

Given a set of aligned DNA or amino acid sequences, infer the phylogenetic treerepresenting the evolution of the set of taxa

Requires:

- An optimality criterion (what constitutes the ‘best’ tree)

- Search algorithm

Commonly applied optimality criteria are

- Minimum evolution (tree with shortest sum of branch lengths)

- Maximum parsimony (tree requiring smallest number of steps to explain thedata)

- Maximum Likelihood

- Maximum a posteriori Probability (MAP)

Page 21: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

The likelihood of a tree:

),|()...,|(),|()(...)|( 11331221

1 2 3

nnna a a a

abaPabaPabaPaPTDPn

a1

a2

a3

an

b3

b2

bn

an-1

Page 22: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

A recursive algorithm is used to avoid doing all the summations (Felsenstein’sPruning Algorithm)

Let Lmk be the likelihood of the subtree decended from node k, given that the

nucleotide present at node k, is m then

k

i

jb2

b1

s

js

s

is

km bmsPLbmsPLL ),|(),|( 21

Page 23: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

The L’s can be worked out easily for the leaf nodes:

Consider position i in sequence X

If b is a leaf node, then

Lab = 1 if Xi = a

Missing information can be handled easily (using intermediate values at terminalnodes)

Page 24: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

r

)()|( spLDTPs

rs

Complexity: O(n . m . k2)

(n = # sequences; m = sequence length; k = alphabet size)

Page 25: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

643

652

451

321

G

C

T

A

GCTA

Exercise: Given the instantaneous transition rate matrix and tree showncalculate the likelihood of the single alignment column shown at the tips ofthe tree.

A

A

T

G

0.05

0.05

0.05

0.05

0.01

0.01

Page 26: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Optimizing branch lengths

• If all branch lengths are known except one then the likelihood of the treecan be expressed as a function of the unknown branch length

• Standard problem of maximization in 1D for a single branch (e.g. Newton-Raphson)

• Although branches are not independent branch maximizations tend not tointerfere to a great extent

• A small number of successive maximizations normally succeeds inachieving the maximum likelihood set of branch lengths

Page 27: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Searching for optimal trees

Branch & Bound

Heuristics – usually local perturbations with hill-climbing

Markov-Chain Monte Carlo (MC3)

Genetic algorithms

etc.

Page 28: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Common heuristic algorithm: Neighbor-Joining, anapproximation to the minimum evolution tree

8

7

6

54

1

2

3

8

7

6

5

23

4

1

Choose the pair that minimizes the length of the resulting tree

Page 29: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

t

r s

u v

dAB ~ r + sdCD ~ u + vdAD ~ r + t + vdBC ~ s + t + u

Tree length = u + v + t + r + s

A B

C D

(r, s, u, v, t are estimated using theleast squares method)

Page 30: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Branch & Bound

Exact

Can be used with several different optimality criteria

Algorithm:

Traverse the search tree in some order

Exclude a subtree from the search if the score on the root node of the subtree isless than the best score achieved so far

Can improve speed by starting with a tree inferred using a different method

Works because the score only gets worse as you proceed towards the tips of thetree

Complexity:At worst equal to the complexity of the exhaustive search

Page 31: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Branch and bound

http://artedi.ebc.uu.se/course/X3-2004/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html

Page 32: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Heuristic search algorithms

Greedy algorithms - Hill climbing approach

• NNI (Nearest Neighbour Interchange): break an interior branch and replacewith one of the two alternative branches

• SPR (Subtree Pruning and Regrafting): remove a subtree from the treeand reinsert elsewhere

• TBR (Tree Bisection and Reconnection): break the tree to form twosubtrees. Reconnect the two subtrees with a new branch between twoexisting branches in the two subtrees

Genetic Algorithms (e.g. MetaPig; Garli)

Page 33: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Heuristic search algorithms

Can be sped up by starting with a reasonable tree (e.g. tree inferred with NJalgorithm).

Speed up also by estimating other parameters using an approximate treeprior to inferring the final tree topology (iterate if necessary).

Start tree can be from

- an tree inferred from another method (e.g. NJ)

- Stepwise addition

- Star decomposition

Page 34: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Star decomposition

http://artedi.ebc.uu.se/course/X3-2004/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html

Page 35: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Stepwise addition

http://artedi.ebc.uu.se/course/X3-2004/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html

Page 36: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Traversing a tree

Tree traversals:

Preorder: node; left subtree; right subtree

Inorder: left subtree; node; right subtree

Postorder: left subtree; right subtree; node

All of these can be implemented using recursive functions

Page 37: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Exercise: Sketch this tree and label its nodes in the order inwhich they would be visited on preorder, inorder and postordertraversal, starting the algorithm at the root node

Page 38: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Bayesian MCMC in phylogenetics

Prior over trees (often flat)

Starting tree (star decomposition, step-wise addition, NJ)

Proposals: tree perturbations

Acceptance depends on ratio of posterior probabilities ( = ratio of likelihoods)

Determine burn-in and convergence

Page 39: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

In molecular phylogenetics the prior is usually ‘flat’

Why bother?

1. We get the answer as a probability

2. We get to use MCMC to sample over trees/search for‘best’ tree

3. Allows us to integrate over nuisance parametersrather than using their optimal values

Page 40: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)
Page 41: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Running a phylogenetic MCMC

Generate long chain of trees/parameters sampled according to their jointposterior probability

The number of times the chain visits tree X is proportional to the probability oftree X

The number of times a specific branch is sampled can be used to estimate theposterior probability that the ‘clade’ of taxa specified by the branch is correct

Page 42: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Multiple chains may be used (Metropolis coupled MCMC = MC3)

• Only one chain is sampled

• The other chains are heated (i.e. they can take bigger steps)

• Chains can swap states

• Allows crossing of valleys

Page 43: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Burnin

• From an arbitrary starting point the chain can take some time to equilibriate

• Consequently, the chain takes some time before samples are obtaiendaccording to their posterior probabilites

• Initially probability of trees increases with time

• Programmes allowed to run until the probabilities are fluctuating randomlyabout a constant mean

• Data generated before the chain equilibriates are discarded

Page 44: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

0 200 400 600 800 1000 1200

-25000

-20000

-15000

-10000

Index

lnL1

0 200 400 600 800 1000 1200

-25000

-20000

-15000

-10000

Index

lnL2

Page 45: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Proposals

• Topology (e.g. NNI) or ‘coalescence time’ perturbations have been used

• Choice of proposal significant

– Too aggressive results in rejection of most proposals

– Too conservative takes too long to provide adequate sampling of parameterspace

Page 46: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Advantages of Bayesian methods

- relatively fast

- easily interpretable

- often very accurate

Disadvantages of Bayesian methods

- can be difficult to be sure of convergence (this has improved withavailability of better diagnostics)

- still controversial in molecular phylogenetics – choice of prior can bedifficult to justify

- thought by some to exaggerate confidence

Software: e.g. MrBayes

Page 47: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Inferring Phylogenies

Joseph Felsenstein, 2004

Sinauer

Further reading (molecular phylogenetics)

Page 48: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

‘Universal’ genetic code is degenerate => natural classification ofmutations as:

Nonsynonymous: amino acid changing (rate - dN)

Synonymous: no amino acid change (rate - dS)

ω: dN/dS

ω > 1 => adaptive evolution (actually ‘diversifying selection’)

Models of codon sequence evolution and inference of positiveDarwinian selection

Page 49: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)
Page 50: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Example: Analysis of selection using simple discrete ω distribution

Neutral model:

0 1

Selection model:

0 1

Free parameters: ω- < 1; pω-; ω+ > 1; pω+

ω

ω

Free parameters: ω- < 1; pω-

Page 51: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Is there a subset of sites with ω > 1

- model comparison techniques

Which sites are evolving adaptively (empirical Bayes method)

- fix all parameter values to their ML estimates

- using ML estimates as priors, calculate posteriorprobabilities of belonging to selection site class

Questions of interest

)1()1|()1( iii PDLP

Page 52: Molecular phylogenetics - NUI Galwaycathal/Teaching/MSc11/Phylogenetics.pdf · rate of evolution across alignment sites (typically modeled with discretized gamma distribution)

Analysis of selection

- Infer a phylogenetic tree

- Obtain ML estimates of all parameters

- Use LRT (or other model comparison method) to evaluateevidence for selection

- Use empirical Bayes method (or a variant) to estimate posteriorprobabilities of belonging to the selection site class

ki

i

i

kpP

pDPDP

)(

)|()|(