MHP 2 Phylogenetics - SFU.ca...Algorithms vs. Optimality UPGMA, Neighbor-Joining and ʻBranch and Boundʼ are algorithms (or recipes) that do not optimize anything Maximum Likelihood,

Phylogenetic Methods

How to reconstruct phylogenies Algorithms vs. Optimality Parsimony Models and Distances

MtDNA

Modern Humans

Neanderthal Western Gorilla

Mountain Gorilla

Eastern Gorilla

Sumatran Orangutan

Bornean Orangutan

Bonobo

Western Chimpanzee

Central Chimpanzee

Eastern Chimpanzee

Gagneux et al. (1999)

root

Humans are a recently-evolved species, and human genetic diversity is very low compared to other apes!

Due to drift, samples of mitochondrial DNA are related as a tree, and if we can reconstruct that tree, we should be able to infer many things about the populations

Background: how many different trees?

A

C

B

D

For four taxa:

A

B

C

D A

D

C

B

# of taxa # of unrootedtrees

2 13 14 35 156 1057 9458 103959 13513510 2027025

4 taxa = 3 trees (2n-5)!!!

10 taxa = ~2*106 trees How many

rooted trees?

A

C

B

D

For four taxa:

A

D

C

B

The root can be placed on any branch or internode. The total number of branches on an unrooted tree is 2n-3 where n=number of taxa. Therefore, the number of rooted trees corresponding with one unrooted tree is 2n-3

C D B A

A

B

C

D

Background: how many different trees?

Terms and concepts: how many different trees?

For four taxa: The root can be placed on any branch or internode. The total number of branches on an unrooted tree is 2n-3 where n=number of taxa. Therefore, the number of rooted trees corresponding with one unrooted tree is 2n-3

A

C

B

D

A

D

C

B

C D B A

A B D C

A

B

C

D



A

C

B

D

A

D

C

B

C D B A

A B D C

C D A B A

B

C

D



A

C

B

D

A

D

C

B

C D B A

A B D C

C D A B

A B C D

A

B

C

D



A

C

B

D

A

D

C

B

C D B A

A B D C

C D A B

A B C D

A B C D

A

B

C

D


For four taxa:

# oftaxa

# ofunrootedtrees

# of rootedtrees

2 1 13 1 34 3 155 15 1056 105 9457 945 103958 10395 1351359 135135 202702510 2027025 34459425

4 taxa = 15 trees (2n-3)(2n-5)!!!

10 taxa = ~3*107 trees +5

+5

A

C

B

D A

B

C

D A

D

C

B

A B C D

C D A B

A B D C

C D B A

A B C D

Algorithms vs. Optimality

UPGMA, Neighbor-Joining and ‘Branch and Bound’ are algorithms (or recipes) that do not optimize anything Maximum Likelihood, Maximum Parsimony and Least-Squares are optimality criteria: they do not specify how candidate hypotheses are arrived at (e.g. how trees are sampled) but do offer a yardstick for assessing which hypotheses are preferred.

Algorithms are fast (N-J works in low-polynomial n time) but neither guarantee a ‘right’ answer or evaluation of fit.

Maximum Parsimony

We do not believe that evolution is parsimonious, but we believe that the characters we choose evolve in such a way that maximum parsimony offers the best chance of recovering the correct relationships.

Tree length becomes our optimality criterion:

choose the shortest tree among all contenders...

L(t) = wjdiff (xk' j , xk' ' j )j=1

N!

k=1

B!

minimize L(t), length of tree

We calculate the length of a tree L(t), as the sum across all branches B,

for all N characters... each given weight w...

and each having a cost of change diff(x,y).

How to reconstruct phylogeny: Parsimony

A aat tcg ctt cta gga atc tgc cta atc ctg!B ... ..a ..g ..a .t. ... ... t.. ... ..a!C ... ..a ..c ..c ... ..t ... ... ... t.a!D ... ..a ..a ..g ..g ..t ... t.t ..t t..!

--assumes discrete data that represent state changes along a tree. Ie a column is a character with variation due to evolution

Alignment: the art of producing such columns

How to reconstruct phylogeny: methods - parsimony

A

B

C

D

A

C

B

D

A

D

C

B

4 4 4

2 3 2 3

2 3

2 3

2 3

Length=3! Length=5! Length=5!

1 2 3 4A a c a tB a c a tC a g g tD a g g a

•  Parsimony allows the use of all known evolutionary information in building a tree.

•  Parsimony involves assigning scores based on the number of evolutionary changes that are needed to explain the observed data to all possible trees.

•  The best tree is the one that requires the fewest (homoplasious) changes.

•  Only synapomorphies are parsimony-informative

invariant

unique

parsimony informative


A

B

C

D

A

C

B

D

A

D

C

B

4 4 4

2 3 2 3

2 3

2 3

2 3

Length=3! Length=5! Length=5!

•  Parsimony allows the use of all known evolutionary information in building a tree.

•  Parsimony involves assigning scores based on the number of evolutionary changes that are needed to explain the observed data to all possible trees.

•  The best tree is the one that requires the fewest (homoplasious) changes.

•  Only synapomorphies are parsimony-informative

1 2 3 4A a c a tB a c a tC a g g tD a g g a


To distinguish ancestral from derived character states (and thus allow for a temporal dimension) an ‘outgroup’ (i.e. the sistergroup of the taxa of interest) is added. Using outgroup comparisons, the most parsimonious rooted tree can be found.

A

B

C

D

A

C

B

D

A

D

C

B Length=3! Length=5! Length=5!

Shortest unrooted tree, but! where’s the root?

1 2 3 4A a c a tB a c a tC a g g tD a g g gE c g a t


C D B A A B D C C D A B A B C D A B C D


A

B

C

D




E E E E E





E E E E E

1 1 1 1 1





E E E E E

1 1 1 1 1

2 2 2


2 2 2

Homoplasious change!

2




E E E E E

1 1 1 1 1

2 2 2 3 3 3


2 2

2 2

3 3

3 3





E E E E E 4

1 1 1 1 1

4 4 4 4

2 2 2 3 3 3

5! 4! 5! 5! 5!


2 2

2 2

3 3

3 3



How to reconstruct phylogeny: methods - cladistics - parsimony

E E E E E 4

1 1 1 1 1

4 4 4 4 2

2 2

2 2 2 2 3 3 3 3

3 3

3

5! 4! 5! 5! 5!

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. To distinguish ancestral from derived character

states (and thus allow for a temporal dimension) an ‘outgroup’ (i.e. the sistergroup of the taxa of interest) is added. Using outgroup comparisons, the most parsimonious rooted tree can be found.



•  MP uses the observed states of characters to infer the shortest set of paths (shortest tree). No probability.

•  For datasets where different branches have different rates of evolution (ie fast and slow branches), the MP tree may not be the most likely tree.

•  This is Long-Branch Attraction (LBA), or the inconsistency caused by heterogenous rates across the tree

Maximum Parsimony Maximum Parsimony and the LBA

MP has no model or way to correct for LBA

We can use a model to “correct for multiple hits” This can be done in either a discrete or continuous framework. Let’s look at the continuous one first.

pij

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.5 1 1.5 2 2.5

estimated number of substitutions µt

Observed distance p

linear at low µt

p = 34(1! e!4/3µt )

µt = ! 34ln(1! 4

3p)

saturation

1

2

3

4

0.1 0.1 0.1

1.0 1.0

This true tree produces these data:

And the least-squares tree on the observed distances is incorrect, just like the MP tree would be:

1 3

2 4

0.12

0.35

0.11

1 2 3 41 0.0 0.577 0.704 0.5992 0.0 0.599 0.2473 0.0 0.5774 0.0

1

2

3

4

0.1 0.1 0.1

1.0 1.0

This corrected distances would be:

And the least-squares tree* on the corrected distances is now correct:

1 2 3 4 1 0.0 1.1 2.1 1.2 2 0.0 1.2 0.3 3 0.0 1.1 4 0.0

1

2

3

4

0.1 0.1 0.1

1.0 1.0

How to reconstruct phylogeny: distance methods

A a c c g a t c g t a a c g tB . . . . g . . . c . . . . .C . . t . t . . a g . g . a .D g . g a c c . c a . c . t c

A t g g c g t g a a g c g a cB c . a . t . . . g . . a . tC c . a . . . a . g . . a g tD c a a t t c a g t a g a g g

Aligned DNA sequences

Aligned DNA sequences

Phenograms (i.e. phenetic trees) are obtained using distance methods to build trees from comparative data.


AB 2C 6 6D 10 10 10

A B C D

AB 6C 7 3D 14 10 9

A B C D



Aligned DNA sequences Distance matrix



A pairwise distance matrix contains the estimated number of different sites between all pairs of sequences


AB 2C 6 6D 10 10 10

A B C D

AB 6C 7 3D 14 10 9

A B C D




A B

C D

A B C D

Ultrametric tree

Aligned DNA sequences Distance matrix Additive tree

1

1 1

1

1

5

2

6

2

3 2

5


Then, a tree is constructed: e.g. by linking the least distant pairs of taxa, followed by successively more distant taxa.

A pairwise distance matrix contains the estimated number of different sites between all pairs of sequences


Distance methods can use clustering algorithms (N-J) or an optimality criterion (least squares, minimum evolution) to convert the distances to a tree

Both can use raw or modelled distances

raw: percent different sites (if aligned) Modelled: corrected with a model (e.g. Jukes-Cantor model)

Neighbor-Joining Algorithm

is similar in flavour to the more intuitive UPGMA, but doesn’t force everything to be equidistant from a root (indeed, it only produces unrooted trees: you have to root them by knowing the outgroup or using, eg. midpoint rooting)

Many fast tree-building programs build N-J trees.

1 RANDOM!2 RONDON!3 RONFON!

RRDAOM!RRDOON!RRFOON!

AMRAON!ONROON!ONROON!

NNNDON!NNNDON!NNNFON!

...! ...!

pseudoreplicates (500+)!

1! 2! 3! 1! 2! 3! 1! 2! 3!

66%!

bootstrap tree!1! 2! 3!

Consensus neighbor-joining tree of 104 human mtDNA complete sequences.

Mishmar D et al. PNAS 2003;100:171-176 ©2003 by National Academy of Sciences

African

European

Asian/Native American

Likelihood

Lik(h)! P(D | h,m)

The likelihood of a hypothesis (e.g. of a tree) is is proportional to the probability of the data arising (the sequences) given the hypothesis and a model

(This says nothing about the probability of the model)

Maximum Likelihood

The ML is reached at the point that the hypothesis produces the highest probability of seeing the data

D: HHTTHTHHTTT (11 tosses of a coin) m: independent tosses with some p= Prob(Heads) Likelihood for different h would correspond to different values for p Which h is most likely?

Lik(h)! P(D | h,m)

Lik

0

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

all possible h (=p)

Likelihood of h

Data: HHTTHTHHTTT P(data|hyp, m)=pp(1-p)(1-p)p(1-p)pp(1-p)(1-p)(1-p)

we are usually interested in the h (the tree) that returns the ML not in the ML itself (any one tree is not very likely)

•  The method of Maximum Likelihood attempts to reconstruct a phylogeny using an explicit model of evolution.

•  Since each nucleotide site evolves independently, the tree is calculated separately for each site. The product of the likelihoods for each site provides the overall likelihood of the observed data - FOR ONE OF ALL POSSIBLE TREE SHAPES

•  Even with simple models of evolutionary change, the computational task is enormous, making this the slowest of all phylogenetic methods.

How to reconstruct phylogeny: methods - maximum likelihood

x=a!

y=g!

a ! g! c!One of many possible ways the pattern of

nucleotides at a given site could have evolved:

-ln(((0.7*0.3*0.3*0.3)+(all other combinations))*(all other sites))!

No change: !p=0.7!Substitution: !p=0.3!

Model of sequence evolution:

Models

There are many models of evolution, which have different numbers of parameters to estimate when calculating the Lik. (J-C has 1; Kimura,2; HKY,5; HKY+Inv,6; etc.)

There are two types of parameters: variation in possible substitutions (ti, tv, A->C vs A->T) variation across the sequence (gamma, invariant sites) For discrete characters, Mk1 model has one parameter – we will (I hope) derive the model fully when we have some data

How to choose a model? 1.  One can first build a N-J tree on raw distances, 2.  then calculate Max. Lik. of data on each tree under different models of evolution 3. Compare the relationship between the number of parameters and the actual ML fit to decide on the model, e.g. the one with the lowest value for the Akaike Information Criterion (AIC)

AIC = -2Log(Lik) +2k

gets bigger with more complex models

gets smaller with more complex models (better fit, higher lik)

How to reconstruct phylogeny: search methods

Before we can assess the ‘goodness’ of competing phylogenetic hypotheses (i.e. trees) using an optimality criterion, we have to build tree shapes. Assessing all possible tree shapes (i.e. exhaustive searches) takes a mighty long time for large numbers of taxa. A shortcut is provided by ‘hill-climbing’ algorithms (i.e. heuristic searches), of which many different flavors exist. They all follow this philosophy:

•  Similar tree shapes have a similar ‘goodness’ (e.g. likelihoods).

•  So, by starting with a tree (any old tree) and changing the shape in small steps, while constantly keeping track of which changes are improvements an which are not, the best tree will eventually be found.

How to reconstruct phylogeny: search methods: exhaustive searching

“Opt

imal

ity”

All possible tree shapes

An exhaustive search will return the optimal tree shape ‘A’ after evaluating all possible trees

A

Exhaustively evaluated trees

How to reconstruct phylogeny: search methods: heuristic searching

“Opt

imal

ity”


Optimal tree ‘A’


A

Starting tree ‘B’

B


“Opt

imal

ity”




A B

‘hill climbing'


“Opt

imal

ity”




A B

Heuristically evaluated trees

The molecular clock for haemoglobins assumed in 1962 by Linus Pauling and Emile Zuckerkandl, shown later by Margoliash (1964), and by P&Z in 1965...

"the discovery of the molecular clock stands out as the most significant result of research in molecular evolution. (R. Lewin)!!"a very important idea that has turned out to be much truer than people thought at the time." (F. Crick)!!"one of the most elegantly simple concepts in biology, but it is also one of the most contentious." (S. Eastal et al.)!!!!

pauling.library.oregonstate.edu

Can't find pic of Margoliash...

52!

Molecular Evolution and the Neutral Theory

Margoliash, PNAS 1964

53!

Motoo Kimura (1966, 1983)

Motoo described how substitutions might occur

54!

orange allele changes in frequency... perhaps due to selection

A substitution is the replacement of one allele for another as the predominant allele in a population

N=1

0 in

divi

dual

s

55!

Substitutions occur at some background rate due to drift in both large and small populations

orange allele changes in frequency... due to drift

The Neutral Theory simply states that most of genetic variation is not due to, nor acted on by, Natural Selection 56!

This drift is not working against selection (as in small pops) but is simply 'not seen' by selection...

57!

Assumption is that most gene products are already at their optimum Selection weeds out the (very) harmful mutations and all the variation we actually see is the leftover, neutral variation created by mutation and drift. (very few new mutations are subject to positive selection.)

Neutral Theory

http://online.itp.ucsb.edu/online/infobio01/ohta/oh/01.html 58!

The Neutral theory was proposed to explain the clock-like substitution pattern among species

Number of neutral mutations created per generation: 2N! (where ! is a subset of µ, the overall mutation rate)

e.g.: if neutral mutation rate 10-6 per position per generation, and if each position is represented 106 times (2N= 106) then expect 1 new mutation per base per generation (ie one someone is carrying that new mutation)

59!6 different copies of an allele – here we focus on the ‘black’ one

one lineage eventually drifts to fixation, and the chance it is the one we were looking at is 1/6

drift...

60!

For a new neutral mutation, the probability that it ‘becomes’ the most common one is the same as its initial frequency = (2N)-1 (all have equal chance) You can think of ‘becomes the most common one as ‘being the ancestor of the most common one’

61!

If we focus on the black allele, it has a 1/6 chance of replacing others

62!

rate of substitution (in substitutions/gen)= k = 2N! x (2N)-1 = !"

number of candidate mutations

chance for each mutation to ‘fix’

The substitution rate of neutral mutations k is equal to the neutral mutation rate v under the Neutral Model

63!

How can this be??

Remember, looking only at those mutations that are neutral In small populations, number of new mutations is low (not a lot of DNA to mutate) = 2N is small. But drift is fast - ie chance that any allele increases in frequency is high (1/2N is relatively large).

In large population, lots of new mutations (2N large). But chance that any one of them substitutes is low, because drift is slow (1/2N is small).

So effects of population size on number of mutations and drift of those mutations cancel each other out.

64!

random substitution

daughter species inherit random substitutions

all this occurs at rate ! and so we have a clock

shark quoll human

65!

And so, for the same locus, if ! is similar for all species, you get a ‘clock’ in generations

66!

! must differ between genes (Table 7.1)

The relevant rate is the neutral mutation rate, not the overall mutation rate. (The overall rate could also be lineage specific.) But for highly constrained proteins, most mutations aren’t neutral - they are selected out, and don’t ‘count’

1.  Loci with fewer constraints evolve faster (and vice versa: e.g. histones don’t seem to evolve amino acid changes at all)

2. Synonymous changes (e.g. 3rd base) evolve faster still

3. Pseudogenes and (some) introns may evolve at true µ rate (and they do evolve most quickly across lineages)

67!

To recap:

1.  The rate of substitution k = neutral mutation rate !"

2.  Different genes have different constraints, so k differs

between them (different proportions of mutations are neutral)

3.  Mutations occur at meiosis, so accumulate generation by

generation,not year by year.

4. Many data are consistent with a constant rate of

substitution per year !??

Molecular Evolution. The clock Posterior probabilities

In theory, we can use Bayes’ theorem to convert likelihoods into actual probabilities (‘posterior probabilities’).

e.g. we want to know how probable it is that a particular coin has p(heads)=0.8 (biased) versus having p(heads)=0.5 (unbiased) (this is NOT the same as estimating p from data)

Box has 10% of the coins biased Choose a coin at random, p(biased)=0.1 [“prior”]

Posterior Probability for “Biased”

Now toss it 10 times, get HHTHHTTHHH (ML estimate of p= 0.7, so neither 0.5 nor 0.8) p(biased and data)=0.87 * 0.23 = .00167 p(true and data) =0.510 = 0.000976

likelihood ratio LR = .00167/.000976 = 1.76X prior odds ratio = .1/.9=0.11 posterior odds ratio =LR*prior odds = 1.76*0.11= 0.19

Posterior Probabilities

p(biased | data)= p(data | biased)! p(biased)p(data)

=p(data | biased)! p(biased)

p(data | biased)! p(biased)+p(data | unbiased)! p(unbiased)

=0.00167(0.1)

0.00167(0.1)+ 0.000976(0.9)= 0.16

Likelihood Prior hypothesis

unconditional prob(data)

Or P(biased|data) = odds/(1+odds)0.19/1.19=0.16 (remember, p(biased) before data was 0.1)

p(biased | data)= p(data | biased)! p(biased)p(data)

=p(data | biased)! p(biased)

p(data | biased)! p(biased)+p(data | unbiased)! p(unbiased)

=0.00167(0.1)

0.00167(0.1)+ 0.000976(0.9)= 0.16

p(hyp | data)= p(data | hyp)! p(hyp)p(hyp)p(data | hyp)

hyp"

Posterior Likelihood Prior

Normalizing constant, but impossible to get (since there are ‘infinite’ ways to get data)

MCMC (aka Metropolis-Hastings) gets rid of the denominator!

Metropolis-coupled Monte-Carlo techniques sample trees in proportion to their likelihoods*priors (so, uncorrected probabilities, just the numerators), and so allow for estimates of the posterior probabilities.

How does it do that?

By keeping a random sample of ‘hypotheses’ in storage in proportion to their likelihood*priors. So, if a hypothesis is found 80% of the time, it has a 80% posterior probability of being true.

Represented by the consensus of your 1000000 MCMC trees

MCMC gets around needing to know P(data)

Documents

MHP 2 Phylogenetics - SFU.ca...Algorithms vs. Optimality UPGMA, Neighbor-Joining and ʻBranch and Boundʼ are algorithms (or recipes) that do not optimize anything Maximum Likelihood,