Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Phylogenetic Methods
How to reconstruct phylogenies Algorithms vs. Optimality Parsimony Models and Distances
MtDNA
Modern Humans
Neanderthal Western Gorilla
Mountain Gorilla
Eastern Gorilla
Sumatran Orangutan
Bornean Orangutan
Bonobo
Western Chimpanzee
Central Chimpanzee
Eastern Chimpanzee
Gagneux et al. (1999)
root
Humans are a recently-evolved species, and human genetic diversity is very low compared to other apes!
Due to drift, samples of mitochondrial DNA are related as a tree, and if we can reconstruct that tree, we should be able to infer many things about the populations
Background: how many different trees?
A
C
B
D
For four taxa:
A
B
C
D A
D
C
B
# of taxa # of unrootedtrees
2 13 14 35 156 1057 9458 103959 13513510 2027025
4 taxa = 3 trees (2n-5)!!!
10 taxa = ~2*106 trees How many
rooted trees?
A
C
B
D
For four taxa:
A
D
C
B
The root can be placed on any branch or internode. The total number of branches on an unrooted tree is 2n-3 where n=number of taxa. Therefore, the number of rooted trees corresponding with one unrooted tree is 2n-3
C D B A
A
B
C
D
Background: how many different trees?
Terms and concepts: how many different trees?
For four taxa: The root can be placed on any branch or internode. The total number of branches on an unrooted tree is 2n-3 where n=number of taxa. Therefore, the number of rooted trees corresponding with one unrooted tree is 2n-3
A
C
B
D
A
D
C
B
C D B A
A B D C
A
B
C
D
Terms and concepts: how many different trees?
For four taxa: The root can be placed on any branch or internode. The total number of branches on an unrooted tree is 2n-3 where n=number of taxa. Therefore, the number of rooted trees corresponding with one unrooted tree is 2n-3
A
C
B
D
A
D
C
B
C D B A
A B D C
C D A B A
B
C
D
Terms and concepts: how many different trees?
For four taxa: The root can be placed on any branch or internode. The total number of branches on an unrooted tree is 2n-3 where n=number of taxa. Therefore, the number of rooted trees corresponding with one unrooted tree is 2n-3
A
C
B
D
A
D
C
B
C D B A
A B D C
C D A B
A B C D
A
B
C
D
Terms and concepts: how many different trees?
For four taxa: The root can be placed on any branch or internode. The total number of branches on an unrooted tree is 2n-3 where n=number of taxa. Therefore, the number of rooted trees corresponding with one unrooted tree is 2n-3
A
C
B
D
A
D
C
B
C D B A
A B D C
C D A B
A B C D
A B C D
A
B
C
D
Terms and concepts: how many different trees?
For four taxa:
# oftaxa
# ofunrootedtrees
# of rootedtrees
2 1 13 1 34 3 155 15 1056 105 9457 945 103958 10395 1351359 135135 202702510 2027025 34459425
4 taxa = 15 trees (2n-3)(2n-5)!!!
10 taxa = ~3*107 trees +5
+5
A
C
B
D A
B
C
D A
D
C
B
A B C D
C D A B
A B D C
C D B A
A B C D
Algorithms vs. Optimality
UPGMA, Neighbor-Joining and ‘Branch and Bound’ are algorithms (or recipes) that do not optimize anything Maximum Likelihood, Maximum Parsimony and Least-Squares are optimality criteria: they do not specify how candidate hypotheses are arrived at (e.g. how trees are sampled) but do offer a yardstick for assessing which hypotheses are preferred.
Algorithms are fast (N-J works in low-polynomial n time) but neither guarantee a ‘right’ answer or evaluation of fit.
Maximum Parsimony
We do not believe that evolution is parsimonious, but we believe that the characters we choose evolve in such a way that maximum parsimony offers the best chance of recovering the correct relationships.
Tree length becomes our optimality criterion:
choose the shortest tree among all contenders...
L(t) = wjdiff (xk' j , xk' ' j )j=1
N!
k=1
B!
minimize L(t), length of tree
We calculate the length of a tree L(t), as the sum across all branches B,
for all N characters... each given weight w...
and each having a cost of change diff(x,y).
How to reconstruct phylogeny: Parsimony
A aat tcg ctt cta gga atc tgc cta atc ctg!B ... ..a ..g ..a .t. ... ... t.. ... ..a!C ... ..a ..c ..c ... ..t ... ... ... t.a!D ... ..a ..a ..g ..g ..t ... t.t ..t t..!
--assumes discrete data that represent state changes along a tree. Ie a column is a character with variation due to evolution
Alignment: the art of producing such columns
How to reconstruct phylogeny: methods - parsimony
A
B
C
D
A
C
B
D
A
D
C
B
4 4 4
2 3 2 3
2 3
2 3
2 3
Length=3! Length=5! Length=5!
1 2 3 4A a c a tB a c a tC a g g tD a g g a
• Parsimony allows the use of all known evolutionary information in building a tree.
• Parsimony involves assigning scores based on the number of evolutionary changes that are needed to explain the observed data to all possible trees.
• The best tree is the one that requires the fewest (homoplasious) changes.
• Only synapomorphies are parsimony-informative
invariant
unique
parsimony informative
How to reconstruct phylogeny: methods - parsimony
A
B
C
D
A
C
B
D
A
D
C
B
4 4 4
2 3 2 3
2 3
2 3
2 3
Length=3! Length=5! Length=5!
• Parsimony allows the use of all known evolutionary information in building a tree.
• Parsimony involves assigning scores based on the number of evolutionary changes that are needed to explain the observed data to all possible trees.
• The best tree is the one that requires the fewest (homoplasious) changes.
• Only synapomorphies are parsimony-informative
1 2 3 4A a c a tB a c a tC a g g tD a g g a
How to reconstruct phylogeny: methods - parsimony
To distinguish ancestral from derived character states (and thus allow for a temporal dimension) an ‘outgroup’ (i.e. the sistergroup of the taxa of interest) is added. Using outgroup comparisons, the most parsimonious rooted tree can be found.
A
B
C
D
A
C
B
D
A
D
C
B Length=3! Length=5! Length=5!
Shortest unrooted tree, but! where’s the root?
1 2 3 4A a c a tB a c a tC a g g tD a g g gE c g a t
How to reconstruct phylogeny: methods - parsimony
C D B A A B D C C D A B A B C D A B C D
To distinguish ancestral from derived character states (and thus allow for a temporal dimension) an ‘outgroup’ (i.e. the sistergroup of the taxa of interest) is added. Using outgroup comparisons, the most parsimonious rooted tree can be found.
A
B
C
D
1 2 3 4A a c a tB a c a tC a g g tD a g g gE c g a t
How to reconstruct phylogeny: methods - parsimony
C D B A A B D C C D A B A B C D A B C D
E E E E E
To distinguish ancestral from derived character states (and thus allow for a temporal dimension) an ‘outgroup’ (i.e. the sistergroup of the taxa of interest) is added. Using outgroup comparisons, the most parsimonious rooted tree can be found.
1 2 3 4A a c a tB a c a tC a g g tD a g g gE c g a t
How to reconstruct phylogeny: methods - parsimony
C D B A A B D C C D A B A B C D A B C D
E E E E E
1 1 1 1 1
To distinguish ancestral from derived character states (and thus allow for a temporal dimension) an ‘outgroup’ (i.e. the sistergroup of the taxa of interest) is added. Using outgroup comparisons, the most parsimonious rooted tree can be found.
1 2 3 4A a c a tB a c a tC a g g tD a g g gE c g a t
How to reconstruct phylogeny: methods - parsimony
C D B A A B D C C D A B A B C D A B C D
E E E E E
1 1 1 1 1
2 2 2
To distinguish ancestral from derived character states (and thus allow for a temporal dimension) an ‘outgroup’ (i.e. the sistergroup of the taxa of interest) is added. Using outgroup comparisons, the most parsimonious rooted tree can be found.
2 2 2
Homoplasious change!
2
1 2 3 4A a c a tB a c a tC a g g tD a g g gE c g a t
How to reconstruct phylogeny: methods - parsimony
C D B A A B D C C D A B A B C D A B C D
E E E E E
1 1 1 1 1
2 2 2 3 3 3
To distinguish ancestral from derived character states (and thus allow for a temporal dimension) an ‘outgroup’ (i.e. the sistergroup of the taxa of interest) is added. Using outgroup comparisons, the most parsimonious rooted tree can be found.
2 2
2 2
3 3
3 3
Homoplasious change!
1 2 3 4A a c a tB a c a tC a g g tD a g g gE c g a t
How to reconstruct phylogeny: methods - parsimony
C D B A A B D C C D A B A B C D A B C D
E E E E E 4
1 1 1 1 1
4 4 4 4
2 2 2 3 3 3
5! 4! 5! 5! 5!
To distinguish ancestral from derived character states (and thus allow for a temporal dimension) an ‘outgroup’ (i.e. the sistergroup of the taxa of interest) is added. Using outgroup comparisons, the most parsimonious rooted tree can be found.
2 2
2 2
3 3
3 3
Homoplasious change!
1 2 3 4A a c a tB a c a tC a g g tD a g g gE c g a t
How to reconstruct phylogeny: methods - cladistics - parsimony
E E E E E 4
1 1 1 1 1
4 4 4 4 2
2 2
2 2 2 2 3 3 3 3
3 3
3
5! 4! 5! 5! 5!
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. To distinguish ancestral from derived character
states (and thus allow for a temporal dimension) an ‘outgroup’ (i.e. the sistergroup of the taxa of interest) is added. Using outgroup comparisons, the most parsimonious rooted tree can be found.
C D B A A B D C C D A B A B C D A B C D
Homoplasious change!
• MP uses the observed states of characters to infer the shortest set of paths (shortest tree). No probability.
• For datasets where different branches have different rates of evolution (ie fast and slow branches), the MP tree may not be the most likely tree.
• This is Long-Branch Attraction (LBA), or the inconsistency caused by heterogenous rates across the tree
Maximum Parsimony Maximum Parsimony and the LBA
MP has no model or way to correct for LBA
We can use a model to “correct for multiple hits” This can be done in either a discrete or continuous framework. Let’s look at the continuous one first.
pij
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.5 1 1.5 2 2.5
estimated number of substitutions µt
Observed distance p
linear at low µt
p = 34(1! e!4/3µt )
µt = ! 34ln(1! 4
3p)
saturation
1
2
3
4
0.1 0.1 0.1
1.0 1.0
This true tree produces these data:
And the least-squares tree on the observed distances is incorrect, just like the MP tree would be:
1 3
2 4
0.12
0.35
0.11
1 2 3 41 0.0 0.577 0.704 0.5992 0.0 0.599 0.2473 0.0 0.5774 0.0
1
2
3
4
0.1 0.1 0.1
1.0 1.0
This corrected distances would be:
And the least-squares tree* on the corrected distances is now correct:
1 2 3 4 1 0.0 1.1 2.1 1.2 2 0.0 1.2 0.3 3 0.0 1.1 4 0.0
1
2
3
4
0.1 0.1 0.1
1.0 1.0
How to reconstruct phylogeny: distance methods
A a c c g a t c g t a a c g tB . . . . g . . . c . . . . .C . . t . t . . a g . g . a .D g . g a c c . c a . c . t c
A t g g c g t g a a g c g a cB c . a . t . . . g . . a . tC c . a . . . a . g . . a g tD c a a t t c a g t a g a g g
Aligned DNA sequences
Aligned DNA sequences
Phenograms (i.e. phenetic trees) are obtained using distance methods to build trees from comparative data.
How to reconstruct phylogeny: distance methods
AB 2C 6 6D 10 10 10
A B C D
AB 6C 7 3D 14 10 9
A B C D
A a c c g a t c g t a a c g tB . . . . g . . . c . . . . .C . . t . t . . a g . g . a .D g . g a c c . c a . c . t c
A t g g c g t g a a g c g a cB c . a . t . . . g . . a . tC c . a . . . a . g . . a g tD c a a t t c a g t a g a g g
Aligned DNA sequences Distance matrix
Aligned DNA sequences Distance matrix
Phenograms (i.e. phenetic trees) are obtained using distance methods to build trees from comparative data.
A pairwise distance matrix contains the estimated number of different sites between all pairs of sequences
How to reconstruct phylogeny: distance methods
AB 2C 6 6D 10 10 10
A B C D
AB 6C 7 3D 14 10 9
A B C D
A a c c g a t c g t a a c g tB . . . . g . . . c . . . . .C . . t . t . . a g . g . a .D g . g a c c . c a . c . t c
A t g g c g t g a a g c g a cB c . a . t . . . g . . a . tC c . a . . . a . g . . a g tD c a a t t c a g t a g a g g
Aligned DNA sequences Distance matrix
A B
C D
A B C D
Ultrametric tree
Aligned DNA sequences Distance matrix Additive tree
1
1 1
1
1
5
2
6
2
3 2
5
Phenograms (i.e. phenetic trees) are obtained using distance methods to build trees from comparative data.
Then, a tree is constructed: e.g. by linking the least distant pairs of taxa, followed by successively more distant taxa.
A pairwise distance matrix contains the estimated number of different sites between all pairs of sequences
How to reconstruct phylogeny: distance methods
Distance methods can use clustering algorithms (N-J) or an optimality criterion (least squares, minimum evolution) to convert the distances to a tree
Both can use raw or modelled distances
raw: percent different sites (if aligned) Modelled: corrected with a model (e.g. Jukes-Cantor model)
Neighbor-Joining Algorithm
is similar in flavour to the more intuitive UPGMA, but doesn’t force everything to be equidistant from a root (indeed, it only produces unrooted trees: you have to root them by knowing the outgroup or using, eg. midpoint rooting)
Many fast tree-building programs build N-J trees.
1 RANDOM!2 RONDON!3 RONFON!
RRDAOM!RRDOON!RRFOON!
AMRAON!ONROON!ONROON!
NNNDON!NNNDON!NNNFON!
...! ...!
pseudoreplicates (500+)!
1! 2! 3! 1! 2! 3! 1! 2! 3!
66%!
bootstrap tree!1! 2! 3!
Consensus neighbor-joining tree of 104 human mtDNA complete sequences.
Mishmar D et al. PNAS 2003;100:171-176 ©2003 by National Academy of Sciences
African
European
Asian/Native American
Likelihood
Lik(h)! P(D | h,m)
The likelihood of a hypothesis (e.g. of a tree) is is proportional to the probability of the data arising (the sequences) given the hypothesis and a model
(This says nothing about the probability of the model)
Maximum Likelihood
The ML is reached at the point that the hypothesis produces the highest probability of seeing the data
D: HHTTHTHHTTT (11 tosses of a coin) m: independent tosses with some p= Prob(Heads) Likelihood for different h would correspond to different values for p Which h is most likely?
Lik(h)! P(D | h,m)
Lik
0
0.0001
0.0002
0.0003
0.0004
0.0005
0.0006
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
all possible h (=p)
Likelihood of h
Data: HHTTHTHHTTT P(data|hyp, m)=pp(1-p)(1-p)p(1-p)pp(1-p)(1-p)(1-p)
we are usually interested in the h (the tree) that returns the ML not in the ML itself (any one tree is not very likely)
• The method of Maximum Likelihood attempts to reconstruct a phylogeny using an explicit model of evolution.
• Since each nucleotide site evolves independently, the tree is calculated separately for each site. The product of the likelihoods for each site provides the overall likelihood of the observed data - FOR ONE OF ALL POSSIBLE TREE SHAPES
• Even with simple models of evolutionary change, the computational task is enormous, making this the slowest of all phylogenetic methods.
How to reconstruct phylogeny: methods - maximum likelihood
x=a!
y=g!
a ! g! c!One of many possible ways the pattern of
nucleotides at a given site could have evolved:
-ln(((0.7*0.3*0.3*0.3)+(all other combinations))*(all other sites))!
No change: !p=0.7!Substitution: !p=0.3!
Model of sequence evolution:
Models
There are many models of evolution, which have different numbers of parameters to estimate when calculating the Lik. (J-C has 1; Kimura,2; HKY,5; HKY+Inv,6; etc.)
There are two types of parameters: variation in possible substitutions (ti, tv, A->C vs A->T) variation across the sequence (gamma, invariant sites) For discrete characters, Mk1 model has one parameter – we will (I hope) derive the model fully when we have some data
How to choose a model? 1. One can first build a N-J tree on raw distances, 2. then calculate Max. Lik. of data on each tree under different models of evolution 3. Compare the relationship between the number of parameters and the actual ML fit to decide on the model, e.g. the one with the lowest value for the Akaike Information Criterion (AIC)
AIC = -2Log(Lik) +2k
gets bigger with more complex models
gets smaller with more complex models (better fit, higher lik)
How to reconstruct phylogeny: search methods
Before we can assess the ‘goodness’ of competing phylogenetic hypotheses (i.e. trees) using an optimality criterion, we have to build tree shapes. Assessing all possible tree shapes (i.e. exhaustive searches) takes a mighty long time for large numbers of taxa. A shortcut is provided by ‘hill-climbing’ algorithms (i.e. heuristic searches), of which many different flavors exist. They all follow this philosophy:
• Similar tree shapes have a similar ‘goodness’ (e.g. likelihoods).
• So, by starting with a tree (any old tree) and changing the shape in small steps, while constantly keeping track of which changes are improvements an which are not, the best tree will eventually be found.
How to reconstruct phylogeny: search methods: exhaustive searching
“Opt
imal
ity”
All possible tree shapes
An exhaustive search will return the optimal tree shape ‘A’ after evaluating all possible trees
A
Exhaustively evaluated trees
How to reconstruct phylogeny: search methods: heuristic searching
“Opt
imal
ity”
All possible tree shapes
Optimal tree ‘A’
Exhaustively evaluated trees
A
Starting tree ‘B’
B
How to reconstruct phylogeny: search methods: heuristic searching
“Opt
imal
ity”
All possible tree shapes
Optimal tree ‘A’
Exhaustively evaluated trees
A B
‘hill climbing'
How to reconstruct phylogeny: search methods: heuristic searching
“Opt
imal
ity”
All possible tree shapes
Optimal tree ‘A’
Exhaustively evaluated trees
A B
Heuristically evaluated trees
The molecular clock for haemoglobins assumed in 1962 by Linus Pauling and Emile Zuckerkandl, shown later by Margoliash (1964), and by P&Z in 1965...
"the discovery of the molecular clock stands out as the most significant result of research in molecular evolution. (R. Lewin)!!"a very important idea that has turned out to be much truer than people thought at the time." (F. Crick)!!"one of the most elegantly simple concepts in biology, but it is also one of the most contentious." (S. Eastal et al.)!!!!
pauling.library.oregonstate.edu
Can't find pic of Margoliash...
52!
Molecular Evolution and the Neutral Theory
Margoliash, PNAS 1964
53!
Motoo Kimura (1966, 1983)
Motoo described how substitutions might occur
54!
orange allele changes in frequency... perhaps due to selection
A substitution is the replacement of one allele for another as the predominant allele in a population
N=1
0 in
divi
dual
s
55!
Substitutions occur at some background rate due to drift in both large and small populations
orange allele changes in frequency... due to drift
The Neutral Theory simply states that most of genetic variation is not due to, nor acted on by, Natural Selection 56!
This drift is not working against selection (as in small pops) but is simply 'not seen' by selection...
57!
Assumption is that most gene products are already at their optimum Selection weeds out the (very) harmful mutations and all the variation we actually see is the leftover, neutral variation created by mutation and drift. (very few new mutations are subject to positive selection.)
Neutral Theory
http://online.itp.ucsb.edu/online/infobio01/ohta/oh/01.html 58!
The Neutral theory was proposed to explain the clock-like substitution pattern among species
Number of neutral mutations created per generation: 2N! (where ! is a subset of µ, the overall mutation rate)
e.g.: if neutral mutation rate 10-6 per position per generation, and if each position is represented 106 times (2N= 106) then expect 1 new mutation per base per generation (ie one someone is carrying that new mutation)
59!6 different copies of an allele – here we focus on the ‘black’ one
one lineage eventually drifts to fixation, and the chance it is the one we were looking at is 1/6
drift...
60!
For a new neutral mutation, the probability that it ‘becomes’ the most common one is the same as its initial frequency = (2N)-1 (all have equal chance) You can think of ‘becomes the most common one as ‘being the ancestor of the most common one’
61!
If we focus on the black allele, it has a 1/6 chance of replacing others
62!
rate of substitution (in substitutions/gen)= k = 2N! x (2N)-1 = !"
number of candidate mutations
chance for each mutation to ‘fix’
The substitution rate of neutral mutations k is equal to the neutral mutation rate v under the Neutral Model
63!
How can this be??
Remember, looking only at those mutations that are neutral In small populations, number of new mutations is low (not a lot of DNA to mutate) = 2N is small. But drift is fast - ie chance that any allele increases in frequency is high (1/2N is relatively large).
In large population, lots of new mutations (2N large). But chance that any one of them substitutes is low, because drift is slow (1/2N is small).
So effects of population size on number of mutations and drift of those mutations cancel each other out.
64!
random substitution
daughter species inherit random substitutions
all this occurs at rate ! and so we have a clock
shark quoll human
65!
And so, for the same locus, if ! is similar for all species, you get a ‘clock’ in generations
66!
! must differ between genes (Table 7.1)
The relevant rate is the neutral mutation rate, not the overall mutation rate. (The overall rate could also be lineage specific.) But for highly constrained proteins, most mutations aren’t neutral - they are selected out, and don’t ‘count’
1. Loci with fewer constraints evolve faster (and vice versa: e.g. histones don’t seem to evolve amino acid changes at all)
2. Synonymous changes (e.g. 3rd base) evolve faster still
3. Pseudogenes and (some) introns may evolve at true µ rate (and they do evolve most quickly across lineages)
67!
To recap:
1. The rate of substitution k = neutral mutation rate !"
2. Different genes have different constraints, so k differs
between them (different proportions of mutations are neutral)
3. Mutations occur at meiosis, so accumulate generation by
generation,not year by year.
4. Many data are consistent with a constant rate of
substitution per year !??
Molecular Evolution. The clock Posterior probabilities
In theory, we can use Bayes’ theorem to convert likelihoods into actual probabilities (‘posterior probabilities’).
e.g. we want to know how probable it is that a particular coin has p(heads)=0.8 (biased) versus having p(heads)=0.5 (unbiased) (this is NOT the same as estimating p from data)
Box has 10% of the coins biased Choose a coin at random, p(biased)=0.1 [“prior”]
Posterior Probability for “Biased”
Now toss it 10 times, get HHTHHTTHHH (ML estimate of p= 0.7, so neither 0.5 nor 0.8) p(biased and data)=0.87 * 0.23 = .00167 p(true and data) =0.510 = 0.000976
likelihood ratio LR = .00167/.000976 = 1.76X prior odds ratio = .1/.9=0.11 posterior odds ratio =LR*prior odds = 1.76*0.11= 0.19
Posterior Probabilities
p(biased | data)= p(data | biased)! p(biased)p(data)
=p(data | biased)! p(biased)
p(data | biased)! p(biased)+p(data | unbiased)! p(unbiased)
=0.00167(0.1)
0.00167(0.1)+ 0.000976(0.9)= 0.16
Likelihood Prior hypothesis
unconditional prob(data)
Or P(biased|data) = odds/(1+odds)0.19/1.19=0.16 (remember, p(biased) before data was 0.1)
p(biased | data)= p(data | biased)! p(biased)p(data)
=p(data | biased)! p(biased)
p(data | biased)! p(biased)+p(data | unbiased)! p(unbiased)
=0.00167(0.1)
0.00167(0.1)+ 0.000976(0.9)= 0.16
p(hyp | data)= p(data | hyp)! p(hyp)p(hyp)p(data | hyp)
hyp"
Posterior Likelihood Prior
Normalizing constant, but impossible to get (since there are ‘infinite’ ways to get data)
MCMC (aka Metropolis-Hastings) gets rid of the denominator!
Metropolis-coupled Monte-Carlo techniques sample trees in proportion to their likelihoods*priors (so, uncorrected probabilities, just the numerators), and so allow for estimates of the posterior probabilities.
How does it do that?
By keeping a random sample of ‘hypotheses’ in storage in proportion to their likelihood*priors. So, if a hypothesis is found 80% of the time, it has a 80% posterior probability of being true.
Represented by the consensus of your 1000000 MCMC trees
MCMC gets around needing to know P(data)