64
Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Embed Size (px)

Citation preview

Page 1: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Some of these slides have been borrowed from Dr.Paul Lewis, Dr. Joe Felsenstein. Thanks!

Paul has many great tools for teaching phylogenetics at his

web site:

http://hydrodictyon.eeb.uconn.edu/people/plewis

Page 2: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Gene copies in a population of 10 individuals

Time

A random−mating population

Week 9: Coalescents – p.2/60

Page 3: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Going back one generation

Time

A random−mating population

Week 9: Coalescents – p.3/60

Page 4: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

... and one more

Time

A random−mating population

Week 9: Coalescents – p.4/60

Page 5: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

... and one more

Time

A random−mating population

Week 9: Coalescents – p.5/60

Page 6: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

... and one more

Time

A random−mating population

Week 9: Coalescents – p.6/60

Page 7: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

... and one more

Time

A random−mating population

Week 9: Coalescents – p.7/60

Page 8: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

... and one more

Time

A random−mating population

Week 9: Coalescents – p.8/60

Page 9: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

... and one more

Time

A random−mating population

Week 9: Coalescents – p.9/60

Page 10: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

... and one more

Time

A random−mating population

Week 9: Coalescents – p.10/60

Page 11: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

... and one more

Time

A random−mating population

Week 9: Coalescents – p.11/60

Page 12: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

... and one more

Time

A random−mating population

Week 9: Coalescents – p.12/60

Page 13: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

... and one more

Time

A random−mating population

Week 9: Coalescents – p.13/60

Page 14: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

The genealogy of gene copies is a tree

Time

Genealogy of gene copies, after reordering the copies

Week 9: Coalescents – p.14/60

Page 15: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Ancestry of a sample of 3 copies

Time

Genealogy of a small sample of genes from the population

Week 9: Coalescents – p.15/60

Page 16: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Here is that tree of 3 copies in the pedigree

Time

Week 9: Coalescents – p.16/60

Page 17: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Kingman’s coalescent

u9

u7

u5

u3

u8

u6

u4

u2

Random collision of lineages as go back in time (sans recombination)

Collision is faster the smaller the effective population size

Average time for n

Average time for

copies to coalesce to

4N

k(k−1) k−1 =

In a diploid population of

effective population size N,

copies to coalesce

= 4N (1 − 1n ( generations

k

Average time for

two copies to coalesce

= 2N generations

Week 9: Coalescents – p.17/60

Page 18: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

The Wright-Fisher model

This is the canonical model of genetic drift in populations. It wasinvented in 1930 and 1932 by Sewall Wright and R. A. Fisher.In this model the next generation is produced by doing this:

Choose two individuals with replacement (including the possibilitythat they are the same individual) to be parents,

Each produces one gamete, these become a diploid individual,

Repeat these steps until N diploid individuals have been produced.

The effect of this is to have each locus in an individual in the nextgeneration consist of two genes sampled from the parents’ generation atrandom, with replacement.

Week 9: Coalescents – p.18/60

Page 19: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

The coalescent – a derivation

The probability that k lineages becomes k− 1 one generation earlier is(as each lineage “chooses” its ancestor independently):

k(k− 1)/2× Prob (First two have same parent, rest are different)

(since there are(k2

)= k(k− 1)/2 different pairs of copies)

We add up terms, all the same, for the k(k− 1)/2 pairs that couldcoalesce:

k(k− 1)/2 × 1 × 12N × (

1− 12N

)× (

1− 22N

) × · · · × (1− k−2

2N

)so that the total probability that a pair coalesces is

= k(k− 1)/4N + O(1/N2)

Week 9: Coalescents – p.19/60

Page 20: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Can probabilities of two or more lineages coalescing

Note that the total probability that some combination of lineagescoalesces is

1− Prob (Probability all genes have separate ancestors)

= 1−[

1×(

1− 1

2N

)(1− 2

2N

). . .

(1− k− 1

2N

)]

= 1 −[

1 − 1 + 2 + 3 + · · ·+ (k− 1)2N

+ O(1/N2)]

and since1 + 2 + 3 + . . . + (n− 1) = n(n− 1)/2

the quantity

= 1 − [1− k(k− 1)/4N + O(1/N2)

] ' k(k− 1)/4N + O(1/N2)

Week 9: Coalescents – p.20/60

Page 21: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Can calculate how many coalescences are of pairs

This shows, since the terms of order 1/N are the same, that the eventsinvolving 3 or more lineages simultaneously coalescing are in the termsof order 1/N2 and thus become unimportant if N is large.

Here are the probabilities of 0, 1, or more coalescences with 10 lineagesin populations of different sizes:

N 0 1 > 1100 0.79560747 0.18744678 0.01694575

1000 0.97771632 0.02209806 0.0001856210000 0.99775217 0.00224595 0.00000187

Note that increasing the population size by a factor of 10 reduces thecoalescent rate for pairs by about 10-fold, but reduces the rate for triples(or more) by about 100-fold.

Week 9: Coalescents – p.21/60

Page 22: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

The coalescent

To simulate a random genealogy, do the following:

1. Start with k lineages

2. Draw an exponential time interval with mean 4N/(k(k− 1))generations.

3. Combine two randomly chosen lineages.

4. Decrease k by 1.

5. If k = 1, then stop

6. Otherwise go back to step 2.

Week 9: Coalescents – p.22/60

Page 23: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Random coalescent trees with 16 lineages

OCSMLPKEJITRHQFBNDGA MJBFGCERASQKNLHTIPDO BGTMLQDOFKPEAIJSCHRN

FRNLMDHBTCQSOGPIAKJE

IQCAJLSGPFODHBMETRKN

RCLDKHOQFMBGSITPAJEN

NMPRHLESOFBGJDCITKQA NHMCRPGLTEDSOIKJQFAB

Week 9: Coalescents – p.23/60

Page 24: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Coalescence is faster in small populations

Change of population size and coalescents

Ne

time

the changes in population size will produce waves of coalescence

time

Coalescence events

time

the tree

The parameters of the growth curve for Ne can be inferred bylikelihood methods as they affect the prior probabilities of those treesthat fit the data.

Week 9: Coalescents – p.24/60

Page 25: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Migration can be taken into account

Time

population #1 population #2Week 9: Coalescents – p.25/60

Page 26: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Recombination creates loops

Recomb.

Different markers have slightly different coalescent trees

Week 9: Coalescents – p.26/60

Page 27: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

If we have a sample of 50 copies

50−gene sample in a coalescent tree

Week 9: Coalescents – p.27/60

Page 28: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

The first 10 account for most of the branch length

10 genes sampled randomly out of a

50−gene sample in a coalescent tree

Week 9: Coalescents – p.28/60

Page 29: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

... and when we add the other 40 they add less length

10 genes sampled randomly out of a

50−gene sample in a coalescent tree

(orange lines are the 10−gene tree)

Week 9: Coalescents – p.29/60

Page 30: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

We want to be able to analyze human evolution

Africa

Europe Asia

"Out of Africa" hypothesis

(vertical scale is not time or evolutionary change)

Week 9: Coalescents – p.30/60

Page 31: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents – p.31/60

Page 32: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents – p.32/60

Page 33: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents – p.33/60

Page 34: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents – p.34/60

Page 35: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents – p.35/60

Page 36: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

coalescence time

Week 9: Coalescents – p.36/60

Page 37: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

If the branch is more than Ne generations long ...

t1

t2

N1

N2

N4

N3

N5

Gene tree and Species tree

Week 9: Coalescents – p.37/60

Page 38: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

If the branch is more than Ne generations long ...

t1

t2

N1

N2

N4

N3

N5

Gene tree and Species tree

Week 9: Coalescents – p.38/60

Page 39: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

If the branch is more than Ne generations long ...

t1

t2

N1

N2

N4

N3

N5

Gene tree and Species tree

Week 9: Coalescents – p.39/60

Page 40: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Labelled histories

Labelled Histories (Edwards, 1970; Harding, 1971)

Trees that differ in the time−ordering of their nodes

A B C D

A B C D

These two are the same:

A B C D

A B C D

These two are different:

Week 9: Coalescents – p.46/60

Page 41: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Inconsistency of estimation from concatenated genesequences

Degnan and Rosenberg (2006) show that the most likely topology for agene tree is not necessarily the tree that agrees with the phylogenetic tree.

For some phylogenetic shapes (e.g. imbalanced trees with short internalnodes) there exists (at least) one other tree shape that has a higherprobability of agreeing with a gene tree.

Argues for explicitly considering the coalescent process in phylogeneticinference.

Page 42: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

How do we compute a likelihood for a population sample?

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTCAGCGTCC

CAGTTTCAGCGTCC

CAGTTTCAGCGTCCCAGTTTCAGCGTCC

CAGTTTCAGCGTCC

CAGTTTCAGCGTCC

CAGTTTCAGCGTCC

CAGTTTCAGCGTCC

CAGTTTTGGCGTCC

CAGTTTTGGCGTCCCAGTTTTGGCGTCC

CAGTTTTGGCGTCC

CAGTTTTGGCGTCC

CAGTTTCAGCGTAC

CAGTTTCAGCGTAC

CAGTTTCAGCGTAC

, CAGTTTCAGCGTCC CAGTTTCAGCGTCC ), ... L = Prob ( = ??

Week 9: Coalescents – p.40/60

Page 43: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

If we have a tree for the sample sequences, we can

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC

CAGTTTCAGCGTCC

CAGTTTCAGCGTCC

CAGTTTCAGCGTCC

CAGTTTCAGCGTCC

CAGTTTCAGCGTCC

CAGTTTTGGCGTCCCAGTTTTGGCGTCC

CAGTTTTGGCGTCC

CAGTTTTGGCGTCC

CAGTTTCAGCGTACCAGTTTCAGCGTAC

CAGTTTCAGCGTAC

CAGTTTCAGCGTCC

CAGTTTTGGCGTCC

, CAGTTTCAGCGTCC CAGTTTCAGCGTCCProb( | Genealogy)

so we can compute

but how to computer the overall likelihood from this?

, ...

CAGTTTCAGCGTCC

CAGTTTCAGCGTCC

Week 9: Coalescents – p.41/60

Page 44: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

The basic equation for coalescent likelihoods

In the case of a single population with parametersNe effective population sizeµ mutation rate per site

and assuming G′ stands for a coalescent genealogy and D for thesequences,

L = Prob (D | Ne, µ)

=∑G′

Prob (G′ | Ne) Prob (D | G′, µ)

︸ ︷︷ ︸ ︸ ︷︷ ︸Kingman′s prior likelihood of tree

Week 9: Coalescents – p.42/60

Page 45: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Rescaling the branch lengths

Rescaling branch lengths of G′ so that branches are given in expectedmutations per site, G = µG′ , we get (if we let Θ = 4Neµ )

L =∑G

Prob (G | Θ) Prob (D | G)

as the fundamental equation. For more complex population scenariosone simply replaces Θ with a vector of parameters.

Week 9: Coalescents – p.43/60

Page 46: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

The variability comes from two sources

Ne

Necan reduce variability by looking at

(i) more gene copies, or

(ii) more loci

(2) Randomness of coalescence of lineages

affected by the

can reduce variance of

branch by examining more sites

number of mutations per site per

mutation rate u (1) Randomness of mutation

affected by effective population size

coalescence times allow estimation of

Week 9: Coalescents – p.44/60

Page 47: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

We can compute the likelihood by averaging over coalescents

t

t

Like

lihoo

d of

t

Like

lihoo

d of

The product of the prior on t,

times the likelihood of that t from the data,

when integrated over all possible t’s, gives the

likelihood for the underlying parameter

The likelihood calculation in a sample of two gene copies

t

1Θ2

Θ

Θ

Prio

r P

rob

of t

Θ1

Θ

Θ

Week 9: Coalescents – p.45/60

Page 48: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Rearrangement to sample points in tree space

A conditional coalescent rearrangement strategy

Week 9: Coalescents – p.51/60

Page 49: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Dissolving a branch and regrowing it backwards

First pick a random node (interior or tip) and remove its subtree

Week 9: Coalescents – p.52/60

Page 50: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

We allow it coalesce with the other branches

Then allow this node to re−coalesce with the tree

Week 9: Coalescents – p.53/60

Page 51: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

and this gives anothern coalescent

The resulting tree proposed by this process

Week 9: Coalescents – p.54/60

Page 52: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

An example of an MCMC likelihood curve

0

−10

−20

−30

−40

−50

−60

−70

−80

0.001 0.002 0.005 0.01 0.02 0.05 0.1

Θ

ln L

0.00650776

Results of analysing a data set with 50 sequences of 500 baseswhich was simulated with a true value of Θ = 0.01

Week 9: Coalescents – p.56/60

Page 53: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Major MCMC likelihood or Bayesian programs

LAMARC by Mary Kuhner and Jon Yamato and others.Likelihood inference with multiple populations, recombination,migration, population growth. No historical branching events, yet.

BEAST by Andrew Rambaut, Alexei Drummond and others.Bayesian inference with multiple populations related by a tree.Support for serial sampling (no migration or recombination yet).

genetree by Bob Griffiths and Melanie Bahlo. Likelihood inferenceof migration rates and changes in population size.

migrate by Peter Beerli. Likelihood inference with multiplepopulations and migration rates.

IM and IMa by Rasmus Nielsen and Jody Hey. Two populationsallowing both historical splitting and migration after that.

Week 9: Coalescents – p.57/60

Page 54: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

“Skyline” and “Skyride” plots in BEASTClassical Skyline Plot

Effe

ctiv

e Po

pula

tion

Size

0.15 0.10 0.05 0.00

0.00

10.

011.

0

ORMCP Model

0.15 0.10 0.05 0.00

0.00

10.

011.

0

Bayesian Skyline Plot

0.15 0.10 0.05 0.00

0.00

10.

011.

0

Uniform Bayesian Skyride

Time (Past to Present)

Effe

ctiv

e Po

pula

tion

Size

0.15 0.10 0.05 0.00

0.00

10.

011.

0

Time−Aware Bayesian Skyride

Time (Past to Present)0.15 0.10 0.05 0.00

0.00

10.

011.

0BEAST Bayesian Skyride

Time (Past to Present)0.15 0.10 0.05 0.00

0.00

10.

011.

0Figure from Minin, Bloomquist, and Suchard 2008

Page 55: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

BEST Liu and Pearl (2007); Edwards et al. (2007)

• X – sequence data• G – a genealogy (gene tree – with branch lengths)• S – a species tree• θ – demographic parameters• Λ – parameters of molecular sequence evolution

Pr(S,θ|X) =Pr(S,θ) Pr(X|S,θ)

Pr(X)

= Pr(S) Pr(θ)∫

Pr(X|G) Pr(G|S,θ)dG

∝ Pr(S) Pr(θ)∫ [∫

Pr(X|G,Λ) Pr(Λ)dΛ]

Pr(G|S,θ)dG

Page 56: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

BEST – importance sampling

1. Generate a collection of gene trees, G, using an approximation of thecoalescent prior

2. Sample from the distribution of the species trees conditional on the genetrees, G.

3. Use “importance weights” to correct the sample for the fact that anapproximate prior was used

Page 57: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

BEST – importance sampling

1. Generate a collection of gene trees, G, using an approximation of thecoalescent prior(a) Use a tweaked version of MrBayes to sample N sets of gene trees,G, from

†Pr(G|X) =

Pr†(G) Pr(X|G)Pr†(X)

(b) Pr†(G) is an approximate prior on gene trees from using a “maximal”species tree.

2. Sample from the distribution of the species trees conditional on the genetrees, G.

3. Use “importance weights” to correct the sample for the fact that anapproximate prior was used

Page 58: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

BEST – importance sampling

1. Generate a collection of gene trees, G, using an approximation of thecoalescent prior

2. Sample from the distribution of the species trees conditional on the genetrees, G.(a) From each set of gene trees (Gj for 1 ≤ j ≤ N) generate k species

trees using coalescent theory:

Pr(Si|Gj) =Pr(Si) Pr(Gj|Si)

Pr(Gj)

3. Use “importance weights” to correct the sample for the fact that anapproximate prior was used

Page 59: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

BEST – importance sampling

1. Generate a collection of gene trees, G, using an approximation of thecoalescent prior

2. Sample from the distribution of the species trees conditional on the genetrees, G.

3. Use “importance weights” to correct the sample for the fact that anapproximate prior was used(a) Estimate P̂r(Gj) by using the harmonic mean estimator from the

MCMC in step 2.(b) Compute a normalization factor

β =N∑

j=1

P̂r(Gj)Pr(Gj)

(c) Reweight all sampled species trees by

P̂r(Gj)Pr(Gj)

β

Page 60: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

BEST – conclusions

1. very expensive computationally (long MrBayes runs are needed)2. should correctly deal with the variability in gene tree caused by the

coalescent process.

Page 61: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

∗BEST

Similar model to BEST, but much more efficient implementation.

Both will be very sensitive to migration, but they represent the state-of-the-art for estimating species trees from gene trees.

Page 62: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Gene tree in a species tree w/ variable population size

Figure from Heled and Drummond 2010

Page 63: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

Multiple gene tree in a species tree w/ variablepopulation size

Figure from Heled and Drummond 2010

Page 64: Some of these slides have been borrowed from Dr. Paul Lewis, Dr

References

Degnan, J. and Rosenberg, N. (2006). Discordance of species trees withtheir most likely gene trees. PLoS Genet, 2(5).

Edwards, S. V., Liu, L., and Pearl, D. K. (2007). High-resolution speciestrees without concatenation. Proceedings of the National Academy ofSciences, 104(14):5936–5941.

Liu, L. and Pearl, D. K. (2007). Species trees from gene trees: reconstructionBayesian posterior distributions of a species phylogeny using estimatedgene tree distributions. Systematic Biology, 56(3):504–514.