Lecture 2 Coalescent theory I - Université de Montréalhussinju/l2010/coalescence4_theory.pdf · 2010-09-22 · Derivation of the coalescent • From the Wright-Fisher model we can

Coalescent theory

In this lecture we will…

•  Look closely at the basic derivation of the coalescent

•  Calculate expectations and variances of key genealogical properties

•  Derive properties of the frequency and age spectrum of mutations

•  Look at the relationship between coalescent theory and other mathematical approaches to population genetics

Derivation of the coalescent

•  From the Wright-Fisher model we can calculate the probability that a sample of n individuals has a ancestors in the previous generation from a population of size N

•  The key point is that the probability of more than one coalescent event in that generation is of order N-2

–  As N tends to infinity, the probability tends to zero

n = 4 a = 2

( )∏−

=

− −=

−=1

02 ),(

events} coalescent Pr{},;Pr{a

i

n iNanSN

anNna

)(2)1(1},;1Pr{ 2−+−=− Nonn

NNnn

)(},;1Pr{ 2−=−< NoNnna

S2(n,a) = Stirling’s number of the second kind

How accurate is the coalescent

•  For small populations you will expect the coalescent to become less accurate as multiple coalescent events will occur each generation

•  We can look at the probability that this occurs as the proportion of the population sampled increases

Proportion of population sampled (N = 10,000)

Probability of >1 coalescent event

The Moran model

•  There is an alternative way of deriving the coalescent, which comes from the Moran model

•  The Moran model is a continuous time process –  Each individual survives an exponentially distributed length of time –  At death, an individual is replaced by a copy of a randomly chosen individual

from the remaining population –  Selection can be incorporated by differential mortality (or replacement)

•  There is a natural scaling of time in this model – the time taken to (on average) replace the whole population

•  The number of coalescent events that occur during that unit of interval has the same probability structure as under the WF model

–  i.e. the probability of more than one coalescent event is o(N-2)

Generalisations of the coalescent

•  The derivation presented assumes a binomial number of offspring under the WF model (or a Poisson number under the Moran model)

•  What about other distributions of offspring number? –  E.g. suppose there is over-dispersion of reproductive success

•  As long as the probability of more than one coalescent is of order N-2, the coalescent will still work (i.e. take the limit as N→∞), but we have to introduce the concept of the effective population size

•  However, for extremely skewed distributions (e.g. suppose there is a small probability that a proportion lambda of the population is replaced by the offspring of a single individual) this no longer holds

–  Generalisations of the coalescent to include ‘multiple mergers’ are currently much discussed by mathematical population geneticists

Reminder about the coalescent

•  Exponential waiting time till coalescence

•  Poisson number of mutations arising

•  Expected number of differences between two sequences equal to the parameter θ

•  Expected number of mutations in a sample of n chromosomes equal to

•  What are the distributions of these quantities?

∑−

=

=1

1

1][n

i iSE θ

A geometric distribution for pairwise differences

•  If we combine the exponential distribution of coalescence times with the Poisson number of mutations we get the distribution of pairwise differences

•  This is the geometric distribution

•  Remember that this only works for NO recombination

kt

kt

t

dtk

ee

dttktk

⎟⎠⎞⎜

⎝⎛

++=

=

=

∫

∫∞

=

−−

∞

=

θθ

θ

θ

φθ

θ

111

!

};Pr{)(};Pr{

0

0

The effect of recombination

•  Θ = 5 (about 5kb in humans)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Number of differences

Probability

Mean

NO recombination NO linkage

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Number of differences

Probability Mean

What about n > 2?

•  There is no closed formula for the distribution of the number of mutations

•  But there is a recursion (Tavare 1984) that calculates the probability

∑=

−×−=S

iiSninS

0sequence}nth addingin mutations Pr{}1,;Pr{},;Pr{ θθ

dtis

tn

een

iS

tntn

)!(2

222

0 −

⎟⎠⎞⎜

⎝⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛

−

−⎟⎟⎠

⎞⎜⎜⎝

⎛−∞

∫θ

θ

iS

nnn −

⎟⎠⎞⎜

⎝⎛

+−⎟⎠⎞⎜

⎝⎛

+−−=

θθ

θ 111

Geometric distribution

NB remember that this assumes no recombination

Variances

•  Although we cannot obtain a closed formula for the distribution of segregating sites, we can calculate the variance

•  The total number of segregating sites is obtained from the sum of the numbers occur during each epoch of the coalescent

•  Because these epochs are independent, the variance can be written as a sum of variances across epochs

•  The variance for epoch i is obtained from the geometric distribution

•  This gives a total variance of

nSSSS +++= ...32

)(...)()()( 32 nSVarSVarSVarSVar +++=

2

11)( ⎟

⎠⎞⎜

⎝⎛

−+

−=

nnSVar i

θθ

∑∑−

=

−

=

+=1

12

21

1

11)(n

i

n

i iiSVar θθ

Variances of moment estimators

•  We looked at two estimators of θ – the average pairwise differences and Watterson’s estimator from the number of segregating sites

•  We want to know about the variance of these estimators (how good are they)

•  Watterson’s

•  Average pairwise differences

2)/()()ˆ(

/ˆ

nW

nW

aSVarVar

aS

=

=

θ

θ

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡+⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

∑ ∑∑

∑

≠≠ ≠

−

≠

−

jijijiji

nmnm

mnijij

jijiij

kkCovkVarn

Var

kn

,, , ,

2

,,

1

),()(2

)ˆ(

2ˆ

π

π

θ

θ

As n gets large, Watterson’s converges on truth, pairwise differences do not – a property called consistency

Expectations of other quantities

•  We have calculated properties for the number of segregating sites, but we might want to look in detail at the number of mutations segregating at different frequencies

•  What is the expected number of segregating sites at which the derived mutation is at frequency i in a sample of n chromosomes?

•  To answer this question we have to look at the combinatorics of the coalescent process

•  We also need to look at a different formulation for the coalescent – FORWARD in time!!

•  This model is called the Hoppe Urn model

Hoppe Urn model

•  Suppose I have a box with i white balls and m-i black balls

•  I draw a ball at random and put it back in the box, along with an additional ball of the same colour

•  What is the probability that after adding n – m balls I am left with j white balls?

•  The answer is the hypergeometric distribution

⎟⎟⎠

⎞⎜⎜⎝

⎛−−

⎟⎟⎠

⎞⎜⎜⎝

⎛−−

⎟⎟⎠

⎞⎜⎜⎝

⎛−−−−

=

11

11

11

},,;Pr{

mn

ij

imjn

inmj

What is the link to genealogies?

•  Coalescent genealogies are usually considered backwards in time, however if there is no recombination I can also simulate forwards in time

•  To do this I select a lineage at random to ‘split’ and continue the tips forward for an exponentially distributed length of time with rate n(n-1)/2

•  The number of descendant lineages( for each lineage) when there are k lineages is described by the Urn model

Tree for 3 sequences Tree for 4 sequences

The link to allele frequency

•  The required probability is

•  The expected number of mutations that occur during this epoch (when there are m lineages) is

•  So the expected number of mutations with sample frequency j occurring when there are m lineages is

)!1()!1()!()!1()1(}1,,;Pr{−+−−

−−−−==njmnmnjnminmj

1][

−=m

SE m θ

)!1()!1()!1()!(][+−−−

−−−=jmnnjnmnSE m

j θ

Summing over epochs

•  We can write down the number of mutations in the sample with frequency j as the sum over epochs

•  All we need to do is to sum the expectations

∑+−

=

=1

2

jn

m

mjj SS

j

jmnmn

njn

SESE

jn

m

jn

m

mjj

θ

θ

=

+−−−

−−−=

=

∑

∑+−

=

+−

=

1

2

1

2

)!1()!(

)!1()!1(

][][

Consequences

•  This remarkable result gives us a whole new set of estimators for θ (Fu 1996)

•  For example, Fay and Wu proposed the estimator

•  It is not easy (possible?) to derive expressions for variances and covariances

•  However, it does give us insights into the age distribution of mutations

∑−

=

−

⎟⎟⎠

⎞⎜⎜⎝

⎛=

1

1

21

2ˆ

n

iiH Si

nθ

Age distributions of mutations

•  The probability that a mutation where the derived allele is at frequency j in the sample occurred during epoch m is proportional to

•  For j = 1

•  In other words, such mutations are equally likely to have occurred during any epoch!

•  Their age distribution is therefore described by the age distribution of epochs

)!1()!(},;Pr{+−−

−∝jmnmnnjm

1},1;Pr{ ∝nm

More generally…

•  The age distribution of mutations can be derived by considered the relationship between the epoch distribution of mutations and the age distribution of epochs

•  Note this approaches can generalise to more complex scenarios –  The combinatorics apply to any binary tree

∑+−

=

=1

2]|[},;Pr{],;Age[

jn

mmAgeEnjmnjE

nmmmAgeE 2

111]|[ −−

+= Coalescent

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

0 20 40 60 80 1000.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

0 20 40 60 80 1000.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

Examples

•  n=100

j = 1

Prob(m) E[Age|m]

j = 10

Prob(m) E[Age|m]

Classical mathematical population genetics

•  Although different in key ways, there are strong links between coalescent theory and older approaches in mathematical population genetics

•  Many of the results from coalescent theory have analogues from diffusion theory approaches

–  For example, the frequency of the derived mutation has the transient stationary distribution (Wright, Kimura)

•  However, the coalescent does generally provide a more intuitive approach to describing probability distributions for quantities of interest

–  Though is does have limitations – e.g. selection

xx /1)( ∝φ

Documents

Lecture 2 Coalescent theory I - Université de Montréalhussinju/l2010/coalescence4_theory.pdf · 2010-09-22 · Derivation of the coalescent • From the Wright-Fisher model we can