Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Coalescent theory
In this lecture we will…
• Look closely at the basic derivation of the coalescent
• Calculate expectations and variances of key genealogical properties
• Derive properties of the frequency and age spectrum of mutations
• Look at the relationship between coalescent theory and other mathematical approaches to population genetics
Derivation of the coalescent
• From the Wright-Fisher model we can calculate the probability that a sample of n individuals has a ancestors in the previous generation from a population of size N
• The key point is that the probability of more than one coalescent event in that generation is of order N-2
– As N tends to infinity, the probability tends to zero
n = 4 a = 2
( )∏−
=
− −=
−=1
02 ),(
events} coalescent Pr{},;Pr{a
i
n iNanSN
anNna
)(2)1(1},;1Pr{ 2−+−=− Nonn
NNnn
)(},;1Pr{ 2−=−< NoNnna
S2(n,a) = Stirling’s number of the second kind
How accurate is the coalescent
• For small populations you will expect the coalescent to become less accurate as multiple coalescent events will occur each generation
• We can look at the probability that this occurs as the proportion of the population sampled increases
Proportion of population sampled (N = 10,000)
Probability of >1 coalescent event
The Moran model
• There is an alternative way of deriving the coalescent, which comes from the Moran model
• The Moran model is a continuous time process – Each individual survives an exponentially distributed length of time – At death, an individual is replaced by a copy of a randomly chosen individual
from the remaining population – Selection can be incorporated by differential mortality (or replacement)
• There is a natural scaling of time in this model – the time taken to (on average) replace the whole population
• The number of coalescent events that occur during that unit of interval has the same probability structure as under the WF model
– i.e. the probability of more than one coalescent event is o(N-2)
Generalisations of the coalescent
• The derivation presented assumes a binomial number of offspring under the WF model (or a Poisson number under the Moran model)
• What about other distributions of offspring number? – E.g. suppose there is over-dispersion of reproductive success
• As long as the probability of more than one coalescent is of order N-2, the coalescent will still work (i.e. take the limit as N→∞), but we have to introduce the concept of the effective population size
• However, for extremely skewed distributions (e.g. suppose there is a small probability that a proportion lambda of the population is replaced by the offspring of a single individual) this no longer holds
– Generalisations of the coalescent to include ‘multiple mergers’ are currently much discussed by mathematical population geneticists
Reminder about the coalescent
• Exponential waiting time till coalescence
• Poisson number of mutations arising
• Expected number of differences between two sequences equal to the parameter θ
• Expected number of mutations in a sample of n chromosomes equal to
• What are the distributions of these quantities?
∑−
=
=1
1
1][n
i iSE θ
A geometric distribution for pairwise differences
• If we combine the exponential distribution of coalescence times with the Poisson number of mutations we get the distribution of pairwise differences
• This is the geometric distribution
• Remember that this only works for NO recombination
kt
kt
t
dtk
ee
dttktk
⎟⎠⎞⎜
⎝⎛
++=
=
=
∫
∫∞
=
−−
∞
=
θθ
θ
θ
φθ
θ
111
!
};Pr{)(};Pr{
0
0
The effect of recombination
• Θ = 5 (about 5kb in humans)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Number of differences
Probability
Mean
NO recombination NO linkage
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Number of differences
Probability Mean
What about n > 2?
• There is no closed formula for the distribution of the number of mutations
• But there is a recursion (Tavare 1984) that calculates the probability
∑=
−×−=S
iiSninS
0sequence}nth addingin mutations Pr{}1,;Pr{},;Pr{ θθ
dtis
tn
een
iS
tntn
)!(2
222
0 −
⎟⎠⎞⎜
⎝⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛
−
−⎟⎟⎠
⎞⎜⎜⎝
⎛−∞
∫θ
θ
iS
nnn −
⎟⎠⎞⎜
⎝⎛
+−⎟⎠⎞⎜
⎝⎛
+−−=
θθ
θ 111
Geometric distribution
NB remember that this assumes no recombination
Variances
• Although we cannot obtain a closed formula for the distribution of segregating sites, we can calculate the variance
• The total number of segregating sites is obtained from the sum of the numbers occur during each epoch of the coalescent
• Because these epochs are independent, the variance can be written as a sum of variances across epochs
• The variance for epoch i is obtained from the geometric distribution
• This gives a total variance of
nSSSS +++= ...32
)(...)()()( 32 nSVarSVarSVarSVar +++=
2
11)( ⎟
⎠⎞⎜
⎝⎛
−+
−=
nnSVar i
θθ
∑∑−
=
−
=
+=1
12
21
1
11)(n
i
n
i iiSVar θθ
Variances of moment estimators
• We looked at two estimators of θ – the average pairwise differences and Watterson’s estimator from the number of segregating sites
• We want to know about the variance of these estimators (how good are they)
• Watterson’s
• Average pairwise differences
2)/()()ˆ(
/ˆ
nW
nW
aSVarVar
aS
=
=
θ
θ
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡+⎟⎟⎠
⎞⎜⎜⎝
⎛=
⎟⎟⎠
⎞⎜⎜⎝
⎛=
∑ ∑∑
∑
≠≠ ≠
−
≠
−
jijijiji
nmnm
mnijij
jijiij
kkCovkVarn
Var
kn
,, , ,
2
,,
1
),()(2
)ˆ(
2ˆ
π
π
θ
θ
As n gets large, Watterson’s converges on truth, pairwise differences do not – a property called consistency
Expectations of other quantities
• We have calculated properties for the number of segregating sites, but we might want to look in detail at the number of mutations segregating at different frequencies
• What is the expected number of segregating sites at which the derived mutation is at frequency i in a sample of n chromosomes?
• To answer this question we have to look at the combinatorics of the coalescent process
• We also need to look at a different formulation for the coalescent – FORWARD in time!!
• This model is called the Hoppe Urn model
Hoppe Urn model
• Suppose I have a box with i white balls and m-i black balls
• I draw a ball at random and put it back in the box, along with an additional ball of the same colour
• What is the probability that after adding n – m balls I am left with j white balls?
• The answer is the hypergeometric distribution
⎟⎟⎠
⎞⎜⎜⎝
⎛−−
⎟⎟⎠
⎞⎜⎜⎝
⎛−−
⎟⎟⎠
⎞⎜⎜⎝
⎛−−−−
=
11
11
11
},,;Pr{
mn
ij
imjn
inmj
What is the link to genealogies?
• Coalescent genealogies are usually considered backwards in time, however if there is no recombination I can also simulate forwards in time
• To do this I select a lineage at random to ‘split’ and continue the tips forward for an exponentially distributed length of time with rate n(n-1)/2
• The number of descendant lineages( for each lineage) when there are k lineages is described by the Urn model
Tree for 3 sequences Tree for 4 sequences
The link to allele frequency
• The required probability is
• The expected number of mutations that occur during this epoch (when there are m lineages) is
• So the expected number of mutations with sample frequency j occurring when there are m lineages is
)!1()!1()!()!1()1(}1,,;Pr{−+−−
−−−−==njmnmnjnminmj
1][
−=m
SE m θ
)!1()!1()!1()!(][+−−−
−−−=jmnnjnmnSE m
j θ
Summing over epochs
• We can write down the number of mutations in the sample with frequency j as the sum over epochs
• All we need to do is to sum the expectations
∑+−
=
=1
2
jn
m
mjj SS
j
jmnmn
njn
SESE
jn
m
jn
m
mjj
θ
θ
=
+−−−
−−−=
=
∑
∑+−
=
+−
=
1
2
1
2
)!1()!(
)!1()!1(
][][
Consequences
• This remarkable result gives us a whole new set of estimators for θ (Fu 1996)
• For example, Fay and Wu proposed the estimator
• It is not easy (possible?) to derive expressions for variances and covariances
• However, it does give us insights into the age distribution of mutations
∑−
=
−
⎟⎟⎠
⎞⎜⎜⎝
⎛=
1
1
21
2ˆ
n
iiH Si
nθ
Age distributions of mutations
• The probability that a mutation where the derived allele is at frequency j in the sample occurred during epoch m is proportional to
• For j = 1
• In other words, such mutations are equally likely to have occurred during any epoch!
• Their age distribution is therefore described by the age distribution of epochs
)!1()!(},;Pr{+−−
−∝jmnmnnjm
1},1;Pr{ ∝nm
More generally…
• The age distribution of mutations can be derived by considered the relationship between the epoch distribution of mutations and the age distribution of epochs
• Note this approaches can generalise to more complex scenarios – The combinatorics apply to any binary tree
∑+−
=
=1
2]|[},;Pr{],;Age[
jn
mmAgeEnjmnjE
nmmmAgeE 2
111]|[ −−
+= Coalescent
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
0 20 40 60 80 1000.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
0 20 40 60 80 1000.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
Examples
• n=100
j = 1
Prob(m) E[Age|m]
j = 10
Prob(m) E[Age|m]
Classical mathematical population genetics
• Although different in key ways, there are strong links between coalescent theory and older approaches in mathematical population genetics
• Many of the results from coalescent theory have analogues from diffusion theory approaches
– For example, the frequency of the derived mutation has the transient stationary distribution (Wright, Kimura)
• However, the coalescent does generally provide a more intuitive approach to describing probability distributions for quantities of interest
– Though is does have limitations – e.g. selection
xx /1)( ∝φ