Genome Evolution © Amos Tanay, The Weizmann Institute Genome evolution Lecture 2: population genetics I: drift and mutation

Genome Evolution © Amos Tanay, The Weizmann Institute

Genome evolution

Lecture 2:

population genetics I: drift and mutation


Studying Populations

Models:

A set of individuals, genomesAncestry relations or hierarchies

Experiments:

Fields studies, diversity/genotypingExperimental evolution

Åland Islands, Glanville fritillary population

mtDNA human migration patterns


Population genetics

Drift: The process by which allele frequencies are changing through generations

Mutation: The process by which new alleles are being introduced

Recombination: the process by which multi-allelic genomes are mixed

Selection: the effect of fitness on the dynamics of allele drift

Epistasis: the drift effects of fitness dependencies among different alleles

“Organismal” effects: Ecology, Geography, Behavior


The Hardy-Weinberg Model

• Diploid organismsTwo copies of each allele/gene/baseHomozygous / Heterozygous

• Sexual ReproductionMating haplotypes

• Large population, No migrationFixed size, closed system

• Non-overlapping generationsSynchronous processNot as bad as it may look like

• Random matingNew generation is being selected from the existing haplotypes with

replacement

• No mutations, no selection (will add these later)


2

2

)(

2)(

)(

qaaP

pqAaP

pAAP

The Hardy-Weinberg Model

Hardy-Weinberg equilibrium:

AA

Aa

aa

aAqaP

pAP

)(

)(AA

Aa

aa

aA

Random mating

Non overlapping generations

With the model assumption, equilibrium is reached within one generation

• Non-overlapping generationsSynchronous processNot as bad as it may look like

• Random matingNew generation is being selected from the existing haplotypes with

replacement

• No mutations, no selection (will add these later)


Frequency estimates

We will be dealing with estimation of allele frequencies.

To remind you, when sampling n times from a population with allele of frequency p, we get an estimate that is distributed as a binomial variable. This can be further approximated using a normal distribution:

))1(,());(( pnpnpNnpBV

n

pps

)ˆ1(ˆ

When estimating the frequency out of the number of successes we therefore have an error that looks like:

ini ppi

nnpB

)1();(


Testing Hardy-Weinberg using chi-square statistics

HW is over simplifying everything, but can be used as a baseline to test if interesting evolution is going on for some allele

Classical example is the blood group genotypes M/N (Sanger 1975) (this genotype determines the expression of a polysaccharide on red blood cell surfaces – so they were quantifiable before the genomic era..):

MM298294.3

MN489496

NN213209.3

Observed HW

2

2

)(

2)(

)(

qaaP

pqAaP

pAAP

22.0exp

exp)( 22

obs

Chi-square significance can be computed from the chi-square distribution with df degrees of freedom.

Here: df = #classes - #parameters – 1 = 3(MN/NN/MM) – 1 (p) – 1 = 1


Wright-Fischer model for genetic drift

Nindividuals

∞gametes

Nindividuals

∞gametes

We follow the frequency of an allele in the population, until fixation (f=2N) or loss (f=0)

We can model the frequency as a Markov process on a variable X (the number of A alleles) with transition probabilities:

jNj

ij N

i

N

i

j

NT

2

21

2

2 Sampling j alleles from a population 2N population with i alleles.

In larger population the frequency would change more slowly (the variance of the binomial variable is pq/2N – so sampling wouldn’t change that much)

0 2N1 2N-1Loss Fixation


Drift and fixation probability

Theorem (fixation in drift): In the Wright-Fischer model, the probability of fixation in the A’s allele state, given a population of 2N alleles out of which i are A, is:

N

iNXPi 2

)2(

Proof: The mean of the binomial sample in the n’th step is np:

nnn XiN

iNiXXE 2

2)|( 1

Which means that the expected number of A’s is constant in time. Intuitively:

)2(2)( NXNPXEi ii

)1()();();()( oXEnXEnXEXEi i

n

niini

Since 0 and 2N are absorbing states, given sufficient time, the wright-fischer process will converge to either 0 or 2N. Define:

}20:min{ NXorXn nn

More formally:


Figure 7.4

Drift

Experiments with drifting fly populations: 107 Drosophila melanogaster populations. Each consisted originally of 16 brown eys (bw) heterozygotes. At each generation, 8 males and 8 females were selected at random from the progenies of the previous generation. The bars shows the distribution of allele frequencies in the 107 populations


The coalescent

When sampling K new individuals, the chances of peaking up the same parent twice is roughly:

Present 10

2)( 5

NTE

6

2)( 4

NTE

3

2)( 3

NTE

NTE 2)( 2

Past

1 2 3 54

)1

(2

1

2

)1(2N

ON

kk

Theorem: The amount of time during which there are k lineages, tk has approximately an exponential distribution with mean 2N * (2/(k(k-1)))

When looking at k individuals, we can trace their coalescent backwards and ask when did they had k-1,k-2, or one common ancestor.

Proof: the probability of not merging k lineages in n generations is:

N

nkk

N

kkn

22

)1(exp

2

1

2

)1(1

Which is like an exponential te

This is correct for any k, so going backward from present time, we can estimate the time to coalescent at each step

The expected value is)1(

41)(

kk

NeE t


The coalescent

The expected time to the common ancestor of k individuals:

Present 10

2)( 5

NTE

6

2)( 4

NTE

3

2)( 3

NTE

NTE 2)( 2

Past

1 2 3 54

nk nk n

Nkk

Nkk

NTE

..2 ..21 )

11(4

1

1

14

)1(

4)(

Theorem: The probability that the most recent common ancestor of a sample of size n is the same as that of the population converges to (n-1)/(n+1) as the population size increase.

When looking at k individuals, we can trace their coalescent backwards and ask when did they had k-1,k-2, or one common ancestor.

4N is the magic number


Diffusion approximation and Kimura’s solution

),(),( txJx

txt

),( tx

Fischer, and then Kimura approximated the drift process using a diffusion equation (heat equation):

The density of population with frequency x..x+dx at time t

),( txJ The flux of probability at time t and frequency x

The change in the density equals the differences between the fluxes J(x,t) and J(x+dx,t), taking dx to the limit we have:

The if M(x) is the mean change in allele frequency when the frequency is x, and V(x) is the variance of that change, then the probability flux equals:

),()(2

1),()(),( txxV

xtxxMtxJ

),()(2

1),()(),(

2txxV

xtxxM

xtx

t

N

xxxVM

2

)1()(,0

),()1(

4

1),(

2txxx

xNtx

t

Heat diffusionFokker-PlanckKolmogorov Forward eq.



),( tx

Fischer, and then Kimura approximated the drift process using a diffusion equation (heat equation). We start with working on the time step dy and frequency step dx

The probability that the population have allele frequency x time t

)(xM

We limit changes from t to t+dt and x+-dx. The population can be on x at t+dt if:

It was at x and stayed there:

It was at x-dx and moved to x:

It was at x+dx and moved to x:

)],()(),()([2

1

)],()(),()([2

1

)],()(),()([),(),(

tdxxdxxVtxxV

txxVtdxxdxxV

tdxxdxxMtxxMtxdttx

),()(2

1),()(),(

2txxV

xtxxM

xtx

t

2/)(xV

the probability that the frequency increased from x by dx, due to mutation/selection

The probability of dx increase or decrease due to drift

))()(1)(,( xVxMtx

)2/)()()(,( xVxMtdxx

)2/)()(,( xVtdxx



),( tx

Fischer, and then Kimura approximated the drift process using a diffusion equation (heat equation). We start with working on the time step dy and frequency step dx

The probability that the population have allele frequency x time t

)(xM

),()(2

1),()(),(

2txxV

xtxxM

xtx

t

2/)(xV

the probability that the frequency increased from x by dx, due to mutation/selection

The probability of dx increase or decrease due to drift

0)(

2/)1()(

xM

NxxxVFor drift the variance is binomial:And we assume no selection:

Still not easy to solve analytically…


Changes in allele-frequencies, Fischer-Wright model

After about 4N generations, just 10% of the cases are not fixed and the distribution becomes flat.


Absorption time and Time to fixation

According to Kimura’s solution, the mean time for allele fixation, assuming initial probability p and assuming it was not lost is:

)1log()1(4

)(1̂ ppp

Npt

)log()(1

4)(0̂ pp

p

Npt

The mean time for allele loss is (the fixation time of the complement event):


Effective population size

4N generations looks light a huge number (in a population of billions!)

But in fact, the wright-fischer model (like the hardy-weinberg model) is based on many non-realistic assumption, including random mating – any two individuals can mate

The effective population size is defined as the size of an idealized population for which the predicted dynamics of changes in allele frequency are similar to the observed ones

For each measurable statistics of population dynamics, a different effective population size can be computed

For example, the expected variance in allele frequency is expressed as:

N

pppV ttt 2

)1()( 1

e

ttt N

pppV

2

)1()( 1

But we can use the same formula to define the effective population size given the variance:


Effective population size: changing populations

110

1..

11

t

e

NNN

tN

So the effective population size is dominated by the size of the smallest bottleneck

Bottlenecks can occur during migration, environmental stress, isolation

Such effects greatly decrease heterozygosity (founder effect – for example Tay-Sachs in “ashkenazim”)

Bottlenecks can accelerate fixation of neutral or even deleterious mutations as we shall see later.

If the population is changing over time, the dynamics will be affect by the harmonic mean of the sizes:

Human effective population size in the recent 2My is estimated around 10,000 (due to bottlenecks). (so when was our T1?)

Genome Evolution © Amos Tanay, The Weizmann InstituteEffective population size: unequal sex ratio, and sex chromosomes

fma NNN

So if there are 10 times more females in the population, the effective population size is 4*x*10x/(11x)=4x, much less than the size of the population (11x).

If there are more females than males, or there are fewer males participating in reproduction then the effective population size will be smaller:

fm

fme NN

NNN

4 Any combination of alleles from a male and a female

Another example is the X chromosome, which is contained in only one copy for males.

fm

fme NN

NNN

24

9

f

ff

m

mmfm N

qp

N

qppVarppp

29

4

9

1)(,

3

2

3

1

fm

fmfmfm

NN

NN

pq

NNpqpVarppp

24

92

18

4

9

1)(,


Recombination and linkage

Assume two loci have alleles A1,A2, B1,B2

2222

1212

2121

1111

)(

)(

)(

)(

qpBAP

qpBAP

qpBAP

qpBAP

Only double Heterozygous can allow recombination to change allele frequencies:

A1B1/ A2B2

A1B2/ A1B2

A1 B1

A2 B2

A1 B2

A2 B1

Linkage equilibrium:

The recombination fraction r: proportion of recombinant gametes generated from double heterozygote

For different chromosomes: r = 0.5For the same chromosome, function of the distance and possibly other factors


Linkage disequilibrium (LD)

Define the linkage disequilibrium parameter D as:

1111 qpPD

))(1(

)1(

111111'

11

1111'

11

qpPrpqP

prqPrP

)(),(),(),( 2222122121121111 BAPPBAPPBAPPBAPP

01 )1()1( DrDrD nnn

Next generation:

No recomb Recombination on any A1- / -B1

Generation

D

r=0.05

r=0.2r=0.5

A2 B1

A1 B1

A2 B2

A1 B2

A2 B2

A1 B1

r

1-r

A2 B2

A1 B1

21122211 PPPPD


Linkage disequilibrium (LD) - example

blood group genotypes M/N and S/s. Both alleles in Hardy-Weinberg

MS484334.2

Ms611750.8

NS142281.8

Ns773633.2

Observed unlinked

7.184exp

exp)( 22

obs

For M/N – p1 = 0.5425 p2 = 0.4575For S/s – q1 = 0.3080 q2 = 0.6920

Linkage equilibrium highly unlikely!

07.021122211 PPPPD


Sources of Linkage disequilibrium

LD in original population that was not stabilized due to low r

Genetic coadaptation: regions of the genome that are not subject to recombination (for example, inverted chromosomal fragments)

Admixture of populations with different allele frequencies:

9025.0

0475.0

0475.0

0025.0

22

21

12

11

P

P

P

P

0025.0

0475.0

0475.0

9025.0

22

21

12

11

P

P

P

P

2025.0D

0D0D

4525.0

0475.0

0475.0

4525.0

22

21

12

11

P

P

P

P


The hapmap project

1 million SNPs (single nucleotide polymorphisms)

4 populations:30 trios (parents/child) from Nigeria (Yoruba - YRI)30 trios (parents/child) from Utah (CEU)45 Han chinease (Beijing)44 Japanease (Tokyo)

Haplotyping – each SNP/individualNo just determining heterozygosity/homozygosity – haplotyping completely resolve the genotypes (phasing)

Because of linkage, the partial SNPMap largely determine all other SNPs!!

The idea is that a group of “tag SNPs”Can be used for representing all geneticVariation in the human population.

This is extremely important in associationstudies that look for the genetic cause ofdisease.


Correlation on SNPs between populations


Recombination rates in the human population: LD blocks


Recombination rates in the human population

Recombination rates are highly non uniform – with major effects on genome structure!


Mutations

Simplest model: assume two alleles, and mutations probabilities:

)Pr(

)Pr(

Aa

aA

If the process is running long enough, we will converge to a stationary distribution:

)Pr(AA

a

As we saw earlier, since population is finite and undergo random genetic driftany mutation will ultimately be lost or fixated.Elimination have a significant chance of happening immediately::

N2

1 eN

N /1)2

11( 2

sampling


Infinite alleles model

Adding mutations with probability m, the coalescent process is extended by killing lineages(time is speeded up by a 2N factor):

)4(,2

2 NkNk Coalescent:N

kk

2

1

2

)1(

mutation:

Back in time

Probability model (Hoppe’s Urn):

Selecting from an urn with one black ball of mass and more balls with other colors and mass 1. Each time the black ball is selected, a new ball with a new color is added to the urn. If another color is selected, the selected ball and another ball from the same color are returned to the urn.

Theorem: Hoppe’s Urn and the Coalescent with killing are equivalent

(The Chinese restaurant process)


Testing the infinite alleles model

A simplified statistics is the number of distinct alleles. This should have the expected value:

1..

211)(

nkE

Theorem (Ewens sampling formula): Let ai be the number of alleles present i times in a sample of size n. When the scaled mutation rate is =4N,

Proof: At each step of the Hoppe’s process, we draw the black ball with probability:

1 i


Figure 7.16,7.17

Testing the infinite alleles model

Not quite neutral Highly non neutral

F computed from the number of Xdh alleles in 89 D. pseudoobscura lines gene: 52 had a common allele, 8 singletons.

Compared to a simulation assuming the infinite allele model.

VNTR locus in humans: observed (open columns) and Ewens predicted allele counts.

Documents

Genome Evolution © Amos Tanay, The Weizmann Institute Genome evolution Lecture 2: population genetics I: drift and mutation