35
The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes. II. The Coalescent with Mutations & Ancestral Analysis. III. ”The Story” of Human Evolution. IV. ”The Story” of Coalescent.

The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Embed Size (px)

Citation preview

Page 1: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

The Coalescent & Human Sequence Variation (11.6.02)

I. The Human Population & its Genome.

The Existing Data: SNPs & Haplotypes.

Reconstructing Haplotypes.

II. The Coalescent with Mutations & Ancestral Analysis.

III. ”The Story” of Human Evolution.

IV. ”The Story” of Coalescent.

Page 2: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

The Human Genome http://www.sanger.ac.uk/HGP/

1

2 3

4 56 7 8 9

X

Y15141312

10 11212019

181716

22

3 billion base pairs per haploid genome

30.000-40.000 genes

Page 3: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Recent SNP/Haplotype Analysis.Inter.SNP Consortium (2001): A map of human genome sequence variation containing 1.42 million SNPs. Nature 409.928-33

For 2 complete haplotype genomes, there would be expected about 3 million SNP differences. The number of expected SNPs for more genomes should then grow as the expected number of segregating sites in an ideal population, i.e. approximately logarithmically.

Page 4: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Linkage disequilibrium in the human genome Reich,DE et al.(2001) Linkage disequilibrium in the human genome Nature 411.199-204

LD:=fi,j-fifj E(LD)=1/(1+4Ner) LD in europeans stretches 60kb, while in Yorubans much less.

Page 5: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Daly,JM et al.(2001) High-resolution haplotype structure in the human genome. Nat.Gen. 29.229-32.

Conclusions (258 chromosomes):

Haplotypes better than SNPs.

Genome spit up in blocks without recombination.

Page 6: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

SNPs & haplotypes: Getting Haplotypes

Egg & Sperm Sequencing

Cell Lines with Lost Chromosomes

Sequencing Clones Spanning SNPs

Very expensive so reconstructing haplotypes from SNPs are favoured.

Haplotypes:

SNPs:

A

T

G

C

C

A

{A,T} {C,G} {A,C}

2m-1

Page 7: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

SNPs --> haplotypes:Computational Problem

{N1,N2}1,1 {N1,N2}1,m

1 m1

n{N1,N2}n,1 {N1,N2}n,m

1,1

n,1

?

{N1 or N2}1,1

{N1 or N2}n,m

{N1 or N2}1,m

{N1 or N2}n,1

Page 8: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

SNPs ---> haplotypes: Clark (1990)Algorithm:

Find homozygotes or single heterozygotes & deduce existing haplotypes.

Run through remaining SNPs and assign to expanding set of determined haplotypes.

Check if unresolved haplotypes can be explained as recombination of resolved haplotypes.

Page 9: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

SNPs ---> haplotypes: Clark (1990)

Three Problems:

No homozygote or single heterozygotes available

The process leaves unresolved haplotypes

A haplotype is declared a recombination between two existing

haplotypes, although it exists in the sample.

Spanning Tree instead of phylogenetic tree assignment of haplotypes

H2

H6

H1 H5

H4

H3H1 H2 H3 H4

Spanning Tree: Phylogenetic Tree:

Page 10: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

SNPs ---> haplotypes: Gusfield (2002)

Haplotype Inference: Make Phylogeny with 2n leaves exhausting the SNPs.

Perfect Phylogeny: only 0 or 1 event at each site.

A position in an individual is labelled 0 and 1 if homozygous for one of the two variants and labelled 2 if heterozygous.

S

1 2

a 2 2

b 0 2

c 1 0

1 2

a 2 2

a’ 2 2

b 0 2

b 0 2

b’ 0 2

c 1 0

c’ 1 0

Q B

1 2

a 1 0

a’ 0 1

b 0 1

b’ 0 0

c 1 0

c’ 1 0

T(S)

a

a’

c

b

c’

b’

1

2

Page 11: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

SNPs ---> haplotypes: Gusfield (2002)

PPH can be reduced to graph realization problem:

Recognizing graphic binary matroids.

This problems has an almost linear algorithm (that has never been implemented)

This also allows efficient enumeration of possible solutions.

Question: Are there SNP data that doesn’t allow a perfect tree.

Page 12: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

SNPs ---> haplotypes: Stephens (2001)

G=(G1,..,Gn) SNP-types. H=(H1,..,Hn) haplotypes

F=(F1,..,Fm) population haplotype frequencies.

f=(f1,..,fm) sample haplotype frequencies.

i.Find F that maximizes the probability of the observed sample.

ii. The same for population parameters.

iii. Simulation is very easy.

H1 H4H3H2

Page 13: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

0 1 2 3

The Exponential Distribution.The Exponential Distribution: R+ Expo(a)

Density: f(t) = ae-at, P(X>t)= e-at

Properties: X ~ Exp(a) Y ~ Exp(b) independent

i. P(X>t2|X>t1) = P(X>t2-t1) (t2 > t1)

ii. E(X) = 1/a.

iii. P(X < Y) = a/(a + b).

iv. min(X,Y) ~ Exp (a + b).

v. Sums of k iid Xi is (k,a) distributed

ak xk 1e ax

(k)

Page 14: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

The Standard Coalescent

Two independent Processes

Continuous: Exponential Waiting Times

Discrete: Choosing Pairs to Coalesce.

1 2 3 4 5

Waiting Coalescing

4--5

3--(4,5)

(1,2)--(3,(4,5))

1--2

Exp 5

2

Exp 4

2

Exp 2

2

Exp 3

2

{1}{2}{3}{4}{5}

{1,2}{3,4,5}

{1,2,3,4,5}

{1,2}{3}{4,5}

{1}{2}{3,4,5}

Page 15: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Additional Evolutionary Factors

Geographical Structure.

Admixture can create longer LD islands

Population Growth.

Present LD in the large population can have small population characteristics

Recombination/Gene Conversion.

GC can create close fall in LD relative to distant LD

Selection.

Selective sweeps can create strong LDs locally.

Page 16: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Two sequences, infinite sites & k differences

The probability that there are k differences between two sequences. Going back in time, 2 kinds of events can occur (mutations ( - or a coalescent (1). This gives a geometric distribution.

k)1

(1

1

--*-------*------*-----

----*----*----*----*---Exp(1) Exp()

Ek(MRCA) =

1

1k

Distribution of waiting time to j’th newest mutation is

(j,1+)

TMRCA is (k+1,1+) + distributed.

Page 17: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

n sequences, infinite sites & k differences.Russell Thompson 98

----------------------------- 1

----------------------------- 2

----------------------------- n

s s s s

Exp(k(k-1)/2) Exp(k)

Oldest mutation

Only the number of segregating sites are observed.

Explicit Expressions or simple recursions exits for distributions analogous to the 2 sequence case.

Page 18: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Classical Polya UrnsFeller I.

21 3

Let X0 be the initial configuration of the initial Urn.

A step: take a random ball the urn and put it back together with an extra of the same colour.

Xk be the content after the k’th step. Let Yk be the colour of the k’th picked ball.

i. P{Yk =j} = P{Y1 =j}.

ii. Sequences Y1 ... Yk resulting in the same Xk - has the same probability.

Page 19: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Labelling, Polya Urns & Age of Alleles(Donnelly,1986 + Hoppe,1984+87)

An Urn:

1

2

1

1

A ball is picked proportionally to its weight. Ordinary balls have weight 1.

If the initial -size ball is picked, it is replaced together with a completely new type.

If an ordinary ball is picked, it is replaced together with a copy of itself.

There is a simple relationship between the distribution of ”the alleles labeled with age ranking” is the same as ”the alleles labeled with size ranking”

As they come

By size

By age

Page 20: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

n sequences, infinite sites & 1 segregating site (d,n-d)M.Stephens 2000, Griffiths & Tavare,1998

---------------0------------- 1

---------------0------------- 2

---------------0------------- d

---------------+------------- d+1

---------------+------------- n

1 2 d d+1 … n

Distribution of Age of the Mutation:

)exp(),()(1

2

),( adkpaf i

dn

k

n

ki

nkidA

where

1

2 0

1

00

1

02

1

1

1

11

2

1

1

1

11

),(dn

k k

dn

k

n

k

k

dn

k

n

kdkp and

))..()((

....

1

),(

iniiik

nknki

)2( nik

1

2

1),()(

dn

k

n

ki i

dkpdAE

Lastly Population analogues can be obtained by n2N

Page 21: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Shape of Tree Hanging below a MutationGriffiths & Tavare,1998 + 2002

---------------0------------- 1

---------------0------------- 2

---------------0------------- d

---------------+------------- d+1

---------------+------------- n

1 2 d d+1 … n

Probability that a specified edge when there were k lineages has b descendants.

1

1/

2

1)(

k

n

k

bnbpnk 11 knb

It is possible to describe the shape of the hanging tree.

k lineages

Page 22: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Mutations & their Branch.Wiuf & Donnelly (1999) Wiuf (2000+2001a,b)

Exact expression can be obtained for start and end of mutation branch and position of mutation.

Approximations for small (< 10%) mutation tree that also allows the mutation to have a selection coeffecient.

f – frequency of mutant, n=1000

Page 23: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Mutations & their Branch.Wiuf (2001a,b)

The Effect of Selection & Growth.

Page 24: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Cystic Fibrosis(Wiuf 2001)

F508 – possibly maintained by heterosis (1.023)- higher resistance to Salmonella infections.

Data: Frequency of F508-allele - .022.Inter variability in 1.705 individuals 46 variable positions.Model of human demography.

Model parameters: mutation rate, heterosis advantage and an exponential growth model of human population expansion.

/\* \/ \/\ \/ \ \

Estimated age of F508 is estimated to be:

Page 25: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Human History-Two Levels: Physical & Genealogical

The physical population size, N(t), and the efficient population size, Ne(t) are separate concepts.

i. N(t)can mainly be studied by historical/archeological means,

ii. Ne(t) can be studied genealogically, for instance by tracing the ancestries of DNA sequences.

Main departures from simplest Population Genetical Models:

A. Long epochs of exponential growth at increasing rates

B. Bottlenecks.

C. Migrations & Geographical subdivisions

Page 26: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Out of Africa ------ Multiregional Model

1st Origin of Humans in Africa 3-5 Myr ago is relatively accepted.

A 2nd origin from Africa 150-300 Kyr ago is controversial.

1. Was there a population expansion from Africa replacing the populations in Asia/Europe that left fossil as asserted by the Out of Africa Model.

2. Or did this expansion hybridize with the local population as asserted by the Multiregional Model.

Page 27: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

From Templeton,2002 March Nature

Page 28: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

From Cavalli-Sfroza,2001

Human Migrations

Page 29: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Cavalli- Sforza: Language TreesCavalli-Sforza (1997) Genes Peoples and Languages PNAS 94.7719-24

Principle of Comparison.

Loss of cognates (“homologous” words)

Syntax Comparison.

Sound use.

Reconstruction dependent on interpretation – stretches back 2-6.000 years dependent on criteria.

Page 30: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Cavalli- Sforza:Principal Components- Agriculture,…

Agriculture

6-10 Kyr

Greek Colonisation

3 Kyr

Retraction of the Basques.

Uralic People

Horse domestication

Page 31: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Homo Sapiens & the Neanderthal(Nordborg)

Two Scenarios:

Constant Female Pop.Size 3.400

Growing for 50.000 years to 5*108.

Problem: Can the observed be explained by one common H.sapiens - Neanderthal population?

Neanderthal

Te

Tt

ts

986 H. sapiens

Constant Pop.size Recent Growth

30.000 100.000 30.000 100.000

E(A()) 4.86 1.75 782 2.86

P(topology) .085 .56 3.3 10-6 .24

P(topology & Tt > 4Te) .0063 .035 3.7 10-8 .002

Page 32: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

SummaryThe Existing Data: SNPs & Haplotypes.

Reconstructing Haplotypes.

The Coalescent with Mutations.

The Human Population, its history & its Genome.

A serious gap between capabilities of theory and the demand of existing data.

Page 33: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

History of Coalescent Theory.1930-40s: Genealogical arguments well known to Wright & Fisher.

1964: Crow & Kimura: Infinite Allele Model

1966: (Hubby & Lewontin) & (Harris) make first surveys of population allele variation by protein electrophoresis.

1968: Motoo Kimura proposes neutral explanation of molecular evolution & population variation. So does King & Jukes

1971: Kimura & Otha proposes infinite sites model.

1975: Watterson makes explicit use of “The Coalescent”

1982: Kingman introduces “The Coalescent”.

1983: Hudson introduces “The Coalescent with Recombination”

1983: Kreitman publishes first major population sequences.

1987: Cann et al. tries to trace human origin and migrations with mitochondrial DNA.

Page 34: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

1989-90: Kaplan, Hudson, Takahata and others: Selection regimes with coalescent structure (MHC, Incompatibility alleles).

1988: Hughes & Nei: Genes with positive Darwinian Selection.

1987-95 Griffiths, Ethier & Tavare calculates inf.site data probability.

1991: MacDonald & Kreitman: Data with surplus of replacement interspecific substitutions.

1991: Aquadro & Begun: Positive recombination-nucleotide variation correlation.

1994-: Griffiths-Tavaré + Kuhner-Yamoto-Felsensenstein introduces highly computer intensitive simulation techniquees to estimate parameters in population models.

1996- Krone-Neuhauser introduces selection in Coalescent

1998- Donnelly, Stephens, Fearnhead et al.: Major accelerations in coalescent based data analysis.

1999: Wiuf & Donnelly uses Coalescent Theory to estimate age of disease allele

2000: Wiuf et al. introduces gene conversion into coalescent.

2000-: Several groups combines Coalescent Theory & Gene Mapping. A flood of SNP data & haplotypes are on their way.

Page 35: The Coalescent & Human Sequence Variation (11.6.02) I. The Human Population & its Genome. The Existing Data: SNPs & Haplotypes. Reconstructing Haplotypes

Recommended Literature & www-sitesCavalli-Sforza (2001) Genes, People and Language. Penguin.

Clark,A. (1990) ”Inference of Haplotypes from PCR-amplified Samples of Diploid Populations Mol.Biol.Evol.7.2.111-122

Daly,JM et al.(2001) High-resolution haplotype structure in the human genome. Nat.Gen. 29.229-32.

Donnelly,P. and R.Foley (eds) (2001) Genes, Fossils and Behavious IOS Press.

Goldstein, DB & Chikhi (2002) ”Human Migrations and Population Structure” Annu.Rev.Genomics Hum.Genetics (forthcoming)

Griffiths, RC ”Ancestral Inference from Gene Trees” in Donnelly,P. and R.Foley (eds) (2001) Genes, Fossils and Behavious IOS Press.

Gusfield (2002) Haplotypes as perfect phylogeny. To appear in Recomb2002Hoppe (1984) ”Polya-like urns and the Ewens’ sampling formula” J.Math.Biol. 20.91-94Harpending & Rogers (2000) Genetic Perspectives on Human Origins and Differentiation. Annu.Rev. Genom.Hum.Genet. 1.361-85.Inter.SNP Consortium (2001): A map of human genome sequence variation containing 1.42 million SNPs. Nature 409.928-33Nichols, J. (1997) Modelling Ancient Population Structures and Movement in Linguistics. Annu.Rev.Anthrop.26.359-84.Reich,DE et al.(2001) Linkage disequilibrium in the human genome Nature 411.199-204Relethford (2001) Genetics and the Search for Modern Human Origins. WileySlatkin & Rannala (2000) ”Estimating Allele Age” Annu Rev.Genomics Hum.Genet. 1.225-49Stephens,M.(1999) ”Times on Trees, and the Age of an Allele” Theor.Pop.Biol. 58.61-75.Stephens,M et al.(2001) ”A New Statistical Method for Haplotype Reconstruction from Population Data” Am.J.Hum.Gen.68.978-989Templeton, A. (2002) ”Out of Africa again and again” Nature vol416.45-51.Thompson,R. (1998) ”Ages of mutations on a coalescent tree” Math.Bios. 153.41-61.Wiuf & Hein (1997) ”The Number of Ancestors to DNA Sequence” GeneticsWiuf (2000) ”On the Genealogy of a Sample of Neutral Rare Alleles” Theor.Pop.Biol. 58.61-75.Wiuf (2001) ”Rare Alleles and Selection” Theor.Pop.Biol. 59.287-96.Wiuf (2001)Do DF508 heterozygotes have a selective advantage? Genet.Res.Cam. 78.41-47.Wiuf & Donnelly (1999) Conditional Genealogies and the Age of a Mutant. Theor. Pop.Biol. 56.183-201.

http://www.sanger.ac.uk/HGP/http://snp.cshl.org/data/Mikkel Schierup’s program package www.daimi.au.dk/~compbio/coalescent/Gil McVean’s course in population genetics: http://www.stats.ox.ac.uk/~mcvean/pgindex.html