37
The utility of the HapMap reference samples for clinical populations Gabor T. Marth Department of Biology, Boston College [email protected] VanBug Seminar Vancouver, BC, Canada September 9, 2004 (the informatics of sequence variations and haplotypes)

The utility of the HapMap reference samples for clinical populations

  • Upload
    zonta

  • View
    41

  • Download
    4

Embed Size (px)

DESCRIPTION

The utility of the HapMap reference samples for clinical populations. (the informatics of sequence variations and haplotypes). VanBug Seminar Vancouver, BC, Canada September 9, 2004. Gabor T. Marth. Department of Biology, Boston College [email protected]. cause inherited diseases. - PowerPoint PPT Presentation

Citation preview

Page 1: The utility of the HapMap reference samples for clinical populations

The utility of the HapMap reference samples for clinical populations

Gabor T. Marth

Department of Biology, Boston [email protected]

VanBug SeminarVancouver, BC, CanadaSeptember 9, 2004

(the informatics of sequence variations and haplotypes)

Page 2: The utility of the HapMap reference samples for clinical populations

Why do we care about variations?

underlie phenotypic differences

cause inherited diseases

allow tracking ancestral human history

Page 3: The utility of the HapMap reference samples for clinical populations

How do we find sequence variations?

• look at multiple sequences from the same genome region

• use base quality values to decide if mismatches are true polymorphisms or sequencing errors

Page 4: The utility of the HapMap reference samples for clinical populations

Automated polymorphism discovery

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

Marth et al. Nature Genetics 1999

Page 5: The utility of the HapMap reference samples for clinical populations

Large SNP mining projects

Sachidanandam et al. Nature 2001

~ 8 million

EST

WGS

BAC

genome reference

Page 6: The utility of the HapMap reference samples for clinical populations

How to use markers to find disease?

• problem: genotyping cost precludes using millions of markers simultaneously for an association study

genome-wide, dense SNP marker map

• depends on the patterns of allelic association in the human genome

• question: how to select from all available markers a subset that captures most mapping information (marker selection, marker prioritization)

Page 7: The utility of the HapMap reference samples for clinical populations

Allelic association

• allelic association is the non-random assortment between alleles i.e. it measures how well knowledge of the allele state at one site permits prediction at another marker site functional site

• by necessity, the strength of allelic association is measured between markers

• significant allelic association between a marker and a functional site permits localization (mapping) even without having the functional site in our collection

• there are pair-wise and multi-locus measures of association

Page 8: The utility of the HapMap reference samples for clinical populations

Linkage disequilibrium

• LD measures the deviation from random assortment of the alleles at a pair of polymorphic sites

D=f( ) – f( ) x f( )

• other measures of LD are derived from D, by e.g. normalizing according to allele frequencies (r2)

Page 9: The utility of the HapMap reference samples for clinical populations

strong association: most chromosomes carry one of a few common haplotypes – reduced haplotype diversity

Haplotype diversity

• the most useful multi-marker measures of associations are related to haplotype diversity

2n possible haplotypesn

markers

random assortment of alleles at different sites

Page 10: The utility of the HapMap reference samples for clinical populations

The determinants of allelic association

• recombination: breaks down allelic association by “randomizing” allele combinations

• demographic history of effective population size: bottlenecks increase allelic association by non-uniform re-sampling of allele combinations (haplotypes)

bottleneck

Page 11: The utility of the HapMap reference samples for clinical populations

Strength of LD in the human genome

0

0.2

0.4

0.6

0.8

1.0E-06 3.0E-06 1.0E-05 3.0E-05 1.0E-04 3.0E-04 1.0E-03 3.0E-03 1.0E-02

Recombination Fraction

D2

Expansion

Bottleneck

• LD is stronger, extends longer than previously thought

Page 12: The utility of the HapMap reference samples for clinical populations

Haplotype blocks

Daly et al. Nature Genetics 2001

• experimental evidence for reduced haplotype diversity (mainly in European samples)

Page 13: The utility of the HapMap reference samples for clinical populations

The promise for medical genetics

CACTACCGACACGACTATTTGGCGTAT

• within blocks a small number of SNPs are sufficient to distinguish the few common haplotypes significant marker reduction is possible

• if the block structure is a general feature of human variation structure, whole-genome association studies will be possible at a reduced genotyping cost

• this motivated the HapMap project

Gibbs et al. Nature 2003

Page 14: The utility of the HapMap reference samples for clinical populations

The HapMap initiative

• goal: to map out human allele and association structure of at the kilobase scale

• deliverables: a set of physical and informational reagents

Page 15: The utility of the HapMap reference samples for clinical populations

HapMap physical reagents

• reference samples: 4 world populations, ~100 independent chromosomes from each

• SNPs: computational candidates where both alleles were seen in multiple chromosomes

• genotypes: high-accuracy assays from various platforms; fast public data release

Page 16: The utility of the HapMap reference samples for clinical populations

Informational reagents: haplotypes

• the problem: the substrate for genotyping is diploid, genomic DNA; phasing of alleles at multiple loci is in general not possible with certainty

• experimental methods of haplotype determination (single-chromosome isolation followed by whole-genome PCR amplification, radiation hybrids, somatic cell hybrids) are expensive and laborious

A

T

C

T

G

C

C

A

Page 17: The utility of the HapMap reference samples for clinical populations

Computational haplotype inference

• Parsimony approach: minimize the number of different haplotypes that explains all diploid genotypes in the sampleClark

Mol Biol Evol 1990

• Maximum likelihood approach: estimate haplotype frequencies that are most likely to produce observed diploid genotypes Excoffier & Slatkin

Mol Biol Evol 1995

• Bayesian methods: estimate haplotypes based on the observed diploid genotypes and the a priori expectation of haplotype patterns informed by Population Genetics Stephens et al.

AJHG 2001

Page 18: The utility of the HapMap reference samples for clinical populations

Haplotype inference

http://pga.gs.washington.edu/

Page 19: The utility of the HapMap reference samples for clinical populations

Haplotype annotations – LD based

• Pair-wise LD-plots

Wall & Pritchard Nature Rev Gen 2003

• LD-based multi-marker block definitions requiring strong pair-wise LD between all pairs in block

Page 20: The utility of the HapMap reference samples for clinical populations

Annotations – haplotype blocks

• Dynamic programming approachZhang et al.

AJHG 2001

3 3 3

1. meet block definition based on common haplotype requirements

2. within each block, determine the number of SNPs that distinguishes common haplotypes (htSNPs)

3. minimize the total number of htSNPs over complete region including all blocks

Page 21: The utility of the HapMap reference samples for clinical populations

Questions about the HapMap

• completion, sufficient density?

• haplotype structure across populations?

• is structure constant with sample size?

• Explore human allele structure with a Population Genetic modeling and data fitting technique

Page 22: The utility of the HapMap reference samples for clinical populations

1. marker density (MD): distribution of number of SNPs in pairs of sequences

Data: polymorphism distributions

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

“rare” “common”

2. allele frequency spectrum (AFS): distribution of SNPs according to allele frequency in a set of samples

0

0.05

0.1

1 2 3 4 5 6 7 8 9 10

Clone 1 Clone 2 # SNPs

AL00675 AL00982 8

AS81034 AK43001 0

CB00341 AL43234 2

SNP Minor allele Allele count

A/G A 1

C/T T 9

A/G G 3

Page 23: The utility of the HapMap reference samples for clinical populations

Model: processes that generate SNPs

k

ii

LLL

k

k

ii

LL

k

ii

LL

k

k

iiLL

k

i

ii

i

eL

L

L

eeL

L

L

eL

L

LkP

1!

111

3

3

3

1!

11

1!

11

2

2

2

1!

11

1

1

1

23

21

3

12

2211

212

12

221

2

12

11

1111

111

1

1111

1

1111

1

computable formulations

simulation procedures

3/5 1/5 2/5

Page 24: The utility of the HapMap reference samples for clinical populations

Models of demographic history

past

present

stationary expansioncollapse

MD(simulation)

AFS(direct form)

histo

ry

0

0.05

0.1

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

1 2 3 4 5 6 7 8 9 100

0.05

0.1

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

1 2 3 4 5 6 7 8 9 10

bottleneck

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

Page 25: The utility of the HapMap reference samples for clinical populations

0.005.00

10.0015.00

20.0025.00

30.0035.00

40.00

4 kb4 kb

8 kb8kb

12 kb12 kb

16 kb16kb0

0.1

0.2

0.3

0.4

• best model is a bottleneck shaped population size history

presentN1=6,000T1=1,200 gen.

N2=5,000T2=400 gen.

N3=11,000

Data fitting: marker density

Marth et al. PNAS 2003

• our conclusions from the marker density data are confounded by the unknown ethnicity of the public genome sequence we looked at allele frequency data from ethnically defined samples

Page 26: The utility of the HapMap reference samples for clinical populations

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

presentN1=20,000T1=3,000 gen.

N2=2,000T2=400 gen.

N3=10,000

model consensus: bottleneck

Data fitting: allele frequency

• Data from other populations?

Page 27: The utility of the HapMap reference samples for clinical populations

Population specific demographic history

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

minor allele count

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

minor allele count

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

minor allele count

European data

African data

bottleneck

modest but uninterrupted

expansionMarth et al.

Genetics 2004

Page 28: The utility of the HapMap reference samples for clinical populations

Model-based prediction

computational model encapsulating what we know about the processgenealogy + mutationsallele structure

arbitrary number of additional replicates

Page 29: The utility of the HapMap reference samples for clinical populations

African dataEuropean data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Pro

port

ion

of A

FS

Mutational Size (i)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Pro

port

ion

of A

FS

Mutational size (i)

contribution of the past to

alleles in various

frequency classes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

20,000

40,000

60,000

80,000

Mut

atio

nal A

ge (

gene

ratio

ns)

Mutational Size (i)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

20,000

40,000

60,000

80,000

Mut

atio

nal A

ge (

gener

atio

ns)

Mutational Size (i)

average age of

polymorphism

Prediction – allele frequency and age

Page 30: The utility of the HapMap reference samples for clinical populations

Prediction – extent of LD

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.81E-6

1E-5

1E-4

1E-3

0.01

0.1

1

10

100

1000

Reco

mbin

atio

n F

ract

ion

r2

European Asian

African American

Dista

nce

(kb)

Page 31: The utility of the HapMap reference samples for clinical populations

Prediction – haplotype structure

0

200

400

600

800

1000

1200

1400

1600

1800

0 5 10 15 20 25

block length [kb]

co

un

t

• our models predict shorter blocks in African samples than in Europeans

• what is the spatial relationship between blocks?

• we must connect the polymorphism structure of different human populations

Page 32: The utility of the HapMap reference samples for clinical populations

Modeling joint allele structure

• We study these relationships with models of population

subdivision

“African history” “European history”

“migration”

• The genealogy of samples from different populations are

connected through the shared part of our past

• Polymorphic markers (some shared, some population-specific)

and haplotypes are placed into a common frame of reference

• The “true” history of all human populations is interconnected

Page 33: The utility of the HapMap reference samples for clinical populations

Joint allele frequencies

EuropeanAfrican

monomorphic

rare common

monomorphic0.0 %0.0 %

19.9 %13.2 %

2.3 %1.0 %

rare43.4 %43.7 %

11.5 %11.0 %

4.6 %7.4 %

common10.2 %4.2 %

4.4 %6.0 %

6.6 %13.4 %

shared SNPs

observation in UW PGA data

SNPs private to African samples

SNPs private to European samples

SNPs common in both populations

• our simple model of subdivision captures the qualitative dynamics

• we now have the tools to analyze joint allele structure

Page 34: The utility of the HapMap reference samples for clinical populations

Generality for future samples?

1. reference samples

3. blocks

4. list of haplotypes5. frequent haplotypes

• The haplotype map resource is a collection of reagents

• How relevant are the reference reagents to future clinical samples (drawn from the same or different population)?

2. common markers

Page 35: The utility of the HapMap reference samples for clinical populations

Reference haplotypes

87.5%(74.9%, 65.0% at lower minimum marker allele frequency)

99.4%same populatio

n

different population

reference haplotypes

• these computational studies inform us about the global, genome-average properties of the HapMap reagents

• what can we say about linkage in specific local regions?

Page 36: The utility of the HapMap reference samples for clinical populations

Utility for association studies?

• No matter how good the resource is, its success to find disease causing variants greatly depend on the allelic structure of common diseases, a question under debate

• Regardless of how we describe human association structure, many questions remain about the relative merits of single-marker vs. haplotype-based strategies for medical association studies

Page 37: The utility of the HapMap reference samples for clinical populations

Steve SherryEva CzabarkaJanos MurvaiAlexey VinokurovGreg SchulerRicha AgarwalaStephen Altschul

Aravinda Chakravarti (Hopkins)Andy Clark (Cornell)Pui-Yan Kwok (UCSF)Henry Harpending (Utah)Jim Weber (Marshfield)

http://clavius.bc.edu/~marthlab/MarthLab

Acknowledgements

[email protected]

Eric Tsung