Statistical Human Genetics Linkage and Association ...engr.case.edu/li_jing/slides/statgenetics.pdfStatistical Human Genetics Linkage and Association Haplotyping algorithms EECS 458

1

Statistical Human Genetics Linkage and AssociationHaplotyping algorithms

EECS 458 CWRU

Fall 2004

Readings: Chapter 2&3 of An introduction to Genetics, Griffiths et al. 2000, Seventh Edition

Some slides from the Lecture notes of Dr. Dan Geiger at http://webcourse.technion.ac.il/236608/

Dr. Terry Speed's Class Homepages at Berkeley: http://www.stat.berkeley.edu/users/terry/Classes/index.html

Roadmap

• Mendel’s law• Linkage and the likelihood• Loglikelihood ratio • Marker map*• Interval mapping, multipoint linkage analysis*• Association• Haplotyping algorithms• *: will not cover in this class

Human GenomeMost human cells

contain 46 chromosomes:

• 2 sex chromosomes (X,Y):XY – in males.XX – in females.

• 22 pairs of chromosomes, named autosomes.

2

Genetic Information

• Gene – basic unit of genetic information. They determine the inherited characters.

• Genome – the collection of genetic information.

• Chromosomes –storage units of genes.

Chromosome Logical Structure

Marker – Genes, SNP, Tandem repeats.

Locus – location of markers. Allele – one variant form of a

marker.

Locus1Possible Alleles: A1,A2

Locus2Possible Alleles: B1,B2,B3

Genotypes Phenotypes• At each locus (except for sex chromosomes)

there are 2 genes. These constitute the individual’s genotype at the locus.

• The expression of a genotype is termed a phenotype. For example, hair color, weight, or the presence or absence of a disease.

• Genetics: find the genes underlying phenotypes/disease

3

Modern genetics began with Mendel’s experiments on garden peas (Although, the ramification of his work were not realized during his life time). He studied seven contrasting pairs of characters, including:

The form of ripe seeds: round, wrinkledThe color of the seed albumen: yellow, greenThe length of the stem: long, short

Mendel’s Work

Mendel Gregor. 1866. Experiments on Plant Hybridization. Transactions of the Brünn Natural History Society.

Mendel’s first law

Characters are controlled by pairs of genes which separate during the formation of the reproductive cells (meiosis)

A a

A a

Sexual Reproduction

zygote

gametes

sperm

egg

Meiosis

4

P: AA X aa

F1: Aa

F1 X F1 Aa X Aa test cross Aa X aa

Gametes: A a

A AA Aa

a Aa aa

F2: 1 AA : 2 Aa : 1 aa

~ ~A aPhenotype

Gametes: A a

a Aa aa

1A : 1 aPhenotype:~ ~

Dominant vs. Recessive

• A dominant allele is expressed even if it is paired with a recessive allele.

•A recessive allele is only visible when paired with another recessive allele.

Mendel’s second law

When two or more pairs of genes segregate simultaneously, they do so independently.

A a; B b

A B A b a B a b

PAB= PA × PB PAb=PA × Pb PaB=Pa × PB Pab=Pa × Pb

5

“Exceptions” to Mendel’s Second LawMorgan’s fruit fly data (1909): 2,839 flies

Eye color A: red a: purpleWing length B: normal b: vestigial

AABB x aabb

AaBb x aabb

AaBb Aabb aaBb aabbExpected 710 710 710 710Observed 1,339 151 154 1,195

The pair AB stick together more than expected from Mendel’s law.

Morgan’s explanation

A A

B B

a a

b b×

F1: A a

B b

a a

b b×

F2:A a

B b

a a

b b

A a

b b

a a

B b

Crossover has taken place

6

Recombination Phenomenon(Happens during Meiosis)

RecombinationHaplotype

Male or female

Parental types: AaBb, aabbRecombinants: Aabb, aaBb

The proportion of recombinants between the two genes (or characters) is called the recombination fraction between these two genes.

It is usually denoted by r or θ. For Morgan’s traits:r = (151 + 154)/2839 = 0.107

If r < 1/2: two genes are said to be linked.

If r = 1/2: independent segregation (Mendel’s second law).

Purpose of human linkage analysis

To obtain a crude chromosomal location of the gene or genes associated with a phenotype of interest, e.g. a genetic disease or an important

quantitative trait.

Examples: Cystic fibrosis (found), Diabetes, Alzheimer, and Blood pressure.

7

Linkage Strategies I

Traditional (from the 1980s or earlier)– Linkage analysis on pedigrees– Association studies: candidate genes– Allele-sharing methods: Affected siblings– Animal models: identifying candidate genes

Newer (from the 1990s)– Focus on special populations (Finland)– Haplotype-sharing (many variants)

PedigreePedigree

Father Mother

Children

ID Num

Genotypes

Founders

Nuclear family

Familytrioloop

{1 2}{1 2}{2 2}{2 2}

{1 2}{1 2}{2 2}{1 1}

{1 1}{1 1}{2 2}{2 1}

{1 2}{1 2}{2 2}{1 2}

Fictitious Example for Finding Disease Genes

We use a marker with codominant alleles A1/A2.

We speculate a locus with alleles H (Healthy) / D (affected)If the expected number of recombinants is low (close to zero), then the speculated locus and the marker are tentatively physically closed.

2

4

5

1

3

H

A1/A1

D

A2/A2

H

A1/A2

D

A1/A2

H

A2/A2

D DA1 A2

H DA1 A2

H | DA2 | A2

D DA2 A2

Recombinant

Phase inferred

8

Linkage Strategies II

On the horizon (here)– Single-nucleotide polymorphism (SNPs)– Functional analyses: finding candidate genes

Needed (starting to happen)– New multilocus analysis techniques, especially – Ways of dealing with large pedigrees– Better phenotypes: ones closer to gene

products– Large collaborations

Horses for courses

• Each of these strategies has its domain of applicability

• Each of them has a different theoretical basis and method of analysis

• Which is appropriate for mapping genes for a disease of interest depends on a number of matters, most importantly the disease, and the population from which the sample comes.

The disease matters

Definition (phenotype), prevalence, features such as age at onset

Genetics: nature of genes (Penetrance), number of genes, nature of their contributions (additive, interacting), size of effect

Other relevant variables: Sex, obesity, etc.Genotype-by-environment interactions:

Exposure to sun.

9

The population matters

History: pattern of growth, immigrationComposition: homogeneous or melting

pot, or in betweenMating patterns: family sizes, mate

choice Frequencies of disease-related alleles,

and of marker allelesAges of disease-related alleles

Complex traitsDefinition vague, but usually thought of as having multiple,

possibly interacting loci, with unknown penetrances; and phenocopies.

Affected only methods are widely used. The jury is still out on which, if any will succeed.

Few success stories so far.Important: heart disease, cancer susceptibility, diabetes,

…are all “complex” traits.We focus more on simple traits where success has been

demonstrated very often. About 6-8 percent of human diseases are thought o be simple Mendelian diseases.

Design of gene mapping studies

How good are your data implying a genetic component to your trait? Can you estimate the size of the genetic component?

Have you got, or will you eventually have enough of the right sort of data to have a good chance of getting a definitive result?

Power studies.

Simulations.

10

AnalysisA very large range of methods/programs are available.

Effort to understand their theory will pay off in leading to the right choice of analysis tools.

Trying everything is not recommended, but not uncommon.

Many opportunities for innovation.

Interpretation of results of analysis

An important issue here is whether you have established linkage. The standards seem to be getting increasingly stringent.

What p-value or LOD should you use?

Dealing with multiple testing, especially in the context of genome scans and the use of multiple models and multiple phenotypes, is one of the big issues.

Replication of resultsThis has recently become a big issue with complex diseases, especially in psychiatry.

Nature Genetics suggested in May 1998 that they will require replication before publishing results mapping complex traits.

Simulations by Suarez et al (1994) show that sample sizes necessary for replication may be substantially greater than that needed for first detection.

11

Topics not mentioned

Exclusion mapping, interference, variance component methods, twin studies, non parametric linkage (sib-pair, ibd-based) and much more.

Some of these topics plus others are covered in three books:

Handbook of Human Genetic Linkage by J.D. Terwilliger & J. Ott (1994) Johns Hopkins University Press. Ordered, not available at the library.

Analysis of Human Genetic Linkage by J. Ott, 3rd Edition (1999), Johns Hopkins University Press.

Handbook of Statistical Genetic by Balding, 2nd Edition (2003), Wiley.

Gene Mapping

image credit: U.S. Department of Energy Human Genome Program

Probability of a pedigree

Input data: marker genotypes M, phenotypes T, relationship, with missing (always)

Objective: calculate the joint probability of P(M,T)

Components: founder probabilities, transmission probabilities, and penetrance probabilities

Method: candidate genes, 2 point analysis, interval mapping, multipoint mapping.

12

One locus: founder probabilitiesFounders are individuals whose parents are not in the pedigree. They may of may not be typed. Either way, we need to assign probabilities to their actual or possible genotypes. This is usually done by assuming Hardy-Weinberg equilibrium. (There is a good story here.) If the frequency of D is .01, H-W says

pr(Dd ) = 2x.01x.99

Genotypes of founder couples are (usually) treated as independent.

pr(pop Dd , mom dd ) = (2x.01x.99)x(.99)2

D d

D d dd

1

21

One locus: transmission probabilities

Children get their genes from their parents’ genes, independently, according to Mendel’s laws; also independently for different children.

D d D d

d d3

21

pr(kid 3 dd | pop 1 Dd & mom 2 Dd )

= 1/2 x 1/2

One locus: transmission probabilities - II

D d D d

D d

pr(3 dd & 4 Dd & 5 DD | 1 Dd & 2 Dd )

= (1/2 x 1/2)x(2 x 1/2 x 1/2) x (1/2 x 1/2).

The factor 2 comes from summing over the two mutually exclusive and equiprobable ways 4 can get a D and a d.

d d D D

1

4 53

2

13

One locus: penetrance probabilitiesPedigree analyses usually suppose that, given the genotype at all loci, and in some cases age and sex, the chance of having a particularphenotype depends only on genotype at one locus, and is independent of all other factors: genotypes at other loci, environment, genotypes and phenotypes of relatives, etc.

Complete penetrance:

pr(affected | DD ) = 1

Incomplete penetrance:

pr(affected | DD ) = .8

DD

DD

One locus: penetrance - II

Age and sex-dependent penetrance (see liability classes)

pr( affected | DD , male, 45 y.o. ) = .6

D D (45)

One locus: putting it all together

Assume penetrances pr(affected | dd ) = .1, pr(affected | Dd ) = .3 pr(affected | DD ) = .8, and that allele D has frequency .01.

The probability of this pedigree is the product:

(2 x .01 x .99 x .7) x (2 x .01 x .99 x .3) x (1/2 x 1/2 x .9) x (2 x 1/2 x 1/2 x .7) x (1/2 x 1/2 x .8)

D d D d

D dd d D D

1

4 53

2

In general shaded means affected, blank means unaffected.

14

One locus: putting it all together - II

Note that we begin by multiplying founder gene frequencies, followed by founder penetrances. Next we multiply transmission probabilities, followed by penetrance probabilities of offspring, using their independence given parental genotypes.

If there are missing or incomplete data, we must sum over all mutually exclusive possibilities compatible with the observed data.

The general strategy of beginning with founders, then non-founders, and multiplying and summing as appropriate, has been codified inwhat is known as the Elston-Stewart algorithm for calculating probabilities over pedigrees. It is one of the two widely used approaches. The other is termed the Lander-Green algorithm and takes a quite different approach.

Both are hidden Markov models, both have compute time/space limitations with multiple individuals/loci (see next) , and extending them beyond their current limits is the ongoing outstanding problem.

Two loci: linkage and recombination

Son 3 produces sperm with D-T, D-t, d-T or d-t in proportions:

3

21

D dT t

d dt t

D DT T

3

T t

D (1-θ)/2 θ/2 1/2

d θ/2 (1-θ)/2 1/2

1/2 1/2

Two loci: linkage and recombination - IISon produces sperm with DT, Dt, dT or dt in proportions:

T t

D (1-θ)/2 θ/2 1/2

d θ/2 (1-θ)/2 1/2

1/2 1/2

θ = 1/2 : independent assortment (cf Mendel) unlinked loci

θ < 1/2 : linked loci

θ ≈ 0 : tightly linked loci

Note: θ > 1/2 is never observed

If the loci are linked, then D-T and d-t are parental, and

D-t and d-T are recombinant haplotypes

15

ˆRecombination only discernible in the father. Here θ = 1/4 (why?)

This is called the phase-known double backcross pedigree.

Two loci: estimation of recombination fractions

D DT T

d dt t

D dt t

d dt t

D dT t

D dT t

D dT t

d dt t

Two loci: phase Suppose we have data on two linked loci as follows:

Was the daughter’s D-T from her father a parental or recombinant combination? This is the problem of phase: did father get D-T from one parent and d-t from the other? If so, then the daughter's paternally derived haplotype is parental.

If father got D-t from one parent and d-T from the other, these would be parental, and daughter's paternally derived haplotype would be recombinant.

D dT t

d dt t

D dT t

Two loci: dealing with phase

Phase is incompleteness in genetic information, specifically, in parental origin of alleles at heterozygous loci.

Often it can be inferred with certainty from genotype data on parents.

Often it can be inferred with high probability from genotype data on several children.

In general genotype data on relatives helps, but does not necessarily determine phase.

In practice, probabilities must be calculated under all phases compatible with the observed data, and added together. The need to do so is the main reason linkage analysis is computationally intensive, especially with multilocus analyses.

D d

DdD d

16

Two loci: founder probabilities

Two-locus founder probabilities are typically calculated assuming linkage equilibrium, i.e. independence of genotypes across loci.

If D and d have frequencies .01 and .99 at one locus, and T and t have frequencies .25 and .75 at a second, linked locus, this assumption means that DT, Dt, dT and dt have frequencies .01 x .25, .01 x .75, .99 x .25 and .99 x .75 respectively. Together with Hardy-Weinberg, this implies that

pr(DdTt ) = (2 x .01 x .99) x (2 x .25 x .75)

= 2 x (.01 x .25) x (.99 x .75)

+ 2 x (.01 x .75) x (.99 x .25).

This last expression adds haplotype pair probabilities.

Dd

Tt

D|d

T|t

D|d

t|T

d|D

T|t

d|D

t|T

Two loci: transmission probabilities

D dT t

d dt t

D d T t

Initially, this must be done with haplotypes, so that account can be taken of recombination. Then terms like that below are summed over possible phases. Here only the father can exhibit recombination: mother is uninformative.

pr(kid DT/dt | pop DT/dt & mom dt/dt )

= pr(kid DT | pop DT/dt ) x pr(kid dt | mom dt/dt )

= (1-θ)/2 x 1.

Two Loci: Penetrance

• In all standard linkage programs, different parts of phenotype are conditionally independent given all genotypes, and two-loci penetrances split into products of one-locus penetrances. Assuming the penetrancesfor DD, Dd and dd given earlier, and that T,t are two alleles at a co-dominant marker locus.

•• Pr( affected & Tt | DD, Tt ) • = Pr(affected | DD, Tt ) ×Pr(Tt | DD, Tt )• = 0.8 × 1

17

Two loci: phase unknown double backcross

• We assume below pop is as likely to be DT / dt as Dt / dT.

d dt t

D dT t

D dT t

D dt t

d dt t

D dT t

Pr (all data | θ ) = pr(parents' data | θ ) × pr(kids' data | parents' data, θ)= pr(parents' data) × {[((1-θ)/2)3 × θ/2]/2+ [(θ/2)3 × (1-θ)/2]/2}

This is then maximised in θ, in this case numerically. Here θ = 0.5

�Log (base 10) odds or LOD scores• Suppose pr(data | θ) is the likelihood function of a recombination fraction θ

generated by some 'data', and pr(data | 1/2) is the same likelihood when θ= 1/2.• Statistical theory tells us that the ratio

• L = pr(data | θ*) / pr(data | 1/2)

• provides a basis for deciding whether θ =θ* rather than θ = 1/2. • This can equally well be done with Log10L, i.e.•

• LOD(θ*) = Log10{pr(data | θ*) / pr(data | 1/2)}

• measures the relative strength of the data for θ = θ* rather than θ = 1/2. Usually we write θ, not θ* and calculate the function LOD(θ).

�Facts about/interpretation of LOD scores

• Positive LOD scores suggests stronger support for θ* than for 1/2, negative LOD scores the reverse.

• Higher LOD scores means stronger support, lower means the reverse.

• LODs are additive across independent pedigrees, and under certain circumstances can be calculated sequentially.

• For a single two-point linkage analysis, the threshold LOD ≈ 3 has become the de facto standard for "establishing linkage", i.e. rejecting the null hypothesis of no linkage.

• When more than one locus or model is examined, the remark in 4 must be modified, sometimes dramatically.

18

Assumptions underpinning most 2-point human linkage analyses

• Founder Frequencies: Hardy-Weinberg at each locus.Random mating, Linkage equilibrium across loci, knownallele frequencies; founders independent.

• Transmission: Mendelian segregation, no mutation.• Penetrance: single locus, no room for dependence on

relatives' phenotypes or environment. Known disease model (including phenocopy rate).

• Implicit: phenotype and genotype data correct, marker order and location correct

• Comment: Some analyses are robust, others can be very sensitive to violations of some of these assumptions. Non-standard linkage analyses can be developed.

The real challenges and more interesting strategies areinterval mapping and multipoint linkage analysis, but going there would take more time than we have today.

Beyond two-point human linkage analysis

References• www.netspace.org/MendelWeb

• HLK Whitehouse: Towards an Understanding of the Mechanism of Heredity, 3rd ed. Arnold 1973

• Kenneth Lange: Mathematical and statistical methods for genetic analysis, Springer 1997

• Elizabeth A Thompson: Statistical inference from genetic data on pedigrees, CBMS, IMS, 2000.

• Jurg Ott : Analysis of human genetic linkage, 3rd ednJohns Hopkins University Press 1999

• JD Terwilliger & J Ott : Handbook of human genetic linkageJohns Hopkins University Press 1994.

• Handbook of Statistical Genetic by Balding, 2nd Edition (2003), Wiley.

19

Project topic: efficient calculation of the linkage probability

• Elston-Stewart algorithm • Lander-Green algorithm• Genehunter ( AJHG’96 58:1347-1363)• Allegero (Nat Genet’00 25(1):12-13)• Merlin (Nat Genet’02 30(1):97-101)• Superlink (Bayesian network) Bioinformatics

v18 S1: S189-S198(ISMB02), RECOMB03• SAGE

Association analysis• Population based association• Allelic association and χ2 –test• Linkage disequilibrium• Limitations of the LD mapping• Haplotype based (datamining) approaches:

HPM, HapMiner• Haplotype based (statistical) approaches*• Family based association (TDT) ** Not coverred in this class

Some slides from Päivi Onkamo Biomedicum & Department of Computer Science, Helsinki

Association Studies

Are the really independent? Coalescent theory!

20

Genetic association analysis• Search for significant correlations between gene

variants and phenotype• For example:

Locus A for 100 cases and100 controlsgenotyped

5421Allele 2

4679Allele 1

UnaffectedAffected

Allelic association = An allele is associated to a trait

•Allele 1 seems to be associated with the disease based on the table, but how sure canone be about it?

Affected Healthy Σ

Allele 1 79 46 125

Allele 2 21 54 75

Σ 100 100 200

21

• The idea is to compare the observed frequencies to frequencies expected under hypothesis of no association between alleles and the occurrence of the disease (independency between variables)

• Test statistic

Where• oi is the observed class frequency for class i,

ei expected (under H0 of no association)• k is the number of classes in the table• Degrees of freedom for the test: df=(r-1)(s-1)

∑=

−=

k

i

ii

i eeo

1

)( 22χ

23.235.37

)5.3754(5.37

)5.3721(

5.62)5.6246(

5.62)5.6279()(

22

22

,

22

=−

+−

+

−+

−=

−=∑

ji ij

ijij

eeo

χ

df=1 p<<0,001

Affected Healthy ΣAllele 1 62.5 (79) 62.5 (46) 125

Allele 2 37.5 (21) 37.5 (54) 75

Σ 100 100 200

Expected

Interpretation of the test results

• The p-value is low enough that H0 can berejected = the probability that the observedfrequencies would differ this much (oreven more) from expected by just coincidence < 0.001

• Multiple testing problem

22

• Genetic association is population levelcorrelation with some known genetic variant and a trait: an allele is over-represented in affected individuals →

• From a genetic point of view, an association does not imply causal relationship. Tow-step strategy: do linkage analysis first to find a candidate region, then do association fine mapping.

• Often, a gene is not a direct cause for the disease, but is in LD with a causative gene→

Haplotype Frequencies

1.0πbπBTotal

πaπabπaBa

πAπAbπABALocus A

bB

TotalLocus B

Linkage disequilibrium (LD)• Linkage equilibrium: πAB = πAπB. If this

holds, then πAb = πA - πAB = πA - πAπB = πA πb, similarly: πaB = πaπB and πab = πaπb.

• Any deviation from these values implies LD. There are many reasons that cause LD.

• Under random mating assumption, LD will decay generation by generation maily due to recombination

23

MutationsA G

C G

A G

C G

C C

Before mutations

After mutations

RecombinationAfter Recombinations

A G

C G

C C

A C

Measures of Linkage Disequilibrium

• D := πAB - πAπB ; Thus: πAB = πAπB + D, which implies: πAb = πA - πAB = πA -πAπB – D = πAπb – D, similarly: πaB = πaπB – D, πab = πaπb + D.

• So D≥ -πAπB , D ≥ -πaπb, D ≤ πAπb, D ≤ πaπB (based on haplotype frequencies can not be negative).

• D is hard to interpret: – Sign is arbitrary (one could set A, B to be the common alleles and a, b to be the

minor alleles)– The range of D depends on allele frequencies, hard to compare between

markers

• Alternative measures: D’, r2.

• Devlin B., Risch N. (1995) A Comparison of Linkage Disequilibrium Measures for Fine-Scale Mapping. Genomics 29:311-322

24

Alternative measures

• Ranges between –1 and +1– More likely to take extreme values when allele frequencies

are small– ±1 implies at least one of the haplotypes was not observed

⎪⎩

⎪⎨⎧

≥

<=

0 if

0 if '

),max(

),max(

D

DD

BabA

baBA

D

D

ππππ

ππππ

Alternative measures

• Ranges between 0 and 1– 1 when D achieves

maximum/minimum and the two markers provide identical information

– 0 when they are in perfect equilibrium

bBaA

Drππππ

22 =

Linkage disequilibrium (LD)• Closely located genes often express linkage

disequilibrium to each other: Locus 1 with alleles A and a, and locus 2 with alleles B and b, at a distance of a few centiMorgans from each other, but may also due to many other reasons

• LD follows from the fact that closely located genes aretransmitted as a ”block” which only rarely breaks up in meioses

• An example:– Locus 1 – marker gene – Locus 2 – disease locus, with allele b as dominant susceptibility

allele with 100% penetrance

25

An example

• Association evaluated →Locus 1 also seems associated, even though it has nothing to do with the disease –association observed just due to LD

LD mapping – utilizing founder effect• A new disease mutation born n generations

ago in a relatively small, isolated population• The original ancestral haplotype slowly

decays as a function of generations• In the last generation, only small stretches of

founder haplotype can be observed in the disease-associated chromosomes

Linkage Disequilibrium Mapping

Ancestral haplotypes

Affected

Present-day haplotypes

Normal

26

Data: Searching for a needle in a haystackDisease gene

a ? 2 1 1a ? 1 2 1

1 2 2 1 1 2 1 2 1 2 1 1 2 2

1 2 2 1 2 1 1 2

c 2 1 ? ?c 1 1 ? ?

1 2 2 1 1 2 1 1 2 2 2 1 1 1

1 1 2 1 1 2 2 2 2 2 1 1 2 1

2 1 1 1 1 1 1 1

2 2 ? 1 1 1 ? 1

a 1 1 2 1a 1 1 1 2

Diseasestatus S2 ...SNP1 ...

…………

• Task is to find either an allele or an allele string (haplotype) which is overrepresented in disease-associatedchromosomes– markers may vary: SNPs, microsatellites– populations vary: the strength of marker-to-

marker LD• Many approaches:

– ”old-fashioned” allele association withsome simple test (problem: multipletesting)

– TDT; modelling of LD process: Bayesian, EM algorithm, integrated linkage & LD

Haplotype Based Methods for Case-Control Data

100 normal and 100 affected, 2 loci

1 2

2 1

1 1

2 2

L1 L2 # Cases # Controls

50 10

50 10

40

40

Allele frequencies

#cases #controlL11 5050

2 50 50

#cases #controlL2

2 50 50

1 50 50

27

Limitations: LD is random process

• LD is a continuous process, which is createdand decreased by several factors:– genetic drift– population structure– natural selection– new mutations– founder effect– recombination

→ limits the accuracy of association mapping

Research challenges …• Haplotyping methods needed as

prerequisite for association/LD methods

• …or, searching association directlyfrom genotype data (without the haplotyping stage)

• Better methods for measurement of the association (and/or the effects of the genes)

• Taking disease models into consideration

Haplotype Pattern Mining (HPM)AJHG 67:133-145, 2000

• Search the haplotype data for recurrent patterns with no pre-specified sequence

• Patterns may contain gaps, taking into consideration missing and erroneous data

• The patterns are evaluated for their strength of association

• Markerwise ‘score’ of association is calculated

28

Algorithm1. Find a set of associated haplotype

patterns– number of gaps allowed (2)– maximum gap length (1 marker)– maximum pattern length (7 markers)– association threshold (χ2 = 9)

2. Score loci based on the patternsEvaluate significance by permutationtestsExtendable to quantitative traitsExtendable to multiple genes

Example: a set of associated patterns

Marker 01 02 03 04 05 06 07 08 χ2

P1 2 1 2 2 2 * * * 9.6P2 2 1 2 2 2 1 * * 9.2P3 2 1 2 2 * 1 1 * 8.9P4 2 1 * 2 1 * * * 8.1P5 1 * 1 2 2 * * * 7.4P6 * * 1 2 2 1 2 * 7.1P7 * 2 1 2 * * * * 7.1P8 2 1 1 2 * * * * 6.9P9 2 1 1 * * * * * 6.8Score 5 6 7 7 6 3 2 0

Pattern selection

• The set of potential patterns is large.• Depth-first search for all potential patterns• Search parameters limit search space:

– number of gaps– maximum gap length– maximum pattern length– association threshold

29

Score and localization: an example

Permutation tests

• random permutation of the status fields of the chromosomes

• 10,000 permutations• HPM and marker scores recalculated

for each permuted data set• proportion of permuted data sets in

which score > true score → empirical p-value.

HapMiner

• Li and Jiang 2004

30

A New Haplotype Similarity Measure

• Combination of the length of the longest common sub-string and the number of matched base pairs;

• Both measures have been used and corresponding to the recombination events and point mutations;

• Weighed according to the distance from the central point.

h3: 1 1 2 2 1

h4: 2 1 2 2 2

A New Haplotype Similarity Measure

Score:

3h1: 1 1 2 1 2

h2: 1 2 2 2 2

5

h1: 1 1 2 1 2

h2: 1 2 2 2 2

Weight1: 0.8 0.9 1 0.9 0.8

Score:

2.6

h3: 1 1 2 2 1

h4: 2 1 2 2 2

Weight1: 0.8 0.9 1 0.9 0.8

Weight2: 0.9 1 1 0.9

4.8

Whole Genome Scan

31

The World Is Not Perfect!

• Penetrance (Not everyone with disease-mutant alleles got affected.)

• Phenocopies (Affected individuals may not have disease-mutant alleles; more than 90%.)

• Data with noise

A Density-Based Clustering Algorithm

Haplotype Association Mapping

A contingency table for each cluster:

n’n-n’

m’m-m’

Cluster CRemaining

#control#case

32

Algorithm

1. for each marker i2. consider the haplotype segment surrounding it3. apply the density-based clustering algorithm4. calculate z-score for each cluster5. output the max z-score and the associated

cluster

Properties• Data mining approach (clustering)• Nonparametric/Model-free• Ideal for fine mapping/scalable for whole

genome-wide scan• Population based haplotype association

(individual haplotypes)• No assumptions on haplotype structure• Report DS positions as well as haplotype

patterns

Experimental Data

• Public data sets (Toivonen et al. 00)• Isolated population, size from 300 to

100,000 in 500 years• 100 cM, microsatellite/SNP marker• Dominant disease• Proportion of mutation-carrying

chromosomes: 2.5%, 5%, 7.5%, 10%• Sample sizes: 200/400

33

Results

Simulated data from Toivonen et al. 00

Results (cont’d)

Results (cont’d)

34

HLA Data Set

• Public data set (Herr et al. 00)• 25 markers, 14Mb on chromosome 6• Known type 1 diabetes-susceptibility

locus• 89 from 385 families (normal parents

with 2 affected children)• Haplotyping by our PedPhase program• Haplotypes in children are cases (213),

untransmitted are controls (143)

Results on a Real Data Set

Herr et al. 2000

Our results: highest score of3.72 at D6S2444.

Recently developments

• Permutation tests• Multiple genes and gene-gene interactions• Genotype vectors

35

Benefits & drawbacks• Non-parametric, yet efficient approach; no

disease model specification is needed +• Powerful even with weak genetic effects and

small data sets +• Robust to genotyping errors, mutations,

missing data +• optimal pattern search parameters may need

to be specified case-wise -• no rigid statistical theory background -

• For HPM, the significance of the patterns can not be assessed, the frequencies of patterns depend on the sample size

• HapMiner shows more consistent results with the increasing density of markers. Can be extended to multiple genes, genotype vectors.

– New haplotype-based statistical methods– McPeek and Strahs (1999)– Liu et al. (2001)– Molitor & Thomas (2003)– Tzeng et al. (2003)

Documents

Statistical Human Genetics Linkage and Association ...engr.case.edu/li_jing/slides/statgenetics.pdfStatistical Human Genetics Linkage and Association Haplotyping algorithms EECS 458