1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North

1

Cladistic Clustering of Haplotypes

in Association Analysis

Jung-Ying Tzeng

Aug 27, 2004

Department of Statistics & Bioinformatics Research Center

North Carolina State University

2

Simple Disorder vs. Complex Disorder

Peltonen and McKusick (2001). Science

3

Complex Disorders

Liability genes = genes containing variants increasing disease liability

Goal: look for such genes Rely more on the epidemiological evidences

Association analysis Case-control studies Detect liability genes by searching for association

between disease status and genetic variants

4

Genetic Markers

Instead of studying the whole DNA sequences, we look at a subset of

them---genetic markers

SNP: Single Nucleotide Polymorphism

• Pro: dense; 100-300bp

• Con: binary variants

Resolved by considering adjacent SNPs jointly

5

Haplotype-based Association Analysis

Haplotype = maker sequence

Haplotye-based association analysis

TCTC

CACA

Case Control

Hap 1Hap 2Hap 3

.

.

.

Hap k

T C T C

C A C A

6

Haplotype-based Association Analysis

Problem: findings are not replicable• Under-powered (Lohmueller et. al 2003; Neal and Sham 2004 )

Solution:

1. Use large samples (Lohmueller et. al 2003)

2. Reduce the dimension of the parameter space

7

Dimensionality

Haplotype distribution within a block

Daly et al. (2001) Nature Genetics

Method I: Truncating

: tag SNPs

8

Evolutionary tree of haplotypes

Minimize the haplotype distance within clusters

000000

100000

100001

100011 100101 101001 110000

010000

011001 000100

011000

111000

Method II: Clustering (Molitor et al. 2003; Durrant et al. 2004)

9

Method II: Clustering

000000

100000

100001

100011 100101 101001 110000

010000

011001 000100

011000

111000

10

000000

100000

100001

100011 100101 101001 110000

010000

011001 000100

011000

111000

Method II: Clustering

11

Observed Hap ={ 000, 001, 010, 100,110, 101, 011, 111 }

001

101

110

010

011000

111

100 001

101

110

010

011000

111

100

Method III: Cladistic Grouping(Templeton 1995)(Seltman et al. 2003)

Cladogram

12

Include all samples

Incorporate both haplotype distance and age

• High frequency ancient (Crandall & Templeton 1995)

• Low frequency young

Allow uncertainty in inferring the underlying

evolutionary relationship

Desired Features

13

Possible Hap = { 000, 001, 010, 100, 110, 101, 011, 111 }

110

001 101011

000

111010 100

{ 110 } (2)

*(i)t = (i)t + (i+1)t B(i+1)

{ 000, 010, 111, 100 }

{ 001, 011, 101 }

(1)

(0)

001 101011

111010 100

000

110 B(2)

B(1)

Proposed Approach: Cladistic Clustering

p 1-p

q1 q2 1-q1-q2

*t = tB

= (0)t (1)t (2)t

B(2)B(1)

B(1)

I

14

Issues

1. Determine major nodes (0)

2. Construct conditional allocating matrix B(i)

15

110

001 101011

000

111010 100

{ 110 }

{ 000, 010, 100, 111 }

{ 001, 011, 101 }

B(2) =

C = ()

c c c c110

000 010 100 111

(2)

(1)

(0)

Conditional Allocating Matrix B(i)

*(1)t = (2)t B(2) + (1)t

[0,1likelihood of one step movement

B(2)

110

111010 100

000

16

B(1) =

*t = (0)t + (1)t B(1) + (2)t B(2)B(1)

Conditional Allocating Matrix B(i)

110

001 101011

000

111010 100

100

111

010

000

101011001

101110

101

101110

110

17

Determine

Information criteria

• Net Information (Shannon’s Information content)

k

k

iii nk /)(log)/1(log 2

12

18

Net Information and (0)

19

Association Analysis Based on *

Coalescent simulation (Hudson’s 2002):

• Prevalence = 0.01

• Relative Risk = 2

• Frequencies of liability Allele = (0.1, 0.3, 0.5)

• Location of liability allele = (hot spot, blocky, very blocky)

• Draw 200 cases and 200 controls

Test of homogeneity based on *cs and *cn

20

Power and Type I error

Gene Pelc Gene IL01RB

21

Summary

Provide a mechanism of cladistic clustering by * B

• Combine the ideas of Truncating and Clustering

• Based on evolutionary relationship without reconstruct cladogram

• Incorporate haplotype frequencies and distance in cluster assignment

• One-step conditional regrouping can accommodate multiple step regrouping: self-repeating, algebraic multiplicative

• Reserve (0) based on information criteria

* increases test efficiency

• Increased power even for large samples and haplotypes in block regions

22

End of Slides

23

Approach

Two stages:

• Stage I: (Where)

Identify the susceptible regions across genome

(multiple testing problem)

Approaches based on haplotype similarity

• Stage II: (Which)

Determine and pinpoint the specific liability

variants

Study individual effects of groups of haplotypes

24

I. Haplotype Similarity

• Van Der Meulen and te Meerman 1997; Bourgain et al. 2000-2002; Tzeng et al. 2003ab

• Search for extra haplotype sharing among cases

• Pro: 1 degree of freedom

• Con: not study individual haplotype effect

• Usage: good for genome screening

Strategies of Reducing Degrees of Freedom

25


Freq(%)

1 A C A C C C C C G G G C C G 45

2 . . . . . . . . . . . A . . 20

3 C T T G . T A T T A . . . . 13.25

4 . . . . . . . . . . . . . A 11.25

5 C . T . T . A . . . A A . . 3.75

6 . . . . . . . . . . . . T . 3.50

7 C . . . . . . . . . . . . . 1.50

8 C . T . T . A . . . . . . . 0.50

9 . T T G . T A T T A . . . . 0.50

1 A C G

2 . A .

3 T . .

4 . . A

5 T A .

(1) . . .

(1) . . .

6 T . .

(6) T . .

tag SNP

II. Haplotype Tagging (Johnson et al. 2001)

• Pro: efficiently capture the major diversity

• Con: discard rare haplotypes

26

III. Haplotype Clustering

• Molitor et al. 2003; Seltman et al 2001, 2003; Durrant et al 2004

• Similar haplotypes induce similar liability effect

• Cluster haplotypes and perform analysis based

on clusters of haplotypes

• Pro: incorporating all data

• Con: may cluster two major haplotypes in the

same group


27

Approach

Two stages:

• Stage I: (Where)

Identify the susceptible regions across genome

(multiple testing problem)

Approaches based on haplotype similarity

• Stage II: (Which)

Determine and pinpoint the specific liability variants

Study individual effects of groups of haplotypes

28

Haplotype Grouping

Focus on Stage II

Combine the pros of haplotype tagging and clustering

29

Power and Type I error

Gene Pelc Gene IL01RB

Documents

1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North