View
215
Download
1
Tags:
Embed Size (px)
Citation preview
1
Cladistic Clustering of Haplotypes
in Association Analysis
Jung-Ying Tzeng
Aug 27, 2004
Department of Statistics & Bioinformatics Research Center
North Carolina State University
2
Simple Disorder vs. Complex Disorder
Peltonen and McKusick (2001). Science
3
Complex Disorders
Liability genes = genes containing variants increasing disease liability
Goal: look for such genes Rely more on the epidemiological evidences
Association analysis Case-control studies Detect liability genes by searching for association
between disease status and genetic variants
4
Genetic Markers
Instead of studying the whole DNA sequences, we look at a subset of
them---genetic markers
SNP: Single Nucleotide Polymorphism
• Pro: dense; 100-300bp
• Con: binary variants
Resolved by considering adjacent SNPs jointly
5
Haplotype-based Association Analysis
Haplotype = maker sequence
Haplotye-based association analysis
TCTC
CACA
Case Control
Hap 1Hap 2Hap 3
.
.
.
Hap k
T C T C
C A C A
6
Haplotype-based Association Analysis
Problem: findings are not replicable• Under-powered (Lohmueller et. al 2003; Neal and Sham 2004 )
Solution:
1. Use large samples (Lohmueller et. al 2003)
2. Reduce the dimension of the parameter space
7
Dimensionality
Haplotype distribution within a block
Daly et al. (2001) Nature Genetics
Method I: Truncating
: tag SNPs
8
Evolutionary tree of haplotypes
Minimize the haplotype distance within clusters
000000
100000
100001
100011 100101 101001 110000
010000
011001 000100
011000
111000
Method II: Clustering (Molitor et al. 2003; Durrant et al. 2004)
9
Method II: Clustering
000000
100000
100001
100011 100101 101001 110000
010000
011001 000100
011000
111000
10
000000
100000
100001
100011 100101 101001 110000
010000
011001 000100
011000
111000
Method II: Clustering
11
Observed Hap ={ 000, 001, 010, 100,110, 101, 011, 111 }
001
101
110
010
011000
111
100 001
101
110
010
011000
111
100
Method III: Cladistic Grouping(Templeton 1995)(Seltman et al. 2003)
Cladogram
12
Include all samples
Incorporate both haplotype distance and age
• High frequency ancient (Crandall & Templeton 1995)
• Low frequency young
Allow uncertainty in inferring the underlying
evolutionary relationship
Desired Features
13
Possible Hap = { 000, 001, 010, 100, 110, 101, 011, 111 }
110
001 101011
000
111010 100
{ 110 } (2)
*(i)t = (i)t + (i+1)t B(i+1)
{ 000, 010, 111, 100 }
{ 001, 011, 101 }
(1)
(0)
001 101011
111010 100
000
110 B(2)
B(1)
Proposed Approach: Cladistic Clustering
p 1-p
q1 q2 1-q1-q2
*t = tB
= (0)t (1)t (2)t
B(2)B(1)
B(1)
I
14
Issues
1. Determine major nodes (0)
2. Construct conditional allocating matrix B(i)
15
110
001 101011
000
111010 100
{ 110 }
{ 000, 010, 100, 111 }
{ 001, 011, 101 }
B(2) =
C = ()
c c c c110
000 010 100 111
(2)
(1)
(0)
Conditional Allocating Matrix B(i)
*(1)t = (2)t B(2) + (1)t
[0,1likelihood of one step movement
B(2)
110
111010 100
000
16
B(1) =
*t = (0)t + (1)t B(1) + (2)t B(2)B(1)
Conditional Allocating Matrix B(i)
110
001 101011
000
111010 100
100
111
010
000
101011001
101110
101
101110
110
17
Determine
Information criteria
• Net Information (Shannon’s Information content)
k
k
iii nk /)(log)/1(log 2
12
18
Net Information and (0)
19
Association Analysis Based on *
Coalescent simulation (Hudson’s 2002):
• Prevalence = 0.01
• Relative Risk = 2
• Frequencies of liability Allele = (0.1, 0.3, 0.5)
• Location of liability allele = (hot spot, blocky, very blocky)
• Draw 200 cases and 200 controls
Test of homogeneity based on *cs and *cn
20
Power and Type I error
Gene Pelc Gene IL01RB
21
Summary
Provide a mechanism of cladistic clustering by * B
• Combine the ideas of Truncating and Clustering
• Based on evolutionary relationship without reconstruct cladogram
• Incorporate haplotype frequencies and distance in cluster assignment
• One-step conditional regrouping can accommodate multiple step regrouping: self-repeating, algebraic multiplicative
• Reserve (0) based on information criteria
* increases test efficiency
• Increased power even for large samples and haplotypes in block regions
22
End of Slides
23
Approach
Two stages:
• Stage I: (Where)
Identify the susceptible regions across genome
(multiple testing problem)
Approaches based on haplotype similarity
• Stage II: (Which)
Determine and pinpoint the specific liability
variants
Study individual effects of groups of haplotypes
24
I. Haplotype Similarity
• Van Der Meulen and te Meerman 1997; Bourgain et al. 2000-2002; Tzeng et al. 2003ab
• Search for extra haplotype sharing among cases
• Pro: 1 degree of freedom
• Con: not study individual haplotype effect
• Usage: good for genome screening
Strategies of Reducing Degrees of Freedom
25
Strategies of Reducing Degrees of Freedom
Freq(%)
1 A C A C C C C C G G G C C G 45
2 . . . . . . . . . . . A . . 20
3 C T T G . T A T T A . . . . 13.25
4 . . . . . . . . . . . . . A 11.25
5 C . T . T . A . . . A A . . 3.75
6 . . . . . . . . . . . . T . 3.50
7 C . . . . . . . . . . . . . 1.50
8 C . T . T . A . . . . . . . 0.50
9 . T T G . T A T T A . . . . 0.50
1 A C G
2 . A .
3 T . .
4 . . A
5 T A .
(1) . . .
(1) . . .
6 T . .
(6) T . .
tag SNP
II. Haplotype Tagging (Johnson et al. 2001)
• Pro: efficiently capture the major diversity
• Con: discard rare haplotypes
26
III. Haplotype Clustering
• Molitor et al. 2003; Seltman et al 2001, 2003; Durrant et al 2004
• Similar haplotypes induce similar liability effect
• Cluster haplotypes and perform analysis based
on clusters of haplotypes
• Pro: incorporating all data
• Con: may cluster two major haplotypes in the
same group
Strategies of Reducing Degrees of Freedom
27
Approach
Two stages:
• Stage I: (Where)
Identify the susceptible regions across genome
(multiple testing problem)
Approaches based on haplotype similarity
• Stage II: (Which)
Determine and pinpoint the specific liability variants
Study individual effects of groups of haplotypes
28
Haplotype Grouping
Focus on Stage II
Combine the pros of haplotype tagging and clustering
29
Power and Type I error
Gene Pelc Gene IL01RB