Upload
makala
View
106
Download
0
Embed Size (px)
DESCRIPTION
中国畜牧兽医学会信息技术分会 2009 年会 . 哈尔滨. Haplotype inference and haplotype-based transmission disequilibrium test (Hap-TDT). 丁向东 中国农业大学动物科技学院. Outline. Haplotype inference Introduction of important methods Parsimony (Clark,1990) EM (Excoffier and Slatkin, 1995 ) - PowerPoint PPT Presentation
Citation preview
丁向东中国农业大学动物科技学院
Haplotype inference and haplotype-based transmission disequilibrium test
(Hap-TDT)
中国畜牧兽医学会信息技术分会
2009年会 . 哈尔滨
Outline
Haplotype inference Introduction of important methods
• Parsimony (Clark,1990)
• EM (Excoffier and Slatkin, 1995)
• Bayesian (Stephens and Donnelly,2001)
Haplotype inference using family information
Transmission Disequilibrium Test (TDT)Haplotype-based TDT (Hap-TDT)
Genotype and Haplotype
Loc1 2 13
Loc2 1 6
… 9 15
… 4 17
… 1 9
… 2 6
Loc7 9 17
2 13
6 1
9 15
17 4
1 9
2 6
17 9
Haplotype
reconstruction
Chromosome phase is unkown Chromosome phase is kown
Haplotype: A collection of alleles from same chromosome
Allele & Genotype Haplotype & diplotype
Algorithms for haplotype inference
Statistical methods Parsimony (Clark,1990) EM
• Excoffier and Slatkin (1995); Qin et al. (2002)
Bayesian
• Stephens and Donnelly (2001); Niu et al. (2002)
Rule-based methods Minimum recombination principle
• Qian and Beckmann (2002); Baruch, et al. (2006)
Parsimony (Clark,1990)
A1A1
A1A2
A2A3
A3A4
A3A5A6A7
1. Start from a homozygote
2. Determine any other ambiguous sequence using the definitive haplotype from 1
3. Continue this procedure until all haplotypes are resolved or until no more new haplotypes can be found
EM algorithm
Numerical method of finding maximum likelihood estimates for parameters given incomplete data.
Initial parameter values: haplotype frequencies Expectation step: compute expected values of
missing data based on initial data Maximization step: compute MLE for parameters
from the complete data Repeat with updated set of parameters until changes
in the parameter estimates are negligible.
EM algorithm efficiency
Heavy computational burden with large number of loci Partition-ligation algorithm
(Niu et al., 2002) PL-EM (Qin et al., 2002)
Accuracy and departures from HWE Assumption of HWE in most EM-based methods Robust to departure from HWE (Fallin and Schork, 2000)
Bayesian method
PHASE (Stephens and Donnelly, 2001) Based on coalescent model Use Gibbs sampling So far, very accurate, but also complicated.
fastPHASE (Scheet and Stephens,2006) Revised version of PHASE Much faster, missing genotype
Comparison of Parsimony,EM and PHASE
EM
PHASE
Parsimony
PHASE performs better than parsimony and EM (Stephen, 2001)
PHASE and EM-based methods exhibited similar performances (Zhang et al. 2001; Xu et al. 2002)
Haplotype inference using family data (1)
Haplotype inference based on close relatives Reduces haplotype ambiguity and improves the
efficiency
Rohde and Fuerst (2001) – EM algorithm Families with both parents and their children The genotyped offspring reduce the number of
potential haplotype pairs for both parents.
Ding et al. (2006) - EM algorithm Families with only one parent available
Haplotype inference using family data (2)
Ding et al. – EM algorithm Families with only sibs
Mixed family data Complete families Incomplete families
• One parent
• Only sibs
genotypes of sibs
Parental information
Haplotype pairs of sibs
Method nameUsing family information?
Using
LD?
Handling incomplete families?
Complete-family-EM
(Rhode and Fuerst, 2001)YES YES NO
Incomplete-family-EM
(Ding et al., 2006)YES YES YES
GENEHUNTER
(Kruglyak et al., 1996)YES NO YES
PHASE
(Stephens et al., 2003) YES YES YES
Comparison of four different strategies
(Ding et al., 2006)
Result for incomplete families
0.80
0.85
0.90
0.95
1.00
Dis
crep
ancy
0
20
40
60
80
100
Ru
nn
ing
tim
e (m
)
0.0
0.2
0.4
0.6
0.8
erro
r ra
te
Incomplete-family-EM PHASE GENEHUNTER
Running time of PHASE:3.5 hs for the whole 100 datasets of 30 trios,187 SNPs (Marchini et al.,2006) Running time will become prohibitive for large SNPs
Rule-based method
Minimum recombination principle Qian and Beckmann (2002); Baruch, et al. (2006) Genetic recombination is rare Haplotype with fewer recombinants should be
preferred in a haplotype reconstruction
Rule 1
Rule 2
…
Wang et al., 2006
Joint EM and rule-based algorithm for (grand-) daughter design
Assumption of no recombination EM algorithm to construct diplotype
Taking into account recombination Minimum recombination principle
• Find the diplotype of sire that minimizes the number of recombinations in the sire family
Possible diplotypes recom. events------------------------------------------------1. 22222 11111 12. 11222 22111 513. 11122 22211 494. 22221 11112 515. 12222 21111 51-------------------------------------------------
Example:
A sire family with 50 offspring
phase-unknown genotype of sire:(12;12;12;12;12)
Results
HSHAP PedPhase SimWalk2
Error rate in sires 0.007c 0.682a 0.126b
Error rate in offspring 0.021c 0.0330b 0.0705a
Running time (m) 8.211 0.006 60.17
10 sire families with 30 offspring each 10 SNPs , 0.5 centiMorgan
Results - incomplete data
HSHAP PedPhase SimWalk2
Error rate in sires 0.0310c 0.8910a 0.0930b
Error rate in offspring 0.0159c 0.7607a 0.2164b
Running time (m) 0.01 0.002 10.65
10 sire families with 20 offspring each 5SNPs, 0.5 centiMorgan 30% individuals with one random missing locus.
Linkage vs Association
Linkage• Family-based
• Matching/ethnicity generally unimportant
• Few markers for genome coverage (300-400 STRs)
• Can be weak design
• Good for initial detection; poor for fine-mapping
• Powerful for rare variants
Association• Families or unrelateds• Matching/ethnicity crucial• Many markers req for
genome coverage (105 – 106 SNPs)
• Powerful design• Poor for initial detection;
good for fine-mapping• Powerful for common
variants; rare variants generally impossible
TDT (Transmission Disequilibrium Test)
Compares the distribution of transmitted and non-transmitted alleles by parents of affected offspring (Spielman et al. 1993)
Non-transmitted allele
total
transmitted allele
M1 M2
M1 a b a+b
M2 c d c+d
total a+c b+d 2n
If the marker is unlinked to the causative locus then we expect b = c, else, one of the alleles will tend to be transmitted more often
21
22 ~ dfcb
cbX
TDT
TDT (Transmission Disequilibrium Test) Good for fine-mapping, poor for initial detection Robust for population stratification/admixture Widely used in genome-wide association study Extended to all kinds of data structure
• Multi allelic markers (Sham and Curtis, 1995)
• Multiple siblings (Spielman,1998; Boehnke, 1998)
• Missing parental data (Sun, 1999)
• Extended pedigree (Martin et al., 2000)
• Quantitative traits (Allison,1997; Ding et al, 2004)
Comparison of PDT,QTDT and PTDT
# ped # alleles
QTL effect
Power Type I error
PDT QTDT PTDT PDT QTDT PTDT
10
2 0.1 59 18 57 4 5 1
5 0.3 74 61 82 7 4 3
8 0.5 85 81 92 8 3 4
Quantitative trait
Hap-TDT
Hap-TDT (Haplotype-based TDT) Haplotype is more informative than single marker
The original TDT and most of its extensions consider one marker at a time.
Two categories of haplotype-based TDT Haplotype reconstruction first
• Sethuraman (1997); Wilson (1997); Clayton and Jones (1999); Zhao et al. (2000); Zhang et al.(2003)
Implicit haplotype reconstruction
• Dudbridge (2003)
Hap-TDT vs TDT
Quantitative trait (h2 = 0.06)Qualitative trait
Zhang et al. (2003)
Hap-TDT vs TDT
Sample size: 300 trios
0
20
40
60
80
100
1 0.1 0.01Density (cM)
Po
wer
Hap-TDT
TDT
FBAT
Marker density: 0.1cM
0
20
40
60
80
100
200 300 400
Sample size (trios)
Po
wer
• 5SNPs, Excluding causal variant SNP
Hap- TDT
Problem of multiple comparisons Increase in the degree of freedom
More markers
A large number of haplotypes
A large number of degrees of freedom
Limit the power of test
Prof. Qin Zhang (China Agricultural University) Prof. Henner Simianer (University of Goettingen)
Acknowledgement