Haplotype inference and haplotype-based transmission disequilibrium test (Hap-TDT)

丁向东中国农业大学动物科技学院

Haplotype inference and haplotype-based transmission disequilibrium test

(Hap-TDT)

中国畜牧兽医学会信息技术分会

2009年会 . 哈尔滨

Outline

Haplotype inference Introduction of important methods

• Parsimony (Clark,1990)

• EM (Excoffier and Slatkin, 1995)

• Bayesian (Stephens and Donnelly,2001)

Haplotype inference using family information

Transmission Disequilibrium Test (TDT)Haplotype-based TDT (Hap-TDT)

Genotype and Haplotype

Loc1 2 13

Loc2 1 6

… 9 15

… 4 17

… 1 9

… 2 6

Loc7 9 17

2 13

6 1

9 15

17 4

1 9

2 6

17 9

Haplotype

reconstruction

Chromosome phase is unkown Chromosome phase is kown

Haplotype: A collection of alleles from same chromosome

Allele & Genotype Haplotype & diplotype

Algorithms for haplotype inference

Statistical methods Parsimony (Clark,1990) EM

• Excoffier and Slatkin (1995); Qin et al. (2002)

Bayesian

• Stephens and Donnelly (2001); Niu et al. (2002)

Rule-based methods Minimum recombination principle

• Qian and Beckmann (2002); Baruch, et al. (2006)

Parsimony (Clark,1990)

A1A1

A1A2

A2A3

A3A4

A3A5A6A7

1. Start from a homozygote

2. Determine any other ambiguous sequence using the definitive haplotype from 1

3. Continue this procedure until all haplotypes are resolved or until no more new haplotypes can be found

EM algorithm

Numerical method of finding maximum likelihood estimates for parameters given incomplete data.

Initial parameter values: haplotype frequencies Expectation step: compute expected values of

missing data based on initial data Maximization step: compute MLE for parameters

from the complete data Repeat with updated set of parameters until changes

in the parameter estimates are negligible.

EM algorithm efficiency

Heavy computational burden with large number of loci Partition-ligation algorithm

(Niu et al., 2002) PL-EM (Qin et al., 2002)

Accuracy and departures from HWE Assumption of HWE in most EM-based methods Robust to departure from HWE (Fallin and Schork, 2000)

Bayesian method

PHASE (Stephens and Donnelly, 2001) Based on coalescent model Use Gibbs sampling So far, very accurate, but also complicated.

fastPHASE (Scheet and Stephens,2006) Revised version of PHASE Much faster, missing genotype

Comparison of Parsimony,EM and PHASE

EM

PHASE

Parsimony

PHASE performs better than parsimony and EM (Stephen, 2001)

PHASE and EM-based methods exhibited similar performances (Zhang et al. 2001; Xu et al. 2002)

Haplotype inference using family data (1)

Haplotype inference based on close relatives Reduces haplotype ambiguity and improves the

efficiency

Rohde and Fuerst (2001) – EM algorithm Families with both parents and their children The genotyped offspring reduce the number of

potential haplotype pairs for both parents.

Ding et al. (2006) - EM algorithm Families with only one parent available

Haplotype inference using family data (2)

Ding et al. – EM algorithm Families with only sibs

Mixed family data Complete families Incomplete families

• One parent

• Only sibs

genotypes of sibs

Parental information

Haplotype pairs of sibs

Method nameUsing family information?

Using

LD?

Handling incomplete families?

Complete-family-EM

(Rhode and Fuerst, 2001)YES YES NO

Incomplete-family-EM

(Ding et al., 2006)YES YES YES

GENEHUNTER

(Kruglyak et al., 1996)YES NO YES

PHASE

(Stephens et al., 2003) YES YES YES

Comparison of four different strategies

(Ding et al., 2006)

Result for incomplete families

0.80

0.85

0.90

0.95

1.00

Dis

crep

ancy

0

20

40

60

80

100

Ru

nn

ing

tim

e (m

)

0.0

0.2

0.4

0.6

0.8

erro

r ra

te

Incomplete-family-EM PHASE GENEHUNTER

Running time of PHASE:3.5 hs for the whole 100 datasets of 30 trios,187 SNPs (Marchini et al.,2006) Running time will become prohibitive for large SNPs

Rule-based method

Minimum recombination principle Qian and Beckmann (2002); Baruch, et al. (2006) Genetic recombination is rare Haplotype with fewer recombinants should be

preferred in a haplotype reconstruction

Rule 1

Rule 2

…

Wang et al., 2006

Joint EM and rule-based algorithm for (grand-) daughter design

Assumption of no recombination EM algorithm to construct diplotype

Taking into account recombination Minimum recombination principle

• Find the diplotype of sire that minimizes the number of recombinations in the sire family

Possible diplotypes recom. events------------------------------------------------1. 22222 11111 12. 11222 22111 513. 11122 22211 494. 22221 11112 515. 12222 21111 51-------------------------------------------------

Example:

A sire family with 50 offspring

phase-unknown genotype of sire:(12;12;12;12;12)

Results

HSHAP PedPhase SimWalk2

Error rate in sires 0.007c 0.682a 0.126b

Error rate in offspring 0.021c 0.0330b 0.0705a

Running time (m) 8.211 0.006 60.17

10 sire families with 30 offspring each 10 SNPs , 0.5 centiMorgan

Results - incomplete data

HSHAP PedPhase SimWalk2

Error rate in sires 0.0310c 0.8910a 0.0930b

Error rate in offspring 0.0159c 0.7607a 0.2164b

Running time (m) 0.01 0.002 10.65

10 sire families with 20 offspring each 5SNPs, 0.5 centiMorgan 30% individuals with one random missing locus.

Linkage vs Association

Linkage• Family-based

• Matching/ethnicity generally unimportant

• Few markers for genome coverage (300-400 STRs)

• Can be weak design

• Good for initial detection; poor for fine-mapping

• Powerful for rare variants

Association• Families or unrelateds• Matching/ethnicity crucial• Many markers req for

genome coverage (105 – 106 SNPs)

• Powerful design• Poor for initial detection;

good for fine-mapping• Powerful for common

variants; rare variants generally impossible

TDT (Transmission Disequilibrium Test)

Compares the distribution of transmitted and non-transmitted alleles by parents of affected offspring (Spielman et al. 1993)

Non-transmitted allele

total

transmitted allele

M1 M2

M1 a b a+b

M2 c d c+d

total a+c b+d 2n

If the marker is unlinked to the causative locus then we expect b = c, else, one of the alleles will tend to be transmitted more often

21

22 ~ dfcb

cbX

TDT

TDT (Transmission Disequilibrium Test) Good for fine-mapping, poor for initial detection Robust for population stratification/admixture Widely used in genome-wide association study Extended to all kinds of data structure

• Multi allelic markers (Sham and Curtis, 1995)

• Multiple siblings (Spielman,1998; Boehnke, 1998)

• Missing parental data (Sun, 1999)

• Extended pedigree (Martin et al., 2000)

• Quantitative traits (Allison,1997; Ding et al, 2004)

Comparison of PDT,QTDT and PTDT

# ped # alleles

QTL effect

Power Type I error

PDT QTDT PTDT PDT QTDT PTDT

10

2 0.1 59 18 57 4 5 1

5 0.3 74 61 82 7 4 3

8 0.5 85 81 92 8 3 4

Quantitative trait

Hap-TDT

Hap-TDT (Haplotype-based TDT) Haplotype is more informative than single marker

The original TDT and most of its extensions consider one marker at a time.

Two categories of haplotype-based TDT Haplotype reconstruction first

• Sethuraman (1997); Wilson (1997); Clayton and Jones (1999); Zhao et al. (2000); Zhang et al.(2003)

Implicit haplotype reconstruction

• Dudbridge (2003)

Hap-TDT vs TDT

Quantitative trait (h2 = 0.06)Qualitative trait

Zhang et al. (2003)

Hap-TDT vs TDT

Sample size: 300 trios

0

20

40

60

80

100

1 0.1 0.01Density (cM)

Po

wer

Hap-TDT

TDT

FBAT

Marker density: 0.1cM

0

20

40

60

80

100

200 300 400

Sample size (trios)

Po

wer

• 5SNPs, Excluding causal variant SNP

Hap- TDT

Problem of multiple comparisons Increase in the degree of freedom

More markers

A large number of haplotypes

A large number of degrees of freedom

Limit the power of test

Prof. Qin Zhang (China Agricultural University) Prof. Henner Simianer (University of Goettingen)

Acknowledgement

Documents

Haplotype inference and haplotype-based transmission disequilibrium test (Hap-TDT)