Imputation 2

  • View

  • Download

Embed Size (px)


Imputation 2. Presenter: Ka -Kit Lam. Outline. Big Picture and Motivation IMPUTE IMPUTE2 Experiments Conclusion and Discussion Supplementary : GWAS Estimate on mutation rate . Big Picture and Motivation. Background. Genome-wide association study: - PowerPoint PPT Presentation

Text of Imputation 2


Imputation 2Presenter: Ka-Kit Lam1OutlineBig Picture and MotivationIMPUTEIMPUTE2ExperimentsConclusion and DiscussionSupplementary : GWASEstimate on mutation rate

2Big Picture and Motivation3Genome-wide association study: Identify common genetic factors that influence health/disease


4BackgroundImportant to know the SNPsHowever, . . . ,Not all SNPs are genotyped for all individuals in the case-control study in GWAS.

How can we guess the missing parts?


?????5Information knownLuckily, we now have references for human DNA:

But, how can we use the reference genomes?

6Main QuestionObjective:Design algorithms to impute the missing genotypes of the individuals being studied

Criteria for algorithmsScalableAccurate 7Big Picture on Algorithm DesignAlgorithmsSNPs in study,reference haplotype/genotypeImputed genotype,associated confidenceScalabilityAccuracy1. Experimental validation2. ApplicationIn theory, it makes senseIn practice, it works8IMPUTE 9Notations and Setting0111001100010100111010111000101111000000011110?????2?????0??1?????1????????2?????0?????1??1?????0?????1??Reference Haplotypes :

Genotype in the study sample: LLNK

(Rmk: 0-00 , 1-01, 2-11)

1010FormulationObserved genotype and missing genotype

Classical inference problem:A reasonable estimate:


11Modeling (HMM model):Relationship btw (H,G)Assumptions:Study individuals are independent

Copying process of haplotypes as a mosaic of reference captured by a Hidden Markov Model Mutation at different sites are conditionally independent given the copied haplotype

12Modeling (HMM model):Relationship btw (H,G)0?????2?????0??011100110001010011101011100010111100000001111Reference Haplotypes : LNStudy Individual: 02220022000102113Modeling (HMM model):Relationship btw (H,G)011100110001010011101011100010111100000001111LN

02220022000102114Modeling (Transition Probability)StatesTransition

What is the intuition?

15Modeling :relationship btw transition Probability and Recombination Recombination Process:

16Modeling :relationship btw transition Probability and Recombination Recombination Process:More reference, longer the copy length

Copy length in our model depends on genetic distance btw SNPs

Ref panel 1Ref panel 2Study individual:More likely to have longercopy length here17Modeling (Transition Probability)StatesTransition

18Modeling (Emission Probability)Emission probabilityDefine mutation rate : Since mutation is assumed independent across site

0-001-012 -1100(1-)22(1-)()201(1-)()2+(1-)2(1-)11()22(1-)(1-)2

19Extension (completely missing)Problem: Missing genotype across all references and study samples. How to impute?What can we expect? Generate information from no information? We cannot expect to know the genotypeBut we can guess the relationship btw them

Our friend : population genetics may help !

0010110?11000000010110?11000010010110?110001120Imputation on ReferenceIllustration

H(1)111001?0001010H(2)111010?1100010H (3)111000?0001111H (4)111100?0000000H(N)111011?00111000010121Imputation on ReferenceAlgorithm:1. Randomly select an ordering2. Sample the first mutation according to

3. Treat previous as references and impute 4. Repeat several time to get a stable output5. Use the imputed reference to impute the study

22Computational Complexity:Imputation

O(N2L) for each individual23Computational Complexity:Imputation

O(N2L) for each individual24Computational Complexity:Forward-Backward AlgorithmForward Equations:

Nave application takes O(N4)

25Computational Complexity:Forward-Backward AlgorithmQ : How to compute the following in O(N2) ?

A: (suggested in fastPhase)

26Computational Complexity:Forward-Backward AlgorithmFinally, we have

Similarly for the backward part

O(N2)O(N) for each jO(N2) totallyO(N) for each iO(N2) totally

O(N2) totally27Demo

./impute -h example/haplo.txt -l example/legend.txt -g example/geno.txt -m example/map.txt -s example/strand.txt -Ne 11400 -int 62000000 6300000028Demo

29IMPUTE230MotivationAccuracy:Not all information used during imputation (e.g. other study individuals)Complexity: Need to scale well if we incorporate all information (e.g. previously it is O(LN2))New data type:Diploid reference (1000 genome project)Q: How to design algorithms to handle this?31Description of Setting(Scenario A)0111001100010100111010111000101111000000011110?????2?????0??1?????1????????2?????0?????1??1?????0?????1??Reference Haplotypes :

Genotype in the inference panel: LLNhapNinf

(Rmk: 0-00 , 1-01, 2-11)

:T, :U(Rmk : sets of index of SNPs) 32322?????0?????1??1?????0?????1??Description of Setting(Scenario B)011100110001010011101011100010111100000001111?????2?????0??1?????1????????Reference Haplotypes :


(Rmk: 0-00 , 1-01, 2-11)

:T, :U1, :U2Inference panelDiploid reference panelNinfNdip(Rmk : sets of index of SNPs) 33111221102133Algorithm for Scenario AIllustration:0111001100010100111010111000101111000000011110?????2?????0??1?????1????????2?????0?????1??1?????0?????1??34Algorithm for Scenario AIllustration (Burn in)00?????11?????00??10?????10?????00??11?????00?????10??10?????00?????10??01110011000101001110101110001011110000000111135Algorithm for Scenario AIllustration (Phasing)00?????11?????00??10?????10?????00??11?????00?????10????????????????????0111001100010100111010111000101111000000011111?????0?????1??0?????0?????0??Update i(1)(0)(1)(genotype)36Algorithm for Scenario AIllustration (Imputing)00?????11?????00??10?????10?????00??11?????00?????10??10? ???? ?? ?? ?00? ?? ?? ?? ?? ?10? ?? ?011100110001010011101011100010111100000001111Update i(1)(0)(1)(genotype)11010000000111101110101110001037Phasing Step: Path SamplingHow to sample path?

38Imputation Step: Extract Posterior ProbabilityAfter many rounds, we can get : For each individual and for each missing site

Assuming independence in sampling the haploid pair Hap 1010. 2010. average then39Algorithm for Scenario A:Complexity AnalysisA) Burn in phaseB) MCMC iterations for m times:For each individual ii) phase(i,T,hap+inf)ii) impute(i,T+U,hap)iii) record(posterior probability)C) Average over different runs of MCMC to get the genotype and confidenceO((Nhap + Ninf)2LT)O(NhapLT+U)O(LT+U)40Missing in T also need to be imputec40Benefits of the Algorithm Faster:Reducing the load in the imputation step

More accurate:Utilize information available to guess41Algorithm for Scenario BIllustration:0111001100010100111010111000101111000000011112?????0?????1??1?????0?????1??NhapNinfNdip :T, :U1, :U2420??2??22????0??1??2??11????0??Algorithm for Scenario BIllustration: (Burn in )01110011000101001110101110001011110000000111111?????00?????10??10?????00?????10??NhapNinfNdip :T, :U1, :U24300??11??1111????00??10??11??1010????00??Algorithm for Scenario BIllustration: (Phase T and U2 in diploid ref)01110011000101001110101110001011110000000111100??11??1111????00??????????????????? ???NhapNinfNdipUpdate i10??11??1010????00?? :T, :U1, :U24411?????00?????10??10?????00?????10??Algorithm for Scenario BIllustration: (Impute U1 in diploid ref)01110011000101001110101110001011110000000111100111111000111110101001100110010??11??1010????00??NhapNinfNdip101111110000101000000010001100 :T, :U1, :U2Update i4511?????00?????10??10?????00?????10??11? ?? ?? ?? ?? ?00? ?? ?? ?? ?? ?10? ?? ???? ?? ?? ?? ?? ???? ?? ?? ?? ?? ???? ?? ?Algorithm for Scenario BIllustration: (Phase T in inference panel)01110011000101001110101110001011110000000111100111111000111110101001100110010??11??1010????00??NhapNinfNdip101111110000101000000010001100Update i10?????00?????10?? :T, :U1, :U24611?????00?????10??10??????????00??????????10????Algorithm for Scenario BIllustration: (Impute U2 in inference panel)01110011000101001110101110001011110000000111100111111000111110101001100110010??11??1010????00??NhapNinfNdip101111110000101000000010001100Update i :T, :U1, :U210????11????0010????????10????474711?????00?????10??10????11????0010????????10????Algorithm for Scenario BIllustration: (Impute U1 in inference panel)01110011000101001110101110001011110000000111100111111000111110101001100110010??11??1010????00??NhapNinfNdip101111110000101000000010001100Update i :T, :U1, :U210111111001000101010000110110148Need not match blue48Algorithm for Scenario B:Complexity AnalysisA) Burn in phaseB) MCMC iterations for m times:For each individual i in dip:i) phase(i,T+U2,hap+dip)ii) impute(i,T+U1,hap)Iii) record(posterior probability)For each individual i in inference :i) phase(i,T,hap+dip+inf)ii) impute(i,T+U2,hap+dip)iii) impute(i,U1, hap)iv) record(posterior probability)C) Average over different runs of MCMC to get the genotype and confidenceO((Nhap + Ninf)2LT+U2)O(NhapLT+U1)O(LT+U1)O((Nhap + Ndip + Ninf)2LT)O(Nhap+dipLT+U2)O(LT+U1+U2)O(NhapLU1)49Missing in T also need to be imputec49Benefits of the AlgorithmAble to handle new data type

Faster and more accurate

50Further Speeding UpChoose k closest neighours in phasingNeed to compute Hamming distance O(k2L) for HMM but O(NL) for Hamming distance computation (better than O(N2L) in previous HMM calculation)Choose khap closest neighbours in imputationKhap >> k is also good (because O(k2) in phasing but O(k) in imputation)51Mark down the number 50 vs 25051Comparison with BeagleWeakness of BEAGLE: Full joint modeling of all individualsAc