34
Genotype Imputation Dan Evans [email protected] California Pacific Medical Center Research Institute

Genotype Imputation

  • Upload
    tawny

  • View
    101

  • Download
    0

Embed Size (px)

DESCRIPTION

Genotype Imputation. Dan Evans [email protected] California Pacific Medical Center Research Institute. Outline. Overview Elements of a Hidden Markov Model (HMM) Methods used by MACH Method comparison with IMPUTEv2 Implementation with MACH Implementation with IMPUTEv2 - PowerPoint PPT Presentation

Citation preview

Page 1: Genotype Imputation

Genotype Imputation

Dan [email protected]

California Pacific Medical Center Research Institute

Page 2: Genotype Imputation

Outline

• Overview• Elements of a Hidden Markov Model (HMM)• Methods used by MACH• Method comparison with IMPUTEv2• Implementation with MACH• Implementation with IMPUTEv2• Software evaluation

Page 3: Genotype Imputation

Impute missing genotypes

Li et al., Annu Rev Genomics Hum Genet 2009

Page 4: Genotype Imputation

Benefits of imputation• Expanded set of SNPs tested for association

• Facilitate meta-analysis among studies using different genotyping arrays

Marchini et al., Nature Genetics 2007

Page 5: Genotype Imputation

Imputation steps

• Phase study genotypes

• Impute missing genotypes from phased haplotypes

Phase 1 M1 M2 M3 M4ID1 G - - - G - - - T - - - AID1 G - - - A - - - C - - - A

Phase 2 M1 M2 M3 M4ID1 G - - - A - - - T - - - AID1 G - - - G - - - C - - - A

Page 6: Genotype Imputation

Phasing genome-wide

• EM algorithm – treats all possible haplotype configurations as equally likely a priori– Computational constraints when markers > 10

• Hidden Markov Models – new haplotypes derived from older haplotypes by mutation and recombination– Limits the possible haplotype configurations

Page 7: Genotype Imputation

Outline

• Overview• Elements of a Hidden Markov Model (HMM)• Methods used by MACH• Method comparison with IMPUTEv2• Implementation with MACH• Implementation with IMPUTEv2• Software evaluation

Page 8: Genotype Imputation

Elements of a Hidden Markov Model (HMM)

Eddy, Nature Biotechnology 2004

• Probabilistic model for sequence annotation – identify the 5’ splice site• Exons, splice sites, and introns have different base composition• 3 states• Each state has emission probabilities• Each state has transition probabilities

path = πPath is a Markov chain

Page 9: Genotype Imputation

Probability model of sequence• Sequence x1 … xL, ith symbol xi

• transition between different states in the path

• emission – probability that symbol b is seen at position i when the ith state in the path is k

• Joint probability of sequence and path

start transition emission transitionDurbin et al., Biological sequence analysis, 1998

Page 10: Genotype Imputation

𝑃 (𝑥 ,𝜋 )=... (0.25∗0.9 )∗ (0.25∗0.9 )∗ (0.95∗0.1 )∗ (0.4∗1.0 )∗ (0.4∗0.9 )…𝑃 (𝑥 ,𝜋 )=𝑎0𝜋 1∏

𝑖=1

𝐿

𝑒𝜋 𝑖(𝑥 𝑖)∏𝑖=2

𝐿

𝑎𝜋 𝑖 𝜋 𝑖 −1

Page 11: Genotype Imputation

What if state path, emission probabilities and transition probabilities are unknown?

• Dynamic programming algorithms to determine path– Viterbi algorithm– Forward – backward algorithm

• Baum-Welch algorithm to estimate transition and emission probabilities

Page 12: Genotype Imputation

Forward algorithm

• Probability of observed sequence up to and including xi, given statei = k

Sum over all statesAt each position

emission

transition

Page 13: Genotype Imputation

Backward algorithm

• Probability of observed sequence starting from the end and working backwards:

Start at end Sum over all statesat each position

Page 14: Genotype Imputation

Posterior state probabilities

• Want to know probability of state k at position i when the emitted sequence is known

• Posterior probability

General multiplication rule

Divide both sides by P(x)

From posterior probability, can take most probable state, or apply function on states multiplied by posterior prob

Page 15: Genotype Imputation

Baum-Welch algorithm

1. Initial guess at transition () and emission probabilities ()

2. Forward-backward to find posterior probabilities of states in path

3. Use posterior probabilities at each state to estimate new and

4. Iterate steps 2 and 3 until stopping criteria (small difference in log likelihood)

Version of EM algorithm

Page 16: Genotype Imputation

Outline

• Overview• Elements of a Hidden Markov Model (HMM)• Methods used by MACH• Method comparison with IMPUTEv2• Implementation with MACH• Implementation with IMPUTEv2• Software evaluation

Page 17: Genotype Imputation

MACH Haplotyping with HMM• Hidden – sequence of mosaic states S that

emit the observed genotypes• Transition probabilities – recombination

events• Emission probabilities – mutation, error

Li et al., Genet Epidemiol 2010

𝑃 (𝑥 ,𝜋 )=𝑎0𝜋 1∏𝑖=1

𝐿

𝑒𝜋 𝑖(𝑥 𝑖)∏𝑖=2

𝐿

𝑎𝜋 𝑖 𝜋 𝑖 −1

start transitionemission transition

Page 18: Genotype Imputation

MACH Path estimation

• Forward – backward algorithm to estimate path

• Update transition and emission probabilities with each estimated path, Baum algorithm

• Rounds is the number of updates, 20 is suggested to estimate path and parameters

Page 19: Genotype Imputation

MACH genotype imputation

• HMM again, but this time include reference haplotypes– count frequency that genotype was sampled at

each position across iterations• Most probable genotype sampled most often• Expected number of allele counts (dosage) =

2*hom counts + het counts/# samples

Page 20: Genotype Imputation

MACH imputation quality measures

• Quality of genotype = proportion of iterations where the final imputed genotype was selected

• Quality of marker = genotype quality score averaged across all individuals

• r2 = observed/expected variance of genotype scores– p=mean(g)/2– Var(g)/[2*p*(1-p)]

Page 21: Genotype Imputation

Outline

• Overview• Elements of a Hidden Markov Model (HMM)• Methods used by MACH• Method comparison with IMPUTEv2• Implementation with MACH• Implementation with IMPUTEv2• Software evaluation

Page 22: Genotype Imputation

IMPUTEv2 vs MACH

• Transmission and emission probabilities– IMPUTEv2 uses fixed values for these parameters.

Emission probability is constant assuming a uniform mutation rate. Transmission probability from the fine-scaled recombination map of human genome.

– MACH estimates these parameters using Baum-Welch algorithm

Page 23: Genotype Imputation
Page 24: Genotype Imputation

IMPUTEv2 vs MACH

• Potential states– IMPUTEv2 considers study and reference

haplotypes• Reduces complexity using Hamming distance to select

genetically more similar haplotypes• Can accommodate large reference panels

– MACH randomly selects 200 haplotypes, doesn’t leverage all haplotypes

Page 25: Genotype Imputation

Outline

• Overview• Elements of a Hidden Markov Model (HMM)• Methods used by MACH• Method comparison with IMPUTEv2• Implementation with MACH• Implementation with IMPUTEv2• Software evaluation

Page 26: Genotype Imputation

MACH./mach1 \-d ../examples/sample.dat \-p ../examples/sample.ped \-h ../examples/hapmap.haplos \-s ../examples/hapmap.snps \--rounds 50 \ #number of iterations--states 200 \ #number of haplotypes to sample--dosage \ #output dosage, not best genotypes--prefix ../output/test \> ../output/dosage.log

Page 27: Genotype Imputation

Outline

• Overview• Elements of a Hidden Markov Model (HMM)• Methods used by MACH• Method comparison with IMPUTEv2• Implementation with MACH• Implementation with IMPUTEv2• Software evaluation

Page 28: Genotype Imputation

IMPUTEv2./impute2 \ -m ./Example/example.chr22.map \ ##recombination map-h ./Example/example.chr22.1kG.haps \ ##reference haplotypes-l ./Example/example.chr22.1kG.legend \ ##SNP annotation for ref haplo-g ./Example/example.chr22.study.gens \ ##study genotypes-strand_g ./Example/example.chr22.study.strand \ ##study SNP strand-int 20.4e6 20.5e6 \ ##genomic interval-Ne 20000 \ ##effective population size,

##scales recombination rates-o ./Example/example.chr22.one.phased.impute2

Page 29: Genotype Imputation

---------------- Run parameters ----------------

reference haplotypes : 112 [Panel 0] study individuals : 250 [Panel 2] sequence interval : [20400000,20500000] buffer : 250 kb Ne : 20000 input call thresh : 0.900 #genotypes with P<0.9 are missing burn-in MCMC iterations : 10 #forward-backward that don’t contribute to imputation probabilities total MCMC iterations : 30 (20 used for inference) HMM states for phasing : 80 [Panel 2] HMM states for imputation : 112 [Panel 0->2] #make this large

Page 30: Genotype Imputation

Outline

• Overview• Elements of a Hidden Markov Model (HMM)• Methods used by MACH• Method comparison with IMPUTEv2• Implementation with MACH• Implementation with IMPUTEv2• Software evaluation

Page 31: Genotype Imputation

Howie et al., PLoS Genetics, 2009

Page 32: Genotype Imputation

Howie et al., PLoS Genetics, 2009

Page 33: Genotype Imputation

Pre-phasing

• Reference panels updated frequently• Phase study haplotypes with SHAPEIT2• Impute ungenotyped SNPs with IMPUTEv2

Page 34: Genotype Imputation

Hip OA GWAS