View
732
Download
1
Category
Tags:
Preview:
DESCRIPTION
This is a talk I gave for the 2012 GES series, which focuses on applications and methods for genetic epidemiology
Citation preview
Fast methods and software forimputation of whole-genome
sequencing data
Gary K. Chen
Sept 18, 2012
An outline
Background and Motivation
Implementation/Software
Simulations based on KGP data
Tutorial
Ongoing work
Introduction
◮ Imputation◮ Probabilistic inference of unobserved genotypes
(e.g. not available on chip, poor QC, low coverage)◮ Exploits the fact that we are all distantly related,
implying reduced haplotype diversity
Introduction
◮ Imputation◮ Probabilistic inference of unobserved genotypes
(e.g. not available on chip, poor QC, low coverage)◮ Exploits the fact that we are all distantly related,
implying reduced haplotype diversity
◮ Can potentially improve power◮ by refining an association signal◮ by pooling resources for large scale collaborations
Introduction
◮ Imputation◮ Probabilistic inference of unobserved genotypes
(e.g. not available on chip, poor QC, low coverage)◮ Exploits the fact that we are all distantly related,
implying reduced haplotype diversity
◮ Can potentially improve power◮ by refining an association signal◮ by pooling resources for large scale collaborations
◮ Existing software◮ FastPHASE, MACH, IMPUTE2, BEAGLE, PLINK,
MENDEL, and others◮ All use some form of the Expectation Maximization
to “learn” parameters◮ e.g. Hidden Markov Models
The mechanics behind imputation
◮ Imputation is an exercise in haplotype counting◮ We don’t observe the true haplotypes◮ We make a large number of “guesses”, and weight
each possible pairing (much larger number!)◮ Weights are likelihoods: i.e. the probability of the
observed multimarker genotypes, given the pair ofhaplotype guesses
◮ We add up the weights across all possible pairs,which will give us a posterior probability of eachgenotype at a site
Imputation in GWAS◮ Imputation is the bottleneck in modern
high-dimensional genetic studies◮ Memory: Because humans are diploid, we must
integrate over phasing uncertainty:(haps*(haps+1)/2)
◮ Speed: Current methods (either MCMC or EM)must iterate a number of times until convergence
Imputation in GWAS◮ Imputation is the bottleneck in modern
high-dimensional genetic studies◮ Memory: Because humans are diploid, we must
integrate over phasing uncertainty:(haps*(haps+1)/2)
◮ Speed: Current methods (either MCMC or EM)must iterate a number of times until convergence
◮ Real world challenges◮ Requires massive computational resources (e.g.
>3500 jobs for HapMap 2 reference panel)◮ Scaling to genome wide sequence data (1-2 orders
of magnitude more coverage)?◮ Imputing very rare alleles from deep reference
panels (thousands) will only exacerbate thecomputational burden quadratically
Why is imputation so computationallydemanding?
◮ Simplified concrete example: 1000 individuals,20 SNP window
Why is imputation so computationallydemanding?
◮ Simplified concrete example: 1000 individuals,20 SNP window
◮ Suppose we make 220 = 1, 048, 576 = h guesses,h∗(h+1)
2= 549, 756, 338, 176 pairs
◮ We calculate weights for the first individual,repeating for everyone: 549, 756, 338, 176, 000
◮ We loop again to get genotype posteriors, so totalcomputation is:
◮ 2 ∗ 549, 756, 338, 176, 000 = 1.1quadrillioncalculations.
◮ In reality of course, we expect LD to greatly reduceh
Current best practices
◮ Chunking◮ Divide regions into subregions of several megabases
each◮ Run each subregion independently on a cluster
node as an embarrassingly parallel problem
Current best practices
◮ Chunking◮ Divide regions into subregions of several megabases
each◮ Run each subregion independently on a cluster
node as an embarrassingly parallel problem
◮ Pre phase◮ Estimating haplotype frequencies by far the most
expensive procedure◮ Phase study data across a much smaller subset of
SNPs (e.g. 660k)◮ Run fast haploid based imputation, using sequence
based reference panel
◮ Nevertheless, imputation can take weeks if notmonths on large clusters!
Breaking down computational barriers◮ Traditional CPU clusters are not keeping up
◮ We are interested in rarer variation, so need deeperreference panels
◮ Computational demand scales linearly with respectto SNPs
◮ However, computation increases as the number ofreference haplotypes squared
◮ Algorithms do not take advantage of
◮ Innovations in processor technology◮ Just as sequencing has revolutionized data
production, new microprocessor technologies andprogramming interfaces are revolutionizing softwaredevelopment
◮ Many-core processors: e.g. Graphics ProcessingUnits
China’s GPU farm at BGI
Examples of GPU devices
GPUs in scientific computing
◮ Far more superior to today’s CPUs in efficiency(wrt to energy and hardware cost)
◮ A single device can contain over 500 computingcores in near proximity
◮ Each enterprise grade device runs approx. $3000.
◮ Ideal for large scale optimization◮ Common routines like matrix multiplication
commonly yield over 300x speedup◮ Other common methods that are good candidates
include HMMs, matrix inversion, PCA◮ Many algorithms can be rewritten that expose fine
grained calculations that can be doneindependently on a separate core
Parallel HMM◮ Over 25x speed up on a single machine◮ ASHG talk
◮ Kai Wang, Gary K Chen: GPU acceleratedgenotype imputation for low-coveragehigh-throughput whole-genome sequencing data.In: International Congress of Human Genetics:2011; Montreal, Canada; 2011.
Table: Heterozygote Accuracy
MAF MaCH Our program<0.01 0.684 0.821<0.03 0.798 0.865<0.05 0.865 0.882
An outline
Background and Motivation
Implementation/Software
Simulations based on KGP data
Tutorial
Ongoing work
The algorithm behind MaCH
001001/001001001001/010101001001/100100001001/101010010101/010101010101/100100
100100/100100100100/101010101010/101010
010101/101010
Sample states conditional onbackward prob and x over rates
100100101010
001001010101
Observedgenotypes
Best guesshaplotypes
State space overphasing uncertainty
Compute backward probabilities
2011?0
011102
Our algorithm
111010101101010101000001000
Estimate hap freqson compact haplotype set
111010101 .3101010101 .3
111010101101010101000001000
Impute on middlethird set of genotypes
111010101101010101000001000
010101001010101010001000100
Assign hap freqs on fullhaplotype set
000001000 .4
0000 .41111 .61???0?1?2
Advance by one third
◮ Key point: we can support *all* referencehaplotypes, not just a random subset
GPU implementation◮ A massively parallel problem:
◮ Typical to deploy millions of computation“work-items” in a single function call
◮ Consideration of all possible pairs of haplotypeguesses
◮ Each hap pair maps to a “work-item”◮ Each subject maps to a “work-group”
Workgroup 1
...
256 workitems
Workgroup 2
...
256 workitems
Workgroup N
...
256 workitems
...
◮ Computations are masked by memory latency◮ One workgroup may fetch data, while another is
finishing computations◮ Powerful mechanism for especially large problems
Speedups are linear on a CPU cluster, butsuper linear on a GPU
Manuscript and software
An outline
Background and Motivation
Implementation/Software
Simulations based on KGP data
Tutorial
Ongoing work
Evaluation on KGP derived simulated data
◮ 1KGP Cosmopolitan panel◮ includes AFR,AMR,ASN, and EUR◮ Allocated 50% of each ethnic group into two
datasets◮ Dataset 1 is the phased reference haplotypes◮ Dataset 2 is a hypothetical study◮ Study data consists of genotype likelihoods
reflecting mean subject level coverage of 4x.
Evaluation on KGP derived simulated data
◮ 1KGP Cosmopolitan panel◮ includes AFR,AMR,ASN, and EUR◮ Allocated 50% of each ethnic group into two
datasets◮ Dataset 1 is the phased reference haplotypes◮ Dataset 2 is a hypothetical study◮ Study data consists of genotype likelihoods
reflecting mean subject level coverage of 4x.
◮ Evaluation◮ Applied recommended settings (e.g. IMPUTE2:
phasing states=80, MCMC rounds=30)◮ Accuracy: Applied post imputation filtering criteria
to recover approx same number of SNPs◮ Benchmarked RAM usage and run times
Table: Accuracy (Dosage corr. and het accuracies:)
MAF IMPUTE2 GPU-IMPUTErange Dose Het Dose Het0.01 0.725 0.952 0.792 0.9220.02 0.796 0.928 0.829 0.9290.03 0.826 0.943 0.857 0.9390.04 0.872 0.961 0.887 0.9550.05 0.886 0.963 0.905 0.9550.06 0.906 0.979 0.904 0.9660.07 0.932 0.976 0.929 0.9670.08 0.933 0.977 0.929 0.9690.09 0.950 0.978 0.941 0.9690.10 0.953 0.983 0.944 0.9740.20 0.956 0.985 0.951 0.9760.30 0.961 0.981 0.960 0.9710.40 0.963 0.979 0.966 0.9680.50 0.963 0.975 0.968 0.965
Computational requirements
Table: Memory and run time
Program Runtime RAMIMPUTE2 38:52:23 3.7GBGPU-IMPUTE 00:16:38 576MB
Fold speedup: 140.2x
An outline
Background and Motivation
Implementation/Software
Simulations based on KGP data
Tutorial
Ongoing work
Recommended hardware/platform
◮ 1 or more GPU devices that are CUDA or ATIstream compliant
◮ Linux OS: Pipeline scripts
◮ MySQL database server: to store and sortstudy data
Configuration
Distribution of effort
GPU 1 GPU 2
Chunk 1 Chunk 23001−38502001−30001001−20001−1000
Chunk 3 Chunk 4
Region 1
Region 2
Region R
...
Step 1: Fetch study data
◮ Retrieves from database, sorts, and chunksdata into regions
Step 2: Pre-process KGP data
◮ Removes extremely rare sites, chunks data intoregions
Step 3: Pre-phase the study data
◮ Outputs posterior probabilities of the 4 orderedgenotypes.
Step 4: Post-process the phased data
◮ Generates input files for haploid imputation.
Step 5: Impute haploid data into KGPreference
◮ Imputation step. Very fast, but I/O intensive.Recommended on HPCC.
Analysis of Multi-ethnic Cohort data◮ Host
◮ epigraph.epigenome.usc.edu◮ Two Tesla C2050 GPUs, each with 448 cores.
◮ GWAS studies imputed:◮ AABC (1M): 5761◮ AAPC-A (1M): 6806◮ AAPC-B (1M): 2835◮ JABC (660k): 2211◮ LABC (660k): 1070◮ LAPC/JAPC (660k): 4175◮ T2D-Lat (2.5M): 4673◮ Hecht-smoking (1M): 2319
◮ Total samples=29,850
◮ Total SNPs=13,123,026
An outline
Background and Motivation
Implementation/Software
Simulations based on KGP data
Tutorial
Ongoing work
On the horizon: matrix completion◮ Matrix completion
◮ Is the basis of winning entry of $1M Netflixchallenge.
◮ Customers rate about 1% of the movies, can weimpute the 99% and predict what movie they willlike?
◮ Model-free imputation◮ Makes no assumptions about inter-site and
inter-person correlations. (All other programsassume independence in the latter)
◮ Extremely fast◮ 421 times faster than MaCH in pedigree data◮ Parallelization of the SVD step may lead to
another order of magnitude improvement.◮ Eric Chi
◮ If interested please come to his noon talk Oct. 25at SSB first floor classrooms
Collaborators
◮ USC◮ Kai Wang◮ Alex Stram◮ Chris Haiman◮ Brian Henderson◮ AABC consortium◮ AAPC consortium
◮ UCLA◮ Kenneth Lange◮ Eric Sobel◮ Eric Chi
Recommended