Upload
bosc-2010
View
918
Download
0
Tags:
Embed Size (px)
Citation preview
The Genome Analysis Toolkit A MapReduce framework for analyzing next-generation DNA sequencing data
Ma# Hanna and Mark DePristo
Genome Sequencing and Analysis Group Medical and Popula<on Gene<cs Program
Broad Ins<tute of Harvard and MIT
• GATK Overview and Concepts
• GATK Workflow
• Example: A Simple Bayesian Genotyper
The Genome Analysis Toolkit Agenda
2 2 2
GATK: Overview and Concepts Motivation
Coverage in xMHC region of JPT individuals"
• Dataset size greatly increases analysis complexity. • Implementation issues can prematurely terminate
long-running jobs or introduce subtle bugs.
3
GATK: Overview Simplifying the process of writing analysis tools for resequencing data
• The framework is designed to support most common paradigms of analysis algorithms – Provides structured access to reads in BAM format, reference context, as well as reference-‐associated meta data
• General-‐purpose – Op<mized for ease of use and completeness of func<onality within scope
• Efficient – Engineering investment on performance of cri<cal data structures and manipula<on rou<nes
• Convenient – Structured plug-‐in model makes developing in Java against the framework rela<vely painfree
4
GATK: Overview The MapReduce design philosophy
Result is:
Map
Reduce
Function f applied to each element of list
Function r recursively reduced over each f(…)
a b c d e Data elements
A B C D E X = f(x)
R R = r(A, R(B,…,E))
f(x)
r(x,y, …, z)
Operations are independent of each other
Results depends on all sites
5
GATK: Overview Rapid development of efficient and robust analysis tools
Genome Analysis Toolkit (GATK) infrastructure
Analysis tool
Traversal engine
Implemented by user Provided by framework
Provides the boilerplate code required to perform any NGS analysis
6
GATK: Workflow Introduction
• GATK Overview and Concepts
• GATK Workflow • An example of one of the GATK’s most common workflows
• Data access pa#ern: by locus • Inputs: reads, reference, dbSNP
• Example: A Simple Bayesian Genotyper
7
GATK: Workflow The sharding system: dividing data into processor-sized pieces
Reads
Reference
dbSNP
• Divides data into small chunks that can be processed independently
• Handles extraction of subsets of data • Groups small intervals together to avoid
repetitive decompression
8
GATK: Workflow Traversal engines: preparing data for processing
Builds data structures easy consumed by the
analysis
9
GATK: Workflow Interaction between sharding system and traversal engines
• Datasets are split into shards, which can be processed sequentially or in parallel • When processing sequentially, the reduce value of each shard is used to
bootstrap the next shard. • When processing in parallel, the result of each shard is computed independently
and then “tree-reduced” together.
10
GATK: Workflow Walkers: Analyses written by end-users
exons dbsnp
A C C A C
A
Analysis tool
• Walkers (analyses) can easily be written by end users. The GATK is distributed with a significant library of walkers.
• Only the reads, reference, and reference metadata applicable to a single-base location is presented to the analysis tool.
• The GATK provides tools to filter the pileup automatically or on demand.
11
ref
reads
GATK: Workflow Other data access patterns
Other data access patterns:
Traversal Type Description Reads Call map per read, along with the reference
and reference-ordered metadata spanning that read.
Duplicates Call map for each set of duplicate reads.
Read pair (naïve) Call map for each read and its mate (naïve, requires the input BAM to be sorted in query name order).
Straightforward (but not necessarily easy) to add any new access pattern involving streaming data.
12
GATK: Additional features Additional inputs and outputs
Reference metadata • Support for additional input data that is sorted in reference
order can easily be added to the GATK. • Input types can be added by creating two new classes: a
feature (data access object) and a codec (parser). • New file formats are indexed automatically. • New data types are autodiscovered via a classpath search. • Joint initiative with IGV.
Additional I/O • Analysis parameters can be added to a walker by annotating a
field in the walker with an @Argument annotation. • Command-line argument types can become very sophisticated.
13
Walkers: Example A simple Bayesian genotyper
• GATK Overview and Concepts
• GATK Workflow
• Example: A Simple Bayesian Genotyper • A func<onal genotyper in under 150 lines of code • A minimal example: calls are much lower in quality than
the UnifiedGenotyper
14
Walkers: Example A simple Bayesian genotyper: the model
15
L(G | D) = P(G)P(D |G) = P(b |G)b∈ good _ bases{ }∏
Prior for the genotype
Likelihood for the genotype
Likelihood of the data given the genotype
Bayesian model
Independent base model
• Likelihood of data computed using pileup of bases and associated quality scores at given locus
• Only “good bases” are included: those sa<sfying minimum base quality, mapping read quality, pair mapping quality, NQS
• L(G|D) computed for all 10 genotypes
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for a more complete approach
Walkers: Example A simple Bayesian genotyper
• Walker specifies the data access pattern and declares command-line arguments.
• Inheritance defines traversal type. • Annotation defines command-line argument.
public class GATKPaperGenotyper extends LocusWalker<Integer,Long> {
@Argument(fullName = "log_odds_score", shortName = "LOD", doc = "The LOD threshold", required = false) private double LODScore = 3.0;
16
Walkers: Example A simple Bayesian genotyper
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
double likelihoods[] = DiploidGenotypePriors.getReferencePolarizedPrior( ref.getBase(), DiploidGenotypePriors.HUMAN_HETEROZYGOSITY, 0.01);
// get the bases and qualities from the pileup ReadBackedPileup pileup = context.getBasePileup(). getPileupWithoutMappingQualityZeroReads(); byte bases[] = pileup.getBases(); byte quals[] = pileup.getQuals(); …
• Walker prepares the input dataset. • ReadBackedPileup utility can be used to filter pileup on
demand.
17
Walkers: Example A simple Bayesian genotyper
for (GENOTYPE genotype : GENOTYPE.values()) for (int index = 0; index < bases.length; index++) { // our epsilon is the de-Phred scored base quality double epsilon = Math.pow(10, quals[index] / -10.0);
byte pileupBase = bases[index]; double p = 0; for (char r : genotype.toString().toCharArray()) p += r == pileupBase ? 1 - epsilon : epsilon / 3; likelihoods[genotype.ordinal()] += Math.log10(p /
genotype.length()); }
Integer sortedList[] = MathUtils.sortPermutation(likelihoods);
• Calculate the likelihood for each possible genotype. • Determine the best of the calculated genotypes.
18
Walkers: Example A simple Bayesian genotyper
… if (lod > LODScore) out.printf("%s\t%s\t%.4f\t%c%n", context.getLocation(),
selectedGenotype, lod, (char)ref.getBase()); return 1; }}// end of map() function
public Long reduce(Integer value, Long sum) { return value + sum;}
public void onTraversalDone(Integer result) { out.printf("Simple Genotyper genotyped %d loci.”, result);}
• Conditionally output the results. • Use reduce to calculate number of genotypes called. • Writing to provided output stream is guaranteed to be
thread-safe.
19
Walkers: Threading performance A simple Bayesian genotyper
GATK performance improves nearly linearly as processors are added
20
Genome Analysis Toolkit 1000 Genomes Project
More info: h#p://www.broadins<tute.org/gsa/wiki/ Support : h#p://www.getsa<sfac<on.com/gsa/
Ini<al alignment
MSA realignment
Q-‐score recalibra<on
Base error modeling
Genotyping
SNP filtering
• All of these tools have been developed in the GATK
• They are memory and CPU efficient, cluster friendly and are easily parallelized
• They are now publically and are being used at many sites around the world
• Supports any BAM-‐compa<ble aligner
21
Acknowledgments Genome sequencing and
analysis group (MPG) Kiran Garimella (Analysis Lead)
Michael Melgar Chris Hartl
Sherman Jia Eric Banks (Development lead)
Ryan Poplin Guillermo del Angel
Aaron McKenna Khalid Shakir Brett Thomas Corin Boyko
Broad postdocs, staff, and faculty
Anthony Philippakis Vineeta Agarwala
Manny Rivas Jared Maguire
Carrie Sougnez David Jaffe
Nick Patterson Steve Schaffner Shamil Sunyaev Paul de Bakker
1000 Genomes project In general but notably:
Matt Hurles Philip Awadalla Richard Durbin
Goncalo Abecasis Richard Gibbs Gabor Marth
Thomas Keane Gil McVean
Gerton Lunter Heng Li
Copy number group Bob Handsaker
Jim Nemesh Josh Korn
Steve McCarroll
Cancer genome analysis
Kristian Cibulskis Andrey Sivachenko
Gad Getz
Integrative Genomics Viewer (IGV) Jim Robinson
Jesse Whitworth Helga Thorvaldsdottir
Genome Sequencing Platform In general but notably:
Lauren Ambrogio Illumina Production Team
Tim Fennell Kathleen Tibbetts
Alec Wysoker Ben Weisburd Toby Bloom
MPG directorship Stacey Gabriel David Altshuler
Mark Daly
22