Hanna bosc2010

The Genome Analysis Toolkit A MapReduce framework for analyzing next-generation DNA sequencing data

Ma# Hanna and Mark DePristo

Genome Sequencing and Analysis Group Medical and Popula<on Gene<cs Program

Broad Ins<tute of Harvard and MIT

•  GATK Overview and Concepts

•  GATK Workflow

•  Example: A Simple Bayesian Genotyper

The Genome Analysis Toolkit Agenda

2 2 2

GATK: Overview and Concepts Motivation

Coverage in xMHC region of JPT individuals"

•  Dataset size greatly increases analysis complexity. •  Implementation issues can prematurely terminate

long-running jobs or introduce subtle bugs.

3

GATK: Overview Simplifying the process of writing analysis tools for resequencing data

•  The framework is designed to support most common paradigms of analysis algorithms –  Provides structured access to reads in BAM format, reference context, as well as reference-‐associated meta data

•  General-‐purpose –  Op<mized for ease of use and completeness of func<onality within scope

•  Efficient –  Engineering investment on performance of cri<cal data structures and manipula<on rou<nes

•  Convenient –  Structured plug-‐in model makes developing in Java against the framework rela<vely painfree

4

GATK: Overview The MapReduce design philosophy

Result is:

Map

Reduce

Function f applied to each element of list

Function r recursively reduced over each f(…)

a b c d e Data elements

A B C D E X = f(x)

R R = r(A, R(B,…,E))

f(x)

r(x,y, …, z)

Operations are independent of each other

Results depends on all sites

5

GATK: Overview Rapid development of efficient and robust analysis tools

Genome Analysis Toolkit (GATK) infrastructure

Analysis tool

Traversal engine

Implemented by user Provided by framework

Provides the boilerplate code required to perform any NGS analysis

6

GATK: Workflow Introduction


•  GATK Workflow •  An example of one of the GATK’s most common workflows

•  Data access pa#ern: by locus •  Inputs: reads, reference, dbSNP

•  Example: A Simple Bayesian Genotyper

7

GATK: Workflow The sharding system: dividing data into processor-sized pieces

Reads

Reference

dbSNP

•  Divides data into small chunks that can be processed independently

•  Handles extraction of subsets of data •  Groups small intervals together to avoid

repetitive decompression

8

GATK: Workflow Traversal engines: preparing data for processing

Builds data structures easy consumed by the

analysis

9

GATK: Workflow Interaction between sharding system and traversal engines

•  Datasets are split into shards, which can be processed sequentially or in parallel •  When processing sequentially, the reduce value of each shard is used to

bootstrap the next shard. •  When processing in parallel, the result of each shard is computed independently

and then “tree-reduced” together.

10

GATK: Workflow Walkers: Analyses written by end-users

exons dbsnp

A C C A C

A

Analysis tool

•  Walkers (analyses) can easily be written by end users. The GATK is distributed with a significant library of walkers.

•  Only the reads, reference, and reference metadata applicable to a single-base location is presented to the analysis tool.

•  The GATK provides tools to filter the pileup automatically or on demand.

11

ref

reads

GATK: Workflow Other data access patterns

Other data access patterns:

Traversal Type Description Reads Call map per read, along with the reference

and reference-ordered metadata spanning that read.

Duplicates Call map for each set of duplicate reads.

Read pair (naïve) Call map for each read and its mate (naïve, requires the input BAM to be sorted in query name order).

Straightforward (but not necessarily easy) to add any new access pattern involving streaming data.

12

GATK: Additional features Additional inputs and outputs

Reference metadata •  Support for additional input data that is sorted in reference

order can easily be added to the GATK. •  Input types can be added by creating two new classes: a

feature (data access object) and a codec (parser). •  New file formats are indexed automatically. •  New data types are autodiscovered via a classpath search. •  Joint initiative with IGV.

Additional I/O •  Analysis parameters can be added to a walker by annotating a

field in the walker with an @Argument annotation. •  Command-line argument types can become very sophisticated.

13

Walkers: Example A simple Bayesian genotyper


•  GATK Workflow

•  Example: A Simple Bayesian Genotyper •  A func<onal genotyper in under 150 lines of code •  A minimal example: calls are much lower in quality than

the UnifiedGenotyper

14

Walkers: Example A simple Bayesian genotyper: the model

15

L(G | D) = P(G)P(D |G) = P(b |G)b∈ good _ bases{ }∏

Prior for the genotype

Likelihood for the genotype

Likelihood of the data given the genotype

Bayesian model

Independent base model

•  Likelihood of data computed using pileup of bases and associated quality scores at given locus

•  Only “good bases” are included: those sa<sfying minimum base quality, mapping read quality, pair mapping quality, NQS

•  L(G|D) computed for all 10 genotypes

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for a more complete approach


•  Walker specifies the data access pattern and declares command-line arguments.

•  Inheritance defines traversal type. •  Annotation defines command-line argument.

public class GATKPaperGenotyper extends LocusWalker<Integer,Long> {

@Argument(fullName = "log_odds_score", shortName = "LOD", doc = "The LOD threshold", required = false) private double LODScore = 3.0;

16


public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {

double likelihoods[] = DiploidGenotypePriors.getReferencePolarizedPrior( ref.getBase(), DiploidGenotypePriors.HUMAN_HETEROZYGOSITY, 0.01);

// get the bases and qualities from the pileup ReadBackedPileup pileup = context.getBasePileup(). getPileupWithoutMappingQualityZeroReads(); byte bases[] = pileup.getBases(); byte quals[] = pileup.getQuals(); …

•  Walker prepares the input dataset. •  ReadBackedPileup utility can be used to filter pileup on

demand.

17


for (GENOTYPE genotype : GENOTYPE.values()) for (int index = 0; index < bases.length; index++) { // our epsilon is the de-Phred scored base quality double epsilon = Math.pow(10, quals[index] / -10.0);

byte pileupBase = bases[index]; double p = 0; for (char r : genotype.toString().toCharArray()) p += r == pileupBase ? 1 - epsilon : epsilon / 3; likelihoods[genotype.ordinal()] += Math.log10(p /

genotype.length()); }

Integer sortedList[] = MathUtils.sortPermutation(likelihoods);

•  Calculate the likelihood for each possible genotype. •  Determine the best of the calculated genotypes.

18


… if (lod > LODScore) out.printf("%s\t%s\t%.4f\t%c%n", context.getLocation(),

selectedGenotype, lod, (char)ref.getBase()); return 1; }}// end of map() function

public Long reduce(Integer value, Long sum) { return value + sum;}

public void onTraversalDone(Integer result) { out.printf("Simple Genotyper genotyped %d loci.”, result);}

•  Conditionally output the results. •  Use reduce to calculate number of genotypes called. •  Writing to provided output stream is guaranteed to be

thread-safe.

19

Walkers: Threading performance A simple Bayesian genotyper

GATK performance improves nearly linearly as processors are added

20

Genome Analysis Toolkit 1000 Genomes Project

More info: h#p://www.broadins<tute.org/gsa/wiki/ Support : h#p://www.getsa<sfac<on.com/gsa/

Ini<al alignment

MSA realignment

Q-‐score recalibra<on

Base error modeling

Genotyping

SNP filtering

•  All of these tools have been developed in the GATK

•  They are memory and CPU efficient, cluster friendly and are easily parallelized

•  They are now publically and are being used at many sites around the world

•  Supports any BAM-‐compa<ble aligner

21

Acknowledgments Genome sequencing and

analysis group (MPG) Kiran Garimella (Analysis Lead)

Michael Melgar Chris Hartl

Sherman Jia Eric Banks (Development lead)

Ryan Poplin Guillermo del Angel

Aaron McKenna Khalid Shakir Brett Thomas Corin Boyko

Broad postdocs, staff, and faculty

Anthony Philippakis Vineeta Agarwala

Manny Rivas Jared Maguire

Carrie Sougnez David Jaffe

Nick Patterson Steve Schaffner Shamil Sunyaev Paul de Bakker

1000 Genomes project In general but notably:

Matt Hurles Philip Awadalla Richard Durbin

Goncalo Abecasis Richard Gibbs Gabor Marth

Thomas Keane Gil McVean

Gerton Lunter Heng Li

Copy number group Bob Handsaker

Jim Nemesh Josh Korn

Steve McCarroll

Cancer genome analysis

Kristian Cibulskis Andrey Sivachenko

Gad Getz

Integrative Genomics Viewer (IGV) Jim Robinson

Jesse Whitworth Helga Thorvaldsdottir

Genome Sequencing Platform In general but notably:

Lauren Ambrogio Illumina Production Team

Tim Fennell Kathleen Tibbetts

Alec Wysoker Ben Weisburd Toby Bloom

MPG directorship Stacey Gabriel David Altshuler

Mark Daly

22

Technology

Hanna bosc2010