52
1 © Cloudera, Inc. All rights reserved. The Redemptive Power of Hadoop Uri Laserson | @laserson | 14 November 2015 Scaling Up Genomics with Spark

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

Embed Size (px)

Citation preview

Page 1: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

1© Cloudera, Inc. All rights reserved.

The Redemptive Power of HadoopUri Laserson | @laserson | 14 November 2015

Scaling Up Genomics with Spark

Page 2: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

2© Cloudera, Inc. All rights reserved.

We come in peace.

Pioneer plaque

Page 3: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

3© Cloudera, Inc. All rights reserved.

What is genomics?

Page 4: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

4© Cloudera, Inc. All rights reserved.

Organism

Page 5: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

5© Cloudera, Inc. All rights reserved.

Organism Cell

Page 6: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

6© Cloudera, Inc. All rights reserved.

Organism Cell Genome

Page 7: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

7© Cloudera, Inc. All rights reserved.

Page 8: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

8© Cloudera, Inc. All rights reserved.

Page 9: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

9© Cloudera, Inc. All rights reserved.

Reference chromosome

Page 10: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

10© Cloudera, Inc. All rights reserved.

Reference chromosome

Location

Page 11: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

11© Cloudera, Inc. All rights reserved.“… decoding the Book of Life”

Page 12: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

12© Cloudera, Inc. All rights reserved.

Ortelius, 1570

Page 13: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

13© Cloudera, Inc. All rights reserved.

Page 14: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

14© Cloudera, Inc. All rights reserved.Google Maps, 2015

Page 15: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

15© Cloudera, Inc. All rights reserved.

Page 16: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

16© Cloudera, Inc. All rights reserved.

Page 17: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

17© Cloudera, Inc. All rights reserved.

Page 18: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

18© Cloudera, Inc. All rights reserved.

Page 19: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

19© Cloudera, Inc. All rights reserved.

Page 20: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

20© Cloudera, Inc. All rights reserved.

>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT

Page 21: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

21© Cloudera, Inc. All rights reserved.

>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT

Bioinformatics!

Page 22: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

22© Cloudera, Inc. All rights reserved.

>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT

Bioinformatics!

Page 23: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

23© Cloudera, Inc. All rights reserved.

Alignment Dedup Recalibrate QC/Filter VariantCalling

VariantAnnotation

Pipelines!

Page 24: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

24© Cloudera, Inc. All rights reserved.

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Compressed text files (non-splittable)Semi-structuredPoorly specified

Page 25: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

25© Cloudera, Inc. All rights reserved.

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Compressed text files (non-splittable)Semi-structuredPoorly specified

Global sort order

Page 26: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

26© Cloudera, Inc. All rights reserved.

CHPC (scheduler)POSIX filesystem

JavaHPC (Queue)POSIX filesystem

C++Single-nodeSQLite

It’s file formats all the way down!

Page 27: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

27© Cloudera, Inc. All rights reserved.

Dedup

Page 28: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

28© Cloudera, Inc. All rights reserved.

/** * Main work method. Reads the BAM file once and collects sorted information about * the 5' ends of both ends of each read (or just one end in the case of pairs). * Then makes a pass through those determining duplicates before re-reading the * input file and writing it out with duplication flags set correctly. */protected int doWork() { // build some data structures buildSortedReadEndLists(useBarcodes); generateDuplicateIndexes(useBarcodes);

final SAMFileWriter out = new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT); final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator; while (iterator.hasNext()) { final SAMRecord rec = iterator.next(); if (!rec.isSecondaryOrSupplementary()) { if (recordInFileIndex == nextDuplicateIndex) { rec.setDuplicateReadFlag(true); // Now try and figure out the next duplicate index if (this.duplicateIndexes.hasNext()) { nextDuplicateIndex = this.duplicateIndexes.next(); } else { // Only happens once we've marked all the duplicates nextDuplicateIndex = -1; } } else { rec.setDuplicateReadFlag(false); } } recordInFileIndex++; if (!this.REMOVE_DUPLICATES || !rec.getDuplicateReadFlag()) { out.addAlignment(rec); } } return 0;}

Method

Code

Page 29: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

29© Cloudera, Inc. All rights reserved.

/** * Main work method. Reads the BAM file once and collects sorted information about * the 5' ends of both ends of each read (or just one end in the case of pairs). * Then makes a pass through those determining duplicates before re-reading the * input file and writing it out with duplication flags set correctly. */protected int doWork() { // build some data structures buildSortedReadEndLists(useBarcodes); generateDuplicateIndexes(useBarcodes);

final SAMFileWriter out = new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT); final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator; while (iterator.hasNext()) { final SAMRecord rec = iterator.next(); if (!rec.isSecondaryOrSupplementary()) { if (recordInFileIndex == nextDuplicateIndex) { rec.setDuplicateReadFlag(true); // Now try and figure out the next duplicate index if (this.duplicateIndexes.hasNext()) { nextDuplicateIndex = this.duplicateIndexes.next(); } else { // Only happens once we've marked all the duplicates nextDuplicateIndex = -1; } } else { rec.setDuplicateReadFlag(false); } } recordInFileIndex++; if (!this.REMOVE_DUPLICATES || !rec.getDuplicateReadFlag()) { out.addAlignment(rec); } } return 0;}

Method

Code

Page 30: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

30© Cloudera, Inc. All rights reserved.

@Option(shortName = "MAX_FILE_HANDLES", doc = "Maximum number of file handles to keep open when spilling " + "read ends to disk. Set this number a little lower than the " + "per-process maximum number of file that may be open. This " + "number can be found by executing the 'ulimit -n' command on " + "a Unix system.")public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;

Page 31: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

31© Cloudera, Inc. All rights reserved.

@Option(shortName = "MAX_FILE_HANDLES", doc = "Maximum number of file handles to keep open when spilling " + "read ends to disk. Set this number a little lower than the " + "per-process maximum number of file that may be open. This " + "number can be found by executing the 'ulimit -n' command on " + "a Unix system.")public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;

Dedup

Method

Code

Platform

Page 32: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

32© Cloudera, Inc. All rights reserved.

Alignment Dedup Recalibrate QC/Filter VariantCalling

VariantAnnotation

Page 33: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

33© Cloudera, Inc. All rights reserved.

It’s pipelines all the way down!

Alignment Dedup Recalibrate QC/Filter VariantCalling

VariantAnnotation

Alignment Dedup Recalibrate QC/Filter VariantCalling

VariantAnnotation

Alignment Dedup Recalibrate QC/Filter VariantCalling

VariantAnnotation

Page 34: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

34© Cloudera, Inc. All rights reserved.

It’s pipelines all the way down!

Alignment Dedup Recalibrate QC/Filter VariantCalling

VariantAnnotation

Node 1

Alignment Dedup Recalibrate QC/Filter VariantCalling

VariantAnnotation

Node 2

Alignment Dedup Recalibrate QC/Filter VariantCalling

VariantAnnotation

Node 3

Page 35: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

35© Cloudera, Inc. All rights reserved.

Manually running pipelines on HPC

$ bsub –q shared_12h python split_genotypes.py

$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv

$ bsub –q shared_12h python merge_maf.py

Page 36: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

36© Cloudera, Inc. All rights reserved.

Page 37: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

37© Cloudera, Inc. All rights reserved.

Alignment Dedup Recalibrate QC/Filter VariantCalling

VariantAnnotation

Alignment Dedup Recalibrate QC/Filter

Alignment Dedup Recalibrate QC/Filter

Page 38: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

38© Cloudera, Inc. All rights reserved.

Node 1

Alignment Dedup Recalibrate QC/Filter VariantCalling

VariantAnnotation

Node 2

Node 3

Alignment Dedup Recalibrate QC/Filter

Alignment Dedup Recalibrate QC/Filter

Node 4

Page 39: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

39© Cloudera, Inc. All rights reserved.

Node 1

Alignment Dedup QC/Filter VariantCalling

VariantAnnotation

Node 2

Node 3

Alignment Dedup QC/Filter

Alignment Dedup QC/Filter

Node 4

Recalibrate

Page 40: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

40© Cloudera, Inc. All rights reserved.

How now, brown cow?

Page 41: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

41© Cloudera, Inc. All rights reserved.

Why Are We Still Defining File Formats By Hand?

• Instead of defining custom file formats for each data type and access pattern…

• Parquet creates a compressed format for each Avro-defined data model

• Improtvements over existing formats• ~20% for BAM• ~90% for VCF

Page 42: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

42© Cloudera, Inc. All rights reserved.

YARN-managedHadoop cluster

Sparkexecutors

∏𝑗=1

𝑑 𝑖

𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖) ∏𝑗=1

𝑑 𝑖

𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖) ∏𝑗=1

𝑑 𝑖

𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖)Partial sums

∏𝑖=1

𝑁

∏𝑗=1

𝑑𝑖

𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖)Driver

Applicationcode

ContEst Algorithm

Page 43: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

43© Cloudera, Inc. All rights reserved.

Page 44: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

44© Cloudera, Inc. All rights reserved.

Hadoop provides layered abstractions for data processing

HDFS (scalable, distributed storage)

YARN (resource management)

MapReduce Impala (SQL) Solr (search) Spark

ADAMquince guacamole …

bdg-

form

ats (

Avro

/Par

quet

)

Page 45: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

45© Cloudera, Inc. All rights reserved.

Executing query in Hadoop: interactive Spark shell (ADAM)

def inDbSnp(g: Genotype): Boolean = true or falsedef isDeleterious(g: Genotype): Boolean = g.getPolyPhen

val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)val genotypesRDD = sc.adamLoad("path/to/genotypes")

val filteredRDD = genotypesRDD .filter(!inDbSnp(_)) .filter(isDeleterious(_)) .filter(isFramingham(_))val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)

val maf = joinedRDD .keyBy(x => (x.getVariant, getPopulation(x))) .groupByKey() .map(computeMAF(_))

maf.saveAsNewAPIHadoopFile("path/to/output")

apply predicates

load data

join data

group-byaggregate (MAF)

persist data

Page 46: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

46© Cloudera, Inc. All rights reserved.

Executing query in Hadoop: distributed SQLSELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)FROM genotypes g INNER JOIN samples s ON g.sample = s.sample INNER JOIN dnase d ON g.chr = d.chr AND g.pos >= d.start AND g.pos < d.end LEFT OUTER JOIN dbsnp p ON g.chr = p.chr AND g.pos = p.pos AND g.ref = p.ref AND g.alt = p.altWHERE s.study = "framingham" p.pos IS NULL AND g.polyphen IN ( "possibly damaging", "probably damaging" )GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop

apply predicates

“load” and join data

group-by

aggregate (UDAF)

Page 47: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

47© Cloudera, Inc. All rights reserved.

• Hosted at Berkeley and the AMPLab

• Apache 2 License• Contributors from both

research and commercial organizations

• Core spatial primitives, variant calling

• Avro and Parquet for data models and file formats

Spark + Genomics = ADAM

Page 48: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

48© Cloudera, Inc. All rights reserved.

Core Genomics Primitives: Spatial Join

Page 49: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

49© Cloudera, Inc. All rights reserved.

ADAM preliminary performance

Page 50: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

50© Cloudera, Inc. All rights reserved.

Page 51: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

51© Cloudera, Inc. All rights reserved.

Acknowledgements

UCBerkeleyMatt MassieFrank NothaftMichael Heuer

TamrTimothy Danford

MSSMJeff HammerbacherRyan Williams

ClouderaTom WhiteSandy Ryza

Page 52: DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

52© Cloudera, Inc. All rights reserved.

Thank you@[email protected]