21
BioPig: Hadoop-based Analytic Toolkit for Next- Generation Sequence Data Zhong Wang, Ph.D. Computational Biology Staff Scientist

BioPig for scalable analysis of big sequencing data

Embed Size (px)

DESCRIPTION

This talk was adapted from my presentation at the Finishing in the Future 2011, Santa Fe, NM.

Citation preview

Page 1: BioPig for scalable analysis of big sequencing data

BioPig: Hadoop-based Analytic Toolkit for Next-Generation Sequence Data

Zhong Wang, Ph.D.Computational Biology Staff Scientist

Page 2: BioPig for scalable analysis of big sequencing data

Cellulase

The deep metagenome approach to discover cellulases for biofuel research

Page 3: BioPig for scalable analysis of big sequencing data

Large data, large reward

http://www.cazy.org/

Only 1% shared (>=95% identity)50% validated activity

Science. 2011 Jan 28;331(6016):463-7.

Page 4: BioPig for scalable analysis of big sequencing data

Sequence data

More data would be even better

Page 5: BioPig for scalable analysis of big sequencing data

Rumen(2009) Rumen(2010) Rumen(2012)

17 Gb

250 Gb

1000 Gb

But, can analysis keep up with data growth?

Page 6: BioPig for scalable analysis of big sequencing data

Ideal solutions for the terabase problem

1.Scalable to 1Tb?2.Performance (within hours)?

Page 7: BioPig for scalable analysis of big sequencing data

High-Mem cluster

Input/Output (IO)Memory

Page 8: BioPig for scalable analysis of big sequencing data

MP/MPI solution: k-mer counting

1

2

3

4

Raw Data Data slicesEach node/core

has data and table slices Count table

Page 9: BioPig for scalable analysis of big sequencing data

MP/MPI performance

MPI version412 Gb, 4.5B reads2.7 hours on 128x24 coresNESRC HopperII

MP Threaded version268 Gb, 3B reads5 days on 32 coresHigh-Mem Cluster

• Experienced software engineers• Six months of development time• One nodes fails, all fail

Problems:

Fast, scalable

Page 10: BioPig for scalable analysis of big sequencing data

Hadoop/Map Reduce framework

• Google MapReduce– Data Parallel programming model to process petabyte data– Generally has a map and a reduce step

• Apache Hadoop– Distributed file system (HDFS) and job handling for

scalability and robustness– Data locality to bring compute to data, avoiding network

transfer bottleneck

Page 11: BioPig for scalable analysis of big sequencing data

Programmability: Hadoop vs Pig finding out top 5 websites young people visit

Page 12: BioPig for scalable analysis of big sequencing data

BioPig: design goals

• Flexible– every dataset is unique, data analysts have domain knowledge that is essential to

optimize the analysis,– pluggable modules that analysts can use to build custom analytic pipelines,

• High-Level – domain-specific language enable data analysts to create custom pipelines,– hide details of parallelism (too complex for most people),

• Scalability– leverage data parallelism to speed up analytics,– integrate external tools and applications where necessary,– scale from 1 to hundreds of compute nodes with minimal effort and linear

scalability.• Robustness

– Data and computation are replicated across nodes to combat failures

BioPIG

Page 13: BioPig for scalable analysis of big sequencing data

Runs on any hardware supporting Hadoop

• JGI Titanium (commodity hadoop cluster)– Up to 20 16-cores 32GB RAM 1.799Ghz, 1G Ethernet

• NERSC Magellan Cloud Testbed– Up to 200 8-core 24GB RAM, and 2.67GHz Nehalem

processors, 10Gbit InfiniBand, GPFS

• Amazon AWS– Elastic MapReduce with cluster compute nodes (23 GB of

memory, 2 x Intel quad-core “Nehalem” architecture 1690 GB of instance storage, 10G Ethernet

Page 14: BioPig for scalable analysis of big sequencing data

BioPig Modules

Blast

Input/Output(Fasta,q)

K-merCounter

Assembly

Page 15: BioPig for scalable analysis of big sequencing data

How k-mer count is implemented

Load Mapper Shuffle/sort

Reducer Merge

<id1, header, ‘attagc’><id2, header, ‘gttagg’>

<id1, ‘atta’>, <id1,’ttag’><id2, ‘gtta’>, <id2, ‘ttag’>

<‘atta’, id1>, <‘ttag’, id1, id2><‘gtta’, id2>, <‘tagg’, id2>

<‘atta’, 1>, <‘ttag’, 2><‘gtta’, 1>, <‘tagg’, 1>

<‘atta’, 3>, <‘ttag’, 2><‘gtta’, 2>, <‘tagg’, 1>

Page 16: BioPig for scalable analysis of big sequencing data

A 7-liner BioPig script for k-mer counting

Page 17: BioPig for scalable analysis of big sequencing data

Rumen metagenome gene discovery pipeline

Read preprocess

(remove artifacts)

pigBlast(blast reads

against known cellulases)

pigAssembler(Assemble reads

into contigs)

pigExtender(Extend contigs into full-length

enzymes)

Page 18: BioPig for scalable analysis of big sequencing data

Cloud solution to large data

BioPig-Blaster

BioPig-Assembler

BioPig-Extender

BioPIG

BioPig: 61 lines of codeMPI-extender: ~12,000 lines (vs 31 in BioPig)

Flexibility

Programmability

Scalability

xx

Page 19: BioPig for scalable analysis of big sequencing data

Conclusions

Hadoop-based BioPig shows great potential for scalable analysis on very large sequence data, it is robust and easy to use.

Page 20: BioPig for scalable analysis of big sequencing data

Challenges in application

• IO optimization, e.g., reduce data copying • Some problems do not easily fit into

map/reduce framework, e.g., graph-based algorithms

• Integration into exiting framework, Galaxy

Page 21: BioPig for scalable analysis of big sequencing data

Acknowledgement

• Karan Bhatia• Henrik Nordberg• Kai Wang• Rob Egan• Alex Sczyrba• Jeremy Brand @JGI/NERSC• Shane Cannon @NERSC

BioPIG