20
COMPUTE | STORE | ANALYZE High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows Ted Slater Bio-IT World Conference & Expo 2015 20 April 2015

High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

Embed Size (px)

Citation preview

Page 1: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

Ted Slater

Bio-IT World Conference & Expo 2015

20 April 2015

Page 2: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

About Cray

Copyright 2015 Cray Inc.

Seymour Cray founded Cray Research in 1972

• 1972-1996, Cray Research grew to leadership in Supercomputing

• 1996-2000, Cray was subsidiary of SGI

• 2000- present, Cray Inc. growing to $525M in revenue in 2013

• Cray Inc. formed in April 2000

Cray Inc.

• NASDAQ: CRAY

• Over 1,000 employees across 30 countries

• Headquartered in Seattle, WA

Three Focus Areas

• Computation

• Storage

• Analytics

Seven Major Development Sites:

• Austin, TX

• Chippewa Falls, WI

• Pleasanton, CA

• St. Paul, MN

• San Jose, CA

• Seattle, WA

• Bristol, UK

Page 3: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Our Vision

Modeling The WorldFusing Supercomputing and Big & Fast Data

Compute Store Analyze

Data

Models

Math

Models

Data-

Intensive

Processing

Copyright 2015 Cray Inc.

Page 4: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

New data sources and emerging analytical approaches to

enable predictive modeling and knowledge discovery

Convergence of analytics and supercomputing opening

new opportunities to meet the pace of discovery

Ad-hoc cluster infrastructures exacerbating complexity,

reliability and usability challenges

Organizations struggling to keep compute infrastructures

up to date, with rapidly changing life sciences technologies

The Life Sciences/Healthcare Communities

Market and Technology Drivers

The race to understand individual patients, diseases and

treatments at the molecular level

Precision

Medicine

Pace of

Technology

Cluster

Sprawl

Rise of High

Performance

Analytics

Data Science

Copyright 2015 Cray Inc.

Page 5: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

The Quest for In-Time Analytics

Copyright 2015 Cray Inc.

Resp

on

se t

ime f

ram

es

<30ms

30ms

10min

>10min

Low-Latency

BatchFew data

scientists who

wrangle data

Business

analysts

accustomed to

interactive time

frames

Streaming data

Stationary data

Low-latency applications require performance optimizations

• Memory-storage hierarchies

• Fast interconnects

Page 6: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Multi-step Analytics Pipelines

Copyright 2015 Cray Inc.

Data Prep/

ETL

Stream

Processing

Data

Mining

Interactive

Queries

Actionable

Insight

Analytics Pipeline

Performance Productivity

Page 7: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Convergence of Analytics and Supercomputing

High Performance Computing• Finance: portfolio optimization, pricing, risk

• Energy: seismic modeling

• Life sciences: genomics, drug discovery

• Scientific: simulation, weather forecasting

Traditional Big Data• Batch analytics

• Undifferentiated systems

“Simulation is the original

Big Data Market” – IDC

High Performance Big Data Analytics• Low-latency analytics

• Next-generation architecture

Copyright 2015 Cray Inc.

Page 8: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Analytics Solutions

Powered ByExtreme Analytics Platform

• Turnkey Advanced Analytics Platform

• Next-Generation System Architecture

• Engineered for Performance

Graph Discovery Appliance

• Discover Unknown & Hidden

Relationships in Big Data

• Real-time Data Discovery

• Realize Rapid Time-to-Value

Copyright 2015 Cray Inc.

Page 9: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Cray’s Next-Generation Sequencing Solution:Accelerated Time to Discovery

Genome Assembly

High-Throughput NGS Storage and

Archive Environment

Bioinformatics Analytics

Personalized

Medicine

Pathway

Modeling

Hypothesis

GenerationAlternative

Indications

Biomarker

Prediction

Patient

Selection

Base Calling

Assembly

Variant

Analysis

QC

Annotation

Next-Generation

Sequencers

Manage all aspects of

NGS pipeline in one

environment• Address data transfer

and compute

bottlenecks

• Speed up whole-

genome resequencing

analysis

• Fast short-read

alignment

• Calculate differential

gene expression from

large RNA-Seq

datasets

• “Single pane of glass”

management interface

Enterprise Benefits

• Open architecture

• Reduced footprint

• Eliminates cluster sprawl

• Out-of-the-box

performance with

flexibility to meet

evolving needs

• Pay-as-you-grow

storage/archival

performance and

capacity

• Minimal management

burden and lower TCO

Copyright 2015 Cray Inc.

Page 10: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

NGS Data Management is Overwhelming

Production – Huge data volumes and excessive data movement are pushing the

limits of many storage and networking infrastructures.

Archive – NGS workflows generate huge volumes of data, which are both

tedious and costly to retain.

Three NGS Challenges: Sequence Assembly, Bioinformatics and Data Management

NGS Bioinformatics is Complex

Complexity is High – Interpreting NGS sequence meaning involves annotation,

integration, visualization and collaboration, requiring diverse expertise.

Performance – Post-sequencing analytics is computationally demanding, in both

performance and scale.

Sequence Assembly is a Bottleneck

Sequence costs down, and sequence volumes are up. Huge volume makes

assembly the challenge. The rate at which genotypic variation can be

characterized is now limited by computational tools, not by sequencing

technology.

Copyright 2015 Cray Inc.

Page 11: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Next-Generation Sequencing:Urika-XA platform for all aspects of NGS bioinformatics

Next-generation

sequencers

Urika-XA platform

simplifies

the NGS workflow

Manage all aspects of NGS pipeline

in one environment• Address data transfer and compute bottlenecks

• Speed up whole-genome resequencing analysis

• Fast short-read alignment

• Calculating differential gene expression from

large RNA-Seq datasets

• “Single pane of glass” management interface

Eliminate cluster sprawl

Reduce data movement

Copyright 2015 Cray Inc.

Page 12: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

…add a Scalable Archive Strategy to NGS

● Cray Tiered Adaptive Storage (TAS) for active data use and archiving

● Policy-based data movement

● Performs at scale

● NGS generates enormous amounts of data

● Once data is processed, much of it is no longer needed but must be saved

● A proper archive strategy will eliminate bottlenecks, improve performance and reduce costs

Next-generation

sequencers

Urika-XA platform

simplifies workflow

Copyright 2015 Cray Inc.

Page 13: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Halvade – Intel® Wrapper for Hadoop®

● Key observations

● Read mapping is parallel by read; variant calling is parallel by chromosomal region

● Map phase: read mapping • Reduce phase: variant calling

● Leveraging Hadoop improved throughput ~40X

● BWA and GATK – single node = 5 days

● Hadoop single node = 2.5 days

● Hadoop 50 nodes < 3 hours

● Urika-XA ~2 hours

● An additional 20% in performance

● Follow-on analytics can be done on the same platform

Reference: Decap et al., Bioinformatics 2015 Mar 26. pii: btv179.

Copyright 2015 Cray Inc.

Page 14: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Genomic Analysis with Hadoop

Genetic Data

Clinical Trial Records

Patient Records

Social Media Data

6

Life sciences-specific

data formats

Analysis on

Urika-XA platform

Life sciences-

specific results

http://www.biodatomics.com

Copyright 2015 Cray Inc.

Page 15: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Lumenogix NGS

50x Whole Human

Genome

http://www.lumenogix.com

Copyright 2015 Cray Inc.

Page 16: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Lumenogix and Cray Performance Details

● AWS Cluster using Halvade

● Urika-XA platform using Lumenogix

0

20

40

60

80

100

120

140

160

180

AWS Urika-XA1

Min

ute

s

Time to process 50x Whole Human Genome

Process Time

BWA 17 minutes

Tag & Shuffle Reads 2 minutes

Sort and Compress 1 minute

Mark Duplicates 1 minute

Realignment 6 minutes

Genotyping 18 minutes

Total 45 minutes

Genome split

into 4MB sections

Copyright 2015 Cray Inc.

Page 17: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

GenomeNext

Churchill: Kelly et al., Genome Biology 2015, 16:6 doi:10.1186/s13059-014-0577-x

• Churchill uses novel, deterministic parallelization to deliver a deterministic,

balanced, highly scalable regional parallelization strategy

• Enables computationally efficient whole genome sequencing data analysis in

less than 2 hours

Stay tuned for Urika-XA system results!

http://www.genomenext.com

Copyright 2015 Cray Inc.

Page 18: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Urika-XA Extreme Analytics Platform for NGS

Pre-integrated, open platform for high performance Hadoop and Spark™ analytics

Save months standing up a Hadoop cluster• Run a 48-node Hadoop cluster out of the box

• Cloudera Hadoop and Apache Spark factory installed

Replace 3 standard racks with a single Urika-XA

system rack• High-density compute powered by Intel® Xeon® processors

• Consolidate wide range of analytics onto single platform

Future-proof your big data environment• Next-gen architecture leveraging SSDs and InfiniBand

• Designed for low-latency, in-memory processing

Copyright 2015 Cray Inc.

Page 19: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Thank You

Dave Anstey: [email protected]

http://www.cray.com

Page 20: High Performance Computing Technology and Methodology Applied to Next-Generation Sequencing Workflows

C O M P U T E | S T O R E | A N A L Y Z E

Legal Disclaimer

Copyright 2015 Cray Inc.

Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document.

Cray Inc. may make changes to specifications and product descriptions at any time, without notice.

All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviatefrom published specifications. Current characterized errata are available on request.

Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.

Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXIONand URIKA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and trademarks of Cray Inc.: CS, XC, XE, XK and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee ofLinus Torvalds, owner of the mark on a worldwide basis.

Other names and brands may be claimed as the property of others. Other product and service names mentioned herein are the trademarks of their respective owners.

Copyright 2015 Cray Inc.