37
Cancer Genomics Data Pipelines Lynn & Samantha Langit CSIRO Bioinformatics / Australia June 2017 - Oslo

Bioinformatics Data Pipelines built by CSIRO on AWS

Embed Size (px)

Citation preview

Page 1: Bioinformatics Data Pipelines built by CSIRO on AWS

Cancer GenomicsData PipelinesLynn & Samantha Langit CSIRO Bioinformatics / Australia

June 2017 - Oslo

Page 2: Bioinformatics Data Pipelines built by CSIRO on AWS

3 Billion data points per patient DNA sampleUp to 25% of the population could be sequenced by 2025

Page 3: Bioinformatics Data Pipelines built by CSIRO on AWS

Two Perspectives

Bioinformatics

Research• Insight

• Reproducibility

Cloud

Architecture• Speed

• Low Cost

• Simplicity

Page 4: Bioinformatics Data Pipelines built by CSIRO on AWS

Cloud Data Pipeline Pattern

Problem

• Define business problem

Data

• Quality

• Quantity

Candidate Technologies

• Ingest

• ETL

• Biz Analytics

• ML

• Visualization

Build MVPs

• Iterate

• Learn

Assemble Pipeline

• Validate each section

• Test at scale

Page 5: Bioinformatics Data Pipelines built by CSIRO on AWS
Page 6: Bioinformatics Data Pipelines built by CSIRO on AWS

Genomic Sequencing Results

CRISPR-Cas9 for molecular engineering technology

enables the accurate editing of genomes for researchers.

It…

Pattern-matching unique sequences of DNA

Huge demand for large-scale computation

Time-critical dimension to compute

NIH-approved for human health

Could revolutionize cancer treatments

Page 7: Bioinformatics Data Pipelines built by CSIRO on AWS

Serverless Lambda Architecture Pattern

Lambda

function

1

Lambda

function

2

Lambda

function

3

buckets with

objects DynamoDB

API Gateway Users

Page 8: Bioinformatics Data Pipelines built by CSIRO on AWS

CSIRO: Commonwealth Scientific & Industrial Research Organization

Page 9: Bioinformatics Data Pipelines built by CSIRO on AWS

GT-Scan2Demo

GT-Scan2

Page 10: Bioinformatics Data Pipelines built by CSIRO on AWS
Page 11: Bioinformatics Data Pipelines built by CSIRO on AWS

Scale Genomic Analysis

GWAS = genome-wide sequencing data association studies

Analysis on large cohort data or imputed SNP array data

Clustering on genomic profiles to stratify large-cohort genomic data

Viewing datasets with millions of features

Page 12: Bioinformatics Data Pipelines built by CSIRO on AWS

Cloud Data Pipeline Pattern

Problem

• Define business problem

Data

• Quality

• Quantity

Candidate Technologies

• Ingest

• ETL

• Biz Analytics

• ML

• Visualization

Build MVPs

• Iterate

• Learn

Assemble Pipeline

• Validate each section

• Test at scale

Page 13: Bioinformatics Data Pipelines built by CSIRO on AWS

Genomics (ML) Pipeline Pattern

Page 14: Bioinformatics Data Pipelines built by CSIRO on AWS

What is CSIRO’s solution?

For Scale at reasonable cost Use Apache Hadoop

For Scale at speed Use Apache Spark for Hadoop

For Usability in bioinformatics

Create a domain-specific API (OSS library)

For global useLeverage Cloud Pipeline Patterns

Page 15: Bioinformatics Data Pipelines built by CSIRO on AWS

GWAS Analysis with Variant-Spark

On premise Hadoop Cluster

with Apache Spark

Genomics Analysts

corporate data center

Page 16: Bioinformatics Data Pipelines built by CSIRO on AWS

What is Apache Spark?

Page 17: Bioinformatics Data Pipelines built by CSIRO on AWS

What is variant-spark?

Demo

Page 18: Bioinformatics Data Pipelines built by CSIRO on AWS

80% faster than ADAM

90% faster than R

90% faster than Python

Page 19: Bioinformatics Data Pipelines built by CSIRO on AWS

VariantSpark

Uses Apache Spark to massively parallelize the generation of random forests to identify disease genes efficiently

Analyzes 3,000 samples with 80 million features in < 30 minutes

Enables real-time diagnosis by finding similar patients

Contributes to motor neuron disease (ALS) research in Australia

Page 20: Bioinformatics Data Pipelines built by CSIRO on AWS

Data Prep

Statistics

Probabilistic Algorithms

Data Viz

Machine Learning…

Page 21: Bioinformatics Data Pipelines built by CSIRO on AWS

Spark ML Classification Algorithms

Wide Random Forest Ensembleof Decision Trees

Logistic Regression

variant-spark other libraries

Page 22: Bioinformatics Data Pipelines built by CSIRO on AWS

OSS Library variant-spark for all

Page 23: Bioinformatics Data Pipelines built by CSIRO on AWS

usable? performant?

extendable? (clean code)

using the best language (Scala)?

using the ‘best version’ of Spark?

using a version of wide random forests that is understandable?

Is it…

Page 24: Bioinformatics Data Pipelines built by CSIRO on AWS

How best to Deploy Cloud Hadoop?

• IaaS EC2 instances with Apache Hadoop, Apache Spark, more…

• PaaS Elastic Map Reduce (EMR) Hadoop cluster

• SaaS Vendor-managed, i.e. DataBricks w/Jupyter Notebooks

Page 25: Bioinformatics Data Pipelines built by CSIRO on AWS

What is Databricks?

Page 26: Bioinformatics Data Pipelines built by CSIRO on AWS
Page 27: Bioinformatics Data Pipelines built by CSIRO on AWS

DEMO: Jupyter Notebooks

Page 28: Bioinformatics Data Pipelines built by CSIRO on AWS

Variant-Spark and DatabricksDemo

Page 29: Bioinformatics Data Pipelines built by CSIRO on AWS

SolvingImportant Questions…Cancer Genomics?

Page 30: Bioinformatics Data Pipelines built by CSIRO on AWS

DEMO: Who is a Hipster?

Page 31: Bioinformatics Data Pipelines built by CSIRO on AWS

AWS EC2 Spot Instances

Page 32: Bioinformatics Data Pipelines built by CSIRO on AWS

GWAS Analysis with Variant-Spark

EC2 Hadoop Cluster with Apache Spark

Genomics Analysts

Availability Zone

1000 Genomes

GWAS input

Spot EC2 Hadoop

worker instancesEC2 Hadoop

instances

Page 33: Bioinformatics Data Pipelines built by CSIRO on AWS

Cloud Data Pipeline Pattern

Problem DataCandidate

TechnologiesBuild MVPs

Assemble Pipeline

Analyze GWAS -> S3/Hadoop Ingest

ETL

Analyze

Viz

S3 -> Databricks DBFS

Apache Spark

Variant-Spark ML

Notebook SQL, R or Python

SaaS

Page 34: Bioinformatics Data Pipelines built by CSIRO on AWS
Page 35: Bioinformatics Data Pipelines built by CSIRO on AWS

Cloud Data Pipeline Pattern

Problem DataCandidate

TechnologiesBuild MVPs

Assemble Pipeline

1. Scan vcf -> S3/DynamoDB Ingest

ETL

Analyze

Viz

S3

Lambda

Lambda

Lambda/API Gateway

Serverless

2. Analyze GWAS -> S3/Hadoop Ingest

ETL

Analyze

Viz

S3 -> Databricks DBFS

Apache Spark

Variant-Spark ML

Notebook SQL, R or Python

SaaS

Page 36: Bioinformatics Data Pipelines built by CSIRO on AWS

Modern Big Data Pipelines• Problem #1 - Scan

• Solution: Serverless Cloud Pipeline

• Problem # 2 - Analyze

• Solution: SaaS Cloud ML Pipeline

Page 37: Bioinformatics Data Pipelines built by CSIRO on AWS

Cancer GenomicsData PipelinesLynn & Samantha Langit CSIRO Bioinformatics & variant-spark

June 2017 - Oslo