Bioinformatics Data Pipelines built by CSIRO on AWS

Cancer GenomicsData PipelinesLynn & Samantha Langit CSIRO Bioinformatics / Australia

June 2017 - Oslo

3 Billion data points per patient DNA sampleUp to 25% of the population could be sequenced by 2025

Two Perspectives

Bioinformatics

Research• Insight

• Reproducibility

Cloud

Architecture• Speed

• Low Cost

• Simplicity

Cloud Data Pipeline Pattern

Problem

• Define business problem

Data

• Quality

• Quantity

Candidate Technologies

• Ingest

• ETL

• Biz Analytics

• ML

• Visualization

Build MVPs

• Iterate

• Learn

Assemble Pipeline

• Validate each section

• Test at scale

Genomic Sequencing Results

CRISPR-Cas9 for molecular engineering technology

enables the accurate editing of genomes for researchers.

It…

Pattern-matching unique sequences of DNA

Huge demand for large-scale computation

Time-critical dimension to compute

NIH-approved for human health

Could revolutionize cancer treatments

Serverless Lambda Architecture Pattern

Lambda

function

1

Lambda

function

2

Lambda

function

3

buckets with

objects DynamoDB

API Gateway Users

CSIRO: Commonwealth Scientific & Industrial Research Organization

GT-Scan2Demo

GT-Scan2

Scale Genomic Analysis

GWAS = genome-wide sequencing data association studies

Analysis on large cohort data or imputed SNP array data

Clustering on genomic profiles to stratify large-cohort genomic data

Viewing datasets with millions of features


Problem

• Define business problem

Data

• Quality

• Quantity

Candidate Technologies

• Ingest

• ETL

• Biz Analytics

• ML

• Visualization

Build MVPs

• Iterate

• Learn

Assemble Pipeline

• Validate each section

• Test at scale

Genomics (ML) Pipeline Pattern

What is CSIRO’s solution?

For Scale at reasonable cost Use Apache Hadoop

For Scale at speed Use Apache Spark for Hadoop

For Usability in bioinformatics

Create a domain-specific API (OSS library)

For global useLeverage Cloud Pipeline Patterns

GWAS Analysis with Variant-Spark

On premise Hadoop Cluster

with Apache Spark

Genomics Analysts

corporate data center

What is Apache Spark?

What is variant-spark?

Demo

80% faster than ADAM

90% faster than R

90% faster than Python

VariantSpark

Uses Apache Spark to massively parallelize the generation of random forests to identify disease genes efficiently

Analyzes 3,000 samples with 80 million features in < 30 minutes

Enables real-time diagnosis by finding similar patients

Contributes to motor neuron disease (ALS) research in Australia

Data Prep

Statistics

Probabilistic Algorithms

Data Viz

Machine Learning…

Spark ML Classification Algorithms

Wide Random Forest Ensembleof Decision Trees

Logistic Regression

variant-spark other libraries

OSS Library variant-spark for all

usable? performant?

extendable? (clean code)

using the best language (Scala)?

using the ‘best version’ of Spark?

using a version of wide random forests that is understandable?

Is it…

How best to Deploy Cloud Hadoop?

• IaaS EC2 instances with Apache Hadoop, Apache Spark, more…

• PaaS Elastic Map Reduce (EMR) Hadoop cluster

• SaaS Vendor-managed, i.e. DataBricks w/Jupyter Notebooks

What is Databricks?

DEMO: Jupyter Notebooks

Variant-Spark and DatabricksDemo

SolvingImportant Questions…Cancer Genomics?

DEMO: Who is a Hipster?

AWS EC2 Spot Instances

GWAS Analysis with Variant-Spark

EC2 Hadoop Cluster with Apache Spark

Genomics Analysts

Availability Zone

1000 Genomes

GWAS input

Spot EC2 Hadoop

worker instancesEC2 Hadoop

instances


Problem DataCandidate

TechnologiesBuild MVPs

Assemble Pipeline

Analyze GWAS -> S3/Hadoop Ingest

ETL

Analyze

Viz

S3 -> Databricks DBFS

Apache Spark

Variant-Spark ML

Notebook SQL, R or Python

SaaS


Problem DataCandidate

TechnologiesBuild MVPs

Assemble Pipeline

1. Scan vcf -> S3/DynamoDB Ingest

ETL

Analyze

Viz

S3

Lambda

Lambda

Lambda/API Gateway

Serverless

2. Analyze GWAS -> S3/Hadoop Ingest

ETL

Analyze

Viz

S3 -> Databricks DBFS

Apache Spark

Variant-Spark ML

Notebook SQL, R or Python

SaaS

Modern Big Data Pipelines• Problem #1 - Scan

• Solution: Serverless Cloud Pipeline

• Problem # 2 - Analyze

• Solution: SaaS Cloud ML Pipeline

Cancer GenomicsData PipelinesLynn & Samantha Langit CSIRO Bioinformatics & variant-spark

June 2017 - Oslo

Science

Bioinformatics Data Pipelines built by CSIRO on AWS