AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research on AWS (LFS301)

Preview:

Citation preview

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Stephen Terrell, Senior Cloud Engineer, Human Longevity Inc.

Ryan Ulaszek, Cloud Architect, Human Longevity Inc.

Lance Smith, IT Director, Celgene

Patrick Combes, Principal Solution Architect, AWS

November 28, 2016

Building a Platform for Collaborative

Scientific Research on AWS

LFS301

Collaborative Scientific Research

The past decade has seen tremendous growth in

collaborative research

“New collaboration patterns are

changing the global balance of

science. Established

superpowers need to keep up

or be left behind…” Jonathan

Adams Nature 490, 335–336 (18 October

2012) doi:10.1038/490335a Published online 17 October 2012

…in the last decade, thousands of researchers, pharmaceutical and biotechnology companies, government regulators, payers, clinicians and patients have come together in more than 100 multi-stakeholder collaborations to solve some specific shared problem… Carol Cruzan MortonThe Science of Collaboration, Center for Biomedical Innovation August 29, 2013

Onsite

Meet in the middle

DevOps

Coordinated

development,

deployment and

operation

AWS as a Home for Collaborative Research

Multi-site,

distributed

collaborations

What to Expect from the Session

• About Human Longevity Inc.

• Challenges we faced

• The solution

• Our journey

• Summary

The Next Frontier in Medicine

OUR MISSION:

Changing the practice

of medicine, making it predictive,

preventative and genomics-based.

Ultimately, our goal is to extend the

healthy, high performance and

productive life span.

Descriptive

Medicine

Traditional

Medical Record

~3.5 GB

Digital Health

Record - ~150

GB

Deep Representation of

systems : N-of-1

10,000 Genomes

~1 PB

N-of-Thousands

Human Virome

1 Trillion Similarity Searches

10 million

Genomes

~1 EB

Future

Medicine as Data Science

Challenges

Redundant infrastructure

consuming time and resourcesComplexity and scale of data

compound the problem

Commercializing bioinformatics

innovation is painful and slow

Significant storage and

compute costs

The Solution is a Common Platform

+

Create a pluggable common platform for

genomics pipelines using AWS managed services

Leverage platform to

simplify pipelines and

optimize for cost

Ease and accelerate

transition to production by

standardizing on a common

platform

Move to continuous

delivery to go

faster with quality

The Journey

1 2 3 4 5

Building a common genomics platform was an iterative process

There were five key steps along the way

The Journey

1 2 3 4 5

Get a simple genomics pipeline up and running in two weeks

so downstream teams can start iterating

Up and running

in a sprint

Up and Running in a Sprint

Poll a queue for a sample and run the bioinformatics tools in sequence

on an instance using the tools baked into the AMI

AWS OpsWorks Stack

Sample Queue Amazon S3

Up and Running in a Sprint

Up and Running in a Sprint

Up and Running in a Sprint

Up and Running in a Sprint

Easy to implement and worked well as a

starting point

Easy to auto-scale instances

ⅹ Pain around manually building and

updating AMI

ⅹ Writing our own workflow engine

ⅹ Can’t optimize for workload at each step

ⅹ No cost optimization since using on-

demand instances

Benefits Drawbacks

The Journey

1 2 3 4 5

Get a simple genomics pipeline up and running in two weeks

so downstream teams can start iterating

Up and running

in a sprint

The Journey

Up and running

in a sprint

Adapting to tool

change

1 2 3 4 5

Difficult to accommodate constant bioinformatics tool changes in the pipeline

Adapting to Tool Change

Migrate to tools running in Docker containers

on Amazon EC2 instances in AWS OpsWorks

OpsWorks Stack

Sample Queue S3 ECR

Adapting to Tool Change

Adapting to Tool Change

Adapting to Tool Change

Adapting to Tool Change

Adapting to Tool Change

✓ Easy to auto-scale instances

✓ Easy to accommodate tool changes

ⅹ Pain around manually building and

updating AMI

ⅹ Writing our own workflow engine

ⅹ Can’t optimize for workload at each step

ⅹ No cost optimization since using on-

demand instances

ⅹ Painful to support Docker on EC2

instances ourselves

ⅹ Pain around version and deploy of

images

Benefits Drawbacks

The Journey

Up and running

in a sprint

Adapting to tool

change

1 2 3 4 5

Difficult to accommodate constant bioinformatics tool changes in the pipeline

The Journey

Up and running

in a sprint

Adapting to tool

change

Report pipeline

platform

1 2 3 4 5

Need to accommodate additional report pipelines

and their increasing complexity

Report Pipeline Platform

AWS Flow Framework for Ruby is a collection of

convenience libraries that make it faster and easier to

build applications with Amazon SWF

SWF as a fully managed state tracker and task

coordinator in the cloud

Report Pipeline Platform

Migrate workflow solution to SWF and AWS Ruby Flow Framework

OpsWorks Stack

S3 ECRSWFLauncherTopic

Report Pipeline Platform

Report Pipeline Platform

Report Pipeline Platform

Report Pipeline Platform

Report Pipeline Platform

Report Pipeline Platform

Report Pipeline Platform

Report Pipeline Platform

Report Pipeline Platform

Report Pipeline Platform

Report Pipeline Platform

✓ Easy to accommodate tool changes

✓ Easy to auto-scale instances

✓ Easy to accommodate new pipelines

and run steps in parallel

✓ Easy handle failures at each step

ⅹ Can’t optimize for workload at each step

ⅹ No cost optimization since using on-

demand instances

ⅹ Painful to support Docker on EC2

instances ourselves

ⅹ Pain around version and deploy of

images

ⅹ Complex workflows could not be

supported

Benefits Drawbacks

The Journey

Up and running

in a sprint

Adapting to tool

change

Report pipeline

platform

1 2 3 4 5

Difficult to accommodate additional report pipelines

and their increasing complexity

The Journey

Up and running

in a sprint

Adapting to tool

change

Report pipeline

platformDocker Pipeline

platform

1 2 3 4 5

Many redundant workflow systems

that are suboptimal for cost and performance

Bioinformatics scientists are bogged down in infrastructure development

Docker Pipeline

SWF

Flow

Amazon

ECS

Track steps within a workflow

Orchestrate steps within a workflow Provision spot instances

Spot Fleet

Run steps on a cluster

Amazon

DynamoDB

Register Pipelines and Tasks Glue everything together

AWS Lambda

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

Define resource requirements for a

tool and register that tool which makes

it available for use in pipelines.

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

Define a pipeline that uses tasks

already registered within the system

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

Run a pipeline by name and with any

arguments that are needed

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

ancestry user$ dpl task register task.json

{

task_id: ancestry_1.0.5_container

}

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

pipeline user$ dpl pipeline register

{

pipeline_id: demo_1.0.1

}

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

Docker Pipeline

$ dpl task register

$ dpl pipeline register

$ dpl pipeline run

pipeline user$ dpl run demo

{

"domain": "dpl-ind-a",

"pipelineId": "demo_1.0.0",

"runId": " 23BFVdVVBvOvGOebScSv7…

"workflowId": " 5079851a-8802-4141-b9...

}

Docker Pipeline

Docker Pipeline

OpsWorks Stack

SWF ECR

ECS

Spot Fleet

Lambda

Task RegistryPipeline RegistryImage

$ dpl pipeline register $ dpl task register $ docker push

$ dpl run

Pipeline Definition Task Definition Tool

Docker Pipeline

SQS Lambda

S3

OpsWorks Stack

SWF ECR

ECS

Spot Fleet

Lambda

Task RegistryPipeline RegistryImage

$ dpl pipeline register $ dpl task register $ docker push

$ dpl run

Pipeline Definition Task Definition Tool

Docker Pipeline

SQS Lambda

1

S3

OpsWorks Stack

SWF ECR

ECS

Spot Fleet

Lambda

Task RegistryPipeline RegistryImage

$ dpl pipeline register $ dpl task register $ docker push

$ dpl run

Pipeline Definition Task Definition Tool

Docker Pipeline

SQS Lambda

12

S3

OpsWorks Stack

SWF ECR

ECS

Spot Fleet

Lambda

Task RegistryPipeline RegistryImage

$ dpl pipeline register $ dpl task register $ docker push

$ dpl run

Pipeline Definition Task Definition Tool

Docker Pipeline

SQS Lambda

123

S3

OpsWorks Stack

SWF ECR

ECS

Spot Fleet

Lambda

Task RegistryPipeline RegistryImage

$ dpl pipeline register $ dpl task register $ docker push

$ dpl run

Pipeline Definition Task Definition Tool

Docker Pipeline

SQS Lambda

123

4

S3

OpsWorks Stack

SWF ECR

ECS

Spot Fleet

Lambda

Task RegistryPipeline RegistryImage

$ dpl pipeline register $ dpl task register $ docker push

$ dpl run

Pipeline Definition Task Definition Tool

Docker Pipeline

SQS Lambda

S3

OpsWorks Stack

SWF ECR

ECS

Spot Fleet

Lambda

Task RegistryPipeline RegistryImage

$ dpl pipeline register $ dpl task register $ docker push

$ dpl run

Pipeline Definition Task Definition Tool

Docker Pipeline

SQS Lambda

S3

OpsWorks Stack

SWF ECR

ECS

Spot Fleet

Lambda

Task RegistryPipeline RegistryImage

$ dpl pipeline register $ dpl task register $ docker push

$ dpl run

Pipeline Definition Task Definition Tool

Docker Pipeline

SQS Lambda

S3

OpsWorks Stack

SWF ECR

ECS

Spot Fleet

Lambda

Task RegistryPipeline RegistryImage

$ dpl pipeline register $ dpl task register $ docker push

$ dpl run

Pipeline Definition Task Definition Tool

Docker Pipeline

SQS Lambda

S3

OpsWorks Stack

SWF ECR

ECS

Spot Fleet

Lambda

Task RegistryPipeline RegistryImage

$ dpl pipeline register $ dpl task register $ docker push

$ dpl run

Pipeline Definition Task Definition Tool

Docker Pipeline

SQS Lambda

S3

OpsWorks Stack

SWF ECR

ECS

Spot Fleet

Lambda

Task RegistryPipeline RegistryImage

$ dpl pipeline register $ dpl task register $ docker push

$ dpl run

Pipeline Definition Task Definition Tool

Docker Pipeline

SQS Lambda

S3

Docker Pipeline

✓ Easy to accommodate tool changes

✓ Easy to auto-scale instances

✓ Easy to accommodate new workflows

and run steps in parallel

✓ Easy to optimize for workload and

handle failures at each step

✓ Easy to accommodate complex

workflows

✓ Easy to share tools across pipelines

and greatly simplifies pipeline

definition

ⅹ Can’t optimize for workload at each step

ⅹ Painful to support Docker on EC2

instances ourselves

ⅹ No cost optimization since using on-

demand instances

ⅹ Pain around version and deploy of

images

ⅹ Pain around version and deploy of

platform

Benefits Drawbacks

The Journey

Up and running

in a sprint

Adapting to tool

change

Report pipeline

platformDocker pipeline

platform

1 2 3 4 5

Many redundant workflow systems that are

suboptimal for cost and performance

Bioinformatics scientist are bogged down in infrastructure development

The Journey

Go faster with

continuous deliveryUp and running

in a sprint

Adapting to tool

change

Report pipeline

platformDocker pipeline

platform

1 2 3 4 5

Deploying changes frequently was very time consuming and risky

Go Faster with Continuous Delivery

Continuous Delivery at HLI

Integration TestingAutomation Push-Button to Prod

Go Faster with Continuous Delivery

Continuous

Integration

Latest Build

AWS CodePipeline

Dev

1. Deploy

2. Smoke

Test

AWS

CodeDeploy

AWS

Lambda

Go Faster with Continuous Delivery

Continuous

Integration

Latest Build

AWS CodePipeline

Dev

1. Deploy

2. Smoke

Test

AWS

CodeDeploy

AWS

Lambda

Int

1. Deploy

2. Integration

Test

AWS

CodeDeploy

AWS

Lambda

Promote

Go Faster with Continuous Delivery

Continuous

Integration

Latest Build

AWS CodePipeline

Dev

1. Deploy

2. Smoke

Test

AWS

CodeDeploy

AWS

Lambda

Int

1. Deploy

2. Integration

Test

AWS

CodeDeploy

AWS

Lambda

Stage

1. Blue/Green

Deploy

2. Integration

Test

AWS

CodeDeploy

AWS

Lambda

Amazon

SNS

3. Approval to

Prod

Promote Promote

Go Faster with Continuous Delivery

Go Faster with Continuous Delivery

Go Faster with Continuous Delivery

Go Faster with Continuous Delivery

Go Faster with Continuous Delivery

✓ Easy to accommodate tool changes

✓ Easy to auto-scale instances

✓ Easy to accommodate new

workflows and run steps in parallel

✓ Easy to optimize for workload and

handle failures at each step

✓ Easy to accommodate complex

workflows

✓ Easy to deploy platform changes

into production daily

Benefits Drawbacks

ⅹ Pain around version and deploy of

images

ⅹ Pain around version and deploy of

platform

Summary

Dramatic simplification in pipeline complexity

From 2KLOC to 20LOC + config

Significant reduction in time to generate reports

From weeks to hours fully automated

Significant cost saving with spot

From $32 to $6 for report

Daily deployments of platform changes to production environments

From weeks or months to daily

Dramatically easier handoff between bioinformatics and engineering

From code to configuration

Next Steps

• Create framework to simplify tool development

• Support running step workload on Spark cluster

Celgene Research

Collaboration Environments

Agenda

• Company & Industry Trends

• Collaboration Models

• Configuration & Security

• Lessons Learned & Tips

About Celgene

Biotech focused on cancer and inflammatory diseases with 300+ clinical

trials in progress. Major products include Otezla, Revlimid, Thalomid, and

Pomalyst.

Scope

• Discovery research

• Clinical development

• Drug manufacturing

• Sales/Distribution

Scale

• ~7000 employees

• ~60 sites globally

Industry Trends

• Collaborations and Partnerships

• Even faster paced R&D

• Scale

• Cloud native solutions

Myeloma Genome Project on AWS

For more information, please contact ggeissman@celgene.com

Managed Software / SaaS

• COTS platform for end users

• Web GUI

• Lab data

Many Collaboration Systems, 2 Models

HPC Collaboration

• Raw IaaS for developer-users

• API / Shell access

• Petabyte scale

Collaboration Structure

CRO

vendor

Many Collaborations

• Two collaborations models

• Multi-AWS account

• AWS + management is the same

Example MSaaS Collaboration Architecture

Plugin

Pipeline

Data

processing

Extraction

processing

Transaction

processing

Web servers

Amazon

SQS

Amazon

S3

Logging SSO

Platform

Services

RDS

WAFusers

Example HPC Collaboration Architecture

VPC subnet

VPC subnet

Auto

Scaling

cloudwatch

SQSCloudTrail

EFS

bastion

S3

VPCe

Connectivity

• Multi-account model + VPC

• Connectivity options

• Big decisions factorsX

Multi Account Model

• Isolation of workloads

• Ease of management

• Guardrails Tool

• TurbotHQ.com

Hardened AWS Environment

• Network Controls

• Object Storage Controls

• Credentials

• Auditing

All collaborations use:

• EC2 / ECS

• S3 / Glacier

• EFS

• VPC + Direct Connect

Services Used

Primary Reasons for AWS

• Speed of deployment

• Security / isolation

• Elastic nature supports

unknown requirements

Regions used:

• us-east-1

• us-west-1

• eu-central-1

• us-west-2*

• COTS Software

• Vendor API Access

• User access via app/SSO

• Roles for app

• Account isolation

Access for MSaaS CollaborationsIAM Policy: SQS access

{

"Effect": "Allow",

"Action": [

"sqs:GetQueueAttributes",

"sqs:GetQueueUrl",

"sqs:PurgeQueue",

"sqs:SendMessage",

"sqs:ReceiveMessage",

"sqs:DeleteMessage"

],

"Resource": "arn:aws:sqs:us-west-1:111122223333:a-queue"

}

IAM Policy: Not allowed!

{

"Effect": "Allow",

"Action": “ec2:*",

"Resource": “*"

}X

• Software “Type”

• User access

• IAM Policies

• AWS Console

User access for HPC/IaaS Collaborations

• Automated security

• Business rules

• Data sciences / management

User access for S3 buckets

Bucket Policy: Server Side Encyption

For AWS provided keys (SSE-S3):

{

"Sid": “ObjectsMustBeEncryptedAtRest",

"Effect": "Deny",

"Principal": "*",

"Action": "s3:*",

"Resource": [

"arn:aws:s3:::example-bucket/*",

"arn:aws:s3:::example-bucket" ],

"Condition": {

"StringNotEquals": {

"s3:x-amz-server-side-encryption": "AES256"

}

}

For KMS provided key (SSE-KMS):

"s3:x-amz-server-side-encryption":"aws:kms"

For customer provided key (SSE-C):

"s3:x-amz-server-side-encryption-customer-

algorithm": "AES256"

Bucket Policy: Require HTTPS

{

"Sid":

“ObjectsMustBeEncryptedInTransit",

"Effect": "Deny",

"Principal": "*",

"Action": "s3:*",

"Resource": [

"arn:aws:s3:::example-bucket/*",

"arn:aws:s3:::example-bucket"

],

"Condition": {

"Bool": {

"aws:SecureTransport": false

}

}

Only from select IP:

"Condition": {

"NotIpAddress": {

"aws:SourceIp": [

Or from endpoint:

"Condition": {

“StringNotEquals": {

"aws:SourceVpce": “vpce-1a2b3c4d”

Or VPC (if multiple endpoints):

"aws:SourceVpc": “vpc-111bbb22”

IAM Policy: Limit keys (aka “folder” location)

{

“Sid”: “LimitListBucketForUsers”

"Effect": "Allow",

"Action": "s3:ListBucket",

"Resource": "arn:aws:s3:::example-bucket",

"Condition": {"StringLike": {"s3:prefix":

"${aws:username}/*"}}

},

{

“Sid”: “ObjectsActionsForUsersInHome”

"Effect": "Allow",

"Action": [

"s3:PutObject",

"s3:GetObject",

"s3:DeleteObject",

"s3:Abort*"

],

"Resource": "arn:aws:s3:::example-

bucket/${aws:username}/*"

}

Code

• Enterprise GitHub

Collaboration Data / IP / Code

Data

• Data retention policy

• Data Science team

Collaboration Environment

• Long term

• Open source

Use cloud best practices

• Expect failure

• Automate

• Use services (as intended)

• Data transfer / errors

Lessons Learned / Tips

Soft lessons

• Past enterprise experience

• Vendors

• Users

• Get buy-in

AWS Summary

Similar advantages

Rapid infrastructure deployment

Isolated work areas

Common components drawn into larger reusable framework

Elastic resources: accommodate any size workload

Accessible: reach the infrastructure from anywhere in the world

…drive toward

Reliable and *reproducible* collaborative science at a scale

previously unachievable

Thank you!

Remember to complete

your evaluations!