145
AWS Life Sciences Days Boston, MA Mark Johnston, Director of Global Business Development, Healthcare and Life Sciences

2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Embed Size (px)

Citation preview

Page 1: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

AWS Life Sciences DaysBoston, MA

Mark Johnston, Director of Global Business Development,

Healthcare and Life Sciences

Page 2: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6

04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5

03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4

02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker

for science on AWS3

02:30 PM – 02:45 PMBreak

01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2

01:00 PM – 01:30 PMIntroduction and Opening Remarks1

Agenda

Page 3: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

COMPETING ON KNOWLEDGE AND INSIGHTS:AN ACADEMIC MEDICAL CENTER’S PERSPECTIVE

PARTNERS HEALTHCAREMAY 17, 2016

Trung DoExecutive Director, Business Development – InnovationPartners HealthCare

Page 4: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Operating Revenue $9.0 Billion

Research Revenue $1.5+ Billion

Inpatient Discharges 166,700

Licensed Beds 4,000

Lives UnderManagement1 750,000

Physicians 6,500

Employees (FTEs) 68,000

Clinical Trials 1,200

Clinical & ResearchFellows and Residents 4,300

Faculty Appointed at Harvard Medical School

Brigham and Women’s Hospital Founded 1832 ; Ranked #6 in US

Massachusetts General HospitalFounded 1811; Ranked Number #1 in US

P A R T N E R S H E A L T H C A R E

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 5: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

COMPETING ON KNOWLEDGE AND INSIGHTS:AN ACADEMIC MEDICAL CENTER’S PERSPECTIVE

PARTNERS HEALTHCAREMAY 17, 2016

Trung DoExecutive Director, Business Development – InnovationPartners HealthCare

Page 6: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Operating Revenue $9.0 Billion

Research Revenue $1.5+ Billion

Inpatient Discharges 166,700

Licensed Beds 4,000

Lives Under

Management1 750,000

Physicians 6,500

Employees (FTEs) 68,000

Clinical Trials 1,200

Clinical & Research

Fellows and Residents 4,300

Faculty Appointed at Harvard Medical School

Brigham and Women’s Hospital Founded 1832 ; Ranked #6 in US

Massachusetts General HospitalFounded 1811; Ranked Number #1 in US

P A R T N E R S H E A L T H C A R E

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 7: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

THE INNOVATION CHALLENGE

INNOVATING AT SCALE

Page 8: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

INNOVATION AT PARTNERS HEALTHCARE

• Innovation is a core Partners activity that supports and sustains our ability to constantly improve the care we deliver to our patients to enrich their lives and well-being

• The value of an integrated health system, like Partners HealthCare, is its ability to innovate at scale

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 9: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

CHALLENGES TO INNOVATING AT SCALE

• Many small advances occur continuously throughout leading academic medical centers

• Difficult to ensure these innovations are broadly adopted by clinicians and accessible to all patients

• There are many impediments• Time

• Organizational structures

• Physical distance Communication is challenging

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 10: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

ERA OF DATA SCIENCE

Standard Reports “What happened?”

Ad hoc reports “How many, how often,

where?”

Query/drill down “What exactly is the problem?”

Alerts “What actions are needed?”

Statistical Analysis “Why is this happening?”

Forecasting /

extrapolation “What if these trends continue?”

Predictive Modeling “What will happen next?”

Optimization“What is the best that can

happen?”

Predictive

Analytics

Descriptive

Analytics

Com

pe

titive

Ad

va

nta

ge

Sophistication of Intelligence

Source: Competing on Analytics: The New Science of Winning (Davenport /

Harris)

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 11: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

NEW MODEL: SYSTEMS HEALTH CARE AS AN INFORMATION

BUSINESS

Standard Reports EHR ecosystem

Ad hoc reports Quality Data Warehouse

Query/drill down Quality Data Warehouse

Alerts CDS built on EHR

Statistical AnalysisObservational Studies

Clinical TrialsResearch

Clinical

Com

pe

titive

Ad

va

nta

ge

Sophistication of Intelligence

Massive bifurcation

causes significant

inefficiencies

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 12: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

NEW MODEL: SYSTEMS HEALTH CARE AS AN INFORMATION

BUSINESS

Standard Reports EHR ecosystem

Ad hoc reports Quality Data Warehouse

Query/drill down Quality Data Warehouse

Alerts CDS built on EHR

Statistical AnalysisObservational Studies

Clinical Trials Research

Clinical

Com

pe

titive

Ad

va

nta

ge

Sophistication of Intelligence

Establish continuous learning environment

based on predictive and population based

analytics

Opportunity!

Massive bifurcation

causes significant

inefficiencies

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 13: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

NEW MODEL: SYSTEMS HEALTH CARE AS AN INFORMATION

BUSINESS

Standard Reports EHR ecosystem

Ad hoc reports Quality Data Warehouse

Query/drill down Quality Data Warehouse

Alerts CDS built on EHR

Statistical AnalysisObservational Studies

Clinical Trials

Forecasting /

extrapolation

Trends –

pharmacovigilance surveillance

Continuous

Learning

Healthcare

System

Com

pe

titive

Ad

va

nta

ge

Sophistication of Intelligence

Predictive Modeling Machine Learning - CDS

Optimization Patient Specific Predictive CDS

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 14: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Public Data

Sets

Database

Schemas

ML

Cluster

Analysis

Image

Data

To Help Predict Outcomes and Support Medical Decisions

To Learn About what Differences in People are Important for Predicting

Disease

To Understand the Disease Traits Caused by a Gene Variant

To Help Interpret Features in Medical Images and Tissues

USING BIG DATA TO IMPROVE HEALTHCARE

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 15: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

BIG DATA SOURCES…

• EHR / Hospital Data

– Research Patient Data Registry (RPDR)

• 7 million patients

• 2 billion diagnoses, medications, laboratories and clinical findings

• 13 billion images

• Partners Biobank: 40,000 consented patient samples linked to EMR (target 100,000 by 2018)

Department of

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 16: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

BIG DATA COMMONS:Creating an Enterprise Wide Query And Analysis System For Big Data

Big Data CommonsIntegrates disparate islands

of patient data (clinical and

research data) onto a common

Big Data Platform

Enhanced Query Tool

Partners Biobank Portal –

Genomics Data, Samples

Public Health Data

Imaging Data

Notes / Text

Repository (e.g. physician

notes)

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 17: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

UNPREDICTABLE QUALITY USING RAW ICD9/10 CODES

Phenotype Count with ICD-

9/ICD-10 Code

Count (90% positive predictive value)

Count with

Genotype Data

Asthma 7618 3322 805

Bipolar Disorder 1754 219 84

Breast Cancer 2101 1711 378

Congestive Heart Failure 10160 4597 1859

Coronary Artery Disease 1435 803 236

Crohn’s Disease 5177 700 350

Depression 11154 4273 1074

Epilepsy 2351 1211 381

Gout 2464 1828 566

Hypertension 20788 16995 4553

Multiple Sclerosis 602 320 58

Obesity 10245 12179 3191

Rheumatoid Arthritis 3475 878 261

Schizophrenia 509 83 14

Type 1 Diabetes 2196 232 61

Type 2 Diabetes 7123 4385 1268

Ulcerative Colitis 1359 624 157

May 4, 2016, n ~ 40,000Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 18: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

95%

Specificity

1. Create a gold standard training

set.

2. Create a comprehensive list of

features (concepts/variables) that

describe the phenotype of interest

3. Develop the classification algorithm. Using the data

analysis file and the training set from step 1, assess the

frequency of each variable. Remove variables with low

prevalence. Apply adaptive LASSO penalized logistic regression to identify highly predictive variables for the algorithm

4. Apply the algorithm to all subjects in the superset and

assign each subject a probability of having the phenotype

CREATING QUALITY DATA WITH SUPERVISED MACHINE LEARNING

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 19: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

PREDICTABLY QUALITY - COMPUTED PHENOTYPES

Phenotype Count with ICD-

9/ICD-10 Code

Count (90% positive predictivevalue)

Count with

Genotype Data

Asthma 7618 3322 805

Bipolar Disorder 1754 219 84

Breast Cancer 2101 1711 378

Congestive Heart Failure 10160 4597 1859

Coronary Artery Disease 1435 803 236

Crohn’s Disease 5177 700 350

Depression 11154 4273 1074

Epilepsy 2351 1211 381

Gout 2464 1828 566

Hypertension 20788 16995 4553

Multiple Sclerosis 602 320 58

Obesity 10245 12179 3191

Rheumatoid Arthritis 3475 878 261

Schizophrenia 509 83 14

Type 1 Diabetes 2196 232 61

Type 2 Diabetes 7123 4385 1268

Ulcerative Colitis 1359 624 157

May 4, 2016, n ~ 40,000Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 20: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

DEFINITIVE DISEASE STATES ASSOCIATED WITH GENOMIC DATA

Phenotype Count with ICD-

9/ICD-10 Code

Count (90% positive predictivevalue)

Count with

Genotype Data

Asthma 7618 3322 805

Bipolar Disorder 1754 219 84

Breast Cancer 2101 1711 378

Congestive Heart Failure 10160 4597 1859

Coronary Artery Disease 1435 803 236

Crohn’s Disease 5177 700 350

Depression 11154 4273 1074

Epilepsy 2351 1211 381

Gout 2464 1828 566

Hypertension 20788 16995 4553

Multiple Sclerosis 602 320 58

Obesity 10245 12179 3191

Rheumatoid Arthritis 3475 878 261

Schizophrenia 509 83 14

Type 1 Diabetes 2196 232 61

Type 2 Diabetes 7123 4385 1268

Ulcerative Colitis 1359 624 157

May 4, 2016, n ~ 40,000Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 21: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

CLINICAL APPLICATIONSLEVERAGING MACHINE LEARNING IN HEALTHCARE

Page 22: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

CLARIFAI.COM

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 23: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

CLARIFAI.COM

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 24: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

SymptomaticAsymptomatic Data Symptomatic

CLINICAL DATA SCIENCESYMPTOMATIC DATA DISCOVERY

ASYMPTOMATIC

SCREENING

- BREAST CANCER

- COLON CANCER

- LUNG CANCER

SYMPTOMATIC

DATA SYMPTOMATIC

- CLINICAL DATA

- GENETIC DATA

- CONSUMER

DATA

Clinically

C L I N I C A L D I A G N O S T I C S E R V I C E S

DIAGNOSTIC CLINICAL DECISION SUPPORT

SymptomaticData Discovery

Machine Learning

Can we find disease before

symptoms appear clinically?

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 25: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

MGH CLINICAL DATA SCIENCE CENTER

ONCOLOGY

SPECIALTY

NEUROSCIENCE

CARDIAC

BIOCHEMISTRY

LABORATORY

MICROBIOLOGY

IMMUNOLOGY

HEMATOLOGY

RADIOLOGY

DIAGNOSTICS

PATHOLOGY

CARDOLOGY

PHARMACOLOGIC

THERAPUTICS

PROCEDURAL

SURGERY

DISEASE MANAGEMENT

POPULATION HEALTH

WELLNESS PLANNING

PREVENTION

MRI

IMAGING

CTPET

USXRAY

NUC

GENOME

GENETICS

ARRAYS

PROBES

CLINICAL DATA

CLINICAL APPLICATIONS

HOME MONITOR

PERSONAL

WEARABLES

DIRECT TOCONSUMER

DATA

EHR

CLINCAL

OTHER

ALL

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 26: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

MGH CLINICAL DATA SCIENCEINITIAL FOCUS: DIAGNOSTIC RADIOLOGY APPLICATIONS

RADIOLOGY

MRI GENOME HOME MONITORBIOCHEMISTRY

MGHCADS

DATA

DIAGNOSTICS

IMAGING LABORATORY GENETICS EHR PERSONAL

CTPET

USXRAY

MICROBIOLOGY

IMMUNOLOGY

HEMATOLOGY

WEARABLESCLINCAL

OTHERNUC

DIRECT TOCONSUMER

ARRAYS

PROBES

CLINICAL DATA

CLINICAL APPLICATIONS

ALL

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 27: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

EHR

InterpretationDecision Support

Systems

MGH CLINICAL DATA SCIENCE CENTERAPPLYING NEURAL NETWORKS

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 28: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

MGH CLINICAL DATA SCIENCE CENTERMACHINE LEARNING RESOURCES

InterpretationDecision Support

Systems

EHR• Formulate the correct clinical questions• Acquire (retrieve) the necessary volume of clinical data• Accurately label the clinical data such that it answers the clinical questions• Apply the appropriate Data Science techniques (training and validation)• Iterate until the desired (or best) accuracy results• Clinically validate the resulting ‘machine learning appliance’• Implement into clinical practice• License/Commercialize

Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved

Page 29: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

COMPETING ON KNOWLEDGE AND INSIGHTS:AN ACADEMIC MEDICAL CENTER’S PERSPECTIVE

PARTNERS HEALTHCAREMAY 17, 2016

Trung DoExecutive Director, Business Development – InnovationPartners HealthCare

Page 30: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6

04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5

03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4

02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker

for science on AWS3

02:30 PM – 02:45 PMBreak

01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2

01:00 PM – 01:30 PMIntroduction and Opening Remarks1

Agenda

Page 31: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Best practices when building a validated system

on AWS for the Life Sciences

Chris McCurdy

Healthcare and Life Sciences Specialist AWS

Matt Szenher

Principle Architect at Medidata

Page 32: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Agenda

• DevOps to DevSecOps Primer

• Observed industry cloud techniques with AWS• Tools, processes and frameworks to assist

• Medidata Audit Ingestion

Page 33: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

DevOps Level Set

Development

Quality Assurance

Operations

DevOps

Page 34: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

DevOps Toolchain

Plan

Configure

Verify

Preprod

Monitor

Create

Release

Define and plan; business value, application requirements and metrics

Building, coding and configuration

Ensuring quality; acceptance, regression testing

Infrastructure and application

Approval/certification, triggered releases, release staging and holding

Process, application and infrastructure

Release coordination, promotion, scheduling, rollback and recovery

Page 35: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

DevOps Principles

• Collaborate with all stakeholders

• Codify everything

• Test everything

• Automate everything

• Measure and monitor everything

• Deliver business value with continual feedback

Manual Hacking

Page 36: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Drivers for DevSecOps

Embedding Security into DevOps was not successful because…

• Compliance checklists didn’t take us far before we stopped scaling…

• We couldn’t keep up with deployments without automation…

• Standard Security Operations did not work…

• And we needed far more data than we expected to help the business make decisions…

Page 37: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

DevSecOps: Security as Code

Establishing these principles…

• Customer focused mindset

• Scale, scale, scale

• Objective criteria

• Proactive hunting

• Continuous detection and response

Page 38: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

DevOps Toolchain

Plan

Configure

Verify

Preprod

Monitor

Create

Release

Define and plan; business value, application requirements, security, compliance

and metrics

Build, code and configuration

Ensuring quality; acceptance, regression, security and compliance testing

Infrastructure and application

Approval/certification, triggered releases, release staging and holding

Process, application, infrastructure, security and compliance

Release coordination, promotion, scheduling, rollback and recovery

Page 39: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Here’s some infrastructure as Code"myVPC": {

"Type": "AWS::EC2::VPC",

"Properties": {

"CidrBlock": {"Ref": "myVPCCIDRRange"},

"EnableDnsSupport": false,

"EnableDnsHostnames": false,

"InstanceTenancy": "default"

}

},

"myInstance" : {

"Type" : "AWS::EC2::Instance",

"Properties" : {

"ImageId": {

"Fn::FindInMap": ["AWSRegionToAMI",{"Ref": "AWS::Region"},"64"]

},

"SecurityGroupIds" : [{"Fn::GetAtt": ["myVPC", "DefaultSecurityGroup"]}],

"SubnetId" : {"Ref" : "mySubnet"}

}

}

AWS

CloudFormation

template

Page 40: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Here’s some security as Code{

"Statement": [

{

"Sid": "DenyIncorrectEncryptionHeader",

"Effect": "Deny",

"Principal": "*",

"Action": "s3:PutObject",

"Resource": "arn:aws:s3:::YourBucket/*",

"Condition": {

"StringNotEquals": {

"s3:x-amz-server-side-encryption": "AES256"

}

}

},

{

"Sid": "DenyUnEncryptedObjectUploads",

"Effect": "Deny",

"Principal": "*",

"Action": "s3:PutObject",

"Resource": "arn:aws:s3:::YourBucket/*",

"Condition": {

"Null": {

"s3:x-amz-server-side-encryption": "true"

}

}

}

]

}

AWS IAM

Page 41: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Cloud Era

Page 42: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Observed industry cloud techniques with AWS

Page 43: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

AWS as components

AWS makes commercial cloud infrastructure software

products and office productivity applications that are

user-configurable, general purpose in nature, and

delivered to commercial IT standards like ISO, NIST,

SOC and others. This is similar to other general purpose

IT products and services such as database engines,

operating systems, programming languages, internet

service providers, etc. Many organizations categorize

AWS products as commercial-off-the-shelf (COTS)

infrastructure software products, which is consistent

with the US federal government’s use of AWS

Products as a COTS item through a federal

procurement program called FedRAMP.

”Using AWS in GxP Systems” AWS Whitepaper

http://icon-park.com/icon/light-orange-lego-brick-vector-data-for-free/

Page 44: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

AWS Foundation Services

Compute Storage Database Networking

AWS Global Infrastructure Regions

Availability Zones

Edge Locations

Cu

sto

mer

sPlatform, Applications, Identity & Access Management

Operating System, Network & Firewall

Customer content

Client-side encryption implementation, Server-side encryption, Network Traffic Protection

A Word on Security

Security

in the

cloud

Security

of the

cloud

Page 45: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Consult internally before implementing

The following slides are practices we

have seen used in industry. As security

and industry compliance is determined

by the customer before implementing

please:

• Consult with your internal best

practices

• Consult with with your Cloud Center of

Excellence

• Consult with your Information Security

group

• Consult with your Compliance

organization

• Do your due diligence

Page 46: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

General Strategies

AWS

CodeCommit

AWS

CodeDeploy

AWS

CodePipeline

Consult with compliance and security organizations before implementing

• Decouple protected/sensitive data from

the processing or orchestration

• Track where your protected/sensitive

data flows

• Do not check the protected data into

your source or artifact repository!

• Use indirection when orchestrating your

protected/sensitive data flow

• Separate protected/sensitive and general

workflow logical boundaries

Page 47: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Separate Virtual Private Cloud (VPC) Strategy

Amazon

EC2Amazon

EMRAmazon

S3

Protected/Sensitive Data VPC

Amazon

EC2

General VPC

AWS Directory

Service

AWS

Device Farm

P/S

Consult with compliance and security organizations before implementing

Page 48: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Indirection Strategy

Data Processing

SystemInbound

Data Store

(S3)

HTTPS

Send

SQS

SNS

Claims

P/S Data

Consult with compliance and security organizations before implementing

Page 49: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Example: Analytics Workflow

Insight

System

(EMR)

Inbound

Archive

(Glacier)

Inbound

Data Store

(S3)

Columnar Query

Store

(Redshift)

1

Medical Data

Data Lake

(S3)

6

P/S

Insights

4

Consult with compliance and security organizations before implementing

SQS

AWS

LambdaAmazon

SES

6

General

Insights

9

General

Insights

2 3

5

7

New Object

Message8

9

Page 50: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Compliance Example Workflow (using DevSecOps)

CloudFormation

templateSecurity /

Compliance Admin

1

Define

AWS Service Catalog

2

Publish

CloudFormation

stack

Developers

4

Browse and Launch

AWS CloudTrail Amazon S3

11

Monitors

Logs all API calls

AWS CloudWatchalarm

8

Monitors

10

Initiates

12

Notifies

AWS Config

Track changes

3

Git push

6

AWS CodeCommit

5

Provisions

9

7

Consult with compliance and security organizations before implementing

Page 51: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Audit Ingestion

Page 52: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential © 2016 Medidata Solutions, Inc. – Proprietary and Confidential

MAudit Audit Ingestion and DevSecOpsMatthew Szenher | [email protected]

Principal Architect - Medidata Core Web Services

Page 53: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

About Medidata

SaaS Platform for clinical development, analytics and benchmarking in life sciences

Started in 1999

Over 9,000 trials in more than 130 countries

Serve CROs and contracting partners

We’re hiring: https://www.mdsol.com/en/careers

Page 54: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

What’s are Audits?

A record of actions that create, modify or delete clinically relevant data.

Crucial for asserting confidentiality, integrity and authenticity of this data.

I’ll talk about how auditing is difficult, and how AWS makes DevSecOps for auditing solutions a lot

easier...

Page 55: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

Audits are a MUST

MUST be captured transactionally with patient data points (as well as other clinically relevant data)

MUST be persisted

MUST be immutable

MUST be consistent

MUST be secure

SHOULD be cheap

Page 56: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

Audits are VOLUMINOUS

Medidata persists eight billion clinical records from more than two million patients across more than

9,000 studies

More than one half million patient data points are added daily

Regulatorily required to capture audits transactionally with these data points (as well as other clinically

relevant data)

… ~600 audits per second

And growing!

Page 57: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

And Growing

GADGET trial with GlaxoSmithKline

Patients wore Vital Connect Health Patch (http://www.vitalconnect.com/)

ECG, skin temp., etc.

1 week

~350 GB of audit data

~300 million data points (and their audits)

More data than many years-long trials collect over their lifetimes

Page 58: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

Solution: MAudit

Scalable

Centralized

Durable

Highly Available

Secure

Audit ingestion and validation service….

Built on AWS infrastructure

Page 59: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

Audit Producers MAudit Servers

(EC2)

Page 60: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 Medidata Solutions, Inc. – Proprietary and Confidential

MAudit and DevSecOps

S3: Programmatically defined persistence, with security and infinite scaling

Autoscaling Groups: Codified app server scaling

Kinesis: Codified, scalable streaming of data

IAM: Programmatically defined access controls

CloudFormation: Specifying all of the above in code

Page 61: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Thank You

Page 62: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

http://icon-park.com/icon/light-orange-lego-brick-vector-

data-for-free/

Page 63: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6

04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5

03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4

02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker

for science on AWS3

02:30 PM – 02:45 PMBreak

01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2

01:00 PM – 01:30 PMIntroduction and Opening Remarks1

Agenda

Page 64: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Angel Pizarro, AWS Scientific ComputingBrad Chapman, Harvard Medical School

May 17, 2016

Containers for Science!?!Reproducible science at scale on AWS using

Common Workflow Language, Docker, and Amazon

Elastic Container Service

Page 65: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Agenda

Review Common Workflow Language and bcbio

Intro to Docker and Amazon Elastic Container Service

CWL+Docker workflow on top of ECS

Page 66: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Common Workflow Language

Page 67: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

We need faster, better science

https://twitter.com/KMS_Meltzy/status/6612060

70308794368

Page 68: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Large Scale Infrastructure Development

Shared problems: Academia, Industry, Startups

● Workflow implementations

● Validation

● Scaling

● Support

Page 69: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Blue Collar Bioinformatics (bcbio)

https://github.com/chapmanb/bcbio-nextgen

Page 70: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Uses

Aligners: bwa, novoalign, bowtie2, HiSat2

Variation: FreeBayes, GATK, VarDict, MuTecT, Scalpel,

SnpEff, VEP, GEMINI, Lumpy, Manta, CNVkit, WHAM

RNA-seq: Tophat, STAR, Sailfish, Kallisto

Quality control: MultiQC, fastqc, Qualimap

Manipulation: bedtools, bcftools, biobambam, sambamba,

samblaster, samtools, vcflib, vt

Page 71: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Provides

http://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html

Page 72: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Validation: low frequency cancer variants

http://bcb.io/2016/04/04/vardict-filtering/

Page 73: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Why community developed workflows?

Problem: complexity of parallelization reduces tool reuse

● Ties biology to job running framework

● Results in reimplemented pipelines

● Each requires AWS integration and scaling

Page 74: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Why community developed workflows?

Solution: Common framework for describing tools and

workflows

● Shared language for common concepts

● Increased re-use of tool definitions

● Mix and match workflow components

Page 75: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

http://www.common

wl.org/

Page 76: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Common Workflow Language implementations

https://arvados.org/

Toil from BD2K Center for

Translational Genomics

https://github.com/galaxyproject/planemo

https://sbgenomics.comhttp://toil.readthedocs.io

Page 77: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
Page 78: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

bcbio + Docker

http://bcbio-nextgen.readthedocs.io/en/latest/contents/cloud.html

● Single container with many biological tools

● https://hub.docker.com/r/bcbio/bcbio/

● Same workflow management as non-Docker

● Run with any platform supporting CWL

Page 79: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

bcbio + Common Workflow Language

http://bcbio-nextgen.readthedocs.io/en/latest/contents/cwl.html

● Generates CWL from sample description

● Data management integrates with platform

○ Arvados: Keep containers

○ Toil: AWS buckets

Page 80: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
Page 81: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

arvados_testcwl-workflow/├── main-arvados_testcwl.cwl├── main-arvados_testcwl-samples.json├── steps│ ├── batch_for_variantcall.cwl│ ├── combine_sample_regions.cwl│ ├── compare_to_rm.cwl│ ├── concat_batch_variantcalls.cwl│ ├── coverage_report.cwl│ ├── get_parallel_regions.cwl│ ├── merge_split_alignments.cwl│ ├── multiqc_summary.cwl│ ├── pipeline_summary.cwl│ ├── postprocess_alignment.cwl│ ├── postprocess_variants.cwl│ ├── prep_align_inputs.cwl│ ├── prep_samples.cwl│ ├── process_alignment.cwl│ └── variantcall_batch_region.cwl├── wf-alignment.cwl└── wf-variantcall.cwl

Page 82: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

"files": [{"class": "File","path": "keep:a1d976bc7bcba2b523713fa67695d715+464/7_100326_FC6107FAAXX.bam","secondaryFiles": [{

"class": "File","Path":

"keep:a1d976bc7bcba2b523713fa67695d715+464/7_100326_FC6107FAAXX.bam.bai"}]}]

"reference__fasta__base": [{"class": "File","path": "keep:a84e575534ef1aa756edf1bfb4cad8ae+1927/hg19/seq/hg19.fa","secondaryFiles": [{

"class": "File","path":

"keep:a84e575534ef1aa756edf1bfb4cad8ae+1927/hg19/seq/hg19.fa.fai"}]}]

Manage provenance for all workflow inputs

Page 83: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

class: CommandLineToolcwlVersion: cwl:draft-3baseCommand: [bcbio_nextgen.py, runfn, process_alignment, cwl]hints:- class: ResourceRequirement

coresMin: 16ramMin: 65536tmpdirMin: 100000

inputs:- id: '#files'

type:items: Filetype: array

- id: '#reference__fasta__base'type: File

outputs:- id: '#align_bam'

type: File

Required job infrastructure.

Runner handles resource

allocation.

Explicitly define inputs/outputs.

Runner handles file

management.

Page 84: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Community infrastructure development

Decouple biology and infrastructure

Enables interoperability

Improves integration with AWS

Better infrastructure: provenance and reproducibility

Validation: better, faster science

Page 85: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Amazon Elastic Container

Service (ECS)

Page 86: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

AMIs VS. Containers

App A App B App C

Bins/Libs Bins/Libs Bins/Libs

Guest OS Guest OS Guest OS

Hypervisor

Server (Host OS)

App A App B App B App C

Bins/Libs Bins/Libs

Guest OS + Container Manager

Hypervisor

Server (Host OS)

AMIs Containers

Page 87: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Containers are natural for discrete applications

Docker provides tools to manage and deploy your applications

Lightweight container virtualization platform

Simple to model

Any app, any language

Container image can be tied to app version

Test & deploy same artifact

Stateless servers decrease change risk

Page 88: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Applications evolve from monolithic stacks ...

* Image courtesy of The Broad Institute - https://www.broadinstitute.org/gatk/img/BP_workflow.png

Page 89: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

... to discrete applications (microservices)

* Image courtesy of The Broad Institute - https://www.broadinstitute.org/gatk/img/BP_workflow.png

Page 90: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Server

Guest OS

Bins/Libs Bins/Libs

App2App1

Scheduling one resource is straightforward

Page 91: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Scheduling a cluster is hard

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Server

Guest OS

Page 92: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Amazon Elastic Container Service (ECS)

Easily manage clusters for any scale

No services for you to run

Complete state

Control and monitoring

Page 93: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Cluster Management

Page 94: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Amazon ECS

Docker

Container Instance

ECS Agent

Docker

Container Instance

ECS Agent

Docker

Container Instance

ECS Agent

AZ 1 AZ 2

Amazon

ECS Cluster Management Engine

Key/Value Store

Page 95: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Amazon ECS

Docker

Container Instance

ECS Agent

Docker

Container Instance

ECS Agent

Docker

Container Instance

ECS Agent

AZ 1 AZ 2

Amazon

ECS Cluster Management Engine

Key/Value Store

Agent Communication Service

Page 96: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Amazon ECS

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

Docker

Task

Container Instance

Container

ECS Agent

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

AZ 1 AZ 2

Amazon

ECS Cluster Management Engine

Key/Value Store

Agent Communication Service

Page 97: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Unit of work

Grouping of related Containers

Run on Container Instances

ECS Tasks

Page 98: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Tasks are defined via Task Definitionsclass: CommandLineToolcwlVersion: cwl:draft-3baseCommand: [samtools, index]hints:- class: ResourceRequirement

coresMin: 2ramMin: 1024tmpdirMin: 100000

inputs:- id: '#files'

type:items: Filetype: array

- id: '#reference__fasta__base'type: File

outputs:- id: '#align_bam'

type: File

{"family": "samtoolsIndexPath1","containerDefinitions": [{

"name": "samtools-index","image": "delagoya/samtools-index","cpu": 2,"memory": 1024,"essential": true,"entryPoint": ["sh", "-c"],"command": ["samtools", "index"],"workingDirectory": "/mnt/scratch","mountPoints": [{"containerPath": "/mnt/scratch","sourceVolume": "scratch","readOnly": null

}],

}],"volumes": [{

"host": {"sourcePath": "/mnt/scratch”

},"name": "scratch”

}]} * Not a complete example* Not a complete example

Page 99: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Amazon ECS

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

Docker

Task

Container Instance

Container

ECS Agent

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

AZ 1 AZ 2

Amazon

ECS Cluster Management Engine

Key/Value Store

Agent Communication Service API

User / Scheduler

Page 100: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Amazon ECS

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

Docker

Task

Container Instance

Container

ECS Agent

Docker

Task

Container Instance

Container

ECS Agent

Task

Container

AZ 1 AZ 2

Amazon

ECS Cluster Management Engine

Key/Value Store

Agent Communication Service API

User / SchedulerToil

Page 101: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Architecture

ECS Scheduler Driver

EC2 Autoscale Groupcwltoil

ECS Task Definitions

https://github.com/awslabs/ecs-mesos-scheduler-driver

Page 102: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Setup ECS Cluster with AutoScaling

Create LaunchConfiguration

Pick instance type depending on resource requirements, e.g. memory or CPU

Use latest Amazon Linux ECS-optimized AMI, other distros available

Create AutoScaling group and set to cluster initial size

Page 103: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

AutoScaling your Amazon ECS Cluster

Create CloudWatch

alarm on a metric, e.g.

MemoryReservation

Configure scaling policies

to increase and decrease

the size of your cluster

Page 104: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Demo

Page 105: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6

04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5

03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4

02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker

for science on AWS3

02:30 PM – 02:45 PMBreak

01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2

01:00 PM – 01:30 PMIntroduction and Opening Remarks1

Agenda

Page 106: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT10

6

Agile Cloud native Development with AWS

11th April 2016

Maran Marudhamuthu

Chief Architect – Cloud Services

Page 107: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT10

7

CHANGE

Page 108: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT10

8

WHAT CHANGE

‣ Secular change in architecture of distributed system is happening.

‣ How distributed systems are built is changing.

‣ Evolution towards cloud native architecture.

‣ Thick Line between Infra and App is getting thinner –

‣ Serverless

‣ Forget Infra …build applications - IaaS and PaaS gaining maturity

Page 109: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT10

9

DEVOPS‣ CULTURE: No more development and operations team

‣ PROCESS: Continuous integration and Delivery

‣ AUTOMATE: Infrastructure as Code, Automate from developer

desktop to Production

‣ TOOLS: New tools to automate, Chef/Puppet, AWS Cloudformation,

Jenkins

CHANGE EXPLAINED

Page 110: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT11

0

MICROSERVICES AND API‣ CLOUD NATIVE: 12 factor

For example,

‣ isolate failure

‣ facilitate blue-green deployment

‣ Independently scale

‣ CI /CD: Continuous integration and Delivery easier

‣ STANDARDS : Defined Protocol for Service definition and consumption

‣ FRAMEWORK & TOOLS: Chef/Puppet, AWS Cloudformation, Jenkins, Spring Boot, NetFlix OSS

CHANGE EXPLAINED

Page 111: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT11

1

CONTAINERSBUILD SHIP RUN ANY APP ANYWHERE

▸ BUILD : Package your app in a container (image)

▸ SHIP : Move that container from a machine to another

▸ RUN : Execute that container (i.e. app)

▸ ANY APP: Anything that runs on linux and Windows

▸ ANY WHERE: Bare metal, VM, cloud instance

CHANGE EXPLAINED

Page 112: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT11

2

DEVOPS

AWS Deployment and

Management Tools

• AWS Codedeploy

• AWS Codepipeline

• AWS CodeCommit

HOW AWS SERVICES HELP ?

Automation

• AWS OpsWorks

• AWS Cloudformation

• AWS SDK,API and CLI

Page 113: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT11

3

HOW AWS SERVICES HELP ? - DEVOPS

Page 114: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT11

4

MICROSERVICES & API – CLOUD NATIVE

AWS AppDev PaaS Hosting

• AWS Beanstalk• AWS CodeDeploy• AWS CodeCommit• AWS EMR• AWS Datapipeline

HOW AWS SERVICES HELP?

Build Fast – Forget Infra -

Serverless

• AWS Lambda

• AWS Kinesis

• AWS Data pipeline

• AWS API,SDK,CLI

• AWS IOT Framework

Page 115: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT11

5

HOW AWS SERVICES HELP ? – MICROSERVICES – CLOUD NATIVE

Page 116: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT11

6

HOW AWS SERVICES HELP?

‣ AWS EC2, AMI

‣ Docker run

‣ AWS Beanstalk ‣ Deploy and scale Docker application

‣ AWS EC2 Container Service - ECS‣ Launch and manage Docker container

‣ AWS EC2 Container Registry Service - ECR‣ a fully-managed Docker container registry

‣ Integrated with Amazon EC2 Container Service (ECS)

‣ Simplifies your development to production workflow

‣ DevOps

CONTAINERS

Page 117: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT11

7

CONCLUSION

‣ Think beyond Infra/Forget Infra

‣ Use rather than build

‣ IaaS and PaaS/Application Services

Page 118: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT11

8

?

Page 119: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

© 2016 COGNIZANT11

9

Thank You

Page 120: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6

04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5

03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4

02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker

for science on AWS3

02:30 PM – 02:45 PMBreak

01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2

01:00 PM – 01:30 PMIntroduction and Opening Remarks1

Agenda

Page 121: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Scalable Genomic Analysis on AWS with ADAM

Ujjwal RatanHealthcare and Life Sciences

Solutions Architect

Amazon Web Services

Timothy DanfordField Engineer, Tamr, Inc. & Software

Engineer, University of California

Berkeley

Page 122: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

This Talk Will Cover

An Overview of Amazon Elastic Map Reduce (EMR) and Spark

Genomics and ADAM deep dive

Video Demonstration

Q&A

Page 123: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Amazon EMR Overview

Page 124: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Hadoop 1.x & 2.x / HDFS clusters

Easy to use; fully managed

Support for EC2 Spot Instances

S3, DynamoDB, Redshift

& Kinesis Integration

Amazon

Elastic

MapReduce

(EMR)

Page 125: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Many storage layers to choose from

Amazon DynamoDB

EMR-DynamoDB

connector

Amazon RDS

Amazon

Kinesis

Streaming data

connectorsJDBC Data Source

w/ Spark SQL

Elasticsearch

connector

Amazon Redshift

Amazon Redshift Copy

From HDFS

EMR File System

(EMRFS)

Amazon S3

Amazon EMR

Page 126: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Create a fully configured cluster in minutes

AWS Management

Console

AWS Command Line

Interface (CLI)

Or use an AWS SDK directly with the Amazon EMR API

Page 127: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Amazon EMR – Managed Spark

Easy to install and configure Spark

Secured

Spark submit or useZeppelin UI

Quickly addand remove capacity

Hourly, reserved, or EC2 Spot pricing

Use S3 to decouplecompute and storage

Page 128: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Core/Task

Single Spark Cluster on Amazon EMRFS

Amazon S3

AWS EMR

Cluster

Core/Task

Core/Task

Master

node

Core/Task

Spark

master

YARN

Spark

worker

Spark

worker

Spark

worker

Spark

worker

S3 Standard is designed for 11

9’s of durability and is designed

for your data sources

S3 reduced redundancy is

designed for 2 9’s of

durability and can be

used to reduce costs on

reproducible datasets

Page 129: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Next-Generation

Genomics

Using Spark and ADAM

Timothy Danford

Tamr Inc.

AMPLab

Page 130: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

What’s In

The Box?

Page 131: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

A Variant-Calling Pipeline

Stages are written separately

Hand-off between steps is through files

Everyone has their own “flavor” of pipeline

Page 132: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Parallelization in the Cloud

Page 133: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Lingua Franca: File Formats

.bam files define a custom .bai index format

User-defined attributes

Typically in coordinate-sorted order

Page 134: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Where is ”The Platform?”

(This is taken from the Picard library.)

Why are we managing file handles and spilling reads

to disk inside our bioinformatics methods?

Page 135: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Things Fall Apart When Our Computation Changes

Page 136: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Bioinformaticians❤️

Probabilistic Models

Many Bioinformatics Methods Are Just Large Sums

Page 137: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

The Challenge: Existing Code!

Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)

Page 138: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Can You Spot the File Format Assumption?

A single piece of a

filtering stage for

the Mutect somatic

variant caller.

Can you spot the

place, where they

assume they are

working with BAM

files?

Page 139: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Spark + Genomics = ADAM• Hosted at Berkeley and

the AMPLab

• Apache 2 License

• Contributors from both

research and

commercial

organizations

• Core spatial primitives,

variant calling

• Avro and Parquet for

data models and file

formats

Page 140: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

”The Platform” DefinesCore Genomics Primitives

Page 141: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Stop Defining File Formats By Hand

• Instead of defining

custom file formats for

each data type and access

pattern…

• Parquet creates a

compressed format for

each Avro-defined data

model.

• Improvement over existing

formats1

• 20-22% for

BAM

• ~95% for

VCF 1compression % quoted from 1K Genomes samples

Page 142: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Demo

Installation of ADAM on EMR

Page 143: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Analyze Parquet Files using Spark

Use SCALA to query the genome data in the adam parquet

files

……

val gnomeDF = sqlContext.read.parquet("/user/hadoop/adamfiles/part-r-00000.gz.parquet")

gnomeDF.printSchema()

gnomeDF.registerTempTable("gnome")

val gnome_data = sqlContext.sql("select count(*) from gnome")

gnome_data.show()

…….

Page 144: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

For more details, see our blog post

Title: Will Spark Power

the Data behind

Precision Medicine

Authors: Christopher

Crosbie, Ujjwal Ratan

URL: http://blogs.aws.amazon.com/

bigdata/post/Tx1GE3J0NATV

J39/Will-Spark-Power-the-

Data-behind-Precision-

Medicine

Page 145: 2016 AWS Life Sciences Days | Boston, MA – May 17, 2016

Thank YouAny Questions?