Upload
amazon-web-services
View
767
Download
3
Embed Size (px)
Citation preview
AWS Life Sciences DaysBoston, MA
Mark Johnston, Director of Global Business Development,
Healthcare and Life Sciences
05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6
04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5
03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4
02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker
for science on AWS3
02:30 PM – 02:45 PMBreak
01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2
01:00 PM – 01:30 PMIntroduction and Opening Remarks1
Agenda
COMPETING ON KNOWLEDGE AND INSIGHTS:AN ACADEMIC MEDICAL CENTER’S PERSPECTIVE
PARTNERS HEALTHCAREMAY 17, 2016
Trung DoExecutive Director, Business Development – InnovationPartners HealthCare
Operating Revenue $9.0 Billion
Research Revenue $1.5+ Billion
Inpatient Discharges 166,700
Licensed Beds 4,000
Lives UnderManagement1 750,000
Physicians 6,500
Employees (FTEs) 68,000
Clinical Trials 1,200
Clinical & ResearchFellows and Residents 4,300
Faculty Appointed at Harvard Medical School
Brigham and Women’s Hospital Founded 1832 ; Ranked #6 in US
Massachusetts General HospitalFounded 1811; Ranked Number #1 in US
P A R T N E R S H E A L T H C A R E
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
COMPETING ON KNOWLEDGE AND INSIGHTS:AN ACADEMIC MEDICAL CENTER’S PERSPECTIVE
PARTNERS HEALTHCAREMAY 17, 2016
Trung DoExecutive Director, Business Development – InnovationPartners HealthCare
Operating Revenue $9.0 Billion
Research Revenue $1.5+ Billion
Inpatient Discharges 166,700
Licensed Beds 4,000
Lives Under
Management1 750,000
Physicians 6,500
Employees (FTEs) 68,000
Clinical Trials 1,200
Clinical & Research
Fellows and Residents 4,300
Faculty Appointed at Harvard Medical School
Brigham and Women’s Hospital Founded 1832 ; Ranked #6 in US
Massachusetts General HospitalFounded 1811; Ranked Number #1 in US
P A R T N E R S H E A L T H C A R E
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
THE INNOVATION CHALLENGE
INNOVATING AT SCALE
INNOVATION AT PARTNERS HEALTHCARE
• Innovation is a core Partners activity that supports and sustains our ability to constantly improve the care we deliver to our patients to enrich their lives and well-being
• The value of an integrated health system, like Partners HealthCare, is its ability to innovate at scale
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
CHALLENGES TO INNOVATING AT SCALE
• Many small advances occur continuously throughout leading academic medical centers
• Difficult to ensure these innovations are broadly adopted by clinicians and accessible to all patients
• There are many impediments• Time
• Organizational structures
• Physical distance Communication is challenging
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
ERA OF DATA SCIENCE
Standard Reports “What happened?”
Ad hoc reports “How many, how often,
where?”
Query/drill down “What exactly is the problem?”
Alerts “What actions are needed?”
Statistical Analysis “Why is this happening?”
Forecasting /
extrapolation “What if these trends continue?”
Predictive Modeling “What will happen next?”
Optimization“What is the best that can
happen?”
Predictive
Analytics
Descriptive
Analytics
Com
pe
titive
Ad
va
nta
ge
Sophistication of Intelligence
Source: Competing on Analytics: The New Science of Winning (Davenport /
Harris)
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
NEW MODEL: SYSTEMS HEALTH CARE AS AN INFORMATION
BUSINESS
Standard Reports EHR ecosystem
Ad hoc reports Quality Data Warehouse
Query/drill down Quality Data Warehouse
Alerts CDS built on EHR
Statistical AnalysisObservational Studies
Clinical TrialsResearch
Clinical
Com
pe
titive
Ad
va
nta
ge
Sophistication of Intelligence
Massive bifurcation
causes significant
inefficiencies
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
NEW MODEL: SYSTEMS HEALTH CARE AS AN INFORMATION
BUSINESS
Standard Reports EHR ecosystem
Ad hoc reports Quality Data Warehouse
Query/drill down Quality Data Warehouse
Alerts CDS built on EHR
Statistical AnalysisObservational Studies
Clinical Trials Research
Clinical
Com
pe
titive
Ad
va
nta
ge
Sophistication of Intelligence
Establish continuous learning environment
based on predictive and population based
analytics
Opportunity!
Massive bifurcation
causes significant
inefficiencies
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
NEW MODEL: SYSTEMS HEALTH CARE AS AN INFORMATION
BUSINESS
Standard Reports EHR ecosystem
Ad hoc reports Quality Data Warehouse
Query/drill down Quality Data Warehouse
Alerts CDS built on EHR
Statistical AnalysisObservational Studies
Clinical Trials
Forecasting /
extrapolation
Trends –
pharmacovigilance surveillance
Continuous
Learning
Healthcare
System
Com
pe
titive
Ad
va
nta
ge
Sophistication of Intelligence
Predictive Modeling Machine Learning - CDS
Optimization Patient Specific Predictive CDS
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
Public Data
Sets
Database
Schemas
ML
Cluster
Analysis
Image
Data
To Help Predict Outcomes and Support Medical Decisions
To Learn About what Differences in People are Important for Predicting
Disease
To Understand the Disease Traits Caused by a Gene Variant
To Help Interpret Features in Medical Images and Tissues
USING BIG DATA TO IMPROVE HEALTHCARE
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
BIG DATA SOURCES…
• EHR / Hospital Data
– Research Patient Data Registry (RPDR)
• 7 million patients
• 2 billion diagnoses, medications, laboratories and clinical findings
• 13 billion images
• Partners Biobank: 40,000 consented patient samples linked to EMR (target 100,000 by 2018)
Department of
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
BIG DATA COMMONS:Creating an Enterprise Wide Query And Analysis System For Big Data
Big Data CommonsIntegrates disparate islands
of patient data (clinical and
research data) onto a common
Big Data Platform
Enhanced Query Tool
Partners Biobank Portal –
Genomics Data, Samples
Public Health Data
Imaging Data
Notes / Text
Repository (e.g. physician
notes)
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
UNPREDICTABLE QUALITY USING RAW ICD9/10 CODES
Phenotype Count with ICD-
9/ICD-10 Code
Count (90% positive predictive value)
Count with
Genotype Data
Asthma 7618 3322 805
Bipolar Disorder 1754 219 84
Breast Cancer 2101 1711 378
Congestive Heart Failure 10160 4597 1859
Coronary Artery Disease 1435 803 236
Crohn’s Disease 5177 700 350
Depression 11154 4273 1074
Epilepsy 2351 1211 381
Gout 2464 1828 566
Hypertension 20788 16995 4553
Multiple Sclerosis 602 320 58
Obesity 10245 12179 3191
Rheumatoid Arthritis 3475 878 261
Schizophrenia 509 83 14
Type 1 Diabetes 2196 232 61
Type 2 Diabetes 7123 4385 1268
Ulcerative Colitis 1359 624 157
May 4, 2016, n ~ 40,000Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
95%
Specificity
1. Create a gold standard training
set.
2. Create a comprehensive list of
features (concepts/variables) that
describe the phenotype of interest
3. Develop the classification algorithm. Using the data
analysis file and the training set from step 1, assess the
frequency of each variable. Remove variables with low
prevalence. Apply adaptive LASSO penalized logistic regression to identify highly predictive variables for the algorithm
4. Apply the algorithm to all subjects in the superset and
assign each subject a probability of having the phenotype
CREATING QUALITY DATA WITH SUPERVISED MACHINE LEARNING
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
PREDICTABLY QUALITY - COMPUTED PHENOTYPES
Phenotype Count with ICD-
9/ICD-10 Code
Count (90% positive predictivevalue)
Count with
Genotype Data
Asthma 7618 3322 805
Bipolar Disorder 1754 219 84
Breast Cancer 2101 1711 378
Congestive Heart Failure 10160 4597 1859
Coronary Artery Disease 1435 803 236
Crohn’s Disease 5177 700 350
Depression 11154 4273 1074
Epilepsy 2351 1211 381
Gout 2464 1828 566
Hypertension 20788 16995 4553
Multiple Sclerosis 602 320 58
Obesity 10245 12179 3191
Rheumatoid Arthritis 3475 878 261
Schizophrenia 509 83 14
Type 1 Diabetes 2196 232 61
Type 2 Diabetes 7123 4385 1268
Ulcerative Colitis 1359 624 157
May 4, 2016, n ~ 40,000Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
DEFINITIVE DISEASE STATES ASSOCIATED WITH GENOMIC DATA
Phenotype Count with ICD-
9/ICD-10 Code
Count (90% positive predictivevalue)
Count with
Genotype Data
Asthma 7618 3322 805
Bipolar Disorder 1754 219 84
Breast Cancer 2101 1711 378
Congestive Heart Failure 10160 4597 1859
Coronary Artery Disease 1435 803 236
Crohn’s Disease 5177 700 350
Depression 11154 4273 1074
Epilepsy 2351 1211 381
Gout 2464 1828 566
Hypertension 20788 16995 4553
Multiple Sclerosis 602 320 58
Obesity 10245 12179 3191
Rheumatoid Arthritis 3475 878 261
Schizophrenia 509 83 14
Type 1 Diabetes 2196 232 61
Type 2 Diabetes 7123 4385 1268
Ulcerative Colitis 1359 624 157
May 4, 2016, n ~ 40,000Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
CLINICAL APPLICATIONSLEVERAGING MACHINE LEARNING IN HEALTHCARE
CLARIFAI.COM
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
CLARIFAI.COM
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
SymptomaticAsymptomatic Data Symptomatic
CLINICAL DATA SCIENCESYMPTOMATIC DATA DISCOVERY
ASYMPTOMATIC
SCREENING
- BREAST CANCER
- COLON CANCER
- LUNG CANCER
SYMPTOMATIC
DATA SYMPTOMATIC
- CLINICAL DATA
- GENETIC DATA
- CONSUMER
DATA
Clinically
C L I N I C A L D I A G N O S T I C S E R V I C E S
DIAGNOSTIC CLINICAL DECISION SUPPORT
SymptomaticData Discovery
Machine Learning
Can we find disease before
symptoms appear clinically?
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
MGH CLINICAL DATA SCIENCE CENTER
ONCOLOGY
SPECIALTY
NEUROSCIENCE
CARDIAC
BIOCHEMISTRY
LABORATORY
MICROBIOLOGY
IMMUNOLOGY
HEMATOLOGY
RADIOLOGY
DIAGNOSTICS
PATHOLOGY
CARDOLOGY
PHARMACOLOGIC
THERAPUTICS
PROCEDURAL
SURGERY
DISEASE MANAGEMENT
POPULATION HEALTH
WELLNESS PLANNING
PREVENTION
MRI
IMAGING
CTPET
USXRAY
NUC
GENOME
GENETICS
ARRAYS
PROBES
CLINICAL DATA
CLINICAL APPLICATIONS
HOME MONITOR
PERSONAL
WEARABLES
DIRECT TOCONSUMER
DATA
EHR
CLINCAL
OTHER
ALL
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
MGH CLINICAL DATA SCIENCEINITIAL FOCUS: DIAGNOSTIC RADIOLOGY APPLICATIONS
RADIOLOGY
MRI GENOME HOME MONITORBIOCHEMISTRY
MGHCADS
DATA
DIAGNOSTICS
IMAGING LABORATORY GENETICS EHR PERSONAL
CTPET
USXRAY
MICROBIOLOGY
IMMUNOLOGY
HEMATOLOGY
WEARABLESCLINCAL
OTHERNUC
DIRECT TOCONSUMER
ARRAYS
PROBES
CLINICAL DATA
CLINICAL APPLICATIONS
ALL
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
EHR
InterpretationDecision Support
Systems
MGH CLINICAL DATA SCIENCE CENTERAPPLYING NEURAL NETWORKS
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
MGH CLINICAL DATA SCIENCE CENTERMACHINE LEARNING RESOURCES
InterpretationDecision Support
Systems
EHR• Formulate the correct clinical questions• Acquire (retrieve) the necessary volume of clinical data• Accurately label the clinical data such that it answers the clinical questions• Apply the appropriate Data Science techniques (training and validation)• Iterate until the desired (or best) accuracy results• Clinically validate the resulting ‘machine learning appliance’• Implement into clinical practice• License/Commercialize
Copyright 2016 – Partners HealthCare Incorporated – All Rights Reserved
COMPETING ON KNOWLEDGE AND INSIGHTS:AN ACADEMIC MEDICAL CENTER’S PERSPECTIVE
PARTNERS HEALTHCAREMAY 17, 2016
Trung DoExecutive Director, Business Development – InnovationPartners HealthCare
05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6
04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5
03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4
02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker
for science on AWS3
02:30 PM – 02:45 PMBreak
01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2
01:00 PM – 01:30 PMIntroduction and Opening Remarks1
Agenda
Best practices when building a validated system
on AWS for the Life Sciences
Chris McCurdy
Healthcare and Life Sciences Specialist AWS
Matt Szenher
Principle Architect at Medidata
Agenda
• DevOps to DevSecOps Primer
• Observed industry cloud techniques with AWS• Tools, processes and frameworks to assist
• Medidata Audit Ingestion
DevOps Level Set
Development
Quality Assurance
Operations
DevOps
DevOps Toolchain
Plan
Configure
Verify
Preprod
Monitor
Create
Release
Define and plan; business value, application requirements and metrics
Building, coding and configuration
Ensuring quality; acceptance, regression testing
Infrastructure and application
Approval/certification, triggered releases, release staging and holding
Process, application and infrastructure
Release coordination, promotion, scheduling, rollback and recovery
DevOps Principles
• Collaborate with all stakeholders
• Codify everything
• Test everything
• Automate everything
• Measure and monitor everything
• Deliver business value with continual feedback
Manual Hacking
Drivers for DevSecOps
Embedding Security into DevOps was not successful because…
• Compliance checklists didn’t take us far before we stopped scaling…
• We couldn’t keep up with deployments without automation…
• Standard Security Operations did not work…
• And we needed far more data than we expected to help the business make decisions…
DevSecOps: Security as Code
Establishing these principles…
• Customer focused mindset
• Scale, scale, scale
• Objective criteria
• Proactive hunting
• Continuous detection and response
DevOps Toolchain
Plan
Configure
Verify
Preprod
Monitor
Create
Release
Define and plan; business value, application requirements, security, compliance
and metrics
Build, code and configuration
Ensuring quality; acceptance, regression, security and compliance testing
Infrastructure and application
Approval/certification, triggered releases, release staging and holding
Process, application, infrastructure, security and compliance
Release coordination, promotion, scheduling, rollback and recovery
Here’s some infrastructure as Code"myVPC": {
"Type": "AWS::EC2::VPC",
"Properties": {
"CidrBlock": {"Ref": "myVPCCIDRRange"},
"EnableDnsSupport": false,
"EnableDnsHostnames": false,
"InstanceTenancy": "default"
}
},
"myInstance" : {
"Type" : "AWS::EC2::Instance",
"Properties" : {
"ImageId": {
"Fn::FindInMap": ["AWSRegionToAMI",{"Ref": "AWS::Region"},"64"]
},
"SecurityGroupIds" : [{"Fn::GetAtt": ["myVPC", "DefaultSecurityGroup"]}],
"SubnetId" : {"Ref" : "mySubnet"}
}
}
AWS
CloudFormation
template
Here’s some security as Code{
"Statement": [
{
"Sid": "DenyIncorrectEncryptionHeader",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::YourBucket/*",
"Condition": {
"StringNotEquals": {
"s3:x-amz-server-side-encryption": "AES256"
}
}
},
{
"Sid": "DenyUnEncryptedObjectUploads",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::YourBucket/*",
"Condition": {
"Null": {
"s3:x-amz-server-side-encryption": "true"
}
}
}
]
}
AWS IAM
Cloud Era
Observed industry cloud techniques with AWS
AWS as components
AWS makes commercial cloud infrastructure software
products and office productivity applications that are
user-configurable, general purpose in nature, and
delivered to commercial IT standards like ISO, NIST,
SOC and others. This is similar to other general purpose
IT products and services such as database engines,
operating systems, programming languages, internet
service providers, etc. Many organizations categorize
AWS products as commercial-off-the-shelf (COTS)
infrastructure software products, which is consistent
with the US federal government’s use of AWS
Products as a COTS item through a federal
procurement program called FedRAMP.
”Using AWS in GxP Systems” AWS Whitepaper
http://icon-park.com/icon/light-orange-lego-brick-vector-data-for-free/
AWS Foundation Services
Compute Storage Database Networking
AWS Global Infrastructure Regions
Availability Zones
Edge Locations
Cu
sto
mer
sPlatform, Applications, Identity & Access Management
Operating System, Network & Firewall
Customer content
Client-side encryption implementation, Server-side encryption, Network Traffic Protection
A Word on Security
Security
in the
cloud
Security
of the
cloud
Consult internally before implementing
The following slides are practices we
have seen used in industry. As security
and industry compliance is determined
by the customer before implementing
please:
• Consult with your internal best
practices
• Consult with with your Cloud Center of
Excellence
• Consult with your Information Security
group
• Consult with your Compliance
organization
• Do your due diligence
General Strategies
AWS
CodeCommit
AWS
CodeDeploy
AWS
CodePipeline
Consult with compliance and security organizations before implementing
• Decouple protected/sensitive data from
the processing or orchestration
• Track where your protected/sensitive
data flows
• Do not check the protected data into
your source or artifact repository!
• Use indirection when orchestrating your
protected/sensitive data flow
• Separate protected/sensitive and general
workflow logical boundaries
Separate Virtual Private Cloud (VPC) Strategy
Amazon
EC2Amazon
EMRAmazon
S3
Protected/Sensitive Data VPC
Amazon
EC2
General VPC
AWS Directory
Service
AWS
Device Farm
P/S
Consult with compliance and security organizations before implementing
Indirection Strategy
Data Processing
SystemInbound
Data Store
(S3)
HTTPS
Send
SQS
SNS
Claims
P/S Data
Consult with compliance and security organizations before implementing
Example: Analytics Workflow
Insight
System
(EMR)
Inbound
Archive
(Glacier)
Inbound
Data Store
(S3)
Columnar Query
Store
(Redshift)
1
Medical Data
Data Lake
(S3)
6
P/S
Insights
4
Consult with compliance and security organizations before implementing
SQS
AWS
LambdaAmazon
SES
6
General
Insights
9
General
Insights
2 3
5
7
New Object
Message8
9
Compliance Example Workflow (using DevSecOps)
CloudFormation
templateSecurity /
Compliance Admin
1
Define
AWS Service Catalog
2
Publish
CloudFormation
stack
Developers
4
Browse and Launch
AWS CloudTrail Amazon S3
11
Monitors
Logs all API calls
AWS CloudWatchalarm
8
Monitors
10
Initiates
12
Notifies
AWS Config
Track changes
3
Git push
6
AWS CodeCommit
5
Provisions
9
7
Consult with compliance and security organizations before implementing
Audit Ingestion
© 2016 Medidata Solutions, Inc. – Proprietary and Confidential © 2016 Medidata Solutions, Inc. – Proprietary and Confidential
MAudit Audit Ingestion and DevSecOpsMatthew Szenher | [email protected]
Principal Architect - Medidata Core Web Services
© 2016 Medidata Solutions, Inc. – Proprietary and Confidential
About Medidata
SaaS Platform for clinical development, analytics and benchmarking in life sciences
Started in 1999
Over 9,000 trials in more than 130 countries
Serve CROs and contracting partners
We’re hiring: https://www.mdsol.com/en/careers
© 2016 Medidata Solutions, Inc. – Proprietary and Confidential
What’s are Audits?
A record of actions that create, modify or delete clinically relevant data.
Crucial for asserting confidentiality, integrity and authenticity of this data.
I’ll talk about how auditing is difficult, and how AWS makes DevSecOps for auditing solutions a lot
easier...
© 2016 Medidata Solutions, Inc. – Proprietary and Confidential
Audits are a MUST
MUST be captured transactionally with patient data points (as well as other clinically relevant data)
MUST be persisted
MUST be immutable
MUST be consistent
MUST be secure
SHOULD be cheap
© 2016 Medidata Solutions, Inc. – Proprietary and Confidential
Audits are VOLUMINOUS
Medidata persists eight billion clinical records from more than two million patients across more than
9,000 studies
More than one half million patient data points are added daily
Regulatorily required to capture audits transactionally with these data points (as well as other clinically
relevant data)
… ~600 audits per second
And growing!
© 2016 Medidata Solutions, Inc. – Proprietary and Confidential
And Growing
GADGET trial with GlaxoSmithKline
Patients wore Vital Connect Health Patch (http://www.vitalconnect.com/)
ECG, skin temp., etc.
1 week
~350 GB of audit data
~300 million data points (and their audits)
More data than many years-long trials collect over their lifetimes
© 2016 Medidata Solutions, Inc. – Proprietary and Confidential
Solution: MAudit
Scalable
Centralized
Durable
Highly Available
Secure
Audit ingestion and validation service….
Built on AWS infrastructure
© 2016 Medidata Solutions, Inc. – Proprietary and Confidential
Audit Producers MAudit Servers
(EC2)
© 2016 Medidata Solutions, Inc. – Proprietary and Confidential
MAudit and DevSecOps
S3: Programmatically defined persistence, with security and infinite scaling
Autoscaling Groups: Codified app server scaling
Kinesis: Codified, scalable streaming of data
IAM: Programmatically defined access controls
CloudFormation: Specifying all of the above in code
Thank You
http://icon-park.com/icon/light-orange-lego-brick-vector-
data-for-free/
05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6
04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5
03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4
02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker
for science on AWS3
02:30 PM – 02:45 PMBreak
01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2
01:00 PM – 01:30 PMIntroduction and Opening Remarks1
Agenda
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Angel Pizarro, AWS Scientific ComputingBrad Chapman, Harvard Medical School
May 17, 2016
Containers for Science!?!Reproducible science at scale on AWS using
Common Workflow Language, Docker, and Amazon
Elastic Container Service
Agenda
Review Common Workflow Language and bcbio
Intro to Docker and Amazon Elastic Container Service
CWL+Docker workflow on top of ECS
Common Workflow Language
We need faster, better science
https://twitter.com/KMS_Meltzy/status/6612060
70308794368
Large Scale Infrastructure Development
Shared problems: Academia, Industry, Startups
● Workflow implementations
● Validation
● Scaling
● Support
Blue Collar Bioinformatics (bcbio)
https://github.com/chapmanb/bcbio-nextgen
Uses
Aligners: bwa, novoalign, bowtie2, HiSat2
Variation: FreeBayes, GATK, VarDict, MuTecT, Scalpel,
SnpEff, VEP, GEMINI, Lumpy, Manta, CNVkit, WHAM
RNA-seq: Tophat, STAR, Sailfish, Kallisto
Quality control: MultiQC, fastqc, Qualimap
Manipulation: bedtools, bcftools, biobambam, sambamba,
samblaster, samtools, vcflib, vt
Provides
http://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html
Validation: low frequency cancer variants
http://bcb.io/2016/04/04/vardict-filtering/
Why community developed workflows?
Problem: complexity of parallelization reduces tool reuse
● Ties biology to job running framework
● Results in reimplemented pipelines
● Each requires AWS integration and scaling
Why community developed workflows?
Solution: Common framework for describing tools and
workflows
● Shared language for common concepts
● Increased re-use of tool definitions
● Mix and match workflow components
Common Workflow Language implementations
https://arvados.org/
Toil from BD2K Center for
Translational Genomics
https://github.com/galaxyproject/planemo
https://sbgenomics.comhttp://toil.readthedocs.io
bcbio + Docker
http://bcbio-nextgen.readthedocs.io/en/latest/contents/cloud.html
● Single container with many biological tools
● https://hub.docker.com/r/bcbio/bcbio/
● Same workflow management as non-Docker
● Run with any platform supporting CWL
bcbio + Common Workflow Language
http://bcbio-nextgen.readthedocs.io/en/latest/contents/cwl.html
● Generates CWL from sample description
● Data management integrates with platform
○ Arvados: Keep containers
○ Toil: AWS buckets
arvados_testcwl-workflow/├── main-arvados_testcwl.cwl├── main-arvados_testcwl-samples.json├── steps│ ├── batch_for_variantcall.cwl│ ├── combine_sample_regions.cwl│ ├── compare_to_rm.cwl│ ├── concat_batch_variantcalls.cwl│ ├── coverage_report.cwl│ ├── get_parallel_regions.cwl│ ├── merge_split_alignments.cwl│ ├── multiqc_summary.cwl│ ├── pipeline_summary.cwl│ ├── postprocess_alignment.cwl│ ├── postprocess_variants.cwl│ ├── prep_align_inputs.cwl│ ├── prep_samples.cwl│ ├── process_alignment.cwl│ └── variantcall_batch_region.cwl├── wf-alignment.cwl└── wf-variantcall.cwl
"files": [{"class": "File","path": "keep:a1d976bc7bcba2b523713fa67695d715+464/7_100326_FC6107FAAXX.bam","secondaryFiles": [{
"class": "File","Path":
"keep:a1d976bc7bcba2b523713fa67695d715+464/7_100326_FC6107FAAXX.bam.bai"}]}]
"reference__fasta__base": [{"class": "File","path": "keep:a84e575534ef1aa756edf1bfb4cad8ae+1927/hg19/seq/hg19.fa","secondaryFiles": [{
"class": "File","path":
"keep:a84e575534ef1aa756edf1bfb4cad8ae+1927/hg19/seq/hg19.fa.fai"}]}]
Manage provenance for all workflow inputs
class: CommandLineToolcwlVersion: cwl:draft-3baseCommand: [bcbio_nextgen.py, runfn, process_alignment, cwl]hints:- class: ResourceRequirement
coresMin: 16ramMin: 65536tmpdirMin: 100000
inputs:- id: '#files'
type:items: Filetype: array
- id: '#reference__fasta__base'type: File
outputs:- id: '#align_bam'
type: File
Required job infrastructure.
Runner handles resource
allocation.
Explicitly define inputs/outputs.
Runner handles file
management.
Community infrastructure development
Decouple biology and infrastructure
Enables interoperability
Improves integration with AWS
Better infrastructure: provenance and reproducibility
Validation: better, faster science
Amazon Elastic Container
Service (ECS)
AMIs VS. Containers
App A App B App C
Bins/Libs Bins/Libs Bins/Libs
Guest OS Guest OS Guest OS
Hypervisor
Server (Host OS)
App A App B App B App C
Bins/Libs Bins/Libs
Guest OS + Container Manager
Hypervisor
Server (Host OS)
AMIs Containers
Containers are natural for discrete applications
Docker provides tools to manage and deploy your applications
Lightweight container virtualization platform
Simple to model
Any app, any language
Container image can be tied to app version
Test & deploy same artifact
Stateless servers decrease change risk
Applications evolve from monolithic stacks ...
* Image courtesy of The Broad Institute - https://www.broadinstitute.org/gatk/img/BP_workflow.png
... to discrete applications (microservices)
* Image courtesy of The Broad Institute - https://www.broadinstitute.org/gatk/img/BP_workflow.png
Server
Guest OS
Bins/Libs Bins/Libs
App2App1
Scheduling one resource is straightforward
Scheduling a cluster is hard
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Server
Guest OS
Amazon Elastic Container Service (ECS)
Easily manage clusters for any scale
No services for you to run
Complete state
Control and monitoring
Cluster Management
Amazon ECS
Docker
Container Instance
ECS Agent
Docker
Container Instance
ECS Agent
Docker
Container Instance
ECS Agent
AZ 1 AZ 2
Amazon
ECS Cluster Management Engine
Key/Value Store
Amazon ECS
Docker
Container Instance
ECS Agent
Docker
Container Instance
ECS Agent
Docker
Container Instance
ECS Agent
AZ 1 AZ 2
Amazon
ECS Cluster Management Engine
Key/Value Store
Agent Communication Service
Amazon ECS
Docker
Task
Container Instance
Container
ECS Agent
Task
Container
Docker
Task
Container Instance
Container
ECS Agent
Docker
Task
Container Instance
Container
ECS Agent
Task
Container
AZ 1 AZ 2
Amazon
ECS Cluster Management Engine
Key/Value Store
Agent Communication Service
Unit of work
Grouping of related Containers
Run on Container Instances
ECS Tasks
Tasks are defined via Task Definitionsclass: CommandLineToolcwlVersion: cwl:draft-3baseCommand: [samtools, index]hints:- class: ResourceRequirement
coresMin: 2ramMin: 1024tmpdirMin: 100000
inputs:- id: '#files'
type:items: Filetype: array
- id: '#reference__fasta__base'type: File
outputs:- id: '#align_bam'
type: File
{"family": "samtoolsIndexPath1","containerDefinitions": [{
"name": "samtools-index","image": "delagoya/samtools-index","cpu": 2,"memory": 1024,"essential": true,"entryPoint": ["sh", "-c"],"command": ["samtools", "index"],"workingDirectory": "/mnt/scratch","mountPoints": [{"containerPath": "/mnt/scratch","sourceVolume": "scratch","readOnly": null
}],
}],"volumes": [{
"host": {"sourcePath": "/mnt/scratch”
},"name": "scratch”
}]} * Not a complete example* Not a complete example
Amazon ECS
Docker
Task
Container Instance
Container
ECS Agent
Task
Container
Docker
Task
Container Instance
Container
ECS Agent
Docker
Task
Container Instance
Container
ECS Agent
Task
Container
AZ 1 AZ 2
Amazon
ECS Cluster Management Engine
Key/Value Store
Agent Communication Service API
User / Scheduler
Amazon ECS
Docker
Task
Container Instance
Container
ECS Agent
Task
Container
Docker
Task
Container Instance
Container
ECS Agent
Docker
Task
Container Instance
Container
ECS Agent
Task
Container
AZ 1 AZ 2
Amazon
ECS Cluster Management Engine
Key/Value Store
Agent Communication Service API
User / SchedulerToil
Architecture
ECS Scheduler Driver
EC2 Autoscale Groupcwltoil
ECS Task Definitions
https://github.com/awslabs/ecs-mesos-scheduler-driver
Setup ECS Cluster with AutoScaling
Create LaunchConfiguration
Pick instance type depending on resource requirements, e.g. memory or CPU
Use latest Amazon Linux ECS-optimized AMI, other distros available
Create AutoScaling group and set to cluster initial size
AutoScaling your Amazon ECS Cluster
Create CloudWatch
alarm on a metric, e.g.
MemoryReservation
Configure scaling policies
to increase and decrease
the size of your cluster
Demo
05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6
04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5
03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4
02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker
for science on AWS3
02:30 PM – 02:45 PMBreak
01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2
01:00 PM – 01:30 PMIntroduction and Opening Remarks1
Agenda
© 2016 COGNIZANT10
6
Agile Cloud native Development with AWS
11th April 2016
Maran Marudhamuthu
Chief Architect – Cloud Services
© 2016 COGNIZANT10
7
CHANGE
© 2016 COGNIZANT10
8
WHAT CHANGE
‣ Secular change in architecture of distributed system is happening.
‣ How distributed systems are built is changing.
‣ Evolution towards cloud native architecture.
‣ Thick Line between Infra and App is getting thinner –
‣ Serverless
‣ Forget Infra …build applications - IaaS and PaaS gaining maturity
© 2016 COGNIZANT10
9
DEVOPS‣ CULTURE: No more development and operations team
‣ PROCESS: Continuous integration and Delivery
‣ AUTOMATE: Infrastructure as Code, Automate from developer
desktop to Production
‣ TOOLS: New tools to automate, Chef/Puppet, AWS Cloudformation,
Jenkins
CHANGE EXPLAINED
© 2016 COGNIZANT11
0
MICROSERVICES AND API‣ CLOUD NATIVE: 12 factor
For example,
‣ isolate failure
‣ facilitate blue-green deployment
‣ Independently scale
‣ CI /CD: Continuous integration and Delivery easier
‣ STANDARDS : Defined Protocol for Service definition and consumption
‣ FRAMEWORK & TOOLS: Chef/Puppet, AWS Cloudformation, Jenkins, Spring Boot, NetFlix OSS
CHANGE EXPLAINED
© 2016 COGNIZANT11
1
CONTAINERSBUILD SHIP RUN ANY APP ANYWHERE
▸ BUILD : Package your app in a container (image)
▸ SHIP : Move that container from a machine to another
▸ RUN : Execute that container (i.e. app)
▸ ANY APP: Anything that runs on linux and Windows
▸ ANY WHERE: Bare metal, VM, cloud instance
CHANGE EXPLAINED
© 2016 COGNIZANT11
2
DEVOPS
AWS Deployment and
Management Tools
• AWS Codedeploy
• AWS Codepipeline
• AWS CodeCommit
HOW AWS SERVICES HELP ?
Automation
• AWS OpsWorks
• AWS Cloudformation
• AWS SDK,API and CLI
© 2016 COGNIZANT11
3
HOW AWS SERVICES HELP ? - DEVOPS
© 2016 COGNIZANT11
4
MICROSERVICES & API – CLOUD NATIVE
AWS AppDev PaaS Hosting
• AWS Beanstalk• AWS CodeDeploy• AWS CodeCommit• AWS EMR• AWS Datapipeline
HOW AWS SERVICES HELP?
Build Fast – Forget Infra -
Serverless
• AWS Lambda
• AWS Kinesis
• AWS Data pipeline
• AWS API,SDK,CLI
• AWS IOT Framework
© 2016 COGNIZANT11
5
HOW AWS SERVICES HELP ? – MICROSERVICES – CLOUD NATIVE
© 2016 COGNIZANT11
6
HOW AWS SERVICES HELP?
‣ AWS EC2, AMI
‣ Docker run
‣ AWS Beanstalk ‣ Deploy and scale Docker application
‣ AWS EC2 Container Service - ECS‣ Launch and manage Docker container
‣ AWS EC2 Container Registry Service - ECR‣ a fully-managed Docker container registry
‣ Integrated with Amazon EC2 Container Service (ECS)
‣ Simplifies your development to production workflow
‣ DevOps
CONTAINERS
© 2016 COGNIZANT11
7
CONCLUSION
‣ Think beyond Infra/Forget Infra
‣ Use rather than build
‣ IaaS and PaaS/Application Services
© 2016 COGNIZANT11
8
?
© 2016 COGNIZANT11
9
Thank You
05:00 PM – 06:30 PMClosing Remarks, Q&A and Networking6
04:15 PM – 05:00 PMScalable Genomics Analysis in the Cloud with ADAM5
03:30 PM – 04:15 PMCognizant: Agile Cloud Native Development with AWS4
02:45 PM – 03:30 PMRepeatable Science at Scale: Using Common Workflow Language and Docker
for science on AWS3
02:30 PM – 02:45 PMBreak
01:30 PM – 02:30 PMBest practices when building a validated system on AWS for the Life Sciences2
01:00 PM – 01:30 PMIntroduction and Opening Remarks1
Agenda
Scalable Genomic Analysis on AWS with ADAM
Ujjwal RatanHealthcare and Life Sciences
Solutions Architect
Amazon Web Services
Timothy DanfordField Engineer, Tamr, Inc. & Software
Engineer, University of California
Berkeley
This Talk Will Cover
An Overview of Amazon Elastic Map Reduce (EMR) and Spark
Genomics and ADAM deep dive
Video Demonstration
Q&A
Amazon EMR Overview
Hadoop 1.x & 2.x / HDFS clusters
Easy to use; fully managed
Support for EC2 Spot Instances
S3, DynamoDB, Redshift
& Kinesis Integration
Amazon
Elastic
MapReduce
(EMR)
Many storage layers to choose from
Amazon DynamoDB
EMR-DynamoDB
connector
Amazon RDS
Amazon
Kinesis
Streaming data
connectorsJDBC Data Source
w/ Spark SQL
Elasticsearch
connector
Amazon Redshift
Amazon Redshift Copy
From HDFS
EMR File System
(EMRFS)
Amazon S3
Amazon EMR
Create a fully configured cluster in minutes
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use an AWS SDK directly with the Amazon EMR API
Amazon EMR – Managed Spark
Easy to install and configure Spark
Secured
Spark submit or useZeppelin UI
Quickly addand remove capacity
Hourly, reserved, or EC2 Spot pricing
Use S3 to decouplecompute and storage
Core/Task
Single Spark Cluster on Amazon EMRFS
Amazon S3
AWS EMR
Cluster
Core/Task
Core/Task
Master
node
Core/Task
Spark
master
YARN
Spark
worker
Spark
worker
Spark
worker
Spark
worker
S3 Standard is designed for 11
9’s of durability and is designed
for your data sources
S3 reduced redundancy is
designed for 2 9’s of
durability and can be
used to reduce costs on
reproducible datasets
Next-Generation
Genomics
Using Spark and ADAM
Timothy Danford
Tamr Inc.
AMPLab
What’s In
The Box?
A Variant-Calling Pipeline
Stages are written separately
Hand-off between steps is through files
Everyone has their own “flavor” of pipeline
Parallelization in the Cloud
Lingua Franca: File Formats
.bam files define a custom .bai index format
User-defined attributes
Typically in coordinate-sorted order
Where is ”The Platform?”
(This is taken from the Picard library.)
Why are we managing file handles and spilling reads
to disk inside our bioinformatics methods?
Things Fall Apart When Our Computation Changes
Bioinformaticians❤️
Probabilistic Models
Many Bioinformatics Methods Are Just Large Sums
The Challenge: Existing Code!
Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)
Can You Spot the File Format Assumption?
A single piece of a
filtering stage for
the Mutect somatic
variant caller.
Can you spot the
place, where they
assume they are
working with BAM
files?
Spark + Genomics = ADAM• Hosted at Berkeley and
the AMPLab
• Apache 2 License
• Contributors from both
research and
commercial
organizations
• Core spatial primitives,
variant calling
• Avro and Parquet for
data models and file
formats
”The Platform” DefinesCore Genomics Primitives
Stop Defining File Formats By Hand
• Instead of defining
custom file formats for
each data type and access
pattern…
• Parquet creates a
compressed format for
each Avro-defined data
model.
• Improvement over existing
formats1
• 20-22% for
BAM
• ~95% for
VCF 1compression % quoted from 1K Genomes samples
Demo
Installation of ADAM on EMR
Analyze Parquet Files using Spark
Use SCALA to query the genome data in the adam parquet
files
……
val gnomeDF = sqlContext.read.parquet("/user/hadoop/adamfiles/part-r-00000.gz.parquet")
gnomeDF.printSchema()
gnomeDF.registerTempTable("gnome")
val gnome_data = sqlContext.sql("select count(*) from gnome")
gnome_data.show()
…….
For more details, see our blog post
Title: Will Spark Power
the Data behind
Precision Medicine
Authors: Christopher
Crosbie, Ujjwal Ratan
URL: http://blogs.aws.amazon.com/
bigdata/post/Tx1GE3J0NATV
J39/Will-Spark-Power-the-
Data-behind-Precision-
Medicine
Thank YouAny Questions?