Upload
amazon-web-services
View
352
Download
4
Embed Size (px)
DESCRIPTION
Producing vaccines is a significant and complex effort that spans manufacturing, biological materials, streaming data, and complex computational challenges. In this session, speakers from Merck and Booz Allen Hamilton discuss how they partnered to leverage AWS and data science techniques, enabling them to pioneer new approaches for analyzing vaccine production yields. The solution they created combines a shared data lake service built on AWS services-such as Amazon EC2 and Amazon VPC-as well as Hadoop MapReduce, HDFS, Hive, and R to implement the data science infrastructure and analysis that created models of complex biological processes. As a result of this project, Merck has analyzed 12 years of vaccine manufacturing data from 16 data sources, conducted over 15 billion calculations, and was recognized with the InformationWeek Elite Business Innovation Award for the innovative application of data science towards enhancing vaccine yield rates and saving lives.
Citation preview
11.018.14
Brian Keller, Data Science Lead, Booz Allen Hamilton
Jerry Megaro, Director, Advanced Analytics and Innovation, Merck Manufacturing
Nic Perez, Cloud Architecture Lead, Booz Allen Hamilton
Making a difference with data
- George W. Merck (1950)
1 Broor S, Ghosh D, Mathur P. Molecular epidemiology of rotaviruses in India. Indian J Med Res 2003; 118:59-67.
**** ****
***
*
****
************* ********** *
*
*****
** * *
**
*
*
* *
*** *
* = sales for RotaTeq®
*
*
* * ***** ****
*****
*******
* ******
5.6 Billion
people in the
world do not
have access to
our products
90% of
RotaTeq
sales are
in USA
The Rotavirus
Vaccine Disconnect
= 1,000 deaths•
BUSINESS KNOWLEDGE
Parametric models Let the data tell the story
Input/Output modelingData experiments to
enable discovery
Avoid failureFailure is powerful…
learn fast and adjust
Narrow scope of analysisAsk bigger questions
using atypical data
Human Insight + Actions
Data Management
Infrastructure
Machine Learning Free-Computation Alerting
GeographicLanguage
Translation
Entity
RelationshipEvent Grab
Dense/
Sparse
Structured Unstructured Streaming
Provisioning Deployment Monitoring Workflow
Streaming Analytics
Streaming
indexes
Services (SOA)
Analytics andDiscovery
Views and Indexes
HDFS/Data Lake
Metadata Tagging
Data Sources
Infrastructure/ Management
Visualization,Reporting, Dashboards,
and Query Interface
Resulted in…
Resulted in…
Resulted in…
Winner of Information Week Business Innovation Award
Clustering in this
region indicates
parameter similarity
is associated with
high yield
Clustering in this region
indicates parameter
similarity is associated
with low yieldSimilarity
Score(low)
(high)
Batch 2
Batch 1
Batch 3
Batch 5
Batch 4
Ba
tch
1
Ba
tch
3
Ba
tch
2
Ba
tch
5
Ba
tch
4
Increasing yieldIn
cre
asin
g y
ield
Similarity
Matrix
(moderate similarity)
(high similarity)Lots of Data Experiments
(And Failures) That Lead to
Final Predictive Model…
BusinessDecisionMakers
Researchers External Partners
Redshift-Based Data Marts
Amazon EC2Elastic Map/
Reduce
Hadoop, Solr Search Solution
Legacy Enterpise RDS
AES Encypted S3 Data Lake
VPCEnterprise
Active Directory
JAXRS/Tomcat-Based Rest Services on Elastic Bean Stalk
Angular, D3.js Web UIInsightsAccelerated Reasoning
Security
Cell-Level Visibilty, Life Science Informatics via
Custom Solr Plug-ins
Flexible Data ProcessingPipelines
Business Users
Data Scientists
Reference Architecture – Privileged Identity Management
Reference Architecture – Identity Analytics
– Monitor, identify, and alert on abnormal user activity
– Govern administrative rights; policy based enforcement
– Hardened virtual appliance; do not allow direct RDP/SSH access to
management/security appliances
– IA has purview into every log (firewall/router logs, crypto logs,
application logs, systemd logs, OS logs, SCCM, etc.)
Reference Architecture – Cryptography
exploredatascience.com
github.com/booz-allen-hamilton
http://bit.ly/awsevals