Upload
sjwoodman
View
127
Download
1
Embed Size (px)
DESCRIPTION
Presentation given at the Microsoft HPC User Group, Salt Lake City, 2012
Citation preview
Fast Exploration of the QSAR Model Space with e-Science Central and Windows AzureSimon WoodmanJacek CalaHugo HidenPaul Watson
VENUS-C
Developing technology to ease scientific adoption of Cloud Computing
• EU Funded Project– 11 Partners– 5 Technology/Infrastructure– 6 Scenario Partners– Open Call – 20 Pilot Studies–May 2010 to May 2012
Windows Azure
OpenNebula
MSFT
EMOTIVE
Arc
hit
rave
ENG KTH BSC
Windows
LinuxOperating System
Cloud Technology
Cloud Provider
ExecutionEnvironments
Scenario /Algorithm C
OLB
CN
R
AEG
EA
N
UPV.
Bio
CoS
BI
UN
EW
C++
Programming Language
C++
Java.NET
C++
.NET
Java
IaaSPaaSCloud Paradigm
…
Windows
BSC
Super
Com
pute
r(n
ot
in t
he c
loud)
On P
rem
ises
(not
in t
he
cloud)
Customer
WorkflowType of
workloadBatc
hHTC Data flow
Parameter sweep
Map / Reduce
CEP
EMIC Generic Worker BSC COMPSVENUS-C
Infra-structur
e
Users
What are the properties of this molecule?
Toxicity
Solubility
Biological Activity
Perform experiments
Time consuming
Expensive
Ethical constraints
The Problem
Predict likely properties based on similar molecules
CHEMBL Database: data on 622,824 compounds,collected from 33,956 publications
WOMBAT-PK Database:data on 1230 compounds,for over 13,000 clinical measurements
WOMBAT Database: data on 251,560 structures,for over 1,966 targets
All these databases contain structure information and numerical activity data
The alternative to Experiments
Quantitative Structure Activity Relationship
f( )Activity ≈
More accurately, Activity related to a quantifiable structural attribute
Activity ≈ f( logP, number of atoms, shape....)
Currently > 3,000 recognised attributeshttp://www.qsarworld.com/
QSARQSAR
Method
Branching Workflows
Linear regressionNeural NetworkPartial Least SquaresClassification Trees
Correlation analysisGenetic algorithmsRandom selection
Java CDK descriptorsC++ CDL descriptors
Random split80:20 split
Add to database
Partition training & test data
Calculate descriptors
Select descriptors
Build model
e-Science Central
Platform for cloud based data analysis
AzureEC2On Premise
JavaROctaveJavascript
<<web role>>Generic Worker
<<web role>>Generic Worker
<<web role>>Generic Worker
<<Azure VM>>
e-SC db backend
<<Azure VM>>
e-Science Central
main server JMS queue
REST APIWeb UI
web browser
workflow invocations
e-SC control dataWorkflow
engine
Workflow engine
e-SC blob store
Workflow engine
rich client app
<<web role>>
QSARExplorer
Azure Blob store
web browser
workflow
data
Architecture
Workflow Architecture
• Single Message Queue– Worker Failure
Semantics– Elasticity
• Runtime Environments– R– Octave– Java
• Deployed only once
Worker Role
Put Next Jobs on Queue
Put Data
Execute Job
Get Data
Deploy Runtime?
Get Job from Queue
Execute the engine
Install wf engine
Install JRE
Results
• 250k models– Linear Regression– PLS– RPartitioning– Neural Net
• 460K workflow executions
• 4.4M service calls
• QSAR Explorer– Browse– Search– Get Predictions
Scalability: Large Scale QSAR
100 Nodes 200 Nodes
Response Time
3hr 19mins 1hr 50mins
Speedup 94x 156x
Efficiency 94% 78%
Cost $55.68 $51.84
0 50 100 150 200 25000:00
02:24
04:48
07:12
09:36
12:00
14:24
GW
Azure
Number of processors
Executi
on t
ime [
hh:m
m]
0 50 100 150 200 2500.0
50.0
100.0
150.0
200.0
250.0
AzureidealGW
Number of processors
Rela
tive p
rocessin
g s
peed-u
p
480 datasets sequential time: 11 days
Cloud Applicability
• Bursty– ChEMBLdb updates (delta 10%)– New Modelling Methods (???)
• Performance depends on how chatty the problem is– Deploy (incl download) dependencies
once– Avoid storage bottlenecks
Performance is great but …
Drug Development requires us to capture the data and the process
Provenance/Audit Requirements
• How was a model generated?– What algorithm?– What descriptors
• Are these results reproducible?
• How have bugs manifested?– Which models affected– How do we regenerate affected models?
• Performance Characteristics• How do we deal with new data?
Storing Provenance
• Neo4j– Open Source Graph Database– Nodes/Relationships + properties– Querying/traversing
• Access– Java lib for OPM– e-SC library built on top of OPM lib– REST interface
• Options for HA and Sharding for performance
Provenance Model
• Based on OPM– Processes, Artifacts,
Agents
• Directed Graph• Multiple views of
provenance– Dependent on security
privileges
Adding new model builders
1. Add new block2. Mine the
provenance3. Dynamically
create virtual workflows
4. One invocation per data set
• Work in progress…
Build and cross-validate
RPart-m
cross-validation
Test RPart-m
n
1
1
1
1
Enumerate descriptors
Build and cross-validate a new kind of model
1n
Test ?-m
Future Work
• Scalability and reliability– SQL Azure– Application server replication
• Provenance visualization• Meta-QSAR– Provenance Mining
• Cloud4Science– Applying lessons learned to new
scenarios
MOVEeCloud Project
• Investigating the links between physical activity and common diseases – type 2 diabetes, cardiovascular disease,…
• Wrist accelerometers worn over 1 week period
• Measures movement at 100Hz in three axes
• Processing ideal for Azure– Bursty data processing as new data
gathered– Embarrassingly parallel– Large datasets
MOVEeCloud ProcessAnalysis and Classification
R, Java, Octave
SleepWalkin
gSedentary
Activity
Sedentary
Clinician’s Report
Patient Intervention
s
Methodology Section for
Papers
/ patient / visit
Data Sizes100 samples / second 100 rows
3600 seconds / hour 360,000 rows
24 hours / day 8,640,000 rows
7 days / study 60,480,000 rows
Cohort size of 800 patients and multiple visits
Working with larger data sets
• As we add more workflow engines server load increases– One server can cope 200 engines if files
are small
• This is not the case with movement data– Only support 4 engines
• Increase the bandwidth to the engines– Clustering appserver /database?
HDFS
• Implemented prior to Native HDFS on Azure
• Easy to integrate with e-sc – Java system just requires libraries included in e-sc
• Distributed store where bandwidth increases with number of machines– Bits of data spread around lots of machines
• Concept of data location– Potential to route workflows to execute as close as possible to
storage
• Other applications also also built on top of HDFS – Open TSDB to store timeseries for movement data
Drawbacks to HDFS• Needs a NAMENODE to co-ordinate everything
– Single point of failure (in current HDFS)
• Metadata stored in RAM– doesn’t scale beyond a million or so files– Bad for drug discovery)
• Not particularly efficient for small files– There is an overhead to connecting to filesystem
• If instances terminate can lose data if not backed up– Redundancy helps– Backing in Cloud and stage to HDFS for experiments– Might use it is a cache of “Hot” files and use Blobstore/S3
to back it all up
e-SC Workflows
Chunking Workflow
Chunk Processing Workflow
System Setup
• One machine for the e-sc server• 4 CPUs, 7GB RAM, 1TB Local storage
• One machine for Namenode• 4 CPUs, 7GB RAM, 1TB local storage
• Four workflow engines• 2 CPUs, 3.5 GB RAM, 500GB local storage
• 2TB HDFS storage using workflow engines• Mounts up quickly
• Increase priority of HDFS on engines• Competing for resources with workflow
Initial ResultsFor a single data set processing went from 60 to 16 minutes
using 4 workflow engines running HDFS
• 4 engines the limit for one e-sc server• Main server hit 100% CPU delivering data• No further improvements with more engines
• Using HDFS CPU was consistently below 5%• More like our earlier scalability results
• Once data had been chunked processing was the same for each chunk
• The improvement lay entirely in staging and uploading results
Questions?
• Thank you to our generous funders–Microsoft Cloud 4 Science– EU FP7 - VENUS-C (RI-261565)– RCUK – SiDE (EP/G066019/1)
• The Team– Jacek Cala– Paul Watson– Hugo Hiden