31
Fast Exploration of the QSAR Model Space with e-Science Central and Windows Azure Simon Woodman Jacek Cala Hugo Hiden Paul Watson

Microsoft HPC User Group

Embed Size (px)

DESCRIPTION

Presentation given at the Microsoft HPC User Group, Salt Lake City, 2012

Citation preview

Page 1: Microsoft HPC User Group

Fast Exploration of the QSAR Model Space with e-Science Central and Windows AzureSimon WoodmanJacek CalaHugo HidenPaul Watson

Page 2: Microsoft HPC User Group

VENUS-C

Developing technology to ease scientific adoption of Cloud Computing

• EU Funded Project– 11 Partners– 5 Technology/Infrastructure– 6 Scenario Partners– Open Call – 20 Pilot Studies–May 2010 to May 2012

Page 3: Microsoft HPC User Group

Windows Azure

OpenNebula

MSFT

EMOTIVE

Arc

hit

rave

ENG KTH BSC

Windows

LinuxOperating System

Cloud Technology

Cloud Provider

ExecutionEnvironments

Scenario /Algorithm C

OLB

CN

R

AEG

EA

N

UPV.

Bio

CoS

BI

UN

EW

C++

Programming Language

C++

Java.NET

C++

.NET

Java

IaaSPaaSCloud Paradigm

Windows

BSC

Super

Com

pute

r(n

ot

in t

he c

loud)

On P

rem

ises

(not

in t

he

cloud)

Customer

WorkflowType of

workloadBatc

hHTC Data flow

Parameter sweep

Map / Reduce

CEP

EMIC Generic Worker BSC COMPSVENUS-C

Infra-structur

e

Users

Page 4: Microsoft HPC User Group

What are the properties of this molecule?

Toxicity

Solubility

Biological Activity

Perform experiments

Time consuming

Expensive

Ethical constraints

The Problem

Page 5: Microsoft HPC User Group

Predict likely properties based on similar molecules

CHEMBL Database: data on 622,824 compounds,collected from 33,956 publications

WOMBAT-PK Database:data on 1230 compounds,for over 13,000 clinical measurements

WOMBAT Database: data on 251,560 structures,for over 1,966 targets

All these databases contain structure information and numerical activity data

The alternative to Experiments

Page 6: Microsoft HPC User Group

Quantitative Structure Activity Relationship

f( )Activity ≈

More accurately, Activity related to a quantifiable structural attribute

Activity ≈ f( logP, number of atoms, shape....)

Currently > 3,000 recognised attributeshttp://www.qsarworld.com/

QSARQSAR

Page 7: Microsoft HPC User Group

Method

Page 8: Microsoft HPC User Group

Branching Workflows

Linear regressionNeural NetworkPartial Least SquaresClassification Trees

Correlation analysisGenetic algorithmsRandom selection

Java CDK descriptorsC++ CDL descriptors

Random split80:20 split

Add to database

Partition training & test data

Calculate descriptors

Select descriptors

Build model

Page 9: Microsoft HPC User Group

e-Science Central

Platform for cloud based data analysis

AzureEC2On Premise

JavaROctaveJavascript

Page 10: Microsoft HPC User Group

<<web role>>Generic Worker

<<web role>>Generic Worker

<<web role>>Generic Worker

<<Azure VM>>

e-SC db backend

<<Azure VM>>

e-Science Central

main server JMS queue

REST APIWeb UI

web browser

workflow invocations

e-SC control dataWorkflow

engine

Workflow engine

e-SC blob store

Workflow engine

rich client app

<<web role>>

QSARExplorer

Azure Blob store

web browser

workflow

data

Architecture

Page 11: Microsoft HPC User Group
Page 12: Microsoft HPC User Group

Workflow Architecture

• Single Message Queue– Worker Failure

Semantics– Elasticity

• Runtime Environments– R– Octave– Java

• Deployed only once

Worker Role

Put Next Jobs on Queue

Put Data

Execute Job

Get Data

Deploy Runtime?

Get Job from Queue

Execute the engine

Install wf engine

Install JRE

Page 13: Microsoft HPC User Group

Results

• 250k models– Linear Regression– PLS– RPartitioning– Neural Net

• 460K workflow executions

• 4.4M service calls

• QSAR Explorer– Browse– Search– Get Predictions

Page 14: Microsoft HPC User Group

Scalability: Large Scale QSAR

100 Nodes 200 Nodes

Response Time

3hr 19mins 1hr 50mins

Speedup 94x 156x

Efficiency 94% 78%

Cost $55.68 $51.84

0 50 100 150 200 25000:00

02:24

04:48

07:12

09:36

12:00

14:24

GW

Azure

Number of processors

Executi

on t

ime [

hh:m

m]

0 50 100 150 200 2500.0

50.0

100.0

150.0

200.0

250.0

AzureidealGW

Number of processors

Rela

tive p

rocessin

g s

peed-u

p

480 datasets sequential time: 11 days

Page 15: Microsoft HPC User Group

Cloud Applicability

• Bursty– ChEMBLdb updates (delta 10%)– New Modelling Methods (???)

• Performance depends on how chatty the problem is– Deploy (incl download) dependencies

once– Avoid storage bottlenecks

Page 16: Microsoft HPC User Group

Performance is great but …

Drug Development requires us to capture the data and the process

Page 17: Microsoft HPC User Group

Provenance/Audit Requirements

• How was a model generated?– What algorithm?– What descriptors

• Are these results reproducible?

• How have bugs manifested?– Which models affected– How do we regenerate affected models?

• Performance Characteristics• How do we deal with new data?

Page 18: Microsoft HPC User Group

Storing Provenance

• Neo4j– Open Source Graph Database– Nodes/Relationships + properties– Querying/traversing

• Access– Java lib for OPM– e-SC library built on top of OPM lib– REST interface

• Options for HA and Sharding for performance

Page 19: Microsoft HPC User Group

Provenance Model

• Based on OPM– Processes, Artifacts,

Agents

• Directed Graph• Multiple views of

provenance– Dependent on security

privileges

Page 20: Microsoft HPC User Group

Adding new model builders

1. Add new block2. Mine the

provenance3. Dynamically

create virtual workflows

4. One invocation per data set

• Work in progress…

Build and cross-validate

RPart-m

cross-validation

Test RPart-m

n

1

1

1

1

Enumerate descriptors

Build and cross-validate a new kind of model

1n

Test ?-m

Page 21: Microsoft HPC User Group

Future Work

• Scalability and reliability– SQL Azure– Application server replication

• Provenance visualization• Meta-QSAR– Provenance Mining

• Cloud4Science– Applying lessons learned to new

scenarios

Page 22: Microsoft HPC User Group

MOVEeCloud Project

• Investigating the links between physical activity and common diseases – type 2 diabetes, cardiovascular disease,…

• Wrist accelerometers worn over 1 week period

• Measures movement at 100Hz in three axes

• Processing ideal for Azure– Bursty data processing as new data

gathered– Embarrassingly parallel– Large datasets

Page 23: Microsoft HPC User Group

MOVEeCloud ProcessAnalysis and Classification

R, Java, Octave

SleepWalkin

gSedentary

Activity

Sedentary

Clinician’s Report

Patient Intervention

s

Methodology Section for

Papers

Page 24: Microsoft HPC User Group

/ patient / visit

Data Sizes100 samples / second 100 rows

3600 seconds / hour 360,000 rows

24 hours / day 8,640,000 rows

7 days / study 60,480,000 rows

Cohort size of 800 patients and multiple visits

Page 25: Microsoft HPC User Group

Working with larger data sets

• As we add more workflow engines server load increases– One server can cope 200 engines if files

are small

• This is not the case with movement data– Only support 4 engines

• Increase the bandwidth to the engines– Clustering appserver /database?

Page 26: Microsoft HPC User Group

HDFS

• Implemented prior to Native HDFS on Azure

• Easy to integrate with e-sc – Java system just requires libraries included in e-sc

• Distributed store where bandwidth increases with number of machines– Bits of data spread around lots of machines

• Concept of data location– Potential to route workflows to execute as close as possible to

storage

• Other applications also also built on top of HDFS – Open TSDB to store timeseries for movement data

Page 27: Microsoft HPC User Group

Drawbacks to HDFS• Needs a NAMENODE to co-ordinate everything

– Single point of failure (in current HDFS)

• Metadata stored in RAM– doesn’t scale beyond a million or so files– Bad for drug discovery)

• Not particularly efficient for small files– There is an overhead to connecting to filesystem

• If instances terminate can lose data if not backed up– Redundancy helps– Backing in Cloud and stage to HDFS for experiments– Might use it is a cache of “Hot” files and use Blobstore/S3

to back it all up

Page 28: Microsoft HPC User Group

e-SC Workflows

Chunking Workflow

Chunk Processing Workflow

Page 29: Microsoft HPC User Group

System Setup

• One machine for the e-sc server• 4 CPUs, 7GB RAM, 1TB Local storage

• One machine for Namenode• 4 CPUs, 7GB RAM, 1TB local storage

• Four workflow engines• 2 CPUs, 3.5 GB RAM, 500GB local storage

• 2TB HDFS storage using workflow engines• Mounts up quickly

• Increase priority of HDFS on engines• Competing for resources with workflow

Page 30: Microsoft HPC User Group

Initial ResultsFor a single data set processing went from 60 to 16 minutes

using 4 workflow engines running HDFS

• 4 engines the limit for one e-sc server• Main server hit 100% CPU delivering data• No further improvements with more engines

• Using HDFS CPU was consistently below 5%• More like our earlier scalability results

• Once data had been chunked processing was the same for each chunk

• The improvement lay entirely in staging and uploading results

Page 31: Microsoft HPC User Group

Questions?

• Thank you to our generous funders–Microsoft Cloud 4 Science– EU FP7 - VENUS-C (RI-261565)– RCUK – SiDE (EP/G066019/1)

• The Team– Jacek Cala– Paul Watson– Hugo Hiden