30
Thomas Steinke Zuse Institute Berlin (ZIB) <www.zib.de> [email protected] Activities of the COST D37 Activities of the COST D37 GridChem GridChem Computational Chemistry Computational Chemistry Workflow Group Workflow Group EGEE'07 Conference EGEE'07 Conference Budapest Budapest 01.10.2007 01.10.2007

Activities of the COST D37 GridChem Computational Chemistry Workflow Group

  • Upload
    veata

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Activities of the COST D37 GridChem Computational Chemistry Workflow Group. EGEE'07 Conference Budapest 01.10.2007. Partners in the CCWF Working Group. København. Thomas Steinke, Tim Clark (DE) Hans-Peter Lüthi, Martin Brändle (CH) Peter Murray-Rust , Henry Rzepa (UK) - PowerPoint PPT Presentation

Citation preview

Page 1: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

Thomas Steinke

Zuse Institute Berlin (ZIB) <www.zib.de>[email protected]

Activities of the COST D37 GridChemActivities of the COST D37 GridChemComputational Chemistry Workflow Computational Chemistry Workflow

GroupGroup

EGEE'07 ConferenceEGEE'07 Conference

BudapestBudapest

01.10.200701.10.2007

Page 2: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

2

• Berlin

• Manno•

• Erlangen

• London•

• Sevilla

Zürich

Cambridge Thomas Steinke, Tim Clark (DE)

Hans-Peter Lüthi, Martin Brändle

(CH)

Peter Murray-Rust, Henry Rzepa

(UK)

Antonio Márquez (ES)

Kurt Mikkelsen (DK)

- CSCS (Manno, CH)

- ZIB (Berlin, DE)

Partners in the CCWF Working Group

København•

Page 3: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

3

“Traditional” Workflow in Computational Chemistry

Workflows have a long tradition in the CC domain.

start knowledge base (DB search)automated/manually edited molecular structuresmolecular simulations

method / program Amethod / program B…

propertiesprimary visualization / quality controlanalysis / archival / DB storagenew insights?

in the 80’s – 90’s

Page 4: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

4

Databases: Computational protocol (T. Clark, 1998)

Complete protocol runs automatically with less than 0.5% failure rate. Cleanup 2D 3D conversion VAMP optimization Calculate properties

~3,000 compounds per processor day (3 GHz Xeon)

Enhanced 3D-Databases: A Fully Electrostatic Database of AM1-Optimized Structures B. Beck, A. Horn, J. E. Carpenter, and T. Clark, J.Chem. Inf. Comput.Sci. 1998, 38, 1214-1217.

source: Tim Clark, Uni Erlangen

Page 5: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

5

Distributed Computing Environment in the 90’s

QMpackages

Page 6: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

6

Distributed Computing Environment in the 90’s

Example: UniChemdistributed environment for quantum-chemical

simulationsCray Research Inc. 1991-(2004)

Page 7: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

7

CCWF Chemical Illustrator Applications

Molecular design of functionalised enzynesHans-Peter Lüthi, Martin Brändle, ZürichPeter Murray-Rust, Cambridge; Henry Rzepa, London

Quantum chemical based QSAR/QSPRTim Clark, Erlangen; Jon Essex, Southampton

High-order dynamic and static electrostatic molecular properties

Kurt Mikkelsen, Copenhagen

Computational heterogeneous catalysisAntonio M. Márquez Cruz, Javier Fdez. Sanz, Sevilla

Page 8: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

8

Molecular Design Workflow (Enzyne Design)

Steps: Generation and

Archiving of data

ExtractionXPath queries

Statistical Analysis

DB

QC Input

QC Output

Input

Output

Parser

StatisticalAnalysis

XMLXPathQuery

XSLT

QCApplication

source: Hans-Peter Lüthi, ETH Zürich

Page 9: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

9

Quantum Chemical Based QSAR and QSPR

2D-Database

2D 3DConformations,

Tautomers

VAMP

ParaSurf

QSPR

Virtual Screening

ADME/Tox.

Pharmacokinetics

Molecular Info

Materials Design

Multiscale Modeling

Property Optimization

generate structures,conformations and protonation states

semiempirical MO geometry optimization and electron density

generate isodensity surfaces, spherical-harmonic fits and local properties

apply models

source: Tim Clark, Uni Erlangen

Page 10: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

10

-14 -12 -10 -8 -6 -4 -2 0 2 4

Experimental Gsolv(H2O) (kcal mol-1)

-14

-12

-10

-8

-6

-4

-2

0

2

4

Cal

cula

ted

G

solv(H

2O)

(kca

l mol

-1)

Properties: Free Energies of Hydration

N = 362MUE = 0.85 kcal mol-1

RMSD = 1.09 kcal mol-1

r2 = 0.88q2 = 0.83

source: Tim Clark, Uni Erlangen

Page 11: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

11

Computing the NCI database (P. Murray-Rust, ’05)

MOPACPM5

source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute

Workflow built with Taverna

Page 12: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

12

Times to run jobs

0

40,000

80,000

120,000

0.E+00 5.E+08 1.E+09

(n basis functions)4

time

/ s

source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute

Page 13: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

13

Protocol

Log Files

Parse

SystemCrashes

ScienceErrors

Analysis

PathologicalBehaviour

Statistics

Other Science DisseminateResults

UnsuitableData

ProgramCrashes

InformDeveloper

source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute

Page 14: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

14

source: Peter Murray-Rust et al., Uni Cambridge / Unilever Institute

Conclusions from NCI “Experiment” (2005)

Protocols can be automated

Machines can highlight unusual behaviour, geometries and distribution of results for humans to consider

Computational programs can provide high quality “experimental” molecular properties

Page 15: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

15

Motivation

The orchestration of complex workflow scenarios is on today’s agenda.

complex scientific solution paths linking in-house and (commercial) legacy codes

Transformation of scientific ventures into a scientifically validated protocol

allowing a highly (semi-) automated data generation (pre-processing) and data processing steps.

Page 16: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

16

Goals of the CCWF Working Group

implementation of workflow environments for QC by adapting standard (Grid) technologies

fostering standard techniques (interfaces) for handling quantum chemical data in a flexible and extensible format to ensure application program interoperability and support of an efficient access to chemical information based on a CC ontology.

implementation of computational chemistry illustrator scenarios to demonstrate the applicability of our approach

Page 17: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

17

Generic Workflow

1. Automatic generation + validation of input data

2. Submission, monitoring, and gathering of output data of

simulation jobs

3. Integration of results (primary data) into project database

4. Data mining and visualization techniques to reduce

complexity

5. Knowledge generation by applying methods of statistical

analysis and pattern recognition.

6. On-line publication and archiving of valuable scientific

data.

Page 18: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

18

Challenges

Diversity:Molecular properties derived from state functions obtained with electronic-structure methods. ab-initio, semi-empirical, DFT, approximate potentials

Gaussian, COLUMBUS, Dalton, Turbomole, MOPAC, Vamp, CPMD…

Data formats:How to implement seamless data export/import? ~80 relevant formats known in CC: XYZ, MDL, SDF, PDB, …

OpenBABEL

Page 19: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

19

Challenges (cont.)

Scaling, Robustness, Load Balancing:I can handle O(10) jobs by hand but…what about campaigns of O(1000) of jobs? workflow system computational resources distributed computing persistence, automated failure recovery, … long simulation times, sometimes unpredictable

Acceptance: easy of use, GUI + CLI

Page 20: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

20

What I Want…

easy-of-use: workflow orchestration usage installation / maintenance

sharing of workflow descriptions with my colleagues standard languages

support in a heterogeneous environment laptop – server – cluster – supercomputer – grid

Page 21: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

21

Which Workflow System?

… to be spoilt for choice?

Page 22: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

22

Some Assessment Criteria

workflows in distributed systems supported batch systems: PBS (,

LSF) support for managing large files

recovery / backup

quality of the documentation customizability PKI / security

required installation effort Web interface WF language

robustness, stability Grid environment open source

restart/stop/debugging user/installation base

status & exception handling legacy codes and Web services project development activity

GUI

Page 23: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

23

TRIANA Experiences (2005/06)

workflow orchestration integration of web

services semantic check of WSDL

files support for self-written

Triana modules negligible control logic

overhead pre-requisite for migration

to Grid environments

- proprietary workflow description language in TRIANA (BPEL is announced)

- GUI robustness for very complex workflow definitions

Page 24: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

24

GWES Experiences (MediGRID, since 2006)

integration of web services and legacy codes

monitoring + debugging support

Grid environments under active development

(A. Hoheisel et al./FhG FIRST)

- workflow orchestration (WF GUI builder in preparation)

- proprietary workflow description language

Page 25: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

25

Page 26: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

26

OMII Server: Attracting Features

Workflows language: BPEL (Active BPEL) WF editor (Eclipse) Web Services customization

Jobs submission & monitoring via

WS job manager API

persistent (job recovery), in-memory (via Hibernate)

Distributed Resource Management (DRM)

Condor-G, Globus Gram SSH-exec your own plug-ins, e.g. PBS

Data GridSAM file staging support within job (JSDL): file stage in/out Apache Virtual File System library

(vfs) FTP, local files, http, http, ssftp zip, jar, tar, bzip2, gzip ram - data in memory

GridFTP

Page 27: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

27

OMII/Active BPEL Experiences (3 months)

workflow orchestration (Eclipse plugin)

standardized WF language monitoring support Grid environments security features: https +

signed messages (X.509 cert.)

active development (UK eScience)

- deployment requires manual workarounds

- learning barrier (BPEL)- BPEL editor not fully

mature (validation of BPEL workflows)

Page 28: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

28

Summary

there are a couple of workflow system available design/development of workflow system still an on-

going research not yet decided for our working group

barriers: easy to use vs. robustness middleware stack: more complicated Grid

environments vs. script-based approaches on clusters

standards vs. proprietary but powerful/sufficient WF languages BPEL has a high chance to survive

Page 29: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

29

Acknowledgement

Core members of D37 CCWF working group Hans-Peter Lüthi, ETH Zurich Tim Clark, CCC Uni Erlangen J. A. Townsend, P. Murray-Rust, S. M. Tyrrell, Y. Zhang, Uni

Cambridge/Unilever Inst.

developer of workflow systems mentioned in this talk

Page 30: Activities of the COST D37 GridChem Computational Chemistry Workflow Group

30

QUESTIONS?