76
Scientific Data & Workflow Scientific Data & Workflow Engineering Engineering Preliminary Notes from the Cyberinfrastructure Preliminary Notes from the Cyberinfrastructure Trenches Trenches Bertram Lud Bertram Lud ä ä scher scher San Diego Supercomputer Center Associate Professor Associate Professor Dept. of Computer Science & Genome Center Dept. of Computer Science & Genome Center University of California, Davis University of California, Davis Fellow Fellow San Diego Supercomputer Center San Diego Supercomputer Center University of California, San Diego University of California, San Diego UC DAVIS Department of Computer Science

Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

Embed Size (px)

DESCRIPTION

UC DAVIS Department of Computer Science. San Diego Supercomputer Center. Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches. Associate Professor Dept. of Computer Science & Genome Center University of California, Davis Fellow - PowerPoint PPT Presentation

Citation preview

Page 1: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

Scientific Data & Workflow Scientific Data & Workflow

EngineeringEngineeringPreliminary Notes from the Cyberinfrastructure Preliminary Notes from the Cyberinfrastructure

TrenchesTrenches

Bertram Bertram LudLudääscherscher

San Diego Supercomputer Center

Associate ProfessorAssociate ProfessorDept. of Computer Science & Genome Dept. of Computer Science & Genome

CenterCenterUniversity of California, DavisUniversity of California, Davis

FellowFellow

San Diego Supercomputer CenterSan Diego Supercomputer CenterUniversity of California, San DiegoUniversity of California, San Diego

UC DAVISDepartment ofComputer Science

Page 2: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

2 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

OutlineOutline

• Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures

• Scientific Data IntegrationScientific Data Integration

• Scientific Workflow ManagementScientific Workflow Management

• Links & Crystallization PointsLinks & Crystallization Points

• Lessons learnt & SummaryLessons learnt & Summary

Page 3: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

3 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Science Environment for Science Environment for Ecological Knowledge (SEEK) Ecological Knowledge (SEEK)

OverviewOverview• Domain Science DriverDomain Science Driver– Ecology (LTER), biodiversity, …

• Analysis & Modeling SystemAnalysis & Modeling System– Design & execution of ecological

models & analysis (“scientific workflows”)

– {application,upper}-ware Kepler system

• Semantic Mediation SystemSemantic Mediation System– Data Integration of hard-to-relate

sources and processes– Semantic Types and Ontologies– upper middleware Sparrow Toolkit

• EcoGridEcoGrid – Access to ecology data and tools– {middle,under}-ware unified API to SRB/MCAT,

MetaCat, DiGIR, … datasetssample CS problem [DILS’04]

Page 4: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

4 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Common CI Infrastructure PiecesCommon CI Infrastructure Pieces

• Other CI-projects (e.g. GEON, … ) have Other CI-projects (e.g. GEON, … ) have similar service-oriented architectures: similar service-oriented architectures: – Seamless and uniform data access (“Data-Grid”)

• data & metadata registry

– distributed and high performance computing platform (“Compute-Grid”)

• service registry

– Federated, integrated, mediated databases• often use of semantic extensions (e.g. ontologies)

– User-friendly workbench / problem-solving environment scientific workflows

• add to this sensors, observing systems … add to this sensors, observing systems …

Page 5: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

5 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

… … Example: Realtime Example: Realtime Environment for Analytical Environment for Analytical Processing Processing (REAP vision)(REAP vision)

Page 6: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

6 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

The Great Unified SystemThe Great Unified System

• Many engineering and CS challenges!Many engineering and CS challenges!… we’ll see some …

• Our focus:Our focus:– Scientific data integration

• How to associate, mediate, integrate complex scientific data?

– Scientific workflows• How to devise larger scientific workflows for process

automation from individual components (e.g. web services)?

• Disclaimer:Disclaimer:… often scratching the surface; see references &

research literature for details …

Page 7: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

7 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

OutlineOutline

• Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures

• Scientific Data IntegrationScientific Data Integration

• Scientific Workflow ManagementScientific Workflow Management

• Links & Crystallization PointsLinks & Crystallization Points

• Lessons learnt & SummaryLessons learnt & Summary

Page 8: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

8 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

An Online Shopper’s Information Integration An Online Shopper’s Information Integration ProblemProblem

El Cheapo: “Where can I get the cheapest copy (including shipping cost) of El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”

??Information Information IntegrationIntegration

addall.comaddall.com

“One-World”Mediation

amazon.comamazon.comamazon.comamazon.com A1books.comA1books.comA1books.comA1books.comhalf.comhalf.comhalf.comhalf.combarnes&noble.combarnes&noble.combarnes&noble.combarnes&noble.com

Mediator (virtual DB)Mediator (virtual DB)(vs. Datawarehouse)(vs. Datawarehouse)NOTE: non-trivial NOTE: non-trivial

data engineering challenges!data engineering challenges!

Page 9: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

9 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

A Home Buyer’s Information Integration A Home Buyer’s Information Integration ProblemProblem

What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood

with below-average crime rate and diverse population?

?Information Integration

RealtorRealtor DemographicsDemographicsSchool RankingsSchool RankingsCrime StatsCrime Stats

“Multiple-Worlds”Mediation

Page 10: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

10 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

A Neuroscientist’s A Neuroscientist’s Information Integration Information Integration

ProblemProblemWhat is the cerebellar distribution of rat proteins with more than

70% homology with human NCS-1? Any structure specificity?How about other rodents?

?Information Integration

protein localization(NCMIR)

neurotransmission(SENSELAB)

sequence info(CaPROT)

morphometry(SYNAPSE)

“Complex Multiple-Worlds”

Mediation

Biomedical InformaticsBiomedical InformaticsResearch NetworkResearch Networkhttp://nbirn.nethttp://nbirn.net

Inter-source links:• unclear for the non-scientists• hard for the scientist

Page 11: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

11 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Page 12: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

12 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Interoperability & Integration Interoperability & Integration ChallengesChallenges

• System aspects: “Grid” Middleware• distributed data & computing, SOA• web services, WSDL/SOAP, WSRF, OGSA, …• sources = functions, files, data sets …

• Syntax & Structure: (XML-Based) Data Mediators

• wrapping, restructuring • (XML) queries and views• sources = (XML) databases

• Semantics: Model-Based/Semantic Mediators

• conceptual models and declarative views • Knowledge Representation: ontologies,

description logics (RDF(S),OWL ...)• sources = knowledge bases (DB+CMs+ICs)

• Synthesis: Scientific Workflow Design & Execution

• Composition of declarative and procedural components into larger workflows

• (re)sources = services, processes, actors, …

reconciling reconciling SS55 heterogeneitiesheterogeneities

““gluing” together gluing” together resources resources

bridging information and bridging information and knowledge gaps knowledge gaps computationallycomputationally

Page 13: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

13 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Information Integration Challenges: Information Integration Challenges: SS44 Heterogeneities Heterogeneities• SSystemystem aspects aspects

– platforms, devices, data & service distribution, APIs, protocols, … Grid middleware technologies + e.g. single sign-on, platform independence, transparent use of remote

resources, …

• SSyntaxyntax & & SStructuretructure– heterogeneous data formats (one for each tool ...)– heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …) – heterogeneous schemas (one for each DB ...) Database mediation technologies+ XML-based data exchange, integrated views, transparent query rewriting,

• SSemanticsemantics– descriptive metadata, different terminologies, “hidden” semantics

(context), implicit assumptions, … Knowledge representation & semantic mediation technologies+ “smart” data discovery & integration+ e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy

anyways!

Page 14: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

14 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Information Integration Challenges: Information Integration Challenges: SS55 Heterogeneities Heterogeneities

• SSynthesisynthesis of applications, analysis tools, data & of applications, analysis tools, data & query components, … into “query components, … into “scientific workflowsscientific workflows” ” – How to make use of these wonderful things & put them

together to solve a scientist’s problem?

Scientific Problem Solving Environments Scientific Problem Solving Environments (PSEs)(PSEs) Portals,Workbench (“scientist’s view”)+ ontology-enhanced data registration, discovery,

manipulation+ creation and registration of new data products from

existing ones, … Scientific Workflow System (“engineer’s view”)+ for designing, re-engineering, deploying analysis pipelines

and scientific workflows; a tool to make new tools … + e.g., creation of new datasets from existing ones, dataset

registration, …

Page 15: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

15 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Information Integration from a Information Integration from a Database Perspective Database Perspective

• Information Integration ProblemInformation Integration Problem– Given: data sources S1, ..., Sk (databases, web sites, ...)

and user questions Q1,..., Qn that can –in principle– be answered using the information in the Si

– Find: the answers to Q1, ..., Qn

• The Database Perspective: The Database Perspective: source = source = “database”“database” Si has a schema (relational, XML, OO, ...) Si can be queried define virtual (or materialized) integrated (or

global) view G over local sources S1 ,..., Sk using database query languages (SQL, XQuery,...)

questions become queries Qi against G(S1,..., Sk)

Page 16: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

16 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

5. Post processing5. Post processing2. Query rewriting2. Query rewriting

Standard (XML-Based) Mediator Standard (XML-Based) Mediator ArchitectureArchitecture

MEDIATORMEDIATOR

Integrated Global(XML) View G

Integrated ViewDefinition

G(..) S1(..)…Sk(..)

USER/ClientUSER/Client

1. Query Q ( G (S1. Query Q ( G (S11,..., S,..., Skk) )) )

S1

Wrapper

(XML) View

S2

Wrapper

(XML) View

Sk

Wrapper

(XML) Viewweb services as wrapper APIs

3. Q1 Q2 Q33. Q1 Q2 Q3

4. {answers(Q1)} {answers(Q2)} {answers(Q3)}4. {answers(Q1)} {answers(Q2)} {answers(Q3)}

6. {answers(Q)}6. {answers(Q)}

Page 17: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

17 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Query Planning in Data Query Planning in Data IntegrationIntegration

• GivenGiven: : – Declarative user query Q: answer(…) …G ...– … & { G … S … } global-as-view (GAV)– … & { S … G … } local-as-view (LAV)– … & { ic(…) … S … G… } integrity constraints (ICs)

• FindFind: : – equivalent (or minimal containing, maximal contained) query plan Q’: answer(…) … S …

query rewriting (logical/calculus, algebraic, physical levels)

• ResultsResults::– A variety of results/algorithms; depending on classes of

queries, views, and ICs: P, NP, … , undecidable– hot research area in core CS (database community)

Page 18: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

18 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Scientific Data Integration Scientific Data Integration using using

Semantic ExtensionsSemantic Extensions

Page 19: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

19 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Page 20: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

20 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Example: Geologic Map Example: Geologic Map IntegrationIntegration

• GivenGiven: : – Geologic maps from different state geological

surveys (shapefiles w/ different data schemas)– Different ontologies:

• Geologic age ontology (e.g. USGS)• Rock classification ontologies:

– Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC)

– Single hierarchy from British Geological Survey (BGS)

• ProblemProblem::– Support uniform queries across all map – … using different ontologies– Support registration w/ ontology A, querying

w/ ontology B

Page 21: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

21 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

IntegrationSchema

Schema Integration Schema Integration (“registering” (“registering” local schemas to the global schema)local schemas to the global schema)

Arizona

Colorado

Utah

Nevada

Wyoming

New Mexico

Montana E.

Idaho

Montana West

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

FormationFormation ……

AgeAge ……

…… FormationFormation

…… AgeAge

…… CompositionComposition

…… FabricFabric

…… TextureTexture

…… FormationFormation

…… AgeAge

…… CompositionComposition

…… FabricFabric

…… TextureTexture

ABBREV

PERIOD

PERIOD

NAME

PERIOD

TYPE

TIME_UNIT

FMATN

PERIOD

NAME

PERIOD

NAME

FORMATION

PERIOD

FORMATION

FORMATION

LITHOLOGY

LITHOLOGY

AGE

AGE

andesitic sandstone

Livingston formation

Tertiary-Cretaceous

Sources

Sources

Page 22: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

22 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Multihierarchical Rock Classification Multihierarchical Rock Classification “Ontology” (Taxonomies) for “Thematic “Ontology” (Taxonomies) for “Thematic

Queries” (GSC)Queries” (GSC)

Composition

Genesis

Fabric

Texture

Page 23: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

23 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Ontology-Enabled Application Example:Ontology-Enabled Application Example:Geologic Map IntegrationGeologic Map Integration

Show formations where AGE = ‘Paleozic’

(without age ontology)

Show formations where AGE = ‘Paleozic’

(without age ontology)

Show formations where AGE = ‘Paleozic’

(with age ontology)

Show formations where AGE = ‘Paleozic’

(with age ontology)

+/- a few hundred million years

domainknowledge

domainknowledge

Knowledge r

epresentatio

n

Geologic Age

ONTOLOGY

NevadaNevada

Page 24: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

24 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Querying by Geologic Age … Querying by Geologic Age …

Page 25: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

25 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Querying by Geologic Age: Querying by Geologic Age: ResultsResults

Page 26: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

26 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Querying by Chemical Querying by Chemical Composition … (GSC) Composition … (GSC)

Page 27: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

27 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Semantic Mediation Semantic Mediation (via “semantic (via “semantic registration” of schemas and ontology registration” of schemas and ontology

articulations)articulations)

• Schema elements and/or data values are Schema elements and/or data values are associated with concept expressions from associated with concept expressions from the target ontologythe target ontology conceptual queries “through” the ontology

• Articulation ontology Articulation ontology source registration to A, querying through B

• Semantic mediation: query rewriting w/ ontologiesSemantic mediation: query rewriting w/ ontologies

Database1

Database2

Ontology A

Ontology B

semanticregistration

ontology articulations

Concept-based(“semantic”)

queries

semanticregistration

Page 28: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

28 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Different views on State Different views on State Geological MapsGeological Maps

Page 29: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

29 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Sedimentary Rocks: BGS Sedimentary Rocks: BGS OntologyOntology

Page 30: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

30 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Sedimentary Rocks: GSC Sedimentary Rocks: GSC OntologyOntology

Page 31: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

31 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Implementation in OWL: Not only “for Implementation in OWL: Not only “for the machine” …the machine” …

Page 32: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

32 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Source Contextualization Source Contextualization through Ontology Refinementthrough Ontology Refinement

In addition to registering (“hanging off”) data relative toexisting concepts, a source may also refine the mediator’s domain map...

sources can register new concepts at the mediator ...

Page 33: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

33 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

OutlineOutline

• Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures

• Scientific Data IntegrationScientific Data Integration

• Scientific Workflow ManagementScientific Workflow Management

• Links & Crystallization PointsLinks & Crystallization Points

• Lessons learnt & SummaryLessons learnt & Summary

Page 34: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

34 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

What is a Scientific Workflow What is a Scientific Workflow (SWF)?(SWF)?

• GoalsGoals::– automate a scientist’s repetitive data management

and analysis tasks – typical phases:

• data access, scheduling, generation, transformation, aggregation, analysis, visualization

design, test, share, deploy, execute, reuse, … SWFs

Page 35: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

35 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Promoter Identification Promoter Identification WorkflowWorkflow

Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)

Page 36: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

36 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Source: NIH BIRN (Jeffrey Grethe, UCSD)Source: NIH BIRN (Jeffrey Grethe, UCSD)

Page 37: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

37 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Ecology: GARP Analysis Ecology: GARP Analysis Pipeline for Invasive Species Pipeline for Invasive Species

PredictionPrediction

Training sample

(d)

GARPrule set

(e)

Test sample (d)

Integrated layers

(native range) (c)

Speciespresence &

absence points(native range)

(a)EcoGridQuery

EcoGridQuery

LayerIntegration

LayerIntegration

SampleData

+A3+A2

+A1

DataCalculation

MapGeneration

Validation

User

Validation

MapGeneration

Integrated layers (invasion area) (c)

Species presence &absence points

(invasion area) (a)

Native range

predictionmap (f)

Model qualityparameter (g)

Environmental layers (native

range) (b)

GenerateMetadata

ArchiveTo Ecogrid

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

RegisteredEcogrid

Database

Environmental layers (invasion

area) (b)

Invasionarea prediction

map (f)

Model qualityparameter (g)

Selectedpredictionmaps (h)

Source: NSF SEEK (Deana Pennington et. al, UNM)Source: NSF SEEK (Deana Pennington et. al, UNM)

Page 38: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

38 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Page 39: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

39 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Commercial & Open Source Commercial & Open Source Scientific “Workflow” (well Scientific “Workflow” (well DataflowDataflow) Systems) Systems

Kensington Discovery Edition from InforSense

Taverna

Triana

Page 40: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

40 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

SCIRun: Problem Solving Environments SCIRun: Problem Solving Environments for Large-Scale Scientific Computingfor Large-Scale Scientific Computing

• SCIRun: PSE for interactive construction, debugging, and SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computationssteering of large-scale scientific computations

• Component model, based on generalized dataflow Component model, based on generalized dataflow programmingprogramming

Steve Parker (cs.utah.edu)Steve Parker (cs.utah.edu)

Page 41: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

41 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Ptolemy II

see!see!see!see!

try!try!try!try!

read!read!read!read!

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Page 42: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

42 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Why Ptolemy II (and thus KEPLER)?Why Ptolemy II (and thus KEPLER)?• Ptolemy II Objective:Ptolemy II Objective:

– “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.”

• Dataflow Process Networks w/ natural support for Dataflow Process Networks w/ natural support for abstractionabstraction, , pipeliningpipelining (streaming) (streaming) actor-orientation, actor-orientation, actor actor reusereuse

• User-OrientationUser-Orientation– Workflow design & exec console (Vergil GUI)– “Application/Glue-Ware”

• excellent modeling and design support• run-time support, monitoring, …• not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …)• but middle-/underware is conveniently accessible through actors!

• PRAGMATICSPRAGMATICS– Ptolemy II is mature, continuously extended & improved, well-documented

(500+pp) – open source system– Ptolemy II folks actively participate in KEPLER

Page 43: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

43 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

KEPLER/KEPLER/CSPCSP: : CContributors, ontributors, SSponsors, ponsors, PProjectsrojects(or loosely coupled (or loosely coupled CCommunicating ommunicating SSequential equential PPersons ;-)ersons ;-)

Ilkay Altintas Ilkay Altintas SDM, ResurgenceSDM, ResurgenceKim Baldridge Kim Baldridge Resurgence, NMIResurgence, NMI Chad Berkley Chad Berkley SEEKSEEK Shawn Bowers Shawn Bowers SEEKSEEKTerence Critchlow Terence Critchlow SDMSDM Tobin Fricke Tobin Fricke ROADNetROADNetJeffrey Grethe Jeffrey Grethe BIRNBIRNChristopher H. Brooks Christopher H. Brooks Ptolemy IIPtolemy II Zhengang Cheng Zhengang Cheng SDMSDM Dan Higgins Dan Higgins SEEKSEEKEfrat Jaeger Efrat Jaeger GEONGEON Matt Jones Matt Jones SEEKSEEK Werner Krebs, Werner Krebs, EOLEOLEdward A. Lee Edward A. Lee Ptolemy IIPtolemy II Kai Lin Kai Lin GEONGEONBertram Ludaescher Bertram Ludaescher SEEKSEEK, , GEONGEON, , SDMSDM, , ROADNet, BIRNROADNet, BIRNMark Miller Mark Miller EOLEOLSteve Mock Steve Mock NMINMISteve Neuendorffer Steve Neuendorffer Ptolemy IIPtolemy II Jing Tao Jing Tao SEEKSEEK Mladen Vouk Mladen Vouk SDMSDM Xiaowen Xin Xiaowen Xin SDMSDM Yang Zhao Yang Zhao Ptolemy IIPtolemy IIBing Zhu Bing Zhu SEEKSEEK ••••••

Ptolemy IIPtolemy II

                                                

                                            

www.kepler-project.orgwww.kepler-project.org

Page 44: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

44 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

KEPLER: An Open CollaborationKEPLER: An Open Collaboration• Initiated by members from NSF SEEK and DOE SDM/SPA; now Initiated by members from NSF SEEK and DOE SDM/SPA; now

several other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, several other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, …)…)

• Open Source (BSD-style license)Open Source (BSD-style license)• Intensive Communications: Intensive Communications:

– Web-archived mailing lists– IRC (!)

• Co-development: Co-development: – via shared CVS repository– joining as a new co-developer (currently):

• get a CVS account (read-only)• local development + contribution via existing KEPLER member• be voted “in” as a member/co-developer

• Software & social engineeringSoftware & social engineering– How to better accommodate new groups/communities?– How to better accommodate different usage/contribution models

(core dev … special purpose extender … user)?

Page 45: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

45 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Ptolemy II/KEPLER GUI (Vergil)Ptolemy II/KEPLER GUI (Vergil)“Directors” define the component interaction & execution semantics

Large, polymorphic component (“Actors”) and Directors libraries (drag & drop)

Page 46: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

46 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

KEPLER/Ptolemy II GUI refinedKEPLER/Ptolemy II GUI refinedOntology based actor (service) and dataset

search

Result Display

Page 47: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

47 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Web Services Web Services Actors Actors (WS Harvester)(WS Harvester)

12

3

4

“Minute-made” (MM) WS-based application integration• Similarly: MM workflow design & sharing w/o implemented

components

Page 48: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

48 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Some Recent Actor AdditionsSome Recent Actor Additions

Page 49: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

49 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

An “early” example: An “early” example: Promoter Identification Promoter Identification

SSDBM, AD 2003SSDBM, AD 2003

• Scientist models application as a “workflow” of connected components (“actors”)

• If all components exist, the workflow can be automated/ executed

• Different directors can be used to pick appropriate execution model (often “pipelined” execution: PN director)

Page 50: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

50 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Reengineering a Geoscientist’s Mineral Classification Workflow

Page 51: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

51 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Job Management (here: Job Management (here: NIMROD)NIMROD)

• Job management infrastructure in place• Results database: under development• Goal: 1000’s of GAMESS jobs (quantum mechanics) – Fall/Winter’04

Page 52: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

52 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

ORB

Page 53: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

53 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Rapid Web Service-based Prototyping Rapid Web Service-based Prototyping

(Here: ROADNet Command & Control Services for LOOKING Kick-Off Mtg)(Here: ROADNet Command & Control Services for LOOKING Kick-Off Mtg)

Source: Ilkay Altintas, SDM, NLADRROADNet: Vernon, Orcutt et al

Web services: Tony Fountain et al

Source: Ilkay Altintas, SDM, NLADRROADNet: Vernon, Orcutt et al

Web services: Tony Fountain et al

Page 54: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

54 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

in KEPLER in KEPLER (w/ editable (w/ editable script)script)

Source: Dan Higgins, Kepler/SEEKSource: Dan Higgins, Kepler/SEEK

Page 55: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

55 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

in KEPLER in KEPLER (interactive session)(interactive session)

Source: Dan Higgins, Kepler/SEEKSource: Dan Higgins, Kepler/SEEK

Page 56: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

56 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Blurring Blurring Design (ToDo)Design (ToDo) and Execution and Execution

Page 57: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

57 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Scientific Workflow ChallengesScientific Workflow Challenges

• Typical FeaturesTypical Features– data-intensive and/or compute-intensive– plumbing-intensive (consecutive web services won’t

fit)– dataflow-oriented– distributed (remote data, remote processing)– user-interaction “in the middle”, …– … vs. (C-z; bg; fg)-ing (“detach” and reconnect)– advanced programming constructs (map(f), zip,

takewhile, …)– logging, provenance, “registering back”

(intermediate) products…

Page 58: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

58 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

hand-crafted control solution; also: forces sequential execution!

designed to fit

designed to fit

hand-craftedWeb-service

actor

Complex backward control-flow

No data transformations

available

[Altintas-et-al-PIW-SSDBM’03][Altintas-et-al-PIW-SSDBM’03]

Page 59: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

59 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

A Scientific Workflow Problem: Solved A Scientific Workflow Problem: Solved (Computer Scientist’s view)(Computer Scientist’s view)

• Solution based on Solution based on declarative, functional declarative, functional dataflow process networkdataflow process network(= also a data streaming

model!)

• Higher-order constructs: Higher-order constructs: mapmap((ff) ) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go

from piw(GeneId) to PIW :=map(piw) over [GeneId]

map(f)-style

iterators Powerful type

checking Generic,

declarative “programming”

constructs

Generic data transformation

actors

Forward-only, abstractable sub-workflow piw(GeneId)

Page 60: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

60 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Promoter Identification Workflow Promoter Identification Workflow RedesignedRedesigned

map(GenbankWS) Input: {“NM_001924”, “NM020375”} Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}

Page 61: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

61 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

A Research Problem: A Research Problem: Optimization by RewritingOptimization by Rewriting

• Example: PIW as a declarative, Example: PIW as a declarative, referentially transparent functional referentially transparent functional processprocess optimization via functional rewriting

possiblee.g. map(f o g) = map(f) o map(g)

• Technical report & PIW specification in Technical report & PIW specification in HaskellHaskell

map(f o g) instead of map(f) o

map(g)

Combination of map and zip

http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdfhttp://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf

Page 62: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

62 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

A KR+DI+Scientific Workflow A KR+DI+Scientific Workflow ProblemProblem

• Services can be Services can be semantically compatiblesemantically compatible, , but but structurally incompatiblestructurally incompatible

SourceServiceSourceService

TargetServiceTargetService

Ps Pt

SemanticType Ps

SemanticType Ps

SemanticType Pt

SemanticType Pt

StructuralType Pt

StructuralType Pt

StructuralType Ps

StructuralType Ps

Desired Connection

Incompatible

Compatible

(⋠)

(⊑)

(Ps)(Ps) (≺)

Ontologies (OWL)Ontologies (OWL)

Source: [Bowers-Ludaescher, DILS’04]Source: [Bowers-Ludaescher, DILS’04]

Page 63: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

63 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Ontology-Informed Data Ontology-Informed Data Transformation Transformation (“Structure-Shim”)(“Structure-Shim”)

SourceServiceSourceService

TargetServiceTargetService

Ps Pt

SemanticType Ps

SemanticType Ps

SemanticType Pt

SemanticType Pt

StructuralType Pt

StructuralType Pt

StructuralType Ps

StructuralType Ps

Desired Connection

Compatible ( )⊑

RegistrationMapping (Output)

RegistrationMapping (Input)

CorrespondenceCorrespondence

Generate (Ps)(Ps)

Ontologies (OWL)Ontologies (OWL)

Transformation

Source: [Bowers-Ludaescher, DILS’04]Source: [Bowers-Ludaescher, DILS’04]

Page 64: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

64 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

OutlineOutline

• Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures

• Scientific Data IntegrationScientific Data Integration

• Scientific Workflow ManagementScientific Workflow Management

• Links & Crystallization PointsLinks & Crystallization Points

• Lessons learnt & SummaryLessons learnt & Summary

Page 65: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

65 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Link-Up & Crystallization PointsLink-Up & Crystallization Points• Shared (Domain) Science Vision, GoalsShared (Domain) Science Vision, Goals

– NVO, SCEC, Human Genome Project, … • Technology WavesTechnology Waves

– XML, web services, WSRF, Semantic Web (OWL), Portlets, … • Standards for data exchange, metadata, data access Standards for data exchange, metadata, data access

protocols, … protocols, … – GML, EML, netCDF, HDF, …, ADN, …, DODS/OpenDAP, … – Organizations: W3C, GGF, …,

• Community ontologies Community ontologies – GO (Gene Ontology), ecoinformatics, seismology,

geochemistry, …– … from Saulus to Paulus …

• Shared Community Tools and Tool Co-DevelopmentShared Community Tools and Tool Co-Development– SRB, Globus, …, Kepler, …

Page 66: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

66 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Shared Science Vision, Goals: SCEC/CMEShared Science Vision, Goals: SCEC/CME

Simulation of Seismic Wave Simulation of Seismic Wave Propagation of a Magnitude Propagation of a Magnitude 7.7 Earthquake on San 7.7 Earthquake on San Andreas FaultAndreas Fault– PIs: Thomas Jordan, Bernard

Minster, Reagan Moore, Carl Kesselman

– Simulation• 240 Processors for 5 days• 47 Terabytes of data

generated

– SDSC SAC project optimized code on DataStar parallel computer (both MPI I/O management and checkpointing)

– Future simulation – Increase resolution a factor of 2, implies 1 PB of simulation results, 1000 processors for 20 days

Southern California Earthquake Center / Community Modeling Southern California Earthquake Center / Community Modeling Environment ProjectEnvironment Project

Source: Reagan Moore, SDSCSource: Reagan Moore, SDSC

Page 67: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

67 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Example: NVO Community Example: NVO Community ProcessesProcesses

- created standard data encoding format (FITS image - created standard data encoding format (FITS image format)format)

- made accessible common digital holdings (sky survey - made accessible common digital holdings (sky survey images)images)

- defined Uniform Content Descriptors (common metadata - defined Uniform Content Descriptors (common metadata attributes)attributes)

- created standard services (standard access mechanisms - created standard services (standard access mechanisms to catalogs to catalogs

and surveys)and surveys)- created digital library (manage derived data products)- created digital library (manage derived data products)- created portals (for combining services interactively)- created portals (for combining services interactively)- created processing pipelines (for automated processing)- created processing pipelines (for automated processing)- created preservation environment- created preservation environment

• Broader impact: found a new star!Broader impact: found a new star!Source: Reagan Moore, SDSCSource: Reagan Moore, SDSC

Page 68: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

68 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Semantic Mediation “Waterfall” … Semantic Mediation “Waterfall” …

Ontologies

Semantic Data,Service Annotation

IterativeDevelopment

Resource Discovery

ResourceIntegration

WorkflowPlanning

WorkflowAnalysis

Source: Shawn Bowers, SEEK AHM’04Source: Shawn Bowers, SEEK AHM’04

Page 69: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

69 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

GEON Dataset Generation & GEON Dataset Generation & RegistrationRegistration

(a co-development in KEPLER)(a co-development in KEPLER)

Xiaowen (SDM)

Edward et al.(Ptolemy)

Yang (Ptolemy)

Efrat(GEON)

Ilkay(SDM)

SQL database access (JDBC)Matt,Chad,

Dan et al. (SEEK)

% Makefile$> ant run

% Makefile$> ant run

Page 70: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

70 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

KEPLER as a Melting Pot …KEPLER as a Melting Pot …

• A grass-roots projectA grass-roots project– Needed a coalition of the (really!) willing

• Inter-project linksInter-project links– SEEK ITR, GEON ITR, ROADNet ITRs, DOE SciDAC SDM,

Ptolemy II, NIH BIRN (coming …), UK eScience myGrid, …

• Intra-project linksIntra-project links– e.g. in SEEK: AMS SMS EcoGrid

• Inter-technology linksInter-technology links– Globus, SRB, JDBC, web services, soaplab services,

command line tools, R, GRASS, XSLT, …

• Interdisciplinary linksInterdisciplinary links– CS, IT, domain sciences, … (recently: usability

engineer)

Page 71: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

71 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

OutlineOutline

• Introduction: CI Sample ArchitecturesIntroduction: CI Sample Architectures

• Scientific Data IntegrationScientific Data Integration

• Scientific Workflow ManagementScientific Workflow Management

• Links & Crystallization PointsLinks & Crystallization Points

• Lessons learnt & SummaryLessons learnt & Summary

Page 72: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

72 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Some Lessons Learnt Some Lessons Learnt • Eat your own dog-food (or at least try…)Eat your own dog-food (or at least try…)

– start using your own (CI) tools early

• Collaboration toolsCollaboration tools– CVS repositories (+cvsview, webcvs)– Mailing lists (e.g. mailman googlified) – Bugzilla (detailed tracking of tech. issues & bugs)– Wiki (community authored web resource, e.g. high-level tech.

issues)

• Where is the XYZ repository/registry?Where is the XYZ repository/registry?– EcoGrid (SEEK) registry, GEON registry, KEPLER actor &

datasets repository, …– UDDI what?

• CI Melting Pots: SDSC, …CI Melting Pots: SDSC, …– NCEAS, LTER, NLADR (w/ NCSA), KU Specify, …

Page 73: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

73 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Q & AQ & A

Page 74: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

74 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Further ReadingFurther Reading

Page 75: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

75 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Related PublicationsRelated Publications• Semantic Data Registration and IntegrationSemantic Data Registration and Integration• On Integrating Scientific Resources through Semantic RegistrationOn Integrating Scientific Resources through Semantic Registration, S. Bowers, K. Lin, , S. Bowers, K. Lin,

and B. Ludäscher, and B. Ludäscher, 16th International Conference on Scientific and Statistical Database 16th International Conference on Scientific and Statistical Database ManagementManagement ( (SSDBM'04SSDBM'04), 21-23 June 2004, Santorini Island, Greece. ), 21-23 June 2004, Santorini Island, Greece.

• A System for Semantic Integration of Geologic Maps via A System for Semantic Integration of Geologic Maps via OntologiesOntologies, K. Lin and B. , K. Lin and B. Ludäscher. In Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific DataSemantic Web Technologies for Searching and Retrieving Scientific Data ( (SCISWSCISW), Sanibel Island, Florida, 2003. ), Sanibel Island, Florida, 2003.

• Towards a Generic Framework for Semantic Registration of Scientific DataTowards a Generic Framework for Semantic Registration of Scientific Data, S. Bowers , S. Bowers and B. Ludäscher. In and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Semantic Web Technologies for Searching and Retrieving Scientific DataScientific Data ( (SCISWSCISW), Sanibel Island, Florida, 2003. ), Sanibel Island, Florida, 2003.

• The Role of XML in Mediated Data Integration Systems with Examples from Geological (MThe Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data Interoperabilityap) Data Interoperability, B. Brodaric, B. Ludäscher, and K. Lin. In , B. Brodaric, B. Ludäscher, and K. Lin. In Geological Society of America (GSA) Geological Society of America (GSA) Annual MeetingAnnual Meeting, volume 35(6), November 2003. , volume 35(6), November 2003.

• Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON GSemantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Gridrid, K. Lin, B. Ludäscher, B. Brodaric, D. Seber, C. Baru, and K. A. Sinha. In , K. Lin, B. Ludäscher, B. Brodaric, D. Seber, C. Baru, and K. A. Sinha. In Geological Society of America (GSA) Annual MeetingGeological Society of America (GSA) Annual Meeting, volume 35(6), November 2003. , volume 35(6), November 2003.

• Query Planning and RewritingQuery Planning and Rewriting• Processing First-Order Queries under Limited Access PatternsProcessing First-Order Queries under Limited Access Patterns, Alan Nash and B. , Alan Nash and B.

Ludäscher, Ludäscher, Proc. 23rd ACM Symposium on Principles of Database SystemsProc. 23rd ACM Symposium on Principles of Database Systems ( (PODS'04PODS'04) ) Paris, France, June 2004. Paris, France, June 2004.

• Processing Unions of Conjunctive Queries with Negation under Limited Access PatternsProcessing Unions of Conjunctive Queries with Negation under Limited Access Patterns, , Alan Nash and B. Ludäscher., Alan Nash and B. Ludäscher., 9th Intl. Conference on Extending Database Technology9th Intl. Conference on Extending Database Technology ((EDBT'04EDBT'04) Heraklion, Crete, Greece, March 2004, LNCS 2992. ) Heraklion, Crete, Greece, March 2004, LNCS 2992.

• Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and Negationwith Union and Negation, B. Ludäscher and Alan Nash. Research abstract (poster), , B. Ludäscher and Alan Nash. Research abstract (poster), 20th Intl. Conference on 20th Intl. Conference on Data EngineeringData Engineering ( (ICDE'04ICDE'04) Boston, IEEE Computer Society, April 2004.) Boston, IEEE Computer Society, April 2004.

Page 76: Scientific Data & Workflow Engineering Preliminary Notes from the Cyberinfrastructure Trenches

76 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004 Scientific Data & WF Engineering, B.Ludäscher

Nov. 15th 2004

Related PublicationsRelated Publications• Scientific WorkflowsScientific Workflows• KeplerKepler: An Extensible System for Design and Execution of Scientific Workflows: An Extensible System for Design and Execution of Scientific Workflows, I. Altintas, C. , I. Altintas, C.

Berkley, E. Jaeger, M. Jones, B. Ludäscher, S. Mock, Berkley, E. Jaeger, M. Jones, B. Ludäscher, S. Mock, 16th International Conference on 16th International Conference on Scientific and Statistical Database ManagementScientific and Statistical Database Management ( (SSDBM'04SSDBM'04), 21-23 June 2004, Santorini ), 21-23 June 2004, Santorini Island, Greece. Island, Greece.

• KeplerKepler: Towards a Grid-Enabled System for Scientific Workflows: Towards a Grid-Enabled System for Scientific Workflows, Ilkay Altintas, Chad , Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher, Steve Mock, Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher, Steve Mock, Workflow in Grid Systems (GGF10)Workflow in Grid Systems (GGF10), Berlin, March 9th, 2004., Berlin, March 9th, 2004.

• An Ontology-Driven Framework for Data Transformation in Scientific WorkflowsAn Ontology-Driven Framework for Data Transformation in Scientific Workflows, S. Bowers , S. Bowers and B. Ludäscher, and B. Ludäscher, Intl. Workshop on Data Integration in the Life SciencesIntl. Workshop on Data Integration in the Life Sciences ( (DILS'04DILS'04), March ), March 25-26, 2004 Leipzig, Germany, LNCS 2994. 25-26, 2004 Leipzig, Germany, LNCS 2994.

• A Web Service Composition and Deployment Framework for Scientific Workflows, I. A Web Service Composition and Deployment Framework for Scientific Workflows, I. Altintas, E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In the Altintas, E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In the 2nd Intl. Conference on 2nd Intl. Conference on Web ServicesWeb Services ( (ICWSICWS), San Diego, California, July 2004.), San Diego, California, July 2004.