Goals Enhance scientific productivity through: Discovery and
application of datasets and programs at petabyte scale Enabling use
of a worldwide data grid as a scientific workstation
Slide 3
Goals of using grids through scripting Provide an easy on-ramp
to the grid Utilize massive resources with simple scripts Leverage
multiple grids like a workstation Empower script-writers to empower
end users Track and leverage provenance in the science process
Slide 4
Classes of Workflow Systems Earlier generation business
workflow systems Document management, forms processing, etc
Scientific laboratory management systems LIMS, wet lab workflow
Application-oriented workflow Kepler, DAGman, P-Star, VisTrails,
Karajan VDS: First-generation Virtual Data System Pegasus, Virtual
Data Language Service-oriented workflow systems BPEL, BPDL,
Taverna/SCUFL, Triana Pegasus/Wings Pegasus with OWL/RDF workflow
specification Swift workflow system Karajan with typed and mapped
VDL - SwiftScript
Slide 5
VDS The Virtual Data System Introduced Virtual Data Language -
VDL A location-independent parallel language Several Planners
Pegasus: main production planner Euryale: experimental just in time
planner GADU/GNARE user application planner (D. Sulahke, Argonne)
Provenance Kickstart app launcher and tracker VDC virtual data
catalog
Slide 6
Virtual Data and Workflows Challenge is managing and organizing
the vast computing and storage capabilities provided by Grids
Workflow expresses computations in a form that can be readily
mapped to Grids Virtual data keeps accurate track of data
derivation methods and provenance Grid tools virtualize location
and caching of data, and recovery from failures
Slide 7
Virtual Data Origins: The Grid Physics Network Enhance
scientific productivity through Discovery, application and
management of data and processes at all scales Using a worldwide
data grid as a scientific workstation The key to this approach is
Virtual Data creating and managing datasets through workflow
recipes and provenance recording.
Slide 8
Virtual Data workflow abstracts Grid details
Slide 9
mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000
mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW
stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200
decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200
decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay =
ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8
Example Application: High Energy Physics Data Analysis Work and
slide by Rick Cavanaugh and Dimitri Bourilkov, University of
Florida
Slide 10
The core essence: Basic data analysis programs
CMS.ECal.2006.0405 107: 24B707CC AF 01 37 01 00 01 00 01 24655A35
235011.603 061206 V 03 0 +0269 108: 24B707CD 01 23 01 3F 00 01 00
01 24655A35 235011.603 061206 V 03 0 +0269 109: 06194161 80 01 38
01 00 01 00 01 03E9DCA9 235142.597 061206 V 03 0 -0723 110:
06194163 00 01 01 28 32 01 00 01 Raw Data bins =60 xmin = 40.5 ymin
=.003 Data Analysis Program bins xmin ymin infile
Slide 11
Expressing Workflow in VDL TR grep (in a1, out a2) { argument
stdin = ${a1}; argument stdout = ${a2}; } TR sort (in a1, out a2) {
argument stdin = ${a1}; argument stdout = ${a2}; } DV grep
(a1=@{in:file1}, a2=@{out:file2}); DV sort (a1=@{in:file2},
a2=@{out:file3}); file1 file2 file3 grep sort Define a function
wrapper for an application Provide actual argument values for the
invocation Define formal arguments for the application Define a
call to invoke application Connect applications via output-to-input
dependencies
ACTIVAL Workflow Main Workflow Program // Declare datasets
fullBrainData brainFile ; fullBrainSpecs specFile ; brainDatasets
randBrain ; brainClusters randCluster; brainDatasets dsetReturn;
brainClusterTable clusterThresholdsTable ; brainDataset brainResult
; brainDataset origBrain ; // Main program executes the entire
workflow (randCluster, dsetReturn) = brain_cluster(brainFile,
specFile); clusterThresholdsTable = bricCentralize (randCluster.c);
brainResult =
makebrain(origBrain,clusterThresholdsTable,brainFile,specFile);