19
www.ci.anl.gov www.ci.uchicago.edu Process automation for data-driven science Ian Foster Computation Institute Argonne National Laboratory & The University of Chicago

Process automation for data-driven science

Embed Size (px)

DESCRIPTION

Talk given at the Materials Genome Initiative Workshop on Building the Materials Innovation Infrastructure: Data and Standards, held May 14-15, 2012 at the U.S. Department of Commerce (Herbert Hoover) building in Washington, DC. I made the case that to deal effectively with BIG DATA, you need BIG PROCESS. I described how Globus Online is addressing that need.

Citation preview

Page 1: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

Process automationfor data-driven science

Ian FosterComputation InstituteArgonne National Laboratory & The University of Chicago

Talk at Materials Genome Initiative Workshop, May 14-15, DC

Page 2: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

2

Where we want to get to

Imagine if, when tackling a problem, we could easily, both alone and within a distributed team:• Assemble, integrate, and interpret all relevant

data—organized within a knowledge network• Be informed of anomalies, patterns, and gaps• Formulate and evaluate computational models• Launch automated processes to test

hypotheses & expand the knowledge networkAll within an environment in which productive strategies could be easily scaled—and repeated

Page 3: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

3

The attractive vs. the pragmatic

• Some attractive goals expressed yesterday– “Record the complete process used to generate data”– “Define standard formats and metadata”– “Make users rate data every time they use it”– “Eliminate incorrect data from databases”

• My pragmatic take on how best to proceed– “Identify, automate, and streamline key

processes to make desirable behaviors easy”

Page 4: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

4

Page 5: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

5

Tripit exemplifies process automation

MeBook flights

Book hotel

Record flights Suggest hotel Record hotel Get weather Prepare maps Share info Check prices Monitor flight

Other servicesTime

Page 6: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

6

Process automation for science

Run experimentCollect dataMove dataCheck data

Annotate dataShare data

Find similar dataLink to literature

Analyze dataPublish data

Time

>25,000 registered users, >1PB access

>5,000 registered users, >4 PB moved

>45,000 metagenomes, 12 Tbp

Page 7: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

7

A simple take on “big process for science”

Globus Integrate

Globus Transfer

Globus Storage

Globus Collaborate

Globus Catalog

…SaaS

…PaaS

Research Data Management-as-a-Service

Page 8: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

8

Globus Transfer: Data movement

Globus Integrate

Globus Transfer

Globus Storage

Globus Collaborate

Globus Catalog

…SaaS

…PaaS

Research Data Management-as-a-Service

Page 9: Process automation for data-driven science
Page 10: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

10

Globus Transfer details

• Reliable file transfer.– Easy “fire-and-forget” transfers– Automatic fault recovery– High performance– Across multiple security domains

• No IT required.– Software as a Service (SaaS)

o No client software installationo New features automatically available

– Consolidated support & troubleshooting– Works with existing GridFTP servers; Globus Connect for “last mile”

• >5000 users, >4 Petabytes and 500,000,000 files moved• >99.9% uptime in 2012

Adopted by Advanced Photon Source, NERSC, Blue Waters, campuses

Page 11: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

11

Globus Storage and Globus Collaborate

Globus Integrate

Globus Transfer

Globus Storage

Globus Collaborate

Globus Catalog

…SaaS

…PaaS

Research Data Management-as-a-Service

Page 12: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

12

Commercial storage service

provider

National research center

Campus computing

center

Globus Storage: For when you want to …

• Place your data where you want

• Access it from anywhere via different protocols

• Update it, version it,and take snapshots

• Share versions with who you want

• Synchronize among locations

Globus Storage volume

Globus Transfer, HTTP/REST, Desktop sync

Page 13: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

13

Globus Collaborate: For when you want to

Join with a few or many people to:• Share documents• Track tasks• Send email• Share data • Do whatever

With:• Common

groups• Delegated

management

Page 14: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

14

TBI=Traumatic Brain InjuryDTI=Diffusion Tensor ImagingMRI=Magnetic Resonance Imaging

UChicagoObject

Store

UChicagoObject

StoreCornell

Red Cloud

SDSCCloud

Globus Storage & Collaborate in action

Kyle

Bryce PADSComputeCluster

“TBI”volume

Globus Storage Create volume and

share with TBI group

Globus Transfer Copy TBI data to compute cluster

Globus Transfer Move DTI results to shared volume

Globus NexusAdd Bryce to TBI

collaboration

Globus CollaboratePublish DTI data to TBI

web siteAmazon S3

DTI Group- Kyle

Globus ConnectMove MRI files to TBI shared volume

Globus Connect Move DTI results to

Bryce’s laptop

Globus StorageCreate snapshot to share with group

DTI Group- Kyle- Bryce

Page 15: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

15

Use case: Earth System Grid

• Outsource data transfer to Globus– Data download from search– Data transfer to another server – Replication between sites

• Next step is automated publication• No ESGF client software needed

Page 16: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

16

Data acquisition, management, analysis

Big Data (volume, velocity, variety, variability) … demands Big Process in order for discovery to scale

Experiments Computationsdon’t

Literatureforget!

Page 17: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

17

How to proceed

• Top down:– Large-scale integration, standardized formats,

common protocols, etc.– Good if achieved, but likely to be slow and painful

• Bottom up: – Consider opportunities to encourage useful

behaviors via outsourcing and automation– Making data accessible is the first (and easiest?) 90%– Facilitate sharing, annotation, emergence of

(localized) structure, bridging among structures

Page 18: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

18

Acknowledgements

• Thanks for vital and much appreciated support:– DOE Office of Advanced Scientific Computing

Research (ASCR)– NSF Office of Cyberinfrastructure (OCI)– National Institutes of Health– The University of Chicago

• Thanks to the Globus Online team at the University of Chicago and Argonne for their amazing work. See https://www.globusonline.org/about/goteam/

Page 19: Process automation for data-driven science

www.ci.anl.govwww.ci.uchicago.edu

Thank you!

[email protected]@uchicago.edu