Upload
ian-foster
View
591
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Talk given at the Materials Genome Initiative Workshop on Building the Materials Innovation Infrastructure: Data and Standards, held May 14-15, 2012 at the U.S. Department of Commerce (Herbert Hoover) building in Washington, DC. I made the case that to deal effectively with BIG DATA, you need BIG PROCESS. I described how Globus Online is addressing that need.
Citation preview
www.ci.anl.govwww.ci.uchicago.edu
Process automationfor data-driven science
Ian FosterComputation InstituteArgonne National Laboratory & The University of Chicago
Talk at Materials Genome Initiative Workshop, May 14-15, DC
www.ci.anl.govwww.ci.uchicago.edu
2
Where we want to get to
Imagine if, when tackling a problem, we could easily, both alone and within a distributed team:• Assemble, integrate, and interpret all relevant
data—organized within a knowledge network• Be informed of anomalies, patterns, and gaps• Formulate and evaluate computational models• Launch automated processes to test
hypotheses & expand the knowledge networkAll within an environment in which productive strategies could be easily scaled—and repeated
www.ci.anl.govwww.ci.uchicago.edu
3
The attractive vs. the pragmatic
• Some attractive goals expressed yesterday– “Record the complete process used to generate data”– “Define standard formats and metadata”– “Make users rate data every time they use it”– “Eliminate incorrect data from databases”
• My pragmatic take on how best to proceed– “Identify, automate, and streamline key
processes to make desirable behaviors easy”
www.ci.anl.govwww.ci.uchicago.edu
4
www.ci.anl.govwww.ci.uchicago.edu
5
Tripit exemplifies process automation
MeBook flights
Book hotel
Record flights Suggest hotel Record hotel Get weather Prepare maps Share info Check prices Monitor flight
Other servicesTime
www.ci.anl.govwww.ci.uchicago.edu
6
Process automation for science
Run experimentCollect dataMove dataCheck data
Annotate dataShare data
Find similar dataLink to literature
Analyze dataPublish data
Time
>25,000 registered users, >1PB access
>5,000 registered users, >4 PB moved
>45,000 metagenomes, 12 Tbp
www.ci.anl.govwww.ci.uchicago.edu
7
A simple take on “big process for science”
Globus Integrate
Globus Transfer
Globus Storage
Globus Collaborate
Globus Catalog
…SaaS
…PaaS
Research Data Management-as-a-Service
www.ci.anl.govwww.ci.uchicago.edu
8
Globus Transfer: Data movement
Globus Integrate
Globus Transfer
Globus Storage
Globus Collaborate
Globus Catalog
…SaaS
…PaaS
Research Data Management-as-a-Service
www.ci.anl.govwww.ci.uchicago.edu
10
Globus Transfer details
• Reliable file transfer.– Easy “fire-and-forget” transfers– Automatic fault recovery– High performance– Across multiple security domains
• No IT required.– Software as a Service (SaaS)
o No client software installationo New features automatically available
– Consolidated support & troubleshooting– Works with existing GridFTP servers; Globus Connect for “last mile”
• >5000 users, >4 Petabytes and 500,000,000 files moved• >99.9% uptime in 2012
Adopted by Advanced Photon Source, NERSC, Blue Waters, campuses
www.ci.anl.govwww.ci.uchicago.edu
11
Globus Storage and Globus Collaborate
Globus Integrate
Globus Transfer
Globus Storage
Globus Collaborate
Globus Catalog
…SaaS
…PaaS
Research Data Management-as-a-Service
www.ci.anl.govwww.ci.uchicago.edu
12
Commercial storage service
provider
National research center
Campus computing
center
Globus Storage: For when you want to …
• Place your data where you want
• Access it from anywhere via different protocols
• Update it, version it,and take snapshots
• Share versions with who you want
• Synchronize among locations
Globus Storage volume
Globus Transfer, HTTP/REST, Desktop sync
www.ci.anl.govwww.ci.uchicago.edu
13
Globus Collaborate: For when you want to
Join with a few or many people to:• Share documents• Track tasks• Send email• Share data • Do whatever
With:• Common
groups• Delegated
management
www.ci.anl.govwww.ci.uchicago.edu
14
TBI=Traumatic Brain InjuryDTI=Diffusion Tensor ImagingMRI=Magnetic Resonance Imaging
UChicagoObject
Store
UChicagoObject
StoreCornell
Red Cloud
SDSCCloud
Globus Storage & Collaborate in action
Kyle
Bryce PADSComputeCluster
“TBI”volume
Globus Storage Create volume and
share with TBI group
Globus Transfer Copy TBI data to compute cluster
Globus Transfer Move DTI results to shared volume
Globus NexusAdd Bryce to TBI
collaboration
Globus CollaboratePublish DTI data to TBI
web siteAmazon S3
DTI Group- Kyle
Globus ConnectMove MRI files to TBI shared volume
Globus Connect Move DTI results to
Bryce’s laptop
Globus StorageCreate snapshot to share with group
DTI Group- Kyle- Bryce
www.ci.anl.govwww.ci.uchicago.edu
15
Use case: Earth System Grid
• Outsource data transfer to Globus– Data download from search– Data transfer to another server – Replication between sites
• Next step is automated publication• No ESGF client software needed
www.ci.anl.govwww.ci.uchicago.edu
16
Data acquisition, management, analysis
Big Data (volume, velocity, variety, variability) … demands Big Process in order for discovery to scale
Experiments Computationsdon’t
Literatureforget!
www.ci.anl.govwww.ci.uchicago.edu
17
How to proceed
• Top down:– Large-scale integration, standardized formats,
common protocols, etc.– Good if achieved, but likely to be slow and painful
• Bottom up: – Consider opportunities to encourage useful
behaviors via outsourcing and automation– Making data accessible is the first (and easiest?) 90%– Facilitate sharing, annotation, emergence of
(localized) structure, bridging among structures
www.ci.anl.govwww.ci.uchicago.edu
18
Acknowledgements
• Thanks for vital and much appreciated support:– DOE Office of Advanced Scientific Computing
Research (ASCR)– NSF Office of Cyberinfrastructure (OCI)– National Institutes of Health– The University of Chicago
• Thanks to the Globus Online team at the University of Chicago and Argonne for their amazing work. See https://www.globusonline.org/about/goteam/