36
www.ci.anl.gov www.ci.uchicago.edu Big process for big data Process automation for datadriven science Ian Foster Computation Institute Argonne National Laboratory & The University of Chicago Talk at HPC 2012 Conference, Cetraro, Italy, June 25, 2012

Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

Big process for big data

Process automation for data‐driven science        

Ian Foster Computation Institute

Argonne National Laboratory & The University of Chicago

Talk at HPC 2012 Conference, Cetraro, Italy, June 25, 2012

Page 2: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

2

Big science is making it work

All build on NSF‐

& DOE‐supported Globus Toolkit software

LIGO: 1 PB data in last science  run, distributed worldwide

ESG: 1.2 PB climate datadelivered to 23,000 users; 600+ pubs

OSG: 1.4M CPU‐hours/day,  >90 sites, >3000 users, 

>260 pubs in 2010

Robust production solutionsSubstantial teams and expenseSustained, multi‐year effortApplication‐specific solutions,

built on common technology

Page 3: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

3

But small/medium science is struggling

More data, more complex dataAd‐hoc solutionsInadequate software, hardwareData plan mandates

Page 4: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

4

Complexity is large and growing

Run experimentCollect dataMove dataCheck data

Annotate dataShare data

Find similar dataLink to literatureAnalyze dataPublish data

Time

Page 5: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

5

Page 6: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

6

Tripit exemplifies process automation

MeBook flights

Book hotel

Record flightsSuggest hotelRecord hotelGet weatherPrepare mapsShare infoMonitor 

pricesMonitor flight

Other servicesTime

Page 7: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

7

Complexity is large and growing

Run experimentCollect dataMove dataCheck data

Annotate dataShare data

Find similar dataLink to literatureAnalyze dataPublish data

Time

Page 8: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

8

Can we extract this complexity?

Page 9: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

9

Process automation for science

Run experimentCollect dataMove dataCheck data

Annotate dataShare data

Find similar dataLink to literatureAnalyze dataPublish data

Time

?Research IT 

as a service ?

?Research IT 

as a service?

Page 10: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

10

A first take on “big process for science”

Dark Energy Survey        Metagenomics       Climate scienceGenomics      Land use change       X‐ray source data

Biomedical imaging      High energy physics      Nielsen data

Page 11: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

11

A first take on “big process for science”

Dark Energy Survey        Metagenomics       Climate scienceGenomics      Land use change       X‐ray source data

Biomedical imaging      High energy physics      Nielsen data

Page 12: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

12

Software as a Service (Gartner)

1. The application is owned, delivered, and  managed remotely by one or more providers 

2. The application is based on a single code base  that is consumed in a one‐to‐many model by all 

contracted customers at any time3. The application is licensed on pay‐per‐use or 

subscription basis

4. The application behind the service is properly  web architected—not an existing application 

web enabled [D. Terrar]

Page 13: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

13

Globus Online: Data transfer as SaaS

• Reliable file transfer.– Fire‐and‐forget– Automatic fault recovery– High performance– Across multiple security domains

• No IT required.– No client software install– New features automatically 

available– Consolidated 

support and troubleshooting Works with existing GridFTP servers; also Globus Connect

Page 14: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

14

Globus Transfer to date• In 18 months

– 5,000 users– 5 PB moved– 500M files– 99.9% uptime

• Broad adoption– Experimental facilities– Supercomputers– Campuses– Individuals– Projects

Page 15: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3
Page 16: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3
Page 17: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3
Page 18: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

18

Dark Energy Survey use of Globus Online• Dark Energy Survey 

receives 100,000 files  each night in Illinois

• They transmit files to  Texas for analysis …

then move results back  to Illinois

• Process must be reliable,  routine, and efficient

• They outsource this task  to Globus Online

Image credit: Roger Smith/NOAO/AURA/NSF

Blanco 4m on Cerro Tololo

Page 19: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

19

Page 20: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

20

Genome sequence analysis pipelines

Amazon S3  storage

Amazon S3  storage

Amazon EC2  computing

Amazon EC2  computing

Commercial  sequencing  center

Page 21: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

21

Globus Online under the covers

Globus Nexus is used  to manage

‐‐

user identities ‐‐

user profiles

‐‐

groups and policies‐‐

resource definitions

Page 22: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

22

Globus Online under the covers

Globus Nexus is used  to manage

‐‐

user identities ‐‐

user profiles

‐‐

groups and policies‐‐

resource definitions

Monitoring and controlAuto‐tuning of transfer   

parametersDetection & attempted 

correction of errorsManual intervention 

when required

Page 23: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

23

Globus Online under the covers

Monitoring and controlAuto‐tuning of transfer   

parametersDetection & attempted 

correction of errorsManual intervention 

when required

Reliable cloud‐based infrastructureEC2 for transfer managementS3 for system stateSimpleDB for lock managementReplication across availability zones

Globus Nexus is used  to manage

‐‐

user identities ‐‐

user profiles

‐‐

groups and policies‐‐

resource definitions

Page 24: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

24

Globus Online under the covers

Monitoring and controlAuto‐tuning of transfer   

parametersDetection & attempted 

correction of errorsManual intervention 

when required

Reliable cloud‐based infrastructureEC2 for transfer managementS3 for system stateSimpleDB for lock managementReplication across availability zones

Globus Nexus is used  to manage

‐‐

user identities ‐‐

user profiles

‐‐

groups and policies‐‐

resource definitions

Page 25: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

25

A first take on “big process for science”

Page 26: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

26

A first take on “big process for science”

Globus IntegrateGlobus Integrate

Globus 

Transfer

Globus 

TransferGlobus 

Storage

Globus 

StorageGlobus 

Collaborate

Globus 

CollaborateGlobus 

Catalog

Globus 

Catalog…SaaS

…PaaS

Research Data Management‐as‐a‐Service

Page 27: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

27

Commercial 

storage service 

provider

Commercial 

storage service 

provider

National 

research 

center

National 

research 

center

Campus 

computin

g center

Campus 

computin

g center

Globus Storage: For when you want to …

• Place

your data where  you want

• Access

it from anywhere  via different protocols

• Update it, version it, and take snapshots

• Share

versions with  who you want

• Synchronize

among  locations

Globus 

Storage 

volume

Globus Transfer, HTTP/REST, Desktop sync

Page 28: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

28

Globus Storage under the covers

Conventional or cloud storage system

Cassandra database  hosted on Amazon

Data File system 

metadata

GridFTP 

server

GridFTP 

serverGridFTP 

server

GridFTP 

serverHTTP 

server

HTTP 

server

Page 29: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

29

Globus Collaborate: For when you want to

Join with a few or many people to:•Share docs•Track tasks•Send email•Share data •Do whatever

With:•Common 

groups•Delegated 

management

Page 30: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

31

TBI=Traumatic Brain InjuryDTI=Diffusion Tensor ImagingMRI=Magnetic Resonance Imaging

UChicago

Object  

Store

UChicago

Object  

Store

UChicago

Object  

Store

UChicago

Object  

StoreCornell

Red CloudCornell

Red Cloud

SDSCCloudSDSCCloud

Globus Storage & Collaborate in action

Kyle

Bryce PADSComputeCluster

“TBI”

volume

“TBI”

volume

Globus Storage 

Create volume and 

share with TBI group

Globus Storage 

Create volume and 

share with TBI group

Globus Transfer 

Copy TBI data to 

compute cluster 

Globus Transfer 

Copy TBI data to 

compute cluster 

Globus Transfer 

Move DTI results 

to shared volume

Globus Transfer 

Move DTI results 

to shared volume

Globus NexusAdd Bryce to TBI 

collaboration

Globus NexusAdd Bryce to TBI 

collaboration

Globus CollaboratePublish DTI data to TBI 

web site

Globus CollaboratePublish DTI data to TBI 

web site

Amazon S3Amazon S3

DTI Group‐

Kyle

Globus ConnectMove MRI files to 

TBI shared volume

Globus ConnectMove MRI files to 

TBI shared volume

Globus Connect 

Move DTI results to 

Bryce’s laptop

Globus Connect 

Move DTI results to 

Bryce’s laptop

Globus StorageCreate snapshot to 

share with group

Globus StorageCreate snapshot to 

share with group

DTI Group‐

Kyle

Bryce

Page 31: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

32

Data acquisition, management, analysis

Big Data (volume, velocity, variety, variability)…

demands Big Process in order for discovery to scale

Experiments Computationsdon’t

Literatureforget!

Page 32: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

33

Let’s rethink how we provide research IT

Accelerate discovery and innovation worldwide  by providing research IT as a serviceresearch IT as a service

Leverage the cloud to•provide millions of researchers with unprecedented 

access to powerful tools; •enable  a massive shortening of cycle times in

time‐consuming research processes; and•reduce research IT costs dramatically via economies 

of scale

Page 33: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

34

Process automation for science

Run experimentCollect dataMove dataCheck data

Annotate dataShare data

Find similar dataLink to literatureAnalyze dataPublish data

Time

?Research IT 

as a service ?

?Research IT 

as a service?

Page 34: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

35

Process automation for science

Run experimentCollect dataMove dataCheck data

Annotate dataShare data

Find similar dataLink to literatureAnalyze dataPublish data

Time

?Research IT 

as a service ?

?Research IT 

as a service?

Page 35: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

36

Acknowledgements

• Thanks for vital and much appreciated  support:

– DOE Office of Advanced Scientific  Computing Research (ASCR)

– NSF Office of Cyberinfrastructure

(OCI)– National Institutes of Health– The University of Chicago

• And thanks to the amazing  Globus

Online team. See 

www.globusonline.org/about/goteam/

Page 36: Big for big data - Unical · Amazon S3 storage Amazon S3 ... hosted on Amazon. ... AmazonAmazon S3S3

www.ci.anl.govwww.ci.uchicago.edu

Thank you!

globusonline.org

@globusonline

[email protected] [email protected]