21
Microsoft Research Faculty Summit 2008

Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

Embed Size (px)

Citation preview

Page 1: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

Microsoft Research Faculty Summit 2008

Page 2: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

Towards a Data Cauldron

Ian FosterComputation InstituteUniversity of Chicago & Argonne National Laboratory

Page 3: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

If you want to build a ship, don’t drum up the men to gather wood, divide the work, and give orders. Instead, teach them to yearn for the vast and endless sea.

Antoine de Saint-Exupéry

Page 4: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

Biomedical Research, circa 1600

Page 5: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

Biomedical Research, circa 2000

Page 6: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

Growth of Sequences &Annotations since 1982

Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch, August 2006.

Page 7: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

An Open Analytics Environment

Resultsout

Datain

Programs& rules in

“No limits” Storage Computing Format Program

Allowing for Versioning Provenance Collaboration Annotation

Page 8: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

o·pen [oh-puhn] adjective

having the interior immediately accessible

relatively free of obstructions to sight, movement, or internal arrangement

generous, liberal, or bounteous

in operation; live

readily admitting new members

not constipated

Page 10: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

What Goes In (2)

Rules

Workflows

Dryad

MapReduce

Parallel programs

SQL

BPEL

Swift

SCFL

R

MatLab

Octave

Page 11: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

How it Cooks

VirtualizationRun any program, store any data

IndexingAutomated maintenance

ProvisioningPolicy-driven allocation of resources to competing demands

Page 12: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

What Comes Out

DataData

Page 13: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

Analysis as (Collaborative) ProcessTransform

Annotate

Search

Add to

Tag

Visualize

Discover

Extend

Group

Share

Page 14: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

Data Cauldron @ U.Chicago: ApplicationsAstrophysicsCognitive scienceEast Asian studiesEconomicsEnvironmental scienceEpidemiologyGenomic medicineNeurosciencePolitical scienceSociologySolid state physics

Page 15: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

Data Cauldron @ U.Chicago: Hardware

500 TB reliable storage (data, metadata)

180 TB, 180 GB/s17 Top/sanalysis

Dataingest

Dynamic provisioning

Parallel analysis

Remote access

Offload to remote data centers

P A D S

Diverseusers

Diversedata

sources

1000 TBtape backup

Page 16: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

DOCK on BG/P: ~1M Tasks on 118,000 CPUs

CPU cores: 118784

Tasks: 934803

Elapsed time: 7257 sec

Compute time: 21.43 CPU yr

Average task time: 667 sec

Relative Efficiency: 99.7%

(from 16 to 32 racks)

Utilization: Sustained: 99.6%

Overall: 78.3%

IoanRaicu

ZhaoZhang

MikeWilde

Time (secs)

Page 17: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

Data Cauldron @ U.Chicago:MethodsHPC systems software (MPICH, PVFS, ZeptOS)Collaborative data tagging (GLOSS)Data integration (XDTM)HPC data analytics and visualizationLoosely coupled parallelism (Swift, Hadoop)Dynamic provisioning (Falkon)Service authoring (Introduce, caGrid, gRAVI)Provenance recording and query (Swift)Service composition and workflow (Taverna)Virtualization management (Workspace Service)Distributed data management (GridFTP, etc.)

Page 18: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

High-PerformanceData Analytics

FunctionalMRI

Ben Clifford, MihaelHatigan, Mike Wilde,Yong Zhao

Page 19: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

Social Informatics Data Grid (SIDgrid)Collaborative, multi-modal analysis of cognitive science data

TeraGrid PADS …

SIDgrid

Diverseexperimental

data &metadata

Browse dataSearchContent previewTranscodeDownloadAnalyze

Bennett BerthenthalMike PapkaMike Wilde… and others

Page 20: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory

A Vast and Endless Sea …

Resultsout

Datain

Programs& rules in

“No limits” Storage Computing Format Program

Allowing for Versioning Provenance Collaboration Annotation

Page 21: Microsoft Research Faculty Summit 2008. Ian Foster Computation Institute University of Chicago & Argonne National Laboratory