38
Using Application Structure to Handle Failures and Improve Performance in a Migratory File Service John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Project 14 April 2003

John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Embed Size (px)

DESCRIPTION

Using Application Structure to Handle Failures and Improve Performance in a Migratory File Service. John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Project 14 April 2003. Disclaimer. We have a lot of stuff to describe, - PowerPoint PPT Presentation

Citation preview

Page 1: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Using Application Structureto Handle Failures

and Improve Performancein a Migratory File Service

John Bent, Douglas Thain, Andrea Arpaci-Dusseau,

Remzi Arpaci-Dusseau, and Miron Livny

WiND and Condor Project

14 April 2003

Page 2: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Disclaimer

We have a lot of stuff to describe,

so hang in there until the end!

Page 3: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Outline

• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA

• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions

• Evaluation– Performance– Failure

• Philosophizing

Page 4: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

CPU Bound

• SETI@Home, Folding@Home, etc...– Excellent application of dist comp.– KB of data, days of CPU time.– Efficient to do tiny I/O on demand.

• Supporting Systems:– Condor– BOINC– Google Toolbar– Custom software.

Page 5: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

I/O Bound

• D-Zero data analysis:– Excellent app for cluster computing.– GB of data, seconds of CPU time.– Efficient to compute whenever data is

ready.

• Supporting Systems:– Fermi SAM– High-throughput document scanning– Custom software.

Page 6: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Batch Pipelined Applications

c1

data

b1

a1

x y z

c2

data

b2

a2

x y z

c3

data

b3

a3

x y z

data

PipelineSharedData

Batch Width

BatchSharedData

Pip

elin

e

Page 7: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Example: AMANDA

corsika

corama

mmc

amasim

NUCNUCCSGLAUBTAR

EGSDATA3.3QGSDATA4

(1 MB)

DAT(23 MB)

corama.out(26 MB)

mmc_input.txt

mmc_output.dat(126 MB)

amasim_input.dat

ice tables(3 files, 3MB)

amasim_output.txt(5MB)

expt geometry(100s files, 500 MB)

corsika_input.txt(4 KB)

Page 8: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Computing Evironment

• Clusters dominate:– Similar configurations.– Fast interconnects.– Single administrative domain.– Underutilized commodity storage.– En masse, quite unreliable.

• Users wish to harness multiple clusters, but have jobs that are both I/O and CPU intensive.

Page 9: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Ugly Solutions

• “FTP-Net”– User finds remote clusters.– Manually stages data in.– Submits jobs, deals with failures.– Pulls data out.– Lather, rinse, repeat.

• “Remote I/O”– Submit jobs to a remote batch system.– Let all I/O come back to the archive.– Return in several decades.

Page 10: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

What We Really Need

• Access resources outside my domain.– Assemble your own army.

• Automatic integration of CPU and I/O access.– Forget optimal: save administration costs.– Replacing remote with local always wins.

• Robustness to failures.– Can’t hire babysitters for New Year’s Eve.

Page 11: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Hawk: A Migratory File Service

• Automatically deploys a “task force” acorss an existing distributed system.

• Manages applications from a high level, using knowledge of process interactions.

• Provides dependable performance through peer-to-peer techniques.

• Understands and reacts to failures using knowledge of the system and workloads.

Page 12: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Philsophy of Hawk

“In allocating resources, strive to avoid disaster, rather than attempt to obtain an optimum.” - Butler Lampson

Page 13: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Why not AFS+Make?

• Quick answer:– Distributed filesystems provide an

unnecessarily strong abstraction that is unacceptably expensive to provide in the wide area.

• Better answer after we explain what Hawk is and how it works.

Page 14: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Outline

• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA

• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions

• Evaluation– Performance– Failure

• Philosophizing

Page 15: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Workflow Language 1

job a a.sub

job b b.sub

job c c.sub

job d d.sub

parent a child c

parent b child d

a b

c d

Page 16: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

v1

Home Storage

mydata

v2 v3

Workflow Language 2

volume v1 ftp://home/mydata

mount v1 a /datamount v1 b /data

volume v2 scratchmount v2 a /tmpmount v2 c /tmp

volume v3 scratchmount v3 b /tmpmount v3 d /tmp

a b

c d

Page 17: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

v1

Home Storage

mydata

v2 v3

Workflow Language 3

extract v2 x ftp://home/out.1

extract v3 x ftp://home/out.2

a b

c dx

out.1 out.2

x

Page 18: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Mapping Logical to Physical

• Abstract Jobs– Physical jobs in a batch system– May run more than once!

• Logical “scratch” volumes– Temporary containers on a scratch disk.– May be created, replicated, and destroyed.

• Logical “read” volumes– Striped across cooperative proxy caches.– May be created, cached, and evicted.

Page 19: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Node

Starting System

MatchMaker

BatchQueueArchive

Node Node

NodeNodeNode

PBS Head Node

Node Node

NodeNode

Condor Pool

WorkflowManager

Page 20: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Node

Gliding In

MatchMaker

BatchQueueArchive

StartDProxy

Master

Node Node

NodeNodeNodeStartDProxy

Master

StartDProxy

Master

PBS Head Node

Node Node

NodeNode

Condor Pool

StartDProxy

Master

StartDProxy

Master

StartDProxy

Master Glide-InJob

Page 21: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Hawk ArchitectureStartD

Proxy

MatchMaker

BatchQueueArchive

WorkflowManager

StartD

Proxy

StartD

Proxy

Wide Area Caching

CoopCache

CoopCache

SystemModel

AppFlow

Job

Agent

Job

Agent

Job

Agent

Page 22: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

I/O InteractionsStartD

Job

Agent

Proxy

POSIX Library Interface

Local Area Network

/tmp container://host5/120/data cache://host5/archive/data

MatchMaker

BatchQueueArchive

WorkflowManager

CooperativeBlockCache

OtherProxies

Cont. 119 Cont. 120

foo

outfile

tmpfile

bar baz

creat(“/tmp/outfile”);open(“/data/d15”);

Page 23: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Cooperative ProxiesStartD

ProxyA

MatchMaker

BatchQueueArchive

WorkflowManager

StartD

ProxyB

StartD

ProxyC

Job

Agent

Job

Agent

Job

Agent

DiscoverDiscoverDiscoverC

C

C

Hash MapPaths -> Proxies

Ct1:

BCt2:

C B At3:

C Bt4:

Page 24: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Summary

• Archive – Sources input data, chooses coordinator.

• Glide-In– Deploy a “task force” of components.

• Cooperative Proxies– Provide dependable batch read-only data.

• Data Containers– Fault-isolated pipeline data.

• Workflow Manager– Directs the operation.

Page 25: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Outline

• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA

• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions

• Evaluation– Performance– Failure

• Philosophizing

Page 26: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Performance Testbed

• Controlled testbed:– 32 550 MHZ dual-cpu cluster machines, 1

GB, SCSI disks, 100Mb/s ethernet.– Simulated WAN: restrict archive storage

across router to 800 KB/s.

• Also some preliminary tests on uncontrolled systems:– MFS over PBS cluster at Los Alamos– MFS over Condor system at INFN Italy.

Page 27: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Synthetic Apps

a

b

10 MBpipe

a

b

5 MBbatch

5 MBpipe

a

b

10 MBbatch

Pipe Intensive Mixed Batch Intensive

Local

Co-

Locate Data

Don’t

Co-

Locate

Remote

System Configurations

Page 28: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Pipeline Optimization

Page 29: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Everything Together

Page 30: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Network Consumption

Page 31: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Failure Handling

Page 32: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Real Applications

• BLAST– Search tool for proteins and nucleotides in

genomic databases.

• CMS– Simulation of a high energy physics expt to begin

operation at CERN in 2006.

• H-F– Simulation of the non relativistic interactions

between nuclei and electrons

• AMANDA– Simulation of a neutrino detector buried in the ice

of the South Pole.

Page 33: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Application Throughput

Name Stages Remote Hawk

BLAST 1 4.67 747.40

CMS 2 33.78 1273.96

HF 3 40.96 3187.22

AMANDA 4

Page 34: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Outline

• Data Intensive Applications– Batch and Pipeline Sharing– Example: AMANDA

• Hawk: A Migratory File Service– Application Structure– System Architecture– Interactions

• Evaluation– Performance– Failure

• Philosophizing

Page 35: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Related Work

• Workflow management

• Dependency managers: TREC, make

• Private namespaces: UFO, db views

• Cooperative caching: no writes.

• P2P systems: wrong semantics.

• Filesystems: overly strong

Page 36: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Why Not AFS+Make?

• Namespaces– Constructed per-process at submit-time

• Consistency– Enforced at the workflow level

• Selective Commit– Everything tossed unless explicitly saved.

• Fault Awareness– CPUs and data can be lost at any point.

• Practicality– No special permission required.

Page 37: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Conclusions

• Traditional systems build from the bottom up: this disk must have five nines, or we’re in big trouble!

• MFS builds from the top down: application semantics drive system structure.

• By posing the right problem, we solve the traditional hard problems of file systems.

Page 38: John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

For More Info...

• Paper in progress...• Application study:

– “Pipeline and Batch Sharing in Grid Workloads”, to appear in HPDC-2003.

– www.cs.wisc.edu/condor/doc/profiling.ps

• Talk to us!• Questions now?