31
Error Scope on a Computational Grid: Theory and Practice Douglas Thain and Miron Livny Computer Sciences Department University of Wisconsin HPDC-11, July 2002

Error Scope on a Computational Grid: Theory and Practice Douglas Thain and Miron Livny Computer Sciences Department University of Wisconsin HPDC-11, July

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Error Scopeon a Computational Grid:

Theory and Practice

Douglas Thain

and Miron Livny

Computer Sciences Department

University of Wisconsin

HPDC-11, July 2002

DangerAhead!

Outline

An Exercise: Condor + Java Bad News: Error Explosion A Theory of Error Propagation (A Taste) Condor Revisited Parting Thoughts

An Exercise:Coupling Condor and Java

The Condor Project, est. 1985.– Production high-throughput computing facility.– Provides a stable execution environment on a Grid of

unstable, autonomous resources. The Java Language, est 1991.

– Production language, compiler, and interpreter.– Provides a standard instruction set and libraries on any

processor and system. The Grid

– Execute any code any where at any time.– Dependable, consistent, pervasive, inexpensive...– Are we there yet?

The Condor High Throughput Computing System

HTC != HPC– Measured in sims/week, frames/month, cycles/year.

All participants are autonomous.– Users give constraints on usable machines.– Machines give constraints on jobs and users.– ClassAds: a language for matchmaking.

If you are willing to re-link jobs...– Remote system calls for transparent mobility.– Binary checkpointing for migration and fault-tolerance.– Can’t relink? All other features available.

Special “universes” support software environments.– PVM, MPI, Master-Worker, Vanilla, Globus, Java

HomeFile

System

Execution SiteSubmission Site

UserAgent

(schedd)

Match-Maker

MachineAgent(startd)

PolicyControl

PolicyControl

Execution Protocol

The Job

Fork

JobAgent

(starter)

Fork

JobAgent

(shadow)

Fork

“I want...

” “I have...”

Claiming Protocol

notify notify

Java Universe

Execution:– User specifies .class and .jar files.– Machine provides the JVM details.

Input and Output:– Know all of your files?

Condor transfers whole files for you.

– Need online I/O? Link program with Chirp I/O Library. Execution site provides proxy to home site.

JVM

Fork

Job Agent(starter)

Job Agent(shadow)

HomeFile

System I/O Library

The Job

I/O Server I/O ProxySecure Remote I/O

Local System Calls Local RPC(Chirp)

Execution SiteSubmission Site

Wrapper

Initial Experience

Bad news! Any kind of error sent the job back to the user with an exception message:

– NullPointerException - Program is faulty.– OutOfMemory - Program outgrew machine.– ClassNotFoundError - Machine incorrectly installed.– ConnectionRefused - Network temporarily unavailable.

Users were frustrated because they had to evaluate whether the job failed or the system failed.

These were correct in the sense they were true. These were not bugs. We deliberately trapped all

possible errors and passed them up the chain.

What’s the Problem?

To reason about this problem, we began to construct a theory of error propagation.

This theory offers some common definitions and four principles that outline a design discipline.

We re-examined the Java Universe according to this theory.

Our most serious mistake: We failed to propagate errors according to their scope.

We are NOT Talking About:

Fault Tolerance– What algorithms are fault-resistant?– How many disks can I lose without losing data?– How many copies should I make for five nines?

Language Structures– Should I use Objects or Strings to represent errors?– Should I use Exceptions or Signals to communicate errors?

These are important and valuable questions, but we are asking something different!

We ARE Talking About:

Where is the problem? How should a program respond to an error? Who should receive an error message? What information should an error carry? How can we even reason about this stuff?

Engineering Perspective

Fault– A physical disruption of the machine.

Error– An information state that reflects a fault.

Failure– A violation of documented/guaranteed behavior.

Fault– (A failure in one’s underlying components.)

Interface Perspective

Implicit Error– A result presented as valid, but found to be false.– Example: sqrt(3) -> 2.

Explicit Error– A result describing an inability to carry out the request.– Example: open(“file”) -> ENOENT.

Escaping Error– A return to a higher level of abstraction. – Example: read -> virt mem failure -> process abort.– Example: server out of memory -> shutdown socket

Principles for Error Design

1 - A program must not generate an implicit error as a result of receiving an explicit error.

2 - An escaping error must be used to convert a potential implicit error into an explicit error at a higher level.

3 - An error must be propagated to the program that manages its scope.

4 - Error interfaces must be concise and finite.

Error Scope

Definition: The scope of an error is the portion of a system that it invalidates.

Principle 3: An error must be propagated to the program that manages its scope.

Error Scope Examples

“File not found” simply has file scope.

schedd

shadow

starter

JVM

program

Code Data

Program Scope

Virtual Machine Scope

Remote Resource Scope

Local Resource Scope

Job Scope

InputData

ProgArgs

ProgImage

OutputSpace

I/OServer

UserPolicy

OwnerPolicy

JavaPkg

Mem& CPU

Detail Scope Handler

Program exited normally. Program User

Null pointer exception. Program User

Out of memory. Virtual

Machine

JVM

Java misconfigured. Remote

Resource

Starter

Home file system offline. Local Resource

Shadow

Program image corrupt. Job Schedd

Scope in Condor

Scope in Condor:JVM Exit Code

Detail Scope Handler Exit Code

Program exited normally. Program User (x)

Null pointer exception. Program User 1

Out of memory. Virtual

Machine

JVM 1

Java misconfigured. Remote

Resource

Starter 1

Home file system offline. Local Resource

Shadow 1

Program image corrupt. Job Schedd 1

What To Do With An Error?

A program cannot possibly know what to do with an error outside its scope.

– Should sin(x) deal with “math library not available?”

Propagate an error to the manager of the scope as directly as possible.

Sometimes, a direct mechanism:– Signal, exception, dropped connection, message.

Sometimes, an indirect mechanism:– Touch a file, then exit by any means available.

JVM

Job Agent(starter)

Job Agent(shadow)

HomeFile

System

Wrapper

I/O Library

The Job

ResultFile

JVM Result

ProgramResult

orError and

Scope

Starter Result +Program Result

JVM

starter

shadow

HomeFile

System

Wrapper

I/O Library

The Job

ResultFile

JVM Result

I/O Proxy

Errors of Larger Scope

Errors InsideProgram Scope

Error Theory

An outline:– Definitions of error types.– Error relationships discussion.– Four principles for error discipline.– Error scope.

Unpopular position:– Generic (expandable) errors must be exterminated!

Please take a closer look, and feel free to come argue with me!

Open Problems

Related Work

Anh Nguyen-Tuong and Andrew Grimshaw, Legion Reflective Graph and Event Model.

– Distributed applications keep a model of themselves.– Very powerful when the entire system is known to every

component. John B. Goodenough, et al. “Exceptions”

– Must exceptions be declared in the interface?– If not, how do we deal with escaping errors?

Hoare, et al, “Design by Contract”– Motivates the distinction between explicit and escaping errors.– How should escaping errors be structured?

Conclusion

Small but powerful changes drastically improved the Java Universe.

Our mistake was to represent all possible errors explicitly in the closest interface.

Error scope is an analytic tool that helps the designer decide how to propagate an error.

An error discipline saves precious resources: time and aggravation!

A Parting Thought

Very few existing structures can be lifted into distributed computing without change.

Can these results be distinguished?– sh fails to load (result 1)– gzip fails to load (result 1)– file does not exist (result 1)– file exists (result 0)

#!/bin/sh

gzip file

exit $?

For more information...

Douglas Thain – [email protected]

Miron Livny– [email protected]

Condor Software, Manuals, Papers, and More– http://www.cs.wisc.edu/condor

Questions now?