35
Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University *[email protected]

Checkpoint & Restart for Distributed Components in XCAT3

Embed Size (px)

DESCRIPTION

Checkpoint & Restart for Distributed Components in XCAT3. Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University *[email protected]. Long-running Distributed Applications on the Grid. The Problem: 1 Launch simulation at Y - PowerPoint PPT Presentation

Citation preview

Page 1: Checkpoint & Restart for Distributed Components in XCAT3

Checkpoint & Restart for Distributed Components in XCAT3

Sriram Krishnan*

Indiana University, San Diego Supercomputer Center

&

Dennis Gannon

Indiana University

*[email protected]

Page 2: Checkpoint & Restart for Distributed Components in XCAT3

Long-running Distributed Applications on the Grid

The Problem:1 Launch simulation at Y2. Launch simulation at Z3. Link both simulations4. Execute both simulations5. Store results at X

Y

Z

The GridThe Grid

X

Need an effective way to orchestrate such computations

Page 3: Checkpoint & Restart for Distributed Components in XCAT3

Checkpoint & Restart

MotivationBasic fault tolerance via periodic checkpointing

Rollback to saved checkpoint upon failure

Dynamic rescheduling of jobsCheckpoint and restart on another location

Checkpointing GoalsCorrectnessPortabilityMinimal checkpoint sizeScalability InteroperabilityCheckpoint Availability

Page 4: Checkpoint & Restart for Distributed Components in XCAT3

Outline

Motivation Background

The XCAT3 frameworkCheckpoint & Restart

Checkpointing & Restart in XCAT3Software TechniquesAlgorithmsExperiments

Conclusions & Future work

Page 5: Checkpoint & Restart for Distributed Components in XCAT3

Application Orchestration: Component Architectures

A Component Architecture consists of two parts: Components

Software objects that implement a set of required behaviors Frameworks

A runtime environment A set of services used by components

BenefitsEncapsulation, modular construction of programs (via

composition), reuse Component Architectures adopted in various domains

Business: EJB, CCM, COM/DCOMScientific Computing: CCA

Page 6: Checkpoint & Restart for Distributed Components in XCAT3

Common Component Architecture

A ComponentID for identification & management purposes Ports: the public interfaces of a component

Defines the different ways we can interact with a component and the ways the component uses other services and components.

Image ProcessingComponent

setImage(Image I)

Image getImage()

adjustColor()

setFilter(Filter)calls doFFT(…)

Provides Ports - interfacesfunctions provided by component

Uses Ports - interface of aservice used by component

Page 7: Checkpoint & Restart for Distributed Components in XCAT3

XCAT3: CCA Framework for the Grid

Grid Service Extensions (GSX) Toolkit used for OGSI Compatible Grid servicesStandard protocols used by Grid services: SOAP, HTTPhttp://www.extreme.indiana.edu/xgws/GSX

A Component is represented as a set of Grid services Provides ports, ComponentID’s are Grid services Uses ports are Grid service clients

Sriram Krishnan and Dennis Gannon. XCAT3: A Framework for CCA Components as OGSA Services. In HIPS 2004, 9th International Workshop on High-Level Parallel Programming Models and Supportive Environments. April 2004.

Page 8: Checkpoint & Restart for Distributed Components in XCAT3

Checkpointing: Software Techniques

System-level TechniquesAutomatic transparent checkpointing for an

application at the operating system or middleware level

User-defined TechniquesNon-transparent checkpointing for an application that

relies on the programmer to identify the minimal information needed for restart

Page 9: Checkpoint & Restart for Distributed Components in XCAT3

Checkpointing: Software Techniques

Transparent to the user: No expertise required

Not very portable across platforms

Larger checkpoint sizes: Typically complete process images stored

Less flexible: Application is treated as a black box

Not transparent to the user: Considerable expertise required

More portable across platforms

Smaller checkpoint sizes: Only minimal state stored

More flexible: Application information can be used

System-Level User-defined

Page 10: Checkpoint & Restart for Distributed Components in XCAT3

Checkpointing: Examples

System-level TechniquesCondorLAM-MPIEnterprise Java BeansCORBA Components

User-defined TechniquesCUMULVSEnterprise Java BeansCORBA Components

Global Grid Forum: Grid Checkpoint/Recovery GroupUser-defined checkpointing

APIs for Grid servicesDo not address consistent

global checkpoints for distributed applications

A set of individual checkpoints that constitute a state that occurs in a failure-free, correct execution

Page 11: Checkpoint & Restart for Distributed Components in XCAT3

Checkpointing Technique in XCAT3

User-defined & System-assistedUser is responsible for identifying local component stateFramework is responsible for:

Generating complete state of the component, viz. local component state, connection state, and environment state

Algorithms for generating global component states, and storing them into stable storage

Component writer implements the following methods: generateComponentState() loadComponentState() resumeExecution()

Page 12: Checkpoint & Restart for Distributed Components in XCAT3

Distributed Checkpointing

Algorithm Overview: Coordinated blocking checkpoint algorithmBlock all port communication between componentsTake individual checkpoints, and commit them

atomicallyResume port communication between components

Novelty: Application to RPC-based component frameworkTypically, such algorithms are applied to messaging

frameworks

Page 13: Checkpoint & Restart for Distributed Components in XCAT3

The Big Picture

ApplicationCoordinator

PersistentStorage

X Y

Z

Distributed Components on the Grid

Federation of Master (MS) & Individual Storage (IS) Services

MS

ISISISIS

Page 14: Checkpoint & Restart for Distributed Components in XCAT3

Checkpoint Algorithm

ApplicationCoordinator

PersistentStorage

X Y

Z

MS

ISISISIS

CheckpointComponents

Page 15: Checkpoint & Restart for Distributed Components in XCAT3

Checkpoint Algorithm

ApplicationCoordinator

PersistentStorage

X Y

Z

MS

ISISISIS

Block all port communication

between components

Page 16: Checkpoint & Restart for Distributed Components in XCAT3

Checkpoint Algorithm

ApplicationCoordinator

PersistentStorage

X Y

Z

MS

ISISISIS

All communication between components blocked

Page 17: Checkpoint & Restart for Distributed Components in XCAT3

Checkpoint Algorithm

ApplicationCoordinator

PersistentStorage

X Y

Z

MS

ISISISIS

Find best available Storage service URLs

Page 18: Checkpoint & Restart for Distributed Components in XCAT3

Checkpoint Algorithm

ApplicationCoordinator

PersistentStorage

X Y

Z

MS

ISISISIS

Store checkpoints intoStorage services

Page 19: Checkpoint & Restart for Distributed Components in XCAT3

Checkpoint Algorithm

ApplicationCoordinator

PersistentStorage

X Y

Z

MS

ISISISIS

Return storageID’sfor stored state

Page 20: Checkpoint & Restart for Distributed Components in XCAT3

Checkpoint Algorithm

ApplicationCoordinator

PersistentStorage

X Y

Z

MS

ISISISIS

Atomically update locatorsfor individual checkpoints

Page 21: Checkpoint & Restart for Distributed Components in XCAT3

Checkpoint Algorithm

ApplicationCoordinator

PersistentStorage

X Y

Z

MS

ISISISIS

Un-block communication between components

Page 22: Checkpoint & Restart for Distributed Components in XCAT3

Checkpointing: Correctness

Consistency of Global CheckpointA flavor of coordinated blocking algorithms – well

accepted to be correct

Atomicity of CheckpointsLocators for the global checkpoint are updated

atomically after all components have been checkpointed

Not possible to have a scenario where a global checkpoint consists of a combination of old and new individual checkpoints

Page 23: Checkpoint & Restart for Distributed Components in XCAT3

Restart Algorithm

Also implemented by the Application Coordinator Details

Destroy executing instances, if need beRestart all components (possibly on other resources)Load state of components from the Storage servicesResume execution of all control threads, after the

states of every component have been loaded from the Storage services

Page 24: Checkpoint & Restart for Distributed Components in XCAT3

Test Application: Chem-Eng Simulation

Based on the simulation of copper electro-deposition on resistive substrate (NCSA-UIUC)Master-Worker model of executionVariable number of workers, and data size per worker

generateComponentState(), loadComponentState(), and resumeExecution() methods added to support checkpointing and restartRequired identification of the various execution states

of the master and worker components

Page 25: Checkpoint & Restart for Distributed Components in XCAT3

Experiment Setup

Hardware setup8 node Linux cluster

2.8GHz dual processor Intel Xeon processorsRed Hat Linux 8.02GB Memory1Gbps EthernetSUN’s JDK 1.4.2_04

Federation of 1 Master & 8 Individual Storage services used

Single GSX-based Handle Resolver

Page 26: Checkpoint & Restart for Distributed Components in XCAT3

Checkpointing: Master Processing

Page 27: Checkpoint & Restart for Distributed Components in XCAT3

Checkpointing: Workers Processing

Page 28: Checkpoint & Restart for Distributed Components in XCAT3

Future Work

FrameworkIntegration with the Web Service Resource

Framework (WSRF)

Fault ToleranceFault MonitoringReliable communication between componentsCheckpoint OptimizationsStorage Service Optimizations

ApplicationsUse of XCAT3 for LEAD (http://lead.ou.edu)

Page 29: Checkpoint & Restart for Distributed Components in XCAT3

Conclusions

A framework for checkpointing & restart of distributed applications on the GridCCA-based component framework consistent with

Grid standardsUser-defined, platform-independent checkpointsAPIs for checkpointing, and algorithms for capturing

global checkpoints and for restart provided by the framework

http://www.extreme.indiana.edu/xcat/

Page 30: Checkpoint & Restart for Distributed Components in XCAT3

Appendix

Page 31: Checkpoint & Restart for Distributed Components in XCAT3

OGSI Compatibility

Representation for Provides portsIn traditional Grid/Web services, multiple ports of the

same portType are semantically equivalentCCA allows multiple ports of the same type

CCA ports can not be mapped to Web service ports!Hence, every Provides port is mapped as a separate

Grid serviceA single portType containing the Provides port interface

Representation for Uses portsClients of Grid services (Provides ports)Connections to Provides ports made at runtime

Page 32: Checkpoint & Restart for Distributed Components in XCAT3

OGSI Compatibility

Representation for the ComponentIDAlso a Grid serviceActs as a Manager for the other Provides portsContains SDEs containing GSH/GSRs for the various

Provides ports

The Provides ports and ComponentID services, and the Uses ports communicate via shared state

Page 33: Checkpoint & Restart for Distributed Components in XCAT3

AcmeFFT

component

Building Applications by Composition

Connect Uses Ports to Provides Ports.

Image ProcessingComponent

getImage()

adjustColor()

Image tool graphical interface component

Imagedatabase

component

setImage(…)

doFFT(…)

Page 34: Checkpoint & Restart for Distributed Components in XCAT3

Restart Algorithm

Page 35: Checkpoint & Restart for Distributed Components in XCAT3

Test Application: Chem-Eng Simulation