1
Generic Proxy sits between Job-Flow Manager and Meta- Scheduler, and transparently intercepts the calls between these two layers. It employs recovery policies upon a failed invocation. Recovery Policies contain rules to detect failures and a sequence of recovery actions to follow on failure detection. These policies are defined in the form of patterns that comprise common reusable recovery actions. Two sample fault-tolerant patterns: Runtime Fault-Handling for Job-Flow Management In Grid Environments Motivation Provide flexible job-flow orchestration and submission for scientists Integrate service-based flow orchestration with job scheduling Provide fault-tolerance handling that is transparent to the job-flow Approach Isolate job-flow orchestration from fault handling using layered design Top Layer: Job-flow (Service) orchestration; using BPEL for job-flows Second layer: Job trapping & fault handling; using JSDL for jobs Introduce fault-handling proxy to implement job recovery policies Use the Generic Proxy from TRAP/BPEL framework Fault handling policies can be job specific depending on Type of job Type of failure Level of fault-tolerance specified by user Team: Gargi Dasgupta 3 , Onyeka Ezenwoye 2 , Liana Fong 1 , Selim Kalayci 2 , S. Masoud Sadjadi 2 , Balaji Viswanathan 3 , 1: IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, 2: Florida International University, Miami, FL 33199, 3: IBM India Research Labs, New Delhi IBM FIU Job-Flow M anager Job-Flow M anager Peer-to-peer Protocols A pplication orPortal A pplication orPortal U ser Local R esources Local R esources Local R esources Local R esources M eta- Scheduler M eta- Scheduler Local scheduler Local scheduler Local scheduler Local scheduler 1 2 3 4 5 6 7 1 4 6 1 2 3 5 7 1 4 6 R esource Policies R esource Policies U ser 1 2 3 5 7 4. Prototypical Architecture 2. Two-Layered Architecture G lobusG ridway M eta-Scheduler ActiveBPE L E ngine PortalClient R e-subm itjob to rem ote dom ain PortalClient PortalClient IBM TDWB M eta-Scheduler R e-subm itjob to rem ote dom ain G eneric Proxy G eneric Proxy IBM ’sW ebsphere ProcessServer Local Scheduler Local Scheduler Local Scheduler Local Scheduler Local Scheduler Local Scheduler Local Scheduler D om ain 1: IBM D om ain 2: FIU R e-polljob at rem ote dom ain PortalClient PortalClient 1. Overview 3. Runtime Fault-Handling Re-submit Job Pattern Re-stage Data Pattern

Runtime Fault-Handling for Job-Flow Management In Grid Environments

Embed Size (px)

DESCRIPTION

Runtime Fault-Handling for Job-Flow Management In Grid Environments. Team: Gargi Dasgupta 3 , Onyeka Ezenwoye 2 , Liana Fong 1 , Selim Kalayci 2 , S. Masoud Sadjadi 2 , Balaji Viswanathan 3 , 1: IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, - PowerPoint PPT Presentation

Citation preview

Page 1: Runtime Fault-Handling for Job-Flow Management  In Grid Environments

Generic Proxy sits between Job-Flow Manager and Meta-Scheduler, and transparently intercepts the calls between these two layers. It employs recovery policies upon a failed invocation.

Recovery Policies contain rules to detect failures and a sequence of recovery actions to follow on failure detection. These policies are defined in the form of patterns that comprise common reusable recovery actions.

Two sample fault-tolerant patterns:

Runtime Fault-Handling for Job-Flow Management In Grid Environments

Motivation Provide flexible job-flow orchestration and submission for

scientists

Integrate service-based flow orchestration with job scheduling

Provide fault-tolerance handling that is transparent to the job-flow

Approach Isolate job-flow orchestration from fault handling using

layered design

Top Layer: Job-flow (Service) orchestration; using BPEL for job-flows

Second layer: Job trapping & fault handling; using JSDL for jobs

Introduce fault-handling proxy to implement job recovery policies

Use the Generic Proxy from TRAP/BPEL framework

Fault handling policies can be job specific depending on

Type of job

Type of failure

Level of fault-tolerance specified by user

Team: Gargi Dasgupta3, Onyeka Ezenwoye2, Liana Fong1, Selim Kalayci2,

S. Masoud Sadjadi2, Balaji Viswanathan3, 1: IBM T. J. Watson Research Center, Yorktown Heights, NY 10598,

2: Florida International University, Miami, FL 33199, 3: IBM India Research Labs, New Delhi

IBM FIU

Job-Flow Manager

Job-Flow Manager

Peer-to-peerProtocols

Application or Portal

Application or Portal

User

Local Resources

Local Resources

Local Resources

Local Resources

Meta-Scheduler

Meta-Scheduler

Local scheduler

Local scheduler

Local scheduler

Localscheduler

1

2 3 4

5 6

7

1

4

6

12357 1 4 6

Resource Policies

Resource Policies

User

1

2 3

5

7

4. Prototypical Architecture

2. Two-Layered Architecture

GlobusGridwayMeta-Scheduler

ActiveBPELEngine

Portal C

lient

IBM TDWB Meta-Scheduler

Re-submit job to remote domain

Portal C

lientPortal C

lient

IBM TDWB Meta-Scheduler

Re-submit job to remote domain

Generic Proxy Generic Proxy

IBM’s WebsphereProcess Server

Local SchedulerLocal Scheduler

Local SchedulerLocal Scheduler

Local Scheduler

Local Scheduler

Local Scheduler

Domain 1: IBM Domain 2: FIU

Re-poll job at remote domain

Portal C

lientPortal C

lient

1. Overview

3. Runtime Fault-Handling

Re-submit Job Pattern Re-stage Data Pattern