Generic Proxy sits between Job-Flow Manager and Meta- Scheduler, and transparently intercepts the calls between these two layers. It employs recovery policies upon a failed invocation. Recovery Policies contain rules to detect failures and a sequence of recovery actions to follow on failure detection. These policies are defined in the form of patterns that comprise common reusable recovery actions. Two sample fault-tolerant patterns: Runtime Fault-Handling for Job-Flow Management In Grid Environments Motivation Provide flexible job-flow orchestration and submission for scientists Integrate service-based flow orchestration with job scheduling Provide fault-tolerance handling that is transparent to the job-flow Approach Isolate job-flow orchestration from fault handling using layered design Top Layer: Job-flow (Service) orchestration; using BPEL for job-flows Second layer: Job trapping & fault handling; using JSDL for jobs Introduce fault-handling proxy to implement job recovery policies Use the Generic Proxy from TRAP/BPEL framework Fault handling policies can be job specific depending on Type of job Type of failure Level of fault-tolerance specified by user Team: Gargi Dasgupta 3 , Onyeka Ezenwoye 2 , Liana Fong 1 , Selim Kalayci 2 , S. Masoud Sadjadi 2 , Balaji Viswanathan 3 , 1: IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, 2: Florida International University, Miami, FL 33199, 3: IBM India Research Labs, New Delhi IBM FIU Job-Flow M anager Job-Flow M anager Peer-to-peer Protocols A pplication orPortal A pplication orPortal U ser Local R esources Local R esources Local R esources Local R esources M eta- Scheduler M eta- Scheduler Local scheduler Local scheduler Local scheduler Local scheduler 1 2 3 4 5 6 7 1 4 6 1 2 3 5 7 1 4 6 R esource Policies R esource Policies U ser 1 2 3 5 7 4. Prototypical Architecture 2. Two-Layered Architecture G lobusG ridway M eta-Scheduler ActiveBPE L E ngine PortalClient R e-subm itjob to rem ote dom ain PortalClient PortalClient IBM TDWB M eta-Scheduler R e-subm itjob to rem ote dom ain G eneric Proxy G eneric Proxy IBM ’sW ebsphere ProcessServer Local Scheduler Local Scheduler Local Scheduler Local Scheduler Local Scheduler Local Scheduler Local Scheduler D om ain 1: IBM D om ain 2: FIU R e-polljob at rem ote dom ain PortalClient PortalClient 1. Overview 3. Runtime Fault-Handling Re-submit Job Pattern Re-stage Data Pattern

Runtime Fault-Handling for Job-Flow Management In Grid Environments

Download PPT Report

Upload
timothy-beasley
View
22
Download
1

Tags:

Embed Size (px)

DESCRIPTION

Runtime Fault-Handling for Job-Flow Management In Grid Environments. Team: Gargi Dasgupta 3 , Onyeka Ezenwoye 2 , Liana Fong 1 , Selim Kalayci 2 , S. Masoud Sadjadi 2 , Balaji Viswanathan 3 , 1: IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, - PowerPoint PPT Presentation

Citation preview

Page 1: Runtime Fault-Handling for Job-Flow Management In Grid Environments

Generic Proxy sits between Job-Flow Manager and Meta-Scheduler, and transparently intercepts the calls between these two layers. It employs recovery policies upon a failed invocation.

Recovery Policies contain rules to detect failures and a sequence of recovery actions to follow on failure detection. These policies are defined in the form of patterns that comprise common reusable recovery actions.

Two sample fault-tolerant patterns:

Runtime Fault-Handling for Job-Flow Management In Grid Environments

Motivation Provide flexible job-flow orchestration and submission for

scientists

Integrate service-based flow orchestration with job scheduling

Provide fault-tolerance handling that is transparent to the job-flow

Approach Isolate job-flow orchestration from fault handling using

layered design

Top Layer: Job-flow (Service) orchestration; using BPEL for job-flows

Second layer: Job trapping & fault handling; using JSDL for jobs

Introduce fault-handling proxy to implement job recovery policies

Use the Generic Proxy from TRAP/BPEL framework