Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism

Speaker: Sheng DiCoauthors: Yves Robert, Frédéric Vivien, De

rrick Kondo, Franck Cappello

OutlineBackground of Google Cloud Task ProcessingSystem OverviewResearch FormulationOptimization of Fault-tolerance

Optimization of the Number of CheckpointsAdaptive Optimization of Fault ToleranceLocal disk vs. Shared disk

Performance EvaluationConclusion and Future Work

BackgroundGoogle trace (released in 2011.11):

670,000 jobs, 2,500,000 tasks, 12,000 nodesOne-month period (29 days)Various events, Resource request/allocation, Job/ta

sk length, Various attributes, etc.There are two types of jobs in Google trace:

sequential-task job and Bag-of-Task job4000 application types, such as map-reduce.

Failure events occur often for some tasks!Most of task lengths are short (a few or dozens of mi

nutes), so task execution is sensitive to checkpointing cost.

Service Layer

Physical Infrastructure Layer

Resource Allocation Layer

User Interface (Task Parser)

Job/Task Scheduling Layer

Virtual Machine Layer

System OverviewUser Interface

Receive tasksTask Scheduling

Coordinate resource

competition among hostsResource Allocation

Coordinate resource usage

within a particular host

System Overview (Cont’d)Task Processing Procedure

Job Submission

Cloud server

TaskJob

Physical node Running VM

Taskschedulingnotification

Resource PoolQueue

Job scheduling & Resource Isolation

Task Execution & Checkpointing

Process Restarting & Migration

Process Restart or Migration

Failed VM or Service

Research Formulation Analysis of Google trace:

Task failure intervals, Task length, Job structure

Equidistant checkpointing model Checkpointing interval for a particular task is fixed

Task execution model (suppose k failures) Tw(task) = Te(task)+C(x-1)+Σk{roll-back-loss}+Σk{restart-cost}

Objective: minimizing E(Tw(task)) Random Variable: K (# of task failure events) Compute optimal # of checkpoints for a Google task

Task’s wall-clock time

Productive time Checkpoint cost Roll-back loss Restart cost

Task Entry Task Exit

Theorem 1:x*: the optimal number of checkpointing intervalsTe: task execution length (productive length)E(Y): task’s expected # of failures (characterized by MNOF)C: checkpoint cost (time increment per checkpoint)

Formula (3):Example:

A task’s productive length is 18 seconds, C = 2 sec, expected # of failures = 2 in its execution

Optimal # of checkpointing intervals = sqrt(18*2/(2*2))=3The optimal checkpointing interval = 18/3 = 6 seconds

Optimization of the Number of Checkpoints: New formula

Formula (3) does not depend on probability distribution, unlike Young’s formula

Young’s formula (proposed in 1977)Optimal checkpoint interval:

C: checkpointing cost Tf: mean time between failures (MTBF) Conditions: (1) Task failure intervals follows exponential distribution

(2) Checkpoint cost C is far smaller than checkpoint interval Tc

Due to Taylor series and second-order approximation

Optimization of the Number of Checkpoints : Discussion

The assumption with exponential distribution makes Young’s formula unsuitable for Google task processingDistribution of Google task failure intervals based on priority

Corollary 1: Young’s formula is a special case

Two important conditions: Task failure intervals follow exponential distributionCheckpointing cost is small

Optimization of the Number of Checkpoints : Discussion Our formula (3) is easier to apply than Youn

g’s formula in practice- Young’s formula depends on MTBF, while MTBF

may not be easy to predict precisely Non-asynchronous clocks across hosts Inevitable influence of checkpointing cost Significant delay of failure detection

- By contrast, MNOF is easy to record accurately

Adaptive Optimization of Chpt PositionsProblem: what if the probability distribution of failure intervals (or

failure rates) changes over time?This is possible due to changeable priority ….Objective: To design an adaptive algorithm to dynamically suit the

changing failure rates. Question: Will the optimal checkpoint positions change with

decreasing remaining workload over time?

Solution: We just need to monitor MNOF, regardless of the

decreasing remaining workload to process - because of Theorem 2

Kth chpt (K+1)th chpt

Opt chpt intervals?

Later on

means current time

Adaptive Optimization of Fault Tolerance (Cont’d)Theorem 2:

Optimal # of checkpointing Intervalscomputed at (k+1)th checkpoint position

Optimal # of checkpointing intervals computed at kth checkpoint position

Local disk vs. Shared disk checkpointingCharacterization based on BLCR

Operation time cost in setting a checkpoint

Performance EvaluationExperimental Setting

We build a testbed based on Google trace, in a cluster with hundreds of VM instances running across 16 nodes (16*8 cores, 16*16GB memroy size, XEN4.0, BLCR)

We call it GloudSim (Google based cloud simulation system) [under review by HiPC’13]

We reproduce Google task execution as close as possible to Google trace, e.g., Task arrivals are based on the trace or some distribution Task’s memory is reproduced via Google trace Task’s failure events are reproduced via Google trace Each job is chosen from among all sample jobs in the trace

Performance Evaluation (Cont’d)Experimental Results

Job’s Workload-Processing Ratio (WPR)

Checkpointing effect with precise prediction

(on MNOF and MTBF)

Performance Evaluation (Cont’d)Distribution of WPR with diff. C/R formulas

Performance Evaluation (Cont’d)MNOF & MTBF w.r.t. Priority in Google trace

MNOF is stable with task lengths, while MTBF is not stable (changing from 179 to 4199 secs)

Performance Evaluation (Cont’d)Min/Avg/Max WPR with respect to diff. Priorities

Our formula outperforms Young’s formula by 3-10%

Performance Evaluation (Cont’d)Wall-clock lengths of 10,000 job execution

Conclusion: Job wall-clock lengths are often incremented by 50-100 seconds under Young’s formula than ours.

Performance Evaluation (Cont’d)Adaptive Algorithm vs. Static Algorithm

Conclusion and Future WorkSelected conclusions:

Our formula (3) is better than Young’s formula by 3-10 percent, w.r.t. Google task processing

Job wall-clock lengths are incremented by 50-100 seconds under Young’s formula than ours.

Worst WPR under dynamic algorithm stays about 0.8, compared to 0.5 under static algorithm.

Future workPort our theorems to more cases like MPI over Cl

oud platforms.

Thanks for your attention!!Contact me at: disheng222@gmail.com

Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism

Documents

Allegro System Capture...Checkpoint restart The checkpoint restart feature allows the designer to store simulation states at various time-points and then restart simulations from any

DMTCP: System-Level Checkpoint-Restart in User Spacemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/cooperman.pdfDMTCP: System-Level Checkpoint-Restart in User Space

Mutable checkpoint-restart Automating Live-update For Generic …giuffrida/papers/middleware-2014... · 2014-12-20 · MUTABLE CHECKPOINT-RESTART Automating Live Update For Generic

Virtual Machines in Condor - unipg.itogervasi.unipg.it/OpSysNet/4Students/Condor/VM-tutorial.pdf› VM’s MAC and IP address are saved across checkpoint and restart › Network connections

An Overview of Berkeley Lab Checkpoint/Restart (BLCR) for ... · March 18, 2008 Uses of Checkpoint/Restart • Gang scheduling No queue drain for maintenance, policy change Higher

Checkpoint & Restart for Distributed Components in XCAT3

Process Migration in a Parallel Environment - lisas.deadrian/dissertation.pdf · Glossary AIX Advanced Interactive eXecutive. 45 BLCR Berkeley Lab Checkpoint/Restart. 46{49, 58 BTL

CRAFT: A library for application-level Checkpoint/Restart

Application of Numerical Accuracy to the Selection of ...sc16.supercomputing.org/sc-archive/src_poster/poster_files/spost13… · • Restart simulation from lossy checkpoint at time-step

IMS Transaction Manager Your Enterprise Transaction Manager€¦ · IMS Transaction Manager Your Enterprise Transaction Manager ... – Easy to use batch checkpoint/restart ... Recognized

UNIVERSITI PUTRA MALAYSIA · underlying checkpoint/restart subsystem is designed which permits users to specify the migration mechanism according to process constraints. A migration

1 CRAFT: A library for easier application-level Checkpoint ... · 1 CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance Faisal Shahzad,

Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

Maui Administrator's Guidehep.uchicago.edu › ~maryh › pbs › mauiadmin.pdf11.3 Suspend/Resume Handling 11.4 Checkpoint/Restart Facilities 12.0 General Node Administration 12.1

Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan jay@nersc.gov

Checkpoint-dependent RNR induction promotes fork restart ... · Checkpoint-dependent RNR induction promotes fork restart after replicative stress Esther C. Morafraile1, John F. X

A 1PB/s File System to Checkpoint Three Million …nowlab.cse.ohio-state.edu/static/media/publications/...2.3 Checkpoint/Restart I/O Characteristics Checkpoint/restart I/O workloads

MSC.Nastran 2005gc.nuaa.edu.cn/hangkong/doc/ziliao/MSC_NASTRAN/MSC.Nastran 20… · Installing and Using LAM/MPI for MSC.Nastran on Linux, 183 ... Checkpoint Restart Facility (SGI-IRIX64),

Transparent Checkpoint-Restart of Distributed Applications

Checkpoint Restart V2.1