An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems...

Preview:

Citation preview

Intro Checkpointing Forward-recovery Silent Errors Conclusion

An overview of fault-tolerant techniques for HPC

Yves Robert

Laboratoire LIP, ENS LyonInstitut Universitaire de FranceUniversity Tennessee Knoxville

Yves.Robert@inria.fr

http://graal.ens-lyon.fr/~yrobert/europar16.pdf

http://www.netlib.org/lapack/lawnspdf/lawn289.pdf

EuroPar’2016 Tutorial

yves.robert@inria.fr Fault-tolerance for HPC 1/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Thanks

INRIA & ENS Lyon

Anne Benoit & Frederic Vivien

Guillaume Aupy, Aurelien Cavelan, Hongyang Sun

Univ. Tennessee Knoxville

George Bosilca, Aurelien Bouteiller & Thomas Herault(joint tutorial at SC’14, SC’15, SC’16 and elsewhere)

Jack Dongarra

Elsewhere

Franck Cappello & Marc Snir, Argonne National Lab.

Henri Casanova, Univ. Hawai‘i

yves.robert@inria.fr Fault-tolerance for HPC 2/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 3/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 IntroductionLarge-scale computing platformsFaults and failures

2 Checkpointing

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 4/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 IntroductionLarge-scale computing platformsFaults and failures

2 Checkpointing

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 5/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exascale platforms (courtesy Jack Dongarra)

Potential System Architecture with a cap of $200M and 20MW Systems 2011

K computer 2019 Difference

Today & 2019

System peak 10.5 Pflop/s 1 Eflop/s O(100)

Power 12.7 MW ~20 MW

System memory 1.6 PB 32 - 64 PB O(10)

Node performance 128 GF 1,2 or 15TF O(10) – O(100)

Node memory BW 64 GB/s 2 - 4TB/s O(100)

Node concurrency 8 O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW 20 GB/s 200-400GB/s O(10)

System size (nodes) 88,124 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 705,024 O(billion) O(1,000)

MTTI days O(1 day) - O(10)

yves.robert@inria.fr Fault-tolerance for HPC 6/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exascale platforms (courtesy C. Engelmann & S. Scott)

yves.robert@inria.fr Fault-tolerance for HPC 7/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exascale platforms

Hierarchical• 105 or 106 nodes• Each node equipped with 104 or 103 cores

Failure-prone

MTBF – one node 1 year 10 years 120 yearsMTBF – platform 30sec 5mn 1h

of 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures)

yves.robert@inria.fr Fault-tolerance for HPC 8/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exascale platforms

Hierarchical• 105 or 106 nodes• Each node equipped with 104 or 103 cores

Failure-prone

MTBF – one node 1 year 10 years 120 yearsMTBF – platform 30sec 5mn 1h

of 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures)

Exascale

6= Petascale ×1000

yves.robert@inria.fr Fault-tolerance for HPC 8/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Even for today’s platforms (courtesy F. Cappello)

yves.robert@inria.fr Fault-tolerance for HPC 9/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Even for today’s platforms (courtesy F. Cappello)

yves.robert@inria.fr Fault-tolerance for HPC 10/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 IntroductionLarge-scale computing platformsFaults and failures

2 Checkpointing

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 11/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Error sources (courtesy Franck Cappello)

•  Analysis of error and failure logs

•  In 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most number of outages (59-84 percent), and take the shortest time to repair (0.6-1.5 hours). Hardware problems, albeit rarer, need 6.3-100.7 hours to solve.”

•  In 2007 (Garth Gibson, ICPP Keynote):

•  In 2008 (Oliner and J. Stearley, DSN Conf.): 50%

Hardware

Conclusion: Both Hardware and Software failures have to be considered

Software errors: Applications, OS bug (kernel panic), communication libs, File system error and other.

Hardware errors, Disks, processors, memory, network

yves.robert@inria.fr Fault-tolerance for HPC 12/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A few definitions

Many types of faults: software error, hardware malfunction,memory corruption

Many possible behaviors: silent, transient, unrecoverable

Restrict to faults that lead to application failures

This includes all hardware faults, and some software ones

Will use terms fault and failure interchangeably

Silent errors (SDC) addressed later in the tutorial

yves.robert@inria.fr Fault-tolerance for HPC 13/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distributions: (1) Exponential

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)

Exp(λ): Exponential distribution law of parameter λ:

Pdf: f (t) = λe−λtdt for t ≥ 0

Cdf: F (t) = 1− e−λt

Mean = 1λ

yves.robert@inria.fr Fault-tolerance for HPC 14/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distributions: (1) Exponential

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)

X random variable for Exp(λ) failure inter-arrival times:

P (X ≤ t) = 1− e−λtdt (by definition)

Memoryless property: P (X ≥ t + s |X ≥ s ) = P (X ≥ t)at any instant, time to next failure does not depend upontime elapsed since last failure

Mean Time Between Failures (MTBF) µ = E (X ) = 1λ

yves.robert@inria.fr Fault-tolerance for HPC 14/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distributions: (2) Weibull

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)

Weibull(k, λ): Weibull distribution law of shape parameter k andscale parameter λ:

Pdf: f (t) = kλ(tλ)k−1e−(λt)kdt for t ≥ 0

Cdf: F (t) = 1− e−(λt)k

Mean = 1λΓ(1 + 1

k )

yves.robert@inria.fr Fault-tolerance for HPC 15/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distributions: (2) Weibull

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)

X random variable for Weibull(k , λ) failure inter-arrival times:

If k < 1: failure rate decreases with time”infant mortality”: defective items fail early

If k = 1: Weibull(1, λ) = Exp(λ) constant failure time

yves.robert@inria.fr Fault-tolerance for HPC 15/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distributions: with several processors

Processor (or node): any entity subject to failures⇒ approach agnostic to granularity

If the MTBF is µ with one processor,what is its value with p processors?

yves.robert@inria.fr Fault-tolerance for HPC 16/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Intuition

Time

p1

p2

p3

t

If three processors have around 20 faults during a time t (µ = t20 )...

Time

p

t

...during the same time, the platform has around 60 faults (µp = t60 )

yves.robert@inria.fr Fault-tolerance for HPC 17/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Platform MTBF

Rebooting only faulty processor

Platform failure distribution⇒ superposition of p IID processor distributions⇒ IID only for Exponential

Define µp by

limF→+∞

n(F )

F=

1

µp

n(F ) = number of platform failures until time F is exceeded

Theorem: µp =µ

pfor arbitrary distributions

yves.robert@inria.fr Fault-tolerance for HPC 18/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

MTBF with p processors (1/2)

Theorem: µp = µp for arbitrary distributions

With one processor:

n(F ) = number of failures until time F is exceeded

Xi iid random variables for inter-arrival times, with E (Xi ) = µ∑n(F )−1i=1 Xi ≤ F ≤∑n(F )

i=1 Xi

Wald’s equation: (E (n(F ))− 1)µ ≤ F ≤ E (n(F ))µ

limF→+∞E(n(F ))

F = 1µ

yves.robert@inria.fr Fault-tolerance for HPC 19/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

MTBF with p processors (2/2)

Theorem: µp = µp for arbitrary distributions

With p processors:

n(F ) = number of platform failures until time F is exceeded

nq(F ) = number of those failures that strike processor q

nq(F ) + 1 = number of failures on processor q until time F isexceeded (except for processor with last-failure)

limF→+∞nq(F )F = 1

µ as above

limF→+∞n(F )F = 1

µpby definition

Hence µp = µp because n(F ) =

∑pq=1 nq(F )

yves.robert@inria.fr Fault-tolerance for HPC 20/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A little digression for afficionados

Xi IID random variables for processor inter-arrival times

Assume Xi continuous, with E (Xi ) = µ

Yi random variables for platform inter-arrival times

Definition: µpdef= limn→+∞

∑ni E(Yi )n

Limits always exists (superposition of renewal processes)

Theorem: µp = µp

yves.robert@inria.fr Fault-tolerance for HPC 21/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Values from the literature

MTBF of one processor: between 1 and 125 years

Shape parameters for Weibull: k = 0.5 or k = 0.7

Failure trace archive from INRIA(http://fta.inria.fr)

Computer Failure Data Repository from LANL(http://institutes.lanl.gov/data/fdata)

yves.robert@inria.fr Fault-tolerance for HPC 22/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Does it matter?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0h 3h 6h 9h 12h 15h 18h 21h 24h

Fa

ilure

Pro

ba

bili

ty

Time (hours)

Parallel machine (106 nodes)

Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)

After infant mortality and before aging,instantaneous failure rate of computer platforms is almost constant

yves.robert@inria.fr Fault-tolerance for HPC 23/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Summary for the road

MTBF key parameter and µp = µp ,

Exponential distribution OK for most purposes ,Assume failure independence while not (completely) true /

yves.robert@inria.fr Fault-tolerance for HPC 24/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 25/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 26/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Maintaining redundant information

Goal

General Purpose Fault Tolerance Techniques: work despite theapplication behavior

Two adversaries: Failures & Application

Use automatically computed redundant information

At given instants: checkpointsAt any instant: replicationOr anything in between: checkpoint + message logging

yves.robert@inria.fr Fault-tolerance for HPC 27/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Process checkpointing

Goal

Save the current state of the process

FT Protocols save a possible state of the parallel application

Techniques

User-level checkpointing

System-level checkpointing

Blocking call

Asynchronous call

yves.robert@inria.fr Fault-tolerance for HPC 28/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

System-level checkpointing

Blocking checkpointing

Relatively intuitive: checkpoint(filename)

Cost: no process activity during whole checkpoint operation

Different implementations: OS syscall; dynamic library;compiler assisted

Create a serial file that can be loaded in a process image.Usually on same architecture / OS / software environment

Entirely transparent

Preemptive (often needed for library-level checkpointing)

Lack of portability

Large size of checkpoint (≈ memory footprint)

yves.robert@inria.fr Fault-tolerance for HPC 29/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Storage

Remote reliable storage

Intuitive. I/O intensive. Disk usage.

Memory hierarchy

local memory

local disk (SSD, HDD)

remote disk

Scalable Checkpoint Restart Libraryhttp://scalablecr.sourceforge.net

Checkpoint is valid when finished on reliable storage

Distributed memory storage

In-memory checkpointing

Disk-less checkpointing

yves.robert@inria.fr Fault-tolerance for HPC 30/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Coordinated checkpointing

orphan

orphan

missing

Definition (Missing Message)

A message is missing if in the current configuration, the sendersent it, while the receiver did not receive it

yves.robert@inria.fr Fault-tolerance for HPC 31/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Coordinated checkpointing

orphan

orphan

missing

Definition (Orphan Message)

A message is orphan if in the current configuration, the receiverreceived it, while the sender did not send it

yves.robert@inria.fr Fault-tolerance for HPC 32/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Coordinated checkpointing

Create a consistent view of the application (no orphan messages)

Messages belong to a checkpoint wave or another

All communication channels must be flushed (all2all)

yves.robert@inria.fr Fault-tolerance for HPC 33/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Coordinated checkpointing

App. Message Marker Message

Silences the network during checkpoint

Missing messages recorded

yves.robert@inria.fr Fault-tolerance for HPC 34/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 35/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Periodic checkpointing

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunkProcessing the first chunk

Time

Time spent checkpointing

Time spent working

Blocking model: while a checkpoint is taken, no computation canbe performed

yves.robert@inria.fr Fault-tolerance for HPC 36/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Framework

Periodic checkpointing policy of period T

Independent and identically distributed failures

Applies to a single processor with MTBF µ = µindApplies to a platform with p processors and MTBF µ = µind

p

coordinated checkpointingtightly-coupled applicationprogress ⇔ all processors available

⇒ platform = single (powerful, unreliable) processor ,

Waste: fraction of time not spent for useful computations

yves.robert@inria.fr Fault-tolerance for HPC 37/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste in fault-free execution

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunkProcessing the first chunk

Time

Time spent checkpointing

Time spent working Timebase: application base time

TimeFF: with periodic checkpointsbut failure-free

TimeFF = Timebase + #checkpoints × C

#checkpoints =

⌈Timebase

T − C

⌉≈ Timebase

T − C(valid for large jobs)

Waste[FF ] =TimeFF −Timebase

TimeFF=

C

T

yves.robert@inria.fr Fault-tolerance for HPC 38/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

Timebase: application base time

TimeFF: with periodic checkpoints but failure-free

Timefinal: expectation of time with failures

Timefinal = TimeFF + Nfaults × Tlost

Nfaults number of failures during executionTlost: average time lost per failure

Nfaults =Timefinal

µ

Tlost?

yves.robert@inria.fr Fault-tolerance for HPC 39/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

Timebase: application base time

TimeFF: with periodic checkpoints but failure-free

Timefinal: expectation of time with failures

Timefinal = TimeFF + Nfaults × Tlost

Nfaults number of failures during executionTlost: average time lost per failure

Nfaults =Timefinal

µ

Tlost?

yves.robert@inria.fr Fault-tolerance for HPC 39/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing Tlost

T

CT − CRDTlost

P1

P0

P3

P2

Time spent working Time spent checkpointing

Recovery timeDowntime Time

Tlost = D + R +T

2

Rationale⇒ Instants when periods begin and failures strike are independent⇒ Approximation used for all distribution laws⇒ Exact for Exponential and uniform distributions

yves.robert@inria.fr Fault-tolerance for HPC 40/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

Timefinal = TimeFF + Nfaults × Tlost

Waste[fail ] =Timefinal −TimeFF

Timefinal=

1

µ

(D + R +

T

2

)

yves.robert@inria.fr Fault-tolerance for HPC 41/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Total waste

TimeFF =TimeFinal (1-Waste[Fail]) TimeFinal ×Waste[Fail ]

TimeFinal

T -C C T -C C T -C C T -C C T -C C

Waste =Timefinal −Timebase

Timefinal

1−Waste = (1−Waste[FF ])(1−Waste[fail ])

Waste =C

T+

(1− C

T

)1

µ

(D + R +

T

2

)

yves.robert@inria.fr Fault-tolerance for HPC 42/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste minimization

Waste =C

T+

(1− C

T

)1

µ

(D + R +

T

2

)Waste =

u

T+ v + wT

u = C(1− D + R

µ

)v =

D + R − C/2

µw =

1

Waste minimized for T =√

uw

T =√

2(µ− (D + R))C

yves.robert@inria.fr Fault-tolerance for HPC 43/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Comparison with Young/Daly

TimeFF =TimeFinal (1-Waste[Fail]) TimeFinal ×Waste[Fail ]

TimeFinal

T -C C T -C C T -C C T -C C T -C C

(1−Waste[fail ]

)Timefinal = TimeFF

⇒ T =√

2(µ− (D + R))C

Daly: Timefinal =(1 + Waste[fail ]

)TimeFF

⇒ T =√

2(µ+ (D + R))C + C

Young: Timefinal =(1 + Waste[fail ]

)TimeFF and D = R = 0

⇒ T =√

2µC + C

yves.robert@inria.fr Fault-tolerance for HPC 44/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Validity of the approach (1/3)

Technicalities

E (Nfaults) = Timefinalµ and E (Tlost) = D + R + T

2but expectation of product is not product of expectations(not independent RVs here)

Enforce C ≤ T to get Waste[FF ] ≤ 1

Enforce D + R ≤ µ and bound T to get Waste[fail ] ≤ 1but µ = µind

p too small for large p, regardless of µind

yves.robert@inria.fr Fault-tolerance for HPC 45/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Validity of the approach (2/3)

Several failures within same period?

Waste[fail] accurate only when two or more faults do nottake place within same period

Cap period: T ≤ γµ, where γ is some tuning parameter

Poisson process of parameter θ = Tµ

Probability of having k ≥ 0 failures : P(X = k) = θk

k! e−θ

Probability of having two or more failures:π = P(X ≥ 2) = 1− (P(X = 0) +P(X = 1)) = 1− (1 +θ)e−θ

γ = 0.27 ⇒ π ≤ 0.03⇒ overlapping faults for only 3% of checkpointing segments

yves.robert@inria.fr Fault-tolerance for HPC 46/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Validity of the approach (3/3)

Enforce T ≤ γµ, C ≤ γµ, and D + R ≤ γµ

Optimal period√

2(µ− (D + R))C may not belong toadmissible interval [C , γµ]

Waste is then minimized for one of the bounds of thisadmissible interval (by convexity)

yves.robert@inria.fr Fault-tolerance for HPC 47/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Wrap up

Capping periods, and enforcing a lower bound on MTBF⇒ mandatory for mathematical rigor /

Not needed for practical purposes ,• actual job execution uses optimal value• account for multiple faults by re-executing work until success

Approach surprisingly robust ,

yves.robert@inria.fr Fault-tolerance for HPC 48/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Lesson learnt for fail-stop failures

(Not so) Secret data• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs• Blue Waters: 2-3 node failures per day• Titan: a few failures per day• Tianhe 2: wouldn’t say

Topt =√

2µC ⇒ Waste[opt] ≈√

2C

µ

Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17%Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53%Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%

yves.robert@inria.fr Fault-tolerance for HPC 49/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Lesson learnt for fail-stop failures

(Not so) Secret data• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs• Blue Waters: 2-3 node failures per day• Titan: a few failures per day• Tianhe 2: wouldn’t say

Topt =√

2µC ⇒ Waste[opt] ≈√

2C

µ

Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17%Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53%Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%

Exascale 6= Petascale ×1000Need more reliable components

Need to checkpoint faster

yves.robert@inria.fr Fault-tolerance for HPC 49/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Lesson learnt for fail-stop failures

(Not so) Secret data• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs• Blue Waters: 2-3 node failures per day• Titan: a few failures per day• Tianhe 2: wouldn’t say

Topt =√

2µC ⇒ Waste[opt] ≈√

2C

µ

Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17%Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53%Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%

Silent errors:

detection latency ⇒ additional problems

yves.robert@inria.fr Fault-tolerance for HPC 49/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 50/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential failure distribution

1 Expected execution time for a single chunk

2 Expected execution time for a sequential job

3 Expected execution time for a parallel job

yves.robert@inria.fr Fault-tolerance for HPC 51/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

E(T (W )) =

yves.robert@inria.fr Fault-tolerance for HPC 52/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

of success

Probability

Psucc(W + C ) (W + C )

E(T (W )) =

yves.robert@inria.fr Fault-tolerance for HPC 52/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive ApproachTime needed

the work W and

to compute

checkpoint it

Psucc(W + C ) (W + C )

E(T (W )) =

yves.robert@inria.fr Fault-tolerance for HPC 52/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

Probability of failure

(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))

+

Psucc(W + C ) (W + C )

E(T (W )) =

yves.robert@inria.fr Fault-tolerance for HPC 52/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

Time elapsed

before failure

stroke

+

(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))

Psucc(W + C ) (W + C )

E(T (W )) =

yves.robert@inria.fr Fault-tolerance for HPC 52/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

Time needed

to perform

downtime

and recovery

+

(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))

Psucc(W + C ) (W + C )

E(T (W )) =

yves.robert@inria.fr Fault-tolerance for HPC 52/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

Time needed

to compute W

anew

+

(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))

Psucc(W + C ) (W + C )

E(T (W )) =

yves.robert@inria.fr Fault-tolerance for HPC 52/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computation of E(T (W ,C ,D,R , λ))

+

(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))

Psucc(W + C ) (W + C )

E(T (W )) =

Psuc(W + C ) = e−λ(W+C)

E(Tlost(W + C )) =∫∞

0xP(X = x |X <W + C )dx = 1

λ − W+Ceλ(W+C)−1

E(Trec) = e−λR(D+R)+(1−e−λR)(D+E(Tlost(R))+E(Trec))

E(T (W ,C ,D,R, λ)) = eλR(

1λ + D

)(eλ(W+C) − 1)

yves.robert@inria.fr Fault-tolerance for HPC 53/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Best pattern for infinite jobs

Minimize Overhead(W ) =E(T (W ))

W− 1

Overhead(W ) =C

W+ λ

W

2+ o(λW )⇒Wopt = Θ(λ−1/2)

Wopt =

√2C

λ

Overhead(Wopt) =√

2Cλ+ o(λ1/2)

Periodic pattern (memoryless property)

yves.robert@inria.fr Fault-tolerance for HPC 54/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Checkpointing a sequential job

E(T (W )) = eλR(

1λ + D

) (∑Ki=1 e

λ(Wi+C) − 1)

Optimal strategy uses same-size chunks (convexity)

K0 = λW1+L(−e−λC−1)

where L(z)eL(z) = z (Lambert function)

Optimal number of chunks K ∗ is max(1, bK0c) or dK0e

Eopt(T (W )) = K ∗(eλR

(1

λ+ D

))(eλ( W

K∗ +C)−1)

Can also use Daly’s second-order approximation

yves.robert@inria.fr Fault-tolerance for HPC 55/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Checkpointing a parallel job

p processors ⇒ distribution Exp(λp), where λp = pλ

Use W (p), C (p), R(p) in Eopt(T (W )) for a distributionExp(λp = pλ)

Job types

Perfectly parallel jobs: W (p) = W /p.Generic parallel jobs: W (p) = W /p + δWNumerical kernels: W (p) = W /p + δW 2/3/

√p

Checkpoint overhead

Proportional overhead: C (p) = R(p) = δV /p = C/p(bandwidth of processor network card/link is I/O bottleneck)Constant overhead: C (p) = R(p) = δV = C(bandwidth to/from resilient storage system is I/O bottleneck)

yves.robert@inria.fr Fault-tolerance for HPC 56/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Weibull failure distribution

No optimality result known

Heuristic: maximize expected work before next failure

Dynamic programming algorithms- Use a time quantum- Trim history of previous failures

yves.robert@inria.fr Fault-tolerance for HPC 57/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 58/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Hierarchical checkpointing

Clusters of processes

Coordinated checkpointingprotocol within clusters

Message logging protocolsbetween clusters

Only processors from failed groupneed to roll back

P0

P1

P2

P3

m1

m2

m3

m4

m5

/ Need to log inter-groups messages• Slowdowns failure-free execution• Increases checkpoint size/time

, Faster re-execution with logged messages

yves.robert@inria.fr Fault-tolerance for HPC 59/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Which checkpointing protocol to use?

Coordinated checkpointing

, No risk of cascading rollbacks

, No need to log messages

/ All processors need to roll back

/ Rumor: May not scale to very large platforms

Hierarchical checkpointing

/ Need to log inter-groups messages• Slowdowns failure-free execution• Increases checkpoint size/time

, Only processors from failed group need to roll back

, Faster re-execution with logged messages

, Rumor: Should scale to very large platforms

yves.robert@inria.fr Fault-tolerance for HPC 60/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Blocking vs. non-blocking

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunkProcessing the first chunk

Time

Time spent checkpointing

Time spent working

Blocking model: checkpointing blocks all computations

yves.robert@inria.fr Fault-tolerance for HPC 61/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Blocking vs. non-blocking

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunk

Processing the first chunk

Time

Time spent checkpointing

Time spent working

Non-blocking model: checkpointing has no impact oncomputations (e.g., first copy state to RAM, then copy RAM todisk)

yves.robert@inria.fr Fault-tolerance for HPC 61/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Blocking vs. non-blocking

Checkpointing

the first chunk

Computing the first chunk

Processing the first chunk

Time

Time spent working

Time spent checkpointing

Time spent working with slowdown

General model: checkpointing slows computations down: duringa checkpoint of duration C , the same amount of computation isdone as during a time αC without checkpointing (0 ≤ α ≤ 1)

yves.robert@inria.fr Fault-tolerance for HPC 61/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste in fault-free execution

T

CT − C

P1

P0

P3

P2

Time spent working Time spent checkpointing Time spent working with slowdown

Time

Time elapsed since last checkpoint: T

Amount of computations executed: Work = (T − C ) + αC

Waste[FF ] = T−WorkT

yves.robert@inria.fr Fault-tolerance for HPC 62/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

P0

P3

P2

P1

Time spent checkpointingTime spent working Time spent working with slowdown

Time

Failure can happen

1 During computation phase

2 During checkpointing phase

yves.robert@inria.fr Fault-tolerance for HPC 62/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

P2

P1

P3

P0

Time spent working Time spent checkpointing Time spent working with slowdown

Time

yves.robert@inria.fr Fault-tolerance for HPC 62/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

P2

P1

P3

P0

Time spent working Time spent checkpointing Time spent working with slowdown

Time

yves.robert@inria.fr Fault-tolerance for HPC 62/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

Tlost

P1

P3

P0

P2

Time spent working Time spent checkpointing Time spent working with slowdown

Time

Coordinated checkpointing protocol: when one processor is victimof a failure, all processors lose their work and must roll back to lastcheckpoint

yves.robert@inria.fr Fault-tolerance for HPC 62/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures in computation phase

D

P0

P2

P1

P3

Time spent working Time spent checkpointing Time spent working with slowdown

Downtime Time

yves.robert@inria.fr Fault-tolerance for HPC 62/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures in computation phase

R

P2

P1

P3

P0

Time spent checkpointingTime spent working Time spent working with slowdown

Recovery timeDowntime Time

Coordinated checkpointing protocol: all processors must recoverfrom last checkpoint

yves.robert@inria.fr Fault-tolerance for HPC 62/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures in computation phase

C αC

P3

P2

P1

P0

Time spent working Time spent checkpointing Time spent working with slowdown

Re-executing slowed-down workRecovery timeDowntime Time

Redo the work destroyed by the failure, that was done in thecheckpointing phase before the computation phase

But no checkpoint is taken in parallel, hence this re-execution isfaster than the original computation

yves.robert@inria.fr Fault-tolerance for HPC 62/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures in computation phase

T − C

P1

P0

P3

P2

Time spent working Time spent checkpointing Time spent working with slowdown

Re-executing slowed-down workRecovery timeDowntime Time

Re-execute the computation phase

yves.robert@inria.fr Fault-tolerance for HPC 62/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures in computation phase

C

P3

P2

P1

P0

Time spent checkpointingTime spent working Time spent working with slowdown

Re-executing slowed-down workRecovery timeDowntime Time

Finally, the checkpointing phase is executed

yves.robert@inria.fr Fault-tolerance for HPC 62/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Total waste

αC CT − CRDTlost

P0

P2

P1

P3

Time spent working Time spent checkpointing Time spent working with slowdown

Re-executing slowed-down workRecovery timeDowntime

T

Time

Waste[fail ] =1

µ

(D + R + αC +

T

2

)Optimal period Topt =

√2(1− α)(µ− (D + R + αC ))C

yves.robert@inria.fr Fault-tolerance for HPC 62/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Hierarchical checkpointing

T

α(G−g+1)C

RD G .C

T−G .C−Tlost

TlostTlost

G2

G4

Gg

G1

G5

Re-executing slowed-down workRecovery timeDowntime

Time spent working Time spent working with slowdownTime spent checkpointing

Time

Processors partitioned into G groups

Each group includes q processors

Inside each group: coordinated checkpointing in time C (q)

Inter-group messages are logged

yves.robert@inria.fr Fault-tolerance for HPC 63/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Accounting for message logging: Impact on work

/ Logging messages slows down execution:⇒ Work becomes λWork, where 0 < λ < 1Typical value: λ ≈ 0.98

, Re-execution after a failure is faster:⇒ Re-Exec becomes Re-Exec

ρ , where ρ ∈ [1..2]Typical value: ρ ≈ 1.5

Waste[FF ] =T − λWork

T

Waste[fail ] =1

µ

(D(q) + R(q) +

Re-Exec

ρ

)

yves.robert@inria.fr Fault-tolerance for HPC 64/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Accounting for message logging: Impact on checkpoint size

Inter-groups messages logged continuously

Checkpoint size increases with amount of work executedbefore a checkpoint /C0(q): Checkpoint size of a group without message logging

C (q) = C0(q)(1 + βWork)⇔ β =C (q)− C0(q)

C0(q)Work

Work = λ(T − (1− α)GC (q))

C (q) =C0(q)(1 + βλT )

1 + GC0(q)βλ(1− α)

yves.robert@inria.fr Fault-tolerance for HPC 65/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Three case studies

Coord-IOCoordinated approach: C = CMem = Mem

bio

where Mem is the memory footprint of the application

Hierarch-IOSeveral (large) groups, I/O-saturated⇒ groups checkpoint sequentially

C0(q) =CMem

G=

Mem

Gbio

Hierarch-PortVery large number of smaller groups, port-saturated⇒ some groups checkpoint in parallelGroups of qmin processors, where qminbport ≥ bio

yves.robert@inria.fr Fault-tolerance for HPC 66/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Three applications

1 2D-stencil

2 Matrix product3 3D-Stencil

PlaneLine

yves.robert@inria.fr Fault-tolerance for HPC 67/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing β for 2D-Stencil

C (q) = C0(q) + Logged Msg = C0(q)(1 + βWork)

Real n × n matrix and p × p gridWork = 9b2

sp, b = n/p

Each process sends a block to its 4 neighbors

Hierarch-IO:

1 group = 1 grid row

2 out of the 4 messages are logged

β = Logged MsgC0(q)Work = 2pb

pb2(9b2/sp)=

2sp9b3

Hierarch-Port:

β doubles

yves.robert@inria.fr Fault-tolerance for HPC 68/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Four platforms: basic characteristics

Name Number of Number of Number of cores Memory I/O Network Bandwidth (bio) I/O Bandwidth (bport)cores processors ptotal per processor per processor Read Write Read/Write per processor

Titan 299,008 16,688 16 32GB 300GB/s 300GB/s 20GB/sK-Computer 705,024 88,128 8 16GB 150GB/s 96GB/s 20GB/s

Exascale-Slim 1,000,000,000 1,000,000 1,000 64GB 1TB/s 1TB/s 200GB/sExascale-Fat 1,000,000,000 100,000 10,000 640GB 1TB/s 1TB/s 400GB/s

Name Scenario G (C (q)) β for β for2D-Stencil Matrix-Product

Coord-IO 1 (2,048s) / /Titan Hierarch-IO 136 (15s) 0.0001098 0.0004280

Hierarch-Port 1,246 (1.6s) 0.0002196 0.0008561

Coord-IO 1 (14,688s) / /K-Computer Hierarch-IO 296 (50s) 0.0002858 0.001113

Hierarch-Port 17,626 (0.83s) 0.0005716 0.002227

Coord-IO 1 (64,000s) / /Exascale-Slim Hierarch-IO 1,000 (64s) 0.0002599 0.001013

Hierarch-Port 200,0000 (0.32s) 0.0005199 0.002026

Coord-IO 1 (64,000s) / /Exascale-Fat Hierarch-IO 316 (217s) 0.00008220 0.0003203

Hierarch-Port 33,3333 (1.92s) 0.00016440 0.0006407

yves.robert@inria.fr Fault-tolerance for HPC 69/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Checkpoint time

Name C

K-Computer 14,688s

Exascale-Slim 64,000

Exascale-Fat 64,000

Large time to dump the memory

Using 1%C

Comparing with 0.1%C for exascale platforms

α = 0.3, λ = 0.98 and ρ = 1.5

yves.robert@inria.fr Fault-tolerance for HPC 70/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Plotting formulas – Platform: Titan

Stencil 2D Matrix product Stencil 3D

Waste as a function of processor MTBF µind

yves.robert@inria.fr Fault-tolerance for HPC 71/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Platform: K-Computer

Stencil 2D Matrix product Stencil 3D

Waste as a function of processor MTBF µind

yves.robert@inria.fr Fault-tolerance for HPC 72/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Plotting formulas – Platform: Exascale

Waste = 1 for all scenarios!!!

yves.robert@inria.fr Fault-tolerance for HPC 73/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Plotting formulas – Platform: Exascale

Waste = 1 for all scenarios!!!

Goodbye Exascale?!

yves.robert@inria.fr Fault-tolerance for HPC 73/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Plotting formulas – Platform: Exascale with C = 1, 000

Stencil 2D Matrix product Stencil 3DE

xasc

ale-

Slim

Exa

scal

e-F

at

Waste as a function of processor MTBF µind , C = 1, 000

yves.robert@inria.fr Fault-tolerance for HPC 74/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Plotting formulas – Platform: Exascale with C = 100

Stencil 2D Matrix product Stencil 3DE

xasc

ale-

Slim

Exa

scal

e-F

at

Waste as a function of processor MTBF µind , C = 100

yves.robert@inria.fr Fault-tolerance for HPC 75/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Simulations – Platform: Titan

Stencil 2D Matrix product

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

0

20

40

60

80

100

120

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100M

akes

pan

(day

s)MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Makespan (days)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Makespan (days)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

Makespan (in days) as a function of processor MTBF µind

yves.robert@inria.fr Fault-tolerance for HPC 76/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Simulations – Platform: Exascale with C = 1, 000

Stencil 2D Matrix product

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

0

20

40

60

80

100

120

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(d

ays)

MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

Exa

scal

e-S

lim

0

50

100

150

200

250

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

50

100

150

200

250

300

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

Exa

scal

e-F

at

0

50

100

150

200

250

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

50

100

150

200

250

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

Makespan (in days) as a function of processor MTBF µind , C = 1, 000

yves.robert@inria.fr Fault-tolerance for HPC 77/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Simulations – Platform: Exascale with C = 100

Stencil 2D Matrix product

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

0

20

40

60

80

100

120

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(d

ays)

MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

Exa

scal

e-S

lim

0

20

40

60

80

100

120

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

20

40

60

80

100

120

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

Exa

scal

e-F

at

4

5

6

7

8

9

10

11

12

13

14

15

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

4

5

6

7

8

9

10

11

12

13

14

15

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

Coordinated DalyCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

Makespan (in days) as a function of processor MTBF µind , C = 100

yves.robert@inria.fr Fault-tolerance for HPC 78/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 79/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Motivation

Checkpoint transfer and storage⇒ critical issues of rollback/recovery protocols

Stable storage: high cost

Distributed in-memory storage:

Store checkpoints in local memory ⇒ no centralized storage, Much better scalabilityReplicate checkpoints ⇒ application survives single failure/ Still, risk of fatal failure in some (unlikely) scenarios

yves.robert@inria.fr Fault-tolerance for HPC 80/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Double checkpoint algorithm (Kale et al., UIUC)

1

1

d q s

f

f

P

Local checkpointdone

Remote checkpointdone

Perioddone

Node p

Node p'

Platform nodes partitioned into pairs

Each node in a pair exchanges its checkpoint with its buddy

Each node saves two checkpoints:- one locally: storing its own data- one remotely: receiving and storing its buddy’s data

yves.robert@inria.fr Fault-tolerance for HPC 81/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failures

1

1

d q s

f

f

P

Node p

Node p'

1

1

d q

f

f

tlost

Checkpoint ofp

Checkpoint ofp'

Risk Period

Node to replace p

q

f 1

tlostD R

After failure: downtime D and recovery from buddy node

Two checkpoint files lost, must be re-sent to faulty processor

Best trade-off between performance and risk?

yves.robert@inria.fr Fault-tolerance for HPC 82/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failures

1

1

d q s

f

f

P

Node p

Node p'

1

1

d q

f

f

tlost

Checkpoint ofp

Checkpoint ofp'

Risk Period

Node to replace p

q

f 1

tlostD R

After failure: downtime D and recovery from buddy node

Two checkpoint files lost, must be re-sent to faulty processor

Application at risk until complete reception of both messages

Best trade-off between performance and risk?

yves.robert@inria.fr Fault-tolerance for HPC 82/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 83/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Framework

Predictor

Exact prediction dates (at least C seconds in advance)

Recall r : fraction of faults that are predicted

Precision p: fraction of fault predictions that are correct

Events

true positive: predicted faults

false positive: fault predictions that did not materialize asactual faults

false negative: unpredicted faults

yves.robert@inria.fr Fault-tolerance for HPC 84/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Fault rates

µ: mean time between failures (MTBF)

µP mean time between predicted events (both true positiveand false positive)

µNP mean time between unpredicted faults (false negative).

µe : mean time between events (including three event types)

r =TrueP

TrueP + FalseNand p =

TruePTrueP + FalseP

(1− r)

µ=

1

µNPand

r

µ=

p

µP

1

µe=

1

µP+

1

µNP

yves.robert@inria.fr Fault-tolerance for HPC 85/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Example

fault fault fault fault fault

pred. pred. pred. pred. pred. pred.

Time

F+P F+Ppred.

F+Ppred.

F+Pfault

t

Actual faults:

Predictor:

Overlap:

Predictor predicts six faults in time t

Five actual faults. One fault not predicted

µ = t5 , µP = t

6 , and µNP = t

Recall r = 45 (green arrows over red arrows)

Precision p = 46 (green arrows over blue arrows)

yves.robert@inria.fr Fault-tolerance for HPC 86/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm

1 While no fault prediction is available:• checkpoints taken periodically with period T

2 When a fault is predicted at time t:• take a checkpoint ALAP (completion right at time t)• after the checkpoint, complete the execution of the period

yves.robert@inria.fr Fault-tolerance for HPC 87/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing the waste

1 Fault-free execution: Waste[FF ] = CT

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunkProcessing the first chunk

Time

Time spent checkpointing

Time spent working

2 Unpredicted faults: 1µNP

[D + R + T

2

]TimeT -C T -C Tlost T -C

fault

C C C D R C

yves.robert@inria.fr Fault-tolerance for HPC 88/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing the waste

3 Predictions: 1µP

[p(C + D + R) + (1− p)C ]

TimeT -C Wreg

fault Predicted fault

T -Wreg -C T -C

C C Cp D R C C

with actual fault (true positive)

TimeT -C Wreg

Predicted fault

T -Wreg -C T -C T -C

C C Cp C C C

no actual fault (false negative)

Waste[fail ] =1

µ

[(1− r)

T

2+ D + R +

r

pC

]⇒ Topt ≈

√2µC

1− r

yves.robert@inria.fr Fault-tolerance for HPC 88/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Refinements

Use different value Cp for proactive checkpoints

Avoid checkpointing too frequently for false negatives⇒ Only trust predictions with some fixed probability q⇒ Ignore predictions with probability 1− qConclusion: trust predictor always or never (q = 0 or q = 1)

Trust prediction depending upon position in current period⇒ Increase q when progressing⇒ Break-even point

Cp

p

yves.robert@inria.fr Fault-tolerance for HPC 89/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

With prediction windows

TimeTR-C TR-C Tlost TR-C

fault(Regular mode)

Time

Regular mode Proactive mode

TR-C Wreg

I

TP-Cp TP-Cp TP-Cp TR-C-Wreg

(Prediction without failure)

Time

Regular mode Proactive mode

TR-C Wreg

I

TP-Cp TP-Cp TR-C-Wreg

fault(Prediction with failure)

C C C D R C

C C Cp Cp Cp Cp C

C C Cp Cp Cp D R C

Gets too complicated! /

yves.robert@inria.fr Fault-tolerance for HPC 90/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 91/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Replication

Systematic replication: efficiency < 50%

Can replication+checkpointing be more efficient thancheckpointing alone?

Study by Ferreira et al. [SC’2011]: yes

yves.robert@inria.fr Fault-tolerance for HPC 92/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Model by Ferreira et al. [SC’ 2011]

Parallel application comprising N processes

Platform with ptotal = 2N processors

Each process replicated → N replica-groups

When a replica is hit by a failure, it is not restarted

Application fails when both replicas in one replica-group havebeen hit by failures

yves.robert@inria.fr Fault-tolerance for HPC 93/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Example

p1

p2

p1

p2

p1

p2

p1

p2

Time

Pair1

Pair2

Pair3

Pair4

yves.robert@inria.fr Fault-tolerance for HPC 94/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

The birthday problem

Classical formulationWhat is the probability, in a set of m people, that two of themhave same birthday ?

Relevant formulationWhat is the average number of people required to find a pair withsame birthday?

Birthday(m) = 1 +∫ +∞

0 e−x(1 + x/m)m−1dx = 23 +

√πm2 +

√π

288m − 4135m + . . .

The analogy

Two people with same birthday≡

Two failures hitting same replica-group

yves.robert@inria.fr Fault-tolerance for HPC 95/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure: can failed PE be hit?

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure cannot hit failed PE

Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure cannot hit failed PE

Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure cannot hit failed PE

Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure cannot hit failed PE

Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure cannot hit failed PE

Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure can hit failed PE

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure can hit failed PE

Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure can hit failed PE

Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure can hit failed PE

Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem

yves.robert@inria.fr Fault-tolerance for HPC 96/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Correct analogy

� � � � . . . �1 2 3 4 . . . n

⇑• • • • • • • • • • • . . .

N = nrg bins, red and blue balls

Mean Number of Failures to Interruption (bring down application)MNFTI = expected number of balls to throw

until one bin gets one ball of each color

yves.robert@inria.fr Fault-tolerance for HPC 97/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Number of failures to bring down application

MNFTI ah Count each failure hitting any of the initialprocessors, including those already hit by a failure

MNFTI rp Count failures that hit running processors, and thuseffectively kill replicas.

MNFTI ah = 1 + MNFTI rp

yves.robert@inria.fr Fault-tolerance for HPC 98/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Number of failures to bring down application

MNFTI ah Count each failure hitting any of the initialprocessors, including those already hit by a failure

MNFTI rp Count failures that hit running processors, and thuseffectively kill replicas.

MNFTI ah = 1 + MNFTI rp

yves.robert@inria.fr Fault-tolerance for HPC 98/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential failures

Theorem MNFTI ah = E(NFTI ah|0) where

E(NFTI ah|nf ) =

{2 if nf = nrg ,

2nrg2nrg−nf +

2nrg−2nf2nrg−nf E

(NFTI ah|nf + 1

)otherwise.

E(NFTI ah|nf ): expectation of number of failures to killapplication, knowing that• application is still running• failures have already hit nf different replica-groups

yves.robert@inria.fr Fault-tolerance for HPC 99/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential failures (cont’d)

Proof

E(NFTI ah |nrg

)=

1

2× 1 +

1

2×(

1 + E(NFTI ah |nrg

)).

E(NFTI ah|nf

)=

2nrg − 2nf2nrg

×(

1 + E(NFTI ah|nf + 1

))+

2nf2nrg

×(

1

2× 1 +

1

2

(1 + E

(NFTI ah|nf

))).

MTTI = systemMTBF (2nrg)× MNFTI ah

yves.robert@inria.fr Fault-tolerance for HPC 100/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential failures (cont’d)

Proof

E(NFTI ah |nrg

)=

1

2× 1 +

1

2×(

1 + E(NFTI ah |nrg

)).

E(NFTI ah|nf

)=

2nrg − 2nf2nrg

×(

1 + E(NFTI ah|nf + 1

))+

2nf2nrg

×(

1

2× 1 +

1

2

(1 + E

(NFTI ah|nf

))).

MTTI = systemMTBF (2nrg)× MNFTI ah

yves.robert@inria.fr Fault-tolerance for HPC 100/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential failures (cont’d)

Proof

E(NFTI ah |nrg

)=

1

2× 1 +

1

2×(

1 + E(NFTI ah |nrg

)).

E(NFTI ah|nf

)=

2nrg − 2nf2nrg

×(

1 + E(NFTI ah|nf + 1

))+

2nf2nrg

×(

1

2× 1 +

1

2

(1 + E

(NFTI ah|nf

))).

MTTI = systemMTBF (2nrg)× MNFTI ah

yves.robert@inria.fr Fault-tolerance for HPC 100/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Comparison

2N processors, no replication

ThroughputStd = 2N(1−Waste) = 2N(

1−√

2Cµ2N

)N replica-pairs

ThroughputRep = N(

1−√

2Cµrep

)µrep = MNFTI × µ2N = MNFTI × µ

2N

Platform with 2N = 220 processors ⇒ MNFTI = 1284.4µ = 10 years ⇒ better if C shorter than 6 minutes

yves.robert@inria.fr Fault-tolerance for HPC 101/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distribution

221218 219216 217 220215

number of processors

0

50

100

150

200

aver

age

mak

espa

n(i

nda

ys)

BestPeriod-g = 2BestPeriod-g = 1Daly-g = 2Daly-g = 1

(a) Exponential

221218 219216 217 220215

number of processors

0

50

100

150

200

aver

age

mak

espa

n(i

nda

ys)

BestPeriod-g = 2BestPeriod-g = 1Daly-g = 2Daly-g = 1

(b) Weibull, k = 0.7

Crossover point for replication when µind = 125 years

yves.robert@inria.fr Fault-tolerance for HPC 102/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Weibull distribution with k = 0.7

Dashed line: Ferreira et al. Solid line: Correct analogy

101 100

Processor MTBF (in years)

0

200000

400000

600000

800000

1000000N

umb

erof

pro

cess

ors

C = 300

C = 2400C = 1200C = 900C = 600

C = 150

Study by Ferrreira et al. favors replication

Replication beneficial if small µ + large C + big ptotal

yves.robert@inria.fr Fault-tolerance for HPC 103/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 104/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Generic vs. Application specific approaches

Generic solutions

Universal

Very low prerequisite

One size fits all (pros and cons)

Application specific solutions

Requires (deep) study of the application/algorihtm

Tailored solution: higher efficiency

yves.robert@inria.fr Fault-tolerance for HPC 105/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Backward Recovery vs. Forward Recovery

Backward Recovery

Rollback / Backward Recovery: returns in the history torecover from failures.

Spends time to re-execute computations

Rebuilds states already reached

Typical: checkpointing techniques

yves.robert@inria.fr Fault-tolerance for HPC 106/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Backward Recovery vs. Forward Recovery

Forward Recovery

Forward Recovery: proceeds without returning

Pays additional costs during (failure-free) computation tomaintain consistent redundancy

Or pays additional computations when failures happen

General technique: Replication

Application-Specific techniques: Iterative algorithms withfixed point convergence, ABFT, ...

yves.robert@inria.fr Fault-tolerance for HPC 106/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 107/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerance (ABFT)

Principle

Limited to Linear Algebra computations

Matrices are extended with rows and/or columns of checksums

M =

5 1 7 134 3 5 124 6 9 19

yves.robert@inria.fr Fault-tolerance for HPC 108/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerance (ABFT)

Principle

Limited to Linear Algebra computations

Matrices are extended with rows and/or columns of checksums

M =

5 1 7 134 3 5 124 6 9 19

yves.robert@inria.fr Fault-tolerance for HPC 108/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerance (ABFT)

Principle

Limited to Linear Algebra computations

Matrices are extended with rows and/or columns of checksums

M =

5 1 7 134 3 5 124 6 9 19

yves.robert@inria.fr Fault-tolerance for HPC 108/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and fail-stop errors

Missing checksum data

M =

5 1 7 134 3 54 6 9 19

Simple recomputation: 4+3+5 = 12.

Missing original data

M =

5 1 7 134 5 124 6 9 19

Simple recomputation: 12-(4+5) = 3.

yves.robert@inria.fr Fault-tolerance for HPC 109/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and fail-stop errors

Missing checksum data

M =

5 1 7 134 3 54 6 9 19

Simple recomputation: 4+3+5 = 12.

Missing original data

M =

5 1 7 134 5 124 6 9 19

Simple recomputation: 12-(4+5) = 3.

yves.robert@inria.fr Fault-tolerance for HPC 109/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and fail-stop errors

Missing checksum data

M =

5 1 7 134 3 54 6 9 19

Simple recomputation: 4+3+5 = 12.

Missing original data

M =

5 1 7 134 5 124 6 9 19

Simple recomputation: 12-(4+5) = 3.

yves.robert@inria.fr Fault-tolerance for HPC 109/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and fail-stop errors

Missing checksum data

M =

5 1 7 134 3 54 6 9 19

Simple recomputation: 4+3+5 = 12.

Missing original data

M =

5 1 7 134 5 124 6 9 19

Simple recomputation: 12-(4+5) = 3.

yves.robert@inria.fr Fault-tolerance for HPC 109/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

M =

5 1 7 134 3 5 134 6 9 19

Error detection: 4 + 3 + 5 6= 13Limitations

The following matrix would have successfully passed thesanity check:

M =

5 1 7 135 3 5 134 6 9 19

Can detect one error and correct zero.

yves.robert@inria.fr Fault-tolerance for HPC 110/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

M =

5 1 7 134 3 5 134 6 9 19

Error detection: 4 + 3 + 5 6= 13Limitations

The following matrix would have successfully passed thesanity check:

M =

5 1 7 135 3 5 134 6 9 19

Can detect one error and correct zero.

yves.robert@inria.fr Fault-tolerance for HPC 110/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

One row and one column of checksums

M =

5 1 7 134 3 5 114 6 9 19

13 9 21 43

Checksum recomputation to look for silent data corruptions:

5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19

13 + 10 + 21 = 44

Checksums do not match !

yves.robert@inria.fr Fault-tolerance for HPC 111/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

One row and one column of checksums

M =

5 1 7 134 3 5 114 6 9 19

13 9 21 43

Checksum recomputation to look for silent data corruptions:

5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19

13 + 10 + 21 = 44

Checksums do not match !

yves.robert@inria.fr Fault-tolerance for HPC 111/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

M =

5 1 7 134 3 5 114 6 9 19

13 9 21 43

5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19

13 + 10 + 21 = 44

Both checksums are affected, giving out location of errorWe solve:

4 + x + 5 = 11 1 + x + 6 = 9

Recomputing the checksums we find that:5 + 1 + 7 = 134 + 2 + 5 = 114 + 6 + 9 = 19

13 + 9 + 21 = 43

Checksums match ,

Can detect two errors and correct one

yves.robert@inria.fr Fault-tolerance for HPC 112/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

M =

5 1 7 134 3 5 114 6 9 19

13 9 21 43

5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19

13 + 10 + 21 = 44

Both checksums are affected, giving out location of errorWe solve:

4 + x + 5 = 11 1 + x + 6 = 9

Recomputing the checksums we find that:5 + 1 + 7 = 134 + 2 + 5 = 114 + 6 + 9 = 19

13 + 9 + 21 = 43

Checksums match ,

Can detect two errors and correct one

yves.robert@inria.fr Fault-tolerance for HPC 112/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

M =

5 1 7 134 3 5 114 6 9 19

13 9 21 43

5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19

13 + 10 + 21 = 44

Both checksums are affected, giving out location of errorWe solve:

4 + x + 5 = 11 1 + x + 6 = 9

Recomputing the checksums we find that:5 + 1 + 7 = 134 + 2 + 5 = 114 + 6 + 9 = 19

13 + 9 + 21 = 43

Checksums match ,

Can detect two errors and correct one

yves.robert@inria.fr Fault-tolerance for HPC 112/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT for Matrix-Matrix multiplication

Aim: Computation of C = A× B

Let eT = [1, 1, · · · , 1], we define

Ac :=

(A

eTA

), B r :=

(B Be

), C f :=

(C Ce

eTC eTCe

).

Where Ac is the column checksum matrix, B r is the row checksummatrix and C f is the full checksum matrix.

Ac × B r =

(A

eTA

)×(B Be

)=

(AB ABe

eTAB eTABe

)=

(C Ce

eTC eTCe

)= C f

yves.robert@inria.fr Fault-tolerance for HPC 113/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT for Matrix-Matrix multiplication

Aim: Computation of C = A× B

Let eT = [1, 1, · · · , 1], we define

Ac :=

(A

eTA

), B r :=

(B Be

), C f :=

(C Ce

eTC eTCe

).

Where Ac is the column checksum matrix, B r is the row checksummatrix and C f is the full checksum matrix.

Ac × B r =

(A

eTA

)×(B Be

)=

(AB ABe

eTAB eTABe

)=

(C Ce

eTC eTCe

)= C f

yves.robert@inria.fr Fault-tolerance for HPC 113/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

In practice... things are more complicated!

When do errors strike? Are all data always protected?

Computations are approximate because of floating-pointrounding

Error detection and error correction capabilities depend on thenumber of checksum rows and columns

yves.robert@inria.fr Fault-tolerance for HPC 114/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 115/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Block LU factorization

A A'

U

L

U

Solve A · x = b (hard)

Transform A into a LU factorization

Solve L · y = b, then U · x = y

yves.robert@inria.fr Fault-tolerance for HPC 116/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Block LU factorization

A A'

U

L

U

GETF2: factorize acolumn block

TRSM - Update row block

GEMM: Updatethe trailing

matrix

Solve A · x = b (hard)

Transform A into a LU factorization

Solve L · y = b, then U · x = y

yves.robert@inria.fr Fault-tolerance for HPC 116/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Block LU factorization

L

U U

L

U

GETF2: factorize acolumn block

TRSM - Update row block

GEMM: Updatethe trailing

matrix

L

U

Solve A · x = b (hard)

Transform A into a LU factorization

Solve L · y = b, then U · x = y

yves.robert@inria.fr Fault-tolerance for HPC 116/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Block LU factorization

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 21 30 21 30 21 30 21 3

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 21 30 21 30 21 30 21 3

Failure of rank 2

2D Block Cyclic Distribution (here 2× 3)

A single failure ⇒ many data lost

yves.robert@inria.fr Fault-tolerance for HPC 116/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

M

Pmb

nbQ

N< 2N/Q + nb

+++

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

Checksum:invertible operation on the data of the row / column

Checksum blocks are doubled, to allow recovery when dataand checksum are lost together

yves.robert@inria.fr Fault-tolerance for HPC 117/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

M

P mb

nbQ

NN/Q

+++

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

Checksum:invertible operation on the data of the row / column

Checksum replication can be avoided by dedicating computingresources to checksum storage

yves.robert@inria.fr Fault-tolerance for HPC 117/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

Idea of ABFT: applying the operation on data and checksumpreserves the checksum properties

yves.robert@inria.fr Fault-tolerance for HPC 117/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

+

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

For the part of the data that is not updated this way, thechecksum must be re-calculated

yves.robert@inria.fr Fault-tolerance for HPC 117/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

To avoid slowing down all processors and panel operation,group checksum updates every Q block columns

yves.robert@inria.fr Fault-tolerance for HPC 117/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

To avoid slowing down all processors and panel operation,group checksum updates every Q block columns

yves.robert@inria.fr Fault-tolerance for HPC 117/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

To avoid slowing down all processors and panel operation,group checksum updates every Q block columns

yves.robert@inria.fr Fault-tolerance for HPC 117/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

+

Then, update the missing coverage.Keep checkpoint block column to cover failures during thattime

yves.robert@inria.fr Fault-tolerance for HPC 117/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

M

P mb

nbQ

NN/Q

+++

“Checkpoint”

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

1 3 50 2 41 3 50 2 41 3 5

BABAB

Checkpoint the next set of Q-Panels to be able to return to itin case of failures

yves.robert@inria.fr Fault-tolerance for HPC 117/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

---

In case of failure, conclude the operation, then

Missing Data = Checksum - Sum(Existing Data) s

yves.robert@inria.fr Fault-tolerance for HPC 118/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

+++

In case of failure, conclude the operation, then

Missing Checksum = Sum(Existing Data)s

yves.robert@inria.fr Fault-tolerance for HPC 118/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure inside a Q−panel factorization

M

P mb

nbQ

NN/Q

+++

“Checkpoint”

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

1 3 50 2 41 3 50 2 41 3 5

BABAB

Failures may happen while inside a Q−panel factorization

yves.robert@inria.fr Fault-tolerance for HPC 119/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure inside a Q−panel factorization

M

P mb

nbQ

N

--

-

-

“Checkpoint”

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

1 3 50 2 41 3 50 2 41 3 5

BABAB

Valid Checksum Information allows to recover most of themissing data, but not all: the checksum for the currentQ−panels are not valid

yves.robert@inria.fr Fault-tolerance for HPC 119/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure inside a Q−panel factorization

M

P mb

nbQ

N

“Checkpoint”

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

1 3 50 2 41 3 50 2 41 3 5

BABAB

We use the checkpoint to restore the Q−panel in its initialstate

yves.robert@inria.fr Fault-tolerance for HPC 119/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure inside a Q−panel factorization

M

P mb

nbQ

N

“Checkpoint”

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

1 3 50 2 41 3 50 2 41 3 5

BABAB

and re-execute that part of the factorization, without applyingoutside of the scope

yves.robert@inria.fr Fault-tolerance for HPC 119/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT LU decomposition: implementation

MPI Implementation

PBLAS-based: need to provide “Fault-Aware” version of thelibrary

Cannot enter recovery state at any point in time: need tocomplete ongoing operations despite failures

Recovery starts by defining the position of each process in thefactorization and bring them all in a consistent state(checksum property holds)

Need to test the return code of each and every MPI-relatedcall

yves.robert@inria.fr Fault-tolerance for HPC 120/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT LU decomposition: performance

As supercomputers grow ever larger in scale, the Mean Time to Failure becomes shorter and shorter, making the complete and successful execution of complex applications more and more difficult. FT-LA delivers a new approach, utilizing Algorithm-Based Fault Tolerance (ABFT), to help factorization algorithms survive fail-stop failures. The FT-LA software package extends ScaLAPACK with ABFT routines, and in sharp contrast with legacy checkpoint-based approaches, ABFT does not incur I/O overhead, and promises a much more scalable protection scheme.

ABFT THE IDEA

Cost of ABFT comes only from extra flops (to update checksums) and extra storage

Cost decreases with machine scale (divided by Q when using PxQ processes)

PROTECTION

Matrix protected by block row checksum

The algorithm updates both the trailing matrix AND the checksums

RECOVERY

Missing blocks reconstructed by inverting the checksum operation

FUNCTIONALITY COVERAGE

Linear Systems of Equations

Least Squares

Cholesky, LU

QR (with protection of the upper and lower factors)

FEATURES

WORK IN PROGRESS

Covering four precisions: double complex, single complex, double real, single real (ZCDS)

Deploys on MPI FT draft (ULFM), or with “Checkpoint-on-failure”

Allows toleration of permanent crashes

Hessenber Reduction, Soft (silent) Errors

Process grid: p x qF: simultaneous failures tolerated

Protection against 2 faults on 192x192 processes => 1% overhead

Usually F << q; Overheads in F/q

Protection cost is inversely proportional to machine scale!

Computation

Memory

Flops for the checksum update

Matrix is extended with 2F columns every q columns

FIND OUT MORE AT http://icl.cs.utk.edu/ft-la

0

7

14

21

28

35

6x6; 20k12x12; 40k

24x24; 80k48x48; 160k

96x96; 320k192x192; 640k 0

10

20

30

40

50

Rela

tive

Over

head

(%)

Perfo

rman

ce (T

Flop

/s)

#Processors (PxQ grid); Matrix size (N)

ScaLAPACK PDGETRFFT-PDGETRF (no error)

FT-PDGETRF (w/1 recovery)Overhead: FT-PDGETRF (no error)

Overhead: FT-PDGETRF (w/1 recovery)

U

L

C’

GETF2 GEMM

TRSM

A’

L

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

C

PERFORMANCE ON KRAKEN

Open MPI with ULFM; Kraken supercomputer.

yves.robert@inria.fr Fault-tolerance for HPC 121/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT QR decomposition: performance0:22 A. Bouteiller, T. Herault, G. Bosilca, P. Du, and J. Dongarra

0

2

4

6

8

10

12

6x6; 20k 12x12; 40k 24x24; 80k 48x48; 160k 0

10

20

30

40

50

60

Perf

orm

an

ce (

TF

lop

/s)

Rela

tive O

verh

ead

(%

)

#Processors (PxQ grid); Matrix size (N)

ScaLAPACK PDGEQRFFT-PDGEQRF (no errors)FT-PDGEQRF (one error)

Overhead: FT-PDGEQRF (no errors)Overhead: FT-PDGEQRF (one error)

Fig. 11. Weak scalability of FT-QR: run time overhead on Kraken when failures strike

local snapshots have to be used along with re-factorization to recover the lost data andrestore the matrix state. This is referred to as the ”failure within Q panels.”

Figure 10 shows the overhead from these two cases for the LU factorization, alongwith the no-error overhead as a reference. In the “border” case, the failure is simulatedto strike when the 96th panel (which is a multiple of grid columns, 6, 12, · · · , 48) has justfinished. In the “non-border” case, failure occurs during the (Q + 2)th panel factoriza-tion. For example, when Q = 12, the failure is injected when the trailing update for thestep with panel (1301,1301) finishes. From the result in Figure 10, the recovery pro-cedure in both cases adds a small overhead that also decreases when scaled to largeproblem size and process grid. For largest setups, only 2-3 percent of the executiontime is spent recovering from a failure.

7.4. Extension to Other factorizationThe algorithm proposed in this work can be applied to a wide range of dense matrixfactorizations other than LU. As a demonstration we have extended the fault toler-ance functions to the ScaLAPACK QR factorization in double precision. Since QR usesa block algorithm similar to LU (and also similar to Cholesky), the integration of faulttolerance functions is mostly straightforward. Figure 11 shows the performance of QRwith and without recovery. The overhead drops as the problem and grid size increase,although it remains higher than that of LU for the same problem size. This is expected:as the QR algorithm has a higher complexity than LU ( 4

3N3 v.s. 23N3), the ABFT ap-

proach incurs more extra computation when updating checksums. Similar to the LUresult, recovery adds an extra 2% overhead. At size 160,000 a failure incurs about5.7% penalty to be recovered. This overhead becomes lower, the larger the problem orprocessor grid size considered.

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: January 2013.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Open MPI with ULFM; Kraken supercomputer.yves.robert@inria.fr Fault-tolerance for HPC 122/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT LU decomposition: implementation

?

ABFTRecovery

Checkpoint on Failure - MPI Implementation

FT-MPI / MPI-Next FT: not easily available on largemachines

Checkpoint on Failure = workaround

yves.robert@inria.fr Fault-tolerance for HPC 123/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT QR decomposition: performance

0

0.5

1

1.5

2

2.5

3

3.5

20k 40k 60k 80k 100k

Pe

rfo

rma

nc

e (

Tfl

op

s/s

)

Matrix Size (N)

ScaLAPACK QRCoF-QR (w/o failure)CoF-QR (w/1 failure)

Fig. 2. ABFT QR andone CoF recovery onKraken (Lustre).

0

100

200

300

400

500

600

700

800

10k 20k 30k 40k 50k

Pe

rfo

rma

nc

e (

Gfl

op

s/s

)

Matrix Size (N)

ScaLAPACK QRCoF-QR (w/o failure)CoF-QR (w/1 failure)

Fig. 3. ABFT QR andone CoF recovery onDancer (local SSD).

0

1

2

3

4

5

6

7

20k 25k 30k 35k 40k 45k 50k

Ap

pli

ca

tio

n T

ime

Sh

are

(%

)

Matrix Size (N)

Load CheckpointDump Checkpoint

ABFT Recovery

Fig. 4. Time breakdownof one CoF recovery onDancer (local SSD).

5.3 Checkpoint-on-Failure QR Performance

Supercomputer Performance: Figure 2 presents the performance on the Krakensupercomputer. The process grid is 24⇥24 and the block size is 100. The CoF-QR(no failure) presents the performance of the CoF QR implementation, in a fault-free execution; it is noteworthy, that when there are no failures, the performanceis exactly identical to the performance of the unmodified FT-QR implementa-tion. The CoF-QR (with failure) curves present the performance when a failureis injected after the first step of the PDLARFB kernel. The performance of thenon-fault tolerant ScaLAPACK QR is also presented for reference.

Without failures, the performance overhead compared to the regular ScaLA-PACK is caused by the extra computation to maintain the checksums inherentto the ABFT algorithm [12]; this extra computation is unchanged between CoF-QR and FT-QR. Only on runs where a failure happened do the CoF protocolsundergoe the supplementary overhead of storing and reloading checkpoints. How-ever, the performance of the CoF-QR remains very close to the no-failure case.For instance, at matrix size N=100,000, CoF-QR still achieves 2.86 Tflop/s afterrecovering from a failure, which is 90% of the performance of the non-fault toler-ant ScaLAPACK QR. This demonstrates that the CoF protocol enables e�cient,practical recovery schemes on supercomputers.

Impact of Local Checkpoint Storage: Figure 3 presents the performance of theCoF-QR implementation on the Dancer cluster with a 8 ⇥ 16 process grid. Al-though a smaller test platform, the Dancer cluster features local storage on nodesand a variety of performance analysis tools unavailable on Kraken. As expected(see [12]), the ABFT method has a higher relative cost on this smaller machine.Compared to the Kraken platform, the relative cost of CoF failure recovery issmaller on Dancer. The CoF protocol incurs disk accesses to store and loadcheckpoints when a failure hits, hence the recovery overhead depends on I/Operformance. By breaking down the relative cost of each recovery step in CoF,Figure 4 shows that checkpoint saving and loading only take a small percentageof the total run-time, thanks to the availability of solid state disks on every node.Since checkpoint reloading immediately follows checkpointing, the OS cache sat-isfy most disk access, resulting in high I/O performance. For matrices larger than

Checkpoint on Failure - MPI Performance

Open MPI; Kraken supercomputer;

yves.robert@inria.fr Fault-tolerance for HPC 124/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 125/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Fault Tolerance Techniques

General Techniques

Replication

Rollback Recovery

Coordinated CheckpointingUncoordinated Checkpointing &Message LoggingHierarchical CheckpointingMultilevel Checkpointing

Application-Specific Techniques

Algorithm Based Fault Tolerance(ABFT)

Iterative Convergence

Approximated Computation

yves.robert@inria.fr Fault-tolerance for HPC 126/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Application

Typical Application

f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/

sim2mat ( ) ;

/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/

d g e q r f ( ) ;d s o l v e ( ) ;

/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/

vec2s im ( ) ;}

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

LIBRARY Phase GENERAL Phase

Characteristics

, Large part of (total)computation spent infactorization/solve

Between LA operations:

/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data

/ modify data not covered byABFT algorithms

yves.robert@inria.fr Fault-tolerance for HPC 127/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Application

Typical Application

f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/

sim2mat ( ) ;

/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/

d g e q r f ( ) ;d s o l v e ( ) ;

/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/

vec2s im ( ) ;}

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

LIBRARY Phase GENERAL Phase

Characteristics

, Large part of (total)computation spent infactorization/solve

Between LA operations:

/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data

/ modify data not covered byABFT algorithms

Goodbye ABFT?!

yves.robert@inria.fr Fault-tolerance for HPC 127/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Application

Typical Application

f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/

sim2mat ( ) ;

/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/

d g e q r f ( ) ;d s o l v e ( ) ;

/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/

vec2s im ( ) ;}

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

LIBRARY Phase GENERAL Phase

Characteristics

, Large part of (total)computation spent infactorization/solve

Between LA operations:

/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data

/ modify data not covered byABFT algorithms

Problem Statement

How to use fault tolerant operations(∗) within anon-fault tolerant(∗∗) application?(∗∗∗)

(*) ABFT, or other application-specific FT(**) Or within an application that does not have the same kind of FT

(***) And keep the application globally fault tolerant...

yves.robert@inria.fr Fault-tolerance for HPC 127/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: no failure

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

PeriodicCheckpoint

SplitForced

Checkpoints

yves.robert@inria.fr Fault-tolerance for HPC 128/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: failure during Library phase

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

Failure(during LIBRARY)

Rollback(partial)

Recovery

ABFTRecovery

yves.robert@inria.fr Fault-tolerance for HPC 129/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: failure during General phase

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

Failure(during GENERAL)

Rollback(fulll)

Recovery

yves.robert@inria.fr Fault-tolerance for HPC 130/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT&PeriodicCkpt: Optimizations

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

ABFT&PERIO

DICCKPT

ABFT&PeriodicCkpt: Optimizations

If the duration of the General phase is too small: don’t addcheckpoints

If the duration of the Library phase is too small: don’t doABFT recovery, remain in General mode

this assumes a performance model for the library call

yves.robert@inria.fr Fault-tolerance for HPC 131/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT&PeriodicCkpt: Optimizations

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

ABFT&PERIO

DICCKPT

GENERALCheckpoint Interval

ABFT&PeriodicCkpt: Optimizations

If the duration of the General phase is too small: don’t addcheckpoints

If the duration of the Library phase is too small: don’t doABFT recovery, remain in General mode

this assumes a performance model for the library call

yves.robert@inria.fr Fault-tolerance for HPC 131/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Conclusion

Algorithm-Based Fault Tolerance

Application-specific solution for linear algebra kernels

Low-overhead forward-recovery solution

Used alone or in conjunction with backward-recovery solutions

Limitation

Need 2f linearly independent checksum vectors to detect andcorrect f errors

Numerical instability ⇒ implementations limited to f = 1 orf = 2 /

yves.robert@inria.fr Fault-tolerance for HPC 132/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniques

4 Silent errorsCoupling checkpointing and verificationApplication-specific methods

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 133/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Definitions

Instantaneous error detection ⇒ fail-stop failures,e.g. resource crash

Silent errors (data corruption) ⇒ detection latency

Silent error detected only when the corrupt data is activated

Includes some software faults, some hardware errors (softerrors in L1 cache), double bit flip

Cannot always be corrected by ECC memory

yves.robert@inria.fr Fault-tolerance for HPC 134/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Quotes

Soft Error: An unintended change in the state of an electronicdevice that alters the information that it stores withoutdestroying its functionality, e.g. a bit flip caused by acosmic-ray-induced neutron. (Hengartner et al., 2008)

SDC occurs when incorrect data is delivered by a computingsystem to the user without any error being logged (CristianConstantinescu, AMD)

Silent errors are the black swan of errors (Marc Snir)

yves.robert@inria.fr Fault-tolerance for HPC 135/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Should we be afraid? (courtesy Al Geist)

yves.robert@inria.fr Fault-tolerance for HPC 136/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Probability distributions for silent errors

?Theorem: µp =

µind

pfor arbitrary distributions

yves.robert@inria.fr Fault-tolerance for HPC 137/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Probability distributions for silent errors

?Theorem: µp =

µind

pfor arbitrary distributions

yves.robert@inria.fr Fault-tolerance for HPC 137/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniques

4 Silent errorsCoupling checkpointing and verificationApplication-specific methods

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 138/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

General-purpose approach

TimeXe Xd

fault Detection

Error and detection latency

Last checkpoint may have saved an already corrupted state

Saving k checkpoints (Lu, Zheng and Chien):

¬ Critical failure when all live checkpoints are invalid­ Which checkpoint to roll back to?

yves.robert@inria.fr Fault-tolerance for HPC 139/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

General-purpose approach

TimeXe Xd

fault Detection

Error and detection latency

Last checkpoint may have saved an already corrupted state

Saving k checkpoints (Lu, Zheng and Chien):

¬ Critical failure when all live checkpoints are invalidAssume unlimited storage resources

­ Which checkpoint to roll back to?Assume verification mechanism

yves.robert@inria.fr Fault-tolerance for HPC 139/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Optimal period?

TimeXe Xd

fault Detection

Error and detection latency

Xe inter arrival time between errors; mean time µe

Xd error detection time; mean time µd

Assume Xd and Xe independent

yves.robert@inria.fr Fault-tolerance for HPC 140/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Arbitrary distribution

Wasteff =C

T

Wastefail =T2 + R + µd

µe

Only valid if T2 + R + µd � µe

Theorem

Best period is Topt ≈√

2µeC

Independent of Xd

yves.robert@inria.fr Fault-tolerance for HPC 141/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential distribution

Theorem

At the end of the day,

E(T (w)) = eλeR (µe + µd) (eλe(w+C) − 1)

Optimal period independent of µd

Good approximation is T =√

2µeC (Young’s formula)

yves.robert@inria.fr Fault-tolerance for HPC 142/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

The case with limited resources

Assume that we can only save the last k checkpoints

Definition (Critical failure)

Error detected when all checkpoints contain corrupted data.Happens with probability Prisk during whole execution.

Prisk decreases when T increases (when Xd is fixed).Hence, Prisk ≤ ε leads to a lower bound Tmin on T

Can derive an analytical form for Prisk when Xd follows anExponential law. Use it as a good(?) approximation for arbitrarylaws

yves.robert@inria.fr Fault-tolerance for HPC 143/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Limitation of the model

It is not clear how to detect when the error has occurred(hence to identify the last valid checkpoint) / / /

Need a verification mechanism to check the correctness of thecheckpoints. This has an additional cost!

yves.robert@inria.fr Fault-tolerance for HPC 144/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Coupling checkpointing and verification

Verification mechanism of cost V

Silent errors detected only when verification is executed

Approach agnostic of the nature of verification mechanism(checksum, error correcting code, coherence tests, etc)

Fully general-purpose(application-specific information, if available, can always beused to decrease V )

yves.robert@inria.fr Fault-tolerance for HPC 145/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

On-line ABFT scheme for PCG

Zizhong Chen, PPoPP’13

Iterate PCGCost: SpMV, preconditionersolve, 5 linear kernels

Detect soft errors by checkingorthogonality and residual

Verification every d iterationsCost: scalar product+SpMV

Checkpoint every c iterationsCost: three vectors, or twovectors + SpMV at recovery

Experimental method tochoose c and d

yves.robert@inria.fr Fault-tolerance for HPC 146/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Base pattern (and revisiting Young/Daly)

TimeW W

faultDetection

V C V C V C

Fail-stop (classical) Silent errors

Pattern T = W + C S = W + V + C

Waste[FF ] CT

V+CS

Waste[fail ] 1µ(D + R + W

2 ) 1µ(R + W + V )

Optimal Topt =√

2Cµ Sopt =√

(C + V )µ

Waste[opt]√

2Cµ 2

√C+Vµ

yves.robert@inria.fr Fault-tolerance for HPC 147/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

With p = 1 checkpoint and q = 3 verifications

Timew w w w w w

faultDetection

V C V V V C V V V C

Base Pattern p = 1, q = 1 Waste[opt] = 2√

C+Vµ

New Pattern p = 1, q = 3 Waste[opt] = 2√

4(C+3V )6µ

yves.robert@inria.fr Fault-tolerance for HPC 148/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

BalancedAlgorithm

Time2w 2w w w 2w 2w

V C V V C V V V C

p checkpoints and q verifications, p ≤ q

p = 2, q = 5, S = 2C + 5V + W

W = 10w , six chunks of size w or 2w

May store invalid checkpoint (error during third chunk)

After successful verification in fourth chunk, precedingcheckpoint is valid

Keep only two checkpoints in memory and avoid any fatalfailure

yves.robert@inria.fr Fault-tolerance for HPC 149/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

BalancedAlgorithm

Time2w 2w w w 2w 2w

V C V V C V V V C

¬ ( proba 2w/W ) Tlost = R + 2w + V

­ ( proba 2w/W ) Tlost = R + 4w + 2V

® ( proba w/W ) Tlost = 2R + 6w + C + 4V

¯ ( proba w/W ) Tlost = R + w + 2V

° ( proba 2w/W ) Tlost = R + 3w + 2V

± ( proba 2w/W ) Tlost = R + 5w + 3V

Waste[opt] ≈ 2

√7(2C + 5V )

20µ

yves.robert@inria.fr Fault-tolerance for HPC 150/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Analysis

Key parameters

off failure-free overhead per pattern

fre fraction of work that is re-executed

Wasteff = offS , where off = pC + qV and S = off + pqw � µ

Wastefail = Tlostµ , where Tlost = freS + β

β: constant, linear combination of C , V and R

Waste ≈ offS + freS

µ ⇒ Sopt ≈√

offfre· µ

Waste[opt] = 2

√offfreµ

+ o(

√1

µ)

yves.robert@inria.fr Fault-tolerance for HPC 151/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing fre when p = 1

Timeα1W α2W α3W

V C V V V C

Theorem

The minimal value of fre(1, q) is obtained for same-size chunks

fre(1, q) =∑q

i=1

(αi∑i

j=1 αj

)Minimal when αi = 1/q

In that case, fre(1, q) = q+12q

yves.robert@inria.fr Fault-tolerance for HPC 152/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing fre when p ≥ 1

Timeα1W α2W α3W

V C C C V C

Theorem

fre(p, q) ≥ p+q2pq , bound is matched by BalancedAlgorithm.

Assess gain due to the p − 1 intermediate checkpoints

f(1)

re − f(p)

re =∑p

i=1

(αi∑i−1

j=1 αj

)Maximal when αi = 1/p for all i

In that case, f(1)

re − f(p)

re = (p − 1)/p2

Now best with equipartition of verifications too

In that case, f(1)

re = q+12q and f

(p)re = q+1

2q −p−12p = q+p

2pq

yves.robert@inria.fr Fault-tolerance for HPC 153/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Choosing optimal pattern

Let V = γC , where 0 < γ ≤ 1

offfre = p+q2pq (pC + qV ) = C × p+q

2

(1q + γ

p

)Given γ, minimize p+q

2

(1q + γ

p

)with 1 ≤ p ≤ q, and p, q

taking integer values

Let p = λ× q. Then λopt =√γ =

√VC

yves.robert@inria.fr Fault-tolerance for HPC 154/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Summary

Time2w 2w w w 2w 2w

V C V V C V V V C

BalancedAlgorithm optimal when C ,R,V � µ

Keep only 2 checkpoints in memory/storage

Closed-form formula for Waste[opt]

Given C and V , choose optimal pattern

Gain of up to 20% over base pattern

yves.robert@inria.fr Fault-tolerance for HPC 155/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniques

4 Silent errorsCoupling checkpointing and verificationApplication-specific methods

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 156/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Literature

ABFT: dense matrices / fail-stop, extended to sparse / silent.Limited to one error detection and/or correction in practice

Asynchronous (chaotic) iterative methods (old work)

Partial differential equations: use lower-order scheme asverification mechanism (detection only, Benson, Schmit andSchreiber)

FT-GMRES: inner-outer iterations (Hoemmen and Heroux)

PCG: orthogonalization check every k iterations,re-orthogonalization if problem detected (Sao and Vuduc)

. . . Many others

yves.robert@inria.fr Fault-tolerance for HPC 157/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Dynamic programming for linear chains of tasks

{T1,T2, . . . ,Tn} : linear chain of n tasks

Each task Ti fully parametrized:

wi computational weightCi ,Ri ,Vi : checkpoint, recovery, verification

Error rates:

λF rate of fail-stop errorsλS rate of silent errors

yves.robert@inria.fr Fault-tolerance for HPC 158/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

VC-only

1 i j

TimerecC (i , k − 1) TC (i + 1, j)

VC VC

min0≤k<n

TimerecC (n, k)

TimerecC (j , k) = mink≤i<j

{TimerecC (i , k − 1) + T SFC (i + 1, j)}

T SFC (i , j) = pFi ,j

(Tlost i,j + Ri−1 + T SF

C (i , j))

+(

1− pFi ,j

)(∑j`=i w` + Vj + pSi ,j

(Ri−1 + T SF

C (i , j))

+(

1− pSi ,j

)Cj

)

yves.robert@inria.fr Fault-tolerance for HPC 159/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Young/Daly

TimeFF = TimeFinal (1−WasteFail ) TimeFinal ×WasteFail

TimeFinal

T − VC VC T − VC VC T − VC VC T − VC VC T − VC VC

T − VC VC T − VC VC T − VC VC T − VC VC T − VC VC

Waste = Wasteef + Wastefail

Waste =V + C

T+ λF (s)(R +

T

2) + λS(s)(R + T )

Topt =

√2(V + C )

λF (s) + 2λS(s)

yves.robert@inria.fr Fault-tolerance for HPC 160/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Extensions

VC-only and VC+V

Different speeds with DVFS, different error rates

Different execution modes

Optimize for time or for energy consumption

Current research

Use verification to correct some errors (ABFT)⇒ Similar analysis, smaller error rate but higher verificationcost

Use partial detectors (recall < 1)

Address both fail-stop and silent errors simultaneously(multi-level patterns)

yves.robert@inria.fr Fault-tolerance for HPC 161/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A few questions

Silent errors

Error rate? MTBE?

Selective reliability?

New algorithms beyond iterative? matrix-product, FFT, ...

Multi-level patterns for both fail-stop and silent errors

Resilient research on resilience

Models needed to assess techniques at scalewithout bias ,

yves.robert@inria.fr Fault-tolerance for HPC 162/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A few questions

Silent errors

Error rate? MTBE?

Selective reliability?

New algorithms beyond iterative? matrix-product, FFT, ...

Multi-level patterns for both fail-stop and silent errors

Resilient research on resilience

Models needed to assess techniques at scalewithout bias ,

yves.robert@inria.fr Fault-tolerance for HPC 162/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A few questions

Silent errors

Error rate? MTBE?

Selective reliability?

New algorithms beyond iterative? matrix-product, FFT, ...

Multi-level patterns for both fail-stop and silent errors

Resilient research on resilience

Models needed to assess techniques at scalewithout bias ,

yves.robert@inria.fr Fault-tolerance for HPC 162/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A few questions

Silent errors

Error rate? MTBE?

Selective reliability?

New algorithms beyond iterative? matrix-product, FFT, ...

Multi-level patterns for both fail-stop and silent errors

Resilient research on resilience

Models needed to assess techniques at scalewithout bias ,

yves.robert@inria.fr Fault-tolerance for HPC 162/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

yves.robert@inria.fr Fault-tolerance for HPC 163/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Conclusion

Multiple approaches to Fault Tolerance

Application-Specific Fault Tolerance will always provide morebenefits:

Checkpoint Size Reduction (when needed)Portability (can run on different hardware, differentdeployment, etc..)Diversity of use (can be used to restart the execution andchange parameters in the middle)

yves.robert@inria.fr Fault-tolerance for HPC 164/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Conclusion

Multiple approaches to Fault Tolerance

General Purpose Fault Tolerance is a required feature of theplatforms

Not every computer scientist needs to learn how to writefault-tolerant applicationsNot all parallel applications can be ported to a fault-tolerantversion

Faults are a feature of the platform. Why should it be the roleof the programmers to handle them?

yves.robert@inria.fr Fault-tolerance for HPC 164/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Conclusion

Application-Specific Fault Tolerance

Fault Tolerance is introducing redundancy in the application

replication of computationmaintaining invariant in the data

Requirements of a more Fault-friendly programmingenvironment

MPI-Next evolutionOther programming environments?

yves.robert@inria.fr Fault-tolerance for HPC 165/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Conclusion

General Purpose Fault Tolerance

Software/hardware techniques to reduce checkpoint, recovery,migration times and to improve failure prediction

Multi-criteria scheduling problemexecution time/energy/reliabilityadd replicationbest resource usage (performance trade-offs)

Need combine all these approaches!

Several challenging algorithmic/scheduling problems ,

yves.robert@inria.fr Fault-tolerance for HPC 166/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Bibliography

Exascale• Toward Exascale Resilience, Cappello F. et al., IJHPCA 23, 4 (2009)• The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al.,IJHPCA 25, 1 (2011)

ABFT Algorithm-based fault tolerance applied to high performance computing,Bosilca G. et al., JPDC 69, 4 (2009)

Coordinated Checkpointing Distributed snapshots: determining global states ofdistributed systems, Chandy K.M., Lamport L., ACM Trans. Comput. Syst. 3, 1(1985)

Message Logging A survey of rollback-recovery protocols in message-passing systems,Elnozahy E.N. et al., ACM Comput. Surveys 34, 3 (2002)

Replication Evaluating the viability of process replication reliability for exascalesystems, Ferreira K. et al, SC’2011

Models• Checkpointing strategies for parallel jobs, Bougeret M. et al., SC’2011

• Unified model for assessing checkpointing protocols at extreme-scale, Bosilca G et

al., CCPE 26(17), pp 2772-2791 (2014)

yves.robert@inria.fr Fault-tolerance for HPC 167/ 168

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Bibliography

New Monograph, Springer Verlag 2015

yves.robert@inria.fr Fault-tolerance for HPC 168/ 168

Recommended