244
Intro Checkpointing Forward-recovery Silent Errors Conclusion An overview of fault-tolerant techniques for HPC Yves Robert Laboratoire LIP, ENS Lyon Institut Universitaire de France University Tennessee Knoxville [email protected] http://graal.ens-lyon.fr/ ~ yrobert/europar16.pdf http://www.netlib.org/lapack/lawnspdf/lawn289.pdf EuroPar’2016 Tutorial [email protected] Fault-tolerance for HPC 1/ 168

An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

An overview of fault-tolerant techniques for HPC

Yves Robert

Laboratoire LIP, ENS LyonInstitut Universitaire de FranceUniversity Tennessee Knoxville

[email protected]

http://graal.ens-lyon.fr/~yrobert/europar16.pdf

http://www.netlib.org/lapack/lawnspdf/lawn289.pdf

EuroPar’2016 Tutorial

[email protected] Fault-tolerance for HPC 1/ 168

Page 2: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Thanks

INRIA & ENS Lyon

Anne Benoit & Frederic Vivien

Guillaume Aupy, Aurelien Cavelan, Hongyang Sun

Univ. Tennessee Knoxville

George Bosilca, Aurelien Bouteiller & Thomas Herault(joint tutorial at SC’14, SC’15, SC’16 and elsewhere)

Jack Dongarra

Elsewhere

Franck Cappello & Marc Snir, Argonne National Lab.

Henri Casanova, Univ. Hawai‘i

[email protected] Fault-tolerance for HPC 2/ 168

Page 3: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 3/ 168

Page 4: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 IntroductionLarge-scale computing platformsFaults and failures

2 Checkpointing

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 4/ 168

Page 5: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 IntroductionLarge-scale computing platformsFaults and failures

2 Checkpointing

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 5/ 168

Page 6: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exascale platforms (courtesy Jack Dongarra)

Potential System Architecture with a cap of $200M and 20MW Systems 2011

K computer 2019 Difference

Today & 2019

System peak 10.5 Pflop/s 1 Eflop/s O(100)

Power 12.7 MW ~20 MW

System memory 1.6 PB 32 - 64 PB O(10)

Node performance 128 GF 1,2 or 15TF O(10) – O(100)

Node memory BW 64 GB/s 2 - 4TB/s O(100)

Node concurrency 8 O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW 20 GB/s 200-400GB/s O(10)

System size (nodes) 88,124 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 705,024 O(billion) O(1,000)

MTTI days O(1 day) - O(10)

[email protected] Fault-tolerance for HPC 6/ 168

Page 7: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exascale platforms (courtesy C. Engelmann & S. Scott)

[email protected] Fault-tolerance for HPC 7/ 168

Page 8: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exascale platforms

Hierarchical• 105 or 106 nodes• Each node equipped with 104 or 103 cores

Failure-prone

MTBF – one node 1 year 10 years 120 yearsMTBF – platform 30sec 5mn 1h

of 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures)

[email protected] Fault-tolerance for HPC 8/ 168

Page 9: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exascale platforms

Hierarchical• 105 or 106 nodes• Each node equipped with 104 or 103 cores

Failure-prone

MTBF – one node 1 year 10 years 120 yearsMTBF – platform 30sec 5mn 1h

of 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures)

Exascale

6= Petascale ×1000

[email protected] Fault-tolerance for HPC 8/ 168

Page 10: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Even for today’s platforms (courtesy F. Cappello)

[email protected] Fault-tolerance for HPC 9/ 168

Page 11: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Even for today’s platforms (courtesy F. Cappello)

[email protected] Fault-tolerance for HPC 10/ 168

Page 12: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 IntroductionLarge-scale computing platformsFaults and failures

2 Checkpointing

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 11/ 168

Page 13: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Error sources (courtesy Franck Cappello)

•  Analysis of error and failure logs

•  In 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most number of outages (59-84 percent), and take the shortest time to repair (0.6-1.5 hours). Hardware problems, albeit rarer, need 6.3-100.7 hours to solve.”

•  In 2007 (Garth Gibson, ICPP Keynote):

•  In 2008 (Oliner and J. Stearley, DSN Conf.): 50%

Hardware

Conclusion: Both Hardware and Software failures have to be considered

Software errors: Applications, OS bug (kernel panic), communication libs, File system error and other.

Hardware errors, Disks, processors, memory, network

[email protected] Fault-tolerance for HPC 12/ 168

Page 14: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A few definitions

Many types of faults: software error, hardware malfunction,memory corruption

Many possible behaviors: silent, transient, unrecoverable

Restrict to faults that lead to application failures

This includes all hardware faults, and some software ones

Will use terms fault and failure interchangeably

Silent errors (SDC) addressed later in the tutorial

[email protected] Fault-tolerance for HPC 13/ 168

Page 15: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distributions: (1) Exponential

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)

Exp(λ): Exponential distribution law of parameter λ:

Pdf: f (t) = λe−λtdt for t ≥ 0

Cdf: F (t) = 1− e−λt

Mean = 1λ

[email protected] Fault-tolerance for HPC 14/ 168

Page 16: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distributions: (1) Exponential

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)

X random variable for Exp(λ) failure inter-arrival times:

P (X ≤ t) = 1− e−λtdt (by definition)

Memoryless property: P (X ≥ t + s |X ≥ s ) = P (X ≥ t)at any instant, time to next failure does not depend upontime elapsed since last failure

Mean Time Between Failures (MTBF) µ = E (X ) = 1λ

[email protected] Fault-tolerance for HPC 14/ 168

Page 17: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distributions: (2) Weibull

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)

Weibull(k, λ): Weibull distribution law of shape parameter k andscale parameter λ:

Pdf: f (t) = kλ(tλ)k−1e−(λt)kdt for t ≥ 0

Cdf: F (t) = 1− e−(λt)k

Mean = 1λΓ(1 + 1

k )

[email protected] Fault-tolerance for HPC 15/ 168

Page 18: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distributions: (2) Weibull

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000

Fa

ilure

Pro

ba

bili

ty

Time (years)

Sequential Machine

Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)

X random variable for Weibull(k , λ) failure inter-arrival times:

If k < 1: failure rate decreases with time”infant mortality”: defective items fail early

If k = 1: Weibull(1, λ) = Exp(λ) constant failure time

[email protected] Fault-tolerance for HPC 15/ 168

Page 19: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distributions: with several processors

Processor (or node): any entity subject to failures⇒ approach agnostic to granularity

If the MTBF is µ with one processor,what is its value with p processors?

[email protected] Fault-tolerance for HPC 16/ 168

Page 20: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Intuition

Time

p1

p2

p3

t

If three processors have around 20 faults during a time t (µ = t20 )...

Time

p

t

...during the same time, the platform has around 60 faults (µp = t60 )

[email protected] Fault-tolerance for HPC 17/ 168

Page 21: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Platform MTBF

Rebooting only faulty processor

Platform failure distribution⇒ superposition of p IID processor distributions⇒ IID only for Exponential

Define µp by

limF→+∞

n(F )

F=

1

µp

n(F ) = number of platform failures until time F is exceeded

Theorem: µp =µ

pfor arbitrary distributions

[email protected] Fault-tolerance for HPC 18/ 168

Page 22: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

MTBF with p processors (1/2)

Theorem: µp = µp for arbitrary distributions

With one processor:

n(F ) = number of failures until time F is exceeded

Xi iid random variables for inter-arrival times, with E (Xi ) = µ∑n(F )−1i=1 Xi ≤ F ≤∑n(F )

i=1 Xi

Wald’s equation: (E (n(F ))− 1)µ ≤ F ≤ E (n(F ))µ

limF→+∞E(n(F ))

F = 1µ

[email protected] Fault-tolerance for HPC 19/ 168

Page 23: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

MTBF with p processors (2/2)

Theorem: µp = µp for arbitrary distributions

With p processors:

n(F ) = number of platform failures until time F is exceeded

nq(F ) = number of those failures that strike processor q

nq(F ) + 1 = number of failures on processor q until time F isexceeded (except for processor with last-failure)

limF→+∞nq(F )F = 1

µ as above

limF→+∞n(F )F = 1

µpby definition

Hence µp = µp because n(F ) =

∑pq=1 nq(F )

[email protected] Fault-tolerance for HPC 20/ 168

Page 24: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A little digression for afficionados

Xi IID random variables for processor inter-arrival times

Assume Xi continuous, with E (Xi ) = µ

Yi random variables for platform inter-arrival times

Definition: µpdef= limn→+∞

∑ni E(Yi )n

Limits always exists (superposition of renewal processes)

Theorem: µp = µp

[email protected] Fault-tolerance for HPC 21/ 168

Page 25: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Values from the literature

MTBF of one processor: between 1 and 125 years

Shape parameters for Weibull: k = 0.5 or k = 0.7

Failure trace archive from INRIA(http://fta.inria.fr)

Computer Failure Data Repository from LANL(http://institutes.lanl.gov/data/fdata)

[email protected] Fault-tolerance for HPC 22/ 168

Page 26: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Does it matter?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0h 3h 6h 9h 12h 15h 18h 21h 24h

Fa

ilure

Pro

ba

bili

ty

Time (hours)

Parallel machine (106 nodes)

Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)

After infant mortality and before aging,instantaneous failure rate of computer platforms is almost constant

[email protected] Fault-tolerance for HPC 23/ 168

Page 27: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Summary for the road

MTBF key parameter and µp = µp ,

Exponential distribution OK for most purposes ,Assume failure independence while not (completely) true /

[email protected] Fault-tolerance for HPC 24/ 168

Page 28: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 25/ 168

Page 29: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 26/ 168

Page 30: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Maintaining redundant information

Goal

General Purpose Fault Tolerance Techniques: work despite theapplication behavior

Two adversaries: Failures & Application

Use automatically computed redundant information

At given instants: checkpointsAt any instant: replicationOr anything in between: checkpoint + message logging

[email protected] Fault-tolerance for HPC 27/ 168

Page 31: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Process checkpointing

Goal

Save the current state of the process

FT Protocols save a possible state of the parallel application

Techniques

User-level checkpointing

System-level checkpointing

Blocking call

Asynchronous call

[email protected] Fault-tolerance for HPC 28/ 168

Page 32: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

System-level checkpointing

Blocking checkpointing

Relatively intuitive: checkpoint(filename)

Cost: no process activity during whole checkpoint operation

Different implementations: OS syscall; dynamic library;compiler assisted

Create a serial file that can be loaded in a process image.Usually on same architecture / OS / software environment

Entirely transparent

Preemptive (often needed for library-level checkpointing)

Lack of portability

Large size of checkpoint (≈ memory footprint)

[email protected] Fault-tolerance for HPC 29/ 168

Page 33: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Storage

Remote reliable storage

Intuitive. I/O intensive. Disk usage.

Memory hierarchy

local memory

local disk (SSD, HDD)

remote disk

Scalable Checkpoint Restart Libraryhttp://scalablecr.sourceforge.net

Checkpoint is valid when finished on reliable storage

Distributed memory storage

In-memory checkpointing

Disk-less checkpointing

[email protected] Fault-tolerance for HPC 30/ 168

Page 34: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Coordinated checkpointing

orphan

orphan

missing

Definition (Missing Message)

A message is missing if in the current configuration, the sendersent it, while the receiver did not receive it

[email protected] Fault-tolerance for HPC 31/ 168

Page 35: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Coordinated checkpointing

orphan

orphan

missing

Definition (Orphan Message)

A message is orphan if in the current configuration, the receiverreceived it, while the sender did not send it

[email protected] Fault-tolerance for HPC 32/ 168

Page 36: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Coordinated checkpointing

Create a consistent view of the application (no orphan messages)

Messages belong to a checkpoint wave or another

All communication channels must be flushed (all2all)

[email protected] Fault-tolerance for HPC 33/ 168

Page 37: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Coordinated checkpointing

App. Message Marker Message

Silences the network during checkpoint

Missing messages recorded

[email protected] Fault-tolerance for HPC 34/ 168

Page 38: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 35/ 168

Page 39: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Periodic checkpointing

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunkProcessing the first chunk

Time

Time spent checkpointing

Time spent working

Blocking model: while a checkpoint is taken, no computation canbe performed

[email protected] Fault-tolerance for HPC 36/ 168

Page 40: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Framework

Periodic checkpointing policy of period T

Independent and identically distributed failures

Applies to a single processor with MTBF µ = µindApplies to a platform with p processors and MTBF µ = µind

p

coordinated checkpointingtightly-coupled applicationprogress ⇔ all processors available

⇒ platform = single (powerful, unreliable) processor ,

Waste: fraction of time not spent for useful computations

[email protected] Fault-tolerance for HPC 37/ 168

Page 41: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste in fault-free execution

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunkProcessing the first chunk

Time

Time spent checkpointing

Time spent working Timebase: application base time

TimeFF: with periodic checkpointsbut failure-free

TimeFF = Timebase + #checkpoints × C

#checkpoints =

⌈Timebase

T − C

⌉≈ Timebase

T − C(valid for large jobs)

Waste[FF ] =TimeFF −Timebase

TimeFF=

C

T

[email protected] Fault-tolerance for HPC 38/ 168

Page 42: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

Timebase: application base time

TimeFF: with periodic checkpoints but failure-free

Timefinal: expectation of time with failures

Timefinal = TimeFF + Nfaults × Tlost

Nfaults number of failures during executionTlost: average time lost per failure

Nfaults =Timefinal

µ

Tlost?

[email protected] Fault-tolerance for HPC 39/ 168

Page 43: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

Timebase: application base time

TimeFF: with periodic checkpoints but failure-free

Timefinal: expectation of time with failures

Timefinal = TimeFF + Nfaults × Tlost

Nfaults number of failures during executionTlost: average time lost per failure

Nfaults =Timefinal

µ

Tlost?

[email protected] Fault-tolerance for HPC 39/ 168

Page 44: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing Tlost

T

CT − CRDTlost

P1

P0

P3

P2

Time spent working Time spent checkpointing

Recovery timeDowntime Time

Tlost = D + R +T

2

Rationale⇒ Instants when periods begin and failures strike are independent⇒ Approximation used for all distribution laws⇒ Exact for Exponential and uniform distributions

[email protected] Fault-tolerance for HPC 40/ 168

Page 45: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

Timefinal = TimeFF + Nfaults × Tlost

Waste[fail ] =Timefinal −TimeFF

Timefinal=

1

µ

(D + R +

T

2

)

[email protected] Fault-tolerance for HPC 41/ 168

Page 46: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Total waste

TimeFF =TimeFinal (1-Waste[Fail]) TimeFinal ×Waste[Fail ]

TimeFinal

T -C C T -C C T -C C T -C C T -C C

Waste =Timefinal −Timebase

Timefinal

1−Waste = (1−Waste[FF ])(1−Waste[fail ])

Waste =C

T+

(1− C

T

)1

µ

(D + R +

T

2

)

[email protected] Fault-tolerance for HPC 42/ 168

Page 47: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste minimization

Waste =C

T+

(1− C

T

)1

µ

(D + R +

T

2

)Waste =

u

T+ v + wT

u = C(1− D + R

µ

)v =

D + R − C/2

µw =

1

Waste minimized for T =√

uw

T =√

2(µ− (D + R))C

[email protected] Fault-tolerance for HPC 43/ 168

Page 48: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Comparison with Young/Daly

TimeFF =TimeFinal (1-Waste[Fail]) TimeFinal ×Waste[Fail ]

TimeFinal

T -C C T -C C T -C C T -C C T -C C

(1−Waste[fail ]

)Timefinal = TimeFF

⇒ T =√

2(µ− (D + R))C

Daly: Timefinal =(1 + Waste[fail ]

)TimeFF

⇒ T =√

2(µ+ (D + R))C + C

Young: Timefinal =(1 + Waste[fail ]

)TimeFF and D = R = 0

⇒ T =√

2µC + C

[email protected] Fault-tolerance for HPC 44/ 168

Page 49: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Validity of the approach (1/3)

Technicalities

E (Nfaults) = Timefinalµ and E (Tlost) = D + R + T

2but expectation of product is not product of expectations(not independent RVs here)

Enforce C ≤ T to get Waste[FF ] ≤ 1

Enforce D + R ≤ µ and bound T to get Waste[fail ] ≤ 1but µ = µind

p too small for large p, regardless of µind

[email protected] Fault-tolerance for HPC 45/ 168

Page 50: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Validity of the approach (2/3)

Several failures within same period?

Waste[fail] accurate only when two or more faults do nottake place within same period

Cap period: T ≤ γµ, where γ is some tuning parameter

Poisson process of parameter θ = Tµ

Probability of having k ≥ 0 failures : P(X = k) = θk

k! e−θ

Probability of having two or more failures:π = P(X ≥ 2) = 1− (P(X = 0) +P(X = 1)) = 1− (1 +θ)e−θ

γ = 0.27 ⇒ π ≤ 0.03⇒ overlapping faults for only 3% of checkpointing segments

[email protected] Fault-tolerance for HPC 46/ 168

Page 51: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Validity of the approach (3/3)

Enforce T ≤ γµ, C ≤ γµ, and D + R ≤ γµ

Optimal period√

2(µ− (D + R))C may not belong toadmissible interval [C , γµ]

Waste is then minimized for one of the bounds of thisadmissible interval (by convexity)

[email protected] Fault-tolerance for HPC 47/ 168

Page 52: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Wrap up

Capping periods, and enforcing a lower bound on MTBF⇒ mandatory for mathematical rigor /

Not needed for practical purposes ,• actual job execution uses optimal value• account for multiple faults by re-executing work until success

Approach surprisingly robust ,

[email protected] Fault-tolerance for HPC 48/ 168

Page 53: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Lesson learnt for fail-stop failures

(Not so) Secret data• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs• Blue Waters: 2-3 node failures per day• Titan: a few failures per day• Tianhe 2: wouldn’t say

Topt =√

2µC ⇒ Waste[opt] ≈√

2C

µ

Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17%Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53%Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%

[email protected] Fault-tolerance for HPC 49/ 168

Page 54: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Lesson learnt for fail-stop failures

(Not so) Secret data• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs• Blue Waters: 2-3 node failures per day• Titan: a few failures per day• Tianhe 2: wouldn’t say

Topt =√

2µC ⇒ Waste[opt] ≈√

2C

µ

Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17%Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53%Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%

Exascale 6= Petascale ×1000Need more reliable components

Need to checkpoint faster

[email protected] Fault-tolerance for HPC 49/ 168

Page 55: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Lesson learnt for fail-stop failures

(Not so) Secret data• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs• Blue Waters: 2-3 node failures per day• Titan: a few failures per day• Tianhe 2: wouldn’t say

Topt =√

2µC ⇒ Waste[opt] ≈√

2C

µ

Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17%Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53%Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%

Silent errors:

detection latency ⇒ additional problems

[email protected] Fault-tolerance for HPC 49/ 168

Page 56: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 50/ 168

Page 57: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential failure distribution

1 Expected execution time for a single chunk

2 Expected execution time for a sequential job

3 Expected execution time for a parallel job

[email protected] Fault-tolerance for HPC 51/ 168

Page 58: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

E(T (W )) =

[email protected] Fault-tolerance for HPC 52/ 168

Page 59: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

of success

Probability

Psucc(W + C ) (W + C )

E(T (W )) =

[email protected] Fault-tolerance for HPC 52/ 168

Page 60: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive ApproachTime needed

the work W and

to compute

checkpoint it

Psucc(W + C ) (W + C )

E(T (W )) =

[email protected] Fault-tolerance for HPC 52/ 168

Page 61: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

Probability of failure

(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))

+

Psucc(W + C ) (W + C )

E(T (W )) =

[email protected] Fault-tolerance for HPC 52/ 168

Page 62: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

Time elapsed

before failure

stroke

+

(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))

Psucc(W + C ) (W + C )

E(T (W )) =

[email protected] Fault-tolerance for HPC 52/ 168

Page 63: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

Time needed

to perform

downtime

and recovery

+

(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))

Psucc(W + C ) (W + C )

E(T (W )) =

[email protected] Fault-tolerance for HPC 52/ 168

Page 64: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Expected execution time for a single chunk

Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .

Recursive Approach

Time needed

to compute W

anew

+

(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))

Psucc(W + C ) (W + C )

E(T (W )) =

[email protected] Fault-tolerance for HPC 52/ 168

Page 65: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computation of E(T (W ,C ,D,R , λ))

+

(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))

Psucc(W + C ) (W + C )

E(T (W )) =

Psuc(W + C ) = e−λ(W+C)

E(Tlost(W + C )) =∫∞

0xP(X = x |X <W + C )dx = 1

λ − W+Ceλ(W+C)−1

E(Trec) = e−λR(D+R)+(1−e−λR)(D+E(Tlost(R))+E(Trec))

E(T (W ,C ,D,R, λ)) = eλR(

1λ + D

)(eλ(W+C) − 1)

[email protected] Fault-tolerance for HPC 53/ 168

Page 66: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Best pattern for infinite jobs

Minimize Overhead(W ) =E(T (W ))

W− 1

Overhead(W ) =C

W+ λ

W

2+ o(λW )⇒Wopt = Θ(λ−1/2)

Wopt =

√2C

λ

Overhead(Wopt) =√

2Cλ+ o(λ1/2)

Periodic pattern (memoryless property)

[email protected] Fault-tolerance for HPC 54/ 168

Page 67: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Checkpointing a sequential job

E(T (W )) = eλR(

1λ + D

) (∑Ki=1 e

λ(Wi+C) − 1)

Optimal strategy uses same-size chunks (convexity)

K0 = λW1+L(−e−λC−1)

where L(z)eL(z) = z (Lambert function)

Optimal number of chunks K ∗ is max(1, bK0c) or dK0e

Eopt(T (W )) = K ∗(eλR

(1

λ+ D

))(eλ( W

K∗ +C)−1)

Can also use Daly’s second-order approximation

[email protected] Fault-tolerance for HPC 55/ 168

Page 68: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Checkpointing a parallel job

p processors ⇒ distribution Exp(λp), where λp = pλ

Use W (p), C (p), R(p) in Eopt(T (W )) for a distributionExp(λp = pλ)

Job types

Perfectly parallel jobs: W (p) = W /p.Generic parallel jobs: W (p) = W /p + δWNumerical kernels: W (p) = W /p + δW 2/3/

√p

Checkpoint overhead

Proportional overhead: C (p) = R(p) = δV /p = C/p(bandwidth of processor network card/link is I/O bottleneck)Constant overhead: C (p) = R(p) = δV = C(bandwidth to/from resilient storage system is I/O bottleneck)

[email protected] Fault-tolerance for HPC 56/ 168

Page 69: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Weibull failure distribution

No optimality result known

Heuristic: maximize expected work before next failure

Dynamic programming algorithms- Use a time quantum- Trim history of previous failures

[email protected] Fault-tolerance for HPC 57/ 168

Page 70: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 58/ 168

Page 71: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Hierarchical checkpointing

Clusters of processes

Coordinated checkpointingprotocol within clusters

Message logging protocolsbetween clusters

Only processors from failed groupneed to roll back

P0

P1

P2

P3

m1

m2

m3

m4

m5

/ Need to log inter-groups messages• Slowdowns failure-free execution• Increases checkpoint size/time

, Faster re-execution with logged messages

[email protected] Fault-tolerance for HPC 59/ 168

Page 72: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Which checkpointing protocol to use?

Coordinated checkpointing

, No risk of cascading rollbacks

, No need to log messages

/ All processors need to roll back

/ Rumor: May not scale to very large platforms

Hierarchical checkpointing

/ Need to log inter-groups messages• Slowdowns failure-free execution• Increases checkpoint size/time

, Only processors from failed group need to roll back

, Faster re-execution with logged messages

, Rumor: Should scale to very large platforms

[email protected] Fault-tolerance for HPC 60/ 168

Page 73: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Blocking vs. non-blocking

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunkProcessing the first chunk

Time

Time spent checkpointing

Time spent working

Blocking model: checkpointing blocks all computations

[email protected] Fault-tolerance for HPC 61/ 168

Page 74: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Blocking vs. non-blocking

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunk

Processing the first chunk

Time

Time spent checkpointing

Time spent working

Non-blocking model: checkpointing has no impact oncomputations (e.g., first copy state to RAM, then copy RAM todisk)

[email protected] Fault-tolerance for HPC 61/ 168

Page 75: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Blocking vs. non-blocking

Checkpointing

the first chunk

Computing the first chunk

Processing the first chunk

Time

Time spent working

Time spent checkpointing

Time spent working with slowdown

General model: checkpointing slows computations down: duringa checkpoint of duration C , the same amount of computation isdone as during a time αC without checkpointing (0 ≤ α ≤ 1)

[email protected] Fault-tolerance for HPC 61/ 168

Page 76: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste in fault-free execution

T

CT − C

P1

P0

P3

P2

Time spent working Time spent checkpointing Time spent working with slowdown

Time

Time elapsed since last checkpoint: T

Amount of computations executed: Work = (T − C ) + αC

Waste[FF ] = T−WorkT

[email protected] Fault-tolerance for HPC 62/ 168

Page 77: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

P0

P3

P2

P1

Time spent checkpointingTime spent working Time spent working with slowdown

Time

Failure can happen

1 During computation phase

2 During checkpointing phase

[email protected] Fault-tolerance for HPC 62/ 168

Page 78: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

P2

P1

P3

P0

Time spent working Time spent checkpointing Time spent working with slowdown

Time

[email protected] Fault-tolerance for HPC 62/ 168

Page 79: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

P2

P1

P3

P0

Time spent working Time spent checkpointing Time spent working with slowdown

Time

[email protected] Fault-tolerance for HPC 62/ 168

Page 80: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures

Tlost

P1

P3

P0

P2

Time spent working Time spent checkpointing Time spent working with slowdown

Time

Coordinated checkpointing protocol: when one processor is victimof a failure, all processors lose their work and must roll back to lastcheckpoint

[email protected] Fault-tolerance for HPC 62/ 168

Page 81: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures in computation phase

D

P0

P2

P1

P3

Time spent working Time spent checkpointing Time spent working with slowdown

Downtime Time

[email protected] Fault-tolerance for HPC 62/ 168

Page 82: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures in computation phase

R

P2

P1

P3

P0

Time spent checkpointingTime spent working Time spent working with slowdown

Recovery timeDowntime Time

Coordinated checkpointing protocol: all processors must recoverfrom last checkpoint

[email protected] Fault-tolerance for HPC 62/ 168

Page 83: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures in computation phase

C αC

P3

P2

P1

P0

Time spent working Time spent checkpointing Time spent working with slowdown

Re-executing slowed-down workRecovery timeDowntime Time

Redo the work destroyed by the failure, that was done in thecheckpointing phase before the computation phase

But no checkpoint is taken in parallel, hence this re-execution isfaster than the original computation

[email protected] Fault-tolerance for HPC 62/ 168

Page 84: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures in computation phase

T − C

P1

P0

P3

P2

Time spent working Time spent checkpointing Time spent working with slowdown

Re-executing slowed-down workRecovery timeDowntime Time

Re-execute the computation phase

[email protected] Fault-tolerance for HPC 62/ 168

Page 85: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Waste due to failures in computation phase

C

P3

P2

P1

P0

Time spent checkpointingTime spent working Time spent working with slowdown

Re-executing slowed-down workRecovery timeDowntime Time

Finally, the checkpointing phase is executed

[email protected] Fault-tolerance for HPC 62/ 168

Page 86: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Total waste

αC CT − CRDTlost

P0

P2

P1

P3

Time spent working Time spent checkpointing Time spent working with slowdown

Re-executing slowed-down workRecovery timeDowntime

T

Time

Waste[fail ] =1

µ

(D + R + αC +

T

2

)Optimal period Topt =

√2(1− α)(µ− (D + R + αC ))C

[email protected] Fault-tolerance for HPC 62/ 168

Page 87: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Hierarchical checkpointing

T

α(G−g+1)C

RD G .C

T−G .C−Tlost

TlostTlost

G2

G4

Gg

G1

G5

Re-executing slowed-down workRecovery timeDowntime

Time spent working Time spent working with slowdownTime spent checkpointing

Time

Processors partitioned into G groups

Each group includes q processors

Inside each group: coordinated checkpointing in time C (q)

Inter-group messages are logged

[email protected] Fault-tolerance for HPC 63/ 168

Page 88: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Accounting for message logging: Impact on work

/ Logging messages slows down execution:⇒ Work becomes λWork, where 0 < λ < 1Typical value: λ ≈ 0.98

, Re-execution after a failure is faster:⇒ Re-Exec becomes Re-Exec

ρ , where ρ ∈ [1..2]Typical value: ρ ≈ 1.5

Waste[FF ] =T − λWork

T

Waste[fail ] =1

µ

(D(q) + R(q) +

Re-Exec

ρ

)

[email protected] Fault-tolerance for HPC 64/ 168

Page 89: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Accounting for message logging: Impact on checkpoint size

Inter-groups messages logged continuously

Checkpoint size increases with amount of work executedbefore a checkpoint /C0(q): Checkpoint size of a group without message logging

C (q) = C0(q)(1 + βWork)⇔ β =C (q)− C0(q)

C0(q)Work

Work = λ(T − (1− α)GC (q))

C (q) =C0(q)(1 + βλT )

1 + GC0(q)βλ(1− α)

[email protected] Fault-tolerance for HPC 65/ 168

Page 90: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Three case studies

Coord-IOCoordinated approach: C = CMem = Mem

bio

where Mem is the memory footprint of the application

Hierarch-IOSeveral (large) groups, I/O-saturated⇒ groups checkpoint sequentially

C0(q) =CMem

G=

Mem

Gbio

Hierarch-PortVery large number of smaller groups, port-saturated⇒ some groups checkpoint in parallelGroups of qmin processors, where qminbport ≥ bio

[email protected] Fault-tolerance for HPC 66/ 168

Page 91: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Three applications

1 2D-stencil

2 Matrix product3 3D-Stencil

PlaneLine

[email protected] Fault-tolerance for HPC 67/ 168

Page 92: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing β for 2D-Stencil

C (q) = C0(q) + Logged Msg = C0(q)(1 + βWork)

Real n × n matrix and p × p gridWork = 9b2

sp, b = n/p

Each process sends a block to its 4 neighbors

Hierarch-IO:

1 group = 1 grid row

2 out of the 4 messages are logged

β = Logged MsgC0(q)Work = 2pb

pb2(9b2/sp)=

2sp9b3

Hierarch-Port:

β doubles

[email protected] Fault-tolerance for HPC 68/ 168

Page 93: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Four platforms: basic characteristics

Name Number of Number of Number of cores Memory I/O Network Bandwidth (bio) I/O Bandwidth (bport)cores processors ptotal per processor per processor Read Write Read/Write per processor

Titan 299,008 16,688 16 32GB 300GB/s 300GB/s 20GB/sK-Computer 705,024 88,128 8 16GB 150GB/s 96GB/s 20GB/s

Exascale-Slim 1,000,000,000 1,000,000 1,000 64GB 1TB/s 1TB/s 200GB/sExascale-Fat 1,000,000,000 100,000 10,000 640GB 1TB/s 1TB/s 400GB/s

Name Scenario G (C (q)) β for β for2D-Stencil Matrix-Product

Coord-IO 1 (2,048s) / /Titan Hierarch-IO 136 (15s) 0.0001098 0.0004280

Hierarch-Port 1,246 (1.6s) 0.0002196 0.0008561

Coord-IO 1 (14,688s) / /K-Computer Hierarch-IO 296 (50s) 0.0002858 0.001113

Hierarch-Port 17,626 (0.83s) 0.0005716 0.002227

Coord-IO 1 (64,000s) / /Exascale-Slim Hierarch-IO 1,000 (64s) 0.0002599 0.001013

Hierarch-Port 200,0000 (0.32s) 0.0005199 0.002026

Coord-IO 1 (64,000s) / /Exascale-Fat Hierarch-IO 316 (217s) 0.00008220 0.0003203

Hierarch-Port 33,3333 (1.92s) 0.00016440 0.0006407

[email protected] Fault-tolerance for HPC 69/ 168

Page 94: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Checkpoint time

Name C

K-Computer 14,688s

Exascale-Slim 64,000

Exascale-Fat 64,000

Large time to dump the memory

Using 1%C

Comparing with 0.1%C for exascale platforms

α = 0.3, λ = 0.98 and ρ = 1.5

[email protected] Fault-tolerance for HPC 70/ 168

Page 95: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Plotting formulas – Platform: Titan

Stencil 2D Matrix product Stencil 3D

Waste as a function of processor MTBF µind

[email protected] Fault-tolerance for HPC 71/ 168

Page 96: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Platform: K-Computer

Stencil 2D Matrix product Stencil 3D

Waste as a function of processor MTBF µind

[email protected] Fault-tolerance for HPC 72/ 168

Page 97: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Plotting formulas – Platform: Exascale

Waste = 1 for all scenarios!!!

[email protected] Fault-tolerance for HPC 73/ 168

Page 98: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Plotting formulas – Platform: Exascale

Waste = 1 for all scenarios!!!

Goodbye Exascale?!

[email protected] Fault-tolerance for HPC 73/ 168

Page 99: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Plotting formulas – Platform: Exascale with C = 1, 000

Stencil 2D Matrix product Stencil 3DE

xasc

ale-

Slim

Exa

scal

e-F

at

Waste as a function of processor MTBF µind , C = 1, 000

[email protected] Fault-tolerance for HPC 74/ 168

Page 100: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Plotting formulas – Platform: Exascale with C = 100

Stencil 2D Matrix product Stencil 3DE

xasc

ale-

Slim

Exa

scal

e-F

at

Waste as a function of processor MTBF µind , C = 100

[email protected] Fault-tolerance for HPC 75/ 168

Page 101: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Simulations – Platform: Titan

Stencil 2D Matrix product

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

0

20

40

60

80

100

120

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100M

akes

pan

(day

s)MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Makespan (days)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Makespan (days)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

Makespan (in days) as a function of processor MTBF µind

[email protected] Fault-tolerance for HPC 76/ 168

Page 102: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Simulations – Platform: Exascale with C = 1, 000

Stencil 2D Matrix product

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

0

20

40

60

80

100

120

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(d

ays)

MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

Exa

scal

e-S

lim

0

50

100

150

200

250

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

50

100

150

200

250

300

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

Exa

scal

e-F

at

0

50

100

150

200

250

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

50

100

150

200

250

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

Makespan (in days) as a function of processor MTBF µind , C = 1, 000

[email protected] Fault-tolerance for HPC 77/ 168

Page 103: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Simulations – Platform: Exascale with C = 100

Stencil 2D Matrix product

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

0

20

40

60

80

100

120

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(day

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

50

100

150

200

250

300

350

3 4 5 7.5 10 15 20 35 50 75 100

Mak

espan

(d

ays)

MTBF (years)

CoordinatedCoordinated BestPer

Hierarchical PlaneHierarchical Plane BestPer

Hierarchical LineHierarchical Line BestPer

Hierarchical PortHierarchical Port BestPer

Exa

scal

e-S

lim

0

20

40

60

80

100

120

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

0

20

40

60

80

100

120

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

Exa

scal

e-F

at

4

5

6

7

8

9

10

11

12

13

14

15

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

CoordinatedCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

4

5

6

7

8

9

10

11

12

13

14

15

1 2 3 4 5 7.5 10 15 20 35 50 75 100

Mak

esp

an

(d

ay

s)

MTBF (years)

Coordinated DalyCoordinated BestPer

HierarchicalHierarchical BestPer

Hierarchical PortHierarchical Port BestPer

Makespan (in days) as a function of processor MTBF µind , C = 100

[email protected] Fault-tolerance for HPC 78/ 168

Page 104: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 79/ 168

Page 105: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Motivation

Checkpoint transfer and storage⇒ critical issues of rollback/recovery protocols

Stable storage: high cost

Distributed in-memory storage:

Store checkpoints in local memory ⇒ no centralized storage, Much better scalabilityReplicate checkpoints ⇒ application survives single failure/ Still, risk of fatal failure in some (unlikely) scenarios

[email protected] Fault-tolerance for HPC 80/ 168

Page 106: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Double checkpoint algorithm (Kale et al., UIUC)

1

1

d q s

f

f

P

Local checkpointdone

Remote checkpointdone

Perioddone

Node p

Node p'

Platform nodes partitioned into pairs

Each node in a pair exchanges its checkpoint with its buddy

Each node saves two checkpoints:- one locally: storing its own data- one remotely: receiving and storing its buddy’s data

[email protected] Fault-tolerance for HPC 81/ 168

Page 107: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failures

1

1

d q s

f

f

P

Node p

Node p'

1

1

d q

f

f

tlost

Checkpoint ofp

Checkpoint ofp'

Risk Period

Node to replace p

q

f 1

tlostD R

After failure: downtime D and recovery from buddy node

Two checkpoint files lost, must be re-sent to faulty processor

Best trade-off between performance and risk?

[email protected] Fault-tolerance for HPC 82/ 168

Page 108: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failures

1

1

d q s

f

f

P

Node p

Node p'

1

1

d q

f

f

tlost

Checkpoint ofp

Checkpoint ofp'

Risk Period

Node to replace p

q

f 1

tlostD R

After failure: downtime D and recovery from buddy node

Two checkpoint files lost, must be re-sent to faulty processor

Application at risk until complete reception of both messages

Best trade-off between performance and risk?

[email protected] Fault-tolerance for HPC 82/ 168

Page 109: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 83/ 168

Page 110: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Framework

Predictor

Exact prediction dates (at least C seconds in advance)

Recall r : fraction of faults that are predicted

Precision p: fraction of fault predictions that are correct

Events

true positive: predicted faults

false positive: fault predictions that did not materialize asactual faults

false negative: unpredicted faults

[email protected] Fault-tolerance for HPC 84/ 168

Page 111: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Fault rates

µ: mean time between failures (MTBF)

µP mean time between predicted events (both true positiveand false positive)

µNP mean time between unpredicted faults (false negative).

µe : mean time between events (including three event types)

r =TrueP

TrueP + FalseNand p =

TruePTrueP + FalseP

(1− r)

µ=

1

µNPand

r

µ=

p

µP

1

µe=

1

µP+

1

µNP

[email protected] Fault-tolerance for HPC 85/ 168

Page 112: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Example

fault fault fault fault fault

pred. pred. pred. pred. pred. pred.

Time

F+P F+Ppred.

F+Ppred.

F+Pfault

t

Actual faults:

Predictor:

Overlap:

Predictor predicts six faults in time t

Five actual faults. One fault not predicted

µ = t5 , µP = t

6 , and µNP = t

Recall r = 45 (green arrows over red arrows)

Precision p = 46 (green arrows over blue arrows)

[email protected] Fault-tolerance for HPC 86/ 168

Page 113: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm

1 While no fault prediction is available:• checkpoints taken periodically with period T

2 When a fault is predicted at time t:• take a checkpoint ALAP (completion right at time t)• after the checkpoint, complete the execution of the period

[email protected] Fault-tolerance for HPC 87/ 168

Page 114: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing the waste

1 Fault-free execution: Waste[FF ] = CT

Checkpointing

the first chunk

Computing the first chunk

Processing the second chunkProcessing the first chunk

Time

Time spent checkpointing

Time spent working

2 Unpredicted faults: 1µNP

[D + R + T

2

]TimeT -C T -C Tlost T -C

fault

C C C D R C

[email protected] Fault-tolerance for HPC 88/ 168

Page 115: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing the waste

3 Predictions: 1µP

[p(C + D + R) + (1− p)C ]

TimeT -C Wreg

fault Predicted fault

T -Wreg -C T -C

C C Cp D R C C

with actual fault (true positive)

TimeT -C Wreg

Predicted fault

T -Wreg -C T -C T -C

C C Cp C C C

no actual fault (false negative)

Waste[fail ] =1

µ

[(1− r)

T

2+ D + R +

r

pC

]⇒ Topt ≈

√2µC

1− r

[email protected] Fault-tolerance for HPC 88/ 168

Page 116: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Refinements

Use different value Cp for proactive checkpoints

Avoid checkpointing too frequently for false negatives⇒ Only trust predictions with some fixed probability q⇒ Ignore predictions with probability 1− qConclusion: trust predictor always or never (q = 0 or q = 1)

Trust prediction depending upon position in current period⇒ Increase q when progressing⇒ Break-even point

Cp

p

[email protected] Fault-tolerance for HPC 89/ 168

Page 117: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

With prediction windows

TimeTR-C TR-C Tlost TR-C

fault(Regular mode)

Time

Regular mode Proactive mode

TR-C Wreg

I

TP-Cp TP-Cp TP-Cp TR-C-Wreg

(Prediction without failure)

Time

Regular mode Proactive mode

TR-C Wreg

I

TP-Cp TP-Cp TR-C-Wreg

fault(Prediction with failure)

C C C D R C

C C Cp Cp Cp Cp C

C C Cp Cp Cp D R C

Gets too complicated! /

[email protected] Fault-tolerance for HPC 90/ 168

Page 118: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 91/ 168

Page 119: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Replication

Systematic replication: efficiency < 50%

Can replication+checkpointing be more efficient thancheckpointing alone?

Study by Ferreira et al. [SC’2011]: yes

[email protected] Fault-tolerance for HPC 92/ 168

Page 120: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Model by Ferreira et al. [SC’ 2011]

Parallel application comprising N processes

Platform with ptotal = 2N processors

Each process replicated → N replica-groups

When a replica is hit by a failure, it is not restarted

Application fails when both replicas in one replica-group havebeen hit by failures

[email protected] Fault-tolerance for HPC 93/ 168

Page 121: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Example

p1

p2

p1

p2

p1

p2

p1

p2

Time

Pair1

Pair2

Pair3

Pair4

[email protected] Fault-tolerance for HPC 94/ 168

Page 122: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

The birthday problem

Classical formulationWhat is the probability, in a set of m people, that two of themhave same birthday ?

Relevant formulationWhat is the average number of people required to find a pair withsame birthday?

Birthday(m) = 1 +∫ +∞

0 e−x(1 + x/m)m−1dx = 23 +

√πm2 +

√π

288m − 4135m + . . .

The analogy

Two people with same birthday≡

Two failures hitting same replica-group

[email protected] Fault-tolerance for HPC 95/ 168

Page 123: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure

[email protected] Fault-tolerance for HPC 96/ 168

Page 124: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure

[email protected] Fault-tolerance for HPC 96/ 168

Page 125: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure: can failed PE be hit?

[email protected] Fault-tolerance for HPC 96/ 168

Page 126: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure cannot hit failed PE

Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem

[email protected] Fault-tolerance for HPC 96/ 168

Page 127: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure cannot hit failed PE

Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem

[email protected] Fault-tolerance for HPC 96/ 168

Page 128: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure cannot hit failed PE

Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem

[email protected] Fault-tolerance for HPC 96/ 168

Page 129: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure cannot hit failed PE

Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem

[email protected] Fault-tolerance for HPC 96/ 168

Page 130: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure cannot hit failed PE

Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem

[email protected] Fault-tolerance for HPC 96/ 168

Page 131: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure can hit failed PE

[email protected] Fault-tolerance for HPC 96/ 168

Page 132: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure can hit failed PE

Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem

[email protected] Fault-tolerance for HPC 96/ 168

Page 133: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure can hit failed PE

Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem

[email protected] Fault-tolerance for HPC 96/ 168

Page 134: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Differences with birthday problem

1 2

. . .

i

. . .

N

2N processors but N processes, each replicated twice

Uniform distribution of failures

First failure: each replica-group has probability 1/N to be hit

Second failure can hit failed PE

Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem

[email protected] Fault-tolerance for HPC 96/ 168

Page 135: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Correct analogy

� � � � . . . �1 2 3 4 . . . n

⇑• • • • • • • • • • • . . .

N = nrg bins, red and blue balls

Mean Number of Failures to Interruption (bring down application)MNFTI = expected number of balls to throw

until one bin gets one ball of each color

[email protected] Fault-tolerance for HPC 97/ 168

Page 136: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Number of failures to bring down application

MNFTI ah Count each failure hitting any of the initialprocessors, including those already hit by a failure

MNFTI rp Count failures that hit running processors, and thuseffectively kill replicas.

MNFTI ah = 1 + MNFTI rp

[email protected] Fault-tolerance for HPC 98/ 168

Page 137: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Number of failures to bring down application

MNFTI ah Count each failure hitting any of the initialprocessors, including those already hit by a failure

MNFTI rp Count failures that hit running processors, and thuseffectively kill replicas.

MNFTI ah = 1 + MNFTI rp

[email protected] Fault-tolerance for HPC 98/ 168

Page 138: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential failures

Theorem MNFTI ah = E(NFTI ah|0) where

E(NFTI ah|nf ) =

{2 if nf = nrg ,

2nrg2nrg−nf +

2nrg−2nf2nrg−nf E

(NFTI ah|nf + 1

)otherwise.

E(NFTI ah|nf ): expectation of number of failures to killapplication, knowing that• application is still running• failures have already hit nf different replica-groups

[email protected] Fault-tolerance for HPC 99/ 168

Page 139: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential failures (cont’d)

Proof

E(NFTI ah |nrg

)=

1

2× 1 +

1

2×(

1 + E(NFTI ah |nrg

)).

E(NFTI ah|nf

)=

2nrg − 2nf2nrg

×(

1 + E(NFTI ah|nf + 1

))+

2nf2nrg

×(

1

2× 1 +

1

2

(1 + E

(NFTI ah|nf

))).

MTTI = systemMTBF (2nrg)× MNFTI ah

[email protected] Fault-tolerance for HPC 100/ 168

Page 140: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential failures (cont’d)

Proof

E(NFTI ah |nrg

)=

1

2× 1 +

1

2×(

1 + E(NFTI ah |nrg

)).

E(NFTI ah|nf

)=

2nrg − 2nf2nrg

×(

1 + E(NFTI ah|nf + 1

))+

2nf2nrg

×(

1

2× 1 +

1

2

(1 + E

(NFTI ah|nf

))).

MTTI = systemMTBF (2nrg)× MNFTI ah

[email protected] Fault-tolerance for HPC 100/ 168

Page 141: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential failures (cont’d)

Proof

E(NFTI ah |nrg

)=

1

2× 1 +

1

2×(

1 + E(NFTI ah |nrg

)).

E(NFTI ah|nf

)=

2nrg − 2nf2nrg

×(

1 + E(NFTI ah|nf + 1

))+

2nf2nrg

×(

1

2× 1 +

1

2

(1 + E

(NFTI ah|nf

))).

MTTI = systemMTBF (2nrg)× MNFTI ah

[email protected] Fault-tolerance for HPC 100/ 168

Page 142: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Comparison

2N processors, no replication

ThroughputStd = 2N(1−Waste) = 2N(

1−√

2Cµ2N

)N replica-pairs

ThroughputRep = N(

1−√

2Cµrep

)µrep = MNFTI × µ2N = MNFTI × µ

2N

Platform with 2N = 220 processors ⇒ MNFTI = 1284.4µ = 10 years ⇒ better if C shorter than 6 minutes

[email protected] Fault-tolerance for HPC 101/ 168

Page 143: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure distribution

221218 219216 217 220215

number of processors

0

50

100

150

200

aver

age

mak

espa

n(i

nda

ys)

BestPeriod-g = 2BestPeriod-g = 1Daly-g = 2Daly-g = 1

(a) Exponential

221218 219216 217 220215

number of processors

0

50

100

150

200

aver

age

mak

espa

n(i

nda

ys)

BestPeriod-g = 2BestPeriod-g = 1Daly-g = 2Daly-g = 1

(b) Weibull, k = 0.7

Crossover point for replication when µind = 125 years

[email protected] Fault-tolerance for HPC 102/ 168

Page 144: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Weibull distribution with k = 0.7

Dashed line: Ferreira et al. Solid line: Correct analogy

101 100

Processor MTBF (in years)

0

200000

400000

600000

800000

1000000N

umb

erof

pro

cess

ors

C = 300

C = 2400C = 1200C = 900C = 600

C = 150

Study by Ferrreira et al. favors replication

Replication beneficial if small µ + large C + big ptotal

[email protected] Fault-tolerance for HPC 103/ 168

Page 145: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 104/ 168

Page 146: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Generic vs. Application specific approaches

Generic solutions

Universal

Very low prerequisite

One size fits all (pros and cons)

Application specific solutions

Requires (deep) study of the application/algorihtm

Tailored solution: higher efficiency

[email protected] Fault-tolerance for HPC 105/ 168

Page 147: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Backward Recovery vs. Forward Recovery

Backward Recovery

Rollback / Backward Recovery: returns in the history torecover from failures.

Spends time to re-execute computations

Rebuilds states already reached

Typical: checkpointing techniques

[email protected] Fault-tolerance for HPC 106/ 168

Page 148: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Backward Recovery vs. Forward Recovery

Forward Recovery

Forward Recovery: proceeds without returning

Pays additional costs during (failure-free) computation tomaintain consistent redundancy

Or pays additional computations when failures happen

General technique: Replication

Application-Specific techniques: Iterative algorithms withfixed point convergence, ABFT, ...

[email protected] Fault-tolerance for HPC 106/ 168

Page 149: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 107/ 168

Page 150: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerance (ABFT)

Principle

Limited to Linear Algebra computations

Matrices are extended with rows and/or columns of checksums

M =

5 1 7 134 3 5 124 6 9 19

[email protected] Fault-tolerance for HPC 108/ 168

Page 151: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerance (ABFT)

Principle

Limited to Linear Algebra computations

Matrices are extended with rows and/or columns of checksums

M =

5 1 7 134 3 5 124 6 9 19

[email protected] Fault-tolerance for HPC 108/ 168

Page 152: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerance (ABFT)

Principle

Limited to Linear Algebra computations

Matrices are extended with rows and/or columns of checksums

M =

5 1 7 134 3 5 124 6 9 19

[email protected] Fault-tolerance for HPC 108/ 168

Page 153: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and fail-stop errors

Missing checksum data

M =

5 1 7 134 3 54 6 9 19

Simple recomputation: 4+3+5 = 12.

Missing original data

M =

5 1 7 134 5 124 6 9 19

Simple recomputation: 12-(4+5) = 3.

[email protected] Fault-tolerance for HPC 109/ 168

Page 154: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and fail-stop errors

Missing checksum data

M =

5 1 7 134 3 54 6 9 19

Simple recomputation: 4+3+5 = 12.

Missing original data

M =

5 1 7 134 5 124 6 9 19

Simple recomputation: 12-(4+5) = 3.

[email protected] Fault-tolerance for HPC 109/ 168

Page 155: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and fail-stop errors

Missing checksum data

M =

5 1 7 134 3 54 6 9 19

Simple recomputation: 4+3+5 = 12.

Missing original data

M =

5 1 7 134 5 124 6 9 19

Simple recomputation: 12-(4+5) = 3.

[email protected] Fault-tolerance for HPC 109/ 168

Page 156: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and fail-stop errors

Missing checksum data

M =

5 1 7 134 3 54 6 9 19

Simple recomputation: 4+3+5 = 12.

Missing original data

M =

5 1 7 134 5 124 6 9 19

Simple recomputation: 12-(4+5) = 3.

[email protected] Fault-tolerance for HPC 109/ 168

Page 157: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

M =

5 1 7 134 3 5 134 6 9 19

Error detection: 4 + 3 + 5 6= 13Limitations

The following matrix would have successfully passed thesanity check:

M =

5 1 7 135 3 5 134 6 9 19

Can detect one error and correct zero.

[email protected] Fault-tolerance for HPC 110/ 168

Page 158: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

M =

5 1 7 134 3 5 134 6 9 19

Error detection: 4 + 3 + 5 6= 13Limitations

The following matrix would have successfully passed thesanity check:

M =

5 1 7 135 3 5 134 6 9 19

Can detect one error and correct zero.

[email protected] Fault-tolerance for HPC 110/ 168

Page 159: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

One row and one column of checksums

M =

5 1 7 134 3 5 114 6 9 19

13 9 21 43

Checksum recomputation to look for silent data corruptions:

5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19

13 + 10 + 21 = 44

Checksums do not match !

[email protected] Fault-tolerance for HPC 111/ 168

Page 160: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

One row and one column of checksums

M =

5 1 7 134 3 5 114 6 9 19

13 9 21 43

Checksum recomputation to look for silent data corruptions:

5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19

13 + 10 + 21 = 44

Checksums do not match !

[email protected] Fault-tolerance for HPC 111/ 168

Page 161: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

M =

5 1 7 134 3 5 114 6 9 19

13 9 21 43

5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19

13 + 10 + 21 = 44

Both checksums are affected, giving out location of errorWe solve:

4 + x + 5 = 11 1 + x + 6 = 9

Recomputing the checksums we find that:5 + 1 + 7 = 134 + 2 + 5 = 114 + 6 + 9 = 19

13 + 9 + 21 = 43

Checksums match ,

Can detect two errors and correct one

[email protected] Fault-tolerance for HPC 112/ 168

Page 162: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

M =

5 1 7 134 3 5 114 6 9 19

13 9 21 43

5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19

13 + 10 + 21 = 44

Both checksums are affected, giving out location of errorWe solve:

4 + x + 5 = 11 1 + x + 6 = 9

Recomputing the checksums we find that:5 + 1 + 7 = 134 + 2 + 5 = 114 + 6 + 9 = 19

13 + 9 + 21 = 43

Checksums match ,

Can detect two errors and correct one

[email protected] Fault-tolerance for HPC 112/ 168

Page 163: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT and silent data corruption

M =

5 1 7 134 3 5 114 6 9 19

13 9 21 43

5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19

13 + 10 + 21 = 44

Both checksums are affected, giving out location of errorWe solve:

4 + x + 5 = 11 1 + x + 6 = 9

Recomputing the checksums we find that:5 + 1 + 7 = 134 + 2 + 5 = 114 + 6 + 9 = 19

13 + 9 + 21 = 43

Checksums match ,

Can detect two errors and correct one

[email protected] Fault-tolerance for HPC 112/ 168

Page 164: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT for Matrix-Matrix multiplication

Aim: Computation of C = A× B

Let eT = [1, 1, · · · , 1], we define

Ac :=

(A

eTA

), B r :=

(B Be

), C f :=

(C Ce

eTC eTCe

).

Where Ac is the column checksum matrix, B r is the row checksummatrix and C f is the full checksum matrix.

Ac × B r =

(A

eTA

)×(B Be

)=

(AB ABe

eTAB eTABe

)=

(C Ce

eTC eTCe

)= C f

[email protected] Fault-tolerance for HPC 113/ 168

Page 165: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT for Matrix-Matrix multiplication

Aim: Computation of C = A× B

Let eT = [1, 1, · · · , 1], we define

Ac :=

(A

eTA

), B r :=

(B Be

), C f :=

(C Ce

eTC eTCe

).

Where Ac is the column checksum matrix, B r is the row checksummatrix and C f is the full checksum matrix.

Ac × B r =

(A

eTA

)×(B Be

)=

(AB ABe

eTAB eTABe

)=

(C Ce

eTC eTCe

)= C f

[email protected] Fault-tolerance for HPC 113/ 168

Page 166: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

In practice... things are more complicated!

When do errors strike? Are all data always protected?

Computations are approximate because of floating-pointrounding

Error detection and error correction capabilities depend on thenumber of checksum rows and columns

[email protected] Fault-tolerance for HPC 114/ 168

Page 167: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 115/ 168

Page 168: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Block LU factorization

A A'

U

L

U

Solve A · x = b (hard)

Transform A into a LU factorization

Solve L · y = b, then U · x = y

[email protected] Fault-tolerance for HPC 116/ 168

Page 169: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Block LU factorization

A A'

U

L

U

GETF2: factorize acolumn block

TRSM - Update row block

GEMM: Updatethe trailing

matrix

Solve A · x = b (hard)

Transform A into a LU factorization

Solve L · y = b, then U · x = y

[email protected] Fault-tolerance for HPC 116/ 168

Page 170: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Block LU factorization

L

U U

L

U

GETF2: factorize acolumn block

TRSM - Update row block

GEMM: Updatethe trailing

matrix

L

U

Solve A · x = b (hard)

Transform A into a LU factorization

Solve L · y = b, then U · x = y

[email protected] Fault-tolerance for HPC 116/ 168

Page 171: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Block LU factorization

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 21 30 21 30 21 30 21 3

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 21 30 21 30 21 30 21 3

Failure of rank 2

2D Block Cyclic Distribution (here 2× 3)

A single failure ⇒ many data lost

[email protected] Fault-tolerance for HPC 116/ 168

Page 172: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

M

Pmb

nbQ

N< 2N/Q + nb

+++

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

45

0 21 3

Checksum:invertible operation on the data of the row / column

Checksum blocks are doubled, to allow recovery when dataand checksum are lost together

[email protected] Fault-tolerance for HPC 117/ 168

Page 173: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

M

P mb

nbQ

NN/Q

+++

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

0 2 41 3 5

0 2 41 3 5

0 21 3

Checksum:invertible operation on the data of the row / column

Checksum replication can be avoided by dedicating computingresources to checksum storage

[email protected] Fault-tolerance for HPC 117/ 168

Page 174: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

Idea of ABFT: applying the operation on data and checksumpreserves the checksum properties

[email protected] Fault-tolerance for HPC 117/ 168

Page 175: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

+

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

For the part of the data that is not updated this way, thechecksum must be re-calculated

[email protected] Fault-tolerance for HPC 117/ 168

Page 176: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

To avoid slowing down all processors and panel operation,group checksum updates every Q block columns

[email protected] Fault-tolerance for HPC 117/ 168

Page 177: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

To avoid slowing down all processors and panel operation,group checksum updates every Q block columns

[email protected] Fault-tolerance for HPC 117/ 168

Page 178: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

GETF2 GEMM

TRSM

To avoid slowing down all processors and panel operation,group checksum updates every Q block columns

[email protected] Fault-tolerance for HPC 117/ 168

Page 179: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

+

Then, update the missing coverage.Keep checkpoint block column to cover failures during thattime

[email protected] Fault-tolerance for HPC 117/ 168

Page 180: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

M

P mb

nbQ

NN/Q

+++

“Checkpoint”

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

1 3 50 2 41 3 50 2 41 3 5

BABAB

Checkpoint the next set of Q-Panels to be able to return to itin case of failures

[email protected] Fault-tolerance for HPC 117/ 168

Page 181: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

---

In case of failure, conclude the operation, then

Missing Data = Checksum - Sum(Existing Data) s

[email protected] Fault-tolerance for HPC 118/ 168

Page 182: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Algorithm Based Fault Tolerant LU decomposition

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 A1 3 B

A AB B

+++

In case of failure, conclude the operation, then

Missing Checksum = Sum(Existing Data)s

[email protected] Fault-tolerance for HPC 118/ 168

Page 183: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure inside a Q−panel factorization

M

P mb

nbQ

NN/Q

+++

“Checkpoint”

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

1 3 50 2 41 3 50 2 41 3 5

BABAB

Failures may happen while inside a Q−panel factorization

[email protected] Fault-tolerance for HPC 119/ 168

Page 184: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure inside a Q−panel factorization

M

P mb

nbQ

N

--

-

-

“Checkpoint”

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

1 3 50 2 41 3 50 2 41 3 5

BABAB

Valid Checksum Information allows to recover most of themissing data, but not all: the checksum for the currentQ−panels are not valid

[email protected] Fault-tolerance for HPC 119/ 168

Page 185: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure inside a Q−panel factorization

M

P mb

nbQ

N

“Checkpoint”

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

1 3 50 2 41 3 50 2 41 3 5

BABAB

We use the checkpoint to restore the Q−panel in its initialstate

[email protected] Fault-tolerance for HPC 119/ 168

Page 186: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Failure inside a Q−panel factorization

M

P mb

nbQ

N

“Checkpoint”

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

AB

A AB B

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5

1 3 50 2 41 3 50 2 41 3 5

BABAB

and re-execute that part of the factorization, without applyingoutside of the scope

[email protected] Fault-tolerance for HPC 119/ 168

Page 187: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT LU decomposition: implementation

MPI Implementation

PBLAS-based: need to provide “Fault-Aware” version of thelibrary

Cannot enter recovery state at any point in time: need tocomplete ongoing operations despite failures

Recovery starts by defining the position of each process in thefactorization and bring them all in a consistent state(checksum property holds)

Need to test the return code of each and every MPI-relatedcall

[email protected] Fault-tolerance for HPC 120/ 168

Page 188: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT LU decomposition: performance

As supercomputers grow ever larger in scale, the Mean Time to Failure becomes shorter and shorter, making the complete and successful execution of complex applications more and more difficult. FT-LA delivers a new approach, utilizing Algorithm-Based Fault Tolerance (ABFT), to help factorization algorithms survive fail-stop failures. The FT-LA software package extends ScaLAPACK with ABFT routines, and in sharp contrast with legacy checkpoint-based approaches, ABFT does not incur I/O overhead, and promises a much more scalable protection scheme.

ABFT THE IDEA

Cost of ABFT comes only from extra flops (to update checksums) and extra storage

Cost decreases with machine scale (divided by Q when using PxQ processes)

PROTECTION

Matrix protected by block row checksum

The algorithm updates both the trailing matrix AND the checksums

RECOVERY

Missing blocks reconstructed by inverting the checksum operation

FUNCTIONALITY COVERAGE

Linear Systems of Equations

Least Squares

Cholesky, LU

QR (with protection of the upper and lower factors)

FEATURES

WORK IN PROGRESS

Covering four precisions: double complex, single complex, double real, single real (ZCDS)

Deploys on MPI FT draft (ULFM), or with “Checkpoint-on-failure”

Allows toleration of permanent crashes

Hessenber Reduction, Soft (silent) Errors

Process grid: p x qF: simultaneous failures tolerated

Protection against 2 faults on 192x192 processes => 1% overhead

Usually F << q; Overheads in F/q

Protection cost is inversely proportional to machine scale!

Computation

Memory

Flops for the checksum update

Matrix is extended with 2F columns every q columns

FIND OUT MORE AT http://icl.cs.utk.edu/ft-la

0

7

14

21

28

35

6x6; 20k12x12; 40k

24x24; 80k48x48; 160k

96x96; 320k192x192; 640k 0

10

20

30

40

50

Rela

tive

Over

head

(%)

Perfo

rman

ce (T

Flop

/s)

#Processors (PxQ grid); Matrix size (N)

ScaLAPACK PDGETRFFT-PDGETRF (no error)

FT-PDGETRF (w/1 recovery)Overhead: FT-PDGETRF (no error)

Overhead: FT-PDGETRF (w/1 recovery)

U

L

C’

GETF2 GEMM

TRSM

A’

L

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

0 4 6 0 4 6

1 3 5 7 1 3 5 7

C

PERFORMANCE ON KRAKEN

Open MPI with ULFM; Kraken supercomputer.

[email protected] Fault-tolerance for HPC 121/ 168

Page 189: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT QR decomposition: performance0:22 A. Bouteiller, T. Herault, G. Bosilca, P. Du, and J. Dongarra

0

2

4

6

8

10

12

6x6; 20k 12x12; 40k 24x24; 80k 48x48; 160k 0

10

20

30

40

50

60

Perf

orm

an

ce (

TF

lop

/s)

Rela

tive O

verh

ead

(%

)

#Processors (PxQ grid); Matrix size (N)

ScaLAPACK PDGEQRFFT-PDGEQRF (no errors)FT-PDGEQRF (one error)

Overhead: FT-PDGEQRF (no errors)Overhead: FT-PDGEQRF (one error)

Fig. 11. Weak scalability of FT-QR: run time overhead on Kraken when failures strike

local snapshots have to be used along with re-factorization to recover the lost data andrestore the matrix state. This is referred to as the ”failure within Q panels.”

Figure 10 shows the overhead from these two cases for the LU factorization, alongwith the no-error overhead as a reference. In the “border” case, the failure is simulatedto strike when the 96th panel (which is a multiple of grid columns, 6, 12, · · · , 48) has justfinished. In the “non-border” case, failure occurs during the (Q + 2)th panel factoriza-tion. For example, when Q = 12, the failure is injected when the trailing update for thestep with panel (1301,1301) finishes. From the result in Figure 10, the recovery pro-cedure in both cases adds a small overhead that also decreases when scaled to largeproblem size and process grid. For largest setups, only 2-3 percent of the executiontime is spent recovering from a failure.

7.4. Extension to Other factorizationThe algorithm proposed in this work can be applied to a wide range of dense matrixfactorizations other than LU. As a demonstration we have extended the fault toler-ance functions to the ScaLAPACK QR factorization in double precision. Since QR usesa block algorithm similar to LU (and also similar to Cholesky), the integration of faulttolerance functions is mostly straightforward. Figure 11 shows the performance of QRwith and without recovery. The overhead drops as the problem and grid size increase,although it remains higher than that of LU for the same problem size. This is expected:as the QR algorithm has a higher complexity than LU ( 4

3N3 v.s. 23N3), the ABFT ap-

proach incurs more extra computation when updating checksums. Similar to the LUresult, recovery adds an extra 2% overhead. At size 160,000 a failure incurs about5.7% penalty to be recovered. This overhead becomes lower, the larger the problem orprocessor grid size considered.

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: January 2013.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Open MPI with ULFM; Kraken [email protected] Fault-tolerance for HPC 122/ 168

Page 190: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT LU decomposition: implementation

?

ABFTRecovery

Checkpoint on Failure - MPI Implementation

FT-MPI / MPI-Next FT: not easily available on largemachines

Checkpoint on Failure = workaround

[email protected] Fault-tolerance for HPC 123/ 168

Page 191: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT QR decomposition: performance

0

0.5

1

1.5

2

2.5

3

3.5

20k 40k 60k 80k 100k

Pe

rfo

rma

nc

e (

Tfl

op

s/s

)

Matrix Size (N)

ScaLAPACK QRCoF-QR (w/o failure)CoF-QR (w/1 failure)

Fig. 2. ABFT QR andone CoF recovery onKraken (Lustre).

0

100

200

300

400

500

600

700

800

10k 20k 30k 40k 50k

Pe

rfo

rma

nc

e (

Gfl

op

s/s

)

Matrix Size (N)

ScaLAPACK QRCoF-QR (w/o failure)CoF-QR (w/1 failure)

Fig. 3. ABFT QR andone CoF recovery onDancer (local SSD).

0

1

2

3

4

5

6

7

20k 25k 30k 35k 40k 45k 50k

Ap

pli

ca

tio

n T

ime

Sh

are

(%

)

Matrix Size (N)

Load CheckpointDump Checkpoint

ABFT Recovery

Fig. 4. Time breakdownof one CoF recovery onDancer (local SSD).

5.3 Checkpoint-on-Failure QR Performance

Supercomputer Performance: Figure 2 presents the performance on the Krakensupercomputer. The process grid is 24⇥24 and the block size is 100. The CoF-QR(no failure) presents the performance of the CoF QR implementation, in a fault-free execution; it is noteworthy, that when there are no failures, the performanceis exactly identical to the performance of the unmodified FT-QR implementa-tion. The CoF-QR (with failure) curves present the performance when a failureis injected after the first step of the PDLARFB kernel. The performance of thenon-fault tolerant ScaLAPACK QR is also presented for reference.

Without failures, the performance overhead compared to the regular ScaLA-PACK is caused by the extra computation to maintain the checksums inherentto the ABFT algorithm [12]; this extra computation is unchanged between CoF-QR and FT-QR. Only on runs where a failure happened do the CoF protocolsundergoe the supplementary overhead of storing and reloading checkpoints. How-ever, the performance of the CoF-QR remains very close to the no-failure case.For instance, at matrix size N=100,000, CoF-QR still achieves 2.86 Tflop/s afterrecovering from a failure, which is 90% of the performance of the non-fault toler-ant ScaLAPACK QR. This demonstrates that the CoF protocol enables e�cient,practical recovery schemes on supercomputers.

Impact of Local Checkpoint Storage: Figure 3 presents the performance of theCoF-QR implementation on the Dancer cluster with a 8 ⇥ 16 process grid. Al-though a smaller test platform, the Dancer cluster features local storage on nodesand a variety of performance analysis tools unavailable on Kraken. As expected(see [12]), the ABFT method has a higher relative cost on this smaller machine.Compared to the Kraken platform, the relative cost of CoF failure recovery issmaller on Dancer. The CoF protocol incurs disk accesses to store and loadcheckpoints when a failure hits, hence the recovery overhead depends on I/Operformance. By breaking down the relative cost of each recovery step in CoF,Figure 4 shows that checkpoint saving and loading only take a small percentageof the total run-time, thanks to the availability of solid state disks on every node.Since checkpoint reloading immediately follows checkpointing, the OS cache sat-isfy most disk access, resulting in high I/O performance. For matrices larger than

Checkpoint on Failure - MPI Performance

Open MPI; Kraken supercomputer;

[email protected] Fault-tolerance for HPC 124/ 168

Page 192: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 125/ 168

Page 193: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Fault Tolerance Techniques

General Techniques

Replication

Rollback Recovery

Coordinated CheckpointingUncoordinated Checkpointing &Message LoggingHierarchical CheckpointingMultilevel Checkpointing

Application-Specific Techniques

Algorithm Based Fault Tolerance(ABFT)

Iterative Convergence

Approximated Computation

[email protected] Fault-tolerance for HPC 126/ 168

Page 194: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Application

Typical Application

f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/

sim2mat ( ) ;

/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/

d g e q r f ( ) ;d s o l v e ( ) ;

/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/

vec2s im ( ) ;}

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

LIBRARY Phase GENERAL Phase

Characteristics

, Large part of (total)computation spent infactorization/solve

Between LA operations:

/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data

/ modify data not covered byABFT algorithms

[email protected] Fault-tolerance for HPC 127/ 168

Page 195: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Application

Typical Application

f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/

sim2mat ( ) ;

/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/

d g e q r f ( ) ;d s o l v e ( ) ;

/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/

vec2s im ( ) ;}

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

LIBRARY Phase GENERAL Phase

Characteristics

, Large part of (total)computation spent infactorization/solve

Between LA operations:

/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data

/ modify data not covered byABFT algorithms

Goodbye ABFT?!

[email protected] Fault-tolerance for HPC 127/ 168

Page 196: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Application

Typical Application

f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/

sim2mat ( ) ;

/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/

d g e q r f ( ) ;d s o l v e ( ) ;

/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/

vec2s im ( ) ;}

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

LIBRARY Phase GENERAL Phase

Characteristics

, Large part of (total)computation spent infactorization/solve

Between LA operations:

/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data

/ modify data not covered byABFT algorithms

Problem Statement

How to use fault tolerant operations(∗) within anon-fault tolerant(∗∗) application?(∗∗∗)

(*) ABFT, or other application-specific FT(**) Or within an application that does not have the same kind of FT

(***) And keep the application globally fault tolerant...

[email protected] Fault-tolerance for HPC 127/ 168

Page 197: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: no failure

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

PeriodicCheckpoint

SplitForced

Checkpoints

[email protected] Fault-tolerance for HPC 128/ 168

Page 198: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: failure during Library phase

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

Failure(during LIBRARY)

Rollback(partial)

Recovery

ABFTRecovery

[email protected] Fault-tolerance for HPC 129/ 168

Page 199: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT&PeriodicCkpt

ABFT&PeriodicCkpt: failure during General phase

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

Failure(during GENERAL)

Rollback(fulll)

Recovery

[email protected] Fault-tolerance for HPC 130/ 168

Page 200: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT&PeriodicCkpt: Optimizations

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

ABFT&PERIO

DICCKPT

ABFT&PeriodicCkpt: Optimizations

If the duration of the General phase is too small: don’t addcheckpoints

If the duration of the Library phase is too small: don’t doABFT recovery, remain in General mode

this assumes a performance model for the library call

[email protected] Fault-tolerance for HPC 131/ 168

Page 201: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

ABFT&PeriodicCkpt: Optimizations

Process 0

Process 1

Process 2

Application

Application

Application

Library

Library

Library

ABFT&PERIO

DICCKPT

GENERALCheckpoint Interval

ABFT&PeriodicCkpt: Optimizations

If the duration of the General phase is too small: don’t addcheckpoints

If the duration of the Library phase is too small: don’t doABFT recovery, remain in General mode

this assumes a performance model for the library call

[email protected] Fault-tolerance for HPC 131/ 168

Page 202: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Conclusion

Algorithm-Based Fault Tolerance

Application-specific solution for linear algebra kernels

Low-overhead forward-recovery solution

Used alone or in conjunction with backward-recovery solutions

Limitation

Need 2f linearly independent checksum vectors to detect andcorrect f errors

Numerical instability ⇒ implementations limited to f = 1 orf = 2 /

[email protected] Fault-tolerance for HPC 132/ 168

Page 203: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniques

4 Silent errorsCoupling checkpointing and verificationApplication-specific methods

5 Conclusion

[email protected] Fault-tolerance for HPC 133/ 168

Page 204: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Definitions

Instantaneous error detection ⇒ fail-stop failures,e.g. resource crash

Silent errors (data corruption) ⇒ detection latency

Silent error detected only when the corrupt data is activated

Includes some software faults, some hardware errors (softerrors in L1 cache), double bit flip

Cannot always be corrected by ECC memory

[email protected] Fault-tolerance for HPC 134/ 168

Page 205: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Quotes

Soft Error: An unintended change in the state of an electronicdevice that alters the information that it stores withoutdestroying its functionality, e.g. a bit flip caused by acosmic-ray-induced neutron. (Hengartner et al., 2008)

SDC occurs when incorrect data is delivered by a computingsystem to the user without any error being logged (CristianConstantinescu, AMD)

Silent errors are the black swan of errors (Marc Snir)

[email protected] Fault-tolerance for HPC 135/ 168

Page 206: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Should we be afraid? (courtesy Al Geist)

[email protected] Fault-tolerance for HPC 136/ 168

Page 207: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Probability distributions for silent errors

?Theorem: µp =

µind

pfor arbitrary distributions

[email protected] Fault-tolerance for HPC 137/ 168

Page 208: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Probability distributions for silent errors

?Theorem: µp =

µind

pfor arbitrary distributions

[email protected] Fault-tolerance for HPC 137/ 168

Page 209: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniques

4 Silent errorsCoupling checkpointing and verificationApplication-specific methods

5 Conclusion

[email protected] Fault-tolerance for HPC 138/ 168

Page 210: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

General-purpose approach

TimeXe Xd

fault Detection

Error and detection latency

Last checkpoint may have saved an already corrupted state

Saving k checkpoints (Lu, Zheng and Chien):

¬ Critical failure when all live checkpoints are invalid­ Which checkpoint to roll back to?

[email protected] Fault-tolerance for HPC 139/ 168

Page 211: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

General-purpose approach

TimeXe Xd

fault Detection

Error and detection latency

Last checkpoint may have saved an already corrupted state

Saving k checkpoints (Lu, Zheng and Chien):

¬ Critical failure when all live checkpoints are invalidAssume unlimited storage resources

­ Which checkpoint to roll back to?Assume verification mechanism

[email protected] Fault-tolerance for HPC 139/ 168

Page 212: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Optimal period?

TimeXe Xd

fault Detection

Error and detection latency

Xe inter arrival time between errors; mean time µe

Xd error detection time; mean time µd

Assume Xd and Xe independent

[email protected] Fault-tolerance for HPC 140/ 168

Page 213: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Arbitrary distribution

Wasteff =C

T

Wastefail =T2 + R + µd

µe

Only valid if T2 + R + µd � µe

Theorem

Best period is Topt ≈√

2µeC

Independent of Xd

[email protected] Fault-tolerance for HPC 141/ 168

Page 214: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Exponential distribution

Theorem

At the end of the day,

E(T (w)) = eλeR (µe + µd) (eλe(w+C) − 1)

Optimal period independent of µd

Good approximation is T =√

2µeC (Young’s formula)

[email protected] Fault-tolerance for HPC 142/ 168

Page 215: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

The case with limited resources

Assume that we can only save the last k checkpoints

Definition (Critical failure)

Error detected when all checkpoints contain corrupted data.Happens with probability Prisk during whole execution.

Prisk decreases when T increases (when Xd is fixed).Hence, Prisk ≤ ε leads to a lower bound Tmin on T

Can derive an analytical form for Prisk when Xd follows anExponential law. Use it as a good(?) approximation for arbitrarylaws

[email protected] Fault-tolerance for HPC 143/ 168

Page 216: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Limitation of the model

It is not clear how to detect when the error has occurred(hence to identify the last valid checkpoint) / / /

Need a verification mechanism to check the correctness of thecheckpoints. This has an additional cost!

[email protected] Fault-tolerance for HPC 144/ 168

Page 217: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Coupling checkpointing and verification

Verification mechanism of cost V

Silent errors detected only when verification is executed

Approach agnostic of the nature of verification mechanism(checksum, error correcting code, coherence tests, etc)

Fully general-purpose(application-specific information, if available, can always beused to decrease V )

[email protected] Fault-tolerance for HPC 145/ 168

Page 218: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

On-line ABFT scheme for PCG

Zizhong Chen, PPoPP’13

Iterate PCGCost: SpMV, preconditionersolve, 5 linear kernels

Detect soft errors by checkingorthogonality and residual

Verification every d iterationsCost: scalar product+SpMV

Checkpoint every c iterationsCost: three vectors, or twovectors + SpMV at recovery

Experimental method tochoose c and d

[email protected] Fault-tolerance for HPC 146/ 168

Page 219: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Base pattern (and revisiting Young/Daly)

TimeW W

faultDetection

V C V C V C

Fail-stop (classical) Silent errors

Pattern T = W + C S = W + V + C

Waste[FF ] CT

V+CS

Waste[fail ] 1µ(D + R + W

2 ) 1µ(R + W + V )

Optimal Topt =√

2Cµ Sopt =√

(C + V )µ

Waste[opt]√

2Cµ 2

√C+Vµ

[email protected] Fault-tolerance for HPC 147/ 168

Page 220: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

With p = 1 checkpoint and q = 3 verifications

Timew w w w w w

faultDetection

V C V V V C V V V C

Base Pattern p = 1, q = 1 Waste[opt] = 2√

C+Vµ

New Pattern p = 1, q = 3 Waste[opt] = 2√

4(C+3V )6µ

[email protected] Fault-tolerance for HPC 148/ 168

Page 221: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

BalancedAlgorithm

Time2w 2w w w 2w 2w

V C V V C V V V C

p checkpoints and q verifications, p ≤ q

p = 2, q = 5, S = 2C + 5V + W

W = 10w , six chunks of size w or 2w

May store invalid checkpoint (error during third chunk)

After successful verification in fourth chunk, precedingcheckpoint is valid

Keep only two checkpoints in memory and avoid any fatalfailure

[email protected] Fault-tolerance for HPC 149/ 168

Page 222: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

BalancedAlgorithm

Time2w 2w w w 2w 2w

V C V V C V V V C

¬ ( proba 2w/W ) Tlost = R + 2w + V

­ ( proba 2w/W ) Tlost = R + 4w + 2V

® ( proba w/W ) Tlost = 2R + 6w + C + 4V

¯ ( proba w/W ) Tlost = R + w + 2V

° ( proba 2w/W ) Tlost = R + 3w + 2V

± ( proba 2w/W ) Tlost = R + 5w + 3V

Waste[opt] ≈ 2

√7(2C + 5V )

20µ

[email protected] Fault-tolerance for HPC 150/ 168

Page 223: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Analysis

Key parameters

off failure-free overhead per pattern

fre fraction of work that is re-executed

Wasteff = offS , where off = pC + qV and S = off + pqw � µ

Wastefail = Tlostµ , where Tlost = freS + β

β: constant, linear combination of C , V and R

Waste ≈ offS + freS

µ ⇒ Sopt ≈√

offfre· µ

Waste[opt] = 2

√offfreµ

+ o(

√1

µ)

[email protected] Fault-tolerance for HPC 151/ 168

Page 224: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing fre when p = 1

Timeα1W α2W α3W

V C V V V C

Theorem

The minimal value of fre(1, q) is obtained for same-size chunks

fre(1, q) =∑q

i=1

(αi∑i

j=1 αj

)Minimal when αi = 1/q

In that case, fre(1, q) = q+12q

[email protected] Fault-tolerance for HPC 152/ 168

Page 225: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Computing fre when p ≥ 1

Timeα1W α2W α3W

V C C C V C

Theorem

fre(p, q) ≥ p+q2pq , bound is matched by BalancedAlgorithm.

Assess gain due to the p − 1 intermediate checkpoints

f(1)

re − f(p)

re =∑p

i=1

(αi∑i−1

j=1 αj

)Maximal when αi = 1/p for all i

In that case, f(1)

re − f(p)

re = (p − 1)/p2

Now best with equipartition of verifications too

In that case, f(1)

re = q+12q and f

(p)re = q+1

2q −p−12p = q+p

2pq

[email protected] Fault-tolerance for HPC 153/ 168

Page 226: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Choosing optimal pattern

Let V = γC , where 0 < γ ≤ 1

offfre = p+q2pq (pC + qV ) = C × p+q

2

(1q + γ

p

)Given γ, minimize p+q

2

(1q + γ

p

)with 1 ≤ p ≤ q, and p, q

taking integer values

Let p = λ× q. Then λopt =√γ =

√VC

[email protected] Fault-tolerance for HPC 154/ 168

Page 227: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Summary

Time2w 2w w w 2w 2w

V C V V C V V V C

BalancedAlgorithm optimal when C ,R,V � µ

Keep only 2 checkpoints in memory/storage

Closed-form formula for Waste[opt]

Given C and V , choose optimal pattern

Gain of up to 20% over base pattern

[email protected] Fault-tolerance for HPC 155/ 168

Page 228: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniques

4 Silent errorsCoupling checkpointing and verificationApplication-specific methods

5 Conclusion

[email protected] Fault-tolerance for HPC 156/ 168

Page 229: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Literature

ABFT: dense matrices / fail-stop, extended to sparse / silent.Limited to one error detection and/or correction in practice

Asynchronous (chaotic) iterative methods (old work)

Partial differential equations: use lower-order scheme asverification mechanism (detection only, Benson, Schmit andSchreiber)

FT-GMRES: inner-outer iterations (Hoemmen and Heroux)

PCG: orthogonalization check every k iterations,re-orthogonalization if problem detected (Sao and Vuduc)

. . . Many others

[email protected] Fault-tolerance for HPC 157/ 168

Page 230: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Dynamic programming for linear chains of tasks

{T1,T2, . . . ,Tn} : linear chain of n tasks

Each task Ti fully parametrized:

wi computational weightCi ,Ri ,Vi : checkpoint, recovery, verification

Error rates:

λF rate of fail-stop errorsλS rate of silent errors

[email protected] Fault-tolerance for HPC 158/ 168

Page 231: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

VC-only

1 i j

TimerecC (i , k − 1) TC (i + 1, j)

VC VC

min0≤k<n

TimerecC (n, k)

TimerecC (j , k) = mink≤i<j

{TimerecC (i , k − 1) + T SFC (i + 1, j)}

T SFC (i , j) = pFi ,j

(Tlost i,j + Ri−1 + T SF

C (i , j))

+(

1− pFi ,j

)(∑j`=i w` + Vj + pSi ,j

(Ri−1 + T SF

C (i , j))

+(

1− pSi ,j

)Cj

)

[email protected] Fault-tolerance for HPC 159/ 168

Page 232: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Young/Daly

TimeFF = TimeFinal (1−WasteFail ) TimeFinal ×WasteFail

TimeFinal

T − VC VC T − VC VC T − VC VC T − VC VC T − VC VC

T − VC VC T − VC VC T − VC VC T − VC VC T − VC VC

Waste = Wasteef + Wastefail

Waste =V + C

T+ λF (s)(R +

T

2) + λS(s)(R + T )

Topt =

√2(V + C )

λF (s) + 2λS(s)

[email protected] Fault-tolerance for HPC 160/ 168

Page 233: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Extensions

VC-only and VC+V

Different speeds with DVFS, different error rates

Different execution modes

Optimize for time or for energy consumption

Current research

Use verification to correct some errors (ABFT)⇒ Similar analysis, smaller error rate but higher verificationcost

Use partial detectors (recall < 1)

Address both fail-stop and silent errors simultaneously(multi-level patterns)

[email protected] Fault-tolerance for HPC 161/ 168

Page 234: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A few questions

Silent errors

Error rate? MTBE?

Selective reliability?

New algorithms beyond iterative? matrix-product, FFT, ...

Multi-level patterns for both fail-stop and silent errors

Resilient research on resilience

Models needed to assess techniques at scalewithout bias ,

[email protected] Fault-tolerance for HPC 162/ 168

Page 235: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A few questions

Silent errors

Error rate? MTBE?

Selective reliability?

New algorithms beyond iterative? matrix-product, FFT, ...

Multi-level patterns for both fail-stop and silent errors

Resilient research on resilience

Models needed to assess techniques at scalewithout bias ,

[email protected] Fault-tolerance for HPC 162/ 168

Page 236: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A few questions

Silent errors

Error rate? MTBE?

Selective reliability?

New algorithms beyond iterative? matrix-product, FFT, ...

Multi-level patterns for both fail-stop and silent errors

Resilient research on resilience

Models needed to assess techniques at scalewithout bias ,

[email protected] Fault-tolerance for HPC 162/ 168

Page 237: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

A few questions

Silent errors

Error rate? MTBE?

Selective reliability?

New algorithms beyond iterative? matrix-product, FFT, ...

Multi-level patterns for both fail-stop and silent errors

Resilient research on resilience

Models needed to assess techniques at scalewithout bias ,

[email protected] Fault-tolerance for HPC 162/ 168

Page 238: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Outline

1 Introduction

2 Checkpointing

3 Forward-recovery techniques

4 Silent errors

5 Conclusion

[email protected] Fault-tolerance for HPC 163/ 168

Page 239: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Conclusion

Multiple approaches to Fault Tolerance

Application-Specific Fault Tolerance will always provide morebenefits:

Checkpoint Size Reduction (when needed)Portability (can run on different hardware, differentdeployment, etc..)Diversity of use (can be used to restart the execution andchange parameters in the middle)

[email protected] Fault-tolerance for HPC 164/ 168

Page 240: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Conclusion

Multiple approaches to Fault Tolerance

General Purpose Fault Tolerance is a required feature of theplatforms

Not every computer scientist needs to learn how to writefault-tolerant applicationsNot all parallel applications can be ported to a fault-tolerantversion

Faults are a feature of the platform. Why should it be the roleof the programmers to handle them?

[email protected] Fault-tolerance for HPC 164/ 168

Page 241: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Conclusion

Application-Specific Fault Tolerance

Fault Tolerance is introducing redundancy in the application

replication of computationmaintaining invariant in the data

Requirements of a more Fault-friendly programmingenvironment

MPI-Next evolutionOther programming environments?

[email protected] Fault-tolerance for HPC 165/ 168

Page 242: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Conclusion

General Purpose Fault Tolerance

Software/hardware techniques to reduce checkpoint, recovery,migration times and to improve failure prediction

Multi-criteria scheduling problemexecution time/energy/reliabilityadd replicationbest resource usage (performance trade-offs)

Need combine all these approaches!

Several challenging algorithmic/scheduling problems ,

[email protected] Fault-tolerance for HPC 166/ 168

Page 243: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Bibliography

Exascale• Toward Exascale Resilience, Cappello F. et al., IJHPCA 23, 4 (2009)• The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al.,IJHPCA 25, 1 (2011)

ABFT Algorithm-based fault tolerance applied to high performance computing,Bosilca G. et al., JPDC 69, 4 (2009)

Coordinated Checkpointing Distributed snapshots: determining global states ofdistributed systems, Chandy K.M., Lamport L., ACM Trans. Comput. Syst. 3, 1(1985)

Message Logging A survey of rollback-recovery protocols in message-passing systems,Elnozahy E.N. et al., ACM Comput. Surveys 34, 3 (2002)

Replication Evaluating the viability of process replication reliability for exascalesystems, Ferreira K. et al, SC’2011

Models• Checkpointing strategies for parallel jobs, Bougeret M. et al., SC’2011

• Unified model for assessing checkpointing protocols at extreme-scale, Bosilca G et

al., CCPE 26(17), pp 2772-2791 (2014)

[email protected] Fault-tolerance for HPC 167/ 168

Page 244: An overview of fault-tolerant techniques for HPCgraal.ens-lyon.fr/~yrobert/europar16.pdf · Systems 2011 K computer 2019 Difference Today & 2019 System peak 10.5 Pflop/s 1 Eflop/s

Intro Checkpointing Forward-recovery Silent Errors Conclusion

Bibliography

New Monograph, Springer Verlag 2015

[email protected] Fault-tolerance for HPC 168/ 168