Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Intro Checkpointing Forward-recovery Silent Errors Conclusion
An overview of fault-tolerant techniques for HPC
Yves Robert
Laboratoire LIP, ENS LyonInstitut Universitaire de FranceUniversity Tennessee Knoxville
http://graal.ens-lyon.fr/~yrobert/europar16.pdf
http://www.netlib.org/lapack/lawnspdf/lawn289.pdf
EuroPar’2016 Tutorial
[email protected] Fault-tolerance for HPC 1/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Thanks
INRIA & ENS Lyon
Anne Benoit & Frederic Vivien
Guillaume Aupy, Aurelien Cavelan, Hongyang Sun
Univ. Tennessee Knoxville
George Bosilca, Aurelien Bouteiller & Thomas Herault(joint tutorial at SC’14, SC’15, SC’16 and elsewhere)
Jack Dongarra
Elsewhere
Franck Cappello & Marc Snir, Argonne National Lab.
Henri Casanova, Univ. Hawai‘i
[email protected] Fault-tolerance for HPC 2/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 Checkpointing
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 3/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 IntroductionLarge-scale computing platformsFaults and failures
2 Checkpointing
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 4/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 IntroductionLarge-scale computing platformsFaults and failures
2 Checkpointing
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 5/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Exascale platforms (courtesy Jack Dongarra)
Potential System Architecture with a cap of $200M and 20MW Systems 2011
K computer 2019 Difference
Today & 2019
System peak 10.5 Pflop/s 1 Eflop/s O(100)
Power 12.7 MW ~20 MW
System memory 1.6 PB 32 - 64 PB O(10)
Node performance 128 GF 1,2 or 15TF O(10) – O(100)
Node memory BW 64 GB/s 2 - 4TB/s O(100)
Node concurrency 8 O(1k) or 10k O(100) – O(1000)
Total Node Interconnect BW 20 GB/s 200-400GB/s O(10)
System size (nodes) 88,124 O(100,000) or O(1M) O(10) – O(100)
Total concurrency 705,024 O(billion) O(1,000)
MTTI days O(1 day) - O(10)
[email protected] Fault-tolerance for HPC 6/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Exascale platforms (courtesy C. Engelmann & S. Scott)
[email protected] Fault-tolerance for HPC 7/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Exascale platforms
Hierarchical• 105 or 106 nodes• Each node equipped with 104 or 103 cores
Failure-prone
MTBF – one node 1 year 10 years 120 yearsMTBF – platform 30sec 5mn 1h
of 106 nodes
More nodes ⇒ Shorter MTBF (Mean Time Between Failures)
[email protected] Fault-tolerance for HPC 8/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Exascale platforms
Hierarchical• 105 or 106 nodes• Each node equipped with 104 or 103 cores
Failure-prone
MTBF – one node 1 year 10 years 120 yearsMTBF – platform 30sec 5mn 1h
of 106 nodes
More nodes ⇒ Shorter MTBF (Mean Time Between Failures)
Exascale
6= Petascale ×1000
[email protected] Fault-tolerance for HPC 8/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Even for today’s platforms (courtesy F. Cappello)
[email protected] Fault-tolerance for HPC 9/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Even for today’s platforms (courtesy F. Cappello)
[email protected] Fault-tolerance for HPC 10/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 IntroductionLarge-scale computing platformsFaults and failures
2 Checkpointing
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 11/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Error sources (courtesy Franck Cappello)
• Analysis of error and failure logs
• In 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most number of outages (59-84 percent), and take the shortest time to repair (0.6-1.5 hours). Hardware problems, albeit rarer, need 6.3-100.7 hours to solve.”
• In 2007 (Garth Gibson, ICPP Keynote):
• In 2008 (Oliner and J. Stearley, DSN Conf.): 50%
Hardware
Conclusion: Both Hardware and Software failures have to be considered
Software errors: Applications, OS bug (kernel panic), communication libs, File system error and other.
Hardware errors, Disks, processors, memory, network
[email protected] Fault-tolerance for HPC 12/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
A few definitions
Many types of faults: software error, hardware malfunction,memory corruption
Many possible behaviors: silent, transient, unrecoverable
Restrict to faults that lead to application failures
This includes all hardware faults, and some software ones
Will use terms fault and failure interchangeably
Silent errors (SDC) addressed later in the tutorial
[email protected] Fault-tolerance for HPC 13/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failure distributions: (1) Exponential
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 200 400 600 800 1000
Fa
ilure
Pro
ba
bili
ty
Time (years)
Sequential Machine
Exp(1/100)
Exp(λ): Exponential distribution law of parameter λ:
Pdf: f (t) = λe−λtdt for t ≥ 0
Cdf: F (t) = 1− e−λt
Mean = 1λ
[email protected] Fault-tolerance for HPC 14/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failure distributions: (1) Exponential
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 200 400 600 800 1000
Fa
ilure
Pro
ba
bili
ty
Time (years)
Sequential Machine
Exp(1/100)
X random variable for Exp(λ) failure inter-arrival times:
P (X ≤ t) = 1− e−λtdt (by definition)
Memoryless property: P (X ≥ t + s |X ≥ s ) = P (X ≥ t)at any instant, time to next failure does not depend upontime elapsed since last failure
Mean Time Between Failures (MTBF) µ = E (X ) = 1λ
[email protected] Fault-tolerance for HPC 14/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failure distributions: (2) Weibull
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 200 400 600 800 1000
Fa
ilure
Pro
ba
bili
ty
Time (years)
Sequential Machine
Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)
Weibull(k, λ): Weibull distribution law of shape parameter k andscale parameter λ:
Pdf: f (t) = kλ(tλ)k−1e−(λt)kdt for t ≥ 0
Cdf: F (t) = 1− e−(λt)k
Mean = 1λΓ(1 + 1
k )
[email protected] Fault-tolerance for HPC 15/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failure distributions: (2) Weibull
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 200 400 600 800 1000
Fa
ilure
Pro
ba
bili
ty
Time (years)
Sequential Machine
Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)
X random variable for Weibull(k , λ) failure inter-arrival times:
If k < 1: failure rate decreases with time”infant mortality”: defective items fail early
If k = 1: Weibull(1, λ) = Exp(λ) constant failure time
[email protected] Fault-tolerance for HPC 15/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failure distributions: with several processors
Processor (or node): any entity subject to failures⇒ approach agnostic to granularity
If the MTBF is µ with one processor,what is its value with p processors?
[email protected] Fault-tolerance for HPC 16/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Intuition
Time
p1
p2
p3
t
If three processors have around 20 faults during a time t (µ = t20 )...
Time
p
t
...during the same time, the platform has around 60 faults (µp = t60 )
[email protected] Fault-tolerance for HPC 17/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Platform MTBF
Rebooting only faulty processor
Platform failure distribution⇒ superposition of p IID processor distributions⇒ IID only for Exponential
Define µp by
limF→+∞
n(F )
F=
1
µp
n(F ) = number of platform failures until time F is exceeded
Theorem: µp =µ
pfor arbitrary distributions
[email protected] Fault-tolerance for HPC 18/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
MTBF with p processors (1/2)
Theorem: µp = µp for arbitrary distributions
With one processor:
n(F ) = number of failures until time F is exceeded
Xi iid random variables for inter-arrival times, with E (Xi ) = µ∑n(F )−1i=1 Xi ≤ F ≤∑n(F )
i=1 Xi
Wald’s equation: (E (n(F ))− 1)µ ≤ F ≤ E (n(F ))µ
limF→+∞E(n(F ))
F = 1µ
[email protected] Fault-tolerance for HPC 19/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
MTBF with p processors (2/2)
Theorem: µp = µp for arbitrary distributions
With p processors:
n(F ) = number of platform failures until time F is exceeded
nq(F ) = number of those failures that strike processor q
nq(F ) + 1 = number of failures on processor q until time F isexceeded (except for processor with last-failure)
limF→+∞nq(F )F = 1
µ as above
limF→+∞n(F )F = 1
µpby definition
Hence µp = µp because n(F ) =
∑pq=1 nq(F )
[email protected] Fault-tolerance for HPC 20/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
A little digression for afficionados
Xi IID random variables for processor inter-arrival times
Assume Xi continuous, with E (Xi ) = µ
Yi random variables for platform inter-arrival times
Definition: µpdef= limn→+∞
∑ni E(Yi )n
Limits always exists (superposition of renewal processes)
Theorem: µp = µp
[email protected] Fault-tolerance for HPC 21/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Values from the literature
MTBF of one processor: between 1 and 125 years
Shape parameters for Weibull: k = 0.5 or k = 0.7
Failure trace archive from INRIA(http://fta.inria.fr)
Computer Failure Data Repository from LANL(http://institutes.lanl.gov/data/fdata)
[email protected] Fault-tolerance for HPC 22/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Does it matter?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0h 3h 6h 9h 12h 15h 18h 21h 24h
Fa
ilure
Pro
ba
bili
ty
Time (hours)
Parallel machine (106 nodes)
Exp(1/100)Weibull(0.7, 1/100)Weibull(0.5, 1/100)
After infant mortality and before aging,instantaneous failure rate of computer platforms is almost constant
[email protected] Fault-tolerance for HPC 23/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Summary for the road
MTBF key parameter and µp = µp ,
Exponential distribution OK for most purposes ,Assume failure independence while not (completely) true /
[email protected] Fault-tolerance for HPC 24/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 25/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 26/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Maintaining redundant information
Goal
General Purpose Fault Tolerance Techniques: work despite theapplication behavior
Two adversaries: Failures & Application
Use automatically computed redundant information
At given instants: checkpointsAt any instant: replicationOr anything in between: checkpoint + message logging
[email protected] Fault-tolerance for HPC 27/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Process checkpointing
Goal
Save the current state of the process
FT Protocols save a possible state of the parallel application
Techniques
User-level checkpointing
System-level checkpointing
Blocking call
Asynchronous call
[email protected] Fault-tolerance for HPC 28/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
System-level checkpointing
Blocking checkpointing
Relatively intuitive: checkpoint(filename)
Cost: no process activity during whole checkpoint operation
Different implementations: OS syscall; dynamic library;compiler assisted
Create a serial file that can be loaded in a process image.Usually on same architecture / OS / software environment
Entirely transparent
Preemptive (often needed for library-level checkpointing)
Lack of portability
Large size of checkpoint (≈ memory footprint)
[email protected] Fault-tolerance for HPC 29/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Storage
Remote reliable storage
Intuitive. I/O intensive. Disk usage.
Memory hierarchy
local memory
local disk (SSD, HDD)
remote disk
Scalable Checkpoint Restart Libraryhttp://scalablecr.sourceforge.net
Checkpoint is valid when finished on reliable storage
Distributed memory storage
In-memory checkpointing
Disk-less checkpointing
[email protected] Fault-tolerance for HPC 30/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Coordinated checkpointing
orphan
orphan
missing
Definition (Missing Message)
A message is missing if in the current configuration, the sendersent it, while the receiver did not receive it
[email protected] Fault-tolerance for HPC 31/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Coordinated checkpointing
orphan
orphan
missing
Definition (Orphan Message)
A message is orphan if in the current configuration, the receiverreceived it, while the sender did not send it
[email protected] Fault-tolerance for HPC 32/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Coordinated checkpointing
Create a consistent view of the application (no orphan messages)
Messages belong to a checkpoint wave or another
All communication channels must be flushed (all2all)
[email protected] Fault-tolerance for HPC 33/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Coordinated checkpointing
App. Message Marker Message
Silences the network during checkpoint
Missing messages recorded
[email protected] Fault-tolerance for HPC 34/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 35/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Periodic checkpointing
Checkpointing
the first chunk
Computing the first chunk
Processing the second chunkProcessing the first chunk
Time
Time spent checkpointing
Time spent working
Blocking model: while a checkpoint is taken, no computation canbe performed
[email protected] Fault-tolerance for HPC 36/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Framework
Periodic checkpointing policy of period T
Independent and identically distributed failures
Applies to a single processor with MTBF µ = µindApplies to a platform with p processors and MTBF µ = µind
p
coordinated checkpointingtightly-coupled applicationprogress ⇔ all processors available
⇒ platform = single (powerful, unreliable) processor ,
Waste: fraction of time not spent for useful computations
[email protected] Fault-tolerance for HPC 37/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste in fault-free execution
Checkpointing
the first chunk
Computing the first chunk
Processing the second chunkProcessing the first chunk
Time
Time spent checkpointing
Time spent working Timebase: application base time
TimeFF: with periodic checkpointsbut failure-free
TimeFF = Timebase + #checkpoints × C
#checkpoints =
⌈Timebase
T − C
⌉≈ Timebase
T − C(valid for large jobs)
Waste[FF ] =TimeFF −Timebase
TimeFF=
C
T
[email protected] Fault-tolerance for HPC 38/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures
Timebase: application base time
TimeFF: with periodic checkpoints but failure-free
Timefinal: expectation of time with failures
Timefinal = TimeFF + Nfaults × Tlost
Nfaults number of failures during executionTlost: average time lost per failure
Nfaults =Timefinal
µ
Tlost?
[email protected] Fault-tolerance for HPC 39/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures
Timebase: application base time
TimeFF: with periodic checkpoints but failure-free
Timefinal: expectation of time with failures
Timefinal = TimeFF + Nfaults × Tlost
Nfaults number of failures during executionTlost: average time lost per failure
Nfaults =Timefinal
µ
Tlost?
[email protected] Fault-tolerance for HPC 39/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Computing Tlost
T
CT − CRDTlost
P1
P0
P3
P2
Time spent working Time spent checkpointing
Recovery timeDowntime Time
Tlost = D + R +T
2
Rationale⇒ Instants when periods begin and failures strike are independent⇒ Approximation used for all distribution laws⇒ Exact for Exponential and uniform distributions
[email protected] Fault-tolerance for HPC 40/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures
Timefinal = TimeFF + Nfaults × Tlost
Waste[fail ] =Timefinal −TimeFF
Timefinal=
1
µ
(D + R +
T
2
)
[email protected] Fault-tolerance for HPC 41/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Total waste
TimeFF =TimeFinal (1-Waste[Fail]) TimeFinal ×Waste[Fail ]
TimeFinal
T -C C T -C C T -C C T -C C T -C C
Waste =Timefinal −Timebase
Timefinal
1−Waste = (1−Waste[FF ])(1−Waste[fail ])
Waste =C
T+
(1− C
T
)1
µ
(D + R +
T
2
)
[email protected] Fault-tolerance for HPC 42/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste minimization
Waste =C
T+
(1− C
T
)1
µ
(D + R +
T
2
)Waste =
u
T+ v + wT
u = C(1− D + R
µ
)v =
D + R − C/2
µw =
1
2µ
Waste minimized for T =√
uw
T =√
2(µ− (D + R))C
[email protected] Fault-tolerance for HPC 43/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Comparison with Young/Daly
TimeFF =TimeFinal (1-Waste[Fail]) TimeFinal ×Waste[Fail ]
TimeFinal
T -C C T -C C T -C C T -C C T -C C
(1−Waste[fail ]
)Timefinal = TimeFF
⇒ T =√
2(µ− (D + R))C
Daly: Timefinal =(1 + Waste[fail ]
)TimeFF
⇒ T =√
2(µ+ (D + R))C + C
Young: Timefinal =(1 + Waste[fail ]
)TimeFF and D = R = 0
⇒ T =√
2µC + C
[email protected] Fault-tolerance for HPC 44/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Validity of the approach (1/3)
Technicalities
E (Nfaults) = Timefinalµ and E (Tlost) = D + R + T
2but expectation of product is not product of expectations(not independent RVs here)
Enforce C ≤ T to get Waste[FF ] ≤ 1
Enforce D + R ≤ µ and bound T to get Waste[fail ] ≤ 1but µ = µind
p too small for large p, regardless of µind
[email protected] Fault-tolerance for HPC 45/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Validity of the approach (2/3)
Several failures within same period?
Waste[fail] accurate only when two or more faults do nottake place within same period
Cap period: T ≤ γµ, where γ is some tuning parameter
Poisson process of parameter θ = Tµ
Probability of having k ≥ 0 failures : P(X = k) = θk
k! e−θ
Probability of having two or more failures:π = P(X ≥ 2) = 1− (P(X = 0) +P(X = 1)) = 1− (1 +θ)e−θ
γ = 0.27 ⇒ π ≤ 0.03⇒ overlapping faults for only 3% of checkpointing segments
[email protected] Fault-tolerance for HPC 46/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Validity of the approach (3/3)
Enforce T ≤ γµ, C ≤ γµ, and D + R ≤ γµ
Optimal period√
2(µ− (D + R))C may not belong toadmissible interval [C , γµ]
Waste is then minimized for one of the bounds of thisadmissible interval (by convexity)
[email protected] Fault-tolerance for HPC 47/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Wrap up
Capping periods, and enforcing a lower bound on MTBF⇒ mandatory for mathematical rigor /
Not needed for practical purposes ,• actual job execution uses optimal value• account for multiple faults by re-executing work until success
Approach surprisingly robust ,
[email protected] Fault-tolerance for HPC 48/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Lesson learnt for fail-stop failures
(Not so) Secret data• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs• Blue Waters: 2-3 node failures per day• Titan: a few failures per day• Tianhe 2: wouldn’t say
Topt =√
2µC ⇒ Waste[opt] ≈√
2C
µ
Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17%Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53%Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%
[email protected] Fault-tolerance for HPC 49/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Lesson learnt for fail-stop failures
(Not so) Secret data• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs• Blue Waters: 2-3 node failures per day• Titan: a few failures per day• Tianhe 2: wouldn’t say
Topt =√
2µC ⇒ Waste[opt] ≈√
2C
µ
Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17%Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53%Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%
Exascale 6= Petascale ×1000Need more reliable components
Need to checkpoint faster
[email protected] Fault-tolerance for HPC 49/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Lesson learnt for fail-stop failures
(Not so) Secret data• Tsubame 2: 962 failures during last 18 months so µ = 13 hrs• Blue Waters: 2-3 node failures per day• Titan: a few failures per day• Tianhe 2: wouldn’t say
Topt =√
2µC ⇒ Waste[opt] ≈√
2C
µ
Petascale: C = 20 min µ = 24 hrs ⇒ Waste[opt] = 17%Scale by 10: C = 20 min µ = 2.4 hrs ⇒ Waste[opt] = 53%Scale by 100: C = 20 min µ = 0.24 hrs ⇒ Waste[opt] = 100%
Silent errors:
detection latency ⇒ additional problems
[email protected] Fault-tolerance for HPC 49/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 50/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Exponential failure distribution
1 Expected execution time for a single chunk
2 Expected execution time for a sequential job
3 Expected execution time for a parallel job
[email protected] Fault-tolerance for HPC 51/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Expected execution time for a single chunk
Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .
Recursive Approach
E(T (W )) =
[email protected] Fault-tolerance for HPC 52/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Expected execution time for a single chunk
Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .
Recursive Approach
of success
Probability
Psucc(W + C ) (W + C )
E(T (W )) =
[email protected] Fault-tolerance for HPC 52/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Expected execution time for a single chunk
Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .
Recursive ApproachTime needed
the work W and
to compute
checkpoint it
Psucc(W + C ) (W + C )
E(T (W )) =
[email protected] Fault-tolerance for HPC 52/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Expected execution time for a single chunk
Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .
Recursive Approach
Probability of failure
(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))
+
Psucc(W + C ) (W + C )
E(T (W )) =
[email protected] Fault-tolerance for HPC 52/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Expected execution time for a single chunk
Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .
Recursive Approach
Time elapsed
before failure
stroke
+
(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))
Psucc(W + C ) (W + C )
E(T (W )) =
[email protected] Fault-tolerance for HPC 52/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Expected execution time for a single chunk
Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .
Recursive Approach
Time needed
to perform
downtime
and recovery
+
(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))
Psucc(W + C ) (W + C )
E(T (W )) =
[email protected] Fault-tolerance for HPC 52/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Expected execution time for a single chunk
Compute the expected time E(T (W ,C ,D,R, λ)) to execute awork of duration W followed by a checkpoint of duration C .
Recursive Approach
Time needed
to compute W
anew
+
(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))
Psucc(W + C ) (W + C )
E(T (W )) =
[email protected] Fault-tolerance for HPC 52/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Computation of E(T (W ,C ,D,R , λ))
+
(1− Psucc(W + C )) (E(Tlost(W + C )) + E(Trec) + E(T (W )))
Psucc(W + C ) (W + C )
E(T (W )) =
Psuc(W + C ) = e−λ(W+C)
E(Tlost(W + C )) =∫∞
0xP(X = x |X <W + C )dx = 1
λ − W+Ceλ(W+C)−1
E(Trec) = e−λR(D+R)+(1−e−λR)(D+E(Tlost(R))+E(Trec))
E(T (W ,C ,D,R, λ)) = eλR(
1λ + D
)(eλ(W+C) − 1)
[email protected] Fault-tolerance for HPC 53/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Best pattern for infinite jobs
Minimize Overhead(W ) =E(T (W ))
W− 1
Overhead(W ) =C
W+ λ
W
2+ o(λW )⇒Wopt = Θ(λ−1/2)
Wopt =
√2C
λ
Overhead(Wopt) =√
2Cλ+ o(λ1/2)
Periodic pattern (memoryless property)
[email protected] Fault-tolerance for HPC 54/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Checkpointing a sequential job
E(T (W )) = eλR(
1λ + D
) (∑Ki=1 e
λ(Wi+C) − 1)
Optimal strategy uses same-size chunks (convexity)
K0 = λW1+L(−e−λC−1)
where L(z)eL(z) = z (Lambert function)
Optimal number of chunks K ∗ is max(1, bK0c) or dK0e
Eopt(T (W )) = K ∗(eλR
(1
λ+ D
))(eλ( W
K∗ +C)−1)
Can also use Daly’s second-order approximation
[email protected] Fault-tolerance for HPC 55/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Checkpointing a parallel job
p processors ⇒ distribution Exp(λp), where λp = pλ
Use W (p), C (p), R(p) in Eopt(T (W )) for a distributionExp(λp = pλ)
Job types
Perfectly parallel jobs: W (p) = W /p.Generic parallel jobs: W (p) = W /p + δWNumerical kernels: W (p) = W /p + δW 2/3/
√p
Checkpoint overhead
Proportional overhead: C (p) = R(p) = δV /p = C/p(bandwidth of processor network card/link is I/O bottleneck)Constant overhead: C (p) = R(p) = δV = C(bandwidth to/from resilient storage system is I/O bottleneck)
[email protected] Fault-tolerance for HPC 56/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Weibull failure distribution
No optimality result known
Heuristic: maximize expected work before next failure
Dynamic programming algorithms- Use a time quantum- Trim history of previous failures
[email protected] Fault-tolerance for HPC 57/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 58/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Hierarchical checkpointing
Clusters of processes
Coordinated checkpointingprotocol within clusters
Message logging protocolsbetween clusters
Only processors from failed groupneed to roll back
P0
P1
P2
P3
m1
m2
m3
m4
m5
/ Need to log inter-groups messages• Slowdowns failure-free execution• Increases checkpoint size/time
, Faster re-execution with logged messages
[email protected] Fault-tolerance for HPC 59/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Which checkpointing protocol to use?
Coordinated checkpointing
, No risk of cascading rollbacks
, No need to log messages
/ All processors need to roll back
/ Rumor: May not scale to very large platforms
Hierarchical checkpointing
/ Need to log inter-groups messages• Slowdowns failure-free execution• Increases checkpoint size/time
, Only processors from failed group need to roll back
, Faster re-execution with logged messages
, Rumor: Should scale to very large platforms
[email protected] Fault-tolerance for HPC 60/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Blocking vs. non-blocking
Checkpointing
the first chunk
Computing the first chunk
Processing the second chunkProcessing the first chunk
Time
Time spent checkpointing
Time spent working
Blocking model: checkpointing blocks all computations
[email protected] Fault-tolerance for HPC 61/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Blocking vs. non-blocking
Checkpointing
the first chunk
Computing the first chunk
Processing the second chunk
Processing the first chunk
Time
Time spent checkpointing
Time spent working
Non-blocking model: checkpointing has no impact oncomputations (e.g., first copy state to RAM, then copy RAM todisk)
[email protected] Fault-tolerance for HPC 61/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Blocking vs. non-blocking
Checkpointing
the first chunk
Computing the first chunk
Processing the first chunk
Time
Time spent working
Time spent checkpointing
Time spent working with slowdown
General model: checkpointing slows computations down: duringa checkpoint of duration C , the same amount of computation isdone as during a time αC without checkpointing (0 ≤ α ≤ 1)
[email protected] Fault-tolerance for HPC 61/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste in fault-free execution
T
CT − C
P1
P0
P3
P2
Time spent working Time spent checkpointing Time spent working with slowdown
Time
Time elapsed since last checkpoint: T
Amount of computations executed: Work = (T − C ) + αC
Waste[FF ] = T−WorkT
[email protected] Fault-tolerance for HPC 62/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures
P0
P3
P2
P1
Time spent checkpointingTime spent working Time spent working with slowdown
Time
Failure can happen
1 During computation phase
2 During checkpointing phase
[email protected] Fault-tolerance for HPC 62/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures
P2
P1
P3
P0
Time spent working Time spent checkpointing Time spent working with slowdown
Time
[email protected] Fault-tolerance for HPC 62/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures
P2
P1
P3
P0
Time spent working Time spent checkpointing Time spent working with slowdown
Time
[email protected] Fault-tolerance for HPC 62/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures
Tlost
P1
P3
P0
P2
Time spent working Time spent checkpointing Time spent working with slowdown
Time
Coordinated checkpointing protocol: when one processor is victimof a failure, all processors lose their work and must roll back to lastcheckpoint
[email protected] Fault-tolerance for HPC 62/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures in computation phase
D
P0
P2
P1
P3
Time spent working Time spent checkpointing Time spent working with slowdown
Downtime Time
[email protected] Fault-tolerance for HPC 62/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures in computation phase
R
P2
P1
P3
P0
Time spent checkpointingTime spent working Time spent working with slowdown
Recovery timeDowntime Time
Coordinated checkpointing protocol: all processors must recoverfrom last checkpoint
[email protected] Fault-tolerance for HPC 62/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures in computation phase
C αC
P3
P2
P1
P0
Time spent working Time spent checkpointing Time spent working with slowdown
Re-executing slowed-down workRecovery timeDowntime Time
Redo the work destroyed by the failure, that was done in thecheckpointing phase before the computation phase
But no checkpoint is taken in parallel, hence this re-execution isfaster than the original computation
[email protected] Fault-tolerance for HPC 62/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures in computation phase
T − C
P1
P0
P3
P2
Time spent working Time spent checkpointing Time spent working with slowdown
Re-executing slowed-down workRecovery timeDowntime Time
Re-execute the computation phase
[email protected] Fault-tolerance for HPC 62/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Waste due to failures in computation phase
C
P3
P2
P1
P0
Time spent checkpointingTime spent working Time spent working with slowdown
Re-executing slowed-down workRecovery timeDowntime Time
Finally, the checkpointing phase is executed
[email protected] Fault-tolerance for HPC 62/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Total waste
∆
αC CT − CRDTlost
P0
P2
P1
P3
Time spent working Time spent checkpointing Time spent working with slowdown
Re-executing slowed-down workRecovery timeDowntime
T
Time
Waste[fail ] =1
µ
(D + R + αC +
T
2
)Optimal period Topt =
√2(1− α)(µ− (D + R + αC ))C
[email protected] Fault-tolerance for HPC 62/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Hierarchical checkpointing
T
α(G−g+1)C
RD G .C
T−G .C−Tlost
TlostTlost
G2
G4
Gg
G1
G5
Re-executing slowed-down workRecovery timeDowntime
Time spent working Time spent working with slowdownTime spent checkpointing
Time
Processors partitioned into G groups
Each group includes q processors
Inside each group: coordinated checkpointing in time C (q)
Inter-group messages are logged
[email protected] Fault-tolerance for HPC 63/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Accounting for message logging: Impact on work
/ Logging messages slows down execution:⇒ Work becomes λWork, where 0 < λ < 1Typical value: λ ≈ 0.98
, Re-execution after a failure is faster:⇒ Re-Exec becomes Re-Exec
ρ , where ρ ∈ [1..2]Typical value: ρ ≈ 1.5
Waste[FF ] =T − λWork
T
Waste[fail ] =1
µ
(D(q) + R(q) +
Re-Exec
ρ
)
[email protected] Fault-tolerance for HPC 64/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Accounting for message logging: Impact on checkpoint size
Inter-groups messages logged continuously
Checkpoint size increases with amount of work executedbefore a checkpoint /C0(q): Checkpoint size of a group without message logging
C (q) = C0(q)(1 + βWork)⇔ β =C (q)− C0(q)
C0(q)Work
Work = λ(T − (1− α)GC (q))
C (q) =C0(q)(1 + βλT )
1 + GC0(q)βλ(1− α)
[email protected] Fault-tolerance for HPC 65/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Three case studies
Coord-IOCoordinated approach: C = CMem = Mem
bio
where Mem is the memory footprint of the application
Hierarch-IOSeveral (large) groups, I/O-saturated⇒ groups checkpoint sequentially
C0(q) =CMem
G=
Mem
Gbio
Hierarch-PortVery large number of smaller groups, port-saturated⇒ some groups checkpoint in parallelGroups of qmin processors, where qminbport ≥ bio
[email protected] Fault-tolerance for HPC 66/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Three applications
1 2D-stencil
2 Matrix product3 3D-Stencil
PlaneLine
[email protected] Fault-tolerance for HPC 67/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Computing β for 2D-Stencil
C (q) = C0(q) + Logged Msg = C0(q)(1 + βWork)
Real n × n matrix and p × p gridWork = 9b2
sp, b = n/p
Each process sends a block to its 4 neighbors
Hierarch-IO:
1 group = 1 grid row
2 out of the 4 messages are logged
β = Logged MsgC0(q)Work = 2pb
pb2(9b2/sp)=
2sp9b3
Hierarch-Port:
β doubles
[email protected] Fault-tolerance for HPC 68/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Four platforms: basic characteristics
Name Number of Number of Number of cores Memory I/O Network Bandwidth (bio) I/O Bandwidth (bport)cores processors ptotal per processor per processor Read Write Read/Write per processor
Titan 299,008 16,688 16 32GB 300GB/s 300GB/s 20GB/sK-Computer 705,024 88,128 8 16GB 150GB/s 96GB/s 20GB/s
Exascale-Slim 1,000,000,000 1,000,000 1,000 64GB 1TB/s 1TB/s 200GB/sExascale-Fat 1,000,000,000 100,000 10,000 640GB 1TB/s 1TB/s 400GB/s
Name Scenario G (C (q)) β for β for2D-Stencil Matrix-Product
Coord-IO 1 (2,048s) / /Titan Hierarch-IO 136 (15s) 0.0001098 0.0004280
Hierarch-Port 1,246 (1.6s) 0.0002196 0.0008561
Coord-IO 1 (14,688s) / /K-Computer Hierarch-IO 296 (50s) 0.0002858 0.001113
Hierarch-Port 17,626 (0.83s) 0.0005716 0.002227
Coord-IO 1 (64,000s) / /Exascale-Slim Hierarch-IO 1,000 (64s) 0.0002599 0.001013
Hierarch-Port 200,0000 (0.32s) 0.0005199 0.002026
Coord-IO 1 (64,000s) / /Exascale-Fat Hierarch-IO 316 (217s) 0.00008220 0.0003203
Hierarch-Port 33,3333 (1.92s) 0.00016440 0.0006407
[email protected] Fault-tolerance for HPC 69/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Checkpoint time
Name C
K-Computer 14,688s
Exascale-Slim 64,000
Exascale-Fat 64,000
Large time to dump the memory
Using 1%C
Comparing with 0.1%C for exascale platforms
α = 0.3, λ = 0.98 and ρ = 1.5
[email protected] Fault-tolerance for HPC 70/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Plotting formulas – Platform: Titan
Stencil 2D Matrix product Stencil 3D
Waste as a function of processor MTBF µind
[email protected] Fault-tolerance for HPC 71/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Platform: K-Computer
Stencil 2D Matrix product Stencil 3D
Waste as a function of processor MTBF µind
[email protected] Fault-tolerance for HPC 72/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Plotting formulas – Platform: Exascale
Waste = 1 for all scenarios!!!
[email protected] Fault-tolerance for HPC 73/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Plotting formulas – Platform: Exascale
Waste = 1 for all scenarios!!!
Goodbye Exascale?!
[email protected] Fault-tolerance for HPC 73/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Plotting formulas – Platform: Exascale with C = 1, 000
Stencil 2D Matrix product Stencil 3DE
xasc
ale-
Slim
Exa
scal
e-F
at
Waste as a function of processor MTBF µind , C = 1, 000
[email protected] Fault-tolerance for HPC 74/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Plotting formulas – Platform: Exascale with C = 100
Stencil 2D Matrix product Stencil 3DE
xasc
ale-
Slim
Exa
scal
e-F
at
Waste as a function of processor MTBF µind , C = 100
[email protected] Fault-tolerance for HPC 75/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Simulations – Platform: Titan
Stencil 2D Matrix product
0
50
100
150
200
250
300
350
3 4 5 7.5 10 15 20 35 50 75 100
Mak
espan
(day
s)
MTBF (years)
CoordinatedCoordinated BestPer
Hierarchical PlaneHierarchical Plane BestPer
Hierarchical LineHierarchical Line BestPer
Hierarchical PortHierarchical Port BestPer
0
20
40
60
80
100
120
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Mak
espan
(day
s)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
0
50
100
150
200
250
300
350
3 4 5 7.5 10 15 20 35 50 75 100M
akes
pan
(day
s)MTBF (years)
CoordinatedCoordinated BestPer
Hierarchical PlaneHierarchical Plane BestPer
Hierarchical LineHierarchical Line BestPer
Hierarchical PortHierarchical Port BestPer
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Makespan (days)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Makespan (days)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
Makespan (in days) as a function of processor MTBF µind
[email protected] Fault-tolerance for HPC 76/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Simulations – Platform: Exascale with C = 1, 000
Stencil 2D Matrix product
0
50
100
150
200
250
300
350
3 4 5 7.5 10 15 20 35 50 75 100
Mak
espan
(day
s)
MTBF (years)
CoordinatedCoordinated BestPer
Hierarchical PlaneHierarchical Plane BestPer
Hierarchical LineHierarchical Line BestPer
Hierarchical PortHierarchical Port BestPer
0
20
40
60
80
100
120
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Mak
espan
(day
s)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
0
50
100
150
200
250
300
350
3 4 5 7.5 10 15 20 35 50 75 100
Mak
espan
(d
ays)
MTBF (years)
CoordinatedCoordinated BestPer
Hierarchical PlaneHierarchical Plane BestPer
Hierarchical LineHierarchical Line BestPer
Hierarchical PortHierarchical Port BestPer
Exa
scal
e-S
lim
0
50
100
150
200
250
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Mak
esp
an
(d
ay
s)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
0
50
100
150
200
250
300
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Mak
esp
an
(d
ay
s)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
Exa
scal
e-F
at
0
50
100
150
200
250
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Mak
esp
an
(d
ay
s)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
0
50
100
150
200
250
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Mak
esp
an
(d
ay
s)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
Makespan (in days) as a function of processor MTBF µind , C = 1, 000
[email protected] Fault-tolerance for HPC 77/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Simulations – Platform: Exascale with C = 100
Stencil 2D Matrix product
0
50
100
150
200
250
300
350
3 4 5 7.5 10 15 20 35 50 75 100
Mak
espan
(day
s)
MTBF (years)
CoordinatedCoordinated BestPer
Hierarchical PlaneHierarchical Plane BestPer
Hierarchical LineHierarchical Line BestPer
Hierarchical PortHierarchical Port BestPer
0
20
40
60
80
100
120
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Mak
espan
(day
s)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
0
50
100
150
200
250
300
350
3 4 5 7.5 10 15 20 35 50 75 100
Mak
espan
(d
ays)
MTBF (years)
CoordinatedCoordinated BestPer
Hierarchical PlaneHierarchical Plane BestPer
Hierarchical LineHierarchical Line BestPer
Hierarchical PortHierarchical Port BestPer
Exa
scal
e-S
lim
0
20
40
60
80
100
120
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Mak
esp
an
(d
ay
s)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
0
20
40
60
80
100
120
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Mak
esp
an
(d
ay
s)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
Exa
scal
e-F
at
4
5
6
7
8
9
10
11
12
13
14
15
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Mak
esp
an
(d
ay
s)
MTBF (years)
CoordinatedCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
4
5
6
7
8
9
10
11
12
13
14
15
1 2 3 4 5 7.5 10 15 20 35 50 75 100
Mak
esp
an
(d
ay
s)
MTBF (years)
Coordinated DalyCoordinated BestPer
HierarchicalHierarchical BestPer
Hierarchical PortHierarchical Port BestPer
Makespan (in days) as a function of processor MTBF µind , C = 100
[email protected] Fault-tolerance for HPC 78/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 79/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Motivation
Checkpoint transfer and storage⇒ critical issues of rollback/recovery protocols
Stable storage: high cost
Distributed in-memory storage:
Store checkpoints in local memory ⇒ no centralized storage, Much better scalabilityReplicate checkpoints ⇒ application survives single failure/ Still, risk of fatal failure in some (unlikely) scenarios
[email protected] Fault-tolerance for HPC 80/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Double checkpoint algorithm (Kale et al., UIUC)
1
1
d q s
f
f
P
Local checkpointdone
Remote checkpointdone
Perioddone
Node p
Node p'
Platform nodes partitioned into pairs
Each node in a pair exchanges its checkpoint with its buddy
Each node saves two checkpoints:- one locally: storing its own data- one remotely: receiving and storing its buddy’s data
[email protected] Fault-tolerance for HPC 81/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failures
1
1
d q s
f
f
P
Node p
Node p'
1
1
d q
f
f
tlost
Checkpoint ofp
Checkpoint ofp'
Risk Period
Node to replace p
q
f 1
tlostD R
After failure: downtime D and recovery from buddy node
Two checkpoint files lost, must be re-sent to faulty processor
Best trade-off between performance and risk?
[email protected] Fault-tolerance for HPC 82/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failures
1
1
d q s
f
f
P
Node p
Node p'
1
1
d q
f
f
tlost
Checkpoint ofp
Checkpoint ofp'
Risk Period
Node to replace p
q
f 1
tlostD R
After failure: downtime D and recovery from buddy node
Two checkpoint files lost, must be re-sent to faulty processor
Application at risk until complete reception of both messages
Best trade-off between performance and risk?
[email protected] Fault-tolerance for HPC 82/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 83/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Framework
Predictor
Exact prediction dates (at least C seconds in advance)
Recall r : fraction of faults that are predicted
Precision p: fraction of fault predictions that are correct
Events
true positive: predicted faults
false positive: fault predictions that did not materialize asactual faults
false negative: unpredicted faults
[email protected] Fault-tolerance for HPC 84/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Fault rates
µ: mean time between failures (MTBF)
µP mean time between predicted events (both true positiveand false positive)
µNP mean time between unpredicted faults (false negative).
µe : mean time between events (including three event types)
r =TrueP
TrueP + FalseNand p =
TruePTrueP + FalseP
(1− r)
µ=
1
µNPand
r
µ=
p
µP
1
µe=
1
µP+
1
µNP
[email protected] Fault-tolerance for HPC 85/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Example
fault fault fault fault fault
pred. pred. pred. pred. pred. pred.
Time
F+P F+Ppred.
F+Ppred.
F+Pfault
t
Actual faults:
Predictor:
Overlap:
Predictor predicts six faults in time t
Five actual faults. One fault not predicted
µ = t5 , µP = t
6 , and µNP = t
Recall r = 45 (green arrows over red arrows)
Precision p = 46 (green arrows over blue arrows)
[email protected] Fault-tolerance for HPC 86/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm
1 While no fault prediction is available:• checkpoints taken periodically with period T
2 When a fault is predicted at time t:• take a checkpoint ALAP (completion right at time t)• after the checkpoint, complete the execution of the period
[email protected] Fault-tolerance for HPC 87/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Computing the waste
1 Fault-free execution: Waste[FF ] = CT
Checkpointing
the first chunk
Computing the first chunk
Processing the second chunkProcessing the first chunk
Time
Time spent checkpointing
Time spent working
2 Unpredicted faults: 1µNP
[D + R + T
2
]TimeT -C T -C Tlost T -C
fault
C C C D R C
[email protected] Fault-tolerance for HPC 88/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Computing the waste
3 Predictions: 1µP
[p(C + D + R) + (1− p)C ]
TimeT -C Wreg
fault Predicted fault
T -Wreg -C T -C
C C Cp D R C C
with actual fault (true positive)
TimeT -C Wreg
Predicted fault
T -Wreg -C T -C T -C
C C Cp C C C
no actual fault (false negative)
Waste[fail ] =1
µ
[(1− r)
T
2+ D + R +
r
pC
]⇒ Topt ≈
√2µC
1− r
[email protected] Fault-tolerance for HPC 88/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Refinements
Use different value Cp for proactive checkpoints
Avoid checkpointing too frequently for false negatives⇒ Only trust predictions with some fixed probability q⇒ Ignore predictions with probability 1− qConclusion: trust predictor always or never (q = 0 or q = 1)
Trust prediction depending upon position in current period⇒ Increase q when progressing⇒ Break-even point
Cp
p
[email protected] Fault-tolerance for HPC 89/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
With prediction windows
TimeTR-C TR-C Tlost TR-C
fault(Regular mode)
Time
Regular mode Proactive mode
TR-C Wreg
I
TP-Cp TP-Cp TP-Cp TR-C-Wreg
(Prediction without failure)
Time
Regular mode Proactive mode
TR-C Wreg
I
TP-Cp TP-Cp TR-C-Wreg
fault(Prediction with failure)
C C C D R C
C C Cp Cp Cp Cp C
C C Cp Cp Cp D R C
Gets too complicated! /
[email protected] Fault-tolerance for HPC 90/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 CheckpointingCoordinated checkpointingYoung/Daly’s approximationExponential distributionsAssessing protocols at scaleIn-memory checkpointingFailure PredictionReplication
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 91/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Replication
Systematic replication: efficiency < 50%
Can replication+checkpointing be more efficient thancheckpointing alone?
Study by Ferreira et al. [SC’2011]: yes
[email protected] Fault-tolerance for HPC 92/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Model by Ferreira et al. [SC’ 2011]
Parallel application comprising N processes
Platform with ptotal = 2N processors
Each process replicated → N replica-groups
When a replica is hit by a failure, it is not restarted
Application fails when both replicas in one replica-group havebeen hit by failures
[email protected] Fault-tolerance for HPC 93/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Example
p1
p2
p1
p2
p1
p2
p1
p2
Time
Pair1
Pair2
Pair3
Pair4
[email protected] Fault-tolerance for HPC 94/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
The birthday problem
Classical formulationWhat is the probability, in a set of m people, that two of themhave same birthday ?
Relevant formulationWhat is the average number of people required to find a pair withsame birthday?
Birthday(m) = 1 +∫ +∞
0 e−x(1 + x/m)m−1dx = 23 +
√πm2 +
√π
288m − 4135m + . . .
The analogy
Two people with same birthday≡
Two failures hitting same replica-group
[email protected] Fault-tolerance for HPC 95/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure: can failed PE be hit?
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure cannot hit failed PE
Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure cannot hit failed PE
Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure cannot hit failed PE
Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure cannot hit failed PE
Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure cannot hit failed PE
Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure can hit failed PE
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure can hit failed PE
Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure can hit failed PE
Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Differences with birthday problem
1 2
. . .
i
. . .
N
2N processors but N processes, each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure can hit failed PE
Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem
[email protected] Fault-tolerance for HPC 96/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Correct analogy
� � � � . . . �1 2 3 4 . . . n
⇑• • • • • • • • • • • . . .
N = nrg bins, red and blue balls
Mean Number of Failures to Interruption (bring down application)MNFTI = expected number of balls to throw
until one bin gets one ball of each color
[email protected] Fault-tolerance for HPC 97/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Number of failures to bring down application
MNFTI ah Count each failure hitting any of the initialprocessors, including those already hit by a failure
MNFTI rp Count failures that hit running processors, and thuseffectively kill replicas.
MNFTI ah = 1 + MNFTI rp
[email protected] Fault-tolerance for HPC 98/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Number of failures to bring down application
MNFTI ah Count each failure hitting any of the initialprocessors, including those already hit by a failure
MNFTI rp Count failures that hit running processors, and thuseffectively kill replicas.
MNFTI ah = 1 + MNFTI rp
[email protected] Fault-tolerance for HPC 98/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Exponential failures
Theorem MNFTI ah = E(NFTI ah|0) where
E(NFTI ah|nf ) =
{2 if nf = nrg ,
2nrg2nrg−nf +
2nrg−2nf2nrg−nf E
(NFTI ah|nf + 1
)otherwise.
E(NFTI ah|nf ): expectation of number of failures to killapplication, knowing that• application is still running• failures have already hit nf different replica-groups
[email protected] Fault-tolerance for HPC 99/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Exponential failures (cont’d)
Proof
E(NFTI ah |nrg
)=
1
2× 1 +
1
2×(
1 + E(NFTI ah |nrg
)).
E(NFTI ah|nf
)=
2nrg − 2nf2nrg
×(
1 + E(NFTI ah|nf + 1
))+
2nf2nrg
×(
1
2× 1 +
1
2
(1 + E
(NFTI ah|nf
))).
MTTI = systemMTBF (2nrg)× MNFTI ah
[email protected] Fault-tolerance for HPC 100/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Exponential failures (cont’d)
Proof
E(NFTI ah |nrg
)=
1
2× 1 +
1
2×(
1 + E(NFTI ah |nrg
)).
E(NFTI ah|nf
)=
2nrg − 2nf2nrg
×(
1 + E(NFTI ah|nf + 1
))+
2nf2nrg
×(
1
2× 1 +
1
2
(1 + E
(NFTI ah|nf
))).
MTTI = systemMTBF (2nrg)× MNFTI ah
[email protected] Fault-tolerance for HPC 100/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Exponential failures (cont’d)
Proof
E(NFTI ah |nrg
)=
1
2× 1 +
1
2×(
1 + E(NFTI ah |nrg
)).
E(NFTI ah|nf
)=
2nrg − 2nf2nrg
×(
1 + E(NFTI ah|nf + 1
))+
2nf2nrg
×(
1
2× 1 +
1
2
(1 + E
(NFTI ah|nf
))).
MTTI = systemMTBF (2nrg)× MNFTI ah
[email protected] Fault-tolerance for HPC 100/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Comparison
2N processors, no replication
ThroughputStd = 2N(1−Waste) = 2N(
1−√
2Cµ2N
)N replica-pairs
ThroughputRep = N(
1−√
2Cµrep
)µrep = MNFTI × µ2N = MNFTI × µ
2N
Platform with 2N = 220 processors ⇒ MNFTI = 1284.4µ = 10 years ⇒ better if C shorter than 6 minutes
[email protected] Fault-tolerance for HPC 101/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failure distribution
221218 219216 217 220215
number of processors
0
50
100
150
200
aver
age
mak
espa
n(i
nda
ys)
BestPeriod-g = 2BestPeriod-g = 1Daly-g = 2Daly-g = 1
(a) Exponential
221218 219216 217 220215
number of processors
0
50
100
150
200
aver
age
mak
espa
n(i
nda
ys)
BestPeriod-g = 2BestPeriod-g = 1Daly-g = 2Daly-g = 1
(b) Weibull, k = 0.7
Crossover point for replication when µind = 125 years
[email protected] Fault-tolerance for HPC 102/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Weibull distribution with k = 0.7
Dashed line: Ferreira et al. Solid line: Correct analogy
101 100
Processor MTBF (in years)
0
200000
400000
600000
800000
1000000N
umb
erof
pro
cess
ors
C = 300
C = 2400C = 1200C = 900C = 600
C = 150
Study by Ferrreira et al. favors replication
Replication beneficial if small µ + large C + big ptotal
[email protected] Fault-tolerance for HPC 103/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 Checkpointing
3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 104/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Generic vs. Application specific approaches
Generic solutions
Universal
Very low prerequisite
One size fits all (pros and cons)
Application specific solutions
Requires (deep) study of the application/algorihtm
Tailored solution: higher efficiency
[email protected] Fault-tolerance for HPC 105/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Backward Recovery vs. Forward Recovery
Backward Recovery
Rollback / Backward Recovery: returns in the history torecover from failures.
Spends time to re-execute computations
Rebuilds states already reached
Typical: checkpointing techniques
[email protected] Fault-tolerance for HPC 106/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Backward Recovery vs. Forward Recovery
Forward Recovery
Forward Recovery: proceeds without returning
Pays additional costs during (failure-free) computation tomaintain consistent redundancy
Or pays additional computations when failures happen
General technique: Replication
Application-Specific techniques: Iterative algorithms withfixed point convergence, ABFT, ...
[email protected] Fault-tolerance for HPC 106/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 Checkpointing
3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 107/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerance (ABFT)
Principle
Limited to Linear Algebra computations
Matrices are extended with rows and/or columns of checksums
M =
5 1 7 134 3 5 124 6 9 19
[email protected] Fault-tolerance for HPC 108/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerance (ABFT)
Principle
Limited to Linear Algebra computations
Matrices are extended with rows and/or columns of checksums
M =
5 1 7 134 3 5 124 6 9 19
[email protected] Fault-tolerance for HPC 108/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerance (ABFT)
Principle
Limited to Linear Algebra computations
Matrices are extended with rows and/or columns of checksums
M =
5 1 7 134 3 5 124 6 9 19
[email protected] Fault-tolerance for HPC 108/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT and fail-stop errors
Missing checksum data
M =
5 1 7 134 3 54 6 9 19
Simple recomputation: 4+3+5 = 12.
Missing original data
M =
5 1 7 134 5 124 6 9 19
Simple recomputation: 12-(4+5) = 3.
[email protected] Fault-tolerance for HPC 109/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT and fail-stop errors
Missing checksum data
M =
5 1 7 134 3 54 6 9 19
Simple recomputation: 4+3+5 = 12.
Missing original data
M =
5 1 7 134 5 124 6 9 19
Simple recomputation: 12-(4+5) = 3.
[email protected] Fault-tolerance for HPC 109/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT and fail-stop errors
Missing checksum data
M =
5 1 7 134 3 54 6 9 19
Simple recomputation: 4+3+5 = 12.
Missing original data
M =
5 1 7 134 5 124 6 9 19
Simple recomputation: 12-(4+5) = 3.
[email protected] Fault-tolerance for HPC 109/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT and fail-stop errors
Missing checksum data
M =
5 1 7 134 3 54 6 9 19
Simple recomputation: 4+3+5 = 12.
Missing original data
M =
5 1 7 134 5 124 6 9 19
Simple recomputation: 12-(4+5) = 3.
[email protected] Fault-tolerance for HPC 109/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT and silent data corruption
M =
5 1 7 134 3 5 134 6 9 19
Error detection: 4 + 3 + 5 6= 13Limitations
The following matrix would have successfully passed thesanity check:
M =
5 1 7 135 3 5 134 6 9 19
Can detect one error and correct zero.
[email protected] Fault-tolerance for HPC 110/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT and silent data corruption
M =
5 1 7 134 3 5 134 6 9 19
Error detection: 4 + 3 + 5 6= 13Limitations
The following matrix would have successfully passed thesanity check:
M =
5 1 7 135 3 5 134 6 9 19
Can detect one error and correct zero.
[email protected] Fault-tolerance for HPC 110/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT and silent data corruption
One row and one column of checksums
M =
5 1 7 134 3 5 114 6 9 19
13 9 21 43
Checksum recomputation to look for silent data corruptions:
5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19
13 + 10 + 21 = 44
Checksums do not match !
[email protected] Fault-tolerance for HPC 111/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT and silent data corruption
One row and one column of checksums
M =
5 1 7 134 3 5 114 6 9 19
13 9 21 43
Checksum recomputation to look for silent data corruptions:
5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19
13 + 10 + 21 = 44
Checksums do not match !
[email protected] Fault-tolerance for HPC 111/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT and silent data corruption
M =
5 1 7 134 3 5 114 6 9 19
13 9 21 43
5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19
13 + 10 + 21 = 44
Both checksums are affected, giving out location of errorWe solve:
4 + x + 5 = 11 1 + x + 6 = 9
Recomputing the checksums we find that:5 + 1 + 7 = 134 + 2 + 5 = 114 + 6 + 9 = 19
13 + 9 + 21 = 43
Checksums match ,
Can detect two errors and correct one
[email protected] Fault-tolerance for HPC 112/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT and silent data corruption
M =
5 1 7 134 3 5 114 6 9 19
13 9 21 43
5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19
13 + 10 + 21 = 44
Both checksums are affected, giving out location of errorWe solve:
4 + x + 5 = 11 1 + x + 6 = 9
Recomputing the checksums we find that:5 + 1 + 7 = 134 + 2 + 5 = 114 + 6 + 9 = 19
13 + 9 + 21 = 43
Checksums match ,
Can detect two errors and correct one
[email protected] Fault-tolerance for HPC 112/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT and silent data corruption
M =
5 1 7 134 3 5 114 6 9 19
13 9 21 43
5 + 1 + 7 = 134 + 3 + 5 = 124 + 6 + 9 = 19
13 + 10 + 21 = 44
Both checksums are affected, giving out location of errorWe solve:
4 + x + 5 = 11 1 + x + 6 = 9
Recomputing the checksums we find that:5 + 1 + 7 = 134 + 2 + 5 = 114 + 6 + 9 = 19
13 + 9 + 21 = 43
Checksums match ,
Can detect two errors and correct one
[email protected] Fault-tolerance for HPC 112/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT for Matrix-Matrix multiplication
Aim: Computation of C = A× B
Let eT = [1, 1, · · · , 1], we define
Ac :=
(A
eTA
), B r :=
(B Be
), C f :=
(C Ce
eTC eTCe
).
Where Ac is the column checksum matrix, B r is the row checksummatrix and C f is the full checksum matrix.
Ac × B r =
(A
eTA
)×(B Be
)=
(AB ABe
eTAB eTABe
)=
(C Ce
eTC eTCe
)= C f
[email protected] Fault-tolerance for HPC 113/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT for Matrix-Matrix multiplication
Aim: Computation of C = A× B
Let eT = [1, 1, · · · , 1], we define
Ac :=
(A
eTA
), B r :=
(B Be
), C f :=
(C Ce
eTC eTCe
).
Where Ac is the column checksum matrix, B r is the row checksummatrix and C f is the full checksum matrix.
Ac × B r =
(A
eTA
)×(B Be
)=
(AB ABe
eTAB eTABe
)=
(C Ce
eTC eTCe
)= C f
[email protected] Fault-tolerance for HPC 113/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
In practice... things are more complicated!
When do errors strike? Are all data always protected?
Computations are approximate because of floating-pointrounding
Error detection and error correction capabilities depend on thenumber of checksum rows and columns
[email protected] Fault-tolerance for HPC 114/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 Checkpointing
3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 115/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Block LU factorization
A A'
U
L
U
Solve A · x = b (hard)
Transform A into a LU factorization
Solve L · y = b, then U · x = y
[email protected] Fault-tolerance for HPC 116/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Block LU factorization
A A'
U
L
U
GETF2: factorize acolumn block
TRSM - Update row block
GEMM: Updatethe trailing
matrix
Solve A · x = b (hard)
Transform A into a LU factorization
Solve L · y = b, then U · x = y
[email protected] Fault-tolerance for HPC 116/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Block LU factorization
L
U U
L
U
GETF2: factorize acolumn block
TRSM - Update row block
GEMM: Updatethe trailing
matrix
L
U
Solve A · x = b (hard)
Transform A into a LU factorization
Solve L · y = b, then U · x = y
[email protected] Fault-tolerance for HPC 116/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Block LU factorization
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 21 30 21 30 21 30 21 3
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 21 30 21 30 21 30 21 3
Failure of rank 2
2D Block Cyclic Distribution (here 2× 3)
A single failure ⇒ many data lost
[email protected] Fault-tolerance for HPC 116/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerant LU decomposition
M
Pmb
nbQ
N< 2N/Q + nb
+++
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
45
0 21 3
Checksum:invertible operation on the data of the row / column
Checksum blocks are doubled, to allow recovery when dataand checksum are lost together
[email protected] Fault-tolerance for HPC 117/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerant LU decomposition
M
P mb
nbQ
NN/Q
+++
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
0 2 41 3 5
0 2 41 3 5
0 21 3
Checksum:invertible operation on the data of the row / column
Checksum replication can be avoided by dedicating computingresources to checksum storage
[email protected] Fault-tolerance for HPC 117/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
GETF2 GEMM
TRSM
Idea of ABFT: applying the operation on data and checksumpreserves the checksum properties
[email protected] Fault-tolerance for HPC 117/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerant LU decomposition
+
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
For the part of the data that is not updated this way, thechecksum must be re-calculated
[email protected] Fault-tolerance for HPC 117/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
GETF2 GEMM
TRSM
To avoid slowing down all processors and panel operation,group checksum updates every Q block columns
[email protected] Fault-tolerance for HPC 117/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
GETF2 GEMM
TRSM
To avoid slowing down all processors and panel operation,group checksum updates every Q block columns
[email protected] Fault-tolerance for HPC 117/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
GETF2 GEMM
TRSM
To avoid slowing down all processors and panel operation,group checksum updates every Q block columns
[email protected] Fault-tolerance for HPC 117/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
+
Then, update the missing coverage.Keep checkpoint block column to cover failures during thattime
[email protected] Fault-tolerance for HPC 117/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerant LU decomposition
M
P mb
nbQ
NN/Q
+++
“Checkpoint”
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
1 3 50 2 41 3 50 2 41 3 5
BABAB
Checkpoint the next set of Q-Panels to be able to return to itin case of failures
[email protected] Fault-tolerance for HPC 117/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
---
In case of failure, conclude the operation, then
Missing Data = Checksum - Sum(Existing Data) s
[email protected] Fault-tolerance for HPC 118/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Algorithm Based Fault Tolerant LU decomposition
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 A1 3 B
A AB B
+++
In case of failure, conclude the operation, then
Missing Checksum = Sum(Existing Data)s
[email protected] Fault-tolerance for HPC 118/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failure inside a Q−panel factorization
M
P mb
nbQ
NN/Q
+++
“Checkpoint”
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
1 3 50 2 41 3 50 2 41 3 5
BABAB
Failures may happen while inside a Q−panel factorization
[email protected] Fault-tolerance for HPC 119/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failure inside a Q−panel factorization
M
P mb
nbQ
N
--
-
-
“Checkpoint”
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
1 3 50 2 41 3 50 2 41 3 5
BABAB
Valid Checksum Information allows to recover most of themissing data, but not all: the checksum for the currentQ−panels are not valid
[email protected] Fault-tolerance for HPC 119/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failure inside a Q−panel factorization
M
P mb
nbQ
N
“Checkpoint”
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
1 3 50 2 41 3 50 2 41 3 5
BABAB
We use the checkpoint to restore the Q−panel in its initialstate
[email protected] Fault-tolerance for HPC 119/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Failure inside a Q−panel factorization
M
P mb
nbQ
N
“Checkpoint”
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
AB
A AB B
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
0 2 41 3 50 2 41 3 50 2 41 3 50 2 41 3 5
1 3 50 2 41 3 50 2 41 3 5
BABAB
and re-execute that part of the factorization, without applyingoutside of the scope
[email protected] Fault-tolerance for HPC 119/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT LU decomposition: implementation
MPI Implementation
PBLAS-based: need to provide “Fault-Aware” version of thelibrary
Cannot enter recovery state at any point in time: need tocomplete ongoing operations despite failures
Recovery starts by defining the position of each process in thefactorization and bring them all in a consistent state(checksum property holds)
Need to test the return code of each and every MPI-relatedcall
[email protected] Fault-tolerance for HPC 120/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT LU decomposition: performance
As supercomputers grow ever larger in scale, the Mean Time to Failure becomes shorter and shorter, making the complete and successful execution of complex applications more and more difficult. FT-LA delivers a new approach, utilizing Algorithm-Based Fault Tolerance (ABFT), to help factorization algorithms survive fail-stop failures. The FT-LA software package extends ScaLAPACK with ABFT routines, and in sharp contrast with legacy checkpoint-based approaches, ABFT does not incur I/O overhead, and promises a much more scalable protection scheme.
ABFT THE IDEA
Cost of ABFT comes only from extra flops (to update checksums) and extra storage
Cost decreases with machine scale (divided by Q when using PxQ processes)
PROTECTION
Matrix protected by block row checksum
The algorithm updates both the trailing matrix AND the checksums
RECOVERY
Missing blocks reconstructed by inverting the checksum operation
FUNCTIONALITY COVERAGE
Linear Systems of Equations
Least Squares
Cholesky, LU
QR (with protection of the upper and lower factors)
FEATURES
WORK IN PROGRESS
Covering four precisions: double complex, single complex, double real, single real (ZCDS)
Deploys on MPI FT draft (ULFM), or with “Checkpoint-on-failure”
Allows toleration of permanent crashes
Hessenber Reduction, Soft (silent) Errors
Process grid: p x qF: simultaneous failures tolerated
Protection against 2 faults on 192x192 processes => 1% overhead
Usually F << q; Overheads in F/q
Protection cost is inversely proportional to machine scale!
Computation
Memory
Flops for the checksum update
Matrix is extended with 2F columns every q columns
FIND OUT MORE AT http://icl.cs.utk.edu/ft-la
0
7
14
21
28
35
6x6; 20k12x12; 40k
24x24; 80k48x48; 160k
96x96; 320k192x192; 640k 0
10
20
30
40
50
Rela
tive
Over
head
(%)
Perfo
rman
ce (T
Flop
/s)
#Processors (PxQ grid); Matrix size (N)
ScaLAPACK PDGETRFFT-PDGETRF (no error)
FT-PDGETRF (w/1 recovery)Overhead: FT-PDGETRF (no error)
Overhead: FT-PDGETRF (w/1 recovery)
U
L
C’
GETF2 GEMM
TRSM
A’
L
0 4 6 0 4 6
1 3 5 7 1 3 5 7
0 4 6 0 4 6
1 3 5 7 1 3 5 7
0 4 6 0 4 6
1 3 5 7 1 3 5 7
0 4 6 0 4 6
1 3 5 7 1 3 5 7
0 4 6 0 4 6
1 3 5 7 1 3 5 7
C
PERFORMANCE ON KRAKEN
Open MPI with ULFM; Kraken supercomputer.
[email protected] Fault-tolerance for HPC 121/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT QR decomposition: performance0:22 A. Bouteiller, T. Herault, G. Bosilca, P. Du, and J. Dongarra
0
2
4
6
8
10
12
6x6; 20k 12x12; 40k 24x24; 80k 48x48; 160k 0
10
20
30
40
50
60
Perf
orm
an
ce (
TF
lop
/s)
Rela
tive O
verh
ead
(%
)
#Processors (PxQ grid); Matrix size (N)
ScaLAPACK PDGEQRFFT-PDGEQRF (no errors)FT-PDGEQRF (one error)
Overhead: FT-PDGEQRF (no errors)Overhead: FT-PDGEQRF (one error)
Fig. 11. Weak scalability of FT-QR: run time overhead on Kraken when failures strike
local snapshots have to be used along with re-factorization to recover the lost data andrestore the matrix state. This is referred to as the ”failure within Q panels.”
Figure 10 shows the overhead from these two cases for the LU factorization, alongwith the no-error overhead as a reference. In the “border” case, the failure is simulatedto strike when the 96th panel (which is a multiple of grid columns, 6, 12, · · · , 48) has justfinished. In the “non-border” case, failure occurs during the (Q + 2)th panel factoriza-tion. For example, when Q = 12, the failure is injected when the trailing update for thestep with panel (1301,1301) finishes. From the result in Figure 10, the recovery pro-cedure in both cases adds a small overhead that also decreases when scaled to largeproblem size and process grid. For largest setups, only 2-3 percent of the executiontime is spent recovering from a failure.
7.4. Extension to Other factorizationThe algorithm proposed in this work can be applied to a wide range of dense matrixfactorizations other than LU. As a demonstration we have extended the fault toler-ance functions to the ScaLAPACK QR factorization in double precision. Since QR usesa block algorithm similar to LU (and also similar to Cholesky), the integration of faulttolerance functions is mostly straightforward. Figure 11 shows the performance of QRwith and without recovery. The overhead drops as the problem and grid size increase,although it remains higher than that of LU for the same problem size. This is expected:as the QR algorithm has a higher complexity than LU ( 4
3N3 v.s. 23N3), the ABFT ap-
proach incurs more extra computation when updating checksums. Similar to the LUresult, recovery adds an extra 2% overhead. At size 160,000 a failure incurs about5.7% penalty to be recovered. This overhead becomes lower, the larger the problem orprocessor grid size considered.
ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: January 2013.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
Open MPI with ULFM; Kraken [email protected] Fault-tolerance for HPC 122/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT LU decomposition: implementation
?
ABFTRecovery
Checkpoint on Failure - MPI Implementation
FT-MPI / MPI-Next FT: not easily available on largemachines
Checkpoint on Failure = workaround
[email protected] Fault-tolerance for HPC 123/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT QR decomposition: performance
0
0.5
1
1.5
2
2.5
3
3.5
20k 40k 60k 80k 100k
Pe
rfo
rma
nc
e (
Tfl
op
s/s
)
Matrix Size (N)
ScaLAPACK QRCoF-QR (w/o failure)CoF-QR (w/1 failure)
Fig. 2. ABFT QR andone CoF recovery onKraken (Lustre).
0
100
200
300
400
500
600
700
800
10k 20k 30k 40k 50k
Pe
rfo
rma
nc
e (
Gfl
op
s/s
)
Matrix Size (N)
ScaLAPACK QRCoF-QR (w/o failure)CoF-QR (w/1 failure)
Fig. 3. ABFT QR andone CoF recovery onDancer (local SSD).
0
1
2
3
4
5
6
7
20k 25k 30k 35k 40k 45k 50k
Ap
pli
ca
tio
n T
ime
Sh
are
(%
)
Matrix Size (N)
Load CheckpointDump Checkpoint
ABFT Recovery
Fig. 4. Time breakdownof one CoF recovery onDancer (local SSD).
5.3 Checkpoint-on-Failure QR Performance
Supercomputer Performance: Figure 2 presents the performance on the Krakensupercomputer. The process grid is 24⇥24 and the block size is 100. The CoF-QR(no failure) presents the performance of the CoF QR implementation, in a fault-free execution; it is noteworthy, that when there are no failures, the performanceis exactly identical to the performance of the unmodified FT-QR implementa-tion. The CoF-QR (with failure) curves present the performance when a failureis injected after the first step of the PDLARFB kernel. The performance of thenon-fault tolerant ScaLAPACK QR is also presented for reference.
Without failures, the performance overhead compared to the regular ScaLA-PACK is caused by the extra computation to maintain the checksums inherentto the ABFT algorithm [12]; this extra computation is unchanged between CoF-QR and FT-QR. Only on runs where a failure happened do the CoF protocolsundergoe the supplementary overhead of storing and reloading checkpoints. How-ever, the performance of the CoF-QR remains very close to the no-failure case.For instance, at matrix size N=100,000, CoF-QR still achieves 2.86 Tflop/s afterrecovering from a failure, which is 90% of the performance of the non-fault toler-ant ScaLAPACK QR. This demonstrates that the CoF protocol enables e�cient,practical recovery schemes on supercomputers.
Impact of Local Checkpoint Storage: Figure 3 presents the performance of theCoF-QR implementation on the Dancer cluster with a 8 ⇥ 16 process grid. Al-though a smaller test platform, the Dancer cluster features local storage on nodesand a variety of performance analysis tools unavailable on Kraken. As expected(see [12]), the ABFT method has a higher relative cost on this smaller machine.Compared to the Kraken platform, the relative cost of CoF failure recovery issmaller on Dancer. The CoF protocol incurs disk accesses to store and loadcheckpoints when a failure hits, hence the recovery overhead depends on I/Operformance. By breaking down the relative cost of each recovery step in CoF,Figure 4 shows that checkpoint saving and loading only take a small percentageof the total run-time, thanks to the availability of solid state disks on every node.Since checkpoint reloading immediately follows checkpointing, the OS cache sat-isfy most disk access, resulting in high I/O performance. For matrices larger than
Checkpoint on Failure - MPI Performance
Open MPI; Kraken supercomputer;
[email protected] Fault-tolerance for HPC 124/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 Checkpointing
3 Forward-recovery techniquesABFT for matrix productABFT for (tiled) LU factorizationComposite approach: ABFT & Checkpointing
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 125/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Fault Tolerance Techniques
General Techniques
Replication
Rollback Recovery
Coordinated CheckpointingUncoordinated Checkpointing &Message LoggingHierarchical CheckpointingMultilevel Checkpointing
Application-Specific Techniques
Algorithm Based Fault Tolerance(ABFT)
Iterative Convergence
Approximated Computation
[email protected] Fault-tolerance for HPC 126/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Application
Typical Application
f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/
sim2mat ( ) ;
/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/
d g e q r f ( ) ;d s o l v e ( ) ;
/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/
vec2s im ( ) ;}
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
LIBRARY Phase GENERAL Phase
Characteristics
, Large part of (total)computation spent infactorization/solve
Between LA operations:
/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data
/ modify data not covered byABFT algorithms
[email protected] Fault-tolerance for HPC 127/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Application
Typical Application
f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/
sim2mat ( ) ;
/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/
d g e q r f ( ) ;d s o l v e ( ) ;
/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/
vec2s im ( ) ;}
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
LIBRARY Phase GENERAL Phase
Characteristics
, Large part of (total)computation spent infactorization/solve
Between LA operations:
/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data
/ modify data not covered byABFT algorithms
Goodbye ABFT?!
[email protected] Fault-tolerance for HPC 127/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Application
Typical Application
f o r ( an insanenumber ) {/∗ E x t r a c t data from∗ s i m u l a t i o n , f i l l up∗ m a t r i x ∗/
sim2mat ( ) ;
/∗ F a c t o r i z e matr ix ,∗ S o l v e ∗/
d g e q r f ( ) ;d s o l v e ( ) ;
/∗ Update s i m u l a t i o n∗ w i t h r e s u l t v e c t o r ∗/
vec2s im ( ) ;}
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
LIBRARY Phase GENERAL Phase
Characteristics
, Large part of (total)computation spent infactorization/solve
Between LA operations:
/ use resulting vector / matrixwith operations that do notpreserve the checksums onthe data
/ modify data not covered byABFT algorithms
Problem Statement
How to use fault tolerant operations(∗) within anon-fault tolerant(∗∗) application?(∗∗∗)
(*) ABFT, or other application-specific FT(**) Or within an application that does not have the same kind of FT
(***) And keep the application globally fault tolerant...
[email protected] Fault-tolerance for HPC 127/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT&PeriodicCkpt
ABFT&PeriodicCkpt: no failure
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
PeriodicCheckpoint
SplitForced
Checkpoints
[email protected] Fault-tolerance for HPC 128/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT&PeriodicCkpt
ABFT&PeriodicCkpt: failure during Library phase
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
Failure(during LIBRARY)
Rollback(partial)
Recovery
ABFTRecovery
[email protected] Fault-tolerance for HPC 129/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT&PeriodicCkpt
ABFT&PeriodicCkpt: failure during General phase
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
Failure(during GENERAL)
Rollback(fulll)
Recovery
[email protected] Fault-tolerance for HPC 130/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT&PeriodicCkpt: Optimizations
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
ABFT&PERIO
DICCKPT
ABFT&PeriodicCkpt: Optimizations
If the duration of the General phase is too small: don’t addcheckpoints
If the duration of the Library phase is too small: don’t doABFT recovery, remain in General mode
this assumes a performance model for the library call
[email protected] Fault-tolerance for HPC 131/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
ABFT&PeriodicCkpt: Optimizations
Process 0
Process 1
Process 2
Application
Application
Application
Library
Library
Library
ABFT&PERIO
DICCKPT
GENERALCheckpoint Interval
ABFT&PeriodicCkpt: Optimizations
If the duration of the General phase is too small: don’t addcheckpoints
If the duration of the Library phase is too small: don’t doABFT recovery, remain in General mode
this assumes a performance model for the library call
[email protected] Fault-tolerance for HPC 131/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Conclusion
Algorithm-Based Fault Tolerance
Application-specific solution for linear algebra kernels
Low-overhead forward-recovery solution
Used alone or in conjunction with backward-recovery solutions
Limitation
Need 2f linearly independent checksum vectors to detect andcorrect f errors
Numerical instability ⇒ implementations limited to f = 1 orf = 2 /
[email protected] Fault-tolerance for HPC 132/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 Checkpointing
3 Forward-recovery techniques
4 Silent errorsCoupling checkpointing and verificationApplication-specific methods
5 Conclusion
[email protected] Fault-tolerance for HPC 133/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Definitions
Instantaneous error detection ⇒ fail-stop failures,e.g. resource crash
Silent errors (data corruption) ⇒ detection latency
Silent error detected only when the corrupt data is activated
Includes some software faults, some hardware errors (softerrors in L1 cache), double bit flip
Cannot always be corrected by ECC memory
[email protected] Fault-tolerance for HPC 134/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Quotes
Soft Error: An unintended change in the state of an electronicdevice that alters the information that it stores withoutdestroying its functionality, e.g. a bit flip caused by acosmic-ray-induced neutron. (Hengartner et al., 2008)
SDC occurs when incorrect data is delivered by a computingsystem to the user without any error being logged (CristianConstantinescu, AMD)
Silent errors are the black swan of errors (Marc Snir)
[email protected] Fault-tolerance for HPC 135/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Should we be afraid? (courtesy Al Geist)
[email protected] Fault-tolerance for HPC 136/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Probability distributions for silent errors
?Theorem: µp =
µind
pfor arbitrary distributions
[email protected] Fault-tolerance for HPC 137/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Probability distributions for silent errors
?Theorem: µp =
µind
pfor arbitrary distributions
[email protected] Fault-tolerance for HPC 137/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 Checkpointing
3 Forward-recovery techniques
4 Silent errorsCoupling checkpointing and verificationApplication-specific methods
5 Conclusion
[email protected] Fault-tolerance for HPC 138/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
General-purpose approach
TimeXe Xd
fault Detection
Error and detection latency
Last checkpoint may have saved an already corrupted state
Saving k checkpoints (Lu, Zheng and Chien):
¬ Critical failure when all live checkpoints are invalid Which checkpoint to roll back to?
[email protected] Fault-tolerance for HPC 139/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
General-purpose approach
TimeXe Xd
fault Detection
Error and detection latency
Last checkpoint may have saved an already corrupted state
Saving k checkpoints (Lu, Zheng and Chien):
¬ Critical failure when all live checkpoints are invalidAssume unlimited storage resources
Which checkpoint to roll back to?Assume verification mechanism
[email protected] Fault-tolerance for HPC 139/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Optimal period?
TimeXe Xd
fault Detection
Error and detection latency
Xe inter arrival time between errors; mean time µe
Xd error detection time; mean time µd
Assume Xd and Xe independent
[email protected] Fault-tolerance for HPC 140/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Arbitrary distribution
Wasteff =C
T
Wastefail =T2 + R + µd
µe
Only valid if T2 + R + µd � µe
Theorem
Best period is Topt ≈√
2µeC
Independent of Xd
[email protected] Fault-tolerance for HPC 141/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Exponential distribution
Theorem
At the end of the day,
E(T (w)) = eλeR (µe + µd) (eλe(w+C) − 1)
Optimal period independent of µd
Good approximation is T =√
2µeC (Young’s formula)
[email protected] Fault-tolerance for HPC 142/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
The case with limited resources
Assume that we can only save the last k checkpoints
Definition (Critical failure)
Error detected when all checkpoints contain corrupted data.Happens with probability Prisk during whole execution.
Prisk decreases when T increases (when Xd is fixed).Hence, Prisk ≤ ε leads to a lower bound Tmin on T
Can derive an analytical form for Prisk when Xd follows anExponential law. Use it as a good(?) approximation for arbitrarylaws
[email protected] Fault-tolerance for HPC 143/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Limitation of the model
It is not clear how to detect when the error has occurred(hence to identify the last valid checkpoint) / / /
Need a verification mechanism to check the correctness of thecheckpoints. This has an additional cost!
[email protected] Fault-tolerance for HPC 144/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Coupling checkpointing and verification
Verification mechanism of cost V
Silent errors detected only when verification is executed
Approach agnostic of the nature of verification mechanism(checksum, error correcting code, coherence tests, etc)
Fully general-purpose(application-specific information, if available, can always beused to decrease V )
[email protected] Fault-tolerance for HPC 145/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
On-line ABFT scheme for PCG
Zizhong Chen, PPoPP’13
Iterate PCGCost: SpMV, preconditionersolve, 5 linear kernels
Detect soft errors by checkingorthogonality and residual
Verification every d iterationsCost: scalar product+SpMV
Checkpoint every c iterationsCost: three vectors, or twovectors + SpMV at recovery
Experimental method tochoose c and d
[email protected] Fault-tolerance for HPC 146/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Base pattern (and revisiting Young/Daly)
TimeW W
faultDetection
V C V C V C
Fail-stop (classical) Silent errors
Pattern T = W + C S = W + V + C
Waste[FF ] CT
V+CS
Waste[fail ] 1µ(D + R + W
2 ) 1µ(R + W + V )
Optimal Topt =√
2Cµ Sopt =√
(C + V )µ
Waste[opt]√
2Cµ 2
√C+Vµ
[email protected] Fault-tolerance for HPC 147/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
With p = 1 checkpoint and q = 3 verifications
Timew w w w w w
faultDetection
V C V V V C V V V C
Base Pattern p = 1, q = 1 Waste[opt] = 2√
C+Vµ
New Pattern p = 1, q = 3 Waste[opt] = 2√
4(C+3V )6µ
[email protected] Fault-tolerance for HPC 148/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
BalancedAlgorithm
Time2w 2w w w 2w 2w
V C V V C V V V C
p checkpoints and q verifications, p ≤ q
p = 2, q = 5, S = 2C + 5V + W
W = 10w , six chunks of size w or 2w
May store invalid checkpoint (error during third chunk)
After successful verification in fourth chunk, precedingcheckpoint is valid
Keep only two checkpoints in memory and avoid any fatalfailure
[email protected] Fault-tolerance for HPC 149/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
BalancedAlgorithm
Time2w 2w w w 2w 2w
V C V V C V V V C
¬ ( proba 2w/W ) Tlost = R + 2w + V
( proba 2w/W ) Tlost = R + 4w + 2V
® ( proba w/W ) Tlost = 2R + 6w + C + 4V
¯ ( proba w/W ) Tlost = R + w + 2V
° ( proba 2w/W ) Tlost = R + 3w + 2V
± ( proba 2w/W ) Tlost = R + 5w + 3V
Waste[opt] ≈ 2
√7(2C + 5V )
20µ
[email protected] Fault-tolerance for HPC 150/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Analysis
Key parameters
off failure-free overhead per pattern
fre fraction of work that is re-executed
Wasteff = offS , where off = pC + qV and S = off + pqw � µ
Wastefail = Tlostµ , where Tlost = freS + β
β: constant, linear combination of C , V and R
Waste ≈ offS + freS
µ ⇒ Sopt ≈√
offfre· µ
Waste[opt] = 2
√offfreµ
+ o(
√1
µ)
[email protected] Fault-tolerance for HPC 151/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Computing fre when p = 1
Timeα1W α2W α3W
V C V V V C
Theorem
The minimal value of fre(1, q) is obtained for same-size chunks
fre(1, q) =∑q
i=1
(αi∑i
j=1 αj
)Minimal when αi = 1/q
In that case, fre(1, q) = q+12q
[email protected] Fault-tolerance for HPC 152/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Computing fre when p ≥ 1
Timeα1W α2W α3W
V C C C V C
Theorem
fre(p, q) ≥ p+q2pq , bound is matched by BalancedAlgorithm.
Assess gain due to the p − 1 intermediate checkpoints
f(1)
re − f(p)
re =∑p
i=1
(αi∑i−1
j=1 αj
)Maximal when αi = 1/p for all i
In that case, f(1)
re − f(p)
re = (p − 1)/p2
Now best with equipartition of verifications too
In that case, f(1)
re = q+12q and f
(p)re = q+1
2q −p−12p = q+p
2pq
[email protected] Fault-tolerance for HPC 153/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Choosing optimal pattern
Let V = γC , where 0 < γ ≤ 1
offfre = p+q2pq (pC + qV ) = C × p+q
2
(1q + γ
p
)Given γ, minimize p+q
2
(1q + γ
p
)with 1 ≤ p ≤ q, and p, q
taking integer values
Let p = λ× q. Then λopt =√γ =
√VC
[email protected] Fault-tolerance for HPC 154/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Summary
Time2w 2w w w 2w 2w
V C V V C V V V C
BalancedAlgorithm optimal when C ,R,V � µ
Keep only 2 checkpoints in memory/storage
Closed-form formula for Waste[opt]
Given C and V , choose optimal pattern
Gain of up to 20% over base pattern
[email protected] Fault-tolerance for HPC 155/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 Checkpointing
3 Forward-recovery techniques
4 Silent errorsCoupling checkpointing and verificationApplication-specific methods
5 Conclusion
[email protected] Fault-tolerance for HPC 156/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Literature
ABFT: dense matrices / fail-stop, extended to sparse / silent.Limited to one error detection and/or correction in practice
Asynchronous (chaotic) iterative methods (old work)
Partial differential equations: use lower-order scheme asverification mechanism (detection only, Benson, Schmit andSchreiber)
FT-GMRES: inner-outer iterations (Hoemmen and Heroux)
PCG: orthogonalization check every k iterations,re-orthogonalization if problem detected (Sao and Vuduc)
. . . Many others
[email protected] Fault-tolerance for HPC 157/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Dynamic programming for linear chains of tasks
{T1,T2, . . . ,Tn} : linear chain of n tasks
Each task Ti fully parametrized:
wi computational weightCi ,Ri ,Vi : checkpoint, recovery, verification
Error rates:
λF rate of fail-stop errorsλS rate of silent errors
[email protected] Fault-tolerance for HPC 158/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
VC-only
1 i j
TimerecC (i , k − 1) TC (i + 1, j)
VC VC
min0≤k<n
TimerecC (n, k)
TimerecC (j , k) = mink≤i<j
{TimerecC (i , k − 1) + T SFC (i + 1, j)}
T SFC (i , j) = pFi ,j
(Tlost i,j + Ri−1 + T SF
C (i , j))
+(
1− pFi ,j
)(∑j`=i w` + Vj + pSi ,j
(Ri−1 + T SF
C (i , j))
+(
1− pSi ,j
)Cj
)
[email protected] Fault-tolerance for HPC 159/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Young/Daly
TimeFF = TimeFinal (1−WasteFail ) TimeFinal ×WasteFail
TimeFinal
T − VC VC T − VC VC T − VC VC T − VC VC T − VC VC
T − VC VC T − VC VC T − VC VC T − VC VC T − VC VC
Waste = Wasteef + Wastefail
Waste =V + C
T+ λF (s)(R +
T
2) + λS(s)(R + T )
Topt =
√2(V + C )
λF (s) + 2λS(s)
[email protected] Fault-tolerance for HPC 160/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Extensions
VC-only and VC+V
Different speeds with DVFS, different error rates
Different execution modes
Optimize for time or for energy consumption
Current research
Use verification to correct some errors (ABFT)⇒ Similar analysis, smaller error rate but higher verificationcost
Use partial detectors (recall < 1)
Address both fail-stop and silent errors simultaneously(multi-level patterns)
[email protected] Fault-tolerance for HPC 161/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
A few questions
Silent errors
Error rate? MTBE?
Selective reliability?
New algorithms beyond iterative? matrix-product, FFT, ...
Multi-level patterns for both fail-stop and silent errors
Resilient research on resilience
Models needed to assess techniques at scalewithout bias ,
[email protected] Fault-tolerance for HPC 162/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
A few questions
Silent errors
Error rate? MTBE?
Selective reliability?
New algorithms beyond iterative? matrix-product, FFT, ...
Multi-level patterns for both fail-stop and silent errors
Resilient research on resilience
Models needed to assess techniques at scalewithout bias ,
[email protected] Fault-tolerance for HPC 162/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
A few questions
Silent errors
Error rate? MTBE?
Selective reliability?
New algorithms beyond iterative? matrix-product, FFT, ...
Multi-level patterns for both fail-stop and silent errors
Resilient research on resilience
Models needed to assess techniques at scalewithout bias ,
[email protected] Fault-tolerance for HPC 162/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
A few questions
Silent errors
Error rate? MTBE?
Selective reliability?
New algorithms beyond iterative? matrix-product, FFT, ...
Multi-level patterns for both fail-stop and silent errors
Resilient research on resilience
Models needed to assess techniques at scalewithout bias ,
[email protected] Fault-tolerance for HPC 162/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Outline
1 Introduction
2 Checkpointing
3 Forward-recovery techniques
4 Silent errors
5 Conclusion
[email protected] Fault-tolerance for HPC 163/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Conclusion
Multiple approaches to Fault Tolerance
Application-Specific Fault Tolerance will always provide morebenefits:
Checkpoint Size Reduction (when needed)Portability (can run on different hardware, differentdeployment, etc..)Diversity of use (can be used to restart the execution andchange parameters in the middle)
[email protected] Fault-tolerance for HPC 164/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Conclusion
Multiple approaches to Fault Tolerance
General Purpose Fault Tolerance is a required feature of theplatforms
Not every computer scientist needs to learn how to writefault-tolerant applicationsNot all parallel applications can be ported to a fault-tolerantversion
Faults are a feature of the platform. Why should it be the roleof the programmers to handle them?
[email protected] Fault-tolerance for HPC 164/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Conclusion
Application-Specific Fault Tolerance
Fault Tolerance is introducing redundancy in the application
replication of computationmaintaining invariant in the data
Requirements of a more Fault-friendly programmingenvironment
MPI-Next evolutionOther programming environments?
[email protected] Fault-tolerance for HPC 165/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Conclusion
General Purpose Fault Tolerance
Software/hardware techniques to reduce checkpoint, recovery,migration times and to improve failure prediction
Multi-criteria scheduling problemexecution time/energy/reliabilityadd replicationbest resource usage (performance trade-offs)
Need combine all these approaches!
Several challenging algorithmic/scheduling problems ,
[email protected] Fault-tolerance for HPC 166/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Bibliography
Exascale• Toward Exascale Resilience, Cappello F. et al., IJHPCA 23, 4 (2009)• The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al.,IJHPCA 25, 1 (2011)
ABFT Algorithm-based fault tolerance applied to high performance computing,Bosilca G. et al., JPDC 69, 4 (2009)
Coordinated Checkpointing Distributed snapshots: determining global states ofdistributed systems, Chandy K.M., Lamport L., ACM Trans. Comput. Syst. 3, 1(1985)
Message Logging A survey of rollback-recovery protocols in message-passing systems,Elnozahy E.N. et al., ACM Comput. Surveys 34, 3 (2002)
Replication Evaluating the viability of process replication reliability for exascalesystems, Ferreira K. et al, SC’2011
Models• Checkpointing strategies for parallel jobs, Bougeret M. et al., SC’2011
• Unified model for assessing checkpointing protocols at extreme-scale, Bosilca G et
al., CCPE 26(17), pp 2772-2791 (2014)
[email protected] Fault-tolerance for HPC 167/ 168
Intro Checkpointing Forward-recovery Silent Errors Conclusion
Bibliography
New Monograph, Springer Verlag 2015
[email protected] Fault-tolerance for HPC 168/ 168