Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Resilient Solvers for Partial Differential Equations
Miroslav StoyanovEirik Endeve
Clayton Webster
Oak Ridge National Lab
July 6, 2014Extreme-scale Algorithms & Software Resilience (EASIR): ERKJ245
Al Geist (ORNL), Mike Heroux (SNL)
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 1/32
Extreme Scale Equations
Consider the generalized large scale linear differential equation thatis steady state
−AX = F ,
or time dependentd
dtX = AX + F .
The differential operator A can represent complex dynamics ofmultiple couple phenomena (e.g., multi-physics).
Finding the solution X requires a powerful supercomputer.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 2/32
Supercomputing Speed vs Reliability
Hardware Failure
Operation at near critical voltage, ever shrinking fabricationsize and increasing number of componentsIncreased computational throughput leads to decrease inreliabilitySoft faults are already causing problems in largest scalecomputations (e.g., climate)In near future we expect the mean failure time to be of theorder of an application runtime
Hardware Error Control
DRAM has build-in error detection and correctionNetwork communication has CRC checksumCPU cache and logical circuits have virtually no protection
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 3/32
Software Abstraction
Results of FaultsThe code crashes and no result is givenThe code completes execution, but returns an invalid result(e.g., inf or nan)The code completes execution and returns a result that“appears” valid
2 + 3 = 7
Software Error ControlCheck-pointing: how to detect silent faults?Check-summing: in heterogeneous computing environment,we cannot check-sum floating point operations.Post-processing: detection is too late.Redundant computations: doable on at limited scale.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 4/32
The Cutting Edge Approach
Hoemmen, Mark and Heroux, Michael A, Fault-tolerant iterativemethods via selective reliability, Proceedings of the 2011International Conference for High Performance Computing,Networking, Storage and Analysis (SC). IEEE ComputerSociety, 2011.Code is split into two modes with different reliability
Fast Mode: The bulk of the computations should be performedin this mode and any “fast” computation is susceptible tohardware faults.Safe Mode: Only the bare minimum of work will be done in thismode, computations are assumed to be completely reliable.
A sanity-check is performed in Safe Mode to guard againstintroduction of large hardware errorNatural self-correcting properties are exploited to correcthardware error
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 5/32
Challenges of Selective Reliability
Rigorously quantify the impact of soft faults
Analyze the convergence rate
Devise optimal accept/reject criteria
What type and rate of faults can be corrected
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 6/32
Analytic Fault Model
Algorithms are comprised of steps
Steps compute intermediate answer→ final solution
Each step is performed in either fast or safe mode
Safe mode always returns correct answer
Fast mode is susceptible to faults and there is probability p ofcomputing something that is not correct (i.e., we should get x,but instead we get x+ x̃)
If the intermediate result is corrupted, then we make noassumption on the distribution of the perturbation
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 7/32
Convergence
‖Xexact −Xcomputed‖ ≤ Erounding + Ediscrete + Eiterative + Ehardware
Erounding → 0 with more precision, i.e. single, double, quad.Ediscrete → 0 with more the nodes, i.e. denser meshEiterative → 0 with more solver iterationsIn all cases, more work means better approximationEhardware → 0, as we do more work
Definition (Convergence with Respect to Hardware Error)A method is convergent on a specific hardware, if the statisticalmoments of the Hardware Error converge to zero as thecomputational cost increases, regardless of the distribution of theperturbation.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 8/32
Steady State Method: GMRES
Consider the steady state problem
AX = F ,
which we disretize into a system of linear equations
A|Nx|N = b|N ,
where
A|N → A, b|N → F , x|N → X .
We need a resilient linear solver!
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 9/32
GMRES
Given the linear system of equations
Ax = b,
iterative solvers general iterative solvers generate a sequence
xk → x.
We perform K iterations so that tolerance ε is reached
‖b−AxK‖ < ε.
Krylov basis is expensive to store and manipulate, use restatedscheme
{{xmk }Kk=0}Mm=1
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 10/32
Fault-Tolerant GMRES
Due to M. Hoemmen and M. Heroux:Given A, b, x0 and ε > 0e0 = ‖b−Ax0‖m = 0while em > ε do
Compute xm+1K
em+1 = ‖b−Axm+1K ‖
if em > em−1 thendiscard xm+1
K and hm+1 and redo the last inner loopelse
accept xm+1K and advance to the next m = m+ 1
end ifend while
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 11/32
Fault Free Convergence
The residual is emk = ‖b−Axmk ‖ and let
Λ+ = {λ : Av = λv,Re(λ) > 0}Λ+ ⊂ disk(D, d/2)
Λ− = {λ : Av = λv,Re(λ) ≤ 0}ν = |Λ−|
Define magnification and reduction factors
R = cond(A)
(maxλ+∈Λ+,λ−∈Λ− |λ+ − λ−|
minλ−∈Λ− |λ−|
)ν (D
d
)ν, r =
d
D.
Then,eMK ≤
(RrK
)Me0 = RMrM×Ke0.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 12/32
Resilient Convergence
Let M∗ be so that
eM∗
K ≤ ε, M∗ =
⌈log ε− log e0
logR+K log r
⌉
Hardware faults do not increase the residual of FT-GMRES.The method will reach desired tolerance ε after M∗ fault freeouter iterations.
FT-GMRES convergence is equivalent to analyzing theprobability of computing M∗ fault free iterations in M tries.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 13/32
Resilient Convergence
Let p indicate the probability of encountering failure in thesupercomputer over a unit of time equal to a single inner loopiteration.Assume fault rate is uncorrelated in time.The probability that K inner loop iterations encounter a failure
s(K) ≈ (1− p)K
Then the probability of encountering M∗ fault free iterations inM ≥M∗ attempts is
P (M) = s(K)
(M − 1M∗ − 1
)s(K)M
∗−1 (1− s(K))M−M∗.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 14/32
Statistical Convergence
The expected number of iterations needed for convergence is
E[M ] =
∞∑M=M∗
M
(M − 1M∗ − 1
)s(K)M
∗(1− s(K))M−M
∗=
M∗
s(K)
The variance is bounded by
V[M ] =
∞∑M=M∗
(M − E[M ])2P (M) ≤ M∗
s(K)2
The probability of FT-GMRES converging, approaches 1, asM →∞.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 15/32
Computational Cost
Let W be the cost associated with computing the residual in safemode (e.g., for triple redundancy, W = 3).Then, the total cost is
C = M(K +W )
and the expected cost is
E[C] =M∗
s(K)(K +W ) =
⌈log ε− log e0
logR+K log r
⌉K +W
(1− p)K.
There is an optimal restart rate that is the global minimizer of E[C]that is independent from ε and e0.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 16/32
Example:
Consider boundary value problem
0.2Xξξ + Xξ = 0, X (0) = 0, X (1) = 1.
We discretize the problem using finite difference scheme resultingin the linear system of equations
(0.4 + 1
N) −0.2− 1
N0 · · · 0
−0.2 (0.4 + 1N
) −0.2− 1N
· · · 0...
. . .. . .
. . ....
0 · · · −0.2 (0.4 + 1N
) −0.2− 1N
0 0 · · · −0.2 (0.4 + 1N
)
x1x2...
xN−2
xN−1
=
00...0
0.2
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 17/32
Faults w/o Resilience
100
101
102
103
104
10−5
100
105
1010
Outer Iteration
Re
sid
ua
l N
orm
100
101
102
103
104
10−5
100
105
1010
Outer Iteration
Re
sid
ua
l N
orm
Figure : Left: Fault free GMRES convergence. Right: Faulty GMRESconvergence (or the lack thereof). N = 101, p = 0.05, K = 20, s(K) ≈ 0.36
x̃ =g
‖g‖10u, g ∼ N (0, 1)N , u ∼ U(−9, 19).
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 18/32
FT-GMRES: Convergence
100
101
102
103
10−5
100
105
1010
Outer Iteration
Re
sid
ua
l N
orm
100
101
102
103
10−5
100
105
1010
Outer Iteration
Re
sid
ua
l N
orm
Figure : Two realization of FT-GMRES, despite the high fault rate, thealgorithm converges reliably.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 19/32
FT-GMRES: Cost
0 5 10 15 20 25 30 35 401000
1500
2000
2500
3000
3500
4000
4500
5000
5500
Number of Inner Iterations K
Cost
Figure : Expected cost of the FT-GMRES algorithm for different values ofK. The expectation is computed using Monte Carlo sampling.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 20/32
FT-GMRES Convergence
We have analytic proof of convergence for the FT-GMRESmethod.
The method converges for virtually any fault rate p.
The computational cost and convergence rate are stronglyinfluenced by the fault rate p and restart rate K.
When making a decision on the restart rate K, we need toconsider the rate of faults p in addition to eigenstructure of thematrix A and storage limitations.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 21/32
Time Integration: First Order
Time dependent PDE
d
dtX = AX + F .
which we disretize into a system of ordinary differential equations
d
dtx|N = A|Nx|N + b|N ,
where
A|N → A, b|N → F , x|N → X .
Given X (0) we are looking for X (T ).
We need a resilient time integrator!
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 22/32
Time Stepping
Consider first order integration scheme for the system of ODEs
d
dtx = Ax.
Select time step ∆t and approximate xn ≈ x(n∆t)
First order Euler time stepping schemes
xn+1 = xn + ∆tAxn, xn+1 = xn + ∆tAxn+1.
Asymptotic convergence rate for nf = T∆t
‖xnf − x(nf∆t)‖ ≤ CT∆t.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 23/32
Single Fault Scenario
Assume that computations encounter a silent fault and at step j,instead of xj we compute a corrupted
x̃j = xj + x̃.
For n > j all time steps will be corrupted x̃n.
Then at the final time step we have that
‖x̃nf − x(nf∆t)‖ ≤ ‖x̃nf − xnf ‖+ ‖xnf − x(nf∆t)‖≤ ‖(I + ∆tA)nf−j x̃‖+ CT∆t
≤ ‖(I + ∆tA)nf−j‖‖x̃‖+ CT∆t
≤ BT ‖x̃‖+ CT∆t,
whereBT = max(1, (I + ∆tA)nf ).
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 24/32
Resilience Enhancement
According to the Mean Value Theorem
xn+1 − 2xn + xn−1 = ∆t21
2
(d2
dt2x(ξl) +
d2
dt2x(ξr)
),
where ξl ∈ [(n− 1)∆t, n∆t] and ξr ∈ [n∆t, (n+ 1)∆t].
Therefore,‖xn+1 − 2xn + xn−1‖ ≤ ∆t2α,
for any α ≥ ‖ d2dt2x‖∞.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 25/32
Resilience Enhancement
At each step of the integration we accept xn+1 only if
‖xn+1 − 2xn + xn−1‖ ≤ ∆t2α.
Thus, if we encounter a fault x̃ it will be rejected unless
‖xn+1 + x̃− 2xn + xn−1‖ ≤ ∆t2α ⇒ ‖x̃‖ ≤ 2α∆t2.
If we encounter a single fault
‖x̃nf − x(nf∆t)‖ ≤ 2αBT∆t2 + CT∆t.
We expect to encounter pnf faults in nf = T∆t iterations, then
‖x̃nf − x(nf∆t)‖ ≤ 2αpBTT∆t+ CT∆t = (2αpBTT + CT ) ∆t.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 26/32
Resilient Time Stepping
Given A, x0, ∆t, TGiven αSet n = 0while n∆t < T do
Compute xn+1, e.g., xn+1 = xn + ∆tAxn
if ‖xn+1 − 2xn + xn−1‖ > α∆t2 thendiscard xn+1 and redo the last step
elseaccept xn+1 and advance to the next n = n+ 1
end ifend while
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 27/32
Example:
Consider the heat transfer equation
d
dtX = 10−3Xξξ, X (t, 0) = X (t, 1) = 0, X (0, ξ) = ξ(1− ξ).
We discretize the problem using finite difference scheme resultingin the system of linear ODEs
d
dt
x1x2...
xN−2
xN−1
= 10−3(N + 1)2
−2 1 0 · · · 01 −2 1 · · · 0...
. . .. . .
. . ....
0 · · · 1 −2 10 0 · · · 1 −2
x1x2...
xN−2
xN−1
.
For the numerical experiments we use N = 1024.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 28/32
Classical Backward Euler
10−3
10−2
106
107
108
109
1010
Error Convergence (Heat 1D)
∆t
L2 E
rror
No
rm
Single Perturbation
Multiple Perturbations
Figure : Lack of convergence for the classical backward Euler scheme forone and multiple faults when p = 0.1.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 29/32
Classical Backward Euler
10−3
10−2
10−8
10−6
10−4
10−2
Error Convergence (Heat 1D)
∆t
L2 E
rror
No
rm
P=0.0P=0.1P=0.2P=0.4P=0.8
Figure : Convergence for the resilient time stepping method for differentvalues of p.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 30/32
Time Stepping Conclusions
We have a resilient first order time-stepping algorithm.
The method extends to non-linear problems as well as variabletime step ∆t (so long as we have continuity w.r.t. initial data)
The resilient method preserves the linear convergence rateeven in the presence of faults.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 31/32
Thank You!
Questions?
Stoyanov, M. and Webster, C. Numerical Analysis of Fixed PointAlgorithms in the Presence of Hardware Faults, Oak RidgeNational Laboratory, Tech Report, ORNL/TM-2013/283.
Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 32/32