Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Resilient Solvers for Partial Differential Equations

Miroslav StoyanovEirik Endeve

Clayton Webster

Oak Ridge National Lab

July 6, 2014Extreme-scale Algorithms & Software Resilience (EASIR): ERKJ245

Al Geist (ORNL), Mike Heroux (SNL)

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 1/32

Extreme Scale Equations

Consider the generalized large scale linear differential equation thatis steady state

−AX = F ,

or time dependentd

dtX = AX + F .

The differential operator A can represent complex dynamics ofmultiple couple phenomena (e.g., multi-physics).

Finding the solution X requires a powerful supercomputer.


Supercomputing Speed vs Reliability

Hardware Failure

Operation at near critical voltage, ever shrinking fabricationsize and increasing number of componentsIncreased computational throughput leads to decrease inreliabilitySoft faults are already causing problems in largest scalecomputations (e.g., climate)In near future we expect the mean failure time to be of theorder of an application runtime

Hardware Error Control

DRAM has build-in error detection and correctionNetwork communication has CRC checksumCPU cache and logical circuits have virtually no protection


Software Abstraction

Results of FaultsThe code crashes and no result is givenThe code completes execution, but returns an invalid result(e.g., inf or nan)The code completes execution and returns a result that“appears” valid

2 + 3 = 7

Software Error ControlCheck-pointing: how to detect silent faults?Check-summing: in heterogeneous computing environment,we cannot check-sum floating point operations.Post-processing: detection is too late.Redundant computations: doable on at limited scale.


The Cutting Edge Approach

Hoemmen, Mark and Heroux, Michael A, Fault-tolerant iterativemethods via selective reliability, Proceedings of the 2011International Conference for High Performance Computing,Networking, Storage and Analysis (SC). IEEE ComputerSociety, 2011.Code is split into two modes with different reliability

Fast Mode: The bulk of the computations should be performedin this mode and any “fast” computation is susceptible tohardware faults.Safe Mode: Only the bare minimum of work will be done in thismode, computations are assumed to be completely reliable.

A sanity-check is performed in Safe Mode to guard againstintroduction of large hardware errorNatural self-correcting properties are exploited to correcthardware error


Challenges of Selective Reliability

Rigorously quantify the impact of soft faults

Analyze the convergence rate

Devise optimal accept/reject criteria

What type and rate of faults can be corrected


Analytic Fault Model

Algorithms are comprised of steps

Steps compute intermediate answer→ final solution

Each step is performed in either fast or safe mode

Safe mode always returns correct answer

Fast mode is susceptible to faults and there is probability p ofcomputing something that is not correct (i.e., we should get x,but instead we get x+ x̃)

If the intermediate result is corrupted, then we make noassumption on the distribution of the perturbation


Convergence

‖Xexact −Xcomputed‖ ≤ Erounding + Ediscrete + Eiterative + Ehardware

Erounding → 0 with more precision, i.e. single, double, quad.Ediscrete → 0 with more the nodes, i.e. denser meshEiterative → 0 with more solver iterationsIn all cases, more work means better approximationEhardware → 0, as we do more work

Definition (Convergence with Respect to Hardware Error)A method is convergent on a specific hardware, if the statisticalmoments of the Hardware Error converge to zero as thecomputational cost increases, regardless of the distribution of theperturbation.


Steady State Method: GMRES

Consider the steady state problem

AX = F ,

which we disretize into a system of linear equations

A|Nx|N = b|N ,

where

A|N → A, b|N → F , x|N → X .

We need a resilient linear solver!


GMRES

Given the linear system of equations

Ax = b,

iterative solvers general iterative solvers generate a sequence

xk → x.

We perform K iterations so that tolerance ε is reached

‖b−AxK‖ < ε.

Krylov basis is expensive to store and manipulate, use restatedscheme

{{xmk }Kk=0}Mm=1


Fault-Tolerant GMRES

Due to M. Hoemmen and M. Heroux:Given A, b, x0 and ε > 0e0 = ‖b−Ax0‖m = 0while em > ε do

Compute xm+1K

em+1 = ‖b−Axm+1K ‖

if em > em−1 thendiscard xm+1

K and hm+1 and redo the last inner loopelse

accept xm+1K and advance to the next m = m+ 1

end ifend while


Fault Free Convergence

The residual is emk = ‖b−Axmk ‖ and let

Λ+ = {λ : Av = λv,Re(λ) > 0}Λ+ ⊂ disk(D, d/2)

Λ− = {λ : Av = λv,Re(λ) ≤ 0}ν = |Λ−|

Define magnification and reduction factors

R = cond(A)

(maxλ+∈Λ+,λ−∈Λ− |λ+ − λ−|

minλ−∈Λ− |λ−|

)ν (D

d

)ν, r =

d

D.

Then,eMK ≤

(RrK

)Me0 = RMrM×Ke0.


Resilient Convergence

Let M∗ be so that

eM∗

K ≤ ε, M∗ =

⌈log ε− log e0

logR+K log r

⌉

Hardware faults do not increase the residual of FT-GMRES.The method will reach desired tolerance ε after M∗ fault freeouter iterations.

FT-GMRES convergence is equivalent to analyzing theprobability of computing M∗ fault free iterations in M tries.


Resilient Convergence

Let p indicate the probability of encountering failure in thesupercomputer over a unit of time equal to a single inner loopiteration.Assume fault rate is uncorrelated in time.The probability that K inner loop iterations encounter a failure

s(K) ≈ (1− p)K

Then the probability of encountering M∗ fault free iterations inM ≥M∗ attempts is

P (M) = s(K)

(M − 1M∗ − 1

)s(K)M

∗−1 (1− s(K))M−M∗.


Statistical Convergence

The expected number of iterations needed for convergence is

E[M ] =

∞∑M=M∗

M

(M − 1M∗ − 1

)s(K)M

∗(1− s(K))M−M

∗=

M∗

s(K)

The variance is bounded by

V[M ] =

∞∑M=M∗

(M − E[M ])2P (M) ≤ M∗

s(K)2

The probability of FT-GMRES converging, approaches 1, asM →∞.


Computational Cost

Let W be the cost associated with computing the residual in safemode (e.g., for triple redundancy, W = 3).Then, the total cost is

C = M(K +W )

and the expected cost is

E[C] =M∗

s(K)(K +W ) =

⌈log ε− log e0

logR+K log r

⌉K +W

(1− p)K.

There is an optimal restart rate that is the global minimizer of E[C]that is independent from ε and e0.


Example:

Consider boundary value problem

0.2Xξξ + Xξ = 0, X (0) = 0, X (1) = 1.

We discretize the problem using finite difference scheme resultingin the linear system of equations

(0.4 + 1

N) −0.2− 1

N0 · · · 0

−0.2 (0.4 + 1N

) −0.2− 1N

· · · 0...

. . .. . .

. . ....

0 · · · −0.2 (0.4 + 1N

) −0.2− 1N

0 0 · · · −0.2 (0.4 + 1N

)

x1x2...

xN−2

xN−1

=

00...0

0.2


Faults w/o Resilience

100

101

102

103

104

10−5

100

105

1010

Outer Iteration

Re

sid

ua

l N

orm

100

101

102

103

104

10−5

100

105

1010

Outer Iteration

Re

sid

ua

l N

orm

Figure : Left: Fault free GMRES convergence. Right: Faulty GMRESconvergence (or the lack thereof). N = 101, p = 0.05, K = 20, s(K) ≈ 0.36

x̃ =g

‖g‖10u, g ∼ N (0, 1)N , u ∼ U(−9, 19).


FT-GMRES: Convergence

100

101

102

103

10−5

100

105

1010

Outer Iteration

Re

sid

ua

l N

orm

100

101

102

103

10−5

100

105

1010

Outer Iteration

Re

sid

ua

l N

orm

Figure : Two realization of FT-GMRES, despite the high fault rate, thealgorithm converges reliably.


FT-GMRES: Cost

0 5 10 15 20 25 30 35 401000

1500

2000

2500

3000

3500

4000

4500

5000

5500

Number of Inner Iterations K

Cost

Figure : Expected cost of the FT-GMRES algorithm for different values ofK. The expectation is computed using Monte Carlo sampling.


FT-GMRES Convergence

We have analytic proof of convergence for the FT-GMRESmethod.

The method converges for virtually any fault rate p.

The computational cost and convergence rate are stronglyinfluenced by the fault rate p and restart rate K.

When making a decision on the restart rate K, we need toconsider the rate of faults p in addition to eigenstructure of thematrix A and storage limitations.


Time Integration: First Order

Time dependent PDE

d

dtX = AX + F .

which we disretize into a system of ordinary differential equations

d

dtx|N = A|Nx|N + b|N ,

where

A|N → A, b|N → F , x|N → X .

Given X (0) we are looking for X (T ).

We need a resilient time integrator!


Time Stepping

Consider first order integration scheme for the system of ODEs

d

dtx = Ax.

Select time step ∆t and approximate xn ≈ x(n∆t)

First order Euler time stepping schemes

xn+1 = xn + ∆tAxn, xn+1 = xn + ∆tAxn+1.

Asymptotic convergence rate for nf = T∆t

‖xnf − x(nf∆t)‖ ≤ CT∆t.


Single Fault Scenario

Assume that computations encounter a silent fault and at step j,instead of xj we compute a corrupted

x̃j = xj + x̃.

For n > j all time steps will be corrupted x̃n.

Then at the final time step we have that

‖x̃nf − x(nf∆t)‖ ≤ ‖x̃nf − xnf ‖+ ‖xnf − x(nf∆t)‖≤ ‖(I + ∆tA)nf−j x̃‖+ CT∆t

≤ ‖(I + ∆tA)nf−j‖‖x̃‖+ CT∆t

≤ BT ‖x̃‖+ CT∆t,

whereBT = max(1, (I + ∆tA)nf ).


Resilience Enhancement

According to the Mean Value Theorem

xn+1 − 2xn + xn−1 = ∆t21

2

(d2

dt2x(ξl) +

d2

dt2x(ξr)

),

where ξl ∈ [(n− 1)∆t, n∆t] and ξr ∈ [n∆t, (n+ 1)∆t].

Therefore,‖xn+1 − 2xn + xn−1‖ ≤ ∆t2α,

for any α ≥ ‖ d2dt2x‖∞.


Resilience Enhancement

At each step of the integration we accept xn+1 only if

‖xn+1 − 2xn + xn−1‖ ≤ ∆t2α.

Thus, if we encounter a fault x̃ it will be rejected unless

‖xn+1 + x̃− 2xn + xn−1‖ ≤ ∆t2α ⇒ ‖x̃‖ ≤ 2α∆t2.

If we encounter a single fault

‖x̃nf − x(nf∆t)‖ ≤ 2αBT∆t2 + CT∆t.

We expect to encounter pnf faults in nf = T∆t iterations, then

‖x̃nf − x(nf∆t)‖ ≤ 2αpBTT∆t+ CT∆t = (2αpBTT + CT ) ∆t.


Resilient Time Stepping

Given A, x0, ∆t, TGiven αSet n = 0while n∆t < T do

Compute xn+1, e.g., xn+1 = xn + ∆tAxn

if ‖xn+1 − 2xn + xn−1‖ > α∆t2 thendiscard xn+1 and redo the last step

elseaccept xn+1 and advance to the next n = n+ 1

end ifend while


Example:

Consider the heat transfer equation

d

dtX = 10−3Xξξ, X (t, 0) = X (t, 1) = 0, X (0, ξ) = ξ(1− ξ).

We discretize the problem using finite difference scheme resultingin the system of linear ODEs

d

dt

x1x2...

xN−2

xN−1

= 10−3(N + 1)2

−2 1 0 · · · 01 −2 1 · · · 0...

. . .. . .

. . ....

0 · · · 1 −2 10 0 · · · 1 −2

x1x2...

xN−2

xN−1

.

For the numerical experiments we use N = 1024.


Classical Backward Euler

10−3

10−2

106

107

108

109

1010

Error Convergence (Heat 1D)

∆t

L2 E

rror

No

rm

Single Perturbation

Multiple Perturbations

Figure : Lack of convergence for the classical backward Euler scheme forone and multiple faults when p = 0.1.


Classical Backward Euler

10−3

10−2

10−8

10−6

10−4

10−2

Error Convergence (Heat 1D)

∆t

L2 E

rror

No

rm

P=0.0P=0.1P=0.2P=0.4P=0.8

Figure : Convergence for the resilient time stepping method for differentvalues of p.


Time Stepping Conclusions

We have a resilient first order time-stepping algorithm.

The method extends to non-linear problems as well as variabletime step ∆t (so long as we have continuity w.r.t. initial data)

The resilient method preserves the linear convergence rateeven in the presence of faults.


Thank You!

Questions?

Stoyanov, M. and Webster, C. Numerical Analysis of Fixed PointAlgorithms in the Presence of Hardware Faults, Oak RidgeNational Laboratory, Tech Report, ORNL/TM-2013/283.


Documents

Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab