32
Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab July 6, 2014 Extreme-scale Algorithms & Software Resilience (EASIR): ERKJ245 Al Geist (ORNL), Mike Heroux (SNL) Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 1/32

Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Resilient Solvers for Partial Differential Equations

Miroslav StoyanovEirik Endeve

Clayton Webster

Oak Ridge National Lab

July 6, 2014Extreme-scale Algorithms & Software Resilience (EASIR): ERKJ245

Al Geist (ORNL), Mike Heroux (SNL)

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 1/32

Page 2: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Extreme Scale Equations

Consider the generalized large scale linear differential equation thatis steady state

−AX = F ,

or time dependentd

dtX = AX + F .

The differential operator A can represent complex dynamics ofmultiple couple phenomena (e.g., multi-physics).

Finding the solution X requires a powerful supercomputer.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 2/32

Page 3: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Supercomputing Speed vs Reliability

Hardware Failure

Operation at near critical voltage, ever shrinking fabricationsize and increasing number of componentsIncreased computational throughput leads to decrease inreliabilitySoft faults are already causing problems in largest scalecomputations (e.g., climate)In near future we expect the mean failure time to be of theorder of an application runtime

Hardware Error Control

DRAM has build-in error detection and correctionNetwork communication has CRC checksumCPU cache and logical circuits have virtually no protection

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 3/32

Page 4: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Software Abstraction

Results of FaultsThe code crashes and no result is givenThe code completes execution, but returns an invalid result(e.g., inf or nan)The code completes execution and returns a result that“appears” valid

2 + 3 = 7

Software Error ControlCheck-pointing: how to detect silent faults?Check-summing: in heterogeneous computing environment,we cannot check-sum floating point operations.Post-processing: detection is too late.Redundant computations: doable on at limited scale.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 4/32

Page 5: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

The Cutting Edge Approach

Hoemmen, Mark and Heroux, Michael A, Fault-tolerant iterativemethods via selective reliability, Proceedings of the 2011International Conference for High Performance Computing,Networking, Storage and Analysis (SC). IEEE ComputerSociety, 2011.Code is split into two modes with different reliability

Fast Mode: The bulk of the computations should be performedin this mode and any “fast” computation is susceptible tohardware faults.Safe Mode: Only the bare minimum of work will be done in thismode, computations are assumed to be completely reliable.

A sanity-check is performed in Safe Mode to guard againstintroduction of large hardware errorNatural self-correcting properties are exploited to correcthardware error

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 5/32

Page 6: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Challenges of Selective Reliability

Rigorously quantify the impact of soft faults

Analyze the convergence rate

Devise optimal accept/reject criteria

What type and rate of faults can be corrected

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 6/32

Page 7: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Analytic Fault Model

Algorithms are comprised of steps

Steps compute intermediate answer→ final solution

Each step is performed in either fast or safe mode

Safe mode always returns correct answer

Fast mode is susceptible to faults and there is probability p ofcomputing something that is not correct (i.e., we should get x,but instead we get x+ x̃)

If the intermediate result is corrupted, then we make noassumption on the distribution of the perturbation

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 7/32

Page 8: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Convergence

‖Xexact −Xcomputed‖ ≤ Erounding + Ediscrete + Eiterative + Ehardware

Erounding → 0 with more precision, i.e. single, double, quad.Ediscrete → 0 with more the nodes, i.e. denser meshEiterative → 0 with more solver iterationsIn all cases, more work means better approximationEhardware → 0, as we do more work

Definition (Convergence with Respect to Hardware Error)A method is convergent on a specific hardware, if the statisticalmoments of the Hardware Error converge to zero as thecomputational cost increases, regardless of the distribution of theperturbation.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 8/32

Page 9: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Steady State Method: GMRES

Consider the steady state problem

AX = F ,

which we disretize into a system of linear equations

A|Nx|N = b|N ,

where

A|N → A, b|N → F , x|N → X .

We need a resilient linear solver!

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 9/32

Page 10: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

GMRES

Given the linear system of equations

Ax = b,

iterative solvers general iterative solvers generate a sequence

xk → x.

We perform K iterations so that tolerance ε is reached

‖b−AxK‖ < ε.

Krylov basis is expensive to store and manipulate, use restatedscheme

{{xmk }Kk=0}Mm=1

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 10/32

Page 11: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Fault-Tolerant GMRES

Due to M. Hoemmen and M. Heroux:Given A, b, x0 and ε > 0e0 = ‖b−Ax0‖m = 0while em > ε do

Compute xm+1K

em+1 = ‖b−Axm+1K ‖

if em > em−1 thendiscard xm+1

K and hm+1 and redo the last inner loopelse

accept xm+1K and advance to the next m = m+ 1

end ifend while

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 11/32

Page 12: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Fault Free Convergence

The residual is emk = ‖b−Axmk ‖ and let

Λ+ = {λ : Av = λv,Re(λ) > 0}Λ+ ⊂ disk(D, d/2)

Λ− = {λ : Av = λv,Re(λ) ≤ 0}ν = |Λ−|

Define magnification and reduction factors

R = cond(A)

(maxλ+∈Λ+,λ−∈Λ− |λ+ − λ−|

minλ−∈Λ− |λ−|

)ν (D

d

)ν, r =

d

D.

Then,eMK ≤

(RrK

)Me0 = RMrM×Ke0.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 12/32

Page 13: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Resilient Convergence

Let M∗ be so that

eM∗

K ≤ ε, M∗ =

⌈log ε− log e0

logR+K log r

Hardware faults do not increase the residual of FT-GMRES.The method will reach desired tolerance ε after M∗ fault freeouter iterations.

FT-GMRES convergence is equivalent to analyzing theprobability of computing M∗ fault free iterations in M tries.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 13/32

Page 14: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Resilient Convergence

Let p indicate the probability of encountering failure in thesupercomputer over a unit of time equal to a single inner loopiteration.Assume fault rate is uncorrelated in time.The probability that K inner loop iterations encounter a failure

s(K) ≈ (1− p)K

Then the probability of encountering M∗ fault free iterations inM ≥M∗ attempts is

P (M) = s(K)

(M − 1M∗ − 1

)s(K)M

∗−1 (1− s(K))M−M∗.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 14/32

Page 15: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Statistical Convergence

The expected number of iterations needed for convergence is

E[M ] =

∞∑M=M∗

M

(M − 1M∗ − 1

)s(K)M

∗(1− s(K))M−M

∗=

M∗

s(K)

The variance is bounded by

V[M ] =

∞∑M=M∗

(M − E[M ])2P (M) ≤ M∗

s(K)2

The probability of FT-GMRES converging, approaches 1, asM →∞.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 15/32

Page 16: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Computational Cost

Let W be the cost associated with computing the residual in safemode (e.g., for triple redundancy, W = 3).Then, the total cost is

C = M(K +W )

and the expected cost is

E[C] =M∗

s(K)(K +W ) =

⌈log ε− log e0

logR+K log r

⌉K +W

(1− p)K.

There is an optimal restart rate that is the global minimizer of E[C]that is independent from ε and e0.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 16/32

Page 17: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Example:

Consider boundary value problem

0.2Xξξ + Xξ = 0, X (0) = 0, X (1) = 1.

We discretize the problem using finite difference scheme resultingin the linear system of equations

(0.4 + 1

N) −0.2− 1

N0 · · · 0

−0.2 (0.4 + 1N

) −0.2− 1N

· · · 0...

. . .. . .

. . ....

0 · · · −0.2 (0.4 + 1N

) −0.2− 1N

0 0 · · · −0.2 (0.4 + 1N

)

x1x2...

xN−2

xN−1

=

00...0

0.2

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 17/32

Page 18: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Faults w/o Resilience

100

101

102

103

104

10−5

100

105

1010

Outer Iteration

Re

sid

ua

l N

orm

100

101

102

103

104

10−5

100

105

1010

Outer Iteration

Re

sid

ua

l N

orm

Figure : Left: Fault free GMRES convergence. Right: Faulty GMRESconvergence (or the lack thereof). N = 101, p = 0.05, K = 20, s(K) ≈ 0.36

x̃ =g

‖g‖10u, g ∼ N (0, 1)N , u ∼ U(−9, 19).

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 18/32

Page 19: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

FT-GMRES: Convergence

100

101

102

103

10−5

100

105

1010

Outer Iteration

Re

sid

ua

l N

orm

100

101

102

103

10−5

100

105

1010

Outer Iteration

Re

sid

ua

l N

orm

Figure : Two realization of FT-GMRES, despite the high fault rate, thealgorithm converges reliably.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 19/32

Page 20: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

FT-GMRES: Cost

0 5 10 15 20 25 30 35 401000

1500

2000

2500

3000

3500

4000

4500

5000

5500

Number of Inner Iterations K

Cost

Figure : Expected cost of the FT-GMRES algorithm for different values ofK. The expectation is computed using Monte Carlo sampling.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 20/32

Page 21: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

FT-GMRES Convergence

We have analytic proof of convergence for the FT-GMRESmethod.

The method converges for virtually any fault rate p.

The computational cost and convergence rate are stronglyinfluenced by the fault rate p and restart rate K.

When making a decision on the restart rate K, we need toconsider the rate of faults p in addition to eigenstructure of thematrix A and storage limitations.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 21/32

Page 22: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Time Integration: First Order

Time dependent PDE

d

dtX = AX + F .

which we disretize into a system of ordinary differential equations

d

dtx|N = A|Nx|N + b|N ,

where

A|N → A, b|N → F , x|N → X .

Given X (0) we are looking for X (T ).

We need a resilient time integrator!

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 22/32

Page 23: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Time Stepping

Consider first order integration scheme for the system of ODEs

d

dtx = Ax.

Select time step ∆t and approximate xn ≈ x(n∆t)

First order Euler time stepping schemes

xn+1 = xn + ∆tAxn, xn+1 = xn + ∆tAxn+1.

Asymptotic convergence rate for nf = T∆t

‖xnf − x(nf∆t)‖ ≤ CT∆t.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 23/32

Page 24: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Single Fault Scenario

Assume that computations encounter a silent fault and at step j,instead of xj we compute a corrupted

x̃j = xj + x̃.

For n > j all time steps will be corrupted x̃n.

Then at the final time step we have that

‖x̃nf − x(nf∆t)‖ ≤ ‖x̃nf − xnf ‖+ ‖xnf − x(nf∆t)‖≤ ‖(I + ∆tA)nf−j x̃‖+ CT∆t

≤ ‖(I + ∆tA)nf−j‖‖x̃‖+ CT∆t

≤ BT ‖x̃‖+ CT∆t,

whereBT = max(1, (I + ∆tA)nf ).

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 24/32

Page 25: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Resilience Enhancement

According to the Mean Value Theorem

xn+1 − 2xn + xn−1 = ∆t21

2

(d2

dt2x(ξl) +

d2

dt2x(ξr)

),

where ξl ∈ [(n− 1)∆t, n∆t] and ξr ∈ [n∆t, (n+ 1)∆t].

Therefore,‖xn+1 − 2xn + xn−1‖ ≤ ∆t2α,

for any α ≥ ‖ d2dt2x‖∞.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 25/32

Page 26: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Resilience Enhancement

At each step of the integration we accept xn+1 only if

‖xn+1 − 2xn + xn−1‖ ≤ ∆t2α.

Thus, if we encounter a fault x̃ it will be rejected unless

‖xn+1 + x̃− 2xn + xn−1‖ ≤ ∆t2α ⇒ ‖x̃‖ ≤ 2α∆t2.

If we encounter a single fault

‖x̃nf − x(nf∆t)‖ ≤ 2αBT∆t2 + CT∆t.

We expect to encounter pnf faults in nf = T∆t iterations, then

‖x̃nf − x(nf∆t)‖ ≤ 2αpBTT∆t+ CT∆t = (2αpBTT + CT ) ∆t.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 26/32

Page 27: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Resilient Time Stepping

Given A, x0, ∆t, TGiven αSet n = 0while n∆t < T do

Compute xn+1, e.g., xn+1 = xn + ∆tAxn

if ‖xn+1 − 2xn + xn−1‖ > α∆t2 thendiscard xn+1 and redo the last step

elseaccept xn+1 and advance to the next n = n+ 1

end ifend while

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 27/32

Page 28: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Example:

Consider the heat transfer equation

d

dtX = 10−3Xξξ, X (t, 0) = X (t, 1) = 0, X (0, ξ) = ξ(1− ξ).

We discretize the problem using finite difference scheme resultingin the system of linear ODEs

d

dt

x1x2...

xN−2

xN−1

= 10−3(N + 1)2

−2 1 0 · · · 01 −2 1 · · · 0...

. . .. . .

. . ....

0 · · · 1 −2 10 0 · · · 1 −2

x1x2...

xN−2

xN−1

.

For the numerical experiments we use N = 1024.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 28/32

Page 29: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Classical Backward Euler

10−3

10−2

106

107

108

109

1010

Error Convergence (Heat 1D)

∆t

L2 E

rror

No

rm

Single Perturbation

Multiple Perturbations

Figure : Lack of convergence for the classical backward Euler scheme forone and multiple faults when p = 0.1.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 29/32

Page 30: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Classical Backward Euler

10−3

10−2

10−8

10−6

10−4

10−2

Error Convergence (Heat 1D)

∆t

L2 E

rror

No

rm

P=0.0P=0.1P=0.2P=0.4P=0.8

Figure : Convergence for the resilient time stepping method for differentvalues of p.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 30/32

Page 31: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Time Stepping Conclusions

We have a resilient first order time-stepping algorithm.

The method extends to non-linear problems as well as variabletime step ∆t (so long as we have continuity w.r.t. initial data)

The resilient method preserves the linear convergence rateeven in the presence of faults.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 31/32

Page 32: Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential Equations Miroslav Stoyanov Eirik Endeve Clayton Webster Oak Ridge National Lab

Thank You!

Questions?

Stoyanov, M. and Webster, C. Numerical Analysis of Fixed PointAlgorithms in the Presence of Hardware Faults, Oak RidgeNational Laboratory, Tech Report, ORNL/TM-2013/283.

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 32/32