Analysis of Monte Carlo accelerated iterative methods for ...benzi/Web_papers/mcsa.pdf · of the hybrid schemes, and we investigate different types of preconditioners includ-ing sparse

Received: 4 February 2016 Revised: 26 November 2016 Accepted: 9 December 2016

DOI 10.1002/nla.2088

R E S E A R C H A R T I C L E

Analysis of Monte Carlo accelerated iterative methods for sparselinear systems

Michele Benzi1 Thomas M. Evans2 Steven P. Hamilton2 Massimiliano Lupo Pasini1

Stuart R. Slattery2

1Department of Mathematics and Computer

Science, Emory University, Atlanta, 30322, GA,

USA2Oak Ridge National Laboratory, 1 Bethel Valley

Rd., Oak Ridge, 37831, TN, USA

CorrespondenceMichele Benzi, Department of Mathematics and

Computer Science, Emory University, Atlanta, GA

30322, USA.

Email: [email protected]

Funding InformationUT-Battelle, LLC, Grant/Award Number:

DE-AC05-00OR22725; Department of Energy

(Office of Science), Grant/Award Number:

ERKJ247

SummaryWe consider hybrid deterministic-stochastic iterative algorithms for the solution oflarge, sparse linear systems. Starting from a convergent splitting of the coefficientmatrix, we analyze various types of Monte Carlo acceleration schemes applied tothe original preconditioned Richardson (stationary) iteration. These methods areexpected to have considerable potential for resiliency to faults when implemented onmassively parallel machines. We establish sufficient conditions for the convergenceof the hybrid schemes, and we investigate different types of preconditioners includ-ing sparse approximate inverses. Numerical experiments on linear systems arisingfrom the discretization of partial differential equations are presented.

KEYWORDS

iterative methods, Monte Carlo methods, preconditioning, resilience, Richardsoniteration, sparse approximate inverses, sparse linear systems

1 INTRODUCTION

The next generation of computational science applicationswill require numerical solvers that are both reliable and capa-ble of high performance on projected exascale platforms. Inorder to meet these goals, solvers must be resilient to softand hard system failures, provide high concurrency on het-erogeneous hardware configurations, and retain numericalaccuracy and efficiency. In this paper, we focus on the solu-tion of large sparse systems of linear equations, for example,of the kind arising from the discretization of partial differen-tial equations (PDEs). A possible approach is to try to adaptexisting solvers (such as preconditioned Krylov subspace ormultigrid methods) to the new computational environments,and indeed, several efforts are under way in this direction; see,for example, the previous studies1–5 and references therein.An alternative approach is to investigate new algorithmsthat can address issues of resiliency, particularly fault tol-erance and hard processor failures, naturally. An exampleis provided by the recently proposed Monte Carlo syntheticacceleration (MCSA) methods , see the previous studies.6,7

In these methods, an underlying (deterministic) stationary

Richardson iterative method is combined with a stochastic,Monte Carlo (MC)-based “acceleration” scheme. Ideally, theaccelerated scheme will converge to the solution of the linearsystem in far fewer (outer) iterations than the basic schemewithout MC acceleration, with the added advantage that mostof the computational effort is now relegated to the MC portionof the algorithm, which is highly parallel and offers a morestraightforward path to resiliency than standard, deterministicsolvers. In addition, a careful combination of the Richard-son and MC parts of the algorithm allows to circumvent thewell-known problem of slow MC error reduction; see onestudy.6

Numerical evidence presented in one study6 suggests thatMCSA can be competitive, for certain classes of problems,with established deterministic solvers such as preconditionedconjugate gradients and Generalized Minimum Residual(GMRES). So far, however, no theoretical analysis of the con-vergence properties of these solvers has been carried out. Inparticular, it is not clear a priori whether the method, appliedto a particular linear system, will converge. Indeed, the con-vergence of the underlying preconditioned Richardson itera-tion is not sufficient, in general, to guarantee the convergence

Numer Linear Algebra Appl. 2017;24:e2088. wileyonlinelibrary.com/journal/nla Copyright © 2017 John Wiley & Sons, Ltd. 1 of 18https://doi.org/10.1002/nla.2088

:

https://doi.org/10.1002/nla.2088

2 of 18 BENZI ET AL.

of the MCSA-accelerated iteration. In other words, it is quitepossible that the stochastic acceleration part of the algorithmmay actually cause the hybrid method to diverge or stagnate.

In this paper, we address this fundamental issue, discussingboth necessary and sufficient conditions for convergence. Wealso discuss the choice of splitting, or preconditioner, andillustrate our findings by means of numerical experiments. Wedo not specifically consider in this paper the resiliency issue,which will be addressed elsewhere.

The paper is organized as follows. In Section 2, we pro-vide an overview of existing MC linear solver algorithms.In Section 3, we will discuss the convergence behavior ofstochastic solvers, including a discussion of classes of matri-ces for which convergence can be guaranteed. Section 4,provides some numerical results illustrating properties of thevarious approaches, and in Section 5, we give our conclusions.

2 STOCHASTIC LINEAR SOLVERS

Linear solvers based on stochastic techniques have a longhistory, going back to the famed 1928 paper by Courant,Friedrichs, and Lewy on finite difference schemes forPDEs.8 Many authors have considered linear solvers basedon MC techniques, with important early contributions byCurtiss9 and by Forsythe and Leibler.10 More recent worksinclude11,12, and,13 among others. Until recently, these meth-ods have had mixed success at best, due to their gener-ally inferior performance when compared to state-of-the-artdeterministic solvers like multigrid or preconditioned Krylovmethods. Current interest in resilient solvers, where someperformance may be traded off for increased robustness inthe presence of faults, has prompted a fresh look at methodsincorporating MC ideas.6,7,14

As mentioned in the work of Dimov and Alexandrov,13

MC methods may be divided into two broad classes: directmethods, such as those described in the works of Dimov andAlexandrov,13,15 and iterative methods, which refer to tech-niques such as those presented in Halton12,16; see also theprevious studies.6,14 The first type consists of purely stochas-tic schemes; therefore, the resulting error with respect to theexact solution is made of just a stochastic component. Incontrast, the iterative MC methods utilize more traditionaliterative algorithms alongside the stochastic approach, gen-erating two types of error: a stochastic one and a systematicone. In practice, it may be difficult to separate the two com-ponents; nevertheless, awareness of this intrinsic structure isuseful, as it allows algorithm designers some flexibility in thechoice of what part of the algorithm to target for refinement(e.g., trading off convergence speed for resilience by balanc-ing the number of “deterministic" outer iterations against thenumber of random walks to be used within each iteration).

Consider a system of linear equations of the form

Ax = b, (1)

where A ∈ Rn×n and x, b ∈ Rn. Equation 1 can be recast as afixed point problem:

x = Hx + f, (2)

where H = I − A and f = b. Assuming that the spectral radius𝜌(H) < 1, the solution to Equation 2 can be written in termsof a power series in H (Neumann series):

x =∞∑𝓁=0

H𝓁f.

Denoting the kth partial sum by x(k), the sequence of approx-imate solutions {x(k)}∞k=0 converges to the exact solutionregardless of the initial guess x0.

By restricting the attention to a single component of x, weobtain

xi = fi +∞∑𝓁=1

n∑k1=1

n∑k2=1

· · ·n∑

k𝓁=1

Hi,k1Hk1,k2

· · ·Hk𝓁−1,k𝓁 fk𝓁 . (3)

The last equation can be reinterpreted as the realization of anestimator defined on a random walk. Let us start consideringa random walk whose state space S is labeled by the set ofindices of the forcing term f:

S = {1, 2, … , n} ⊂ N.

Each ith step of the random walk has a random variable kiassociated with it. The realization of ki represents the indexof the component of f, which is visited in the current stepof the random walk. The construction of random walks isaccomplished considering the directed graph associated withthe matrix H. The nodes of this graph are labeled 1 through n,and there is a directed edge from node i to node j if and onlyif Hi,j ≠ 0. Starting from a given node, the random walk con-sists of a sequence of nodes obtained by jumping from onenode to the next along directed edges, choosing the next nodeat random according to a transition probability distributionmatrix constructed from H or from HT, see below. Note thatit may happen that a row of H (or of HT) is all zero; this hap-pens when there are no out-going (respectively, in-coming)edges from (respectively, to) the corresponding node. In thiscase, that particular random walk is terminated and anothernode is selected as the starting point for the next walk. Thetransition probabilities and the selection of the initial state ofthe random walk can be accomplished according to differentmodalities, leading to two distinct methods: the forward andadjoint methods. These methods are described next.

2.1 Forward method

Given a functional J, the goal is to compute its action on a vec-tor x by constructing a statistical estimator whose expectedvalue equals J(x). Each statistical sampling is represented bya random walk, and it contributes to build an estimate of theexpected value. Towards this goal, it is necessary to intro-duce an initial probability distribution and a transition matrix

BENZI ET AL. 3 of 18

so that random walks are well defined. Recalling Riesz’srepresentation theorem, one can write

J(x) = ⟨h, x⟩ = n∑i=1

hixi,

where h ∈ Rn is the Riesz representative in Rn of the func-tional J. Such a representative can be used to build the initialprobability p ∶ S → [0, 1] of the random walk as

p(k0 = i) = pk0= |hi|∑n

i=1 |hi| .It is important to highlight that the role of vector h is confinedto the construction of the initial probability, and that h is notused afterwards in the stochastic process. A possible choicefor the transition probability matrix P can be

p(k𝓁 = j | k𝓁−1 = i) = Pij =|Hij|∑n

k=1 |Hik| ,where

p(·, i) ∶ S → [0, 1], ∀i ∈ S

and k𝓁 ∈ S represents the state reached at a generic 𝓁th stepof the random walk. A related sequence of random variableswij can be defined as

wij =Hij

Pij.

The probability distribution of the random variables wij is rep-resented by the transition matrix that governs the stochasticprocess. The wij quantities just introduced can be used to buildone more sequence of random variables. At first, we introducequantities W𝓁

W0 =hk0

pk0

, W𝓁 = W𝓁−1wk𝓁−1,k𝓁 .

In probability theory, a random walk is itself envisioned as arandom variable that can assume multiple values consisting ofdifferent realizations. Indeed, given a starting point, there arein general many choices (one for each nonzero in the corre-sponding row of P) to select the second state and from there onrecursively. The actual feasibility of a path and the frequencywith which it is selected depend on the initial probability andon the transition matrix. By introducing 𝜈 as a realization ofa random walk, we define

X(𝜈) =∞∑𝓁=0

W𝓁fk𝓁

as the random variable associated with a specific realization𝜈. We can thus define the estimator 𝜃 as

𝜃 = E[X] =∑𝜈

P𝜈X(𝜈),

where 𝜈 ranges over all possible realizations. P𝜈 is the prob-ability associated with a specific realization of the randomwalk. It can be proved (see the works of Halton12,16) that

E[W𝓁fk𝓁 ] =⟨

h,H𝓁f⟩, 𝓁 = 0, 1, 2, …

and

𝜃 = E[ ∞∑𝓁=0

W𝓁fk𝓁

]= ⟨h, x⟩ .

A possible choice for h is a vector of the standard basis, h= ei.This would correspond to setting the related initial probabilityto a Kronecker delta:

p(k0 = j) = 𝛿ij.

By doing so, we have k0 = i and

𝜃 = xi = fi +∞∑

l=1

n∑k1=1

n∑k2=1

· · ·n∑

k𝓁=1

Pi,k1Pk1,k2

· · ·Pk𝓁−1,k𝓁wi,k1wk1,k2

· · ·wk𝓁−1,k𝓁 fk𝓁 .

(4)

As regards the variance, we recall the following relation:

Var[ ∞∑𝓁=0

W𝓁fk𝓁

]= E

[ ∞∑𝓁=0

W2𝓁f 2

k𝓁

]−(

E[ ∞∑𝓁=0

W𝓁fk𝓁

])2

. (5)

Hence, the variance can be computed as the differencebetween the second moment of the random variable and thesquare of its first moment.

In order to apply the central limit theorem (CLT) to theestimators defined above, we must require that the estimatorshave both finite expected value and finite variance. This isequivalent to checking the finiteness of the expected value andsecond moment. Therefore, we have to impose the followingconditions:

E[ ∞∑𝓁=0

W𝓁fk𝓁

]< ∞, (6)

E[ ∞∑𝓁=0

W2𝓁f 2

k𝓁

]< ∞. (7)

The forward method presented above, however, has the lim-itation of employing an entire set of realizations to estimatejust a single entry of the solution at a time. Hence, in orderto estimate the entire solution vector for Equation 1, we haveto employ a separate set of realizations for each entry in thesolution vector. This limitation can be circumvented by theadjoint method, which we describe below.

Remark 1. It is important to note that in order to constructthe random walks, access to the individual entries of H isrequired. Hence, H needs to be formed explicitly and, there-fore, must be sparse in order to have a practical algorithm.


2.2 Adjoint method

A second MC method can be derived by considering the linearsystem adjoint to Equation 1:

ATy = d, (8)

where y and d are the adjoint solution and source term.Equation 8 can be recast in a manner similar to Equation 2:

y = HTy + d.

Note that 𝜌(HT) = 𝜌(H) < 1, hence convergence of the Neu-mann series (fixed point iteration) for Equation 1 guaranteesconvergence for the adjoint system 8.

Exploiting the following inner product equivalence:⟨ATy, x

⟩= ⟨y,Ax⟩ , (9)

it follows that

⟨x,d⟩ = ⟨y, f⟩ .By writing the Neumann series for the solution to Equation 8we have

y =∞∑𝓁=0

(HT )𝓁d, (10)

and focusing on a single entry of the solution vector:

yi = di +∞∑𝓁=1

n∑k1=1

n∑k2=1

· · ·n∑

k𝓁=1

HTi,k1

HTk1,k2

· · ·HTk𝓁−1,k𝓁

dk𝓁 .

The undetermined quantities in the dual problem 8 are y andd. Therefore, two constraints are required: the first constraintis Equation 9 and as a second constraint we select d to be oneof the standard basis vectors. Applying this choice of d to 9we get

⟨y, f⟩ = ⟨x,d⟩ = xi.

In order to give a stochastic interpretation of the adjointmethod similar to the one obtained for the forward method,we introduce the initial probability:

p(k0 = i) =|fi|‖f‖1

and the initial weight:

W0 = ‖f‖1fi|fi| .

The transition probability is defined as

p(k𝓁 = j|k𝓁−1 = i) = Pij =|HT

ij |∑nk=1 |HT

ik| =|Hji|∑n

k=1 |Hki|

and the sequence of weights as follows:

wij =Hji

Pij.

By reformulating the fixed point scheme in its statistical inter-pretation, the following formula holds for the estimator of thesolution vector associated with the adjoint method: it is thevector 𝜽 ∈ Rn such that

𝜃i = E[∑∞

𝓁=0W𝓁𝛿k𝓁 ,i

]

=∞∑𝓁=0

n∑k0=1

n∑k1=1

n∑k2=1

· · ·n∑

k𝓁=1

fk0Pk0,k1

Pk1,k2

· · ·Pk𝓁−1,K𝓁wk0,k1

· · ·wk𝓁−1,k𝓁𝛿k𝓁 ,i.

This estimator is known in literature as collision estimator.The forward method adds a contribution to the component

of the solution vector where the random walk began, basedon the value of the source vector in the state in which thewalk currently resides. The adjoint method, on the other hand,adds a contribution to the component of the solution vectorwhere the random walk currently resides based on the valueof the source vector in the state in which the walk began. TheKronecker delta at the end of the series 10 represents a filter,indicating that only a subset of realizations contribute to thejth component of the solution vector.

The variance is given by

Var[ ∞∑𝓁=0

W𝓁𝛿k𝓁 ,i

]= E

[ ∞∑𝓁=0

W2𝓁𝛿k𝓁 ,i

]

−(

E[ ∞∑𝓁=0

W𝓁𝛿k𝓁 ,i

])2

, i = 1, … , n.

(11)Along the same lines as the development for the forwardmethod, we must impose finiteness of the expected value andsecond moment. Therefore, the following conditions must beverified:

E[ ∞∑𝓁=0

W𝓁𝛿k𝓁 ,i

]< ∞ i = 1, … , n (12)

and

E[ ∞∑𝓁=0

W2𝓁𝛿k𝓁 ,i

]< ∞, i = 1, … , n. (13)

The main advantage of this method, compared to the forwardone, consists in the fact that a single set of realizations is usedto estimate the entire solution vector. Unless only a small por-tion of the problem is of interest, this property often leads tothe adjoint method being favored over the forward method.In other terms, the adjoint method should be preferred whenapproximating the solution globally over the entire compu-tational domain, and the forward method is especially usefulwhen approximating the solution locally.


In literature, another estimator is employed along with theadjoint MC method, the so called expected value estimator.Its formulation is as follows: it is the vector 𝜽 ∈ Rn such that

𝜃i = E[

fi +∞∑𝓁=0

W𝓁HTk𝓁 ,i

]

= fi +∞∑𝓁=0

n∑k0=1

n∑k1=1

n∑k2=1

· · ·n∑

k𝓁=1

fk0Pk0,k1

Pk1,k2

· · ·Pk𝓁−1,k𝓁wk0,k1· · ·wk𝓁−1,k𝓁HT

k𝓁 ,i.

(14)

Hence, the expected value estimator averages the determinis-tic contribution of the iteration matrix over all the potentialstates j that might be reached from the current state 𝓁. Thevariance in this case becomes

Var[ ∞∑𝓁=0

W𝓁HTk𝓁 ,i

]= E

[ ∞∑𝓁=0

W2𝓁HT

k𝓁 ,i

]

−(

E[ ∞∑𝓁=0

W𝓁HTk𝓁 ,i

])2

, i = 1, … , n.

(15)

2.3 Hybrid stochastic/deterministic methods

The direct methods described in Sections 2.1 and 2.2 suf-fer from a slow rate of convergence due to the 1√

Nbehavior

dictated by the CLT (N here is the number of random walksused to estimate the solution). Furthermore, when the spectralradius of the iteration matrix is close to unity, each individ-ual random walk may require a large number of transitionsto approximate the corresponding components in the Neu-mann series. To offset the slow convergence of the CLT,schemes have been proposed which combine traditional fixedpoint iterative methods with the stochastic solvers. The firstsuch method, due to Halton, was termed the sequential MonteCarlo (SMC) method and can be written as follows.

The MC linear solver method is used to compute the update𝛿xl. This algorithm is equivalent to a Richardson iterationaccelerated by a correction obtained by approximately solvingthe error-residual equation

(I − H)𝛿xl+1 = rl. (16)

If this equation were to be solved exactly, the correspondingapproximation xl + 1 = xl + 𝛿xl + 1 would be the exact solutionto the linear system. This is of course impractical, becausesolving 16 is equivalent to solving the original linear system1. Instead, the correction is obtained by solving Equation 16only approximately, using an MC method. Because MC isonly applied within a single iteration, the CLT is only appli-cable within that iteration rather than to the overall con-vergence behavior of the algorithm. This allows a trade-offbetween the amount of time and effort spent on the inner(stochastic) and outer (deterministic) iterations, which cantake into account the competing goals of reliability and rapidconvergence.

A further extension of Halton’s method, termed MonteCarlo Synthetic Acceleration (MCSA), has been recentlyintroduced in the previous studies.6,14 The MCSA algorithmcan be written as

As with SMC, an MC linear solver is used to compute

the updating contribution 𝛿xl+ 12 . In this approach, an extra

step of Richardson iteration is added to smooth out someof the high-frequency noise introduced by the MC process.This way, the deterministic and stochastic components of thealgorithm act in a complementary fashion.

Obviously, a minimum requirement is that the linear systemcan be written in the form 2 with 𝜌(H) < 1. This is typicallyachieved by preconditioning. That is, we find an invertiblematrix P such that H = I − P − 1A satisfies 𝜌(H) < 1, andwe apply the method to the fixed point problem 2, whereH = I − P − 1A and f = P − 1b. In other words, the under-lying deterministic iteration is a preconditioned Richardsoniteration. Various choices of the preconditioner are possible;a detailed discussion of this issue is deferred until Section 3.5.Here, we note only that because we need explicit knowledgeof the entries of H, not all preconditioning choices are viable;in particular, P needs to be such that H = I − P − 1A retains

a high degree of sparsity. Unless otherwise specified, below,we assume that the transformation of the original linear sys-tem 1 to the fixed point form 2 with 𝜌(H) < 1 has already beencarried out.


3 CONVERGENCE BEHAVIOROF STOCHASTIC METHODS

Interestingly, the convergence requirements imposed by theMC estimator and the corresponding variance can be refor-mulated in a purely deterministic setting. For instance, thecondition of finiteness of the expected value turns out to beequivalent to requiring

𝜌(H) < 1, (17)

where H is the iteration matrix of the fixed point scheme.Indeed, we can see from Equations 4 and 10 that the expectedvalue is expressed in terms of power series of H, and the con-dition 𝜌(H) < 1 is a necessary and sufficient condition for theNeumann series to converge.

Next, we address the finiteness requirement for the secondmoment. Equations 5 and 11 for the forward and the adjointmethod, respectively, show that the second moment can bereinterpreted as a power series with respect to the matricesdefined as follows:

Hij =H2

ij

Pij- Forward Method

and

Hij =H2

ji

Pij- Adjoint Method.

In order for the corresponding power series to converge, wemust require

𝜌(H) < 1. (18)

Hence, condition 17 is required for a generic fixed pointscheme to reach convergence, whereas the extra condition18 is typical of the stochastic schemes studied in this work.Moreover, because the finiteness of the variance automati-cally entails the finiteness of the expected value, we can statethat Equation 18 implicitly entails Equation 17, whereas theconverse is not true in general.

3.1 Necessary and sufficient conditions

Here, we report some results presented in the previousstudies,17,18 concerning necessary conditions and sufficient

conditions for convergence. In particular, these papers discusssuitable choices for constructing the transition probabilitymatrix, P.

The construction of the transition probability must obvi-ously satisfy the following constraints (called transition con-ditions): {

Pij ⩾ 0∑Nj=1 Pij = 1 .

One additional requirement relates the sparsity pattern of Hto that of the transition probabilities:

Forward Method: Hij ≠ 0 ⇒ Pij ≠ 0,

Adjoint Method: Hji ≠ 0 ⇒ Pij ≠ 0.

The following auxiliary result can be found in one study.18

Lemma 1. Consider a generic vector g = (g1, g2,… , gN)T

where at least one element is nonzero, gk ≠ 0 for some k ∈{1,… , N}. Then, the following statements hold

a. for any probability distribution vector 𝜷 = (𝛽1, 𝛽2,… ,

𝛽N)T satisfying the transition conditions,N∑

k=1

g2k

𝛽k⩾( N∑

k=1|gk|)2

; moreover, the lower bound is attained for the

probability vector defined by 𝛽k =|gk|∑N

k=1 |gk| ;b. there always exists a probability vector 𝜷 such that

N∑k=1

g2k

𝛽k⩾

c, for all c > 1.

Consider now a generic realization of a random walk,truncated at a certain kth step:

𝜈k ∶ r0 → r1 → r2 → · · · → rk,

and the corresponding statistical estimator associated with theforward MC method:

X(𝜈k) =Hr0,r1

Hr1,r2· · ·Hrk−1,rk

Pr0,r1Pr1,r2

· · ·Prk−1,rk

frk .

Then, the following result holds (see one study18).

Theorem 1. (Forward method version)Let H ∈ Rn×n be such that ‖H‖∞ < 1. Consider 𝜈k asthe realization of a random walk 𝜈 truncated at the kth step.


Then, there always exists a transition matrix P such that

Var(

X(𝜈k))→ 0 and Var

(∑𝜈X(𝜈k)

)is bounded as k → ∞.

If we introduce the estimator associated with the adjointMC:

X(𝜈k) =HT

r0,r1HT

r1,r2· · ·HT

rk−1,rk

Pr0,r1Pr1,r2

· · ·Prk−1,rk

sign(fr0)‖f‖1,

then we can state a theorem analogous to Equation 1 (see onestudy18).

Theorem 2. (Adjoint method version)Let H ∈ Rn×n with ‖H‖1 < 1. Consider 𝜈k as the realiza-tion of a random walk 𝜈 truncated at the kth step. Then, there

always exists a transition matrix P such that Var(

X(𝜈k))→ 0

and Var(∑

𝜈X(𝜈k))

is bounded as k → ∞.

These results represent sufficient (but not necessary) condi-tions for the convergence of the forward and adjoint MC andcan be easily verified if H is explicitly available. However, inmany cases the conditions ‖H‖∞ < 1 or ‖H‖1 < 1 may betoo restrictive.

The connection between Lemma 1 and Theorems 1-2 willbe explained in the next section, dedicated to the definition oftransition probabilities.

3.2 Construction of transition probabilities

The way the transition probability is defined has a signifi-cant impact on the properties of the resulting algorithm, andin many circumstances, the choice can make the differencebetween convergence or divergence of the stochastic scheme.Two main approaches have been considered in the literature:uniform probabilities and weighted probabilities. We discussthese next.

3.2.1 Uniform probabilitiesWith this approach, the transition matrix P is such that all thetransitions corresponding to each row have equal probabilityof occurring:

Forward ∶ Pij ={

0 if Hij = 0,1

#(nonzeros in row i of H)if Hij ≠ 0;

Adjoint ∶ Pij ={

0 if Hji = 0,1

#(nonzeros in column i of H)if Hji ≠ 0.

The MC approach resorting to this definition of the transitionmatrix, in accordance to one study,11 is called uniform MC.

3.2.2 Weighted probabilitiesAn alternative definition of transition matrices aims to asso-ciate nonzero probability to the nonzero entries of H accord-ingly to their magnitude. For instance, we may employ thefollowing definition:

Forward ∶ p(ki = j | ki−1 = i) = Pij =|Hij|p∑n

k=1 |Hik|p,and

Adjoint ∶ p(ki = j | ki−1 = i) = Pij =|Hji|p∑n

k=1 |Hki|p ,where p ∈ N. The case p = 1 is called MC almost opti-mal (MAO). The reason for the “almost optimal” designa-tion can be understood looking at Lemma 1, as the quantity∑N

k=1g2

k

𝛽kis minimized when the probability vector is defined

as 𝛽k = |gk|∑Nk=1 |gk| . Indeed, Lemma 1 implies that the almost

optimal probability minimizes the ∞-norm of H for the for-ward method and the 1-norm of H for the adjoint method,because the role of g in Lemma 1 is played by the rows of Hin the former case and by the columns of H in the latter one.This observation provides us with easily computable upperbounds for 𝜌(H).

3.3 Classes of matrices with guaranteed convergence

On the one hand, sufficient conditions for convergence ofMC linear solvers are very restrictive; see, for example, theprevious studies.17,18 On the other hand, the necessary andsufficient condition in one study18 requires knowledge of𝜌(H), which is not readily available. Note that explicit com-putation of 𝜌(H) is quite expensive, comparable to the costof solving the original linear system. Although ensuring that𝜌(H) < 1 (by means of appropriate preconditioning) is inmany cases possible, guaranteeing that 𝜌(H) < 1 is generallymuch more problematic.

Here, we identify matrix types for which both conditionscan be satisfied by an appropriate choice of splitting, so thatconvergence of the MC scheme is guaranteed.

3.3.1 Strictly diagonally dominant (SDD) matricesOne of these categories is represented by SDD matrices. Weinvestigate under which conditions diagonal preconditioningis enough to ensure convergence. We recall the followingdefinitions.

Definition 1. A matrix A ∈ Rn×n is SDD by rows if

|aij| > n∑i=1i≠j

|aij|. (19)

Definition 2. A matrix A ∈ Rn×n is SDD by columns if AT isSDD by rows, that is,

|aij| > n∑j=1j≠i

|aij|. (20)


Suppose A is SDD by rows. Then we can apply left diag-onal (Jacobi) preconditioning, obtaining an iteration matrixH = I − P − 1A such that ‖H‖∞ < 1. Introducing a MAOtransition probability for the forward method:

Pij =|Hij|∑n

k=1 |Hik|we have that the entries of H are defined as follows:

Hij =H2

ij

Pij= |Hij|( n∑

k=1

|Hik|).Consequently,

n∑j=1

|Hij| = n∑j=1

Hij =( n∑

j=1

|Hij|)( n∑k=1

|Hik|)

=( n∑

j=1

|Hij|)2

< 1, ∀i = 1, · · · , n.

This implies that 𝜌(H) ⩽ ‖H‖∞ < 1, guaranteeing the for-ward MC converges. However, nothing can be said a prioriabout the convergence of the adjoint method.

On the other hand, if (20) holds, we can apply right diagonal(Jacobi) preconditioning, which results in an iteration matrixH = I − AP − 1 such that ‖H‖1 < 1. In this case, by a simi-lar reasoning we conclude that the adjoint method converges,owing to ‖H‖1 < 1; however, nothing can be said a prioriabout the forward method.

Finally, it is clear that if A is SDD by rows and by columns,then a (left or right) diagonal preconditioning will result in theconvergence of both the forward and the adjoint MC schemes.

3.3.2 Generalized diagonally dominant (GDD) matricesAnother class of matrices for which the convergence of MCsolvers is ensured is that of GDD matrices. We recall thefollowing definition.

Definition 3. A square matrix A ∈ Cn×n is said to be GDD if

|aii|di ⩾n∑

j=1j≠i

|aij|dj, i = 1, … , n,

for some positive vector d = (d1,… , dn)T.

A proper subclass of the class of GDD matrices is rep-resented by the nonsingular M-matrices. Recall that A is anonsingular M-matrix if it is of the form A = rI − B, whereB is nonnegative and r > 𝜌(B). It can be shown (see, e.g., onestudy19) that a matrix A ∈ Rn×n is a nonsingular M-matrixif and only if there exists a positive diagonal matrix Δ suchthat AΔ is SDD by rows. Clearly, every nonsingular M-matrixis GDD.

It is well known (see, e.g., the work of Axelsson20) that theclassical Jacobi, Block Jacobi, and Gauss–Seidel splittings areconvergent if A is a nonsingular M-matrix. However, this isnot enough to ensure the convergence of MC schemes based

on the corresponding fixed point (preconditioned Richardson)iteration, because in general, we cannot expect that 𝜌(H) < 1.

Nevertheless, if A is an M-matrix, there exist efficient meth-ods to determine a diagonal scaling of A so that the resultingmatrix is SDD by rows. Note that the scaled matrix is still anM-matrix, therefore applying left Jacobi preconditioning tothis matrix will guarantee that both 𝜌(H) < 1 and 𝜌(H) < 1.

In one study,21 the author presents a procedure to determinewhether a given matrix A ∈ Cn×n is GDD (in which case thediagonal scaling that makes A SDD is produced), or not. Thealgorithm can be described as follows.

This procedure, which in practice converges very fast, turnsa generalized diagonally dominant matrix (in particular, anonsingular M-matrix) into a SDD matrix by rows. By replac-

ing Si =n∑

j=1j≠i

|aij| at step 1 with Sj =n∑

i=1i≠j

|aij| and by replacing

aji = aji · di with aji = aji · dj, we obtain the algorithm thatturns a GDD matrix into a matrix that is SDD by columns.

Once we have applied this transformation to the originalmatrix A, the MC scheme combined with diagonal precondi-tioning is ensured to converge.

3.3.3 Block diagonally dominant matricesIn this section, we analyze situations in which block diagonalpreconditioning can produce a convergent MC linear solver.Assume that A has been partitioned into p × p block form, and


that each diagonal block has size ni with n1 + · · · + np = n.Assume further that each diagonal block Aii is nonsingu-lar. The iteration matrix H ∈ Rn×n resulting from a blockdiagonal left preconditioning is

H =

⎡⎢⎢⎢⎢⎢⎣

0n1×n1−A−1

11 A12 · · · · · · −A−111 A1p

−A−122 A21 0n2×n2

−A−122 A23 · · · −A−1

22 A2p⋮ ⋮ ⋱ ⋮ ⋮⋮ ⋮ ⋮ ⋱ ⋮

−A−1pp Ap1 · · · · · · −A−1

pp Ap,p−1 0np×np

⎤⎥⎥⎥⎥⎥⎦.

Below, we denote with “i || m” the modulo operation appliedto the integers i and m. The symbol “⌊·⌋” stands for the floorfunction, as usual.

Consider first the forward method. Assuming (for ease ofnotation) that all the blocks have the same size m = n

/p, the

entries of the MAO transition probability matrix become

Pij =|Hij|∑n

k=1 |Hik| =||||([A⌊ i

m⌋⌊ i

m⌋]−1A⌊ i

m⌋⌊ j

m⌋)

i|m,j|m||||

n∑k=1k≠i

||||([A⌊ i

m⌋⌊ i

m⌋]−1A⌊ i

m⌋⌊ k

m⌋)

i|m,k|m||||.

Consequently, the entries of H are given by

Hij = |Hij|( n∑k=1

|Hik|)=||||([A⌊ i

m⌋⌊ i

m⌋]−1A⌊ i

m⌋⌊ j

m⌋)

i|m,j|m||||

×n∑

k=1k≠i

||||([A⌊ i

m⌋⌊ i

m⌋]−1A⌊ i

m⌋⌊ k

m⌋)

i|m,k|m||||.

Computing the sum over a generic row of H, we obtain

n∑j=1

|Hij| = n∑j=1

Hij =( n∑

j=1j≠i

||||([A⌊ i

m⌋⌊ i

m⌋]−1A⌊ i

m⌋⌊ j

m⌋)

i|m,j|m||||)2

.

Consider now the quantity ‖H‖∞. Clearly,

‖H‖∞ < 1 ⇐⇒n∑

j=1j≠i

||||([A⌊ i

m⌋⌊ i

m⌋]−1A⌊ i

m⌋⌊ j

m⌋)

i|m,j|m||||

< 1, ∀i = 1, … , n.A sufficient condition for this to happen is that

p∑k=1k≠h

‖A−1hh Ahk‖∞ < 1, ∀h = 1, … , p. (21)

Introducing the matrix H ∈ Rp×p defined as

H =

⎡⎢⎢⎢⎢⎢⎣

0 ‖A−111 A12‖∞ · · · · · · ‖A−1

11 A1p‖∞‖A−122 A21‖∞ 0 ‖A−1

22 A23‖∞ … ‖A−122 A2p‖∞

⋮ ⋮ ⋱ ⋮ ⋮⋮ ⋮ ⋮ ⋱ ⋮‖A−1

pp Ap1‖∞ · · · · · · ‖A−1pp Ap,p−1‖∞ 0

⎤⎥⎥⎥⎥⎥⎦,

we can formulate a sufficient condition on the convergenceof the forward MC scheme:

‖H‖∞ < 1 ⇒ ‖H‖∞ < 1. (22)

Note that Equation 21 also implies that ‖H‖∞ < 1.We now turn our attention to the adjoint method. Analo-

gously to the forward method, we can define

(H)Tij = |HTij |( n∑

k=1

|HTik|)

= |Hji|( n∑k=1

|Hki|).This allows us to formulate a sufficient condition for theconvergence of the adjoint MC method with block diagonalpreconditioning. Letting

H =

⎡⎢⎢⎢⎢⎢⎣

0 ‖A−111 A12‖1 · · · · · · ‖A−1

11 A1p‖1‖A−122 A21‖1 0 ‖A−1

22 A23‖1 · · · ‖A−122 A2p‖1

⋮ ⋮ ⋱ ⋮ ⋮⋮ ⋮ ⋮ ⋱ ⋮‖A−1

pp Ap1‖1 · · · · · · ‖A−1pp Ap,p−1‖1 0

⎤⎥⎥⎥⎥⎥⎦,

a sufficient condition is that ‖H‖1 < 1, that is,

p∑h=1h≠k

‖A−1hh Ahk‖1 < 1, ∀k = 1, … , p. (23)

Again, this condition also implies that ‖H‖1 < 1.We say that A is strictly block diagonally dominant by rows

(columns) with respect to a given block partition if condi-tion 21 (respectively, condition 23) is satisfied relative to thatparticular block partition. Note that a matrix may be strictlyblock diagonally dominant with respect to one partition andnot to another. We note that these definitions of block diago-nal dominance are different from those found in, for example,the work of Feingold and Varga,22 and they are easier to check


in practice because they do not require computing the 2-normof the blocks of H.

3.4 Adaptive methods

In formulas 4 and 10, the estimation of the solution to thelinear system 1 involves infinite sums, which in actual com-putation have to be truncated. In the following, we discusscriteria to decide the number of steps to be taken in a singlerandom walk and the number of random walks that need to beperformed at each Richardson iteration.

3.4.1 History lengthWe first consider criteria to terminate an individual randomwalk, effectively deciding how many terms of the Neumannseries will be considered. One possibility is to set a pre-determined history length, at which point all histories areterminated. This approach, however, presents two difficulties.First, it is difficult to determine a priori how many steps onaverage will be necessary to achieve a specified tolerance.Second, due to the stochastic nature of the random walks,some histories will retain important information longer thanothers. Truncating histories at a predetermined step runs therisk of either prematurely truncating important histories, lead-ing to larger errors, or continuing unimportant histories longerthan necessary, leading to computational inefficiency.

Our goal is to apply a cutoff via an automatic procedure,without requiring any user intervention. We would like todetermine an integer m such that

𝜃 = E[∑m

𝓁=0 W𝓁fk𝓁

]= fi +

m∑𝓁=1

n∑k1=1

n∑k2=1

· · ·n∑

k𝓁=1Pi,k1

Pk1,k2

· · ·Pk𝓁−1,k𝓁wk0,k1wk1,k2

· · ·wk𝓁−1,k𝓁 fk𝓁

and

𝜃i = E[∑m

𝓁=0 W𝓁𝛿k𝓁 ,j

]=

m∑𝓁=0

n∑k1

n∑k2

· · ·n∑k𝓁

fk0Pk0,k1

Pk1,k2

· · ·Pk𝓁−1,K𝓁wk0,k1

· · ·wk𝓁−1,k𝓁𝛿k𝓁 ,i

are good approximations of Equations 4 and 10, respectively.In the work of Slattery,7 a criterion was given, which is

applicable to both forward and adjoint methods. It requires toset up a relative weight cutoff threshold Wc and to look for astep m such that

Wm ⩽ WcW0 . (24)

In Equation 24, W0 is the value of the weight at the initial stepof the random walk and Wm is the value of the weight after msteps. We will adopt a similar strategy in this work.

3.4.2 Number of random walksWe now consider the selection of the number of random walksthat should be performed to achieve a given accuracy. Unlikethe termination of histories, this is a subject that has not beendiscussed in the MC linear solver literature, as all previousstudies have considered the simulation of a prescribed numberof histories.

The expression for the variance of the forward method isgiven by formula 5. In this context, a reasonable criterion todetermine the number Ñi of random walks to be run is to seta threshold 𝜖1 and determine Ñi such that√

Var[𝜃i]|E[𝜃i]| < 𝜖1, i = 1, … , n. (25)

The dependence of Var[𝜃i] and E[𝜃i] on Ñi is due to the factthat 𝜃i is estimated by performing a fixed number of histories.Therefore, we are controlling the relative standard deviation,requiring it not to be too large. In other words, we are aimingfor an approximate solution in which the uncertainty factoris small relative to the expected value. This simple adaptiveapproach can be applied for the estimation of each componentxi. Hence, a different number of histories may be employed tocompute different entries of the solution vector.

As concerns the adjoint method, the estimation of the vari-ance is given in formula 11. A possible criterion for theadaptive selection of the number Ñ of random walk, in thissituation, is that it satisfies the condition

‖𝝈Ñ‖1‖x‖1< 𝜖1, (26)

where 𝝈 is a vector whose entries are 𝜎Ñi = Var[𝜃i] and x is a

vector whose entries are xi = E[𝜃i].The criteria introduced above can be exploited to build an a

posteriori adaptive algorithm, capable of identifying the min-imal value of Ñ that verifies Equations 25 or 26, respectively.Algorithms 4 and 5 describe the MC approaches with theadaptive criteria.


The use of the adaptive approach for the selection of thenumber of histories has a dual purpose. First, it guaranteesthat the update computed with the MC step is accurate enoughto preserve convergence. Second, it provides the user witha tuning parameter to distribute the computation betweenthe deterministic and the stochastic part of the algorithm.Lowering the value of the threshold for the relative standarddeviation increases the number of histories per iteration. Thisresults in a more accurate stochastic updating and reduces theiterations necessary to converge. Although guessing an a pri-ori fixed number of histories may lead to a smaller numberof MC histories overall, it might either hinder the conver-gence or distribute too much computation on the deterministicside of the scheme (or both). Generally speaking, the adap-tive approach is more robust and more useful, especially when𝜌(H) is close to 1, because in this case the Richardson step isless effective in dampening the uncertainty coming from theprevious iterations.

3.5 Preconditioning

As noted at the end of Section 2, left preconditioning can beincorporated into any MC linear solver algorithm by simplysubstituting A with P − 1A and b with P − 1b in Equation 1;that is, we set H = I − P − 1A and f = P − 1b in Equation 2.Right preconditioning can also be incorporated by rewrit-ing Equation 1 as AP − 1y = b, with y = Px; that is, we setH = I − AP − 1 and replace x by y in Equation 2. The solu-tion x to the original system 1 is then given by x = P − 1y.Likewise, split (left and right) preconditioning can also beused. The MC process, however, imposes some constraintson the choice of preconditioner. Most significantly, becausethe transition probabilities are built based on the values of theiteration matrix H, it is necessary to have access to the entriesof the preconditioned matrix P − 1A (or AP − 1). Therefore, weare limited to preconditioners that enable explicitly formingthe preconditioned matrix while retaining some of the sparsityof the original matrix.

One possible preconditioning approach involves eitherdiagonal or block diagonal preconditioning (with blocks ofsmall or moderate size). Diagonal preconditioning does not

alter the sparsity of the original coefficient matrix, whereasblock diagonal preconditioning will incur a moderate amountof fill-in in the preconditioned matrix if the blocks are nottoo large. From the discussions in Sections 3.3.1 and 3.3.3,selecting a diagonal or block diagonal preconditioner guar-antees convergence of the MC schemes for matrices that areSDD or block diagonally dominant, respectively. In addition,M-matrices that are not strictly or block diagonally dominantcan also be dealt with by first rescaling A so that it becomesSDD, as discussed in Section 3.3.1.

In principle, other standard preconditioning approaches canalso be used in an attempt to achieve both 𝜌(H)< 1 and 𝜌(H) <1 while still retaining sparsity in the preconditioned matrix.One possibility is the use of incomplete LU factorizations23,24.If P = LU is the preconditioner with sparse triangular factorsL and U, then P − 1A can in principle be formed explic-itly provided that the sparsity in L, U and A is carefullyexploited in the forward and back substitutions needed toform (LU) − 1 = U − 1(L − 1A). In general, however, this resultsin a rather full preconditioned matrix; sparsity needs to bepreserved by dropping small entries in the resulting matrix.

Another class of preconditioners that are potentially ofinterest for use with MC linear solvers are approximateinverse preconditioners.23 In these algorithms, an approxima-tion to the inverse of A is generated directly and the compu-tation of the preconditioned matrix reduces to one or moresparse matrix-matrix products, a relatively straightforwardtask. As with ILU factorizations, multiple versions of approx-imate inverse preconditioning exist, which may have differentbehavior in terms of effectiveness of the preconditioner versusthe resulting reduction in sparsity.

A downside of the use of ILU or approximate inverse pre-conditioning is that the quality of preconditioner needed toachieve both 𝜌(H) < 1 and 𝜌(H) < 1 is difficult to determine.Indeed, in some situations, it may happen that modifying thepreconditioner so as to reduce 𝜌(H) may actually lead to anincrease in 𝜌(H), decreasing the effectiveness of the MC pro-cess on the system or even causing it to diverge. In otherwords, for both ILU and sparse approximate inverse precondi-tioning, it seems to be very difficult to guarantee convergenceof the MC linear solvers a priori.

3.6 Considerations about computational complexity

Providing an analysis of the computational complexity forthe aforementioned algorithms is not entirely straightforwardbecause of their stochastic nature. Indeed, different statisticalsamplings can produce estimates with different uncertaintylevels, requiring a proper tuning of the number of sam-plings computed to reach a prescribed accuracy. Moreover, wealready mentioned that the asymptotic analysis of MC con-vergence assumes random walks with infinitely many stepsand N → ∞, where N is the number of random walks. How-ever, in practice each history must be truncated to a finite


number of steps, and the number of statistical samplings mustbe finite as well. The actual value of N and the length of thehistories affect the accuracy of the statistical estimation, thusinfluencing the number of iterations in a hybrid algorithm,because the uncertainty propagates to subsequent iterations.Therefore, here, we can only provide a tentative analysis of thecomplexity of the forward and adjoint MC methods, assum-ing a specific history length and a fixed number N of statisticalsamplings.

Recalling formula 4 for the entry-wise estimate for the solu-tion with the forward method, the cost of reconstructing theentire solution vector is

Forward: O(N · k𝓁 · n), (27)

where N is the number of histories, k𝓁 is the length of eachrandom walk, and n is the number of unknowns. As concernsthe adjoint method, formula 10 leads to

Adjoint: O(N · k𝓁). (28)

The operation count for the hybrid schemes can thus beobtained by combining the cost of the Richardson schemewith the complexity of standard MC techniques.

Regarding the SMC algorithm, the standard MC scheme iscombined with the computation of the residual at each itera-tion. The cost of the residual computation is essentially thatof a sparse matrix-vector product, which is O(n) for a sparsesystem. Therefore, the complexity of a single SMC iterationis

SMC with forward MC update: O((n + 1) · N · k𝓁) (29)

using the forward MC as an inner solver, and

SMC with adjoint MC update: O(n + N · k𝓁) (30)

using the adjoint MC as an inner solver.A single iteration of MCSA requires computing the resid-

ual, applying a matrix-vector product using H, and applyingan MC update. The complexity of an MCSA iteration usingthe forward MC is therefore

MCSA with forward MC update: O((2n+1) ·N ·k𝓁), (31)

whereas using the adjoint MC we find

MCSA with adjoint MC update: O(2n + N · k𝓁). (32)

Note that these estimates require nnz(H)≈ nnz(A), in the sensethat both H and A contain O(n) nonzeros. The actual valuesattained by N and k𝓁 depend on the thresholds employed totruncate a single history and to determine the number of ran-dom walks to use. In general, the higher N and k𝓁 , the lowerthe number of outer iterations to achieve a prescribed accu-racy. Finally, the total cost will depend also on the number ofiterations, which is difficult to predict in practice.

4 NUMERICAL RESULTS

In this section, we discuss the results of numerical experi-ments. The main goal of these experiments is to gain someinsight into the behavior of adaptive techniques, and to testdifferent preconditioning options within the MC approach.

4.1 Numerical tests with adaptive methods

In this section, we study experimentally the adaptiveapproaches discussed in Section 3.4. For this purpose, werestrict our attention to standard MC linear solvers.

For these tests, we limit ourselves to small matrices, primar-ily because the numerical experiments are being computed ona standard laptop and the computational cost of MC methodsrapidly becomes prohibitive on such machines. The smallestof these matrices represents a finite difference discretiza-tion of a 1D reaction-diffusion equation, the second one isa discrete 2D Laplacian with zero Dirichlet boundary con-ditions, the third a steady 2D advection-diffusion operatordiscretized by quadrilateral linear finite elements using theIFISS package,25 and the fourth one results from a finite vol-ume discretization of a thermal radiation diffusion equation(Marshak problem). The first and the last of these matricesare SDD by both rows and columns. For all the problems,left diagonal preconditioning is applied. Details about thesematrices are given in Table 1.

We present results for both the forward and the adjoint MCmethods. As concerns the forward method, we set the maxi-mum number of histories per entry to 107. In Tables 2 and3, we show results for values of the adaptive threshold 25equal to 𝜖1 = 0.1 and 𝜖1 = 0.01, respectively. A batch size

TABLE 1 Properties of the matrices employed for the checks on adaptivity

Type of problem n 𝜌(H) Forward 𝜌(H) Adjoint 𝜌(H)

1D reaction-diffusion 50 0.4991 0.9827 0.9827

2D Laplacian 900 0.9949 0.994 0.9945

2D advection-diffusion 1089 0.9836 0.983 0.983

Marshak problem 1600 0.6009 0.3758 0.3815

TABLE 2 Forward Monte Carlo. Adaptive selection of histories, 𝜖1 = 0.1

Type of problem Relative error Nb. histories

1D reaction-diffusion 0.003 5,220

2D Laplacian 0.1051 262,350

2D advection-diffusion 0.022 1,060,770

Marshak problem 0.0012 1,456,000

TABLE 3 Forward Monte Carlo. Adaptive selection of histories, 𝜖1 = 0.01


1D reaction-diffusion 4·10 − 4 512,094

2D Laplacian 0.0032 108,551,850


Marshak problem 5.8·10 − 4 144,018,700


of two is used at each adaptive check to verify the magni-tude of the apparent relative standard deviation. As expected,results are aligned with the convergence rate predicted by theCLT. Indeed, decreasing by a factor of 10 the tolerance 𝜖1, wesee that the relative error undergoes a decrease of the sameorder, requiring roughly 100 times more histories. In the caseof the 2D Laplacian, we actually have an increase of a fac-tor close to 400 in the number of histories, but the relativeerror is decreased by more than thirty times. For this partic-ular example, the forward method overestimates the numberof histories needed to satisfy a prescribed reduction on thestandard deviation.

As regards the adjoint MC, at each adaptive check thenumber of random walks employed is increased by 10. A max-imum number of histories equal to 1010 is set. Tables 4, 5,and 6 show results for the different test cases using adaptivethresholds 26 with values 𝜖1 = 0.1, 𝜖1 = 0.01 and 𝜖1 = 0.001,respectively. By comparing the reported errors, it is clear thata decrease in the value of the threshold induces a reduction ofthe relative error of the same order of magnitude. This con-firms the effectiveness of the adaptive selection of historieswith an error reduction goal. Each decrease in the error by anorder of magnitude requires an increase in the total number ofhistories employed by two orders of magnitude, as expected.

The same simulations can be run resorting to the expectedvalue estimator. Results are shown in the Tables 7, 8 and 9for the threshold values of 𝜖1 = 0.1, 𝜖1 = 0.01 and 𝜖1 = 0.001respectively. As it can be noticed, in terms of error scaling theresults are quite similar to the ones obtained with the colli-sion estimator. As regards the number of histories needed to

TABLE 4 Collision estimator—Adjoint Monte Carlo. Adaptive selectionof histories, 𝜖1 = 0.1


1D reaction-diffusion 0.08 1820

2D Laplacian 0.136 570

2D advection-diffusion 0.08 2400

Marshak problem 0.288 880




2D Laplacian 0.0122 126,800

2D advection-diffusion 0.0093 219,293

Marshak problem 0.0105 650,040



1D reaction-diffusion 9.56·10 − 4 15,268,560

2D Laplacian 0.001 12,600,000


Marshak problem 0.0011 80,236,000

TABLE 7 Expected value estimator—Adjoint Monte Carlo. Adaptiveselection of histories, 𝜖1 = 0.1


1D reaction-diffusion 0.0463 1000

2D Laplacian 0.1004 900

2D advection-diffusion 0.0661 1300

Marshak problem 0.0526 2000




2D Laplacian 0.0094 83,700

2D advection-diffusion 0.0088 124,400

Marshak problem 0.0056 166,000



1D reaction-diffusion 0.004 10,063,300

2D Laplacian 9.31·10 − 4 8,377,500


Marshak problem 7.79·10 − 4 16,537,000

reach a prescribed accuracy, the orders of magnitude are thesame for both the collision and the expected value estimators.However, the expected value estimator requires in most casesa smaller number of realizations. This behavior becomes morepronounced as the value of the threshold decreases, makingthe computation increasingly cost-effective.

4.2 Preconditioning approaches

In this section, we examine the effect of different precondi-tioners on the values attained by the spectral radii 𝜌(H) and𝜌(H). For this purpose, we focus on the 2D discrete Lapla-cian and the 2D discrete advection-diffusion operator from theprevious section.

The values of the two spectral radii with diagonal precondi-tioning have already been shown in Table 4. Here, we considerthe effect of block diagonal preconditioning for differentblock sizes, and the use of the factorized sparse approximateinverse preconditioner AINV26,27 for different values of thedrop tolerance (which controls the sparsity in the approx-imate inverse factors). Intuitively, with these two types ofpreconditioners both 𝜌(H) and 𝜌(H) should approach zero forincreasing block size and decreasing drop tolerance, respec-tively; however, the convergence need not be monotonic ingeneral, particularly for 𝜌(H). This somewhat counterintuitivebehavior is shown in Tables 10 and 11, where an increase inthe size of the blocks used for the block diagonal precondi-tioner results in an increase of 𝜌(H) for both test problems.Note also the very slow rate of decrease in 𝜌(H) for increas-ing block size, which is more than offset by the rapid increase


TABLE 10 Behavior of 𝜌(H) and 𝜌(H) for the 2D Laplacian with blockdiagonal preconditioning

Block size nnz(H)nnz(A) 𝜌(H) 𝜌(H)

5 2.6164 0.9915 0.9837

10 5.6027 0.9907 0.9861

30 16.3356 0.9898 0.9886

TABLE 11 Behavior of 𝜌(H) and 𝜌(H) for the 2D advection-diffusionproblem with block diagonal preconditioning

Block size nnz(H)nnz(A)

𝜌(H) 𝜌(H)

3 1.4352 0.9804 0.9531

9 3.0220 0.9790 0.9674

33 9.8164 0.9783 0.9774

TABLE 12 Behavior of 𝜌(H) and 𝜌(H) for the 2D Laplacian with AINVpreconditioning

Drop tolerance nnz(H)nnz(A) 𝜌(H) 𝜌(H)

0.05 8.2797 0.9610 1.0231

0.01 33.1390 0.8668 0.8279

TABLE 13 Behavior of 𝜌(H) and 𝜌(H) for the 2D advection-diffusionproblem with AINV preconditioning

Drop tolerance nnz(H)nnz(A)

𝜌(H) 𝜌(H)

0.05 3.8392 0.9396 0.9069

0.01 15.1798 0.7964 0.6361

in the density of H, which of course implies much highercosts. We mention that for a block size of 30 the 2D discreteLaplacian is block diagonally dominant, but not for smallerblock sizes. The 2D discrete advection-diffusion operator isnot block diagonally dominant for any of the three reportedblock sizes.

In Tables 12 and 13, the values of the spectral radii areshown for the AINV preconditioner with two different val-ues of the drop tolerance.26,27 It is interesting to point outthat, for the two-dimensional Laplacian, a drop tolerance𝜏 = 0.05 entails 𝜌(H) < 1 but the same does not hold forH. Both convergence conditions are satisfied by reducing thedrop tolerance, but at the price of very high fill-in in thepreconditioned matrix.

In summary, we conclude that it is generally very challeng-ing to guarantee the convergence of MC linear solvers a priori.Simple (block) diagonal preconditioners may work even if Ais not strictly (block) diagonally dominant, but it is hard toknow beforehand if a method will converge, especially due tolack of a priori bounds on 𝜌(H). Moreover, the choice of theblock sizes in the block diagonal case is not an easy matter.Sparse approximate inverses are a possibility but the amountof fill-in required to satisfy the convergence conditions couldbe unacceptably high. These observations suggest that it isdifficult to use MC linear solvers in the case of linear systemsarising from the discretization of steady-state PDEs, which

typically are not SDD. As we shall see later, the situation ismore favorable in the case of time-dependent problems.

4.3 Hybrid methods

Next, we present results for hybrid methods, combining adeterministic Richardson iteration with MC acceleration. Werecall that the convergence conditions are the same as forthe direct stochastic approaches, therefore the concludingobservations from the previous section still apply.

4.3.1 Poisson problemConsider the standard 2D Poisson model problem:{

−Δu = f in Ω,u = 0 on 𝜕Ω, (33)

where Ω = (0, 1) × (0, 1). For the numerical experiments weuse as the right-hand side a sinusoidal distribution in both xand y directions:

f (x, y) = sin(𝜋x) sin(𝜋y).

We discretize problem 33 using standard five-point finite dif-ferences. For N = 32 nodes in each direction, we obtain the900 × 900 linear system already used in the previous section.

Consider the iteration matrix H corresponding to (left)diagonal preconditioning. It is well known that 𝜌(H) < 1, andindeed, we can see from Table 1 that 𝜌(H) ≈ 0.9949. It is alsoeasy to see that ‖H‖1 = 1. In order for the adjoint MC methodto converge, it is necessary to have 𝜌(H) < 1, too. If an almostoptimal probability is used, the adjoint MC method leads to aH matrix such that

‖H‖1 ⩽ ‖H‖21 = 1.

This condition by itself is not enough to guarantee that 𝜌(H) <1. However, the iteration matrix H has zero-valued entries onthe main diagonal and it has

• four entries equal to 1

4on the rows associated with a node

which is not adjacent to the boundary;• two entries equal to 1

4on the rows associated with nodes

adjacent to the corner of the square domain;• three entries equal to 1

4on the rows associated with nodes

adjacent to the boundary on an edge of the square domain.

Recalling the definition of H in terms of the entries of theiteration matrix H and the transition probability P, we see that

H = DH,

where D is a diagonal matrix with diagonal entries equal to1, 1

2or 3

4. In particular, di = diag(D)i = 1 if the ith row of

the discretized Laplacian is associated with a node which isnot adjacent to the boundary, di = diag(D)i = 1

2if the row

is associated with a node of the grid adjacent to the corner ofthe boundary, and di = diag(D)i = 3

4if the associated node of


TABLE 14 Numerical results for the Poisson problem

Average no. ofAlgorithm Relative error No. of iterations histories per iteration

Richardson 9.8081·10 − 8 3133 -

Sequential MC 7.9037·10 − 8 9 8.264,900

Adjoint MCSA 8.0872·10 − 8 8 1,738,250

Note. MC = Monte Carlo; MCSA = Monte Carlo synthetic acceleration.

the grid is adjacent to the boundary of the domain far from acorner. Because H is an irreducible nonnegative matrix and Dis a positive definite diagonal matrix, we can invoke a resultin one study28 to conclude that

𝜌(H) = 𝜌(DH) ⩽ 𝜌(D)𝜌(H).

Because 𝜌(D) = 1, then 𝜌(H) ⩽ 𝜌(H) < 1. Therefore, diag-onal preconditioning always guarantees convergence for theMC linear solver applied to any 2D Laplace operator dis-cretized with five-point finite differences. Similar argumentsapply to the d-dimensional Laplacian, for any d.

Results of numerical experiments are reported in Table 14.We compare the purely deterministic Richardson iteration(Jacobi’s method in this case) with two hybrid methods, Hal-ton’s SMC and MCSA using the adjoint method. The thresh-old for the adaptive selection of the random walks is set to𝜖1 = 0.1. Convergence is attained when the initial residual hasbeen reduced by at least eight orders of magnitude, startingfrom a zero initial guess. Note that, as expected, the diagonallypreconditioned Richardson iteration converges very slowly,because the spectral radius of H is very close to 1. Halton’sSMC performs quite well on this problem, but even betterresults are obtained by the adjoint MCSA method, which con-verges in one iteration less than SMC while requiring far lessMC histories per iteration. This might be explained by the factthat within each outer iteration, the first relaxation step in theMCSA algorithm refines the accuracy of the solution com-ing from the previous iteration, before using it as the inputfor the MC solver at the current iterate. In particular, its effectis to dampen the statistical noise produced by the MC lin-

ear solver in the estimation of the update 𝛿xl+ 12 performed at

the previous iteration. Therefore, the refinement, or smooth-ing, accomplished by the Richardson relaxation decreases thenumber of random walks needed for a prescribed accuracy.This hypothesis is validated by the fact that at the first itera-tion both SMC and MCSA use the same number of histories.Their behaviors start differing from the second iteration on,when the statistical noise is introduced in the estimation ofthe correction to the current iterate; see Figures 1 and 2.It can be seen that in all the numerical tests presented inthis section there is a sharp increase (from O(103)–O(105) toO(106)–O(107)) in the number of histories when going fromthe first to subsequent iterations. This is due to the introduc-tion of statistical noise coming from the MC updating of thesolution. Indeed, the stochastic noise, introduced from thesecond iteration on, increases the uncertainty associated with

FIGURE 1 Sequential Monte Carlo (MC)—Poisson problem. Number of

random walks employed at each iteration. The number for the first iteration

is 2000

FIGURE 2 Monte Carlo synthetic acceleration (MCSA)—Poisson

problem. Number of random walks employed at each iteration. The number

for the first iteration is 2000

the estimate of the solution. Therefore, the adaptive criterionforces the algorithm to perform a higher number of historiesto achieve a prescribed accuracy.

4.3.2 Reaction-diffusion problemHere, we consider the simple reaction-diffusion problem{

−Δu + 𝜎u = f in Ω,u = 0 on 𝜕Ω, (34)

where Ω = (0,1) × (0,1), 𝜎 = 0.1 and f ≡ 1. A five-point finitedifference scheme is applied to discretize the problem. Thenumber of nodes on each direction of the domain is 100, sothat h ≈ 0.01. The discretized problem is n × n with n = 9604.A left diagonal preconditioning is again applied to the coeffi-cient matrix obtained from the discretization, which is SDD.The 1-norm of the iteration matrix is ‖H‖1 ≈ 0.9756. Thisautomatically guarantees the convergence of the adjoint MClinear solver. In Table 15, a comparison between the determin-istic Richardson iteration, SMC, and MCSA is provided. Theresults are similar to those for the Poisson equation. The num-ber of histories per iteration for SMC and MCSA is shown inFigures 3 and 4.


TABLE 15 Numerical results for the diffusion reaction problem


Richardson 9.073·10 − 8 634 -

Sequential MC 8.415·10 − 8 8 12,391,375

Adjoint MCSA 6.633·10 − 8 7 3,163,700


FIGURE 3 Sequential Monte Carlo (MC)—Reaction-diffusion problem.

Number of random walks employed at each iteration. The number first the

first iteration is 36,000

FIGURE 4 Monte Carlo synthetic acceleration (MCSA)—Reaction-

diffusion problem. Number of random walks employed at each iteration.

The number first the first iteration is 36,000

4.3.3 Parabolic problemHere, we consider the following time-dependent problem:

⎧⎪⎨⎪⎩𝜕u𝜕t− 𝜇Δu + 𝜷(x) · ∇u = 0, x ∈ Ω, t ∈ (0,T]

u(x, 0) = u0, x ∈ Ωu(x, t) = uD(x), x ∈ 𝜕Ω, t ∈ (0,T],

(35)

where Ω = (0,1) × (0,1), T > 0, 𝜇 = 3

200,

𝜷(x) = [2y(1 − x2), − 2x(1 − y2)]T, and uD = 0 on{{x = 0} × (0,1)}, {(0,1) × {y = 0}}, {(0,1) × {y = 1}}.

Implicit discretization in time (backward Euler scheme)with time step Δt and a spatial discretization withquadrilateral linear finite elements using the IFISS toolbox25

leads to a sequence of linear systems of the form( 1Δt

M + A)

uk+1 = fk, k = 0, 1, …

Here, we restrict our attention to a single generic time step.The right-hand side is chosen so that the exact solution to thelinear system for the specific time step chosen is the vector ofall ones. For the experiments, we use a uniform discretizationwith mesh size h = 2 − 8 and we let Δt = 10h. The resultinglinear system has n = 66, 049 unknowns.

We use the factorized sparse approximate inverse AINV27

as a right preconditioner, with drop tolerance 𝜏 = 0.05 for bothinverse factors. With this choice, the spectral radius of the iter-ation matrix H = I − AP − 1 is 𝜌(H) ≈ 0.9218 and the spectralradius of H for the adjoint MC is 𝜌(H) ≈ 0.9148. The MAOtransition probability is employed. Resorting to a uniformprobability in this case would have impeded the convergence,because in this case 𝜌(H) ≈ 1.8401. This is an exampledemonstrating how the MAO probability can improve the

TABLE 16 Numerical results for the parabolic problem


Richardson 6.277·10 − 7 178 -

Sequential MC 1.918·10 − 9 8 51,535,375

Adjoint MCSA 1.341·10 − 9 6 16,244,000


FIGURE 5 Sequential Monte Carlo (MC)—Parabolic problem. Number of

random walks employed at each iteration. The number for the first iteration

is 505,000

FIGURE 6 Monte Carlo synthetic acceleration (MCSA)—Parabolic

problem. Number of random walks employed at each iteration. The number

for the first iteration is 505,000


behavior of the stochastic algorithm, outperforming the uni-form one.

The fill-in ratio is given by nnz(H)nnz(A)

= 4.26; therefore, therelative number of nonzero elements in H is still acceptablein terms of storage and computational costs.

As before, the threshold for the check on the relative resid-ual is set to 𝜖 = 10 − 8, and the threshold for the adaptiveselection of the random walks is set to 𝜖1 = 0.1. The resultsfor all three methods are shown in Table 16. As one can see,both SMC and MCSA dramatically reduce the number of iter-ations with respect to the purely deterministic preconditionedRichardson iteration, with MCSA oupterforming SMC. Ofcourse each iteration is now more expensive due to the MCcalculations required at each Richardson iteration, but westress that MC is an embarrassingly parallel method. MCcalculations are also expected to be more robust in the pres-ence of faults, which is one of the main motivations for thepresent work.

Finally, Figures 5 and 6 show the number of MC historiesper iteration for SMC and for MCSA, respectively.

5 CONCLUSIONS AND FUTURE WORK

In this paper, we have reviewed known convergence condi-tions for MC linear solvers and established a few new suf-ficient conditions. In particular, we have determined classesof matrices for which the method is guaranteed to converge.The main focus has been on the recently proposed MCSAalgorithm, which clearly outperforms previous approaches.This method combines a deterministic fixed point iteration(preconditioned Richardson method) with an MC accelera-tion scheme, typically an adjoint MC estimator. Various typesof preconditioners have been tested, including diagonal, blockdiagonal, and sparse approximate inverse preconditioning.

Generally speaking, it is difficult to ensure a priori thatthe hybrid solver will converge. In particular, convergenceof the underlying preconditioned Richardson iteration is nec-essary, but not sufficient. The application of hybrid solversto nondiagonally dominant, steady-state problems presents achallenge and may require some trial and error in the choiceof tuning parameters, such as the block size or drop tol-erances; convergence can be guaranteed for some standardmodel problems, but in general, it is difficult to enforce. Thisis an inherent limitation of hybrid deterministic-stochasticapproaches of the kind considered in this paper. It is of coursepossible in many cases to obtain convergence by combin-ing the MCSA algorithm with a flexible Krylov subspacemethod like flexible GMRES. However, this entails additionalcosts, decreased parallelism, and increased storage require-ments, as well as a possible reduction in the resilience ofthe algorithm due to the need for orthogonalization and theattendant additional communication needed.

On a positive note, numerical experiments show that thesemethods are quite promising for solving SDD linear systems

arising from time-dependent simulations, such as unsteadydiffusion and advection-diffusion type equations. Problems ofthis type are quite important in practice, as they are often themost time-consuming part of many large-scale computationalfluid dynamics and radiation transport simulations. Linearsystems with such properties also arise in other applicationareas, such as network science and data mining.

In this paper, we have not attempted to analyze the algo-rithmic scalability of the hybrid solvers. A difficulty is thefact that these methods contain a number of tuneable parame-ters, each one of which can have great impact on performanceand convergence behavior: the choice of preconditioner, thestopping criteria used for the Richardson iteration, the crite-ria for the number and length of MC histories to be run ateach iteration, the particular estimator used, and possibly oth-ers. Although the cost per iteration is linear in the number nof unknowns, it is not clear how to predict the rate of conver-gence of the outer iterations, because it depends strongly onthe amount of work done in the MC acceleration phase, whichis also not known a priori except for some rather conservativeupper bounds. Clearly, the scaling behavior of hybrid methodswith respect to problem size needs to be further investigated.

Future work should also focus on testing hybrid methods onlarge parallel architectures and on evaluating their resiliencyin the presence of simulated faults. Efforts in this direction arecurrently under way.

ACKNOWLEDGEMENTS

This manuscript has been authored by UT-Battelle, LLC,under contract DE-AC05-00OR22725 with the U.S. Depart-ment of Energy. The United States Government retainsand the publisher, by accepting the article for publication,acknowledges that the United States Government retains anonexclusive, paid-up, irrevocable, world-wide license topublish or reproduce the published form of this manuscript, orallow others to do so, for United States Government purposes.

We would also like to thank Miroslav Tåma for providingthe AINV code used in some of the numerical experiments.

This study was supported by the Department of Energy(Office of Science)ERKJ247.

REFERENCES

1. Agullo E, Giraud L, Guermouche A, Roman J, Zounon M. Towards resilientparallel linear Krylov solvers: Recover-restart strategies, Research Report N.8324: Project-Teams HiePACS, INRIA; June 2013.

2. Fasi M, Langou J, Robert Y, Uçar B. A backward/forward recoveryapproach for the preconditioned conjugate gradient method. J Comput Sci.2016;17:522–534

3. Hoemmen M, Heroux M. Fault-tolerant iterative methods via selective reli-ability. Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis (SC), vol. 3; 2011; Seattle,WA. IEEE Computer Society; 2011. p. 2011.

4. Sargsyan K, Rizzi F, Mycek P, et al. Fault resilient domain decompositionpreconditioner for PDEs. SIAM J Sci Comput. 2015;37:A2317–A2345.

5. Stoyanov M, Webster C. Numerical analysis of fixed point algorithms in thepresence of hardware faults. SIAM J Sci Comput. 2015;37:C532–C553.


6. Evans TM, Mosher SW, Slattery SR, Hamilton SP. A Monte Carlosynthetic-acceleration method for solving the thermal radiation diffusionequation. J Comput Phys. 2014;258:338–358.

7. Slattery SR, Problems parallel montecarlo synthetic acceleration methodsfor discrete transport. PhD Thesis, University of Wisconsin-Madison, 2013.Available from: http://search.proquest.com/ebrary/docview/1449206072

8. Courant R, Friedrichs K, Lewy H. Über die partiellen Differenzengleichun-gen der mathematischen Physik. Math Ann. 1928;100:32–74.

9. Curtiss JH. Sampling methods applied to differential and differenceequations. Proc. Seminar on Scientific Computation, Nov. 1949; IBM, NewYork; 1950. p.87–109.

10. Forsythe GE, Leibler RA. Matrix inversion by a Monte Carlo method. MathTables Other Aids Comput. 1952;6:78–81.

11. Alexandrov V, Atanassov E, Dimov IT, Branford S, Thandavan A, WeihrauchC. Parallel hybrid Monte Carlo algorithms for matrix computations problems.Lect Notes Comput Sc. 2005;3516:752–759.

12. Halton JH. Sequential Monte Carlo techniques for the solution of linearsystems. J Sci Comput. 1994;9:213–257.

13. Dimov IT, Alexandrov VN. A new highly convergent Monte Carlo methodfor matrix computations. Math Comput Simulat. 1998;47:165–181.

14. Slattery SR, Evans TM, Wilson PPH. A spectral analysis of the domaindecomposed Monte Carlo method for linear systems. International Con-ference on Mathematics and Computational Methods Applied to NuclearScience & Engineering; 2013; Sun Valley Resort, ID.

15. Dimov IT, Alexandrov V, Karaivanova A. Parallel resolvent MonteCarlo algorithms for linear algebra problems. Math Comput Simulat.2001;55:25–35.

16. Halton JH. Sequential monte carlo. Math Proc Cambridge. 1962;58:57–58.

17. Srinivasan A. Monte Carlo linear solvers with non-diagonal splitting. MathComput Simulat. 2010;80:1133–1143.

18. Hao J, Mascagni M, Li Y. Convergence analysis of Markov chain MonteCarlo linear solvers using Ulam-von Neumann algorithm. SIAM J NumerAnal. 2013;51:2107–2122.

19. Berman A, Plemmons RJ. Nonnegative matrices in the mathematical sci-ences. New York: Academic Press; 1979.

20. Axelsson O. Iterative solution methods. Cambridge, UK: Cambridge Univer-sity Press; 1994.

21. Li L. On the iterative criterion for generalized diagonally dominant matrices.SIAM J Matrix Anal A. 2002;24:17–24.

22. Feingold DF, Varga RS. Block diagonally dominant matrices and generaliza-tions of the Gerschgorin circle theorem. Pac J Math. 1962;4:1241–1250.

23. Benzi M. Preconditioning techniques for large linear systems: A survey. JComput Phys. 2002;182:417–477.

24. Saad Y. Iterative methods for sparse linear systems, Society for Industrialand Applied Mathematics; Philadelphia; 2003.

25. Elman HC, Ramage A, Silvester DJ. IFISS: A computational laboratory forinvestigating incompressible flow problems. SIAM Rev. 2014;56:261–273.

26. Benzi M, Meyer CD, Tuma M. A sparse approximate inverse preconditionerfor the conjugate gradient method. SIAM J Sci Comput. 1996;17:1135–1149.

27. Benzi M, Tåma M. A sparse approximate inverse preconditioner for nonsym-metric linear systems. SIAM J Sci Comput. 1998;19:968–994.

28. Friedland S, Karlin S. Some inequalities for the spectral radius ofnon-negative matrices and applications. Duke Math J. 1975;42:459–490.

How to cite this article: Benzi M, Evans TM,Hamilton SP, Lupo Pasini M, Slattery SR. Analysisof Monte Carlo accelerated iterative methods forsparse linear systems. Numer Linear Algebra Appl.2017; e2088. https://doi.org/10.1002/nla.208824:

http://search.proquest.com/ebrary/docview/1449206072

https://doi.org/10.1002/nla.2088

Documents

Analysis of Monte Carlo accelerated iterative methods for ...benzi/Web_papers/mcsa.pdf · of the hybrid schemes, and we investigate different types of preconditioners includ-ing sparse