Waveform iterative methods for parallel solution of initial value problems

Waveform Iterative Methods for Parallel Solution of Initial Value Problems�Andrew Lumsdainey Jeffrey M. Squyresz Mark W. Reicheltx

Abstract

The traditional approach for computing the solution to largesystems of ordinary differential or differential-algebraicequations typically includes discretization in time with animplicit integration formula. The primary opportunity forparallelization is therefore limited to the linear system so-lution that is performed at each timestep. Waveform tech-niques, on the other hand, decompose the problem at theequation level and solve for different components of the sys-tem independently, using previous iterates from other pro-cessors as inputs. This approach is particularly well-suitedfor message-passing computing environments, especiallythose with high communication latency because synchro-nization and communication take place infrequently, andcommunication consists of large packets of information. Inthis paper, we present an MPI-based implementation of awaveform relaxation-based semiconductor device simula-tion program and provide experimental results using thisprogram to solve the time dependent semiconductor drift-diffusion equations on a cluster of workstations.

1 Introduction

The traditional approach for computing the solutionto large systems of ordinary differential or differential-algebraic equations typically includes discretization in timewith an implicit integration formula. The primary oppor-tunity for parallelization is therefore in the linear systemsolution that is performed at each timestep. However, thismay result in poor parallel performance on machines withhigh synchronization and communication costs because theprocessors must communicate and synchronize (possiblyseveral times) at each timestep.

Waveform methods provide an attractive alternative tostandard methods. With the waveform approach, the sys-tem of equations is decomposed and distributed among theprocessors of the parallel machine before being discretizedin time. Each processor then solves its own subsystem�Presented at the Scalable Parallel Libararies Conference, MississippiState, MS, October, 1994.yDept. of Comp. Sci. and Eng., University of Notre Dame, Notre Dame,IN 46556 ([email protected]).zDept. of Comp. Sci. and Eng., University of Notre Dame, Notre Dame,IN 46556 ([email protected]).xThe MathWorks, Inc., 24 Prime Park Way, Natick, MA 01760([email protected]).

over the temporal interval of interest using previous iter-ates from other processors as inputs. Synchronization andcommunication take place infrequently, and communica-tion consists of large packets of information – entire wave-forms. Because of this computation and communicationstructure, waveform methods are especially well suited toimplementationon loosely-coupled parallel machines (suchas workstation clusters) using message-passing techniques.

One criticism of classical waveform relaxation is that itis not practical because of its typically slow rate of con-vergence. However, acceleration techniques such as con-volution SOR and waveform GMRES can dramatically im-prove the convergence rate of waveform relaxation for largeclasses of problems. In fact, accelerated waveform methodscan be competitive with standard methods in serial imple-mentations, yet they possess the potential for much betterscalability in parallel implementations.

In this paper, we briefly review accelerated waveformmethods and discuss relevant issues in our MPI imple-mentation of the pWORDS parallel semiconductor devicesimulation program. Experimental results are presentedto show the performance of several different numericaltechniques for solving the time dependent semiconductordrift-diffusion equations. These results demonstrate the ro-bustness of waveform techniques in the “communication-hostile” parallel environment found in a typical workstationcluster.

2 Waveform Methods

In this section, we develop a brief description of wave-form methods using a model problem. For the model prob-lem, we seek to compute the transient (temporal) solutionto a system of N simultaneous ordinary differential equa-tions (ODEs), subject to an initial condition. This type ofproblem, usually called an initial value problem (IVP) isexpressed as: ddtx(t) +Ax(t) = f(t)x(0) = x0; (1)

wherex(t) 2 RN, f(t) 2 RN andA 2 RN�N. The vectorf(t) is understood to be the input and x(t) is understoodto be the unknown which is to be computed over a timeinterval t 2 [0; T ]. The traditional approach for numericallysolving the IVP begins by discretizing (1) in time with animplicit integration rule (since large dynamical systems are

typically stiff) and then solving the resulting matrix problemat each time step [1, 2]. This pointwise approach can bedisadvantageous for a parallel implementation, especiallyfor distributed memory parallel computers having a highcommunication latency, since the processors will have tosynchronize repeatedly for each timestep.

A more suitable approach to solving the IVP with a paral-lel computer is to decompose the problem at the differentialequation level. That is, the large system is decomposedinto smaller subsystems, each of which is assigned to asingle processor. The IVP is solved iteratively by solvingthe smaller IVPs for each subsystem, using fixed valuesfrom previous iterations for the variables from other sub-systems. This dynamic iteration process is variously knownas waveform relaxation (WR), dynamic iteration, or as thePicard-Lindelof iteration [3, 4].

Example 1. Consider an equation by equation decom-position of (1). A waveform relaxation algorithm for thisprocess is described by:

Algorithm 1 (Jacobi Waveform Relaxation).

1. Start: Select initial value x0(t) for t 2 [0; T ].2. Iterate: For each waveform iteration k = 1; 2; : : : ; until

satisfied do:

For each equation i = 1; 2; : : : ; N solve the IVP:ddtxk+1i (t) + aiixk+1i (t) = fi(t)�Xj 6=i aijxkj (t)xi(0) = x0i:Here, at every waveform iteration k, each equation in the

system is solved for its corresponding component of x, us-ing previous values of the other components as input. Notethat the left hand side uses only the diagonal element of A(i.e., aii for the computation of xi(t)). This computationalstructure is analogous to that of Gauss-Jacobi relaxation forsolving linear systems of equations [5]; hence, this partic-ular decomposition and solution process is usually calledJacobi waveform relaxation. Alternative decompositionscan be constructed (e.g., Gauss-Seidel), based on alterna-tive splittings of the matrix A.

Since the time that the WR algorithm was first introducedas an efficient technique for solving the large sparsely-coupled differential equation systems generated by simu-lation of integrated circuits [6], its properties have been un-der substantial theoretical and practical investigation. Theprecise nature of the loose coupling in integrated circuits,which was responsible for the rapid convergence of WR forthose examples, was first made clear in [7]. The more for-mal theory for WR applied to linear time-invariant systemsin normal form is described in [3], and theoretical aspects

which arise when WR is applied to the more general form(C ddt +A)x(t) = f(t) are examined in [8] and for the dif-ferential algebraic case in particular in [9]. Since the WRmethod decomposes the problem before time discretization,it has been used as a tool for examining the stability prop-erties of multirate integration methods [10]. Though themajor practical success of WR has been in accelerating thesimulation of integrated circuits [11, 12, 13, 14], it has beenexamined for other applications. For example, the methodhas been successfully applied to semiconductor device sim-ulation [15] and to chemical engineering problems [16].

As the above body of work makes clear, for WR tobe a computational competitor to pointwise methods, itsconvergence must be accelerated. Approaches to acceler-ating the convergence of WR include multigrid [17, 18],SOR [3], convolution SOR [19], Krylov-subspace meth-ods [20], adaptive window size selection [21, 22], and theuse of shifted iterations [23].

In the following sections, we describe the Krylov-subspace and convolution SOR acceleration techniques,concentrating primarily on their practical aspects.

2.1 Operator Equation Formulation

Many of the techniques used to accelerate WR are wave-form extensions of well-known acceleration methods fromlinear algebra. Hence, to describe these methods, it is use-ful to put WR into a form that is analogous to the canonicalAx = b formulation of linear algebra problems.

In (1), let A = M � N be a splitting of A. Thewaveform relaxation algorithm based on this splitting isexpressed in matrix form as

Algorithm 2 (WR for Linear Systems).

1. Initialize: Pick x0

2. Iterate: For waveform iteration k = 0; 1; : : : solveddtxk+1(t) +Mxk+1(t) = Nxk(t) + f(t)x(0) = x0

for xk+1(t) on [0; T ].We can solve for xk+1(t) explicitly [24], that is,xk+1(t) = (2)e�Mtx(0) + Z t

0e�M(t�s) �N (s)x(s) + f(s)� ds:

Instead of using this formulation, it is useful to abstract (2)and consider x as an element of a function space (of N -dimensional functions) and the integral as an operator onN -dimensional functions. Using operator notation, we canwrite (2) as xk+1 = Kxk + : (3)

Here the variables are defined on the space of N -dimensional square integrable functions, which we will de-note as H = L2([0; T ];RN). The operator K : H ! H isdefined by(Kx)(t) = Z t

0e�M(t�s)N (s)x(s)ds;

and 2 H is given by (t) = e�Mtx(0) + Z t0e�M(t�s)f (s)ds:

More intuitively, application of the operator K can beroughly interpreted as: “take one step of waveform re-laxation.”

Now, we also know (based on the splitting) that thesolution x to (1) will satisfyddtx(t) +Mx(t) = Nx(t) + f(t)x(0) = x0:Or, using operator notation, we can see that x will satisfy(I � K)x = (4)

where the operator I is the identity operator.

Example 2. Let M (t) be the diagonal part of A(t).Then Algorithm 2 becomes the Jacobi WR algorithm (Al-gorithm 1).

It can be shown that on finite time intervals, K has zerospectral radius [25, 26], so that the method defined in (3)converges. A more detailed analysis of convergence canbe derived by considering cases for which K is defined asT !1, in which case K has nonzero spectral radius [3].

2.2 Krylov Subspace Algorithms

In general, the operator K is not self-adjoint, i.e.,K 6= K�, so of the various Krylov-subspace methods, onlythose that are appropriate for non-self-adjoint operators,are appropriate for accelerating WR [20]. The waveformGMRES algorithm (WGMRES) — the extension of thegeneralized minimum residual algorithm (GMRES) [27] tothe function space H — is given below. Some theoreticalconvergence results for WGMRES are given in [20].

Algorithm 3 (Waveform GMRES).

1. Start: Set r0 = � (I � K)x0, v1 = r0=kr0k, � =kr0k2. Iterate: For k = 1; 2; : : : ; until satisfied do:� hj;k = h(I � K)vk;vji, j = 1; 2; : : :; k� vk+1 = (I �K)vk �Pkj=1 hj;kvj

� hk+1;k = kvk+1k� vk+1 = vk+1=hk+1;k3. Form approximate solution:� xk = x0+V kyk, whereyk minimizes k�e1�Hkykk

The two fundamental operations in Algorithm 3 are theoperator-functionproduct, (I�K)p, and the inner product,h�; �i. When solving (4) in the function space H, theseoperations are as follows:

Operator-Function Product: To calculate the waveformw � (I � K)p:

1. Solve the IVP( ddt +M (t))y(t) = N (t)p(t)y(0) = p0 = 0

for y(t), t 2 [0; T ]; this gives us y = Kp.

2. Set w = p� yInner Product: The inner product hx;yi is given byhx;yi = NXi=1

Z T0

xi(t)yi(t)dt:Recall that the application ofK is basically equivalent to

the application of one step of WR. The first part of Step 1 ofthe operator-function product is therefore equivalent to onestep of the standard WR iteration (with zero input), henceWGMRES can be considered as a scheme for acceleratingthe convergence of WR. This also implies that computingthe operator-function product in the Krylov-subspace basedmethods is as amenable to parallel implementation as WR.Moreover, the inner products required by the WGMRESalgorithm can be computed by N separate integrations ofthe pointwise product xi(t)yi(t), which can be performedin parallel, followed by a global sum of the results.

Finally, it should be noted that although the initial resid-ual is given by r0 = � (I � K)x0; it is computed inpractice according tor0 = (Kx0 + ) � x0:The latter formulation avoids explicit computation of ,since the expression in parentheses is merely the resultobtained by performing a single step of WR.

2.3 Hybrid Methods for Nonlinear Systems

Many interesting applications are nonlinear and cannotbe described as linear time invariant systems (like our model

problem). We will use the following as a model nonlinearproblem: ddtx(t) + F (x(t); t) = 0x(0) = x0: (5)

In order to use the previously developed methods, whichonly apply to linear systems, we must first linearize (5).To linearize (5), we apply Newton’s method directly to thenonlinear ODE system (in a process sometimes referred toas the waveform Newton method (WN) [28]) to obtain thefollowing iteration:� ddt + JF (xm)�xm+1 = JF (xm)xm � F (xm)xm+1(0) = x0:

(6)Here, JF is the Jacobian of F . We note that (6) is a lin-ear time-varying IVP to be solved for xm+1 , which can beaccomplished with a waveform Krylov-subspace method.Note that the previous development of waveform Krylov-subspace methods extends trivially to the linear time-varying case [20]. The resulting operator Newton/Krylov-subspace algorithm, a member of the class of hybrid Krylovmethods [29], is shown below.

Algorithm 4 (Waveform Newton/WGMRES).

1. Initialize: Pick x0

2. Iterate: For m = 0; 1; : : : until converged� Linearize (5) to form (6)� Solve (6) with WGMRES� Update xm+1

For the WGMRES algorithm applied to solving (6), therequired operator-function product can be computed usingthe formulas in Section 2.3, with the splittingJF (xm(t)) =M (t)�N (t):It is also possible to use a “Jacobian-free” approach [30],but the nature of the linearization in the operator-Newtonalgorithm makes that approach somewhat unreliable [20].

Within each nonlinear (operator-Newton) iteration, theinitial residual for the WGMRES algorithm must be com-puted. Denote the initial guess for xm+1 in the WGMRESpart of the hybrid algorithm as xm+1;0 and the initial resid-ual by rm+1;0. Ifxm+1;0 = xm, then the initial residual forthe WGMRES algorithm can be computed using a two-stepapproach as follows:

1. Solve the IVPddty(t) +M (t)y(t) = M (t)xm(t)� F (xm(t))y(0) = x0

for y(t), t 2 [0; T ].

2. Set rm+1;0 = y � xmThis approach is similar to that used by WGMRES forlinear systems. Methods for approximating rm+1;0 so thatM (t)xm(t) does not need to be explicitly calculated canbe found in [20].

2.4 Convolution SOR

Successive overrelaxation (SOR) type acceleration ofWR was studied in great detail in [3], with somewhat dis-couraging results. However, convolution SOR (CSOR), anovel type of SOR acceleration, was developed in [19] andcircumvents the limitations of waveform SOR as describedin [3].

To abbreviate the description of the CSOR algorithm,we will consider the problem of numerically solving thelinear initial-value problem (1). A waveform relaxationalgorithm using CSOR for solving (1) is shown in Algo-rithm 5. We take an ordinary Gauss-Seidel WR step toobtain a value for the intermediate variable xk+1i . The iter-ate xk+1i is obtained by moving xk+1i slightly farther in theiteration direction by convolution with a CSOR parameter,function !(t). With the convolution, the CSOR methodcorrectly accounts for the temporal frequency-dependenceof the spectrum of the Gauss-Jacobi WR operator (e.g.,Gauss-Jacobi WR smoothes high frequency components ofthe error waveform more rapidly than low frequency com-ponents), by in effect, using a different SOR parameter foreach frequency.

Algorithm 5 (Gauss-Seidel WR with CSOR Accelera-tion).

1. Initialize: Pick vector waveform x0(t) 2 ([0; T ];Rn)with x0(0) = x0.

2. Iterate: For k = 0; 1; : : : until converged� Solve for scalar waveform xk+1i (t) 2 ([0; T ];R) withxk+1i (0) = x0i ,� ddt + aii� xk+1i (t) =fi(t)� i�1Xj=1

aijxk+1i (t) � nXj=i+1

aijxki (t):� Overrelax to generate xk+1i (t) 2 ([0; T ];R),xk+1i (t) =xki (t) + Z t0!(� ) � hxk+1i (t � � )� xki (t� � )i d�:

In a practical implementation, the CSOR method is usedto solve a problem that has been discretized in time with a

multistep integration method. The overrelaxation convolu-tion integral (7) is replaced with a convolution sum,xk+1i [m] = xki [m] + mX=0

![`] � �xk+1i [m� `]�xki [m� `]� :Here, m denotes the timestep, k denotes the discretizedwaveform iteration, and i denotes the component ofx. Likethe standard algebraic SOR method [5, 31], the practicaldifficulty is in determining an appropriate overrelaxationparameter, in this case, the sequence ![m]. One successfulapproach for estimating the optimal SOR parameter hasbeen to consider the spectrum of the SOR operator as afunctionof frequency and to use a power method to estimatean optimal !opt[m] [19].

There are a variety of alternative approaches to extendingthe CSOR algorithm to problems with nonlinearities. Weused a waveform extension of relaxation-Newton methods(WRN) for solving nonlinear algebraic problems [28, 32].For the nonlinear problem of the form of (5), the iterationupdate equation for the ith component of x in a CSOR-Newton algorithm is given byddt xk+1i (t)+@Fi(xk(t))@xi �xk+1i (t) � xki (t)� + Fi(xk(t); t) = 0;followed byxk+1i (t) = xki (t)+Z t

0!(� ) � hxk+1i (t� � )�xki (t� � )i d�:

3 Semiconductor Device Simulation

3.1 The Drift-Diffusion Equations

Charge transport within a semiconductor device is as-sumed to be governed by the Poisson equation and by theelectron and hole continuity equations:kTq r � (�ru) + q (p� n+ ND � NA) = 0 (7)r � Jn � q�@n@t + R� = 0 (8)r � Jp + q�@p@t + R� = 0 (9)

Here, u is the normalized electrostatic potential in thermalvolts, n and p are the electron and hole concentrations,Jn and Jp are the electron and hole current densities, NDand NA are the donor and acceptor concentrations, R isthe net generation and recombination rate, q is the mag-nitude of electronic charge, k is Boltzmann’s constant, T

is temperature, and � is the spatially-dependent dielectricpermittivity [33, 34].

The current densities Jn and Jp are given by the drift-diffusion approximations:Jn = �q�nn r�kTq u�+ qDnrn= �kT�nnru+ qDnrn (10)Jp = �q�pp r�kTq u�� qDprp= �kT�ppru� qDprp (11)

where �n and �p are the electron and hole mobilities, andDn and Dp are the diffusion coefficients. The mobilities�n and �p may be computed as nonlinear functions of theelectric field E, i.e.,�n = �n0

"1 +��n0Evsat ��#�(1=�) ;

where vsat and � are constants and �n0 is a doping-dependent mobility [35]. The diffusion constants Dn andDp are related to the mobilities by kT=q (the thermal volt-age) in a pair of equations known as the Einstein rela-tions [36] Dn = kTq �n and Dp = kTq �p:The drift-diffusion approximations (10) and (11) are typi-cally used to eliminate the current densitiesJn and Jp fromthe continuity equations (8) and (9), leaving a differential-algebraic system of three equations in three unknowns, u,n, and p.

3.2 MOSFET Simulation

A key component of modern VLSI circuits is a semicon-ductor device known as a MOSFET (Metal-Oxide Semi-conductor Field Effect Transistor). Although a MOSFETis a three-dimensional structure consisting of several differ-ent regions of silicon, oxide and metal, a MOSFET may bemodeled by a two-dimensional slice of the device, as shownin Fig. 1. In the figure, thick lines represent metal contactsto the drain, source, substrate and gate oxide regions, towhich external voltage boundary conditions are applied.

Given a rectangular mesh covering a two-dimensionalslice of a MOSFET, a common approach to spatially dis-cretizing the device equation system is to use a finite-difference formula to discretize the Poisson equation,and an exponentially-fit finite-difference formula to dis-cretize the continuity equations (this process is known asthe Scharfetter-Gummel method [37]). On an N -nodemesh, this spatial discretization yields a sparsely-coupled

oxide

gate

siliconsubstrate

source drain

Figure 1: A two-dimensional slice of a MOSFET device.

differential-algebraic initial value problem (IVP) consistingof 3N equations in 3N unknowns, denoted byF 1(u(t);n(t);p(t)) = 0ddtn(t) + F 2(u(t);n(t);p(t)) = 0ddtp(t) + F 3(u(t);n(t);p(t)) = 0

subject to initial conditionsu(0) = u0n(0) = n0p(0) = p0;where t 2 [0; T ], and u(t);n(t);p(t) 2 RN are vectorsof normalized potential, electron concentration, and holeconcentration. The initial conditions are assumed to beconsistent [38]. Here, F 1;F 2;F 3 : R3N ! RN are speci-fied component-wise asF1i(ui; ni; pi; uj) =kTq Xj dij�ijLij (ui�uj) � qAi �pi�ni+NDi�NAi �F2i(ui; ni; uj; nj) =kTqAiXj dij�nijLij hniB(ui�uj)�njB(uj�ui)i+RiF3i(ui; pi; uj; pj) =kTqAiXj dij�pijLij hpiB(uj�ui)�pjB(ui�uj)i+RiThe summations are taken over the silicon nodes j adjacentto node i. As shown in Fig. 2, for each node j adjacentto node i, Lij is the distance from node i to node j, dijis the length of the side of the Voronoi box that enclosesnode i and bisects the edge between nodes i and j, and Aiis the area of the Voronoi box. Similarly, the quantities�ij, �nij and �pij are the dielectric permittivity, electronand hole mobility, respectively, on the edge between nodesi and j. The Bernoulli function, B(x) = x=(ex � 1), isused to exponentially fit potential variation to electron andhole concentration variations, and effectively upwinds thecurrent equations.

4 Implementation

As reported in [39], the node-by-node Gauss-Jacobi WRalgorithm will typically require many hundreds (or even

i

j Lijdij

k

dik

LikAi

Figure 2: Illustration of a mesh node i, the area Ai of itsVoronoi box, and the lengths dij and Lij .

thousands) of iterations to converge, severely limiting theefficiency of WR-based device simulation. Moreover, as-signing each node to a separate processor in a parallel im-plementation would require on the order of a thousand pro-cessors or more (the number of nodes typically necessaryfor accurate device simulation). Since the WR algorithmis better suited to a coarse-grained MIMD type of archi-tecture, such a fine-grained division of the problem is notnecessary or desirable.

Instead of the node-by-node approach, pWORDS col-lects groups of mesh nodes into blocks and solves the nodesin each block simultaneously. In particular, the nodes ineach vertical line of the discretization mesh are grouped to-gether into blocks — this has been shown to be a particularlyeffective blocking strategy for MOSFET simulation [39].

The main WR routine in the pWORDS program uses ared/black block Gauss-Seidel scheme in which the systemof equations governing the nodes in each vertical mesh lineis solved using a backward-difference integration formula.The implicit algebraic systems generated by the backward-difference formula are solved with Newton’s method andthe linear equation systems generated by Newton’s methodare solved with sparse Gaussian elimination.

4.1 The pWORDS Progam

The pWORDS program uses a Manager/Worker ap-proach in which the supervisory Manager program initiatesthe actual parallel simulation process by invokingW copiesof a Worker program on W client machines.

The Manager program reads in the device input file thatspecifies the geometry and the voltage boundary conditionsimposed upon the device, as well as the 2D spatial dis-cretization mesh. Given a rectangular mesh with C verticallines, the Manager splits the mesh intoW non-overlappingblocks consisting ofC=W adjacent vertical lines and sendseach block to a Worker for solution.

In addition to the block of lines that each Worker solves,each Worker also contains storage for the two vertical lineson either side of the contiguous block. These “pseudo-lines” are used to store the solutions generated by the Work-ers controlling those adjacent vertical lines.

4.2 Parallel Waveform Relaxation

As mentioned earlier, computing the operator-waveformproduct used by the waveform Krylov-subspace methodsrequires performing one step of traditional waveform re-laxation. Therefore, the operator-waveform product can beaccomplished with a function call to the WR routine al-ready implemented within pWORDS — the vertical meshline preconditioningscheme inherent to the WR routine willautomatically be used by the Krylov-subspace method aswell.

To describe the algorithm running on each Worker ma-chine, we alternately assign a “color,” either red or black, tothe vertical mesh lines. The Worker WR iteration, shownin Figure 3 is as follows:

1. Send the black line solutions needed for pseudo-lines onother Workers and compute the solution for those redlines that do not depend on black pseudo-line solutions.

2. Receive black pseudo-line solutions.

3. Compute solutions for the remaining red lines.

4. Send the red line solutions needed for pseudo-lines onother Workers and compute the solution for those blacklines that do not depend on red pseudo-line solutions.

5. Receive red pseudo-line solutions.

6. Compute solutions for the remaining black lines.

Communication and computation are overlapped on ma-chines that support asynchronous (non-blocking) commu-nication.

BR BR BR

procprocproc i i+1i-1

(4)

(5)

(6)

(6) send black to right(5) solve black(4) receive red from right

(3)

(2)

(1)(3) send red to left(2) solve red(1) receive black from left

Figure 3: The parallelized waveform relaxation step.

4.3 Parallelizing the Pointwise Approach

In our experience, the most efficient serial algorithmfor device transient simulation was the pointwise Newton-GMRES algorithm. In this algorithm, block-Jacobi precon-ditioned GMRES [27] is used to solve the linear systemsarising at each Newton iteration of each timestep of animplicit integration formula applied to (??). The point-wise Newton-GMRES method in pWORDS uses the same

vertical-line blocks as the waveform methods. The Jaco-bian matrix is stored in a block-row wise fashion, as shownin Figure 4. Although many of these communication opera-tions can potentially be overlapped with local computation,the communication latency may be so large that it cannotbe hidden by the (relatively) small amount of computationdone at a single timestep. Furthermore, a significant amountof synchronization results from the inner-product computa-tions within each GMRES iteration. An approach such asthat given in [40] might prove to be helpful, however.

proc

proc

proc

i+1

i

i-1

Figure 4: Partitioning of the system Jacobian matrix for thepointwise approach.

5 Experimental Results

Fig. 5 illustrates the MOSFET simulation used for thenumerical experiments in this section. The figure showsa two-dimensional slice of the silicon device along withthe external boundary conditions on the potentials u at theterminals. Here, the potentials at the source and substrateterminals are held at 0, the potential u at the gate terminal isheld at 5V , and there is a short pulse at the drain terminal.The concentrations n and p at each terminal are held con-stant at an equilibrium value determined by the backgrounddoping concentration at that terminal.

0 psec 512 psec

2.2 microns

5 v

0 v

5 v

Figure 5: Illustrationof the example problem with Dirichletboundary conditions on the terminal potentials u.

The experiments compared parallelized pointwise New-ton/GMRES, WRN [28], WN/WGMRES, and WRN withCSOR acceleration. To obtain the CSOR results, the “op-timal” CSOR parameter was determined by linearizing thedevice problem about the solution at time t = 0, and fitting!opt(z) (as a function of frequency) with a rational functionas described in [19]. Also, to diminish the effect of the non-linearity, the overrelaxation convolution was applied onlyto the potential variables u.

Method # Procs TimePointwise (GMRES) 1 912.04Pointwise (GMRES) 2 1220.13WRN 1 2730.96WRN 2 1503.71WRN 4 1230.66WRN 8 861.48WN/WGMRES 1 934.16WN/WGMRES 2 504.21WN/WGMRES 4 349.67WN/WGMRES 8 303.14WRN with CSOR 1 566.71WRN with CSOR 2 308.83WRN with CSOR 4 222.65WRN with CSOR 8 158.47

Table 1: Execution times (in wall clock seconds) for tran-sient simulation of the example problem on an MPI work-station cluster.

The backward Euler method with 256 fixed timestepswas used for all experiments, on a simulation interval of 512picoseconds. Although the use of global uniform timestepsprecludes multirate integration (one of the primary compu-tational advantages of WR on a sequential machine), it alsosimplifies the problem of load-balancing. The convergencecriterion for all experiments was that the maximum relativeerror of any terminal current over the simulation interval beless than 10�4. The initial guess for WRN and for the ac-celerated waveform methods was produced by performing8 WR iterations beginning with flat waveforms extendedfrom the initial conditions.

Table 1 shows a comparison of the execution times(in wall clock seconds) required to complete a tran-sient simulation of the example problem using pointwiseNewton-GMRES, WRN, WN/WGMRES, and CSOR on anEthernet-connected MPI [41] workstation cluster consist-ing of Sun SPARC 5 and SPARC 10 workstations. Despitesome small differences in compute node processing power,the mesh was divided as evenly as possible among the nodes— no load balancing was attempted. Note that the execu-tion time for pointwise Newton-GMRES increased whenparallelized on two processors.

The qualitative communication behavior of the differenttypes of methods can be illustrated very clearly throughthe use of plots obtained by using MPE libraries. Fig-ure 6 shows an Upshot plot of the communication oper-ations required for only a fraction of a timestep of point-wise Newton-GMRES. Note the large number of individualcommunication operations, as well as the enormous num-ber of synchronizations, that are required. In contrast, Fig-ures 7 and 8 show four iterations of the waveform relaxation

and WGMRES methods, respectively (the CSOR methodwill exhibit behavior similar to that of WRN). Note thatthe communication operations take place infrequently andthat, although WGMRES does require synchronization, thenumber of synchronizations are few in comparison to thepointwise method — which is why the waveform methodsare able to exhibit a speedup in a parallel implementationon a workstation cluster and the pointwise method is not.

6 Conclusion

The experimental results presented here showed thatwaveform methods are relatively insensitive to the under-lying communication network of the parallel environmentwhere they are running. As a result, waveform methods arewell suited to environments having very high communica-tion latency, such as workstation clusters. As MIMD com-puters continue to become more popular, and as workstationclusters continue to become a legitimate parallel computingresource, waveform methods will grow in importance.

Acknowledgments

The authors would like to thank Jacob White, Ken Jack-son, and Stefan Vandewalle for many helpful discussions.

References

[1] C. W. Gear, Numerical Initial Value Problems in OrdinaryDifferential Equations. Automatic Computation,EnglewoodCliffs, New Jersey: Prentice-Hall, 1971.

[2] E. Hairer, S. P. Norsett, and G. Wanner, Solving OrdinaryDifferential Equations, vol. 1 and 2. New York: Springer-Verlag, 1987.

[3] U. Miekkala and O. Nevanlinna, “Convergence of dynamiciteration methods for initial value problems,” SIAM J. Sci.Stat. Comp., vol. 8, pp. 459–467, 1987.

[4] J. K. White and A. Sangiovanni-Vincentelli, RelaxationTechniques for the Simulation of VLSI Circuits. Engineer-ing and Computer Science Series, Norwell, Massachusetts:Kluwer Academic Publishers, 1986.

[5] R. S. Varga, Matrix Iterative Analysis. Automatic Compu-tation Series, Englewood Cliffs, New Jersey: Prentice-HallInc, 1962.

[6] E. Lelarasmee, A. E. Ruehli, and A. L. Sangiovanni-Vincentelli, “The waveform relaxation method for time do-main analysis of large scale integrated circuits,” IEEE Trans-actions on Computer-Aided Design of Integrated Circuitsand Systems, vol. 1, pp. 131–145, July 1982.

[7] F. Odeh, A. Ruehli, and C. Carlin, “Robustness aspectsof an adaptive wave-form relaxation scheme,” in Proceed-ings of the IEEE Int. Conf. on Circuits and Comp. Design,(Rye,N.Y.), pp. 396–440, October 83.

7.913 7.982 8.052 8.121 8.191 8.260 8.330

Barr Bcast Recv Reduce Send

1

2

3

4

Figure 6: Upshot plot of a fraction of one timestep of pointwise Newton-GMRES.

4.5 6.1 7.8 9.4 11.1 12.7 14.4

Barr Bcast Recv Send

1

2

3

4

Figure 7: Upshot plot of four iterations of pointwise waveform relaxation.

13.3 15.6 17.9 20.1 22.4 24.7 26.9

Barr Bcast Recv Reduce Send

1

2

3

4

Figure 8: Upshot plot of four iterations of WGMRES.

[8] O. Nevanlinna and F. Odeh, “Remarks on the convergenceof the waveform relaxation method,” Numerical FunctionalAnal. Optimization, vol. 9, pp. 435–445, 1987.

[9] U. Miekkala, “Dynamic iteration methods applied to linearDAE systems,” J. Comput. Appl. Math., vol. 25, pp. 133–151, 1989.

[10] J. White and F. Odeh, “A connection between the conver-gence properties of waveform relaxation and the A-stabilityof multirate integration methods,” in Proceedings of theNASECODE VII Conference, (Copper Mountain, Colorado),1991.

[11] D. Dumlugol, The Segmented Waveform Relaxation Methodfor Mixed-Mode Simulation of Digital MOS Circuits. PhDthesis, Katholieke Universiteit Leuven, October 1986.

[12] S. Mattison, CONCISE: A concurrentcircuit simulation pro-gram. PhD thesis, Lund Institute of Technology,Lund, Swe-den, 1986.

[13] F. Odeh, A. Ruehli, and P. Debefve, “Waveform techniques,”in Circuit Analysis,Simulation and Design,Part 2 (A.Ruehli,ed.), pp. 41–127, North-Holland, 1987.

[14] J. White, F. Odeh, A. Vincentelli, and A. Ruehli, “Waveformrelaxation: Theory and practice,” Trans. of the Society forComputer Simulation, vol. 2, pp. 95–133, June 1985.

[15] M. Reichelt, J. White, J. Allen, and F. Odeh, “Waveformrelaxation applied to transient device simulation,” in Pro-ceedings of the IEEE Int. Conf. on Circuits and Systems,(Espoo, Finland), pp. 396–440, October 88.

[16] A. Skjellum, Concurrent dynamic simulation: Multicom-puters algorithms research applied to ordinary differential-algebraic process systems in chemical engineering. PhDthesis, California Institute of Technology, May 1990.

[17] C.Lubich and A. Osterman, “Multigrid dynamic iteration forparabolic problems,” BIT, vol. 27, pp. 216–234, 1987.

[18] S. Vandewalle and R. Piessens, “Efficient parallel algo-rithms for solving initial-boundary value and time-periodicparabolic partial differential equations,” SIAM J. Sci. Statist.Comput., vol. 13, pp. 1330–1346, November 1992.

[19] M. Reichelt, Accelerated Waveform Relaxation Techniquesfor the Parallel Transient Simulation of Semiconductor De-vices. PhD thesis, Massachusetts Institute of Technology,Cambridge, MA, 1993.

[20] A. Lumsdaine, Theoretical and Practical Aspects of Paral-lel Numerical Algorithms for Initial Value Problems, withApplications. PhD thesis, Massachusetts Institute of Tech-nology, Cambridge, MA, 1992.

[21] B. Leimkuhler, “Estimating waveform relaxation conver-gence,” SIAM J. Sci. Comput., vol. 14, no. 4, pp. 872–889,1993.

[22] B. Leimkuhler and A. Ruehli, “Rapid convergence of wave-form relaxation,” Applied Numerical Mathematics, vol. 11,pp. 221–224, 1993.

[23] R. D. Skeel, “Waveform iteration and the shifted Picard split-ting,” SIAM J. Sci. Statist. Comput., vol. 10, no. 4, pp. 756–776, 1989.

[24] T. Kailath, Linear Systems.Englewood Cliffs: Prentice-Hall,1980.

[25] R. Kress, Linear Integral Equations. New York: Springer-Verlag, 1989.

[26] J. B. Conway, A Course in Functional Analysis, SecondEdition. New York: Springer-Verlag, 1990.

[27] Y. Saad and M. Schultz, “GMRES: A generalized mini-mum residual algorithm for solving nonsymmetric linearsystems,” SIAM J. Sci. Statist. Comput., vol. 7, pp. 856–869,July 1986.

[28] R. Saleh and J. White, “Accelerating relaxation algorithmsfor circuit simulation using waveform-Newton and step-sizerefinement,” IEEE Trans. CAD, vol. 9, no. 9, pp. 951–958,1990.

[29] P. Brown and Y. Saad, “Hybrid Krylov methods for nonlinearsystems of equations,” SIAM J. Sci. Statist. Comput., vol. 11,pp. 450–481, May 1990.

[30] P. N. Brown and A. C. Hindmarsh, “Matrix-free methodsfor stiff systems of ODE’s,” SIAM J. Numer. Anal., vol. 23,pp. 610–638, June 1986.

[31] D. M. Young, Iterative Solution of Large Linear Systems.Orlando, FL: Academic Press, 1971.

[32] J. M. Ortega and W. C. Rheinbolt, Iterative Solution of Non-linear Equations in Several Variables. Computer Scienceand Applied Mathematics, New York: Academic Press,1970.

[33] R. Bank, W. Coughran, Jr., W. Fichtner, E. Grosse, D. Rose,and R. Smith, “Transient simulation of silicon devices andcircuits,” IEEE Trans. CAD, vol. 4, pp. 436–451, October1985.

[34] S. Selberherr, Analysis and Simulation of SemiconductorDevices. New York: Springer-Verlag, 1984.

[35] R. S. Muller and T. I. Kamins, Device Electronics for Inte-grated Circuits. New York: John Wiley and Sons, 1986.

[36] P. E. Gray and C. L. Searle, Electronic Principles: Physics,models and circuits. New York: Wiley, 1969.

[37] D. Scharfetter and H. Gummel, “Large-signal analysis of asilicon read diode oscillator,” IEEE Transactions on ElectronDevices, vol. ED-16, pp. 64–77, January 1969.

[38] K. E. Brenan, S. L. Campbell, and L. R. Petzold, NumericalSolution of Initial-Value Problems in Differential-AlgebraicEquations. New York: North Holland, 1989.

[39] M. Reichelt, J. White, and J. Allen, “Waveform relaxationfor transient two-dimensional simulation of MOS devices,”in International Conference on Computer Aided-Design,(Santa Clara, California), pp. 412–415, November 1989.

[40] E. D. Sturler, “A parallel restructured version of GM-RES(m),” in Proceedings of the Copper Mountain Confer-ence on Iterative Methods, (Copper Mountain, Colorado),1992.

[41] M. P. I. Forum, “MPI: A Message Passing Interface,” inProc. of Supercomputing ’93, pp. 878–883, IEEE ComputerSociety Press, November 1993.

Documents

Waveform iterative methods for parallel solution of initial value problems