15
Parallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations in science and engineering give rise to sparse linear systems of equations. It is a well known fact that the cost of the simulation process is almost always governed by the solution of the linear systems especially for large- scale problems. The emergence of extreme-scale parallel platforms, along with the increasing number of processing cores available on a single chip pose significant challenges for algorithm development. Machines with tens of thousands of mul- ticore processors place tremendous constraints on the communication as well as memory access requirements of algorithms. The increase in number of cores in a processing unit without an increase in memory bandwidth aggravates an already significant memory bottleneck. Sparse linear algebra kernels are well-known for their poor processor utilization. This is a result of limited memory reuse, which ren- ders data caching less effective. In view of emerging hardware trends, it is necessary to develop algorithms that strike a more meaningful balance between memory ac- cesses, communication, and computation. Specifically, an algorithm that performs more floating point operations at the expense of reduced memory accesses and com- munication is likely to yield better performance. We presented two alternative vari- ations of DS factorization based methods for solution of sparse linear systems on parallel computing platforms. Performance comparisons to traditional LU factor- ization based parallel solvers is also discussed. We show that combining iterative methods with direct solvers and using DS factorization, one can achieve better scal- ability and shorter time to solution. Murat Manguoglu Department of Computer Engineering, Middle East Technical University, 06531 Ankara, Turkey e-mail: [email protected] 1

Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

Parallel Solution of Sparse Linear Systems

Murat Manguoglu

Abstract Many simulations in science and engineering give rise to sparse linearsystems of equations. It is a well known fact that the cost of the simulation processis almost always governed by the solution of the linear systems especially for large-scale problems. The emergence of extreme-scale parallel platforms, along with theincreasing number of processing cores available on a single chip pose significantchallenges for algorithm development. Machines with tens of thousands of mul-ticore processors place tremendous constraints on the communication as well asmemory access requirements of algorithms. The increase in number of cores in aprocessing unit without an increase in memory bandwidth aggravates an alreadysignificant memory bottleneck. Sparse linear algebra kernels are well-known fortheir poor processor utilization. This is a result of limited memory reuse, which ren-ders data caching less effective. In view of emerging hardware trends, it is necessaryto develop algorithms that strike a more meaningful balance between memory ac-cesses, communication, and computation. Specifically, an algorithm that performsmore floating point operations at the expense of reduced memory accesses and com-munication is likely to yield better performance. We presented two alternative vari-ations of DS factorization based methods for solution of sparse linear systems onparallel computing platforms. Performance comparisons to traditional LU factor-ization based parallel solvers is also discussed. We show that combining iterativemethods with direct solvers and using DS factorization, one can achieve better scal-ability and shorter time to solution.

Murat ManguogluDepartment of Computer Engineering, Middle East Technical University, 06531 Ankara, Turkeye-mail: [email protected]

1

Page 2: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

2 Murat Manguoglu

1 Introduction

Many simulations in science and engineering give rise to sparse linear systems ofequations. It is a well known fact that the cost of the solution process is almost al-ways governed by the solution of the linear systems especially for large-scale prob-lems. The emergence of extreme-scale parallel platforms, along with the increasingnumber of processing cores available on a single chip pose significant challengesfor algorithm development. Machines with tens of thousands of multicore proces-sors place tremendous constraints on the communication as well as memory accessrequirements of algorithms. The increase in number of cores in a processing unitwithout an increase in memory bandwidth aggravates an already significant mem-ory bottleneck. Sparse linear algebra kernels are well-known for their poor processorutilization. This is a result of limited memory reuse, which renders data caching lesseffective. In view of emerging hardware trends, it is necessary to develop algorithmsthat strike a more meaningful balance between memory accesses, communication,and computation. Specifically, an algorithm that performs more floating point oper-ations at the expense of reduced memory accesses and communication is likely toyield better performance.

Significant amount of effort has been devoted to design and implementation ofparallel sparse linear systems solvers. Existing parallel sparse direct solvers, suchas MUMPS [3, 2, 1], Pardiso [39, 40], SuperLU [22], and WSMP [14, 13] , arebased on LU factorization. Therefore, the speed improvements realized by suchsolvers are often limited due to the inherited limitations of sparse LU factorizationsand sparse triangular forward-backward sweeps. Iterative solvers, such as precon-ditioned Krylov subspace methods with sparse approximate inverse or incompleteLU factorization based preconditioners, on the other hand, are often more scalablebut not as robust as direct solvers.

We present two robust hybrid algorithms based on DS factorization for parallelsolution of general sparse linear systems. At the cost of increased computation, DSfactorization for solving the system allows us to minimize the interprocess commu-nications and, hence, enhances concurrency. The remainder of this chapter is orga-nized as follows. In Section 2, we present banded and sparse variations of DS fac-torization. In Section 3, we develop two hybrid general sparse linear system solversthat use DS factorization to solve the preconditioned system.

2 Banded and sparse parallel DS factorizations

A number of banded solvers have been proposed and implemented in softwarepackages such as, LAPACK [4] for uniprocessors, ScaLAPACK [7], and Spike[36, 34, 21, 6, 9, 30, 31, 37] for parallel architectures. The central idea of Spikeis to partition the matrix so that each process (or processing element) can work onits own part of the matrix, with the processes communicating only during the solu-

Page 3: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

Parallel Solution of Sparse Linear Systems 3

tion of the common reduced system. The size of the reduced system is determinedby the bandwidth of the matrix and the number of partitions.

Unlike classical sequential LU factorization of the coefficient matrix A, for solv-ing a banded linear system Ax = f , the Spike scheme employs the factorization:

A = DS, (1)

where D is the block diagonal of A for a given number of partitions. The factorS, given by D−1A (assuming D is nonsingular) , called the spike matrix, consistsof the block diagonal identity matrix modified by “spikes” to the right and left ofeach partition. The process of solving Ax = f , then, reduces to a sequence of thefollowing steps

• g← D−1 f (modification of the right hand side)• S← D−1A (forming the spike system coefficient matrix)• x← S−1g (solving a smaller independent reduced system)• x← S−1g (retrieving the full solution).

All the steps of the solution process can be executed in perfect parallelism withthe exception of the solution of the small reduced system. The size of the reducedsystem will increase as we increase the number of partitions (or processors). Fur-thermore, each step can be accomplished using one of several available methods,depending on the specific parallel architecture and the linear system at hand. Thisgives rise to a family of optimized variants of the basic Spike algorithm. For a smallbanded system we illustrate the partitioning of the system among three processorsin Figure 1. Figure 2 depicts the spike system Sx = g. Finally, Figure 3 shows thesmaller reduced system. Dashed boxes in figures show those elements which couldbe near machine precision if A is diagonally dominant system. Ignoring those smallelements will give rise to ‘truncated” variation of the algorithm allowing more par-allelism in the solution of the reduced system.

For the solution of general sparse linear systems (where A is sparse) a new al-gorithm has been proposed in [24, 28]. Inspired by the banded Spike solver, thisalgorithm also uses the DS factorization and partitioning of the system (see Figures4 and 5). The resulting S matrix, however, is not banded and consists of “sparse”spikes. Nevertheless, a smaller reduced system can still be obtained (see Figure 6).Solution stages follow exactly the banded case.

Both banded and sparse DS factorization based algorithms can be adapted forefficient and scalable solution of general sparse linear systems as will be discussedin Section 3.

3 Solution of general sparse linear systems

In this section, we present two hybrid methods for the solution of general sparselinear systems which use an iterative method as the outer layer with preconditioning.

Page 4: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

4 Murat Manguoglu

=*

A x f

A22

A11

A33

Fig. 1 Partitioning of the system (Ax = f ) into three parts, boxes represent nonzeros

=*

D-1A D-1f

I

I

I

I

I

II

x

Fig. 2 Spike system: Sx = g where g = D−1 f and S = D−1A for the system in Figure 1

In both methods, the solution of preconditioned linear systems are handled by oneof the variations of the DS factorization based banded or sparse solvers describedin Section 2. The first method relies on reordering systems so that the large entriesin the coefficient matrix are moved closer to the main diagonal. After reorderingan effective banded preconditioner, M, can be extracted and used for solving thesystem with an outer iterative layer. M could be treated as dense or sparse withinthe band and hence the diagonal blocks can be handled by a variety of algorithms.The second method, on the other hand, eliminates the need for the reordering withweights by dropping small elements when forming the preconditioner. An outer

Page 5: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

Parallel Solution of Sparse Linear Systems 5

=*I

I

II

S x g

Fig. 3 Reduced system: Sx = g obtained from the spike system in Figure 2

=*

A x f

A11

A22

A33

Fig. 4 Partitioning of the system (Ax = f ) into three parts, boxes represent nonzeros

iterative method is also used for the second method and preconditioned systems arehandled by the sparse variation of the DS factorization.

3.1 Weighted reordering and banded preconditioning

Given a linear system of equations, Ax = f , we first apply a nonsymmetric rowpermutation as follows:

QAx = Q f . (2)

Page 6: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

6 Murat Manguoglu

=*

D-1A D-1f

I

I

I

I

I

I

I

x

I

I

I

Fig. 5 Spike system: Sx = g where g = D−1 f and S = D−1A for the system in Figure 4

II * =I

I

x gS

Fig. 6 Reduced system: Sx = g obtained from the spike system in Figure 5

Here, Q is the row permutation matrix that either maximizes the number of nonzeroson the diagonal of A, or the permutation that maximizes the product of the absolutevalues of the diagonal entries [11]. The first algorithm is known as maximum traver-sal search, while the second algorithm provides scaling factors so that the absolutevalues of the diagonal entries are equal to one and all other elements are less thanor equal to one. This scaling can be applied as follows:

(QD2AD1)(D−11 x) = (QD2 f ). (3)

Both algorithms are implemented in subroutine MC64 [10] of the HSL [17] library.Following the above nonsymmetric reordering and optional scaling, we apply the

symmetric permutation P as follows:

Page 7: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

Parallel Solution of Sparse Linear Systems 7

(PQD2AD1PT )(PD−11 x) = (PQD2 f ). (4)

The permutation, P, can be chosen such that the magnitude of the nonzeros closerto the main diagonal are larger than the ones that are far away. Such a reorderingcan be obtained by solving the following eigenvalue problem. The second small-est eigenvalue and the corresponding eigenvector of the Laplacian of a graph havebeen used in a number of application areas including matrix reordering [26, 23, 5],graph partitioning [32, 33], machine learning [29], protein analysis and data mining[16, 42, 20], and web search [15]. The second smallest eigenvalue of the Laplacianof a graph is sometimes called the algebraic connectivity of the graph, and the cor-responding eigenvector is known as the Fiedler vector, due to the work of Fiedler[12].

For a given n× n sparse symmetric matrix A, or an undirected weighted graphwith positive weights, one can form the weighted-Laplacian matrix, Lw, as follows:

Lw(i, j) ={

∑ j 6=i |A(i, j)| if i = j,−|A(i, j)| if i 6= j.

(5)

Since the Fiedler vector can be computed independently for disconnected graphs,we assume that the graph is connected. The eigenvalues of Lw are different than zeroexcept λ1. The eigenvector x2 corresponding to smallest nontrivial eigenvalue λ2 iscalled the Fiedler vector. Since we assume a connected graph, the trivial eigenvector,x1, is a vector of all ones. In case the matrix, A, is nonsymmetric one can use (|A|+|AT |)/2, instead.

A state of the art multilevel solver [18] called MC73 Fiedler for computing theFiedler vector is implemented in the Harwell Subroutine Library(HSL) [17]. It usesa series of levels of coarser graphs where the eigenvalue problem correspondingto the coarsest level is solved via the Lanczos method for estimating the Fiedlervector. The results are then prolongated to the finer graphs and Rayleigh QuotientIterations (RQI) with shift and invert are used for refining the eigenvector. Linearsystems encountered in RQI are solved via the SYMMLQ algorithm. We considerMC73 Fiedler as one of the best uniprocessor implementation for determining theFiedler vector. A new parallel algorithm TraceMin-Fiedler is developed based on theTrace Minimization algorithm (TraceMin) [38, 35], and parallel results comparingit to MC73 Fiedler is presented in [25].

We consider solving the standard symmetric eigenvalue problem

Lx = λx (6)

where L denotes the weighted Laplacian, using the TraceMin scheme for obtainingthe Fiedler vector. The basic TraceMin algorithm can be summarized as follows. LetXk be an approximation of the eigenvectors corresponding to the p smallest eigen-values such that XT

k LXk = Σk and XTk Xk = I, where Σk = DIAG(ρ

(k)1 ,ρ

(k)2 , ...,ρ

(k)p ).

The updated approximation is obtained by solving the minimization problem

min tr(Xk−∆k)T L(Xk−∆k), subject to ∆

Tk Xk = 0. (7)

Page 8: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

8 Murat Manguoglu

This in turn leads to the need for solving a saddle point problem, in each iterationof the TraceMin algorithm, of the form[

L XkXT

k 0

][∆kNk

]=

[LXk

0

]. (8)

We solve first the Schur complement system (XTk L−1Xk)Nk = XT

k Xk for obtainingNk. After ∆k is retrieved, (Xk −∆k) is then used to obtain Xk+1 which forms thesection

XTk+1LXk+1 = Σk+1,XT

k+1Xk+1 = I. (9)

The TraceMin-Fiedler algorithm, which is based on the basic TraceMin algorithm,is given in Figure 7.

Algorithm 1:Data: L is the n×n Laplacian matrix defined in Eqn.(5) , εout is the stopping criterion for

the ||.||∞ of the eigenvalue problem residual, p is the number of eigenpairs to becomputed, and q is the dimension of the search space

Result: x2 is the eigenvector corresponding to the second smallest eigenvalue of Lp←− 2; q←− p+2 ;nconv←− 0; Xconv←− [ ];L←− L+ ||L||∞10−12× I ;D←− the diagonal of L ;D←− the diagonal of L ;X1←− rand(n,q);for k = 1,2, . . . max it do

1. Orthonormalize Xk into Vk;2. Compute the interaction matrix Hk←− VT

k LVk;3. Compute the eigendecomposition HkYk = YkΣk of Hk. The eigenvalues Σk arearranged in ascending order and the eigenvectors are chosen to be orthogonal;4. Compute the corresponding Ritz vectors Xk←− VkYk;Note that Xk is a section, i.e. XT

k LXk = Σk,XTk Xk = I;

5. Compute the relative residual ||LXk−XkΣk||∞/||L||∞;6. Test for convergence: If the relative residual of an approximate eigenvector is lessthan εout , move that vector from Xk to Xconv and replace nconv by nconv +1 increment. Ifnconv ≥ p, stop;7. Deflate: If nconv > 0,Xk←− Xk−Xconv(XT

convXk);8. if nconv = 0 then

Solve the linear system LWk = Xk approximately with relative residual εin via thePCG scheme using the diagonal preconditioner D;

elseSolve the linear system LWk = Xk approximately with relative residual εin via thePCG scheme using the diagonal preconditioner D;

9. Form the Schur complement Sk←− XTk Wk;

10. Solve the linear system SkNk = XTk Xk for Nk ;

11. Update Xk+1←− Xk−∆k = WkNk ;

Fig. 7 TraceMin-Fiedler algorithm.

Page 9: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

Parallel Solution of Sparse Linear Systems 9

Using the above algorithm, speed improvements over the uniprocessor MC73 Fiedlerusing TraceMin-Fiedler on 1, 8, 16, 32, and 64 cores are shown in Figure 8 for ma-trices obtained from the University of Florida Sparse Matrix Collection [8] . Theplatform we use is a cluster with Infiniband interconnection where each node con-sists of two six-core Intel Xeon CPUs (Westmere X5670) running at 2.93GHz (12cores per node)

nlpkkt160 circuit5M cage15 rajat31 circuit5M_dc nlpkkt120 freescale1 memchip kktpow er0.1

1.0

10.0

100.0

1000.01248163264

Spe

ed Im

prov

emen

t

Fig. 8 Speed improvement : Time (MC73 Fiedler) / Time (TraceMin-Fiedler) for 9 test problems.

Reordering using the Fiedler vector provides matrices in which the large ele-ments are clustered aroung the main diagonal as shown in Figure 9 for a matrixobtained from the University of Florida Sparse Matrix Collection.

Once the reordered system is obtained one can extract a banded preconditionerand solve the general system using a preconditioned iterative method. Systems in-volving the preconditioner are solved at each iteration using the DS factorization. Inwhich systems involving the diagonal blocks in D can be solved by using (i) densesequential or multithreaded banded solvers [26] or (ii) a sparse sequential or multi-threaded direct solver. Second variation has been implemented in [41, 27, 23] and iscalled PSPIKE.

We obtained the g3 circuit (1,585,478 unknowns and 7,660,826 nonzeros)matrix from the University of Florida Sparse Matrix Collection. In Figure 10,speed improvements of PSPIKE, MUMPS and Pardiso are presented for g3 circuitsystem is presented. The platform we use is an Intel Xeon ([email protected])cluster with Inband interconnection and 16GB memory per node. In PSPIKE,BiCGStab [43] is used as the outer iterative solver and the iterations are stopped

Page 10: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

10 Murat Manguoglu

(a) original matrix (b) TraceMin-Fiedler

Fig. 9 Sparsity plots of eurqsa; red and blue indicates the largest and the smallest elements, re-spectively.

when || f −Ax||∞/|| f ||∞ ≤ 10−6. The reduced system is truncated to enhance paral-lelism and solved directly.

PSPIKE

PARDISO

MUMPS

0

1

2

3

4

5

6

7

8

9

32168421

Sp

eed

Im

pro

vem

ent

Number of Cores

Fig. 10 The speed improvement for g3 circuit compared to Pardiso using one core (73.5 seconds)

3.2 Domain Decomposing Parallel Sparse Solver

Given a general sparse linear system Ax = f , we partition A ∈ Rn×n into p blockrows A = [A1,A2, ...,Ap]

T . Let

A = D+R, (10)

where D consists of the p block diagonals of A,

Page 11: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

Parallel Solution of Sparse Linear Systems 11

D =

A11

A22. . .

App

, (11)

and R consists of the remaining elements (i.e. R = A−D). We note that the pro-cess can be viewed as an algebraic domain decomposition. Therefore, we will callthe method described in this section Domain Decomposing Parallel Sparse Solver(DDPS). The DS factorization in DDPS is for solving the systems involving thepreconditioner.

Let Li and Ui be the incomplete or complete LU factorizations of Aii, wherei = 1,2, ..., p. We define

D =

A11

A22. . .

App

(12)

in which Aii = LiUi. The DDPS algorithm is shown in Figure 11.

Data: Ax = f and pResult: x1. D+R←− A for a given p;2. LiUi←− Aii (approximate or exact) for i = 1,2, ..., p;3. R←− R (by dropping some elements);4. S←− D−1R;5. identify nonzero columns of S and store their indices in array c;6. solve Ax = f via a Krylov subspace method with a preconditioner P = D+ R andstopping tolerance εout

solve Pz = y(D−1Pz = D−1y⇒ (I+S)z = g);6.1. g←− D−1y;6.2. S←− (I(c,c)+S(c,c)); z←− z(c); g←− g(c);6.3. solve the smaller independent system: Sz = g (directly or iteratively withstopping tolerance εin);6.4. z(c)←− z;6.5. z←− g−Sz;

endend

Fig. 11 DDPS algorithm.

Stages 1–5 are preprocessing stages where the right-hand-side is not required.After preprocessing we solve the system via a Krylov subspace method and using apreconditioner. The major operations in a Krylov subspace method are: (i) matrix–vector multiplications, (ii) inner products, and (iii) preconditioning operations in

Page 12: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

12 Murat Manguoglu

the form of Pz = y. Details of the preconditioning operations for DDPS are given inFigure 11.

Each stage, with the exception of Stage 6.3, can be executed with perfect par-allelism, requiring no interprocessor communication. In Stage 6.3, the solution ofthe smaller system Sz = g is the only part of the algorithm that requires commu-nication. We solve this smaller reduced system iteratively via BiCGStab withoutpreconditioning. The size of S, which is determined by the nonzero columns it has,is problem dependent and is expected to have an influence on the overall scalabil-ity of the algorithm. We employ several techniques to reduce the dimension of S.First, we use METIS [19] reordering to minimize the total communication volume,hence reducing the size of S. We also use the following dropping strategy: if forany column j in Ri ||R(:, j)i||∞ ≤ δ ×maxl ||R(:, l)i||∞ (i = 1,2, .., p) we do not con-sider that column when forming S. Here Ri is the block row partition of R (i.e.R = [R1,R2, ...,Rp]

T ). We call this dropping strategy apriori dropping and use itfor obtaining the results in this section. Another possibility is to drop elements aftercomputing S posteriori dropping.

We note that dropping elements from R in Stage 3 to reduce the size of S resultsin an approximation of the solution. Furthermore, we can use approximate LU fac-torization of the diagonal blocks in Stage 2 and solve Sz = g iteratively in Stage 6.3.Therefore, we place an outer iterative layer (e.g. BiCGStab) where we use the abovealgorithm as a solver for systems involving the preconditioner P = D+ R, where Rconsists only of the columns that are not dropped. We stop the outer iterations whenthe relative residual at the kth iteration ||rk||∞/||r0||∞ ≤ εout .

DDPS is a direct solver if (i) nothing is dropped from R, (ii) exact LU factoriza-tion of Aii is computed, and (iii) Sz = g is solved exactly. In the case of using DDPSas a direct solver, an outer iterative scheme is not required. The choices we make inStages 2, 3 and 6.3, result in a solver that can be as robust as a direct solver or asscalable as an iterative solver, or anything in between. We note that the outer itera-tive layer also benefits from our partitioning strategy as METIS minimizes the totalcommunication volume in parallel sparse matrix vector multiplications. We furthernote that S consists of dense columns, which we store as a two-dimensional array inmemory, and as a result matrix–vector multiplications can be done via BLAS2 (orBLAS3 in case of multiple right hand sides).

We obtained the torso3 matrix (259,156 unknowns and 4,429,042 nonzeros)from the University of Florida sparse matrix collection. In Figure 12 we presentfor this matrix the speed improvements of DDPS and MUMPS solvers comparedto Pardiso on a single core of an Intel Xeon ([email protected]) cluster with Inbandinterconnection and 16GB memory per node, and with εout = 10−6. In this example,DDPS uses Pardiso for solving systems involving the diagonal blocks of D.

Page 13: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

Parallel Solution of Sparse Linear Systems 13

1 2 4 8 160

5

10

15

20

25MUMPS DDPS

number of cores

spee

d im

prov

emen

t

Fig. 12 The speed improvement for torso3 compared to Pardiso using one core (49.4 seconds)

4 Conclusions

We have presented two alternative formulations of DS factorization based methodsfor solution of sparse linear systems on parallel computing platforms. Performancecomparisons to traditional LU factorization based parallel solvers show that com-bining iterative methods with direct solvers and using DS factorization, one canachieve better scalability and shorter time to solution.

Acknowledgements I would like to thank Ahmed Sameh, Ananth Grama, Faisal Saied, Eric Cox,Kenji Takizawa, Madan Sathe, Mehmet Koyuturk, Olaf Schenk, and Tayfun Tezduyar for their helpand many useful discussions.

References

1. P. R. AMESTOY AND I. S. DUFF, Multifrontal parallel distributed symmetric and unsymmet-ric solvers, Comput. Methods Appl. Mech. Eng, 184 (2000), pp. 501–520.

2. P. R. AMESTOY, I. S. DUFF, J.-Y. L’EXCELLENT, AND J. KOSTER, A fully asynchronousmultifrontal solver using distributed dynamic scheduling, SIAM J. Matrix Anal. Appl., 23(2001), pp. 15–41.

3. P. R. AMESTOY, A. GUERMOUCHE, J.-Y. L’EXCELLENT, AND S. PRALET, Hybrid schedul-ing for the parallel solution of linear systems, Parallel Comput., 32 (2006), pp. 136–156.

4. E. ANDERSON, Z. BAI, C. BISCHOF, J. DEMMEL, J. DONGARRA, J. DUCROZ, A. GREEN-BAUM, S. HAMMARLING, A. MCKENNEY, AND D. SORENSEN., LAPACK: A portable lin-ear algebra library for high-performance computers, tech. report, Knoxville, 1990.

5. S. T. BARNARD, A. POTHEN, AND H. SIMON, A spectral algorithm for envelope reductionof sparse matrices, Numerical Linear Algebra with Applications, 2 (1995), pp. 317–334.

Page 14: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

14 Murat Manguoglu

6. M. W. BERRY AND A. SAMEH, Multiprocessor schemes for solving block tridiagonal linearsystems, The International Journal of Supercomputer Applications, 1 (1988), pp. 37–57.

7. L. S. BLACKFORD, J. CHOI, A. CLEARY, E. D’AZEVEDO, J. DEMMEL, I. DHILLON,J. DONGARRA, S. HAMMARLING, G. HENRY, A. PETITET, K. STANLEY, D. WALKER,AND R. C. WHALEY, ScaLAPACK: a linear algebra library for message-passing computers,in Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Com-puting (Minneapolis, MN, 1997), Philadelphia, PA, USA, 1997, Society for Industrial andApplied Mathematics, p. 15 (electronic).

8. T. A. DAVIS, University of Florida sparse matrix collection. NA Digest, 1997.9. J. J. DONGARRA AND A. H. SAMEH, On some parallel banded system solvers, Parallel

Computing, 1 (1984), pp. 223–235.10. I. DUFF AND J. KOSTER, On algorithms for permuting large entries to the diagonal of a

sparse matrix, 1999.11. I. S. DUFF AND J. KOSTER, The design and use of algorithms for permuting large entries

to the diagonal of sparse matrices, SIAM Journal on Matrix Analysis and Applications, 20(1999), pp. 889–901.

12. M. FIEDLER, Algebraic connectivity of graphs, Czechoslovak Mathematical Journal, 23(1973), pp. 298–305.

13. A. GUPTA, Recent advances in direct methods for solving unsymmetric sparse systems oflinear equations, ACM Transactions on Mathematical Software, 28 (2002), pp. 301–324.

14. A. GUPTA, S. KORIC, AND T. GEORGE, Sparse matrix factorization on massively parallelcomputers, in SC’09 USB Key, ACM/IEEE, Portland, OR, USA, Nov. 2009. a1,.

15. X. HE, H. ZHA, C. H. DING, AND H. D. SIMON, Web document clustering using hyperlinkstructures, Computational Statistics & Data Analysis, 41 (2002), pp. 19 – 45.

16. D. J. HIGHAM, G. KALNA, AND M. KIBBLE, Spectral clustering and its use in bioinformat-ics, Journal of Computational and Applied Mathematics, 204 (2007), pp. 25 – 37. Specialissue dedicated to Professor Shinnosuke Oharu on the occasion of his 65th birthday.

17. HSL, A collection of Fortran codes for large-scale scientific computation, 2004. Seehttp://www.cse.scitech.ac.uk/nag/hsl/.

18. Y. HU AND J. SCOTT, HSL MC73: a fast multilevel Fiedler and profile reduction code, Tech-nical Report RAL-TR-2003-036, 2003.

19. G. KARYPIS AND V. KUMAR, METIS: Unstructured graph partitioning and sparse matrixordering system, The University of Minnesota, 2 (1995).

20. S. KUNDU, D. SORENSEN, AND J. G.N. PHILIPHSI, Automatic domain decomposition ofproteins by a gaussian network model, Proteins: Structure, Function, and Bioinformatics, 57(2004), pp. 725–733.

21. D. H. LAWRIE AND A. H. SAMEH, The computation and communication complexity of aparallel banded system solver, ACM Trans. Math. Softw., 10 (1984), pp. 185–195.

22. X. LI AND J. W. DEMMEL, Superlu dist: A scalable distributed-memory sparse direct solverfor unsymmetric linear systems, ACM Trans. Mathematical Software, 29 (2003), pp. 110–140.

23. M. MANGUOGLU, A parallel hybrid sparse linear system solver, in Computational Electro-magnetics International Workshop, 2009. CEM 2009, July 2009, pp. 38–43.

24. M. MANGUOGLU, A domain decomposing parallel sparse linear system solver, journal ofComputational and Applied Mathematics, (In review).

25. M. MANGUOGLU, E. COX, F. SAIED, AND A. SAMEH, TRACEMIN-Fiedler: A ParallelAlgorithm for Computing the Fiedler Vector, High Performance Computing for ComputationalScience–VECPAR 2010, (2011), pp. 449–455.

26. M. MANGUOGLU, M. KOYUTURK, A. SAMEH, AND A. GRAMA, Weighted matrix orderingand parallel banded preconditioners for iterative linear system solvers, SIAM Journal onScientific Computing, 32 (2010), pp. 1201–1216.

27. M. MANGUOGLU, A. SAMEH, AND O. SCHENK, Pspike: A parallel hybrid sparse linearsystem solver, Euro-Par 2009 Parallel Processing, (2009), pp. 797–808.

28. M. MANGUOGLU, K. TAKIZAWA, A. SAMEH, AND T. TEZDUYAR, Nested and parallelsparse algorithms for arterial fluid mechanics computations with boundary layer mesh refine-ment, International Journal for Numerical Methods in Fluids, 65 (2011), pp. 135–149.

Page 15: Parallel Solution of Sparse Linear Systemsuser.ceng.metu.edu.tr/~manguoglu/PDFs/sparse_manguoglu.pdfParallel Solution of Sparse Linear Systems Murat Manguoglu Abstract Many simulations

Parallel Solution of Sparse Linear Systems 15

29. A. Y. NG, M. I. JORDAN, AND Y. WEISS, On spectral clustering: Analysis and an algorithm,in Advances in Neural Information Processing Systems 14, MIT Press, 2001, pp. 849–856.

30. E. POLIZZI AND A. H. SAMEH, A parallel hybrid banded system solver: the spike algorithm,Parallel Comput., 32 (2006), pp. 177–194.

31. , Spike: A parallel environment for solving banded linear systems, Computers & Fluids,36 (2007), pp. 113–120.

32. A. POTHEN, H. D. SIMON, AND K.-P. LIOU, Partitioning sparse matrices with eigenvectorsof graphs, SIAM J. Matrix Anal. Appl., 11 (1990), pp. 430–452.

33. H. QIU AND E. R. HANCOCK, Graph matching and clustering using spectral partitions,Pattern Recognition, 39 (2006), pp. 22 – 34.

34. D. J. K. S. C. CHEN AND A. H. SAMEH, Practical parallel band triangular system solvers,ACM Transactions on Mathematical Software, 4 (1978), pp. 270–277.

35. A. SAMEH AND Z. TONG, The trace minimization method for the symmetric generalizedeigenvalue problem, J. Comput. Appl. Math., 123 (2000), pp. 155–175.

36. A. H. SAMEH AND D. J. KUCK, On stable parallel linear system solvers, J. ACM, 25 (1978),pp. 81–91.

37. A. H. SAMEH AND V. SARIN, Hybrid parallel linear system solvers, International Journal ofComputational Fluid Dynamics, 12 (1999), pp. 213–223.

38. A. H. SAMEH AND J. A. WISNIEWSKI, A trace minimization algorithm for the generalizedeigenvalue problem, SIAM Journal on Numerical Analysis, 19 (1982), pp. 1243–1259.

39. O. SCHENK AND K. GARTNER, Solving unsymmetric sparse systems of linear equations withpardiso, Future Generation Computer Systems, 20 (2004), pp. 475 – 487.

40. , On fast factorization pivoting methods for sparse symmetric indefinite systems, Elec-tronic Transactions on Numerical Analysis, 23 (2006), pp. 158 – 179.

41. O. SCHENK, M. MANGUOGLU, A. SAMEH, M. CHRISTIAN, AND M. SATHE, Parallel scal-able PDE-constrained optimization: antenna identification in hyperthermia cancer treatmentplanning, Computer Science Research and Development, 23 (2009).

42. S. J. SHEPHERD, C. B. BEGGS, AND S. JONES, Amino acid partitioning using a fiedlervector model, Journal European Biophysics Journal, 37 (2007), pp. 105–109.

43. H. A. VAN DER VORST, Bi-cgstab: a fast and smoothly converging variant of bi-cg for thesolution of nonsymmetric linear systems, SIAM J. Sci. Stat. Comput., 13 (1992), pp. 631–644.