GPU-Accelerated Preconditioned Iterative Linear Solvers

Embed Size (px)

DESCRIPTION

Revisión de métodos iterativos para resolver sistemas de ecuaciones lineales acelerados con GPUS

Citation preview

  • J Supercomput (2013) 63:443466DOI 10.1007/s11227-012-0825-3

    GPU-accelerated preconditioned iterative linear solvers

    Ruipeng Li Yousef Saad

    Published online: 5 October 2012 Springer Science+Business Media New York 2012

    Abstract This work is an overview of our preliminary experience in developing ahigh-performance iterative linear solver accelerated by GPU coprocessors. Our goalis to illustrate the advantages and difficulties encountered when deploying GPU tech-nology to perform sparse linear algebra computations. Techniques for speeding upsparse matrix-vector product (SpMV) kernels and finding suitable preconditioningmethods are discussed. Our experiments with an NVIDIA TESLA M2070 show thatfor unstructured matrices SpMV kernels can be up to 8 times faster on the GPU thanthe Intel MKL on the host Intel Xeon X5675 Processor. Overall performance of theGPU-accelerated Incomplete Cholesky (IC) factorization preconditioned CG methodcan outperform its CPU counterpart by a smaller factor, up to 3, and GPU-acceleratedThe incomplete LU (ILU) factorization preconditioned GMRES method can achievea speed-up nearing 4. However, with better suited preconditioning techniques forGPUs, this performance can be further improved.

    Keywords GPU computing Preconditioned iterative methods Sparse matrixcomputations

    1 Introduction

    Emerging many-core platforms yield enormous raw processing power, in the form ofmassive SIMD parallelism. Chip designers are increasing processing capabilities bybuilding multiple processing cores in a single chip. The number of cores that can fit

    R. Li () Y. SaadDepartment of Computer Science & Engineering, University of Minnesota, Minneapolis, MN 55455,USAe-mail: [email protected]

    Y. Saade-mail: [email protected]

  • 444 R. Li, Y. Saad

    into a chip is increasing at a fast pace and as a result Moores law has been givena new interpretation: It is the number of cores that doubles every 18 months [1].The NVIDIA GPU (Graphics Processing Unit) is one of several available coproces-sors that feature a large number of cores. Traditionally, GPUs have been especiallydesigned to handle computations for computer graphics in real-time. Today, they areincreasingly being exploited as general-purpose attached processors to speed-up com-putations in image processing, physical simulations, data mining, linear algebra, etc.GPUs are specialized for compute-intensive and massively parallel computations. Forthis type of computation, they are often vastly superior to architectures like multicoreCPUs, since more transistors are devoted to data processing rather than data cachingand flow control [21].

    CUDA (Compute Unified Device Architecture) is a parallel language for NVIDIAGPUs, which supports developers to programming on GPU in C/C++ with NVIDIAextensions. Wrappers for Python, Fortran, Java, and MATLAB are also avail-able. This paper explores various aspects of sparse linear algebra computations onGPUs. Specifically, we investigate appropriate preconditioning methods for high-performance iterative linear solvers accelerated by GPUs.

    The remainder of this report is organized as follows: Sect. 2 gives a brief introduc-tion of GPU architectures, the CUDA programming model, and related work. Sec-tion 3 describes the sparse matrix-vector product (SpMV) kernel on GPUs; Sect. 4describes sparse triangular solves and the preconditioned iterative methods, CG andGMRES are discussed in Sect. 5; Sect. 6 reports on experimental results and analyses.Finally, concluding remarks are stated in Sect. 7.

    2 GPU architecture and CUDA

    2.1 GPU architecture

    A GPU is built as a scalable array of multithreaded Streaming Multiprocessors (SMs),each of which consists of multiple Scalar Processor (SP) cores. To manage hundredsof threads, the multiprocessors employ a Single Instruction Multiple Threads (SIMT)model. Each thread is mapped into one SP core and executes independently with itsown instruction address and register state [21].

    The NVIDIA GPU platform has various memory hierarchies. The types of mem-ory can be classified as follows: (1) off-chip global memory, (2) off-chip local mem-ory, (3) on-chip shared memory, (4) read-only cached off-chip constant memory andtexture memory, and (5) registers. The effective bandwidth of each type of memorydepends significantly on the access pattern. Global memory is relatively large, but hasa much higher latency (about 400 to 800 clock cycles for the NVIDIA Tesla) com-pared with the on-chip shared memory (1 clock cycle if there are no bank conflicts).Global memory is not cached, so it is important to follow the right access pattern toachieve good memory bandwidth.

    Threads are organized in warps. A warp is defined as a group of 32 threads ofconsecutive thread IDs. A half-warp is either the first- or second-half of a warp. Themost efficient way to use the global memory bandwidth is to coalesce the simulta-neous memory accesses by threads in a half-warp into a single memory transaction.

  • GPU-accelerated preconditioned iterative linear solvers 445

    For NVIDIA GPU devices with compute capability 1.2 and higher, memory accessby a half-warp is coalesced as soon as the words accessed by all threads lie in thesame segment of size equal to 128 bytes if all threads access 32-bit or 64-bit words.Local memory accesses are always coalesced. Alignment and coalescing techniquesare explained in detail in [21]. Since it is on chip, the shared memory is much fasterthan the global memory, but we also need to pay attention to the problems of bankconflict. The size of shared memory is 48K per SM. Cached constant memory andtexture memory are also beneficial for accessing reused data and data with 2D spatiallocality.

    2.2 CUDA programming

    Programming NVIDIA GPUs for general-purpose computing is supported by theNVIDIA CUDA (Compute Unified Device Architecture) environment. CUDA pro-grams on the host (CPU) invoke a kernel grid, which runs on the device (GPU). Thesame parallel kernel is executed by many threads. These threads are organized intothread blocks. Thread blocks are distributed to SMs and split into warps scheduled bySIMT units. All threads in the same thread block share the same shared memory ofsize 48 KB and can synchronize themselves by a barrier. Threads in a warp executeone common instruction at a time. This is referred to as warp-level synchronization[21]. Full efficiency is achieved when all 32 threads of a warp follow the same exe-cution path. Branch divergence causes serial execution.

    2.3 Related work

    The potential of GPUs for parallel sparse matrix computations was recognized earlyon when GPU programming required shader languages [7]. After the advent ofNVIDIA CUDA, GPUs have drawn much more attention for sparse linear algebra andthe solution of sparse linear systems. Existing implementations of the GPU Sparsematrix-vector product (SpMV) include CUDPP (CUDA Data Parallel Primitives) li-brary [27], NVIDIA Cusp library [5], and the IBM SpMV library [4]. In [4, 5], dif-ferent formats of sparse matrices are studied for producing high performance SpMVkernels on GPUs. These include the compressed sparse row (CSR) format, the coordi-nate format (COO) format, the diagonal (DIA) format, the ELLPACK (ELL) format.and a hybrid (ELL/COO) format. Recent work considers a new row-grouped CSR for-mat published in [22] and a sliced ELLPACK format in [18]. A blocked hybrid formatis studied in [17]. SpMV in blocked CSR/ELL format and the model-driven autotun-ing are presented in [8]. A good summary on the development of high-performancemulticore and accelerator-based implementations of SpMV on recent hardware plat-forms can be found in [32].

    With regard to iterative linear solvers, a GPU-accelerated Jacobi preconditionedCG method is studied in [13]. In [3], the CG method with incomplete Poisson precon-ditioning is proposed for the Poisson problem on a multi-GPU platform. The block-ILU preconditioned GMRES method is studied for solving sparse linear systems onthe NVIDIA Tesla GPUs in [31]. SpMV kernels on GPUs and the block-ILU pre-conditioned GMRES method are used in [28] for solving large sparse linear systems

  • 446 R. Li, Y. Saad

    in black-oil and compositional flow simulations. A GPU implementation of deflatedPCG method for bubbly flow computations is discussed in [14].

    For dense linear algebra computations, the MAGMA (Matrix Algebra for GPUand Multicore Architectures) project [2] using hybrid multicore-multi-GPU systemaims to develop a dense linear algebra library similar to LAPACK. They reportedspeed ups to 375/75 GFLOPS in single/double precision for the GEMM matrix-matrixoperation, and 280 GFLOPS in single precision for solving a linear system by the L/Ufactorization using multiple GPUs. Volkov and Demmel [30] reported that they wereable to approach the peak performance of the G80 series GPUs for dense matrix-matrix multiplications.

    3 Sparse matrix-vector product

    The sparse matrix-vector product (SpMV) is one of the major components of anysparse matrix computations. However, the SpMV kernel which accounts for a bigpart of the cost of sparse iterative linear solvers, has difficulty in reaching a signif-icant percentage of peak floating point throughput. It yields only a small fractionof the machine peak performance due to its indirect and irregular memory accesses[4]. Recently, several research groups have reported their implementations on high-performance sparse matrix-vector multiplication kernels on GPUs [4, 5, 29]. NVIDIAresearchers have demonstrated that their SpMV kernels can reach 15 GFLOPS and10 GFLOPS in single/double precision for unstructured mesh matrices. They providea thorough description and performance evaluation of SpMV kernels using differentstorage formats in [5]. The hybrid ELL/COO kernel (HYB) they propose achievesthe best performance among all formats. For details of the HYB format, see [6].

    This section will discuss the implementation of SpMV GPU kernels in several dif-ferent nonblocked formats, which are used in our GPU-accelerated iterative methods.The CSR SpMV kernels provide important insights for understanding the algorithmicand implementation principles to achieve high performance in GPU computing. Thesame ideas will be also used in the sparse triangular solve kernel in the CSR format,which is discussed in Sect. 4. In this paper, we suggest the JAD format for SpMVon GPUs, which provides a better solution than the CSR format in both parallelismgranularity and memory access patterns. The DIA format and the ELL format aremuch more efficient for structured matrices. The performance comparison of theseSpMV kernels will be shown in Sect. 6.

    For a matrix-vector product, y = Ax, data reuse can be applied to the vector x,which is read-only. Therefore, a good strategy is to place the vector x in the cachedtexture memory and access it by texture fetching [21]. As reported in [5], caching im-proves the throughput by an average of 30 % and 25 % in single and double precision,respectively.

    3.1 SpMV in CSR format

    The compressed sparse row (CSR) format is probably the most widely used general-purpose sparse matrix storage format [24]. In the CSR format, the sparse matrix is

  • GPU-accelerated preconditioned iterative linear solvers 447

    specified by three arrays: one real/complex data array A which contains the nonzeroelements row by row; one index array JA contains the column index of each nonzeroelement in A; and one index array IA contains the beginning position of each row. Toparallelize the SpMV for a matrix in the CSR format, a simple scheme called scalarCSR kernel [5] is to assign each row to one thread. The computations of all threadsare independent. The main drawback of this scheme is that threads within a half-warp access noncontiguous memory addresses, pointed to by the row pointers. Asmentioned in Sect. 2.1, this significantly reduces the chance that memory transactionsof a half-warp threads can be coalesced.

    A better approach, termed vector CSR kernel, is to assign a half-warp (16 threads)instead of only one thread to each row. In this approach, the first half-warp works onrow one, the second works on row two, and so on. Since all threads within a half-warp access nonzero elements of some row, these accesses are more likely to belongto the same memory segment. So, the chance of coalescing should be higher. But thisapproach incurs another problem related to computing vector dot products in parallel(dot product of a matrix row vector with the vector x). To solve this problem, eachthread saves its partial result into the shared memory and a parallel reduction is usedto sum up all partial results [4, 5]. Moreover, since a warp executes one commoninstruction at a time, threads within a warp are implicitly synchronized and so thesynchronization in the half-warp parallel reduction can be omitted for better perfor-mance [21]. In addition, full efficiency of the vector CSR kernel requires at least 16nonzeros per row in A. Our vector CSR kernel is slightly different from the imple-mentation in the Cusp library [5] in which a warp of threads are assigned to each row.Based on the experiments, we found that the performance of these two implementa-tions are very close. For matrices which have few nonzeros per row, the half-warpkernel is slightly better than the whole-warp kernel.

    3.2 SpMV in JAD format

    The JAD (JAgged Diagonal) format can be viewed as a generalization of the Ellpack-Itpack (ELL) format which removes the assumption on the fixed-length rows [26]. Tobuild the JAD structure, we first sort the rows by nonincreasing order of the numberof nonzeros per row. Then the first JAD consists of the first element of each row; thesecond JAD consists of the second element, etc. The number of JADs is the largestnumber of nonzeros per row. An example is shown below.

    A =

    1 0 2 0 03 4 0 5 00 6 7 0 80 0 9 10 00 0 0 11 12

    , PA =

    3 4 0 5 00 6 7 0 81 0 2 0 00 0 9 10 00 0 0 11 12

    Here, A is the original matrix and PA is the reordered matrix. Then the JADformat of A is shown in Fig. 1. The JAD data structure also consist of three arrays: onereal/complex data array DJ contains the nonzero elements; one index array JDIAGcontains the column index of each nonzero; and one index array IDIAG contains thebeginning position of each JAD.

  • 448 R. Li, Y. Saad

    DJ 3 6 1 9 11 4 7 2 10 12 5 8JDIAG 1 2 1 3 4 2 3 3 4 5 4 5IDIAG 1 6 11 13

    Fig. 1 The JAD format of matrix A

    Fig. 2 JAD kernel memoryaccess pattern

    Like in the scalar CSR kernel, only one thread works on each matrix row to ex-ploit fine-grained parallelism. Note that since matrix data and indices are stored inthe JAD format, the JAD kernel does not suffer from the performance drawback ofthe scalar CSR kernel: consecutive threads can access contiguous memory addresses,which follows the suggested memory access pattern and can improve the memorybandwidth by coalescing memory transactions. To ensure coalesced memory access,artificial zeros are added to make each JAD aligned in the device memory. The mem-ory access pattern is shown in Fig. 2.

    3.3 SpMV in DIA/ELL format

    The DIA (DIAgonal) format is a special format for diagonally structured matrices inwhich nonzero values are restricted to lie in a small number of diagonals. This formatis efficient for memory bandwidth because there is no double indexing as was thecase in the CSR or the JAD formats. Specifically, the column index of an element canbe calculated by its row index plus the offset value of that diagonal which is stored inIOFF. But this format can also potentially waste storage since zeros are padded fornonfull diagonals. Ellpack-Itpack format is a more general scheme. The assumptionin this format is that there are at most Nd nonzero elements per row, where Ndis small. Then two arrays of dimension n Nd are required to store the nonzeroselements and column indices, respectively, in which n is the dimension of the matrix.

    The DIA format and ELL format of the matrix A shown above are illustrated inFigs. 3 and 4. In the DIA SpMV kernel, one thread works on each matrix row andconsecutive threads can access contiguous memory addresses if the 2D data arrayDIAG is stored in column major form. The ELL SpMV kernel works similarly exceptthat the column indices are loaded from array JCOEF. It is important to note that the

  • GPU-accelerated preconditioned iterative linear solvers 449

    Fig. 3 DIA format of matrix A * 1 23 4 5

    DIAG = 6 7 89 10 *11 12 *

    , IOFF = -1 0 2

    Fig. 4 ELL format of matrix A 1 2 03 4 5

    COEF = 6 7 89 10 0

    11 12 0

    ,

    1 3 11 2 4

    JCOEF = 2 3 53 4 44 5 5

    DIA format and the ELL format are not general-purpose formats and patterns of manysparse matrices are not suitable for them.

    4 Sparse triangular solve

    In the row version of the standard forward and backward substitution algorithms forsolving triangular systems, the outer loop of the substitution for each unknown is se-quential. The inner loop is a vector dot product of a sparse row vector of the triangularmatrix and the dense right-hand side vector b. In the column oriented versions, theinner loop is a saxpy operation of a sparse column vector and the vector b. These dotproduct and saxpy operations can be split and parallelized, but this approach is notefficient in general since the length of the vector involved is usually short.

    4.1 Level scheduling

    Better parallelism can be achieved by level scheduling, which exploits topologicalsorting [26]. The idea is to group unknowns x(i) into different levels so that all un-knowns within the same level can be computed simultaneously. For forward substi-tution, the level of unknown x(i), denoted by l(i) can be computed as follows afterinitializing l(j) = 0 for all j ,

    l(i) = 1 + maxj

    {l(j)

    }, for all j such that Lij = 0.

    Suppose the lower triangular matrix L is stored row-wise in the CSR format (in ar-rays al, jal, and ial) without its unit diagonal. If nlev is the number of levels, thepermutation array q records the ordering of all unknowns, level by level and level(i)points to the beginning of the ith level in array q , then forward substitution with levelscheduling is shown in Algorithm 1. The inequality, 1 nlev n is satisfied, wheren is the size of the system. The best case when nlev = 1, corresponds to the situationwhen all unknowns can be computed simultaneously (e.g., L is a diagonal matrix).In the worst case, when nlev = n, the forward substitution is completely sequential.The analogous back substitution is handled similarly.

  • 450 R. Li, Y. Saad

    Algorithm 1 Forward substitution with level scheduling1: for lev = 1, . . . , nlev do2: j1 = level(lev)3: j2 = level(lev + 1) 14: for k = j1, . . . , j2 do5: i = q(k)6: for j = ial(i), . . . , ial(i + 1) 1 do7: x(i) x(i) al(j) x(jal(j))8: end for9: end for

    10: end for

    4.2 GPU implementation

    The implementation of the row version of forward/backward substitution kernel issimilar to the vector CSR format SpMV kernel. With level scheduling, the equationsare partitioned into levels. Starting from the first level, each half-warp picks one un-known of the current level to solve and the vector dot product is split and summedup by parallel reduction of the partial results as was done in the vector CSR formatSpMV kernel. Synchronization among half-warps is needed after each level is fin-ished. In CUDA, thread synchronization within a block is supported by calling the__syncthreads() function, so one can use a single-block-kernel to implement thetriangular solve with level-scheduling. However, this may not be efficient due to in-sufficient number of threads. Therefore, we use multiple thread blocks to performthe forward/backward substitution in parallel. The only way to synchronize all threadblocks is to launch another kernel for the next level after the current one terminates.

    The performance of parallel triangular solve on GPUs significantly depends on thenumber of levels of triangular matrices. For some cases, some reordering techniquelike the Multiple Minimal Degree (MMD) ordering [12], may help reduce the num-ber of levels so as to increase parallelism. However, in general, the performance ofsolving a sparse triangular system on GPUs is lower than that on CPUs, especiallyfor triangular systems obtained from high-fill-in factorizations which usually have alarge number of levels.

    5 Preconditioned iterative methods

    Linear systems are often derived from discretization of partial differential equationsby finite difference method, finite element method or finite volume method. Matricesfrom these applications are usually large and sparse, which means that a large numberof the entries of A are zeros. The Conjugate Gradient (CG) and the GeneralizedMinimal RESidual (GMRES) methods (see, e.g., [26]) are extensively used Krylovsubspace methods for solving symmetric positive definite (SPD)/nonsymmetric linearsystems iteratively. These are termed accelerators.

    A preconditioner is any form of implicit or explicit modification of an original sys-tem that make it easier to solve by a given accelerator [26]. The right preconditioned

  • GPU-accelerated preconditioned iterative linear solvers 451

    linear system takes the form:

    AM1u = b, u = Mx.The matrix M is the preconditioning matrix. It should be close to A, i.e., M A andthe preconditioning operation M1v should be easy to apply for an arbitrary vector v.

    The main computational components in the preconditioned CG or GMRES meth-ods are: (1) Matrix-vector product; (2) Preconditioning operation; (3) Level-1 BLASvector operations. Preconditioned iterative methods can be accelerated by the GPUSpMV kernels, and the CUBLAS library which is an implementation of BLAS libraryon top of the NVIDIA CUDA driver [19].

    The convergence rate of iterative methods is highly dependent on the quality ofthe preconditioner used. Computational complexities of different types of precondi-tioning operations are different. Moreover, their performance varies significantly ondifferent computing platforms. In the following sections, several types of precondi-tioners along with their GPU implementations will be discussed in detail.

    5.1 ILU/IC preconditioners

    The Incomplete LU (ILU) factorization process for a sparse matrix A computes asparse lower triangular matrix L and a sparse upper triangular matrix U such thatthe residual matrix R = LU A satisfies certain conditions [26]. Since the precon-ditioning matrix M = LU and LU A, the preconditioning operation is a back sub-stitution followed by a forward substitution, i.e., u = M1v = U1L1v. If no fill-inis allowed in ILU process, we obtain the so called ILU(0) preconditioner. The zeropattern of L and U is exactly the same as that of A. The accuracy of this ILU(0)factorization preconditioning may be not sufficient in many cases. The more accurateILU(k) factorization allows more fill-ins by using the notion of level-fill-in. ILUTapproach uses an alternative method to drop fill-ins. Instead of levels, it drops en-tries based on the numerical values of the fill-in elements. If the value is smaller thansome threshold, it is zeroed out. Details are discussed in [26]. When A is SPD, the in-complete Cholesky (IC) factorization is used as a preconditioner for the CG method.However, the incomplete Cholesky factorization for an SPD matrix does not alwaysexist. The modified incomplete Cholesky (MIC) factorization discussed in [23] isused, which exists for any SPD matrix.

    As mentioned in Sect. 4.2, the performance of a triangular solve on GPUs dete-riorates severely when the number of level increases. Triangular systems obtainedfrom ILU or IC factorizations which allow many fill-ins usually have a large numberof levels. Although MMD reordering can help increase parallelism in some cases,in general, for many cases, triangular solves on GPUs are still slower than on theCPU. Due to the weakness of GPUs in processing tasks which have small degreesof parallelism, applying ILU/IC preconditioners on GPUs may not be a good choice.Therefore, if a triangular solve cannot be performed efficiently using level schedul-ing on the GPU, it will be executed on the CPU instead for better performance. Thisimplies transferring data (the preconditioned vector) between CPU and GPU in eachiteration step. In CUDA, page-locked/mapped host memory can be used to this por-tion of data to achieve higher memory bandwidth [21]. Although this violates one of

  • 452 R. Li, Y. Saad

    the performance guidelines in [21], minimize data transfer between the host and thedevice, . . . even if that means running kernels with low parallelism computations, ac-cording to the experimental results, this hybrid CPU/GPU computation still exhibitsa higher performance than performing the triangular solves on GPUs.

    Hence, for iterative methods, traditional ILU/IC-type preconditioners, especiallyhigh-fill-in ILU/IC preconditioners, are not suitable for GPU-acceleration. To ob-tain a higher speed-up from the GPU, parallel preconditioners that are better suitedfor GPU parallelism must be developed and implemented. The block Jacobi precon-ditioners, multicolor SSOR/ILU(0) preconditioners, and polynomial preconditionersdiscussed in the following sections try to achieve high performance in this way.

    5.2 Block Jacobi preconditioners

    The simplest parallel preconditioner might be the block Jacobi technique, whichbreaks up the system into independent local sub-systems. These subsystems corre-spond to the diagonal blocks in the reordered matrix. The global solution is obtainedby adding solutions of all local systems and these local systems can be solved ap-proximately by ILU factorizations. The setup and application phases of the precon-ditioner can be performed locally without data dependency. Therefore, one CUDAthread block is assigned to each local block and no global synchronization or com-munication is needed. Threads within each thread block can cooperate to solve thelocal system and save the local results to the global output vector.

    Although block Jacobi preconditioner with a large number of blocks can providehigh parallelism, a well-known drawback is that it usually results in a much largernumber of iterations to converge. The benefits of increased parallelism may be out-weighed by the increased number of iterations.

    5.3 Multicolor SSOR/ILU(0) preconditioners

    A graph G is p-colorable if its vertices can be colored in p colors such that no twoadjacent vertices have the same color. A simple greedy algorithm known as sequentialcoloring algorithm can provide an adequate coloring in a number of colors upperbounded by (G) + 1, where (G) is the maximum degree of vertices of G.

    Suppose a graph is colored by p colors and its vertices are reordered into p blocksby colors. Then its reordered adjacency matrix A has the following structure:

    D1D2 F

    . . .

    E Dp1Dp

    , (1)

    where the p diagonal blocks are diagonal matrices since no two vertices of the samecolor are adjacent. If a pointer array ptr is defined as ptr(i) points to the first row ofith color block, then the MC-SOR (multicolor SOR) sweep will take the form shownin Algorithm 2.

  • GPU-accelerated preconditioned iterative linear solvers 453

    Algorithm 2 Multicolor SOR sweep1: for col = 1, . . . , p do2: n1 = ptr(col)3: n2 = ptr(col + 1) 14: r(n1 : n2) = (rhs(n1 : n2) A(n1 : n2, :) x)5: x(n1 : n2) = x(n1 : n2) + D1i r(n1 : n2)6: end for

    In Algorithm 2, the loop in line 1 is from the first color to the last one. For eachcolor, a scaling by the diagonal matrix Di and a matrix-by-vector product A(n1 :n2, :) x are performed. This loop is sequential. However, the scaling and matrix-by-vector product are easy to parallelize. A multicolor SOR sweep can be executedefficiently on GPUs when the number of colors is small, since the number of requiredmatrix-by-vector products is small and the size of the submatrix for each color islarge. The Symmetric SOR (SSOR) sweep consists of the SOR sweep followed bya backward SOR sweep, i.e., from the last color to the first one. The MC-SSOR(k)multisweep, in which k SSOR sweeps are performed instead of just one, can be usedas a preconditioner.

    The MC-ILU(0) (multicolor ILU(0)) preconditioner is another preconditionerwhich achieves good parallelism in preconditioning operations with multicolor re-ordering. The L/U factors obtained from ILU(0) factorization of multicolor reorderedmatrices have the structure of (1). It is not hard to see that the triangular systems inthis form can be solved by a sequence of matrix-by-vector products and diagonal scal-ings. The parallelism achieved is identical to that of the MC-SSOR preconditioning.However, as is well known [26], since many more elements are dropped in the multi-color version of ILU(0) than in the version with the original ordering, the number ofiterations to achieve convergence is usually larger.

    For both MC-SSOR(k) and MC-ILU(0), the parallelism is of order of n/p, wherep is the number of colors. When p is large, a possible solution to achieve goodperformance in the SOR sweep and the triangular solve is to sparsify the graph. Thissparsification can be carried out as follows: First, execute the greedy algorithm togenerate a coloring with p colors. Since p is upper bounded by (G) + 1, then takeall nodes whose degree is larger than or equal to p 1, and remove their edges withrelatively small weight. This process can be repeated a small number of times untila small enough p is obtained. This sparsification is usually effective in reducing thenumber of colors.

    By applying the multicolor ordering of the sparsified matrices to the original ma-trices, we can obtain an approximated multicolor ordering of the original matrices.With this ordering, all diagonal blocks, Dis are not diagonal matrices anymore butthe off-diagonal entries of Dis are small. If these small entries can be ignored, we canuse the same algorithms to perform approximate MC-SSOR/ILU(0) preconditioning.

    5.4 Least-squares polynomial preconditioners

    The polynomial preconditioning matrix M is defined by M1 = s(A), where s issome polynomial. Thus, the right preconditioned system can be written as As(A)x =

  • 454 R. Li, Y. Saad

    b. There is no need to construct the matrices s(A) or As(A) explicitly since for anarbitrary vector v, As(A)v can be computed by a sequence of sparse matrix-vectorproducts.

    Consider the inner product on the space Pk the space of polynomials of degree notexceeding k:

    p,q = ul

    p()q()()d

    p() and q() are polynomials and () is some nonnegative weight on interval[l, u]. The corresponding norm p,p is called -norm, denoted by p. Ouraim is to find the polynomial sk1 that minimizes 1 s() over all polynomialss of degree less than k. This polynomial sk1 is called the least-squares polynomial[26].

    The explicit formula of s(k) using the Jacobi weights is given and the coefficientsof s(k) up to degree 8 are tabulated in [26]. As discussed in [11], the function (t)can be approximated as the least-squares approximation by the polynomial k(t),

    (t) k(t) =k

    j=1j Pj (t), j =

    (t), Pj (t)

    ,

    where {Pj } is a basis of polynomials which is orthonormal for some L2 inner-product. The orthogonal basis of polynomials is traditionally computed by the well-known Stieltjes procedure, which utilizes a 3-term recurrence [9].

    j+1 Pj+1(t) = tPj (t) j Pj (t) j Pj1(t). (2)Starting with a certain polynomial S0, the procedure is shown in Algorithm 3. Theinner product and polynomial basis should be carefully chosen in order to avoid nu-merical integration. The basis of Chebyshev polynomials was used in [11]. The innerproducts and the computation of Stieltjes coefficients are also discussed in detail in[11].

    For solving linear systems, we want to approximate (A)b, in which (A) = A1.Assume that A is symmetric and (A), the spectrum of A, is included in the interval

    Algorithm 3 Stieltjes procedure1: P0 0,2: 1 = S0 , P1(t) = S0(t)/1,3: 1 = (t), P1(t)4: for j = 1, . . . , k do5: j = tPj , Pj ,6: Sj (t) = tPj (t) j Pj (t) j Pj1(t),7: j+1 = Sj ,8: Pj+1(t) = Sj (t)/j+1,9: j+1 = (t), Pj+1(t)

    10: end for

  • GPU-accelerated preconditioned iterative linear solvers 455

    Algorithm 4 Approximation of x0 + A1r01: v0 = 0, v1 = r0/12: x1 = x0 + 1v13: for j = 1, . . . , k do4: vj+1 = (Avj jvj jvj1)/j+15: xj+1 = xj + j+1vj+16: end for

    [l, u]. Given an initial guess of the solution x0, the initial residual is r0 = b Ax0.Then the solution can be approximated as follows:

    x = x0 + A1r0 x0 +k

    j=1j Pj (A)r0.

    Define Pj (A)r0 to be vj , then we have x x0 + kj=1 jvj . With the 3-term recur-rence in (2), it is easy to derive the following recurrence:

    j+1vj+1 = Avj jvj jvj1With the coefficients i , i , and i obtained from the Stieltjes procedure by settingS0 = t , the vector x which approximates x0 +A1r0 can be computed by the iterationshown in Algorithm 4.

    This method can produce a good approximation of A1 assuming that a tightbound of smallest eigenvalue n and the largest eigenvalue 1 can be found. As men-tioned in [33], the upper bound must be larger than 1 but it should not be too largeas this would otherwise result in slower convergence. The simplest technique is touse Gershgorin circle theorem but bounds obtained in this way are usually big over-estimates of 1. An much better estimation can be obtained inexpensively by usinga small number of steps of the Lanczos method [16]. A safeguard term is added toquasi-guarantee that the upper bound is larger than 1(A). It equals the absolutevalue of the last element of the normalized eigenvector associated with the largesteigenvalue of the tridiagonal matrix. This safeguard is not theoretical, but is observedto be generally effective [33]. The computations in preconditioning operation consistsof the SpMV and level-1 BLAS vector operations, both of which can be performed ef-ficiently on GPUs. Typically a small number of steps of Lanczos iterations are enoughto provide a good estimate of extreme eigenvalues. The Lanczos algorithm can be ac-celerated by GPUs as well, since the computations it required are also SpMVs andlevel-1 BLAS vector computations.

    6 Experiment results

    The experiments were conducted on Cascade, a GPU cluster with 8 compute nodes atthe Minnesota Supercomputing Institute (MSI). Each node consists of dual Intel XeonX5675 processors (12MB Cache, 3.06 GHz, 6-core) and 96 GiB of main memory,

  • 456 R. Li, Y. Saad

    Table 1 Matrices used for experiments

    Matrix N NNZ NNZ/ROW

    sherman3 5,005 20,033 4.0memplus 17,758 126,150 7.1msc23052 23,052 1,154,814 50.1bcsstk36 23,052 1,143,140 49.6rma10 46,835 2,374,001 50.7dubcova2 65,025 1,030,225 15.8cfd1 70,656 1,828,364 25.9poisson3Db 85,623 2,374,949 27.7cfd2 123,440 3,087,898 25.0boneS01 127,224 6,715,152 52.8majorbasis 160,000 1,750,416 10.9copcase1 208,340 1,382,902 6.6pwtk 217,918 11,634,424 53.4af_shell8 504,855 17,588,875 34.8parafem 525,825 3,674,625 7.0ecology2 999,999 4,995,991 5.0lap7pt 1,000,000 6,940,000 6.9spe10 1,094,421 7,598,799 6.9thermal2 1,228,045 8,580,313 7.0atmosmodd 1,270,432 8,814,880 6.9atmosmodl 1,489,752 10,319,760 6.9

    and 4 NVIDIA TESLA M2070 GPU (448 cores, 1.15 GHz, 3GB memory). In thefollowing tests, we compare the performance of kernel functions and iterative solversbetween a single CPU and a single GPU. The CUDA programs were compiled bythe NVIDIA CUDA Compiler (nvcc) with flag -arch sm_20 and the CPU programswere compiled by g++ using -O3 optimization level. The CUDA Toolkit 4.1 and IntelMath Kernel Library (MKL) 10.2, are used for programming, in which Intel MKL isthreaded using OpenMP for use of multiple cores. For CUDA kernel configuration,the thread block is one-dimensional and its size is fixed to 256 and the maximumnumber of threads is hard coded to be the maximum number of resident threads,which is 14 1536.

    For the iterative linear solves, the stopping criterion for convergence is that therelative residual be 106. The maximum number of iterations allowed is 1,000. Thelinear systems used are taken from the University of Florida sparse matrix collection[10] and a few cases from reservoir simulations were also used. The size, number ofnonzeros and average number of nonzeros per row of each matrix are tabulated inTable 1.

  • GPU-accelerated preconditioned iterative linear solvers 457

    Table 2 CPU/GPU SpMV kernels performance

    Matrix MKL-CSR GPU-CSR-Scalar GPU-CSR-vector GPU-JAD GPU-HYB DIA

    sherman3 3.47 3.67 1.76 4.29 1.03 5.28memplus 2.55 0.95 2.89 0.43 4.19 msc23052 8.02 1.63 9.32 8.05 9.73 bcsstk36 9.82 1.63 9.36 8.03 8.67 rma10 3.80 1.72 10.19 12.61 8.48 dubcova2 4.88 1.97 5.97 11.36 8.61 cfd1 4.70 1.73 8.61 12.91 12.33 poisson3Db 1.97 1.28 5.70 6.57 5.88 cfd2 2.88 1.76 8.52 11.95 12.18 boneS01 3.16 1.82 10.77 8.69 10.58 majorbasis 2.92 2.67 4.81 11.70 11.54 13.06copcase1 4.12 3.48 5.12 5.63 7.11 pwtk 3.08 1.69 10.33 13.80 13.24 af_shell8 3.13 1.54 10.34 14.56 14.27 parafem 2.30 3.15 4.42 8.01 7.78 ecology2 2.06 8.37 6.72 12.04 12.11 13.33lap7pt 2.59 3.92 4.66 11.58 12.44 18.70spe10 2.49 4.06 4.63 8.66 10.96 thermal2 1.76 3.05 3.55 5.47 7.33 atmosmodd 2.09 3.78 4.69 10.89 10.97 16.03atmosmodl 2.25 3.74 4.71 10.77 10.61 14.65

    6.1 SpMV kernels

    We first compare the performance of CPU and GPU SpMV kernels for double preci-sion floating point arithmetic. On the CPU, the CSR format SpMV kernel is obtainedfrom routine mkl_dcsrgemv. The GPU SpMV kernel uses two CSR format vari-ants: CSR-scalar and CSR-vector, a JAD format implementation and the HYB formatkernel from NVIDIA CUSPARSE library [20]. Each SpMV kernel is executed 100times for each matrix and the execution speed is measured in GFLOPS, which isshown in Table 1. According to the experimental results, in general, the CSR-vectorkernel performs better than the CSR-scalar kernel especially for the matrices whichhave larger average numbers of nonzeros per row (cf., e.g., msc23052, boneS01). Dueto a better memory access pattern, the JAD format kernel yields a higher throughputthan the two CSR format kernels. The HYB (ELL/COO) format achieves the bestperformance when the ELL part contains almost all the nonzero elements. For mostcases, the JAD format kernel can be competitive or even outperform the HYB formatkernel. Compared to the MKL CSR SpMV kernel, the speed-up of the GPU SpMVkernels can reach a factor of 8.

    For diagonally structured matrices of which all nonzeros are in a small number ofdiagonals, the SpMV kernel in DIA format can be more efficient than the general-purpose sparse matrix format like the CSR or the JAD format since indirect address-

  • 458 R. Li, Y. Saad

    Table 3 Triangular solve with level scheduling

    Matrix MKLMFLOPS

    GPU-RowMFLOPS

    GPU-ColMFLOPS

    GPU-Lev GPU-MMD-Lev#lev Mflops #lev Mflops

    sherman3 439.2 3.5 3.6 50 79.3 9 338.7memplus 699.1 5.4 5.9 174 77.7 147 84.2msc23052 953.9 21.3 25.2 192 649.8 186 704.8bcsstk36 1001.9 19.9 23.1 4,457 41.6 274 461.3rma10 838.3 21.6 29.2 7,785 47.0 765 324.6dubcova2 755.7 10.1 14.2 1634 103.0 44 1519.3cfd1 819.4 15.4 18.5 4,340 80.0 637 365.2poisson3Db 455.9 13.1 16.6 136 945.1 191 992.5cfd2 841.2 15.2 21.1 4,357 128.4 494 703.8boneS01 887.0 21.9 28.2 813 997.8 480 1207.2majorbasis 718.0 7.78 6.8 600 463.2 436 573.6copcase1 458.9 4.7 6.9 272 526.3 34 936.8pwtk 946.5 20.4 26.1 142,168 14.6 559 1615.7af_shell8 919.6 17.1 21.8 3,725 635.4 550 1699.6parafem 562.1 5.2 6.1 7 1333.2 20 1166.9ecology2 450.8 3.6 5.3 1,999 354.8 5 1086.3lap7pt 556.9 4.9 3.9 298 659.4 6 1407.3spe10 574.0 4.9 3.8 622 484.6 349 675.1thermal2 361.1 5.1 6.7 1,239 587.1 32 939.3atmosmodd 564.2 4.9 5.5 352 668.0 6 1403.0atmosmodl 548.8 4.9 5.8 432 661.8 6 1425.9

    ing is avoided. The performance is shown in Table 2, from which it is clear that theSpMV kernel in DIA format is faster than the CSR and the JAD formats for thesematrices.

    The time of transferring data between CPU and GPU is not included in calculatingthe speed. This is reasonable since in the application of iterative solver, the sparsematrices and vectors do not need to be transferred back and forth in each iteration.This transfer takes place only twice: at the beginning for the initial data and at theend when the solution is copied back.

    6.2 Sparse triangular solve on GPUs

    Table 3 shows the performance on the CPU and the GPU (measured in MFLOPS) ofsolving sparse triangular systems obtained from ILU(0) factorization and the num-bers of levels in the triangular systems when using level scheduling. The triangularsolver on the CPU is obtained from routine mkl_dcsrtrsv. The performance ofthe row/column versions of triangular solves on the GPU in which only the dot prod-ucts or the saxpy operations are parallelized is extremely low, reaching only up to 30MFLOPS whereas it can reach hundreds of MFLOPS on the CPU. For the matrices

  • GPU-accelerated preconditioned iterative linear solvers 459

    Table 4 Triangular solve of MC-ILU(0) factors

    Matrix GPU-GFLOPS CPU-GFLOPS

    poisson3Db 3.2 0.57majorbasis 2.8 1.13atmosmodd 3.1 1.00atmosmodl 3.0 0.98

    which have small numbers of levels, by using level scheduling, the GPU performancecan be significantly improved to be comparable to the CPU or even higher for somecases. The last two columns of Table 3 shows that MMD reordering can be used to re-duce the number of levels, and thus improve performance of sparse triangular solveson the GPU.

    Much higher parallelism can be achieved for solving the special triangular systemsobtained from ILU(0) factorization of multicolor reordered systems. For matriceswhich can be colored with a small number of colors, this particular triangular solvecan reach a much higher FLOPS rates on GPUs. Table 4 shows the performance ofthis triangular solves for 4 examples on the GPU and the CPU.

    6.3 IC/MIC-CG method

    In our implementation of the GPU-accelerated IC/MIC preconditioned CG method,the SpMV and level-1 BLAS operations are performed in parallel on the GPU but thetriangular solves of the preconditioning operations are executed on the CPU if theycannot be accelerated using level scheduling on the GPU.

    Figures 5(a) and 5(b) show the time breakdown of the IC/MIC PCG method andthe GPU-accelerated IC/MIC PCG method for the case thermal2 and the case ecol-ogy2. As shown, the running time of the SpMVs and the BLAS-1 operations is re-duced by factors of 4.5 and 2.9 on average, respectively. In the case thermal2, thetriangular solves can be accelerated by the GPU using level scheduling. As shownin Fig. 5(a), the running time of the triangular solves is reduced by a factor of 3.4.So, the overall speed-up of the iteration time is a factor of 3.3. However, in the caseecology2, the triangular solves are performed on the CPU. Therefore, as shown inFig. 5(b), extra cost of transferring preconditioned vectors between the host and thedevice is now required. On the host, page-locked/mapped host memory which hashigher memory bandwidth is used to store the preconditioned vectors. This reducedthe time for transferring data by 45 %. Finally, the overall speed-up is a factor of 1.3,which reflects Amdahls Law, describing the performance improvement when only afraction of the calculation is parallelized.

    Table 5 shows the performance of all SPD cases using the GPU-acceleratedIC/MIC preconditioned CG method. Where the triangular solves are computed, thepreconditioner construction time, the fill-factors (the ratio of the number of nonze-ros in the preconditioner to the number of nonzeros in the matrix), the numbers ofiterations to converge, the iteration time, and the speed-up of the total solve timecompared to the CPU counterpart are tabulated.

  • 460 R. Li, Y. Saad

    Fig. 5 Time breakdown ofIC/MIC-CG method

    6.4 L-S polynomial-CG method

    Table 6 shows the preconditioner construction time, the number of iterations, theiteration time, and the optimal degrees of the least-squares polynomials for all cases.High level fill-in IC/MIC preconditioners can be expensive to build. However, thetime of building L-S polynomial preconditioner reduces to a few steps of the Lanczosalgorithm. This time is often small. By comparing Tables 5 and 6, we can see that theconstruction of the least-squares polynomial preconditioners is less expensive thanthe IC/MIC preconditioner. The last column of Table 6 shows the speed-up comparedto the GPU-accelerated IC/MIC preconditioned CG method in terms of total solve

  • GPU-accelerated preconditioned iterative linear solvers 461

    Table 5 GPU-IC-CG method

    Matrix Trsv on Prec-time (s) Fill-fact #iters Iter-time (s) Speed-up overCPU-IC

    msc23052 CPU 0.35 2.1 277 1.38 1.2bcsstk36 CPU 0.44 2.7 130 0.79 1.2dubcova2 CPU 0.34 2.6 29 0.19 1.2cfd1 CPU 0.44 2.2 245 2.12 1.3cfd2 CPU 0.75 2.4 172 2.69 1.4boneS01 CPU 1.93 2.2 252 7.89 1.4af_shell8 CPU 3.73 2.4 105 9.03 1.5parafem GPU 1.35 2.7 188 5.28 1.6ecology2 CPU 0.48 2.3 561 20.34 1.3thermal2 GPU 3.80 2.3 329 14.19 2.9

    Table 6 L-S polynomial preconditioner

    Matrix Prec-time (s) #Iters Iter-time (s) Polyn-deg Speed-up overGPU-IC

    msc23052 0.02 98 1.39 40 1.2bcsstk36 0.02 66 1.08 60 1.1dubcova2 0.07 17 0.06 10 4.1cfd1 0.05 58 0.77 30 3.1cfd2 0.09 20 1.25 80 2.6boneS01 0.11 235 4.70 10 2.1af_shell8 0.34 68 4.78 20 2.5parafem 0.31 41 2.19 30 2.7ecology2 0.57 323 16.40 20 1.3thermal2 0.77 147 15.41 20 1.1

    time. As shown, the L-S polynomial preconditioned CG method can reduce the solvetime by up to a factor of 4.

    Table 7 shows the performance of PCG methods for solving a 2D/3D Poissonequation. The first row of Table 7 shows the IC/MIC preconditioned CG method run-ning on the CPU. Row 2 to row 7 show the GPU-accelerated PCG methods withdifferent preconditioners. By comparing the rows of Table 7, we can see that theGPU-acceleration in the PCG methods and the performance difference between dif-ferent preconditioners.

    6.5 ILUT-GMRES method

    Now we discuss the performance of GPU-accelerated ILUT preconditioned GMRESmethods. The ILUT preconditioner is implemented using the dual threshold methodsuggested in [25]. Similarly to the IC-PCG method, SpMV and level-1 BLAS oper-ations are performed on the GPU whereas the triangular solves are executed on the

  • 462 R. Li, Y. Saad

    Table 7 Performance comparison of all preconditioning

    Precon Lap2d Precon Lap3d#iters Total time (s) #iters Total time (s)

    CPU-IC 120 9.12 CPU-IC 47 3.40GPU-IC 120 5.85 GPU-IC 47 2.12GPU-LS-POLY(20) 91 3.23 GPU-LS-POLY(15) 16 1.08GPU-MC-IC(0) 738 5.56 GPU-MC-IC(0) 101 0.94GPU-MC-SSOR(3) 407 9.01 GPU-MC-SSOR(5) 44 1.81GPU-BJ(8) 326 10.33 GPU-BJ(8) 96 3.39

    Table 8 GPU-ILUT-GMRES method

    Matrix Trsv on Prec-time (s) Fill-fact #iters Iter-time (s) Speed-up overCPU-ILUT

    sherman3 CPU 0.01 2.1 62 0.19 0.3memplus CPU 0.02 1.1 65 0.22 1.3poisson3Db GPU 1.31 2.3 44 0.98 1.5majorbasis CPU 0.22 1.9 4 0.04 1.3copcase1 GPU 0.16 1.6 20 0.17 2.1spe10 CPU 0.69 1.8 44 2.16 1.6atmosmodd GPU 0.95 1.6 119 5.05 2.5atmosmodl GPU 1.15 1.7 55 2.64 3.5

    CPU if they cannot be accelerated by the GPU using level scheduling. The Krylovsubspace dimension is set to 40. Table 8 shows where the triangular solves are com-puted, the time of building preconditioners, the fill-factors, the numbers of iterationsto converge, the iteration time and the speed-up of the total solve time compared tothe CPU counterpart for all nonsymmetric cases by using the GPU-accelerated ILUTpreconditioned GMRES method.

    6.6 MC-SSOR/ILU(0)-GMRES method

    For matrices which are colorable with a small number of colors, MC-SSOR precondi-tioners could be more efficient, since the preconditioning operation is highly paralleland its convergence rate is comparable to that of ILU preconditioners. It is also muchless expensive to build. The time of building multicolor SSOR preconditioner is es-sentially the time of the multicolor reordering. This time is typically small. Table 9shows the numbers of colors, optimal k values of MC-SSOR(k) preconditioner, thenumbers iterations required, the preconditioner construction time, and the iterationtime for solving these linear systems using the GPU-accelerated MC-SSOR precon-ditioned GMRES method.

    Table 10 compares the numbers of iterations and the iteration time required usingthe MC-ILU(0) preconditioned GMRES method and the standard ILU(0) precon-

  • GPU-accelerated preconditioned iterative linear solvers 463

    Table 9 MCSSOR-GMRES method

    Matrix MCSSORPrec-time (s) #iters Iter-time (s) #color k

    sherman3 0.00 40 0.12 4 15memplus 0.00 54 0.18 4 5poisson3Db 0.01 39 0.66 17 5majorbasis 0.00 7 0.04 6 2copcase1 0.00 56 0.50 11 2spe10 0.02 93 5.37 7 5atomsmodd 0.02 69 3.89 2 5atomsmodl 0.02 33 2.01 2 5

    Table 10 MC-ILU(0)/ILU(0)-GMRES method

    Matrix MC-ILU(0) ILU(0)Prec-time (s) #iters Iter-time (s) Prec-time (s) #iters Iter-time (s)

    sherman3 0.00 312 0.99 0.00 92 0.31memplus 0.00 146 0.48 0.02 138 0.87poisson3Db 0.26 107 0.60 0.39 106 0.97majorbasis 0.05 12 0.04 0.07 9 0.10copcase1 0.05 62 0.36 0.07 38 0.42spe10 0.19 247 5.09 0.31 157 7.03atomsmodd 0.17 210 4.38 0.19 137 5.52atomsmodl 0.20 105 2.46 0.31 63 3.04

    Table 11 MC-SSOR/ILU(0) with sparsification

    Matrix Precon Sparsify #color #iter Total-time (s)

    memplus MC-SSOR No 97 60 0.94Yes 3 47 0.19

    MC-ILU(0) No 97 146 0.77Yes 3 202 0.58

    spe10 MC-SSOR No 340 93 11.03Yes 7 93 5.37

    MC-ILU(0) No 340 274 7.50Yes 7 247 5.09

    ditioned GMRES method on the GPU. Compared to the standard ILU(0) precon-ditioner, the MC-ILU(0) preconditioned GMRES method needs more iterations toachieve convergence. However, the overall iteration time is reduced since the trian-gular solve of multicolored systems is more efficient.

  • 464 R. Li, Y. Saad

    Table 12 SPE10: block Jacobi ILU-GMRES method

    #Blocks BLU MFLOPS Prec-time (s) #iters Iter-time (s)

    4 334 0.86 81 7.718 546 0.86 102 6.34

    16 623 0.84 121 6.3332 727 0.81 119 5.8864 750 0.80 122 6.18

    128 755 0.79 155 7.97

    For matrices which cannot be colored with a small number of colors, the sparsifica-tion method discussed in Sect. 5.3 could be invoked to reduce the numbers of colors.This sparsification method is applied to two cases to improve the efficiency of MC-SSOR/ILU(0) preconditioners. The numbers of colors before and after sparsification,numbers of iterations, and iteration time are tabulated in Table 11. As is shown, thenumbers of colors are effectively reduced and the performance is improved.

    6.7 Block Jacobi-GMRES method

    When using block Jacobi preconditioning, matrices are first partitioned by somegraph partitioning algorithm like METIS [15] and reordered. Then ILUT factoriza-tion is performed on each diagonal block. The preconditioning operation, the blockL/U solves are executed on the GPU. Table 12 shows the speed of the block L/Usolve (measured in MFLOPS), the time of building ILU preconditioner for diagonalblocks, the numbers of iterations and the iteration time.

    From the results shown in the table, we can see that the speed of the block L/Usolve is increased as the number of blocks grows, because more parallelism is avail-able in this computation. However, the number of iterations required increases sincemore entries are dropped outside the diagonal blocks. Thus, the preconditioner be-comes weaker as the number of blocks grows. Finally, for this case, the best numberof blocks in terms of iteration time is 32.

    7 Conclusion

    In this work, we have explored high performance sparse matrix-vector product(SpMV) kernels in different formats on current many-core platforms, and used themto construct effective iterative linear solvers with several preconditioner options. Ifa triangular solve cannot be accelerated using level scheduling on GPUs, this com-putation will be accomplished by CPUs. By this hybrid CPU/GPU computations,IC preconditioned CG method and ILU preconditioned GMRES method are adaptedto a GPU environment and achieve performance gains compared to its CPU coun-terpart. In order to achieve better GPU-acceleration, a few preconditioners with en-hanced parallelism were considered. Among these, the block Jacobi preconditioneris the simplest, but it is usually not efficient since it requires many iterations to con-verge. For matrices which can be colored with a small number of colors, multicolor

  • GPU-accelerated preconditioned iterative linear solvers 465

    SSOR/ILU(0) preconditioners could provide a better performance since their precon-ditioning operations can be executed with high parallelism and they can yield com-parable convergence rates to the standard ILU preconditioners. For symmetric linearsystems, the least-squares polynomial preconditioner can be much more efficient forthe PCG method since its preconditioning operation relies on the SpMV kernel.

    Based on the experimental results reported here and elsewhere, we observe thatwhen used as general purpose many-core processors, current GPUs provide a muchlower performance advantage for irregular (sparse) computations than they can formore regular (dense) computations. However, when used carefully, GPUs can stillbe beneficial as co-processors to CPUs to speed-up complex computations. We high-lighted a few alternative approaches to standard ones in the arena of iterative sparselinear system solvers. Future work should take a careful look of the GPU kernelsusing CUDA Visual Profiler to find the potential throughput bottleneck. We shouldpoint out that all the test cases fit the memory of a single GPU. Therefore, a naturalnext step is to adapt current GPU-accelerated iterative linear solvers to a multi-GPUenvironment to solve larger scale problems and test the scalability. It also remains toexplore other preconditioning techniques that have good potential on GPUs, such asapproximate inverse preconditioners; see, e.g., [26].

    Acknowledgements This work is supported by DOE under grant DE-FG 08ER 25841 and by the Min-nesota Supercomputer Institute.

    References

    1. Agarwal A, Levy M (2007) The kill rule for multicore. In: DAC07: proceedings of the 44th annualdesign automation conference, New York, NY, USA. ACM, New York, pp 750753

    2. Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S(2009) Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. JPhys Conf Ser 180(1):012037

    3. Ament M, Knittel G, Weiskopf D, Strasser W (2010) A parallel preconditioned conjugate gradientsolver for the Poisson problem on a Multi-GPU platform. In: PDP10: proceedings of the 2010 18theuromicro conference on parallel, distributed and network-based processing, Washington, DC, USA.IEEE Comput. Soc., Los Alamitos, pp 583592

    4. Baskaran MM, Bordawekar R (2008) Optimizing sparse matrix-vector multiplication on GPUs. Techreport, IBM Research

    5. Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-orientedprocessors. In: SC09: proceedings of the conference on high performance computing networking,storage and analysis, New York, NY, USA. ACM, New York, pp 111

    6. Bell N, Garland M (2010) Cusp: generic parallel algorithms for sparse matrix and graph computations.Version 0.1.0

    7. Bolz J, Farmer I, Grinspun E, Schroder P (2003) Sparse matrix solvers on the GPU: conjugategradients and multigrid. ACM Trans Graph 22(3):917924

    8. Choi JW, Singh A, Vuduc RW (2010) Model-driven autotuning of sparse matrix-vector multiply onGPUs. ACM SIGPLAN Not 45:115126

    9. Davis PJ (1963) Interpolation and approximation. Blaisdell, Waltham10. Davis TA (1994) University of Florinda sparse matrix collection, na digest11. Erhel J, GuyomarcH F, Saad Y (2001) Least-squares polynomial filters for ill-conditioned linear

    systems. Tech report umsi-2001-32, Minnesota Supercomputer Institute, University of Minnesota,Minneapolis, MN

    12. George A, Liu JWH (1989) The evolution of the minimum degree ordering algorithm. SIAM Rev31(1):119

    13. Georgescu S, Okuda H (2007) Conjugate gradients on graphic hardware: performance & feasibility

  • 466 R. Li, Y. Saad

    14. Gupta R (2009) A GPU implementation of a bubbly flow solver. Masters thesis, Delft Institute ofApplied Mathematics, Delft University of Technology, 2628 BL, Delft, The Netherlands

    15. Karypis G, Kumar V (1998) Metisa software package for partitioning unstructured graphs, parti-tioning meshes, and computing fill-reducing orderings of sparse matrices, version 4.0. Tech report,University of Minnesota, Department of Computer Science/Army HPC Research Center

    16. Lanczos C (1950) An iteration method for the solution of the eigenvalue problem of linear differentialand integral operators. J Res Natl Bur Stand 45:255282

    17. Monakov A, Avetisyan A (2009) Implementing blocked sparse matrix-vector multiplication on nvidiaGPUs. In: Bertels K, Dimopoulos N, Silvano C, Wong S (eds) Embedded computer systems: archi-tectures, modeling, and simulation. Lecture notes in computer science, vol 5657. Springer, Berlin, pp289297

    18. Monakov A, Lokhmotov A, Avetisyan A (2010) Automatically tuning sparse matrix-vector multipli-cation for GPU architectures. In: Patt Y, Foglia P, Duesterwald E, Faraboschi P, Martorell X (eds) Highperformance embedded architectures and compilers. Lecture notes in computer science, vol 5952.Springer, Berlin, pp 111125

    19. NVIDIA (2012) CUBLAS library user guide 4.220. NVIDIA (2012) CUDA CUSPARSE Library21. NVIDIA (2012) NVIDIA CUDA C programming guide 4.222. Oberhuber T, Suzuki A, Vacata J (2010) New row-grouped csr format for storing the sparse matrices

    on GPU with implementation in CUDA. CoRR abs/1012.227023. Robert Y (1982) Regular incomplete factorizations of real positive definite matrices. Linear Algebra

    Appl 48:10511724. Saad Y (1990) SPARSKIT: A basic tool kit for sparse matrix computations. Tech report RIACS-90-

    20, Research Institute for Advanced Computer Science, NASA Ames Research Center, Moffett Field,CA

    25. Saad Y (1994) ILUT: a dual threshold incomplete ILU factorization. Numer Linear Algebra Appl1:387402

    26. Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia27. Sengupta S, Harris M, Zhang Y, Owens JD (2007) Scan primitives for GPU computing. Graphics

    hardware 2007. ACM, New York, pp 9710628. Sudan H, Klie H, Li R, Saad Y (2010) High performance manycore solvers for reservoir simulation.

    In: 12th European conference on the mathematics of oil recovery29. Vzquez F, Garzon EM, Martinez JA, Fernandez JJ (2009) The sparse matrix vector product on GPUs.

    Tech report, Department of Computer Architecture and Electronics, University of Almeria30. Volkov V, Demmel J (2008) LU, QR and Cholesky factorizations using vector capabilities of GPUs.

    Tech report, Computer Science Division University of California at Berkeley31. Wang M, Klie H, Parashar M, Sudan H (2009) Solving sparse linear systems on nvidia tesla GPUs. In:

    ICCS09: proceedings of the 9th international conference on computational science. Springer, Berlin,pp 864873

    32. Williams S, Bell N, Choi JW, Garland M, Oliker L, Vuduc R (2010) Scientific computing with multi-core and accelerators. CRC Press, Boca Raton, pp 83109. Chap 5

    33. Zhou Y, Saad Y, Tiago ML, Chelikowsky JR (2006) Parallel self-consistent-field calculations viaChebyshev-filtered subspace acceleration. Phys Rev E 74:066704

    GPU-accelerated preconditioned iterative linear solversAbstractIntroductionGPU architecture and CUDAGPU architectureCUDA programmingRelated work

    Sparse matrix-vector productSpMV in CSR formatSpMV in JAD formatSpMV in DIA/ELL format

    Sparse triangular solveLevel schedulingGPU implementation

    Preconditioned iterative methodsILU/IC preconditionersBlock Jacobi preconditionersMulticolor SSOR/ILU(0) preconditionersLeast-squares polynomial preconditioners

    Experiment resultsSpMV kernelsSparse triangular solve on GPUsIC/MIC-CG methodL-S polynomial-CG methodILUT-GMRES methodMC-SSOR/ILU(0)-GMRES methodBlock Jacobi-GMRES method

    ConclusionAcknowledgementsReferences