A Parallel Geometric Multifrontal Solver Using ...crd-legacy.lbl.gov/~xiaoye/hsolver_toms.pdf · A Parallel Geometric Multifrontal Solver Using Hierarchically ... [Mathematical software]:

A Parallel Geometric Multifrontal Solver Using HierarchicallySemiseparable Structure

SHEN WANG, Department of Mathematics, Purdue University

XIAOYE S. LI, Lawrence Berkeley National Laboratory

FRANCOIS-HENRY ROUET, Lawrence Berkeley National Laboratory

JIANLIN XIA, Department of Mathematics, Purdue University

MAARTEN V. DE HOOP, Department of Mathematics, Purdue University

We present a structured parallel geometry-based multifrontal sparse solver using hierarchically semisepa-rable (HSS) representations and exploiting the inherent low-rank structures. Parallel strategies for nesteddissection ordering (taking low-rankness into account), symbolic factorization, and structured numerical fac-torization are shown. In particular, we demonstrate how to manage two layers of tree parallelism to integrateparallel HSS operations within the parallel multifrontal sparse factorization. Such a structured multifrontalfactorization algorithm can be shown to have asymptotically lower complexities in both operation countsand memory than the conventional factorization algorithms for certain partial di↵erential equations. Wepresent numerical results from the solution of the anisotropic Helmholtz equations for seismic imaging, anddemonstrate that our new solver was able to solve 3D problems up to 6003 mesh size, with 216M degrees offreedom in the linear system. For this specific model problem, our solver is both faster and more memorye�cient than a geometry-based multifrontal solver (which is further faster than general-purpose algebraicsolvers such as MUMPS and SuperLU DIST). For the 6003 mesh size, the structured factors from our solverneed about 5.9 times less memory.

Categories and Subject Descriptors: Mathematics of computing [Mathematical software]: Solvers

General Terms: Design, Algorithms, Performance

Additional Key Words and Phrases: Sparse Gaussian elimination, multifrontal method, HSS matrices, par-allel algorithm

ACM Reference Format:

Shen Wang, Xiaoye S. Li, Francois-Henry Rouet, Jianlin Xia, and Maarten V. de Hoop, 2013. A ParallelGeometric Multifrontal Solver Using Hierarchically Semiseparable Structure. ACM Trans. Math. Softw. ,, Article (May 2013), 21 pages.DOI:http://dx.doi.org/10.1145/0000000.0000000

1. INTRODUCTION

In many computational and engineering problems it is critical to solve large sparse linearsystems of equations. Direct methods are attractive due to their reliability and general-ity, and are especially suitable for systems with di↵erent right-hand sides. However, it isprohibitively expensive to use direct methods for large-scale 3D problems, due to their su-perlinear complexity of memory requirement and operation count. A potential avenue todevelop fast and memory-e�cient direct solvers is to exploit certain structures in the prob-

Authors’ addresses: S. Wang, J. Xia and M. V. de Hoop: Department of Mathematics, Purdue University,West Lafayette, IN 47907, USA; X. S. Li and F.-H. Rouet: Lawrence Berkeley National Laboratory, Berkeley,CA 94720, USA.Permission to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or commercial advantageand that copies show this notice on the first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use anycomponent of this work in other works requires prior specific permission and/or a fee. Permissions may berequested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA,fax +1 (212) 869-0481, or [email protected]� 2013 ACM 0098-3500/2013/05-ART $15.00DOI:http://dx.doi.org/10.1145/0000000.0000000

ACM Transactions on Mathematical Software, Vol. , No. , Article , Publication date: May 2013.

:2 S. Wang et al.

lems. Di↵erent structures come from di↵erent natures of the underlying problems or di↵erentdiscretization and computation techniques. In recent years, rank structured matrices havebeen investigated extensively due to their potential in accelerating the solutions of variouspartial di↵erential equations and integral equations. Several useful rank structured matrixrepresentations have been developed, such as H-matrices [Hackbusch 1999] [Hackbusch andKhoromskij 2000] [Hackbusch et al. 2002], H2-matrices [Borm et al. 2003] [Borm and Hack-busch 2001] [Hackbusch et al. 2000], quasiseparable matrices [Bella et al. 2008] [Eidelmanand Gohberg 1999], semiseparable matrices [Chandrasekaran et al. 2005] [Vandebril et al.2005], and multilevel low-rank structures [Li and Saad 2013].In this paper, we develop a new class of parallel structured sparse factorization method

exploiting numerically low-rank structures using hierarchically semi-separable (HSS) matri-ces [Chandrasekaran et al. 2006] [Xia et al. 2010]. The novelty of the HSS-structured solveris to apply the parallel HSS techniques in [Wang et al. 2013] to the intermediate dense sub-matrices that appear in the parallel sparse factorization methods. Specifically, we considerthe multifrontal factorization method [Du↵ and Reid 1983] in this paper. We investigateseveral important aspects, such as parallel nested dissection that preserves the geometry andbenefits the low-rankness, the symbolic factorization, the integration of the HSS tree par-allelism within the outer multifrontal tree parallelism, and how the rank properties behaveand benefit the complexities.The resulting HSS-structured factorization can be used as a direct solver or precondi-

tioner depending on the application’s accuracy requirement and the characteristics of thePDEs. The implementation that we present in the experimental section is restricted to thesolution of problems on regular grids. For some 3D model problems and broader classesof PDEs, it was shown that the HSS-structured multifrontal method costs O(n4/3 log n)flops [Xia 2013]. This complexity is much lower than the O(n2) cost of the traditional exactmultifrontal method. The theoretical memory count is O(n log n). Numerical tests indicatethat our structured parallel solver is faster and needs less memory than a standard geomet-ric multifrontal solver (which is further faster than general-purpose algebraic solvers such asMUMPS and SuperLU DIST). This new class of HSS-structured factorizations can be ap-plied to much broader classes of discretized PDEs (including non-self-adjoint and indefiniteones) aiming towards optimal complexity preconditioners.The rest of the paper is organized as follows. In Section 2 we review the multifrontal

factorization algorithm and the HSS structure. Section 4 presents our new parallel HSS-structured multifrontal algorithm in detail. Analysis of the rank properties and the com-plexities are presented in Section 3. In Section 5, we demonstrate the parallel performanceof our geometric, sparse multifrontal solver. Section 6 is devoted to the conclusions.

2. REVIEW OF THE MULTIFRONTAL AND HSS-STRUCTURED MULTIFRONTALMETHODS

In this section, we briefly review the multifrontal method, HSS representations, and HSS-structured multifrontal methods.

2.1. Multifrontal method with nested dissection ordering

The central idea of the multifrontal method is to reorganize the factorization of a large sparsematrix into the factorizations of many smaller dense matrices and partial updates to theSchur complements [Du↵ and Reid 1983; Liu 1992]. We now briefly recall the main ideas. Weare to compute a factorization of a given matrix A, A = LU if the matrix is unsymmetric, orA = LDLT if the matrix is symmetric. Without loss of generality, we assume that A is non-reducible. For a matrix A with an unsymmetric pattern, we assume that the factorizationtakes place with the nonzero structure of A + AT , where the summation is structural. Inthis case, the multifrontal method relies on a structure called the elimination tree. A few


A Parallel Geometric Multifrontal Solver Using Hierarchically Semiseparable Structure :3

equivalent definitions are possible (we recommend the survey by Liu [Liu 1990]), and weuse the following.

Definition 2.1. Assume A = LU , where A is a sparse, structurally symmetric, and N⇥Nmatrix. Then, the elimination tree of A is a tree of N nodes, with the ith node correspondingto the ith column of L and with the parent relations defined by:

parent(j) = min{i : i > j and `ij 6= 0}, for j = 1, . . . , N � 1.

In practice, nodes are amalgamated : nodes that represent columns and rows of the fac-tors with similar structures are grouped together in a single node. In the end, each nodecorresponds to a square dense matrix (referred to as a frontal matrix ) with the following2⇥ 2 block structure:

Fj =

F11

F12

F21

F22

�. (1)

Factoring the matrix consists in a bottom-up traversal of the tree, following a topologicalorder (a node is processed before its parent). Processing a node consists in:

— forming (or assembling) the frontal matrix: this is achieved by summing the rows andcolumns of A corresponding to the variables in the (1, 1), (2, 1) and (1, 2) blocks, with thetemporary data (update matrices) that have been produced by the child nodes; that is,

Fj = Aj +X

k: child of j

Uk; (2)

— eliminating the fully-summed variables in the (1, 1) block F11

: this is done through a partialfactorization of the frontal matrix which produces the corresponding rows and columnsof the factors stored in F

11

, F21

and F12

. At this step, the so-called Schur complementor contribution block is computed as Uj = F

22

� F21

F�1

11

F12

and stored in a temporarymemory; it will be used to form the frontal matrix associated with the parent node.Therefore, when a node is activated, it “consumes” the contribution blocks of its children.

In this process, the elimination step involves straightforward dense matrix operations, butthe assembling step involves index manipulation and indirect addressing while summing up

Uk. For example, if two children’s update matrices Uci =

✓ai bici di

◆, i = 1, 2 have subscript

sets {1, 2} and {1, 3}, respectively, thenX

i=1,2

Uci =

a1

b1

0c1

d1

00 0 0

!+

a2

0 b2

0 0 0c2

0 d2

!=

a1

+ a2

b1

b2

c1

d1

0c2

0 d2

!. (3)

This summation operation is called extend-add, denoted by l$, which aligns the subscriptsets of two matrices by padding zero entries, and then adds matrices. The relationshipbetween frontal matrices and update matrices can be revealed by Fj = Aj l$Uc1 l$Uc2 l$· · · l$ Ucq , where nodes c

1

, c2

, . . . , cq are the children of j.In the multifrontal factorization, the active memory (at a given step in the factorization)

consists of the frontal matrix being processed and a set of contribution blocks that are tem-porarily stored and will be consumed at a later step. The multifrontal method lends itselfvery naturally to parallelism since multiple processes can be employed to treat one, largeenough, frontal matrix or to process concurrently frontal matrices belonging to separatesubtrees. These two sources of parallelism are commonly referred to as node and tree paral-lelism, respectively, and their correct exploitation is the key to achieving high performanceon parallel supercomputers.


:4 S. Wang et al.

In the following section, we exploit additional rank structures in the frontal matrices toachieve better cost and memory e�ciency.

2.2. Low-rank property and HSS structures

We briefly summarize the key concepts of HSS structures following the definitions andnotation in [Xia 2012; Xia et al. 2010]. Let F be a general n ⇥ n real or complex matrixand I = {1, 2, . . . , N} be the set of all row and column indices. Suppose T is a full binarytree with 2k� 1 nodes labeled as i = 1, 2, . . . , 2k� 1, such that the root node is 2k� 1 andthe number of leaf nodes is k. Let T also be a postordered tree. That is, for each non-leafnode i of T , its left child c

1

and right child c2

satisfy c1

< c2

< i. Let ti ⇢ I be an indexsubset associated with each node i of T . We use F |ti⇥tj to denote the submatrix of F withrow index subset ti and column index subset tj .HSS matrices are designed to take advantage of the low-rank property. In particular, when

the o↵-diagonal blocks of a matrix (with hierarchical partitioning) have small (numerical)ranks, they are represented or approximated hierarchically by compact forms. These com-pact forms at di↵erent hierarchical levels are also related through nested basis forms. Thiscan be seen from the definition of an HSS form:

Definition 2.2. We say that F is in a postordered HSS form and T is the correspondingHSS tree if the following conditions are satisfied:

— tc1 \ tc2 = ;, tc1 [ tc2 = ti for each non-leaf node i of T , where i has child nodes c1

andc2

and t2k�1

= I.—There exist matrices Di, Ui, Ri, Bi,Wi, Vi (or HSS generators) associated with each node

i, such that, if i is a non-leaf node,

D2k�1

= F,

Di = F |ti⇥ti =

✓Dc1 Uc1Bc1V

Hc2

Uc2Bc2VHc1 Dc2

◆, (4)

Ui =

✓Uc1Rc1Uc2Rc2

◆, Vi =

✓Vc1Wc1Vc2Wc2

◆,

where the superscript H means the Hermitian transpose.

The HSS generators define the HSS form of F . The use of a postordered HSS tree enablesus to use a single subscript (corresponding to the label of a node of T ) for each HSSgenerator [Xia et al. 2010] instead of up to three subscripts as in [Chandrasekaran et al.2006]. Figure 1 illustrates a block 8 ⇥ 8 HSS representation F . As a special example, itsleading block 4⇥ 4 part looks like:

F |t7⇥t7 =

0

BB@

✓D

1

U1

B1

V H2

U2

B2

V H1

D2

◆ ✓U1

R1

U2

R2

◆B

3

�WH

4

V H4

WH5

V H5

�

✓U4

R4

U5

R5

◆B

6

�WH

1

V H1

WH2

V H2

� ✓D

4

U4

B4

V H5

U5

B5

V H4

D5

◆

1

CCA .

For each diagonal block Di = F |ti⇥ti associated with each node i of T , we define F�i =

F |ti⇥(I\ti) to be the HSS block row, and F |i = F |

(I\ti)⇥ti to be the HSS block column. Theyare both called HSS blocks. The maximum rank r (or numerical rank r for a given tolerance)of all the HSS blocks is called the HSS rank of F . If r is small as compared with the matrixsize, we say that F has a low-rank property.Once the matrix is put into the HSS format, the transformed linear system can be ef-

ficiently solved with the ULV-type factorization and solution algorithms [Chandrasekaran



D1D2

D4D5

D8D9

D11D12

U3B3V6H 7 V14H

U6B6V3H

B

U7U3R3U6R6

=

(a) A block 8⇥8 HSS matrix.

15

7

3

1 2

6

4 5

14

10

8 9

13

11 12

(b) The corresponding HSS tree.

Fig. 1. Pictorial illustrations of a block 8⇥ 8 HSS form and the corresponding HSS tree T .

et al. 2006; Xia et al. 2010]. The floating point operations associated with the HSS con-struction, ULV factorization and solution are O(rN2), O(r2N) and O(rN), respectively.All these are much smaller than the O(N3) operations required by the traditional denseLU factorization algorithms for solving such linear systems. We described e�cient parallelalgorithms for these HSS operations on dense matrices in a previous work [Wang et al.2013].

2.3. HSS-structured multifrontal methods

Recall that with the nested dissection ordering, there is a one-to-one correspondence betweena separator and a frontal matrix Fj . The frontal matrices are dense and their computationscontribute to the dominant terms in the costs for computing the factors and the storagefor the factors. Employing data-sparse compression for these frontal matrices can drasti-cally lower the overall factorization cost and memory. (Section 3 contains the theoreticaljustification.)Thus, we follow the algorithms in [Xia 2013; Xia et al. 2009] to use HSS matrices to

approximate the frontal matrices Fj . This generally involves the four steps shown in Algo-rithm 1. Steps (b) and (c) can be performed conveniently via the HSS algorithms in [Xia2013; Xia et al. 2010]. For step (d), ideally, we would like to have a fast procedure whichassembles the HSS forms of the contribution block directly into the HSS form of the parentfrontal matrix in step (a), as shown in Figure 2(a). Unfortunately, this involves complicatedHSS structured operations, because the matrix indices (hence the HSS trees) of the contri-bution blocks are di↵erent between the siblings, and also di↵erent from the parent. Directlyperforming HSS-based extend-add requires tree rotations, splitting and merging operations.This was implemented in a 2D code [Xia et al. 2009], but it may not be easily generalizedto a general algebraic solver or a parallel one. Therefore, we resort to a simpler alternativemethod in [Xia 2013], in which we keep the contribution block as a normal dense matrix andthe extend-add operation is done as in the classical (full-rank) multifrontal factorization.This is depicted in Figure 2(b), and we call this algorithm partially structured. Comparedto a fully structured algorithm, the added cost is the decompression (HSS-to-dense) neededto compute the contribution block and the recompression (dense-to-HSS) for the parentfrontal matrix.The extend-add procedure with HSS and HSS tree manipulations (merge, rotation, etc.) is

very sophisticated. The parallel implementation is even more di�cult. This current partiallystructured version, sequential [Xia 2013] or parallel (this paper), makes the implementationmuch more convenient. Moreover, we would like to emphasize that this simplified versiononly incurs extra memory for the intermediate frontal/update matrices during the factor-ization. Such memory is temporary. The final factors are still structured, just like when weuse structured extend-add in a fully HSS-structured algorithm. See Figure 2. The overall


:6 S. Wang et al.

ALGORITHM 1: The HSS-structured multifrontal algorithm.

(a) Construct an HSS approximation to Fj

.(b) Perform a partial ULV factorization for F11.(c) Compute the contribution block U

j

= F22 � F21F�111 F12 via a low-rank update.

(d) Assemble Uj

into the parent frontal matrix. (a.k.a. extend-add)

(a) Fully structured frontal factorizationand extend-add.

(b) Partially structured frontal factorizationthat yields fully structured factors.

Fig. 2. Partially structured multifrontal method in [Xia 2013] that still yields fully structured factors, sincethe frontal and update matrices are only used temporarily.

factorization complexity is roughly the same as that of the the fully structured version(increased by just a factor of O(log n)) [Xia 2013]. In addition, the complicated structuredextend-add operation may not necessarily be faster than the dense one in practical par-allel computations. In practice, this compromised design can lead to an e�cient parallelalgorithm and implementation.

2.4. Other rank-structured multifrontal approaches

Other forms of low-rank structures have been embedded in multifrontal approaches. One ofthem is the Hierarchically O↵-Diagonal Low-Rank (HODLR) format [Aminfar et al. 2014].In this approach, the partitioning of the matrix and the underlying tree is the same as whatwe use in the HSS format. As in the HSS format, the o↵-diagonal blocks of the input matrixare compressed, and, recursively, the diagonal blocks are partitioned and their o↵-diagonalblocks are compressed, i.e.,

D2k�1

= F, and Di = F |ti⇥ti =

✓Dc1 Uc1Bc1V

Hc2

Uc2Bc2VHc1 Dc2

◆

However, contrary to the HSS format, the HODLR format does not rely on a nested basisproperty, i.e., it computes Ui and Vi from scratch at each node i, without using the followingrelation:

Ui =

✓Uc1Rc1Uc2Rc2

◆, Vi =

✓Vc1Wc1Vc2Wc2

◆

The HODLR approach has been embedded in a multifrontal solver for solving ellipticPDES [Aminfar and Darve 2014]. However, at the time of writing, no parallel implementa-tion is publicly available.Another low-rank representation that has been implemented in a multifrontal solver is the

Block Low-Rank (BLR) format [Weisbecker 2013; Amestoy et al. 2015]. In this approach, a



dense matrix is partitioned using a simple 2D blocking, and every block is compressed in-dependently, as shown in Figure 3. The BLR format has been implemented in the MUMPSmultifrontal solve [Amestoy et al. 2015]. We have been communicating with the groupworking on Block Low-Rank (BLR) techniques to compare their approach with ours. Ex-periments carried out for the 3D Helmholtz problem described in this paper show that theHSS-based approach exhibits a better asymptotic complexity but also has a larger prefactorand a slightly lower flop-rate than BLR [Weisbecker 2013]. Therefore, it is advantageous touse the HSS technique presented here for very large problems and larger machines, whileusing BLR might pay o↵ for small or medium-size grids. Extending this comparison to alarger set of problems and applications is work in progress.

D1D2

D4D5

D8D9

D11D12

U3B3V6H 7 V14H

U6B6V3H

B

U7U3R3U6R6

=

(a) HSS matrix.

D1D2

D4D5

D8D9

D11D12

U3B3V6H 7 V14H

U6B6V3H

B

U7

(b) HODLR matrix.

D1D2

D4D5

D8D9

D11D12

(c) BLR matrix.

Fig. 3. Pictorial illustrations of an HSS matrix, HODLR matrix, and BLR matrix.

3. RANK PROPERTIES AND OPERATION AND MEMORY COMPLEXITIES

The complexity of HSS-structured multifrontal algorithms relies on the o↵-diagonal numer-ical ranks of the frontal matrices. The theoretical analysis of the o↵-diagonal rank boundsis still limited. Two recent studies on the intermediate Schur complements in the factoriza-tions of certain discretized PDEs are closely related to our method. In the first study [Chan-drasekaran et al. 2010], it is shown that, for the finite-di↵erence matrices discretized fromconstant coe�cient elliptic PDEs, the o↵-diagonal numerical ranks of the intermediate Schurcomplements is bounded by a constant for 2D problems, but grows with the number of meshpoints along one side of the domain for 3D. In another study [Engquist and Ying 2011], En-gquist and Ying study the Helmholtz equation with constant velocity field c(x) = 1. Theyprove that the rank grows with log k for 2D, where k is the size of one side of the mesh.They indicate that for 3D Helmholtz, the rank grows with k, although they do not providea rigorous proof. Additional results can be found in [Bebendorf and Hackbusch 2003].

Note that both the studies in [Chandrasekaran et al. 2010] and [Engquist and Ying2011] use the lexicographical ordering of the mesh points (i.e., layer-by-layer, or sweepingorder). That is, the global matrix is block banded. Engquist and Ying note that both theSommerfeld boundary condition and the layer-by-layer order are essential for this low-rankproperty. With a nested dissection ordering, the ranks may be higher [Engquist and Ying2011; Martinsson 2011]. However, the ranks are generally not much higher, and may be ofthe same orders for some situations.

In fact, assume the matrix after one level of nested dissection of a 2D regular mesh lookslike

A =

A

11

A13

A22

A23

A31

A32

A33

!, (5)


:8 S. Wang et al.

where the layer-by-layer ordering is applied to both A11

and A22

. The Schur complementof the two leading blocks is

S = A33

�A31

A�1

11

A13

�A32

A�1

22

A23

. (6)

Since �A31

A�1

11

A13

and �A32

A�1

22

A23

are the contributions to the Schur complement dueto the elimination of A

11

and A22

, respectively, they satisfy the same o↵-diagonal rankproperties as in [Engquist and Ying 2011]. Thus, the o↵-diagonal numerical rank bound ofS is at most twice that with the layer-by-layer ordering for the entire A in [Chandrasekaranet al. 2010] and [Engquist and Ying 2011]. Note that this holds regardless of the orderingof A

11

and A22

. Since our goal is to factorize the entire matrix for direct solutions, anested dissection ordering is better at preserving the sparsity and more suitable for parallelcomputing.When there are multiple levels in nested dissection, the above claim similarly holds for

some problems. For example, for elliptic problems with a Dirichlet boundary condition, theprocedure in (5)–(6) applies to any node with two children in the assembly tree. For ournumerical tests below, we fix the frequency in the Helmholtz equation, which makes therank behaviors similar to the elliptic case for large matrix sizes.Therefore, with nested-dissection ordering, we make some trade-o↵ between low-rankness

with better sparsity. In practice, we observed that, with our partially structured algorithm,the ranks grow as O(k) in 3D (see Tables III), which corroborate Engquist’s theory verywell.In [Xia 2013], Xia further considers certain rank patterns of the o↵-diagonal blocks on

top of the rank bounds. That is, when the numerical ranks of the o↵-diagonal blocks atthe hierarchical levels of the frontal matrices satisfy certain patterns, some more optimisticcomplexity estimates can be obtained. See Table I. As can be seen, in all these cases, theHSS-structured multifrontal factorization is provably faster and uses less memory than theclassical algorithm.

Table I. Theoretical o↵-diagonal rank bounds based on [Chandrasekaran et al. 2010; Engquist and Ying 2011]and the complexities of the standard multifrontal method (MF) based on [Du↵ and Reid 1983; George 1973]and the HSS-structured multifrontal method (HSSMF) based on [Xia 2013], where k is the mesh size in onedimension, and n = k2 in 2D and n = k3 in 3D.

Problem rMF HSSMF

Factorization flops Memory Factorization flops Memory2D Elliptic O(1)

O(n3/2) O(n logn) O(n logn) O(n log logn)(k ⇥ k) Helmholtz O(log k)3D Elliptic O(k)

O(n2) O(n4/3) O(n4/3 logn) O(n logn)(k ⇥ k ⇥ k) Helmholtz O(k)

Notice that all the above complexity analysis is based on the o↵-diagonal rank analysisof the Schur complements which are assumed to be computed exactly. In practice, this isusually not the case, since all the intermediate Schur complements are approximate. Thus,the actual costs of the HSS-structured factorization may be higher than the results in TableI. When appropriate accuracy is enforced in the HSS construction, we expect the o↵-diagonalnumerical ranks of the approximate Schur complements to be close to those of the exactones.

4. PARALLEL ALGORITHMS: NESTED DISSECTION, SYMBOLIC FACTORIZATION, ANDHSS-STRUCTURED MULTIFRONTAL METHOD

In this work, we target our parallel algorithms to the discretized PDEs from the regular 2Dand 3D mesh. This is mainly motivated from the need of a fast and memory-e�cient solverin one of the applications we are working with—seismic imaging problems involving the



Helmholtz equations. Therefore, our parallelization strategies exploit the spatial geometryand we will demonstrate that this geometry-tailored code is faster than general sparsesolvers. It is our work in progress to extend this to a general algebraic approach for arbitrarysparse matrices. In the following subsections, we present our parallel algorithms for all thephases of the solution process. We assume that the total number of processes is P , and isa power of two for ease of exposition.

4.1. Parallel nested dissection and symbolic factorization for regular meshes

For any sparse factorization methods, the order of elimination a↵ects the amount of fillproduced in the LU factors and the amount of parallelism exposed. For the regular 2D and3D meshes considered in this paper, the best ordering method is nested dissection [George1973], which recursively partitions the (sub)mesh via separators. A separator is a smallset of nodes whose removal divides the rest of the (sub)mesh into two disjoint pieces. Atthe end of recursion, the final ordering is performed such that the lower level separatornodes are ordered before the upper level ones, see Figure 4(a) for an illustration. Therecursive bisection procedure results in a complete binary (amalgamated) elimination tree,a.k.a. separator tree, see Figure 4(b). The total number of levels in nested dissection isl = O(log

2

n). Following this ordering, the LU factorization eliminates the mesh nodes(unknowns) from the lower levels to the upper levels. Nodes may get connected duringelimination, see Figure 4(c).

115

23

4 5

6

7

10

13

14

(a) Partition/order with sep-arators.

15

7

3

1 2

6

4 5

14

10

8 9

13

11 12

Level

0

1

2

3

(b) Separator tree.

115

23

4 5

6

7

10

13

14

(c) Node connections.

Fig. 4. Separators in nested dissection of a regular mesh, the corresponding separator tree, and the nodeconnections after the elimination of some lower level mesh points.

Algorithm 2 sketches our parallel nested dissection ordering for a cuboid of dimensionNx ⇥Ny ⇥Nz. The 2D case is similar.

In this algorithm, the sets of variables associated with the separators are ordered in apostorder of the separator tree. But the algorithm does not specify the variables’ orderingwithin a separator, see line [*]. For an exact sparse factorization, the internal separatorordering has little e↵ect on the fill-in, because the submatrix F

11

associated with the sep-arator is dense. A common practice is to use a lexicographical order for the mesh pointswithin each 2D separator. However, when HSS is employed, the internal separator orderinga↵ects the amount of compression that can be done for Fj and the rank distribution ofthe HSS blocks. Therefore, we apply another nested dissection ordering strategy for thevariables within a separator. Figure 4.1 illustrate this for the 3D case, where a separatoris a submesh of 2D plane (left figure). We use a partitioning/ordering that respects thegeometry and retains locality of the variables within the HSS blocks. The simple examplein Figure 4.1 can be extended to a larger number of parts, using an edge-based nested dis-section or a Morton ordering [Morton 1966]. In an algebraic context (i.e., for general sparse


:10 S. Wang et al.

ALGORITHM 2: Parallel nested dissection ordering for 3D mesh with P processes.

Let L log n be the number of bisection levels. (Root is at level 0.)L

par

= logP L be the upper parallel levels.FS(j) = list of fully-summed variables (F11 in Eqn. (1)) at node j of the tree.CB(j) = list of variables in the contribution block (F22 in Eqn. (1)) at node j of the tree.w(j) = number of vertices to be ordered before node j.order(x, y, z) = i means the mesh point (x, y, z) is the ith variable to be eliminated.

(a) Tree generation:(a.1) All P processes generate a binary tree globTree with L

par

levels and P leaves(globTree is duplicated).

(a.2) Each process p is assigned one leaf and generates its local subtree locTree withL� L

par

levels. There are P di↵erent locTrees, corresponding to di↵erent subdomains,and the root of each locTree is a leaf of the globTree.

(b) Compute weight w in postorder of both trees:Each node j of globTree and locTree is associated with a cuboid [x1 : x2, y1 : y2, z1 : z2]of 6 coordinates, corresponding to frontal matrix F

j

.w(j + 1) = w(j) + (x2 � x1 + 1)(y2 � y1 + 1)(z2 � z1 + 1).

(c) Nested dissection ordering:(c.1) Redundant work: all P processes perform top-down L

par

levels of recursive bisectiondown globTree:Root node FS(0) [1 : N

x

, 1 : Ny

, 1 : Nz

].For each interior node j with cuboid [x1 :x2, y1 :y2, z1 :z2], perform cut along Z, Y and X

axes in sequence (the following shows only Z-cut):FS(left child of j) [x1 : x2, y1 : y2, z1 : z1+z2

2 � 1];FS(right child of j) [x1 : x2, y1 : y2,

z1+z22 + 1 : z2];

FS(j) [x1 : x2, y1 : y2,z1+z2

2 ]; (2D plane separator)End For;

[*] 2D plane separator ordering: for each vertex (x, y, z) in the separator node j, assign:1 order(x, y, z) (x2 � x1 + 1)(y2 � y1 + 1)(z2 � z1 + 1);order(x, y, z) = order(x, y, z) + w(j)

(c.2) Independent work: each process p independently performs recursive bisection for itslocal subtree locTree (similar to c.1)

matrix), Amestoy et al. use graph partitioning tools to reorder a halo graph associated witheach separator [Amestoy et al. 2015].

{{

leaf 1

leaf 21 2

1

2

CB

Separator Frontal matrix

Fig. 5. Correspondence between the ordering of the variables of a plane separator (left) and the variableswithin the associated frontal matrix (right).

The purpose of the symbolic factorization is to identify the variables associated witheach frontal matrix Fj . Since the variables of F

11

corresponding to the separator alreadydetermined from nested dissection, the remaining task is to compute variables in the F

22

block. The parallel symbolic algorithm proceeds in the same postorder of the separator tree.Each process p first traverses its own locTree, then traverses globTree. For each node j of



the tree, i.e., a separator, the process finds the “span” of j, i.e., the smallest cuboid whichcontains j and whose faces touch neighboring separators or boundaries of the domain. Thereare at most 6 neighboring separators that touches the span of j, and they are ancestors ofj in the elimination tree. The variables in the F

22

block for j are the intersection of eachtouching separator with the span of j. This is illustrated in Figure 6 where the span of theseparator we consider touches 4 other separators of the domain.

Fig. 6. Span of a separator. We consider the smallest horizontal (dotted green) separator. The (grey) cuboidis its span and touches 4 other separators that are its ancestors.

4.2. Parallel multifrontal method

After the nested dissection ordering, we can design the distributed data distribution and theparallel algorithm fully exploiting the well-structured binary separator/elimination tree. Forthe regular mesh geometry and the stencil defined on it, the two subparts resulting fromeach bisection would have similar number of mesh points (equations and variables), andhence the similar amount of computation associated with the elimination process. There-fore, we can assign half of the processes to each subpart of the bisection, which is expectedto achieve nearly perfect load balance for the pure multifrontal method. However, for theHSS-structured multifrontal method, this cannot easily accommodate the load imbalancedue to rank imbalance after HSS compression, which we comment more in Section 5. Fig-ure 7(a) illustrates this matrix distribution scheme for the example with three levels ofbisections. As is shown, the processes are divided into groups, which are organized as abinary tree structure with each tree node being associated with a group of processes. Theprocess numbers of each group are displayed in each tree node. 1 There is a one-to-onecorrespondence between a group of processes, a separator and the associated frontal matrixFj .Once we decide the above matrix distribution scheme, our parallel algorithm can be

stated as in Algorithm 3. Algorithm 3 maps readily onto the standard concepts of MPI,BLACS [Dongarra and Whaley 1997] and ScaLAPACK [Blackford et al. 1997] that are used toimplement the algorithm, since the majority of the operations with the frontal and update

1In parallel computing, 0-based convention is used in the naming of processes, processors, and processes,etc.


:12 S. Wang et al.

0 : 1 2 : 3 4 : 5 6 : 7

0 0 0 4 0 7 0 1 0 2 0 3 0 5 0 6

F15

F7 F14

F1 F2 F4 F5 F8 F9 F11 F12

F3 F6 F10 F13

0 : 3 4 : 7

0 : 7

(a) Process tree with process labels at each node.

0 1 2 3 4 5 6 7

0 0 0 4 0 7 0 1 0 2 0 3 0 5 0 6

4 5 7 6

0 1 3 2

0 1 3 2

4 5 7 6

F15

F7 F14

F1 F2 F4 F5 F8 F9 F11 F12

F3 F6 F10 F13

(b) Process tree with 2D configurations.

Fig. 7. Illustration of the eight processes assigned to the frontal matrices F1 to F15 along the separa-tor/elimination tree in Figure 4.

ALGORITHM 3: Parallel multifrontal factorization with P processes.

Let L = logP be the number of bisection levels. Proceeds bottom-up by levels:(a)At bottom level, L, with P nodes,

(a.1) In parallel, each process factorizes its local subtree sequentially;(a.2) For i = 0 . . . P/2� 1, a pair of processes {2i, 2i+ 1} performs extend-add in Eqn. (3),

and distribute the parent frontal matrix to the pair {2i, 2i+ 1}.(b)For levels l = (L� 1) . . . 0 in separator tree,

(b.1) In parallel, a group of 2L�l processes factorizes a frontal matrix Fj

in Eqn. (1);(b.2) In parallel (except at root), a group of 2L�l+1 processes performs extend-add in Eqn. (3),

and redistribute the parent frontal matrix to this process group.

matrices at each node are dense matrix operations. The key is to use the MPI’s processgrouping capability. Each process group/communicator corresponds to one node of theprocess tree, and the groups have the nested structure. In the example of Figure 7, we needto form seven communication groups: {0,1}, {2,3}, {4,5}, {6,7}, {0:3}, {4:7} and {0:7}.The governing distribution scheme in BLACS is a 2D block cyclic matrix layout, in which

the user specifies the block size of a submatrix and the shape of the 2D process grid. Theblocks of the matrices are then cyclically mapped to the process grid in both row and columndimensions. Furthermore, the processes can be divided into groups (contexts) to work onindependent parts of the calculations. Figure 7(b) shows the mapping of the separator treenodes to 8 processes arranged as 2D grids. We always arrange the process grid as squareas possible, i.e., P ⇡ p

P ⇥pP . Since the context of any non-leaf node doubles each child’s

context, the former can be arranged to combine the two child contexts either side by sideor one on top of the other.During the extend-add operation, we need to assemble the update matrices from two

children’s nodes to the parent node. Since the parent’s context is the union of the two childcontexts, the two update matrices should be redistributed before extend-add is performed.We use the ScaLAPACK routine PxGEMR2D to carry out such a redistribution.In Table II, we compare the performance of our geometric multifrontal code (denoted

MF) to the two widely used parallel sparse direct solvers MUMPS [Amestoy et al. 2006]and SuperLU DIST [Li and Demmel 2003]. The former uses the multifrontal method whilethe latter uses a supernodal fan-out (right-looking) approach. For both solvers, we use a



geometric-based nested dissection fill-reducing ordering and we disable MC64. This way,all the solvers factor the same matrix. As shown in the table, the size of the LU factorsand the number of operations are very close for the three solvers; the slight variations aredue to di↵erent ways of amalgamating the elimination tree (i.e., setting the size of thefrontal matrices or the supernodes). In MUMPS, we disable delayed pivoting; pivoting isdone only within fronts, as in our MF code. These settings enable us to have a fair compar-ison between the three codes. The problem that we use for benchmarking corresponds to a100⇥ 100⇥ 100 3D grid. From the table, we observe that MF is faster than both MUMPSand SuperLU DIST. This is because this code is tailored for this particular problem, whileMUMPS and SuperLU DIST are algebraic codes that can solve problems with very di↵er-ent sparsity pattern and numerical properties. MF also shows better strong scaling thanMUMPS and similar scaling to SuperLU DIST. These results demonstrate that our MFcode has good performance, and can serve as a nice framework to include HSS-structuredtechniques, as well as a fair candidate to be compared with our HSS-structured version (soas to demonstrate the benefit of including HSS structures). In the following, we assess theperformance of HSS-structured MF as compared with the non-structured MF code.

Table II. Comparison of the parallel performance of the geometric multifrontal code to MUMPS and Su-perLU DIST for a 100⇥100⇥100 grid with di↵erent number of MPI processes. “Peak Mem” is the maximumlocal memory highmark among all the processes.

# procs Solver Factor size Flop count Peak Mem. Analysis Facto. Solve(GB) (⇥1012) (GB) (s) (s) (s)

32MUMPS 16.9 54.0 0.96 2.9 69.9 0.5SuperLU DIST 16.5 53.3 1.00 5.4 57.4 0.3MF 16.6 53.1 1.00 0.2 53.5 0.4




4.3. Parallel HSS algorithms

Parallelizing the HSS algorithms is a challenging task in its own right. In previous work,we have developed e�cient parallel algorithms for HSS matrix operations, including con-struction, ULV factorization and solution; see [Wang et al. 2013] for details. Recall that theHSS computations can also be modeled by a binary tree — HSS tree (see Figure 1(b)). Ourparallel HSS algorithms were designed around this tree structure. In particular, we usedthe same node-to-process mapping principle as is used for the multifrontal separator treein Figure 7(b). For a dense structured submatrix of size 250,000 from a 3D problem, ourparallel HSS solver is over 2x faster than the LU factorization in ScaLAPACK. The factorof the compressed matrix is 70x smaller than that of the standard, full-rank LU factoriza-tion. The asymptotic communication complexity (volume and number of messages) is alsoreduced [Wang et al. 2013].

4.4. Parallel HSS-structured multifrontal method

Now, employing the HSS technique to each separator amounts to applying the parallel HSSalgorithms to the frontal matrix that is residing in the process context associated withthat separator. Thus, we can organize the parallel algorithm following an outer-inner tree


:14 S. Wang et al.

structure. The outer tree is the multifrontal elimination tree, for which each tree node isembedded with an inner tree, the HSS tree. Figure 8 illustrates such a nested tree structureand the parallel HSS-structured multifrontal method. At several lower levels of the outertree, the frontal matrices are small and not much compression can be done, therefore, aswitching level [Xia et al. 2009] in the outer tree is used such that dense frontal matrixoperations are employed below the switching level whereas HSS techniques are appliedabove this level.The implementation of the outer-inner tree computations can make use of the MPI sub-

communicator or the BLACS context mechanism. In the following discussion, we will usethe example in Figure 9 to illustrate the parallel operations. For example, in Figure 9(a),the four processes {12,13,14,15} form a process group/context for parallel factorization oftheir assigned separator node. Using this group of four processes, we apply the parallel HSSconstruction, ULV factorization and solution algorithms, given as input the logical view ofthe process numbers {0,1,2,3}. The matrix structure of the corresponding separator node isillustrated in Figure 9(b). There is no need for any algorithm changes internal to the parallelHSS algorithm. Similarly, at the parent node, eight processes {8, . . . , 15} are employed forthe parallel HSS calculations, with the logical view of the process numbers {0, . . . , 7}.

Fig. 8. Outer tree: elimination tree. Inner tree: HSS tree.

0

0

1 2 3 4 5 6 7

1 2 3 4 5 6 7

0 12 3

4 56 7

0 12 3

4 56 7

8

8

9 10 11 12 13 14 15

9 10 11 12 13 14 15

8 91011

12 1314 15

8 91011

12 1314 15

0 12 3

4 56 7

8 91011

12 1314 15

(a) 16 processes forming various groups/contexts.

12

13

14

15

1212

1313

1414

1515

12 131213

14 151415

12 1314 15

12 1314 15

12 1314 15

12 1314 15

12 1314 1512 1314 15

12 1314 1512 1314 15

12 1314 15

12 1314 15

12 13 12 1314 15 14 15

12 1314 15

12 1314 15

12 13 12 1314 15 14 15

12 1314 15

12 1314 15

12 13 12 1314 15 14 15

12 1314 15

12 1314 15

12 13 12 1314 15 14 15

(b) An HSS-structured frontalmatrix F

j

mapped to processcontext {12,13,14,15}.

Fig. 9. Illustration of the parallel HSS-embedded partial factorization and Schur update of a frontal matrix.



The outer tree multifrontal parallelism is the same as depicted in Algorithm 3. WithHSS embedding, a major new design of the parallel algorithm is the computation of thecontribution block U| for the inner HSS tree (Algorithm 1, step (c)) and its interaction withthe outer tree parallelism. Recall that in our partially structured multifrontal method, weapply the HSS compression technique only to the blocks with fully summed variables in thefrontal matrix, i.e., F

11

, F12

and F21

, see Figure 9(b). After HSS compression, the frontalmatrix is approximated as:

Fj ⇡

H U1

B1

V T2

U2

B2

V T1

F22

�, (7)

where H is an HSS representation for F11

, and U1

, B1

, V2

, U2

, B2

and V1

are the HSSgenerators at the root of the HSS tree associated with H. Next, we perform parallel ULVfactorization to H, yielding a reduced matrix for the HSS form of Fj :

Dk U

1

B1

V T2

U2

B2

V T1

F22

�, (8)

where, the Dk, U1

and V1

are all of order r, the HSS rank. Here, we employ our parallelHSS construction and partial ULV factorization algorithms developed in [Wang et al. 2013],using four processes in this example, which accomplishes steps (a) and (b) in Algorithm 1.

The next step is decompression to form the dense update matrix Uj in parallel (Algo-rithm 1, step (c)): Uj = F

22

� (U2

B2

V T1

)H�1(U1

B1

V T2

). As was shown in [Xia 2013], thiscan be computed cheaply by a low-rank update as follows:

Uj = F22

� (U2

B2

V T1

U�1

k )(L�1

k U1

B1

V T2

) , (9)

with the help of a (small) LU factorization of Dk = LkUk. In our implementation, we useScaLAPACK and PBLAS routines to perform parallel LU factorization, parallel triangular solve(i.e., L�1

k U1

) and parallel matrix-matrix multiplication.The last step is to assemble the contribution block Uj into the parent frontal matrix as

part of extend-add. This is the same as what is needed in the pure parallel multifrontalalgorithm described in Section 4.2, that is, we can use BLACS routine PxGEMR2D to firstredistribute Uj from four processes {12,13,14,15} onto eight processes {8, . . . , 15}, beforethe addition operation.

Our code includes pivoting within the dense frontal matrices. No pivoting is needed tofactorize the HSS frontal matrices due to the ULV factorization procedure which computes asequence of orthogonal local factorizations. The current code does not use delayed pivotingacross the children/parent fronts. In our applications, factorizations without delayed pivot-ing preserve the practical background of the problems and are enough to observe stabilityin practice [Wang et al. 2010; Wang et al. 2011; Wang et al. 2012]. In our future parallelalgebraic structured solver for more general sparse matrices, we plan to include static piv-oting as in SuperLU DIST [Li and Demmel 2003] or randomized pivoting as in [Xin et al.2013].

5. RESULTS

In this section, we present the performance results of our parallel geometric HSS-structuredmultifrontal solver when it is used to solve the Helmholtz equation of the following form:

✓�� !2

v(x)2

◆u(x,!) = s(x,!), (10)

where � is the Laplacian, ! is the angular frequency, v(x) is the seismic velocity field, andu(x,!) is called the time-harmonic wavefield solution to the forcing term s(x,!). Helmholtz


:16 S. Wang et al.

equations arise frequently in real applications such as seismic imaging, where the simplestcase of the acoustic wave equation is of the form (10). Here, we focus on the 3D geome-try and a 27-point discretization of the Helmholtz operator on 3D regular domains. Thisdiscretization often leads to very large sparse matrices A which are highly indefinite andill-conditioned. We use a fixed frequency f = 10Hz (! = 2⇡f), a sampling rate of about 15points per wavelength, wavelength L = 150 meters, step size h = 10 meters, and the PML(Perfectly Matched Layer) boundary condition. In order to study the scaling of the parallelalgorithm, we increase the the number of discretization point k in each dimension, so thatthe domain size in each dimension is k ⇥ h meters.It has been observed that, in the direct factorization of A, the dense intermediate matrices

may be compressible [Engquist and Ying 2011; Wang et al. 2010]. In this application,the discretized matrix has complex values, but we need to work only in single precisionarithmetic as it is what is used most of the time in the full wave inversion.We carry out the experiments on the Cray XC30 (edison.nersc.gov) at the National

Energy Research Scientific Computing Center (NERSC). Each node has two sockets, eachof which is populated with a 12-core Intel “Ivy Bridge” processor at 2.4GHz. There are 24cores per node. Each node has 64GB memory. The peak performance of each core is 19.2Gflops/core.We report the detailed parallel performance results in Table III. We also show in Figure 10

how the size of the factors and the number of operations of the HSS-structured factorizationgrow as a function of the problem size. The grid sizes range from 100 ⇥ 100 ⇥ 100 to600⇥ 600⇥ 600 (i.e, n = 216 millions). The number of cores ranges from 16 to 16,384. Foreach problem, we compare the runs of the same code using a regular multifrontal mode (bysetting the switching level to 0) and the HSS-structured mode (where the switching levelis chosen according to the rules in [Xia 2013]). In each case, we perform one factorizationand one solution step with a single right-hand side, followed by at most 5 steps of iterativerefinement to the solution. The convergence criterion is that the componentwise backwarderror maxi

|Ax�b|i(|A||x|+|b|)i [Oettli and Prager 1964] is less than or equal to 5⇥ 10�7. In case the

error does not go down to the desired accuracy, we keep the best residual.We report and compare the following performance metrics:

—Time: the runtime and the flop rate of the factorization, the solution, and the iterativerefinement phases. For the HSS-structured factorization, we also report the time spent inthe rank-revealing QR, which is expected to be a dominant operation.

—Memory: the total size of the factors and the total peak of memory including the factorsand the active memory. The latter is the dominant part of the memory footprint of thesolver.

—Communication characteristics: the volume, the number of messages, and the time spent incommunication. These are collected using the IPM performance profiling tool [Fuerlingeret al. 2010]. We did not collect data for the biggest problem as the log files become toolarge for very long runs with many cores.

—Accuracy: we measure the normwise relative residual ||Ax�b||||b|| and the componentwise

backward error maxi|Ax�b|i

(|A||x|+|b|)i before and after iterative refinement.

One can notice that, except for the smallest 3D problem, the HSS-structured factorizationis always faster than the regular, uncompressed multifrontal factorization. The new codeis up to 2x faster than the pure multifrontal code (e.g., with the mesh size 3003). Thestructured factor size is always smaller than that of the multifrontal code and is up to aboutalmost 6x smaller (e.g. the mesh size 6003). The flop rates of the HSS-structured code arelower because the HSS kernels operate on the smaller blocks, while the multifrontal kernelsmostly perform large BLAS-3 operations. The size of the factors is significantly reduced but



Table III. Experimental results on 3D regular domains. We compare a regular (full-rank) multifrontal solutionprocess (“MF” rows in the table) with our HSS-structured solver (“HSSMF”). Communication statistics arecollected with IPM except for the last problem (too large).

k (mesh: k ⇥ k ⇥ k;n = k3) 100 200 300 400 600Number of processors P 64 256 1,024 4,096 16,384

Levels of Nested Dissection 15 18 20 21 23Switching level 5 8 10 11 14

MF

Factorization (s) 53.2 842.0 2436.9 4217.6 6564.3Gflops/s 998.7 4138.0 16454.8 53587.2 393694.1

Solution (s) 0.1 0.6 1.0 1.6 6.7Gflops/s 17.8 300.6 1556.9 4977.7 25541.1

Iterative refinement (s) 0.2 0.6 1.1 2.5 44.8Steps 1 1 1 1 2

Factors size (GB) 16.6 280.0 1450.1 4636.1 23788.9Total peak (GB) 262.0 434.7 2234.9 7119.5 36373.4

Communication volume (GB) 68.0 2149.3 22790.5 135888.0 -Number of messages (millions) 4.1 71.9 576.2 2883.5 -

Communication time(s) 5.1 42.1 136.0 518.8 -

After ||Ax�b||||b|| 1.0⇥ 10�4 3.6⇥ 10�4 1.2⇥ 10�2 1.7⇥ 10�3 2.8⇥ 10�2

solution maxi

|Ax�b|i(|A||x|+|b|)i

2.9⇥ 10�3 2.1⇥ 10�3 3.5⇥ 10�3 1.5⇥ 10�3 6.5⇥ 10�3

After ||Ax�b||||b|| 8.3⇥ 10�6 1.2⇥ 10�5 1.5⇥ 10�5 1.8⇥ 10�5 2.2⇥ 10�5

refinement maxi


1.5⇥ 10�7 1.5⇥ 10�7 3.6⇥ 10�7 1.8⇥ 10�7 1.6⇥ 10�7

HSSMF

Compression and factorization (s) 56.9 598.6 1214.2 2477.1 5136.8Gflops/s 467.2 1408.5 5116.9 11084.0 43399.2

Min RRQR time(s) 21.7 227.0 469.1 742.8 1603.3Max RRQR time(s) 32.6 415.8 860.3 1375.7 3430.7

Solution (s) 0.2 0.7 2.8 10.9 63.2Gflops/s 58.0 178.0 166.4 115.7 68.3

Iterative refinement(s) 1.4 13.2 13.2 68.6 350.0Steps 5 5 5 5 5

Factors size (GB) 10.8 115.8 433.4 1189.0 4053.5Total peak (GB) 250.3 366.7 1738.4 5352.5 25287.2

Communication volume (GB) 81.8 1794.9 14129.2 71261.2 -Number of messages (millions) 13.2 160.6 1175.0 6147.7 -

Communication time(s) 7.8 49.1 157.8 583.8 -Largest rank found in compression 518 1036 1822 2502 5214

After solution||Ax�b||

||b|| 1.5⇥ 10�2 3.0⇥ 10�2 6.3⇥ 10�2 6.5⇥ 10�2 1.6⇥ 10�1

maxi


8.4⇥ 10�1 8.9⇥ 10�1 9.2⇥ 10�1 8.3⇥ 10�1 8.9⇥ 10�1

After refinement||Ax�b||

||b|| 8.3⇥ 10�6 1.2⇥ 10�5 1.5⇥ 10�5 1.8⇥ 10�5 2.2⇥ 10�5

maxi


3.8⇥ 10�5 8.5⇥ 10�5 6.0⇥ 10�4 1.3⇥ 10�3 1.3⇥ 10�3

the gains on the maximum peak of memory are not as large because the current code isnot fully structured yet: we do not apply the HSS compression on the contribution blocks,as mentioned in Section 2.3. Studying and implementing a parallel fully-structured solveris work in progress.We observe that the flop rates are disappointing for the largest problem, with both the

multifrontal and the HSS-structured modes. We are currently investigating this issue withCray. On an older system (Cray XE6) the HSS-structured code using 16,384 processes weachieved 23.2 TFlops/s, instead of 6.7 TFlops/s here.The communication volume is also reduced when the HSS kernels are used, almost 2x

reduction in the case of 4003. The number of messages is larger for the HSS-structuredfactorization; this is because the HSS kernels perform many redistributions of the inter-


:18 S. Wang et al.

101

102

103

104

1003 3003 4003 6003

Fac

tor

size

(GB

)

n2003

MeasuredTheory (HSS): n log n

(a) Factor size.

104

105

106

107

108

109

1003 2003 3003 4003 6003

Ope

rati

ons

(GFl

ops)

n

Theory (FR): n2

Tentative fit: n5/3

MeasuredTheory (HSS): n4/3 log n

(b) Operation count.

Fig. 10. Size of the ULV factors and operation count of the HSS-structured factorization for 3D problems.

mediate HSS blocks during the compression phase. It is important to bear in mind thatalthough there are more messages, these messages are not on the critical path in the paral-lel execution, meaning that most of them are transferred simultaneously by di↵erent pairs ofprocesses. Therefore, the actual communication time is comparable between the two solvers,and the time-to-solution is still much reduced for the new solver.In terms of accuracy, the HSS-structured code is as good as the pure multifrontal code

in the normwise relative residual measure. With iterative refinement, the componentwisebackward error is further reduced. One can notice that the pure multifrontal factorizationdoes not yield residuals close to the machine precision without iterative refinement. Thismay be because we do not perform any scaling (equilibration) of the system before thefactorization. Also, our code does not use delayed pivots across the children/parent fronts.But again, only modest accuracy is desired for this class of problems and delayed pivotingis generally not necessary.According to Figure 10, the size of the factors and the number of operations of the

HSS-structured factorization are rank-dependent, as discussed in the previous section. Inpractice, the rank patterns may not exactly follow the prediction due to the approximationof the frontal and update matrices. We observed from our partially-structured code thatthe maximum rank found in compression is proportional to the mesh size k, see the cor-responding row in Tables III. From this rank pattern, the factor size follows the predictedcomplexity o(n log n). The number of operations seems to follow a O(n5/3) behavior; thisis larger than the O(n4/3 log n) asymptotic behavior of a fully-structured factorization butsmaller that the O(n2) complexity of a classical non-structured factorization.We carefully analyzed the performance of the HSS-structured factorization, and found

that the rank-reveling QR time exhibits serious load imbalance, which dominates the HSSconstruction time. For example, we observed up to 3x imbalance of RRQR time for someproblems. This is due to the imbalance in the ranks identified during the HSS compressionof a frontal matrix. We illustrate this in Figure 11, where we show the ranks of the toptwo levels of the HSS tree corresponding to the root node (topmost separator) of the 2003

problem. There are four processes working at each leaf node. The minimum rank is 413and the maximum is 607. Table IV shows the HSS construction time and the RRQR timeimbalance. Our future work is to explore the other separator ordering methods and datadistribution schemes to mitigate the load imbalance problem.In Table V, we report experimental results obtained when solving di↵erent variants of the

Helmholtz equation: Laplacian (Helmholtz equation with zero frequency) and Optical Di↵u-



/

450

413 540

451

607 467

Fig. 11. Ranks after RRQR compression of the root node of the 200⇥ 200⇥ 200 problem.

Table IV. Parallel HSS construction time in seconds for the topmost separator of the 2003

problem.

HSS time HSS rank Min time in RRQR Max time in RRQRedge-based ND 32.3 646 11.0 30.7

sion (Helmholtz equation with imaginary frequency). One can observe that these equationshave di↵erent compressibility properties. The Laplace equation and the Optical Di↵usionproblem yield lower HSS rank, smaller factors and fewer operations. For these two problems,our HSS-based factorization yields a larger speedup over the non-structured factorizationthan what we obtain for the general Helmholtz problem.

Table V. Solution of three variants of the Helmholtz equation, using a 300 ⇥300⇥ 300 grid and 1,024 MPI tasks.

Laplace Optical Di↵usion Helmholtz

! = 0 ! = 10 · ⇡ · i ! = 10 · 2⇡MF

Operations 4.0 · 1016Factor size (GB) 1,450.1

HSS

Max rank 1317 1310 1822

Operations 5.3 · 1015 4.9 · 1015 7.0 · 1015Factor size (GB) 431.1 423.5 433.4

Speedup HSS vs MF 2.3 2.4 2.0

6. CONCLUSIONS

We presented a parallel structured, geometry-based multifrontal sparse solver using HSSrepresentations for low-rank approximation. The novelty of our method is to use two layersof tree parallelism to exploit the HSS structure in each dense frontal matrix, and to em-ploy the HSS factorization and solve algorithms in place of the normal LU factorization forthe front. We developed a hierarchically parallel algorithm to weave together both parallelmultifrontal elimination (outer-tree parallelism) and parallel HSS factorization (inner-treeparallelism). Parallel nested dissection and symbolic factorization are also studied to pre-serve the geometry and to benefit the later low-rank compression. We tested our new parallelHSS-structured multifrontal code using up to 16,384 cores, and demonstrated great advan-tages over the standard parallel multifrontal method. The advantage is especially significantin terms of the memory for large 3D problems.This is the first parallel algorithm design and implementation that incorporates the HSS

structure in the sparse multifrontal solver, with the actual performance results on a realparallel machine.

ACKNOWLEDGMENTS

We thank the members of the Geo-Mathematical Imaging Group (GMIG) at Purdue University, BGP,ConocoPhillips, ExxonMobil, PGS, Statoil and Total, for the partial financial support. Partial supportfor this work was provided through Scientific Discovery through Advanced Computing (SciDAC) program


:20 S. Wang et al.

funded by U.S. Department of Energy, O�ce of Science, Advanced Scientific Computing Research (andBasic Energy Sciences/Biological and Environmental Research/High Energy Physics/Fusion Energy Sci-ences/Nuclear Physics). The research of Jianlin Xia was supported in part by an NSF CAREER AwardDMS-1255416 and NSF grants DMS-1115572 and CHE-0957024. This research used resources of the Na-tional Energy Research Scientific Computing Center, which is supported by the O�ce of Science of the U.S.Department of Energy under Contract No. DE-AC02-05CH11231.

REFERENCES

Patrick Amestoy, Cleve Ashcraft, Olivier Boiteau, Alfredo Buttari, Jean-Yves L’Excellent, and ClementWeisbecker. 2015. Improving Multifrontal Methods by Means of Block Low-Rank Representations.SIAM J. Scientific Computing 37, 3 (2015). http://dx.doi.org/10.1137/120903476

P. R. Amestoy, A. Guermouche, J.-Y. L’Excellent, and S. Pralet. 2006. Hybrid scheduling for the parallelsolution of linear systems. Parallel Comput. 32, 2 (2006), 136–156.

Amirhossein Aminfar, Sivaram Ambikasaran, and Eric Darve. 2014. A Fast Block Low-Rank Dense Solverwith Applications to Finite-Element Matrices. CoRR abs/1403.5337 (2014).

Amirhossein Aminfar and Eric Darve. 2014. A Fast Sparse Solver for Finite-Element Matrices. CoRRabs/1410.2697 (2014). http://arxiv.org/abs/1410.2697

M. Bebendorf and W. Hackbusch. 2003. Existence of H-matrix approximants to the inverse FE-matrix ofelliptic operators with L1 coe�cients. Numer. Math. 95 (2003), 1–28.

T. Bella, Y. Eidelman, I. Gohberg, and V. Olshevsky. 2008. Computations with quasiseparable polynomialsand matrices. Theoret. Comput. Sci. 409 (2008), 158–179.

L. S. Blackford, J. Choi, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Pe-titet, K. Stanley, D. Walker, and R. C. Whaley. 1997. ScaLAPACK Users’ Guide. SIAM, Philadelphia.325 pages. http://www.netlib.org/scalapack.

S. Borm, L. Grasedyck, and W. Hackbusch. 2003. Introduction to hierarchical matrices with applications.Eng. Anal. Bound. Elem 27 (2003), 405–422.

S. Borm and W. Hackbusch. 2001. Data-sparse approximation by adaptive H2-matrices. Technical report,Leipzig, Germany: Max Planck Institute for Mathematics 86 (2001).

S. Chandrasekaran, P. Dewilde, M. Gu, T. Pals, X. Sun, A.-J. van der Veen, and D. White. 2005. Somefast algorithms for sequentially semiseparable representations. SIAM J. Matrix Anal. Appl. 27 (2005),341–364.

S. Chandrasekaran, P. Dewilde, M. Gu, and N. Somasunderam. 2010. On the numerical rank of the o↵-diagonal blocks of Schur complements of discretized elliptic PDEs. SIAM J. Matrix Anal. Appl. 31(2010), 2261–2290.

S. Chandrasekaran, M. Gu, and T. Pals. 2006. A fast ULV decomposition solver for hierarchically semisep-arable representations. SIAM J. Matrix Anal. Appl. 28 (2006), 603–622.

J. Dongarra and R. C. Whaley. 1997. A User’s Guide to the BLACS v1.1. LAPACK Working Note #94.http://www.netlib.org/blacs.

I. S. Du↵ and J. K. Reid. 1983. The multifrontal solution of indefinite sparse symmetric linear equations.ACM Trans. Math. Software 9 (1983), 302–325.

Y. Eidelman and I. Gohberg. 1999. On a new class of structured matrices. Integr. Equat. Operat. Theor.34 (1999), 293–324.

B. Engquist and L. Ying. 2011. Sweeping preconditioner for the Helmholtz equation: hierarchical matrixrepresentation. Commun. Pure Appl. Math. 64 (2011), 697–735.

Karl Fuerlinger, Nicholas J Wright, and David Skinner. 2010. E↵ective performance measurement at petas-cale using IPM. In Parallel and Distributed Systems (ICPADS), 2010 IEEE 16th International Con-ference on. IEEE, 373–380.

J. A. George. 1973. Nested dissection of a regular finite element mesh. SIAM J. Numer. Anal. 10 (1973),345–363.

W. Hackbusch. 1999. A sparse matrix arithmetic based on H-matrices. Part I: introduction to H-matrices.Computing 62 (1999), 89–108.

W. Hackbusch, L. Grasedyck, and S. Borm. 2002. An introduction to hierarchical matrices. Math. Bohem.127 (2002), 229–241.

W. Hackbusch, B. Khoromskij, and S. A. Sauter. 2000. On H2-matrices. Lectures on applied mathematics(Munich, 1999), Springer, Berlin (2000), 9–29.

W. Hackbusch and B. N. Khoromskij. 2000. A sparse H-matrix arithmetic. Part-II: Application to multi-dimensional problems. Computing 64 (2000), 21–47.



R. Li and Y. Saad. 2013. Divide and conquer low-rank preconditioners for symmetric matrices. SIAM J.Scientific Computing 35, 4 (2013), A2069–A2095.

X. S. Li and J. W. Demmel. 2003. SuperLU DIST: A Scalable Distributed-Memory Sparse Direct Solver forUnsymmetric Linear Systems. ACM Trans. Math. Software 29, 2 (June 2003), 110–140.

J. W. H. Liu. 1990. The role of elimination trees in sparse factorization. SIAM J. Matrix Anal. Appl. 18(1990), 134–172.

J. W. H. Liu. 1992. The multifrontal method for sparse matrix solution: Theory and practice. SIAM Rev.34 (1992), 82–109.

P. G. Martinsson. 2011. A Fast Direct Solver for a Class of Elliptic Partial Di↵erential Equations. J.Scientific Computing 38, 3 (2011), 316–330.

G. M. Morton. 1966. A computer Oriented Geodetic Data Base; and a New Technique in File Sequencing.Technical Report, Ottawa, Canada: IBM Ltd (1966).

W. Oettli and W. Prager. 1964. Compatibility of approximate solution of linear equations with given errorbounds for coe�cients and right hand sides. Num. Math. 6 (1964), 405–409.

R. Vandebril, M. Van Barel, G. Golub, and N. Mastronardi. 2005. A bibliography on semiseparable matrices.Calcolo 42 (2005), 249–270.

S. Wang, M. V. De Hoop, and J. Xia. 2010. Seismic inverse scattering via Helmholtz operator factorizationand optimization. J. Computat. Phys. 229 (2010), 8445–8462.

S. Wang, M. V. de Hoop, and J. Xia. 2011. On 3D modeling of seismic wave propagation via a structuredmassively parallel multifrontal direct Helmholtz solver. Geophys. Prospect. 59 (2011), 857–873.

S. Wang, M. V. de Hoop, J. Xia, and X. S. Li. 2012. Massively parallel structured multifrontal solver fortime-harmonic elastic waves in 3D anisotropic media. Geophys. J. Int. 91 (2012), 346–366.

S. Wang, X.S. Li, J. Xia, Y. Situ, and M.V. de Hoop. 2013. E�cient Scalable Algorithms for Solving LinearSystems with Hierarchically Semiseparable Structures. SIAM J. Scientific Computing 35, 6 (2013),C519–C544.

C. Weisbecker. 2013. Improving multifrontal methods by means of algebraic block low-rank representations.Ph.D. Dissertation. Institut National Polytechnique de Toulouse.

J. Xia. 2012. On the complexity of some hierarchical structured matrix algorithms. SIAM J. Matrix Anal.Appl. 33 (2012), 388–410.

J. Xia. 2013. E�cient structured multifrontal factorization for general large sparse matrices. SIAM J. Sci.Comput. 35, 2 (2013), A832–A860.

J. Xia, S. Chandrasekaran, M. Gu, and X. S. Li. 2009. Superfast multifrontal method for large structuredlinear systems of equations. SIAM J. Matrix Anal. Appl. 31 (2009), 1382–1411.

J. Xia, S. Chandrasekaran, M. Gu, and X. S. Li. 2010. Fast algorithms for hierarchically semiseparablematrices. Numer. Linear Algebra Appl. 2010 (2010), 953–976.

Z. Xin, J. Xia, M. V. de Hoop, S. Cauley, and V. Balakrishnan. 2013. A structured multifrontal method fornonsymmetric sparse matrices and its applications. Purdue GMIG Report (April 2013).


Documents

A Parallel Geometric Multifrontal Solver Using ...crd-legacy.lbl.gov/~xiaoye/hsolver_toms.pdf · A Parallel Geometric Multifrontal Solver Using Hierarchically ... [Mathematical software]: