26
‡§ ¶k 50 § k

ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

  • Upload
    others

  • View
    8

  • Download
    1

Embed Size (px)

Citation preview

Page 1: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN

AN OUT-OF-CORE ENVIRONMENT

PATRICK R AMESTOYdagger IAIN S DUFFDaggersect

YVES ROBERTpara FRANCcedilOIS-HENRY ROUETdagger AND BORA UCcedilARpara

Abstract The inverse of an irreducible sparse matrix is structurally full so that it is impracticalto think of computing or storing it However there are several applications where a subset of theentries of the inverse is required Given a factorization of the sparse matrix held in out-of-corestorage we show how to compute such a subset eciently by accessing only parts of the factorsWhen there are many inverse entries to compute we need to guarantee that the overall computationscheme has reasonable memory requirements while minimizing the cost of loading the factors Thisleads to a partitioning problem that we prove is NP-complete We also show that we cannot get aclose approximation to the optimal solution in polynomial time We thus need to develop heuristicalgorithms and we propose (i) a lower bound on the cost of an optimum solution (ii) an exactalgorithm for a particular case (iii) two other heuristics for a more general case and (iv) hypergraphpartitioning models for the most general setting We illustrate the performance of our algorithms inpractice using the MUMPS software package on a set of real-life problems as well as some standard testmatrices We show that our techniques can improve the execution time by a factor of 50

Key words Sparse matrices direct methods for linear systems and matrix inversion multi-frontal method graphs and hypergraphs

AMS subject classications 05C50 05C65 65F05 65F50

1 Introduction We are interested in eciently computing entries of the in-verse of a large sparse nonsingular matrix It was proved [12 16] that the inverse ofan irreducible sparse matrix is full in a structural sense and hence it is impracticalto compute all the entries of the inverse However there are applications where onlya set of entries of the inverse is required For example after solving a linear least-squares problem the variance of each component provides a measure of the quality ofthe t and is given by the diagonal entries of the inverse of a large symmetric positivesemi-denite matrix [6] the o-diagonal entries of the inverse of the same matrix givethe covariance of the components Other applications arise in quantum-scale devicesimulation such as the atomistic level simulation of nanowires [9 22] and in electronicstructure calculations [20] In these applications the diagonal entries of the inverse ofa large sparse matrix need to be computed Some other classical applications whichneed the entries of the inverse include the computation of short-circuit currents [25]and approximations of condition numbers [6] In all these computational applicationsthe aim is to compute a large set of entries (often diagonal entries) of the inverse ofa large sparse matrix

We have been particularly motivated by an application in astrophysics which isused by the CESR (Centre for the Study of Radiation in Space Toulouse) In thecontext of the INTEGRAL (INTErnational Gamma-Ray Astrophysics Laboratory)mission of ESA (European Space Agency) a spatial observatory with high resolutionwas launched SPI (SPectrometer on INTEGRAL) a spectrometer with high energy

daggerUniversiteacute de Toulouse INPT(ENSEEIHT)-IRIT France (amestoyfrouetenseeihtfr)DaggerCERFACS 42 Avenue Gaspard Coriolis 31057 Toulouse France (duffcerfacsfr)sectAtlas Centre RAL Oxon OX11 0QX England (isduffrlacuk)paraLaboratoire de lInformatique du Paralleacutelisme (UMR CNRS -ENS Lyon-INRIA-UCBL) Univer-

siteacute de Lyon 46 alleacutee dItalie ENS Lyon F-69364 Lyon Cedex 7 France (yvesrobertens-lyonfrboraucarens-lyonfr)Centre National de la Recherche Scientique

1

2 AMESTOY ET AL

resolution is one of the main instruments on board this satellite To obtain a completesky survey a very large amount of data acquired by the SPI must be processedFor example to estimate the total point-source emission contributions (that is thecontribution of a set of sources to the observed eld) a linear least-squares problemof about a million equations and a hundred thousand unknowns has to be solved[7] Once the least-square problem is solved the variance of the components of thesolution are computed to get access to its standard deviation The variances of thecomponents of the solution are given by the diagonal elements of the inverse of thevariance-covariance matrix A = BTB where B is the matrix associated with theleast-squares problem

The approach we use to compute the entries of the inverse relies on a traditionalsolution method and makes use of the equation AAminus1 = I More specically wecompute a particular entry aminus1

ij using(Aminus1ej

)i here ej is the jth column of the

identity matrix and (v)i denotes the ith component of the vector v (we use vi to referto the ith component of a vector when v is not dened by an operation) Using anLU factorization of A aminus1

ij is obtained by solving successively two triangular systems

y = Lminus1ej

aminus1ij = (Uminus1y)i

(11)

The computational framework in which we consider this problem assumes thatthe matrix A whose inverse entries will be computed has been factorized using amultifrontal or supernodal approach and that the factors have been stored on disks(out-of-core setting) While solving (11) the computational time is therefore domi-nated by the time required to load the factors from the disk We see from the aboveequations that in the forward substitution phase the right-hand side (ej) containsonly one nonzero entry and that in the backward step only one entry of the solutionvector is required For ecient computation we have to take advantage of both theseobservations along with the sparsity of A We note that even though the vector y willnormally be sparse it will conventionally be stored as a full dense vector Thereforewhen the number of requested entries is high one cannot hold all the solution vectorsin the memory at the same time In this case the computations are carried out inepochs where at each epoch a predened number of requested entries are computedTable 11 contains results for three medium-sized matrices from our motivating ap-plication and summarizes what can be gained using our approach (the details of theexperimental setting are described later in Section 52) In this table we show thefactors loaded and the execution time of the MUMPS (MUltifrontal Massively ParallelSolver) software package [3 4] we compare the memory requirement and solutiontime when sparsity in both the right-hand side and the solution is not exploited (incolumn No ES) and when it is exploited (in column ES) The solution time is shownfor both the out-of-core and the in-core case in order to put these execution timesin perspective All diagonal entries of the inverse are computed where 16 entries arecomputed at each epoch Both solution schemes use the natural partition that ordersthe requested entries according to their indices and puts them in blocks of size 16in a natural order that is the rst 16 in the rst block the second 16 in the secondblock and so on We see that on this class of problems the number of factors tobe loaded is reduced by a factor of between 36 and 96 with an impact of the sameorder on the computation time Our aim in this paper is to further reduce the factorsloaded and the execution time by carefully partitioning the requested entries

COMPUTING INVERSE ENTRIES 3

Table 11All diagonal entries of the inverse are computed in blocks of size 16 The columns No ES

correspond to solving the linear systems as if the right-hand side vector were dense The columnsES correspond to solving the linear system while exploiting the sparsity of the right-hand sidevectors The computations were performed on the Intel system dened in Section 52

Matrix Factors loaded (in MB) Time (in secs)name In-core Out-of-core

No ES ES No ES ES No ES ESCESR46799 114051 3158 380 21 3962 77CESR72358 375737 6056 1090 48 10801 264CESR148286 1595645 16595 4708 188 43397 721

In Section 21 we formulate which parts of the factors need to be loaded for arequested entry At this stage our contribution is to summarize what exists in theliterature although we believe that our observations on the solution of (Ux = y)i

using the tree are original Later on we investigate the case where a large numberof entries of the inverse have been requested This would be the typical case inapplications when we have to proceed in epochs This entails accessing (loading) someparts of L and U many times in dierent epochs according to the entries computedin the corresponding epochs The main combinatorial problem then becomes thatof partitioning the requested entries into blocks in such a way that the overall costof loading the factors summed over the epochs is minimized As we shall see inSection 22 the problem can be cast as a partitioning problem on trees In Section 3we prove that the partitioning problem is NP-complete and that a simple post-orderbased heuristic gives a solution whose cost is at most twice the cost of an optimalsolution We propose an exact algorithm for a particular case of the partitioningproblem and use this algorithm to design a heuristic for a more general case We alsopropose a hypergraph-partitioning formulation which provides a solution to the mostgeneral problem of computing o-diagonal entries (see Section 4) We illustrate theperformance of our techniques by implementing them within MUMPS both on standardtest matrix problems and on specic examples coming from data tting problems inastrophysics We present our results in Section 5 Some problems that are relatedto those covered in this paper are discussed in Section 6 along with directions forfurther research We provide some concluding remarks in Section 7

2 Background and problem denition Throughout this study we use adirect factorization method for obtaining the inverse entries We assume that thefactorization has taken place and that the factors are stored on the disks The factor-ization code that we use is the general-purpose linear solver MUMPS that computes thefactorization A = LU if the matrix is unsymmetric or the factorization A = LDLT

if the matrix is symmetric MUMPS provides a large range of functionalities includingthe ability to run out-of-core that is it can store the factors on hard disk when themain memory is not large enough [1 24] We consider an out-of-core environment andthe associated metrics We assume that the factorization has taken place using thestructure of A+AT for a pattern unsymmetric A where the summation is structuralIn this case the structures of the factors L and U are the transpose of each otherAll computation schemes algorithms and formulations in this paper assume this lastproperty to hold

We need the following standard denitions for later A topological ordering of a

4 AMESTOY ET AL

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

(a) The sample matrix A

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

(b) The factors L+ U (c) The elimination tree

Figure 21 A pattern symmetric matrix A its factors and the associated elimination tree (a)The matrix A whose nonzeros are shown with blue circles (b) The pattern of L+U where the lled-in entries are shown with red squares (c) The corresponding elimination tree where the children ofa node are drawn below the node itself

tree is an ordering of its nodes such that any node is numbered before its parentA post-order of a tree is a topological ordering where the nodes in each subtree arenumbered consecutively The least common ancestor of two nodes i and j in a rootedtree lca(i j) is the lowest numbered node that lies at the intersection of the uniquepaths from node i and node j to the root The ceiling function dxe gives the smallestinteger greater than or equal to x For a set S |S| denotes the cardinality of S

21 Elimination tree and sparse triangular solves When the matrix A isstructurally symmetric with a zero-free diagonal the elimination tree represents thestorage and computational requirements of its sparse factorization There are a fewequivalent denitions of the elimination tree [21] We prefer the following one for thepurposes of this paper

Definition 21 Assume A = LU where A is a sparse structurally symmetric

NtimesN matrix Then the elimination tree T (A) of A is a tree of N nodes with the ithnode corresponding to the ith column of L and where the parent relations are dened

as follows

parent(j) = mini i gt j and `ij 6= 0 for j = 1 N minus 1

For the sake of completeness we note that if A is reducible this structure is aforest with one tree for each irreducible block otherwise it is a tree We assumewithout loss of generality that the matrix is irreducible As an example consider thepattern symmetric matrix A shown in Fig 21(a) The factors and the correspondingelimination tree are shown in Figs 21(b) and 21(c)

Our algorithms for eciently computing a given set of entries in Aminus1 rely onthe elimination tree structure We take advantage of the following result which isrewritten from [16 Theorem 21]

Corollary 22 Assume b is a sparse vector and L is a lower triangular matrix

Then the indices of the nonzero elements of the solution vector x of Lx = b are equal

to the indices of the nodes of the elimination tree that are in the paths from the nodes

corresponding to nonzeros entries of b to the root

We will use the corollary in the following way when b is sparse we need to solvethe equations corresponding to the predicted nonzero entries while setting the otherentries of the solution vector x to zero We note that when b contains a single nonzero

COMPUTING INVERSE ENTRIES 5

entry say in its ith component the equations to be solved are those that correspondto the nodes in the unique path from node i to the root In any case assumingthe matrix is irreducible the last entry xN corresponding to the root of the tree isnonzero Consider for example Lx = e3 for the lower triangular factor L shown inFig 21(b) Clearly x1 x2 = 0 and x3 = 1`33 These in turn imply that x4 = 0 andx5 x6 6= 0 The nonzero entries correspond to the nodes that are in the unique pathfrom node 3 to the root node 6 as seen in Fig 21(c)

With Corollary 22 we are halfway through solving (11) eciently for a particularentry of the inverse we only solve only the relevant equations involving the L factor(and hence only access the necessary parts of it) Next we show which equationsinvolving U need to be solved when only a particular entry of the solution is requestedthereby specifying the whole solution process for an entry of the inverse

Lemma 23 In order to obtain the ith component of the solution to Uz = y thatis zi = (Uminus1y)i one has to solve the equations corresponding to the nodes that are in

the unique path from the highest node in struct(y)capancestors(i) to i where struct(y)denotes the nodes associated with the nonzero entries of y and ancestors(i) denotes

the sets of ancestors of node i in the elimination tree

Proof We rst show (1) that the set of components of z involved in the compu-tation of zi is the set of ancestors of i in the elimination tree We then (2) reduce thisset using the structure of y

(1) We prove by top-down induction on the tree that the only componentsinvolved in the computation of any component zl of z are the ancestors of lin the elimination tree Root node zN is computed as zN = yNuNN (thus requires no othercomponent of z) and has no ancestor in the tree

For any node l following a left-looking scheme zl is computed as

zl =

(yl minus

Nsumk=l+1

ulkzk

)ull =

yl minusNsum

kulk 6=0

ulkzk

ull (21)

All the nodes in the set Kl = k ulk 6= 0 are ancestors of l bydenition of the elimination tree (since struct(UT ) = struct(L)) Thusby applying the induction hypothesis to all the nodes in Kl all therequired nodes are ancestors of l

(2) The pattern of y can be exploited to show that some components of z whichare involved in the computation of zi are zero Noting ki as the highest nodein struct(y)capancestors(i) that is the highest ancestor of i such that yki 6= 0we have

zk = 0 if k isin ancestors(ki)zk 6= 0 if k isin ancestors(i)ancestors(ki)

Both statements are proved by induction using the same left-looking schemeTherefore the only components of z required lie on the path between i and ki thehighest node in struct(y) cap ancestors(i)

In particular when yN is nonzero as would be the case when y is the vec-tor obtained after forward elimination for the factorization of an irreducible matrixLemma 23 states that we need to solve the equations that correspond to nodes that liein the unique path from the root node to node i Consider the U given in Fig 21(b)

6 AMESTOY ET AL

and suppose we want to compute (Uz = e6)2 As we are interested in z2 we have tocompute z3 and z5 to be able to solve the second equation in order to compute z3we have to compute z6 as well Therefore we have to solve equation 21 for nodes2 3 5 and 6 As seen in Fig 21(c) these correspond to the nodes that are in theunique path from node 6 (the highest ancestor) to node 2 We also note that variablez4 would be nonzero however we will not compute it because it does not play a rolein determining z2

One can formulate aminus1ij in three dierent ways each involving linear system solu-

tions with L and U Consider the general formula

aminus1ij = eT

i Aminus1ej

= eTi Uminus1Lminus1ej (22)

and the three possible ways of parenthesizing these equations Our method as shownin (11) corresponds to the parenthesization(

Uminus1(Lminus1ej

))i (23)

The other two parenthesizations are

aminus1ij =

((eTi Uminus1)Lminus1

)j

(24)

=(eTi Uminus1) (Lminus1ej

) (25)

We have the following theorem regarding the equivalence of these three parenthe-sizations when L and U are computed and stored in such way that their sparsitypatterns are the transposes of each other In a more general case the three parenthe-sizing schemes may dier when L and UT have dierent patterns

Theorem 24 The three parenthesizations for computing aminus1ij given in equa-

tions (23)-to-(25) access the same parts of the factor L and the same parts of the

factor U

Proof Consider the following four computations on which the dierent parenthe-sizations are based

vT = eTi Uminus1 (26)

wi = (Uminus1y)i (27)

y = Lminus1ej (28)

zj = (vLminus1)j (29)

As v = UminusT ei and the pattern of UT is equal to the pattern of L by Corollary 22computing v requires accessing the U factors associated with nodes in the unique pathbetween node i and the root node Consider zT = LminusT vT As the pattern of LT isequal to the pattern of U by Lemma 23 zj requires accessing the L factors associatedwith the nodes in the unique path from node j to the highest node in struct(v) capancestors(j) As vN = (UminusT ei)N 6= 0 (since A is irreducible) this requires accessingthe L factors from node j to the root (the Nth node) With similar argumentsyN 6= 0 and computing wi = (Uminus1y)i requires accessing the U factors associatedwith the nodes in the unique path between node i and the root These observationscombined with another application of Corollary 22 this time for Lminus1ej complete theproof

We now combine Corollary 22 and Lemma 23 to obtain the theorem

COMPUTING INVERSE ENTRIES 7

Uz =y)2

= eL y 3

6

2

3

5

4

1

(

Figure 22 Traversal of the elimination tree for the computation of aminus123 In the rst step

Ly = e3 is solved yielding y3 y5 and y6 6= 0 Then (Uz = y)2 is found by computing z6 z5 z3 andnally z2

Theorem 25 (Factors to load for computing a particular entry of the inverseProperty 89 in [24]) To compute a particular entry aminus1

ij in Aminus1 the only factors

which have to be loaded are the L factors on the path from node j up to the root node

and the U factors on the path going back from the root to node iThis theorem establishes the eciency of the proposed computation scheme we

solve only the equations required for a requested entry of the inverse both in theforward and backward solve phases We illustrate the above theorem on the previousexample in Fig 22 As discussed before Ly = e3 yields nonzero vector entries y3 y5and y6 and then z2 = (Uz = y)2 is found after computing z6 z5 z3

We note that the third parenthesization (25) can be advantageous while comput-ing only the diagonal entries with the factorizations LLT or LDLT (with a diagonalD matrix) because in these cases we need to compute a single vector and computethe square of its norm This formulation can also be useful in a parallel setting wherethe solves with L and U can be computed in parallel whereas in the other two for-mulations the solves with one of the factors has to wait for the solves with the otherone to be completed

We also note that if the triangular solution procedures for eTi Uminus1 and Lminus1ej are

available then one can benet from the third parenthesization in certain cases Ifthe number of row and column indices concerned by the requested entries is smallerthan the number of these requested entries many of the calculations can be reused ifone computes a set of vectors of the form eT

i Uminus1 and Lminus1ej for dierent i and j and

obtains the requested entries of the inverse by computing the inner products of thesevectors We do not consider this computational scheme in this paper because suchseparate solves with L and U are not available within MUMPS

22 Problem denition We now address the computation of multiple entriesand introduce the partitioning problem We rst discuss the diagonal case (thatis computing a set of diagonal entries) and comment on the general case later inSection 4

As seen in the previous section in order to compute aminus1ii using the formulation

y = Lminus1ei

aminus1ii = (Uminus1y)i

we have to access the parts of L that correspond to the nodes in the unique path fromnode i to the root and then access the parts of U that correspond to the nodes inthe same path As discussed above these are the necessary and sucient parts of L

8 AMESTOY ET AL

and U that are needed In other words we know how to solve eciently for a singlerequested diagonal entry of the inverse Now suppose that we are to compute a setR of diagonal entries of the inverse As said in Section 1 using the equations aboveentails storing a dense vector for each requested entry If |R| is small then we couldagain identify all the parts of L and U that are needed to be loaded at least for onerequested entry in R and then solve for all R at once accessing the necessary andsucient parts of L and U only once However |R| is usually large in the applicationareas mentioned in Section 1 one often wants to compute a large set of entries suchas the whole diagonal of the inverse (in that case |R| = N) Storing that manydense vectors is not feasible therefore the computations proceed in epochs where ateach epoch a limited number of diagonal entries are computed This entails accessing(loading) some parts of L and U multiple times in dierent epochs according to theentries computed in the corresponding epochs The main combinatorial problem thenbecomes that of partitioning the requested entries into blocks in such a way that theoverall cost of loading the factors is minimized

We now formally introduce the problem Let T be the elimination tree on Nnodes where the factors associated with each node are stored on disks (out-of-core)Let P (i) be the set of nodes in the unique path from node i to the root r includingboth nodes i and r Let w(i) denote the cost of loading the parts of the factors Lor U associated with node i of the elimination tree Similarly let w(i j) denote thesum of the costs of the nodes in the path from node i to node j The cost of solvingaminus1

ii is therefore

cost(i) =sum

kisinP (i)

2times w(k) = 2times w(i r) (210)

If we solve for a set R of diagonal entries at once then the overall cost is therefore

cost(R) =sum

iisinP (R)

2times w(i) where P (R) =⋃iisinR

P (i)

We use B to denote the maximum number of diagonal entries that can be computedat an epoch This is the number of dense vectors that we must hold and so is limitedby the available storage

The TreePartitioning problem is formally dened as follows given a tree Twith N nodes a set R = i1 im of nodes in the tree and an integer B le mpartition R into a number of subsets R1 R2 RK so that |Rk| le B for all k andthe total cost

cost(R) =Ksum

k=1

cost(Rk) (211)

is minimumThe number of subsets K is not specied but obviously K ge dm

B e Without lossof generality we can assume that there is a one-to-one correspondence between R andleaf nodes in T Indeed if there is a leaf node i where i isin R then we can delete nodei from T Similarly if there is an internal node i where i isin R then we create a leafnode iprime of zero weight and make it an additional child of i For ease of discussion andformulation for each requested node (leaf or not) of the elimination tree we add a leafnode with zero weight To clarify the execution scheme we now specify the algorithmthat computes the diagonal entries of the inverse specied by a given Rk We rst

COMPUTING INVERSE ENTRIES 9

nd P (Rk) we then post-order the nodes in P (Rk) and start loading the associatedL factors from the disk and perform the forward solves with L When we reach theroot node we have |Rk| dense vectors and we start loading the associated U factorsfrom the disk and perform backward substitutions along the paths that we traversed(in reverse order) during the forward substitutions

3 Partitioning methods and models As discussed above partitioning therequested entries into blocks to minimize the cost of loading factors corresponds tothe TreePartitioning problem In this section we will focus on the case whereall of the requested entries are on the diagonal of the inverse As noted before inthis case the partial forward and backward solves correspond to visiting the samepath in the elimination tree We analyse the TreePartitioning problem in detailfor this case and show that it is NP-complete we also show that the case where theblock size B = 2 is polynomial time solvable We provide two heuristics one with anapproximation guarantee (in the sense that we can prove that it is at worst twice asbad as optimal) and the other being somewhat better in practice we also introducea hypergraph partitioning-based formulation which is more general than the otherheuristics

Before introducing the algorithms and models we present a lower bound for thecost of an optimal partition Let nl(i) denote the number of leaves of the subtreerooted at node i which can be computed as follows

nl(i) =

1 i is a leaf nodesum

jisinchildren(i) nl(j) otherwise(31)

We note that as all the leaf nodes correspond to the requested diagonal entries of theinverse nl(i) corresponds to the number of forward and backward solves that have tobe performed at node i

Given the number of forward and backward solves that pass through a node i itis easy to dene the following lower bound on the amount of the factors loaded

Theorem 31 (Lower bound of the amount of factors to load) Let T be a

node weighted tree w(i) be the weight of node i B be the maximum allowed size of

a partition and nl(i) be the number of leaf nodes in the subtree rooted at i Then

we have the following lower bound denoted by η on the optimal solution c of the

TreePartitioning problem

η = 2timessumiisinT

w(i)timeslceilnl(i)B

rceille c

Proof Follows easily by noting that each node i has to be loaded at leastlceil

nl(i)B

rceiltimes

As the formula includes nl(middot) the lower bounds for wide and shallow trees willusually be smaller than the lower bounds for tall and skinny trees Each internalnode is on a path from (at least) one leaf node therefore dnl(i)Be is at least 1 and2times

sumi w(i) le c

Figure 31 illustrates the notion of the number of leaves of a subtree and thecomputation of the lower bound Entries aminus1

11 aminus133 and a

minus144 are requested and the

elimination tree of Figure 21(c) is modied accordingly to have leaves (with zeroweights) corresponding to these entries The numbers nl(i) are shown next to the

10 AMESTOY ET AL

6

3

5

4

1

3

2

3

1

14 3

11

1 1

Figure 31 Number of leaves of the subtrees rooted at each node of a transformed eliminationtree The nodes corresponding to the requested diagonal entries of the inverse are shaded and aleaf node is added for each such entry Each node is annotated with the number of leaves in thecorresponding subtree resulting in a lower bound of η = 14 with B = 2

nodes Suppose that each internal node has unit weight and that the block size is 2Then the lower bound is

η = 2times(lceil

12

rceil+lceil

12

rceil+lceil

22

rceil+lceil

32

rceil+lceil

32

rceil)= 14

Recall that we have transformed the elimination tree in such a way that therequested entries now correspond to the leaves and each leaf corresponds to a requestedentry We have the following computational complexity result

Theorem 32 The TreePartitioning problem is NP-complete

Proof We consider the associated decision problem given a tree T with m leavesa value of B and a cost bound c does there exist a partitioning S of the m leavesinto subsets whose size does not exceed B and such that cost(S) le c It is clear thatthis problem belongs to NP since if we are given the partition S it is easy to checkin polynomial time that it is valid and that its cost meets the bound c We now haveto prove that the problem is in the NP-complete subset

To establish the completeness we use a reduction from 3-PARTITION [14] whichis NP-complete in the strong sense Consider an instance I1 of 3-PARTITION givena set a1 a3p of 3p integers and an integer Z such that

sum1lejle3p aj = pZ does

there exist a partition of 1 3p into p disjoint subsets K1 Kp each withthree elements such that for all 1 le i le p

sumjisinKi

aj = ZWe build the following instance I2 of our problem the tree is a three-level tree

composed of N = 1 + 3p + pZ nodes the root vr of cost wr has 3p children viof same cost wv for 1 le i le 3p In turn each vi has ai children each being a leafnode of zero cost This instance I2 of the TreePartitioning problem is shown inFig 32 We let B = Z and ask whether there exists a partition of leaf nodes of costc = pwr + 3pwv Here wr and wv are arbitrary values (we can take wr = wv = 1)We note that the cost c corresponds to the lower bound shown in Theorem 31 inthis lower bound each internal node vi is loaded only once and the root is loaded ptimes since it has pZ = pB leaves below it Note that the size of I2 is polynomialin the size of I1 Indeed because 3-PARTITION is NP-complete in the strong sensewe can encode I1 in unary and the size of the instance is O(pZ)

COMPUTING INVERSE ENTRIES 11

a3pa1 a2

wr

wv

r

v2v1v3p

wv wv

Figure 32 The instance of the TreePartitioning problem corresponding to a given 3-PARTITION PROBLEM The weight of each node is shown next to the node The minimum costof a solution for B = Z to the TreePartitioning problem is ptimeswr +3ptimeswv which is only possiblewhen the children of each vi are all in the same part and when the children of three dierent internalnodes say vi vj vk are put in the same part This corresponds to putting the numbers ai aj ak

into a set for the 3-PARTITION problem which sums up to Z

Now we show that I1 has a solution if and only if I2 has a solution Suppose rstthat I1 has a solution K1 Kp The partition of leaf nodes corresponds exactly tothe subsets Ki we build p subsets Si whose leaves are the children of vertices vj withj isin Ki Suppose now that I2 has a solution To meet the cost bound each internalnode has to be loaded only once and the root at most p times This means that thepartition involves at most p subsets to cover all leaves Because there are pZ leaveseach subset is of size exactly Z Because each internal node is loaded only once all itsleaves belong to the same subset Altogether we have found a solution to I1 whichconcludes the proof

We can further show that we cannot get a close approximation to the optimalsolution in polynomial time

Theorem 33 Unless P=NP there is no 1 + o( 1N ) polynomial approximation

for trees with N nodes in the TreePartitioning problem

Proof Assume that there exists a polynomial 1+ ε(N)N approximation algorithm for

trees with N nodes where limNrarrinfin ε(N) = 0 Let ε(N) lt 1 for N ge N0 Consider anarbitrary instance I0 of 3-PARTITION with a set a1 a3p of 3p integers and aninteger Z such that

sum1lejle3p aj = pZ Without loss of generality assume that ai ge 2

for all i (hence Z ge 6) We ask if we can partition the 3p integers of I0 into p triples ofthe same sum Z Now we build an instance I1 of 3-PARTITION by adding X times

the integer Zminus 2 and 2X times the integer 1 to I0 where X = max(lceil

N0minus1Z+3

rceilminus p 1

)

Hence I1 has 3p+3X integers and we ask whether these can be partitioned into p+Xtriples of the same sum Z Clearly I0 has a solution if and only if I1 does (the integerZ minus 2 can only be in a set with two 1s)

We build an instance I2 of TreePartitioning from I1 exactly as we did in theproof of Theorem 32 with wr = wv = 1 and B = Z The only dierence is that thevalue p in the proof has been replaced by p + X here therefore the three-level treenow has N = 1 + 3(p+X) + (p+X)Z nodes Note that X has been chosen so thatN ge N0 Just as in the proof of Theorem 32 I1 has a solution if and only if theoptimal cost for the tree is c = 4(p+X) and otherwise the optimal cost is at least4(p+X) + 1

If I1 has a solution and because N ge N0 the approximation algorithm will

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 2: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

2 AMESTOY ET AL

resolution is one of the main instruments on board this satellite To obtain a completesky survey a very large amount of data acquired by the SPI must be processedFor example to estimate the total point-source emission contributions (that is thecontribution of a set of sources to the observed eld) a linear least-squares problemof about a million equations and a hundred thousand unknowns has to be solved[7] Once the least-square problem is solved the variance of the components of thesolution are computed to get access to its standard deviation The variances of thecomponents of the solution are given by the diagonal elements of the inverse of thevariance-covariance matrix A = BTB where B is the matrix associated with theleast-squares problem

The approach we use to compute the entries of the inverse relies on a traditionalsolution method and makes use of the equation AAminus1 = I More specically wecompute a particular entry aminus1

ij using(Aminus1ej

)i here ej is the jth column of the

identity matrix and (v)i denotes the ith component of the vector v (we use vi to referto the ith component of a vector when v is not dened by an operation) Using anLU factorization of A aminus1

ij is obtained by solving successively two triangular systems

y = Lminus1ej

aminus1ij = (Uminus1y)i

(11)

The computational framework in which we consider this problem assumes thatthe matrix A whose inverse entries will be computed has been factorized using amultifrontal or supernodal approach and that the factors have been stored on disks(out-of-core setting) While solving (11) the computational time is therefore domi-nated by the time required to load the factors from the disk We see from the aboveequations that in the forward substitution phase the right-hand side (ej) containsonly one nonzero entry and that in the backward step only one entry of the solutionvector is required For ecient computation we have to take advantage of both theseobservations along with the sparsity of A We note that even though the vector y willnormally be sparse it will conventionally be stored as a full dense vector Thereforewhen the number of requested entries is high one cannot hold all the solution vectorsin the memory at the same time In this case the computations are carried out inepochs where at each epoch a predened number of requested entries are computedTable 11 contains results for three medium-sized matrices from our motivating ap-plication and summarizes what can be gained using our approach (the details of theexperimental setting are described later in Section 52) In this table we show thefactors loaded and the execution time of the MUMPS (MUltifrontal Massively ParallelSolver) software package [3 4] we compare the memory requirement and solutiontime when sparsity in both the right-hand side and the solution is not exploited (incolumn No ES) and when it is exploited (in column ES) The solution time is shownfor both the out-of-core and the in-core case in order to put these execution timesin perspective All diagonal entries of the inverse are computed where 16 entries arecomputed at each epoch Both solution schemes use the natural partition that ordersthe requested entries according to their indices and puts them in blocks of size 16in a natural order that is the rst 16 in the rst block the second 16 in the secondblock and so on We see that on this class of problems the number of factors tobe loaded is reduced by a factor of between 36 and 96 with an impact of the sameorder on the computation time Our aim in this paper is to further reduce the factorsloaded and the execution time by carefully partitioning the requested entries

COMPUTING INVERSE ENTRIES 3

Table 11All diagonal entries of the inverse are computed in blocks of size 16 The columns No ES

correspond to solving the linear systems as if the right-hand side vector were dense The columnsES correspond to solving the linear system while exploiting the sparsity of the right-hand sidevectors The computations were performed on the Intel system dened in Section 52

Matrix Factors loaded (in MB) Time (in secs)name In-core Out-of-core

No ES ES No ES ES No ES ESCESR46799 114051 3158 380 21 3962 77CESR72358 375737 6056 1090 48 10801 264CESR148286 1595645 16595 4708 188 43397 721

In Section 21 we formulate which parts of the factors need to be loaded for arequested entry At this stage our contribution is to summarize what exists in theliterature although we believe that our observations on the solution of (Ux = y)i

using the tree are original Later on we investigate the case where a large numberof entries of the inverse have been requested This would be the typical case inapplications when we have to proceed in epochs This entails accessing (loading) someparts of L and U many times in dierent epochs according to the entries computedin the corresponding epochs The main combinatorial problem then becomes thatof partitioning the requested entries into blocks in such a way that the overall costof loading the factors summed over the epochs is minimized As we shall see inSection 22 the problem can be cast as a partitioning problem on trees In Section 3we prove that the partitioning problem is NP-complete and that a simple post-orderbased heuristic gives a solution whose cost is at most twice the cost of an optimalsolution We propose an exact algorithm for a particular case of the partitioningproblem and use this algorithm to design a heuristic for a more general case We alsopropose a hypergraph-partitioning formulation which provides a solution to the mostgeneral problem of computing o-diagonal entries (see Section 4) We illustrate theperformance of our techniques by implementing them within MUMPS both on standardtest matrix problems and on specic examples coming from data tting problems inastrophysics We present our results in Section 5 Some problems that are relatedto those covered in this paper are discussed in Section 6 along with directions forfurther research We provide some concluding remarks in Section 7

2 Background and problem denition Throughout this study we use adirect factorization method for obtaining the inverse entries We assume that thefactorization has taken place and that the factors are stored on the disks The factor-ization code that we use is the general-purpose linear solver MUMPS that computes thefactorization A = LU if the matrix is unsymmetric or the factorization A = LDLT

if the matrix is symmetric MUMPS provides a large range of functionalities includingthe ability to run out-of-core that is it can store the factors on hard disk when themain memory is not large enough [1 24] We consider an out-of-core environment andthe associated metrics We assume that the factorization has taken place using thestructure of A+AT for a pattern unsymmetric A where the summation is structuralIn this case the structures of the factors L and U are the transpose of each otherAll computation schemes algorithms and formulations in this paper assume this lastproperty to hold

We need the following standard denitions for later A topological ordering of a

4 AMESTOY ET AL

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

(a) The sample matrix A

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

(b) The factors L+ U (c) The elimination tree

Figure 21 A pattern symmetric matrix A its factors and the associated elimination tree (a)The matrix A whose nonzeros are shown with blue circles (b) The pattern of L+U where the lled-in entries are shown with red squares (c) The corresponding elimination tree where the children ofa node are drawn below the node itself

tree is an ordering of its nodes such that any node is numbered before its parentA post-order of a tree is a topological ordering where the nodes in each subtree arenumbered consecutively The least common ancestor of two nodes i and j in a rootedtree lca(i j) is the lowest numbered node that lies at the intersection of the uniquepaths from node i and node j to the root The ceiling function dxe gives the smallestinteger greater than or equal to x For a set S |S| denotes the cardinality of S

21 Elimination tree and sparse triangular solves When the matrix A isstructurally symmetric with a zero-free diagonal the elimination tree represents thestorage and computational requirements of its sparse factorization There are a fewequivalent denitions of the elimination tree [21] We prefer the following one for thepurposes of this paper

Definition 21 Assume A = LU where A is a sparse structurally symmetric

NtimesN matrix Then the elimination tree T (A) of A is a tree of N nodes with the ithnode corresponding to the ith column of L and where the parent relations are dened

as follows

parent(j) = mini i gt j and `ij 6= 0 for j = 1 N minus 1

For the sake of completeness we note that if A is reducible this structure is aforest with one tree for each irreducible block otherwise it is a tree We assumewithout loss of generality that the matrix is irreducible As an example consider thepattern symmetric matrix A shown in Fig 21(a) The factors and the correspondingelimination tree are shown in Figs 21(b) and 21(c)

Our algorithms for eciently computing a given set of entries in Aminus1 rely onthe elimination tree structure We take advantage of the following result which isrewritten from [16 Theorem 21]

Corollary 22 Assume b is a sparse vector and L is a lower triangular matrix

Then the indices of the nonzero elements of the solution vector x of Lx = b are equal

to the indices of the nodes of the elimination tree that are in the paths from the nodes

corresponding to nonzeros entries of b to the root

We will use the corollary in the following way when b is sparse we need to solvethe equations corresponding to the predicted nonzero entries while setting the otherentries of the solution vector x to zero We note that when b contains a single nonzero

COMPUTING INVERSE ENTRIES 5

entry say in its ith component the equations to be solved are those that correspondto the nodes in the unique path from node i to the root In any case assumingthe matrix is irreducible the last entry xN corresponding to the root of the tree isnonzero Consider for example Lx = e3 for the lower triangular factor L shown inFig 21(b) Clearly x1 x2 = 0 and x3 = 1`33 These in turn imply that x4 = 0 andx5 x6 6= 0 The nonzero entries correspond to the nodes that are in the unique pathfrom node 3 to the root node 6 as seen in Fig 21(c)

With Corollary 22 we are halfway through solving (11) eciently for a particularentry of the inverse we only solve only the relevant equations involving the L factor(and hence only access the necessary parts of it) Next we show which equationsinvolving U need to be solved when only a particular entry of the solution is requestedthereby specifying the whole solution process for an entry of the inverse

Lemma 23 In order to obtain the ith component of the solution to Uz = y thatis zi = (Uminus1y)i one has to solve the equations corresponding to the nodes that are in

the unique path from the highest node in struct(y)capancestors(i) to i where struct(y)denotes the nodes associated with the nonzero entries of y and ancestors(i) denotes

the sets of ancestors of node i in the elimination tree

Proof We rst show (1) that the set of components of z involved in the compu-tation of zi is the set of ancestors of i in the elimination tree We then (2) reduce thisset using the structure of y

(1) We prove by top-down induction on the tree that the only componentsinvolved in the computation of any component zl of z are the ancestors of lin the elimination tree Root node zN is computed as zN = yNuNN (thus requires no othercomponent of z) and has no ancestor in the tree

For any node l following a left-looking scheme zl is computed as

zl =

(yl minus

Nsumk=l+1

ulkzk

)ull =

yl minusNsum

kulk 6=0

ulkzk

ull (21)

All the nodes in the set Kl = k ulk 6= 0 are ancestors of l bydenition of the elimination tree (since struct(UT ) = struct(L)) Thusby applying the induction hypothesis to all the nodes in Kl all therequired nodes are ancestors of l

(2) The pattern of y can be exploited to show that some components of z whichare involved in the computation of zi are zero Noting ki as the highest nodein struct(y)capancestors(i) that is the highest ancestor of i such that yki 6= 0we have

zk = 0 if k isin ancestors(ki)zk 6= 0 if k isin ancestors(i)ancestors(ki)

Both statements are proved by induction using the same left-looking schemeTherefore the only components of z required lie on the path between i and ki thehighest node in struct(y) cap ancestors(i)

In particular when yN is nonzero as would be the case when y is the vec-tor obtained after forward elimination for the factorization of an irreducible matrixLemma 23 states that we need to solve the equations that correspond to nodes that liein the unique path from the root node to node i Consider the U given in Fig 21(b)

6 AMESTOY ET AL

and suppose we want to compute (Uz = e6)2 As we are interested in z2 we have tocompute z3 and z5 to be able to solve the second equation in order to compute z3we have to compute z6 as well Therefore we have to solve equation 21 for nodes2 3 5 and 6 As seen in Fig 21(c) these correspond to the nodes that are in theunique path from node 6 (the highest ancestor) to node 2 We also note that variablez4 would be nonzero however we will not compute it because it does not play a rolein determining z2

One can formulate aminus1ij in three dierent ways each involving linear system solu-

tions with L and U Consider the general formula

aminus1ij = eT

i Aminus1ej

= eTi Uminus1Lminus1ej (22)

and the three possible ways of parenthesizing these equations Our method as shownin (11) corresponds to the parenthesization(

Uminus1(Lminus1ej

))i (23)

The other two parenthesizations are

aminus1ij =

((eTi Uminus1)Lminus1

)j

(24)

=(eTi Uminus1) (Lminus1ej

) (25)

We have the following theorem regarding the equivalence of these three parenthe-sizations when L and U are computed and stored in such way that their sparsitypatterns are the transposes of each other In a more general case the three parenthe-sizing schemes may dier when L and UT have dierent patterns

Theorem 24 The three parenthesizations for computing aminus1ij given in equa-

tions (23)-to-(25) access the same parts of the factor L and the same parts of the

factor U

Proof Consider the following four computations on which the dierent parenthe-sizations are based

vT = eTi Uminus1 (26)

wi = (Uminus1y)i (27)

y = Lminus1ej (28)

zj = (vLminus1)j (29)

As v = UminusT ei and the pattern of UT is equal to the pattern of L by Corollary 22computing v requires accessing the U factors associated with nodes in the unique pathbetween node i and the root node Consider zT = LminusT vT As the pattern of LT isequal to the pattern of U by Lemma 23 zj requires accessing the L factors associatedwith the nodes in the unique path from node j to the highest node in struct(v) capancestors(j) As vN = (UminusT ei)N 6= 0 (since A is irreducible) this requires accessingthe L factors from node j to the root (the Nth node) With similar argumentsyN 6= 0 and computing wi = (Uminus1y)i requires accessing the U factors associatedwith the nodes in the unique path between node i and the root These observationscombined with another application of Corollary 22 this time for Lminus1ej complete theproof

We now combine Corollary 22 and Lemma 23 to obtain the theorem

COMPUTING INVERSE ENTRIES 7

Uz =y)2

= eL y 3

6

2

3

5

4

1

(

Figure 22 Traversal of the elimination tree for the computation of aminus123 In the rst step

Ly = e3 is solved yielding y3 y5 and y6 6= 0 Then (Uz = y)2 is found by computing z6 z5 z3 andnally z2

Theorem 25 (Factors to load for computing a particular entry of the inverseProperty 89 in [24]) To compute a particular entry aminus1

ij in Aminus1 the only factors

which have to be loaded are the L factors on the path from node j up to the root node

and the U factors on the path going back from the root to node iThis theorem establishes the eciency of the proposed computation scheme we

solve only the equations required for a requested entry of the inverse both in theforward and backward solve phases We illustrate the above theorem on the previousexample in Fig 22 As discussed before Ly = e3 yields nonzero vector entries y3 y5and y6 and then z2 = (Uz = y)2 is found after computing z6 z5 z3

We note that the third parenthesization (25) can be advantageous while comput-ing only the diagonal entries with the factorizations LLT or LDLT (with a diagonalD matrix) because in these cases we need to compute a single vector and computethe square of its norm This formulation can also be useful in a parallel setting wherethe solves with L and U can be computed in parallel whereas in the other two for-mulations the solves with one of the factors has to wait for the solves with the otherone to be completed

We also note that if the triangular solution procedures for eTi Uminus1 and Lminus1ej are

available then one can benet from the third parenthesization in certain cases Ifthe number of row and column indices concerned by the requested entries is smallerthan the number of these requested entries many of the calculations can be reused ifone computes a set of vectors of the form eT

i Uminus1 and Lminus1ej for dierent i and j and

obtains the requested entries of the inverse by computing the inner products of thesevectors We do not consider this computational scheme in this paper because suchseparate solves with L and U are not available within MUMPS

22 Problem denition We now address the computation of multiple entriesand introduce the partitioning problem We rst discuss the diagonal case (thatis computing a set of diagonal entries) and comment on the general case later inSection 4

As seen in the previous section in order to compute aminus1ii using the formulation

y = Lminus1ei

aminus1ii = (Uminus1y)i

we have to access the parts of L that correspond to the nodes in the unique path fromnode i to the root and then access the parts of U that correspond to the nodes inthe same path As discussed above these are the necessary and sucient parts of L

8 AMESTOY ET AL

and U that are needed In other words we know how to solve eciently for a singlerequested diagonal entry of the inverse Now suppose that we are to compute a setR of diagonal entries of the inverse As said in Section 1 using the equations aboveentails storing a dense vector for each requested entry If |R| is small then we couldagain identify all the parts of L and U that are needed to be loaded at least for onerequested entry in R and then solve for all R at once accessing the necessary andsucient parts of L and U only once However |R| is usually large in the applicationareas mentioned in Section 1 one often wants to compute a large set of entries suchas the whole diagonal of the inverse (in that case |R| = N) Storing that manydense vectors is not feasible therefore the computations proceed in epochs where ateach epoch a limited number of diagonal entries are computed This entails accessing(loading) some parts of L and U multiple times in dierent epochs according to theentries computed in the corresponding epochs The main combinatorial problem thenbecomes that of partitioning the requested entries into blocks in such a way that theoverall cost of loading the factors is minimized

We now formally introduce the problem Let T be the elimination tree on Nnodes where the factors associated with each node are stored on disks (out-of-core)Let P (i) be the set of nodes in the unique path from node i to the root r includingboth nodes i and r Let w(i) denote the cost of loading the parts of the factors Lor U associated with node i of the elimination tree Similarly let w(i j) denote thesum of the costs of the nodes in the path from node i to node j The cost of solvingaminus1

ii is therefore

cost(i) =sum

kisinP (i)

2times w(k) = 2times w(i r) (210)

If we solve for a set R of diagonal entries at once then the overall cost is therefore

cost(R) =sum

iisinP (R)

2times w(i) where P (R) =⋃iisinR

P (i)

We use B to denote the maximum number of diagonal entries that can be computedat an epoch This is the number of dense vectors that we must hold and so is limitedby the available storage

The TreePartitioning problem is formally dened as follows given a tree Twith N nodes a set R = i1 im of nodes in the tree and an integer B le mpartition R into a number of subsets R1 R2 RK so that |Rk| le B for all k andthe total cost

cost(R) =Ksum

k=1

cost(Rk) (211)

is minimumThe number of subsets K is not specied but obviously K ge dm

B e Without lossof generality we can assume that there is a one-to-one correspondence between R andleaf nodes in T Indeed if there is a leaf node i where i isin R then we can delete nodei from T Similarly if there is an internal node i where i isin R then we create a leafnode iprime of zero weight and make it an additional child of i For ease of discussion andformulation for each requested node (leaf or not) of the elimination tree we add a leafnode with zero weight To clarify the execution scheme we now specify the algorithmthat computes the diagonal entries of the inverse specied by a given Rk We rst

COMPUTING INVERSE ENTRIES 9

nd P (Rk) we then post-order the nodes in P (Rk) and start loading the associatedL factors from the disk and perform the forward solves with L When we reach theroot node we have |Rk| dense vectors and we start loading the associated U factorsfrom the disk and perform backward substitutions along the paths that we traversed(in reverse order) during the forward substitutions

3 Partitioning methods and models As discussed above partitioning therequested entries into blocks to minimize the cost of loading factors corresponds tothe TreePartitioning problem In this section we will focus on the case whereall of the requested entries are on the diagonal of the inverse As noted before inthis case the partial forward and backward solves correspond to visiting the samepath in the elimination tree We analyse the TreePartitioning problem in detailfor this case and show that it is NP-complete we also show that the case where theblock size B = 2 is polynomial time solvable We provide two heuristics one with anapproximation guarantee (in the sense that we can prove that it is at worst twice asbad as optimal) and the other being somewhat better in practice we also introducea hypergraph partitioning-based formulation which is more general than the otherheuristics

Before introducing the algorithms and models we present a lower bound for thecost of an optimal partition Let nl(i) denote the number of leaves of the subtreerooted at node i which can be computed as follows

nl(i) =

1 i is a leaf nodesum

jisinchildren(i) nl(j) otherwise(31)

We note that as all the leaf nodes correspond to the requested diagonal entries of theinverse nl(i) corresponds to the number of forward and backward solves that have tobe performed at node i

Given the number of forward and backward solves that pass through a node i itis easy to dene the following lower bound on the amount of the factors loaded

Theorem 31 (Lower bound of the amount of factors to load) Let T be a

node weighted tree w(i) be the weight of node i B be the maximum allowed size of

a partition and nl(i) be the number of leaf nodes in the subtree rooted at i Then

we have the following lower bound denoted by η on the optimal solution c of the

TreePartitioning problem

η = 2timessumiisinT

w(i)timeslceilnl(i)B

rceille c

Proof Follows easily by noting that each node i has to be loaded at leastlceil

nl(i)B

rceiltimes

As the formula includes nl(middot) the lower bounds for wide and shallow trees willusually be smaller than the lower bounds for tall and skinny trees Each internalnode is on a path from (at least) one leaf node therefore dnl(i)Be is at least 1 and2times

sumi w(i) le c

Figure 31 illustrates the notion of the number of leaves of a subtree and thecomputation of the lower bound Entries aminus1

11 aminus133 and a

minus144 are requested and the

elimination tree of Figure 21(c) is modied accordingly to have leaves (with zeroweights) corresponding to these entries The numbers nl(i) are shown next to the

10 AMESTOY ET AL

6

3

5

4

1

3

2

3

1

14 3

11

1 1

Figure 31 Number of leaves of the subtrees rooted at each node of a transformed eliminationtree The nodes corresponding to the requested diagonal entries of the inverse are shaded and aleaf node is added for each such entry Each node is annotated with the number of leaves in thecorresponding subtree resulting in a lower bound of η = 14 with B = 2

nodes Suppose that each internal node has unit weight and that the block size is 2Then the lower bound is

η = 2times(lceil

12

rceil+lceil

12

rceil+lceil

22

rceil+lceil

32

rceil+lceil

32

rceil)= 14

Recall that we have transformed the elimination tree in such a way that therequested entries now correspond to the leaves and each leaf corresponds to a requestedentry We have the following computational complexity result

Theorem 32 The TreePartitioning problem is NP-complete

Proof We consider the associated decision problem given a tree T with m leavesa value of B and a cost bound c does there exist a partitioning S of the m leavesinto subsets whose size does not exceed B and such that cost(S) le c It is clear thatthis problem belongs to NP since if we are given the partition S it is easy to checkin polynomial time that it is valid and that its cost meets the bound c We now haveto prove that the problem is in the NP-complete subset

To establish the completeness we use a reduction from 3-PARTITION [14] whichis NP-complete in the strong sense Consider an instance I1 of 3-PARTITION givena set a1 a3p of 3p integers and an integer Z such that

sum1lejle3p aj = pZ does

there exist a partition of 1 3p into p disjoint subsets K1 Kp each withthree elements such that for all 1 le i le p

sumjisinKi

aj = ZWe build the following instance I2 of our problem the tree is a three-level tree

composed of N = 1 + 3p + pZ nodes the root vr of cost wr has 3p children viof same cost wv for 1 le i le 3p In turn each vi has ai children each being a leafnode of zero cost This instance I2 of the TreePartitioning problem is shown inFig 32 We let B = Z and ask whether there exists a partition of leaf nodes of costc = pwr + 3pwv Here wr and wv are arbitrary values (we can take wr = wv = 1)We note that the cost c corresponds to the lower bound shown in Theorem 31 inthis lower bound each internal node vi is loaded only once and the root is loaded ptimes since it has pZ = pB leaves below it Note that the size of I2 is polynomialin the size of I1 Indeed because 3-PARTITION is NP-complete in the strong sensewe can encode I1 in unary and the size of the instance is O(pZ)

COMPUTING INVERSE ENTRIES 11

a3pa1 a2

wr

wv

r

v2v1v3p

wv wv

Figure 32 The instance of the TreePartitioning problem corresponding to a given 3-PARTITION PROBLEM The weight of each node is shown next to the node The minimum costof a solution for B = Z to the TreePartitioning problem is ptimeswr +3ptimeswv which is only possiblewhen the children of each vi are all in the same part and when the children of three dierent internalnodes say vi vj vk are put in the same part This corresponds to putting the numbers ai aj ak

into a set for the 3-PARTITION problem which sums up to Z

Now we show that I1 has a solution if and only if I2 has a solution Suppose rstthat I1 has a solution K1 Kp The partition of leaf nodes corresponds exactly tothe subsets Ki we build p subsets Si whose leaves are the children of vertices vj withj isin Ki Suppose now that I2 has a solution To meet the cost bound each internalnode has to be loaded only once and the root at most p times This means that thepartition involves at most p subsets to cover all leaves Because there are pZ leaveseach subset is of size exactly Z Because each internal node is loaded only once all itsleaves belong to the same subset Altogether we have found a solution to I1 whichconcludes the proof

We can further show that we cannot get a close approximation to the optimalsolution in polynomial time

Theorem 33 Unless P=NP there is no 1 + o( 1N ) polynomial approximation

for trees with N nodes in the TreePartitioning problem

Proof Assume that there exists a polynomial 1+ ε(N)N approximation algorithm for

trees with N nodes where limNrarrinfin ε(N) = 0 Let ε(N) lt 1 for N ge N0 Consider anarbitrary instance I0 of 3-PARTITION with a set a1 a3p of 3p integers and aninteger Z such that

sum1lejle3p aj = pZ Without loss of generality assume that ai ge 2

for all i (hence Z ge 6) We ask if we can partition the 3p integers of I0 into p triples ofthe same sum Z Now we build an instance I1 of 3-PARTITION by adding X times

the integer Zminus 2 and 2X times the integer 1 to I0 where X = max(lceil

N0minus1Z+3

rceilminus p 1

)

Hence I1 has 3p+3X integers and we ask whether these can be partitioned into p+Xtriples of the same sum Z Clearly I0 has a solution if and only if I1 does (the integerZ minus 2 can only be in a set with two 1s)

We build an instance I2 of TreePartitioning from I1 exactly as we did in theproof of Theorem 32 with wr = wv = 1 and B = Z The only dierence is that thevalue p in the proof has been replaced by p + X here therefore the three-level treenow has N = 1 + 3(p+X) + (p+X)Z nodes Note that X has been chosen so thatN ge N0 Just as in the proof of Theorem 32 I1 has a solution if and only if theoptimal cost for the tree is c = 4(p+X) and otherwise the optimal cost is at least4(p+X) + 1

If I1 has a solution and because N ge N0 the approximation algorithm will

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 3: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 3

Table 11All diagonal entries of the inverse are computed in blocks of size 16 The columns No ES

correspond to solving the linear systems as if the right-hand side vector were dense The columnsES correspond to solving the linear system while exploiting the sparsity of the right-hand sidevectors The computations were performed on the Intel system dened in Section 52

Matrix Factors loaded (in MB) Time (in secs)name In-core Out-of-core

No ES ES No ES ES No ES ESCESR46799 114051 3158 380 21 3962 77CESR72358 375737 6056 1090 48 10801 264CESR148286 1595645 16595 4708 188 43397 721

In Section 21 we formulate which parts of the factors need to be loaded for arequested entry At this stage our contribution is to summarize what exists in theliterature although we believe that our observations on the solution of (Ux = y)i

using the tree are original Later on we investigate the case where a large numberof entries of the inverse have been requested This would be the typical case inapplications when we have to proceed in epochs This entails accessing (loading) someparts of L and U many times in dierent epochs according to the entries computedin the corresponding epochs The main combinatorial problem then becomes thatof partitioning the requested entries into blocks in such a way that the overall costof loading the factors summed over the epochs is minimized As we shall see inSection 22 the problem can be cast as a partitioning problem on trees In Section 3we prove that the partitioning problem is NP-complete and that a simple post-orderbased heuristic gives a solution whose cost is at most twice the cost of an optimalsolution We propose an exact algorithm for a particular case of the partitioningproblem and use this algorithm to design a heuristic for a more general case We alsopropose a hypergraph-partitioning formulation which provides a solution to the mostgeneral problem of computing o-diagonal entries (see Section 4) We illustrate theperformance of our techniques by implementing them within MUMPS both on standardtest matrix problems and on specic examples coming from data tting problems inastrophysics We present our results in Section 5 Some problems that are relatedto those covered in this paper are discussed in Section 6 along with directions forfurther research We provide some concluding remarks in Section 7

2 Background and problem denition Throughout this study we use adirect factorization method for obtaining the inverse entries We assume that thefactorization has taken place and that the factors are stored on the disks The factor-ization code that we use is the general-purpose linear solver MUMPS that computes thefactorization A = LU if the matrix is unsymmetric or the factorization A = LDLT

if the matrix is symmetric MUMPS provides a large range of functionalities includingthe ability to run out-of-core that is it can store the factors on hard disk when themain memory is not large enough [1 24] We consider an out-of-core environment andthe associated metrics We assume that the factorization has taken place using thestructure of A+AT for a pattern unsymmetric A where the summation is structuralIn this case the structures of the factors L and U are the transpose of each otherAll computation schemes algorithms and formulations in this paper assume this lastproperty to hold

We need the following standard denitions for later A topological ordering of a

4 AMESTOY ET AL

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

(a) The sample matrix A

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

(b) The factors L+ U (c) The elimination tree

Figure 21 A pattern symmetric matrix A its factors and the associated elimination tree (a)The matrix A whose nonzeros are shown with blue circles (b) The pattern of L+U where the lled-in entries are shown with red squares (c) The corresponding elimination tree where the children ofa node are drawn below the node itself

tree is an ordering of its nodes such that any node is numbered before its parentA post-order of a tree is a topological ordering where the nodes in each subtree arenumbered consecutively The least common ancestor of two nodes i and j in a rootedtree lca(i j) is the lowest numbered node that lies at the intersection of the uniquepaths from node i and node j to the root The ceiling function dxe gives the smallestinteger greater than or equal to x For a set S |S| denotes the cardinality of S

21 Elimination tree and sparse triangular solves When the matrix A isstructurally symmetric with a zero-free diagonal the elimination tree represents thestorage and computational requirements of its sparse factorization There are a fewequivalent denitions of the elimination tree [21] We prefer the following one for thepurposes of this paper

Definition 21 Assume A = LU where A is a sparse structurally symmetric

NtimesN matrix Then the elimination tree T (A) of A is a tree of N nodes with the ithnode corresponding to the ith column of L and where the parent relations are dened

as follows

parent(j) = mini i gt j and `ij 6= 0 for j = 1 N minus 1

For the sake of completeness we note that if A is reducible this structure is aforest with one tree for each irreducible block otherwise it is a tree We assumewithout loss of generality that the matrix is irreducible As an example consider thepattern symmetric matrix A shown in Fig 21(a) The factors and the correspondingelimination tree are shown in Figs 21(b) and 21(c)

Our algorithms for eciently computing a given set of entries in Aminus1 rely onthe elimination tree structure We take advantage of the following result which isrewritten from [16 Theorem 21]

Corollary 22 Assume b is a sparse vector and L is a lower triangular matrix

Then the indices of the nonzero elements of the solution vector x of Lx = b are equal

to the indices of the nodes of the elimination tree that are in the paths from the nodes

corresponding to nonzeros entries of b to the root

We will use the corollary in the following way when b is sparse we need to solvethe equations corresponding to the predicted nonzero entries while setting the otherentries of the solution vector x to zero We note that when b contains a single nonzero

COMPUTING INVERSE ENTRIES 5

entry say in its ith component the equations to be solved are those that correspondto the nodes in the unique path from node i to the root In any case assumingthe matrix is irreducible the last entry xN corresponding to the root of the tree isnonzero Consider for example Lx = e3 for the lower triangular factor L shown inFig 21(b) Clearly x1 x2 = 0 and x3 = 1`33 These in turn imply that x4 = 0 andx5 x6 6= 0 The nonzero entries correspond to the nodes that are in the unique pathfrom node 3 to the root node 6 as seen in Fig 21(c)

With Corollary 22 we are halfway through solving (11) eciently for a particularentry of the inverse we only solve only the relevant equations involving the L factor(and hence only access the necessary parts of it) Next we show which equationsinvolving U need to be solved when only a particular entry of the solution is requestedthereby specifying the whole solution process for an entry of the inverse

Lemma 23 In order to obtain the ith component of the solution to Uz = y thatis zi = (Uminus1y)i one has to solve the equations corresponding to the nodes that are in

the unique path from the highest node in struct(y)capancestors(i) to i where struct(y)denotes the nodes associated with the nonzero entries of y and ancestors(i) denotes

the sets of ancestors of node i in the elimination tree

Proof We rst show (1) that the set of components of z involved in the compu-tation of zi is the set of ancestors of i in the elimination tree We then (2) reduce thisset using the structure of y

(1) We prove by top-down induction on the tree that the only componentsinvolved in the computation of any component zl of z are the ancestors of lin the elimination tree Root node zN is computed as zN = yNuNN (thus requires no othercomponent of z) and has no ancestor in the tree

For any node l following a left-looking scheme zl is computed as

zl =

(yl minus

Nsumk=l+1

ulkzk

)ull =

yl minusNsum

kulk 6=0

ulkzk

ull (21)

All the nodes in the set Kl = k ulk 6= 0 are ancestors of l bydenition of the elimination tree (since struct(UT ) = struct(L)) Thusby applying the induction hypothesis to all the nodes in Kl all therequired nodes are ancestors of l

(2) The pattern of y can be exploited to show that some components of z whichare involved in the computation of zi are zero Noting ki as the highest nodein struct(y)capancestors(i) that is the highest ancestor of i such that yki 6= 0we have

zk = 0 if k isin ancestors(ki)zk 6= 0 if k isin ancestors(i)ancestors(ki)

Both statements are proved by induction using the same left-looking schemeTherefore the only components of z required lie on the path between i and ki thehighest node in struct(y) cap ancestors(i)

In particular when yN is nonzero as would be the case when y is the vec-tor obtained after forward elimination for the factorization of an irreducible matrixLemma 23 states that we need to solve the equations that correspond to nodes that liein the unique path from the root node to node i Consider the U given in Fig 21(b)

6 AMESTOY ET AL

and suppose we want to compute (Uz = e6)2 As we are interested in z2 we have tocompute z3 and z5 to be able to solve the second equation in order to compute z3we have to compute z6 as well Therefore we have to solve equation 21 for nodes2 3 5 and 6 As seen in Fig 21(c) these correspond to the nodes that are in theunique path from node 6 (the highest ancestor) to node 2 We also note that variablez4 would be nonzero however we will not compute it because it does not play a rolein determining z2

One can formulate aminus1ij in three dierent ways each involving linear system solu-

tions with L and U Consider the general formula

aminus1ij = eT

i Aminus1ej

= eTi Uminus1Lminus1ej (22)

and the three possible ways of parenthesizing these equations Our method as shownin (11) corresponds to the parenthesization(

Uminus1(Lminus1ej

))i (23)

The other two parenthesizations are

aminus1ij =

((eTi Uminus1)Lminus1

)j

(24)

=(eTi Uminus1) (Lminus1ej

) (25)

We have the following theorem regarding the equivalence of these three parenthe-sizations when L and U are computed and stored in such way that their sparsitypatterns are the transposes of each other In a more general case the three parenthe-sizing schemes may dier when L and UT have dierent patterns

Theorem 24 The three parenthesizations for computing aminus1ij given in equa-

tions (23)-to-(25) access the same parts of the factor L and the same parts of the

factor U

Proof Consider the following four computations on which the dierent parenthe-sizations are based

vT = eTi Uminus1 (26)

wi = (Uminus1y)i (27)

y = Lminus1ej (28)

zj = (vLminus1)j (29)

As v = UminusT ei and the pattern of UT is equal to the pattern of L by Corollary 22computing v requires accessing the U factors associated with nodes in the unique pathbetween node i and the root node Consider zT = LminusT vT As the pattern of LT isequal to the pattern of U by Lemma 23 zj requires accessing the L factors associatedwith the nodes in the unique path from node j to the highest node in struct(v) capancestors(j) As vN = (UminusT ei)N 6= 0 (since A is irreducible) this requires accessingthe L factors from node j to the root (the Nth node) With similar argumentsyN 6= 0 and computing wi = (Uminus1y)i requires accessing the U factors associatedwith the nodes in the unique path between node i and the root These observationscombined with another application of Corollary 22 this time for Lminus1ej complete theproof

We now combine Corollary 22 and Lemma 23 to obtain the theorem

COMPUTING INVERSE ENTRIES 7

Uz =y)2

= eL y 3

6

2

3

5

4

1

(

Figure 22 Traversal of the elimination tree for the computation of aminus123 In the rst step

Ly = e3 is solved yielding y3 y5 and y6 6= 0 Then (Uz = y)2 is found by computing z6 z5 z3 andnally z2

Theorem 25 (Factors to load for computing a particular entry of the inverseProperty 89 in [24]) To compute a particular entry aminus1

ij in Aminus1 the only factors

which have to be loaded are the L factors on the path from node j up to the root node

and the U factors on the path going back from the root to node iThis theorem establishes the eciency of the proposed computation scheme we

solve only the equations required for a requested entry of the inverse both in theforward and backward solve phases We illustrate the above theorem on the previousexample in Fig 22 As discussed before Ly = e3 yields nonzero vector entries y3 y5and y6 and then z2 = (Uz = y)2 is found after computing z6 z5 z3

We note that the third parenthesization (25) can be advantageous while comput-ing only the diagonal entries with the factorizations LLT or LDLT (with a diagonalD matrix) because in these cases we need to compute a single vector and computethe square of its norm This formulation can also be useful in a parallel setting wherethe solves with L and U can be computed in parallel whereas in the other two for-mulations the solves with one of the factors has to wait for the solves with the otherone to be completed

We also note that if the triangular solution procedures for eTi Uminus1 and Lminus1ej are

available then one can benet from the third parenthesization in certain cases Ifthe number of row and column indices concerned by the requested entries is smallerthan the number of these requested entries many of the calculations can be reused ifone computes a set of vectors of the form eT

i Uminus1 and Lminus1ej for dierent i and j and

obtains the requested entries of the inverse by computing the inner products of thesevectors We do not consider this computational scheme in this paper because suchseparate solves with L and U are not available within MUMPS

22 Problem denition We now address the computation of multiple entriesand introduce the partitioning problem We rst discuss the diagonal case (thatis computing a set of diagonal entries) and comment on the general case later inSection 4

As seen in the previous section in order to compute aminus1ii using the formulation

y = Lminus1ei

aminus1ii = (Uminus1y)i

we have to access the parts of L that correspond to the nodes in the unique path fromnode i to the root and then access the parts of U that correspond to the nodes inthe same path As discussed above these are the necessary and sucient parts of L

8 AMESTOY ET AL

and U that are needed In other words we know how to solve eciently for a singlerequested diagonal entry of the inverse Now suppose that we are to compute a setR of diagonal entries of the inverse As said in Section 1 using the equations aboveentails storing a dense vector for each requested entry If |R| is small then we couldagain identify all the parts of L and U that are needed to be loaded at least for onerequested entry in R and then solve for all R at once accessing the necessary andsucient parts of L and U only once However |R| is usually large in the applicationareas mentioned in Section 1 one often wants to compute a large set of entries suchas the whole diagonal of the inverse (in that case |R| = N) Storing that manydense vectors is not feasible therefore the computations proceed in epochs where ateach epoch a limited number of diagonal entries are computed This entails accessing(loading) some parts of L and U multiple times in dierent epochs according to theentries computed in the corresponding epochs The main combinatorial problem thenbecomes that of partitioning the requested entries into blocks in such a way that theoverall cost of loading the factors is minimized

We now formally introduce the problem Let T be the elimination tree on Nnodes where the factors associated with each node are stored on disks (out-of-core)Let P (i) be the set of nodes in the unique path from node i to the root r includingboth nodes i and r Let w(i) denote the cost of loading the parts of the factors Lor U associated with node i of the elimination tree Similarly let w(i j) denote thesum of the costs of the nodes in the path from node i to node j The cost of solvingaminus1

ii is therefore

cost(i) =sum

kisinP (i)

2times w(k) = 2times w(i r) (210)

If we solve for a set R of diagonal entries at once then the overall cost is therefore

cost(R) =sum

iisinP (R)

2times w(i) where P (R) =⋃iisinR

P (i)

We use B to denote the maximum number of diagonal entries that can be computedat an epoch This is the number of dense vectors that we must hold and so is limitedby the available storage

The TreePartitioning problem is formally dened as follows given a tree Twith N nodes a set R = i1 im of nodes in the tree and an integer B le mpartition R into a number of subsets R1 R2 RK so that |Rk| le B for all k andthe total cost

cost(R) =Ksum

k=1

cost(Rk) (211)

is minimumThe number of subsets K is not specied but obviously K ge dm

B e Without lossof generality we can assume that there is a one-to-one correspondence between R andleaf nodes in T Indeed if there is a leaf node i where i isin R then we can delete nodei from T Similarly if there is an internal node i where i isin R then we create a leafnode iprime of zero weight and make it an additional child of i For ease of discussion andformulation for each requested node (leaf or not) of the elimination tree we add a leafnode with zero weight To clarify the execution scheme we now specify the algorithmthat computes the diagonal entries of the inverse specied by a given Rk We rst

COMPUTING INVERSE ENTRIES 9

nd P (Rk) we then post-order the nodes in P (Rk) and start loading the associatedL factors from the disk and perform the forward solves with L When we reach theroot node we have |Rk| dense vectors and we start loading the associated U factorsfrom the disk and perform backward substitutions along the paths that we traversed(in reverse order) during the forward substitutions

3 Partitioning methods and models As discussed above partitioning therequested entries into blocks to minimize the cost of loading factors corresponds tothe TreePartitioning problem In this section we will focus on the case whereall of the requested entries are on the diagonal of the inverse As noted before inthis case the partial forward and backward solves correspond to visiting the samepath in the elimination tree We analyse the TreePartitioning problem in detailfor this case and show that it is NP-complete we also show that the case where theblock size B = 2 is polynomial time solvable We provide two heuristics one with anapproximation guarantee (in the sense that we can prove that it is at worst twice asbad as optimal) and the other being somewhat better in practice we also introducea hypergraph partitioning-based formulation which is more general than the otherheuristics

Before introducing the algorithms and models we present a lower bound for thecost of an optimal partition Let nl(i) denote the number of leaves of the subtreerooted at node i which can be computed as follows

nl(i) =

1 i is a leaf nodesum

jisinchildren(i) nl(j) otherwise(31)

We note that as all the leaf nodes correspond to the requested diagonal entries of theinverse nl(i) corresponds to the number of forward and backward solves that have tobe performed at node i

Given the number of forward and backward solves that pass through a node i itis easy to dene the following lower bound on the amount of the factors loaded

Theorem 31 (Lower bound of the amount of factors to load) Let T be a

node weighted tree w(i) be the weight of node i B be the maximum allowed size of

a partition and nl(i) be the number of leaf nodes in the subtree rooted at i Then

we have the following lower bound denoted by η on the optimal solution c of the

TreePartitioning problem

η = 2timessumiisinT

w(i)timeslceilnl(i)B

rceille c

Proof Follows easily by noting that each node i has to be loaded at leastlceil

nl(i)B

rceiltimes

As the formula includes nl(middot) the lower bounds for wide and shallow trees willusually be smaller than the lower bounds for tall and skinny trees Each internalnode is on a path from (at least) one leaf node therefore dnl(i)Be is at least 1 and2times

sumi w(i) le c

Figure 31 illustrates the notion of the number of leaves of a subtree and thecomputation of the lower bound Entries aminus1

11 aminus133 and a

minus144 are requested and the

elimination tree of Figure 21(c) is modied accordingly to have leaves (with zeroweights) corresponding to these entries The numbers nl(i) are shown next to the

10 AMESTOY ET AL

6

3

5

4

1

3

2

3

1

14 3

11

1 1

Figure 31 Number of leaves of the subtrees rooted at each node of a transformed eliminationtree The nodes corresponding to the requested diagonal entries of the inverse are shaded and aleaf node is added for each such entry Each node is annotated with the number of leaves in thecorresponding subtree resulting in a lower bound of η = 14 with B = 2

nodes Suppose that each internal node has unit weight and that the block size is 2Then the lower bound is

η = 2times(lceil

12

rceil+lceil

12

rceil+lceil

22

rceil+lceil

32

rceil+lceil

32

rceil)= 14

Recall that we have transformed the elimination tree in such a way that therequested entries now correspond to the leaves and each leaf corresponds to a requestedentry We have the following computational complexity result

Theorem 32 The TreePartitioning problem is NP-complete

Proof We consider the associated decision problem given a tree T with m leavesa value of B and a cost bound c does there exist a partitioning S of the m leavesinto subsets whose size does not exceed B and such that cost(S) le c It is clear thatthis problem belongs to NP since if we are given the partition S it is easy to checkin polynomial time that it is valid and that its cost meets the bound c We now haveto prove that the problem is in the NP-complete subset

To establish the completeness we use a reduction from 3-PARTITION [14] whichis NP-complete in the strong sense Consider an instance I1 of 3-PARTITION givena set a1 a3p of 3p integers and an integer Z such that

sum1lejle3p aj = pZ does

there exist a partition of 1 3p into p disjoint subsets K1 Kp each withthree elements such that for all 1 le i le p

sumjisinKi

aj = ZWe build the following instance I2 of our problem the tree is a three-level tree

composed of N = 1 + 3p + pZ nodes the root vr of cost wr has 3p children viof same cost wv for 1 le i le 3p In turn each vi has ai children each being a leafnode of zero cost This instance I2 of the TreePartitioning problem is shown inFig 32 We let B = Z and ask whether there exists a partition of leaf nodes of costc = pwr + 3pwv Here wr and wv are arbitrary values (we can take wr = wv = 1)We note that the cost c corresponds to the lower bound shown in Theorem 31 inthis lower bound each internal node vi is loaded only once and the root is loaded ptimes since it has pZ = pB leaves below it Note that the size of I2 is polynomialin the size of I1 Indeed because 3-PARTITION is NP-complete in the strong sensewe can encode I1 in unary and the size of the instance is O(pZ)

COMPUTING INVERSE ENTRIES 11

a3pa1 a2

wr

wv

r

v2v1v3p

wv wv

Figure 32 The instance of the TreePartitioning problem corresponding to a given 3-PARTITION PROBLEM The weight of each node is shown next to the node The minimum costof a solution for B = Z to the TreePartitioning problem is ptimeswr +3ptimeswv which is only possiblewhen the children of each vi are all in the same part and when the children of three dierent internalnodes say vi vj vk are put in the same part This corresponds to putting the numbers ai aj ak

into a set for the 3-PARTITION problem which sums up to Z

Now we show that I1 has a solution if and only if I2 has a solution Suppose rstthat I1 has a solution K1 Kp The partition of leaf nodes corresponds exactly tothe subsets Ki we build p subsets Si whose leaves are the children of vertices vj withj isin Ki Suppose now that I2 has a solution To meet the cost bound each internalnode has to be loaded only once and the root at most p times This means that thepartition involves at most p subsets to cover all leaves Because there are pZ leaveseach subset is of size exactly Z Because each internal node is loaded only once all itsleaves belong to the same subset Altogether we have found a solution to I1 whichconcludes the proof

We can further show that we cannot get a close approximation to the optimalsolution in polynomial time

Theorem 33 Unless P=NP there is no 1 + o( 1N ) polynomial approximation

for trees with N nodes in the TreePartitioning problem

Proof Assume that there exists a polynomial 1+ ε(N)N approximation algorithm for

trees with N nodes where limNrarrinfin ε(N) = 0 Let ε(N) lt 1 for N ge N0 Consider anarbitrary instance I0 of 3-PARTITION with a set a1 a3p of 3p integers and aninteger Z such that

sum1lejle3p aj = pZ Without loss of generality assume that ai ge 2

for all i (hence Z ge 6) We ask if we can partition the 3p integers of I0 into p triples ofthe same sum Z Now we build an instance I1 of 3-PARTITION by adding X times

the integer Zminus 2 and 2X times the integer 1 to I0 where X = max(lceil

N0minus1Z+3

rceilminus p 1

)

Hence I1 has 3p+3X integers and we ask whether these can be partitioned into p+Xtriples of the same sum Z Clearly I0 has a solution if and only if I1 does (the integerZ minus 2 can only be in a set with two 1s)

We build an instance I2 of TreePartitioning from I1 exactly as we did in theproof of Theorem 32 with wr = wv = 1 and B = Z The only dierence is that thevalue p in the proof has been replaced by p + X here therefore the three-level treenow has N = 1 + 3(p+X) + (p+X)Z nodes Note that X has been chosen so thatN ge N0 Just as in the proof of Theorem 32 I1 has a solution if and only if theoptimal cost for the tree is c = 4(p+X) and otherwise the optimal cost is at least4(p+X) + 1

If I1 has a solution and because N ge N0 the approximation algorithm will

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 4: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

4 AMESTOY ET AL

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

(a) The sample matrix A

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

(b) The factors L+ U (c) The elimination tree

Figure 21 A pattern symmetric matrix A its factors and the associated elimination tree (a)The matrix A whose nonzeros are shown with blue circles (b) The pattern of L+U where the lled-in entries are shown with red squares (c) The corresponding elimination tree where the children ofa node are drawn below the node itself

tree is an ordering of its nodes such that any node is numbered before its parentA post-order of a tree is a topological ordering where the nodes in each subtree arenumbered consecutively The least common ancestor of two nodes i and j in a rootedtree lca(i j) is the lowest numbered node that lies at the intersection of the uniquepaths from node i and node j to the root The ceiling function dxe gives the smallestinteger greater than or equal to x For a set S |S| denotes the cardinality of S

21 Elimination tree and sparse triangular solves When the matrix A isstructurally symmetric with a zero-free diagonal the elimination tree represents thestorage and computational requirements of its sparse factorization There are a fewequivalent denitions of the elimination tree [21] We prefer the following one for thepurposes of this paper

Definition 21 Assume A = LU where A is a sparse structurally symmetric

NtimesN matrix Then the elimination tree T (A) of A is a tree of N nodes with the ithnode corresponding to the ith column of L and where the parent relations are dened

as follows

parent(j) = mini i gt j and `ij 6= 0 for j = 1 N minus 1

For the sake of completeness we note that if A is reducible this structure is aforest with one tree for each irreducible block otherwise it is a tree We assumewithout loss of generality that the matrix is irreducible As an example consider thepattern symmetric matrix A shown in Fig 21(a) The factors and the correspondingelimination tree are shown in Figs 21(b) and 21(c)

Our algorithms for eciently computing a given set of entries in Aminus1 rely onthe elimination tree structure We take advantage of the following result which isrewritten from [16 Theorem 21]

Corollary 22 Assume b is a sparse vector and L is a lower triangular matrix

Then the indices of the nonzero elements of the solution vector x of Lx = b are equal

to the indices of the nodes of the elimination tree that are in the paths from the nodes

corresponding to nonzeros entries of b to the root

We will use the corollary in the following way when b is sparse we need to solvethe equations corresponding to the predicted nonzero entries while setting the otherentries of the solution vector x to zero We note that when b contains a single nonzero

COMPUTING INVERSE ENTRIES 5

entry say in its ith component the equations to be solved are those that correspondto the nodes in the unique path from node i to the root In any case assumingthe matrix is irreducible the last entry xN corresponding to the root of the tree isnonzero Consider for example Lx = e3 for the lower triangular factor L shown inFig 21(b) Clearly x1 x2 = 0 and x3 = 1`33 These in turn imply that x4 = 0 andx5 x6 6= 0 The nonzero entries correspond to the nodes that are in the unique pathfrom node 3 to the root node 6 as seen in Fig 21(c)

With Corollary 22 we are halfway through solving (11) eciently for a particularentry of the inverse we only solve only the relevant equations involving the L factor(and hence only access the necessary parts of it) Next we show which equationsinvolving U need to be solved when only a particular entry of the solution is requestedthereby specifying the whole solution process for an entry of the inverse

Lemma 23 In order to obtain the ith component of the solution to Uz = y thatis zi = (Uminus1y)i one has to solve the equations corresponding to the nodes that are in

the unique path from the highest node in struct(y)capancestors(i) to i where struct(y)denotes the nodes associated with the nonzero entries of y and ancestors(i) denotes

the sets of ancestors of node i in the elimination tree

Proof We rst show (1) that the set of components of z involved in the compu-tation of zi is the set of ancestors of i in the elimination tree We then (2) reduce thisset using the structure of y

(1) We prove by top-down induction on the tree that the only componentsinvolved in the computation of any component zl of z are the ancestors of lin the elimination tree Root node zN is computed as zN = yNuNN (thus requires no othercomponent of z) and has no ancestor in the tree

For any node l following a left-looking scheme zl is computed as

zl =

(yl minus

Nsumk=l+1

ulkzk

)ull =

yl minusNsum

kulk 6=0

ulkzk

ull (21)

All the nodes in the set Kl = k ulk 6= 0 are ancestors of l bydenition of the elimination tree (since struct(UT ) = struct(L)) Thusby applying the induction hypothesis to all the nodes in Kl all therequired nodes are ancestors of l

(2) The pattern of y can be exploited to show that some components of z whichare involved in the computation of zi are zero Noting ki as the highest nodein struct(y)capancestors(i) that is the highest ancestor of i such that yki 6= 0we have

zk = 0 if k isin ancestors(ki)zk 6= 0 if k isin ancestors(i)ancestors(ki)

Both statements are proved by induction using the same left-looking schemeTherefore the only components of z required lie on the path between i and ki thehighest node in struct(y) cap ancestors(i)

In particular when yN is nonzero as would be the case when y is the vec-tor obtained after forward elimination for the factorization of an irreducible matrixLemma 23 states that we need to solve the equations that correspond to nodes that liein the unique path from the root node to node i Consider the U given in Fig 21(b)

6 AMESTOY ET AL

and suppose we want to compute (Uz = e6)2 As we are interested in z2 we have tocompute z3 and z5 to be able to solve the second equation in order to compute z3we have to compute z6 as well Therefore we have to solve equation 21 for nodes2 3 5 and 6 As seen in Fig 21(c) these correspond to the nodes that are in theunique path from node 6 (the highest ancestor) to node 2 We also note that variablez4 would be nonzero however we will not compute it because it does not play a rolein determining z2

One can formulate aminus1ij in three dierent ways each involving linear system solu-

tions with L and U Consider the general formula

aminus1ij = eT

i Aminus1ej

= eTi Uminus1Lminus1ej (22)

and the three possible ways of parenthesizing these equations Our method as shownin (11) corresponds to the parenthesization(

Uminus1(Lminus1ej

))i (23)

The other two parenthesizations are

aminus1ij =

((eTi Uminus1)Lminus1

)j

(24)

=(eTi Uminus1) (Lminus1ej

) (25)

We have the following theorem regarding the equivalence of these three parenthe-sizations when L and U are computed and stored in such way that their sparsitypatterns are the transposes of each other In a more general case the three parenthe-sizing schemes may dier when L and UT have dierent patterns

Theorem 24 The three parenthesizations for computing aminus1ij given in equa-

tions (23)-to-(25) access the same parts of the factor L and the same parts of the

factor U

Proof Consider the following four computations on which the dierent parenthe-sizations are based

vT = eTi Uminus1 (26)

wi = (Uminus1y)i (27)

y = Lminus1ej (28)

zj = (vLminus1)j (29)

As v = UminusT ei and the pattern of UT is equal to the pattern of L by Corollary 22computing v requires accessing the U factors associated with nodes in the unique pathbetween node i and the root node Consider zT = LminusT vT As the pattern of LT isequal to the pattern of U by Lemma 23 zj requires accessing the L factors associatedwith the nodes in the unique path from node j to the highest node in struct(v) capancestors(j) As vN = (UminusT ei)N 6= 0 (since A is irreducible) this requires accessingthe L factors from node j to the root (the Nth node) With similar argumentsyN 6= 0 and computing wi = (Uminus1y)i requires accessing the U factors associatedwith the nodes in the unique path between node i and the root These observationscombined with another application of Corollary 22 this time for Lminus1ej complete theproof

We now combine Corollary 22 and Lemma 23 to obtain the theorem

COMPUTING INVERSE ENTRIES 7

Uz =y)2

= eL y 3

6

2

3

5

4

1

(

Figure 22 Traversal of the elimination tree for the computation of aminus123 In the rst step

Ly = e3 is solved yielding y3 y5 and y6 6= 0 Then (Uz = y)2 is found by computing z6 z5 z3 andnally z2

Theorem 25 (Factors to load for computing a particular entry of the inverseProperty 89 in [24]) To compute a particular entry aminus1

ij in Aminus1 the only factors

which have to be loaded are the L factors on the path from node j up to the root node

and the U factors on the path going back from the root to node iThis theorem establishes the eciency of the proposed computation scheme we

solve only the equations required for a requested entry of the inverse both in theforward and backward solve phases We illustrate the above theorem on the previousexample in Fig 22 As discussed before Ly = e3 yields nonzero vector entries y3 y5and y6 and then z2 = (Uz = y)2 is found after computing z6 z5 z3

We note that the third parenthesization (25) can be advantageous while comput-ing only the diagonal entries with the factorizations LLT or LDLT (with a diagonalD matrix) because in these cases we need to compute a single vector and computethe square of its norm This formulation can also be useful in a parallel setting wherethe solves with L and U can be computed in parallel whereas in the other two for-mulations the solves with one of the factors has to wait for the solves with the otherone to be completed

We also note that if the triangular solution procedures for eTi Uminus1 and Lminus1ej are

available then one can benet from the third parenthesization in certain cases Ifthe number of row and column indices concerned by the requested entries is smallerthan the number of these requested entries many of the calculations can be reused ifone computes a set of vectors of the form eT

i Uminus1 and Lminus1ej for dierent i and j and

obtains the requested entries of the inverse by computing the inner products of thesevectors We do not consider this computational scheme in this paper because suchseparate solves with L and U are not available within MUMPS

22 Problem denition We now address the computation of multiple entriesand introduce the partitioning problem We rst discuss the diagonal case (thatis computing a set of diagonal entries) and comment on the general case later inSection 4

As seen in the previous section in order to compute aminus1ii using the formulation

y = Lminus1ei

aminus1ii = (Uminus1y)i

we have to access the parts of L that correspond to the nodes in the unique path fromnode i to the root and then access the parts of U that correspond to the nodes inthe same path As discussed above these are the necessary and sucient parts of L

8 AMESTOY ET AL

and U that are needed In other words we know how to solve eciently for a singlerequested diagonal entry of the inverse Now suppose that we are to compute a setR of diagonal entries of the inverse As said in Section 1 using the equations aboveentails storing a dense vector for each requested entry If |R| is small then we couldagain identify all the parts of L and U that are needed to be loaded at least for onerequested entry in R and then solve for all R at once accessing the necessary andsucient parts of L and U only once However |R| is usually large in the applicationareas mentioned in Section 1 one often wants to compute a large set of entries suchas the whole diagonal of the inverse (in that case |R| = N) Storing that manydense vectors is not feasible therefore the computations proceed in epochs where ateach epoch a limited number of diagonal entries are computed This entails accessing(loading) some parts of L and U multiple times in dierent epochs according to theentries computed in the corresponding epochs The main combinatorial problem thenbecomes that of partitioning the requested entries into blocks in such a way that theoverall cost of loading the factors is minimized

We now formally introduce the problem Let T be the elimination tree on Nnodes where the factors associated with each node are stored on disks (out-of-core)Let P (i) be the set of nodes in the unique path from node i to the root r includingboth nodes i and r Let w(i) denote the cost of loading the parts of the factors Lor U associated with node i of the elimination tree Similarly let w(i j) denote thesum of the costs of the nodes in the path from node i to node j The cost of solvingaminus1

ii is therefore

cost(i) =sum

kisinP (i)

2times w(k) = 2times w(i r) (210)

If we solve for a set R of diagonal entries at once then the overall cost is therefore

cost(R) =sum

iisinP (R)

2times w(i) where P (R) =⋃iisinR

P (i)

We use B to denote the maximum number of diagonal entries that can be computedat an epoch This is the number of dense vectors that we must hold and so is limitedby the available storage

The TreePartitioning problem is formally dened as follows given a tree Twith N nodes a set R = i1 im of nodes in the tree and an integer B le mpartition R into a number of subsets R1 R2 RK so that |Rk| le B for all k andthe total cost

cost(R) =Ksum

k=1

cost(Rk) (211)

is minimumThe number of subsets K is not specied but obviously K ge dm

B e Without lossof generality we can assume that there is a one-to-one correspondence between R andleaf nodes in T Indeed if there is a leaf node i where i isin R then we can delete nodei from T Similarly if there is an internal node i where i isin R then we create a leafnode iprime of zero weight and make it an additional child of i For ease of discussion andformulation for each requested node (leaf or not) of the elimination tree we add a leafnode with zero weight To clarify the execution scheme we now specify the algorithmthat computes the diagonal entries of the inverse specied by a given Rk We rst

COMPUTING INVERSE ENTRIES 9

nd P (Rk) we then post-order the nodes in P (Rk) and start loading the associatedL factors from the disk and perform the forward solves with L When we reach theroot node we have |Rk| dense vectors and we start loading the associated U factorsfrom the disk and perform backward substitutions along the paths that we traversed(in reverse order) during the forward substitutions

3 Partitioning methods and models As discussed above partitioning therequested entries into blocks to minimize the cost of loading factors corresponds tothe TreePartitioning problem In this section we will focus on the case whereall of the requested entries are on the diagonal of the inverse As noted before inthis case the partial forward and backward solves correspond to visiting the samepath in the elimination tree We analyse the TreePartitioning problem in detailfor this case and show that it is NP-complete we also show that the case where theblock size B = 2 is polynomial time solvable We provide two heuristics one with anapproximation guarantee (in the sense that we can prove that it is at worst twice asbad as optimal) and the other being somewhat better in practice we also introducea hypergraph partitioning-based formulation which is more general than the otherheuristics

Before introducing the algorithms and models we present a lower bound for thecost of an optimal partition Let nl(i) denote the number of leaves of the subtreerooted at node i which can be computed as follows

nl(i) =

1 i is a leaf nodesum

jisinchildren(i) nl(j) otherwise(31)

We note that as all the leaf nodes correspond to the requested diagonal entries of theinverse nl(i) corresponds to the number of forward and backward solves that have tobe performed at node i

Given the number of forward and backward solves that pass through a node i itis easy to dene the following lower bound on the amount of the factors loaded

Theorem 31 (Lower bound of the amount of factors to load) Let T be a

node weighted tree w(i) be the weight of node i B be the maximum allowed size of

a partition and nl(i) be the number of leaf nodes in the subtree rooted at i Then

we have the following lower bound denoted by η on the optimal solution c of the

TreePartitioning problem

η = 2timessumiisinT

w(i)timeslceilnl(i)B

rceille c

Proof Follows easily by noting that each node i has to be loaded at leastlceil

nl(i)B

rceiltimes

As the formula includes nl(middot) the lower bounds for wide and shallow trees willusually be smaller than the lower bounds for tall and skinny trees Each internalnode is on a path from (at least) one leaf node therefore dnl(i)Be is at least 1 and2times

sumi w(i) le c

Figure 31 illustrates the notion of the number of leaves of a subtree and thecomputation of the lower bound Entries aminus1

11 aminus133 and a

minus144 are requested and the

elimination tree of Figure 21(c) is modied accordingly to have leaves (with zeroweights) corresponding to these entries The numbers nl(i) are shown next to the

10 AMESTOY ET AL

6

3

5

4

1

3

2

3

1

14 3

11

1 1

Figure 31 Number of leaves of the subtrees rooted at each node of a transformed eliminationtree The nodes corresponding to the requested diagonal entries of the inverse are shaded and aleaf node is added for each such entry Each node is annotated with the number of leaves in thecorresponding subtree resulting in a lower bound of η = 14 with B = 2

nodes Suppose that each internal node has unit weight and that the block size is 2Then the lower bound is

η = 2times(lceil

12

rceil+lceil

12

rceil+lceil

22

rceil+lceil

32

rceil+lceil

32

rceil)= 14

Recall that we have transformed the elimination tree in such a way that therequested entries now correspond to the leaves and each leaf corresponds to a requestedentry We have the following computational complexity result

Theorem 32 The TreePartitioning problem is NP-complete

Proof We consider the associated decision problem given a tree T with m leavesa value of B and a cost bound c does there exist a partitioning S of the m leavesinto subsets whose size does not exceed B and such that cost(S) le c It is clear thatthis problem belongs to NP since if we are given the partition S it is easy to checkin polynomial time that it is valid and that its cost meets the bound c We now haveto prove that the problem is in the NP-complete subset

To establish the completeness we use a reduction from 3-PARTITION [14] whichis NP-complete in the strong sense Consider an instance I1 of 3-PARTITION givena set a1 a3p of 3p integers and an integer Z such that

sum1lejle3p aj = pZ does

there exist a partition of 1 3p into p disjoint subsets K1 Kp each withthree elements such that for all 1 le i le p

sumjisinKi

aj = ZWe build the following instance I2 of our problem the tree is a three-level tree

composed of N = 1 + 3p + pZ nodes the root vr of cost wr has 3p children viof same cost wv for 1 le i le 3p In turn each vi has ai children each being a leafnode of zero cost This instance I2 of the TreePartitioning problem is shown inFig 32 We let B = Z and ask whether there exists a partition of leaf nodes of costc = pwr + 3pwv Here wr and wv are arbitrary values (we can take wr = wv = 1)We note that the cost c corresponds to the lower bound shown in Theorem 31 inthis lower bound each internal node vi is loaded only once and the root is loaded ptimes since it has pZ = pB leaves below it Note that the size of I2 is polynomialin the size of I1 Indeed because 3-PARTITION is NP-complete in the strong sensewe can encode I1 in unary and the size of the instance is O(pZ)

COMPUTING INVERSE ENTRIES 11

a3pa1 a2

wr

wv

r

v2v1v3p

wv wv

Figure 32 The instance of the TreePartitioning problem corresponding to a given 3-PARTITION PROBLEM The weight of each node is shown next to the node The minimum costof a solution for B = Z to the TreePartitioning problem is ptimeswr +3ptimeswv which is only possiblewhen the children of each vi are all in the same part and when the children of three dierent internalnodes say vi vj vk are put in the same part This corresponds to putting the numbers ai aj ak

into a set for the 3-PARTITION problem which sums up to Z

Now we show that I1 has a solution if and only if I2 has a solution Suppose rstthat I1 has a solution K1 Kp The partition of leaf nodes corresponds exactly tothe subsets Ki we build p subsets Si whose leaves are the children of vertices vj withj isin Ki Suppose now that I2 has a solution To meet the cost bound each internalnode has to be loaded only once and the root at most p times This means that thepartition involves at most p subsets to cover all leaves Because there are pZ leaveseach subset is of size exactly Z Because each internal node is loaded only once all itsleaves belong to the same subset Altogether we have found a solution to I1 whichconcludes the proof

We can further show that we cannot get a close approximation to the optimalsolution in polynomial time

Theorem 33 Unless P=NP there is no 1 + o( 1N ) polynomial approximation

for trees with N nodes in the TreePartitioning problem

Proof Assume that there exists a polynomial 1+ ε(N)N approximation algorithm for

trees with N nodes where limNrarrinfin ε(N) = 0 Let ε(N) lt 1 for N ge N0 Consider anarbitrary instance I0 of 3-PARTITION with a set a1 a3p of 3p integers and aninteger Z such that

sum1lejle3p aj = pZ Without loss of generality assume that ai ge 2

for all i (hence Z ge 6) We ask if we can partition the 3p integers of I0 into p triples ofthe same sum Z Now we build an instance I1 of 3-PARTITION by adding X times

the integer Zminus 2 and 2X times the integer 1 to I0 where X = max(lceil

N0minus1Z+3

rceilminus p 1

)

Hence I1 has 3p+3X integers and we ask whether these can be partitioned into p+Xtriples of the same sum Z Clearly I0 has a solution if and only if I1 does (the integerZ minus 2 can only be in a set with two 1s)

We build an instance I2 of TreePartitioning from I1 exactly as we did in theproof of Theorem 32 with wr = wv = 1 and B = Z The only dierence is that thevalue p in the proof has been replaced by p + X here therefore the three-level treenow has N = 1 + 3(p+X) + (p+X)Z nodes Note that X has been chosen so thatN ge N0 Just as in the proof of Theorem 32 I1 has a solution if and only if theoptimal cost for the tree is c = 4(p+X) and otherwise the optimal cost is at least4(p+X) + 1

If I1 has a solution and because N ge N0 the approximation algorithm will

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 5: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 5

entry say in its ith component the equations to be solved are those that correspondto the nodes in the unique path from node i to the root In any case assumingthe matrix is irreducible the last entry xN corresponding to the root of the tree isnonzero Consider for example Lx = e3 for the lower triangular factor L shown inFig 21(b) Clearly x1 x2 = 0 and x3 = 1`33 These in turn imply that x4 = 0 andx5 x6 6= 0 The nonzero entries correspond to the nodes that are in the unique pathfrom node 3 to the root node 6 as seen in Fig 21(c)

With Corollary 22 we are halfway through solving (11) eciently for a particularentry of the inverse we only solve only the relevant equations involving the L factor(and hence only access the necessary parts of it) Next we show which equationsinvolving U need to be solved when only a particular entry of the solution is requestedthereby specifying the whole solution process for an entry of the inverse

Lemma 23 In order to obtain the ith component of the solution to Uz = y thatis zi = (Uminus1y)i one has to solve the equations corresponding to the nodes that are in

the unique path from the highest node in struct(y)capancestors(i) to i where struct(y)denotes the nodes associated with the nonzero entries of y and ancestors(i) denotes

the sets of ancestors of node i in the elimination tree

Proof We rst show (1) that the set of components of z involved in the compu-tation of zi is the set of ancestors of i in the elimination tree We then (2) reduce thisset using the structure of y

(1) We prove by top-down induction on the tree that the only componentsinvolved in the computation of any component zl of z are the ancestors of lin the elimination tree Root node zN is computed as zN = yNuNN (thus requires no othercomponent of z) and has no ancestor in the tree

For any node l following a left-looking scheme zl is computed as

zl =

(yl minus

Nsumk=l+1

ulkzk

)ull =

yl minusNsum

kulk 6=0

ulkzk

ull (21)

All the nodes in the set Kl = k ulk 6= 0 are ancestors of l bydenition of the elimination tree (since struct(UT ) = struct(L)) Thusby applying the induction hypothesis to all the nodes in Kl all therequired nodes are ancestors of l

(2) The pattern of y can be exploited to show that some components of z whichare involved in the computation of zi are zero Noting ki as the highest nodein struct(y)capancestors(i) that is the highest ancestor of i such that yki 6= 0we have

zk = 0 if k isin ancestors(ki)zk 6= 0 if k isin ancestors(i)ancestors(ki)

Both statements are proved by induction using the same left-looking schemeTherefore the only components of z required lie on the path between i and ki thehighest node in struct(y) cap ancestors(i)

In particular when yN is nonzero as would be the case when y is the vec-tor obtained after forward elimination for the factorization of an irreducible matrixLemma 23 states that we need to solve the equations that correspond to nodes that liein the unique path from the root node to node i Consider the U given in Fig 21(b)

6 AMESTOY ET AL

and suppose we want to compute (Uz = e6)2 As we are interested in z2 we have tocompute z3 and z5 to be able to solve the second equation in order to compute z3we have to compute z6 as well Therefore we have to solve equation 21 for nodes2 3 5 and 6 As seen in Fig 21(c) these correspond to the nodes that are in theunique path from node 6 (the highest ancestor) to node 2 We also note that variablez4 would be nonzero however we will not compute it because it does not play a rolein determining z2

One can formulate aminus1ij in three dierent ways each involving linear system solu-

tions with L and U Consider the general formula

aminus1ij = eT

i Aminus1ej

= eTi Uminus1Lminus1ej (22)

and the three possible ways of parenthesizing these equations Our method as shownin (11) corresponds to the parenthesization(

Uminus1(Lminus1ej

))i (23)

The other two parenthesizations are

aminus1ij =

((eTi Uminus1)Lminus1

)j

(24)

=(eTi Uminus1) (Lminus1ej

) (25)

We have the following theorem regarding the equivalence of these three parenthe-sizations when L and U are computed and stored in such way that their sparsitypatterns are the transposes of each other In a more general case the three parenthe-sizing schemes may dier when L and UT have dierent patterns

Theorem 24 The three parenthesizations for computing aminus1ij given in equa-

tions (23)-to-(25) access the same parts of the factor L and the same parts of the

factor U

Proof Consider the following four computations on which the dierent parenthe-sizations are based

vT = eTi Uminus1 (26)

wi = (Uminus1y)i (27)

y = Lminus1ej (28)

zj = (vLminus1)j (29)

As v = UminusT ei and the pattern of UT is equal to the pattern of L by Corollary 22computing v requires accessing the U factors associated with nodes in the unique pathbetween node i and the root node Consider zT = LminusT vT As the pattern of LT isequal to the pattern of U by Lemma 23 zj requires accessing the L factors associatedwith the nodes in the unique path from node j to the highest node in struct(v) capancestors(j) As vN = (UminusT ei)N 6= 0 (since A is irreducible) this requires accessingthe L factors from node j to the root (the Nth node) With similar argumentsyN 6= 0 and computing wi = (Uminus1y)i requires accessing the U factors associatedwith the nodes in the unique path between node i and the root These observationscombined with another application of Corollary 22 this time for Lminus1ej complete theproof

We now combine Corollary 22 and Lemma 23 to obtain the theorem

COMPUTING INVERSE ENTRIES 7

Uz =y)2

= eL y 3

6

2

3

5

4

1

(

Figure 22 Traversal of the elimination tree for the computation of aminus123 In the rst step

Ly = e3 is solved yielding y3 y5 and y6 6= 0 Then (Uz = y)2 is found by computing z6 z5 z3 andnally z2

Theorem 25 (Factors to load for computing a particular entry of the inverseProperty 89 in [24]) To compute a particular entry aminus1

ij in Aminus1 the only factors

which have to be loaded are the L factors on the path from node j up to the root node

and the U factors on the path going back from the root to node iThis theorem establishes the eciency of the proposed computation scheme we

solve only the equations required for a requested entry of the inverse both in theforward and backward solve phases We illustrate the above theorem on the previousexample in Fig 22 As discussed before Ly = e3 yields nonzero vector entries y3 y5and y6 and then z2 = (Uz = y)2 is found after computing z6 z5 z3

We note that the third parenthesization (25) can be advantageous while comput-ing only the diagonal entries with the factorizations LLT or LDLT (with a diagonalD matrix) because in these cases we need to compute a single vector and computethe square of its norm This formulation can also be useful in a parallel setting wherethe solves with L and U can be computed in parallel whereas in the other two for-mulations the solves with one of the factors has to wait for the solves with the otherone to be completed

We also note that if the triangular solution procedures for eTi Uminus1 and Lminus1ej are

available then one can benet from the third parenthesization in certain cases Ifthe number of row and column indices concerned by the requested entries is smallerthan the number of these requested entries many of the calculations can be reused ifone computes a set of vectors of the form eT

i Uminus1 and Lminus1ej for dierent i and j and

obtains the requested entries of the inverse by computing the inner products of thesevectors We do not consider this computational scheme in this paper because suchseparate solves with L and U are not available within MUMPS

22 Problem denition We now address the computation of multiple entriesand introduce the partitioning problem We rst discuss the diagonal case (thatis computing a set of diagonal entries) and comment on the general case later inSection 4

As seen in the previous section in order to compute aminus1ii using the formulation

y = Lminus1ei

aminus1ii = (Uminus1y)i

we have to access the parts of L that correspond to the nodes in the unique path fromnode i to the root and then access the parts of U that correspond to the nodes inthe same path As discussed above these are the necessary and sucient parts of L

8 AMESTOY ET AL

and U that are needed In other words we know how to solve eciently for a singlerequested diagonal entry of the inverse Now suppose that we are to compute a setR of diagonal entries of the inverse As said in Section 1 using the equations aboveentails storing a dense vector for each requested entry If |R| is small then we couldagain identify all the parts of L and U that are needed to be loaded at least for onerequested entry in R and then solve for all R at once accessing the necessary andsucient parts of L and U only once However |R| is usually large in the applicationareas mentioned in Section 1 one often wants to compute a large set of entries suchas the whole diagonal of the inverse (in that case |R| = N) Storing that manydense vectors is not feasible therefore the computations proceed in epochs where ateach epoch a limited number of diagonal entries are computed This entails accessing(loading) some parts of L and U multiple times in dierent epochs according to theentries computed in the corresponding epochs The main combinatorial problem thenbecomes that of partitioning the requested entries into blocks in such a way that theoverall cost of loading the factors is minimized

We now formally introduce the problem Let T be the elimination tree on Nnodes where the factors associated with each node are stored on disks (out-of-core)Let P (i) be the set of nodes in the unique path from node i to the root r includingboth nodes i and r Let w(i) denote the cost of loading the parts of the factors Lor U associated with node i of the elimination tree Similarly let w(i j) denote thesum of the costs of the nodes in the path from node i to node j The cost of solvingaminus1

ii is therefore

cost(i) =sum

kisinP (i)

2times w(k) = 2times w(i r) (210)

If we solve for a set R of diagonal entries at once then the overall cost is therefore

cost(R) =sum

iisinP (R)

2times w(i) where P (R) =⋃iisinR

P (i)

We use B to denote the maximum number of diagonal entries that can be computedat an epoch This is the number of dense vectors that we must hold and so is limitedby the available storage

The TreePartitioning problem is formally dened as follows given a tree Twith N nodes a set R = i1 im of nodes in the tree and an integer B le mpartition R into a number of subsets R1 R2 RK so that |Rk| le B for all k andthe total cost

cost(R) =Ksum

k=1

cost(Rk) (211)

is minimumThe number of subsets K is not specied but obviously K ge dm

B e Without lossof generality we can assume that there is a one-to-one correspondence between R andleaf nodes in T Indeed if there is a leaf node i where i isin R then we can delete nodei from T Similarly if there is an internal node i where i isin R then we create a leafnode iprime of zero weight and make it an additional child of i For ease of discussion andformulation for each requested node (leaf or not) of the elimination tree we add a leafnode with zero weight To clarify the execution scheme we now specify the algorithmthat computes the diagonal entries of the inverse specied by a given Rk We rst

COMPUTING INVERSE ENTRIES 9

nd P (Rk) we then post-order the nodes in P (Rk) and start loading the associatedL factors from the disk and perform the forward solves with L When we reach theroot node we have |Rk| dense vectors and we start loading the associated U factorsfrom the disk and perform backward substitutions along the paths that we traversed(in reverse order) during the forward substitutions

3 Partitioning methods and models As discussed above partitioning therequested entries into blocks to minimize the cost of loading factors corresponds tothe TreePartitioning problem In this section we will focus on the case whereall of the requested entries are on the diagonal of the inverse As noted before inthis case the partial forward and backward solves correspond to visiting the samepath in the elimination tree We analyse the TreePartitioning problem in detailfor this case and show that it is NP-complete we also show that the case where theblock size B = 2 is polynomial time solvable We provide two heuristics one with anapproximation guarantee (in the sense that we can prove that it is at worst twice asbad as optimal) and the other being somewhat better in practice we also introducea hypergraph partitioning-based formulation which is more general than the otherheuristics

Before introducing the algorithms and models we present a lower bound for thecost of an optimal partition Let nl(i) denote the number of leaves of the subtreerooted at node i which can be computed as follows

nl(i) =

1 i is a leaf nodesum

jisinchildren(i) nl(j) otherwise(31)

We note that as all the leaf nodes correspond to the requested diagonal entries of theinverse nl(i) corresponds to the number of forward and backward solves that have tobe performed at node i

Given the number of forward and backward solves that pass through a node i itis easy to dene the following lower bound on the amount of the factors loaded

Theorem 31 (Lower bound of the amount of factors to load) Let T be a

node weighted tree w(i) be the weight of node i B be the maximum allowed size of

a partition and nl(i) be the number of leaf nodes in the subtree rooted at i Then

we have the following lower bound denoted by η on the optimal solution c of the

TreePartitioning problem

η = 2timessumiisinT

w(i)timeslceilnl(i)B

rceille c

Proof Follows easily by noting that each node i has to be loaded at leastlceil

nl(i)B

rceiltimes

As the formula includes nl(middot) the lower bounds for wide and shallow trees willusually be smaller than the lower bounds for tall and skinny trees Each internalnode is on a path from (at least) one leaf node therefore dnl(i)Be is at least 1 and2times

sumi w(i) le c

Figure 31 illustrates the notion of the number of leaves of a subtree and thecomputation of the lower bound Entries aminus1

11 aminus133 and a

minus144 are requested and the

elimination tree of Figure 21(c) is modied accordingly to have leaves (with zeroweights) corresponding to these entries The numbers nl(i) are shown next to the

10 AMESTOY ET AL

6

3

5

4

1

3

2

3

1

14 3

11

1 1

Figure 31 Number of leaves of the subtrees rooted at each node of a transformed eliminationtree The nodes corresponding to the requested diagonal entries of the inverse are shaded and aleaf node is added for each such entry Each node is annotated with the number of leaves in thecorresponding subtree resulting in a lower bound of η = 14 with B = 2

nodes Suppose that each internal node has unit weight and that the block size is 2Then the lower bound is

η = 2times(lceil

12

rceil+lceil

12

rceil+lceil

22

rceil+lceil

32

rceil+lceil

32

rceil)= 14

Recall that we have transformed the elimination tree in such a way that therequested entries now correspond to the leaves and each leaf corresponds to a requestedentry We have the following computational complexity result

Theorem 32 The TreePartitioning problem is NP-complete

Proof We consider the associated decision problem given a tree T with m leavesa value of B and a cost bound c does there exist a partitioning S of the m leavesinto subsets whose size does not exceed B and such that cost(S) le c It is clear thatthis problem belongs to NP since if we are given the partition S it is easy to checkin polynomial time that it is valid and that its cost meets the bound c We now haveto prove that the problem is in the NP-complete subset

To establish the completeness we use a reduction from 3-PARTITION [14] whichis NP-complete in the strong sense Consider an instance I1 of 3-PARTITION givena set a1 a3p of 3p integers and an integer Z such that

sum1lejle3p aj = pZ does

there exist a partition of 1 3p into p disjoint subsets K1 Kp each withthree elements such that for all 1 le i le p

sumjisinKi

aj = ZWe build the following instance I2 of our problem the tree is a three-level tree

composed of N = 1 + 3p + pZ nodes the root vr of cost wr has 3p children viof same cost wv for 1 le i le 3p In turn each vi has ai children each being a leafnode of zero cost This instance I2 of the TreePartitioning problem is shown inFig 32 We let B = Z and ask whether there exists a partition of leaf nodes of costc = pwr + 3pwv Here wr and wv are arbitrary values (we can take wr = wv = 1)We note that the cost c corresponds to the lower bound shown in Theorem 31 inthis lower bound each internal node vi is loaded only once and the root is loaded ptimes since it has pZ = pB leaves below it Note that the size of I2 is polynomialin the size of I1 Indeed because 3-PARTITION is NP-complete in the strong sensewe can encode I1 in unary and the size of the instance is O(pZ)

COMPUTING INVERSE ENTRIES 11

a3pa1 a2

wr

wv

r

v2v1v3p

wv wv

Figure 32 The instance of the TreePartitioning problem corresponding to a given 3-PARTITION PROBLEM The weight of each node is shown next to the node The minimum costof a solution for B = Z to the TreePartitioning problem is ptimeswr +3ptimeswv which is only possiblewhen the children of each vi are all in the same part and when the children of three dierent internalnodes say vi vj vk are put in the same part This corresponds to putting the numbers ai aj ak

into a set for the 3-PARTITION problem which sums up to Z

Now we show that I1 has a solution if and only if I2 has a solution Suppose rstthat I1 has a solution K1 Kp The partition of leaf nodes corresponds exactly tothe subsets Ki we build p subsets Si whose leaves are the children of vertices vj withj isin Ki Suppose now that I2 has a solution To meet the cost bound each internalnode has to be loaded only once and the root at most p times This means that thepartition involves at most p subsets to cover all leaves Because there are pZ leaveseach subset is of size exactly Z Because each internal node is loaded only once all itsleaves belong to the same subset Altogether we have found a solution to I1 whichconcludes the proof

We can further show that we cannot get a close approximation to the optimalsolution in polynomial time

Theorem 33 Unless P=NP there is no 1 + o( 1N ) polynomial approximation

for trees with N nodes in the TreePartitioning problem

Proof Assume that there exists a polynomial 1+ ε(N)N approximation algorithm for

trees with N nodes where limNrarrinfin ε(N) = 0 Let ε(N) lt 1 for N ge N0 Consider anarbitrary instance I0 of 3-PARTITION with a set a1 a3p of 3p integers and aninteger Z such that

sum1lejle3p aj = pZ Without loss of generality assume that ai ge 2

for all i (hence Z ge 6) We ask if we can partition the 3p integers of I0 into p triples ofthe same sum Z Now we build an instance I1 of 3-PARTITION by adding X times

the integer Zminus 2 and 2X times the integer 1 to I0 where X = max(lceil

N0minus1Z+3

rceilminus p 1

)

Hence I1 has 3p+3X integers and we ask whether these can be partitioned into p+Xtriples of the same sum Z Clearly I0 has a solution if and only if I1 does (the integerZ minus 2 can only be in a set with two 1s)

We build an instance I2 of TreePartitioning from I1 exactly as we did in theproof of Theorem 32 with wr = wv = 1 and B = Z The only dierence is that thevalue p in the proof has been replaced by p + X here therefore the three-level treenow has N = 1 + 3(p+X) + (p+X)Z nodes Note that X has been chosen so thatN ge N0 Just as in the proof of Theorem 32 I1 has a solution if and only if theoptimal cost for the tree is c = 4(p+X) and otherwise the optimal cost is at least4(p+X) + 1

If I1 has a solution and because N ge N0 the approximation algorithm will

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 6: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

6 AMESTOY ET AL

and suppose we want to compute (Uz = e6)2 As we are interested in z2 we have tocompute z3 and z5 to be able to solve the second equation in order to compute z3we have to compute z6 as well Therefore we have to solve equation 21 for nodes2 3 5 and 6 As seen in Fig 21(c) these correspond to the nodes that are in theunique path from node 6 (the highest ancestor) to node 2 We also note that variablez4 would be nonzero however we will not compute it because it does not play a rolein determining z2

One can formulate aminus1ij in three dierent ways each involving linear system solu-

tions with L and U Consider the general formula

aminus1ij = eT

i Aminus1ej

= eTi Uminus1Lminus1ej (22)

and the three possible ways of parenthesizing these equations Our method as shownin (11) corresponds to the parenthesization(

Uminus1(Lminus1ej

))i (23)

The other two parenthesizations are

aminus1ij =

((eTi Uminus1)Lminus1

)j

(24)

=(eTi Uminus1) (Lminus1ej

) (25)

We have the following theorem regarding the equivalence of these three parenthe-sizations when L and U are computed and stored in such way that their sparsitypatterns are the transposes of each other In a more general case the three parenthe-sizing schemes may dier when L and UT have dierent patterns

Theorem 24 The three parenthesizations for computing aminus1ij given in equa-

tions (23)-to-(25) access the same parts of the factor L and the same parts of the

factor U

Proof Consider the following four computations on which the dierent parenthe-sizations are based

vT = eTi Uminus1 (26)

wi = (Uminus1y)i (27)

y = Lminus1ej (28)

zj = (vLminus1)j (29)

As v = UminusT ei and the pattern of UT is equal to the pattern of L by Corollary 22computing v requires accessing the U factors associated with nodes in the unique pathbetween node i and the root node Consider zT = LminusT vT As the pattern of LT isequal to the pattern of U by Lemma 23 zj requires accessing the L factors associatedwith the nodes in the unique path from node j to the highest node in struct(v) capancestors(j) As vN = (UminusT ei)N 6= 0 (since A is irreducible) this requires accessingthe L factors from node j to the root (the Nth node) With similar argumentsyN 6= 0 and computing wi = (Uminus1y)i requires accessing the U factors associatedwith the nodes in the unique path between node i and the root These observationscombined with another application of Corollary 22 this time for Lminus1ej complete theproof

We now combine Corollary 22 and Lemma 23 to obtain the theorem

COMPUTING INVERSE ENTRIES 7

Uz =y)2

= eL y 3

6

2

3

5

4

1

(

Figure 22 Traversal of the elimination tree for the computation of aminus123 In the rst step

Ly = e3 is solved yielding y3 y5 and y6 6= 0 Then (Uz = y)2 is found by computing z6 z5 z3 andnally z2

Theorem 25 (Factors to load for computing a particular entry of the inverseProperty 89 in [24]) To compute a particular entry aminus1

ij in Aminus1 the only factors

which have to be loaded are the L factors on the path from node j up to the root node

and the U factors on the path going back from the root to node iThis theorem establishes the eciency of the proposed computation scheme we

solve only the equations required for a requested entry of the inverse both in theforward and backward solve phases We illustrate the above theorem on the previousexample in Fig 22 As discussed before Ly = e3 yields nonzero vector entries y3 y5and y6 and then z2 = (Uz = y)2 is found after computing z6 z5 z3

We note that the third parenthesization (25) can be advantageous while comput-ing only the diagonal entries with the factorizations LLT or LDLT (with a diagonalD matrix) because in these cases we need to compute a single vector and computethe square of its norm This formulation can also be useful in a parallel setting wherethe solves with L and U can be computed in parallel whereas in the other two for-mulations the solves with one of the factors has to wait for the solves with the otherone to be completed

We also note that if the triangular solution procedures for eTi Uminus1 and Lminus1ej are

available then one can benet from the third parenthesization in certain cases Ifthe number of row and column indices concerned by the requested entries is smallerthan the number of these requested entries many of the calculations can be reused ifone computes a set of vectors of the form eT

i Uminus1 and Lminus1ej for dierent i and j and

obtains the requested entries of the inverse by computing the inner products of thesevectors We do not consider this computational scheme in this paper because suchseparate solves with L and U are not available within MUMPS

22 Problem denition We now address the computation of multiple entriesand introduce the partitioning problem We rst discuss the diagonal case (thatis computing a set of diagonal entries) and comment on the general case later inSection 4

As seen in the previous section in order to compute aminus1ii using the formulation

y = Lminus1ei

aminus1ii = (Uminus1y)i

we have to access the parts of L that correspond to the nodes in the unique path fromnode i to the root and then access the parts of U that correspond to the nodes inthe same path As discussed above these are the necessary and sucient parts of L

8 AMESTOY ET AL

and U that are needed In other words we know how to solve eciently for a singlerequested diagonal entry of the inverse Now suppose that we are to compute a setR of diagonal entries of the inverse As said in Section 1 using the equations aboveentails storing a dense vector for each requested entry If |R| is small then we couldagain identify all the parts of L and U that are needed to be loaded at least for onerequested entry in R and then solve for all R at once accessing the necessary andsucient parts of L and U only once However |R| is usually large in the applicationareas mentioned in Section 1 one often wants to compute a large set of entries suchas the whole diagonal of the inverse (in that case |R| = N) Storing that manydense vectors is not feasible therefore the computations proceed in epochs where ateach epoch a limited number of diagonal entries are computed This entails accessing(loading) some parts of L and U multiple times in dierent epochs according to theentries computed in the corresponding epochs The main combinatorial problem thenbecomes that of partitioning the requested entries into blocks in such a way that theoverall cost of loading the factors is minimized

We now formally introduce the problem Let T be the elimination tree on Nnodes where the factors associated with each node are stored on disks (out-of-core)Let P (i) be the set of nodes in the unique path from node i to the root r includingboth nodes i and r Let w(i) denote the cost of loading the parts of the factors Lor U associated with node i of the elimination tree Similarly let w(i j) denote thesum of the costs of the nodes in the path from node i to node j The cost of solvingaminus1

ii is therefore

cost(i) =sum

kisinP (i)

2times w(k) = 2times w(i r) (210)

If we solve for a set R of diagonal entries at once then the overall cost is therefore

cost(R) =sum

iisinP (R)

2times w(i) where P (R) =⋃iisinR

P (i)

We use B to denote the maximum number of diagonal entries that can be computedat an epoch This is the number of dense vectors that we must hold and so is limitedby the available storage

The TreePartitioning problem is formally dened as follows given a tree Twith N nodes a set R = i1 im of nodes in the tree and an integer B le mpartition R into a number of subsets R1 R2 RK so that |Rk| le B for all k andthe total cost

cost(R) =Ksum

k=1

cost(Rk) (211)

is minimumThe number of subsets K is not specied but obviously K ge dm

B e Without lossof generality we can assume that there is a one-to-one correspondence between R andleaf nodes in T Indeed if there is a leaf node i where i isin R then we can delete nodei from T Similarly if there is an internal node i where i isin R then we create a leafnode iprime of zero weight and make it an additional child of i For ease of discussion andformulation for each requested node (leaf or not) of the elimination tree we add a leafnode with zero weight To clarify the execution scheme we now specify the algorithmthat computes the diagonal entries of the inverse specied by a given Rk We rst

COMPUTING INVERSE ENTRIES 9

nd P (Rk) we then post-order the nodes in P (Rk) and start loading the associatedL factors from the disk and perform the forward solves with L When we reach theroot node we have |Rk| dense vectors and we start loading the associated U factorsfrom the disk and perform backward substitutions along the paths that we traversed(in reverse order) during the forward substitutions

3 Partitioning methods and models As discussed above partitioning therequested entries into blocks to minimize the cost of loading factors corresponds tothe TreePartitioning problem In this section we will focus on the case whereall of the requested entries are on the diagonal of the inverse As noted before inthis case the partial forward and backward solves correspond to visiting the samepath in the elimination tree We analyse the TreePartitioning problem in detailfor this case and show that it is NP-complete we also show that the case where theblock size B = 2 is polynomial time solvable We provide two heuristics one with anapproximation guarantee (in the sense that we can prove that it is at worst twice asbad as optimal) and the other being somewhat better in practice we also introducea hypergraph partitioning-based formulation which is more general than the otherheuristics

Before introducing the algorithms and models we present a lower bound for thecost of an optimal partition Let nl(i) denote the number of leaves of the subtreerooted at node i which can be computed as follows

nl(i) =

1 i is a leaf nodesum

jisinchildren(i) nl(j) otherwise(31)

We note that as all the leaf nodes correspond to the requested diagonal entries of theinverse nl(i) corresponds to the number of forward and backward solves that have tobe performed at node i

Given the number of forward and backward solves that pass through a node i itis easy to dene the following lower bound on the amount of the factors loaded

Theorem 31 (Lower bound of the amount of factors to load) Let T be a

node weighted tree w(i) be the weight of node i B be the maximum allowed size of

a partition and nl(i) be the number of leaf nodes in the subtree rooted at i Then

we have the following lower bound denoted by η on the optimal solution c of the

TreePartitioning problem

η = 2timessumiisinT

w(i)timeslceilnl(i)B

rceille c

Proof Follows easily by noting that each node i has to be loaded at leastlceil

nl(i)B

rceiltimes

As the formula includes nl(middot) the lower bounds for wide and shallow trees willusually be smaller than the lower bounds for tall and skinny trees Each internalnode is on a path from (at least) one leaf node therefore dnl(i)Be is at least 1 and2times

sumi w(i) le c

Figure 31 illustrates the notion of the number of leaves of a subtree and thecomputation of the lower bound Entries aminus1

11 aminus133 and a

minus144 are requested and the

elimination tree of Figure 21(c) is modied accordingly to have leaves (with zeroweights) corresponding to these entries The numbers nl(i) are shown next to the

10 AMESTOY ET AL

6

3

5

4

1

3

2

3

1

14 3

11

1 1

Figure 31 Number of leaves of the subtrees rooted at each node of a transformed eliminationtree The nodes corresponding to the requested diagonal entries of the inverse are shaded and aleaf node is added for each such entry Each node is annotated with the number of leaves in thecorresponding subtree resulting in a lower bound of η = 14 with B = 2

nodes Suppose that each internal node has unit weight and that the block size is 2Then the lower bound is

η = 2times(lceil

12

rceil+lceil

12

rceil+lceil

22

rceil+lceil

32

rceil+lceil

32

rceil)= 14

Recall that we have transformed the elimination tree in such a way that therequested entries now correspond to the leaves and each leaf corresponds to a requestedentry We have the following computational complexity result

Theorem 32 The TreePartitioning problem is NP-complete

Proof We consider the associated decision problem given a tree T with m leavesa value of B and a cost bound c does there exist a partitioning S of the m leavesinto subsets whose size does not exceed B and such that cost(S) le c It is clear thatthis problem belongs to NP since if we are given the partition S it is easy to checkin polynomial time that it is valid and that its cost meets the bound c We now haveto prove that the problem is in the NP-complete subset

To establish the completeness we use a reduction from 3-PARTITION [14] whichis NP-complete in the strong sense Consider an instance I1 of 3-PARTITION givena set a1 a3p of 3p integers and an integer Z such that

sum1lejle3p aj = pZ does

there exist a partition of 1 3p into p disjoint subsets K1 Kp each withthree elements such that for all 1 le i le p

sumjisinKi

aj = ZWe build the following instance I2 of our problem the tree is a three-level tree

composed of N = 1 + 3p + pZ nodes the root vr of cost wr has 3p children viof same cost wv for 1 le i le 3p In turn each vi has ai children each being a leafnode of zero cost This instance I2 of the TreePartitioning problem is shown inFig 32 We let B = Z and ask whether there exists a partition of leaf nodes of costc = pwr + 3pwv Here wr and wv are arbitrary values (we can take wr = wv = 1)We note that the cost c corresponds to the lower bound shown in Theorem 31 inthis lower bound each internal node vi is loaded only once and the root is loaded ptimes since it has pZ = pB leaves below it Note that the size of I2 is polynomialin the size of I1 Indeed because 3-PARTITION is NP-complete in the strong sensewe can encode I1 in unary and the size of the instance is O(pZ)

COMPUTING INVERSE ENTRIES 11

a3pa1 a2

wr

wv

r

v2v1v3p

wv wv

Figure 32 The instance of the TreePartitioning problem corresponding to a given 3-PARTITION PROBLEM The weight of each node is shown next to the node The minimum costof a solution for B = Z to the TreePartitioning problem is ptimeswr +3ptimeswv which is only possiblewhen the children of each vi are all in the same part and when the children of three dierent internalnodes say vi vj vk are put in the same part This corresponds to putting the numbers ai aj ak

into a set for the 3-PARTITION problem which sums up to Z

Now we show that I1 has a solution if and only if I2 has a solution Suppose rstthat I1 has a solution K1 Kp The partition of leaf nodes corresponds exactly tothe subsets Ki we build p subsets Si whose leaves are the children of vertices vj withj isin Ki Suppose now that I2 has a solution To meet the cost bound each internalnode has to be loaded only once and the root at most p times This means that thepartition involves at most p subsets to cover all leaves Because there are pZ leaveseach subset is of size exactly Z Because each internal node is loaded only once all itsleaves belong to the same subset Altogether we have found a solution to I1 whichconcludes the proof

We can further show that we cannot get a close approximation to the optimalsolution in polynomial time

Theorem 33 Unless P=NP there is no 1 + o( 1N ) polynomial approximation

for trees with N nodes in the TreePartitioning problem

Proof Assume that there exists a polynomial 1+ ε(N)N approximation algorithm for

trees with N nodes where limNrarrinfin ε(N) = 0 Let ε(N) lt 1 for N ge N0 Consider anarbitrary instance I0 of 3-PARTITION with a set a1 a3p of 3p integers and aninteger Z such that

sum1lejle3p aj = pZ Without loss of generality assume that ai ge 2

for all i (hence Z ge 6) We ask if we can partition the 3p integers of I0 into p triples ofthe same sum Z Now we build an instance I1 of 3-PARTITION by adding X times

the integer Zminus 2 and 2X times the integer 1 to I0 where X = max(lceil

N0minus1Z+3

rceilminus p 1

)

Hence I1 has 3p+3X integers and we ask whether these can be partitioned into p+Xtriples of the same sum Z Clearly I0 has a solution if and only if I1 does (the integerZ minus 2 can only be in a set with two 1s)

We build an instance I2 of TreePartitioning from I1 exactly as we did in theproof of Theorem 32 with wr = wv = 1 and B = Z The only dierence is that thevalue p in the proof has been replaced by p + X here therefore the three-level treenow has N = 1 + 3(p+X) + (p+X)Z nodes Note that X has been chosen so thatN ge N0 Just as in the proof of Theorem 32 I1 has a solution if and only if theoptimal cost for the tree is c = 4(p+X) and otherwise the optimal cost is at least4(p+X) + 1

If I1 has a solution and because N ge N0 the approximation algorithm will

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 7: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 7

Uz =y)2

= eL y 3

6

2

3

5

4

1

(

Figure 22 Traversal of the elimination tree for the computation of aminus123 In the rst step

Ly = e3 is solved yielding y3 y5 and y6 6= 0 Then (Uz = y)2 is found by computing z6 z5 z3 andnally z2

Theorem 25 (Factors to load for computing a particular entry of the inverseProperty 89 in [24]) To compute a particular entry aminus1

ij in Aminus1 the only factors

which have to be loaded are the L factors on the path from node j up to the root node

and the U factors on the path going back from the root to node iThis theorem establishes the eciency of the proposed computation scheme we

solve only the equations required for a requested entry of the inverse both in theforward and backward solve phases We illustrate the above theorem on the previousexample in Fig 22 As discussed before Ly = e3 yields nonzero vector entries y3 y5and y6 and then z2 = (Uz = y)2 is found after computing z6 z5 z3

We note that the third parenthesization (25) can be advantageous while comput-ing only the diagonal entries with the factorizations LLT or LDLT (with a diagonalD matrix) because in these cases we need to compute a single vector and computethe square of its norm This formulation can also be useful in a parallel setting wherethe solves with L and U can be computed in parallel whereas in the other two for-mulations the solves with one of the factors has to wait for the solves with the otherone to be completed

We also note that if the triangular solution procedures for eTi Uminus1 and Lminus1ej are

available then one can benet from the third parenthesization in certain cases Ifthe number of row and column indices concerned by the requested entries is smallerthan the number of these requested entries many of the calculations can be reused ifone computes a set of vectors of the form eT

i Uminus1 and Lminus1ej for dierent i and j and

obtains the requested entries of the inverse by computing the inner products of thesevectors We do not consider this computational scheme in this paper because suchseparate solves with L and U are not available within MUMPS

22 Problem denition We now address the computation of multiple entriesand introduce the partitioning problem We rst discuss the diagonal case (thatis computing a set of diagonal entries) and comment on the general case later inSection 4

As seen in the previous section in order to compute aminus1ii using the formulation

y = Lminus1ei

aminus1ii = (Uminus1y)i

we have to access the parts of L that correspond to the nodes in the unique path fromnode i to the root and then access the parts of U that correspond to the nodes inthe same path As discussed above these are the necessary and sucient parts of L

8 AMESTOY ET AL

and U that are needed In other words we know how to solve eciently for a singlerequested diagonal entry of the inverse Now suppose that we are to compute a setR of diagonal entries of the inverse As said in Section 1 using the equations aboveentails storing a dense vector for each requested entry If |R| is small then we couldagain identify all the parts of L and U that are needed to be loaded at least for onerequested entry in R and then solve for all R at once accessing the necessary andsucient parts of L and U only once However |R| is usually large in the applicationareas mentioned in Section 1 one often wants to compute a large set of entries suchas the whole diagonal of the inverse (in that case |R| = N) Storing that manydense vectors is not feasible therefore the computations proceed in epochs where ateach epoch a limited number of diagonal entries are computed This entails accessing(loading) some parts of L and U multiple times in dierent epochs according to theentries computed in the corresponding epochs The main combinatorial problem thenbecomes that of partitioning the requested entries into blocks in such a way that theoverall cost of loading the factors is minimized

We now formally introduce the problem Let T be the elimination tree on Nnodes where the factors associated with each node are stored on disks (out-of-core)Let P (i) be the set of nodes in the unique path from node i to the root r includingboth nodes i and r Let w(i) denote the cost of loading the parts of the factors Lor U associated with node i of the elimination tree Similarly let w(i j) denote thesum of the costs of the nodes in the path from node i to node j The cost of solvingaminus1

ii is therefore

cost(i) =sum

kisinP (i)

2times w(k) = 2times w(i r) (210)

If we solve for a set R of diagonal entries at once then the overall cost is therefore

cost(R) =sum

iisinP (R)

2times w(i) where P (R) =⋃iisinR

P (i)

We use B to denote the maximum number of diagonal entries that can be computedat an epoch This is the number of dense vectors that we must hold and so is limitedby the available storage

The TreePartitioning problem is formally dened as follows given a tree Twith N nodes a set R = i1 im of nodes in the tree and an integer B le mpartition R into a number of subsets R1 R2 RK so that |Rk| le B for all k andthe total cost

cost(R) =Ksum

k=1

cost(Rk) (211)

is minimumThe number of subsets K is not specied but obviously K ge dm

B e Without lossof generality we can assume that there is a one-to-one correspondence between R andleaf nodes in T Indeed if there is a leaf node i where i isin R then we can delete nodei from T Similarly if there is an internal node i where i isin R then we create a leafnode iprime of zero weight and make it an additional child of i For ease of discussion andformulation for each requested node (leaf or not) of the elimination tree we add a leafnode with zero weight To clarify the execution scheme we now specify the algorithmthat computes the diagonal entries of the inverse specied by a given Rk We rst

COMPUTING INVERSE ENTRIES 9

nd P (Rk) we then post-order the nodes in P (Rk) and start loading the associatedL factors from the disk and perform the forward solves with L When we reach theroot node we have |Rk| dense vectors and we start loading the associated U factorsfrom the disk and perform backward substitutions along the paths that we traversed(in reverse order) during the forward substitutions

3 Partitioning methods and models As discussed above partitioning therequested entries into blocks to minimize the cost of loading factors corresponds tothe TreePartitioning problem In this section we will focus on the case whereall of the requested entries are on the diagonal of the inverse As noted before inthis case the partial forward and backward solves correspond to visiting the samepath in the elimination tree We analyse the TreePartitioning problem in detailfor this case and show that it is NP-complete we also show that the case where theblock size B = 2 is polynomial time solvable We provide two heuristics one with anapproximation guarantee (in the sense that we can prove that it is at worst twice asbad as optimal) and the other being somewhat better in practice we also introducea hypergraph partitioning-based formulation which is more general than the otherheuristics

Before introducing the algorithms and models we present a lower bound for thecost of an optimal partition Let nl(i) denote the number of leaves of the subtreerooted at node i which can be computed as follows

nl(i) =

1 i is a leaf nodesum

jisinchildren(i) nl(j) otherwise(31)

We note that as all the leaf nodes correspond to the requested diagonal entries of theinverse nl(i) corresponds to the number of forward and backward solves that have tobe performed at node i

Given the number of forward and backward solves that pass through a node i itis easy to dene the following lower bound on the amount of the factors loaded

Theorem 31 (Lower bound of the amount of factors to load) Let T be a

node weighted tree w(i) be the weight of node i B be the maximum allowed size of

a partition and nl(i) be the number of leaf nodes in the subtree rooted at i Then

we have the following lower bound denoted by η on the optimal solution c of the

TreePartitioning problem

η = 2timessumiisinT

w(i)timeslceilnl(i)B

rceille c

Proof Follows easily by noting that each node i has to be loaded at leastlceil

nl(i)B

rceiltimes

As the formula includes nl(middot) the lower bounds for wide and shallow trees willusually be smaller than the lower bounds for tall and skinny trees Each internalnode is on a path from (at least) one leaf node therefore dnl(i)Be is at least 1 and2times

sumi w(i) le c

Figure 31 illustrates the notion of the number of leaves of a subtree and thecomputation of the lower bound Entries aminus1

11 aminus133 and a

minus144 are requested and the

elimination tree of Figure 21(c) is modied accordingly to have leaves (with zeroweights) corresponding to these entries The numbers nl(i) are shown next to the

10 AMESTOY ET AL

6

3

5

4

1

3

2

3

1

14 3

11

1 1

Figure 31 Number of leaves of the subtrees rooted at each node of a transformed eliminationtree The nodes corresponding to the requested diagonal entries of the inverse are shaded and aleaf node is added for each such entry Each node is annotated with the number of leaves in thecorresponding subtree resulting in a lower bound of η = 14 with B = 2

nodes Suppose that each internal node has unit weight and that the block size is 2Then the lower bound is

η = 2times(lceil

12

rceil+lceil

12

rceil+lceil

22

rceil+lceil

32

rceil+lceil

32

rceil)= 14

Recall that we have transformed the elimination tree in such a way that therequested entries now correspond to the leaves and each leaf corresponds to a requestedentry We have the following computational complexity result

Theorem 32 The TreePartitioning problem is NP-complete

Proof We consider the associated decision problem given a tree T with m leavesa value of B and a cost bound c does there exist a partitioning S of the m leavesinto subsets whose size does not exceed B and such that cost(S) le c It is clear thatthis problem belongs to NP since if we are given the partition S it is easy to checkin polynomial time that it is valid and that its cost meets the bound c We now haveto prove that the problem is in the NP-complete subset

To establish the completeness we use a reduction from 3-PARTITION [14] whichis NP-complete in the strong sense Consider an instance I1 of 3-PARTITION givena set a1 a3p of 3p integers and an integer Z such that

sum1lejle3p aj = pZ does

there exist a partition of 1 3p into p disjoint subsets K1 Kp each withthree elements such that for all 1 le i le p

sumjisinKi

aj = ZWe build the following instance I2 of our problem the tree is a three-level tree

composed of N = 1 + 3p + pZ nodes the root vr of cost wr has 3p children viof same cost wv for 1 le i le 3p In turn each vi has ai children each being a leafnode of zero cost This instance I2 of the TreePartitioning problem is shown inFig 32 We let B = Z and ask whether there exists a partition of leaf nodes of costc = pwr + 3pwv Here wr and wv are arbitrary values (we can take wr = wv = 1)We note that the cost c corresponds to the lower bound shown in Theorem 31 inthis lower bound each internal node vi is loaded only once and the root is loaded ptimes since it has pZ = pB leaves below it Note that the size of I2 is polynomialin the size of I1 Indeed because 3-PARTITION is NP-complete in the strong sensewe can encode I1 in unary and the size of the instance is O(pZ)

COMPUTING INVERSE ENTRIES 11

a3pa1 a2

wr

wv

r

v2v1v3p

wv wv

Figure 32 The instance of the TreePartitioning problem corresponding to a given 3-PARTITION PROBLEM The weight of each node is shown next to the node The minimum costof a solution for B = Z to the TreePartitioning problem is ptimeswr +3ptimeswv which is only possiblewhen the children of each vi are all in the same part and when the children of three dierent internalnodes say vi vj vk are put in the same part This corresponds to putting the numbers ai aj ak

into a set for the 3-PARTITION problem which sums up to Z

Now we show that I1 has a solution if and only if I2 has a solution Suppose rstthat I1 has a solution K1 Kp The partition of leaf nodes corresponds exactly tothe subsets Ki we build p subsets Si whose leaves are the children of vertices vj withj isin Ki Suppose now that I2 has a solution To meet the cost bound each internalnode has to be loaded only once and the root at most p times This means that thepartition involves at most p subsets to cover all leaves Because there are pZ leaveseach subset is of size exactly Z Because each internal node is loaded only once all itsleaves belong to the same subset Altogether we have found a solution to I1 whichconcludes the proof

We can further show that we cannot get a close approximation to the optimalsolution in polynomial time

Theorem 33 Unless P=NP there is no 1 + o( 1N ) polynomial approximation

for trees with N nodes in the TreePartitioning problem

Proof Assume that there exists a polynomial 1+ ε(N)N approximation algorithm for

trees with N nodes where limNrarrinfin ε(N) = 0 Let ε(N) lt 1 for N ge N0 Consider anarbitrary instance I0 of 3-PARTITION with a set a1 a3p of 3p integers and aninteger Z such that

sum1lejle3p aj = pZ Without loss of generality assume that ai ge 2

for all i (hence Z ge 6) We ask if we can partition the 3p integers of I0 into p triples ofthe same sum Z Now we build an instance I1 of 3-PARTITION by adding X times

the integer Zminus 2 and 2X times the integer 1 to I0 where X = max(lceil

N0minus1Z+3

rceilminus p 1

)

Hence I1 has 3p+3X integers and we ask whether these can be partitioned into p+Xtriples of the same sum Z Clearly I0 has a solution if and only if I1 does (the integerZ minus 2 can only be in a set with two 1s)

We build an instance I2 of TreePartitioning from I1 exactly as we did in theproof of Theorem 32 with wr = wv = 1 and B = Z The only dierence is that thevalue p in the proof has been replaced by p + X here therefore the three-level treenow has N = 1 + 3(p+X) + (p+X)Z nodes Note that X has been chosen so thatN ge N0 Just as in the proof of Theorem 32 I1 has a solution if and only if theoptimal cost for the tree is c = 4(p+X) and otherwise the optimal cost is at least4(p+X) + 1

If I1 has a solution and because N ge N0 the approximation algorithm will

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 8: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

8 AMESTOY ET AL

and U that are needed In other words we know how to solve eciently for a singlerequested diagonal entry of the inverse Now suppose that we are to compute a setR of diagonal entries of the inverse As said in Section 1 using the equations aboveentails storing a dense vector for each requested entry If |R| is small then we couldagain identify all the parts of L and U that are needed to be loaded at least for onerequested entry in R and then solve for all R at once accessing the necessary andsucient parts of L and U only once However |R| is usually large in the applicationareas mentioned in Section 1 one often wants to compute a large set of entries suchas the whole diagonal of the inverse (in that case |R| = N) Storing that manydense vectors is not feasible therefore the computations proceed in epochs where ateach epoch a limited number of diagonal entries are computed This entails accessing(loading) some parts of L and U multiple times in dierent epochs according to theentries computed in the corresponding epochs The main combinatorial problem thenbecomes that of partitioning the requested entries into blocks in such a way that theoverall cost of loading the factors is minimized

We now formally introduce the problem Let T be the elimination tree on Nnodes where the factors associated with each node are stored on disks (out-of-core)Let P (i) be the set of nodes in the unique path from node i to the root r includingboth nodes i and r Let w(i) denote the cost of loading the parts of the factors Lor U associated with node i of the elimination tree Similarly let w(i j) denote thesum of the costs of the nodes in the path from node i to node j The cost of solvingaminus1

ii is therefore

cost(i) =sum

kisinP (i)

2times w(k) = 2times w(i r) (210)

If we solve for a set R of diagonal entries at once then the overall cost is therefore

cost(R) =sum

iisinP (R)

2times w(i) where P (R) =⋃iisinR

P (i)

We use B to denote the maximum number of diagonal entries that can be computedat an epoch This is the number of dense vectors that we must hold and so is limitedby the available storage

The TreePartitioning problem is formally dened as follows given a tree Twith N nodes a set R = i1 im of nodes in the tree and an integer B le mpartition R into a number of subsets R1 R2 RK so that |Rk| le B for all k andthe total cost

cost(R) =Ksum

k=1

cost(Rk) (211)

is minimumThe number of subsets K is not specied but obviously K ge dm

B e Without lossof generality we can assume that there is a one-to-one correspondence between R andleaf nodes in T Indeed if there is a leaf node i where i isin R then we can delete nodei from T Similarly if there is an internal node i where i isin R then we create a leafnode iprime of zero weight and make it an additional child of i For ease of discussion andformulation for each requested node (leaf or not) of the elimination tree we add a leafnode with zero weight To clarify the execution scheme we now specify the algorithmthat computes the diagonal entries of the inverse specied by a given Rk We rst

COMPUTING INVERSE ENTRIES 9

nd P (Rk) we then post-order the nodes in P (Rk) and start loading the associatedL factors from the disk and perform the forward solves with L When we reach theroot node we have |Rk| dense vectors and we start loading the associated U factorsfrom the disk and perform backward substitutions along the paths that we traversed(in reverse order) during the forward substitutions

3 Partitioning methods and models As discussed above partitioning therequested entries into blocks to minimize the cost of loading factors corresponds tothe TreePartitioning problem In this section we will focus on the case whereall of the requested entries are on the diagonal of the inverse As noted before inthis case the partial forward and backward solves correspond to visiting the samepath in the elimination tree We analyse the TreePartitioning problem in detailfor this case and show that it is NP-complete we also show that the case where theblock size B = 2 is polynomial time solvable We provide two heuristics one with anapproximation guarantee (in the sense that we can prove that it is at worst twice asbad as optimal) and the other being somewhat better in practice we also introducea hypergraph partitioning-based formulation which is more general than the otherheuristics

Before introducing the algorithms and models we present a lower bound for thecost of an optimal partition Let nl(i) denote the number of leaves of the subtreerooted at node i which can be computed as follows

nl(i) =

1 i is a leaf nodesum

jisinchildren(i) nl(j) otherwise(31)

We note that as all the leaf nodes correspond to the requested diagonal entries of theinverse nl(i) corresponds to the number of forward and backward solves that have tobe performed at node i

Given the number of forward and backward solves that pass through a node i itis easy to dene the following lower bound on the amount of the factors loaded

Theorem 31 (Lower bound of the amount of factors to load) Let T be a

node weighted tree w(i) be the weight of node i B be the maximum allowed size of

a partition and nl(i) be the number of leaf nodes in the subtree rooted at i Then

we have the following lower bound denoted by η on the optimal solution c of the

TreePartitioning problem

η = 2timessumiisinT

w(i)timeslceilnl(i)B

rceille c

Proof Follows easily by noting that each node i has to be loaded at leastlceil

nl(i)B

rceiltimes

As the formula includes nl(middot) the lower bounds for wide and shallow trees willusually be smaller than the lower bounds for tall and skinny trees Each internalnode is on a path from (at least) one leaf node therefore dnl(i)Be is at least 1 and2times

sumi w(i) le c

Figure 31 illustrates the notion of the number of leaves of a subtree and thecomputation of the lower bound Entries aminus1

11 aminus133 and a

minus144 are requested and the

elimination tree of Figure 21(c) is modied accordingly to have leaves (with zeroweights) corresponding to these entries The numbers nl(i) are shown next to the

10 AMESTOY ET AL

6

3

5

4

1

3

2

3

1

14 3

11

1 1

Figure 31 Number of leaves of the subtrees rooted at each node of a transformed eliminationtree The nodes corresponding to the requested diagonal entries of the inverse are shaded and aleaf node is added for each such entry Each node is annotated with the number of leaves in thecorresponding subtree resulting in a lower bound of η = 14 with B = 2

nodes Suppose that each internal node has unit weight and that the block size is 2Then the lower bound is

η = 2times(lceil

12

rceil+lceil

12

rceil+lceil

22

rceil+lceil

32

rceil+lceil

32

rceil)= 14

Recall that we have transformed the elimination tree in such a way that therequested entries now correspond to the leaves and each leaf corresponds to a requestedentry We have the following computational complexity result

Theorem 32 The TreePartitioning problem is NP-complete

Proof We consider the associated decision problem given a tree T with m leavesa value of B and a cost bound c does there exist a partitioning S of the m leavesinto subsets whose size does not exceed B and such that cost(S) le c It is clear thatthis problem belongs to NP since if we are given the partition S it is easy to checkin polynomial time that it is valid and that its cost meets the bound c We now haveto prove that the problem is in the NP-complete subset

To establish the completeness we use a reduction from 3-PARTITION [14] whichis NP-complete in the strong sense Consider an instance I1 of 3-PARTITION givena set a1 a3p of 3p integers and an integer Z such that

sum1lejle3p aj = pZ does

there exist a partition of 1 3p into p disjoint subsets K1 Kp each withthree elements such that for all 1 le i le p

sumjisinKi

aj = ZWe build the following instance I2 of our problem the tree is a three-level tree

composed of N = 1 + 3p + pZ nodes the root vr of cost wr has 3p children viof same cost wv for 1 le i le 3p In turn each vi has ai children each being a leafnode of zero cost This instance I2 of the TreePartitioning problem is shown inFig 32 We let B = Z and ask whether there exists a partition of leaf nodes of costc = pwr + 3pwv Here wr and wv are arbitrary values (we can take wr = wv = 1)We note that the cost c corresponds to the lower bound shown in Theorem 31 inthis lower bound each internal node vi is loaded only once and the root is loaded ptimes since it has pZ = pB leaves below it Note that the size of I2 is polynomialin the size of I1 Indeed because 3-PARTITION is NP-complete in the strong sensewe can encode I1 in unary and the size of the instance is O(pZ)

COMPUTING INVERSE ENTRIES 11

a3pa1 a2

wr

wv

r

v2v1v3p

wv wv

Figure 32 The instance of the TreePartitioning problem corresponding to a given 3-PARTITION PROBLEM The weight of each node is shown next to the node The minimum costof a solution for B = Z to the TreePartitioning problem is ptimeswr +3ptimeswv which is only possiblewhen the children of each vi are all in the same part and when the children of three dierent internalnodes say vi vj vk are put in the same part This corresponds to putting the numbers ai aj ak

into a set for the 3-PARTITION problem which sums up to Z

Now we show that I1 has a solution if and only if I2 has a solution Suppose rstthat I1 has a solution K1 Kp The partition of leaf nodes corresponds exactly tothe subsets Ki we build p subsets Si whose leaves are the children of vertices vj withj isin Ki Suppose now that I2 has a solution To meet the cost bound each internalnode has to be loaded only once and the root at most p times This means that thepartition involves at most p subsets to cover all leaves Because there are pZ leaveseach subset is of size exactly Z Because each internal node is loaded only once all itsleaves belong to the same subset Altogether we have found a solution to I1 whichconcludes the proof

We can further show that we cannot get a close approximation to the optimalsolution in polynomial time

Theorem 33 Unless P=NP there is no 1 + o( 1N ) polynomial approximation

for trees with N nodes in the TreePartitioning problem

Proof Assume that there exists a polynomial 1+ ε(N)N approximation algorithm for

trees with N nodes where limNrarrinfin ε(N) = 0 Let ε(N) lt 1 for N ge N0 Consider anarbitrary instance I0 of 3-PARTITION with a set a1 a3p of 3p integers and aninteger Z such that

sum1lejle3p aj = pZ Without loss of generality assume that ai ge 2

for all i (hence Z ge 6) We ask if we can partition the 3p integers of I0 into p triples ofthe same sum Z Now we build an instance I1 of 3-PARTITION by adding X times

the integer Zminus 2 and 2X times the integer 1 to I0 where X = max(lceil

N0minus1Z+3

rceilminus p 1

)

Hence I1 has 3p+3X integers and we ask whether these can be partitioned into p+Xtriples of the same sum Z Clearly I0 has a solution if and only if I1 does (the integerZ minus 2 can only be in a set with two 1s)

We build an instance I2 of TreePartitioning from I1 exactly as we did in theproof of Theorem 32 with wr = wv = 1 and B = Z The only dierence is that thevalue p in the proof has been replaced by p + X here therefore the three-level treenow has N = 1 + 3(p+X) + (p+X)Z nodes Note that X has been chosen so thatN ge N0 Just as in the proof of Theorem 32 I1 has a solution if and only if theoptimal cost for the tree is c = 4(p+X) and otherwise the optimal cost is at least4(p+X) + 1

If I1 has a solution and because N ge N0 the approximation algorithm will

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 9: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 9

nd P (Rk) we then post-order the nodes in P (Rk) and start loading the associatedL factors from the disk and perform the forward solves with L When we reach theroot node we have |Rk| dense vectors and we start loading the associated U factorsfrom the disk and perform backward substitutions along the paths that we traversed(in reverse order) during the forward substitutions

3 Partitioning methods and models As discussed above partitioning therequested entries into blocks to minimize the cost of loading factors corresponds tothe TreePartitioning problem In this section we will focus on the case whereall of the requested entries are on the diagonal of the inverse As noted before inthis case the partial forward and backward solves correspond to visiting the samepath in the elimination tree We analyse the TreePartitioning problem in detailfor this case and show that it is NP-complete we also show that the case where theblock size B = 2 is polynomial time solvable We provide two heuristics one with anapproximation guarantee (in the sense that we can prove that it is at worst twice asbad as optimal) and the other being somewhat better in practice we also introducea hypergraph partitioning-based formulation which is more general than the otherheuristics

Before introducing the algorithms and models we present a lower bound for thecost of an optimal partition Let nl(i) denote the number of leaves of the subtreerooted at node i which can be computed as follows

nl(i) =

1 i is a leaf nodesum

jisinchildren(i) nl(j) otherwise(31)

We note that as all the leaf nodes correspond to the requested diagonal entries of theinverse nl(i) corresponds to the number of forward and backward solves that have tobe performed at node i

Given the number of forward and backward solves that pass through a node i itis easy to dene the following lower bound on the amount of the factors loaded

Theorem 31 (Lower bound of the amount of factors to load) Let T be a

node weighted tree w(i) be the weight of node i B be the maximum allowed size of

a partition and nl(i) be the number of leaf nodes in the subtree rooted at i Then

we have the following lower bound denoted by η on the optimal solution c of the

TreePartitioning problem

η = 2timessumiisinT

w(i)timeslceilnl(i)B

rceille c

Proof Follows easily by noting that each node i has to be loaded at leastlceil

nl(i)B

rceiltimes

As the formula includes nl(middot) the lower bounds for wide and shallow trees willusually be smaller than the lower bounds for tall and skinny trees Each internalnode is on a path from (at least) one leaf node therefore dnl(i)Be is at least 1 and2times

sumi w(i) le c

Figure 31 illustrates the notion of the number of leaves of a subtree and thecomputation of the lower bound Entries aminus1

11 aminus133 and a

minus144 are requested and the

elimination tree of Figure 21(c) is modied accordingly to have leaves (with zeroweights) corresponding to these entries The numbers nl(i) are shown next to the

10 AMESTOY ET AL

6

3

5

4

1

3

2

3

1

14 3

11

1 1

Figure 31 Number of leaves of the subtrees rooted at each node of a transformed eliminationtree The nodes corresponding to the requested diagonal entries of the inverse are shaded and aleaf node is added for each such entry Each node is annotated with the number of leaves in thecorresponding subtree resulting in a lower bound of η = 14 with B = 2

nodes Suppose that each internal node has unit weight and that the block size is 2Then the lower bound is

η = 2times(lceil

12

rceil+lceil

12

rceil+lceil

22

rceil+lceil

32

rceil+lceil

32

rceil)= 14

Recall that we have transformed the elimination tree in such a way that therequested entries now correspond to the leaves and each leaf corresponds to a requestedentry We have the following computational complexity result

Theorem 32 The TreePartitioning problem is NP-complete

Proof We consider the associated decision problem given a tree T with m leavesa value of B and a cost bound c does there exist a partitioning S of the m leavesinto subsets whose size does not exceed B and such that cost(S) le c It is clear thatthis problem belongs to NP since if we are given the partition S it is easy to checkin polynomial time that it is valid and that its cost meets the bound c We now haveto prove that the problem is in the NP-complete subset

To establish the completeness we use a reduction from 3-PARTITION [14] whichis NP-complete in the strong sense Consider an instance I1 of 3-PARTITION givena set a1 a3p of 3p integers and an integer Z such that

sum1lejle3p aj = pZ does

there exist a partition of 1 3p into p disjoint subsets K1 Kp each withthree elements such that for all 1 le i le p

sumjisinKi

aj = ZWe build the following instance I2 of our problem the tree is a three-level tree

composed of N = 1 + 3p + pZ nodes the root vr of cost wr has 3p children viof same cost wv for 1 le i le 3p In turn each vi has ai children each being a leafnode of zero cost This instance I2 of the TreePartitioning problem is shown inFig 32 We let B = Z and ask whether there exists a partition of leaf nodes of costc = pwr + 3pwv Here wr and wv are arbitrary values (we can take wr = wv = 1)We note that the cost c corresponds to the lower bound shown in Theorem 31 inthis lower bound each internal node vi is loaded only once and the root is loaded ptimes since it has pZ = pB leaves below it Note that the size of I2 is polynomialin the size of I1 Indeed because 3-PARTITION is NP-complete in the strong sensewe can encode I1 in unary and the size of the instance is O(pZ)

COMPUTING INVERSE ENTRIES 11

a3pa1 a2

wr

wv

r

v2v1v3p

wv wv

Figure 32 The instance of the TreePartitioning problem corresponding to a given 3-PARTITION PROBLEM The weight of each node is shown next to the node The minimum costof a solution for B = Z to the TreePartitioning problem is ptimeswr +3ptimeswv which is only possiblewhen the children of each vi are all in the same part and when the children of three dierent internalnodes say vi vj vk are put in the same part This corresponds to putting the numbers ai aj ak

into a set for the 3-PARTITION problem which sums up to Z

Now we show that I1 has a solution if and only if I2 has a solution Suppose rstthat I1 has a solution K1 Kp The partition of leaf nodes corresponds exactly tothe subsets Ki we build p subsets Si whose leaves are the children of vertices vj withj isin Ki Suppose now that I2 has a solution To meet the cost bound each internalnode has to be loaded only once and the root at most p times This means that thepartition involves at most p subsets to cover all leaves Because there are pZ leaveseach subset is of size exactly Z Because each internal node is loaded only once all itsleaves belong to the same subset Altogether we have found a solution to I1 whichconcludes the proof

We can further show that we cannot get a close approximation to the optimalsolution in polynomial time

Theorem 33 Unless P=NP there is no 1 + o( 1N ) polynomial approximation

for trees with N nodes in the TreePartitioning problem

Proof Assume that there exists a polynomial 1+ ε(N)N approximation algorithm for

trees with N nodes where limNrarrinfin ε(N) = 0 Let ε(N) lt 1 for N ge N0 Consider anarbitrary instance I0 of 3-PARTITION with a set a1 a3p of 3p integers and aninteger Z such that

sum1lejle3p aj = pZ Without loss of generality assume that ai ge 2

for all i (hence Z ge 6) We ask if we can partition the 3p integers of I0 into p triples ofthe same sum Z Now we build an instance I1 of 3-PARTITION by adding X times

the integer Zminus 2 and 2X times the integer 1 to I0 where X = max(lceil

N0minus1Z+3

rceilminus p 1

)

Hence I1 has 3p+3X integers and we ask whether these can be partitioned into p+Xtriples of the same sum Z Clearly I0 has a solution if and only if I1 does (the integerZ minus 2 can only be in a set with two 1s)

We build an instance I2 of TreePartitioning from I1 exactly as we did in theproof of Theorem 32 with wr = wv = 1 and B = Z The only dierence is that thevalue p in the proof has been replaced by p + X here therefore the three-level treenow has N = 1 + 3(p+X) + (p+X)Z nodes Note that X has been chosen so thatN ge N0 Just as in the proof of Theorem 32 I1 has a solution if and only if theoptimal cost for the tree is c = 4(p+X) and otherwise the optimal cost is at least4(p+X) + 1

If I1 has a solution and because N ge N0 the approximation algorithm will

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 10: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

10 AMESTOY ET AL

6

3

5

4

1

3

2

3

1

14 3

11

1 1

Figure 31 Number of leaves of the subtrees rooted at each node of a transformed eliminationtree The nodes corresponding to the requested diagonal entries of the inverse are shaded and aleaf node is added for each such entry Each node is annotated with the number of leaves in thecorresponding subtree resulting in a lower bound of η = 14 with B = 2

nodes Suppose that each internal node has unit weight and that the block size is 2Then the lower bound is

η = 2times(lceil

12

rceil+lceil

12

rceil+lceil

22

rceil+lceil

32

rceil+lceil

32

rceil)= 14

Recall that we have transformed the elimination tree in such a way that therequested entries now correspond to the leaves and each leaf corresponds to a requestedentry We have the following computational complexity result

Theorem 32 The TreePartitioning problem is NP-complete

Proof We consider the associated decision problem given a tree T with m leavesa value of B and a cost bound c does there exist a partitioning S of the m leavesinto subsets whose size does not exceed B and such that cost(S) le c It is clear thatthis problem belongs to NP since if we are given the partition S it is easy to checkin polynomial time that it is valid and that its cost meets the bound c We now haveto prove that the problem is in the NP-complete subset

To establish the completeness we use a reduction from 3-PARTITION [14] whichis NP-complete in the strong sense Consider an instance I1 of 3-PARTITION givena set a1 a3p of 3p integers and an integer Z such that

sum1lejle3p aj = pZ does

there exist a partition of 1 3p into p disjoint subsets K1 Kp each withthree elements such that for all 1 le i le p

sumjisinKi

aj = ZWe build the following instance I2 of our problem the tree is a three-level tree

composed of N = 1 + 3p + pZ nodes the root vr of cost wr has 3p children viof same cost wv for 1 le i le 3p In turn each vi has ai children each being a leafnode of zero cost This instance I2 of the TreePartitioning problem is shown inFig 32 We let B = Z and ask whether there exists a partition of leaf nodes of costc = pwr + 3pwv Here wr and wv are arbitrary values (we can take wr = wv = 1)We note that the cost c corresponds to the lower bound shown in Theorem 31 inthis lower bound each internal node vi is loaded only once and the root is loaded ptimes since it has pZ = pB leaves below it Note that the size of I2 is polynomialin the size of I1 Indeed because 3-PARTITION is NP-complete in the strong sensewe can encode I1 in unary and the size of the instance is O(pZ)

COMPUTING INVERSE ENTRIES 11

a3pa1 a2

wr

wv

r

v2v1v3p

wv wv

Figure 32 The instance of the TreePartitioning problem corresponding to a given 3-PARTITION PROBLEM The weight of each node is shown next to the node The minimum costof a solution for B = Z to the TreePartitioning problem is ptimeswr +3ptimeswv which is only possiblewhen the children of each vi are all in the same part and when the children of three dierent internalnodes say vi vj vk are put in the same part This corresponds to putting the numbers ai aj ak

into a set for the 3-PARTITION problem which sums up to Z

Now we show that I1 has a solution if and only if I2 has a solution Suppose rstthat I1 has a solution K1 Kp The partition of leaf nodes corresponds exactly tothe subsets Ki we build p subsets Si whose leaves are the children of vertices vj withj isin Ki Suppose now that I2 has a solution To meet the cost bound each internalnode has to be loaded only once and the root at most p times This means that thepartition involves at most p subsets to cover all leaves Because there are pZ leaveseach subset is of size exactly Z Because each internal node is loaded only once all itsleaves belong to the same subset Altogether we have found a solution to I1 whichconcludes the proof

We can further show that we cannot get a close approximation to the optimalsolution in polynomial time

Theorem 33 Unless P=NP there is no 1 + o( 1N ) polynomial approximation

for trees with N nodes in the TreePartitioning problem

Proof Assume that there exists a polynomial 1+ ε(N)N approximation algorithm for

trees with N nodes where limNrarrinfin ε(N) = 0 Let ε(N) lt 1 for N ge N0 Consider anarbitrary instance I0 of 3-PARTITION with a set a1 a3p of 3p integers and aninteger Z such that

sum1lejle3p aj = pZ Without loss of generality assume that ai ge 2

for all i (hence Z ge 6) We ask if we can partition the 3p integers of I0 into p triples ofthe same sum Z Now we build an instance I1 of 3-PARTITION by adding X times

the integer Zminus 2 and 2X times the integer 1 to I0 where X = max(lceil

N0minus1Z+3

rceilminus p 1

)

Hence I1 has 3p+3X integers and we ask whether these can be partitioned into p+Xtriples of the same sum Z Clearly I0 has a solution if and only if I1 does (the integerZ minus 2 can only be in a set with two 1s)

We build an instance I2 of TreePartitioning from I1 exactly as we did in theproof of Theorem 32 with wr = wv = 1 and B = Z The only dierence is that thevalue p in the proof has been replaced by p + X here therefore the three-level treenow has N = 1 + 3(p+X) + (p+X)Z nodes Note that X has been chosen so thatN ge N0 Just as in the proof of Theorem 32 I1 has a solution if and only if theoptimal cost for the tree is c = 4(p+X) and otherwise the optimal cost is at least4(p+X) + 1

If I1 has a solution and because N ge N0 the approximation algorithm will

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 11: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 11

a3pa1 a2

wr

wv

r

v2v1v3p

wv wv

Figure 32 The instance of the TreePartitioning problem corresponding to a given 3-PARTITION PROBLEM The weight of each node is shown next to the node The minimum costof a solution for B = Z to the TreePartitioning problem is ptimeswr +3ptimeswv which is only possiblewhen the children of each vi are all in the same part and when the children of three dierent internalnodes say vi vj vk are put in the same part This corresponds to putting the numbers ai aj ak

into a set for the 3-PARTITION problem which sums up to Z

Now we show that I1 has a solution if and only if I2 has a solution Suppose rstthat I1 has a solution K1 Kp The partition of leaf nodes corresponds exactly tothe subsets Ki we build p subsets Si whose leaves are the children of vertices vj withj isin Ki Suppose now that I2 has a solution To meet the cost bound each internalnode has to be loaded only once and the root at most p times This means that thepartition involves at most p subsets to cover all leaves Because there are pZ leaveseach subset is of size exactly Z Because each internal node is loaded only once all itsleaves belong to the same subset Altogether we have found a solution to I1 whichconcludes the proof

We can further show that we cannot get a close approximation to the optimalsolution in polynomial time

Theorem 33 Unless P=NP there is no 1 + o( 1N ) polynomial approximation

for trees with N nodes in the TreePartitioning problem

Proof Assume that there exists a polynomial 1+ ε(N)N approximation algorithm for

trees with N nodes where limNrarrinfin ε(N) = 0 Let ε(N) lt 1 for N ge N0 Consider anarbitrary instance I0 of 3-PARTITION with a set a1 a3p of 3p integers and aninteger Z such that

sum1lejle3p aj = pZ Without loss of generality assume that ai ge 2

for all i (hence Z ge 6) We ask if we can partition the 3p integers of I0 into p triples ofthe same sum Z Now we build an instance I1 of 3-PARTITION by adding X times

the integer Zminus 2 and 2X times the integer 1 to I0 where X = max(lceil

N0minus1Z+3

rceilminus p 1

)

Hence I1 has 3p+3X integers and we ask whether these can be partitioned into p+Xtriples of the same sum Z Clearly I0 has a solution if and only if I1 does (the integerZ minus 2 can only be in a set with two 1s)

We build an instance I2 of TreePartitioning from I1 exactly as we did in theproof of Theorem 32 with wr = wv = 1 and B = Z The only dierence is that thevalue p in the proof has been replaced by p + X here therefore the three-level treenow has N = 1 + 3(p+X) + (p+X)Z nodes Note that X has been chosen so thatN ge N0 Just as in the proof of Theorem 32 I1 has a solution if and only if theoptimal cost for the tree is c = 4(p+X) and otherwise the optimal cost is at least4(p+X) + 1

If I1 has a solution and because N ge N0 the approximation algorithm will

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 12: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

12 AMESTOY ET AL

return a cost at most(1 +

ε(N)N

)c le

(1 +

1N

)4(p+X) = 4(p+X) +

4(p+X)N

But 4(p+X)N = 4(Nminus1)

(Z+3)N le49 lt 1 so that the approximation algorithm can be used to

determine whether I1 and hence I0 has a solution This is a contradiction unlessP=NP

31 A partitioning based on post-order Consider again the case wheresome entries in the diagonal of the inverse are requested As said before the problem ofminimizing the size of the factors to be loaded corresponds to the TreePartitioningproblem Consider the heuristic PoPart shown in Algorithm 1 for this problem

Algorithm 1 PoPart A post-order based partitioning

Input T = (VE r) with F leaves each requested entry corresponds to a leaf nodewith a zero weight

Input B the maximum allowable size of a partOutput ΠPO = R1 RK where K = dFBe a partition on the leaf nodes1 compute a post-order2 L larr sort the leaf nodes according to their rank in post-order3 Rk = L(i) (k minus 1)timesB + 1 le i le mink timesBF for k = 1 dFBe

As seen in Algorithm 1 the PoPart heuristic rst orders the leaf nodes accordingto their post-order It then puts the rst B leaves in the rst part the next B leavesin the second part and so on This simple partitioning approach results in dFBeparts for a tree with F leaf nodes and puts B nodes in each part except maybe inthe last one We have the following theorem which states that this simple heuristicobtains results that are at most twice the cost of an optimum solution

Theorem 34 Let ΠPO be the partition obtained by the algorithm PoPart and

c be the cost of an optimum solution then

cost(ΠPO) le 2times c

Proof Consider node i Because the leaves of the subtree rooted at i are sorted

consecutively in L the factors of node i will be loaded at mostlceil

nl(i)B

rceil+ 1 times

Therefore the overall cost is at most

cost(ΠPO) le 2timessum

i

w(i)times(lceil

nl(i)B

rceil+ 1)

le η + 2timessum

i

w(i)

le 2times c

We note that the factor two in the approximation guarantee would be rather loosein practical settings as

sumi w(i) would be much smaller than the lower bound η with

a practical B and a large number of nodes

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 13: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 13

32 A special case Two items per part In this section we propose algo-rithms to solve the partitioning problem exactly when B = 2 so that we are able touse a matching to dene the epochs These algorithms will serve as a building blockfor B = 2k in the next subsection

One of the algorithms is based on graph matching as partitioning into blocks ofsize 2 can be described as a matching Consider the complete graph G = (V V times V )of the leaves of a given tree and assume that the edge (i j) represents the decisionto put the leaf nodes i and j together in a part Given this denition of the verticesand edges we associate the value m(i j) = cost(i j) to the edge (i j) if i 6= j andm(i i) =

sumnisinV w(n) (or any suciently large number) Then a minimum weighted

matching in G denes a partitioning of the vertices in V with the minimum cost (asdened in (211)) Although this is a short and immediate formulation it has a highrun time complexity of O(|V |52) and O(|V |2) memory requirements Therefore wepropose yet another exact algorithm for B = 2

The proposed algorithmMatch proceeds from the parents of the leaf nodes to theroot At each internal node n those leaf nodes that are in the subtree rooted at n andwhich are not put in a part yet are matched two by two (arbitrarily) and each pairis put in a part if there is an odd number of leaf nodes remaining to be partitionedat node n one of them (arbitrarily) is passed to parent(n) Clearly this algorithmattains the lower bound shown in Theorem 31 and hence it nds an optimal partitionfor B = 2 The memory and the run time requirements are O(|V |) We note that twoleaf nodes can be matched only at their least common ancestor

Algorithm 2 Match An exact algorithm for B = 2Input T = (VE r) with the root r and F leaves each requested entry corresponds

to a leaf node with a zero weightOutput Π2 = R1 RK where K = dF2e1 for each leaf node ` do2 add ` to list(parent(`))3 compute a postorder4 k larr 15 for each non-leaf node n in postorder do6 if n 6= r and list(n) contains an odd number of vertices then7 `larr the node with least weight in list(n)8 move ` to list(parent(n)) add w(n) to the weight of `

I relay it to the father9 else if n = r and list(r) contains an odd number of vertices then10 `larr a node with the least weight in list(n)11 make ` a singleton12 for i = 1 to |list(n)| by 2 do13 Put the ith and i+ 1st vertices in list(n) into Rk increment k

I Match the ith and i+ 1st elements in the list

We have slightly modied the basic algorithm and show this modied version inAlgorithm 2 The modications keep the running time and memory complexities thesame and they are realized to enable the use ofMatch as a building block for a moregeneral heuristic In this algorithm parent(n) gives the parent of node n and list(n)is an array associated with node n The sum of the sizes of the list(middot) arrays is |V |The modication is that when there are an odd number of leaf nodes to partition

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 14: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

14 AMESTOY ET AL

at node n the leaf node with the least cumulative weight is passed to the fatherThe cumulative weight of a leaf node i when Match processes node n is dened asw(i n)minusw(n) the sum of the weights of the nodes in the unique path between nodesi and n including i but excluding n This is easy to compute each time a leaf nodeis relayed to the father of the current node the weight of the current node is addedto the cumulative weight of the relayed leaf node (as shown in line 8) By doing thisthe leaf nodes which traverse longer paths before being partitioned are chosen beforethose with smaller weights

33 A heuristic for a more general case We propose a heuristic algorithmwhen B = 2k for some k the BiseMatch algorithm is shown in Algorithm 3 It isbased on a bisection approach At each bisection a matching among the leaf nodesis found by a call to Match Then one of the leaf nodes of each pair is removedfrom the tree the remaining one becomes a representative of the two and is called aprincipal node Since the remaining node at each bisection step is a representativeof the two representative nodes of the previous bisection step after logB = k stepsBiseMatch obtains nodes that represent at most Bminus 1 other nodes At the end thenodes represented by a principal node are included in the same part as the principalnode

Algorithm 3 BiseMatch A heuristic algorithm for B = 2k

Input T = (VE r) with the root r and F leaves each requested entry correspondsto a leaf node with a zero weight

Output Π2k = R1 RK where |Ri| le B1 for level = 1 to k do2 M larrMatch(T )3 for each pair (i j) isinM remove the leaf node j from T and mark the leaf node

i as representative4 Clean up the tree T so that all leaf nodes correspond to some requested entry5 Each remaining leaf node i corresponds to a part Ri where the nodes that are

represented by i are put in Ri

As seen in the algorithm at each stage a matching among the remaining leaves isfound by using the Match algorithm When leaf nodes i and j are matched at theirleast common ancestor lca(i j) if w(i lca(i j)) ge w(j lca(i j)) then we designate ito be the representative of the two by adding (i j) to M otherwise we designate j tobe the representative by adding (j i) toM With this choice theMatch algorithm isguided to make decisions at nodes close to the leaves The running time ofBiseMatchis O(|V | logB) with an O(|V |) memory requirement

34 Models based on hypergraph partitioning We show how the problemof nding an optimal partition of the requested entries can be transformed into ahypergraph partitioning problem Our aim is to develop a general model that canaddress both the diagonal and the o-diagonal cases Here we give the model againfor diagonal entries and defer the discussion for the o-diagonal case until Section 4We rst give some relevant denitions

341 Hypergraphs and the hypergraph partitioning problem A hyper-graph H = (VN ) is dened as a set of vertices V and a set of nets N Every net isa subset of vertices Weights can be associated with vertices We use w(j) to denote

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 15: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 15

the weight of the vertex vj Costs can be associated with nets We use c(hi) to denotethe cost associated with the net hi

Π = V1 VK is a K-way vertex partition of H = (VN ) if each part isnonempty parts are pairwise disjoint and the union of the parts gives V In Π a netis said to connect a part if it has at least one vertex in that part The connectivity

set Λ(i) of a net hi is the set of parts connected by hi The connectivity λ(i) = |Λ(i)|of a net hi is the number of parts connected by hi In Π the weight of a part is thesum of the weights of vertices in that part

In the hypergraph partitioning problem the objective is to minimize

cutsize(Π) =sum

hiisinN

(λ(i)minus 1)times c(hi) (32)

This objective function is widely used in the VLSI community [19] and in the scienticcomputing community [5 10 27] it is referred to as the connectivity-1 cutsize metricThe partitioning constraint is to satisfy a balancing constraint on part weights

Wmax minusWavg

Wavg6 ε

Here Wmax is the largest part weight Wavg is the average part weight and ε is apredetermined imbalance ratio This problem is NP-hard [19]

342 The model We build a hypergraph whose partition according to thecutsize (32) corresponds to the total size of the factors loaded Clearly the requestedentries (which correspond to the leaf nodes) are going to be the vertices of the hyper-graph so that a vertex partition will dene a partition on the requested entries Themore intricate part of the model is the denition of the nets The nets correspondto edge disjoint paths in the tree starting from a given node (not necessarily a leaf)and going up to one of its ancestors (not necessarily the root) each net is associatedwith a cost corresponding to the total size of the nodes in the corresponding pathWe use path(h) to denote the path (or the set of nodes of the tree) corresponding toa net h A vertex i (corresponding to the leaf node i in the tree) will be in a net hif the solve for aminus1

ii passes through path(h) In other words if path(h) sub P (i) thenvi isin h Therefore if the vertices of a net hn are partitioned among λ(n) parts thenthe factors corresponding to the nodes in path(hn) will have to be loaded λ(n) timesAs we load a factor at least once the extra cost incurred by a partitioning is λ(n)minus 1for the net hn Given this observation it is easy to see the equivalence between thetotal size of the loaded factors and the cutsize of a partition plus the total weight ofthe tree

We now dene the hypergraph HD = (VDND) for the diagonal case Let T =(VE) be the tree corresponding to the modied elimination tree so that the requestedentries correspond to the leaf nodes Then the vertex set VD corresponds to the leafnodes in T As we are interested in putting at most B solves together we assign a unitweight to each vertex of HD The nets are best described informally There is a netin ND for each internal node of T The net hn corresponding to the node n containsthe set of vertices which correspond to the leaf nodes of subtree T (n) The cost ofhn is equal to the weight of node n ie c(hn) = w(n) This model can be simpliedas follows if a net hn contains the same vertices as the net hj where j = parent(n)that is if the subtree rooted at node n and the subtree rooted at its father j have thesame set of leaf nodes then the net hj can be removed and its cost can be added to

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 16: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

16 AMESTOY ET AL

h4

h1

h2

h2

h4

h5

h1

5

h5

V1

V2

4

5

3

2

1

2

1

2

1

5

Figure 33 The entries aminus111 a

minus122 and a

minus155 are requested For each requested entry a leaf

node is added to the elimination tree as shown on the left The hypergraph model for the requestedentries is build as shown on the right

the cost of the net hn This way the net hn represents the node n and its parent jThis process can be applied repeatedly so that the nets associated with the nodes ina chain except the rst one (the one closest to a leaf) and the last one (the one whichis closest to the root) can be removed and the cost of those removed nets can beadded to that of the rst one After this transformation we can also remove the netswith single vertices (these correspond to fathers of the leaf nodes with a single child)as these nets cannot contribute to the cutsize We note that the remaining nets willcorrespond to disjoint paths in the tree T

Figure 33 shows an example of such a hypergraph the requested entries are aminus111

aminus122 and a

minus155 Therefore V = 1 2 5 and N = h1 h2 h4 h5 (net h3 is removed

according to the rule described above and the cost of h2 includes the weight of nodes2 and 3) Each net contains the leaf vertices which belong to the subtree rooted atits associated node therefore h1 = 1 h2 = 2 h4 = 1 2 h5 = 1 2 5 Givenfor example the partition V1 = 2 and V2 = 1 5 shown on the left of the gurethe cutsize is

cutsize(V1 V2) = c(h1)times (λ(h1)minus 1) + c(h2)times (λ(h2)minus 1)+ c(h4)times (λ(h4)minus 1) + c(h5)times (λ(h5)minus 1)

= c(h4)times (2minus 1) + c(h5)times (2minus 1)= c(h4) + c(h5)

Consider the rst part V1 = 2 We have to load the factors associated with thenodes 2 3 4 5 Consider now the second part V2 = 1 5 For this part we haveto load the factors associated with the nodes 1 4 5 Hence the factors associatedwith the nodes 4 and 5 are loaded twice while the factors associated with all other(internal) nodes are loaded only once Since we have to access each node at least oncethe extra cost due to the given partition is w(4) +w(5) which is equal to the cutsizec(h4) + c(h5)

The model encodes the partitioning problem exactly in the sense that the cutsizeof a partition is equal to the overhead incurred due to that partition however it canlead to huge amount of data and computation Therefore we envisage the use of thismodel only for the cases where a small set of entries is requested But we believe thatone can devise special data structures to hold the resulting hypergraphs as well asspecialized algorithms to partition them

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 17: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 17

4 The o-diagonal case As discussed before the formulation for the diagonalcase carries over to the o-diagonal case as well But an added diculty in this caseis related to the actual implementation of the solver Assume that we have to solvefor aminus1

ij and aminus1kj that is two entries in the same column of Aminus1 As seen from the

formula (11) reproduced below for conveniencey = Lminus1ej

aminus1ij = (Uminus1y)i

only one y vector suces Similarly one can solve for the common nonzero entries inUminus1y only once for i and k This means that for the forward solves with L we canperform only one solve and for the backward solves we can solve for the variables inthe path from the root to lca(i k) only once Clearly this will reduce the operationcount In an out-of-core context we will load the same factors whether or not wekeep the same number of right-hand sides throughout the computation (both in theforward and backward substitution phases) Avoiding the unnecessary repeated solveswould only aect the operation count

If we were to exclude the case when more than one entry in the same column ofAminus1 is requested then we can immediately generalize our results and models devel-oped for the case of diagonal entries to the o-diagonal case Of course the partition-ing problem will remain NP-complete (it contains the instances with diagonal entriesas a particular case) The lower bound can also be generalized to yield a lower boundfor the case with arbitrary entries Indeed we only have to apply the same reasoningtwice one for the column indices of the requested entries and one for the row indicesof the requested entries We can extend the model to cover the case of multiple entriesby duplicating nodes although this does result in solving for multiple vectors insteadof a potential single solve We extend the model by noting that when indices arerepeated (say aminus1

ij and aminus1kj are requested) we can distinguish them by assigning each

occurrence to a dierent leaf node (we add two zero-weighted leaf nodes to the nodej of the elimination tree) Then adding these two lower bounds yields a lower boundfor the general case However in our experience we have found this lower bound tobe loose Note that applying this lower bound to the case where only diagonal entriesare requested yields the lower bound given in Theorem 31

The PoPart and the BiseMatch heuristics do not naturally generalize to theo-diagonal case because the generalized problem has a more sophisticated underlyingstructure However the hypergraph partitioning-based approach works for arbitraryentries of the inverse The idea is to model the forward and backward solves with twodierent hypergraphs and then to partition these two hypergraphs simultaneously Ithas been shown how to partition two hypergraphs simultaneously in [27] The essentialidea which is rened in [28] is to build a composite hypergraph by amalgamatingthe relevant vertices of the two hypergraphs while keeping the nets intact In ourcase the two hypergraphs would be the model for the diagonal entries associatedwith the column subscripts (forward phase) and the model for the diagonal entriesassociated with the row subscripts (backward phase) again assuming that the sameindices are distinguished by associating them with dierent leaf nodes We have thento amalgamate any two vertices i and j where the entry aminus1

ij is requested

Figure 41 shows an example where the requested entries are aminus171 a

minus162 and aminus1

95 The transformed elimination tree and the nets of the hypergraphs associated with theforward (hfwd) and backward (hbwd) solves are shown Note that the nets hfwd

3 aswell as hbwd

3 hbwd4 and hbwd

8 are removed The nodes of the tree which correspond to

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 18: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

18 AMESTOY ET AL

h2

h4

h5

h1

V1

V2

h10

h9

h7

h6

fwd

fwd

fwd

fwd

bwd

bwd

bwd

bwd

5 9

1 7

2 6h

4

h1

h2

4

5

3

2

1 8

7

9

6

h7h

6

10 h10

bwd

h9

bwd

bwdbwd

fwd

fwd

fwd

h5

fwd

2

1

6 7

5

9

h5

bwd

h5

bwd

Figure 41 Example of hypergraph model for the general case aminus171 a

minus162 and aminus1

95 are requested

the vertices of the hypergraph for the forward solves are shaded with light grey thosenodes which correspond to the vertices of the hypergraph for the backward solvesare shaded with dark grey The composite hypergraph is shown in the right-handgure The amalgamation of light and dark grey vertices is done according to therequested entries (vertex i and vertex j are amalgamated for a requested entry aminus1

ij )

A partition is given in the right-hand gure Π = aminus162 a

minus171 a

minus195 The cut size

is c(hbwd5 ) + c(hfwd

4 ) + c(hfwd5 ) Consider the computation of aminus1

62 We need to load theL factors associated with the nodes 2 3 4 and 5 and the U factors associated with5 4 3 and 6 Now consider the computation of aminus1

71 and aminus195 the L factors associated

with 1 4 and 5 and the U factors associated with 5 10 8 7 and 9 are loaded Inthe forward solution the L factors associated with 4 and 5 are loaded twice (insteadof once if we were able to solve for all of them in a single pass) and in the backwardsolution the U factor associated with 5 is loaded twice (instead of once) The cutsizeagain corresponds to these extra loads

We note that building such a hypergraph for the case where only diagonal entriesare requested yields the hypergraph of the previous section where each hyperedge isrepeated twice

5 Experiments We conduct three sets of experiments In the rst set wecompare the quality of the results obtained by the PoPart and BiseMatch heuris-tics using Matlab implementations of these algorithms For these experiments wecreated a large set of TreePartitioning problems each of which is associated withcomputing some diagonal entries in the inverse of a sparse matrix In the second setof experiments we use an implementation of PoPart in Fortran that we integratedinto the MUMPS solver [3] We use this to investigate the performance of PoPart onpractical cases using the out-of-core option of MUMPS In this set of experimentswe compute a set of entries from the diagonal of the inverse of matrices from twodierent data sets the rst set contains a few matrices coming from the astrophysicsapplication [7] briey described in Section 1 the second set contains some more ma-trices that are publicly available In the third set of experiments we carry out someexperiments with the hypergraph model for the o-diagonal case

51 Assessing the heuristics Our rst set of experiments compares the heu-ristics PoPart and the BiseMatch which were discussed in Sections 31 and 33We have implemented these heuristics in Matlab We use a set of matrices from

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 19: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 19

the University of Florida (UFL) sparse matrix collection (httpwwwciseufleduresearchsparsematrices) The matrices we choose satisfy the followingcharacteristics 10000 le N le 100000 the average number of nonzeros per row isgreater than or equal to 25 and in the UFL index the posdef eld is set to 1At the time of writing there were a total of 61 matrices satisfying these propertiesWe have ordered the matrices using the metisnd routine of the Mesh PartitioningToolbox [17] and built the elimination tree associated with the ordered matricesusing the etree function of Matlab We have experimented with block sizes B isin2 4 8 16 32 64 128 256 We have assigned random weights in the range 1200 totree nodes Then for each P isin 005 010 020 040 060 080 100 we have created10 instances (except for P = 100) by randomly selecting P times N integers between 1and N and designating them as the requested entries in the diagonal of the inverseNotice that for a given triplet of a matrix B and P we have 10 dierent trees topartition resulting in a total of 10times 6times 8times 61 + 8times 61 = 29768 TreePartitioningproblems

We summarize the results in Table 51 by giving results with B isin 4 16 64 256(the last row relates to all B values mentioned in the previous paragraph) Oursubsequent discussion relates to the complete set of experiments and not just thoseshown in Table 51 In order to create this table we computed the lower bound forall tree partitioning instances and computed the ratio of the costs found by PoPartand BiseMatch to the lower bound Next for a triplet of a matrix B and P wetook the average of the ratios of the 10 random instances and stored that averageresult for the triplet Then we took the minimum the maximum and the average ofthe 61 dierent triplets with the same B and P As seen in this table both heuristicsobtain results that are close to the lower bound PoParts average result is about104 times the lower bound and BiseMatchs average result is about 101 timesthe lower bound The PoPart heuristic attains the exact lower bound in only onetriplet while BiseMatch attains the lower bound for all instances with B = 2 (recallthat it is based on the exact algorithm Match) and for some other ve triplets Themaximum deviation from the lower bound is about 10 with BiseMatch whereas itis 30 with PoPart Given that in most cases the algorithms perform close to theaverage gures we conclude that both are ecient enough to be useful in the contextof the out-of-core solver

The BiseMatch heuristic almost always obtains better results than PoPart inonly 7 out of 56 times 61 triplets did PoPart obtain better results than BiseMatchFor all P the performance of BiseMatch with respect to the lower bound becameworse as B increases Although there are uctuations in the performance of PoPartfor small values of P eg for 005 and 010 for larger values of P the performancealso becomes worse with larger values of B We suspect that the lower bound mightbe loose for large values of B For all B the performance of BiseMatch with respectto the lower bound improves when P increases A similar trend is observable forPoPart except for a small deviation for B = 256 Recall that the trees we usehere come from a nested dissection ordering Such ordering schemes are known toproduce wide and balanced trees This fact combined with the fact that when a highpercentage of the diagonal entries are requested the trees will not have their structurechanged much by removal and addition of leaf nodes may explain why the heuristicsperform better at larger values of P for a given B

52 Practical tests with a direct solver We have implemented the heuristicPoPart in Fortran and integrated it into the MUMPS solver [3] The implementation of

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 20: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

20 AMESTOY ET AL

Table 51The performance of the proposed PoPart and BiseMatch heuristics with respect to the lower

bound The numbers represent the average over 61 dierent matrices of the ratios of the results ofthe heuristics to the lower bound discussed in the text The column B corresponds to the maximumallowable block size the column P corresponds to the requested percentage of diagonal entries

B P PoPart BiseMatchmin max avg min max avg

4 005 10011 11751 10278 10000 10139 10013010 10005 11494 10192 10000 10073 10005020 10003 10945 10119 10000 10052 10003040 10001 10585 10072 10000 10031 10001060 10001 10449 10053 10000 10019 10001080 10000 10367 10043 10000 10029 10001100 10000 10491 10038 10000 10101 10002

16 005 10050 11615 10592 10000 10482 10113010 10026 11780 10485 10000 10553 10075020 10016 12748 10374 10000 10334 10035040 10007 11898 10246 10000 10230 10016060 10005 11431 10186 10000 10166 10010080 10004 11136 10154 10000 10190 10011100 10003 11052 10133 10000 10096 10008

64 005 10132 11581 10800 10000 10797 10275010 10101 11691 10715 10002 10584 10196020 10054 11389 10599 10001 10506 10125040 10030 11843 10497 10000 10437 10079060 10020 12362 10407 10000 11022 10072080 10015 13018 10383 10000 10344 10044100 10014 12087 10315 10000 10141 10024

256 005 10050 11280 10651 10000 10867 10342010 10127 11533 10721 10003 10911 10314020 10133 11753 10730 10002 10722 10257040 10093 11598 10668 10003 10540 10187060 10068 11621 10602 10002 10572 10174080 10068 11314 10563 10001 10515 10120100 10043 11203 10495 10001 10677 10118

Over all triplets 10000 13018 10359 10000 11110 10079

the computation of a set of inverse entries exploiting sparsity within MUMPS is describedin [24] In this section we give results obtained by using MUMPS with the out-of-coreoption and a nested dissection ordering provided by MeTiS [18] All experiments havebeen performed with direct IO access to les so that we can guarantee eective diskaccess independently of both the size of the factors and the size of the main memoryThe benets resulting from the use of direct IO mechanisms during the solution phaseare discussed in [2] All results are obtained on a dual-core Intel Core2 Duo P8800processor having a 280 GHz clock speed We have used only one of the cores andwe did not use threaded BLAS We use four matrices from the real life astrophysicsapplication [7] briey described in Section 1 The names of these matrices start withCESR and continue with the size of the matrix We use an additional set of fourmatrices (af23560 ecl32 stokes64 boyd1) from the UFL sparse matrix collectionwith very dierent nonzero patterns yielding a set of elimination trees with varyingstructural properties (such as height and width of the tree and variations in nodedegrees)

Table 52 shows the total size of the factors loaded and the execution time of thesolution phase of MUMPS with dierent settings and partitions All diagonal entriesof the inverse of the given matrices are computed with B = 16 64 In this table

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 21: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 21

Table 52The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes All diagonal entries are requested The out-of-core executions use direct IO accessto the les NoES refers to the traditional solution without exploiting sparsity The columns ES-Natand ES-PoP refer to exploiting the sparsity of the right-hand side vectors under respectively anatural partitioning and the PoPart heuristic

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound NoES ES-Nat ES-PoP NoES ES-Nat ES-PoPCESR21532 16 5403 63313 7855 5422 11138 694 393

64 1371 15828 2596 1389 3596 381 166CESR46799 16 2399 114051 3158 2417 39623 767 477

64 620 28512 1176 635 8663 513 285CESR72358 16 1967 375737 6056 2008 108009 2637 718

64 528 93934 4796 571 31740 2741 520CESR148286 16 8068 1595645 16595 8156 433967 7207 2685

64 2092 398911 11004 2179 140493 7267 1998af2356 16 16720 114672 17864 16745 20806 2411 1976

64 4215 28668 5245 4245 6685 1210 595ecl32 16 95478 618606 141533 95566 121847 27263 17606

64 23943 154651 43429 24046 35255 9741 4829stokes64 16 721 8503 1026 726 1312 142 85

64 185 2125 425 189 488 102 41boyd1 16 2028 75521 4232 2031 165512 3898 2149

64 515 18880 1406 518 54927 2305 1212

the values in column Lower bound are computed according to Theorem 31 Thecolumn NoES corresponds to the computational scheme where the sparsity of theright-hand side vectors involved in computing the diagonal entries is not exploitedThe columns ES-Nat and ES-PoP correspond to the computational scheme wherethe sparsity of the right-hand side vectors are exploited to speed up the solutionprocess These columns correspond respectively to the natural partitioning (theindices are partitioned in the natural order into blocks of size B) and to the PoPartheuristic As seen in column ES-Nat most of the gain in the total size of the factorsloaded and in the execution time is due to exploiting the sparsity of the right-handside vectors Furthermore when reordering the right-hand sides following a postorder(PoPart) the total number of loaded factors is once again reduced signicantlyresulting in a noticeable impact on the execution time We also see that the larger theblock size the better PoPart performs compared to the natural partitioning Thiscan be intuitively explained as follows within an epoch computations are performedon a union of paths hence the natural ordering is likely to have more trouble withincreasing epoch size because it will combine nodes far from each other

We see that the execution times are proportional to the total volume of loadedfactors On a majority of problems the execution time is largely dominated by thetime spent reading the factors from the disk which explains this behaviour forexample on matrix ecl32 95 of the time is spent on IO On a few problemsonly the time spent in IO represents a signicant but not dominant part of theruntime hence slightly dierent results (eg on CESR72358 the time for loading thefactors represents less than a third of the total time for ES-PoP) On such matricesincreasing the block size is likely to increase the number of operations and thus thetime as explained in Section 61 for the in-core case

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 22: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

22 AMESTOY ET AL

Table 53The total size of the loaded factors and execution times with MUMPS with two dierent computa-

tional schemes A random set of N10 o-diagonal entries are computed with MUMPS Out-Of-Coreexecutions are with direct IO access to the les The columns ES-PoP and ES-HP correspondrespectively to the case where a post-order on the column indices and a hypergraph partitioningroutine are used to partition the requested entries into blocks of size 16 and 64

Total size of the Running time of theLower loaded factors (MBytes) solution phase (s)

matrix B bound ES-PoP ES-HP ES-PoP ES-HPCESR21532 16 563 1782 999 566 169

64 164 703 464 283 136CESR46799 16 264 549 416 322 251

64 93 232 195 355 253CESR72358 16 242 1124 868 1229 738

64 116 794 598 995 728CESR148286 16 905 3175 2693 4260 3217

64 345 2080 1669 2810 2358af23560 16 1703 3579 2463 1092 664

64 458 1219 1003 471 343ecl32 16 9617 22514 12615 5077 2652

64 2483 7309 4664 1992 1199stokes64 16 77 188 149 29 23

64 26 75 74 19 17boyd1 16 205 481 258 390 344

64 55 198 93 259 242

53 Hypergraph model We have performed a set of experiments with thehypergraph model introduced in Section 4 in an attempt to see how it performs inpractice and to set a base case method for future developments Table 53 summa-rizes some tests with the model The matrices are the same as before The testsare again conducted using the out-of-core option of MUMPS and standard settingsincluding an ordering based on nested dissection A random selection of N10 o-diagonal entries (no two in the same column) are computed with B = 16 64 Thetable displays the lower bound (given in Section 4) and the total size of the factorsloaded with a PoPart partition on the column indices and with a partition on thehypergraph models using PaToH [11] with default options except that we have re-quested a tighter balance among part sizes As expected the formulation based onhypergraph partitioning obtains better results than that based on post-order Thereis however a huge dierence between the lower bounds and the performance of theheuristics We performed additional tests with the hypergraph model on the diagonalcase and observed that its performance was similar to that of PoPart (which wehave shown to be very eective) We think therefore that the performance of thehypergraph based formulation should be again reasonable and that the lower boundis too loose to judge the eectiveness of the heuristic However as pointed out beforehypergraph models can rapidly become huge so further studies are needed

6 Similar problems and related work In this section we briey presentextensions and variations of the problem Firstly we show the dierence betweenthe in-core and the out-of-core cases Then we address the problem of exploitingparallelism indeed the partitionings presented above can limit tree-parallelism andwe propose a simple heuristic to remedy this

61 In-core case In an in-core context the relevant metric is the number ofoating-point operations (ops) performed When processing several right-hand sides

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 23: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 23

3

1

2

Figure 61 Each internal node has unit weight and the number of operations performed ateach node is equal to the number of entries in the processed block In the out-of-core context thepartition 1 2 3 is better than the partition 1 3 2 because the total size of loaded factorsare 5 units (4 for the rst block and 1 for the second) vs 6 units (2 + 4) however in the in-corecontext the situation is reversed because the total operation counts are 10 (2 operations for 4 nodesand 1 operation for 2 nodes) vs 8 (decomposed as 2times 2 + 1times 4)

at the same time block computations are performed on the union of the structuresof these vectors hence there are more computations than there would be if theseright-hand sides were processed one-by-one of course the interest is to benet fromdense kernels such as the BLAS and thus to process the right-hand sides by blocksof reasonable size

A rst observation can be made about the block size in the in-core context theoptimal block size in terms of the number of oating-point operations is one becauseno extra operations are performed Conversely putting all the right-hand sides in asingle block represents the worst case because a maximum amount of extra operationsis introduced In the out-of-core case things are completely dierent processing allthe right-hand sides in one shot is the best strategy because all the nodes (pieces offactors stored on the hard drive) are loaded only once conversely processing all theright-hand sides one by one implies accessing each node a maximum number of timesTherefore the choice of a block size will be dierent in each case for in-core we willchoose a block size which gives a good trade-o between dense kernel eciency andthe number of extra operations introduced for out-of-core we will try to maximizethe block size (constrained by the available memory)

One might think that for a given block size partitions that perform well in out-of-core should be ecient in the in-core case as well Unfortunately this is not thecase and Figure 61 provides a counter-example The tree shown corresponds tothe usual assembly tree Assume that each internal node has unit weight and thata unit number of operations is performed for each requested entry in a given setThe partition 1 2 3 is better than the partition 1 3 2 in the out-of-corecontext but in the in-core case the second partition results in fewer operations

In a dierent context [29] a post-order based method and a hypergraph parti-tioning model have been shown to be useful in reducing the number of (redundant)operations in supernodal triangular solves with many sparse right-hand sides We havenot investigated the eects on the computation of the entries of the inverse howeveras outlined above the objectives in the out-of-core case for which our heuristics areproposed are dierent than the objective of minimizing the operation count

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 24: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

24 AMESTOY ET AL

62 Parallelism In MUMPS a subtree to subcube mapping [15] is performedon the lower part of the tree during the analysis phase Nodes in the lower part ofthe tree are likely to be mapped on a single processor whereas nodes in the upperpart of the tree are mapped onto several processors Since the partitioning strategiesdescribed above tend to put together nodes which are close in the elimination tree fewprocessors (probably only one) will be active in the lower part of tree when processinga block of right-hand sides An interleaving strategy was suggested in [24] to remedythis situation it consists in interleaving the dierent blocks of right-hand sides so thatevery processor will be active when a block of entries is computed The main drawbackof this strategy is that it tends to lose the overlapping gained by partitioning Whenlocal disks are attached to processors then the gain in global bandwidth balances thisloss Further work is needed to design strategies which partition the requested entriesso that parallelism is ensured without increasing either the number of operations orthe number of disk accesses

63 Related work Most of the related work addresses the case of computingthe whole diagonal of the inverse of a given matrix Among these in the studiesregarding the applications mentioned in Section 1 the diagonal entries are computedusing direct methods Tang and Saad [26] address the same problem (computing thediagonal of the inverse) with an iterative method focusing on matrices whose inverseshave a decay property

For the general o-diagonal case not much work has been done To computeentries in Aminus1 one can use the algorithm given in [13] This algorithm relies onequations derived by Takahashi et al [25] Given an LU factorization of an N times Nsparse matrix A the algorithm computes the parts of the inverse of A that correspondto the positions of the nonzeros of (L+U)T starting from entry (NN) and proceedingin a reverse Crout order At every step an entry of the inverse is computed usingthe factors L and U and the already computed entries of the inverse This approachis later extended [23] for a set of entries of the inverse rather than the whole set inthe pattern of (L+ U)T The algorithm has been implemented in a multifrontal-likeapproach [8]

If all entries in the pattern of (L + U)T are requested then the method thatimplements the algorithm in [13] might be advantageous whereas methods based onthe traditional solution of linear systems have to solve at least n linear systems andrequire considerably more memory On the other hand if a set of entries in the inverseis requested any implementation based on the equations by Takahashi et al shouldset up necessary data structures and determine the computational order to computeall the entries that are necessary to compute those requested This seems to be arather time consuming operation

7 Conclusion We have addressed the problem of eciently computing someentries of the inverse of a sparse matrix from computed sparse factors and using theelimination tree We have shown that only factors from paths between a node and theroot are required both in the forward and the backward substitution phases We thenexamined the ecient computation of multiple entries of the inverse particularly inthe case where the factors are held out-of-core

We have proposed several strategies for minimizing the cost of computing multipleinverse entries The issue here is that memory considerations restrict how many entriescan be computed simultaneously so that we need to partition the requested entriesto respect this constraint We showed that this problem is NP-complete so that it isnecessary to develop heuristics for doing this partitioning

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 25: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

COMPUTING INVERSE ENTRIES 25

We describe a very simple heuristic PoPart which is based on a post-ordering ofthe elimination tree and then a partitioning of the nodes in sequential parts accordingto this post-order We showed that this is a 2-approximation algorithm Althoughwe showed that the TreePartitioning problem cannot be approximated arbitrarilyclosely there remains a gap to ll and in future work we will strive at designingapproximation algorithms with a better ratio

We presented an exact algorithm for the case when two entries are computed at atime By using this exact algorithm repeatedly we developed another heuristic Bise-Match for partition sizes that are powers of two We performed extensive tests onthe heuristics and have concluded that both PoPart and BiseMatch perform verywell on average where the worst case the performance of the BiseMatch is betterBy comparing the performance of these heuristics with computable lower bounds wesaw that they give very eective partitionings We implemented the PoPart heuris-tic within the MUMPS solver and reported experimental results with MUMPS Theseconrmed the eectiveness of the PoPart heuristic

The heuristics PoPart and BiseMatch were designed for the case where onlydiagonal entries of the inverse are requested To accommodate the case when o-diagonal entries are wanted we have proposed a formulation based on a hypergraphpartitioning In this model a hypergraph is built so that the cutsize of the partitioncorresponds exactly to the increase in the total size of factors loaded Although the sizeof the hypergraph model can be large the model is powerful enough to represent boththe diagonal and the o-diagonal cases We also performed tests with the hypergraphmodel and concluded that it can be used eectively for cases where a small numberof entries in the inverse are requested

We briey described a technique to improve the performance for parallel ex-ecution and showed dierences that apply when the factorization is held in-coreAlthough we have made the rst steps for showing the ecient computation of o-diagonal inverse entries more work should be done in that case to obtain practicalalgorithms when many entries are requested

REFERENCES

[1] E Agullo On the out-of-core factorization of large sparse matrices PhD thesis Ecole Nor-male Supeacuterieure de Lyon 2008

[2] P R Amestoy I S Duff A Guermouche and Tz Slavova Analysis of the solu-tion phase of a parallel multifrontal approach Parallel Computing 36 (2010) pp 315doi101016jparco200906001

[3] P R Amestoy I S Duff J Koster and J-Y LExcellent A fully asynchronousmultifrontal solver using distributed dynamic scheduling SIAM Journal on Matrix Analysisand Applications 23 (2001) pp 1541

[4] P R Amestoy A Guermouche J-Y LExcellent and S Pralet Hybrid schedulingfor the parallel solution of linear systems Parallel Computing 32 (2006) pp 136156

[5] C Aykanat A Pinar and Uuml V Ccedilatalyuumlrek Permuting sparse rectangular matrices intoblock-diagonal form SIAM Journal on Scientic Computing 25 (2004) pp 18601879

[6] Aring Bjoumlrck Numerical methods for least squares problems Society for Industrial Mathematics1996

[7] L Bouchet J-P Roques P Mandrou A Strong R Diehl F Lebrun and R Ter-rier INTEGRAL SPI observation of the galactic central radian Contribution of discretesources and implication for the diuse emission 1 The Astrophysical Journal 635 (2005)pp 11031115

[8] Y E Campbell and T A Davis Computing the sparse inverse subset an inverse multi-frontal approach Tech Report TR-95-021 CIS Dept Univ of Florida 1995

[9] S Cauley J Jain C K Koh and V Balakrishnan A scalable distributed method forquantum-scale device simulation Journal of Applied Physics 101 (2007) p 123715

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010

Page 26: ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN …mumps.enseeiht.fr/doc_n7/RT_APO_10_06.pdf · ON COMPUTING INVERSE ENTRIES OF A SPARSE MATRIX IN AN OUT-OF-CORE ENVIRONMENT

26 AMESTOY ET AL

[10] Uuml V Ccedilatalyuumlrek and C Aykanat Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication IEEE Transactions on Parallel and DistributedSystems 10 (1999) pp 673693

[11] PaToH A Multilevel Hypergraph Partitioning Tool Version 30 Bilkent UniversityDepartment of Computer Engineering Ankara 06533 Turkey PaToH is available at httpbmiosuedu~umitsoftwarehtm 1999

[12] I S Duff A M Erisman C W Gear and J K Reid Sparsity structure and Gaussianelimination SIGNUM Newsletter 23 (1988) pp 28

[13] A M Erisman and W F Tinney On computing certain elements of the inverse of a sparsematrix Comm ACM 18 (1975) pp 177179

[14] M R Garey and D S Johnson Computers and Intractability A Guide to the Theory ofNP-Completeness W H Freeman amp Co New York NY USA 1979

[15] G A Geist and E G Ng Task scheduling for parallel sparse Cholesky factorization In-ternational Journal of Parallel Programming 18 (1989) pp 291314

[16] J R Gilbert and J W H Liu Elimination structures for unsymmetric sparse LU factorsSIAM J Matrix Analysis and Applications (1993)

[17] J R Gilbert G L Miller and S-H Teng Geometric mesh partitioning Implementa-tion and experiments SIAM Journal on Scientic Computing 19 (1998) pp 20912110

[18] G Karypis and V Kumar MeTiS A Software Package for Partitioning UnstructuredGraphs Partitioning Meshes and Computing Fill-Reducing Orderings of Sparse MatricesVersion 40 University of Minnesota Department of Comp Sci and Eng Army HPCResearch Center Minneapolis 1998

[19] T Lengauer Combinatorial algorithms for integrated circuit layout John Wiley amp Sons IncNew York NY USA 1990

[20] L Lin J Lu L Ying R Car and W E Fast algorithm for extracting the diagonal ofthe inverse matrix with application to the electronic structure analysis of metallic systemsCommunications in Mathematical Sciences 7 (2009) pp 755777

[21] J W H Liu The role of elimination trees in sparse factorization SIAM Journal on MatrixAnalysis and Applications 11 (1990) pp 134172

[22] M Luisier A Schenk W Fichtner and G Klimeck Atomistic simulation of nanowiresin the sp3d5slowast tight-binding formalism From boundary conditions to strain calculationsPhysical Review B 74 (2006) p 205323

[23] H Niessner and K Reichert On computing the inverse of a sparse matrix InternationalJournal for Numerical Methods in Engineering 19 (1983) pp 15131526

[24] Tz Slavova Parallel triangular solution in an out-of-core multifrontal approach for solv-ing large sparse linear systems PhD thesis Institut National Polytechnique de ToulouseToulouse France 2009

[25] K Takahashi J Fagan and M Chin Formation of a sparse bus impedance matrix andits application to short circuit study in Proceedings 8th PICA Conference MinneapolisMinnesota 1973

[26] J Tang and Y Saad A probing method for computing the diagonal of the matrix inverseTech Report umsi-2010-42 Minnesota Supercomputer Institute University of MinnesotaMinneapolis MN 2009

[27] B Uccedilar and C Aykanat Encapsulating multiple communication-cost metrics in partition-ing sparse rectangular matrices for parallel matrix-vector multiplies SIAM Journal onScientic Computing 25 (2004) pp 18371859

[28] Revisiting hypergraph models for sparse matrix partitioning SIAM Review 49 (2007)pp 595603

[29] I Yamazaki X S Li and E G Ng Partitioning Load Balancing and Matrix Orderingin a Parallel Hybrid Solver Presentation at SIAM Conference on Parallel Processing forScientic Computing (PP10) 2010