Parallel decomposition of unstructured FEM-meshes

CONCURRENCY: PRACTICE AND EXPERIENCE, VOL. 10(1), 53–72 (JANUARY 1998)

Parallel decomposition of unstructuredFEM-meshesRALF DIEKMANN∗, DERK MEYER AND BURKHARD MONIEN

Department of Mathematics and Computer Science, University of Paderborn, Germany

SUMMARYWe present a parallel algorithm for static and dynamic partitioning of unstructured FEM-meshes. The method consists of two parts. First a fast but inaccurate sequential clustering isdetermined which is used, together with a simple mapping heuristic, to map the mesh initiallyonto the processors of a parallel system.

The second part of the method uses a massively parallel algorithm to remap and optimize themesh decomposition, taking several cost functions into account which reflect the characteristicsof the underlying hardware and the requirements of the numerical solution method supposedto run after the decomposition. The parallel algorithm first calculates the number of nodes thathave to be migrated between pairs of clusters in order to obtain an optimal load balancing. In asecond step, nodes to be migrated are chosen according to cost functions optimizing the amountof necessary communication and the shapes of subdomains. The latter criterion is extremelyimportant for the convergence behavior of certain numerical solution methods, especially forpreconditioned conjugate gradient methods.

The parallel parts of the method are implemented in C under Parix to run on the ParsytecGC systems. Results on up to 64 processors are presented and compared to those of otherexisting methods. 1998 John Wiley & Sons, Ltd.

1. INTRODUCTION

Finite difference (FDM), finite element (FEM) and boundary element (BEM) methodsare probably the most important techniques for numerical simulation in mechanical andelectrical engineering, physics, chemistry and biology. The finite element method is usedfor stability calculations in many areas, e.g. car and plane construction and constructionengineering. 95% of all stability proofs in engine production use FEM. Simulations of heatconduction, fluid dynamics, diffusion, sound or earthquake wave propagation and chemicalreactions make use of finite element or boundary element methods.

The core of all of these methods is the discretization of the domain of interest into amesh of finite elements. The partial differential equation used to describe the physicalproblem is approximated by a set of simple functions on these elements. One of the majordisadvantages of FEM is the high amount of computation time required to simulate problemsof practical size with sufficient accuracy, especially in the 3D case. For typical applicationsin computational fluid dynamics (CFD), for example, the mesh can reach sizes of severalmillions of elements. Therefore there are strong efforts to use modern supercomputers forsuch kinds of simulations in order to reduce computation times to reasonable magnitudesand to provide sufficiently large amounts of memory.

The domain decomposition method embodies large potentials for a parallelization ofFEM methods. In this data-parallel approach, the domain of interest is partitioned into∗Contact Address: Ralf Diekmann, Department of Computer Science, University of Paderborn, Furstenallee 11,

D-33102 Paderborn, Germany.

CCC 1040–3108/98/010053–20 $17.50 Received 24 January 19961998 John Wiley & Sons, Ltd. Revised 11 September 1996

54 R. DIEKMANN, D. MEYER AND B. MONIEN

smaller subdomains, either before the mesh generation or afterwards (mesh partitioning).The subdomains are assigned to processors which calculate the corresponding part of theapproximation.

The mesh decomposition and assignments of subdomains to processors can be modeled asa graph embedding problem where a large graph (the mesh) has to be mapped onto a smallerone (the processor network)[1]. Unfortunately, this mapping problem is NP-complete andthere exist almost no efficient sequential or parallel heuristics solving it sufficiently[2]. Withgrowing performance of interconnection networks and especially with the establishment ofindependent routing networks, it is appropriate to reduce the mapping problem to the taskof partitioning the graph (the FEM-mesh) into as many equal sized (or weighted) clusters asthere are numbers of processors and to minimize the number of edges crossing the partitionboundaries[1]. The general partitioning problem is still very hard to solve and the existingheuristics like those in [3–5] are often very time- and space-consuming, even if they areparallel. Therefore, the partitioning is often handled by recursive bisection.

There are several disadvantages attached to recursive bisection. Compared to direct k-partitioning the results can be worse, especially if the number of partitions is very large[6].The time (and space) needed to compute a single bisection grows linearly with the numberof nodes (in the best case). Thus for very large graphs that have to be mapped onto largeparallel systems there is the need of efficient parallel partitioning algorithms.

For the special application of partitioning FEM-meshes heuristics have to be flexiblewith respect to the measures they optimize. Graph partitioning in general is a combinatorialoptimization problem. Normal partitioning heuristics minimize the total cut size, i.e. thenumber of edges crossing partition boundaries. This makes sense in order to reduce theamount of necessary communication of the parallel FEM-simulation, but there are severalother measures which are often much more important than cut size and which dependvery much on the numerical solution method used for the simulation[7]. Examples are theaspect ratio of subdomains (i.e. the ratio between the length of the longest and the shortestboundary-segment, where a segment is a part of the boundary leading to a single neighbor),or their convexity. In particular, preconditioned conjugate gradient solvers (PCG) which usethe subdomain-structure to determine preconditioning matrices usually benefit from wellshaped clusters[8]. Their convergence behavior is determined by the form of subdomainsand the smoothness of inner boundaries.

Some of the most efficient numerical methods use adaptively refining meshes whichchange their structure during runtime in order to adapt to the characteristics of the physicalproblem[9,10]. Parallel adaptive environments have to include the ability to cope with thesechanging meshes, too.

1.1. The new method

The heuristic described in this paper tries to optimize the shape of subdomains in additionto communication amounts. It is built in a modular way in order to allow the replacementof certain parts like, for example, the cost functions which are optimized. It is embed-ded in a larger research project which aims to design an efficient and flexible frame forthe development of massively parallel adaptive finite element simulation codes[8]. Suchcompletely parallel FEM-codes consist of parallel mesh generation, parallel partitioningand load balancing and parallel numerical solvers. Thus the parallel partitioning plays animportant role within such a code and is a determining factor for the overall efficiency.

PARALLEL DECOMPOSITION OF UNSTRUCTURED FEM-MESHES 55

The method consists of two parts. The first simulates a parallel mesh generation and hadto be designed because there are currently no parallel mesh generators for arbitrary complexdomains available. It performs a fast but inaccurate sequential clustering of the mesh andmaps the resulting clusters to the processors of a parallel system. We implemented severalclustering and mapping strategies in order to simulate the behavior of different parallelmesh generators and evaluated the influence of the initial clustering on the followingparallel partitioning algorithm.

The second part of the method consists of a massively parallel algorithm which takes theclustering as input (or the output of a parallel mesh generator, if available) and optimizesthe load distribution together with several other measures resulting from the numericalsolution method kept in mind. The parallel phase again consists of two steps. The firstone calculates the amount of nodes (or elements) of the mesh that have to be migratedbetween pairs of subdomains. The second step performs the physical node (or element)migration optimizing communication demands and subdomain shapes. It is followed by anode-exchange phase which further tries to optimize the decomposition.

If the numerical solver is using adaptive meshes, this second phase can be used aftereach refinement/derefinement to restore optimal load balancing.

1.2. Related work

A large number of efficient graph partitioning heuristics have been designed in the past, mostof them for recursive bisection but also some for direct k-partitioning. See [3,4,11–16] foroverviews of different techniques. Many of the most efficient methods have been collectedinto libraries like Chaco by Hendrickson/Leland[17], Metis by Karypis/Kumar[13], Scotchby Pellegrini/Roman[18], Party by Preis/Diekmann[19], and others[20–22]. We do notdiscuss these algorithms and tools here but concentrate on methods explicitly designed topartition and map numerical grids.

Farhat describes in [7] a simple and efficient sequential algorithm for the partitioning ofFEM-meshes. This front-technique is a breadth-first-search based method which is widelyused by engineers and has influenced the development of a number of other partitioningtools[20,23]. We describe it in more detail in Section 2.1, as it is, among others, used withinour sequential clustering phase.

Recursive Orthogonal Bisection (ROB) and Unbalanced Recursive Bisection (URB) usenode co-ordinates to partition a mesh and neglect the graph structure[24]. Both methodsare fast and easy to implement whereas, URB offers greater flexibility. There are ways toparallelize both methods but the parallelization requires the co-ordinates of all nodes to beheld on each processor. Both parallel versions can be used in an adaptive environment.

Walshaw and Berzins apply Recursive Spectral Bisection (RSB), which was introduced bySimon et al.[16,25] to adaptive refining meshes by combining the method with a contractionscheme[26]. The sequential algorithm performs a remapping, taking the existing clusteringinto account.

Van Driessche and Roose generalize RSB and apply it to refining meshes by introducingvirtual vertices and virtual edges[27]. This formulation allows the minimization of nodetransfers in order to obtain a balanced load after a refining step.

Incremental Graph Partitioning (IGP) by Ou and Ranka uses a linear-programmingformulation of the load balancing problem arising if refinements occur[15]. As they areable to optimize cut size in addition to load balance, the method produces results comparable


to those of RSB, and as they additionally include multilevel ideas, the algorithm is able torun in parallel, too.

The partitioning algorithm included in the Archimedes environment[28] is based on ageometric approach where a d-dimensional mesh is projected onto a (d + 1)-dimensionalsphere and partitioned by searching for a center point and cutting hyper-planes[29]. Thissequential method uses global information on the node co-ordinates as well as the graphstructure to obtain good partitions. Currently there are no parallel versions available anduse in an adaptive simulation is not directly possible.

The PREDIVIDER[23] uses the Farhat front-technique starting from several randomlychosen points and optimizes the decomposition afterwards by a node-exchange strategy.The method is neither parallel nor used in an adaptive environment.

PUL-SM and PUL-DM are software tools designed at the EPCC to support parallelunstructured mesh calculations[21]. The partitioning with PUL-DM is done sequentiallyand there is currently no support for adaptive refining meshes.

Ou, Ranka and Fox introduce Index Based Partitioning (IBP)[30,31], where they reducethe partitioning problem to index sorting which is solved by a parallel version of Sample-Sort. Like ROB and URB this method ignores structural information but can be used tohandle adaptively refining meshes in a parallel environment. Its results are comparable tothose of recursive co-ordinate bisection and the method is faster than recursive multilevelspectral bisection[17,32], especially if the number of processors is large[31]. IBP caneasily be applied to heterogeneous computing environments like, for example, workstationclusters which may, additionally, change their size and structure during run-time[33].

Hammond’s Cyclic Pairwise Exchange (CPE) heuristic[4] maps the mesh randomly ona hypercube and optimizes the embedding by a pairwise exchange of nodes until a localminimum of a cost function counting the amount of resulting communication is achieved.

Finally, Walshaw et al. designed the software tool JOSTLE[22], which is able to supportadaptively changing meshes. The method uses Farhat’s algorithm for a sequential clustering,tries to optimize the subdomain shapes, restores an optimal load balancing and finally triesto minimize the communication demands (cut size) by a parallelizable version of the KL-algorithm. The last three steps can be applied if adaptive numerical methods are used.A sequential version of the method performs well compared to existing methods and isoften faster[22]. The method has been implemented in parallel, too, showing similar goodresults[34].

1.3. Organization

The heuristic described here is implemented in a massively data-parallel environment. It isable to perform static partitioning as well as dynamic repartitioning and can therefore beused to handle adaptively changing meshes within a completely parallel FEM-code.

We will only describe the methods used to partition and repartition meshes and themain focus will be on the initial parallel partitioning. Software engineering aspects likeapplication interfaces and integrations into parallel FEM environments are beyond thescope of this paper.

The following section describes the different sequential algorithms for initial clusteringwe tested together with some mapping strategies. Section 3 describes the parallel heuristicand is followed by a section with results compared to existing methods.


2. SEQUENTIAL PREPROCESSING

In a completely parallel FEM-code the mesh generation should be done in parallel, too,but as mesh generation itself is a very complex problem and an area of active research,only a small number of parallel mesh generators exist[35]. If the mesh generation isperformed sequentially or if existing meshes are supposed to be used, the elements have tobe distributed over the processors of the massively parallel system (which is supposed to bea distributed memory machine) prior to any parallel partitioning algorithm. This requires afirst and fast initial mapping of the mesh to the processor graph. In some applications thismapping is done randomly[4] but as the mesh contains a lot of structure it is recommended touse this additional information in order to help the following parallel partition optimizationphase.

In our method the initial assignment of elements to processors is done in two steps: firsta clustering of the mesh is determined and afterwards the clusters are mapped, one-to-one,onto the processor graph.

2.1. Fast initial clustering

In principle, any sequential graph partitioning algorithm can be used to calculate the initialclustering. We decided to use three different methods which are fast, differ in their generalbehavior (i.e. in the characteristics of their output) and take benefit from the geometricinformation available with the node co-ordinates.

We rate the methods according to the load (i.e. variance in cluster size), shape of sub-domains, cut size (i.e. total number of edges connecting different subdomains), connectionof subdomains (i.e. each subdomain should consist of only one connected component) anddegree and structure of the resulting cluster graph they produce (the cluster graph containsa node for each cluster and edges expressing neighboring relations between clusters).

2.1.1. Box decomposition

The Box method is an iterative variant of recursive orthogonal bisection (ROB) or recursiveco-ordinate bisection (cf. Section 1.3 and [24,33]). It determines the smallest surroundingbox of the domain and partitions it into nx ·ny ·nz = P (= no. of processors) ‘subboxes’.The number of boxes in each direction is variable but it is recommended to choose themaccording to the ratio of the side-length of the surrounding box or according to the size ofthe dimensions of the processor network, if it is fixed. The boxes are determined in a waythat each of them contains the same number of nodes of the mesh.

The method generates equal sized clusters, well shaped subdomains which are notnecessarily connected, often bad cut size and a well structured cluster graph. Because asorting of nodes according to their co-ordinates is necessary, the box-method needs timeO(n logn) if n is the number of nodes.

2.1.2. Voronoi diagrams

The Voronoi method determines the Voronoi diagram to P nodes of the mesh. Each of theVoronoi cells then represents one subdomain. The choice of the P nodes uses two strategies:Feder’s and Green’s method to determine a set of points with largest geometrical distance


between each other (farthest point method[36]) and a pure random choice. Experimentsshowed for our application a slightly better cut size if random choice is used.

The method produces a non-balanced clustering with good cut size, well shaped sub-domains (although they need not be convex) which are not necessarily connected and anirregular structured cluster graph. The running time of the Voronoi method is O(nP ).

2.1.3. Farhat’s algorithm

Farhat’s algorithm[7,20] uses a BFS-like front technique to determine the clusters one afteranother. Starting from non-clustered nodes connected to the boundary of partition i− 1 itinserts nodes into cluster i according to their number of external edges, i.e. according tothe number of edges leading to non-clustered nodes.

The algorithm produces compact and balanced partitions which are not necessarilyconnected, a reasonable cut size and an irregular structured cluster graph. The running timeis O(|E|) (if E is the set of edges in the mesh and |E| > n).

2.2. Mapping to the parallel machine

The parallel machine is supposed to be a distributed memory MIMD-system with a two-or three-dimensional grid interconnection network.

The assignment of clusters to processors turns out to be a 1:1-mapping problem ofan arbitrary graph (the cluster graph) to a 3D-grid. Minimizing the dilation of such anembedding is still NP-complete[2].

2.2.1. GEOM MAP

With GEOM MAP we introduce a simple, co-ordinate based, mapping heuristic of geometricgraphs to grids. The co-ordinates of the nodes of the cluster graph correspond to the centersof gravity of the clusters. A 2D- or 3D-box system corresponding to the size (and dimension)of the processor topology is mapped over the cluster graph, and the box boundaries areadjusted until each box contains exactly one node.

Figure 1 shows an example for the mapping of an eight node cluster graph onto a4× 2–grid and the resulting communication structure.

0

1

2

3

4

5

6 7

0 12 3

46 7 5

Figure 1. The GEOM MAP algorithm


2.2.2. Bokhari’s algorithm

Bokhari introduced an iterative improvement heuristic for graph mappings[37]. The heuris-tic repeatedly exchanges pairs of nodes, increasing the number of dilation-1-edges of theembedding until no further improvement is possible. It then permutes

√P randomly chosen

nodes and applies the hill climbing search again. If no further improvement is achieved, thealgorithm terminates. The method is fairly slow if the processor network is large (O(P 3)).It has to be applied starting on an initial mapping. We chose the identity and the solutionof GEOM MAP as the starting solution.

2.3. First comparison

Figure 2 shows the resulting communication structure if the graph airfoil1 d is clusteredinto 64 parts using the Voronoi method and mapped onto an (8×8)-grid using the differentmapping algorithms (the test-set of graphs is described in Section 4.1).

Figure 2. Results of the different mapping variants (left to right): the graph airfoil1 d, the clustergraph of a 64-partition determined by the Voronoi method and its embeddings into an 8 × 8 grid

using Identity + Bokhari, GEOM MAP and GEOM MAP + Bokhari

Figure 3 shows the dilation and the amount of nearest neighbor communication (i.e.edges mapped without dilation) achieved by the different mapping algorithms. It can beobserved that GEOM MAP produces the lowest maximal dilation of all methods whereas thecombination of GEOM MAP and Bokhari results in the largest amount of directly mappededges. Bokhari’s algorithm maximizes the number of directly mapped edges, often at theexpense of some largely dilated ones. GEOM MAP favors low maximal dilation because ofits use of geometric information. It can serve as good starting solution to Bokhari if nearestneighbor communication has to be maximized and is recommended alone if the maximaldilation is important.

3. PARALLEL REMAPPING

The parallel partition optimization heuristic is the most important part of the whole method.It can be used within a completely parallel FEM environment optimizing the decompositionof parallel mesh generators and especially with adaptive simulations if the mesh (andtherefore the load distribution) changes during runtime.

The heuristic consists of two different parts, a balancing flow calculation and a nodemigration phase. The migration phase again splits into a number of steps.


Figure 3. Dilation and amount of nearest neighbor communication produced by different mappingsapplied to the clustering of the Voronoi method


3.1. Flow calculation

The balancing flow calculation uses the Generalized Dimension Exchange (GDE) method,which was introduced by Xu and Lau in 1991 (see, for example, [38,39]). This iterativemethod is completely parallel and local. It requires no global information and convergeson certain networks, such as grids, very fast.

Pi

PjLink

ExchangeLoad x Load y Load Load

+ λ ∗ y(1−λ) ∗ x x(1−λ) ∗ y + λ ∗

Pi

PjLink

Sweep

Figure 4. The basic operation of the GDE algorithm

Figure 4 shows the basic operation of the GDE algorithm. In each step a number ofdisjoint pairs of processors Pi and Pj in parallel balance their load over their commonedge according to an exchange parameter λ. A collection of steps in which each edge ofthe network is included once is called a sweep. It was shown that the expected numberof sweeps on certain networks is very low if an appropriate value of λ is chosen[38]. For(k1 × k2 × k3)-grids with k1 ≥ k2 ≥ k3, for example, a value of λ = 1

1+sin(π/k1)leads to

an optimal number of O(k1) sweeps. It can be shown[39] that the GDE method convergesasymptotically faster than, for example, the Diffusion method[40].

Unfortunately, the optimalλ-values are not known for arbitrary graphs, and wrong valueslead to decreased convergence rates, but if the algorithm is run on the processor network(grid) it can happen that it determines a flow between processors whose clusters are notadjacent in the cluster graph.

a) b)

c)

0 12 3

456 7

20

4

1

20

9

9

131

1010

29

350175175

220 210 170 220

80

0 12 3

456 7

20

4

1

20

9

131

1010

29

39

350175175

220 210 170 220

80

0 12 3

456 7

20

4

1

131

10

29

30

111

350175175

220 210 170 220

80

Figure 5. (a) The example of Figure 1 with artificial load and the flow calculated by the GDE-method; (b) transfer of the flow to the cluster-graph; (c) elimination of cycles

Figure 5(a) shows such an example. The grid-edges between clusters 0 and 7 and between3 and 5 do not belong to the cluster graph (cf. Figure 1). A migration of nodes betweensuch clusters would lead to bad subdomain shapes and to non-connected clusters.


We overcome this problem by transferring the flow on grid edges which do not belong tothe cluster graph to cluster edges on shortest paths between the corresponding subdomains.In Figure 5(b) the flow on grid-edge 7 → 0 is transferred to the path 7 → 3 → 0 in thecluster graph and 3 → 5 is transferred to 3 → 4 → 5. The transfer of the flow to thecluster graph can result in cyclic flow which is eliminated in order to reduce the number ofnecessary node migrations (cf. Figure 5(c)).

All this flow calculation is done logically. The result of this phase is the number ofnodes of the mesh that have to be migrated between neighboring subdomains in order toachieve an optimal load balancing. The physical node migration takes place after the flowcalculation and is discussed in the next section.

3.2. Node migration

The node migration phase again consists of four different steps which have to be appliediteratively. Only the second one realizes the balancing flow; the others try to optimize theshape and communication demands of the decomposition.

C1

C2

C1

C2

airfoil airfoil

Figure 6. Constructing connected subdomains

3.2.1. Constructing connected subdomains

The initial clustering can result in subdomains containing non-connected parts of the mesh.This first step tries to construct clusters which consist of only one connected componenteach. Every processor calculates in parallel the connected components of its part of themesh. To each such component the longest common boundary to components on otherprocessors is determined. The smallest components are then moved to the correspondingprocessors which contain the clusters with the longest common boundary and merged withthese clusters (cf. Figure 6).

This phase results in an increased load deviation between processors but improves thecut size, the shape of subdomains and the degree of the cluster graph.

3.2.2. Realizing the flow

The flow calculation determines the number of nodes to move over each edge of the clustergraph. The nodes to be migrated are chosen iteratively according to their benefit in cut sizeand subdomain shape. The cost function used to value the nodes consists of two parts – onecounting the cut size and one measuring the change of the shape if a node is moved.

Let D1, . . . , DP be the subdomains assigned to the P processors. For a node v ∈ Di letg(v,Dj) be the change in cut size if v is moved to clusterDj . We define δi to be the center


of gravity of cluster Di, i.e. δi is the mean of all node co-ordinates in subdomain Di. Letd(v, w) be the geometrical distance between two arbitrary points v and w. The radius ofa subdomain Di is defined as ri =

√(|Di|/π) in the 2D-case and ri = 3

√3|Di|/(4π) in

the 3D-case. In this measure, the size (area or volume) of a cluster is approximated by itsnumber of nodes. The idea to optimize the shape is based on a migration of nodes lying faraway from the center of gravity. We normalize the distance of a node from the center of itscluster by the radius of the corresponding subdomain[41] and set the change in shape tothe changing value of this measure for the originating and the destinating cluster of a node:

f(v,Dj) =d(δi, v)

ri− 1 +

d(δj , v)

rj− 1

Then the benefit b(v,Dj) of a migration of node v ∈ Di to cluster Dj is a weightedcombination of f and g:

b(v,Dj) = ω1 · g(v,Dj) + ω2 · f(v,Dj)

The processors determine the nodes to be moved to neighboring clusters according to thiscost function step by step, i.e. they perform movements of single nodes, updating the radiusof the subdomains after each movement.

Several problems occur if the node migration takes place in an asynchronous parallelenvironment. Because of some inconsistencies in node placement information the structureof the cluster graph may change and sometimes a cluster may split into disconnectedcomponents. These problems can be avoided by a careful choice of nodes to be migrated[42].

3.2.3. Eliminating node-chains

Near domain boundaries, especially near inner boundaries but sometimes also betweensubdomain boundaries, the described node migration strategies may produce small node‘offshoots’ (cf. Figure 7). This is mainly because they overweight the cut size in comparisonto the domain shape if the mesh is very irregular. To avoid such chains which worsen theshape very much we use an extra search for these cases. The chains are easy to detectbecause they consist of nodes on subdomain boundaries which are only connected to nodesalso lying on boundaries. The detected chains are moved in order to optimize the shape.The resulting load imbalance can be removed by a second flow calculation and migration.

airfoil

Figure 7. Node-chains


3.2.4. Improving the shape

After the first three steps have produced balanced and connected subdomains, the fourthphase uses a hill-climbing approach to further optimize the decomposition. All processorsdetermine, in parallel, pairs of nodes that can be exchanged to improve the costs of thepartition. This node exchange is done very carefully to avoid the generation of node-chainsor disconnected components.

3.3. The parallel algorithm

Figure 8 shows the overall structure of the parallel partition optimization heuristic Par2. Itconstructs connected components and then repeats to calculate and realize a flow of nodesbetween subdomains until the partition is balanced. This repetition is necessary because insome cases a determined flow cannot be realized directly. (If, for example, during the nodemigration a cluster would become empty, the node movement is stopped.)

After the first balanced partition is achieved, the algorithm searches for node chains andeliminates them. This makes a further flow calculation and node migration phase necessary.

Finally the algorithm tries to further improve the partition by node exchanges.The next section shows the benefits of the four different steps and a comparison of results

to those obtained by existing methods.

Procedure Par2

Obtain initial clustering;REPEAT

Construct connected subdomains;REPEAT

Determine balancing flow; /* GDE /*Realize flow;

UNTIL (Partition balanced);IF (first loop) Eliminate node chains;

UNTIL (second loop);Improve partition;

Figure 8. The structure of the parallel heuristic

4. RESULTS

4.1. The benchmark suite

Table 1 shows important characteristics of the graphs chosen as the benchmark suite. Thesame meshes are used as test graphs in several other publications[3,4,11,16,17,25,28].

The table gives the number of nodes (|V |), number of edges (|E|), maximal node degree,number of elements, type of elements (number of nodes belonging to one element, allmeshes are homogeneous in their type of elements), and the geometric dimension.

For most of our measurements we chose to partition the dual graphs (or element-graph)to the above given (denoted by mesh d). Such a dual graph consists of one node per elementof the original mesh connected to all nodes representing neighboring elements. Thus, the


Table 1. The benchmark suite

Mesh |V | |E| deg #Elm. Elm. deg. Dim.

grid2 3296 6432 5 3136 4 2airfoil1 4253 12289 9 8034 3 2whitaker3 9800 28989 8 19190 3 2crack 10240 30380 9 20141 3 2big 15606 45878 10 30269 3 2

number of nodes of the dual graph equals the number of elements of the original one, andits degree corresponds to the element type (neighboring elements have to share a commonedge).

The choice of dual graphs results from the requirements of most numerical simulationmethods to split along element boundaries.

4.2. Cost functions

In the literature, the quality of the partitions obtained is measured by several differentcost functions. We chose the most important ones to compare the results of Par2 to otherpartitioning heuristics (cf. Table 2). Note that Par2 optimizes the cost function shown inSection 3.2 and not directly those shown here.

Table 2. Cost functions

Function Description Ref.

Γ1 Total cut size [3,11–14,16,24,25,30,43,44]Γ2 Maximal cut size [15,43]Γ3 Degree of cluster graph [4,17,43]Γ4 Aspect ratioΓ5 Shape

The total cut size (Γ1) is the sum of all mesh edges connecting different subdomains. It is ameasure for the total amount of data that has to be transferred in each step of the numericalalgorithm. Γ2 is the maximal number of external edges of one cluster and correspondsto the largest amount of data a single processor has to transfer in each communicationstep. The maximal degree of the cluster graph (Γ3) is equal to the maximal number ofcommunications that have to be initiated by a single processor. In some hardware topologiesthis time dominates the message transfer time (if messages are small) and is therefore veryimportant.

If the throughput of the communication network is large enough and if the communicationstructure of the application produces no ‘hot-spots’, then a combination of Γ2 and Γ3 cangive a measure for the time needed to complete a single communication step[45]. Certainload balancing heuristics minimize Γ2 explicitly[15]. Γ3 is often minimized indirectly assome partitioning heuristics optimize the dilation if the cluster graph is mapped onto ahypercube[4,17].

Γ4 and Γ5 measure the shape of the subdomains. The aspect ratio of a cluster is (in thiscase) defined as the ratio between its longest and shortest boundary segments. A segment isa part of the boundary leading to a single neighboring cluster. For Γ4 its length is measured


in numbers of nodes. The aspect ratio of the partition (Γ4) then gives the maximum aspectratio of all subdomains. Γ5 determines the difference between the area (or volume) of thelargest inner circle (or ball) and the area (or volume) of the smallest outer circle (or ball) ofa domain. This measures the similarity of the shape of a domain to the ideal shape (circleor ball).

4.3. Influence of initial clustering

Table 3 shows the best cut sizes for various test graphs obtained by Par2 after initialclustering with one of the presented algorithms (cf. Section 2). It can be seen that balancedk-partitions resulting from the Box (resp. Farhat) method have been improved significantly.This shows that Par2 is able to optimize even balanced decompositions. The best initial datadistribution depends very much on the problem structure, the number of subdomains and theconsidered cost function. The last fact can be observed from Table 4, which shows resultsof the different measures if the parallel heuristic is started on the three initial clusterings.The Voronoi method produces, in general, very good results, but note that it also producesvery unbalanced partitions which cause large node movements (cf. Section 4.4).

Table 3. Comparison of achieved cut sizes after different initial data distributions. Values in bracketsindicate the initial cut size after clustering, other values the best cut size after parallel optimization.

Best values in each row are framed

Measure: Γ1 k Box Farhat Voronoi

airfoil1 d 8 207 (272) 329 (423) 238 (230)

crack d 8 463 (556) 437 (611) 426 (421)

big d 8 535 (709) 719 (848) 494 (519)

airfoil1 d 32 666 (870) 639 (684) 619 (630)

crack d 32 1154 (1405) 1146 (1483) 1150 (1143)

big d 32 1478 (1950) 1384 (1454) 1525 (1375)

Table 4. Comparison of several cost functions after different data distributions. Numbers in bracketsdenote the initial value after clustering, and other numbers the value after parallel optimization

Graph: crack dInitial k Γ1 Γ2 Γ3 Γ4 Γ5

Farhat 8 437 (611) 171 (211) 6 (6) 4.8 (15.4) 0.8 (2.6)

Box 8 463 (556) 164 (197) 4 (4) 79.0 (82.0) 0.6 (0.6)

Voronoi 8 426 (421) 152 (155) 6 (6) 11.8 (13.5) 0.6 (0.6)

Farhat 64 1634 (2121) 79 (130) 7 (21) 21.0 (25.0) 0.2 (1.3)

Box 64 1653 (1925) 98 (160) 9 (8) 22.0 (34.0) 0.1 (0.5)

Voronoi 64 1674 (1663) 76 (94) 8 (8) 21.0 (20.0) 0.2 (0.2)


4.4. Amount of flow and saving by cycle elimination

Table 5 shows the number of edges with flow, the total amount of flow, the number of cyclesand the saving by cycle elimination if one of the benchmark graphs is initially partitionedwith the Voronoi method and mapped to grids of different sizes by GEOM MAP.

It can be observed that the total number of moved nodes is – depending on k – in theorder of the size of the graph. This is, of course, different if other initial clusterings andmappings are used. Note that a movement of a node in the grid to a processor in distancetwo counts for two units in the amount of flow. Thus the mapping is of importance.

Table 5. The amount of flow if the graph big d with |V | = 30269 nodes is partitioned with theVoronoi method and mapped to a grid with GEOM MAP

k #Flow-edges Amount of flow # Cycles Savings

8 36 21759 9 294716 133 43656 66 1055632 303 48851 144 1162364 620 58175 388 32656

It can also be observed that the search for cycles and their elimination saves a large amountof node movements (up to 50% on certain examples) even if the number of subdomainsis small. However, this also shows that the calculated flow is by far not the best possible.Determining the ‘best’ flow which optimizes the shapes with the least amount of nodemovements turns out to be a difficult task.

4.5. Performance of the different steps of the parallel heuristic

The parallel algorithm as shown in Figure 8 can be divided into four steps:

Step 1: Initial clustering.Step 2: First loop (construction of a balanced partition with connected subdomains).Step 3: Second loop (elimination of node chains and rebalancing).Step 4: Shape optimization by node exchange.

Table 6 shows the cut size (Γ1) after the four steps of the algorithm for two of thebenchmark graphs, all three initial clusterings and differing numbers of k. In can beobserved that all three phases of the parallel partition optimization (steps 2–4) contributesignificantly to the overall result. The cut size of initially balanced partitions is improved

Table 6. The cut size (Γ1) after the different steps of the parallel optimization phase

Mesh: grid2 dStep

Initial k 1 2 3 4

Farhat 32 681 669 636 619Box 64 1006 972 955 936Voronoi 16 381 387 387 383

Mesh: airfoil1 dStep

Initial k 1 2 3 4

Farhat 8 255 219 220 213Box 32 870 703 694 666Voronoi 16 416 423 392 369


by constructing connected subdomains, and even if the partition is initially not balanced,as is the case if the Voronoi method is used, the cut size stays the same or can be improved(cf. also Section 4.3). The elimination of node chains and the final smoothing of boundariesby node exchange improve the results further.

4.6. Comparison to other heuristics

We compare the performance of Par2 to a number of other efficient and often used parti-tioning heuristics. The HS-heuristic[11] is our own implementation[19], and the versionsof KL[46], the Inertial method (In)[16], the Spectral method (SP)[25] and the Multilevelmethod (ML)[47] are implemented in the Chaco1 library of partitioning algorithms[17].The results of Farhat’s algorithm[7] are obtained by an own implementation which is alsoused to determine one of the initial clusterings.

Table 7. Comparison of achieved cut sizes with different partitioning heuristics for k = 16 (smallnumbers indicate running times in seconds on SS10/50 for sequential heuristics and 16 T805 for

Par2)

Graph KL HS In In+KL SP SP+KL ML(+KL) Farhat Par2

grid2 d 357 350 432 330 373 328 317 406 3790.96 0.61 0.10 1.02 5.42 6.61 2.89 0.38 8.2

airfoil1 d 860 764 503 400 372 309 325 491 3820.48 0.16 0.20 0.86 21.29 21.20 2.98 0.76 20.5

whitaker3 d 1733 2224 696 623 729 615 657 654 6517.11 0.58 0.37 1.76 70.83 72.12 6.02 1.73 33.2

crack d 1738 2162 797 639 671 577 615 1010 6601.75 0.69 0.45 2.03 74.14 82.60 6.47 1.53 47.1

big d 1875 1543 1219 995 863 675 605 1050 9132.00 0.62 1.05 3.31 124.8 131.34 8.05 2.26 22.5

Table 7 shows results of the different methods if the benchmark graphs are split into16 parts and cut size is counted. The results of Par2 are the best achieved with any ofthe different initial clusterings. The results of the partitioning heuristics are obtained byrecursive bisection with the best found parameter settings.

It can be observed that the efficient combinations of global and local heuristics likeSP+KL and ML+KL outperform all other methods. The pure local methods KL and HSbehave poorly on the last four meshes. This is because the maximal degree of these graphsis three (cf. Section 4.1).

The results of Par2 are convincing – always better than those of Farhat’s algorithmand comparable to SP or IN+KL. The timing results seem to be poor, but this is mainlydue to the differing architectures of the sequential and parallel machines. All sequentialmeasurements were taken on Sun SS10/50 workstations; the parallel algorithm ran on anetwork of T805 transputers. We show no further time measurements because Par2 is notdesigned to obtain speedups but to produce high quality clusterings in a massively parallelenvironment.

1 Sandia National Laboratories. Use granted by License Agreement No. 93-N00053-021.


Table 8. Comparison of different cost functions

Graph: airfoil1 d, k = 8Par2 SP In+KL HS

Γ1 207 251 259 379

Γ2 65 86 95 309

Γ3 4 4 5 5

Γ4 1.0 1.4 1.4 1.6

Γ5 5.5 5.2 8.2 72.0

Table 8 shows results of the different cost functions if the mesh airfoil1 d is partitionedinto eight clusters using Par2, Spectral, Inertial with KL, and HS. In nearly all measuresthe parallel heuristic is able to produce improved results. Note that Γ2 and Γ3, in particular,expressing the structure of the cluster graph, are not directly optimized by Par2, but theresults are nevertheless significantly better than those of the other methods. The values ofΓ4 and Γ5 are taken into account indirectly by the cost function of the parallel method.Regarding the aspect ratio, Par2 is able to achieve an optimal result. To be fair we have tomention that all the other methods are pure graph partitioning heuristics which optimizecut size only, but the example shows that cut size is not the only important measure.

Figure 9. A zoom into a partition obtained with SP (left) and with Par2 (right)

Figure 9 shows an example where the benefits of shape optimization are directly visible.The left partition of airfoil1 is obtained by the Spectral Method, and the right one by Par2.Although the cut sizes of both partitions are nearly equal, the shape of the subdomainsproduced by Par2 is much better. What is not directly visible is the fact that the subdomainsin the right picture are connected, whereas they are not in the left example.


5. CONCLUSIONS

We have presented the parallel partitioning heuristic Par2 for unstructured FEM-meshes.The data-parallel heuristic runs in an asynchronous massively parallel environment andis able to optimize mesh-mappings and to solve the load balancing problem if adaptivelyrefining/derefining meshes are used.

Measurements using several benchmark FEM-meshes together with different initialplacements of meshes to processors have shown that Par2 produces results in cut sizewhich are comparable to many other existing methods. If the shape of subdomains andthe structure of the resulting cluster graph are considered, our method generates resultswhich are significant improvements compared to those of pure partitioning heuristics. Thisis mainly due to the fact that Par2 optimizes not only cut size but also partition shapes.

In particular, preconditioned conjugate gradient solvers should be able to take benefitfrom this shape optimization. Nevertheless, distributed load balancing methods like the onepresented here still have to prove their suitability in the context of parallel adaptive FEMcalculations.

ACKNOWLEDGEMENTS

This work was partly supported by the DFG Sonderforschungsbereich 376 ‘Massive Paral-lelitat: Algorithmen, Entwurfsmethoden, Anwendungen’ and the EC-ESPRIT Long TermResearch Project 20244 (ALCOM-IT).

We would like to thank Friedrich-Gerhard Buchholz, Hinderk van Lengen and therest of the structural mechanics group in Paderborn who gave us, together with WolfgangBorchers and Uwe Dralle, some insight into the demands of FEM-simulations. ChengzhongXu provided a great deal of help on the GDE method during his stay in Paderborn. The codeof the Chaco library was provided by Bruce Hendrickson and Robert Leland. Horst Simon,Alex Pothen and Steven Hammond made additional tools and FEM-meshes available.Finally we want to thank our colleagues Reinhard Luling, Jurgen Schulze, Robert Preis,Carsten Spraner, Chris Walshaw and Ralf Weickenmeier for many helpful discussions.

REFERENCES

1. B. Monien, R. Diekmann, R. Feldmann, R. Klasing, R. Luling, K. Menzel, T. Romke andU.-P. Schroeder, ‘Efficient use of parallel & distributed systems: From theory to practice’, inJ. van Leeuwen (Ed.), Comput. Sci. Today, Springer LNCS 1000, 1995, pp. 62–77.

2. R. Diekmann, R. Luling and A. Reinefeld, ‘Distributed combinatorial optimization’, Proc. ofSofsem’93, Czech Republik, 1993, pp. 33–60.

3. R. Diekmann, R. Luling, B. Monien and C. Spraner, ‘Combining helpful sets and parallelsimulated annealing for the graph-partitioning problem’, Parallel Algorithms Appl., 8, 61–84(1996).

4. S. W. Hammond, ‘Mapping unstructured grid computations to massively parallel computers’,Tech. Rep. No. 92.14, RIACS, NASA Ames Res. Center, June 1992.

5. J. E. Savage and M. G. Wloka, ‘Parallelism in graph-partitioning’, J. Parallel Distrib. Comput.,13, 257–272 (1991).

6. H. D. Simon and S. H. Teng, ‘How good is recursive bisection’, Tech. Rep. RIACS, NASAAmes Research Center, June 1993.

7. C. Farhat, ‘A simple and efficient automatic FEM domain decomposer’, Comput. Struct., 28,(5), 579–602 (1988).


8. R. Diekmann, U. Dralle, F. Neugebauer and T. Romke, ‘PadFEM: A portable parallel FEM-tool’, Proc. Int. Conf. on High-Performance Computing and Networking (HPCN-Europe ’96),Springer LNCS 1067, 1996, pp. 580-585.

9. P. Bastian, S. Lang and K. Eckstein, ‘Parallel adaptive multigrid methods in structural mechan-ics’, Numer. Lin. Algebr. Appl., to be published.

10. J. De Keyser and D. Roose, ‘Run-time load balancing techniques for a parallel unstructuredmulti-grid Euler solver with adaptive grid refinement’, Parallel Comput., 21, 179–198 (1995).

11. R. Diekmann, B. Monien and R. Preis, ‘Using helpful sets to improve graph bisections’, in Hsu,Rosenberg, Sotteau (Eds.), ‘Interconnection networks and mapping and scheduling parallelcomputations’, DIMACS Ser. Discrete Math. Theor. Comput. Sci., AMS, 21, 57–73 (1995).

12. B. Hendrickson and R. Leland, ‘An improved spectral graph partitioning algorithm for mappingparallel computations’ SIAM J. Sci. Comput., 16, (2), 452–469 (1995).

13. G. Karypis and V. Kumar, ‘A fast and high quality multilevel scheme for partitioning irregulargraphs’, Tech. Rep. 95-035, Dept. of Comp. Science, University of Minnesota, 1995.

14. O. C. Martin and S. W. Otto, ‘Partitioning of unstructured meshes for load balancing’, Concur-rency: Pract. Exp. 7, (4), 303–314 (1995).

15. C.-W. Ou and S. Ranka, ‘Parallel incremental graph partitioning’, Tech. Rep. SCCS-652, NPAC,Syracuse University, 1994.

16. H. D. Simon, ‘Partitioning of unstructured problems for parallel processing’, Comput. Syst. Eng.,2, 135–148 (1991).

17. B. Hendrickson and R. Leland, ‘The Chaco user’s guide’, Tech. Rep. SAND93-2339, SandiaNational Laboratories, Nov. 1993.

18. F. Pellegrini and J. Roman, ‘Scotch: A software package for static mapping by dual recur-sive bipartitioning of process and architecture graphs’, Proc. Int. Conf. on High-PerformanceComputing and Networking (HPCN-Europe ’96), LNCS 1067, Springer-Verlag, April 1996,pp. 493-498.

19. R. Preis and R. Diekmann, ‘The PARTY partitioning-library, users guide’, Technical Report,University of Paderborn, 1996.

20. C. Farhat and H.D. Simon, ‘TOP/DOMDEC - a software tool for mesh partitioning and parallelprocessing’, Tech. Rep. RNR-93-011, NASA Ames Research Center, 1993.

21. S. Trewin, ‘PUL-DM prototype user guide’, Tech. Rep., Edinburgh Parallel Computing Center,1993.

22. C. Walshaw, M. Cross and M. G. Everett, ‘A parallelisable algorithm for optimising unstructuredmesh partitions’, Math. Res. Rep., Univ. of Greenwich, London, 1995.

23. H. van Lengen and J. Krome, ‘Automatische Netzeinteilungsalgorithmen zum effektiven Ein-satz der parallelen Substrukturtechnik’, in R. Flieger and R. Grebe (Eds.), Parallele Datenver-arbeitung aktuell: TAT ’94. IOS Press, 1994, pp. 327–336 (in German).

24. M. T. Jones and P. E. Plassmann, ‘Parallel algorithms for the adaptive refinement and partition-ing of unstructured meshes’, Proc. Scalable High Performance Computing Conference, IEEEComputer Society Press, 1994, pp. 478–485.

25. A. Pothen, H. D. Simon and K. P. Liu, ‘Partitioning sparse matrices with eigenvectors of graphs’,SIAM J. Matrix Anal. Appl., 11/3, 430–452 (1990).

26. C. Walshaw and M. Berzins, ‘Dynamic load-balancing for PDE solvers on adaptive unstructuredmeshes’, Concurrency: Pract. Exp. 7, (1), 17–28 (1995).

27. R. Van Driessche and D. Roose, ‘An improved spectral bisection algorithm and its applicationto dynamic load balancing’, Parallel Comput. 21, 29–48 (1995).

28. G. E. Blelloch, A. Feldmann, O. Ghattas, J. R. Gilbert, G. L. Miller, D. R. O’Hallaron,E. J. Schwabe, J. R. Shewchuk and S. -H. Teng, ‘Automated parallel solution of unstructuredPDE problems’, Commun. ACM, to be published.

29. G. L. Miller, S.-H. Teng, W. Thurston and S. A. Vavasis, ‘Automatic mesh partitioning’, inGeorge, Gilbert, Liu (Eds.), Graph Theory and Sparse Matrix Computation, Springer, 1993.

30. C.-W. Ou and S. Ranka, ‘Parallel remapping algorithms for adaptive problems’, Tech. Rep.CRCP-TR94506, Rice University, 1994.

31. C.-W. Ou, S. Ranka and G. C. Fox, ‘Fast and parallel mapping algorithms for irregular problems’,Tech. Rep. SCCS-729, NPAC, Syracuse University, 1995.


32. S. T. Barnard and H. D. Simon, ‘Fast multilevel implementation of recursive spectral bisectionfor partitioning unstructured problems’, Concurrency: Pract. Exp., 6, (2), 101–117 (1994).

33. M. Kaddoura, C.-W. Ou and S. Ranka, ‘Mapping unstructured computational graphs for adaptiveand nonuniform computational environments’, Tech. Rep. SCCS-725, NPAC, Syracuse Univer-sity, 1995; to be published in IEEE Parallell Distrib. Technology.

34. C. Walshaw, M. Cross and M. G. Everett, ‘Parallel partitioning of unstructured meshes’, Proc.of Parallel CFD ’96, 1996.

35. N. Chrisochoides, G. C. Fox and J. Thompson, ‘Parallel grid generation on distributed memoryMIMD machines for 3-dimensional general domains’, Tech. Rep. SCCS-672, NPAC, SyracuseUniversity, 1993.

36. T. Feder and D.H. Greene, ‘Optimal algorithms for approximate clustering’, ACM Symp. onTheory of Computing (STOC), 1988.

37. S. H. Bokhari, ‘On the mapping problem’, IEEE Trans. Commun., 30, (3), 207–214 (1981).38. C. Xu and F. Lau, ‘The generalized dimension exchange method for load balancing in k-ary

n-cubes and variants’, J. Parallel Distrib. Comput., 24, (1), 72–85 (1995).39. C. Xu, B. Monien, R. Luling and F. Lau, ‘Nearest-neighbor algorithms for load-balancing in

parallel Computers’, Concurrency: Pract. Exp., 7, (7), 707–736 (1995).40. J. E. Boillat and P. G. Kropf, ‘A fast distributed mapping algorithm’, Proc. of CONPAR’90,

Springer LNCS 457, 1990.41. N. Chrisochoides, C. E. Houstis, E. N. Houstis, S. K. Kortesis and J. R. Rice, ‘Automatic load

balanced partitioning strategies for PDE computations’, Proc. of ACM Int. Conf. on Supercom-puting, 1989, pp. 99–107.

42. D. Meyer, ‘Datenparallele k-Partitionierung unstrukturierter Finite Elemente Netze’, MasterThesis, CS-Dept., University of Paderborn, 1995 (in German).

43. N. Floros, J. R. Reeve, J. Clinckemaillie, S. Vlachoutsis and G. Lonsdale, ‘Comparative effi-ciencies of domain decompositions’, Tech. Rep. Univ. of Southampton, 1994.

44. R. D. Williams, ‘Performance of dynamic load balancing algorithms for unstructered meshcalculations’, Concurrency: Pract. Exp., 3, 457–481 (1991).

45. B. Monien, R. Diekmann and R. Luling, ‘Communication throughput of interconnection net-works’, Proc. 19th Int. Symp. on Mathematical Foundations of Computer Science (MFCS ’94),Springer LNCS 841, 1994, pp. 72–86.

46. B. W. Kernighan and S. Lin, ‘An effective heuristic procedure for partitioning graphs’, BellSystems Tech. J., 291–308 (1970).

47. B. Hendrickson and R. Leland, ‘A multilevel algorithm for partitioning graphs’, Tech. Rep.SAND93-1301, Sandia National Laboratories, Oct. 1993.

Documents

Parallel decomposition of unstructured FEM-meshes