71

DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

  • Upload
    vunga

  • View
    249

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH REFINEMENTBYXIANGYANG LIB.Eng., Tsinghua University, BeiJing, 1995B.Eco., Tsinghua University, BeiJing, 1995

THESISSubmitted in partial ful�llment of the requirementsfor the degree of Master of Science in Computer Sciencein the Graduate College of theUniversity of Illinois at Urbana-Champaign, 1999Urbana, Illinois

Page 2: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

c Copyright byXiangyang Li1999

Page 3: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Dynamic Load Balancing for Parallel Adaptive Mesh Re�nementXiangyang Li, M.S.Department of Computer ScienceUniversity of Illinois at Urbana-Champaign, 1999Shang-Hua Teng, AdvisorA key step in the �nite element method is to generate a high quality mesh that is as small as pos-sible for an input domain. Several meshing methods and heuristics have been developed and im-plemented. Methods based on advancing front, Delaunay triangulations, and quadtrees/octreesare among the most popular ones. Advancing front uses simple data structures and is e�cient.Unfortunately, in general, it does not provide any guarantee on the size and quality of themesh it produces. On the other hand, the sphere-packing based Delaunay methods generate awell-shaped mesh whose size is within a constant factor of the optimal.Adaptive mesh re�nement is a key problem in large-scale numerical calculations. The needof adaptive mesh re�nement could introduce load imbalance among processors, where the loadmeasures the amount of work required by re�nement itself as well as by numerical calculationsthereafter. We present a dynamic load balancing algorithm to ensure that the work for re�ne-ment and computation thereafter are balanced while the communication overhead (includingthe overhead caused by moving submeshes around) is minimized. The main ingredient of ourmethod is a technique for the estimation of the size and the element distribution of the re�nedmesh before we actually generate the re�ned mesh. Base on this estimation, we can reduce thedynamic load balancing problem to a collection of static partitioning problems, one for eachprocessor. In parallel each processor could then locally apply a static partitioning algorithm togenerate the basic units of submeshes for load rebalancing. We then model the communicationcost of moving submeshes by a condensed and much smaller subdomain graph, and apply astatic partitioning algorithm to generate the �nal partition.iii

Page 4: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

TO MY FAMILY

iv

Page 5: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

AcknowledgementsFirst and foremost, I would like to thank Professor Shang-Hua Teng for being a wonderfuladvisor. His insights and invaluable help and guidance presented in this thesis enlightenedme in various aspects of the work. I would like to acknowledge the constant assistance andencouragements from my best friends. The special thanks goes to Alper Ungor, who was alsoconducting research in the area of the mesh generation algorithm, for his educating commentsand perfect-pursuing habits. The discussion with him not only enhances my research abilities,but also improves my English a lot. I also thank Marcia, my o�cemate and friend for makingthe o�ce so nice place to stay. I would also like to thank the Department of Computer Scienceand the CSAR project at the University of Illinois for the computing resources.Last but not least, my parents, my wife, and my daughter deserves the particular recognitionfor being the driving force in my life. I thank my wife Chen Min for her love and support, myparents and sister for their unconditional encouragement and belief in my ongoing studies andmy sweet daughter Sophia for ensuring that there was no a dull moment when writing thisthesis.

v

Page 6: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Table of ContentsChapter1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11.1 Mesh Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Graph Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Parallel Mesh Re�nement and Load Balancing . . . . . . . . . . . . . . . . . . . 41.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Mesh Generation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Structured and Unstructured Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Quality of Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Control Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.1 Geometric Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.2 Numerical Spacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.3 Control Spacing Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Conformality and Size of Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6 Delaunay Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.7 Mesh Generation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.7.1 Advancing Front Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.7.2 Sphere Packing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Graph Partition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 203.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Background and Good Separators . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Graph Partition Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.1 Level-structure Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.2 The Spectral Partitioning Algorithm . . . . . . . . . . . . . . . . . . . . . 283.3.3 Geometric Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.4 A Multilevel Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 Load Balancing : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 344.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Dynamic Balanced Quadtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Modeling Adaptive Re�nement with Dynamic Quadtree . . . . . . . . . . . . . . 374.4 Reduce Dynamic Load Balancing to Static Partitioning . . . . . . . . . . . . . . 384.5 Subdomain Size Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41vi

Page 7: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

4.5.1 A�ected Boxes of Re�ning a Box . . . . . . . . . . . . . . . . . . . . . . . 414.5.2 Size Estimation of Balanced Quadtree . . . . . . . . . . . . . . . . . . . . 434.6 Sampling Boxes from T � . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.7 Subdomain Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.8 Subdomain Redistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.9 Remeshing Unstructured Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 56Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 58Vita : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 63

vii

Page 8: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

List of Figures2.1 A typical example of a unstructured mesh. . . . . . . . . . . . . . . . . . . . . . . 93.1 The Delaunay mesh from a random point set. . . . . . . . . . . . . . . . . . . . 223.2 The 16-way partition of its' vertices. . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 The 16-way partition of its' triangle elements. . . . . . . . . . . . . . . . . . . . . 244.1 A balanced quadtree T in two dimensions, and a 4-way partition. . . . . . . . . . 364.2 quadtree T 0 after re�ning T , and T � after balancing T 0. (a): assume the splittingdepth of b1, b2, b3 and b4 are 1, 1, 2 and 2 respectively. (b): assume after balancingT 0, we maintain the current partition of T . . . . . . . . . . . . . . . . . . . . . . 384.3 The two templates for splitting boxes in region(b), and how the re�nement of bin uence the splitting of leaf-boxes contained in region(b). . . . . . . . . . . . . . 424.4 The three scenarios for two edge neighbor leaf-boxes. (a): �(b; b1) = 1; (b):�(b; b1) = 0; (c): �(b; b1) = �1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5 The �ve scenarios fro two corner neighbor leaf box b1 and b. (a): �(b; b1) = 2;(b): �(b; b1) = 1; (c): �(b; b1) = 0; (d): �(b; b1) = �1; (e): �(b; b1) = �2. . . . . . 454.6 an example of the pressures of leaf-box b, and the examples of the edge-boxes,corner-boxes, center-boxes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.7 A typical example of subdomain partition for T . . . . . . . . . . . . . . . . . . . 524.8 Constructing subdomain graph from subdomain partition. . . . . . . . . . . . . . 534.9 An example of subdomain redistribution and re�nement of subdomain. . . . . . . 54

viii

Page 9: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Chapter 1IntroductionMany problems in computational science and engineering, geographic information system, andcomputer graphics are based on structured or/and unstructured meshes in two or three di-mensions. The meshes can be quite large, often containing millions of elements. The size ofmeshes is usually determined by the size of the machine available to solve the problem, andthe accuracy requirement of the problem. An essential step in numerical simulation is to �nda proper discretization of a continuous domain. This is the problem of mesh generation [4, 40],which is a key component in computer simulation of physical and engineering problems. Thefollowing six basic steps are usually used to conduct a numerical simulation by �nite element,�nite di�erence, and �nite volume methods.1. Mathematical modeling: de�ne the continuous domain and partial di�erential equa-tions (PDE) over the domain that accurately model the physical and engineering problem;2. Geometric modeling: approximate the continuous domain with a discrete descrip-tion.3. Mesh generation: decompose the interior of the domain into a mesh M of simple and\well-shaped" elements such as boxes and simplices.4. Numerical approximation: construct a system of linear or non-linear equations overM for the governing PDEs.5. Numerical solution: solve the system of equations and estimate the error of the solution;1

Page 10: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

6. Adaptive re�nement: based on the error estimation, if necessary, re�ne the mesh andrepeat the above steps 5 and 6 over the re�ned meshes.1.1 Mesh GenerationMeshing [4, 40, 26, 27, 28, 43, 44] can be de�ned as the process of breaking up a physicaldomain into smaller sub-domains (elements) in order to facilitate the numerical solution of apartial di�erential equation. While meshing can be used for a wide variety of applications,such as solid modeling, computer aided design, graphical rendering, and scienti�c computation.The principal application of interest is the �nite element method. For problems with complexgeometry boundaries and with solutions that change rapidly, we need to use an unstructuredmesh with a varying local topology and spacing in order to reduce the problem size. A goodunstructured meshing algorithm uses elements of properly chosen size and shape that adapt tothe complex geometry and solution accuracy. In doing so, it generates meshes that are numer-ically sound and that are also as small as possible. Several meshing methods and heuristicshave been developed, implemented, and applied to various applications such as steady stateand transient compressible inviscid ow simulations.Over the years, several meshing methods such as those based on advancing front, Delaunaytriangulations, and quadtrees/octrees have become popular due to their e�ectiveness in prac-tical applications. However, these methods do not come with equal strengths. For example,advancing front [6, 30, 31] uses simple data structures and is e�cient and relatively easy to im-plement. It o�ers a high quality of point placement strategy and the integrity of the boundary.Unfortunately, it does not provide any general guarantee on the size and quality of the mesh itproduces. On the other hand, more sophisticate methods such as quadtree/octree re�nement[4, 40, 52] and Delaunay methods [8, 9, 10, 35, 43, 44] generate a well-shaped mesh whose sizeis within a constant factor of the optimal. Li et. al. [28] recently developed a new meshgeneration algorithm called biting method. It combines the strengths of advancing front andthese provably good meshing methods. It not only guarantees the quality of the generatedmesh, but also solve the problems caused by the advancing front method.The particular type of Delaunay method that they use in conjunction with advancing frontis the sphere packing method. It �rst constructs a well-spaced point set by computing a sphere2

Page 11: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

packing of the domain and then uses the Delaunay triangulation of this point set as the �nalmesh. Two methods have been developed to generate the well-spaced point set. The �rst oneapplies particle simulation [45, 46, 47] to �nd a stable con�guration of a set of energetic spheres.The second one uses quadtree/octree re�nement to obtain an oversample of the input domain,and then applies a properly de�ned maximal independent set to create the sphere packing[3, 5, 27, 37, 50]. Both in theory and in practice, the second approach is faster.1.2 Graph PartitionParallel computing has become a critical component of the computing technology of this decade.To reduce the time spent waiting at synchronization events, the work load of each processorshould be balanced. The process of balancing the workload and reducing the synchronizationwait time consists of four parts [12].1. Identifying enough concurrency in decomposition and overcoming Amdahl's Law,2. Deciding how to manage the concurrency { statically or dynamically,3. Determining the granularity at which to exploit the concurrency,4. Reducing the serialization and synchronization cost.The computation and data dependency of a lot of scienti�c computing problem can bemodeled by a graph ( directed graph or undirected graph): the vertices of graph is the basic dataunit or commutating tasks; two vertices are connected if there are a data dependency betweenthem or there are computation dependency between tasks. Then we can use the technique of thegraph partition to identify the concurrency in a given parallel computing problem. A partitionof a graph into subgraphs leads to a decomposition of the data and/or tasks associated with acomputational problem and the subgraphs can be mapped to the processors of a multiprocessor.Graph partition also plays an important role in serial algorithm design especially for the problemwith strong divide and conquer property.For parallel computing, the ideal assignment of the data and/or tasks to processors isthat the work load of every processor are approximately equal and the communication duringthe computation is minimized. Hence, the following two objectives are usually stated in the3

Page 12: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

partitioning algorithm. Partition a given graph into a speci�ed number of subgraphs such thatthe subgraphs have roughly equal number of vertices and few edges join di�erent subgraphs toeach other ( this edges are the cut of the partition). Here we assume that the vertex modelthe basic workload unit. In the context of the parallel computation, the size of a subgraphdetermines the computational work load that a processor has to perform, and the number ofcut edges is the measure of the communication volume in the algorithm. In a serial algorithm,equal-sized subgraphs lead to the lowest worst-case running time, and the number of the cutedges measures the cost of combining the partial solutions to compute the global solution.More general objective functions need to be considered for many complicated problems. Forexample, the work load is not always proportional to the number of vertices in the subgraph.Hence the work associated with a subgraph may be modeled more accurately by attaching aweight to each vertex. Then load balancing requirement can be modeled as to partition thegraph to subgraph such that subgraphs has approximately equal summation weights of verticesassigned to it. Note that for data communication, we can use the message packing technique,i.e., to pack all the messages which will be sent to same processor. The communication costsin the algorithm might be modeled by the number of subgraphs a given subgraph is connectedto, or the number of boundary vertices of the subgraph, or similar variants.If the graph is a geometry graph, sometimes the shape of the subgraph is also important toget a good solution to the problem. In other words, the shape of the subgraphs (e.g., the aspectratios) may be an important parameter in some algorithms (e.g., the convergence of a domaindecomposition method). In some applications, it may be essential to partition into connectedsubgraphs.1.3 Parallel Mesh Re�nement and Load BalancingOnce we have a discretization (mesh) of the domain, di�erential equations for ow, waves, andheat distribution are then approximated by �nite di�erence, �nite element, or �nite volumeformulations. To properly approximate a continuous function, in addition to the conditionsthat a mesh must conform to the boundary of the region and be �ne enough, each individualelement of the mesh must be well shaped. A common shape criteria for elements is the conditionthat the angles of each element are not too small or the aspect ratio of each element is bounded4

Page 13: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

[49]. The aspect ratio of a simplex is de�ned as its maximum side-length divided by its minimumaltitude.We consider issues and algorithms for adaptive mesh re�nement, (cf, Step 6 in the paradigm1). The general scenario is the following. We start with an initial well-shaped mesh M0 for aninput domain and di�erential equations. Then we form the numerical system from the meshM0 and the original di�erential equations. We partition mesh M0 and map the submeshesand their fraction of the numerical system onto a parallel machine. By solving the numericalsystem in parallel, we obtain an initial numerical solution S0. If the solution S0 is within theerror tolerance, we then return the solution. Otherwise, an error-estimation of S0 generatesa re�nement spacing-function h1 over the domain, which de�nes the expected size of meshelements at a particular region in the domain. In other words, if in some region the errorestimation of S0 is too large compared to the tolerant error or average error, we have to re�nethe mesh element in that region, such that the next computation will generate more accuratesolution in this region. Therefore, we need to properly re�ne the initial mesh M0 according toh1 to generate another well-shaped mesh M1.The requirement for re�nement introduces load imbalance among processors in the parallelmachine. Some submeshes, after re�nement, might be much larger than others. The work-loadof a processor in the next stage computation is determined by the summation of the time theprocessor needs to spend in re�ning its submesh and the time it needs to solve its fraction ofthe numerical system over the re�ned mesh M1.In this thesis, we present a dynamic load balancing algorithm to ensure that the computationat each stage of the re�nement is balanced and optimized. The basic idea of our algorithm isas following. Our algorithm estimates the size and distribution of M1 before it is actuallygenerated. Based on this estimation, we can compute a quality partition of M1 before wegenerate it. The partition of M1 can be projected back to M0, which divides the submesh oneach processor into one or more subsubmeshes. Our algorithm �rst moves these subsubmeshesto proper processors before performing the re�nement. This is more e�cient than moving M1because M0 is usually smaller than M1. Note that this approach considers the mesh re�nementcost as one of the the work load to be balanced. In partitioning M1, we take into account ofthe communication cost of moving these subsubmeshes as well as the communication cost insolving the numerical system over M1. 5

Page 14: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

1.4 Thesis OutlineThe thesis is organized as following. Chapter 2 introduces the basic concepts of the meshgeneration and the quality measure of mesh. It also introduces some basic mesh generationalgorithm. Graph partition concepts and some widely used algorithm are introduced in Chapter3 Chapter 4 introduces an abstract problem to model parallel adaptive mesh re�nement. Analgorithm to estimate the size and distribution of the re�ned mesh before its generation is alsopresented in this chapter. It also presents a technique to reduce dynamic load balancing formesh re�nement to a collection of static partitioning problems. This reduction makes use ofthe estimation information generated by the size estimation algorithm. It �rst applies staticpartitioning to divide the submesh (subdomain) on each processor into a set of subsubmeshes(subsubdomains) according to the projection of the �nal partition onto the current mesh. Weintroduce a notion of subdomain graph to incorporate the communicational cost in moving thesesubsubmeshes with the communicational cost in solving the subsequent numerical system. Wethen apply a static partitioning algorithm to complete the de�nition of the partition of there�ned mesh. We also extends ours algorithm from the abstract problem to unstructuredmeshes. Chapter 5 concludes the thesis with a discussion of some future research directions.

6

Page 15: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Chapter 2Mesh Generation2.1 IntroductionDecomposition of a geometric input into simpler objects is fundamental in many areas, suchas solid modeling, computer aided design, graphical rendering, and scienti�c computation. Forexample, an essential step in numerical simulation of physical and engineering problems is to�nd a proper discretization of a continuous domain. This is the problem of mesh generation[4, 40, 26, 27, 43, 44].Finding the optimal triangulation is a particular mesh generation problem in computationalgeometry. The most often used optimization criteria [4] include maximizing the minimum angleamong all elements of the partition (solved by the wellknown Delaunay triangulation [48]),minimizing the maximum angle [14], minimizing a maximum mincontainment ellipse [13], andminimizing total length (an outstanding open problem in the �eld [17]). Variants of theseproblems allow one to add extra vertices, called Steiner points, in order to further improve thequality of the solution.Mesh generation is a great example of inter-disciplinary research. Its development is builtupon advances in computational and combinatorial geometry, data structures, numerical anal-ysis, and scienti�c applications. Its success is justi�ed not only by mathematical proofs aboutthe quality and the numerical relevancy of geometry-based meshing algorithms, but also bythe performance of meshing software in real applications. It embraces both provably goodalgorithms and practical heuristics. 7

Page 16: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

2.2 Structured and Unstructured MeshThe simplest form of a mesh is a structured mesh. Structured meshes are widely used bythe early stage scienti�c computing because the easiness to construct the structured meshesand to form the linear systems for the mesh sometimes. There are two types of structuredmeshes: geometrically structured and topologically structured meshes. Examples of geometri-cally structured meshes are regular Cartesian grids and uniform hexagon grids. In these meshes,all elements are geometrically alike: the size of the mesh element are similar and the shape ofthe element are also similar. The domain that a geometrically structured mesh can be appliedcan not be too complicated because of the grid property of the elements it used. A meshis topologically structured if its topological structure is isomorphic to that of a geometricallystructured mesh. For example, we can apply a conformal transformation to a structured grid togenerate a topologically structured mesh. The topological structured mesh enhance the abilityto approximate the complicate input domain.Structured grids are easy to generate and manipulate, which facilitate the use of simple datastructures to reduce the complexity of programming. In addition, the numerical theory aboutthese types of discretization is well understood. However, it is not easy to apply the structuredmesh to approximate the complicated domain or the numerical systems that changes rapidly.The use of structured regular grids limits the applicability of numerical methods to problemswhose domains are simple and whose solution functions are smooth.The other type of the mesh is unstructured mesh. It has varying local topology and spacingin order to reduce the problem size. For problems with complex geometry boundaries andwith solutions that change rapidly, we need to use an unstructured mesh. For the example ofmodeling the combustion of the material in the rocket, we need a much denser and accuratediscretization the boundary of the combustion while it is desirable not to waste mesh points inregions with low activities. The adaptability of unstructured meshes comes with new challenges,especially for 3D problems; the numerical theory becomes more di�cult and the algorithmicdesign becomes harder.The most general and versatile mesh is an unstructured triangular mesh in which eachelement is a simplex, i.e. a triangle in 2D or a tetrahedron in 3D. In general, a k-simplexis a convex polytope of k + 1 points of dimension k. A triangular mesh is a triangulation of8

Page 17: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

the input domain (e.g., a polygon), along with some extra points, called Steiner points. Atriangulation or a simplicial complex is a decomposition of a domain into a collection of interiordisjointed simplices so that two simplices can only intersect at a lower dimensional simplex,i.e., neighboring elements are conformal at their boundaries. Combinatorially, a triangulationT can be expressed as a PLS of a set ST of simplices: if s is a simplex in ST then all of its lowerdimensional faces, which themselves are simplices as well, also belong to ST .Following Figure 2.1 gives a typical example of the unstructured mesh. It is from the paperof Borouchaki.

Figure 2.1: A typical example of a unstructured mesh.A triangulation T conforms to the boundary of a domain if each boundary polytope of is a union of some simplices in ST . A triangulation T is the constrained triangulation of domain if all mesh vertices are the domain vertices, and each domain boundary polytope of is aunion of some simplices in ST . A triangulation T covers a domain if each domain boundary9

Page 18: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

polytope of is a union of some simplices in ST and each boundary segment of is an edge ofsome simplices of ST .An area of great potential is the automatic adaptation of the mesh, without intervention bythe analyst. It automatically continues solution until the required accuracy has been reached.Such mesh adaptation generally means that in certain areas of the input domain the size ofthe elements is decreased (or increased) and the order of the elements may be increased (ordecreased). In concept, such adaptation is most appealing, but there are di�culties whencomplex practical situations are considered.2.3 Quality of MeshNumerical approximation errors depend on the quality of the mesh, while the time and thespace requirements of numerical algorithms are a function of the number of mesh elements. Toproperly approximate a continuous function, in addition to the conditions that a mesh mustconform to the boundaries of the region and be �ne enough, each individual element of the meshmust be well-shaped. A common shape criterion for elements is the condition that the anglesof each element are not too small, or the aspect ratio of each element is bounded [1, 4, 49].Certain requirements about the size and neighborhood relations of the mesh vertices should beimposed in order to get a high quality mesh with small number of elements. As Babu�ska andAziz [1] justi�ed one should avoid the large angles for a high quality mesh. Speci�cally theyshowed that the �nite element method convergences if the maximum angle of all mesh elementsis bounded above from �, i.e., there exist a constant �0 such that every angle is at most �� �0.Strang and Fix [49] also showed that the �nite element method convergences if the smallestangle of the mesh elements is bounded below from a constant �. Note the above two conditionscan be reduced to bound the smallest angle of all mesh element.The following de�nition of the aspect-ratio, popularized by Mitchell and Vavasis [40], isuniformly de�ned for any dimension.De�nition 2.3.1 (aspect-ratio) The aspect ratio of a simplex T is the ratio RT =rT , whereRT and rT are the radii of the smallest ball containing T and the largest ball contained in T, respectively. The aspect ratio of a triangular mesh M is the largest aspect ratio among itselements. M is well shaped for a constant � > 1 if its aspect ratio is at most �.10

Page 19: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

In this thesis, we measure the quality of a triangular mesh by the radius-edge aspect ratiode�ned by Miller, Talmor, Teng, and Walkington [35, 36].De�nition 2.3.2 (radius-edge ratio) The radius-edge aspect-ratio of a simplex is the ratioof the circum-radius to the length of the shortest edge to of the simplex. A mesh M is �-well-shaped for a constant � > 1 if the radius-edge aspect-ratio is bounded from above by �.In two dimensions, these de�nitions are equivalent in the sense that if a triangle is boundedaway from being an ill-shaped triangle under one aspect-ratio, it is bounded away under theothers as well. In three dimensions, they are not equivalent. Silver of the three dimension hasgood radius edge ratio, but it does not have a good aspect ratio. If there are silver in the mesh,the �nite element method may be not convergent to the solution. The control volume methodguarantees the convergence for the mesh having silver.2.4 Control SpaceThe �rst stage of an adaptive �nite element scheme consists in creating an initial mesh of agiven domain A size speci�cation �eld is deduced before the mesh generation, i.e., at the vicinityof each mesh vertex, the desired mesh element size is speci�ed. The size speci�cation can bederived from the previous numerical results, or from the local geometry feature, or from thegeometric error which indicates the gap between the facetization and the real surface in case ofsurfaces meshes. As shown in [35, 43], the spacing function for a well-shaped mesh should besmooth in the sense that it changes slowly as a function of distance. Formally, a function f isLipschitz with a constant � if for any two points x; y in the domain, jf(x)� f(y)j � �jjx� yjj.Each domain and a di�erential equation u de�nes a desired local spacing within a domainto specify, for example, the expected element size in a given neighborhood or point densitiesnear a point. In this section, we discuss how to determine the local spacing from the geometryof and the numerical condition of u.2.4.1 Geometric FeaturesThe geometry of the boundary of also contributes to the local spacing of a well-shaped mesh.In two dimensions, we assume that is given as a planar-straight-line graph (PSLG), which is11

Page 20: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

a collection of line segments and points in the plane, closed under intersection. Suppose isdescribed by a PSLG S. Ruppert [43] introduced the following concept called local feature size.De�nition 2.4.1 Given a PSLG S, the local feature size at a point x, lfsS(x), or simplylfs(x), is the radius of the smallest disk centered at x that intersects two non-incident verticesor segments of S. 1Note that adding new Steiner vertices does not change the value of lfs() function, since itis determined by the input. Ruppert has observed that lfs changes slowly within the domain.Formally, a function f() is Lipschitz with a coe�ciency � if for any two points x, y in thedomain, jf(x) � f(y)j � �jjx � yjj. Then the Lipschitz coe�ciency of lfs is bounded fromabove by 1 [43]. In addition, lfs is the maximum in the following sense.Lemma 2.4.1 If f is a 1-Lipschitz function over a domain such that for each point x on@ f(x) � lfs(x), then for every x 2 , f(x) � lfs(x).There are several ways to describe the spacing function of a well-shaped mesh M over adomain :De�nition 2.4.2 (Edge-length function, elM) For each point x 2 , elM (x) is equal to thelength of the longest edges of all mesh simplex elements that contain x (note that points on thelower dimensional faces of a simplex are contained in more than one element).De�nition 2.4.3 (Nearest-neighbor function, nnM) Let x be a point in , there are twocases. (1) if x is a mesh point, then nnM(x) is equal to the distance of x to the nearest meshpoint in M . (2) if x is not a mesh point, then nnM(x) is equal to the distance to the secondclosest mesh point in M .Lemma 2.4.2 ([35]) If M is an �-well-shaped, then there exists constants c1 and c2 dependingonly on � such that for all point x 2 ,c1elM (x) � nnM(x) � c2elM (x):1Ruppert also gave a modi�ed de�nition by using the geodesic distance to the 2 nearest non-incident portionsof the input to handle the two arms situation [43]. The geodesic distance is measured along the shortest paththat stays within the domain to be triangulated. 12

Page 21: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

For convenience, if the context is clear, we will use nn(x) other than nnM(x) to denote thenearest neighbor value of point x in mesh M .2.4.2 Numerical SpacingThe numerical condition is usually obtained from an a priori error analysis, or an a posteriorierror analysis based on an initial numerical simulation. It de�nes a numerical spacing functions,denoted by nsf(x), for each point x in the domain . The value of nsf(x), from the interpo-lation viewpoint, is determined by the eigenvalues of the Hessian matrix of u [49]. Locally atpoint x, u can be approximated by a quadratic functionu(x+ dx) = 12(xHxT ) + x5 u(x) + u(x);where H is the Hessian matrix of u, the matrix of the second partial derivatives. The spacingof the mesh points, required by the accuracy of the discretization near x should depend on thereciprocal of the square root of the largest eigenvalues of H at x.For example, in adaptive numerical simulation, we estimate the eigenvalue of the Hessianmatrix at a certain set of points in based on the solution of the previous iteration, and thenexpand the spacing requirement from these points to the entire domain. From the new spacingand the old spacing function deduced from the previous mesh, we can get the re�nement orcoarsening factor for mesh points. We can then use the simultaneous re�nement and coarseningmethod of Li et al. [27] to generate the mesh that satis�es the new control space requirement.2.4.3 Control Spacing FunctionThe local feature size lfs and the numerical condition nsf together de�nes the global controlspacing function. Notice however, that the Lipschitz coe�ciency of nsf may not be boundedby a constant. Using the technique of Miller, Talmor, and Teng [32], we can de�ne a newnumerical spacing function nsf() as the following: for each point x,nsf(x) = min(nsf(x);miny2(nsf(y) + jjx� yjj)): (2.1)13

Page 22: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

The Lipschitz coe�ciency of nsf() is at most 1. In addition, nsf() is the best possible inthe sense that for any 1-Lipschitz function g over the domain , if g(x) � nsf(x) point-wisein , then g(x) � nsf(x) point-wise.The global control spacing function gns() can then be de�ned asgns(x) =min(lfs(x); nsf(x)): (2.2)Where gns stands for Geometric and Numerical Spacing[51]. The function gns() captures boththe numerical and the geometric requirements for a well-shaped adaptive mesh.Lemma 2.4.3 If f() and g() are �1 and �2-Lipschitz respectively over , then f() + g() is�1 + �2-Lipschitz, and min(f(); g()) and max(f(); g()) are max(�1; �2)-Lipschitz.Therefore, gns is 1-Lipschitz.For mesh generation, we do not need to compute these spacing functions exactly. A commonapproach to approximate gns() is to store discrete values on the vertices of a background meshsuch as a quadtree/octree decomposition of the domain. When we need to evaluate the functionat an arbitrary point in the domain, we simply interpolate these discrete values.2.5 Conformality and Size of MeshGiven the control function speci�cation f , a generated mesh M must conform the controlspacing f , in addition to be well shaped. From the generated mesh M , we use the nearestneighbor value nn(x) to denote the spacing of point x in the generated mesh M . Then wede�ne the conformality of point x as the following.De�nition 2.5.1 [Point Conformality Cf;M ] Assume that f is the control spacing for gen-erating mesh M . For any point x in the domain , let nn(x) be the nearest neighbor value ofpoint x in M . Then the conformality of point x ismin(nn(x)f(x) ; f(x)nn(x)): (2.3)Then to de�ne the conformality of the mesh M , we can use the minimal conformality valueof all mesh vertices; we also can use the average conformality value of all mesh vertices as the14

Page 23: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

conformality of the mesh M . If a mesh M conforms the control spacing well, the conformalityof every point x in the mesh should be bounded below from a constant. If the constant is closerto 1, the mesh M conforms the control spacing better. We say that a mesh perfectly conformthe control spacing, if for every mesh vertex p of M , the conformality of p is equal to 1.The following lemma of Miller et al. [34] estimates the size of the generated well shapedmesh.Lemma 2.5.1 (Size of a Well-shaped Mesh [34]) If M is an �-well-shaped mesh of n el-ements, then n = �(Z dAnn2M ): (2.4)For any well-shaped meshM , if it conforms the given spacing function f , then the followinglemma bounds the number of mesh elements of M .Lemma 2.5.2 (Size of Mesh Respect to the Space Control [51]) There exists a constantc such that if M is a well-shaped mesh of n elements over a domain that satis�es the controlspacing function f(), then n � c Z dAf2 : (2.5)Here, a meshM satis�es the control spacing means that the conformality of the mesh is boundedbelow from a constant.2.6 Delaunay TriangulationFor mesh generation, there are variety of optimization objective. Especially, the �nite elementmethod requires that the minimal angle of the mesh must be bounded below from a constant�. To generate a mesh, there are two approaches: the �rst approach is to generate some pointsin the domain, then use some technique to connect these points to generate the �nal mesh;the second approach is to generate the mesh elements when generate some mesh vertices. Thetypical example of the �rst approach is the sphere packing method [3, 5, 27, 37, 50, 28]. Theadvancing front methods [6, 30, 31] are the most widely used second approach.15

Page 24: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

If we have the set of points in the domain, Delaunay triangulation [48] of the point setminimizes the smallest angle for two dimension domain. Assume that P is a point set in IRd.The simplex de�ned by (d + 1) a�nely independent points from P is a Delaunay simplex ifthe circum-sphere of the simplex contains no point from P in its interior. The union of allDelaunay simplices forms the Delaunay diagram DT (P ). If P is not degenerate, then DT (P )is a triangulation of the convex hull of P . Let DB(P ) denote the set of circum-spheres of thesimplices of DT (P ). By de�nition, there is no point in P that lies in the interior of any spherefrom DB(P ).The geometric dual of the Delaunay Diagram is the Voronoi Diagram, which consists of aset of polyhedra V1 � � � Vn, one for each point in P . Vi is called the Voronoi cell of pi and pi iscalled the center of Vi . Geometrically, Vi is the set of points in IRd, whose distance to pi isless than or equal to that of any other point in P . Delaunay triangulation has some desiredproperties for mesh generation. For example, among all triangulations of a point set in 2D, theDelaunay triangulation maximizes the minimum angle. In any dimension, it always containsthe nearest neighbor graph of the point set, i.e. in the Delaunay triangulation, every point isdirectly connected with its nearest neighbors. The Delaunay triangulation also contains theminimum spanning tree connecting the point set.2.7 Mesh Generation MethodA mesh generator usually does two things:� Point set generation: Places Steiner points within or on the boundary of the domain.� Element generation: Forms elements of the mesh by triangulating the point set or byusing some other element formation procedures.Some mesh generation algorithms construct the point set and then triangulate it, but mostmesh generation algorithms merge these two functions, and generate the point set implicitly asa part of the mesh generation phase. We will discuss both of these methods in the next twosections. 16

Page 25: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

2.7.1 Advancing Front MethodsAdvancing front methods construct a mesh of a domain by moving a front from its boundarytowards its interior. It �rst generates an initial front typically by constructing a surface meshfor the boundary of the domain. It then creates new elements one at a time or a layer at a timeand updates the front with these created faces [16, 15, 6, 30, 31]: In the one element-at-a-timemodel, it chooses a face of the current front and introduces a new mesh element with it as thebase face. It can use another vertex on the front or insert a new Steiner point in the interior asthe additional vertex of the new element. The base face and potentially some other faces on thefront (if the additional vertex is an existing one) are removed from the front, and some facesof the new element are added to the front. This process is repeated until the front is empty,i.e., all fronts have merged upon each other and the domain is fully meshed. Hence the methodinvolves the simultaneous generation of �eld points and their connectivity. The initial front isconstructed by triangulating the boundary of the domain.Note that initial front does not have to be a single component. For example, for a domainwith holes, the initial front can be built for the boundary of each hole as well.The selection of the base face and the placement of the new mesh vertex are the two keysteps of any advancing front method. These two steps must ensure that the new mesh elementis valid and well-shaped and keep the front in good condition to allow the creation of gracefulfuture elements. The faces of the clefts and the small faces are given priority to be picked asthe base faces to satisfy these requirements.Hence, once the base face is chosen, we need to decide where to place the new vertex. Recallthat in a well-shaped triangular mesh, points must be well-spaced [50, 33], which implies thatfor each base face, we can only place the Steiner point in a particular region near the base faceso that the new element is well-shaped. Call this region the feasible region. Some points inthe feasible region will make the new element slightly larger (by a constant factor) than someother points do. This is where the control spacing function can be used. It helps us to decidewhether we should go for a larger new element or a smaller one.Paving [6] is one of the popular advancing front methods. It uses a number of operations totightly controlled the moving front to ensure the mesh validity and quality. These operationsinclude row choice, closure check, row generation, smooth, seam, row adjustment, intersection,17

Page 26: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

and cleanup [6]. The size of the elements in the mesh is determined by the spacing of the nodeson the paving boundary as it propagates. The spacing on the paving boundary is initiallyde�ned by the �xed node spacing on the corresponding exterior boundary.Advancing front methods can be combined with Delaunay or quadtree/octree re�nements.For instance, these re�nement techniques can be used to generate a pretty-good domain de-composition of the input domain and then advancing front can be applied to get a mesh foreach subdomain. We can also use quadtree/octree re�nement to generate the set of points forthe creation of the new elements.2.7.2 Sphere Packing MethodsAt a high level, the sphere-packing method �lls an input domain with a set of spheres whosecenters provide a good vertex set for a quality Delaunay mesh. It can be used to generatemeshes for various quality conditions.For example, Bern, Mitchell, and Ruppert [5] use sphere packing to triangulate a n-vertexpolygonal region (potentially with holes) so that no element has angle larger than �=2. Theyshow that one can do so with O(n) triangles, improving a previous result that uses O(n2)triangles [2].The algorithm �rst packs a set of spheres within the domain such that the gaps betweenthem are surrounded by at most four tangent spheres2. It then de�nes the mesh points asthe union of centers of these spheres, the tangency points, and one point within each gap, andlocally triangulates these points. Notice that for nonobtuse triangulation, one does not need toconsider the control spacing function. Therefore, their mesh may have elements with very badaspect ratio. A similar sphere-packing based method has been developed by Bern and Eppstein[3] for quadrilateral meshes.Shimada and Gossard [46] have developed a sphere-packing method called bubble mesh togenerate triangular meshes for two and three dimensions. Their packing scheme is based onthe simulation of the particles that interact each other with repulsive/attractive forces. They�rst de�ne a proximity based force among the spheres, and then �nd a stable con�guration by2This thesis uses the word sphere in its general meaning; it denotes a circle for two-dimensional domains anda (d� 1)-dimensional sphere for d-dimensional domains.18

Page 27: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

moving or deleting spheres. However, their method does not provide a theoretical bound onthe time of the algorithm nor the quality of the mesh that they generate.Miller et al. [35, 36] have designed a sphere-packing based meshing method which combinestwo well-known methods, quadtree and Delaunay re�nements. First, they apply a balancedquadtree re�nement to approximate the spacing function f(). Second, they oversample a setof points in the domain to de�ne a set of overlapping spheres. Then, they compute a maximalset of non-overlapping spheres from this set to obtain a sphere packing. Finally, they computethe Delaunay triangulation of the centers of these spheres.Suppose f() is the desired edge-length or nearest-neighbor function of a well-shaped meshfor a domain . We now introduce some de�nitions to capture the quality of sphere packing.Let B(x; r) denote the sphere centered at point x with radius r.De�nition 2.7.1 (�-Packing) Let � be a positive real constant. A set S of spheres is a �-packing with centers P of with respect to a spacing function f() if� For each point p of P , B(p; f(p)=2) 2 S;� The interiors of any two spheres s1 and s2 in S do not overlap; and� For each point q 2 , there is a sphere in S that overlaps with B(q; �f(q)=2).The following structure theorem of Miller, Talmor, and Teng [34] states an equivalencerelationship between �-sphere packing and well-shaped meshes.Theorem 2.7.1 (Sphere Packing and Well-Shaped Meshes) 1. For any positive con-stant �, there exists a constant � depending only on � such that if f() is a spacing functionof Lipschitz constant 1 over a domain and S is a �-packing with respect to f(), thenthe Delaunay triangulation M of the centers of S is an � well-shaped mesh; in addition,for each point p in , nnM(p) = �(f(p)), where the constant in � depends only on �.2. For any positive constant �, there exists a constant � depending only on � such that ifM is an � well-shaped mesh, then the set of spheresS = fB(p; nnM (p)=2) : for all mesh point p 2Mg;is a �-packing with respect to nnM=2. 19

Page 28: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Chapter 3Graph Partition3.1 IntroductionIdentifying enough concurrency in decomposition of data and/or tasks is one of the main ob-jective in parallel algorithm design. The computation and data dependency of a lot of scienti�ccomputing problem can be modeled by a graph ( directed graph or undirected graph): thevertices of graph is the basic data unit or computing tasks; two vertices are connected if thereare a data dependency between them or there are computation dependency between tasks.Graph partition plays an important role in parallel computing by identifying the concurrencyin a given problem. A partition of a graph into subgraphs leads to a decomposition of thedata and/or tasks associated with a computational problem and the subgraphs can be mappedto the processors of a multiprocessor. Graph partition also plays an important role in serialalgorithm design by means the divide and conquer paradigm.For parallel computing, the ideal assignment of the data and/or tasks to processors isthat the work load of every processor are approximately equal and the communication duringthe computation is minimized. Hence, the following two objectives are usually stated in thepartitioning algorithm. First, the subgraphs partitioned have roughly equal number of vertices.Second, there are few edges join di�erent subgraphs to each other, i.e., the partition has smallcut.In the context of the parallel computation, we assume that every vertex of the original graphdenotes the basic unit of the computation overload. Hence the size of a subgraph determines20

Page 29: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

the computational work load that a processor has to perform. The number of cut edges is themeasure of the communication volume caused by this partition. In a serial algorithm, equal-sized subgraphs lead to the lowest worst-case running time, and the number of the cut edgesmeasures the cost of combining the partial solutions to compute the global solution.More general objective functions may need to be considered for many problems. The workassociated with a subgraph may be modeled more accurately by attaching a weight to eachvertex, and then equipartitioning the weights. The communication costs in the algorithmmight be modeled more accurately by the number of subgraphs a given subgraph is connectedto, or the number of boundary vertices, or similar variants. In addition, in some algorithms,the shape of the subgraphs may be an important parameter. For example, the convergence ofa domain decomposition method dependents on the shape quality of each partition. In someapplications, it may be essential to partition into connected subgraphs.In Figure 3.1, 3.2, 3.3, we show the partition of a mesh into sixteen subgraphs computed bythe METIS partitioning algorithm. This mesh was generated using the mesh generator writtenby Xiangyang Li at University of Illinois at Urbana-Champaign. It generates the Delaunaytriangulation of an input point set. It also support the constrained edges and holes in the inputand also provide the functions to do Delaunay re�nement, Functional Coarsening and FunctionRe�nement idea from the paper [26] of Li et. al.. The Figure shows 3.2 that each subgraph inthe partitioned mesh is connected, and it appears to the eye as a good partition of the mesh.Quantitative measures such as the number of edges cut by the partition show that the METISpartition algorithm does generate partitions of good quality for many �nite element meshes.3.2 Background and Good SeparatorsThere are two di�erent type of partition for a graphG = (V;E): edge separator, vertex separator.A set of edges of E is an edges separator of G, if removal of these edges will partition G toat least two unconnected subgraphs. A set of vertices of V is a vertex separator of G, ifafter removing these vertices and the edges connected to them, G can be divided into twodisconnected subgraphs.Given a graph G = (V;E), and a vertex set B � V , let B denote the vertices from V notin B, i.e., V �B. A partition of a connected graph G = (V;E) is a division of its vertices into21

Page 30: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Figure 3.1: The Delaunay mesh from a random point set.22

Page 31: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Figure 3.2: The 16-way partition of its' vertices.23

Page 32: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Figure 3.3: The 16-way partition of its' triangle elements.24

Page 33: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

two sets A and B. The set of edges joining vertices in A to vertices in B is an edge separatorthat we shall denote by C(A;B). For any partition (A;B), let s(A;B) be the number of thecut edges. We call it the size of the cut. The removal of these edges disconnects the graphinto two or more connected components. In applications such as domain decomposition, theset of vertices B would be mapped to one set of processors and the set of vertices A to another,and s(A;B) is a measure of the volume of communication necessary between the two groups ofprocessors.Unweighted graph can not model all cases of the parallel computation. For some problems,di�erent edges will have di�erent cost of the communication for solving the problem, instead thatevery edge has equal or similar communication cost. Hence, we need to use the edge weightedgraph to modify the data and/or tasks decomposition. Let WE be the weight function for alledges of the graph. A partition of a connected weighted graph G = (V;E;WE) is a divisionof its vertices into two sets A and B. The set of edges joining vertices in A to vertices in Bis an edge separator that we still denote by C(A;B). For any partition (A;B), let w(A;B) bethe total summation edge weight of the cut edges. We call it the weight of the cut. Note theunweighted graph partition is the special case of the edge weighted graph partition by set alledge weight to unity.Hence one goal in partitioning a graph in these applications is to minimize the total weightof edges cut by the partition so as to keep communication costs in the algorithm small. A secondgoal would be to balance the computational work load between the two sets of processors. Thisis achieved by prescribing the number of vertices in A and B to within a tolerance. If the sizeof A and B di�er by at most 1, then the partition is called a bisection. For � in the range of[0; 1=2], a partition is called �-separator if min(jAj; jBj) � �n. The minimal weight w(A;B) ofall bisection A, B, is called the bisection width of the graph G.In some applications, even the edge weighted graph can not model the problem. The typicalcase is that the computational cost of the problem is not always proportional to the number ofvertices assigned to it. Every vertex could have its own computational cost associated with it.Then we can use the weighted vertices, weighted edges graph to model the problem. Let WVand WE be the weight function for all vertices and edges of the graph respectively. A partitionof a connected weighted vertices and weighted edges graph G = (V;E;WV ;WE) is a division ofits vertices into two sets A and B. The set of edges joining vertices in A to vertices in B is an25

Page 34: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

edge separator that we still denote by C(A;B). For any partition (A;B), let w(A;B) be thetotal summation edge weight of the cut edges. We call it the weight of the cut. Let w(A) denotethe total summation vertices weight of all vertices in subgraph A. We still call it the size ofthe subgraph. Note the weighted edges graph partition is the special case of this de�nition byset all vertex weight to unity. The objective of this partition is to minimize the total weight ofedges cut by the partition so as to keep communication costs in the algorithm small. A secondgoal would be to balance the computational work load between the two sets of processors, i.e.,to balance the value w(A) and w(B). Here the computational work load is the summation ofvertices weight of the subgraph.In applications such as nested dissection, a vertex separator is desired; this is a set of verticesS whose removal disconnects the graph into two parts with no edge joining a vertex in one partto a vertex in the other. Here the two goals are that the separator should have a small numberof vertices, and as before, that the two parts should not di�er by too many vertices.Note that, for every balanced partition, we almost have at least two requirements: minimizethe weight of the cut; balance the size of all partitioned subgraphs. Also notice that, for formingthe Linear Programming, we can only accept one objective function under some constrainconditions. To solve this, the following approaches can be used. The �rst approach is to setthe balance objective as one of the constrain condition. Let � < 1 be a tolerant constant of theunbalance such that for a partition (A;B), the size of A and B are both less than 1 + � timesthe average size, i.e., w(A) � (1 + �)w(G)=2, and w(B) � (1 + �)w(G)=2. Where w(G) is thesummation of the vertices weight for graph G. Then the objective function is to minimize theweight of the cut. The second approach is to set the weight of the cut as a constrain condition.Then we want to minimize the largest size of the partitioned subgraphs.3.3 Graph Partition AlgorithmThe oldest separator result is for trees, and is due to Jordan in 1869: Every tree has a singlevertex that separates it into two parts, with no part containing more than two thirds of thevertices. Lipton and Tarjan [29] showed that every planar graph has a vertex separator of size atmost p8n separating it into two parts with no part having more than two thirds the number ofvertices. These results extend to two dimensional �nite element graphs. Vertex separators that26

Page 35: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

are bounded by pgn exist also for graphs with bounded genus g and for graphs with certainexcluded minors. Miller, Teng, Thurston, and Vavasis [38, 39] have recently extended theseresults to a more general class of graphs embedded in d dimensions. They show that, for everyk � ply system �, there is a O(k1=dn1�1=d) vertex separator. And the size of largest subgraphgenerated is at most (d + 1)n=(d + 2), where n is the number of original graph vertices. Thisgeneralize the result of Lipton and Tarjan to high dimension. We give a brief summary of somewidely used graph partition algorithms. The survey paper by Pothen [42] has more detaileddescription of the algorithms. Most of them are for the unweighted graph partition, even someof them can be generated to the weighted graph.3.3.1 Level-structure PartitioningLet p,q be any two vertices of a connected graph G = (V;E). Let d(p; q) denote the length ofthe shortest path connecting p and q. The diameter of the graph is de�ned as the maximalvalue of d(p; q) for all vertices p,q from V . An early algorithm for computing vertex separatorswas provided in SPARSPAK [11], a library of routines for solving sparse systems of equationsby direct methods. This algorithm directly computes a vertex separator. The edge separatorcan also be computed from this vertex separator. And also notice that it only tries to balancethe partition by some heuristics approach for unweighted graph. The following is the detaildescription of the algorithm.Algorithm Level-structure Partitioning1. Finds a pseudo-peripheral vertex v in the graph, i.e., one of a pair of verticesthat are approximately at the greatest distance from each other in the graph.This can be done by the breadth �rst search algorithm.2. A breadth �rst search from v is used to partition the vertices into levels.3. Chooses the vertices in the median level as the vertex separator.The level of the vertices is same as its' level in the breadth �rst search tree. The vertexv belongs to the zeroth level, and all new neighbors of vertices in the ith level belong to the(i + 1)th level, for i = 0; 1; � � �. Here u is the new neighbors of vertex v means that u is notvisited by the breadth �rst search before v is visited. Note this approach does not guaranteethat the partition has approximately equal size.27

Page 36: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

The following slight variant partitions the vertices into roughly equal sets: chooses theseparator to be the vertices in the smallest level k such that the levels 0; 1; � � � ; k togethercontain more than half the vertices. Another improvement is to remove from the separatorthose vertices in level k that are adjacent to vertices in level k � 1 but not to vertices in levelk + 1.The algorithm for computing the vertex separators is quite fast, because it only involvesa few breadth �rst searches to compute the pseudo-peripheral vertex and the level structures.Unfortunately, the quality of the separators is quite poor, relative to the other algorithmsdescribed here. Within SPARSPAK, the separator algorithm was used to compute nesteddissection orderings. Experience had shown that the orderings generated from this separatoralgorithm were not competitive with the Multiple Minimum Degree (MMD) orderings in termsof the storage and the arithmetic required for factoring a sparse matrix [41].3.3.2 The Spectral Partitioning AlgorithmWe consider the bisection problem for unweighted graph: Partition the vertices of a graphG = (V;E) into two sets B1 and B2 to minimize the number of cut edges, i.e., edges with oneendpoint in B1 and the other in B2. Assume that the graph has even number of vertices. Thebisection problem can be formulated as the minimization of a quadratic objective function bymeans of the Laplacian matrix Q = Q(G) of the graph G. Let d(i) denote the degree of a vertexi, i.e., the number of vertices adjacent to i. The Laplacian matrix Q can be expressed as twomatrices associated with graph G: Q = D � A, where A is the adjacency matrix of G, and Dis the n� n diagonal matrix of the degrees of vertices of G. In other words, the element qij ofQ satis�es the following:qij = 8>>>><>>>>: �1 if edge (i; j) 2 E, and i 6= j0 if edge (i; j) 2 E, and i 6= jd(i) if i = j: :Note that the Laplacian matrix Q of graph G is symmetric positive semide�nite. In otherwords, the eigenvalue of the matrix is non-negative. Let x be a n-vector with component xi = 128

Page 37: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

if i 2 B1, and xi = �1 if i 2 B2. Then we have thatxTQx = X(i;j)2E(xi � xj)2:Thus the bisection problem is equivalent to the problem of minimizing the quadratic form xTQxover n-vectors x with components xi 2 f1;�1g and Pni=1 xi = 0, because we assume that thegraph G has even number of vertices. If G has odd number of vertices, then the constraincondition can be Pni=1 xi = 1. Here we assume that the cardinality of B1 is larger than that ofB2. The other case is totally symmetric. Then we have the general formula about the constraincondition Pni=1 xi � 1.From linear programming, assuming that G has even number of vertices, we haveminxi2f1;�1g;Pni=1 xi=0 xTQx � minPni=1 x2i=n;Pni=1 xi=0xTQx= x2TQx2= �2(Q)x2Tx2= n�2(Q);where x2 is the eigenvector corresponding to the smallest positive eigenvalue of the Laplacianmatrix Q(G). If G has odd number of vertices, we have same result.The minimizer of the relaxed problem is the second eigenvector of the Laplacian. It isshowed [7] that the closest partition vector to the second eigenvector is obtained by roundingthe most positive n=2 components of the latter to +1, and the remaining components to �1.The discussion above suggests the following algorithm to partition a graph: compute asecond eigenvector of the Laplacian of the graph, and then partition the vertices into twosubsets by the median eigenvector component. The detail of the algorithm is given following.Algorithm Spectral Partitioning1. Compute the Laplacian matrix Q of the graph G.2. Compute the second eigenvalue and its corresponding eigenvector of the Lapla-cian matrix of the graph. 29

Page 38: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

3. Partition vector to the second eigenvector is obtained by rounding the mostpositive n=2 components of the latter to +1, and the remaining componentsto �1. Partition the vertices into two subsets by the median eigenvector com-ponent.3.3.3 Geometric ApproachFinite element or �nite di�erence meshes embedded in space contain geometric informationabout the coordinates of the mesh points. Algorithms for partitioning meshes by bisectingalong coordinate axes have been considered by Simon, Williams, and many others. A parallelnested dissection algorithm based on this idea has been described by Heath and Raghavan [20].These algorithms have the virtue of being fast, and are easy to implement in parallel; however,the quality of the separators obtained by such straightline cuts are not good relative to theother algorithms, especially for adapted meshes.We now describe a geometric partitioning algorithm designed by Miller, Teng, Thurston,and Vavasis [38], and implemented by Gilbert, Miller, and Teng [18]. This algorithm computes aseparator by using a circle rather than a straightline to cut the mesh. Given a graph embeddedin ddimensional space we disregard the edges of the graph, and view the graph as a collection ofvertices. Since the graph is embedded in ddimensional space, each vertex has a set of geometriccoordinates attached to it.Let fB1; B2; � � � ; Bng be a set of closed balls in IRd. We obtain an overlap graph by creatinga vertex for each ball, and joining two vertices with an edge if the corresponding balls intersect.The set of balls is k-ply neighborhood system, if for any point p in the domain, it lies strictlyin at most k balls.It is possible to include another parameter �, which determines by how much the radiusof each ball can be enlarged to make two balls intersect, and thus obtain a more generalconstruction of overlap graphs. Let B(x; r) denote the ball centered at point x with radiusr. Assume ball Bi centered at ci has radius ri. An overlap graph (with parameters � and k)is a kply neighborhood system where a vertex represents each ball, and where an edge joinstwo vertices if the corresponding balls intersect when the smaller of the balls is expanded by afactor �. 30

Page 39: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Overlap graphs can be embedded in space by locating a vertex corresponding to a ballBi at the point pi , the center of the ball. Planar graphs, twodimensional meshes, and threedimensional meshes of bounded aspect ratio can be represented by overlap graphs for suitablechoices of the parameters � and k. Miller et al. [38] proved that an (�; k)overlap graph has avertex separator of size O(�k1=dn1�1=d); the larger of the two parts obtained by removing theseparator has at most (d+ 1)=(d + 2) vertices of the original graph.A center-point of a given set of points is a point (not necessarily one of the given points)such that every hyperplane through it divides the given set of points approximately evenly intotwo subsets. Approximately evenly in this case means that the worstcase ratio of the sizes ofthe two subsets is d : 1. It can be proved that every �nite point set in IRd has a center-point,and the proof yields a polynomial time algorithm that employs linear programming (O(nd)inequalities of d variables ) to compute the center-point. However, this algorithm is too slowto be practical, and heuristics to compute approximate center-points are used instead.But we can use the following algorithm to approximate the center-point of a given pointset. Algorithm Approximating the Center Point1. Select a subset S of the given point set P with size l uniformly at random;2. Compute a center point cs of S, using the Linear Programming algorithm forcenter points;3. Output cs.It can be shown that for any constant � < 1, the above algorithm will compute (d + �) : 1center point with high probability probability that l > q(�; d), where q is a function that doesnot depend on the size n of P . Therefore, we can approximate a center point in random constanttime.For convenience in the following exposition we �rst scale and translate the coordinates ofthe given points so that the transformed coordinates are between �1 and +1. An outline ofthe geometric partitioning algorithm is as follows:Algorithm Geometric Separator1. Project the input points in IRd to the surface of the unit sphere in IRd+1 centeredat the origin. The \north pole" of the sphere has coordinates (0; 0 � � � 0; 1), and31

Page 40: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

a point p is projected to the surface of the sphere along the line through p andthe north pole.2. Compute a center-point of the projected points on the surface of the (d +1)dimensional sphere. The center-point is in the interior of the sphere.3. Move the center-point to the origin of the sphere in two steps. First, rotate theprojected points about the origin in (d + 1) to make the center-point a point(0; 0; � � � 0; r) on the (d + 1)st axis. Second, dilate the points on the surface ofthe sphere to make the center-point the origin.4. Choose a random great circle on the unit sphere in IRd+1.5. Transform the great circle in IRd+1 to a circle in IRd by reversing the dilation,rotation, and projecting back from IRd+1 to IRd.6. A vertex separator is the set of vertices that lie su�ciently \close" to the sepa-rating circle. (\Close" means that the circle intersects the balls correspondingto the vertices, when the balls are expanded by a factor �.) An edge separatoris the set of edges cut by the separating circle.The geometric algorithm has several advantages. It examines only the vertices of the graph,and makes no use of the edges except to compute the quality of the generated separators.The computations (projecting up to IRd+1 , �nding an approximate center-point, rotation anddilation of the points, projecting down) involve simple operations on the points, and the repeatedcomputation of the null vector of a (d + 2) by (d + 2) matrix. This also makes the algorithmattractive on a parallel computer. However, the feature that the algorithm makes no use of theedge information in the graph is also a weakness in the partitioning of edgeweighted mesheswhere the weight of the cut edges needs to be minimized.3.3.4 A Multilevel AlgorithmMultilevel algorithms for graph partitioning are similar in spirit to multigrid algorithms forsolving linear systems of equations. We view the given graph, with a large number of vertices,as the �nest graph in a sequence of coarse graphs to be computed. Given a \�ne" graph, weobtain a \coarse" graph with fewer vertices by a suitable shrinking (contraction) procedure.We construct a sequence of coarsening graphs until the size of the last computed thus far is32

Page 41: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

small enough. A high quality partitioning algorithm such as the spectral algorithm is used topartition the coarsest graph. A partition of a coarse graph is then used to partition the �negraph immediately preceding it in the sequence of graphs by reversing the shrinking step usedto coarsen. We call this an uncoarsening step. Next, this \rough" partition of the �ne graphis re�ned by means of a vertex swapping algorithm that moves vertices between the parts toreduce the number of edges cut by the partitioning algorithm. We call this a re�nement step.The uncoarsening and re�nement steps are repeated for each successive pair of �ne and coarsegraphs in the sequence until a partition of the given graph is computed.Hendrickson and Leland [21, 22] implemented such an algorithm, and included it in theirCHACO software. The multilevel algorithm has also been implemented by Karypis and Kumar[24]. They have further re�ned the algorithm, and included an implementation in their METISsoftware. They have also described a parallel implementation of this algorithm and providedan analysis [23, 25]. The detail description of the algorithm is as following.AlgorithmMultilevel Partitioning1. If the size of the input graph G is not small, coarsen the graph G to reduce thesize of the graph. Let G0 be the coarsened graph. Otherwise, directly computethe partition of G by some other algorithms such as spectral partitioning andreturn it.2. Recursively call the multilevel partitioning algorithm on G0. Let (A0; B0) bethe generated partition.3. Project the partition (A0; B0) of G0 back to the partition (A;B) of G. Notethat we record the information about which vertices of G are the source toconstruct a vertex of G0.4. Re�ne the projected partition (A;B) to get a better partition of G, and returnit as the partition of G.Note that the multilevel partition algorithm has inherently to support the weighted vertexand weighted edge graph partition. The total weight of the edges cut by a partition in thecoarsening graph is equal to the total weight of the edges cut in the previous �ne graph. Thetotal weight of a vertex in the coarsening graph is the summation of the weight of correspondingvertices. 33

Page 42: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Chapter 4Load Balancing4.1 IntroductionAdaptive mesh re�nement is a key problem in large-scale numerical calculations. The need ofadaptive mesh re�nement could introduce load imbalance among processors, where the loadmeasures the amount of work required by re�nement itself as well as by numerical calculationsthereafter.In this chapter, we present an abstract problem to model the process of parallel adaptivemesh re�nement. This abstract problem is general enough; it uses balanced quadtrees andoctrees to represent well-shaped meshes; it allows quadtrees and octrees to grow dynamicallyand adaptively to approximate the process of adaptive re�nement of unstructured meshes. Thismodel is also simple enough geometrically to provide a good framework for the design of meshre�nement algorithms. In next section, we give a brief description of the load balancing problemfor adaptive meshing and our algorithm. In Section 4.2, we present the detail of the dynamicload balancing algorithm for this abstract problem. In Section 4.9, we show how algorithms forthis abstract problem can be used for general unstructured mesh re�nement.We consider issues and algorithms for adaptive mesh re�nement, (cf, Step 6 in the paradigmof page 1). The general scenario is the following. We start with an initial well-shaped meshM0 for an input domain and di�erential equations. Then we form the numerical system fromthe mesh M0 and the original di�erential equations. We partition mesh M0 and map thesubmeshes and their fraction of the numerical system onto a parallel machine. By solving the34

Page 43: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

numerical system in parallel, we obtain an initial numerical solution S0. If the solution S0 iswithin the error tolerance, we then return the solution. Otherwise, an error-estimation of S0generates a re�nement spacing-function h1 over the domain, which de�nes the expected size ofmesh elements at a particular region in the domain. In other words, if in some region the errorestimation of S0 is too large compared to the tolerant error or average error, we have to re�nethe mesh element in that region, such that the next computation will generate more accuratesolution in this region. Therefore, we need to properly re�ne the initial mesh M0 accordingto h1 to generate another well-shaped mesh M1. Note that for some application, it may bedesirable to also coarse the mesh in some region such that the size of the mesh is approximatelyminimized without a�ect the accuracy of the solution. But for this thesis, we just assume thatthere are coarsening requirement deduced from the error estimation.Note that the requirement for re�nement of the mesh are often not uniformly distributed inthe domain. Hence, the requirement for re�nement introduces load imbalance among processorsin the parallel machine. Some submeshes, after re�nement, might be much larger than others.The work-load of a processor in the next stage computation is determined by the summationof the time the processor needs to spend in re�ning its submesh and the time it needs to solveits fraction of the numerical system over the re�ned mesh M1.In this chapter, we present a dynamic load balancing algorithm to ensure that the computa-tion at each stage of the re�nement is balanced and optimized. The basic idea of our algorithmis as following. Our algorithm �rst estimates the size and distribution ofM1 before it is actuallygenerated. Based on this estimation, we can compute a quality partition ofM1 before we gener-ate it. Note the partition is computed based on the mesh M0. Here M0 acts like the coarsenedmesh of M1. Every vertex of M0 will have a weight assigned to it to approximate how manyvertices the mesh M1 will have around it. This idea is similar to the multilevel partitioningalgorithm. The partition of M1 can be projected back to M0, which divides the submesh oneach processor into one or more subsubmeshes. Our algorithm �rst moves these subsubmeshesto proper processors before performing the re�nement. This is more e�cient than moving M1because M0 is usually smaller than M1. Note that this approach considers the mesh re�nementcost as one of the the work load to be balanced. In partitioningM1 (actually the modi�edM0),we take into account of the communication cost of moving these subsubmeshes as well as thecommunication cost in solving the numerical system over M1.35

Page 44: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

4.2 Dynamic Balanced QuadtreesThe basic data structure for quad-/oct-tree based adaptive mesh re�nement is a box, i.e., asquare in two dimensions, and a cube in three dimensions. A box is a d-dimensional cubeembedded in an axis-parallel manner in IRd. Initially, there is a large d-dimensional box, wecall it the top-box, which contains the interior of the domain, and a neighborhood around thedomain. The box may be split, meaning that it is replaced by 2d equal-sized boxes, whose sidelength is one half of that of the original box. These smaller boxes are called the children boxesof the original box. A sequence of splitting starting at the top-box generates a 2d-tree, i.e., aquadtree in two dimensions, and an octree in three dimensions. leaf-boxes of a 2d-tree are thosethat have no child. Other boxes are internal-boxes, i.e., have children. The size of 2d-tree T ,denoted by size(T ), is the number of the leaf-boxes of T . The depth of a box b in T , denoted bydepth(b), is the number of splittings needed to generate b from the top box. The depth of thetop box, hence is 0. Two leaf-boxes of a 2d-tree are neighbors if they have a (d� 1) dimensionalintersection. A 2d-tree is balanced i� for any two neighbor leaf-boxes b1, b2, jd(b1)� d(b2)j � 1.Figure 4.1(a) shows an example of a balanced quadtree T . Boxes b1, b2 are neighbors, but b1,b3 are not. The depth of boxes b1, b2, and b3 are 2, 3, and 2 respectively.b1

b2b3

( a ) ( b )

Processor P1

Processor P2

Processor P3Processor P4

Figure 4.1: A balanced quadtree T in two dimensions, and a 4-way partition.Figure 4.1(b) shows a 4-way partition of the quadtree. Imagine we have four processors, wecan use this 4-way partition to map the quadtree on theses processors. A good partition shouldbalance the computation load and reduce the communication cost. In general, if we have kprocessors, we map the quadtree using a good k-way partition.36

Page 45: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

4.3 Modeling Adaptive Re�nement with Dynamic QuadtreeA balanced 2d-tree can be viewed as a well-shaped mesh in IRd. A well shaped mesh in IRdcan also be approximated by a balanced 2d quadtree. In fact, most quad-/oct-tree based meshgeneration algorithms �rst construct a balanced 2d-tree over an input domain, and then applya local warping procedure to build the �nal triangular mesh [4, 40]. We model mesh re�nementby allowing a 2d-tree to grow adaptively and dynamically. When the accuracy requirement ofa problem is changed during the numerical simulation, we need to re�ne the mesh accordingly.In particular, an error estimation of the computation from the previous stage generates a newspacing-function over the input domain. The new spacing-function de�nes the expected size ofmesh elements in a particular region for the formulation in the next stage. In the context ofa 2d-tree, the re�nement requires that some leaf-boxes be split into a collection of boxes of acertain size while globally maintains that the resulting 2d-tree is still balanced. We model there�nement of a 2d-tree as following:De�nition 4.3.1 (Adaptive Re�nement of 2d-trees) The input to the problem has twoparts: (1) a balanced 2d-tree T and (2) a list of non-negative integers �, one for each leaf-box, i.e., associated with each leaf-box b is an integer �(b).The input pair T and � de�ne a new a balanced tree T � which can be constructed by thefollowing two-phase procedure.1. Construct T 0 from T by dividing each leaf-box b of T into 2d�(b) equal sized boxes, i.e., bya uniform depth �(b) splitting.2. Construct T � by balancing T 0.The output of the problem is then T �.As shown in Figure 4.2 (a), the 2d-tree T 0 built from T (by dividing each leaf-box b of Tinto 2d�(b) equal sized boxes) is not balanced in general. Hence to maintain the well-shapedcondition, we need to balance the tree, as shown in Figure 4.2 (b).The re�nement most likely introduces load imbalance among processors, which re ects in thework for both re�nement and computations thereafter. In our example, processors P2 and P4will have much more to do during the re�nement than processors P1 and P3. The computationload for the next stage will be imbalance as well if we maintain the current partition.37

Page 46: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

B1

B2

B3

B4

( a )

Processor P1

Processor P2

Processor P3Processor P4

( b )Figure 4.2: quadtree T 0 after re�ning T , and T � after balancing T 0. (a): assume the splittingdepth of b1, b2, b3 and b4 are 1, 1, 2 and 2 respectively. (b): assume after balancing T 0, wemaintain the current partition of T .Therefore, as an integral part of parallel adaptive computation, we need to dynamicallyrepartition the domain for both re�nement and computations of the next stage. But also noticethat we do not need to do load balancing after every adaptive mesh re�nement. We can usesome semi-dynamic approach: we can do load balancing after every constant number of meshre�nement, or if the load is very unbalanced (by some load indicator algorithm ).To balance the work for re�nement, we need to partition a 2d-tree before we actually re�ne it.In the next section, we present an e�cient method to estimate the size and element distributionof a re�ned 2d-tree without actually generating it.4.4 Reduce Dynamic Load Balancing to Static PartitioningThe original 2d-tree T is distributed across a parallel machine based on a partition of T . Assumewe have k processors, and we have divided T into k-subdomains S1 : : : ; Sk, and have mappedSi onto processor i. A good partition is in general balanced, i.e., the sizes of S1; : : : ; Sk areapproximately the same size. In addition, the number of boundary boxes, the set of leaf-boxesthat have neighbors located at di�erent processors should be small.A simple minded way to re�ne a 2d-tree (or a mesh in general) for a new spacing-functionis to have each processor re�ne its own subdomain to collectively construct T �: T 0 is �rstconstructed without communication. Then T � is built when T 0 is made into a balanced 2d-38

Page 47: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

tree. Note that the construction of T � from T 0 needs communication among processors. Theoriginal k-way partition of T naturally de�nes a k-way partition (S01; : : : ; S0k) for T 0 and a k-waypartition (S�1 : : : ; S�k) for T �. However, these partitions may not longer be balanced. If this isthe case, then some processors will have much more work for the re�nement than others andmay need more memory than others. In addition, the computation of the next stage will nolonger be balanced either. Note also that the set of boundary boxes will change during theconstruction of T 0 and T �. The number of the boundary boxes may not be as small as it shouldbe. In this approach, to balance the computation for the next stage, we could repartition T �and distribute it according to the new partition. One shortcoming of this approach is that T �could potentially be larger than T , and hence the overhead for redistributing T � could be moreexpensive.We would like to have a mechanism to simultaneously balance the work for re�nement andfor the computation of the next stage. To do so, we need to properly partition T � before weactually generate it. Furthermore, we need a dynamic load balancing scheme that is simpleenough for e�cient parallel implementation. In this section, we present an algorithm thate�ectively reduce the dynamic load balancing problem to a collection of static partitioningproblems. We �rst give a high-level description of our approach. Details will be given insubsequent subsections.Repartitioning MethodInput (1) a balanced 2d-tree T that is mapped onto k processor according to ak-way partition S1; : : : ; Sk, and (2) a list of non-negative integers �, one for eachleaf-box.1. In parallel, processor i estimates the size and the element distribution of itssubdomains S0i and S�i without constructing them.2. Collectively, all processors estimate the size of T �. Assume this estimation isN . Let W = �(N=k) for a small prede�ned positive constant � < 1.3. In parallel, if the estimated size of S�i is more than W , then processor i appliesthe geometric partitioning algorithm of Miller-Teng-Thurston-Vavisis [38, 19]to implicitly partition S�i into a collection of subsubdomains S�i;1; : : : ; S�i;Li .We can naturally project this partition back to Si to generate subsubdomains39

Page 48: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Si;1; : : : ; Si;Li . We use the geometric partitioning algorithm because it does notneed the mesh to do the partition; all its needs is a small sample of the meshelements in T �. This sample can be generated by our size estimation method.4. We now have a collection of subsubdomains where Si;1; : : : ; Si;Li is generatedby processor i. To dynamically balance the work of the re�nement and compu-tations thereafter, we will remap these subsubdomains to generate a partitionof T so that the projected work for the re�nement and computations thereafterat each processor is balanced. We would also like to minimize the overhead inmoving these subsubdomains.We introduce a notion of subdomain graph over these subsubdomains. In thesubdomain graph, each subsubdomain is a vertex. Its weight is its projectedsize in T �. If two subsubdomains share some boundary boxes, we include aweighted edge between them. The weight is the number of boundary boxes inT � between them. For each processor i, we also introduce a vertex pi in thesubdomain graph. The weight of pi is a reasonable large constant to be de�nedlater. Vertex pi is connected with vertices for subsubdomains Si;1; : : : ; Si;Li ,and the weights on these edges are the communication costs, respectively, ofmoving Si;1; : : : ; Si;Li to any other processors,5. We construct a k-way partition of the subdomain graph using a standard staticgraph partitioning algorithms such as provided in Chaco and Metis [21, 24].The partition of the subdomain graph de�nes a new distribution for T � beforeits re�nement. Note that the size of the subdomain graph is O(�+k), becausethere are O(�) subdomain vertices, and k processor vertices.6. Move each subsubdomain to its processor given by the partition of the subdo-main graph.7. In parallel, each processor re�nes and balances its subdomain.8. Solve the resulting numerical system for the next stage.40

Page 49: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

4.5 Subdomain Size EstimationWe now estimate the size of the quadtree T � after re�ning and balancing quadtree T . Ourtechnique can be directly extended to general 2d-trees. The key ingredient of our estimationalgorithm is the fact that, if we re�ne a box b, it can only causes the neighbor boxes and verysmall portion of other boxes to be re�ned for balancing.4.5.1 A�ected Boxes of Re�ning a BoxThe following de�nition speci�es which region around b will mostly have to be re�ned forbalancing.De�nition 4.5.1 (region(b)) For each leaf-box b of a balanced 2d-tree, the e�ect region of b,denoted by region(b), is the set of all boxes that share at least one point with b.Clearly the size of the region(b) is a constant, if the dimension d is �xed.Lemma 4.5.1 For any leaf-box b of a quadtree T , the size of region(b) satis�es jregion(b)j �12. The region of b can be computed in a constant time.Proof: We use the following observation. For each edge e of leaf-box b, there are at most twoother leaf boxes that intersect the interior of e. For each vertex v of b, there are at most oneleaf-box that only shares v with b. Therefore, there are at most twelve leaf-boxes in region(b)in two dimensions. 2And similarly, there are at most 4 � 6 + 2 � 12 + 1 � 4 = 52 leaf-boxes in region(b) in threedimensions. For general 2d-tree, there are at mostPd�1k=0 2k �f(k) leaf-boxes in region(b), wheref(k) is the number of k dimensional faces in a d-dimensional box.The re�nement of a box b not only a�ects the boxes in region(b), the re�nement of boxesfrom region(b) will also possibly cause other boxes to be re�ned for balancing.De�nition 4.5.2 (pyramid(b)) The pyramid of a leaf-box b of a quadtree T , denoted bypyramid(b), is the following boxes.1. If a leaf-box b1 62 region(b) shares a (d� 1) dimensional face with a box b2 in region(b),and �(b1)=0, and the side length of b is one half of that of b1, then b1 is in pyramid(b).41

Page 50: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

2. If a leaf-box b2 shares a (d � 1) dimensional face with a leaf-box b1 in pyramid(b), and�(b1) = 0, and the side length of b1 is one half of that of b2, then b2 is in pyramid(b).The following Figure 4.3 shows two templates for splitting boxes in region(b), and a typicalexample of re�nement and balancing. Figure 4.3 (a) shows that the re�nement of b causes theneighbor leaf-box b1 to be split, assuming �(b) = 5, �(b1) = 0 in the Figure. We call the box b1the neighbor of b. Figure 4.3 (b) shows that the re�nement of b causes b1 to be split, assuming�(b) = 6, �(b1) = 0 in this Figure. We call the box b1 the corner neighbor of b. Figure 4.3 (c)gives a typical example how the re�nement of b in uence the splitting of leaf-boxes containedin region(b).b1

b

( a )

b1

b

( b ) ( c )

b

Figure 4.3: The two templates for splitting boxes in region(b), and how the re�nement of bin uence the splitting of leaf-boxes contained in region(b).Lemma 4.5.2 The set of boxes that we need to split in the construction of T � due to there�nement of a leaf-box b is contained in region(b) [ pyramid(b).Proof: When b is re�ned, all neighbors of b, and the boxes that share only a vertex with b willpossibly need be re�ned due to the imbalance introduced by the re�nement of b. According toour analysis given before, we can use one of the two templates to re�ne each box in region(b).Note that, no matter how much deeper b is re�ned, the boundary edges of region(b) are splitat most once(as shown in Figure4.3 (c)). Hence, for all boxes b1 not in region(b), we need splitb1 at most once, according to the de�nition of balanced quadtree. And the boxes that need bere�ned must be \coarser" than the re�ned neighbor. Hence, for boxes not in region(b), onlythe boxes in pyramid(b) need be re�ned because of the re�nement of b. 2If b1 is a leaf-box contained in pyramid(b), then we need to split b1 at most once. If b1is a leaf-box in region(b), then b1 can only be split geometrically away from the face shared42

Page 51: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

between b1 and b (see Figure 4.3 (a)) or geometrically away from a corner shared by b1 and b(see Figure 4.3 (b)). We call b the pressure box of b1. The depth of splitting is a function of�(b) and �(b1). As shown in Figure4.3(c), after the re�nement, the impact of b geometricallyweakens away from b in region(b).4.5.2 Size Estimation of Balanced QuadtreeWe now consider how the re�nement of one leaf-box may in uence the splitting of its neigh-boring leaf-boxes. Suppose b and b1 are two leaf-boxes that are neighbors in T . Let �(b; b1) =max([�(b)��(b1)]; 0). We call it the edge pressure of b to b1. Let �(b; b1) = depth(b)�depth(b1).Clearly, we have �1 � � � 1, according to the de�nition of a balance quadtree. Let (b; b1) = 0,if �(b; b1) � 0, and (b; b1) = �1, if �(b; b1) = �1. We have three cases for the relation of boxb and its neighbor boxes as shown in Figure 4.4.b

( a )

b1

( b )

b b1

b

( c )

b1

Figure 4.4: The three scenarios for two edge neighbor leaf-boxes. (a): �(b; b1) = 1; (b):�(b; b1) = 0; (c): �(b; b1) = �1.For convenience, leaf-boxes b and b1 of a quadtree are called edge neighbors if they intersecton an edge; and they are called corner neighbors if they share only a vertex. Let pressure(b; b1)be the needed split depth of leaf-box b1, due to the imbalance caused by the re�nement ofleaf-box b. We can compute it as following.Lemma 4.5.3pressure(b; b1) = 8><>: max(�(b; b1)� �(b; b1)� 1; 0) if b and b1 are edge neighborsmax(�(b; b1)� �(b; b1)� 2; 0) if b and b1 are corner neighbors :43

Page 52: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Proof: Just analysis the three cases of the edge neighbors (as shown in Figure 4.4), and �vecases of corner neighbors(as shown in Figure 4.5), we can get the formula easily from thede�nition of the balanced quadtree. 2The unit boxes of leaf-box b are the smaller box uniformly split in b by depth �(b), i.e., theside length of this kind boxes is 2��(b) of that of b. There are three kinds of unit boxes. Oneis corner-boxes, which locates at the four corners of b. The other is edge-boxes, which intersectb with only one edge. All other boxes are center-boxes. Figure 4.6 shows an example of theseunit boxes.Let leaf-box b1 be an edge neighbor of b. Assume the re�nement of b1 causes an imbalancebetween b and b1. We can use the template of Figure 4.3(a) to split the edge-boxes of b intovariant sized small boxes to eliminate the imbalance. Note that there are 2�(b)�2 edge-boxesneed be split along one edge of b. Let ak be the number of smaller boxes introduced in splittingan edge-box of b into depth k using the template shown at Figure 4.3(a). Then ak can becomputed as following.Lemma 4.5.4 (Edge Boxes Splitting) The number of smaller boxes introduced in splittingan edge-box of b is ak = 3 � 2k � 3; (4.1)where k is the splitting pressure from the edge neighbor.Proof: Clearly, we have a0 = 0. And if we want to split deeper, then we only need to splitthe boxes which intersect b on the common edge of b and b1. And the number of this kind ofboxes that need to be split again is 2k�1. And splitting each small box once will introduce foursmaller boxes. But we also have to remove the parent boxes of this four small boxes. Then wehave ak = ak�1 + 3 � 2k�1, which implies that ak = 3 � 2k � 3. 2The re�nement of one leaf-box may also in uence the splitting of those leaf-boxes that sharea conner with it. Suppose b and b1 are two leaf-boxes that share a conner in T . There are �vecases as shown in Figure 4.5.Let b2 be the leaf-box that is one of the common neighbor leaf-box of b and b1. Let length(b)denote the side length of b. We have length(b) � 2� length(b2), and length(b2) � 2� length(b1),44

Page 53: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

b

( a )

b1

b

( b )

b1

b

( c )

b1

b

( d )

b1

b

( e )

b1

Figure 4.5: The �ve scenarios fro two corner neighbor leaf box b1 and b. (a): �(b; b1) = 2; (b):�(b; b1) = 1; (c): �(b; b1) = 0; (d): �(b; b1) = �1; (e): �(b; b1) = �2.according to the de�nition of balance 2d-tree. Hence, we have length(b) � 4 � lenght(b1).Similarly, we have length(b1) � 4 � length(b). Hence, we know that there are only �ve cases forb and b1 in quadtree.We now consider the splitting of a corner-box of b to eliminate the imbalance caused by there�nement of boxes in region(b). Let b3 be a corner neighbor of b. Let b0 be the corner-boxof b which shares a vertex v with b and b3, as shown in Figure 4.6. Let sk be the number ofboxes introduced in b0 if we split it by depth k according to the pressure of b3. Then we havethe following lemma to compute sk.Lemma 4.5.5 (Corner Boxes Splitting) The number of smaller boxes introduced by split-ting b0 is sk = 3 � k; (4.2)where depth k is the splitting pressure from b3.Proof: Clearly S0 = 0. And sk = sk�1 + 3, which implies that sk = 3 � k. 2We now consider the case that two edge neighbors of b cause the corner-box to be split.W.l.o.g., let b1 and b2 be the two neighbors of b. Boxes b, b1 and b2 intersect on the vertexv of b. Let b0 be the corner-box which also intersect on v. Let k1 = pressure(b1; b). Letk2 = pressure(b2; b). Both are computed from Lemma 4.5.3. Let c(k1; k2) be the number ofsmaller boxes split in b0 by pressure k1 from b1, and the other pressure k2 from b2. We donot consider the corner-pressure in c(k1; k2). Then we have the following lemma to computec(k1; k2). 45

Page 54: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

���������������

���������������

��������

b b1

b2

������������

������������

b3

b0

splitting depth of b =3an edge box

a center-box

a corner-box

(corner neighbor)

pressure from b2 pressure from b3

pressure from b1

(edge neighbor)

(edge neighbor)Figure 4.6: an example of the pressures of leaf-box b, and the examples of the edge-boxes,corner-boxes, center-boxes.Lemma 4.5.6 (Shared Corner Boxes Splitting) The number of smaller boxes split in b0by pressure k1 from b1, and the other pressure k2 from b2, isc(k1; k2) = 3 � (2k1 + 2k2)� 3k2 � 6; (4.3)where we assume k1 � k2.Proof: Clearly c(k1; k2) = c(k2; k1) If k1 = k2, then we have c(0; 0) = 0; c(k; k) = c(k � 1; k �1) + 3 � (2k � 1). Hence, we have c(k; k) = 3 � 2k+1 � 3k � 6. 2If k1 6= k2, then w.l.o.g, we assume k1 > k2. The splitting of the corner unit box can beviewed as two steps: (1) split it by depth k2 in both directions of b1 and b2, (2) split the muchsmaller boundary boxes generated in (1) into depth k1 � k2 along the common edge of b andb1. Note that there are 2k2 much smaller boundary boxes needed to be split in step (2). Thenumber of total smaller boxes split is c(k1; k2) = c(k2; k2) + 2k2 � ak1�k2 . Generally, we havec(k1; k2) = 3 � (2k1 + 2k2)� 3k2 � 6. 2We now count the corner pressure from corner neighbor b3 and the edge pressures from twoedge neighbors b1 and b2 together, as shown in Figure 4.6. Let k1 and k2 be the edge pressurefrom two edge neighbors b1 and b2 respectively. And let k3 be the corner pressure from a cornerneighbor b3 of b. And let v be the common vertex of b, b1, b2 and b3. Let g(k1; k2; k3) be thenumber of much smaller boxes introduced in corner-box b0 of b due to the re�nement of b1, b2and b3. Then from the above lemmas, we have the following lemma.46

Page 55: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Lemma 4.5.7 (Both Edge Boxes and Corner Boxes Splitting) The number of small boxesintroduced in b0 due to the corner pressure from corner neighbor b3 and the edge pressure fromtwo edge neighbors b1 and b2 areg(k1; k2; k3) = 8><>: c(k1; k2) + s(k3�k1) if k3 � k1c(k1; k2) otherwise: : (4.4)where k1 and k2 be the edge pressure from two edge neighbors, and k3 be the corner pressurefrom a corner neighbor.Proof: We have g(k1; k2; k3) = g(k2; k1; k3). W.l.o.g, we assume k1 > k2. If k3 � k1, then thesplitting, according to the pressure from b1, has satis�ed the pressure from b3, i.e., g(k1; k2; k3) =c(k1; k2), if k3 � k1. When k3 > k1, then we need split the smallest box in the corner of b0 intodepth of k3�k1 again using the split pattern as shown in Figure 4.3(b). The number of the muchsmaller boxes introduced in this step is sk3�k1 . Hence the total number of boxes introducedfrom the pressure of b1, b2 and b3, as shown in Figure 4.6, is g(k1; k2; k3) = c(k1; k2) + sk3�k1 =3 � (2k1 + 2k2)� 3k2 � 6 + 3 � (k3 � k1) + 1, where k1, k2 and k3 are the pressure of b1, b2 andb3 to b respectively, and assuming k1 � k2. 2As shown in Figure 4.6, the number of the smaller boxes split at the corner has three parts.One is from the pressure of leaf-box b1, the other is from leaf-box b2, and the last is from b3.Let F (T; �) be the set of leaf-boxes b with �(b) > 0 that there exist a leaf-box b1 such thatb 2 pyramid(b1) and there does not exist b2 such that b 2 region(b2). In other words, F (T; �)is the set of boxes that has to be split once because they are in the pyramid of some re�nedboxes. Then we have F (T; �) = [b;�(b)>0pyramid(b)�[b;�(b)>0region(b):Let f(T; �) be the cardinality of F (T; �). Then by computing pyramid(b1) for �(b1) > 0, wecan compute f(T; �) in size(T )2 time.For leaf-box b, let V (b) = fv1; v2; v3; v4g be the vertex set of b. Let Intr(vi) be the numberof smaller boxes introduced in corner-box b0, which shares vi with b. Intr(vi) can be computedin constant time according to analysis before. And let E(b) = fe1; e2; e3; e4g be the edge set ofb. Let Intr(ei) be the number of smaller boxes introduced in edge-boxes which intersect b on47

Page 56: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

ei. Intr(ei) can also be computed in constant time according to analysis before. Then we havethe following theorem to compute size(T �).Theorem 4.5.8 Suppose T is a balanced quadtree and � is a list of non-negative integers forits leaf-boxes. The size of the quadtree T �, after balancing and re�ning T , issize(T �) =Xb2T 0@ Xvi2V (b) Intr(vi) + Xei2E(b) Intr(ei) + 22�(b)1A+ 4 � f(T; �):where Intr(vi) = g(k1; k2; k3), Intr(ei) = 2�(b)�2ak, k1 and k2 are the edge pressures from twoedge neighbors b1 and b2 respectively, k3 is the corner pressure from a corner neighbor b3 of b.vi is the common vertex of b, b1, b2 and b3. k is the edge pressures from edge neighbor b4, andei is the common edge of b and b4.Proof: The elements of T � has three resources. The �rst contribution is from the re�nementof each box b(there are 22�(b) small boxes constructed). The second is from the re-re�nementof edge neighbors in region(b). There are Intro(ei) introduced from pressure an edge neighborsharing ei with b. And there are no overlaps when we doPb Intro(ei). And the last is from there-re�nement of boxes in pyramid(b). There are at most 4�pyramid(b) small boxes introduced,because each box in pyramid(b) is at most split to 4 smaller boxes. Note that if b1 2 region(b)then b 2 region(b1). 2Let Z denote the set of all leaf-boxes b of quadtree T with �(b) = 0 and all boxesb1 inregion(b) has �(b1) > 0. Note that b1 2 region(b) also implies that b 2 region(b1). Hence Zare the boxes that will not be split in the re�nement and balancing procedure. And clearlyjf(T; �)j � jZj. Note that the cardinality of F (T; �) is usually much smaller thanXb2T 0@ Xvi2V (b) Intro(vi) + Xei2E(b) Intro(ei) + 22�(b)1A :In practice, we suggest to use the following formula to approximate size(T �).Lemma 4.5.9 sizeapp(T �) =Xb2T 0@ Xvi2V (b) Intr(vi) + Xei2V (b) Intr(ei) + 22�(b)1A48

Page 57: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

approximates size(T �) very well. If there are � portion of boxes of T belongs to F (T; �) thensizeapp(T �) � 1��1+3�size(T �).Proof: It follows from the discussion above. And clearly sizeapp(T �) � size(T �). Andsize(T �) = sizeapp(T �) + 4f(T; �)� sizeapp(T �) + 4� � size(T )� sizeapp(T �) + 4� � size(T �)=(1 + 3�):Note that size(T �) � (1 + 3�)size(T ), because there are at least � portion of the boxes of Tthat has to be split once. It implies that sizeapp(T �) � 1��1+3�size(T �). 2Our algorithm for estimating the size of T � runs in linear in the size(T ). Note that wecan compute region(b) in constant time for each leaf-box b; and the computation of Intr(vi)and Intr(ei) is also in constant time for each b 2 T . We omit the contribution of splitting inpyramid(b) because (1) the size of F (T; �) is very small. ( Note that only the splitting of thethe boxes in F (T; �) is not counted in the estimation formula given in Theorem 4.5.8); (2) it ismore expensive to compute the contribution for boxes in F (T; �). 24.6 Sampling Boxes from T �Note that, we do not generate the quadtree T � yet. According to the size estimation of meshT �, we can approximately sample a random leaf-box of T �. For a leaf-box b of mesh T , let ke,ks, kw and kn be the pressure of b from four edge-neighbor leaf-boxes respectively. Let kse, ksw,knw and kne be the pressure of b from four corner-neighbor leaf-boxes respectively. Note that allof these numbers may be 0. Then according to the lemmas of size estimation, we can computethe number of introduced splitting boxes in b due to the re�nement of the neighbor leaf-boxes.Let c1, c2, c3 and c4 be the number of splitting boxes introduced in four corner-boxes of b. Letc5, c6, c7 and c8 be the number of splitting boxes introduced in edge-boxes of b. And let c9 bethe number of center-boxes splitting in b. The set of small boxes, which ci is counted from, iscalled block i. Let (x; y) be the center point of b. Let h be the side length of b. Let h0 = h=2�(b)be the side length of the split boxes in b according to �(b).49

Page 58: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

For sampling a leaf-box in T �, we uniformly generate a random positive integer r, which isnot larger than P9i=1 ci. W.l.o.g. we assume that Pi�1j=1 cj < r �Pij=1 cj . The value r speci�eswhich block i the random small box will be locate at. If the object block i is center block,i.e., r > P8i=1 ci. Let t = r �P8i=1 ci. Let e = 2�(b) � 2. Let m,n be the integer such thatm � e � 1, n � e � 1 and t = m � e + n. In other words, the object small box will locate themth row and nth column at the center block of b. The coordinates of left-up corner point ofcenter block of b is (x� h=2 + h0; y+ h=2� h0). Then the center point of the object small boxis (x� h=2 + (n+ 3=2)h0); y + h=2 � (m + 3=2)h0). And the side length of the sampled smallbox is h0.If the object block is a edge-block, i.e., P4i=1 ci < r �P8i=1 ci. Similarly let t = r�Pj�1i=1 ci,where j and t satisfy that 0 � t < cj and 5 � j � 8. Let kj be the corresponding pressure tob from neighbor leaf-box. For t, we similarly compute m and n as following. If (Pkji=1 2i) � e �t < cj then let n = t�Pkji=1 2i. The center point of the object small box is (x�h=2+h0+(n+1=2) � h0=(2kj ); y+ h=2� (1=2) � h0=(2kj )). And the side length of the small box is h0=(2kj ). Ift < e� (Pkji=1 2i), then assume t =Pmi=0(e�2i)+n, where n < e�2m+1. The center point of theobject small box is (x�h=2+h0+(n+1=2)�h0�2�m; y+h=2�h0+Pi=1m(2�ih0)�2�m�1h0),Which is equal to (x � h=2 + h0 + (n+ 1=2) � h0 � 2�m; y + h=2 � 3 � h0=2m+1). And the sidelength of the small boxes is h0=2m. Note that, for all this edge block, we only consider the casethat the block is north block. For other 3 blocks, we can use similar way to sample the smallboxes.If the object block is a corner block, w.o.l.g, we assume the pressure of this corner-box isk1, k2 and k3 respectively; k1 and k2 are the pressure from edge-neighbor boxes; and w.o.l.g,we assume that k1 � k2. We consider the right-up corner block b0 case. For other three cornerblocks, we can do sampling similarly. Imagine that we split b0 as following ways. First we splitb0 to four uniform small boxes. Let (i; j) denote the coordinates relation of the split boxescompared with b0. i = 1 denote that it is in the right of b0, and i = �1 for left. j = 1 denotethat it is in the up of b0, and j = �1 for down. So the split four boxes get label (1; 1), (1;�1),(�1; 1) and (�1;�1) respectively. For next splitting, we only need split boxes along the outeredge of b0. And every small box will inherit the label of its parent box. After the balancing ofb0, we assign a sequence labels (a1; b1)(a2; b2) : : : (as; bs) to every split small box. We have thefollowing lemma for the labels of split boxes. 50

Page 59: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Lemma 4.6.1 The label of each leaf box split in the corner-box of b0 according the pressure k1,k2 and k3 satisfy the following property. (1) bi = 1, for 1 � i < s. and s < k1 or (2) ai = 1,for 1 � i < s. and s < k2 or (3) ai = 1 and bi = 1 for 1 � i < s. and s < k3Proof: Because we only split the boxes along the outer edge of b0, i.e., we only split the rightestboxes or uppest boxes. Hence for the label of every leaf-box, we have the lemma. 2We call a label feasible, if the label satisfy the above three conditions. For randomlyand uniformly sampling the split box in b0, we only need randomly generate a feasible label(a1; b1)(a2; b2) : : : (as; bs). And the center point of the corresponding box is (x0 + Psi=1 ai �2�ih0; x0 +Psi=1 bi � 2�ih0), where (x0; y0) is the center point of b0, h0 is the side length of b0.Note that h0 = h � 2��(b); x0 = x+ h0 � (2�(b)�1 � 1); y0 = y + h0 � (2�(b)�1 � 1). And the sidelength of this sampled small box is h0=2s.4.7 Subdomain PartitioningWe �rst review the basic concepts of graph partitioning. Suppose we have a weighted graphG = (V;E;w), where V is the set of vertices and E is the set of edges, and w assigns a positiveweight to each vertex and each edge. A k-way partition of G is a division of its vertices intok subsets V1; : : : ; Vk. The set of edges whose endpoints are in two di�erent subsets are callthe edge-separator of the partition. The goal of graph partitioning is to �nd a k-way partitionsuch that (1) Vi has approximately equal total weight, and (2) the separator is small. Thereare several available software for graph partition [21, 24]. However, most of these algorithmsrequires the full combinatorial description of an input graph.To partition each subdomain according to its size and distribution in T �, we do not haveits �nal combinatorial structure available before the re�nement is actually performed. Whatdo we have is a geometric approximation of its size and element distribution. Fortunately,the geometric information is su�cient for us to use the geometric partitioning algorithm ofMiller-Teng-Thurston-Vavisis [38, 39].Recall that the original k-way partition of T de�nes a k-way partition (S01; : : : ; S0k) for T 0 anda k-way partition (S�1 ; : : : ; S�k) for T �. However, these partitions may not longer be balanced.What we are going to do is to use the estimation of the element distribution of T �, to implicitlydivide each subdomain from (S�1 ; : : : ; S�k) into subsubdomains of approximately equal size. The51

Page 60: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

subsubdomain decomposition is described explicitly using T and its initial partition S1; : : : ; Sk.The subsubdomains will be the units for the �nal partition.In particularly, we use the size estimation algorithm presented in the previous section toestimate the size and element distribution of each leaf-box in T . This estimation allows us tosample a random leaf-box of T � in each leaf-box of T . By doing so, we can obtain a sampleof random leaf-boxes of S�i . We then apply the geometric mesh partitioning algorithm to thissample to obtain a proper multiway partition of S�i . This multiway partition is described asa partition tree of separating spheres and hence we can use this set of separating spheres tobuild a multiway partition (Si;1; : : : ; Si;Li) of Si. Details of the geometric mesh partitioningalgorithm uses samples can be found in [38] or Chapter 3.Processor P2

Processor P1

Processor P3

Processor P4

( b )Figure 4.7: A typical example of subdomain partition for T .After the size estimation of each subdomain, we use the sampling technique to uniformand randomly select leaf-boxes in T �. We can use the sphere based technique to partitionthe sampled leaf-boxes. The partition of the sampling leaf-boxes in T � implied a partition ofsubdomain in T . Figure 4.7 give an example of the subdomain partition of each subdomainaccording to the space function generated from the error-estimation of the last solution.4.8 Subdomain RedistributionAfter we have divided each subdomain of T into a collection of subsubdomains, we need toredistribute them to proper processors to balance the load and minimize the communicationrequirement. We introduce a subdomain graph SG to model the redistribution of these sub-52

Page 61: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

subdomains. This graph is a weighted graph and its node set contains two parts. The �rstpart has one node for each subsubdomain that we have generated. These nodes will be referredas subdomain nodes. The weight of each subdomain node is equal to the estimated size of thesubsubdomain in T �. The second part has one node for each processor. We will call these nodesprocessor nodes. We will discuss the weight of processors nodes later.Two subdomain nodes are connected in SG if they are directly connected by boundaryboxes. The weight of the edge between them is equal to the the number of shared boundaryleaf-boxes times a scaler which is determined by the communication cost in solving the numericalsystem in the parallel computer.Each processor node is connected in SG with all subdomain nodes of its subsubdomains.The weight on the edge between a processor node and a subdomain node is the cost of movingthe subsubdomain to any other processor.We now come back to the issue of the weight of a processor node. Let W be the total weightof all subsubdomain nodes in SG. Let w = (1=2 + �) �W=k for a prede�ned positive constant�. The constant � is also a function of the constant � used in the repartition method of Section3. For example � = �=2. The choice of the weight of processor nodes is to ensure that inthe subsequent partition of SG, no two processor nodes will be assigned to the same partition.That is why we choose the weight larger than 0:5W=k. However, if the weight is too large, thenit might disturb the balance of the �nal partition of some static partitioning algorithm. Anexample of a subdomain partition and its subdomain graph is given in Figure 4.8.Processor P2

Processor P3

Processor P4

Processor P1

( a )

P1P2

P3P4

S23

S24

S25

S11 S12

S31

S32

S41

S42

S43S44

S45 S46

S21

S22

( b )Figure 4.8: Constructing subdomain graph from subdomain partition.53

Page 62: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

In Figure 4.8, we use the follow notation. Node pi denotes the processor i. Node Sij denotesthe jth subsubdomain of subdomain associated with processor i, generated by the subdomainpartition algorithm.Note that the subdomain graph is very small. It has only �(k) nodes. So the cost forpartitioning subdomain graph will be small as well.We can use any static graph partitioning algorithm such as those provided in Chaco andMetis on SG to divide its nodes into k subsets of roughly equal total weights. It follows from theweight that we assigned to processor nodes, in the k-way partition, each subset contains exactlyone processor node. Hence this partition generates a redistribution map of subsubdomainsamong processors in the parallel computer: a subsubdomain will be moved to the processorwhose processor node is in the same subset in the k-way partition. Therefore the weight on theedge between a processor node and a subdomain node faithfully includes the communicationcost in the partition.Figure 4.9 gives an example of a quadtree T and its redistribution over the four processors.After the redistribution, each processor then re�nes and balances its new subdomain and solvesits fraction of the numerical system for the next stage.Processor P1

Processor P2

Processor P4

Processor P4

Figure 4.9: An example of subdomain redistribution and re�nement of subdomain.4.9 Remeshing Unstructured MeshesOur parallel adaptive 2d-tree re�nement algorithm can be extended to general unstructuredmeshes. In this section, we outline our approach.It follows from a series of work by Bern-Eppstein-Gilbert [4], Mitchell-Vavasis [40], andMiller-Talmor-Teng-Walkington [35] that given a well-shaped meshM in IRd, there is a balanced54

Page 63: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

2d-tree TM that approximates M . In particularly, TM has the property that there are threepositive constants c > 1, �1 < 1 and �2 > 1 such that� For each element e in M , the number of leaf-boxes of TM that intersect e is at most c.� For each leaf-box b in TM , the number of elements of M that intersect b is at most c.� In addition, if a leaf-box b of TM intersects e, then �1area(e) � area(b) � �2area(e).Given a well-shaped meshM , we can construct TM in time linear in the size ofM . Moreover,such computation can be optimally speeded-up if we have a multiple number of processors.We can then use the following strategy to design a parallel adaptive re�nement algorithmfor unstructured meshes.Parallel Re�nement MethodInput (1) a well-shaped mesh M that is mapped onto k processor according to ak-way partition M1; : : : ;Mk, and (2) a spacing-function f de�ning the new spacingat each vertices of M .1. In parallel, generate TM .2. Project the k-way partition M1; : : : ;Mk to TM to obtain a k-way partition ofTM .3. For each vertex v 2M , compute the ratio rm that is equal to the ratio of thecurrent spacing at v to f(v).4. For each box b 2 TM , let �(b) be the logarithm of the average ratio of allvertices of M that lies inside b.5. Apply our 2d-tree load balancing algorithm to compute a k-way partition of Tand project it back to M to obtain a new k-way partition of M . This k-waypartition will be balanced for M�, the re�ned mesh for M .6. PermuteM according this new partition and each processor applies a sequentialmesh re�nement algorithm to their own submesh and collaboratively re�ne theboundary elements among the submeshes.55

Page 64: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Chapter 5ConclusionsIn this thesis, we present a dynamic load balancing algorithm for parallel adaptive mesh re-�nement. The main objective of this research is to develop e�ective algorithms that are simpleenough for implementation. We focus on reducing dynamic load balancing to static partitioningin a black-box fashion and on reducing parallel mesh re�nement to a collection of traditionalsequential mesh re�nements. We show how the estimation of the size and element distributionof a re�ned mesh can be used for this objective.There are several directions that we can extend and improve the method presented in thisthesis.� In our abstract model for adaptive mesh re�nement, we assume that each leaf-box will beuniformly split in T 0. In practice, we may need to split each leaf-box according to a givenpattern. Although uniform splitting can be used to over-estimate the e�ect of re�nement,it is desirable to have a more accurate method for the estimation of the re�ned mesh.� The scheme developed in this thesis for unstructured mesh re�nement �rst builds a bal-anced 2d-tree to approximate the unstructured mesh. This could be cumbersome. Wehave to spend time to build or to maintain the 2d-tree. We also need more space to storethe 2d-tree approximation.It is desirable to have a more direct method to estimate the size and element distributionof unstructured meshes. 56

Page 65: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

� In our current model for adaptive re�nement, we assume that the mesh will be made�ner at every region. For certain applications such as the ow simulation, the denselydecomposed sub-regions are moving around as a function of time. Therefore, some regionswill be \de-re�ned", i.e., will be coarsened. We need to extend our adaptive re�nementscheme to handle mixed adaptive re�nement and coarsening.

57

Page 66: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

Bibliography[1] I. Babu�ska and A. K. Aziz. On the angle condition in the �nite element method. SIAM J.Numer. Anal., 13(2):214{226, 1976.[2] M. Bern and D. Eppstein. Polynomial-size non-obtuse triangulation of polygons. Int. J.Comp. Geom. Appl., 2:241{255, 1992.[3] M. Bern and D. Eppstein. Quadrilateral meshing by circle packing. In 6th InternationalMeshing Roundtable, pages 7{20, 1997.[4] M. Bern, D. Eppstein, and J. R. Gilbert. Provably good mesh generation. In the 31stAnnual Symposium on Foundations of Computer Science, IEEE, pages 231{241, 1990.[5] M. Berna, S. Mitchell, and J. Ruppert. Linear-size nonobtuse triangulation of polygons.In Proc. of 10th Symp. on Computational Geometry, pages 221{230, New York, 1994.[6] T. D. Blacker. Paving: a new approach to automated quadrilateral mesh generation. Int.Jour. for Numerical Methods in Eng, 32:811{847, 1991.[7] T. F. Chan, P. Ciarlet Jr., and W. K. Szeto. On the near optimality of the recursive spectralbisection method for graph partitioning, Feb 1993.[8] S. W. Cheng, T. K. Dey, H. Edelsbrunner, M. A. Facello, and S.-H. Teng. Silver exudation.In Proc. 15th ACM Symposium on Computational Geometry, 1999. to appear.[9] L. P. Chew. Guaranteed-quality delaunay meshing in 3d (short version). In 13th ACMSym. on Comp. Geometry, pages 391{393, 1997.58

Page 67: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

[10] L. Paul Chew, N. Chrisochoides, and F. Sukup. Parallel constrained delaunay meshing.In S. A. Canann and S. Saigal, editors, Trends in Unstructured Mesh Generation, pages89{96, 1997.[11] E. C. Chu, J. A. George, J. W. Liu, and E. G. Ng. User's guide for sparspaka: Waterloosparse linear equations package. Technical Report CS8436, Computer Science, Universityof Waterloo, Ontario, Canada, 1984.[12] David E. Culler, Jaswinder Pal Singh, and Annop Gupta. Parallel Computer Architecture,a hardware/software approach. Morgan Kaufmann Publishers, Inc, 1998.[13] E. F. D'Azevedo and R. B. Simpson. On optimal interpolation triangle incidences. InSIAM Journal on Scienti�c and Statistical Computing, volume 10, pages 1063{1075, 1995.[14] H. Edelsbrunner, T. S. Tan, and R. Waupotitsch. A polynomial time algorithm for theminmax angle triangulation. In the 6th Symposium on Computational Geometry, ACM,pages 44{52, 1990.[15] P. J. Frey and H. Borouchaki. surface mesh evaluation. In 6th International MeshingRoundtable, pages 363{374, 1997.[16] P. J. Frey, H. Borouchaki, and P. L. George. Delaunay tetrahedralization using anadvancing-front approach. In 6th International Meshing Roundtable, 1997.[17] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to theTheory of NPCompleteness. W. H. Freeman, 1979.[18] J. R. Gilbert, G. L. Miller, and S. H. Teng. Geometric mesh partitioning: Implementationand experiments. Technical Report CSL- 94-13, Xerox Palo Alto Research Center, 1994.[19] J. R. Gilbert, G. L. Miller, and S. H. Teng. Geometric mesh partitioning: Implementationand experiments. to appear, 1998.[20] M. T. Heath and P. Raghavan. A cartesian parallel nested dissection algorithm. SIAM J.Matrix Anal. Appl., 1995.[21] B. Hendrickson and R. Leland. The chaco user's guide, version 1.0. Technical ReportSAND93-2339, Sandia National Laboratories, Albuquerque, NM, 1993.59

Page 68: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

[22] B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs. TechnicalReport SAND93-1301, Sandia National Laboratories, Albuquerque, NM, 1993.[23] G. Karypis and V. Kumar. Analysis of multilevel graph partitioning. Technical Report95037, Computer Science Department,University of Minnesota, 1995.[24] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioningirregular graphs. Technical Report 95035, Computer Science Department,University ofMinnesota, 1995.[25] G. Karypis and V. Kumar. Parallel multilevel graph partitioning. Technical Report 95036,Computer Science Department,University of Minnesota, 1995.[26] X. Y. Li and S. H. Teng. Dynamic load balancing for parallel adaptive mesh re�nement. In5th International Symposium on Solving Irregularly Structured Problems in Parallel, pages144{155, Berkeley, 1998.[27] X. Y. Li, S. H. Teng, and A. �Ung�or. Simultaneous re�nement and coarsening: adaptivemeshing with moving boundaries. In 7th International Meshing Roundtable, pages 201{210,Dearborn, Mich., 1998.[28] X. Y. Li, S. H. Teng, and A. �Ung�or. Biting: advancing front meets sphere packing. In 2ndSymposium on trends in unstructured mesh generation, Boulder, Colorado., 1999.[29] R. J. Lipton and R. E. Tarjan. A separator theorem for planar graphs. SIAM J. Appl.Math., 36:177{189, 1979.[30] R. Lohrer. Progress in grid generation via the advancing front technique. Engineering withComputers, 12:186{210, 1996.[31] R. Lohrer and P. Parikh. Three dimensional grid generation by the advancing-front method.Int. J. Numer. Meth. Fluids, 8:1135{1149, 1988.[32] G. L. Miller, D. Talmor, and S.-H. Teng. Optimal good aspect ratio coarsening for un-structured meshes. In 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pages538{547. ACM-SIAM, 1997. 60

Page 69: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

[33] G. L. Miller, D. Talmor, and S. H. Teng. Data generation for geometric algorithms on non-uniform distributions. International Journal of Computational Geometry and Applications,1998. accepted and to appear.[34] G. L. Miller, D. Talmor, and S.-H. Teng. Optimal coarsening of unstructured meshes.Journal of Algorithms, 1998. invited and accepted to a special issue for SODA 97.[35] G. L. Miller, D. Talmor, S.-H. Teng, and N. Walkington. A delaunay based numericalmethod for three dimensions: generation, formulation, and partition. In Proc. 27th Annu.ACM Sympos. Theory Comput., pages 683{692, 1995.[36] G. L. Miller, D. Talmor, S.-H. Teng, and N. Walkington. On the radius{edge conditionin the control volume method. SIAM J. on Numerical Analysis, 1998. accepted and toappear.[37] G. L. Miller, D. Talmor, S.-H. Teng, N. Walkington, and H. Wang. Control volume meshesusing sphere packing: generation, re�nement, and coarsening. In 5th International MeshingRoundtable, pages 47{61. Sandia National Laboratories, 1996.[38] G. L. Miller, S. H. Teng, W. Thurston, and S. A. Vavasis. Automatic mesh partitioning.In A. George, J. Gilbert, and J. Liu, editors, Sparse Matrix Computations: Graph TheoryIssues and Algorithms, pages 57{84. Springer-Verlag, 1993. IMA Volumes in Mathematicsand its Applications.[39] G. L. Miller, S. H. Teng, W. Thurston, and S. A. Vavasis. Geometric separators for �niteelement meshes. SIAM J. Scienti�c Computing, 1998.[40] S. A. Mitchell and S. A. Vavasis. Quality mesh generation in three dimensions. In ACMSymposium on Computational Geometry, pages 212{221, 1992.[41] A. Pothen, H. D. Simon, and L. Wang. Spectral nested dissection. Technical report,Computer Science, Pennsylvania State University, University Park, PA, 1992.[42] Alex Pothen. Graph partition algorithms with application to scienti�c computing, 1997.[43] J. Ruppert. A new and simple algorithm for quality 2-dimensional mesh generation. InThird Annual ACM-SIAM Symposium on Discrete Algorithms, pages 83{92, 1992.61

Page 70: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

[44] J. R. Shewchuk. Tetrahedral mesh generation by delaunay re�nement. In 14th AnnualACM Symposium on Computational Geometry, pages 86{95, 1998.[45] K. Shimada. Physically-based Mesh Generation:Automated Triangulation of surfaces andVolumes via Bubble Packing. PhD thesis, MIT, Cambridge, 1993.[46] K. Shimada and D. C. Gossard. Bubble mesh: automated triangular meshing of non-manifold geometry by sphere-packing. In third Symp. on Solid Modeling and Appl., pages409{419, 1995.[47] K. Shimada, A. Yamada, and T. Itoh. Anisotropic triangular meshing of parametric sur-faces via close packing of ellipsoidal bubbles. In 6th International Meshing Roundtable,pages 375{390, 1997.[48] R. Sibson. Locally equiangular triangulations. In Computer Journal, volume 21, pages243{245, 1978.[49] G. Strang and G. J. Fix. An Analysis of the Finite Element Method. Prentice-Hall, 1973.[50] D. Talmor. Well-Spaced Points for Numerical Methods. PhD thesis, Carnegie MellonUniversity, 1997.[51] S. H. Teng and C. W. Wong. Unstructured mesh generation: Theory, practice, and per-spectives. International Journal of Computational Geometry and Applications, 1999.[52] M. A. Yerry and M. S. Shephard. A modi�ed quadtree approach to �nite element meshgeneration. In IEEE Computer Graphics and Applications, volume 3, pages 39{46, 1983.

62

Page 71: DYNAMIC LOAD BALANCING FOR PARALLEL ADAPTIVE MESH

VitaXiangyang Li was born on October 28, 1971 in Taixin City, Jiangsu Province, China. He wasplaced �rst in Jiangsu Province Junior Mathematics Competition in 1988, and 1989 and thenwas awarded the outstanding Junior Science Researcher in Yangzhou City Area, China in 1990.He received his Bachelor of Engineering degree in Computer Science and Technology with minorin Economics (Enterprise Management) from Tsinghua University at Beijing in 1995. He spentone year at Tsinghua University as the graduate student of Computer Science Department ofTsinghua University, before he moved to University of Minnesota for another one year graduatestudy in Computer Science. In August, 1997, Xiangyang Li transfered to the Computer ScienceDepartment of University of Illinois at Urbana-Champaign to continue his graduate study.Throughout his stay at University of Illinois, he was a�liated as a research assistant with theCenter for Simulation of Advanced Rockets. The primary focus of his research, supervised byProfessor Shang-Hua Teng, was on the mesh generation algorithms and its applications, besidesthe algorithm design and analysis such as cryptography, data compression.

63