ParMA: Towards Massively Parallel Partitioning of Unstructured Meshes Cameron Smith, Min Zhou, and Mark S. Shephard Rensselaer Polytechnic Institute, USA

ParMA: Towards Massively Parallel Partitioning of Unstructured Meshes Cameron Smith, Min Zhou, and Mark S. Shephard Rensselaer Polytechnic Institute, USA SIAM PP12 Friday, February 17 MS51 Challenges in Massively Parallel Simulations using Unstructured Meshes - Part II of III Background Geometry-Based Analysis Distributed Mesh Representation Mesh Partitioning Partitioning using Mesh Adjacencies Partition Improvement Application Requirements Algorithm Strong Scaling Results Test Cases Closing Remarks and Ongoing Developments 2 Outline Geometry, Attribute: analysis domain Mesh: 0-3D topological entities and adjacencies Field: distribution of solution over mesh Common requirements: data traversal, arbitrarily attachable user data, data grouping, etc. Complete representation: store sufficient entities and adjacencies to get any adjacency in O(1) time critical for evolving meshes Geometry-Based Analysis Regions Faces Vertices Edges Geometric model Mesh Distributed Mesh Representation: FMDB Geometric model Partition model Distributed mesh Flexible Distributed Mesh Database A FASTMath iMesh/iMeshP Implementation i M 0 j M 1 1 P 0 P 2 P inter-process part boundary intra-process part boundary Proc j Proc i Web: Distributed Mesh Representation: Partition Model Partition Model: a conceptual model existing between a geometric model and distributed mesh Partition (model) entity: a topological entity in the partition model, P i d, representing a group of mesh entities of dimension d with the same residence parts. Remote copy: copy of an owned entity on an adjacent part Partition ModelPartition Model Entity Classification M 1 j Parallel simulation requires that the mesh be distributed with equal work-load and minimum inter-part communications Observations on graph-based dynamic balancing Parallel construction and balancing of graph with small cuts takes reasonable time Graph/hyper-graph partitions are powerful for unstructured meshes, however they use one type of mesh entity as the graph nodes, hence the balance of other mesh entities may not be optimal Accounting for multiple criteria and or multiple interactions is not obvious Hypergraphs allows edges to connect more that two vertices has been used to help account for migration communication costs Schloegel and Karypis (2002) discuss an effective optimization method for three, or fewer, constraints 6 Mesh Partitioning Partitioning using Mesh Adjacencies Mesh adjacencies are a more complete representation then a standard partition graph All mesh entities can be considered (graph has to decide what defines graph nodes, information on the adjacencies that define the graph edges lost) Any adjacency obtained in O(1) time, as apposed to having to construct multiple graphs (assuming use of a complete mesh adjacency structure) Possible advantages Avoid graph construction (assuming you have needed adjacencies) Account for multiple entity types important for the solve process - typically the most computationally expensive step Easy to use with diffusive procedures Disadvantage Lack of well developed algorithms for parallel partitioning operations directly from mesh adjacencies 7 Partition Improvement Improve scaling of applications by reducing imbalances through exchange of mesh regions between neighboring parts Current algorithm focused on improved scalability of the solve by accounting for balance of multiple entity types Imbalance is limited to a small number of heavily loaded parts, referred to as spikes, which limit the scalability of applications Application defined priority list of entity types such that imbalance of high priority types is not increased when balancing lower priority types Similar approaches can be used to: Improve balance during mesh adaptation likely want extensions past diffusive methods Supporting Two-level partitioning heterogeneous resources 8 Example of C0, linear shape function finite elements Assembly sensitive to mesh element imbalances Solve sensitive to mesh vertex imbalances since vertices hold the dof dominant computation Heaviest loaded part dictates solver performance Element-based partitioning results in spikes of dofs Application Requirements element imbalance increased from 2.64% to 4.54% dof imbalance reduced from 14.7% to 4.92% 9 Algorithm Input: Priority list of mesh entity types to be balanced (region, face, edge, vertex) Partitioned mesh with complete representation and communication, computation and migration weights for each entity Algorithm: From high to low priority if separated by > (different groups) From low to high dimension entity types if separated by = (same group) Compute migration schedule (Collective) Select regions for migration (Embarrassingly Parallel) Migrate selected regions (Collective) Ex) Rgn>Face=Edge>Vtx is the users input Step 1: improve balance for mesh regions Step 2.1: improve balance for mesh edges Step 2.2: improve balance for mesh faces Step 3: improve balance for mesh vertices 10 Migration Schedule Coordination needed to migrate regions between parts without stepping on toes Migrate computational load to the correct part Apply Hu and Blakes diffusive solution algorithm to determine low migration cost migration schedule that balances computational load for a given mesh entity type. - Green parts are overweight by 10 - White parts are underweight by 10 - Yellow parts have average weights - The diffusive solution is noted on each edge Flow solution minimizing two-norm of data movement. [Fig. 23 from CRPC Parallel Computing Handbook] 11 Compute the weighted partition connectivity matrix (Laplacian Matrix) for input partitioned mesh Define the partition graph G(V,E) where v i is the i th part in V and e ij is an edge in E if parts v i and v j are adjacent Restrict migration between parts i and j if j is absolutely heavily loaded for entities of higher priority by imposing weight Compute the RHS vector; b(i) = l avg l i l i computational load for current entity type on part i l avg average computational load for current entity type Compute the diffusive solution; solve K = b Migration load from part i to j is k ij ( i j ) 12 Computing Migration Schedule ; where a > 1 [Pg. 23 of Y. Hu, R. Blake, D. Emerson. An optimal migration algorithm for dynamic load balancing] Transitional Part Boundary Satisfy the migration load with one migration by selecting a set of mesh regions on the part boundaries. Transitional part boundary defined as the set of regions with faces classified on the partition model face Update the transitional part boundary after each region selection Remove the selected region from the transitional part boundary Add the regions that are face adjacent to the removed region that are not marked for migration Update the selection heuristics of adjacent regions 13 Migration Set Growth T0T0 T N : N th traversal of transitional part boundary Source part in white (translucent) Destination part in blue Migration set in green Transitional part boundary in red T7T7 T 13 14 Apply KL/FM like greedy heuristic to measure the relative change, or gain, in communication cost if a given set of mesh regions is migrated Gain(region) = Initial(region) - Final(region) Initial(region) = sum of communication weights of entities on exterior of region set and classified on the partition model face Final(region) = sum of communication weights of entities on exterior of region set and classified on the partition model face after the region is migrated With uniform costs Spike and Ridge features are migrated Select sets that have large ratio of computational cost to migration cost high bang for the buck Region Selection Vertex bounded Spikes Edge Bounded Ridges 15 Region Balancing Migration load on source and destination parts are equivalent Vertex, Edge and Face Balancing Requires accounting for remote copies such that the migration load is satisfied on the source part and the destination part Track mesh entities in the migration set on the initial part boundary and the transitional part boundary 16 Satisfying Migration Load Migrating the red regions from part 1 to part 0 adds three new vertices, marked in green, to part 0. On part 1 two vertices, marked in white, are removed. Part 0 Part 1 17 Region Selection Algorithm Strong Scaling 1B Mesh up to 160k Cores AAA 1B elements: effective partitioning at extreme scale with and without ParMA (uniform weights, iterative migration using simple schedule) Full system Without ParMAwith ParMA ParMA (see graph) 18 Input: Vtx>Region Test Cases 133M region mesh on 16k parts Table 1: Users input Table 2:Balance of partitions Table 3: Time usage and iterations (tests on Jaguar Cray XT5 system) 19 Demonstrated partition improvement Many opportunities remain Preprocessing of Part Adjacencies Detecting disconnected parts Absorb small disconnected parts into adjacent parts then run balancing procedures Detecting part adjacencies with low surface area Remove adjacency when computing migration schedule without disconnecting partition detecting paths between parts Region Selection on Partition Model Edges Account for how the load and communication balance changes when entities are classified on partition model edges and selected for migration Supporting Predictive Load Balancing Testing the balancing procedures when computational loads represent mesh adaptation size field requests Supporting Two-level partitioning Finding efficient combination of procedures to rebalance to the compute nodes and then within each node 20 Closing Remarks and Ongoing Developments Thank You 21 For More Information Contact:

Documents

ParMA: Towards Massively Parallel Partitioning of Unstructured Meshes Cameron Smith, Min Zhou, and Mark S. Shephard Rensselaer Polytechnic Institute, USA