New Architecture Aware Resource Allocation for Structured Grid … · 2015. 7. 3. · hybrid MPI/OpenMP/GPU implementation of a structured grid application. Details of the ﬂood

Architecture Aware Resource Allocation forStructured Grid Applications: Flood Modelling Case

Vaibhav Saxena, Thomas George, Yogish SabharwalIBM Research

New Delhi, India{vaibhavsaxena,thomasgeorge,ysabharwal}@in.ibm.com

Lucas Villa RealIBM Research

Sao Paulo, [email protected]

Abstract—Numerous problems in science and engineeringinvolve discretizing the problem domain as a regular structuredgrid and make use of domain decomposition techniques to obtainsolutions faster using high performance computing. However, theload imbalance of the workloads among the various processingnodes can cause severe degradation in application performance.This problem is exacerbated for the case when the computationalworkload is non-uniform and the processing nodes have varyingcomputational capabilities. In this paper, we present novel localsearch algorithms for regular partitioning of a structured meshto heterogeneous compute nodes in a distributed setting. Thealgorithms seek to assign larger workloads to processing nodeshaving higher computation capabilities while maintaining theregular structure of the mesh in order to achieve a better loadbalance.

We also propose a distributed memory (MPI) parallelizationarchitecture that can be used to achieve a parallel implementationof scientific modeling software requiring structured grids onheterogeneous processing resources involving CPUs and GPUs.Our implementation can make use of the available CPU coresand multiple GPUs of the underlying platform simultaneously.Empirical evaluation on real world flood modeling domains on aheterogeneous architecture comprising of multicore CPUs andGPUs suggests that the proposed partitioning approach canprovide a performance improvement of up to 8× over a naiveuniform partitioning.

I. INTRODUCTION

Most scientific problems arising in engineering designand study of natural phenomena such as computational fluiddynamics, flood modeling, and weather forecasting requirethe application of computational models and simulations ona physical domain that is discretized into a mesh at a finespatio-temporal resolution. The large size of the resulting gridsand the complex computations involved at each point in thegrid make it imperative to use distributed memory parallelcomputing techniques that decompose the original problemdomain into smaller sub-domains, which can be separatelymodeled by each processing unit. An important aspect in suchdomain-decomposition approaches is to merge the solutionsof individual sub-domains so as to ensure consistency atthe boundary points. This requires communication amongprocessors responsible for neighboring sub-domains, whichcan be performed effectively and also allow for a simplersoftware design if the domains involved have regular geome-tries. Therefore, in a number of real-world applications, thecomputational models operate on regular structured 2D or 3Dgrids encompassing the original physical domains of interest,and these parent domains are further subdivided into regularshaped sub-domains (see Figure 2). Flood modeling is one

Fig. 1. Flood Model Input Domain bounding box partitioned into 2x2 sub-domains

such exemplary application where the watershed basins thatneed to be modeled are highly irregular, but are represented asrectangular 2D grids corresponding to the bounding box (SeeFigure 1).

Existing work on domain-decomposition based high perfor-mance computing techniques primarily deals with a homoge-neous set of processors and strives to achieve load balanceamong the different processors. In case of structured gridswith uniform workloads across the grid points, this problemcan be addressed in a straightforward manner by assigningsub-domains of equal size to each processor. However, in alarge number of applications, there is variable computationalload across the points either due to the dynamic nature of thephenomena being modeled, differences in the static attributesof the grid points, or some of the points not being relevantto the modeling. For example in case of flood modeling, thegrid points that lie in the original irregular watershed basincontribute much more to the computational load than the pointsincluded solely to regularize the domain. Similarly, in caseof weather forecasting, the heavily populated urban regionsneed to be modeled more accurately than the sea regionsand hence result in a higher computational load. To addresssuch scenarios, [1] proposed a greedy heuristic approach forpartitioning structured grids with variable, but known workloadbased on slicing one dimension at a time so as to equalize theworkload in each slice.

In recent years, there is increasing interest in an alternatingcomputing paradigm involving heterogeneous processors thatinclude a mix of GPUs and CPUs with variable computingpower. These heterogeneous systems are easily accessible viaGrid computing, HPC clusters, and cloud services. However,there has been limited research on parallel computing tech-

x[1] x[2] x[3]=xmax x[0]=0

y[0]=0

y[2]

y[1]

y[3]=ymax

Fig. 2. Regular partitioning of a 2D structured grid.

niques that partition a domain across heterogeneous processors.Karypis et al. [2] proposed a graph partitioning approach forsuch a scenario that is applicable to both irregular and regulargrids, but significant code changes are required to use theresulting partitioning since they might not meet the applicationsoftware requirement for regular geometries. On the otherhand, regularizing the partitioning by considering boundingboxes is likely to result in overlapping partitions with workloaddisproportionate to the processing power.

A. Our Contributions

In this paper, we propose a parallel computing approachfor applications modeling regular grids that can operate effec-tively on a heterogeneous set of processors and is especiallytailored for applications that require the sub-domains to havea regular partitioning. To demonstrate the effectiveness of ourmethodology, we focus on flood modeling, an application areathat is becoming increasingly important due to the changingglobal climate and the frequent incidence of flood disasters [3].Our choice was motivated by the fact that most of the existingflood modeling software packages operate only on rectangularsub-domains. Furthermore, accurate and timely flood forecastsfor different parts of world require modeling at a fine-grainedresolution with computational costs optimized to provide suffi-cient lead time with limited monetary expenses, which makesit a good test case for the computing scenario being considered.

We next highlight the main contributions of this paper.

• We design and present novel local search algorithms forregular partitioning of a structured mesh in a simulation ormodeling application for allocation to processing nodes ina distributed setting. Our algorithms are designed for het-erogeneous computation environments where the processingnodes have different computational capabilities; the algorithmsseek to assign larger workloads to processing nodes havinghigher computation capabilities while maintaining the regularstructure of the mesh in order to achieve a better load balance.

• We also propose a distributed memory (MPI) parallelizationarchitecture that can be used to achieve a parallel imple-mentation of scientific modeling software requiring regularshaped domains on heterogeneous processing resources involv-ing CPUs and GPUs.

• We implemented the proposed MPI-based parallelizationapproach and partitioning techniques in IFM, which is astate-of-the-art overland flood model. More specifically, we

implemented a version of IFM on the GPU using CUDAto make use of high computational power available on theGPUs. Our implementation can make use of both the availableCPU cores and multiple GPUs of the underlying platformsimultaneously.

• We simulated a heterogeneous set of processors using multi-threaded ranks and GPUs and performed experiments on threereal-world domains to evaluate the efficacy of the proposedmethod. Our experiments demonstrate that the proposed ap-proach can effectively capture the variations in processorworkloads and processing speeds and provide an improvementof up to 8× over a 2D uniform partitioning of the grid.

The rest of the paper is organized as follows: Section IIformally introduces the problem of regular partitioning ofstructured grids on heterogeneous processors followed by thedomain partitioning approach based on local search heuristicsin Section III. Section IV discusses a software architecture forhybrid MPI/OpenMP/GPU implementation of a structured gridapplication. Details of the flood modeling code are presentedin Section V. Section VI discusses related work on parti-tioning strategies for heterogeneous architectures. Section VIIdescribes empirical evaluation of our partitioning schemesand GPU implementation on real world domains. Concludingremarks are presented in Section VIII.

II. PROBLEM DEFINITION

In this section, we formally state the problem of par-allelizing a structured grid application across heterogeneousprocessor resources. We focus on two dimensional grids forease of presentation, but the problem definition as well as thesolution approach can be generalized to structured grids ofhigher dimensions.

Let C denote a heterogeneous processor configuration ofp processing nodes <c1 · · · cp> associated with processingspeeds <s1, s2, . . . , sp>. In practice, the processing speedssi might not be known exactly and have to be estimated.However, we assume that the slowest processor is known andis chosen to be c1.

Let G denote a structured grid of size xmax × ymax. Aregular m×n partitioning P of the structured grid G is definedby m−1 cuts on the X-coordinate, x1, x2, . . . , xm−1 and n−1cuts on the Y -coordinate, y1, y2, . . . , yn−1 such that 0 = x0 <x1 < x2 < . . . < xn−1 < xn = xmax and 0 = y0 < y1 <y2 < . . . < ym−1 < yn = ymax. For all i = 0 . . .m − 1and j = 0 . . . n − 1, let bij denote the (i, j)th block of thepartition, i.e., the rectangular region defined by the diagonallyopposite pair of points (xi, yi) and (xi+1, yi+1). Let B denotethe set of all the blocks in P , i.e., B = {bij |1 ≤ i ≤ m, 1 ≤j ≤ n}. For a partition P where the number of blocks isidentical to the number of nodes in the processor configurationC, let τ : B 7→ {1, · · · , p} be any mapping of the blocks inpartitioning P to the processors in C and let t(b, s) denotethe computation time for processing block b on a processor ofspeed s. Then, ignoring the communication costs, the overallcomputation time for processing the grid G by dividing it intoblocks based on P and assigning them to processors in Cfollowing τ , denoted by T (G,C, P, τ), is determined by themaximum of the block processing times, i.e.,

T (G,C, P, τ) = maxb∈B

t(b, sτ(b)).

The parallelization problem essentially involves (a) design-ing a methodology to seamlessly process different blocks of thegrid on different processors, and (b) finding the optimal choiceof P and τ that minimizes the overall computation time for agiven choice of structured grid G and processor configurationC, i.e.,

argminP, τ

T (G,C, P, τ).

III. GRID PARTITIONING APPROACH

In this section, we describe our approach for addressingthe partitioning problem described in the previous section. Oursolution methodology is based on a local search algorithm thatstarts with a good initial partitioning and in each iteration, con-structs a new grid partitioning using a local heuristic, estimatesthe simulation time corresponding to the new partitioning withthe optimal processor assignment and accepts it if it results ina lower estimate. Section III-A describes how we estimate thebest simulation time for a given partitioning using empiricalestimates of workloads and processing speeds of the differentprocessors. Section III-B provides details of the local searchalgorithm.

A. Estimating Simulation Time

Given a partitioning P of a structured grid G and aprocessor configuration C, the simulation time depends on theworkload associated with each of the partitions as well as theassignment of partitions to the processors. To minimize thesimulation time, the assignment should be chosen so that thepartitions with smaller workloads are assigned to processorswith smaller processing speeds and partitions with largerworkloads assigned to those with higher processing speeds. Inorder to do so, we need to estimate the workloads associatedwith each partition as well as the processing speed of each ofthe individual processors.

1) Estimating the workload (procedure est work): For anygrid partition, the workload depends on the computation as-sociated with the cells in the grid. In case of applicationswhere the grid cells are all homogeneous, the workload isdirectly proportional to the number of cells in the block, i.e.,(x′ − x) × (y′ − y), for a rectangular block defined by thediagonally opposite end points (x, y) and (x′, y′).

However, for a number of applications, the computationaleffort associated with a cell could vary based on other at-tributes, e.g., whether the grid cell lies in an urban populatedregion, rural region or sea in case of flood modeling. Assumingthere are k different categories of cells with variable computa-tion effort, the workload W (b) for a block b can be modeledas a linear function of the number of cells of different typesin the block. More precisely, let Nr(b) denote the number ofcells of type r in the block b. Then, workload can be assumedto have the form,

W (b) =

k∑r=1

αrNr(b).

The weights αr in the above equation can be estimatedfrom empirical data using statistical methods such as linearleast squares regression. In particular, we collect the timetaken to process multiple different grid blocks on the slowest

TABLE I. FLOOD MODELING-COMPUTATION TIMES FOR PARTITIONSWITH SAME NUMBER OF TOTAL CELLS BUT VARYING ACTIVE AND

INACTIVE CELLS.

Total Cells Active Cells Inactive Cells Computation Time1641334 163 1641171 115.2371641334 36545 1604789 134.4711641334 307597 1333737 231.3921641334 964903 676431 488.637

(or reference) processor in the configuration C. For each ofthe blocks, we also separately compute the number of cellsof different types. We then fit a regression model with theprocessing time as the response or dependent variable andthe number of cells of different types as the covariates orindependent variables.

Example. In case of flood modeling, as mentioned earlier,there are usually two types of cells: active cells that lie in theoriginal watershed basin and inactive cells that are requiredto just regularize the domain. Since water flow computationsare only required to be performed over the active cells andthe inactive cells require minimal processing, the two typesof cells have different levels of contribution to the overallworkload of a block. Table I shows the computation timesfor blocks with varying number of active and inactive cells onthe same processor clearly indicating the differential effects.Therefore, a more accurate estimate of the workload is toconsider a weighted average of the number of active andinactive cells in the partition block with a considerably higherweight associated with the active cells.

2) Estimating Processing Speed (procedure est speed):Given a processor configuration C, it is relatively straight-forward to identify the slowest processor, but typically hardto estimate the relative processing speeds of the differentprocessors. Often, one might find that a processor with n-coresdoes not necessarily give a speedup of n relative to a singlecore. The speedup also does not necessarily remain constantacross different jobs. For the purpose of the current work, weassume that the speedup of a processor relative to the slowestone is more or less constant and estimate it from empiricaltimings. For the application of interest, we consider a set ofmultiple different grid sizes {g1, · · · , gh} and compute theprocessing times for each of the processors in the configurationC with each of the grids. Let til denote the processing timefor the lth grid on the ith processor, then the processing speedof the ith processor can be computed as

si =1

h

h∑i=1

t1ltil,

where the first processor is chosen to be the slowest one sothat s1 = 1.

3) Estimating the simulation time (procedure est time):Given a partitioning P and a processor configuration C, weestimate the simulation time as follows. We first estimate thework for each partition block and the speed of each processorusing the sub procedures est work and est speed describedabove. Next, we sort the partition blocks in order of theirwork estimates and sort the processing nodes in order of theirspeeds. We then allocate the processing nodes to the partitionblocks using the sorted orderings, allocating the processingnode with a higher speed to a partition block with a largerwork estimate. Now, we estimate the time taken for simulating

a partition block by dividing its work estimate by the speedof the processor allocated to it. The partition block that takesthe maximum time determines the processing time for the fullsimulation of the partitioning P with processor configurationC.

More formally, let the partition blocks be denoted by B ={b1, b2, . . . , bp} and the processing nodes by c1, c2, . . . , cp.Let w1, w2, . . . , wp denote the work estimates for the blocksb1, b2, . . . , bp respectively. Further, let <`1, `2, . . . , `p> be apermutation of <1, 2, . . . , p> that yields the sorted order ofthe work estimates, i.e., w`1 ≥ w`2 ≥ . . . ≥ w`p . Similarly,let <q1, q2, . . . , qp> be a permutation of <1, 2, . . . , p> thatyields the sorted order of the processing node speeds, i.e.,sq1 ≥ sq2 ≥ . . . ≥ sqp . We now allocate processing node cqito partition block b`i for all i = 1, 2 . . . , p. The time estimateti for processor cqi is calculated as ti = w`i/sqi . Finally thesimulation time is determined by max1≤i≤p{ti}.

B. The Local Search Algorithm

Our local search algorithm starts with an initial partition-ing P = <x1, x2, . . . , xm−1, y1, y2, . . . , yn−1>. The initialpartitioning is a uniform partitioning obtained by taking thex-cuts and y-cuts at equal intervals, i.e., xi = i · xmax/m fori = 1, . . . ,m− 1 and yi = i · ymax/n for i = 1, . . . , n− 1. Ateach step the algorithm tries to locally improve the solution;it generates a new partitioning by shifting one of the cutsby a fixed (∆) number of cells. Thus the new partitioningP ′ = <x′1, x

′2, . . . , x

′m−1, y

′1, y′2, . . . , y

′n−1> is obtained in

one of the following 2(m− 1) + 2(n− 1) ways:

a. For some x-cut, i ∈ [1,m − 1], set x′i = xi + ∆.Set x′j = xj ∀j ∈ [1,m − 1], j 6= i and y′j = yj ∀j ∈ [1, n− 1].

b. For some x-cut, i ∈ [1,m − 1], set x′i = xi − ∆.Set x′j = xj ∀j ∈ [1,m − 1], j 6= i and y′j = yj ∀j ∈ [1, n− 1].

c. For some y-cut, i ∈ [1, n − 1], set y′i = yi + ∆.Set y′j = yj ∀j ∈ [1, n − 1], j 6= i and y′j = yj ∀j ∈ [1,m− 1].

d. For some y-cut, i ∈ [1, n − 1], set y′i = yi − ∆.Set y′j = yj ∀j ∈ [1, n − 1], j 6= i and y′j = yj ∀j ∈ [1,m− 1].

If the new partitioning P ′, yields a better simulation time usingthe sub procedure est time, then it is taken to be the newpartitioning, i.e., the current partitioning P is replaced with P ′.Note that the new partitioning P ′, may turn out to be infeasibleas the blocks may overlap or fall out of range – in this case thenew partitioning is simply discarded. The algorithm examinesthe 2(m− 1) + 2(n− 1) new partitions described above in thefollowing order: it first examines the (n − 1) new partitionsobtained by shifting one x-cut right by ∆ (as in case (a) above)one at a time; it then examines the (n − 1) new partitionsobtained by shifting one x-cut left by ∆ (as in case (b) above)one at a time; next it examines the (m − 1) new partitionsobtained by shifting one y-cut up by ∆ (as in case (c) above)one at a time; finally it examines the (m − 1) new partitionsobtained by shifting one y-cut down by ∆ (as in case (d)above) one at a time (See Figure 3 for an illustration). As soonas the algorithm finds a better partitioning, it takes the newpartitioning to be the current partitioning and repeats the aboveprocess with the new partitioning. It continues to improve

the partitioning in this manner until no more improvement ispossible. This yields a local optimum solution.

As local search algorithms only yield a locally optimalsolution, we consider several optimizations that tend to searcha larger space of solutions (partitions) and select the best.

1) Dynamic ∆: One of the factors influencing the localoptimum that the algorithm settles into is the ∆ factor used inshifting the cuts. We start with a large value of ∆ in order tocover a wider search space of partitions. Once the algorithmsettles into a local optimum with this value of ∆, we refine thevalue of ∆ reducing it by a small factor. We now refine thesolution using the local search heuristic again till it settles intoa local optimum. This process is repeated, gradually reducing∆ until we hit a specified lower limit on the value of ∆.

More formally, the algorithm takes inputs ∆max, ∆min anda factor f . It starts with ∆ = ∆max. It runs the local searchheuristic with the current value of ∆ until it reaches a localoptimum. It then refines ∆ by setting it to ∆/f and rerunsthe local search heuristic taking the output partitioning of theprevious iteration to be the input partitioning for this iteration.This is repeated until ∆ becomes smaller than ∆min.

2) Order of cut selection: The order in which the cuts areexamined in each step while considering a new partition playsan important role in arriving at the local optimum solution.By default, the algorithm always considers the x-cuts beforethe y-cuts in every iteration. However, this can lead to thepartitioning being over-fitted in the x-dimension. To avoid this,we consider a variant of the algorithm wherein it remembersthe cut at which it found the improved solution in the lastiteration. While examining an improved partitioning in thenext iteration, the algorithm starts from the last cut that wasexamined and proceeds forward. Thus, at any point of thealgorithm, all the cuts have been examined for the samenumber of times (±1).

3) Randomized initial partitions: One of the standard tech-niques used in avoiding local minimas with local search heuris-tics is to start with multiple initial solutions (partitions). Inorder to achieve this, we generate multiple starting partitions,where each partition is obtained by drawing the (m−1) x-cutsuniformly at random from the range [0, xmax] and the (n− 1)y-cuts uniformly at random from the range [0, ymax]. Thelocal search heuristic is executed with each of these startingpartitions.

IV. SOFTWARE ARCHITECTURE

In this section, we describe a distributed memory softwarearchitecture for parallelizing 2D structured grid applicationsacross multiple processing nodes. Since integrated GPUsprovide high computational power and are proving to beincreasingly effective for a variety of applications, we alsodiscuss code modifications that can be performed to add GPUsupport to structured grid applications. These modifications areespecially tailored for CUDA programming model supportedon Nvidia GPUs, but can also be extended to other GPUprogramming models such as OpenCL.

A. Distributed Memory (MPI) Parallelization

Figure 2 depicts a rectangular grid that is divided into sub-domains/partitions. Each of these sub-domains is processed in-dependently by a different MPI process (or rank). The parallel

x[1] x[2] x[3]=xmax x[0]=0

y[0]=0

y[2]

y[1]

y[3]=ymax

x’[1] x[1] x[2] x[3]=xmax x[0]=0

y[0]=0

y[2]

y[1]

y[3]=ymax

x’[1]

(a) Case a: Shift x-cut left (b) Case b: Shift x-cut right

x[1] x[2] x[3]=xmax x[0]=0

y[0]=0

y[2]

y[1]

y[3]=ymax

y’[1]

x[1] x[2] x[3]=xmax x[0]=0

y[0]=0

y[2]

y[1]

y[3]=ymax

y’[1]

(c) Case c: Shift y-cut up (d) Case d: Shift y-cut down

Fig. 3. Obtaining new partitions by shifting cuts.

processing of these sub-domains requires communication ofthe shared boundary cells (halo region) in each time step withneighboring partitions to synchronize different MPI processes.

The processing begins with the master processor perform-ing the initialization and setting up the partitions or sub-domains. Each iteration of a partition is performed by an MPIprocess (or rank), which can be chosen to run on a processingnode comprising of one or more CPU cores or a GPU with thedetails specified via a run-time mapping file. In case of sub-domains assigned to CPU with multiple cores, the processingis parallelized using multiple OpenMP threads. The number ofparallel CPU threads to be used is the same as the number ofcores in the processing node and is specified in the mappingfile. In case of sub-domains assigned to the GPU, the partitionis first communicated to the host CPU and the processing isdone using a GPU-specific implementation that is designed soas to enable the same interface as the CPU implementation.On a multi-GPU system, one can run different MPI processeson different GPUs. Table II shows a sample heterogeneousmap file for 9 MPI processes. The second column specifies theprocessing node details. For this particular example, it specifiesthat MPI ranks 0 and 1 will use GPU processing nodes andthe remaining ranks will use processing nodes with number ofcores (threads) taking values 4, 2, and 1.

B. GPU Implementation

Application software for most large scale structured gridapplications either already supports OpenMP-based paral-

TABLE II. SAMPLE PROCESS MAP FILE FOR A MPI/OPENMP/GPUSUPPORTED APPLICATION.

MPI Rank GPU/Numberof CPU cores

0-1 GPU2-3 44-6 27-8 1

lelization or can be readily modified to do so, but have limitedGPU support. To enable processing on GPUs, one has tocompile all the required classes and their member functions forGPU as well. This also requires addition of new class data andfunction members and a separate implementation for memberfunctions for the GPU wherever necessary or beneficial inorder to exploit the multiple GPU cores. For example, thearray copy (through memcpy) and array initialization (throughmemset) benefit substantially from an implementation that usesmultiple threads on the GPU. Similarly, the OpenMP basedthreaded portions of the code can also be implemented to usethe GPU threads on the device. These modifications enable thesame interface for the CPU and GPU implementations of theroutines so that the same code works on both CPU and GPUwithout any (or minimal) routine name changes.

Copying of data from the host CPU to the GPU and back isa relatively time consuming process. Therefore, it is imperativethat we keep data copy costs as low as possible. For ourimplementation, the data required for performing a simulationis transferred from the host CPU to the GPU only once

at the beginning of the simulation. However, each iterationof the simulation requires MPI processes to exchange haloinformation among them. To enable this halo exchange, theMPI processes using GPUs first transfer the halo region datafrom device memory to host CPU memory and then exchangesthe same with the neighboring MPI processes. Similarly, thereceived halo data from another MPI process is transferredfrom host to device before continuing the simulation iterationson the device. The halo data transfer between the GPU and hostincurs very little (less than 1%) overhead on the simulationtime as the halo data exchanged between the partitions issmall and only occurs on the partition boundaries. Anotherfeature that needs to be supported is the writing of output files.Typically, intermediate data is written out at a pre-specifiedtime interval, for example, water heights at every hour of a 48hour simulation. To enable this, we first transfer the data fromthe GPU to the host and then write these to the output filesusing parallel I/O operation [4].

V. INTEGRATED FLOOD MODELLING SYSTEM

Integrated Flood Model (IFM) is a hydrological modeldeveloped at IBM Research and provides high resolution floodforecasts for the target region. IFM estimates the water heightsover the grid points of a 2D grid encompassing the region ofinterest to identify the flood prone areas. IFM consists of threemain components: (a) Weather Model, (b) Soil Model, and(c) Overland Routing Model. Figure 4 shows the informationtransfer between these three components. The flood simulationproceeds in a time stepped fashion. Essentially, the modelingwork flow involves obtaining precipitation forecasts from theweather model, and mapping the precipitation data to a DigitalElevation Model(DEM) of the area of interest. The soil modelestimates the surface-runoff water height (i.e., the water avail-able for overland flow) based on the incoming precipitation,absorption properties of the soil (e.g., soil moisture deficit)and other relevant land use and environmental attributes suchas temperature. These surface runoff estimates are input tothe overland flood routing engine, which calculates the waterinflows and outflows in a two-dimensional grid through varioussteps based on topological characteristics. The remnant waterflow from a simulation step is then fed back to the soilmodel to more accurately determine the water-height in thenext simulation step. The intermediate estimated values ofwater heights over the 2D grid are written out from thesimulation at a pre-specified output frequency. In the serializedimplementation of IFM, the overland routing computationsdominate the execution time. The IFM code is implementedin C/C++ using Object Oriented approach. More descriptionof the overland routing component can be found in [1] and itis briefly summarized here.

A. Overland Flood Routing

Overland water movement in IFM is implemented asdiffusive routing, which allows the distribution of lateral inflowin both space and time [5] with significant reduction incomputation cost. To be specific, the inflow in the X directionfor the ith cell denoted by OLRX [i] is given by the Manningformula [6] as

OLRX [i] =1

N [j]

√S[i, j] ∗∆d ∗H[j]

η, (1)

Fig. 4. IBM Integrated Flood Modeling System.

where the jth cell adjoins the ith cell along the X-direction,H[j] denotes the water height or surface-runoff of the jth cellestimated by the soil model, N [j] is the Manning’s frictioncoefficient determined by land use data, η = 5/3 is based onthe laminar and mixed laminar-turbulent conditions of the flow,∆d is the distance between the two cells and the terrain slopeS[j, i] indicates a net dip towards the ith cell (i.e., S[j, i] > 0).This slope itself is calculated as

S[j, i] =1

∆d(H[j] + h[j]−H[i]− h[i]), (2)

where h[i] denotes the natural elevation of the ith cell. Theabove routing is implemented on a 2D grid along both X andY directions. First, the flow rate is calculated in the X direction(row-wise), letting the fluid flow from cell i to its neighbors orthe other way round. Then, the flow rate is calculated in the Ydirection (column-wise) to determine the in-flow in Y directionOLRY [i] and the resulting in-flows are used to re-estimate thewater height H[i].

B. Implementation

As mentioned earlier, the original problem domains forflood modeling can be highly irregular. To facilitate simplersoftware design and lower communication costs, most of theflood modeling software work with bounding boxes. We adoptthe same approach in IFM. The IFM code uses a hybridMPI/OpenMP approach for parallel processing of the parti-tions. Each MPI process handles one partition and exchangeshalo region data with neighboring partitions. Each partition isfurther processed in parallel using multiple OpenMP threads.The IFM has been shown to scale well on a Blue Gene/Pup to 8K processor cores [1]. We also ported the soil andwater routing components of IFM to GPU and tested ourimplementation on a multi-core system with two Nvidia Keplerbased K20X GPUs. We did not use the unified memory featureof CUDA available on the Kepler GPUs for our current GPUsupport.

VI. RELATED WORK

Our current work involves heterogeneous partitioning ofstructured grids using local search algorithms and is primarilymotivated by the need to scale up flood/hydrological modelingapplications in an economical way. There is a fair bit of exist-ing research that is related to our current work, which can bebroadly categorized into three areas: (a) heterogeneous domaindecomposition, (b) local search algorithms, (c) parallelizationof hydrological models.

Heterogeneous Domain Decomposition. Domain decompo-sition based approaches are widely used for parallelizingscientific computing applications on distributed memory ma-chines. There currently exists a number of approaches such asrecursive bisection [7] and graph partitioning [8] that cater toboth regular and irregular domains and yield very good perfor-mance on homogeneous architectures. However, heterogeneousarchitectures have received considerably less attention. Thework of Karypis et al. [8], [2] is primarily targeted for thescenario where the workload is non-uniform across the gridcells and the processing nodes are heterogeneous. Though theirformulation is general and suited for unstructured meshes, theresulting partitions are not regular, which is a constraint for anumber of applications such as flood modeling and weatherforecasting. Crandall et al. [9] present a non-uniform gridpartitioning approach suited for regular structured grids basedon a concept of Fair Binary Recursive bisection. For both ofthe above works, the success of their partitioning algorithmis contingent on knowing the exact workloads and relativespeeds of participating processors. These papers do not presentany approach for determining the workload and speed viaa principled statistical estimation as we do in the currentwork. Chronopoulos et al. [10] derive heuristic algorithms forboth homogeneous systems as well as heterogeneous systemswhile taking into account the processor speed and memorycapacity. Their experimental evaluation is limited to a pureMPI based implementation of a CFD code. In our current work,we present a hybrid MPI/OpenMP/GPU implementation of aflood routing code and provide significant improvements overa naive partitioning scheme on a heterogeneous system.

Local Search Algorithms. Our current work is primarilymotivated by weather forecasting and flood modeling ap-plications where the problem domains are represented asrectangular grids and structured partitionings are preferred fordomain decomposition to keep the communication costs low.Graph partitioning algorithms are typically not suitable for thisscenario while recursive bisection techniques run the risk ofsub-optimal results due to the sequential nature of the search.Local search algorithms, on the other hand, have been shownto be very successful in achieving near-optimal solutions formany computationally hard problems[11], [12], [13]. Thesealgorithms work by iteratively applying small modifications(local moves) to the current solution in the hope of finding abetter one. They move from solution to solution in the searchspace (of candidate solutions) using such local moves, until asolution is found that cannot be improved by applying a localmove. The main advantages of these algorithms are that (i) formany problems, these are the best performing algorithms usedin practice, (ii) these methods can examine a large amount ofsearch space solutions in a short amount of time, and (iii) theseare easy to understand and implement. Our current partitioningalgorithm is based on such a local search approach where thelocal moves involve modifying one of the partition cuts and thegoodness of a partitioning is measured in terms of estimatedsimulation time.

Parallelization of Hydrological Models. There is alreadya large body of literature [14], [15], [16] on using domaindecomposition to improve scalability of hydrological modelsvia message-passing interfaces (MPI). All of the existinghydrological modeling approaches involve partitioning intoregular rectangular shaped sub-domains. Yu [16] presentedan approach to parallelize a two-dimensional diffusion-based

TABLE III. DETAILS OF THE HYDROLOGICAL DOMAINS.

Name Size Total Cells Active Cells %ActiveDominican 3186x7396 23563656 9331474 39.6%LakeGeorge 4219x2313 9758547 5078870 52%Queensland 3674x7145 26250730 15737768 60%

flood inundation model wherein the target region is dividedspatially into sub-regions of equal size and dimension accord-ing to the number of processors available. Several ways ofspatial decomposition of the simulation domain are evaluatedsuch as splitting it into a number of fixed columns, rowsor different configurations of quadrants. Singhal et al. [17]presented a hybrid MPI-OpenMP version that incorporates amaster-slave model of MPI workload balancing for indepen-dent watershed basins and OpenMP based shared memoryparallelization within each basin. With this hybrid approach,the speedup was reported to be up to 13× on a 16 coremachine. While this kind of an approach works on moderatesized systems with large amounts of shared memory, it is notscalable to large independent watershed basins due to memorylimitations at a single processor and load imbalance due toa wide range of basin sizes. Singhal et al. [1] presented ascalable and load balanced partitioning approach that takes intoconsideration the total work load at each processing unit. How-ever, this work is limited to homogeneous architectures. Ourwork attempts to combine the workload based load balancingapproach in [1] along with local search algorithms to obtaindomain decompositions that perform well on heterogeneousarchitectures comprising of multicore CPUs and GPUs.

VII. EMPIRICAL EVALUATION

In this section we present results of empirical evaluation ofthe proposed partitioning approach on real world hydrologicalmodeling domains.

A. Experimental Setup

1) Domains: In order to evaluate our partitioning approach,we created a test bed of three real life hydrological modelingdomains from different regions of the world. More detailsabout the size of the domain, spatial resolution, and number ofinactive cells are provided in Table III. The domain Dominicancovers the entire island of Hispaniola in the Caribbean andencompasses the countries Dominican Republic and Haiti.The LakeGeorge domain covers the entire watershed of LakeGeorge in the State of New York, USA and was used forstudies concerning the transportation of road salt into the lake.The Queensland domain covers the entire State of Queenslandin Australia. This domain has been used for various studieson stream networks and for the delineation of watersheds. Thepercentage of active cells in these domains vary from 40% to60%.

2) Hardware and Software Configuration: All the exper-iments were performed on a dual-socket 64-bit RHEL 6.4based system containing two eight-core Intel Xeon E5-2665processors for a total of 16 cores each running at 2.40 GHzfrequency. Each core can run two threads simultaneouslythrough hyperthreading. The system has 64 GB of RAM andcontains two Nvidia Tesla K20X GPUs connected through aPCIe Gen2 x16 interface. Each GPU has 14 Multiprocessors,each one with 192 single-precision (SP) CUDA cores for atotal of 2688 SP CUDA cores per device. There are also 64

double-precision (DP) units per Multiprocessor for a total of896 DP units per device. Each GPU has 6 GB of devicememory out of which 5.625 GB is available to the user. Thesystem is installed with CUDA 6.5 toolkit. We used IntelC++ compiler (version 14.0.3) to compile the host code. Sincethe original IFM code used double precision floating pointoperations, all the computations that were offloaded to theGPU also used double precision floating point operations. Forall I/O operations, Parallel NetCDF [4] toolkit was used. Weexperimented with two sets of processor configurations. A 3x3and a 3x4 heterogeneous configuration that used 16 cores onthe CPU along with both the GPUs. The details of the 3x3configuration is shown in Table II. For the 3x4 configuration,we removed a 4 core processing node and added a 2 coreprocessing node and 3 single core processing nodes to themap file in Table II. This guaranteed that the total number ofCPU cores is still 16.

3) Simulation Parameters: For all the experiments, wesimulated the domains using a uniform precipitation rate for aperiod of one hour using a time step of one second. Only thefinal height values at the end of the simulation were output.All times reported in this section refer to only the computationtime for 3600 iterations since other times are very low for largesimulations.

B. Experimental Results

We begin with the results of statistical estimation of work-load and processor speeds. Then we discuss the performanceof our optimized GPU implementation. This is followed by acomprehensive analysis of the performance obtained using ourlocal search algorithm with different parameters and optimiza-tions. We use the Naive partitioning scheme as the baselineto compare our performance – in this partitioning scheme allthe partitions are configured to have the same size by settingxi = i · xmax/m for i = 1, . . .,m − 1 and yi = i · ymax/nfor i = 1, . . ., n − 1. Finally we analyze the performance ofour algorithm with different weighing schemes for the activecells and processor speeds demonstrating the importance ofassociating weights properly with both the active cells andprocessor speeds for the local search algorithm to performeffectively.

1) Estimation of Workload and Processor Speeds: Forour flood modeling application the workload is modeled asa weighted average of the active and inactive cells. Fittinga linear regression model on the slowest processor timings(single thread) gives us the weights. The relative ratio of theweight of inactive to active cells was found to be 0.15. Wealso estimated the processing speeds for the different typesof processing nodes and found that a GPU processing nodewas about 32x, a 4 core processing node was 3.2x, and a 2core processing node was about 1.9x faster with respect to thesingle core processing node performance.

2) GPU Implementation: Figure 5 shows the performancespeedup obtained when using 1 and 2 GPUs (with 1 and 2MPI ranks respectively) over the CPU based existing imple-mentation using all the available 16 CPU cores of the system.We used three different input domains for this experimenthaving different number of total cells and active v/s inactivecell ratio as described in table III. On the X-axis, the domainsare arranged in the decreasing order of their total active cellswith domain Queensland having the most number of active

0

1

2

3

4

5

6

7

Queensland Dominican LakeGeorge

CPU-16 Cores

1 GPU

2 GPU

Fig. 5. GPU Implementation Performance.

cells. For these experiments, we used Naive 2D partitioningto divide the input domains into sub-domains when using 2MPI processes. These domains were able to fit in the availableGPU memory allowing us to even use a single GPU (with1 MPI rank) for the simulations without any partitioning ofdomain.

The speedups with 1 GPU over the CPU-only implemen-tation varies from 2.3 up to 3.8. As expected, the speedupimproves when using bigger dataset as it results in betterutilization of GPU’s high computing power amortizing thesmaller overheads.

The speedups obtained with 2 GPUs are close to doubleof what is obtained with single GPU except for the domainQueensland, where the use of 2 GPUs only provides about1.5x speedup over the single GPU case. This is due to theimbalanced workload (partitions) assigned to the 2 GPUs basedon the Naive partitioning approach. One GPU gets more thandouble the active cells that are assigned to the other GPU. Thishighlights the need for a better partitioning approach whendividing the input domain across multiple processors.

3) Evaluation of Local Search Heuristic Approach:

• Varying the ∆ parameter. We evaluated our local searchalgorithm by varying the ∆ parameter in the algorithm. Thesimulation timing results are displayed in Figure 6 for two dif-ferent domains (Queensland and Dominican); ∆10, ∆20, ∆40

and ∆80 correspond to the run of our local search algorithmwith the value of ∆ set to 10, 20, 40 and 80 respectively.We see that the local search algorithm provides significantimprovement in performance over the Naive algorithm. Asmentioned before, a larger value of ∆ tends to globally explorea larger space. On the other hand smaller values of ∆ tend torefine the solution to a local optimum. Our results show thatlarger values of ∆ tend to yield better optima. In the caseof the Queensland domain, we see that the solution tends togradually improve as the value of ∆ is increased with ∆80

showing up to 30% improvement over ∆101. For the case of the

Dominican domain, we see that the performance of ∆10 and∆20 is almost the same with ∆20 performing slightly worse

1performance is calculated as 1/time

0

500

1000

1500

2000

2500

3000

3500

Naive ∆10

∆20

∆40

∆80

∆V

∆V-XY

∆V-FT-XY

MaxTime

Queensland

0

100

200

300

400

500

600

700

800

Naive ∆10

∆20

∆40

∆80

∆V

∆V-XY

∆V-FT-XY

MaxTime

Dominican

(a) Queensland domain (b) Dominican domain

Fig. 6. Simulation time using the local search algorithm with different parameters and optimizations on 9 processors in a 3×3 configuration.

0

1

2

3

4

5

6

7

8

LakeGeorge Dominican Queensland

Naive

NoWtCells

NoWtProcs

WtCellsProcs

0

0.5

1

1.5

2

2.5

3

LakeGeorge Dominican Queensland

Naive

NoWtCells

NoWtProcs

WtCellsProcs

(a) 9 processors in 3×3 configuration (b) 12 processors in 3×4 configuration

Fig. 7. Comparison of performance using the local search algorithm with different weighing mechanisms for active cells and processors

than ∆10. Similar performance is observed for ∆40 and ∆80

as well. Overall, we see that better results are achieved withhigher values of ∆ (∆40 and ∆80). With ∆ = 80, we obtainup to 100% improvement in performance for the Queenslanddomain and up to 50% for the Dominican domain over theNaive partitioning scheme.

We next evaluated the effect of different optimizationson the local search algorithm. The simulation timing resultswith different optimization are shown in Figure 6 for theQueensland and Dominican domains.

• Dynamic ∆. The first optimisation was to dynamicallyvary the ∆ parameter. We range ∆ from 80 to 10 in stages,progressively reducing it by a factor of 2; the optimumpartitioning from each stage is used as the input partitioningfor the next stage. In Figure 6, we see that the local searchalgorithm with dynamically varying ∆ (∆V ) performs betterthan for any fixed value of ∆ (∆10,∆20,∆40,∆80). For theQueensland domain, ∆V performs about 40-65% better thanthe fixed values of ∆, whereas for the Dominican domain, the

improvement varies from 15-50%.

• Order of cut-selection. We next analyse the optimizationwherein the local search algorithm tries to alter the cuts alongboth the X and Y dimensions in a balanced manner whilesearching for an improved partition. This optimization wasapplied along with the dynamic ∆ optimization. In Figure 6,we observe that there is significant improvement in the perfor-mance with this optimization (∆V -XY). For the Queenslanddomain, the performance improves by a dramatic 240% overthe dynamic ∆ optimization (∆V ); however there is very littleimprovement for the Dominican domain.

• Further fine-tuning. We next took our algorithm in-corporating both the dynamic ∆ and order of cut-selectionoptimizations discussed previously and further fine-tuned it touse different ranges in the dynamic ∆ optimization and pickthe best amongst the resulting partitions as the final partition;we used the following ranges in the dynamic ∆ optimization:(i) 80 to 10, (ii) 40 to 10, (iii) 20 to 10 and (iv) fixed valueof 10. The results in Figure 6 indicate that this optimization

(∆V -FT-XY) did not lead to any significant improvement forthe Queensland domain but did lead to an improvement ofabout 30% for the Dominican domain over ∆V -XY.

Thus we see that with all the optimizations incorporated(∆V -FT-XY), the local search heuristic improves the perfor-mance of the flood modelling simulations by a significant 4 to8 factor over the naive algorithm (Naive) depending on thedomain. We also conducted experiments taking random parti-tions instead of uniform partitions as the initial partitions to thelocal search heuristic, but our results indicated a degradationin performance with random partitions.

4) Effect of weighing schemes: Our algorithm uses weightsfor two features: (i) the factor increase in computation of activecells in comparison to inactive cells, and (ii) the relative speedof processors with the slowest processor having speed 1. Westudy the effect of these weighing schemes in our local searchalgorithm by considering one of them at a time. Figure 7displays the results comparing the performance of the localsearch algorithm with different weighing schemes for differentdomains. The scheme NoWtCells corresponds to the schemewherein we only associate weights with the processor speedsand associate a uniform weight of 1 with all the cells. Thescheme NoWtProcs on the other hand corresponds to thescheme that ignores the speeds of the processors (associatinga speed of 1 with all the processors) but associates the weightswith the active cells. Lastly the scheme WtCellsProcscorresponds to our algorithm wherein both the above weightsare incorporated. For each of the above schemes, we ranthe local search algorithm with all the different optimizationsdiscussed above and report the time from the best performingcase. The results in the figure are scaled relative to the Naivealgorithm. As can be seen, both the NoWtCells schemeand the NoWtProcs scheme perform marginally better thanthe Naive algorithm; in comparison the WtCellsProcsscheme incorporating weights for both these features showssignificant improvement. For the case of 9 processors in a 3×3configuration, the improvement with the former 2 schemesis up to factor 2 for some cases; with the WtCellsProcsscheme we see a significant factor 4-8 improvement in theperformance over Naive. For the case of 12 processors ina 3×4 configuration, the improvement with the former 2schemes is up to factor 1.5 whereas with the WtCellsProcsscheme we see 2-2.5 factor improvement in all the cases.

VIII. CONCLUSION

Numerous problems in science and engineering involvediscretizing the problem domain as a regular structured gridand makes use of domain decomposition techniques to obtainsolutions faster using high performance computing. Whilemost of the past work targeted homogeneous architectures,there is a lot of interest in applying the cheap computingpower available in GPUs and or cloud resources for scientificapplications. In this work, we present a novel partitioningapproach based on local search heuristics. Our algorithms aredesigned for heterogeneous computation environments wherethe processing nodes have different computational capabilities.We also try to address the problem of estimating the relativeworkload and processing capabilities of the processing nodesvia linear least squares regression.

Empirical evaluation of our proposed approach on real-world domains demonstrates that we can obtain a performance

improvement of up to 8× over a naive 2D partitioning ap-proach that does not use the information regarding the variableworkload or processing power. Though inspired by the con-straints of the flood-modeling problem, the proposed approachis applicable to a wide range of applications that use structuredgrids. In this work, we had ignored the communication coststhat could become a factor while scaling to larger number ofprocessors. However the workload estimation approach can beextended to include the computation and communication costson the target architecture in a similar way.

REFERENCES

[1] S. Singhal, S. Aneja, F. Liu, L. V. Real, and T. George, “IFM: A ScalableHigh Resolution Flood Modeling Framework”,” in Euro-Par, 2014, pp.692–703.

[2] I. Moulitsas and G. Karypis, “Architecture aware partitioning algo-rithms,” in Algorithms and Architectures for Parallel Processing, 8thInternational Conference, ICA3PP 2008, Cyprus, June 9-11, 2008,Proceedings, 2008, pp. 42–53.

[3] “10 costliest floods worldwide ordered by overall losses.”[Online]. Available: http://www.munichre.com/app pages/www/@res/pdf/NatCatService/significant natural catastrophes/2012/NatCatSERVICE significant floods eco en.pdf

[4] J. Li, W.-k. Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp,R. Latham, A. Siegel, B. Gallagher, and M. Zingale, “Parallel netCDF:A High-Performance Scientific I/O Interface,” in SC, 2003.

[5] R. Moussa and C. Bocquillon, “Algorithms for solving the diffusivewave flood routing equation,” Hydrological Processes, vol. 10, no. 1,pp. 105–123, 1996.

[6] B. Yen, Channel Flow Resistance: Centennial of Manning’s Formula.Water Resources Pub., 1992.

[7] H. D. Simon and S.-H. Teng, “How good is recursive bisection?”SIAM J. Sci. Comput., vol. 18, no. 5, pp. 1436–1445, Sep. 1997.[Online]. Available: http://dx.doi.org/10.1137/S1064827593255135

[8] G. Karypis and V. Kumar, “Multilevel k-way partitioning scheme forirregular graphs,” J. Parallel Distrib. Comput., vol. 48, no. 1, pp. 96–129, 1998.

[9] P. Crandall and M. J. Quinn, “Non-uniform 2-d grid partitioning forheterogeneous parallel architectures,” in Proceedings of IPPS ’95, The9th International Parallel Processing Symposium, April 25-28, 1995,Santa Barbara, California, USA, 1995, pp. 428–435.

[10] A. T. Chronopoulos, D. Grosu, A. M. Wissink, M. Benche, and J. Liu,“An efficient 3d grid based scheduling for heterogeneous systems,” J.Parallel Distrib. Comput., vol. 63, no. 9, pp. 827–837, 2003.

[11] E. Aarts and J. K. Lenstra, Eds., Local Search in CombinatorialOptimization, 1st ed. New York, NY, USA: John Wiley & Sons, Inc.,1997.

[12] H. Hoos and T. Sttzle, Stochastic Local Search: Foundations & Appli-cations. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,2004.

[13] P. Toth and D. Vigo, Eds., The Vehicle Routing Problem. Philadelphia,PA, USA: Society for Industrial and Applied Mathematics, 2001.

[14] B. F. Sanders, J. E. Schubert, and R. L. Detwiler, “ParBreZo: Aparallel, unstructured grid, Godunov-type, shallow-water code for high-resolution flood inundation modeling at the regional scale,” Advancesin Water Resources, vol. 33, no. 12, pp. 1456 – 1467, 2010.

[15] J. C. Neal, T. J. Fewtrell, P. D. Bates, and N. G. Wright, “A comparisonof three parallelisation methods for 2D flood inundation models,”Environ. Model. Softw., vol. 25, no. 4, pp. 398–411, Apr. 2010.

[16] D. Yu, “Parallelization of a two-dimensional flood inundation modelbased on domain decomposition,” Environmental Modelling & Software,vol. 25, no. 8, pp. 935 – 945, 2010.

[17] S. Singhal, L. Villa Real, T. George, S. Aneja, and Y. Sabharwal, “Ahybrid parallelization approach for high resolution operational floodforecasting,” in HiPC’13.

Documents

New Architecture Aware Resource Allocation for Structured Grid … · 2015. 7. 3. · hybrid MPI/OpenMP/GPU implementation of a structured grid application. Details of the ﬂood