An optimized GPU implementation of a 2D free surface simulation … · 2014-09-23 · An optimized GPU implementation of a 2D free surface simulation model on unstructured meshes

An optimized GPU implementation of a 2D free surface

simulation model on unstructured meshes

A. Lacasta∗, M. Morales-Hernandez∗, J. Murillo∗, P. Garcia-Navarro∗,

Abstract

This work is related with the implementation of a finite volume method to solve

the 2D shallow water equations on Graphic Processing Units (GPU). The strategy

is fully oriented to work efficiently with unstructured meshes which are widely

used in many fields of Engineering. Due to the design of the GPU cards, struc-

tured meshes are better suited to work with than unstructured meshes. In order

to overcome this situation, some strategies are proposed and analyzed in terms of

computational gain, by means of introducing certain ordering on the unstructured

meshes. The necessity of performing the simulations using unstructured instead

of structured meshes is also justified by means of some test cases with analytic

solution.

Keywords: GPU, Finite volume methods, Unsteady flow, Unstructured meshes,

Dry/wet boundaries, CUDA, High Performance Computing

1. Introduction1

Physically based simulations of complex systems usually require large com-2

putational facilities to be completed in a reasonable time. Moreover when the3

simulated phenomenon is unsteady and based on a dynamical estimation of the up-4

dating time step, the computational performance is an important topic to be taken5

into account. One of the most widespread strategies to reduce the computational6

cost is the use of parallel techniques, involving a suitable number of processors.7

Since CPU frequencies seem to be reaching their maximum capacity [1], nowa-8

days Many-Core parallel techniques appear to be an interesting option.9

10

∗LIFTEC-EINA,CSIC-Universidad Zaragoza

Email addresses: [email protected] (A. Lacasta), [email protected] (M.

Morales-Hernandez), [email protected] (J. Murillo), [email protected] (P.

Garcia-Navarro)

Preprint submitted to Advances in Engineering Software June 2, 2014

In recent years, Graphic Processing Unit (GPU) has been used to accelerate the11

calculations because of its inherent vector-oriented designing. This paradigm is12

known as General-Purpose Computing on Graphics Processing Unit (GPGPU) and13

it is widely used for a very large range of applications in CFD such as [2],[3],[4]14

and [5] as well as other environmental applications such [6]. In the present work,15

special attention is paid to the application of this GPUs to unsteady flows of inter-16

est in hydraulics. Shallow Water models in particular are widely used to simulate17

surface geophysical flows. These situations usually involve large size domains18

and long time scales. Practical applications require a compromise between spatial19

accuracy and computational efficiency. In order to achieve the necessary spatial20

resolution, rather fine grids become necessary in many cases requiring more data21

storage, increasing proportionally the number of operations and reducing the allow-22

able time step size for explicit calculations. When, at the same time, a reasonable23

computational time is desired, the use of GPU codes is one of the options for com-24

puting large space and temporal domain problems.25

26

The idea of accelerating the calculations in unsteady hydraulic simulation us-27

ing multiple CPU was recently reported in [7],[8] or [9] as well as using GPU in28

[10][11][12][5]. Although a very good compromise between number of CPU used29

and performance is offered by the former option, the cost of using multiple CPU30

is significant due to the hardware investment and associated use. Alternatively, the31

GPU technology offers the performance of smaller clusters with less disbursement32

[13]. The main difficulty, and apparent drawback, when porting codes from CPU33

to GPU, is the cell order required by the GPU to process data efficiently. This34

drawback is not present when dealing with structured meshes due to the inherent35

order and a simple and efficient implementation is relatively easy to be obtained.36

37

Despite the wide use of structured meshes, complex geometries for internal or38

external boundaries are problematic to be represented if not using unstructured39

meshes. Moreover, when dealing with topographic representation some recent40

works [14] have shown the benefit of using unstructured meshes in unsteady hy-41

draulic simulations over irregular topography. The quality of the numerical results42

is sensitive to the grid resolution. Hence grid refinement is clearly an option to43

modify the whole resolution. In that sense, adaptive grid refinement, readily avail-44

able when using triangular unstructured meshes [15], designed to follow local bed45

variations or irregular boundaries can be very useful. The present work is moti-46

vated by the implementation in GPU of a code able to perform unsteady hydraulic47

simulations on variable density triangular unstructured meshes.48

49

The performance of GPU based calculations with Double Precision (double)50

2

is lower than those that use Single Precision (float) [16], [5]. In the particu-51

lar case of the 2D Shallow Water Equations with source terms [17], [18], the use52

of float is not always desirable. In fact, when simulating complex topography53

cases, wave propagation over dry beds represents a numerical challenge. The past54

experience with the dynamical stability control of such transient flows involving55

wet/dry fronts indicates that double precision is always required. All the perfor-56

mance analysis presented will deal with that kind of data.57

58

In the first part of the text, the governing equations are outlined. They are fol-59

lowed by a description of the finite volume updating scheme used. Then, the most60

relevant general aspects of the implementation in GPU are identified. The par-61

ticular difficulties encountered when dealing with triangular unstructured meshes62

and some improvements to overcome them are detailed in the following section.63

Finally, they are applied to two test cases in order to prove their behaviour when64

using unstructured meshes.65

66

2. Mathematical model/Governing equations67

The two-dimensional SW equations, which represent depth averaged mass and68

momentum conservation, can be written as follows:69

∂U

∂t+

−→∇E = H (1)

where70

U = (h, qx, qy)T (2)

are the conserved variables with h representing the water depth, qx = hu , qy = hv71

and u = (u, v) the depth averaged velocity vector along the (x, y) coordinates72

respectively. The fluxes of these variables are E=(F,G) given by:73

F =

(qx,

q2xh

+1

2gh2,

qxqyh

)T

, G =

(qy,

qxqyh

,q2yh

+1

2gh2

)T

(3)

where g is the acceleration due to the gravity.74

75

The source terms of the system are the bed slope and the friction terms:76

H =

(0, −gh

∂z

∂x− τb,x

ρw, −gh

∂z

∂y− τb,y

ρw

)T

(4)

3

Where τb,x and τb,y are the components of the bed friction stress and ρw is the77

water density. These friction losses in both (x, y) axis are written in terms of the78

Manning’s roughness coefficient n:79

τb,xρw

= ghn2u

√u2 + v2

h4/3,

τb,yρw

= ghn2v

√u2 + v2

h4/3(5)

3. Numerical scheme80

The numerical resolution of system (??) can be obtained by means of the first81

order upwind finite volume scheme. Integrating in a volume or grid cell Ω the82

numerical scheme can be expressed compactly:83

∂

∂t

∫

Ω

U dΩ+

NE∑

k=1

(δE − T)k · nk lk = 0 (6)

It is possible to define a Jacobian matrix Jk of the normal flux at each edge as84

a result of the local linearization85

δ(E · n)k = JkδUk (7)

and to diagonalize it in terms of matrices P and Λ, formed by its eigenvalues λm86

and eigenvectors em respectively:87

P =

1 0 1u− c nx −cny u+ c nx

v − c ny c nx v + c ny

, Λ =

λ1 0 0

0 λ2 0

0 0 λ3

,

e1 =

1u− c nx

v − c ny

, e2 =

0−c ny

c nx

, e3 =

1u+ c nx

v + c ny

,

λ1 = u · n− c, λ2 = u · n, λ3 = u · n+ c

(8)

where u · n = u nx + v ny. The definition of the averaged variables is as follows88

[19]:89

uk =ui√hi + uj

√hj√

hi +√

hj, vk =

vi√hi + vj

√hj√

hi +√

hj, ck =

√ghi + hj

2(9)

4

The difference across the edge k can be projected onto the eigenvectors basis90

[18]:91

δUk = Uj −Ui = PkAk (10)

where Ak = (α1, α2, α3)Tk contains the set of wave strengths. Following the92

same procedure with the source terms [18]93

(Tn)k = PkBk (11)

where Bk = (β1, β2, β3)Tk contains the source strengths.94

95

More information about the values of the wave and the source strengths as well as96

the entropy fix can be found in [18]. The contributions due to the fluxes and the97

source terms are combined in a compact expression98

(γm)k =(λm αm − βm

)

k(12)

The 2D numerical upwind explicit scheme is formulated using only the contri-99

butions that arrive to the cell:100

γ−k =1

2

[1− sign

(λk

)]γk (13)

so that the finite volume approach for the updating of a single cell of area Ωi is101

[20]:102

Un+1i = U

ni − ∆t

Ωi

NE∑

k=1

3∑

m=1

(γ−m eml

)nk

(14)

Considering that in the explicit scheme (14) each k cell edge is used to deliver103

information between a pair of neighbouring cells of different size, the time step104

size compatible with numerical stability is limited by105

∆t ≤ ∆tλ ∆tλ =min(χi, χj)

maxm=1,2,3 |λm|(15)

so that the following dimensionless quantity is defined106

CFL =∆t

∆tλ≤ 1 (16)

to control the numerical stability of the method.107

108

5

Considering unstructured meshes, the distance χi in each cell i must consider109

the volume of the cell and the length of the shared k edges.110

χi =Ωi

maxk=1,NE lk(17)

For more details, see [21, 18].111

112

4. Implementation on GPU113

The GPU contains a large number of processors working all together apply-114

ing the same operation over different elements. In order to program using this115

paradigm, NVIDIA has developed CUDA (Compute Unified Device Architecture)116

[22] that abstracts some aspects of the hardware, allowing programmers to develop117

general purpose programs efficiently.118

119

There are two main points to understand the performance of GPUs by means120

of CUDA. The first is based on the way CUDA applications are developed. The121

basic element to be processed is called Thread. Threads are identified by labels122

ranging between 0 and BlockDim. The group of Threads is called Block, and it123

contains a (recommended) 32 multiple number of Threads. Finally any group of124

Blocks is called Grid. The second aspect of interest is the hardware architecture.125

The minimum unit is the Streaming Processor (SP), where a single Thread is exe-126

cuted. A group of SP’s form the Streaming Multiprocessor (SM), typically with 32127

SP’s. Finally, a GPU is composed by between 2 and 16 SM’s.128

129

NVIDIA GPUs are generally formed by Streaming Multiprocessors (Typically130

14-16 in the Tesla Series) which contain Streaming Processors (32 in the case of131

the Tesla) [23]. The GPU distributes the Blocks among the SMs. The SMs in turn132

assigns the Threads to the SP’s. All SP’s inside the multiprocessor perform the133

same operations at the same time, but each of them applies it to a different element134

inside a vector. The designing of the GPU is the reason of the recommendation of135

configure blockDim multiple of 32. The set of 32 threads processed in a SM is136

called warp.137

138

CUDA for C is an extension of the C Standard programming language which139

includes the possibility of specifying three key abstractions in the execution: hier-140

archy of thread groups, shared memories and barrier synchronization. These are141

exposed to the programmer as a set of extensions that may be introduced in the142

6

code. The most significant change in the syntax is the neccessary parameters for143

the execution of the functions. In particular, the functions include:144

1 // Function calling with blocksPerGrid blocks of size145

2 // threadsPerBlock using streamId. If no stream is specified,146

3 // the third and fourth parameter may be ommited.147

4 functionName148

5 <<< // This is always included in CUDA kernels149

6 blocksPerGrid, // Number of Blocks150

7 threadsPerBlock, // Number of Threads within a block151

8 0, // This parameter must be included if stream152

9 // ID is specified. It is 0 by default.153

10 streamId // Identifier of the Execution Stream154

11 >>> // This is always included in CUDA kernels155

12 (parameters);156

where blocksPerGrid is the number of blocks of size threadsPerBlock157

launched in the stream streamId. Moreover, inside each function, each thread158

may establish its identifier taking into account its own threaId.x and blockId.x159

(in the case of 2D or 3D blocks, it is possible to get access to the .y, .z ID). This160

provides each thread the way of establishing a unique access as, for instance, the161

value of a vector V:162

1 ...163

2 float V[size];164

3 ...165

4 int ID;166

5 float velocity;167

6168

7 // Each thread may build its own unique Identifier as169

8 ID = threadIdx.x+(blockIdx.x*blockDim.x);170

9 // Each thread wants a position of vector v171

10 v[ID]=velocity172

More details of the standard can be found in [22].173

174

The implementation of numerical models using GPU requires to take into ac-175

count four important aspects:176

• Number of elements to be processed: The number of Blocks and the num-177

ber of Threads within each Block are parameters to be tuned by the pro-178

gramer. They determine the maximum number of elements the GPU can179

process (Blocks times Threads per Block). This number must be greater180

than or equal to the number of elements to be processed.181

• Bottlenecks: In order to process all the operations following the GPU paradigm,182

special attention must be paid to the shared information between the process-183

ing elements.184

7

• Data transfer reduction: The communication between CPU and GPU is185

very slow. In general, all the operations must take place inside the GPU,186

otherwise the overhead caused by data transfers may generate such a cost that187

the global performance of the implementation can be lower than on CPU.188

• Floating Point data precision: The GPU arithmetic performance is halved189

when using double precision data. Many applications require double pre-190

cision because of numerical aspects but there exist many others for which191

simple precision is enough to develop the calculations. When single preci-192

sion is acceptable, performance can be almost doubled on GPU.193

The main loop of our implementation is shown in Listing 1 where the fun-194

damental of the programming and the general aspect of the simulation code are195

shown.196

Listing 1: Overview of the CUDA implementation.

1 ...197

2 // Configuration of the parameters198

3 Threads=256;199

4 edgeBlocks=nEdge/Threads;200

5 cellBlocks=nCell/Threads;201

6 while(t<tmax)202

7 // Calculate the fluxes203

8 calculateEdgeFluxes<<<edgeBlocks,Threads,0,executionStream204

>>>(...);205

9 // Establish the minimum dt obtaining the ID of the206

10 // minimum dt207

11 // (*) Explained at Listing 3208

12 cublasIdamin(...,nEdge,vDt,1,id);209

13 // And assign it210

14 newDt<<<1,1,0,executionStream>>>(dt,vDt,id);211

15 // Update the elapsed time (in GPU)212

16 updateT<<<1,1,0,executionStream>>>(cuda_t,dt);213

17 // Retrieves the value of t to CPU214

18 cudaMemcpy(t,cuda_t,sizeof(double),cudaMemcpyDeviceToHost);215

19 // Update the cell values216

20 assignFluxes<<<cellBlocks,Threads,0,executionStream>>>(...);217

21 // Verify whether it is neccessary to dump data and218

22 // if so, process it.219

23 // (*) Detailed in Listing 4220

24 if(t<t_dump)221

25 // Copy of cell variables to a GPU222

26 // stored buffer223

27 cudaMemcpy(..., cudaMemcpyDeviceToDevice);224

28 // Establishing the barrier to ensure the copy of the225

29 // data to the buffer226

8

30 cudaStreamSynchronize(copyStream);227

31 // Copy the data to the CPU buffer228

32 cudaMemcpyAsync(..., cudaMemcpyDeviceToHost,copyStream);229

33 // Create a parallel CPU stream in order to control230

34 // the disk-transfer231

35 pThread_create( &diskThread, ...);232

36 233

37 234

First, the number of elements of the Blocks is defined statically at the begin-235

ing attending to the criterion of occupancy of the streaming multiprocessors (See236

CUDA GPU Occupancy Calculator [23]). For our purposes, the amount of 96237

Threads/Block and 256 Threads/Block has been identified in general as the best238

configuration for meshes with less than 100000 elements and more than 100000239

elements respectively.240

241

Bottlenecks appear when reduction patterns are present in the algorithms. The242

necessity of using reduction functions during the computation can be implemented243

using cublas. cublas library has high-level functions that work retrieving re-244

sults to GPU or to CPU. When interested in using them without taking out the data245

from the GPU, thus must be specified as in Listing 2246

Listing 2: Cublas configuration

1 cublasSetPointerMode_v2(handle, CUBLAS_POINTER_MODE_DEVICE)247

stating that all results have to be returned to the GPU memory.248

249

1 2 3 n-2 n-1 n

0.13 0.45 0.05 0.62 0.78 0.11

cublasIdamin() 3

dt[1..n] Δt=dt[3]

Figure 1: Gathering minimum ∆t for all the domain

The proposed model formulated as in (14) requires the computations of the250

minimum global ∆t that verifies the stability condition 16 when running along all251

the cell edges. It is achieved by means of the functions cublasIdamin. Figure252

1 illustrates an example of the behaviour of this function obtaining the minimum253

among all of them. Details are shown in Listing 3.254

255

Listing 3: Gathering ∆t operation

1 __global__ void newDt(double *dt,double *vDt, int *id)256

9

2 // As cublasIdamin returns it value following257

3 // 1-based indexing, we must substract 1258

4 *dt=vDt[*id-1];259

5 260

6 ....261

7 cublasIdamin(handle,*npared,vDt,1,id);262

8 newDt<<<1,1>>>(dt,vDt,d_id);263

9 ...264

As the calculation is controlled by CPU, it is necessary to transfer the updated265

tn+1. After ∆t is calculated, the updating operation can be performed as in Listing266

4 hence the updated value of tn+1 is transferred to the CPU.267

268

Listing 4: Updating t

1 __global__ void updateT(double *dt,double *t)269

2 *t=*t+*dt;270

3 271

∆t is the only variable that has to be transferred every time-step due to the272

control of the global calculation by the CPU. When data dumping is required, all273

variables at the cells must be transferred to the CPU. This transfer is slow. Although274

out of the scope of this work, this particular operation may be optimized making275

use of concurrent execution and asynchronous data transfer.276

5. Issues related to implementation on unstructured meshes277

Structured meshes have proved suitable for GPU computations [11]. Indeed,278

the GPUs have been developed optimizing the accesses to memory when this kind279

of structures (grids) are used. The main question when calculating with structured280

meshes to solve the Shallow Water equations is whether they provide good results281

when the topography is complicated. It is more difficult to apply local refinement282

of the mesh if neccesary and, when solving river basins, the angulosity of the mesh283

could lead to artificial viscosity near the shores. Moreover, when the simulation284

requires to respect some kind of structure, the flexibility of the mesh allows fitting285

to represent it propertly [15]286

287

The main advantage of structured meshes is the inherent order existent in their288

creation. Neighbouring cells are usually near in memory. When the solver works289

by edges, this point is very important in order to get the coalesced access to the290

main memory in GPU [24].291

292

10

The way the calculations are made over unstructured meshes is not the same293

as that when using structured meshes. The work in [25] provides a way to make294

the calculations using the NSEW scheme, implying that the order of the cells is295

inherent to the manner of accessing the data. However, when using unstructured296

meshes, the most common way of performing the calculation is by solving the297

fluxes by edges rather than by cells because the number of operations is nearly298

halved. In the present work three main aspects have been identified as relevant to299

construct an optimal solver in GPU when working with unstructured meshes:300

• Cell ordering: The way cells are ordered is important when accessing data301

from two neighboring cells.302

• AoS vs SoA: The choice of using Array of Structures or Structure of Arrays303

may improve the performance of the solver304

• Edges ordering: From the previous two points, when calculating by edges,305

the ordering of these shows relevant results.306

In consequence, there is a limitation because of the variable wet-dry bound-307

aries that appear in unsteady cases and may decrease the overall performance of308

the code. When ordering in the cells is not proximal, Warp divergence is probable.309

Warp divergence occurs when two threads have different evaluation in a control310

flow structure. Within a warp, the hardware in not capable of executing if and311

else statements and serialization of execution paths is required. More details312

about warp divergence can be found in [26]. This implies that when two elements313

to be processed must apply different operation by means of a flow control structure314

(i.e. if... else... such in the case of wet-dry frontier), first those threads315

that satisfy the condition will apply the first operation and after this, those Threads316

which enter in the else condition, will apply the other operations. This special317

issues may have important impact on the performance of the application. The order318

of the cells, will avoid partially this limitation and it is next discussed.319

320

5.1. Cell Ordering321

Cell ordering has relevant weight on the way the cells are connected. This is322

controlled by the connectivity matrix, defined as:323

mi,j =

1 if cell i and cell j are neighbours and i < j0 otherwise

(18)

Coalesced memory access or memory coalescing refers to combining multiple324

memory accesses into a single transaction. Every successive 128 bytes ( 32 single325

11

precision words ) memory can be accessed by a warp (32 consecutive threads) in a326

single transaction. Among the conditions that may result in uncoalesced load, i.e.,327

memory access becomes serialized, the more problematic is when memory access328

is sparse [23]. Structured meshes, in general, have coalesced-pattern implicit in329

their construction and this allows not only to know implicitly the index of the cell330

given the edge but also an ordered manner of performing the memory access. This331

is illustrated by the sketch in Figure 2.332

......

... ...

32 Threads Warp

Cell Data Array

Figure 2: Access pattern for most common structured meshes

On the other hand, unstructured meshes require to have an auxiliar index vector333

to perform the calculations by edges. Taking this into account, the coalescence for334

the access to the cells given the edge depend on their design. An example of the335

access pattern is described in Figure 3. In contrast to Figure 2, unstructured meshes336

require an the connectivity between edges and cells.337

...

...

...

...

...

...... ... ... ... ... ...

32 Threads Warp

Wall Neighbouring

Vector

Cell Data Array

Figure 3: Access pattern for unstructured meshes

Two 32-triangular element meshes are defined in Figure 4. The first is a trian-338

gular structured mesh and the second is a triangular unstructured mesh constructed339

using Triangle [27]. Also Figure 4 shows the connectivity matrix of both meshes.340

That of the structured mesh is very close to a banded matrix whereas that of the341

unstructured mesh does not have that pattern. This implies that, for the structured342

mesh, the memory will be more ordered and the memory transactions will be per-343

formed faster.344

345

When dealing with unstructured meshes, some manipulations can be applied346

to change the connectivity matrix to make it become closer to a tridiagonal matrix.347

The RCM (Reverse Cutchil-McKee) algorithm transforms a sparse matrix into a348

banded matrix form with a small bandwidth. This bandwidth can be measured as:349

12

0

5

10

15

20

25

30

0 5 10 15 20 25 30 0

5

10

15

20

25

30

0 5 10 15 20 25 30

Figure 4: Upper: Structured mesh (left) and unstructured mesh (right). Lower: Connectivity matrix

for the structured mesh (left) and for the unstructured mesh (right). (Note that there exists a point if

m(i, j) = 1).

ϕ = max|f(vi)− f(vj)| : vivj ∈ E (19)

where |f(vi) − f(vj)| defines the distance between elements vi and vj . For our350

purpose, f(vi) is the index of the element vi and the difference of the indexes of351

two neighboring elements vi and vj represents the distance in memory allocation352

for those elements.353

354

In terms of matrices, the graph bandwidth is the bandwidth of the symmetric355

matrix, which represents the adjacency matrix of the graph. Applying this trans-356

formation to the unstructured mesh of Figure 4, the bandwidth goes from ϕ = 29357

(Original bandwidth) to ϕ = 5 (Figure 5), so, in the worst case, two variables358

13

related to the same edge, will be allocated at most 6 memory positions apart.359

0

5

10

15

20

25

30

0 5 10 15 20 25 30

Figure 5: Unstructured RCM processed mesh (left) and connectivity matrix (right)

5.2. AoS vs SoA360

A very frequent question when working with arrays is the use of Structrure of361

Arrays (SoA) or Arrays of Structures (AoS). This is more important when working362

with GPU. In Figure 6 the difference between both options is displayed. Taking363

into account that the considered variables are U = h, hu, hv, it is necessary to364

store them ordered by cells. While SoA stores first all h1..NC for the NC mesh365

cells and then hu1..NC and hv1..NC , AoS stores h, hu, hv1 then h, hu, hv2366

and so on up to NC .367

368

Figure 6: Example of configuration of GPU Main Memory for Structure of Arrays and Array of

Structures for 5 cells

14

Contrary to the CPU, the GPU has a very small cache and, moreover, the ac-369

cesses are improved in groups of 32 elements. Therefore, when a single Thread370

accesses any memory address, it will move a group of 32 elements starting in that371

memory address to the cache. If the contiguous Threads access the near memory372

addresses, there is no need to access the memory again, because the GPU have373

already brought the following 31 elements to the cache.374

375

Listing 5: Cell access example function. Detailed the access to the variables of interest

1 __global__ void cellAccess(double *h, double *hu, ..., double *w)376

377

2 int i = ThreadIdx.x+(BlockIdx.x*BlockDim.x);378

3 ...379

4 // SoA380

5 h=h[i];381

6 hu=hu[i];382

7 hv=hv[i];383

8 ...384

9 // AoS385

10 h=w[3*i];386

11 hu=w[3*i+1];387

12 hv=w[3*i+2];388

13 ...389

14 390

When the GPU performs the memory load in order to obtain the value of h391

(for instance) for a given identifier, if the memory is ordered as AoS (Listing 5,392

lines 10-12), the accesses pattern is 3-displaced. This is so because, given a Thread393

accessing to position i, the neigbouring Thread in the Warp will need to access394

position i+3. However, if the memory is ordered as SoA (Listing 5, lines 5-7) the395

access to position i, will be made concurrently to the access to position i+1 by the396

next Thread and the coalesced access will be automatically produced. It is impor-397

tant to note that if, as in the example, data types are double, internally two cycles398

are required for a Warp. This is so because the 16 first Threads will access the first399

128 bytes and the second 16, the next 128 bytes, covering each 16 the amount of400

bandwidth allowed by the Caching GPU Load Operation. For the case of AoS, the401

first 6 Threads will hit in the access, but the 6 next Threads will need to load 128402

bytes again because of the structure of the memory space. Although AoS is clearer403

conceptually, SoA improves memory load operations in this type of problems and404

is thus recommended in GPU implementations.405

406

Bearing in mind that the default mode of the load operation in the GPU is407

Caching, it attempts to hit the load in L1 and L2 cache with load granularity of408

15

128-Byte. If the data type used is double, it takes 8 bytes to store data and for a409

Warp, if the data are contiguous, it is necessary to access twice the main memory.410

On the other hand, when using float only one access is required. For our case,411

it was decided to write operations as appears in Listing 1. It is worth stressing that412

the cache mechanism wil not be sufficient to compensate a bad coalescence.413

5.3. Edge Ordering414

The third point deals with edge ordering. Edges are defined by the label of415

their neighbouring cells cid1 and cid2, and by their own label i. When operating416

on edges, functions like the one in Listing 6 are used. Here, the cell identifiers are417

stored in two vectors using an SoA approach to improve memory access. The way418

that accesses are made are the last point where coalescence is neccessary.419

420

Listing 6: Edges access example function. Detailed the access to the variables of interest

1 __global__ void edgeAccess(int *connectivityMatrix1, int *421

connectivityMatrix2...)422

2 int i = ThreadIdx.x+(BlockIdx.x*BlockDim.x);423

3 ...424

4 cid1=connectivityMatrix1[i];425

5 cid2=connectivityMatrix2[i];426

6 ...427

7 h1=h[cid1];428

8 h2=h[cid2];429

9 ...430

10 431

The lack of coalescence happens because the Threads in a Block will be as-432

signed consecutive edge identifiers, but will access variables which have been in-433

dexed by cells. See an example in Listing 6 where a function needs to operate on434

a variable at both sides of the edge. One of the Threads will access edge i, retrieve435

the neighboring cells cid1[i] and cid2[i], then access the cell-indexed depth vec-436

tor: h[cid1[i]]. The next Thread, operating on i + 1 will retrieve cid1[i + 1] and437

cid2[i + 1] in a coalesced way. There is no guarantee however, that cid1[i] and438

cid1[i + 1] are close to one another, and thus coalescence problems happen when439

this Thread requires access h[cid1[i+ 1]].440

441

Listing 7 is presented as a way to force coalescence by means of edges or-442

dering. Note that the matrix is symmetric and an element cid1, cid2 appears in443

cid2, cid1. To avoid this replication, it is neccessary to establish the matrix as a444

triangular matrix. Here only the element that verifies cid1 < cid2 has been kept.445

446

16

Listing 7: Ordering method for the connectivity matrix based on qsort algorithm

1 struct connectivityMatrix447

2 int cid1; int cid2;448

3 ;449

4450

5 // The necessary comparison operator451

6 // for the qsort operation452

7 int cmp(const connectivityMatrix *idx1, const connectivityMatrix *453

idx2)454

8 if(idx1->cid1<idx2->cid1)455

9 return(0);456

10 else457

11 if(idx1->cid1>idx2->cid1)458

12 return(1);459

13 else460

14 if(idx1->cid2<=idx2->cid2)461

15 return(0);462

16 else463

17 return(1);464

18 465

19 466

20 467

21 468

22469

23470

24 void createEdgeIdentifiers(int *connectivityMatrix1, int *471

connectivityMatrix2)472

25 // It is very useful to remove elements under the473

26 // main diagonal. It means, remove duplicated474

27 // elements in order to avoid replication of the475

28 // edges. After this operation has been applied,476

29 // there exist nEdges elements not replicated.477

30 // NOTE: This operation has been ommited478

31 edges=(connectivityMatrix*)malloc(sizeof(connectivityMatrix)*479

nEdges);480

32 for(i=0;i<nEdges;i++)481

33 edges[i].cid1=connectivityMatrix1[i];482

34 edges[i].cid2=connectivityMatrix2[i];483

35 484

36 // qsort is defined in stdlib. cmp is the function485

37 // previously defined which defined the criterion of486

38 // order487

39 qsort(edges,nEdges,sizeof(connectivityMatrix),cmp);488

40 for(i=0;i<nEdges;i++)489

41 connectivityMatrix1[i]=edges[i].cid1;490

42 connectivityMatrix2[i]=edges[i].cid2;491

43 492

44 493

17

With this ordering, defining the kth edge as wk = (cid1, cid2) , 1 < k <494

nEdges, and cid1 and cid2 being the identifiers of the neighbouring cells for that495

edge, the following property is enforced:496

wk−1 = (cida, cidb)|cida < cid1 ∨ [(cida = cid1) ∧ (cidb ≤ cid2)] (20)

Therefore, when Thread i in a Warp accesses position idx and Thread i + 1497

to the position idx+1, the cell data that these Threads must load are as close as498

possible. Applying this technique to the unstructured mesh presented earlier, it is499

possible to observe in Figure 7 how the proximity of the numbering can be ob-500

tained.501

502

0 5 10 15 20 25 30

0

5

10

15

20

25

30

35

0

5

10

15

20

25

30

35

40

Edge identifier

0 5 10 15 20 25 30

0

5

10

15

20

25

30

35

0

5

10

15

20

25

30

35

40

Edge identifier

Figure 7: Unstructured mesh RCM processed edges ordering. Original ordering (left), qsort post-

procesed ordering (right)

These three optimization strategies, cell ordering, the use of SoA and the edges503

ordering, help to achieve the proximity in the memory space, allowing Threads the504

accessing to the nearest possible taking the advantage of the coalescence within a505

Warp.506

6. Results507

In order to analyze the performance of our strategy, two test cases have been508

selected. Both test cases have been run on an Intel Core 2 Quad CPU Q9550 @509

2.83 GHz using an OpenMP shared memory parallelized implementation. In or-510

der to explode the Multi-threading of the processor, the comparisons include the511

18

simulation using 1 core and 4 cores. Furthermore, the GPU used for our tests is512

a NVIDIA Tesla C2070. The code of the GPU version is the same for all the513

combinations, implemented with SoA similarly to Listing 1. On the other hand,514

four combinations of GPU calculations have been tested for unstructured meshes.515

The first one (called GPU) makes an standard preprocess of the mesh, allocating516

data ordered by cell numbering and building the edges by node connectivity (i.e.517

following cells ordering). The second one (called GPU+RCM) have the cell num-518

bering reordered by the RCM method. The third (GPU+edgeOrd) orders only the519

edges. The fourth applies both RCM and Edge Ordering (GPU+RCM+edgeOrd).520

For Squared structured meshes there are two tests. One without ordering and the521

second one applying the edge ordering. As the meshes have been built as presented522

on the examples, the bandwidth, as defined in (19) is proportional to the square root523

of the number of cells.524

525

6.1. Test Case 1526

Test Case 1 is oriented to demonstrate the influence of the mesh when solving527

the shallow water equation and to analyze the behaviour of the GPU in squared528

structured meshes and triangular unstructured meshes. This test case is fully ex-529

plained in [28] and it is related to the oscillating evolution of the free surface in530

a radially symmetrical frictionless paraboloid. The solution is defined with the531

parameters r0 = 10m, a = 25.0m, h0 = 2.5m and L = 100m. Squared and532

triangular unstructured meshes are processed using RCM. The main properties of533

the triangular unstructured meshes and the reduction of their bandwidth are shown534

in Table 1. The results for this case are displayed for t=T/4, T/2, 3T/4 and T535

taking into account that ω = 2π/T and then, T = 11.2192s in Figure 8 and Figure536

9. Results show differences between both discretizations. The quadrilateral mesh537

creates preferential directionality along the axis. This has special interest with the538

evolution of the problem and much more when flooding cases are simulated.539

540

Table 3 and Table 4 show the total time t required by the complete simulation in541

all the implementations together with the time used per time step (t/ts). According542

to Table 3 quadrilateral grids require less steps due to the CFL condition. In that543

mesh, χ is three times bigger than the smallest one for the unstructured mesh and544

then the time step is three times larger. This is the main reason that explains that,545

for this case, the absolute execution time t for the squared mesh is smaller than546

for the unstructured mesh with the same number of cells. When the computational547

cost of each time step is analyzed, there is another point to have in consideration.548

For the same number of cells, the implicated edges in the calculus is bigger for the549

19

(a) t = T/4 (b) t = T/2

(c) t = 3T/4 (d) t = T

Figure 8: Comparison of Water Depth for structured (lower-left) and unstructured (upper-right)

meshes

structured mesh.550

551

Comparing CPU and GPU time-step execution cost, the speed-up goes from552

20

(a) t = T/4 (b) t = T/2

(c) t = 3T/4 (d) t = T

Figure 9: Comparison of Velocity for structured (lower-left) and unstructured (upper-right) meshes

15x to 30x comparing the mono-core implementation and from 6-11x comparing553

with the 4 core version. As many authors mention (e.g [29]), squared meshes have554

better results because of their implicit order in the creation of the mesh but ap-555

plying our strategies with unstructured meshes, the performance may be doubled556

21

compared against the non-optimized version going from 30x to nearly 60x for com-557

parison respect to the 1 core run time and from 8x to 20x compared with OpenMP558

version using 4 core.559

560

Mesh

10000 40000 100000 200000 400000

nCells 10015 40030 100423 200797 400680

ϕ0 9525 39794 100401 200735 400283

ϕRCM 118 243 372 522 752

Table 1: Bandwidth analysis for the triangular unstructured meshes used in Test Case 1

0.1

1

10

100

1000

10000

10000 Cells 40000 Cells 100000 Cells 200000 Cells 400000 Cells

Log(

Com

puta

tiona

l Cos

t) (

s)

CPU 1 CoreCPU 4 Core

GPUGPU+RCM

GPU+EdgeOrderingGPU+RCM+EdgeOrdering

0

10

20

30

40

50

60

4 Core GPU GPU+RCM GPU+EdgeOrdaring GPU+RCM+EdgeOrdering

Spe

ed-U

p


Figure 10: Computational cost of the different implementations using different mesh sizes (upper)

and speed-up (lower) for test case 1

It is very important to note that the size of the computational domain varies561

along the simulation. In the code, there are some control flow statements in order562

to apply calculations when a pair of cells are dry. This implies that there exists563

Thread divergence because no calculations are made when two neighboring cells564

are dry and, as explained in section 5, this may reduce the GPU performance. By565

means of the purposed technique, this limitation is minimized as ordering the cells566

22

improves locallity making more probable the application of the same calculation567

for all the Threads in the Warp, and the Thread divergence limitation is strongly568

reduced. In other words, the reordering leads to the warp to process cells with the569

same propierties and then the if or else action will be applied in the whole warp.570

6.2. Test Case 2571

In the second test case, a dam break problem in a channel with two bumps is572

considered, and numerical simulations are compared with experimental data. The573

experimental data were obtained by Aureli et al. [30]. The geometry leads to574

highly unsteady flow, including shock formation, strong surface curvatures and re-575

verse flow. Therefore, this case is useful to simulate it using a uniform refined mesh576

and because of it character of rapidly propagation, a good compromise between the577

different meshes is obtained, allowing good measurements in the comparison of a578

full computational load case. The flume was rectangular in section, 1.0 m wide,579

0.5 m high and 7.0 m long. The dam was placed at x=2.25 m from the head of580

the reservoir (x=0). The water surface elevation in the reservoir is 0.45m. Dry581

conditions are considered downstream the gate. The slopes of both obstacles are582

symmetric. Friction losses are computed using the Manning roughness coefficient583

n = 0.007 sm−1/3. As in the previous case, the computational domain is repre-584

sented in unstructured meshes with cell count ranging from 10000 to 400000 cells,585

setting CFL=0.9 in all cases. Measurements of water depth versus time were586

made based on video recordings of the flow. There are three gauging points: P1587

at x=1.40m, P2 at x=2.25m and P3 at x=4.5m Also, velocity measurements were588

done at gauging point P1.589

590

Mesh

10000 40000 100000 200000 400000

nCells 10171 40033 100138 200440 400186

ϕ0 10094 36613 100098 199658 400106

ϕRCM 39 86 129 204 262

Table 2: Bandwidth analysis for the triangular unstructured meshes used in Test Case 2

Computed and measured data are in good agreement (Figure 13). Differences591

can be explained by the limits of the mathematical model, that considers purely592

hydrostatic forces along the flow. On the other hand, computational effort is also593

reduced by means of this techniques. When applying RCM (Table 2, Figure 11), ϕ594

varies 3 orders of magnitude, leading to a good coalesced pattern for this mesh.595

596

23

Figure 11: RCM method impact for the 100000 cells mesh used in Test Case 2. Original mesh

(upper) and RCM processed mesh (lower)

The behaviour of the technique is very similar to the one found in the previous597

test case. Those domains containing less than 50000 cells, offer gain values of598

around 15% when applying all the proposed techniques. When using more than599

100000 computational cells, the speed-up reaches values around 60x representing600

an increase of around a 23%. It is very important to note that, as in the previous601

case, applying only RCM is not enough. In fact, the sole application of the RCM602

could lead to a worse performance because although the cell access pattern will603

be coalesced, the edge access pattern could be less coalesced as a consequence of604

RCM. In that case, the speed-up could be lower than for computing without any605

modification in the mesh.606

607

Contrary to the previous test case where the wet domain changes all the time,608

in this second test case all the domain is included almost immediatly after the gate609

removal at the beginning in the calculation. This avoids Thread divergences and610

hence the speed-up for the not optimized version is higher.611

7. Conclusions612

A general implementation of a finite volume scheme on Graphic Processors613

Unit has been analyzed in this work. Unstructured meshes are useful for many614

fields of the engineering but GPU designing is oriented to structured meshes. The615

results have been compared with an OpenMP implementation. In this work, some616

24

1

10

100

1000

10000

100000


Log(

Com

puta

tiona

l Cos

t) (

s)

CPU 1 CoreCPU 4 Core

GPUGPU+RCM

GPU+EdgeOrderingGPU+RCM+EdgeOrdering

0

10

20

30

40

50

60

70

4 Core GPU GPU+RCM GPU+EdgeOrdaring GPU+RCM+EdgeOrdering

Spe

ed-U

p


Figure 12: Computational cost of the different implementations using different mesh sizes (upper)

and speed-up (lower) for test case 2

improvements on the way of preprocessing the meshes have demonstrated good617

results, improving the performance of the GPU implementation when unstructured618

meshes are used. Specifically the use of SoA for the definition of variables, and619

the application of cell and edge ordering algorithms have been shown to have a620

noticeable influence on GPU performance. When not all the domain is used, these621

strategies allow coalesced memory to avoid Thread divergences. When almost622

all the domain is used, the profits of the coalesced memory access is obtained623

increasing in both cases the performance between 10% − 25% with small meshes624

and with no moving boundaries and 60%−100% with more than 100000 cells and625

variable domain.626

These results provide a very usuful technique that lead to have a very efficient627

implementation on modern GPUs. It is notting worth that OpenMP parallel version628

is really easy to be implemented in a reasonable time while the CUDA version629

require more time as well as to be carefull with details related to the architecture.630

The proposed strategy can be extended to other solvers based on calculations by631

edges and also to other methods where calculations and storage are performed by632

nodes to simulate unsteady flows on unstructured meshes.633

25

0

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 10 12 14

h (m

)

t (s)

P1, x=1.4m

-0.2

0

0.2

0.4

0.6

0 2 4 6 8 10 12 14

u (m

/s)

t (s)

P3, x=4.5m

0

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 10 12 14

h (m

)

t (s)

P1, x=1.4m

0

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 10 12 14

h (m

)

t (s)

P2, x=2.25m

Figure 13: Water level evolution in time. Experimental data (− • −) and numerical results (−−)

for Test Case 2

26

Figure 14: Evolution of the Test Case 2 for t=1s., 2s., 3s., 4s., 5s., and 8s.

8. Acknowledgments634

This work has been partially supported by the European Social Fund (Aragon635

Government). We also thank NVIDIA for the donation of hardware facilities.636

27

References637

[1] A. Danowitz, K. Kelley, J. Mao, J. P. Stevenson, M. Horowitz, CPU DB:638

Recording microprocessor history, Queue 10 (2012) 10:10–10:27.639

[2] T. Tomczak, K. Zadarnowska, Z. Koza, M. Matyka, L. Miroslaw, Accelera-640

tion of iterative navier-stokes solvers on graphics processing units, Interna-641

tional Journal of Computational Fluid Dynamics 27 (????).642

[3] J. M. Domínguez, A. J. Crespo, M. Gómez-Gesteira, Optimization strategies643

for CPU and GPU implementations of a smoothed particle hydrodynamics644

method, Computer Physics Communications 184 (2013) 617 – 627.645

[4] W.-Y. Liang, T.-J. Hsieh, M. T. Satria, Y.-L. Chang, J.-P. Fang, C.-C. Chen,646

C.-C. Han, A GPU-based simulation of tsunami propagation and inunda-647

tion, in: Proceedings of the 9th International Conference on Algorithms and648

Architectures for Parallel Processing, ICA3PP ’09, Springer-Verlag, Berlin,649

Heidelberg, 2009, pp. 593–603.650

[5] M. J. Castro, S. Ortega, M. de la Asunción, J. M. Mantas, J. M. Gallardo,651

GPU computing for shallow water flow simulation based on finite volume652

schemes, Comptes Rendus Mécanique 339 (2011) 165 – 184.653

[6] M. Abouali, J. Timmermans, J. E. Castillo, B. Z. Su, A high performance654

GPU implementation of surface energy balance system (sebs) based on cuda-655

c, Environmental Modelling & Software 41 (2013) 134 – 138.656

[7] A. Lacasta, P. García-Navarro, J. Burguete, J. Murillo, Preprocess static sub-657

domain decomposition in practical cases of 2D unsteady hydraulic simula-658

tion, Computers & Fluids 80 (2013) 225 – 232. Selected contributions of the659

23rd International Conference on Parallel Fluid Dynamics ParCFD2011.660

[8] B. F. Sanders, J. E. Schubert, R. L. Detwiler, Parbrezo: A parallel, unstruc-661

tured grid, godunov-type, shallow-water code for high-resolution flood in-662

undation modeling at the regional scale, Advances in Water Resources 33663

(2010) 1456 – 1467.664

[9] M. Castro, J. García-Rodríguez, J. González-Vida, C. Parés, A parallel 2d665

finite volume scheme for solving systems of balance laws with nonconserva-666

tive products: Application to shallow flows, Computer Methods in Applied667

Mechanics and Engineering 195 (2006) 2788 – 2815.668

28

[10] A. J. Kalyanapu, S. Shankar, E. R. Pardyjak, D. R. Judi, S. J. Burian, Assess-669

ment of GPU computational enhancement to a 2d flood model, Environmen-670

tal Modelling & Software 26 (2011) 1009 – 1016.671

[11] A. R. Brodtkorb, M. L. Sætra, M. Altinakar, Efficient shallow water simu-672

lations on GPUs: Implementation, visualization, verification, and validation,673

Computers and Fluids 55 (2012) 1 – 12.674

[12] S. Rostrup, H. D. Sterck, Parallel hyperbolic pde simulation on clusters: Cell675

versus GPU, Computer Physics Communications 181 (2010) 2164 – 2179.676

[13] D. Jacobsen, J. C. Thibault, I. Senocak, An MPI-CUDA Implementation for677

Massively Parallel Incompressible Flow Computations on Multi-GPU Clus-678

ters, in: American Institute of Aeronautics and Astronautics (AIAA) 48th679

Aerospace Science Meeting Proceedings.680

[14] D. Caviedes-Voullième, P. García-Navarro, J. Murillo, Influence of mesh681

structure on 2d full shallow water equations and scs curve number simulation682

of rainfall/runoff events, Journal of Hydrology 448–449 (2012) 39 – 59.683

[15] J. E. Schubert, B. F. Sanders, M. J. Smith, N. G. Wright, Unstructured mesh684

generation and landcover-based resistance for hydrodynamic modeling of ur-685

ban flooding, Advances in Water Resources 31 (2008) 1603 – 1621.686

[16] N. Whitehead, A. Fit-Florea, Precision & performance: Floating point and687

ieee 754 compliance for nvidia GPUs, Technical white paper by NVIDA,688

2011.689

[17] J. Murillo, P. García-Navarro, J. Burguete, P. Brufau, The influence of source690

terms on stability, accuracy and conservation in two-dimensional shallow691

flow simulation using triangular finite volumes, International Journal for Nu-692

merical Methods in Fluids 54 (2007) 543–590.693

[18] J. Murillo, P. Garcia-Navarro, Weak solutions for partial differential equa-694

tions with source terms: Application to the shallow water equations, Journal695

of Computational Physics Volume: 229 Issue: 11 Pages: 4327-4368 (2010).696

[19] P. Roe, Approximate riemann solvers, parameter vectors, and difference697

schemes, Journal of Computational Physics 43 (1981) 357 – 372.698

[20] M. Morales-Hernández, P. García-Navarro, J. Burguete, P. Brufau, A conser-699

vative strategy to couple 1d and 2d models for shallow water flow simulation,700

Computers & Fluids 81 (2013) 26 – 44.701

29

[21] J. Murillo, P. García-Navarro, J. Burguete, Time step restrictions for well-702

balanced shallow water solutions in non-zero velocity steady states, Interna-703

tional Journal for Numerical Methods in Fluids 60 (2009) 1351–1377.704

[22] NVIDIA Corporation, NVIDIA CUDA C Programming Guide, 2011.705

[23] P. N. Glaskowsky, Nvidia ’ s fermi : The first complete GPU computing ar-706

chitecture, A white paper prepared under contract with NVIDIA Corporation707

(2009) 1–26.708

[24] F. De Vuyst, C. Labourdette, Some examples for cfd instant computations on709

GPU, lbm for incompressible fluids, vfm for compressible fluid, in: Work-710

shop Trends in physical and numerical modeling of multiphase flows,CFD711

and its experimental validation for multiphase flows.712

[25] A. Brodtkorb, T. Hagen, M. Saetra, GPU programming strategies and trends713

in GPU computing, Journal of Parallel and Distributed Computing (In Press)714

(2012).715

[26] M. Pharr, R. Fernando, GPU Gems 2: Programming Techniques for High-716

Performance Graphics and General-Purpose Computation (GPU Gems),717

Addison-Wesley Professional, 2005.718

[27] J. R. Shewchuk, Triangle: Engineering a 2D Quality Mesh Generator and719

Delaunay Triangulator, in: M. C. Lin, D. Manocha (Eds.), Applied Compu-720

tational Geometry: Towards Geometric Engineering, volume 1148 of Lecture721

Notes in Computer Science, Springer-Verlag, 1996, pp. 203–222. From the722

First ACM Workshop on Applied Computational Geometry.723

[28] W. C. Thacker, Some exact solutions to the nonlinear shallow-water wave724

equations, Journal of Fluid Mechanics 107 (1981) 499–508.725

[29] D. Komatitsch, G. Erlebacher, D. Göddeke, D. Michéa, High-order finite-726

element seismic wave propagation modeling with mpi on a large GPU cluster,727

Journal of Computational Physics 229 (2010) 7692 – 7714.728

[30] F. Aureli, P. Mignosa, M. Tomirotti, Dam-break flows in presence of abrupt729

bottom variations, in: Proceedings of The XXVIII IAHR International730

Congress.731

30

GPU Speed-Up

t (ms) t/ts vs. CPU1Core vs. CPU4Core

10000 Cells

CPU1Core - Unst. Mesh 4991.81 5.57

CPU1Core - Sq. Mesh 3953.16 7.40

CPU4Core - Unst. Mesh 1783.99 1.99 2.79

CPU4Core - Sq. Mesh 1470.98 2.75 2.69

GPU - Unst. Mesh 295.56 0.33 16.89 5.43

GPU+RCM - Unst. Mesh 306.89 0.34 16.27 5.23

GPU+edgeOrd - Unst. Mesh 283.04 0.32 17.64 5.67

GPU+RCM+edgeOrd - Unst. Mesh 267.63 0.30 18.65 6.00

GPU - Sq. Mesh 178.28 0.33 22.17 8.25

GPU+edgeOrd - Sq. Mesh 177.29 0.33 22.30 8.67

40000 Cells


CPU1Core - Sq. Mesh 35323.78 32.53


CPU4Core - Sq. Mesh 14054.87 12.94 2.51

GPU - Unst. Mesh 1921.82 0.96 30.23 8.99

GPU+RCM - Unst. Mesh 2120.12 1.06 27.41 8.15



GPU - Sq. Mesh 839.92 0.77 42.06 16.74


100000 Cells


CPU1Core - Sq. Mesh 160182.18 88.21


CPU4Core - Sq. Mesh 66227.14 36.47 2.42

GPU - Unst. Mesh 8349.61 2.53 31.63 10.57

GPU+RCM - Unst. Mesh 9174.20 2.78 28.78 9.62



GPU - Sq. Mesh 3392.74 1.87 47.21 19.52


200000 Cells


CPU1Core - Sq. Mesh 455589.91 176.59


CPU4Core - Sq. Mesh 170974.87 66.27 2.66

GPU - Unst. Mesh 26533.18 5.13 31.69 11.26

GPU+RCM - Unst. Mesh 28943.84 5.59 29.05 10.32



GPU - Sq. Mesh 8998.99 3.49 50.63 19.00


400000 Cells


CPU1Core - Sq. Mesh 1299827.96 352.73


CPU4Core - Sq. Mesh 525174.53 142.51 2.48

GPU - Unst. Mesh 109699.40 11.13 30.28 11.58

GPU+RCM - Unst. Mesh 117526.78 11.92 28.27 10.81



GPU - Sq. Mesh 24195.92 6.57 53.72 21.70


Table 3: Test Case 1: CPU and GPU total execution time (t) and per time step (t/ts) together with

performance of the GPU version using different configurations. CPU1Core a sequential version on

an Intel Core 2 Quad CPU Q9550 @ 2.83 GHz and CPU4Core is the OpenMP Implemantation

using 4 Cores

31

GPU Speed-Up

t (ms) t/ts vs. CPU1Core vs. CPU4Core

10000 Cells

CPU1Core 41532.30 7.49

CPU4Core 36165.99 2.94 2.51

GPU 1969.88 0.36 21.08 7.36

GPU+RCM 2013.51 0.36 20.63 7.20

GPU+edgeOrd 1882.64 0.34 22.06 7.70

GPU+RCM+edgeOrd 1738.89 0.31 23.88 8.34

40000 Cells

CPU1Core 505401.30 33.40

CPU4Core 417096.00 15.51 2.15

GPU 13762.87 0.91 36.72 11.63

GPU+RCM 15138.44 1.00 33.39 10.58

GPU+edgeOrd 12613.27 0.83 40.07 12.69

GPU+RCM+edgeOrd 11102.60 0.73 45.52 14.42

100000 Cells

CPU1Core 2295837.22 88.48

CPU4Core 1835517.78 36.50 2.42

GPU 51589.11 1.99 44.50 12.46

GPU+RCM 58088.87 2.24 39.52 11.07

GPU+edgeOrd 46221.68 1.78 49.67 13.91

GPU+RCM+edgeOrd 41278.49 15.57 55.62 44.47

200000 Cells

CPU1Core 4949436.05 180.39

CPU4Core 3912997.63 79.13 2.27

GPU 103658.95 3.78 47.75 17.65

GPU+RCM 120357.69 4.39 41.12 15.20

GPU+edgeOrd 93125.04 3.39 53.15 19.64

GPU+RCM+edgeOrd 82877.25 3.02 59.72 22.06

400000 Cells

CPU1Core 14448850.66 359.22

CPU4Core 11717537.88 169.62 2.11

GPU 306335.43 7.62 47.17 17.43

GPU+RCM 350376.72 8.71 41.24 15.24

GPU+edgeOrd 269668.53 6.70 53.58 19.80

GPU+RCM+edgeOrd 234851.10 5.84 61.52 22.74

Table 4: Test Case 2: CPU and GPU total execution time (t) and per time step (t/ts) together with

performance of the GPU version using different configurations. CPU1Core a sequential version on

an Intel Core 2 Quad CPU Q9550 @ 2.83 GHz and CPU4Core is the OpenMP Implemantation

using 4 Cores

32

Documents

An optimized GPU implementation of a 2D free surface simulation … · 2014-09-23 · An optimized GPU implementation of a 2D free surface simulation model on unstructured meshes