Solving a large scale radiosity problem on GPU-based parallel computers

Journal of Computational and Applied Mathematics 270 (2014) 109–120

Contents lists available at ScienceDirect

Journal of Computational and AppliedMathematics

journal homepage: www.elsevier.com/locate/cam

Solving a large scale radiosity problem on GPU-basedparallel computersEduardo D’Azevedo a, Zhiang Hu b, Shi-Quan Su c,∗, Kwai Wong c

a Computer Science and Mathematics Division, Oak Ridge, TN 37831, United Statesb Chinese University of Hong Kong, Shatin, Hong Kong, Chinac Joint Institute for Computational Sciences, University of Tennessee, Knoxville, TN, United States

a r t i c l e i n f o

Article history:Received 4 October 2013Received in revised form 24 January 2014

Keywords:RadiosityView factor calculationCholesky decompositionOut-of-core algorithmHybrid multicore/GPU system

a b s t r a c t

The radiosity equation has been usedwidely in computer graphics and thermal engineeringapplications. The equation is simple to formulate but is challenging to solve when thenumber of Lambertian surfaces associatedwith an application becomes large. In this paper,we present the algorithms to compute the view factors and solve the set of radiosityequations using an out-of-core Cholesky decomposition method. This work details thealgorithmic procedures of the computation of the view factors and the Cholesky solver.The data layout of the radiosity matrix follows the block cyclic decomposition schemeused in ScaLAPACK. The parallel computation of the view factors on the GPUs extendsthe algorithms based on a serial community code called view3d. To handle large matricesthat exceed the device memory on GPU, an out-of-core algorithm for parallel Choleskyfactorization is implemented. A performance study conducted on Keeneland, a hybridCPU/GPU cluster at the National Institute for Computational Sciences, composed of 264nodes of multicore CPU and GPU are shown and discussed.

© 2014 Elsevier B.V. All rights reserved.

1. Introduction

The average amount of radiating energy on a set of diffusely reflecting and absorbing Lambertian surfaces can be obtainedby solving a linear system of radiosity equations. As the number of Lambertian surfaces associated with an applicationbecomes large, iterative methods with progressive radiosity techniques [1] are typically used to avoid the cost to factorizea huge matrix. On the other hand, in order to maintain realistic diffuse reflection in a scene, the ideal solution is to accountfor the intensity of radiation coming from every surface, resorting to solving an all-to-all dense matrix. For an unsteadythermal application in which the mesh configuration remains unchanged, the triangular factor of the radiosity matrix canbe computed once and reused subsequently in every time step. Althoughdirect solvers are expensive to use, they are efficientand readily available on parallel supercomputers.

Most of the supercomputers nowadays, including a number of the fastest computers in the world, are composed ofaccelerators connected to the CPU host unit via PCI-Express bus. One single accelerator, such as the latest GPU by NVIDIAand MIC by Intel, often has peak performance beyond 1 TFLOPS, generally a ten-fold difference to that of the CPU unit. Toattain high throughput of the accelerators, the underlying numerical algorithmsmust take into the account the architecture

∗ Corresponding author. Tel.: +1 9193954869.E-mail addresses: [email protected] (E. D’Azevedo), [email protected] (Z. Hu), [email protected] (S.-Q. Su), [email protected]

(K. Wong).

http://dx.doi.org/10.1016/j.cam.2014.02.0110377-0427/© 2014 Elsevier B.V. All rights reserved.

http://dx.doi.org/10.1016/j.cam.2014.02.011

http://www.elsevier.com/locate/cam

http://www.elsevier.com/locate/cam

http://crossmark.crossref.org/dialog/?doi=10.1016/j.cam.2014.02.011&domain=pdf

mailto:[email protected]




http://dx.doi.org/10.1016/j.cam.2014.02.011

110 E. D’Azevedo et al. / Journal of Computational and Applied Mathematics 270 (2014) 109–120

Fig. 1. Schematic diagram of a computer node of Keeneland.

of the entire heterogeneous system. It must also strike a balance between the cost of computation and communication aswell as pay attention to the hierarchicalmemory structure of themulti-CPUs andGPU/MIC processors, particularly on a largescale distributedmemory system. Fig. 1 shows the schematic layout of a heterogeneous supercomputer named Keeneland inthe National Institute for Computational Sciences. Keeneland is a 264-node cluster delivering a total peak double precisionperformance of 615 TFLOPS. Each node has 32 GB of host memory, two Intel Sandy Bridge CPU’s and three NVIDIA M2090GPUs. Each M2090 GPU has a peak performance of 665 GFLOPS and 6 GB of device memory. To harness the computingpower of Keeneland, we borrow the idea of the out-of-core algorithm, which effectively uses the largest amount of the hostmemory, progressively moves a chunk of data to the device memory, and then performs the BLAS3 operations on the GPUsto its maximum capacity.

The radiosity matrix requires the computation of the view factors that depend only on the geometry and orientation ofthe two interacting surfaces. Walton [2] has listed several commonly used algorithms to compute the view factors in hisview3D code. Calculating the view factors is a computationally intensive process but can be done in parallel. Based on hisserial view factor algorithms, we have extendedWalton’swork to compute the view factors on a parallel computer equippedwith GPU accelerators. In the parallel code, every processor calculates – taking into the effect of obstruction – the portion ofview factors arranged in the block cyclic data decomposition format used in ScaLAPACK.

Given the reciprocity property of the view factors, the radiositymatrix can be rewritten into a symmetric positive definite(SPD) matrix. Taking advantages of the architecture of a hybrid multicore and multi-GPU system of Keeneland, a newparallel out-of-core Cholesky decomposition solver is built to solve the matrix. There are a lot of implementations devotedto the dense linear algebra libraries for the heterogeneous systems [3–13]. In general, the performance of the referrednumerical library functions are tuned and optimized [14–20] to attain maximum throughput. The out-of-core procedureadopts both the left looking and the right looking algorithms in the factorization steps. The left looking algorithmminimizesthe communication cost [21] to update the part of thematrix on the GPUswhile the right looking one gives better computingperformance on the GPUs.

In Section 2 of this paper, we present the approaches and some results computing the view factors in parallel. In Section 3,the radiosity matrix is constructed from the view factor matrix and rewritten to an SPDmatrix. That is followed by a hybridCPU/GPU based Cholesky decomposition algorithm solving the SPD matrix. Section 4 shows and discusses the performanceof the parallel Cholesky procedure on Keeneland. Section 5 will highlight the conclusion and some of the future works inprogress.

2. View factors computation

The transient thermal condition of a set of objects in an open space environment is governed by the following energyequations. Eq. (1) defines the energy principles and Eq. (2) provides the boundary closure to the system.

ρCp∂T∂t−∇ · [k(x)∇T ]− s(x, t) = 0 (1)

k∇T · n+ hconv(T − Tatm)+ σϵ(T 4− T 4

ref)+ f (t) · n = 0. (2)

Here ρ, Cp, k, ϵ, σ denote density, specific heat, conductivity, thermal emissivity and the Stefan–Boltzmann constant,respectively. The convective terms in Eq. (2) are handled through a series of correlations that depend on the surface velocitydistribution calculated by the fluid flow simulation. The radiation effect between diffusely reflecting and absorbing surfacescan be formulated by the radiosity equations that require computing the view factors between all the participating surfaces.

2.1. View factor formulations in view3d

Walton [22] presented four approaches to compute the view factors between two surfaces illustrated in Fig. 2. They aredouble area integral (2AI), double line integral (2LI), single area integral (1AI), and single line integral (1LI).

The fundamental expression of (2AI) for a view factor between two Lambertian surfaces is

F1→2 =1

πA1

A1

A2

cos θ1 cos θ2

d2dA2dA1, (3)

E. D’Azevedo et al. / Journal of Computational and Applied Mathematics 270 (2014) 109–120 111

Fig. 2. Geometry of the view factor between two surfaces, adopted fromWalton.

where A1 and A2 are the areas of surfaces 1 and 2, g1 and g2 are the angles between the unit normals n1 and n2 to surfacedifferential elements dA1 and dA2 and the vector, r⃗ , between those differential elements.

Hottel and Sarofim [23] also state that by Stoke’s theorem, the double area integration can be turned to double contourintegration (2LI).

F1→2 =1

2πA1

C1

C2

ln(|r|)dv⃗2dv⃗1, (4)

where C1 and C2 are the boundary contours of the surfaces and r is the distance between dV1 and dV2, which are vectordifferential elements on the two contours.

The 1AI formula derived by Hottel and Sarofim [23] is

F1→2 =1

2πA1

A2

E2i=1

−→gi •−→n1 , (5)

where−→n1 is the unit normal to polygon A1 and−→gi is a vector whose magnitude is equal to the angle subtended by an edgeof polygon A2.

Consider a and b to be two consecutive vertices defining an edge of dA2 and p, the point representing dA1. Let−→b be the

vector from p to b and−→a the vector from p to a. Also let−→c = −→a ×−→b , e = ∥−→c ∥, and d = −→a •

−→b . Then the direction of

g is given by−→c /e and the magnitude by tan(g) = e/d, which leads to

F1→2 =1

2πA1

A2

E2i=1

−→ci •−→n1

eitan−1

eidi

. (6)

Using the above formula and replacing its integral by a summation over finite areas gives,

F1→2 ≈1

2πA1

N2j=1

E2i=1

−→ci •−→n1

ei

π

2− tan−1

diei

|Aj|. (7)

Walton also tested the following 1LI formula presented by Mitalas and Stephenson [24]:

F1→2 =1

2πA1

E1p=1

E2q=1

cos(φpq)

(t • cos(h) ln(t)+ s • cos(g) ln(s)+ u • f − r)d−→v1 , (8)

where vectors s, t , and u and angles f , g , and h are functions derived analytically.

2.2. Numerical Gaussian quadrature

Walton indicated that when the relative separation – the distance between the two surfaces divided by the sum of radiienclosing both surfaces – is greater than three, the numerical Gaussian quadrature approach produces very accurate viewfactors quickly. In order to avoid insufficient accuracy and a waste of computation time, the adaptive integration techniqueis used. The procedure begins with one edge division of k = 1, then computes a few successive values of the view factorwith increasing number of divisions until |AF [k+1] − AF [k]| < ϵAmin is satisfied, where AF = F1→2 × A1, and ϵ is defined bythe user.

The order of computation for these algorithms is: 2AI—O(N4), 2LI—O(8N2), 1AI—O(4N2), and 1LI—O(8N), where N is thenumber of edge divisions. The 1AI and 1LI algorithms are used in the parallel computation.


Fig. 3. Obstruction over two surfaces, adopted fromWalton.

2.3. View factor calculation in obstructed surfaces

Obstructed surfaces, if they exist, are calculated by modification to the single area integration method, 1AI. As theobstruction casts a shadow from the first surface onto the plane of the second surface, shown in Fig. 3, the view factor F1→2is computed by Eq. (7) summed around the edges of the unshaded area. When multiple obstruction occurs, the procedureaccounts for all the shadows, either overlapping or not, forming a complex polygon which has to be decomposed into manysimple convex polygons for computation. Geometric data of the polygons are expressed in a homogeneous coordinate tofacilitate the numerical processing of the convex polygons.

Adaptive integration is also applied to control the number of points used for the 1AI integration in the obstructed case.In an N surface problem there are potentially N − 2 view obstructions for every pair of surfaces. Besides skipping the casethat the surfaces are not facing towards each other and clipping the case that part of a surface can be seen by another, thefollowing steps are employed to reduce of the number of the potential obstructing surfaces:

1. Potential list: A potential obstructing list is created by excluding all surfaces that can never be giving obstructions—asurface that all other surfaces are on the same side of it.

2. Before j-loop: the elements Fij in the lower part of the view factor matrix are computed by rows from i = 1 to i = N andwithin each row from j = 1 to j = i − 1. When the elements of row j are to be computed, a reduced list of obstructingsurfaces is created excluding surfaces behind surface i.

3. Further tests: the list of potential obstructions will be reduced further by three more tests: cone radius test, box test, andorientation test.(a) Cone radius test excludes the surfaces in the list that are completely outside of the cone or cylinder containing surface

i and surface j.(b) Box test excludes the surfaces in the list which are outside the minimum box containing surface i and surface j.(c) Orientation test precludes all surfaces k in the following list such that: (1) surface k is entirely behind surface j, (2)

both surface i and surface j are entirely in front of the surface k, and (3) both surface i and surface j are entirely behindsurface k.

2.4. Parallel view factor calculation

Based on the algorithms used in the view3d code, a parallel version of the code is built to run on a hybrid CPU/GPU-basedcomputer. The basic procedure of the parallel code is given as follows:

1. The root process reads the input file and broadcasts the surfaces’ information to other processes.2. Create a potential obstruction list.3. Geometric data are distributed in a block cyclic fashion to the GPUs for computation.4. Each process calculates the portion of view factors according to its portion of decomposed data. Thread-based calculation

is used on the GPUs.5. Each process sends the result of its own portion to a parallel solver routine for factorization.

The following procedure highlights the handling of obstructing surfaces in parallel. The computation of the obstructionlist is performed on the CPU cores.

1. Each processor creates the list tested by relevant surfaces according to the data decomposition separately. The complexityfor this method depends on the data decomposition. The ScaLAPACK two-dimensional block cyclic data decompositionis used to evenly distribute the work. The complexity is around O(N2/M), where N is the number of all surfaces and Mis the number of processors.


Fig. 4. ScaLAPACK data decomposition, every processor gets a 4× 3 matrix.

Fig. 5. Multiple object configuration.

2. We use a column-major data order to store the result. The parallel code calculates one sub-column at a time to obtain themodified view factor matrix. Before calculating the view factor of each column, we exclude the surfaces in the potentialobstructing list that are behind the surface corresponding to the column.

3. The cone radius test, box test, and orientation test will be conducted to reduce the potential obstructing list.

2.5. Data decomposition and performance in view factor calculation

After the view factors are calculated, they are rearranged to become an SPD radiosity matrix. The traditional block cyclicdecomposition scheme defined in ScaLAPACK is used to compute the view factors. There are three primary parameters inScaLAPACK library, they are the number of rows and columns in the processor grid, P and Q , and the block size, NB, which isoptimized to the cache size of a specific CPU. Fig. 4 shows an example, when there are 6 processors, P = 2,Q = 3,NB = 1.

We used two benchmark tests to examine the speed-up of the parallel view factor code. The first case is a simple L-shapeobject consisting of 20000 surfaces in which all view factors are unobstructed. The speed-up is almost linear in computingthe view factors. The time to construct the list of obstruction is almost constant for the L-shape object because there is noobstruction. The detection algorithm exhaustsO(N2) instructions tomake sure that each surface cannot be an obstruction toother surfaces. The second test is a multiple-object test (M-Object) shown in Fig. 5 in which obstruction exists. Their resultsare shown in Table 1. In this test, the cost of detection of the obstruction list is reduced by the fact that once a surface A isdetected to be a potential obstruction of any other surface, the detection procedure will stop, add it to the list of potentialobstruction, and proceed to the next surface.

3. Radiosity system of linear equations

Assuming a gray diffuse environment is represented by N surfaces, the radiosity on a surface Ri is determined by the sumof the emitted and the reflected energy,

ϵiRiAi = EiAi + ρj

nj=1

RjAjFij, for each i = 1, 2, . . . ,N, (9)


Table 1Timing of the view factor calculation of the L-shape test andM object.

No. of process Obstruction Obstruction View factor View factorL shape (s) M object (s) L shape (s) M object (s)

4 140.9 752 113.2 736.78 101.6 513.5 56.9 496.5

16 75.0 273.6 29.3 257.532 60.4 145.9 14.4 128.864 53.2 80.3 7.1 63.9

128 49.8 48.2 3.7 31.0256 48.1 35.3 1.8 17.4

1024 47.3 22.3 0.4 4.9

where the radiosity is being calculated on surface i, j represents a single surface, Ei is the emitted energy, ρ is the reflectivity,ϵ is the emissivity which equals to (1 − ρ), and Fij is the view factor between surfaces i and j. Often in thermal simulationwhere Ei can be represented by the Stephen–Boltzmann Law,

Ei = ϵσT 4i . (10)

Eq. (9) can be rewritten asj

δijAi − φiAjFij

Rj

= φi

j

GijRj = σT 4i Ai, (11)

where φ = ρ/ϵ. This is a system of linear equations, Gx = b, in which G is a symmetric positive definite matrix because ofthe reciprocity property of the view factors, AiFij = AjFji. They are given by

G =

A1

φ1− A1F11 −A2F12 · · −ANF1N

−A1F21A2

φ2− A2F22 · · −ANF2N

·

·

−A1FN1 −A2FN2 · ·AN

φN− ANFNN

, (12)

x = (R1, R2, . . . , RN)T , b =σT 4

1 A1/φ1, . . . , T 4NAN/φN

T. (13)

Since G is an SPD matrix, the Cholesky factorization can be used to solve the system of radiosity equations.

4. Cholesky solver on CPU/GPU parallel computers

Many supercomputers nowadays are augmented with GPUs as accelerators. Applications running on GPUs generally relyon a number of in-house numerical libraries developed by the vendors to attain the peak performance on the acceleratingdevice. One of the most commonly used libraries is the CUDA Basic Linear Algebra Subroutines (cuBLAS) library distributedby NVIDIA. The cuBLAS library provides a set of numerical linear algebra operations such as matrix–vector multiplicationand matrix–matrix multiplication. The MAGMA [9,25] library, a counterpart of the LAPACK library for GPUs, uses the 1Dblock-cyclic layout to distribute the matrix. On the other hand, the ScaLAPACK library only runs on the CPUs of a distributedparallel computer. A GPU-based out-of-core solver using cuBLAS and ScaLAPACK was recently developed to perform the LUfactorization of a general matrix [12]. In here, we extended the work to factor an SPDmatrix using Cholesky decomposition.

4.1. Cholesky factorization

Consider a block partitioned of matrix G where G11 is a k × k block matrix. There are several steps for performing theCholesky factorization of matrix G.

G =G11 GT

21G21 G22

=

L11 0L21 L22

LT11 LT210 LT22

. (14)

1. Perform Cholesky factorization of the diagonal block G11 to obtain L11:

G11 ← L11 ∗ LT11. (15)


2. Then perform triangular solve to obtain L21

L21 ← G21L−T11 . (16)

3. Update the remaining submatrix

G22 ← G22 − L21LT21. (17)

4. Recursive factor the remaining part of G22.

Note that the Cholesky factorization of a 1× 1 matrix is just the scalar square root. If the partition chooses G11 to be a smallblock size, then we call this a ‘‘right-looking’’ algorithm since the bulk of the remaining work is in the updating and furtherfactorization of G22. If the partition chooses G22 to be a small block, then we call this a ‘‘left-looking’’ algorithm since thebulk of the work is in producing L21 for updating G22 ← G22 − L21LT21. A ‘‘right-looking’’ algorithm has more opportunitiesof parallelism whereas a ‘‘left-looking’’ algorithm has less data movement and better cache reuse [26,27].

4.2. Out-of-core factorization

Out-of-core (OOC) factorizationmethods have been studied in the past [28,29] inwhich the secondarymemory is used toaugment the storage and hence to solve a large matrix which cannot be stored in the primary memory. In a CPU/GPU hybridsystem, we view the device memory on GPUs as ‘‘primary’’ storage and CPU memory as ‘‘secondary’’ storage. For example,the Keeneland machine has a total of 32 GB of CPU memory and 18 GB (3 GPU, 6 GB each) of GPU device memory on eachnode. One of the advantages of the OOC algorithm is the ability to factor a matrix as large as the amount of CPU memory onall the available nodes of a distributive computer and harness the computational power of the GPUs on the nodes.

The OOC algorithm distributes the matrix G in ScaLAPACK 2D block cyclic format but considers the matrix G on CPUhost memory as partitioned into block column panels, where each panel can fit entirely in the GPU device memory. Aright-looking in-core algorithm is used to compute the Cholesky factorization of the matrix panel in GPU memory and aleft-looking algorithm is used to organize the communication and data movement between CPU host and GPU device inperforming updates from previously factored matrix.

The details of the algorithm are described below.

1. The OOC algorithm begins by considering the matrix on the host memory as partitioned into multiple column panels(called ‘‘Y ’’ panels) such that each of them can fit entirely in the GPU device memory. A panel is then copied to the GPUdevice one at a time for computation. The number of ‘‘Y ’’ panels is determined by the ratio between the availablememoryon CPU and the available memory on GPU. For simplicity, the width of ‘‘Y ’’ panels is chosen be a multiple NB ∗ Q whereNB is the ScaLAPACK block size on the process grid of shape P × Q .

2. Suppose the current ‘‘Y ’’ panel in GPU memory and all the earlier panels of G on the left side up to the column left tothe current ‘‘Y ’’ panel have already been factored and replaced by the part of ‘‘L’’ factor matrix. The update step dividesthe part of Lmatrix solved into multiple smaller ‘‘X ’’ stripes of width NB. A part of each ‘‘X ’’ panel is transfered to GPU toperform the operation of symmetric rank-k update operation described in step 3 of the left-looking Cholesky algorithmdescribed above.

3. The left-looking variant of Cholesky factorization is used to update the current panel on the GPU device memory. Thepseudocode is given as follows:Loop over the ‘‘Y ’’ panels: {(a) Copy one ‘‘Y ’’ panel frommatrix G on CPU host to a mirrored buffer on GPU device using cudaGetMatrix (see Box

1a in Fig. 6).(b) Loop over the ‘‘X ’’ stripes: {

Using an ‘‘X ’’ stripe on CPU host, perform the equivalent of PDSYRK update operation to block column panel onGPU device by calling routine Cpdsyrk_hhd developed in this effort (see Box 1b).}

(c) Perform the ‘‘In-core’’ Cholesky factorization to factorize the current ‘‘Y ’’ panel on GPU (see Box 1c).}

4.3. In-core factorization

The panel on the device memory is divided into ‘‘X ’’ block column stripes. Each stripe has the width of the block widthNB. The right-looking variant is used to factorize the panel stripe by stripe. The pseudocode is given as follows:

Loop over the ‘‘X ’’ stripes: {(a) Copy one diagonal block with size NB (labeled by the black square in Box 2a in Fig. 7) from GPU to CPU using

cublasGetMatrix.(b) Factorize the diagonal block by routine pdpotrf from ScaLAPACK (see Box 2b).


Fig. 6. Steps for overall ‘‘out-of-core’’ approach.

(c) Perform triangular solves using pdtrsm from ScaLAPACK to form the rest of the X stripe below the diagonal part onCPU (see Box 2c).

(d) Update the remaining unfactored part of ‘‘Y ’’ panel on GPU using symmetric rank-k update performed on GPU (seeBox 2d).}

4.4. Performing updates to the matrix

Most of the time of the OOC Cholesky factorization is spent in performing the equivalence of ScaLAPACK PDYSRK parallelsymmetric rank-k update operation on the diagonal block Y = Y−X×XT on GPU. Thematrix X resides on the hostmemoryat the beginning, and the result Y matrix is on the GPU device memory.

1. Routine PDGEADD in ScaLAPACK is used to perform a broadcast of one stripe of data across Q processors. This data isthen copied from CPU to GPU device memory (see Box 3a in Fig. 8).

2. Similarly, PDGEADD is used to perform a broadcast of the transpose of the same stripe down P processors in the sameprocessor column. This data is then copied from CPU to GPU device memory (see Box 3b).

3. The ‘‘Y ’’ panel to be updated is recursively divided into rectangular off-diagonal matrices or square diagonal blocks (seeBox 3c).

4. The cublasDSYRK subroutine in cuBLAS is used to update the diagonal blocks.5. The cublasDGEMM in cuBLAS is used to update off-diagonal rectangular blocks.

5. Numerical results

Two sets of numerical experiments of the OOC Cholesky factorization solver were run on the Keeneland GPU cluster todetermine the scalability and performance characteristics.

The first experiment was performed on a fixed 6 × 6 processor grid on 3 nodes (total of 9 GPUs). To fully utilize theremaining cores, 12 MPI tasks are spawned on a CPU node (4 MPI tasks sharing a GPU). Using 4 MPI tasks per GPU achievesa small (around 10%) improvement compared to using only a single MPI task per GPU. A separate set of experiments using ablock size NB = 64 is close to optimal for a wide range of configurations. Each M2090 GPU is configured with 6 GB of devicememory. Enabling error correcting code (ECC) reduces the amount of availablememory by about 10%. About 1215MBof GPU


Fig. 7. Steps for in-core factorization.

Fig. 8. Steps for performing equivalent of PDSYRK parallel symmetric rank-k update on GPU.


Fig. 9. Performance at different matrix sizes N on a fixed 6 by 6 processor grid. We show the data from two ways of cutting the panels. The block size is64. We also show the comparison with ScaLAPACK, which uses all the CPU cores on the 3 nodes.

devicememory perMPI task is dedicated for the in-core ‘‘Y ’’ panel. The remaining devicememory is used as communicationbuffers or temporarywork space. One variant of the algorithmuses a variablewidth for the ‘‘Y ’’ panel in GPUmemory. As thefactorization proceeds, only the lower triangular part needs to be accessed. Thus for the same amount of device memory, awider ‘‘Y ’’ panel can be accommodated. Fig. 9 shows the performance as the matrix size N is increased. The maximum valueof N = 100608 corresponds to using about 27 GB (out of 32 GB) on each node. Since the amount of CPUmemory is less thantwice the available GPUmemory, at most only two ‘‘Y ’’ panels are needed. Both variants achieve about 200 GFLOPS/GPU forN = 100608, but the simpler fixed panel width achieves a slightly higher performance. We also show the results of callingthe ScaLAPACK. The ScaLAPACK call uses all 6 CPUs, which have 48 cores totally. The processor grid is 8 by 6, and the blocksize is the same. The results are shown in GFLOPS/CPU for comparison.

There are three major time consuming parts in the factorization process. They are diagonal factorization, trailing matrixupdate, and communication. In our approach, typically, the diagonal factorization and trailingmatrix update takes about 51%of the wall time. The communication takes about 46%. There is overlap between the diagonal factorization and the trailingmatrix update. So their timing results depend on different definitions.

The second set of experiments studies the weak scaling behavior by increasing the size on square P × P processor grids,but keeps the amount ofmemory fixed. EachMPI task uses about 1.9 GB of hostmemory dedicated to the same fixed amountof 147456 kB of GPUmemory for ‘‘Y ’’ panel. The block size is NB = 64, the matrix size is chosen as N = 240 ∗NB ∗ P whereP = 6 ∗ n for n from 1 to 6. Fig. 10 shows the performance as the processor grid size increases. As n increases, the cost ofcommunication increases and the performance per GPU reduces from 200 GFLOPS to about 170 GFLOPS.

The above data suggests the following desirable properties of the ‘‘out-of-core’’ algorithm:

1. It can fully utilize both the host and device memory resources to solve the matrix as large as it is in the whole hostmemory, which is way larger than the device memory.

2. It optimizes the data movement between the host and device while it takes advantage of the data locality as much aspossible, and performs all of the floating point intensive operations on the high throughput accelerators.

3. It can scale up across nodes, and maintain a uniform performance across a large number of node counts.4. It offers the user the flexibility to tune thememory ratio between host and device, so it can approach its best performance

in a wider range of problem size.

6. Conclusions and future work

This paper details the algorithms to compute view factors in parallel by extending the serial community code, view3d.The parallel implementation of these algorithms are shown to be scalable by decomposing the data in a block cyclic formatused in the ScaLAPACK library. The view factor calculation creates a large dense SPD matrix that can be efficiently solvedon the distributed memory Keeneland GPU cluster. A variant of the OOC algorithm is demonstrated to solve problems thatare larger than the amount of GPU device memory. It can achieve about 200 Gflops for each NVIDIA M2090 GPU. Detailedprofiling data suggests further performance improvements are possible with a look-ahead algorithm to reduce the amountof time spent in the critical path as well as by overlapping data transfer with computation.


Fig. 10. Performance at different matrix sizes N . We fix the memory usage on each node, and increase the size of P × P processor grid, where P = 6n,n = 1, 2, 3, 4, 5, 6.

Acknowledgments

This research project is supported by the National Science Foundation, Department of Energy, Oak Ridge NationalLaboratory, Chinese University of Hong Kong, and theUniversity of Tennessee. This research used resources of the KeenelandComputing Facility at the Georgia Institute of Technology, which is supported by the National Science Foundation underContract OCI-0910735. The submitted manuscript has been authored by a contractor of the US Government under ContractNo. DE-AC05-00OR22725. Accordingly, the US Government retains a non-exclusive, royalty-free licence to publish orreproduce the published form of this contribution, or allow others to do so, for US Government purposes.

References

[1] M. Cohen, S.E. Chen, J.R. Wallace, D.P. Greenberg, A progressive refinement approach to fast radiosity image generation, in: Proceedings ofSIGGRAPH’88 Computer Graphics, Vol. 22, 1988, pp. 75–84 (4).

[2] G.N. Walton, Algorithms for calculating radiation view factors between plane convex polygons with obstructions, in: Tech. Rep. NBSIR 86-3463,1987—Shortened Report in Fundamentals and Applications of Radiation Heat Transfer, HTD-Vol. 72, National Bureau of Standards, American Societyof Mechanical Engineers, 1986.

[3] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, S. Tomov, Numerical linear algebra on emerging architectures:the PLASMA and MAGMA projects, J. Phys. Conf. Ser. 180 (2009) 012037.

[4] E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, QR factorization on a multicore node enhanced with multiple GPUaccelerators, in: IPDPS 2011, Alaska, USA.

[5] M. Fogue, F.D. Igual, E.S. Quintana-Ortí, R.V.D. Geijn, Retargeting PLAPACK to clusters with hardware accelerators, in: FLAMEWorking Note 42.[6] J.R. Humphrey, D.K. Price, K.E. Spagnoli, A.L. Paolini, E.J. Kelmelis, CULA: hybrid GPU accelerated linear algebra routines, in: SPIE Defense and Security

Symposium, DSS.[7] P. Jetley, L. Wesolowski, F. Gioachin, L.V. Kale, T.R. Quinn, Scaling hierarchical n-body simulations on GPU clusters, in: Proceedings of the 2010

ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC’10, 2010, pp. 1–11.[8] G. Quintana-Ortí, F.D. Igual, E.S. Quintana-Ortí, R.A. van de Geijn, Solving dense linear systems on platforms with multiple hardware accelerators,

in: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’09, 2009, pp. 121–130.[9] R. Nath, S. Tomov, J. Dongarra, An improved MAGMA GEMM for Fermi GPUs, Int. J. High Perform. Comput. Appl. 24 (4) (2010) 511–515.

[10] F. Song, J. Dongarra, A scalable framework for heterogeneous GPU-based clusters, in: Proceedings of the 24th ACM Symposium on Parallelism inAlgorithms and Architectures, ACM, 2012, pp. 91–100.

[11] F. Song, S. Tomov, J. Dongarra, Enabling and scaling matrix computations on heterogeneous multi-core andmulti-GPU systems, in: Proceedings of the26th ACM International Conference on Supercomputing, ACM, 2012, pp. 365–376.

[12] E. D’Azevedo, J.C. Hill, Parallel LU factorization on GPU cluster, Proc. Comput. Sci. 9 (2012) 67–75.[13] R.F. Barrett, T.H.F. Chan, E.F. D’Azevedo, E.F. Jaeger, K. Wong, R.Y. Wong, Complex version of high performance computing LINPACK benchmark (HPL),

Concurr. Comput.: Pract. Exp. 22 (5) (2010) 537–587.[14] M. Bach, M. Kretz, V. Lindenstruth, D. Rohr, Optimized HPL for amd GPU and multi-core CPU usage, Comput. Sci.-Res. Dev. 22 (5) (2010) 537–587.[15] M. Fatica, Scaling hierarchical n-body simulations on GPU clusters, in: Proceedings of the 2010 ACM/IEEE International Conference for High

Performance Computing, Networking, Storage and Analysis, SC’10, 2010, pp. 1–11.[16] R. Nath, S. Tomov, J. Dongarra, An improved MAGMA GEMM for Fermi graphics processing units, Int. J. High Perform. Comput. Appl. 24 (4) (2010)

511–515.[17] J. Ohmura, T. Miyoshi, I. Hidetsugu, T. Yoshinaga, Computation- communication overlap of linpack on a GPU-accelerated PC cluster, IIEICE Trans. Inf.

Syst. 94 (12) (2011) 2319–2327.[18] V. Volkov, J. Demmel, LU, QR and Cholesky Factorizations Using Vector Capabilities of GPUs, Tech. Rep. UCB/EECS-2008-49, University of California,

Berkeley, CA, 2008.[19] D. Rohr, M. Bach, M. Kretz, V. Lindenstruth, Multi-GPU DGEMM and HPL on highly energy efficient clusters, IEEE Micro 99 (2011) 1.[20] F. Wang, C.Q. Yang, Y.F. Du, J. Chen, H.Z. Yi, W.X. Xu, Optimizing LINPACK benchmark on GPU-accelerated petascale supercomputer, J. Comput. Sci.

Tech. 26 (5) (2011) 854–865.

http://refhub.elsevier.com/S0377-0427(14)00087-9/sbref2














[21] G. Ballard, O.H.J. Demmel, O. Schwartz, Communication-optimal parallel and sequential Cholesky decomposition, in: Proceedings of the Twenty-FirstAnnual Symposium on Parallelism in Algorithms and Architectures, SPAA’09, 2009, pp. 245–252.

[22] G.N. Walton, Calculation of Obstructed View Factors by Adaptive Integration, Tech. Rep. NISTIR 6925, National Institute of Standards and Technology,Gaithersburg, MD, 2002.

[23] H.C. Hottel, A. Sarofim, Radiative Transfer, McGraw-Hill, New York, NY, 1967.[24] G.N. Walton, Fortran IV Programs to Calculate Radiant Interchange Factors, Tech. Rep. BDR-25, National Research Council of Canada, Division of

Building Research, Ottawa, Canada, 1966.[25] I. Yamazaki, S. Tomov, J. Dongarra, One-sided dense matrix factorizations on a multicore with multiple GPU accelerators, Proc. Comput. Sci. 9 (2012)

37–46.[26] I. Toledo, F.G. Gustavson, The design and implementation of solar, a portable library for scalable out-of-core linear algebra computation, in: IOPADS’96

Proceedings of the Fourth Workshop on I/O in Parallel and Distributed Systems: Part of the Federated Computing Research Conference, 1996,pp. 28–40.

[27] J. Dongarra, S. Hammarling, D. Walker, Key concepts for parallel out of core LU factorization, Parallel Comput. 23 (1997) 49–70.[28] E. D’Azevedo, J. Dongarra, The design and implementation of the parallel out-of-core scalapack LU, QR, and Cholesky factorization routines, Concurr.

Comput.: Pract. Exp. 12 (2000) 1481–1493.[29] B. Gunter, W. Reiley, R. van de Geijn, Implementation of Out-of-Core Cholesky and QR Factorizations with POOCLAPACK, Tech. Rep. CS-TR-00-21,

University of Texas at Austin, Austin, TX, USA, 2000.








Documents

Solving a large scale radiosity problem on GPU-based parallel computers