10
DATA PARALLEL COMPUTING 2011 1 CUDA raytracing algorithm for visualizing discrete element model output Anders Damsgaard Christensen, 20062213 Abstract—A raytracing algorithm is constructed using the CUDA API for visualizing output from a CUDA discrete element model, which outputs spatial information in dynamic particle systems. The raytracing algorithm is optimized with constant memory and compilation flags, and performance is measured as a function of the number of particles and the number of pixels. The execution time is compared to equivalent CPU code, and the speedup under a variety of conditions is found to have a mean value of 55.6 times. Index Terms—CUDA, discrete element method, raytracing I. I NTRODUCTION V ISUALIZING systems containing many spheres using traditional object-order graphics rendering can often re- sult in very high computational requirements, as the usual automated approach is to construct a meshed surface with a specified resolution for each sphere. The memory requirements are thus quite high, as each surface will consist of many vertices. Raytracing [1] is a viable alternative, where spheric entities are saved as data structures with a centre coordinate and a radius. The rendering is performed on the base of these values, which results in a perfectly smooth surfaced sphere. To accelerate the rendering, the algorithm is constructed utilizing the CUDA API [2], where the problem is divided into n × m threads, corresponding to the desired output image resolution. Each thread iterates through all particles and applies a simple shading model to determine the final RGB values of the pixel. Previous studies of GPU or CUDA implementations of ray tracing algorithms reported major speedups, compared to corresponding CPU applications (e.g. [3], [4], [5], [6]). None of the software was however found to be open-source and GPL licensed, so a simple raytracer was constructed, customized to render particles, where the data was stored in a specific data format. A. Discrete Element Method The input particle data to the raytracer is the output of a custom CUDA-based Discrete Element Method (DEM) appli- cation currently in development. The DEM model is used to numerically simulate the response of a drained, soft, granular sediment bed upon normal stresses and shearing velocities similar to subglacial environments under ice streams [7]. In contrast to laboratory experiments on granular material, the discrete element method [8] approach allows close monitoring of the progressive deformation, where all involved physical Contact: [email protected] Webpage: http://users-cs.au.dk/adc Manuscript, last revision: October 18, 2011. parameters of the particles and spatial boundaries are readily available for continuous inspection. The discrete element method (DEM) is a subtype of molecu- lar dynamics (MD), and discretizes time into sufficiently small timesteps, and treats the granular material as discrete grains, interacting through contact forces. Between time steps, the particles are allowed to overlap slightly, and the magnitude of the overlap and the kinematic states of the particles is used to compute normal- and shear components of the contact force. The particles are treated as spherical entities, which simplifies the contact search. The spatial simulation domain is divided using a homogeneous, uniform, cubic grid, which greatly reduces the amount of possible contacts that are checked during each timestep. The grid-particle list is sorted using Thrust 1 , and updated each timestep. The new particle positions and kinematic values are updated by inserting the resulting force and torque into Newton’s second law, and using a Taylor- based second order integration scheme to calculate new linear and rotational accelerations, velocities and positions. B. Application usage The CUDA DEM application is a command line executable, and writes updated particle information to custom binary files with a specific interval. This raytracing algorithm is constructed to also run from the command line, be non- interactive, and write output images in the PPM image format. This format is chosen to allow rendering to take place on cluster nodes with CUDA compatible devices. Both the CUDA DEM and raytracing applications are open- source 2 , although still under heavy development. This document consists of a short introduction to the basic mathematics behind the ray tracing algorithm, an explaination of the implementation using the CUDA API [2] and a presen- tation of the results. The CUDA device source code and C++ host source code for the ray tracing algorithm can be found in the appendix, along with instructions for compilation and execution of the application. II. RAY TRACING ALGORITHM The goal of the ray tracing algorithm is to compute the shading of each pixel in the image [9]. This is performed by creating a viewing ray from the eye into the scene, finding the closest intersection with a scene object, and computing the resulting color. The general structure of the program is demonstrated in the following pseudo-code: 1 http://code.google.com/p/thrust/ 2 http://users-cs.au.dk/adc/files/sphere.tar.gz

DATA PARALLEL COMPUTING 2011 1 CUDA … · CUDA raytracing algorithm for visualizing discrete element model output Anders Damsgaard Christensen, 20062213 ... and writes updated particle

Embed Size (px)

Citation preview

DATA PARALLEL COMPUTING 2011 1

CUDA raytracing algorithm for visualizingdiscrete element model output

Anders Damsgaard Christensen, 20062213

Abstract—A raytracing algorithm is constructed using theCUDA API for visualizing output from a CUDA discrete elementmodel, which outputs spatial information in dynamic particlesystems. The raytracing algorithm is optimized with constantmemory and compilation flags, and performance is measured asa function of the number of particles and the number of pixels.The execution time is compared to equivalent CPU code, and thespeedup under a variety of conditions is found to have a meanvalue of 55.6 times.

Index Terms—CUDA, discrete element method, raytracing

I. INTRODUCTION

V ISUALIZING systems containing many spheres usingtraditional object-order graphics rendering can often re-

sult in very high computational requirements, as the usualautomated approach is to construct a meshed surface with aspecified resolution for each sphere. The memory requirementsare thus quite high, as each surface will consist of manyvertices. Raytracing [1] is a viable alternative, where sphericentities are saved as data structures with a centre coordinateand a radius. The rendering is performed on the base of thesevalues, which results in a perfectly smooth surfaced sphere. Toaccelerate the rendering, the algorithm is constructed utilizingthe CUDA API [2], where the problem is divided into n×mthreads, corresponding to the desired output image resolution.Each thread iterates through all particles and applies a simpleshading model to determine the final RGB values of the pixel.

Previous studies of GPU or CUDA implementations ofray tracing algorithms reported major speedups, compared tocorresponding CPU applications (e.g. [3], [4], [5], [6]). Noneof the software was however found to be open-source and GPLlicensed, so a simple raytracer was constructed, customized torender particles, where the data was stored in a specific dataformat.

A. Discrete Element Method

The input particle data to the raytracer is the output of acustom CUDA-based Discrete Element Method (DEM) appli-cation currently in development. The DEM model is used tonumerically simulate the response of a drained, soft, granularsediment bed upon normal stresses and shearing velocitiessimilar to subglacial environments under ice streams [7]. Incontrast to laboratory experiments on granular material, thediscrete element method [8] approach allows close monitoringof the progressive deformation, where all involved physical

Contact: [email protected]: http://users-cs.au.dk/adcManuscript, last revision: October 18, 2011.

parameters of the particles and spatial boundaries are readilyavailable for continuous inspection.

The discrete element method (DEM) is a subtype of molecu-lar dynamics (MD), and discretizes time into sufficiently smalltimesteps, and treats the granular material as discrete grains,interacting through contact forces. Between time steps, theparticles are allowed to overlap slightly, and the magnitude ofthe overlap and the kinematic states of the particles is used tocompute normal- and shear components of the contact force.The particles are treated as spherical entities, which simplifiesthe contact search. The spatial simulation domain is dividedusing a homogeneous, uniform, cubic grid, which greatlyreduces the amount of possible contacts that are checkedduring each timestep. The grid-particle list is sorted usingThrust1, and updated each timestep. The new particle positionsand kinematic values are updated by inserting the resultingforce and torque into Newton’s second law, and using a Taylor-based second order integration scheme to calculate new linearand rotational accelerations, velocities and positions.

B. Application usage

The CUDA DEM application is a command line executable,and writes updated particle information to custom binaryfiles with a specific interval. This raytracing algorithm isconstructed to also run from the command line, be non-interactive, and write output images in the PPM image format.This format is chosen to allow rendering to take place oncluster nodes with CUDA compatible devices.

Both the CUDA DEM and raytracing applications are open-source2, although still under heavy development.

This document consists of a short introduction to the basicmathematics behind the ray tracing algorithm, an explainationof the implementation using the CUDA API [2] and a presen-tation of the results. The CUDA device source code and C++host source code for the ray tracing algorithm can be foundin the appendix, along with instructions for compilation andexecution of the application.

II. RAY TRACING ALGORITHM

The goal of the ray tracing algorithm is to compute theshading of each pixel in the image [9]. This is performed bycreating a viewing ray from the eye into the scene, findingthe closest intersection with a scene object, and computingthe resulting color. The general structure of the program isdemonstrated in the following pseudo-code:

1http://code.google.com/p/thrust/2http://users-cs.au.dk/adc/files/sphere.tar.gz

DATA PARALLEL COMPUTING 2011 2

f o r each p i x e l docompute v iewing r a y o r i g i n and d i r e c t i o ni t e r a t e t h r o u g h o b j e c t s and f i n d t h e c l o s e s t h i ts e t p i x e l c o l o r t o v a l u e computed from h i t ←↩

p o i n t , l i g h t , n

The implemented code does not utilize recursive rays, sincethe modeled material grains are matte in appearance.

A. Ray generation

The rays are in vector form defined as:

p(t) = e + t(s− e) (1)

The perspective can be either orthograpic, where all viewingrays have the same direction, but different starting points,or use perspective projection, where the starting point isthe same, but the direction is slightly different [9]. For thepurposes of this application, a perspective projection waschosen, as it results in the most natural looking image. Theray data structures were held flexible enough to allow an easyimplementation of orthographic perspective, if this is desiredat a later point.

The ray origin e is the position of the eye, and is constant.The direction is unique for each ray, and is computed using:

s− e = −dw + uu + vv (2)

where {u,v,w} are the orthonormal bases of the cameracoordinate system, and d is the focal length [9]. The cameracoordinates of pixel (i, j) in the image plane, u and v, arecalculated by:

u = l + (r − l)(i+ 0.5)/n

v = b+ (t− b)(j + 0.5)/m

where l, r, t and b are the positions of the image borders (left,right, top and bottom) in camera space. The values n and mare the number of pixels in each dimension.

B. Ray-sphere intersection

Given a sphere with a center c, and radius R, a equation canbe constrained, where p are all points placed on the spheresurface:

(p− c) · (p− c)−R2 = 0 (3)

By substituting the points p with ray equation 1, and rearrang-ing the terms, a quadratic equation emerges:

(d · d)t2 + 2d · (e− c)t+ (e− c) · (e− c)−R2 = 0 (4)

The number of ray steps t is the only unknown, so the numberof intersections is found by calculating the determinant:

∆ = (2(d · (e− c)))2 − 4(d · d)((e− c) · (e− c)−R2 (5)

A negative value denotes no intersection between the sphereand the ray, a value of zero means that the ray touches thesphere at a single point (ignored in this implementation), anda positive value denotes that there are two intersections, onewhen the ray enters the sphere, and one when it exits. In thecode, a conditional branch checks wether the determinant is

positive. If this is the case, the distance to the intersection inray “steps” is calculated using:

t =−d · (e− c)±√η

(d · d)(6)

where

η = (d · (e− c))2 − (d · d)((e− c) · (e− c)−R2)

Only the smallest intersection (tminus) is calculated, sincethis marks the point where the sphere enters the particle. Ifthis value is smaller than previous intersection distances, theintersection point p and surface normal n at the intersectionpoint is calculated:

p = e + tminusd (7)

n = 2(p− c) (8)

The intersection distance in vector steps (tminus) is saved inorder to allow comparison of the distance with later intersec-tions.

C. Pixel shading

The pixel is shaded using Lambertian shading [9], where thepixel color is proportional to the angle between the light vector(l) and the surface normal. An ambient shading component isadded to simulate global illumination, and prevent that thespheres are completely black:

L = kaIa + kdId max(0, (n · l)) (9)

where the a and d subscripts denote the ambient and diffusive(Lambertian) components of the ambient/diffusive coefficients(k) and light intensities (I). The pixel color L is calculatedonce per color channel.

D. Computational implementation

The above routines were first implemented in CUDA fordevice execution, and afterwards ported to a CPU C++ equiv-alent, used for comparing performance. The CPU raytracingalgorithm was optimized to shared-memory parallelism usingOpenMP [10]. The execution method can be chosen whenlaunching the raytracer from the command line, see theappendix for details. In the CPU implementation, all data wasstored in linear arrays of the right size, ensuring 100% memoryefficiency.

III. CUDA IMPLEMENTATION

When constructing the algorithm for execution on theGPGPU device, the data-parallel nature of the problem (SIMD:single instruction, multiple data) is used to deconstruct therendering task into a single thread per pixel. Each threaditerates through all particles, and ends up writing the resultingcolor to the image memory.

The application starts by reading the discrete elementmethod data from a custom binary file. The particle data,consisting of position vectors in three-dimensional Euclideanspace (R3) and particle radii, is stored together in a float4array, with the particle radius in the w position. This has

DATA PARALLEL COMPUTING 2011 3

Fig. 1. Sample output of GPU raytracer rendering of 512 particles.

Fig. 2. Sample output of GPU raytracer rendering of 50 653 particles.

large advantages to storing the data in separate float3 andfloat arrays; Using float4 (instead of float3) dataallows coalesced memory access [11] to the arrays of datain device memory, resulting in efficient memory requests andtransfers [12], and the data access pattern is coherent andconvenient. Other three-component vectors were also storedas float4 for the same reasons, even though this sometimescaused a slight memory redundancy. The image data is savedin a three-channel linear unsigned char array.

The algorithm starts by allocating memory on the devicefor the particle data, the ray parameters, and the image RGB

values. Afterwards, all particle data is transfered from the host-to the device memory.

All pixel values are initialized to [R,G,B] = [0, 0, 0], whichserves as the image background color. Afterwards, a kernel isexecuted with a thread for each pixel, testing for intersectionsbetween the pixel’s viewing ray and all particles, and returningthe closest particle. This information is used when computingthe shading of the pixel.

After all pixel values have been computed, the image datais transfered back to the host memory, and written to thedisk. The application ends by liberating dynamically allocatedmemory on both the device and the host.

A. Thread and block layout

The thread/block layout passed during kernel launches isarranged in the following manner:

dim3 t h r e a d s ( 1 6 , 16) ;dim3 b l o c k s ( ( wid th +15) / 1 6 , ( h e i g h t +15) / 1 6 ) ;

The image pixel position of the thread can be determinedfrom the thread- and block index and dimensions. The layoutcorresponds to a thread tile size of 256, and a dynamicnumber of blocks, ensured to fit the image dimensions withonly small eventual redundancy [13]. Since this method willinitialize extra threads in most situations, all kernels (withreturn type void) start by checking wether the thread-/blockindex actually falls inside of the image dimensions:

i n t i = t h r e a d I d x . x + b l o c k I d x . x ∗ ←↩blockDim . x ;

i n t j = t h r e a d I d x . y + b l o c k I d x . y ∗ ←↩blockDim . y ;

unsigned i n t mempos = x + y ∗ blockDim . x ←↩∗ gridDim . x ;

i f ( mempos > p i x e l s )re turn ;

The linear memory position (mempos) is used as the indexwhen reading or writing to the linear arrays residing in globaldevice memory.

B. Image output

After completing all pixel shading computations on thedevice, the image data is transfered back to the host memory,and together with a header written to a PPM3 image file. Thisfile is converted to the PNG format using ImageMagick.

C. Performance

Since this simple raytracing algorithm generates a singlenon-recursive ray for each pixel, which in turn checks allspheres for intersection, the application is expected to scalein the form of O(n×m×N), where n and m are the outputimage dimensions in pixels, and N is the number of particles.

The data transfer between the host and device is kept ata bare minimum, as the intercommunication is considered abottleneck in relation to the potential device performance[11].

3http://paulbourke.net/dataformats/ppm/

DATA PARALLEL COMPUTING 2011 4

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

Tim

e [m

s]

Particles

Device kernelsHost<->device transfer and allocation

Host kernels

Fig. 3. Performance scaling with varying particle numbers at imagedimensions 800 by 800 pixels.

0.1

1

10

100

1000

10000

100000

1e+06

100 1000 10000 100000 1e+06 1e+07

Tim

e [m

s]

Pixels

Device kernelsHost<->device transfer and allocation

Host kernels

Fig. 4. Performance scaling with varying image dimensions (n ×m) with5832 particles.

Thread synchronization points are only inserted were neces-sary, and the code is optimized by the compilers to the targetarchitecture (see appendix).

The host execution time was profiled using a clock()based CPU timer from time.h, which was normalized usingthe constant CLOCKS_PER_SEC.

The device execution time was profiled using twocudaEvent_t timers, one measuring the time spent inthe entire device code section, including device memoryallocation, data transfer to- and from the host, executionof the kernels, and memory deallocation. The other timeronly measured time spent in the kernels. The threads weresynchronized before stopping the timers. A simple CPU timerusing clock() will not work, since control is returned to thehost thread before the device code has completed all tasks.

Figures 3 and 4 show the profiling results, where the numberof particles and the image dimensions were varied. Withexception of executions with small image dimensions, thekernel execution time results agree with the O(n ×m × N)scaling predicion.

The device memory allocation and data transfer was alsoprofiled, and turns out to be only weakly dependant on the

particle numbers (fig. 3), but more strongly correlated toimage dimensions (fig. 4). As with kernel execution times, theexecution time converges against an overhead value at smallimage dimensions.

The CPU time spent in the host kernels proves to belinear with the particle numbers, and linear with the imagedimensions. This is due to the non-existant overhead caused byinitialization of the device code, and reduced memory transfer.

The ratio between CPU computational times and the sum ofthe device kernel execution time and the host—device memorytransfer and additional memory allocation was calculated, andhad a mean value of 55.6 and a variance of 739 out of the 11comparative measurements presented in the figures. It shouldbe noted, that the smallest speedups were recorded when usingvery small image dimensions, probably unrealistic in real use.

As the number of particles are not known by compilation,it is not possible to store particle positions and -radii in con-stant memory. Shared memory was also on purpose avoided,since the memory per thread block (64 kb) would not besufficient in rendering of simulations containing containingmore than 16 000 particles (16 000 float4 values). Theconstant memory was however utilized for storing the camerarelated parameters; the orthonormal base vectors, the observerposition, the image dimensions, the focal length, and the lightvector.

Previous GPU implementations often rely on k-D trees,constructed as an sorting method for static scene objects[3],[5]. A k-D tree implementation would drastically reduce theglobal memory access induced by each thread, so it is thereforethe next logical step with regards to optimizing the ray tracingalgorithm presented here.

IV. CONCLUSION

This document presented the implementation of a basicray tracing algorithm, utilizing the highly data-parallel na-ture of the problem when porting the work load to CUDA.Performance tests showed the expected, linear correlationbetween image dimensions, particle numbers and executiontime. Comparisons with an equivalent CPU algorithm showedlarge speedups, typically up to two orders of magnitude. Thisspeedup did not come at a cost of less correct results.

The final product will come into good use during furtherdevelopment and completion of the CUDA DEM particlemodel, and is ideal since it can be used for offline renderingon dedicated, heterogeneous GPU-CPU computing nodes. Theincluded device code will be the prefered method of execution,whenever the host system allows it.

APPENDIX ATEST ENVIRONMENT

The raytracing algorithm was developed, tested and profiledon a mid 2010 Mac Pro with a 2.8 Ghz Quad-Core IntelXeon CPU and a NVIDIA Quadro 4000 for Mac, dedicatedto CUDA applications. The CUDA driver was version 4.0.50,the CUDA compilation tools release 4.0, V0.2.1221. The GCCtools were version 4.2.1. Each CPU core is multithreaded bytwo threads for a total of 8 threads.

DATA PARALLEL COMPUTING 2011 5

The CUDA source code was compiled with nvcc, andlinked to g++ compiled C++ source code with g++. For allbenchmark tests, the code was compiled with the followingcommands:

g++ −c −Wall −O3 −a r c h x86 64 −fopenmp . . .nvcc −c −u s e f a s t m a t h −gencode ←↩

a r c h =compute 20 , code=sm 20 −m64 . . .g++ −a r c h x86 64 −l c u d a − l c u d a r t −fopenmp ←↩

∗ . o −o r t

When profiling device code performance, the application wasexecuted two times, and the time of the second run was noted.This was performed to avoid latency caused by device driverinitialization.

The host system was measured to have a memory bandwidthof 4642.1 MB/s when transfering data from the host to thedevice, and 3805.6 MB/s when transfering data from thedevice to the host.

REFERENCES

[1] T. Whitted, “An improved illumination model for shaded display,”Communications of the ACM, vol. 23, no. 6, pp. 343–349, 1980.

[2] NVIDIA, CUDA C Programming Guide, 3rd ed., NVIDIA Corporation:Santa Clara, CA, USA, Oct 2010.

[3] D. Horn, J. Sugerman, M. Houston, and P. Hanrahan, “Interactive k-Dtree GPU raytracing,” Association for Computing Machinery, Inc., pp.167–174 pp., 2007.

[4] M. Shih, Y. Chiu, Y. Chen, and C. Chang, “Real-time ray tracing withcuda,” Algorithms and Architectures for Parallel Processing, pp. 327–337, 2009.

[5] S. Popov, J. Gunther, H. Seidel, and P. Slusallek, “Stackless kd-treetraversal for high performance gpu ray tracing,” in Computer GraphicsForum, vol. 26, no. 3. Wiley Online Library, 2007, pp. 415–424.

[6] D. Luebke and S. Parker, “Interactive ray tracing with cuda,” NVIDIATechnical Presentation, SIGGRAPH, 2008.

[7] D. Evans, E. Phillips, J. Hiemstra, and C. Auton, “Subglacial till:formation, sedimentary characteristics and classification,” Earth-ScienceReviews, vol. 78, no. 1-2, pp. 115–176, 2006.

[8] P. Cundall and O. Strack, “A discrete numerical model for granularassemblies,” Geotechnique, vol. 29, pp. 47–65, 1979.

[9] P. Shirley, M. Ashikhmin, M. Gleicher, S. Marschner, E. Reinhard,K. Sung, W. Thompson, and P. Willemsen, Fundamentals of computergraphics, 3rd ed. AK Peters, 2009.

[10] B. Chapman, G. Jost, and R. Van Der Pas, Using OpenMP: portableshared memory parallel programming. The MIT Press, 2007, vol. 10.

[11] NVIDIA, CUDA C Best Practices Guide Version, 3rd ed., NVIDIACorporation: Santa Clara, CA, USA, Aug 2010, cA Patent 95,050.

[12] L. Nyland, M. Harris, and J. Prins, “Fast n-body simulation with cuda,”GPU gems, vol. 3, pp. 677–695, 2007.

[13] J. Sanders and E. Kandrot, CUDA by example. Addison-Wesley, 2010.

APPENDIX BSOURCE CODE

The entire source code, as well as input data files, can befound in the following archive http://users-cs.au.dk/adc/files/sphere-rt.tar.gz The source code is built and run with thecommands:

makemake run

With the make run command, the Makefile uses ImageMag-ick to convert the PPM file to PNG format, and the OS Xcommand open to display the image. Other input data files areincluded with other particle number magnitudes. The syntaxfor the raytracer application is the following:

. / r t <CPU | G P U> <sphe re−b i n a r y . bin> ←↩<width> <h e i g h t> <o u t p u t−image . ppm>

This appendix contains the following source code files:

LISTINGS

1 rt kernel.h . . . . . . . . . . . . . . . . . . . . 52 rt kernel.cu . . . . . . . . . . . . . . . . . . . 53 rt kernel cpu.h . . . . . . . . . . . . . . . . . 74 rt kernel cpu.cpp . . . . . . . . . . . . . . . . 7

A. CUDA raytracing source code

Listing 1. rt kernel.h1 # i f n d e f RT KERNEL H2 # d e f i n e RT KERNEL H34 # i n c l u d e <v e c t o r f u n c t i o n s . h>56 / / C o n s t a n t s7 c o n s t a n t f l o a t 4 c o n s t u ;8 c o n s t a n t f l o a t 4 c o n s t v ;9 c o n s t a n t f l o a t 4 cons t w ;

10 c o n s t a n t f l o a t 4 c o n s t e y e ;11 c o n s t a n t f l o a t 4 c o n s t i m g p l a n e ;12 c o n s t a n t f l o a t c o n s t d ;13 c o n s t a n t f l o a t 4 c o n s t l i g h t ;141516 / / Host p r o t o t y p e f u n c t i o n s1718 e x t er n ”C”19 void c a m e r a I n i t ( f l o a t 4 eye , f l o a t 4 l o o k a t , f l o a t ←↩

imgw , f l o a t h w r a t i o ) ;2021 e x t er n ”C”22 void c h e c k F o r C u d a E r r o r s ( c o n s t char∗ ←↩

c h e c k p o i n t d e s c r i p t i o n ) ;2324 e x t er n ”C”25 i n t r t ( f l o a t 4 ∗ p , c o n s t unsigned i n t np ,26 rgb∗ img , c o n s t unsigned i n t width , c o n s t ←↩

unsigned i n t h e i g h t ,27 f3 o r i g o , f3 L , f3 eye , f3 l o o k a t , f l o a t imgw ) ;2829 # e n d i f

Listing 2. rt kernel.cu1 # i n c l u d e <i o s t r e a m>2 # i n c l u d e <c u t i l m a t h . h>3 # i n c l u d e ” h e a d e r . h ”4 # i n c l u d e ” r t k e r n e l . h ”56 i n l i n e h o s t d e v i c e f l o a t 3 f 4 t o f 3 ( f l o a t 4 ←↩

i n )7 {8 re turn m a k e f l o a t 3 ( i n . x , i n . y , i n . z ) ;9 }

1011 i n l i n e h o s t d e v i c e f l o a t 4 f 3 t o f 4 ( f l o a t 3 ←↩

i n )12 {13 re turn m a k e f l o a t 4 ( i n . x , i n . y , i n . z , 0 . 0 f ) ;14 }1516 / / K e r ne l f o r i n i t i a l i z i n g image d a t a17 g l o b a l void i m a g e I n i t ( unsigned char∗ img , ←↩

unsigned i n t p i x e l s )18 {19 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x20 i n t x = t h r e a d I d x . x + b l o c k I d x . x ∗ blockDim . x ;21 i n t y = t h r e a d I d x . y + b l o c k I d x . y ∗ blockDim . y ;22 unsigned i n t mempos = x + y ∗ blockDim . x ∗ gridDim . x ;23 i f ( mempos > p i x e l s )24 re turn ;2526 img [ mempos∗4] = 0 ; / / Red c h a n n e l27 img [ mempos∗4 + 1] = 0 ; / / Green c h a n n e l28 img [ mempos∗4 + 2] = 0 ; / / Blue c h a n n e l

DATA PARALLEL COMPUTING 2011 6

29 }3031 / / C a l c u l a t e r a y o r i g i n s and d i r e c t i o n s32 g l o b a l void r a y I n i t P e r s p e c t i v e ( f l o a t 4 ∗ r a y o r i g o ,33 f l o a t 4 ∗ r a y d i r e c t i o n ,34 f l o a t 4 eye ,35 unsigned i n t width ,36 unsigned i n t h e i g h t )37 {38 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x39 i n t i = t h r e a d I d x . x + b l o c k I d x . x ∗ blockDim . x ;40 i n t j = t h r e a d I d x . y + b l o c k I d x . y ∗ blockDim . y ;41 unsigned i n t mempos = i + j ∗ blockDim . x ∗ gridDim . x ;42 i f ( mempos > wid th∗ h e i g h t )43 re turn ;4445 / / C a l c u l a t e p i x e l c o o r d i n a t e s i n image p l a n e46 f l o a t p u = c o n s t i m g p l a n e . x + ( c o n s t i m g p l a n e . y − ←↩

c o n s t i m g p l a n e . x )47 ∗ ( i + 0 . 5 f ) / w id th ;48 f l o a t p v = c o n s t i m g p l a n e . z + ( c o n s t i m g p l a n e .w − ←↩

c o n s t i m g p l a n e . z )49 ∗ ( j + 0 . 5 f ) / h e i g h t ;5051 / / Wr i t e r a y o r i g o and d i r e c t i o n t o g l o b a l memory52 r a y o r i g o [ mempos ] = c o n s t e y e ;53 r a y d i r e c t i o n [ mempos ] = −c o n s t d∗cons t w + ←↩

p u∗ c o n s t u + p v∗ c o n s t v ;54 }5556 / / Check w e t h e r t h e p i x e l ’ s v i ewing r a y i n t e r s e c t s ←↩

wi th t h e s p h e r e s ,57 / / and shade t h e p i x e l c o r r e s p o n d i n g l y58 g l o b a l void r a y I n t e r s e c t S p h e r e s ( f l o a t 4 ∗ r a y o r i g o ,59 f l o a t 4 ∗ ←↩

r a y d i r e c t i o n ,60 f l o a t 4 ∗ p ,61 unsigned char∗ img ,62 unsigned i n t p i x e l s ,63 unsigned i n t np )64 {65 / / Compute p i x e l p o s i t i o n from t h r e a d I d x / b l o c k I d x66 i n t x = t h r e a d I d x . x + b l o c k I d x . x ∗ blockDim . x ;67 i n t y = t h r e a d I d x . y + b l o c k I d x . y ∗ blockDim . y ;68 unsigned i n t mempos = x + y ∗ blockDim . x ∗ gridDim . x ;69 i f ( mempos > p i x e l s )70 re turn ;7172 / / Read r a y d a t a from g l o b a l memory73 f l o a t 3 e = f 4 t o f 3 ( r a y o r i g o [ mempos ] ) ;74 f l o a t 3 d = f 4 t o f 3 ( r a y d i r e c t i o n [ mempos ] ) ;75 / / f l o a t s t e p = l e n g t h ( d ) ;7677 / / D i s t a n c e , i n r a y s t e p s , be tween o b j e c t and eye ←↩

i n i t i a l i z e d wi th a l a r g e v a l u e78 f l o a t t d i s t = 1 e10 f ;7980 / / S u r f a c e normal a t c l o s e s t s p h e r e i n t e r s e c t i o n81 f l o a t 3 n ;8283 / / I n t e r s e c t i o n p o i n t c o o r d i n a t e s84 f l o a t 3 p ;8586 / / I t e r a t e t h r o u g h a l l p a r t i c l e s87 f o r ( unsigned i n t i =0 ; i<np ; i ++) {8889 / / Read s p h e r e c o o r d i n a t e and r a d i u s90 f l o a t 3 c = f 4 t o f 3 ( p [ i ] ) ;91 f l o a t R = p [ i ] . w;9293 / / C a l c u l a t e t h e d i s c r i m i n a n t : d = Bˆ2 − 4AC94 f l o a t D e l t a = ←↩

( 2 . 0 f∗d o t ( d , ( e−c ) ) ) ∗ ( 2 . 0 f∗d o t ( d , ( e−c ) ) ) / / Bˆ295 − 4 . 0 f∗d o t ( d , d ) / / −4∗A96 ∗ ( d o t ( ( e−c ) , ( e−c ) ) − R∗R) ; / / C9798 / / I f t h e d e t e r m i n a n t i s p o s i t i v e , t h e r e a r e two ←↩

s o l u t i o n s99 / / One where t h e l i n e e n t e r s t h e sphe re , and one ←↩

where i t e x i t s100 i f ( D e l t a > 0 . 0 f ) {101102 / / C a l c u l a t e r o o t s , S h i r l e y 2009 p . 77103 f l o a t t m inus = ( ( d o t (−d , ( e−c ) ) − s q r t ( ←↩

d o t ( d , ( e−c ) )∗d o t ( d , ( e−c ) ) − d o t ( d , d )

104 ∗ ( d o t ( ( e−c ) , ( e−c ) ) − R∗R) ) ) / ←↩d o t ( d , d ) ) ;

105106 / / Check w e t h e r i n t e r s e c t i o n i s c l o s e r t h a n ←↩

p r e v i o u s v a l u e s107 i f ( f a b s ( t minus ) < t d i s t ) {108 p = e + t minus∗d ;109 t d i s t = f a b s ( t minus ) ;110 n = n o r m a l i z e ( 2 . 0 f ∗ ( p − c ) ) ; / / S u r f a c e normal111 }112113 } / / End of s o l u t i o n b ra nc h114115 } / / End of p a r t i c l e loop116117 / / Wr i t e p i x e l c o l o r118 i f ( t d i s t < 1 e10 ) {119120 / / L a m b e r t i a n s h a d i n g p a r a m e t e r s121 f l o a t d o t p r o d = f a b s ( d o t ( n , f 4 t o f 3 ( c o n s t l i g h t ) ) ) ;122 f l o a t I d = 4 0 . 0 f ; / / L i g h t i n t e n s i t y123 f l o a t k d = 5 . 0 f ; / / D i f f u s e c o e f f i c i e n t124125 / / Ambient s h a d i n g126 f l o a t k a = 1 0 . 0 f ;127 f l o a t I a = 5 . 0 f ;128129 / / Wr i t e s h a d i n g model v a l u e s t o p i x e l c o l o r ←↩

c h a n n e l s130 img [ mempos∗4] = ( unsigned char ) ( ( k d ∗ I d ←↩

∗ d o t p r o d131 + k a ∗ I a ) ∗0.48 f ) ;132 img [ mempos∗4 + 1] = ( unsigned char ) ( ( k d ∗ I d ←↩

∗ d o t p r o d133 + k a ∗ I a ) ∗0.41 f ) ;134 img [ mempos∗4 + 2] = ( unsigned char ) ( ( k d ∗ I d ←↩

∗ d o t p r o d135 + k a ∗ I a ) ∗0.27 f ) ;136 }137 }138139140 e x t er n ”C”141 h o s t void c a m e r a I n i t ( f l o a t 4 eye , f l o a t 4 l o o k a t , ←↩

f l o a t imgw , f l o a t h w r a t i o )142 {143 / / Image d i m e n s i o n s i n wor ld s p a c e ( l , r , b , t )144 f l o a t 4 imgp lane = m a k e f l o a t 4 (−0.5 f∗imgw , ←↩

0 . 5 f∗imgw , −0.5 f∗imgw∗hw ra t i o , ←↩0 . 5 f∗imgw∗h w r a t i o ) ;

145146 / / The view v e c t o r147 f l o a t 4 view = eye − l o o k a t ;148149 / / C o n s t r u c t t h e camera view o r t h o n o r m a l base150 f l o a t 4 v = m a k e f l o a t 4 ( 0 . 0 f , 1 . 0 f , 0 . 0 f , 0 . 0 f ) ; / / ←↩

v : P o i n t i n g upward151 f l o a t 4 w = −view / l e n g t h ( view ) ; / / w: ←↩

P o i n t i n g backwards152 f l o a t 4 u = m a k e f l o a t 4 ( c r o s s ( m a k e f l o a t 3 ( v . x , v . y , ←↩

v . z ) ,153 m a k e f l o a t 3 (w. x , w. y , w. z ) ) ,154 0 . 0 f ) ; / / u : P o i n t i n g r i g h t155156 / / F o c a l l e n g t h 20% of eye v e c t o r l e n g t h157 f l o a t d = l e n g t h ( view ) ∗0 .8 f ;158159 / / L i g h t d i r e c t i o n ( p o i n t s t o w a r d s l i g h t s o u r c e )160 f l o a t 4 l i g h t = ←↩

n o r m a l i z e (−1.0 f∗eye∗m a k e f l o a t 4 ( 1 . 0 f , 0 . 2 f , ←↩0 . 6 f , 0 . 0 f ) ) ;

161162 s t d : : c o u t << ” T r a n s f e r i n g camera v a l u e s t o ←↩

c o n s t a n t memory\n ” ;163164 cudaMemcpyToSymbol ( ” c o n s t u ” , &u , s i z e o f ( u ) ) ;165 cudaMemcpyToSymbol ( ” c o n s t v ” , &v , s i z e o f ( v ) ) ;166 cudaMemcpyToSymbol ( ” cons t w ” , &w, s i z e o f (w) ) ;167 cudaMemcpyToSymbol ( ” c o n s t e y e ” , &eye , s i z e o f ( eye ) ) ;168 cudaMemcpyToSymbol ( ” c o n s t i m g p l a n e ” , &imgplane , ←↩

s i z e o f ( imgp lane ) ) ;169 cudaMemcpyToSymbol ( ” c o n s t d ” , &d , s i z e o f ( d ) ) ;170 cudaMemcpyToSymbol ( ” c o n s t l i g h t ” , &l i g h t , ←↩

s i z e o f ( l i g h t ) ) ;171 }172

DATA PARALLEL COMPUTING 2011 7

173 / / Check f o r CUDA e r r o r s174 e x t er n ”C”175 h o s t void c h e c k F o r C u d a E r r o r s ( c o n s t char∗ ←↩

c h e c k p o i n t d e s c r i p t i o n )176 {177 c u d a E r r o r t e r r = c u d a G e t L a s t E r r o r ( ) ;178 i f ( e r r != c u d a S u c c e s s ) {179 s t d : : c o u t << ”\nCuda e r r o r d e t e c t e d , c h e c k p o i n t : ←↩

” << c h e c k p o i n t d e s c r i p t i o n180 << ”\n E r r o r s t r i n g : ” << ←↩

c u d a G e t E r r o r S t r i n g ( e r r ) << ”\n ” ;181 e x i t ( EXIT FAILURE ) ;182 }183 }184185186 / / Wrapper f o r t h e r t k e r n e l187 e x t er n ”C”188 h o s t i n t r t ( f l o a t 4 ∗ p , unsigned i n t np ,189 rgb∗ img , unsigned i n t width , ←↩

unsigned i n t h e i g h t ,190 f3 o r i g o , f3 L , f3 eye , f3 l o o k a t , f l o a t ←↩

imgw ) {191192 us ing s t d : : c o u t ;193194 c o u t << ” I n i t i a l i z i n g CUDA:\ n ” ;195196 / / I n i t i a l i z e GPU t imes t amp r e c o r d e r s197 f l o a t t1 , t 2 ;198 c u d a E v e n t t t1 go , t2 go , t 1 s t o p , t 2 s t o p ;199 c u d a E v e n t C r e a t e (& t1 go ) ;200 c u d a E v e n t C r e a t e (& t2 go ) ;201 c u d a E v e n t C r e a t e (& t 2 s t o p ) ;202 c u d a E v e n t C r e a t e (& t 1 s t o p ) ;203204 / / S t a r t t i m e r 1205 cudaEven tRecord ( t1 go , 0 ) ;206207 / / A l l o c a t e memory208 c o u t << ” A l l o c a t i n g d e v i c e memory\n ” ;209 s t a t i c f l o a t 4 ∗ p ; / / P a r t i c l e p o s i t i o n s ←↩

( x , y , z ) and r a d i u s (w)210 s t a t i c unsigned char ∗ img ; / / RGBw v a l u e s i n ←↩

image211 s t a t i c f l o a t 4 ∗ r a y o r i g o ; / / Ray o r i g o ( x , y , z )212 s t a t i c f l o a t 4 ∗ r a y d i r e c t i o n ; / / Ray d i r e c t i o n ←↩

( x , y , z )213 cudaMal loc ( ( void ∗∗)& p , np∗ s i z e o f ( f l o a t 4 ) ) ;214 cudaMal loc ( ( void ∗∗)& img , ←↩

wid th∗ h e i g h t ∗4∗ s i z e o f ( unsigned char ) ) ;215 cudaMal loc ( ( void ∗∗)& r a y o r i g o , ←↩

wid th∗ h e i g h t∗ s i z e o f ( f l o a t 4 ) ) ;216 cudaMal loc ( ( void ∗∗)& r a y d i r e c t i o n , ←↩

wid th∗ h e i g h t∗ s i z e o f ( f l o a t 4 ) ) ;217218 / / T r a n s f e r p a r t i c l e d a t a219 c o u t << ” T r a n s f e r i n g p a r t i c l e d a t a : h o s t −> ←↩

d e v i c e\n ” ;220 cudaMemcpy ( p , p , np∗ s i z e o f ( f l o a t 4 ) , ←↩

cudaMemcpyHostToDevice ) ;221222 / / Check f o r e r r o r s a f t e r memory a l l o c a t i o n223 c h e c k F o r C u d a E r r o r s ( ”CUDA e r r o r a f t e r memory ←↩

a l l o c a t i o n ” ) ;224225 / / Ar range t h r e a d / b l o c k s t r u c t u r e226 unsigned i n t p i x e l s = wid th∗ h e i g h t ;227 f l o a t h w r a t i o = ( f l o a t ) h e i g h t / ( f l o a t ) wid th ;228 dim3 t h r e a d s ( 1 6 , 1 6 ) ;229 dim3 b l o c k s ( ( wid th +15) / 1 6 , ( h e i g h t +15) / 1 6 ) ;230231 / / S t a r t t i m e r 2232 cudaEven tRecord ( t2 go , 0 ) ;233234 / / I n i t i a l i z e image t o background c o l o r235 i m a g e I n i t<<< b locks , t h r e a d s >>>( img , p i x e l s ) ;236237 / / I n i t i a l i z e camera238 c a m e r a I n i t ( m a k e f l o a t 4 ( eye . x , eye . y , eye . z , 0 . 0 f ) ,239 m a k e f l o a t 4 ( l o o k a t . x , l o o k a t . y , ←↩

l o o k a t . z , 0 . 0 f ) ,240 imgw , h w r a t i o ) ;241 c h e c k F o r C u d a E r r o r s ( ”CUDA e r r o r a f t e r c a m e r a I n i t ” ) ;242243 / / C o n s t r u c t r a y s f o r p e r s p e c t i v e p r o j e c t i o n

244 r a y I n i t P e r s p e c t i v e <<< b locks , t h r e a d s >>>(245 r a y o r i g o , r a y d i r e c t i o n ,246 m a k e f l o a t 4 ( eye . x , eye . y , eye . z , 0 . 0 f ) ,247 width , h e i g h t ) ;248249 / / F ind c l o s e s t i n t e r s e c t i o n between r a y s and s p h e r e s250 r a y I n t e r s e c t S p h e r e s <<< b locks , t h r e a d s >>>(251 r a y o r i g o , r a y d i r e c t i o n ,252 p , img , p i x e l s , np ) ;253254255 / / Make s u r e a l l t h r e a d s a r e done b e f o r e c o n t i n u i n g ←↩

CPU c o n t r o l s e q u e n c e256 c u d a T h r e a d S y n c h r o n i z e ( ) ;257258 / / Check f o r e r r o r s259 c h e c k F o r C u d a E r r o r s ( ”CUDA e r r o r a f t e r k e r n e l ←↩

e x e c u t i o n ” ) ;260261 / / S top t i m e r 2262 cudaEven tRecord ( t 2 s t o p , 0 ) ;263 c u d a E v e n t S y n c h r o n i z e ( t 2 s t o p ) ;264265 / / T r a n s f e r image d a t a from d e v i c e t o h o s t266 c o u t << ” T r a n s f e r i n g image d a t a : d e v i c e −> h o s t\n ” ;267 cudaMemcpy ( img , img , ←↩

wid th∗ h e i g h t ∗4∗ s i z e o f ( unsigned char ) , ←↩cudaMemcpyDeviceToHost ) ;

268269 / / F ree d y n a m i c a l l y a l l o c a t e d d e v i c e memory270 c u d a F r e e ( p ) ;271 c u d a F r e e ( img ) ;272 c u d a F r e e ( r a y o r i g o ) ;273 c u d a F r e e ( r a y d i r e c t i o n ) ;274275 / / S top t i m e r 1276 cudaEven tRecord ( t 1 s t o p , 0 ) ;277 c u d a E v e n t S y n c h r o n i z e ( t 1 s t o p ) ;278279 / / C a l c u l a t e t ime s p e n t i n t 1 and t 2280 cudaEven tE lapsedTime (& t1 , t1 go , t 1 s t o p ) ;281 cudaEven tE lapsedTime (& t2 , t2 go , t 2 s t o p ) ;282283 / / Re po r t t ime s p e n t284 c o u t << ” Time s p e n t on e n t i r e GPU r o u t i n e : ”285 << t 1 << ” ms\n ” ;286 c o u t << ” − K e r n e l s : ” << t 2 << ” ms\n ”287 << ” − Memory a l l o c . and t r a n s f e r : ” << t1−t 2 ←↩

<< ” ms\n ” ;288289 / / Re tu rn s u c c e s s f u l l y290 re turn 0 ;291 }

B. CPU raytracing source code

Listing 3. rt kernel cpu.h1 # i f n d e f RT KERNEL CPU H2 # d e f i n e RT KERNEL CPU H34 # i n c l u d e <v e c t o r f u n c t i o n s . h>56 / / Host p r o t o t y p e f u n c t i o n s78 void c a m e r a I n i t ( f l o a t 3 eye , f l o a t 3 l o o k a t , f l o a t ←↩

imgw , f l o a t h w r a t i o ) ;9

10 i n t r t c p u ( f l o a t 4 ∗ p , c o n s t unsigned i n t np ,11 rgb∗ img , c o n s t unsigned i n t width , c o n s t ←↩

unsigned i n t h e i g h t ,12 f3 o r i g o , f3 L , f3 eye , f3 l o o k a t , f l o a t imgw ) ;1314 # e n d i f

Listing 4. rt kernel cpu.cpp1 # i n c l u d e <i o s t r e a m>2 # i n c l u d e <math . h>3 # i n c l u d e <t ime . h>4 # i n c l u d e <v e c t o r f u n c t i o n s . h>5 # i n c l u d e <c u t i l m a t h . h>6 # i n c l u d e ” h e a d e r . h ”7 # i n c l u d e ” r t k e r n e l c p u . h ”89 / / C o n s t a n t s

DATA PARALLEL COMPUTING 2011 8

10 f l o a t 3 c o n s t c u ;11 f l o a t 3 c o n s t c v ;12 f l o a t 3 cons tc w ;13 f l o a t 3 c o n s t c e y e ;14 f l o a t 4 c o n s t c i m g p l a n e ;15 f l o a t c o n s t c d ;16 f l o a t 3 c o n s t c l i g h t ;1718 i n l i n e f l o a t 3 f 4 t o f 3 ( f l o a t 4 i n )19 {20 re turn m a k e f l o a t 3 ( i n . x , i n . y , i n . z ) ;21 }2223 i n l i n e f l o a t 4 f 3 t o f 4 ( f l o a t 3 i n )24 {25 re turn m a k e f l o a t 4 ( i n . x , i n . y , i n . z , 0 . 0 f ) ;26 }2728 i n l i n e f l o a t l e n g t h f 3 ( f l o a t 3 i n )29 {30 re turn s q r t ( i n . x∗ i n . x + i n . y∗ i n . y + i n . z∗ i n . z ) ;31 }3233 / / K e r n e l f o r i n i t i a l i z i n g image d a t a34 void i m a g e I n i t c p u ( unsigned char∗ img , unsigned i n t ←↩

p i x e l s )35 {36 f o r ( unsigned i n t mempos =0; mempos<p i x e l s ; ←↩

mempos++) {37 img [ mempos∗4] = 0 ; / / Red c h a n n e l38 img [ mempos∗4 + 1] = 0 ; / / Green c h a n n e l39 img [ mempos∗4 + 2] = 0 ; / / Blue c h a n n e l40 }41 }4243 / / C a l c u l a t e r a y o r i g i n s and d i r e c t i o n s44 void r a y I n i t P e r s p e c t i v e c p u ( f l o a t 3 ∗ r a y o r i g o ,45 f l o a t 3 ∗ r a y d i r e c t i o n ,46 f l o a t 3 eye ,47 unsigned i n t width ,48 unsigned i n t h e i g h t )49 {50 i n t i ;51 # pragma omp p a r a l l e l f o r52 f o r ( i =0 ; i<wid th ; i ++) {53 f o r ( unsigned i n t j =0 ; j<h e i g h t ; j ++) {5455 unsigned i n t mempos = i + j∗ h e i g h t ;5657 / / C a l c u l a t e p i x e l c o o r d i n a t e s i n image p l a n e58 f l o a t p u = c o n s t c i m g p l a n e . x + ←↩

( c o n s t c i m g p l a n e . y − c o n s t c i m g p l a n e . x )59 ∗ ( i + 0 . 5 f ) / w id th ;60 f l o a t p v = c o n s t c i m g p l a n e . z + ←↩

( c o n s t c i m g p l a n e .w − c o n s t c i m g p l a n e . z )61 ∗ ( j + 0 . 5 f ) / h e i g h t ;6263 / / Wr i t e r a y o r i g o and d i r e c t i o n t o g l o b a l memory64 r a y o r i g o [ mempos ] = c o n s t c e y e ;65 r a y d i r e c t i o n [ mempos ] = −c o n s t c d∗cons tc w + ←↩

p u∗ c o n s t c u + p v∗ c o n s t c v ;66 }67 }68 }6970 / / Check w e t h e r t h e p i x e l ’ s v i ewing r a y i n t e r s e c t s ←↩

wi th t h e s p h e r e s ,71 / / and shade t h e p i x e l c o r r e s p o n d i n g l y72 void r a y I n t e r s e c t S p h e r e s c p u ( f l o a t 3 ∗ r a y o r i g o ,73 f l o a t 3 ∗ r a y d i r e c t i o n ,74 f l o a t 4 ∗ p ,75 unsigned char∗ img ,76 unsigned i n t p i x e l s ,77 unsigned i n t np )78 {79 i n t mempos ;80 # pragma omp p a r a l l e l f o r81 f o r ( mempos =0; mempos<p i x e l s ; mempos++) {8283 / / Read r a y d a t a from g l o b a l memory84 f l o a t 3 e = r a y o r i g o [ mempos ] ;85 f l o a t 3 d = r a y d i r e c t i o n [ mempos ] ;86 / / f l o a t s t e p = l e n g t h f 3 ( d ) ;8788 / / D i s t a n c e , i n r a y s t e p s , be tween o b j e c t and eye ←↩

i n i t i a l i z e d wi th a l a r g e v a l u e

89 f l o a t t d i s t = 1 e10 f ;9091 / / S u r f a c e normal a t c l o s e s t s p h e r e i n t e r s e c t i o n92 f l o a t 3 n ;9394 / / I n t e r s e c t i o n p o i n t c o o r d i n a t e s95 f l o a t 3 p ;9697 / / I t e r a t e t h r o u g h a l l p a r t i c l e s98 f o r ( unsigned i n t i =0 ; i<np ; i ++) {99

100 / / Read s p h e r e c o o r d i n a t e and r a d i u s101 f l o a t 3 c = f 4 t o f 3 ( p [ i ] ) ;102 f l o a t R = p [ i ] . w;103104 / / C a l c u l a t e t h e d i s c r i m i n a n t : d = Bˆ2 − 4AC105 f l o a t D e l t a = ←↩

( 2 . 0 f∗d o t ( d , ( e−c ) ) ) ∗ ( 2 . 0 f∗d o t ( d , ( e−c ) ) ) ←↩/ / Bˆ2

106 − 4 . 0 f∗d o t ( d , d ) / / −4∗A107 ∗ ( d o t ( ( e−c ) , ( e−c ) ) − R∗R) ; / / C108109 / / I f t h e d e t e r m i n a n t i s p o s i t i v e , t h e r e a r e ←↩

two s o l u t i o n s110 / / One where t h e l i n e e n t e r s t h e sphe re , and ←↩

one where i t e x i t s111 i f ( D e l t a > 0 . 0 f ) {112113 / / C a l c u l a t e r o o t s , S h i r l e y 2009 p . 77114 f l o a t t m inus = ( ( d o t (−d , ( e−c ) ) − s q r t ( ←↩

d o t ( d , ( e−c ) )∗d o t ( d , ( e−c ) ) − d o t ( d , d )115 ∗ ( d o t ( ( e−c ) , ( e−c ) ) − R∗R) ) ) / d o t ( d , d ) ) ;116117 / / Check w e t h e r i n t e r s e c t i o n i s c l o s e r t h a n ←↩

p r e v i o u s v a l u e s118 i f ( f a b s ( t minus ) < t d i s t ) {119 p = e + t minus∗d ;120 t d i s t = f a b s ( t minus ) ;121 n = n o r m a l i z e ( 2 . 0 f ∗ ( p − c ) ) ; / / S u r f a c e normal122 }123124 } / / End of s o l u t i o n b ra nc h125126 } / / End of p a r t i c l e loop127128 / / Wr i t e p i x e l c o l o r129 i f ( t d i s t < 1 e10 ) {130131 / / L a m b e r t i a n s h a d i n g p a r a m e t e r s132 f l o a t d o t p r o d = f a b s ( d o t ( n , c o n s t c l i g h t ) ) ;133 f l o a t I d = 4 0 . 0 f ; / / L i g h t i n t e n s i t y134 f l o a t k d = 5 . 0 f ; / / D i f f u s e c o e f f i c i e n t135136 / / Ambient s h a d i n g137 f l o a t k a = 1 0 . 0 f ;138 f l o a t I a = 5 . 0 f ;139140 / / Wr i t e s h a d i n g model v a l u e s t o p i x e l c o l o r ←↩

c h a n n e l s141 img [ mempos∗4] = ( unsigned char ) ( ( k d ∗ ←↩

I d ∗ d o t p r o d142 + k a ∗ I a ) ∗0.48 f ) ;143 img [ mempos∗4 + 1] = ( unsigned char ) ( ( k d ∗ ←↩

I d ∗ d o t p r o d144 + k a ∗ I a ) ∗0.41 f ) ;145 img [ mempos∗4 + 2] = ( unsigned char ) ( ( k d ∗ ←↩

I d ∗ d o t p r o d146 + k a ∗ I a ) ∗0.27 f ) ;147 }148 }149 }150151152 void c a m e r a I n i t c p u ( f l o a t 3 eye , f l o a t 3 l o o k a t , f l o a t ←↩

imgw , f l o a t h w r a t i o )153 {154 / / Image d i m e n s i o n s i n wor ld s p a c e ( l , r , b , t )155 f l o a t 4 imgp lane = m a k e f l o a t 4 (−0.5 f∗imgw , ←↩

0 . 5 f∗imgw , −0.5 f∗imgw∗hw ra t i o , ←↩0 . 5 f∗imgw∗h w r a t i o ) ;

156157 / / The view v e c t o r158 f l o a t 3 view = eye − l o o k a t ;159160 / / C o n s t r u c t t h e camera view o r t h o n o r m a l base

DATA PARALLEL COMPUTING 2011 9

161 f l o a t 3 v = m a k e f l o a t 3 ( 0 . 0 f , 1 . 0 f , 0 . 0 f ) ; / / v : ←↩P o i n t i n g upward

162 f l o a t 3 w = −view / l e n g t h f 3 ( view ) ; / / w: ←↩P o i n t i n g backwards

163 f l o a t 3 u = c r o s s ( m a k e f l o a t 3 ( v . x , v . y , v . z ) , ←↩m a k e f l o a t 3 (w. x , w. y , w. z ) ) ; / / u : P o i n t i n g r i g h t

164165 / / F o c a l l e n g t h 20% of eye v e c t o r l e n g t h166 f l o a t d = l e n g t h f 3 ( view ) ∗0 .8 f ;167168 / / L i g h t d i r e c t i o n ( p o i n t s t o w a r d s l i g h t s o u r c e )169 f l o a t 3 l i g h t = ←↩

n o r m a l i z e (−1.0 f∗eye∗m a k e f l o a t 3 ( 1 . 0 f , 0 . 2 f , ←↩0 . 6 f ) ) ;

170171 s t d : : c o u t << ” T r a n s f e r i n g camera v a l u e s t o ←↩

c o n s t a n t memory\n ” ;172173 c o n s t c u = u ;174 c o n s t c v = v ;175 cons tc w = w;176 c o n s t c e y e = eye ;177 c o n s t c i m g p l a n e = imgp lane ;178 c o n s t c d = d ;179 c o n s t c l i g h t = l i g h t ;180 }181182183 / / Wrapper f o r t h e r t a l g o r i t h m184 i n t r t c p u ( f l o a t 4 ∗ p , unsigned i n t np ,185 rgb∗ img , unsigned i n t width , unsigned i n t ←↩

h e i g h t ,186 f3 o r i g o , f3 L , f3 eye , f3 l o o k a t , f l o a t imgw ) {187188 us ing s t d : : c o u t ;189190 c o u t << ” I n i t i a l i z i n g CPU r a y t r a c e r :\ n ” ;191192 / / I n i t i a l i z e GPU t imes t amp r e c o r d e r s193 f l o a t t1 go , t2 go , t 1 s t o p , t 2 s t o p ;194195 / / S t a r t t i m e r 1196 t1 go = c l o c k ( ) ;197198 / / A l l o c a t e memory199 c o u t << ” A l l o c a t i n g d e v i c e memory\n ” ;200 s t a t i c unsigned char ∗ img ; / / RGBw v a l u e s i n ←↩

image201 s t a t i c f l o a t 3 ∗ r a y o r i g o ; / / Ray o r i g o ( x , y , z )202 s t a t i c f l o a t 3 ∗ r a y d i r e c t i o n ; / / Ray d i r e c t i o n ←↩

( x , y , z )203 img = new unsigned char [ wid th∗ h e i g h t ∗4 ] ;204 r a y o r i g o = new f l o a t 3 [ wid th∗ h e i g h t ] ;205 r a y d i r e c t i o n = new f l o a t 3 [ wid th∗ h e i g h t ] ;206207 / / Ar range t h r e a d / b l o c k s t r u c t u r e208 unsigned i n t p i x e l s = wid th∗ h e i g h t ;209 f l o a t h w r a t i o = ( f l o a t ) h e i g h t / ( f l o a t ) wid th ;210211 / / S t a r t t i m e r 2212 t2 go = c l o c k ( ) ;213214 / / I n i t i a l i z e image t o background c o l o r215 i m a g e I n i t c p u ( img , p i x e l s ) ;216217 / / I n i t i a l i z e camera218 c a m e r a I n i t c p u ( m a k e f l o a t 3 ( eye . x , eye . y , eye . z ) ,219 m a k e f l o a t 3 ( l o o k a t . x , l o o k a t . y , ←↩

l o o k a t . z ) ,220 imgw , h w r a t i o ) ;221222 / / C o n s t r u c t r a y s f o r p e r s p e c t i v e p r o j e c t i o n223 r a y I n i t P e r s p e c t i v e c p u (224 r a y o r i g o , r a y d i r e c t i o n ,225 m a k e f l o a t 3 ( eye . x , eye . y , eye . z ) ,226 width , h e i g h t ) ;227228 / / F ind c l o s e s t i n t e r s e c t i o n between r a y s and s p h e r e s229 r a y I n t e r s e c t S p h e r e s c p u (230 r a y o r i g o , r a y d i r e c t i o n ,231 p , img , p i x e l s , np ) ;232233 / / S top t i m e r 2234 t 2 s t o p = c l o c k ( ) ;235236 memcpy ( img , img , s i z e o f ( unsigned char )∗ p i x e l s ∗4) ;

237238 / / F ree d y n a m i c a l l y a l l o c a t e d d e v i c e memory239 d e l e t e [ ] img ;240 d e l e t e [ ] r a y o r i g o ;241 d e l e t e [ ] r a y d i r e c t i o n ;242243 / / S top t i m e r 1244 t 1 s t o p = c l o c k ( ) ;245246 / / Re po r t t ime s p e n t247 c o u t << ” Time s p e n t on e n t i r e CPU r a y t r a c i n g ←↩

r o u t i n e : ”248 << ( t 1 s t o p−t 1 go ) / CLOCKS PER SEC∗1000.0 << ” ←↩

ms\n ” ;249 c o u t << ” − F u n c t i o n s : ” << ←↩

( t 2 s t o p−t 2 go ) / CLOCKS PER SEC∗1000.0 << ” ms\n ” ;250251 / / Re tu rn s u c c e s s f u l l y252 re turn 0 ;253 }

DATA PARALLEL COMPUTING 2011 10

APPENDIX CBLOOPERS

This section contains pictures of the unfinished raytracer ina malfunctioning state, which can provide interesting, howeverunusable, results.

Fig. 5. An end result encountered way too many times, e.g. after inadvertentlydividing the focal length with zero.

Fig. 6. Twisted appearance caused by wrong thread/block layout.

Fig. 7. A problem with the distance determination caused the particles tobe displayed in a wrong order.

Fig. 8. Colorful result caused by forgetting to normalize the normal vectorcalculation.

Fig. 9. Another colorful result with wrong order of particles, cause by usingabs() (meant for integers) instead of fabs() (meant for floats).