19
Shell: A Spatial Decomposition Data Structure for 3D Curve Traversal on Many-core Architectures (Regular Submission) Kai Xiao University of Notre Dame [email protected] Danny Z. Chen * University of Notre Dame [email protected] X. Sharon Hu University of Notre Dame [email protected] Bo Zhou Altera Corp [email protected] Abstract Shared memory many-core processors such as GPUs have been extensively used in accelerating computation-intensive algorithms and applications. When porting existing algorithms from sequential or other parallel architecture models to shared memory many-core architectures, non-trivial modifications are often needed in order to match the execution patterns of the target algorithms with the characteristics of many-core architectures. 3D curve traversal is a fundamental process in many applications, and is commonly accelerated by spatial decomposition schemes captured in hierarchical data structures (e.g., kd-trees). However, curve traversal using hierarchical data structures needs to conduct repeated hierar- chical searches. Such search process is time-consuming on shared memory many-core architectures since it incurs considerable amounts of expensive memory accesses and execution divergence. In this paper, we propose a novel spatial decomposition based data structure, called Shell, which completely avoids hierarchical search for 3D curve traversal. In Shell, a structure is built on the boundary of each region in the decomposed space, which allows any curve traversing in a region to find the next neighboring region to traverse using table lookup schemes, without any hierarchical search. While our 3D curve traversal approach works for other spatial decomposition paradigms and many-core processors, we illustrate it using kd-tree decomposition on GPU and compare with the fastest known kd-tree searching algorithms for ray traversal. Analysis and experimental results show that our approach improves ray traversal performance considerably over the kd-tree searching approaches. Keywords: Many-core architecture, GPU, data structure, spatial decomposition, 3D curve traversal. * The research of D.Z. Chen was supported in part by NSF under Grants CCF-0916606 and CCF-1217906.

Shell: A Spatial Decomposition Data Structure for 3D … algorithms and ... which allows any curve traversing in a region to find the next neighboring region to traverse using table

Embed Size (px)

Citation preview

Shell: A Spatial Decomposition Data Structure for 3D CurveTraversal on Many-core Architectures

(Regular Submission)

Kai XiaoUniversity of Notre Dame

[email protected]

Danny Z. Chen∗

University of Notre [email protected]

X. Sharon HuUniversity of Notre Dame

[email protected]

Bo ZhouAltera Corp

[email protected]

Abstract

Shared memory many-core processors such as GPUs have been extensively used in acceleratingcomputation-intensive algorithms and applications. When porting existing algorithms from sequential orother parallel architecture models to shared memory many-core architectures, non-trivial modificationsare often needed in order to match the execution patterns of the target algorithms with the characteristicsof many-core architectures. 3D curve traversal is a fundamental process in many applications, and iscommonly accelerated by spatial decomposition schemes captured in hierarchical data structures (e.g.,kd-trees). However, curve traversal using hierarchical data structures needs to conduct repeated hierar-chical searches. Such search process is time-consuming on shared memory many-core architectures sinceit incurs considerable amounts of expensive memory accesses and execution divergence. In this paper,we propose a novel spatial decomposition based data structure, called Shell, which completely avoidshierarchical search for 3D curve traversal. In Shell, a structure is built on the boundary of each region inthe decomposed space, which allows any curve traversing in a region to find the next neighboring regionto traverse using table lookup schemes, without any hierarchical search. While our 3D curve traversalapproach works for other spatial decomposition paradigms and many-core processors, we illustrate itusing kd-tree decomposition on GPU and compare with the fastest known kd-tree searching algorithmsfor ray traversal. Analysis and experimental results show that our approach improves ray traversalperformance considerably over the kd-tree searching approaches.

Keywords: Many-core architecture, GPU, data structure, spatial decomposition, 3D curve traversal.

∗The research of D.Z. Chen was supported in part by NSF under Grants CCF-0916606 and CCF-1217906.

1 Introduction

3D geometric scenes are involved in many applications, in which large numbers of curve traversal(e.g., ray traversal) operations are frequently conducted. Spatial decomposition based data structures havebeen developed on shared memory many-core architecture (e.g., general purpose graphics processing units(GPGPUs)) for accelerating curve traversal solutions. In this paper, we present a new efficient spatialdecomposition based data structure for 3D curve traversal that better exploits the characteristics of sharedmemory many-core architectures and avoids hierarchical searches that are commonly performed in otherknown spatial decomposition data structures.

Shared memory many-core processors present great opportunities to speed up computation-intensiveapplications by parallelization [19, 20, 22, 26]. Recent advances on GPGPUs, such as those from NVIDIA,AMD, and Intel, leverage massively parallel architectures based on single-instruction, multiple-data (SIMD)processor cores to achieve high performance [9]. In this paper, we use NVIDIA GPU with the Fermiarchitecture as the model for illustration and experiments, but our solutions can be applied to other types ofshared memory many-core processors, such as those with an MIMD architecture (e.g., Intel SCC [7]).

Due to the specific characteristics of GPU, sequential or parallel algorithms straightforwardly ported toGPU often suffer from a number of performance bottlenecks such as memory access efficiency and executiondivergence, thus utilizing only a fraction of the GPU computation power [1]. A shared memory many-coreprocessor, especially GPU, commonly contains hundreds of cores (e.g., NVIDIA GTX570 contains 480cores [23]). All these cores are connected to one storage component such as shared cache or main memory.During execution, the memory bandwidth shared by multiple cores is often insufficient to support a largenumber of simultaneous memory requests. Hence, memory access efficiency on GPUs is much more crucialfor performance than on traditional pipelined processors such as CPUs. For example, in the NVIDIA FermiGPU, the latency of an off-chip memory transaction is 400-600 times longer than the fastest computationalinstructions. Since computational operations can be executed simultaneously by multiple cores but memorytransactions need to be processed sequentially by the memory controller shared by these cores, the memorytransaction latency is usually even worse. Besides memory access efficiency, execution divergence is anothermajor performance bottleneck. The SIMD architecture used in GPU attains the most benefit when a group ofcores (32-48 in NVIDIA Fermi) follows the same instruction flow. But when the execution paths running ondifferent cores diverge due to branches (e.g., conditional statements), their parallel execution can no longerbe sustained and instead serial execution takes place. Such divergent execution paths severely deteriorateperformance [30]. Achieving good performance on SIMD systems requires solid understanding of both thearchitecture and execution patterns of the target algorithms [18].

Curve traversal is to trace the trajectory of a curve through a geometric scene. A curve can be a linesegment, a ray, or a general algebraic curve. Ray traversal is a special case of curve traversal and afundamental process in many applications, such as graphics ray tracing [2, 29] and radiation dose calculation[5, 31]. In this paper, we use ray traversal as example to illustrate the problem and our solution for 3D curvetraversal on many-core architectures. Our approach can be easily extended to traverse other types of curves.

Since applications using ray traversal often involve large numbers of rays and repeatedly conduct thetraversal process for such rays (e.g., to generate high-resolution images in graphics ray tracing), the execu-tion speed of ray traversal is critical for such applications. A number of data structures have been developedto partition and organize geometric objects in 3D scenes to improve the efficiency of ray traversal. Oneimportant class of such data structures is based on spatial decomposition (e.g., grids [10, 17], octrees andkd-trees [28]). A spatial decomposition scheme partitions a geometric scene into a set of regions, eachcontaining a (small) number of objects. The subdivided regions are normally organized by a hierarchicalstructure which represents the geometric relationship among those regions. Kd-tree is a commonly usedhierarchical structure due to its ability of matching the subdivided regions with the distribution of geometricobjects in the scene. It is also considered to be the data structure that provides the fastest known ray traversal

1

speed in static scenes because of its efficient hierarchical search mechanism [32].Quite a few algorithms have been developed for 3D ray traversal using kd-tree structures, in which

repeated hierarchical searches in the tree are needed to find the neighboring regions for traversal [11, 25].For kd-tree based ray traversal on a traditional CPU architecture, a stack is normally used to store the treenodes along a search path. On GPU, due to the lack of stack support and limited capacity of on-chip memory,a stack based approach typically allocates stacks to off-chip memory (e.g., global memory). The long accesslatency of the off-chip memory can easily become a performance bottleneck for stack-based ray traversalon GPUs [16]. Several stack-less kd-tree ray traversal algorithms on GPU have been proposed to addressthis challenge. For example, Foley et al. [8] and Horn et al. [15] proposed a restart traversal scheme onkd-trees which starts the tree search from the root, and uses a push-down method to move the “root” downto the minimum subtree to be searched. Horn et al. [15] also developed a short-stack approach which buildsa small circular stack in the on-chip memory of GPU, and combines it with the push-down method (PD-SS).Popov et al. [24] proposed a kd-rope algorithm [13] using the concept of neighbor links, where each leafnode in the kd-tree contains not only the regionR it represents but also a set of pointers, each pointing toa minimum subtree that contains all neighboring regions touching each of the boundaries (or faces) ofR.Santos et al. [25] improved the kd-rope implementation, making it the fastest known kd-tree searching raytraversal approach on GPU (called kd-rope++).

Although these algorithms considerably improved the kd-tree searching 3D ray traversal performance onGPU, all of them still rely on hierarchical search to find the next traversal region when a ray crosses a regionboundary (i.e., a ray exits from a region). During ray traversal, each search operation incurs reading a treenode, and search for the next neighboring traversal region can visitO(H) nodes in a kd-tree, whereH is theheight of the tree. Note that each reading of a visited tree node may take multiple memory transactions toobtain the entire information package for the node because each GPU memory transaction has a limited size(e.g., 128 bytes). Also, the threads for different rays may follow different sequences of visited tree nodeswhich can result in execution divergence. Hence, the performance of kd-tree searching ray traversal on GPUcan suffer considerably from these issues.

We propose a new spatial decomposition based data structure, called Shell, which completely eliminateshierarchical search, hence leading to an efficient solution for 3D curve traversal on GPU. Tree search is tofind the next neighboring region for a curve (say, ray) to cross a region boundary, by using the geometricinformation of the curve and regions. To avoid tree search, Shell provides a neighboring region locatingmechanism based on table lookup techniques to replace hierarchical search and find the next region for anytraversing ray, allowing the ray to directly access the next region’s information.

Generally speaking, given the set of decomposed 3D regions in a hierarchical structure (say, a kd-tree),Shell focuses on the neighboring relationship among the regions for all leaf tree nodes, called leaf regions.For each leaf regionR, the information of its neighboring leaf regions is captured in Shell by a geometricstructure called arrangements, with one arrangement per boundary face ofR. An arrangement is a partitionof a faceF of R into a set of 2D regions called cells, such that each cellC covers an area ofF touching aneighboring leaf regionR′ of R and contains information ofR′. When a ray exitsR by crossingF in C, itacquires the neighboring region information ofR′ by accessingC. Of course, schemes for quickly findingC (which is hit by the ray) are needed to avoid tree searches and other performance bottlenecks incurred byhierarchical searching data structures (for this, we apply various table lookup schemes).

There are two key factors in designing the Shell table lookup schemes: the ray traversal speed andmemory usage. The ray traversal speed essentially depends on how efficiently a cell can be located when aray crosses a region face through that cell; the memory usage is related to the total number of cells on thefaces of all leaf regions in Shell (i.e., the sizes of arrangements or lookup tables for all leaf regions). Bothfactors are affected by the partition schemes for generating the arrangements. We seek to balance the raytraversal performance with good memory usage in Shell, and present a set of partition schemes, including(from simple to sophisticated) uniform grids, multi-level uniform grids, and compressed non-uniform grids,

2

to deal with different neighboring region settings. Actually, each such scheme can be viewed as an extensionof the simpler ones for obtaining a good trade-off between ray traversal performance and memory usagefor a more complicated neighboring region setting. We also exploit several memory accessing techniques[3, 6, 14] to further reduce memory transactions used in Shell based ray traversal.

Given a geometric scene withN objects, suppose a kd-tree decomposition partitions it intoM leafregions with a tree height ofO(log M) usingO(M + N) memory. A ray traversal in such a kd-tree takesO(log M) search steps (i.e., the number of visited tree nodes or memory transactions involved) each timea ray crosses a region boundary. In comparison, using Shell, only one processing step (orO(1) memorytransactions) without any tree search is needed to find the next region. The Shell memory usage isO(M +N + U), where U is the total number of cells in the arrangements for all leaf regions. Using judiciouspartition and compression schemes, theO(M + N + U) memory bound of Shell can be made in practicecomparable to theO(M + N) memory bound of the kd-tree. Since many applications use massive numbersof traversing rays each of which can cross many regions in the scenes, reducing the number of memorytransactions fromO(log M) to O(1) per region crossing for each ray on GPU can be significant.

To demonstrate the effectiveness of our approach, we implemented our Shell data structure based on akd-tree decomposition with all our arrangement schemes. Our experiments were conducted on ray traversalsin several benchmark geometric scenes [27] used in graphics rendering. The experimental results show thatour Shell approach outperforms the fastest known kd-tree searching ray traversal approaches on GPU byover 2X to 5X. Although ray traversals, kd-tree decomposition, and GPU architecture are used to illustrateour method, our Shell data structure and ray traversal algorithms can be easily extended to traversing 3Dcurves of other types, using different spatial decomposition schemes, and on other many-core architectures.

2 Ray Traversal Using Kd-tree Decomposition

In this section, we give a general discussion of ray traversal based on spatial decomposition (say, akd-tree). We also briefly illustrate the PD-SS [24] and kd-rope [13] ray traversal algorithms on GPU.

Given a spatially decomposed scene, suppose we consider the regions corresponding to all leaf treenodes, calledleaf regions(e.g., in Figure 1(a), a 2D scene is decomposed into 18 leaf regions). We can usea dual graphG to represent all leaf regions, as follows: Each vertex inG corresponds to exactly one leafregion, and two vertices are connected by an edge inG if and only if the two corresponding regions shareany common boundary portion. With this graph model, the ray traversal problem for a rayr is to find a path(i.e., a sequence of vertices) inG determined by the trajectory ofr (i.e., the sequence of regions intersectedby r in the order of the corresponding vertices on the path). To obtain the path inG for r, a key issue, whichwe call thenext-region issue, is, at each vertexv (for a regionRv) of the path, to find the next vertexv′ (for aregionRv′) on the path, i.e., after regionRv, the rayr enters the next regionRv′ . This issue can be resolvedby using the geometric location information ofr traversing inRv and the neighboring leaf regions ofRv.In implementation, resolving this issue means to map a point on a boundary face ofRv (from whichr exitsRv) to a memory address (which stores the information of the neighboring regionRv′ traversed next byr).How to resolve the next-region issue effectively is an essential task for any ray traversal solution.

Figure 1(b) shows the kd-tree representing the spatial decomposition in Figure 1(a). Since a kd-treeorganizes the relationship among its regions in a tree structure, hierarchical search is the basic process forsolving the next-region issue used by any kd-tree searching ray traversal algorithms. For example, whenray1 leaves regionF in Figure 1(a), the next regionD (and the memory address storing the information ofregionD) should be found based on the exit point ofray1 from a boundary face of regionF .

However, a hierarchical search accesses a number of internal nodes in the kd-tree, which impacts raytraversal performance since it incurs considerable memory transactions and divergent branches. Althoughconsidered as the most efficient kd-tree searching ray traversal algorithms on GPU, PD-SS and kd-rope

3

inherit the hierarchical search mechanism from the kd-tree structure to find neighboring regions. In theworst case, both algorithms still visitO(H) nodes to find a neighboring region, whereH is the heightof the kd-tree. Further, accessing each internal node takes multiple memory transactions (usually 2-4 fordifferent kd-tree implementations), which is a considerable overhead for ray traversal on GPU. Moreover,the numbers of accessed nodes between different rays using the same algorithm can be quite different (e.g.,in PD-SS,ray1 accesses 28 nodes andray2 access 20 nodes), which means that the rays can have differentworkloads (the numbers of tree nodes accessed). In parallel execution on GPU, since many rays are oftensimultaneously processed by processor cores on an SIMD architecture, the workload variance of the rayscan cause execution divergence and hence deteriorate the ray traversal performance.

To eliminate the hierarchical search overhead for 3D ray traversal on GPU, we develop a new datastructure for resolving the next-region issue based on table lookup schemes. In designing a GPU tablelookup scheme for the next-region issue, we need to consider two key factors: (1) It should allow to find thenext neighboring region quickly (especially, with only few memory transactions); (2) it should not take toomuch GPU memory (large memory usage can limit its application to scenes of large numbers of objects).We present a series of schemes to address these two factors for various neighboring region settings.

3 The Shell Data Structure

We propose a new data structure, called Shell, for directly finding the next neighboring regions withoutany tree search. By avoiding hierarchical search, Shell reduces not only memory transactions but alsoexecution divergence. Given a kd-tree decomposed scene, based on our dual graph model, Shell focuses onthe leaf regions stored at all leaf nodes of the kd-tree. For each leaf regionR, we put a geometric structure,called arrangements (our “tables”), on the boundary ofR (intuitively, the “Shell” of regionR) to capture theinformation of all neighboring leaf regions ofR. When a rayr hits a boundary faceF of R to exit fromR,the arrangement forF allowsr to find quickly the neighboring region ofR to traverse next.

The arrangements are a key structure of Shell, which essentially provide table lookup schemes formapping the geometric information of a rayr traversing in a regionR and the neighboring regions ofRto a GPU memory address storing the information of the next neighboring region ofR traversed byr. Foreach faceF of every leaf regionR, an arrangementA(F ) partitionsF into a set of 2D areas, called cells,such that each cell touches only one neighboring region ofR on F . All cells of A(F ) are organized bya specific data structure such that given any pointp on F (e.g., the hit point of any ray), the cell ofA(F )containingp can be found quickly. As a table lookup scheme, the cells ofA(F ) form a data table, and eachcell gives (say) a pointer to its corresponding neighboring region ofR. Ideally, the following features aredesired from good arrangements: (1) An arbitrary rayr can quickly find the cellC containing its hit pointon a face ofR (say, with only few memory transactions), and (2) the memory requirement for storing allcells is not too high. There is a trade-off between these two features. Depending on different settings ofneighboring regions, we propose a set of partition and compression schemes for building arrangements toachieve good performance with respect to ray traversal speed and memory usage of Shell. Since the GPUmemory architecture prefers simple memory layout for efficient addressing, we choose arrays as the maindata structure in GPU for storing arrangements based on all our partition schemes.

Figure 2(b) illustrates the Shell structure based on the spatial decomposition in Figure 2(a). Here, toillustrate the idea of a partition scheme for arrangements of neighboring regions, we use a simple uniformgrid as example to partition all region boundaries in the scene. Each leaf region is represented by aShell-Region(SR) (e.g., see a Shell-Region in Figure 2(c)). Every Shell-Region uses an arrangement to partitioneach of its boundary faces into a set of cells (e.g., the shaded small boxes around the boundary of theShell-Region in Figure 2(c)). Each cell is called aShell-Unit(SU), which covers an area on its boundaryface and contains a pointer to a neighboring Shell-Region touching that cell on the face. In Figure 2(c),

4

the highlighted Shell-Unit contains a pointer to Shell-RegionD. The memory layout of Shell, shown inFigure 2(d), consists of an array of all Shell-Regions and an array of all Shell-Units stored in the GPUglobal memory. The Shell-Units of the same boundary face of a Shell-Region are all allocated as a groupin consecutive memory space, and the memory address of the first Shell-Unit in the group is stored in thecorresponding Shell-Region structure. Thus, any Shell-Unit can be addressed by computing its memoryoffset from the first Shell-Unit in the same group. When a rayr leaves a region through a faceF , theneighboring region entered next byr is found by accessing the Shell-Unit in whichr crossesF .

3.1 The Structure of an Individual Shell-Region

A Shell-Region contains information of a 3D leaf region and the arrangements of its six 2D boundaryfaces. To ensure that a Shell-Region can be quickly loaded from memory, its memory requirement shouldideally be no bigger than the size of a single memory transaction in the hardware architecture (e.g., on theNVIDIA GTX570 GPU, an off-chip memory transaction can load 128 consecutive bytes into the on-chip L1cache). Essentially, the leaf region information of a Shell-RegionR consists of its geometric location andsize, as well as a pointer to the list of its associated geometric objects. As an axis-aligned box in 3D, thegeometric location and size ofR can be represented by the two vertices on a diagonal of the box:V0 (thevertex with the smallest coordinate in thex, y, andz directions ofR) andV1 (the vertex with the largestcoordinate in thex, y, andz directions ofR).

An arrangement of a region face consists of a geometric partition scheme subdividing the face into aset of cells (i.e., Shell-Units), and a data structure for mapping the geometric locations of these cells tothe memory addresses of the corresponding Shell-Units in the GPU global memory. The memory addressof the first Shell-Unit for each face is stored in its Shell-Region, which is used as the base for addressingother Shell-Units of the face. The partition scheme determines the time and memory usage of a ray traversalalgorithm using Shell. There is a trade-off between these two factors. For example, storing more Shell-Unitsmay make the cell locating process easier and quicker, and hence speed up ray traversal. To achieve a goodbalance between traversal time and memory usage for different neighboring region settings, we propose aseries of partition schemes, including (from simple to sophisticated) uniform grid, multi-level uniform grids,and non-uniform grid, as well as a grid compression scheme. Each scheme is built on top of the simplerones and reduces memory usage by using slightly more computational operations for locating cells (but notmore memory transactions), to handle a more complicated neighboring region setting, as presented below.

3.1.1 Uniform Grid Schemes

A uniform grid partitions each region face into a matrix of cells (pixels) of the same size. This arrange-ment easily maps the geometric locations of the face to the memory addresses of the pixels. Each pixel is acell representing a Shell-Unit, which contains a pointer to the neighboring region touching this pixel.

Using a uniform grid for all region boundaries is the simplest approach to build a Shell structure.Figure 2(b) gives an example for this on a 2D scene with 8 regions. Actually, such a Shell structure canbe viewed as a combination of the kd-tree and uniform grid approaches.

The uniform grid scheme provides a fast accessing mechanism for the Shell-Units; but, it can use a largeamount of memory. Therefore, it is good when each region’s face has a simple distribution of neighboringregions (say, for scenes containing well-shaped objects with a relatively regular distribution). However, fora complicated scene, the region faces often have a large number of neighbors with non-regular distributions.Hence the uniform grid may have to choose the finest resolution of the neighbor distributions on all regionfaces as the pixel size, and the number of pixels in the grid (and the Shell memory space) can be quite large.

To address this issue, instead of using one uniform grid partition for all region faces, we can usemultiple uniform grids with different levels of resolutions to partition the region faces depending on different

5

neighboring region settings. This is called themulti-level uniform gridscheme, which provides someflexibility to the region faces with simple neighbor distributions so that they can use coarse grid resolutionsand thus store fewer pixels (Shell-Units). For the region faces with complicated neighbor distributions, westill must use uniform pixels of sufficiently fine resolutions. The resolution of each grid used in the multi-level uniform grid scheme needs to be stored for each face of every Shell-Region, which is used to calculatethe pixel hit by a ray on the face and the memory address of the corresponding Shell-Unit.

Comparing with a uniform grid, the multi-level grid scheme can reduce the number of Shell-Units onthe region faces and hence save memory space, with a very small time overhead.

3.1.2 Non-uniform Grid Scheme

Even with the multi-level uniform grid scheme, each region face is still partitioned into a matrix of pixelswith the same size. For a boundary face touching multiple neighboring regions, the grid resolution is stillrestricted to what is required to distinguish the smallest ones, which can still lead to high memory usage.Observe that multiple pixels of a uniform grid on a boundary face often touch the same neighboring regionand hence store multiple copies of the pointer to the same corresponding Shell-Region. To reduce suchduplications, we propose a non-uniform grid scheme which aims to merge pixels of a uniform size on eachface sharing the same neighboring region into a larger cell to save memory space.

A non-uniform grid is built on the partition of a uniform grid. Figure 3(a) shows a boundary face(for a 3D scene) with 16 neighboring regions (marked by heavy lines) and originally partitioned into auniform grid. For a boundary faceF with a uniform grid, we use the set of vertical or horizontal lines alongthe projected boundaries of the neighboring regions onF (e.g., the solid bold red boxes in Figure 3(a)) topartitionF . Note that only a subset of the lines for the uniform grid (e.g., the solid green lines in Figure 3(a))is actually aligned with such projected region boundaries onF , while the other lines for the uniform grid areneeded due to the resolution restriction of the uniform grid. A non-uniform grid uses the lines aligning withthe projected neighboring region boundaries to form a new grid partition, whose cells (the boxes boundedby solid green lines in Figure 3(a)) may cover multiple pixels of the uniform grid.

we need a decoding mechanism to map the indices from the uniform grid to the non-uniform grid, sothat any cell of the non-uniform grid (hit by a ray) can be located using the pixel indices in the uniformgrid. To support this mechanism, the partition lines in each axis of the uniform grid are indexed by a bitsequence, where each bit is set as1 if the corresponding line is used in the non-uniform grid partition and0 otherwise. Every bit sequence is then stored in the integer format, called acoordinate integer, in theShell-Region (e.g.,Coord.x in Figure 3(a)). During ray traversal, suppose a rayr exits the Shell-Regionthrough a cellC(i, j) in the uniform grid for a faceF . Then the following is done: (1) The two coordinateintegersCoord.x andCoord.y for F are obtained; (2) usingCoord.x (resp.,Coord.y), find the number of1’s in the bit sequence ofCoord.x (resp.,Coord.y) from its left end up to positioni (resp.,j), denoted byi′ (resp.,j′). Then cellC ′(i′, j′) is the one in the non-uniform grid ofF hit by the rayr. To avoid using anybranch operations (which may cause execution divergence), we design a short procedure to accomplish Step(2) (see Figure 3(b)). Note that given the two coordinate integers, the decoding process uses no memorytransaction to map the pixel indices in the uniform grid to cell indices in the non-uniform grid.

3.1.3 Grid Compression Scheme

Even with a non-uniform grid scheme, the Shell structure still tends to use more memory than a kd-treestructure. A grid is commonly represented as a matrix, each its element storing a value. On a faceF withK neighboring regions partitioned into a grid of sizem × n (a uniform or non-uniform grid), its matrixrepresentation storesm × n Shell-Units that point to theK neighboring Shell-Regions ofF . To further

6

reduce the memory, we employ a grid compression scheme similar to the sparse matrix compressed method[4], which improves the memory usage forF from O(m× n + K) to O(m + n + K) = O(m + n).

Note that for any boundary faceF , the projected shape of each neighboring region onF is an axis-parallel rectangle. Consider a non-uniform gridGn on F . The cells in each row or column ofGn containpointers to a set of neighboring regions touchingF . Observe that for any rowRi and any columnCj of Gn,the neighboring regionR′ pointed to by the cellC(i, j) in Gn appears in both the set of neighboring regionspointed to by the cells ofRi and the set of neighboring regions pointed to by the cells ofCj ; further,R′ isthe only neighboring region appearing in both these two sets (this follows from the rectangular shapes ofthe projected neighboring regions onF ). Based on this observation, we use the following grid compressionscheme. (1) Store pointers to all neighboring Shell-Regions ofF in an indexed arrayAF . (2) For each row(or column) ofGn, build a bit sequence as follows: the sequence hasK bits, one for each neighboring Shell-Region ofF , in their order as stored in the arrayAF ; each bit is set as 1 if and only if the correspondingneighboring Shell-Region appears in that row (or column) ofGn. Every such bit sequence is stored as aninteger, called aneighbor integer. Thus, we have an array of pointers to the set ofK neighboring Shell-Regions ofF , and two sequences of neighbor integers (one for the rows and one for the columns ofGn).Since there areK neighboring Shell-Regions touchingF andGn hasm rows andn columns onF , thememory usage of the above grid compression scheme forF is clearlyO(m + n + K). (Here, we assumethat the numberK of neighboring regions touchingF is not too big, which is usually the case in practice.)

For a rayr hitting a point onF to exit, we use the following “decoding” process to find the next Shell-Region traversed byr: (1) Compute the pixel indices in the uniform grid for the hit point ofr onF ; (2) findthe cell indices (for a rowRi and a columnCj) in the non-uniform gridGn; (3) take the neighbor integersfor Ri andCj , and perform a logic AND operation on them to identify the unique neighboring Shell-Regionpointed to by the cells in bothRi andCj . Note that both the uniform grid and non-uniform grid ofFinvolved in our decoding process above are only usedconceptually, i.e., they are only concepts for helpingour computation but are not actual structures that we maintain explicitly in Shell. Figure 4 illustrates howthe grid compression scheme is applied based on the non-uniform grid in Figure 3.

3.2 GPU Memory Layout and Memory Bound of Shell

Since the GPU memory architecture prefers simple memory layout schemes in order to implement easyand efficient addressing, we use arrays to store the Shell data structure. The Shell memory layout consistsof an array of Shell-Regions and an array of Shell-Units, addressed by their indices (e.g., see Figure 2(c)).

A Shell-Region representation contains sufficient information to address any of its Shell-Units. Aleaf region in 3D has six 2D boundary faces neighboring with other leaf regions. In a Shell-Regionrepresentation, each face stores a pointer to (the first of) its corresponding group of Shell-Units and stores thestructure produced by the partition and table lookup schemes for that face. A partition scheme consists of thestructure type (uniform grids, non-uniform grid, with or without grid compression), dimensions, resolutions(if using uniform grids), coordinate integers (if using a non-uniform grid), and pointers to three arrays (forthe neighbor integers of the rows and columns of a non-uniform grid and for the pointers to the neighboringShell-Regions in the arrayAF , if using a grid compression scheme). The memory size of a Shell-Regionrepresentation is constant (no more than 64 bytes) and is designed to be no bigger than the length of amemory transaction in the GPU hardware (e.g., 128 bytes). Thus, a single off-chip memory transaction canload a Shell-Region representation onto the on-chip memory (e.g., cache) during ray traversal.

All Shell-Units on a boundary face of a Shell-Region are grouped and allocated onto consecutivememory locations in the Shell-Unit array based on a predetermined sequence specified by the specificpartition scheme for that face. For each Shell-Unit, its memory offset to the address of the first Shell-Unit in the group can be calculated from its geometric location on the face. Since the base address of eachgroup of the Shell-Units is stored in the corresponding Shell-Region, any Shell-Unit can be addressed easily.

7

With a grid partition scheme, Shell-Units on a face are represented as a matrix and stored as a 1D array in theGPU memory. With the grid compression scheme, Shell-Units are in the array of neighboring Shell-RegionIDs (whose pointer is stored in its Shell-Region). Each Shell-Unit only contains a pointer to the neighboringShell-Region that it touches, and hence can always be loaded by one memory transaction.

Suppose for a 3D scene withN objects (for specific applications) represented by Shell, we need to storeM Shell-Regions andU Shell-Units besides theN objects, whereM is the number of leaf regions in thespatially decomposed scene andU is the total number of Shell-Units. Then clearly Shell usesO(M+N+U)memory. Note that in the non-uniform grid compression scheme, the number of cells on each boundaryface is equal to the number of its neighboring regions, and the total size of the three arrays (for neighborintegers of the rows and columns of a non-uniform grid and for pointers to the neighboring Shell-Regions)is proportional to the total size of these neighboring regions. Thus,U is proportional to the number ofneighboring leaf region pairs in the spatial decomposition (i.e., the number of edges in the dual graphG).

4 Ray Traversal and Construction of Shell

In this section, we show how to use the Shell data structure effectively in ray traversal. As we pointed outin Section 2, ray traversal is essentially to map the geometric location information of a point on a boundaryface of a regionR (at which a rayr exitsR) to a memory address containing information of the neighboringregionR′ of R traversed next byr. Hence, the main task of the Shell based ray traversal is to efficiently“decode” the information stored in the Shell structure so as to obtain the needed memory address.

4.1 Locating the Next Traversing Region

In the Shell data structure, the Shell-Unit corresponding to the cell containing the hit point of a rayr ona boundary faceF of a regionR stores a pointer to (i.e., the memory address of) the neighboring region ofRtouching that Shell-Unit; we call this Shell-Unit atarget Shell-Unitand denote it bySUt. Thus, locating thenext region entered byr can be accomplished in three steps: (I) Find the hit pointp of r onF ; (II) determinethe memory location of the target Shell-UnitSUt containingp; (III) accessSUt to obtain the address of theShell-Region for the next traversing region ofr. Step (I) is taken by all ray traversal algorithms regardlessof the data structure used and Step (III) is trivial. Below, we elaborate how Step (II) is done.

The Shell-Units of a boundary faceF , which we call aShell-Unit group, are stored consecutively in thememory as part of an array. Hence, the memory location ofSUt can be computed based on the memoryaddress of the Shell-Unit group for faceF (denoted asMbase) and the offset ofSUt from the first elementin the Shell-Unit group. SinceMbase has already been read into the on-chip memory uponr entering thecurrent region, the main hurdle is to determine the offset. The process of computing the offset is equivalentto decoding the Shell structure associated with the arrangement onF . The actual decoding method dependson the specific partition and compression schemes used forF . Our decoding solutions for two cases (theuncompressed uniform grid scheme and compressed non-uniform grid scheme) are given in Algorithms 1and 2 in Appendix 2. Decoding methods for other cases can be easily derived from these two algorithms.

4.2 Ray Traversal Based on Shell

The ray traversal algorithm using the Shell data structure is shown in Algorithm 3, Appendix 2.In an ideal case, locating the next traversing region using the Shell data structure takes only two memory

transactions: one for accessing a Shell-Region (Line 10 of Algorithm 3) and one for a Shell-Unit (Line9 of Algorithm 1, or Line 5 of Procedure 2 (see Figure 3(b)) used in Line 9 of Algorithm 2). But, theShell decoding for the more sophisticate schemes may need more than two memory requests. For example,the compressed non-uniform grid scheme incurs three additional memory requests: one access to the grid

8

coordinates (Line 7 of Algorithm 2) and two accesses for decoding the target Shell-Unit (Lines 1–2 ofProcedure 2 (see Figure 3(b)) used in Line 9 of Algorithm 2). By properly aligning the memory layoutof the Shell-Region structure and the coordinate integers for the non-uniform grid or the neighbor integersfor a compressed grid, we can fulfill the five sequentially issued memory accesses by only two memorytransactions with cache. (Note that one memory transaction loads a group of data to cache, which cansatisfy multiple memory accesses if the accesses are aligned properly and the data stays in the cache [23].)

Although we are able to fulfill the memory accesses with two memory transactions, the limited cachesize in a many-core processor can induce conflict misses which cause a part of the cached data (broughtin by a memory transaction) being replaced before they are used. This is commonly referred to as cachethrashing. For example, when the Shell-Region information is accessed, the memory transaction brings innot only the Shell-Region information but also the grid information. However, due to the sequentiality ofthe two reads, the grid information may be replaced by data access for some other threads. This problem isexasperated when a large number of threads are active simultaneously and competing for cache access. Toovercome this challenge, we judiciously reduce the number of simultaneously executing threads for targetedsections of the code by controlling the number of threads to be launched. (A careful balance is needed sothat this reduction would not degrade the overall performance; details are omitted due to the page limit.)

The combined effort above guarantees that only two memory transactions are used for locating thenext traversing region. For comparison, consider a geometric scene decomposed intoM regions and a raytraversing throughK leaf regions. A kd-tree searching traversal algorithm accessesO(K × log M) treenodes; each access takes 2–4 memory transactions depending on the kd-tree implementation. With the Shellray traversal algorithm, the ray accessesK Shell-Regions; each access needs only two memory transactions.

4.3 Construction of the Shell Data Structure

Before ray traversal, the Shell data structure needs to be constructed. Since the ray traversal processtypically needs to be performed repeatedly for a large number of rays (e.g., in graphics ray tracing and ra-diation dose calculation applications), the construction is best performed as a preprocessing task, especiallyfor static scenes. As a common practice, we accomplish this task on the CPU of a CPU+GPU platform.

The construction starts by obtaining a spatial decomposition based on a cost model such as the SurfaceArea Heuristic (SAH) [21] and building the corresponding kd-tree. The Shell construction scheme processeseach leaf regionR as follows. We first identify all neighboring leaf regions touching each face ofR.Every face is partitioned into a uniform grid, whose resolution is determined according to the number anddistribution of its neighboring regions. Then depending on the size of the scene and the memory capacity,the non-uniform grid partition and/or grid compression schemes, if needed, are applied to each face. Finally,the memory space for the Shell-Region and Shell-Units ofR is allocated and initialized accordingly. TheShell data structure consists of the Shell-Regions and Shell-Units for all leaf regions of the decomposedscene. The Shell data structure is then downloaded to GPU for ray traversal applications.

5 Evaluation

To evaluate the proposed Shell approach for ray traversal on GPU, we implemented our Shell data struc-ture and Shell based ray traversal algorithm. We also implemented two state-of-the-art kd-tree searching raytraversal algorithms: PD-SS and kd-rope. For kd-rope, we implemented its latest version, kd-rope++ [25],which is the fastest known kd-tree searching ray traversal approach. For PD-SS, although its performance isslightly lower comparing with kd-rope, it takes much less memory space. To ensure the quality of our PD-SSand kd-rope implementations, we tested them on graphics ray tracing applications and achieved renderingperformance results comparable to those published in related work (e.g., [11, 25]). The hardware platform

9

used in our evaluation is an NVIDIA GTX570 graphics card (Fermi architecture, 480 cores, 1.6GHz corefrequency, 1GB device memory). Figure 5 shows the set of geometric scenes used in our experiments. Thesescenes are commonly found in the study of graphics rendering approaches. The properties of the kd-treedecomposition for these scenes are listed in Columns 2–4 in Table 1. Two different Shell data structures arebuilt on the kd-tree decomposition of each scene. Shell-1 aims to minimize the memory usage by adoptingthe non-uniform grid partition and grid compression schemes whenever they can save memory space inbuilding the arrangement for neighboring regions on each boundary face. Shell-2 tends to achieve a goodbalance between memory usage and ray traversal performance, which uses those sophisticated schemesonly if they can reduce a certain portion of memory usage (e.g., more than 20%). All data presented in eachscene are obtained by traversing a set of rays generated to render an image of the scene with resolution of1024× 1024 from graphics ray tracing. Below we analyze and compare various metrics for the Shell basedand the traditional kd-tree searching ray traversal algorithms. These metrics include the ray traversal speed,number of accessed nodes, number of divergent branches, and memory usage.

Figure 6 compares the ray traversal performance using PD-SS, kd-rope, and our Shell based algorithms.The data in Figure 6 show that a speedup between2.6X–5.1X can be achieved by using Shell comparing toPD-SS and2.2X–4.3X to kd-rope. Furthermore, comparing the performance of Shell-1 and Shell-2, we seethat a larger Shell data structure tends to lead to a faster ray traversal performance.

The Shell based ray traversal algorithm gains its performance advantage through removing many expen-sive memory accesses to internal nodes in the kd-tree searching ray traversal methods. As demonstrated inFigure 7, Shell based ray traversal accesses on average4.2X and3.5X fewer nodes than PD-SS and kd-rope,respectively. Although the kd-rope approach uses neighbor links to reduce accessing internal nodes, it stillneeds to visit a number of internal nodes when a region boundary face has multiple neighboring regions.

Another factor contributing to the performance improvement of the Shell based ray traversal is that theShell data structure reduces a significant amount of execution divergence. To illustrate this, Figure 8(a)-(b) summarize the number of branches and number of divergent branches, respectively, for each algorithm.Shell based ray traversal incurs on average85% less branches and70% less divergent branches. But, Shellbase ray traversal cannot completely eliminate execution divergence since different partition schemes areinvolved. The more sophisticated schemes a Shell structure uses (e.g., compressed non-uniform grid), themore divergent branches it may have. This is shown by comparing the Shell-1 and Shell-2 results in Figure 8.

The memory usage of the Shell data structure is strongly dependent on the schemes chosen to generatethe arrangements for neighboring regions on region boundaries. Our proposed partition schemes, togetherwith the grid compression method, provide a trade-off framework for memory usage and ray traversalperformance. In our experimental study, Shell-1 minimizes the memory requirement while Shell-2 aimsto achieve a good balance between memory usage and ray traversal performance.

Table 1 summarizes the properties and memory usage of PD-SS, kd-rope, Shell-1, and Shell-2 for eachof the geometric scenes in Figure 5. A kd-tree can be directly used by PD-SS. For kd-rope, the kd-treeneeds to be augmented in the leaf nodes by storing neighbor links and bounding boxes for each of them.As shown in the column of “size(kd-rope)” in Table 1, the kd-tree supporting kd-rope uses about 3X morememory than that supporting PD-SS. The Shell-1 data structure in Table 1 adopts a non-uniform grid andthe grid compression scheme to as many region boundaries as possible. Its memory usage is about40%higher than PD-SS but60% smaller than kd-rope. The Shell-2 data structure stores more Shell-Units anduses the more sophisticated schemes only if they can save memory usage for a region boundary by morethan20%. For example, on a region faceF , suppose a uniform grid partitions it intox cells and the non-uniform grid scheme can reduce the number of cells toy. Then Shell-2 adopts the non-uniform grid onFonly if (y ÷ x) < 0.8. Although Shell-2 uses on average30% more memory than Shell-1, its ray traversalperformance is20%–45% better than Shell-1. Furthermore, Shell-2 not only uses50%–60% less memorythan kd-rope, but also achieves2.2X–4.3X speedup over kd-rope (see Figure 6).

10

References

[1] T. Aila and S. Laine. Understanding the efficiency of ray traversal on GPUs. InProceedings of the 1st ACMConference on High Performance Graphics, pages 145–149, 2009.

[2] E. Bethel and M. Howison. Multi-core and many-core shared-memory parallel raycasting volume renderingoptimization and tuning.International Journal of High Performance Computing Applications, 26(4):399–412,2012.

[3] G. Blelloch, P. Gibbons, and S. Vardhan. Combinable memory-block transactions. InProceedings of the 20thACM Symposium on Parallelism in Algorithms and Architectures, SPAA’08, pages 23–34, 2008.

[4] A. Buluc, J. Fineman, M. Frigo, J. Gilbert, and C. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. InProceedings of the 21st ACM Symposium on Parallelismin Algorithms and Architectures, SPAA’09, pages 233–244, 2009.

[5] Q. Chen, M. Chen, and W. Lu. Ultrafast convolution/superposition using tabulated and exponential kernel.Medical Physics, 38:1150–1161, 2011.

[6] P. Chuong, F. Ellen, and V. Ramachandran. A universal construction for wait-free transaction friendly datastructures. InProceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures,SPAA’10, pages 335–344, 2010.

[7] C. Clauss, S. Lankes, P. Reble, and T. Bemmerl. Evaluation and improvements of programming models forthe Intel SCC many-core processor. In2011 International Conference on High Performance Computing andSimulation, pages 525–532, 2011.

[8] T. Foley and J. Sugerman. Kd-tree acceleration structures for a GPU raytracer. InProceedings of GraphicsHardware, pages 15–22, 2005.

[9] W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic warp formation: Efficient MIMD control flow on SIMDgraphics hardware.ACM Trans. Archit. Code Optim., 6(2):1–37, 2009.

[10] S. Guntury and P. J. Narayanan. Raytracing dynamic scenes on the GPU using grids.IEEE Transactions onVisualization and Computer Graphics, 18(1):5–16, 2012.

[11] M. Hapala and V. Havran. Review: Kd-tree traversal algorithms for ray tracing.Computer Graphics Forum,30(1):199–213, 2011.

[12] V. Havran.Heuristic Ray Shooting Algorithms. PhD thesis, Czech Technical University, Nov. 2000.

[13] V. Havran, J. Bittner, and J. Zara. Ray tracing with rope trees. InProceedings of Spring Conference on ComputerGraphics, pages 130–139, 1998.

[14] D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff.In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’10, pages355–364, 2010.

[15] D. R. Horn, J. Sugerman, M. Houston, and P. Hanrahan. Interactive k-d tree GPU ray tracing. InProceedings ofthe Symposium on Interactive 3D Games and Graphics, pages 167–174, 2007.

[16] D. M. Huges and I. S. Lim. Kd-jump: A path-preserving stackless traversal for faster isosurface raytraing onGPUs.IEEE Transactions on Visualization and Computer Graphics, 15(6):1555–1562, 2009.

[17] J. Kalojanov, M. Billeter, and P. Slusallek. Two-level grids for ray tracing on GPUs.Computer Graphics Forum,30(2):307–314, 2011.

[18] D. Kopta, J. Spjut, E. Brunvand, and A. Davis. Efficient MIMD architectures for high-performance ray tracing.In 2010 IEEE International Conference on Computer Design, pages 9–16, 2010.

[19] Y. Krishnakumar, T. Prasad, K. Kumar, P. Raju, and B. Kiranmai. Realization of a parallel operatingSIMD-MIMD architecture for image processing application. In2011 International Conference on Computer,Communication and Electrical Technology, pages 98–102, 2011.

11

[20] M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. Chew. Schedulingstrategies for optimistic parallel execution of irregular programs. InProceedings of the 20th ACM Symposiumon Parallelism in Algorithms and Architectures, SPAA’08, pages 217–228, 2008.

[21] J. D. MacDonald and K. S. Booth. Heuristic for ray tracing using space subdivision.Visual Computer, 6:153–165, 1990.

[22] T. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl,and S. Dighe. The 48-core SCC processor: The programmer’s view. InProceedings of the 2010 ACM/IEEEInternational Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11,2010.

[23] NVIDIA Corporation. NVIDIA CUDA C programming guide version 5.0. URL: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html, 2013.

[24] S. Popov, J. Gunther, H. P. Seidel, and P. Slusallek. Stackless kd-tree traversal for high performance GPU raytracing.Computer Graphics Forum, 26(3):415–424, 2007. (Proc. Eurographics. 2007).

[25] A. Santos, J. M. Teixeira, T. Farias, V. Teichrieb, and J. Kelner. Understanding the efficiency of kd-tree raytraversal techniques over a GPGPU architecture.International Journal of Parallel Programming, 40(3):331–352, 2012.

[26] F. Song and J. Dongarra. A scalable framework for heterogeneous GPU-based clusters. InProceedings of the24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’12, 2012.

[27] Stanford Computer Graphics Laboratory. The Stanford 3D scanning repository. URL: http://graphics.stanford.edu/data/3Dscanrep/, 2012.

[28] J. Tsakok, W. Bishop, and A. Kennings. Kd-tree traversal techniques. InProceedings of the IEEE Symposiumon Interactive Ray Tracing, pages 190–194, 2008.

[29] I. Wald, W. R. Mark, J. Gunther, S. Boulous, T. Ize, W. Hunt, S. G. Parker, and P. Shirley. State of the art in raytracing animated scenes.Computer Graphics Forum, 28(6):1691–1722, 2009.

[30] K. Xiao, B. Zhou, D. Z. Chen, and X. S. Hu. Efficient implementation of the 3D-DDA ray traversal algorithmon GPU and its application in radiation dose calculation.Medical Physics, 39:7619–7626, 2012.

[31] B. Zhou, X. S. Hu, and D. Z. Chen. Memory-efficient volume ray tracing on GPU for radiotherapy. InIEEE 9thSymposium on Application Specific Processors, pages 46–51, 2011.

[32] M. Zlatuska and V. Havran. Ray tracing on a GPU with CUDA – Comparative study of three algorithms. InProceedings of Computer Graphics, Visualization and Computer Vision, pages 69–75, 2010.

12

A Appendix 1: Figures

Figure 1: (a) A 2D scene with 18 spatially decomposed regions and two traversing rays,ray1 and ray2. (b) The kd-treerepresentation of the scene in (a) with the leaf nodes of the tree corresponding to the decomposed regions in (a). (c) Sequences ofregions traversed byray1 andray2, and the tree nodes accessed by these rays using the PD-SS and kd-rope algorithms.

13

Figure 2: (a) A 2D scene with 8 spatially decomposed regions. (b) A uniform-grid based Shell structure for the decomposedscene in (a). (c) The Shell-Region for region B in (a) and one of its Shell-Units adjacent to region D. (d) The memory layout ofShell, where each element in the Shell-Region array contains a pointer to the first Shell-Unit associated with that Shell-Region, andeach element in the Shell-Unit array contains a pointer to its neighboring Shell-Region (indicated by the dashed arrows).

Figure 3: (a) A non-uniform grid for a boundary faceF with 16 neighboring regions.F is partitioned by all the vertical andhorizontal lines into a uniform grid scheme. The solid bold red boxes are the projected boundaries of the neighboring regions onF . The solid green lines are the lines used by the non-uniform grid. The set of21× 18 Shell-Units using the uniform grid onF isreduced to8×7 Shell-Units using the non-uniform grid. (b) The decoding procedure for converting the cell indices from a uniformgrid to a non-uniform grid. By applying this procedure,Cell(11, 12) (containing pointP in (a)) in the uniform grid is converted toCell′(4, 5) in the non-uniform grid.

14

Figure 4: (a) A compressed non-uniform grid for the boundary faceF in Figure 3(a). The grid compression scheme stores7+8+16 = 31 integers forF (i.e., the neighbor integers for 7 rows and 8 columns, and an array of pointers to the 16 neighboringregions). (b) The decoding procedure for obtaining the pointer to the neighboring Shell-Region touching a given cell. By applyingthis procedure, the pointer to the neighboring Shell-Region ofCell(4, 5) (containing pointP in (a)) is found atAF [8].

Figure 5: Geometric scenes used in the experiments of this paper. From left to right: Bunny (69K objects), Dragon (871Kobjects), Buddha (1.08M objects), from the Stanford 3D Scanning Repository [27], and Balls5 (66K objects) used in [12].

15

Figure 6: The execution time of the PD-SS, kd-rope, and Shell based ray traversal on the scenes in Figure 5. The detailedinformation for the data structures (including kd-tree, Shell-1, and Shell-2) is given in Table 1.

Figure 7: The total numbers of nodes (including internal and leaf nodes) accessed by all rays during the execution of PD-SS,kd-rope, and Shell based ray traversal on the scenes in Figure 5. The same set of rays is used by all these algorithms on a scene.Note that Shell-1 and Shell-2 are based on the same kd-tree decomposition, and hence the Shell based algorithms access the samenumber of nodes (leaf regions) using these two Shell data structures.

16

Figure 8:The numbers of (a) branches and (b) divergent branches during the execution of PD-SS, kd-rope, and Shell based raytraversal on the scenes in Figure 5. The Shell based algorithms incur smaller numbers of branches and divergent branches, andShell-2 has even smaller such numbers than Shell-1 because Shell-2 uses less sophisticated schemes.

B Appendix 2: Algorithms

Algorithm 1 Locating the next traversing region using a uniform grid1: Input: rayR, boundary faceF , Shell-RegionSR.2: Output: target Shell-UnitSUt.3: compute the pointP whereR intersectsF and exits fromSR;4: obtain the uniform gridG for F ;5: Mbase = the address of the Shell-Unit group forF ;6: compute the location index(i, j) of the pixel inG containingP ; /* The pixel at(i, j) represents the target Shell-Unit. */7: Moff = i× (row size of G) + j;8: Maddr = Mbase + Moff ; /* Compute the memory addressMaddr for the target Shell-Unit. */9: SUt = SU array[Maddr]; /* Load and return the target Shell-Unit, which is the address of the next Shell-RegionSRN . */

10: return SUt;

Algorithm 2 Locating the next traversing region using a compressed non-uniform grid1: Input: rayR, boundary faceF , Shell-RegionSR.2: Output: target Shell-UnitSUt.3: compute the pointP whereR intersectsF and exits fromSR;4: obtain the uniform gridG for F ;5: compute the location index(i, j) of the pixel inG containingP ;6: obtain the non-uniform gridGn for F ;7: obtainCoord.x andCoord.y for F ;8: compute index(i′, j′) by callingProcedure 1in Figure 3(b) using(i, j) ; /* Decode the non-uniform grid. */9: compute pointerID by callingProcedure 2in Figure 4(b) using(i′, j′); /* Decode the compressed grid. */

10: return ID; /* ID points to the next Shell-Region to traverse, which serves as the target Shell-Unit. */

17

Algorithm 3 Ray traversal algorithm based on Shell1: Input: rayR, Shell data structure with an SR (Shell-Region) array and an SU (Shell-Unit) array.2: Output: traversal ofR.3: initialization, to obtain the first Shell-RegionSR0 for traversingR;4: SR = SR0;5: more region to traverse = true;6: while more region to traverse do7: compute the boundary faceF of SR on whichR exitsSR;8: obtain the partition and compression schemeS used onF ;9: use the corresponding algorithm forS to compute the targetSUt for entering the next Shell-RegionSRN ;

/* E.g., if F uses a compressed non-uniform grid, thenSUt = Algorithm 2(R, F , SR). */10: SR = SR array[SUt]; /* The next Shell-RegionSRN is obtained and is set asSR for the next iteration. */11: if the traversal ofR is finishedthen12: more region to traverse = false; /* If the traversal ofR is not finished, then continue to the next iteration. */13: end if14: end while15: end

C Appendix 3: Tables

kd-tree Shell-1 Shell-2Scenes leavesa INs b objectsc size(PD-SS) size(kd-rope) SUsd size SUsd sizeBunny 154K 401K 5.5 9.2MB 30.1MB 2.8M 13.4MB 3.9M 17MBDragon 920K 3.5M 2.8 55.8MB 167MB 13M 75MB 19M 98MBBuddha 1.3M 4.8M 2.4 78.1MB 241MB 20M 112MB 24M 141MBBalls5 57K 189K 2.3 3.6MB 11.9MB 0.96M 4.7MB 1.2M 6.1MBa The total number of leaf nodes (i.e., leaf regions).b The total number of internal tree nodes.c The average number of objects contained in each leaf node.d The total number of Shell-Units.

Table 1: The properties and memory usage of the kd-tree and Shell data structures built for the scenes in Figure 5.Columns 2, 3, and 4 illustrate the properties of the kd-tree. Column 5 shows the memory usage of the kd-tree, whichis used by the PD-SS algorithm. Column 6 shows the memory space used by kd-rope, which stores more information(e.g., the bounding box information and neighbor links) in the leaf nodes of the kd-tree. For the Shell-1 and Shell-2data structures, Columns 7 and 9 illustrate the total numbers of Shell-Units that they contain, and Columns 8 and 10show their total memory requirement, respectively.

18