Upload
vuongnga
View
213
Download
0
Embed Size (px)
Citation preview
� � � � � � � � � � � � � �
Master Thesis
Ray Tracing Complex Scenes onGraphics Hardware� � � � � � � � � � � � � � �
Student: !#"%$ (Pei-Lun Lee)
Advisors: &('*),+.- (Yung-Yu Chuang, PhD.)/#0,1 +.- (Ming Ouhyoung, PhD.)
Department of Computer Science and Information EngineeringNational Taiwan University, Taipei, Taiwan
Abstract
Recently, due to the advancement of graphics hardware, graphics processing unit(GPU) is intensely used for general computation other than graphics display that itis originally designed for. Among them, ray tracing on GPU is one of the promisingapplications owing to its inherent property of parallel computation and shared dataaccess. However, most of the existing research on GPU ray tracing focus on theacceleration of ray tracing itself while assuming the scene not to be too complex.In this thesis, we present a two level hierarchies of data structure to render complexscenes using a CPU/GPU collaborated system. We compare its performance witha CPU only implementation and conclude that this method is efficient and easy toimplement.
i
����������� ����� ���������������� ��"!$#%��&�'(����)+*��,- .0/�1�243�546�7(8:9";=<+>�? #@�����=�(A�:BC# �EDGF 7=HAI@J%K !-�L�M@N�O�P Q�R �TSAU .WVYX[Z F \ 7+] ��^`_�a�bAc�d /�1�243�5 �����e�f�g=h@i+j�k .mln"o p(qr8 �ts%u�vxw 9"; y�z�{ ���=� O�Pr|T}@L9";`8%~ �=�(:�(�����=�( M`��>+� �����=�"��������h@i . ��bT�C# 8~ � �("�����A�$����s%u(������� I=���"�"� ���T�%� .
ii
Contents
1 Introduction 21.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related work 42.1 Rendering complex scenes . . . . . . . . . . . . . . . . . . . . . . 42.2 General purpose computation on graphics hardware . . . . . . . . . 52.3 Ray tracing on graphics hardware . . . . . . . . . . . . . . . . . . 5
3 Algorithm 73.1 Ray-scene intersection . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Complex scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 The two-level acceleration structure . . . . . . . . . . . . . . . . . 123.4 Traversal of two-level structure . . . . . . . . . . . . . . . . . . . . 133.5 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Results 194.1 The scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Dragons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.2 Dragonball . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.3 Plants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.4 Powerplant . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Conclusions and future work 345.1 Conclusinos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
List of Figures
3.1 Interface of ray-scene intersection on GPU . . . . . . . . . . . . . . 93.2 Details of ray-scene intersection on GPU . . . . . . . . . . . . . . . 93.3 Multi-texture approach . . . . . . . . . . . . . . . . . . . . . . . . 103.4 The two-level algorithm . . . . . . . . . . . . . . . . . . . . . . . . 113.5 Partition of a level 1 structure using KD-tree of the bunny model . . 13
4.1 The order and position of each object of the dragon scene . . . . . . 214.2 The function generated camera path used in the dragon scene . . . . 224.3 (a) Plants scene looked from top. (b) The leaf nodes of level 1 KD-
tree of the plants scene. . . . . . . . . . . . . . . . . . . . . . . . . 244.4 Path used to walkthough the powerplant scene . . . . . . . . . . . . 254.5 (a) Rendering time of different implementations of ray tracer on the
dragon scenes. The two vertical lines are the two limit on texturesize of the single texture GPU ray tracer and the 2-level structure.(b) The texture size distribution of the dragon scenes. . . . . . . . . 26
4.6 (a) Per frame rendering time of the 10-dragon scene along the cam-era path with different acceleration structures. (b) Number of timesthe underlying GPU ray tracer is called per frame. . . . . . . . . . 27
4.7 Dragonball: (a) Texture size and number of nodes under differentpartition (b) The rendering time of the 2-level structures of differentnode size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.8 (a) Rendering time of dragonball scene partitioning into 3 nodes (b)Rendering time of 2 level structure only . . . . . . . . . . . . . . . 28
4.9 Dragonball: number of times GPU BVH called per frame . . . . . . 294.10 Plants: rendering time per frame (a) CPU KDT and 2-level (b) 2-
level only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.11 Plants: (a) number of times GPU BVH is accessed per frame (b)
number of rays generated per frame . . . . . . . . . . . . . . . . . 304.12 Plants: (a) percentage of active texture size (b) percentage of incre-
mental texture size . . . . . . . . . . . . . . . . . . . . . . . . . . 304.13 Per frame rendering time of powerplant (a) CPU KDT and 2-level
(b) 2-level only . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.14 Powerplant: (a) number of times GPU BVH is called per frame (b)
total number of rays generated per frame . . . . . . . . . . . . . . . 32
4.15 Powerplant: (a) percentage of active textures (b) percentage of in-cremental textures . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1
Chapter 1
Introduction
1.1 Motivation
Ray tracing is a classic algorithm in computar graphics for generating photo-realistic
images by emulating the natural phenomenon of human visual perception. The most
time consuming part of the ray tracing algorithm is the computation of ray-scene in-
tersection. In ray tracing, the computation of each ray is independent to other, so it
is suitable for parallel computing.
Recently graphics processing unit (GPU) is getting more computation power
than CPU and continues growing. Its programmable ability makes it a great plat-
form for computing. Ray tracing is also one of the target of interest. While more
and more ray tracing algorithms are implemented on GPU, all of them assume that
the scene data fits into the on-board memory of GPU. However, in many practi-
cal case such as production rendering, the scene contains more complex geometry
than what GPU memory can hold. In this thesis, we present a GPU assisted ray
tracer system that take complex scene as input, which makes GPU ray tracer more
practical to use.
1.2 Problem statement
Given a scene data, whose memory consumption is larger than the capacity of GPU
memory, we propose a data structure for data management and generate images
using a GPU assisted ray tracing algorithm.
2
1.3 Thesis organization
In the following chapters, we introduce the related work on GPU ray tracing and
complex scenes rendering, then we describe the proposed method and show the
experiments and results, and lastly we conclude the work and discuss some of the
future direction of research.
3
Chapter 2
Related work
In this chapter, we survey the related work on different topics including rendering
complex scenes, general purpose computation on graphics hardware, and ray tracing
on graphics hardware. We also survey some of the recent work on ray tracing in the
first section.
2.1 Rendering complex scenes
[PKGH97] presented a set of algorithms to improve the data localities of ray trac-
ing algorithm, such that they can render more complex scenes with given memory
capacity. They organize geometry cache in a two-level regular voxel grids with dif-
ferent granularities. The coarse one is termed geometry grids which contain finer
grids called acceleration grids inside. We use a similar scheme to theirs, except their
goal is to manage the access pattern between disk to main memory while we deal
with the system memory-GPU memory management. In addition, they use regular
grid at both level but in our scheme either level can be any data structure.
In [WPS+03], they reported that realtime ray tracing has already been achieved
on different hardware architectures, including shared memory systems, PC clusters,
programmable GPUs, and custom hardware. They also presented interactive global
illumination algorithms as one of the application utilizing realtime ray tracing.
[WDS04] visualizes an extra-ordinary complex model, ’Boeing 777’ which con-
tains 350 million individual triangles. They use a combination of several techniques
including a realtime ray tracer, a low-level out of core caching and demand load-
ing strategy, and a hierarchical approximation scheme for geometry not yet loaded.
Whey render the full model at several frames per second on a single PC.
[WSS05] presents an architecture of a programmable ray tracing hardware,
which combines the flexibility of general purpose CPUs and the efficiency of GPUs.
4
They implemented a prototype FPGA version running at 66 MHz and can achieve
rendering images at interactive rate. The limitation of their system is similar to other
ray tracers pursuing performance, that is the scene must be static and contains only
triangles.
2.2 General purpose computation on graphics hard-ware
[OLG+05] surveyed the development of the new research domain, general purpose
computation on graphics processors, or GPGPU. They described the techniques
used in mapping general-purpose computation to graphics hardware, and they sur-
veyed a broad range of GPGPU applications. They also pointed the trend of next
generation GPGPU research.
[GRH+05] presented an improved bitonic sorting network algorithm on GPU.
They analyzed the cache-efficiency of algorithm and get better data access pattern.
The performance of their implementation was higher than a CPU quick sorting rou-
tine.
[FSH04] analyzed the suitability of GPUs for dense matrix-matrix multiplica-
tion. They implemented an near-optimized algorithm but found that the utilization
of the arithmetic resources was relatively low compared to CPU counterpart. The
bottleneck was due to the limiting bandwidth. Thus they concluded that currently
GPU was not a suitable platform for such computation.
2.3 Ray tracing on graphics hardware
[CHH02] was one of the first work on ray tracing with programmable graphics
hardware. They implemented only ray-triangle intersection on the GPU, and let
the CPU do the ray generation and shading. Ray data were stored in texture mem-
ory, and vertex data were sent through graphics pipeline as drawing primitives. In
terms of intersections per second, their implementation exceeded the fasted CPU-
based ray tracer at that time. The main bottleneck of their implementation is the
slow readback of data from GPU to main memory, which is required each time a
computation invoked.
[PBMH02] mapped the entire ray tracing algorithm on GPU, from ray genera-
tion, acceleration data structure traversal, ray-triangle intersection, to shading. They
choose uniform grid as acceleration data structure. All data, including ray, geome-
try, and grid were stored in texture memory. Still, they suffered from low utilization
5
rate of computation resources. They also presented the concept that a GPU can be
viewed as a stream processor [Pur04]. Under this structure, fragment programs are
the kernels, and data is stream. Their work inspired several researches including
[Chr05] and [KL05], who follow their framework.
[WSE04] used GPU to implement a nonlinear ray tracing algorithm, which ex-
tend ray tracing to handle curved light rays and can be used to visualize gravitational
phenomena.
[TF05] implemented a kd-tree traversal algorithm on GPU. Their modified al-
gorithm can traverse a kd-tree without a stack, thus suitable for GPU, which lack of
the capability of such complex data structure. They also use it in a ray tracer that
beat the performance of one use uniform grid.
[TS05] compared how performance of a GPU-based ray tracer can be affected
with different accelerating data structures. They chose three structures to com-
pare: uniform grid, kd-tree, and bounding volume hierarchies. They found that the
bounding volume hierarchies is the most efficient in most cases.
[CHCH06] uses a threaded bounding volume hieararchy built from a geometry
image, which can be efficiently traversed and constructed entirely on the GPU and
allows for dynamic geometry and level of detail.
NVIDIA Gelato [NVI05] is a hardware-accelerated non-real-time high-quality
renderer which leverages the NVIDIA GPU as a floating point math processor to
render images faster than comparable renderers.
6
Chapter 3
Algorithm
In this chapter, we describe in detail the algorithms we use. Before we start, we list
some of the design decisions we made.
Design Decision
• CPU/GPU job division
Although we utilize the computation power of GPU, we do not aim for de-
signing a system that completely runs on GPU. Instead we offload some of
the tasks to CPU and make our experiments focus on the speedup of ray-scene
intersection.
GPU
– Traversal of level 2 structures
CPU
– Generating camera rays
– Shading
– Spawn secondary rays
• Scene geometry
We choose triangles to be the only type of geometric primitive. This simplifys
the problem and let us focus on the specific part of the system. Also by
limiting the number of primitive to one.
• Scene complexity
We do not aim for arbitrary complex scenes. The main subject of this thesis
is on memory management between GPU and host, not disk to memory. So
the data size of scenes cannot exceed the capacity of main memory.
7
• Illumination model Since we compute the shading at CPU side, we do not
want the cost of shading to be too expensive. Besides, realistic illumination is
not the main objective, so we choose the simple Whitted’s ray tracing as the
illumination model.
3.1 Ray-scene intersection
The most important component in our system is the ray-scene intersection. This is
the part where we mainly utilize the computation power of GPU. The ideal interface
of ray-scene intersection on GPU is shown in figure 3.1. Note that there are multiple
rays fed into GPU at the same time. This is due to the SIMD nature of the fragment
processors on GPU: it gets the most efficiency when processing multiple data at the
same time. This interface resembles the ray engine in [CHH02]. A more detailed
framework of our GPU intersection block is shown in 3.2. We implemented the
methods described in [TS05]. The scene data is first converted into an acceleration
data structure and then stored in texture memory of GPU as textures. The rays are
also stored in textures. The traversal of acceleration structure and intersection test
written in shader code are executed in fragment processors. The intersection result
are output to framebuffer, which has a texture attached, and then readback to main
memory.
In our implemtation, we find the performance of BVH is the best out of the 3
GPU acceleration structures in [TS05], which is consistent with their conclusion.
Therefore we choose BVH as the underlying GPU intersection engine in our sys-
tem. To implement the BVH ray tracer on GPU, several hardware and software
features are required. First the GPU must support Shading Model 3.0 since dy-
namic branching and looping instructions are used in the fragment shaders. To
store data in textures, the support of floating point precision in textures must be
present such as in OpenGL extension GL ARB texture float. To output data to tex-
tures in fragment shader, we use the FrameBufferObject of OpenGL which supports
render-to-texture function. We also use OcclusionQuery to determine whether the
computation on GPU is over. Last we use Early Z-Culling which is a optimization
in graphics pipeline to skip the pixels whose computation are already over from
entering the fragment processors again. All the above techniques are described in
[TS05] and [GPG05].
8
GPUrays
sceneintersections
Figure 3.1: Interface of ray-scene intersection on GPU
rays
scene
intersections
accel.struct.
texture memory
fragment processors
GPU
Figure 3.2: Details of ray-scene intersection on GPU
3.2 Complex scenes
The scheme described in the last section works fine with scenes that are fit into
texture memory. We call the method in the last section the single-texture approach.
However, as the scene gets more complex in geometry, two problems may be con-
fronted:
1. A single texture may exceed the maximum texture size
2. Textures may exceed the capacity of the on-board texture memory of GPU
In the OpenGL specification, the maximum texture size is 4096 by 4096. In
our implementation, we use 32 bit floating point texture to store data. As a result,
the largest single texture will then be 256MB. Take BVH, one of our implemented
acceleration structure on GPU, for example, it tooks some where around 64 to 67
bytes per triangle. By calculation, the 256MB limit will be reached at around 4
million triangles converted into a single texture representation of BVH structure.
That is, if we use the simple BVH implementation, we can handle scenes with at
most 4 million triangles.
9
8192
8192
4096
4096
Figure 3.3: Multi-texture approach
Another limitation is the capacity of texture memory. The latest GPU has at
most 512MB of texture memory, that is, we can have two 256MB textures. How-
ever, with the PCI-Express technology, the bandwidth between GPU and host is
greatly improved compared to the last generation AGP bus. The graphics manufac-
turers thus provide driver level mapping from graphics memory to main memory.
That is, we can use part of the main memory as texture memory just like the virtual
memory using hard disk as memory. For example, a GPU may mave only 64MB
memory on board. But to an user program, it may seems there are 256MB tex-
ture memory since the driver transparently swaps in and out texture data between
GPU and host. These features are quite useful in low-end graphics products, since
the cost of memory can be reduced while the performance comparable to products
with same amount of actual on-board memory. However, by our experiment, the
high-end GPU we use (nVidia GeForce7800) also has this feature. The available
memory is 256MB, but we can allocate more than 1GB of textures without error.
So this feature should be implemented in software and independent to hardware
design.
In sum, the limit on GPU memory size is resolved by the driver. The real limit
still remaining is the size of a single texture. To get rid of this limit, a direct solution
is to tile smaller textures into a larger one. For example we have data storing in an
array that is represented as a texture of size 8192 by 8192, we can divide it into 4
textures sized 4096 by 4096. And then we override the texture look-up function in
shader code as in algorithm 1.
In this way, what we call multi-texture approach, by mapping the texture coordi-
nates of a larger texture to that of multiple smaller textures, we can virtually access
10
Algorithm 1 MyTex2D(tex; x; y)if x < 4096 and y < 4096 then
Return tex2D(tex0; x; y)else if x � 4096 and y < 4096 then
Return tex2D(tex1; x� 4096; y)else if x < 4096 and y � 4096 then
Return tex2D(tex2; x; y � 4096)else
Return tex2D(tex3; x� 4096; y � 4096)end if
Figure 3.4: The two-level algorithm
a texture that is larger than the maximum size in the specification. And by using this
function, we can easily make a GPU intersection code capable of handling complex
scenes by just sending larger textures.
Nevertheless, this approach will not perform as efficient as it looks like. In frag-
ment processors, where these codes are executed, instructions are issued in SIMD.
Which means each time we call the MyTex2D function in a fragment shader, there
will be 4 texture fetch commands executed. And 3 out of the 4 texels will be dis-
carded by the conditional branching. Plus blocks of the 4 textures are potentially
loaded to the texture memory and consume the bandwidth of the PCI-Express bus.
The more tiles a texture are split into, the worse the performance.
Overall, multi-texture is not a practical solution. So we come up with another
solution, the two-level acceleration structure, which will be introduced in the fol-
lowing section.
11
3.3 The two-level acceleration structure
The main idea of the proposed algorithm is to build a two-level hierarchies of ac-
celeration data structures, which is composed of a coarser structure at top level, and
finer structures at each leaf node of the coarser one. The coarser level (level 1) of
structure will be traversed by CPU, while the finer level (level 2) will be traversed
by GPU, as shown in figure 3.4. Through this partition, we can split large scene
data into smaller chunks that can fit into the limited memory of GPU. This method
is similar to [PKGH97], where they choose both the coarser and finer level to be
uniform grid. In theory, any of the acceleration data structure can be used in either
level in this algorithm.
Assumptions
We make the following assumptions on the scene:
1. Scenes are composed purely by triangles
2. Scene data is larger than texture memory on GPU but smaller than system
main memory
For the first point, we handle only one type of primitive. Because GPU uses
SIMD architecture, and two intersection codes means two times slower in GPU
programming. Besides, all other primitive can be subdivided into triangles.
Second, from viewpoint of GPU computation, the memory hierarchies are GPU
L1 cache, GPU L2 cache, texture memory, system memory, and disk storage. The
main target of this research is on texture memory to system memory. So we assume
that our data can fit into main memory and do not explicitly handle the memory
management from memory to disk.
Construction
Given a set of triangles T , its 2-level accelerating data structure (ASL1; ASiL2)
can be computed with:
To build a level 1 structure, one can use a top-down algorithm such as KD-
tree and set the maximum triangles in a leaf node to a desired threshold. Or we
can merge leaf nodes from bottom to top until the size of memory or the number of
nodes meet the requirement from a complete accelerating structure or structure built
with bottom-up algorithm such as the bounding volume hierarchies (BVH) [GS87].
12
Algorithm 2 Construction of 2-level acceleration structure1: Build ASL1 on T2: for each leaf node Li of ASL1 do3: build ASi
L2 on triangles of Li
4: end for
Figure 3.5: Partition of a level 1 structure using KD-tree of the bunny model
Fig 3.5 is the visualization of a level 1 structure. Each color patch corresponds
to a leaf node of the KD-tree. As the figure shows, the partition is rather coarse, so
each leaf node contains many triangles compared to a normal KD-tree used in an
ordinary ray tracing algorithm.
A level 2 structure has no difference from an acceleration data structure used in
other GPU based ray tracers. The level 2 structures are built with the triangles of the
leaf nodes of level 1 structures, each leaf node has one level 2 structure associated.
The constructed level 2 structures AS iL2 must be converted to the texture form that
can be traced by a GPU ray tracer. And the texture representation of a level 2
structure must not exceed the size limitation of a single texture. In OpenGL, this
limitation is 4096 by 4096 texels, which can store up to 256MB of data if 32-bit
floating point texture is used. So when building the level 1 structure, the size of leaf
node must be choosed so that its level 2 structures are not too large.
3.4 Traversal of two-level structure
Here we give the algorithm to traverse a 2-level acceleration structure with the same
interface of the ray-scene intersection described in section 3.1. That is, given a
queue of rays R and a 2-level structure, compute and return the intersection of each
13
ray in R.
Algorithm 3 Intersect(R;ASL1; ASiL2)
1: for each ray r in R do2: Add r to QL1
3: end for4: while QL1 is not empty do5: for each ray r in QL1 do6: p = intersectL1(r; ASL1)7: if p is a leaf node Qi
L2 then8: Add r to Qi
L2
9: else10: Intersect[r] = NULL11: Add r to I12: end if13: Delete r from QL1
14: end for15: for each non-empty Qi
L2 do16: S = IntersectL2(Qi
L2; ASiL2)
17: for each ray s in S do18: if s is a hit then19: Add s in I20: else21: Add s in QL1
22: end if23: end for24: end for25: end while26: Return I
We maintain a ray queue for the level 1 and each level 2 structures. That is,
if the level 1 structure has n leaf nodes, there will be n + 1 ray queues. First all
rays input for intersection query are put into the level 1 ray queue. Then each ray
in level 1 ray queue is traversed in level 1 structure to test which is the nearest leaf
node hit by the ray. If a ray hit some leaf node, we move that ray to the ray queue
of the level 2 structure that node corresponds to; if the ray does not hit any thing, its
traversal ends here. After this step, the level 1 ray queue should be empty and some
of the level 2 queues would have rays in. These level 2 structures with non-empty
ray queues are traversed and tested intersection in GPU. For the rays reported hit
during the traversal of the level 2 structure, we update its states of intersection and it
can stop the traversal if we use a level 1 structure that traverses along the direction
of ray such as uniform grid or KD-tree. Otherwise if a ray misses in the level 2
14
traversal, it should continue its traversal in the level 1 structure, only not starts from
root but from the next leaf node of current one. To do this we have to modified the
level 1 traversal, we describe it in the following.
Traversal of level 1 data structure
As stated above, the traversal of level 1 structure must continue from the last
position of the previously visited node, or it will be an infinite loop. Our solution is
to store a field of current position for each ray, and during traversal, we skip all the
nodes that precedes to the current node. This method is used by [TF05] to remove
the recursion of KD-tree traversal and apply on their GPU ray tracer, which they
call ’KD-Restart’. We use it in a CPU algorithm, though. This method adds extra
cost to the traversal of level 1 structure, but since in our system, a level 1 structure
are usually very small, containing only tens of leaf nodes, so we do not care these
additional computational costs.
Algorithm 4 IntersectL1(r; ASL1)p = nearest leaf node from position[r] along rposition[r] = next leaf node of p along rReturn p
We take into concern the effects of ’restart’ of traversal to compare 3 accelera-
tion structures as level 1 structure. The candidates are uniform grid, KD-tree, and
BVH.
• Uniform grid: To store the ’restart’ position, the xyz index of a grid cell must
be saved per ray and can be encoded into one integer. The traversal order
is along the direction of the ray. The space partition is regular which can
generate empty nodes or extremely uneven sizes of node.
• KD-Tree: The ’restart’ position consists of 2 floating point number, the tmin
and tmax of a node along the direction of ray. The space partition is adap-
tive and each leaf node has similar size. The traversal order is along the ray
direction, which is a plus.
• BVH: A BVH is usually flatten from the original tree structure into an array
by some fixed search order such as depth-first when traversed. We save the
index of the array as ’restart’ position. A BVH is not traversed along the ray
direction. The space partition is adaptive.
15
From these comparisons, we choose KD-Tree as the level 1 structure since it
has the properties of ray order traversal and adaptiveness.
3.5 Rendering
Traditionally, a ray tracing algorithm is usually recursive. In our system, since the
interface of the ray-scene intersection is multiple rays instead of a single ray, it is
difficult to apply in a recursive ray tracing algorithm. Here we use the classic Whit-
ted ray tracer as example and give two algorithms that utilize the 2-level structure.
One is called ’batched’ which interfaces the 2-level structure through the ray-scene
intersection block and is thus unaware of the underlying acceleration data structure.
The other is called ’mixed’ which integrates the rendering with the traversal of the
2-level structure to get better utilization.
Batched
Algorithm 5 Rendering-batched(R)1: i 02: while i <max iterations do3: i i+ 14: R intersect(R)5: for each ray r in R do6: if r is a hit then7: Add the shaded color to the pixel associated with r8: Add each spawn rays into S9: else
10: Add the background color to the pixel associated with r11: end if12: end for13: R S14: S empty queue15: end while
In algorithm 5, we basically maintain 2 ray queues R and S. R stores the rays
for current iteration while S stores the secondary rays for the next iteration. At each
iteration, the ray-scene intersection is called to get the intersection of R. For each
ray that hit, we compute its shading and spawned secondary rays such as reflection,
refraction and add these new rays in S. Then we move the rays in S to R and
16
continue the next iteration. The rendering process continues until the predefined
maximum number of iteration is reached or there are no rays to process.
Note that this algorithm is not aware of the underlying acceleration structure.
The intersect procedure can be either a CPU implementation, a single-texture GPU
implementation, or a 2-level structure we described here. We also use this algorithm
on the CPU KD-tree implementation in the experiments and results in the next chap-
ter.
Mixed
Algorithm 6 Rendering-mixed(R)1: for each all eye ray r do2: Initialize position[r]3: Add reference[r] to QL1
4: end for5: while at least one of QL1 and all Qi
L2 is not empty do6: for each ray r in QL1 do7: p traceL1(r; ASL1)8: if p is a leaf node ASi
L2 of ASL1 then9: position[r] next(r; p; ASL1)
10: Add r to QiL2
11: end if12: Delete r from QL1
13: end for14: choose a ASi
L2 with non-empty QiL2
15: S traceL2(QiL2; AS
iL2)
16: for each ray Si in S do17: if Si is a hit then18: Process the intersection19: Spawn secondary rays T20: Add T to QL1
21: else22: Add Si to QL1
23: end if24: Delete Si from Qi
L2
25: end for26: end while
In algorithm 6, we integrate the rendering of algorithm 5 with the traversal of 2-
level structure in algorithm 3. The key difference is that after we get the intersection
from a level 2 structure, we do not proceed to the intersection of the next non-
17
empty node. Instead, we return to level 1 structure to process those newly spawned
secondary rays. This makes the ray queues of level 2 structures filled with more
rays and reduce the number of passes needed to process a level 2 structure. The
line ’choose a level 2 structure’ in the algorithm has many ways to implement.
[PKGH97] suggests a cost function to determine a best node. We adopt the simplest
method: choose the node with the most rays in its queue.
18
Chapter 4
Results
In this section we perform various experiments on a variety of scenes. At first, we
list the environment setup in the following:
Development environment
• CPU: Intel Pentium 4 3.0 GHz with HyperThreading
• RAM: 2.5G DDR2 SDRAM
• OS: Microsoft Windows XP Professional SP2
• GPU: nVidia GeForce 7800 GTX with 256MB memory
• Compiler: Microsoft Visual C++ .Net
• Cg compiler: nVidia Cg Compiler 1.5 Beta 2
• Driver: nVidia ForceWare 84.21
Setup
• The acceleration structure we use in the 2 level structure are: KD-tree as level
1 structure and BVH as level 2 structure. The reason is already described in
the previous chapter. KD-tree has the properties of ray-order traversal and
adaptiveness, so it is best candidate of level 1 structure. As for level 2 struc-
ture, we choose BVH since its performance is the best among all existing
GPU implementation.
19
• Unless specifically mensioned, all the level 1 KD-trees are constructed by
using 110 of the total number of triangles as maximum number of triangle of
leaf node; and the BVH are constructed by shuffling the order of inserting the
triangles 10 times and choose the tree with the lowest value according the the
cost function as in [GS87].
• The CPU KDT in the result are our own implementation based on the KD-tree
traversal code in [PH04]. The interface of the ray-scene intersection function
of KD-tree is modified from single ray to multiple rays to match the inter-
face of GPU algorithms. This ensure the only difference of CPU and GPU
implementation to be the ray-scene intersection, which is our main focus.
• All the images rendered by our program are 512 by 512 pixels in size and no
multisampling or antialiasing are done.
• Our program takes large data as input and consumes lots of memory. We turn
on the 3GB option of the Windows XP [Mic05] to get more user memory
space. And we replace the malloc function in the standard library of Visual
C++ with the code from [Lea00] for better memory allocation. Without the
above two tuning, our program runs out of memory when some large scenes
are input.
4.1 The scenes
Table 4.1 summarizes the statistics of the scenes. The numbers in the row of texture
size mean the sums of size of each level 2 BVH textures of the scene. The number
of triangles and texture size of the dragon scene will be listed later in detail since
it contains not only one scene. The lights we use for rendering are all point light
sources. We’ll look at each scene and its characteristics respectively in the follow-
ing.
Dragons
This set of scenes contains a set of scalable scenes, which we used to model
scenes of different degrees of complexity. These models are composed of a plane
and a dragon model (from Stanford 3D Scan Repository [Sta06]). There are 21
scenes in total and each contains from 1 to 21 dragons. Each dragon has about 87
thousand triangles. The position and order of each dragon object are shown in figure
20
Scene dragon dragonball plants powerplant
Triangles varied 7,364,817 14,119,952 12,748,510Texture size varied �470MB(*) 877MB 793MBLight sources 1 1 1 6Reflection yes yes yes noRefraction no yes no no
Table 4.1: Scene statistics
16 17 18
10 11 12
4 5 6
3 1 2
9 8 7
15 14 13
21 20 19
Figure 4.1: The order and position of each object of the dragon scene
21
Figure 4.2: The function generated camera path used in the dragon scene
obj Triangles Tex. size obj Triangles Tex. size1 1,076,214 68.3 11 9,790,354 6162 1,947,628 123 12 10,661,768 6703 2,819,042 177 13 11,533,182 7254 3,690,456 232 14 12,404,596 7795 4,561,870 286* 15 13,276,010 8346 5,433,284 341 16 14,147,424 8907 6,304,698 396 17 15,018,838 9448 7,176,112 451 18 15,890,252 9709 8,047,526 506 19 16,761,666 1,020*
10 8,918,940 560 20 17,633,080 1,08021 18,504,494 1,130
Table 4.2: Number of triangles and total texture size (in MB) of the dragon scenes
4.1, from 1 to 21. In this order, we try to make the shape as regular as possible while
keep each object in the same position when others are added. In order to make sure
the complexity of the scenes, we do not use object instancing and all triangles are
indivisual to each other. There is only one light source in these scenes. The plane
is specular and reflects rays. We render the scene using a camera path similar to
[TS05], which surrounds the scene and moving up and down according to a sine
curve and looking at the center of the scene. We render 100 frames in total along
path. Figure 4.2 shows the curve of positions of the camera.
Table 4.2 lists the number of triangles and the total size of the textures of level
2 BVH of each scene in the set. Here we should note two points. First, from the
scene with 4 dragons to 5, the texture size increases and cross the line of 256MB,
the capacity of on-board texture memory of the GPU we use. And this is also the
maximum size of a single texture. Although the BVH texture size will be different
22
when a single BVH structure is built instead of a 2-level structure, the numbers are
very near. A single BVH structure of the scene with 5 dragons is confirmed exceed-
ing 256MB by our experiment. This means the previous GPU ray tracer using BVH
as acceleration structure cannot handle scenes more than 5 dragons here directly.
Another point is at 19, where the texture size exceed 1GB. We’ll discuss this later
with the result.
Dragonball
In this scene, we put another dragon model in the center. This dragon model,
also known as Asian Dragon or xyzrgb-dragon, comes from Stanford 3D Scan
Repository [Sta06], too. It has more than 7 million triangles itself, which is the
most complex single object we use. We put two spheres by the sides of the dragon
model, the red one is specular and the blue one is transparent. There is another
squeezed sphere representing a lens, which is nearly transparent. To the bottom
of these objects is a wave-shaped surface standing for water. This water surface
is specular, and its wave shape makes the directions of the reflected rays random
and incoherent. In order to compute the smooth reflective and refractive rays, we
must get smooth normal vectors at each vertex and interpolate them to get the per
pixel normal vectors. This would increase additional memory consumption in each
triangle, so we only store per vertex normals in this scene. In other three scenes,
normal vectors are computed per triangle, since they have at most flat reflection and
the per vertex normals are unnecessary. We use the same function generated path
as the one used in the first dragon scene.
Plants
This scene is from PBRT [PH04] and is used for real ray traced image. There
is a large amount of object instancing in this scene. Again the same as the dragon
scene, we avoid object instancing but duplicate each individual object for the sake
of scene complexity. The scene is composed of a terrain generated with hight field,
a reflective surface representing water, and a variety of duplicated plant objects.
The main source of the complexity is contributed by the plant models, and they dis-
tribute only in a corner of the terrain, along the river. See figure 4.3(a) for a view
from top. Figure 4.3(b) visualizes the level 1 KD-tree of the scene. Each color
patch represents a leaf node. Since we make the nodes containing the same number
of triangles as possible, we can see that most of the geometry gather in one region.
23
(a) (b)
Figure 4.3: (a) Plants scene looked from top. (b) The leaf nodes of level 1 KD-treeof the plants scene.
The original copy of this scene contains more than 19 million of triangles and is
the most complex model in our experiment. It works on our system but its scale
is over what we can handle efficiently since its memory storage exceeds 1GB and
textures are swapped out to harddisk. Therefore, we choose to simplify the scene
by reducing the number of instanced objects. We keep the four apple trees there
untouched, since they are the most visable objects. The final version has around
14 million triangles and the visual difference between the original scene is barely
observable. We tested this scene with a fly-by camera path, which starts from top
of the scene, down to the ground level, moves along the river, and then circles the
area where the plants resides. This path has 255 frames in total.
Powerplant
This model from the walkthrough project of UNC [UNC01] contains more than
12 million triangles and features complex indoor scenes. Unlike other scenes we
used, which are generally flat shaped, the powerplant model is complex in each
dimension, especially vertically. We tested this scene with a camera path to walk-
through it. The path is illustrated in figure 4.4. The red curve shows the movement
of the position of the camera along the path. The camera starts from outside of the
building, circles to the opposite side of the building, and then goes into the building,
passes through several different rooms, sometimes stops and looks around, and ends
inside the building. The length of the path we use are 275 frames. This is also the
only scene that has more than one light sources. We put six lights in this scene, one
24
Figure 4.4: Path used to walkthough the powerplant scene
is outside the building at the top representing the sun light, four are put in different
rooms inside the building for interior illumination, and the last one is mounted onto
the camera and sticks to it, preventing it from being too dark when walking through
rooms where no lights can cover. There are no specular or transparent surface in
this scene.
4.2 Experiments and results
4.2.1 Dragons
We use this scene mainly to test how different implementations of ray tracer per-
forms as the complexity of input model increases. The range of complexity of our
model starts from 1 million triangles to 18 million, and the texture representation
also ranges from less than 100MB to more than 1GB of memory. The tested data
structure are: KD-tree on GPU, the proposed 2-level structure (which has two vari-
ants of rendering methods: batched and mixed), and a single texture version of BVH
on GPU, which is described in 3.1 and is present for reference.
Figure 4.5(a) shows the rendering time of the increasingly changing models on
different acceleration structures. The number is generated by rendering each model
with the camera path described and the rendering time of 100 frames are averaged.
Apparently the trend of the curves of rendering time is increasing as the scene gets
more complex. Although there are some exceptions, especially in the KD-tree, the
curve rise steady and then drop suddenly at 9 dragons. This reason would be due
to the symmetry of the 9-dragon model and results a better tree than that of fewer
triangles.
25
0
1
2
3
4
5
6
7
8
9
10
1 3 5 7 9 11 13 15 17 19 21
number of objects
rend
erin
g tim
e (s
econ
ds)
CPU KDT2-level (batched)2-level (mixed)GPU BVH
(a)
0
200
400
600
800
1000
1200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
number of dragon
texture size(MB)
(b)
Figure 4.5: (a) Rendering time of different implementations of ray tracer on thedragon scenes. The two vertical lines are the two limit on texture size of the singletexture GPU ray tracer and the 2-level structure. (b) The texture size distribution ofthe dragon scenes.
Another thing to be noted is that although the time complexity of traversing the
spatial index structures used here, such as KD-tree or BVH, should be logarithm of
the number of objects, all the curves in figure 4.5(a) seem to be almost linear. This is
because when one more dragon model is added into the scene, not only the number
of triangles increases, but also the area of visible pixels, which is in proportion to
the number of eye rays. And the traversal time complexity is linear to the number
of rays. So the trend of these curves is reasonable.
We can also see that in figure 4.5(a), GPU BVH performs consistently better
than CPU KDT, as we expected. The 2 level structure can be viewed as the hybrid
of CPU KDT and GPU BVH, and its performance is also in somewhere between
that of these two methods. Or we can say that the 2-level structure takes more of
the advantage of GPU BVH, and its speedup to the CPU KDT is about 200 to 500
percent.
One more thing on figure 4.5(a) is the two dashed vertical lines. They corre-
spond to the two horizontal lines in figure 4.5(b), which reflect two texture sizes.
One is at 256MB, this is the limit of maximum texture size of a single texture. Since
the GPU BVH uses a single texture to store the acceleration structure of the whole
scene, it is unable to process more then 5 dragons. Even the 4-dragon scene is within
the 256MB limit, the rendering time of GPU BVH is not as good as lower scenes,
even slower than one of the 2-level structure, which should have more amount of
data to be processed theoretically. A possible reason is that the one single texture
of 4-dragon plus other memory already in use, such as framebuffer and shader pro-
gram, have been over the capacity of the texture memory. The texture and other
resource got swap in and out during rendering and cause the time boost. The 2-level
structures also get a boost in rendering time at somewhere around 17 to 18 dragons,
26
0
2
4
6
8
10
12
1 11 21 31 41 51 61 71 81 91
frame
time(
seco
nds)
time(kdt)
time(ray engine)
time(whitted)
(a)
0
50
100
150
200
250
300
1 11 21 31 41 51 61 71 81 91
frame
num
ber
of G
PU
acc
ess
batched
mixed
(b)
Figure 4.6: (a) Per frame rendering time of the 10-dragon scene along the camerapath with different acceleration structures. (b) Number of times the underlying GPUray tracer is called per frame.
where the texture memory consumption is abount 900MB. At this point during the
experiment, we observed frequent access to harddisk, which is not seen when the
scenes with lower complexity are processed. We inferred that the size of memory
that the GPU driver allocated from main memory for mapping of texture memory
has an upper limit, and when more are demanded, the system swaps out those on
the main memory to page file on the disk as how the virtual memory works. We
guess this number to be 4 times the size of the on board texture memory and will be
1GB in our case. That’s why scenes with textures more than 1GB still works but the
process time increases very quickly and even more than CPU KDT at 19 dragons.
We tried to find a way to adjust the value of the size of main memory allocation but
in vain.
In figure 4.5(a) we can see that out of the two variant of 2-level structure, the
mixed renderer outperforms the batched one. We get it a closer look at figure 4.6.
We choose a scene with an average size and examine its behavior on per frame scale.
Figure 4.6(a) shows the per frame rendering time of the 3 structures. The rise and
fall of the curves reflects the different angles of the camera at each frame. Overall,
we can see that ’mixed’ is consistently faster than ’batched’. In figure 4.6(b), we
plot the number of access to the underlying GPU BVH per frame, and the number
of ’mixed’ is approximately half to that of ’batched’. This reflects the rendering
time since the more times GPU BVH is called, the more times of read back data
from GPU take place.
4.2.2 Dragonball
In this scene, we test effect of size of the leaf node of the first level structure on
the rendering time. In general, the larger the node size, the fewer the number of
27
0.00
50.00
100.00
150.00
200.00
250.00
1 2 3 4 5 6 7 8 9 10 11 12 13
text
ure
size
(M
B)
0
50
100
150
200
250
300
350
400
450
num
ber
of n
ode
average texture size(MB)
number of node
(a)0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12 13
time
(sec
onds
)
2-level(batched)
2-level(mixed)
(b)
Figure 4.7: Dragonball: (a) Texture size and number of nodes under different parti-tion (b) The rendering time of the 2-level structures of different node size
0
5
10
15
20
25
30
35
40
1 11 21 31 41 51 61 71 81 91
frame
time(
seco
nds)
CPU-KDT2-level(batched)2-level(mixed)
(a)
0
1
2
3
4
5
6
1 11 21 31 41 51 61 71 81 91
frame
time(
seco
nds)
2-level(batched)
2-level(mixed)
(b)
Figure 4.8: (a) Rendering time of dragonball scene partitioning into 3 nodes (b)Rendering time of 2 level structure only
nodes, so the number of calls to GPU is also fewer. But larger nodes have higher
cost of swapping the texture. On the other hand, smaller node size generates more
nodes with lower swapping cost while increases the number of calls to GPU. In
order to find a balance point, we build several 2-level structures of the dragonball
model with different node size. By setting the maximum number of leaf nodes at
constructing the level 1 KD-tree, we can control the size and the number of nodes.
We choose 13 different settings and the average texture size and the number of
nodes of these settings are shown in figure 4.7(a). The larger nodes is to the left
in the graph and it gets smaller down to the right. The partition of the largest node
size we use has only 2 nodes, each has texture size about 250MB. And to another
extreme, the one with smallest node size has more than 400 nodes.
Figure 4.7(b) is the result of rendering time. The rendering time rises as the
number of nodes increases and it appear that the best is the one with fewest nodes.
It is not as we expected though. We though that the minimum point should be
somewhere in the middle. A possible reason would be that the cost of calling shader
multiple times is much greater than swapping large texture, or the performance of
swapping by the driver is much better than we thought.
We take the 2-level structure with 3 nodes and compare its per frame rendering
28
0
10
20
30
40
50
60
1 11 21 31 41 51 61 71 81 91
frame
gpu
calls
2-level(batched)2-level(mixed)
Figure 4.9: Dragonball: number of times GPU BVH called per frame
0
5
10
15
20
25
30
35
40
45
50
1 51 101 151 201 251
frame
rend
erin
g tim
e(se
cond
s)
CPU-KDT
2-level(batched)
2-level(mixed)
0
5
10
15
20
25
1 51 101 151 201 251
frame
rend
erin
g tim
e(se
cond
s)
2-level(batched)
2-level(mixed)
Figure 4.10: Plants: rendering time per frame (a) CPU KDT and 2-level (b) 2-levelonly
time with CPU KDT in figure 4.8(a). The speedup of 2-level over CPU KDT is
about 3 times faster. The comparison of the 2-level variants is in figure 4.8(b). The
mixed renderer is slightly faster in this scene, too. Figure 4.9 shows the number of
GPU calls of the scene with same setting.
4.2.3 Plants
In the plants scene we do the same experiment as the powerplant, that is we use a
camera path to walkthough the scene and record the statistics. The main differences
of these two scenes are as follows:
• Plants has more triangles than powerplant
• Plants is flat-shaped with few occlusion while the powerplant has complex
occlusion
• Plants has only one light source but also a specular surface
Figure 4.10(a) gives the per frame rendering time of the scene. On average
the 2-level structures are 2.5 times faster than CPU KDT. Figure 4.10(b) shows the
29
0
50
100
150
200
250
300
1 51 101 151 201 251
frame
num
ber
of g
pu a
cces
s
2-level(batched)
2-level(mixed)
0
100000
200000
300000
400000
500000
600000
700000
800000
1 51 101 151 201 251
frame
num
ber
of r
ays
trac
ed
2-level(batched)
2-level(mixed)
Figure 4.11: Plants: (a) number of times GPU BVH is accessed per frame (b)number of rays generated per frame
0
0.2
0.4
0.6
0.8
1
1.2
1 51 101 151 201 251
frame
ratio
of a
ctiv
e te
xtur
e si
ze
2-level(batched)
2-level(mixed)
0
0.1
0.2
0.3
0.4
0.5
0.6
1 51 101 151 201 251
frame
ratio
of i
ncre
men
tal t
extu
re s
ize 2-level(batched)
2-level(mixed)
Figure 4.12: Plants: (a) percentage of active texture size (b) percentage of incre-mental texture size
rendering time of the 2-level structures along. The ’mixed’ performs slightly better
than ’batched’.
Figure 4.11(a) shows the number of times of GPU calls per frame. The numbers
of ’mixed’ is about half of that of ’batched’. The numbers are relatively low at
around 120th frame. This is because the camera goes to an edge of the scene and
faces outward, causes rendering of the frame reduced to single texture similar what
happens when rendering the indoor frames of the powerplant. This also reflects
on the rendering time, the minimum point of 4.10(b) is also the minimum point of
4.11(a).
Figure 4.11(b) shows the number of rays generated at each frame. Note that the
number is much lower than that of the powerplant scene, and so is the rendering
time. Figure 4.12(a) plots the percentage of active texture size. Except for a few
frames that utilize low ratio of textures, most frames have relative high percentage
compared with the powerplant scene. Since the distribution of objects are flat and
there are few occlusion in this scene.
30
0
50
100
150
200
250
300
350
400
1 21 41 61 81 101 121 141 161 181 201 221 241 261
frame
rend
erin
g tim
e(se
cond
s)
CPU-KDT
2-level (batched)
2-level (mixed)
(a)
5
10
15
20
25
30
1 21 41 61 81 101 121 141 161 181 201 221 241 261
frame
rend
erin
g tim
e(se
cond
s)
2-level(batched)
2-level(mixed)
(b)
Figure 4.13: Per frame rendering time of powerplant (a) CPU KDT and 2-level (b)2-level only
4.2.4 Powerplant
The powerplant scene is the one that takes the most time to render of all our scenes,
although it is not the most complex in geometry. Because it has the most light
sources and results in much more shadow rays than other scenes.
Figure 4.13(a) shows the per frame rendering time of powerplant. The gap be-
tween CPU KDT and the 2-level structure widens even more, up to more than an or-
der of magnitude. This should not be interpreted as the 2-level structure is supreme
but the CPU KDT performs abnormally poor in this scene. The KD-tree is not well
constructed and contains leaf nodes with thousands of triangles which results in
poor performance in traversal. Although the level 1 KD-tree of the 2-level structure
is also built from the same KD-tree used in CPU KDT, it does not suffer from the
same problem because the level 1 KD-tree takes only few nodes that are near the
root of the original KD-tree.
Figure 4.13(b) is the same graph as figure 4.13(a) with only the 2-level structures
which is more clear to compare the two variants. It is not obvious to say which one
of the 2-level structure renderer is faster in this scene, in some frames one is faster
and in other frames the opposite. On average, the ’batched’ is slightly faster than the
’mixed’ but not much. Because during the walkthough, most of the time the camera
is inside the building and only a few nodes are visible. This reduced the rendering
of a complex scene down to a simple case. See figure 4.14(a) for the number of
times GPU BVH is called during rendering. There is not much difference between
the two curve compared to other scenes. Only in the first 30 frames can we see an
obvious gap. The numbers are also relatively high since at these frames the camera
still remains outside of the building and more of the nodes are involved, where the
’mixed’ can take the advantage, although not much reflected in the rendering time.
When the camera enters the building, the times GPU called drops quickly as the
31
0
20
40
60
80
100
120
140
1 21 41 61 81 101 121 141 161 181 201 221 241 261
frame
num
ber
of G
PU
acc
ess
2-level(batched)
2-level(mixed)
(a)
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
1 21 41 61 81 101 121 141 161 181 201 221 241 261
frame
num
ber
of r
ays
trac
ed
2-level(batched)
2-level(mixed)
(b)
Figure 4.14: Powerplant: (a) number of times GPU BVH is called per frame (b)total number of rays generated per frame
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 21 41 61 81 101 121 141 161 181 201 221 241 261
frame
ratio
of a
ctiv
e te
xtur
e si
ze
2-level(batched)
2-level(mixed)
(a)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
1 21 41 61 81 101 121 141 161 181 201 221 241 261
frame
ratio
of i
ncre
men
tal t
extu
re s
ize
2-level(batched)
2-level(mixed)
(b)
Figure 4.15: Powerplant: (a) percentage of active textures (b) percentage of incre-mental textures
number of visible nodes also reduced. Figure 4.14(b) counts the number of ray
generated at each frame, including camera rays and secondary rays. The two curve
are identical as we expected, otherwise the image produced will not be the same
which means something goes wrong. There are a couple of valleys which are the
result of the camera looking into the direction where more eye rays are shooting
into the background. This is also reflected on the rendering time, where a drop on
the number of rays maps to a drop on the rendering time.
Figure 4.15(a) plots the percentage of the sum of texture size that are active at
the frame over the size of total textures. This gives a clue of the potential amount of
data transfer at each frame.
4.3 Conclusions
In this chapter, we performed several experiments with different scenes. And we
get the following conclusions:
• The 2-level structure scales smoothly on complex scenes
32
• The upper bound of the input scene is limit by the size of memory allocated
by the GPU driver for texture memory mapping
• The mixed renderer of the 2-level structure is slightly better on rendering time
• The mixed renderer has 50% fewer number of access to the GPU shader than
the batched approach
• The 2-level structure reduces the texture usage in the situation such as inside
the powerplant building
• The 2-level structure with larger node size performs better in rendering time
than with smaller node size
33
Chapter 5
Conclusions and future work
5.1 Conclusinos
We present a two-level acceleration structure that utilizes the computation power of
GPU for ray tracing complex scenes. Although we do not compare our method to an
advanced CPU implementation of ray tracing system, the 2-level algorithm achieve
2.5 to 5 times of speedup over our own CPU KD-tree implementation. We design
several experiments to test our system under different scenes with different degree
of complexity and material. We also investigate how to choose a good threshold
when building the 2-level structure.
5.2 Future work
In our implementation, we pass the memory management between the main mem-
ory and texture memory to the driver and it works quite well except the limit of the
pre-allocated size of memory. An alternative is to allocate large blocks in the texture
memory and use the glTexSubImage function to update the sub-blocks. Through
this technique we can explicitly handle the memory management and make it pos-
sible something the driver unable to do such as prefetching of textures.
We implement only ray-scene intersection on GPU, while keep other tasks on
CPU. According to our profiling, the work load of CPU is more than GPU. We
might get improvements by moving some of the computation from CPU to GPU.
For example, shading and computation of the reflection and refraction are good
candidates.
Also in this work we demonstrate only simple illumination model and we would
like to extend it to more advanced global illumination, such as Monte Carlo ray
tracing or Photon Mapping for better and practical image quality.
34
References
[AK87] James Arvo and David Kirk. Fast ray tracing by ray classification.
SIGGRAPH Comput. Graph., 21(4):55–64, 1987.
[App68] Arthur Appel. Some techniques for shading machine renderings of
solids. In AFIPS 1968 Spring Joint Computer Conf., volume 32, pages
37–45, 1968.
[BFH+04] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fata-
halian, Mike Houston, and Pat Hanrahan. Brook for gpus: stream
computing on graphics hardware. ACM Trans. Graph., 23(3):777–786,
2004.
[CHCH06] Nathan A. Carr, Jared Hoberock, Keenan Crane, and John C. Hart.
Fast gpu ray tracing of dynamic meshes using geometry images. In
Proceedings of Graphics Interface 2006, 2006.
[CHH02] Nathan A. Carr, Jesse D. Hall, and John C. Hart. The ray engine. In
HWWS ’02: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS
conference on Graphics hardware, pages 37–46, Aire-la-Ville,
Switzerland, Switzerland, 2002. Eurographics Association.
[Chr05] Martin Christen. Implementing ray tracing on gpu. Master’s thesis,
University of Applied Sciences Basel, Switzerland, 2005.
[DWS05] Andreas Dietrich, Ingo Wald, and Philipp Slusallek. Large-Scale CAD
Model Visualization on a Scalable Shared-Memory Architecture. In
Gunther Greiner, Joachim Hornegger, Heinrich Niemann, and Marc
Stamminger, editors, Proceedings of 10th International Fall Workshop
- Vision, Modeling, and Visualization (VMV) 2005, pages 303–310, Er-
langen, Germany, November 2005. Akademische Verlagsgesellschaft
Aka.
35
[FSH04] K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the effi-
ciency of gpu algorithms for matrix-matrix multiplication. In HWWS
’04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS confer-
ence on Graphics hardware, pages 133–137, New York, NY, USA,
2004. ACM Press.
[FvDFH90] James D. Foley, Andries van Dam, Steven K. Feiner, and John F.
Hughes. Computer graphics: principles and practice (2nd ed.).
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,
1990.
[GPG05] GPGPU.org. http://www.gpgpu.org, 2005.
[GRH+05] Naga K. Govindaraju, Nikunj Raghuvanshi, Michael Henson, David
Tuft, and Dinesh Manocha. A cache-efficient sorting algorithm for
database and data mining computations using graphics processors.
Technical report, UNC, 2005.
[GS87] Jeffrey Goldsmith and John Salmon. Automatic creation of object hi-
erarchies for ray tracing. IEEE Computer Graphics and Applications,
7:14–20, 1987.
[KL05] Filip Karlsson and Carl Johan Ljungstedt. Ray tracing fully im-
plemented on programmable graphics hardware. Master’s thesis,
Chalmers University of Technology, 2005.
[Lea00] Doug Lea. A memory allocator, 2000.
http://g.oswego.edu/dl/html/malloc.html.
[Mic05] Microsoft. http://www.microsoft.com/whdc/system/platform/server/pae/paemem.mspx,
2005.
[NVI05] NVIDIA. http://www.nvidia.com/page/gelato.html, 2005.
[OLG+05] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris,
Jens Kruger, Aaron E. Lefohn, and Timothy J. Purcell. A survey of
general-purpose computation on graphics hardware. In Eurographics
2005, State of the Art Reports, pages 21–51, aug 2005.
[oT97] Tomas Moller and Ben Trumbore. Fast, minimum storage ray-triangle
intersection. J. Graph. Tools, 2(1):21–28, 1997.
36
[PBMH02] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan. Ray
tracing on programmable graphics hardware. ACM Transactions on
Graphics, 21(3):703–712, July 2002. ISSN 0730-0301 (Proceedings
of ACM SIGGRAPH 2002).
[PH04] Matt Pharr and Greg Humphreys. Physically Based Rendering. Mor-
gan Kauffman, 2004.
[PKGH97] Matt Pharr, Craig Kolb, Reid Gershbein, and Pat Hanrahan. Render-
ing complex scenes with memory-coherent ray tracing. In SIGGRAPH
’97: Proceedings of the 24th annual conference on Computer graph-
ics and interactive techniques, pages 101–108, New York, NY, USA,
1997. ACM Press/Addison-Wesley Publishing Co.
[Pur04] Timothy John Purcell. Ray tracing on a stream processor. PhD thesis,
2004. Adviser-Patrick M. Hanrahan.
[Sta06] Stanford. The stanford 3d scanning repository, 2006.
http://graphics.stanford.edu/data/3Dscanrep/.
[TF05] Jeremy Sugerman Tim Foley. Kd-tree acceleration structures for a gpu
raytracer. In Proceedings of Graphics Hardware, 2005.
[TS05] Niels Thrane and Lars Ole Simonsen. A comparison of acceleration
structures for gpu assisted ray tracing. Master’s thesis, University of
Aarhus, 2005.
[UNC01] UNC. The walkthru project, 2001. http://www.cs.unc.edu/ walk/.
[WDS04] Ingo Wald, Andreas Dietrich, and Philipp Slusallek. An Interactive
Out-of-Core Rendering Framework for Visualizing Massively Com-
plex Models. In Proceedings of the Eurographics Symposium on Ren-
dering, 2004. (to appear).
[Whi80] Turner Whitted. An improved illumination model for shaded display.
Commun. ACM, 23(6):343–349, 1980.
[WIK+06] Ingo Wald, Thiago Ize, Andrew Kensler, Aaron Knoll, and Steven G
Parker. Ray Tracing Animated Scenes using Coherent Grid Traversal.
ACM SIGGRAPH 2006, 2006.
37
[WPS+03] Ingo Wald, Timothy J. Purcell, Joerg Schmittler, Carsten Benthin, and
Philipp Slusallek. Realtime Ray Tracing and its use for Interactive
Global Illumination. In Eurographics State of the Art Reports, 2003.
[WSBW01] Ingo Wald, Philipp Slusallek, Carsten Benthin, and Markus Wagner.
Interactive rendering with coherent ray tracing. In A. Chalmers and T.-
M. Rhyne, editors, EG 2001 Proceedings, volume 20(3), pages 153–
164. Blackwell Publishing, 2001.
[WSE04] D. Weiskopf, T. Schafhitzel, and T. Ertl. GPU-Based Nonlinear Ray
Tracing. Computer Graphics Forum (Eurographics 2004), 23(3):625–
633, 2004.
[WSS05] Sven Woop, Jörg Schmittler, and Philipp Slusallek. Rpu: a pro-
grammable ray processing unit for realtime ray tracing. ACM Trans.
Graph., 24(3):434–444, 2005.
38