Ray Tracing Complex Scenes on Graphics Hardwarer93039/thesis.pdf · · 2006-07-19than what GPU memory can hold. In this thesis, we present a GPU assisted ray tracer system that

� � � � � � � � � � � � � �

Master Thesis

Ray Tracing Complex Scenes onGraphics Hardware� � � � � � � � � � � � � � �

Student: !#"%$ (Pei-Lun Lee)

Advisors: &('*),+.- (Yung-Yu Chuang, PhD.)/#0,1 +.- (Ming Ouhyoung, PhD.)

Department of Computer Science and Information EngineeringNational Taiwan University, Taipei, Taiwan

Abstract

Recently, due to the advancement of graphics hardware, graphics processing unit(GPU) is intensely used for general computation other than graphics display that itis originally designed for. Among them, ray tracing on GPU is one of the promisingapplications owing to its inherent property of parallel computation and shared dataaccess. However, most of the existing research on GPU ray tracing focus on theacceleration of ray tracing itself while assuming the scene not to be too complex.In this thesis, we present a two level hierarchies of data structure to render complexscenes using a CPU/GPU collaborated system. We compare its performance witha CPU only implementation and conclude that this method is efficient and easy toimplement.

i

�� "!$#%��&�'(��)+*��,- .0/�1�243�546�7(8:9";=<+>�? #@��=�(A�:BC# �EDGF 7=HAI@J%K !-�L�M@N�O�P Q�R �TSAU .WVYX[Z F \ 7+] ��^`_�a�bAc�d /�1�243�5 ��e�f�g=h@i+j�k .mln"o p(qr8 �ts%u�vxw 9"; y�z�{ ��=� O�Pr|T}@L9";`8%~ �=�(:�(��=�( M`��>+� ��=�"��h@i . ��bT�C# 8~ � �("��A�$��s%u(�� I=��"�"� ��T�%� .

ii

Contents

1 Introduction 21.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related work 42.1 Rendering complex scenes . . . . . . . . . . . . . . . . . . . . . . 42.2 General purpose computation on graphics hardware . . . . . . . . . 52.3 Ray tracing on graphics hardware . . . . . . . . . . . . . . . . . . 5

3 Algorithm 73.1 Ray-scene intersection . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Complex scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 The two-level acceleration structure . . . . . . . . . . . . . . . . . 123.4 Traversal of two-level structure . . . . . . . . . . . . . . . . . . . . 133.5 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Results 194.1 The scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.1 Dragons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.2 Dragonball . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.3 Plants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.4 Powerplant . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Conclusions and future work 345.1 Conclusinos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

List of Figures

3.1 Interface of ray-scene intersection on GPU . . . . . . . . . . . . . . 93.2 Details of ray-scene intersection on GPU . . . . . . . . . . . . . . . 93.3 Multi-texture approach . . . . . . . . . . . . . . . . . . . . . . . . 103.4 The two-level algorithm . . . . . . . . . . . . . . . . . . . . . . . . 113.5 Partition of a level 1 structure using KD-tree of the bunny model . . 13

4.1 The order and position of each object of the dragon scene . . . . . . 214.2 The function generated camera path used in the dragon scene . . . . 224.3 (a) Plants scene looked from top. (b) The leaf nodes of level 1 KD-

tree of the plants scene. . . . . . . . . . . . . . . . . . . . . . . . . 244.4 Path used to walkthough the powerplant scene . . . . . . . . . . . . 254.5 (a) Rendering time of different implementations of ray tracer on the

dragon scenes. The two vertical lines are the two limit on texturesize of the single texture GPU ray tracer and the 2-level structure.(b) The texture size distribution of the dragon scenes. . . . . . . . . 26

4.6 (a) Per frame rendering time of the 10-dragon scene along the cam-era path with different acceleration structures. (b) Number of timesthe underlying GPU ray tracer is called per frame. . . . . . . . . . 27

4.7 Dragonball: (a) Texture size and number of nodes under differentpartition (b) The rendering time of the 2-level structures of differentnode size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.8 (a) Rendering time of dragonball scene partitioning into 3 nodes (b)Rendering time of 2 level structure only . . . . . . . . . . . . . . . 28

4.9 Dragonball: number of times GPU BVH called per frame . . . . . . 294.10 Plants: rendering time per frame (a) CPU KDT and 2-level (b) 2-

level only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.11 Plants: (a) number of times GPU BVH is accessed per frame (b)

number of rays generated per frame . . . . . . . . . . . . . . . . . 304.12 Plants: (a) percentage of active texture size (b) percentage of incre-

mental texture size . . . . . . . . . . . . . . . . . . . . . . . . . . 304.13 Per frame rendering time of powerplant (a) CPU KDT and 2-level

(b) 2-level only . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.14 Powerplant: (a) number of times GPU BVH is called per frame (b)

total number of rays generated per frame . . . . . . . . . . . . . . . 32

4.15 Powerplant: (a) percentage of active textures (b) percentage of in-cremental textures . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1

Chapter 1

Introduction

1.1 Motivation

Ray tracing is a classic algorithm in computar graphics for generating photo-realistic

images by emulating the natural phenomenon of human visual perception. The most

time consuming part of the ray tracing algorithm is the computation of ray-scene in-

tersection. In ray tracing, the computation of each ray is independent to other, so it

is suitable for parallel computing.

Recently graphics processing unit (GPU) is getting more computation power

than CPU and continues growing. Its programmable ability makes it a great plat-

form for computing. Ray tracing is also one of the target of interest. While more

and more ray tracing algorithms are implemented on GPU, all of them assume that

the scene data fits into the on-board memory of GPU. However, in many practi-

cal case such as production rendering, the scene contains more complex geometry

than what GPU memory can hold. In this thesis, we present a GPU assisted ray

tracer system that take complex scene as input, which makes GPU ray tracer more

practical to use.

1.2 Problem statement

Given a scene data, whose memory consumption is larger than the capacity of GPU

memory, we propose a data structure for data management and generate images

using a GPU assisted ray tracing algorithm.

2

1.3 Thesis organization

In the following chapters, we introduce the related work on GPU ray tracing and

complex scenes rendering, then we describe the proposed method and show the

experiments and results, and lastly we conclude the work and discuss some of the

future direction of research.

3

Chapter 2

Related work

In this chapter, we survey the related work on different topics including rendering

complex scenes, general purpose computation on graphics hardware, and ray tracing

on graphics hardware. We also survey some of the recent work on ray tracing in the

first section.

2.1 Rendering complex scenes

[PKGH97] presented a set of algorithms to improve the data localities of ray trac-

ing algorithm, such that they can render more complex scenes with given memory

capacity. They organize geometry cache in a two-level regular voxel grids with dif-

ferent granularities. The coarse one is termed geometry grids which contain finer

grids called acceleration grids inside. We use a similar scheme to theirs, except their

goal is to manage the access pattern between disk to main memory while we deal

with the system memory-GPU memory management. In addition, they use regular

grid at both level but in our scheme either level can be any data structure.

In [WPS+03], they reported that realtime ray tracing has already been achieved

on different hardware architectures, including shared memory systems, PC clusters,

programmable GPUs, and custom hardware. They also presented interactive global

illumination algorithms as one of the application utilizing realtime ray tracing.

[WDS04] visualizes an extra-ordinary complex model, ’Boeing 777’ which con-

tains 350 million individual triangles. They use a combination of several techniques

including a realtime ray tracer, a low-level out of core caching and demand load-

ing strategy, and a hierarchical approximation scheme for geometry not yet loaded.

Whey render the full model at several frames per second on a single PC.

[WSS05] presents an architecture of a programmable ray tracing hardware,

which combines the flexibility of general purpose CPUs and the efficiency of GPUs.

4

They implemented a prototype FPGA version running at 66 MHz and can achieve

rendering images at interactive rate. The limitation of their system is similar to other

ray tracers pursuing performance, that is the scene must be static and contains only

triangles.

2.2 General purpose computation on graphics hard-ware

[OLG+05] surveyed the development of the new research domain, general purpose

computation on graphics processors, or GPGPU. They described the techniques

used in mapping general-purpose computation to graphics hardware, and they sur-

veyed a broad range of GPGPU applications. They also pointed the trend of next

generation GPGPU research.

[GRH+05] presented an improved bitonic sorting network algorithm on GPU.

They analyzed the cache-efficiency of algorithm and get better data access pattern.

The performance of their implementation was higher than a CPU quick sorting rou-

tine.

[FSH04] analyzed the suitability of GPUs for dense matrix-matrix multiplica-

tion. They implemented an near-optimized algorithm but found that the utilization

of the arithmetic resources was relatively low compared to CPU counterpart. The

bottleneck was due to the limiting bandwidth. Thus they concluded that currently

GPU was not a suitable platform for such computation.

2.3 Ray tracing on graphics hardware

[CHH02] was one of the first work on ray tracing with programmable graphics

hardware. They implemented only ray-triangle intersection on the GPU, and let

the CPU do the ray generation and shading. Ray data were stored in texture mem-

ory, and vertex data were sent through graphics pipeline as drawing primitives. In

terms of intersections per second, their implementation exceeded the fasted CPU-

based ray tracer at that time. The main bottleneck of their implementation is the

slow readback of data from GPU to main memory, which is required each time a

computation invoked.

[PBMH02] mapped the entire ray tracing algorithm on GPU, from ray genera-

tion, acceleration data structure traversal, ray-triangle intersection, to shading. They

choose uniform grid as acceleration data structure. All data, including ray, geome-

try, and grid were stored in texture memory. Still, they suffered from low utilization

5

rate of computation resources. They also presented the concept that a GPU can be

viewed as a stream processor [Pur04]. Under this structure, fragment programs are

the kernels, and data is stream. Their work inspired several researches including

[Chr05] and [KL05], who follow their framework.

[WSE04] used GPU to implement a nonlinear ray tracing algorithm, which ex-

tend ray tracing to handle curved light rays and can be used to visualize gravitational

phenomena.

[TF05] implemented a kd-tree traversal algorithm on GPU. Their modified al-

gorithm can traverse a kd-tree without a stack, thus suitable for GPU, which lack of

the capability of such complex data structure. They also use it in a ray tracer that

beat the performance of one use uniform grid.

[TS05] compared how performance of a GPU-based ray tracer can be affected

with different accelerating data structures. They chose three structures to com-

pare: uniform grid, kd-tree, and bounding volume hierarchies. They found that the

bounding volume hierarchies is the most efficient in most cases.

[CHCH06] uses a threaded bounding volume hieararchy built from a geometry

image, which can be efficiently traversed and constructed entirely on the GPU and

allows for dynamic geometry and level of detail.

NVIDIA Gelato [NVI05] is a hardware-accelerated non-real-time high-quality

renderer which leverages the NVIDIA GPU as a floating point math processor to

render images faster than comparable renderers.

6

Chapter 3

Algorithm

In this chapter, we describe in detail the algorithms we use. Before we start, we list

some of the design decisions we made.

Design Decision

• CPU/GPU job division

Although we utilize the computation power of GPU, we do not aim for de-

signing a system that completely runs on GPU. Instead we offload some of

the tasks to CPU and make our experiments focus on the speedup of ray-scene

intersection.

GPU

– Traversal of level 2 structures

CPU

– Generating camera rays

– Shading

– Spawn secondary rays

• Scene geometry

We choose triangles to be the only type of geometric primitive. This simplifys

the problem and let us focus on the specific part of the system. Also by

limiting the number of primitive to one.

• Scene complexity

We do not aim for arbitrary complex scenes. The main subject of this thesis

is on memory management between GPU and host, not disk to memory. So

the data size of scenes cannot exceed the capacity of main memory.

7

• Illumination model Since we compute the shading at CPU side, we do not

want the cost of shading to be too expensive. Besides, realistic illumination is

not the main objective, so we choose the simple Whitted’s ray tracing as the

illumination model.

3.1 Ray-scene intersection

The most important component in our system is the ray-scene intersection. This is

the part where we mainly utilize the computation power of GPU. The ideal interface

of ray-scene intersection on GPU is shown in figure 3.1. Note that there are multiple

rays fed into GPU at the same time. This is due to the SIMD nature of the fragment

processors on GPU: it gets the most efficiency when processing multiple data at the

same time. This interface resembles the ray engine in [CHH02]. A more detailed

framework of our GPU intersection block is shown in 3.2. We implemented the

methods described in [TS05]. The scene data is first converted into an acceleration

data structure and then stored in texture memory of GPU as textures. The rays are

also stored in textures. The traversal of acceleration structure and intersection test

written in shader code are executed in fragment processors. The intersection result

are output to framebuffer, which has a texture attached, and then readback to main

memory.

In our implemtation, we find the performance of BVH is the best out of the 3

GPU acceleration structures in [TS05], which is consistent with their conclusion.

Therefore we choose BVH as the underlying GPU intersection engine in our sys-

tem. To implement the BVH ray tracer on GPU, several hardware and software

features are required. First the GPU must support Shading Model 3.0 since dy-

namic branching and looping instructions are used in the fragment shaders. To

store data in textures, the support of floating point precision in textures must be

present such as in OpenGL extension GL ARB texture float. To output data to tex-

tures in fragment shader, we use the FrameBufferObject of OpenGL which supports

render-to-texture function. We also use OcclusionQuery to determine whether the

computation on GPU is over. Last we use Early Z-Culling which is a optimization

in graphics pipeline to skip the pixels whose computation are already over from

entering the fragment processors again. All the above techniques are described in

[TS05] and [GPG05].

8

GPUrays

sceneintersections

Figure 3.1: Interface of ray-scene intersection on GPU

rays

scene

intersections

accel.struct.

texture memory

fragment processors

GPU

Figure 3.2: Details of ray-scene intersection on GPU

3.2 Complex scenes

The scheme described in the last section works fine with scenes that are fit into

texture memory. We call the method in the last section the single-texture approach.

However, as the scene gets more complex in geometry, two problems may be con-

fronted:

1. A single texture may exceed the maximum texture size

2. Textures may exceed the capacity of the on-board texture memory of GPU

In the OpenGL specification, the maximum texture size is 4096 by 4096. In

our implementation, we use 32 bit floating point texture to store data. As a result,

the largest single texture will then be 256MB. Take BVH, one of our implemented

acceleration structure on GPU, for example, it tooks some where around 64 to 67

bytes per triangle. By calculation, the 256MB limit will be reached at around 4

million triangles converted into a single texture representation of BVH structure.

That is, if we use the simple BVH implementation, we can handle scenes with at

most 4 million triangles.

9

8192

8192

4096

4096

Figure 3.3: Multi-texture approach

Another limitation is the capacity of texture memory. The latest GPU has at

most 512MB of texture memory, that is, we can have two 256MB textures. How-

ever, with the PCI-Express technology, the bandwidth between GPU and host is

greatly improved compared to the last generation AGP bus. The graphics manufac-

turers thus provide driver level mapping from graphics memory to main memory.

That is, we can use part of the main memory as texture memory just like the virtual

memory using hard disk as memory. For example, a GPU may mave only 64MB

memory on board. But to an user program, it may seems there are 256MB tex-

ture memory since the driver transparently swaps in and out texture data between

GPU and host. These features are quite useful in low-end graphics products, since

the cost of memory can be reduced while the performance comparable to products

with same amount of actual on-board memory. However, by our experiment, the

high-end GPU we use (nVidia GeForce7800) also has this feature. The available

memory is 256MB, but we can allocate more than 1GB of textures without error.

So this feature should be implemented in software and independent to hardware

design.

In sum, the limit on GPU memory size is resolved by the driver. The real limit

still remaining is the size of a single texture. To get rid of this limit, a direct solution

is to tile smaller textures into a larger one. For example we have data storing in an

array that is represented as a texture of size 8192 by 8192, we can divide it into 4

textures sized 4096 by 4096. And then we override the texture look-up function in

shader code as in algorithm 1.

In this way, what we call multi-texture approach, by mapping the texture coordi-

nates of a larger texture to that of multiple smaller textures, we can virtually access

10

Algorithm 1 MyTex2D(tex; x; y)if x < 4096 and y < 4096 then

Return tex2D(tex0; x; y)else if x � 4096 and y < 4096 then

Return tex2D(tex1; x� 4096; y)else if x < 4096 and y � 4096 then

Return tex2D(tex2; x; y � 4096)else

Return tex2D(tex3; x� 4096; y � 4096)end if

Figure 3.4: The two-level algorithm

a texture that is larger than the maximum size in the specification. And by using this

function, we can easily make a GPU intersection code capable of handling complex

scenes by just sending larger textures.

Nevertheless, this approach will not perform as efficient as it looks like. In frag-

ment processors, where these codes are executed, instructions are issued in SIMD.

Which means each time we call the MyTex2D function in a fragment shader, there

will be 4 texture fetch commands executed. And 3 out of the 4 texels will be dis-

carded by the conditional branching. Plus blocks of the 4 textures are potentially

loaded to the texture memory and consume the bandwidth of the PCI-Express bus.

The more tiles a texture are split into, the worse the performance.

Overall, multi-texture is not a practical solution. So we come up with another

solution, the two-level acceleration structure, which will be introduced in the fol-

lowing section.

11

3.3 The two-level acceleration structure

The main idea of the proposed algorithm is to build a two-level hierarchies of ac-

celeration data structures, which is composed of a coarser structure at top level, and

finer structures at each leaf node of the coarser one. The coarser level (level 1) of

structure will be traversed by CPU, while the finer level (level 2) will be traversed

by GPU, as shown in figure 3.4. Through this partition, we can split large scene

data into smaller chunks that can fit into the limited memory of GPU. This method

is similar to [PKGH97], where they choose both the coarser and finer level to be

uniform grid. In theory, any of the acceleration data structure can be used in either

level in this algorithm.

Assumptions

We make the following assumptions on the scene:

1. Scenes are composed purely by triangles

2. Scene data is larger than texture memory on GPU but smaller than system

main memory

For the first point, we handle only one type of primitive. Because GPU uses

SIMD architecture, and two intersection codes means two times slower in GPU

programming. Besides, all other primitive can be subdivided into triangles.

Second, from viewpoint of GPU computation, the memory hierarchies are GPU

L1 cache, GPU L2 cache, texture memory, system memory, and disk storage. The

main target of this research is on texture memory to system memory. So we assume

that our data can fit into main memory and do not explicitly handle the memory

management from memory to disk.

Construction

Given a set of triangles T , its 2-level accelerating data structure (ASL1; ASiL2)

can be computed with:

To build a level 1 structure, one can use a top-down algorithm such as KD-

tree and set the maximum triangles in a leaf node to a desired threshold. Or we

can merge leaf nodes from bottom to top until the size of memory or the number of

nodes meet the requirement from a complete accelerating structure or structure built

with bottom-up algorithm such as the bounding volume hierarchies (BVH) [GS87].

12

Algorithm 2 Construction of 2-level acceleration structure1: Build ASL1 on T2: for each leaf node Li of ASL1 do3: build ASi

L2 on triangles of Li

4: end for

Figure 3.5: Partition of a level 1 structure using KD-tree of the bunny model

Fig 3.5 is the visualization of a level 1 structure. Each color patch corresponds

to a leaf node of the KD-tree. As the figure shows, the partition is rather coarse, so

each leaf node contains many triangles compared to a normal KD-tree used in an

ordinary ray tracing algorithm.

A level 2 structure has no difference from an acceleration data structure used in

other GPU based ray tracers. The level 2 structures are built with the triangles of the

leaf nodes of level 1 structures, each leaf node has one level 2 structure associated.

The constructed level 2 structures AS iL2 must be converted to the texture form that

can be traced by a GPU ray tracer. And the texture representation of a level 2

structure must not exceed the size limitation of a single texture. In OpenGL, this

limitation is 4096 by 4096 texels, which can store up to 256MB of data if 32-bit

floating point texture is used. So when building the level 1 structure, the size of leaf

node must be choosed so that its level 2 structures are not too large.

3.4 Traversal of two-level structure

Here we give the algorithm to traverse a 2-level acceleration structure with the same

interface of the ray-scene intersection described in section 3.1. That is, given a

queue of rays R and a 2-level structure, compute and return the intersection of each

13

ray in R.

Algorithm 3 Intersect(R;ASL1; ASiL2)

1: for each ray r in R do2: Add r to QL1

3: end for4: while QL1 is not empty do5: for each ray r in QL1 do6: p = intersectL1(r; ASL1)7: if p is a leaf node Qi

L2 then8: Add r to Qi

L2

9: else10: Intersect[r] = NULL11: Add r to I12: end if13: Delete r from QL1

14: end for15: for each non-empty Qi

L2 do16: S = IntersectL2(Qi

L2; ASiL2)

17: for each ray s in S do18: if s is a hit then19: Add s in I20: else21: Add s in QL1

22: end if23: end for24: end for25: end while26: Return I

We maintain a ray queue for the level 1 and each level 2 structures. That is,

if the level 1 structure has n leaf nodes, there will be n + 1 ray queues. First all

rays input for intersection query are put into the level 1 ray queue. Then each ray

in level 1 ray queue is traversed in level 1 structure to test which is the nearest leaf

node hit by the ray. If a ray hit some leaf node, we move that ray to the ray queue

of the level 2 structure that node corresponds to; if the ray does not hit any thing, its

traversal ends here. After this step, the level 1 ray queue should be empty and some

of the level 2 queues would have rays in. These level 2 structures with non-empty

ray queues are traversed and tested intersection in GPU. For the rays reported hit

during the traversal of the level 2 structure, we update its states of intersection and it

can stop the traversal if we use a level 1 structure that traverses along the direction

of ray such as uniform grid or KD-tree. Otherwise if a ray misses in the level 2

14

traversal, it should continue its traversal in the level 1 structure, only not starts from

root but from the next leaf node of current one. To do this we have to modified the

level 1 traversal, we describe it in the following.

Traversal of level 1 data structure

As stated above, the traversal of level 1 structure must continue from the last

position of the previously visited node, or it will be an infinite loop. Our solution is

to store a field of current position for each ray, and during traversal, we skip all the

nodes that precedes to the current node. This method is used by [TF05] to remove

the recursion of KD-tree traversal and apply on their GPU ray tracer, which they

call ’KD-Restart’. We use it in a CPU algorithm, though. This method adds extra

cost to the traversal of level 1 structure, but since in our system, a level 1 structure

are usually very small, containing only tens of leaf nodes, so we do not care these

additional computational costs.

Algorithm 4 IntersectL1(r; ASL1)p = nearest leaf node from position[r] along rposition[r] = next leaf node of p along rReturn p

We take into concern the effects of ’restart’ of traversal to compare 3 accelera-

tion structures as level 1 structure. The candidates are uniform grid, KD-tree, and

BVH.

• Uniform grid: To store the ’restart’ position, the xyz index of a grid cell must

be saved per ray and can be encoded into one integer. The traversal order

is along the direction of the ray. The space partition is regular which can

generate empty nodes or extremely uneven sizes of node.

• KD-Tree: The ’restart’ position consists of 2 floating point number, the tmin

and tmax of a node along the direction of ray. The space partition is adap-

tive and each leaf node has similar size. The traversal order is along the ray

direction, which is a plus.

• BVH: A BVH is usually flatten from the original tree structure into an array

by some fixed search order such as depth-first when traversed. We save the

index of the array as ’restart’ position. A BVH is not traversed along the ray

direction. The space partition is adaptive.

15

From these comparisons, we choose KD-Tree as the level 1 structure since it

has the properties of ray order traversal and adaptiveness.

3.5 Rendering

Traditionally, a ray tracing algorithm is usually recursive. In our system, since the

interface of the ray-scene intersection is multiple rays instead of a single ray, it is

difficult to apply in a recursive ray tracing algorithm. Here we use the classic Whit-

ted ray tracer as example and give two algorithms that utilize the 2-level structure.

One is called ’batched’ which interfaces the 2-level structure through the ray-scene

intersection block and is thus unaware of the underlying acceleration data structure.

The other is called ’mixed’ which integrates the rendering with the traversal of the

2-level structure to get better utilization.

Batched

Algorithm 5 Rendering-batched(R)1: i 02: while i <max iterations do3: i i+ 14: R intersect(R)5: for each ray r in R do6: if r is a hit then7: Add the shaded color to the pixel associated with r8: Add each spawn rays into S9: else

10: Add the background color to the pixel associated with r11: end if12: end for13: R S14: S empty queue15: end while

In algorithm 5, we basically maintain 2 ray queues R and S. R stores the rays

for current iteration while S stores the secondary rays for the next iteration. At each

iteration, the ray-scene intersection is called to get the intersection of R. For each

ray that hit, we compute its shading and spawned secondary rays such as reflection,

refraction and add these new rays in S. Then we move the rays in S to R and

16

continue the next iteration. The rendering process continues until the predefined

maximum number of iteration is reached or there are no rays to process.

Note that this algorithm is not aware of the underlying acceleration structure.

The intersect procedure can be either a CPU implementation, a single-texture GPU

implementation, or a 2-level structure we described here. We also use this algorithm

on the CPU KD-tree implementation in the experiments and results in the next chap-

ter.

Mixed

Algorithm 6 Rendering-mixed(R)1: for each all eye ray r do2: Initialize position[r]3: Add reference[r] to QL1

4: end for5: while at least one of QL1 and all Qi

L2 is not empty do6: for each ray r in QL1 do7: p traceL1(r; ASL1)8: if p is a leaf node ASi

L2 of ASL1 then9: position[r] next(r; p; ASL1)

10: Add r to QiL2

11: end if12: Delete r from QL1

13: end for14: choose a ASi

L2 with non-empty QiL2

15: S traceL2(QiL2; AS

iL2)

16: for each ray Si in S do17: if Si is a hit then18: Process the intersection19: Spawn secondary rays T20: Add T to QL1

21: else22: Add Si to QL1

23: end if24: Delete Si from Qi

L2

25: end for26: end while

In algorithm 6, we integrate the rendering of algorithm 5 with the traversal of 2-

level structure in algorithm 3. The key difference is that after we get the intersection

from a level 2 structure, we do not proceed to the intersection of the next non-

17

empty node. Instead, we return to level 1 structure to process those newly spawned

secondary rays. This makes the ray queues of level 2 structures filled with more

rays and reduce the number of passes needed to process a level 2 structure. The

line ’choose a level 2 structure’ in the algorithm has many ways to implement.

[PKGH97] suggests a cost function to determine a best node. We adopt the simplest

method: choose the node with the most rays in its queue.

18

Chapter 4

Results

In this section we perform various experiments on a variety of scenes. At first, we

list the environment setup in the following:

Development environment

• CPU: Intel Pentium 4 3.0 GHz with HyperThreading

• RAM: 2.5G DDR2 SDRAM

• OS: Microsoft Windows XP Professional SP2

• GPU: nVidia GeForce 7800 GTX with 256MB memory

• Compiler: Microsoft Visual C++ .Net

• Cg compiler: nVidia Cg Compiler 1.5 Beta 2

• Driver: nVidia ForceWare 84.21

Setup

• The acceleration structure we use in the 2 level structure are: KD-tree as level

1 structure and BVH as level 2 structure. The reason is already described in

the previous chapter. KD-tree has the properties of ray-order traversal and

adaptiveness, so it is best candidate of level 1 structure. As for level 2 struc-

ture, we choose BVH since its performance is the best among all existing

GPU implementation.

19

• Unless specifically mensioned, all the level 1 KD-trees are constructed by

using 110 of the total number of triangles as maximum number of triangle of

leaf node; and the BVH are constructed by shuffling the order of inserting the

triangles 10 times and choose the tree with the lowest value according the the

cost function as in [GS87].

• The CPU KDT in the result are our own implementation based on the KD-tree

traversal code in [PH04]. The interface of the ray-scene intersection function

of KD-tree is modified from single ray to multiple rays to match the inter-

face of GPU algorithms. This ensure the only difference of CPU and GPU

implementation to be the ray-scene intersection, which is our main focus.

• All the images rendered by our program are 512 by 512 pixels in size and no

multisampling or antialiasing are done.

• Our program takes large data as input and consumes lots of memory. We turn

on the 3GB option of the Windows XP [Mic05] to get more user memory

space. And we replace the malloc function in the standard library of Visual

C++ with the code from [Lea00] for better memory allocation. Without the

above two tuning, our program runs out of memory when some large scenes

are input.

4.1 The scenes

Table 4.1 summarizes the statistics of the scenes. The numbers in the row of texture

size mean the sums of size of each level 2 BVH textures of the scene. The number

of triangles and texture size of the dragon scene will be listed later in detail since

it contains not only one scene. The lights we use for rendering are all point light

sources. We’ll look at each scene and its characteristics respectively in the follow-

ing.

Dragons

This set of scenes contains a set of scalable scenes, which we used to model

scenes of different degrees of complexity. These models are composed of a plane

and a dragon model (from Stanford 3D Scan Repository [Sta06]). There are 21

scenes in total and each contains from 1 to 21 dragons. Each dragon has about 87

thousand triangles. The position and order of each dragon object are shown in figure

20

Scene dragon dragonball plants powerplant

Triangles varied 7,364,817 14,119,952 12,748,510Texture size varied �470MB(*) 877MB 793MBLight sources 1 1 1 6Reflection yes yes yes noRefraction no yes no no

Table 4.1: Scene statistics

16 17 18

10 11 12

4 5 6

3 1 2

9 8 7

15 14 13

21 20 19

Figure 4.1: The order and position of each object of the dragon scene

21

Figure 4.2: The function generated camera path used in the dragon scene

obj Triangles Tex. size obj Triangles Tex. size1 1,076,214 68.3 11 9,790,354 6162 1,947,628 123 12 10,661,768 6703 2,819,042 177 13 11,533,182 7254 3,690,456 232 14 12,404,596 7795 4,561,870 286* 15 13,276,010 8346 5,433,284 341 16 14,147,424 8907 6,304,698 396 17 15,018,838 9448 7,176,112 451 18 15,890,252 9709 8,047,526 506 19 16,761,666 1,020*

10 8,918,940 560 20 17,633,080 1,08021 18,504,494 1,130

Table 4.2: Number of triangles and total texture size (in MB) of the dragon scenes

4.1, from 1 to 21. In this order, we try to make the shape as regular as possible while

keep each object in the same position when others are added. In order to make sure

the complexity of the scenes, we do not use object instancing and all triangles are

indivisual to each other. There is only one light source in these scenes. The plane

is specular and reflects rays. We render the scene using a camera path similar to

[TS05], which surrounds the scene and moving up and down according to a sine

curve and looking at the center of the scene. We render 100 frames in total along

path. Figure 4.2 shows the curve of positions of the camera.

Table 4.2 lists the number of triangles and the total size of the textures of level

2 BVH of each scene in the set. Here we should note two points. First, from the

scene with 4 dragons to 5, the texture size increases and cross the line of 256MB,

the capacity of on-board texture memory of the GPU we use. And this is also the

maximum size of a single texture. Although the BVH texture size will be different

22

when a single BVH structure is built instead of a 2-level structure, the numbers are

very near. A single BVH structure of the scene with 5 dragons is confirmed exceed-

ing 256MB by our experiment. This means the previous GPU ray tracer using BVH

as acceleration structure cannot handle scenes more than 5 dragons here directly.

Another point is at 19, where the texture size exceed 1GB. We’ll discuss this later

with the result.

Dragonball

In this scene, we put another dragon model in the center. This dragon model,

also known as Asian Dragon or xyzrgb-dragon, comes from Stanford 3D Scan

Repository [Sta06], too. It has more than 7 million triangles itself, which is the

most complex single object we use. We put two spheres by the sides of the dragon

model, the red one is specular and the blue one is transparent. There is another

squeezed sphere representing a lens, which is nearly transparent. To the bottom

of these objects is a wave-shaped surface standing for water. This water surface

is specular, and its wave shape makes the directions of the reflected rays random

and incoherent. In order to compute the smooth reflective and refractive rays, we

must get smooth normal vectors at each vertex and interpolate them to get the per

pixel normal vectors. This would increase additional memory consumption in each

triangle, so we only store per vertex normals in this scene. In other three scenes,

normal vectors are computed per triangle, since they have at most flat reflection and

the per vertex normals are unnecessary. We use the same function generated path

as the one used in the first dragon scene.

Plants

This scene is from PBRT [PH04] and is used for real ray traced image. There

is a large amount of object instancing in this scene. Again the same as the dragon

scene, we avoid object instancing but duplicate each individual object for the sake

of scene complexity. The scene is composed of a terrain generated with hight field,

a reflective surface representing water, and a variety of duplicated plant objects.

The main source of the complexity is contributed by the plant models, and they dis-

tribute only in a corner of the terrain, along the river. See figure 4.3(a) for a view

from top. Figure 4.3(b) visualizes the level 1 KD-tree of the scene. Each color

patch represents a leaf node. Since we make the nodes containing the same number

of triangles as possible, we can see that most of the geometry gather in one region.

23

(a) (b)

Figure 4.3: (a) Plants scene looked from top. (b) The leaf nodes of level 1 KD-treeof the plants scene.

The original copy of this scene contains more than 19 million of triangles and is

the most complex model in our experiment. It works on our system but its scale

is over what we can handle efficiently since its memory storage exceeds 1GB and

textures are swapped out to harddisk. Therefore, we choose to simplify the scene

by reducing the number of instanced objects. We keep the four apple trees there

untouched, since they are the most visable objects. The final version has around

14 million triangles and the visual difference between the original scene is barely

observable. We tested this scene with a fly-by camera path, which starts from top

of the scene, down to the ground level, moves along the river, and then circles the

area where the plants resides. This path has 255 frames in total.

Powerplant

This model from the walkthrough project of UNC [UNC01] contains more than

12 million triangles and features complex indoor scenes. Unlike other scenes we

used, which are generally flat shaped, the powerplant model is complex in each

dimension, especially vertically. We tested this scene with a camera path to walk-

through it. The path is illustrated in figure 4.4. The red curve shows the movement

of the position of the camera along the path. The camera starts from outside of the

building, circles to the opposite side of the building, and then goes into the building,

passes through several different rooms, sometimes stops and looks around, and ends

inside the building. The length of the path we use are 275 frames. This is also the

only scene that has more than one light sources. We put six lights in this scene, one

24

Figure 4.4: Path used to walkthough the powerplant scene

is outside the building at the top representing the sun light, four are put in different

rooms inside the building for interior illumination, and the last one is mounted onto

the camera and sticks to it, preventing it from being too dark when walking through

rooms where no lights can cover. There are no specular or transparent surface in

this scene.

4.2 Experiments and results

4.2.1 Dragons

We use this scene mainly to test how different implementations of ray tracer per-

forms as the complexity of input model increases. The range of complexity of our

model starts from 1 million triangles to 18 million, and the texture representation

also ranges from less than 100MB to more than 1GB of memory. The tested data

structure are: KD-tree on GPU, the proposed 2-level structure (which has two vari-

ants of rendering methods: batched and mixed), and a single texture version of BVH

on GPU, which is described in 3.1 and is present for reference.

Figure 4.5(a) shows the rendering time of the increasingly changing models on

different acceleration structures. The number is generated by rendering each model

with the camera path described and the rendering time of 100 frames are averaged.

Apparently the trend of the curves of rendering time is increasing as the scene gets

more complex. Although there are some exceptions, especially in the KD-tree, the

curve rise steady and then drop suddenly at 9 dragons. This reason would be due

to the symmetry of the 9-dragon model and results a better tree than that of fewer

triangles.

25

0

1

2

3

4

5

6

7

8

9

10

1 3 5 7 9 11 13 15 17 19 21

number of objects

rend

erin

g tim

e (s

econ

ds)

CPU KDT2-level (batched)2-level (mixed)GPU BVH

(a)

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

number of dragon

texture size(MB)

(b)

Figure 4.5: (a) Rendering time of different implementations of ray tracer on thedragon scenes. The two vertical lines are the two limit on texture size of the singletexture GPU ray tracer and the 2-level structure. (b) The texture size distribution ofthe dragon scenes.

Another thing to be noted is that although the time complexity of traversing the

spatial index structures used here, such as KD-tree or BVH, should be logarithm of

the number of objects, all the curves in figure 4.5(a) seem to be almost linear. This is

because when one more dragon model is added into the scene, not only the number

of triangles increases, but also the area of visible pixels, which is in proportion to

the number of eye rays. And the traversal time complexity is linear to the number

of rays. So the trend of these curves is reasonable.

We can also see that in figure 4.5(a), GPU BVH performs consistently better

than CPU KDT, as we expected. The 2 level structure can be viewed as the hybrid

of CPU KDT and GPU BVH, and its performance is also in somewhere between

that of these two methods. Or we can say that the 2-level structure takes more of

the advantage of GPU BVH, and its speedup to the CPU KDT is about 200 to 500

percent.

One more thing on figure 4.5(a) is the two dashed vertical lines. They corre-

spond to the two horizontal lines in figure 4.5(b), which reflect two texture sizes.

One is at 256MB, this is the limit of maximum texture size of a single texture. Since

the GPU BVH uses a single texture to store the acceleration structure of the whole

scene, it is unable to process more then 5 dragons. Even the 4-dragon scene is within

the 256MB limit, the rendering time of GPU BVH is not as good as lower scenes,

even slower than one of the 2-level structure, which should have more amount of

data to be processed theoretically. A possible reason is that the one single texture

of 4-dragon plus other memory already in use, such as framebuffer and shader pro-

gram, have been over the capacity of the texture memory. The texture and other

resource got swap in and out during rendering and cause the time boost. The 2-level

structures also get a boost in rendering time at somewhere around 17 to 18 dragons,

26

0

2

4

6

8

10

12

1 11 21 31 41 51 61 71 81 91

frame

time(

seco

nds)

time(kdt)

time(ray engine)

time(whitted)

(a)

0

50

100

150

200

250

300

1 11 21 31 41 51 61 71 81 91

frame

num

ber

of G

PU

acc

ess

batched

mixed

(b)

Figure 4.6: (a) Per frame rendering time of the 10-dragon scene along the camerapath with different acceleration structures. (b) Number of times the underlying GPUray tracer is called per frame.

where the texture memory consumption is abount 900MB. At this point during the

experiment, we observed frequent access to harddisk, which is not seen when the

scenes with lower complexity are processed. We inferred that the size of memory

that the GPU driver allocated from main memory for mapping of texture memory

has an upper limit, and when more are demanded, the system swaps out those on

the main memory to page file on the disk as how the virtual memory works. We

guess this number to be 4 times the size of the on board texture memory and will be

1GB in our case. That’s why scenes with textures more than 1GB still works but the

process time increases very quickly and even more than CPU KDT at 19 dragons.

We tried to find a way to adjust the value of the size of main memory allocation but

in vain.

In figure 4.5(a) we can see that out of the two variant of 2-level structure, the

mixed renderer outperforms the batched one. We get it a closer look at figure 4.6.

We choose a scene with an average size and examine its behavior on per frame scale.

Figure 4.6(a) shows the per frame rendering time of the 3 structures. The rise and

fall of the curves reflects the different angles of the camera at each frame. Overall,

we can see that ’mixed’ is consistently faster than ’batched’. In figure 4.6(b), we

plot the number of access to the underlying GPU BVH per frame, and the number

of ’mixed’ is approximately half to that of ’batched’. This reflects the rendering

time since the more times GPU BVH is called, the more times of read back data

from GPU take place.

4.2.2 Dragonball

In this scene, we test effect of size of the leaf node of the first level structure on

the rendering time. In general, the larger the node size, the fewer the number of

27

0.00

50.00

100.00

150.00

200.00

250.00

1 2 3 4 5 6 7 8 9 10 11 12 13

text

ure

size

(M

B)

0

50

100

150

200

250

300

350

400

450

num

ber

of n

ode

average texture size(MB)

number of node

(a)0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12 13

time

(sec

onds

)

2-level(batched)

2-level(mixed)

(b)

Figure 4.7: Dragonball: (a) Texture size and number of nodes under different parti-tion (b) The rendering time of the 2-level structures of different node size

0

5

10

15

20

25

30

35

40

1 11 21 31 41 51 61 71 81 91

frame

time(

seco

nds)

CPU-KDT2-level(batched)2-level(mixed)

(a)

0

1

2

3

4

5

6

1 11 21 31 41 51 61 71 81 91

frame

time(

seco

nds)

2-level(batched)

2-level(mixed)

(b)

Figure 4.8: (a) Rendering time of dragonball scene partitioning into 3 nodes (b)Rendering time of 2 level structure only

nodes, so the number of calls to GPU is also fewer. But larger nodes have higher

cost of swapping the texture. On the other hand, smaller node size generates more

nodes with lower swapping cost while increases the number of calls to GPU. In

order to find a balance point, we build several 2-level structures of the dragonball

model with different node size. By setting the maximum number of leaf nodes at

constructing the level 1 KD-tree, we can control the size and the number of nodes.

We choose 13 different settings and the average texture size and the number of

nodes of these settings are shown in figure 4.7(a). The larger nodes is to the left

in the graph and it gets smaller down to the right. The partition of the largest node

size we use has only 2 nodes, each has texture size about 250MB. And to another

extreme, the one with smallest node size has more than 400 nodes.

Figure 4.7(b) is the result of rendering time. The rendering time rises as the

number of nodes increases and it appear that the best is the one with fewest nodes.

It is not as we expected though. We though that the minimum point should be

somewhere in the middle. A possible reason would be that the cost of calling shader

multiple times is much greater than swapping large texture, or the performance of

swapping by the driver is much better than we thought.

We take the 2-level structure with 3 nodes and compare its per frame rendering

28

0

10

20

30

40

50

60

1 11 21 31 41 51 61 71 81 91

frame

gpu

calls

2-level(batched)2-level(mixed)

Figure 4.9: Dragonball: number of times GPU BVH called per frame

0

5

10

15

20

25

30

35

40

45

50

1 51 101 151 201 251

frame

rend

erin

g tim

e(se

cond

s)

CPU-KDT

2-level(batched)

2-level(mixed)

0

5

10

15

20

25

1 51 101 151 201 251

frame

rend

erin

g tim

e(se

cond

s)

2-level(batched)

2-level(mixed)

Figure 4.10: Plants: rendering time per frame (a) CPU KDT and 2-level (b) 2-levelonly

time with CPU KDT in figure 4.8(a). The speedup of 2-level over CPU KDT is

about 3 times faster. The comparison of the 2-level variants is in figure 4.8(b). The

mixed renderer is slightly faster in this scene, too. Figure 4.9 shows the number of

GPU calls of the scene with same setting.

4.2.3 Plants

In the plants scene we do the same experiment as the powerplant, that is we use a

camera path to walkthough the scene and record the statistics. The main differences

of these two scenes are as follows:

• Plants has more triangles than powerplant

• Plants is flat-shaped with few occlusion while the powerplant has complex

occlusion

• Plants has only one light source but also a specular surface

Figure 4.10(a) gives the per frame rendering time of the scene. On average

the 2-level structures are 2.5 times faster than CPU KDT. Figure 4.10(b) shows the

29

0

50

100

150

200

250

300

1 51 101 151 201 251

frame

num

ber

of g

pu a

cces

s

2-level(batched)

2-level(mixed)

0

100000

200000

300000

400000

500000

600000

700000

800000

1 51 101 151 201 251

frame

num

ber

of r

ays

trac

ed

2-level(batched)

2-level(mixed)

Figure 4.11: Plants: (a) number of times GPU BVH is accessed per frame (b)number of rays generated per frame

0

0.2

0.4

0.6

0.8

1

1.2

1 51 101 151 201 251

frame

ratio

of a

ctiv

e te

xtur

e si

ze

2-level(batched)

2-level(mixed)

0

0.1

0.2

0.3

0.4

0.5

0.6

1 51 101 151 201 251

frame

ratio

of i

ncre

men

tal t

extu

re s

ize 2-level(batched)

2-level(mixed)

Figure 4.12: Plants: (a) percentage of active texture size (b) percentage of incre-mental texture size

rendering time of the 2-level structures along. The ’mixed’ performs slightly better

than ’batched’.

Figure 4.11(a) shows the number of times of GPU calls per frame. The numbers

of ’mixed’ is about half of that of ’batched’. The numbers are relatively low at

around 120th frame. This is because the camera goes to an edge of the scene and

faces outward, causes rendering of the frame reduced to single texture similar what

happens when rendering the indoor frames of the powerplant. This also reflects

on the rendering time, the minimum point of 4.10(b) is also the minimum point of

4.11(a).

Figure 4.11(b) shows the number of rays generated at each frame. Note that the

number is much lower than that of the powerplant scene, and so is the rendering

time. Figure 4.12(a) plots the percentage of active texture size. Except for a few

frames that utilize low ratio of textures, most frames have relative high percentage

compared with the powerplant scene. Since the distribution of objects are flat and

there are few occlusion in this scene.

30

0

50

100

150

200

250

300

350

400

1 21 41 61 81 101 121 141 161 181 201 221 241 261

frame

rend

erin

g tim

e(se

cond

s)

CPU-KDT

2-level (batched)

2-level (mixed)

(a)

5

10

15

20

25

30

1 21 41 61 81 101 121 141 161 181 201 221 241 261

frame

rend

erin

g tim

e(se

cond

s)

2-level(batched)

2-level(mixed)

(b)

Figure 4.13: Per frame rendering time of powerplant (a) CPU KDT and 2-level (b)2-level only

4.2.4 Powerplant

The powerplant scene is the one that takes the most time to render of all our scenes,

although it is not the most complex in geometry. Because it has the most light

sources and results in much more shadow rays than other scenes.

Figure 4.13(a) shows the per frame rendering time of powerplant. The gap be-

tween CPU KDT and the 2-level structure widens even more, up to more than an or-

der of magnitude. This should not be interpreted as the 2-level structure is supreme

but the CPU KDT performs abnormally poor in this scene. The KD-tree is not well

constructed and contains leaf nodes with thousands of triangles which results in

poor performance in traversal. Although the level 1 KD-tree of the 2-level structure

is also built from the same KD-tree used in CPU KDT, it does not suffer from the

same problem because the level 1 KD-tree takes only few nodes that are near the

root of the original KD-tree.

Figure 4.13(b) is the same graph as figure 4.13(a) with only the 2-level structures

which is more clear to compare the two variants. It is not obvious to say which one

of the 2-level structure renderer is faster in this scene, in some frames one is faster

and in other frames the opposite. On average, the ’batched’ is slightly faster than the

’mixed’ but not much. Because during the walkthough, most of the time the camera

is inside the building and only a few nodes are visible. This reduced the rendering

of a complex scene down to a simple case. See figure 4.14(a) for the number of

times GPU BVH is called during rendering. There is not much difference between

the two curve compared to other scenes. Only in the first 30 frames can we see an

obvious gap. The numbers are also relatively high since at these frames the camera

still remains outside of the building and more of the nodes are involved, where the

’mixed’ can take the advantage, although not much reflected in the rendering time.

When the camera enters the building, the times GPU called drops quickly as the

31

0

20

40

60

80

100

120

140

1 21 41 61 81 101 121 141 161 181 201 221 241 261

frame

num

ber

of G

PU

acc

ess

2-level(batched)

2-level(mixed)

(a)

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

1 21 41 61 81 101 121 141 161 181 201 221 241 261

frame

num

ber

of r

ays

trac

ed

2-level(batched)

2-level(mixed)

(b)

Figure 4.14: Powerplant: (a) number of times GPU BVH is called per frame (b)total number of rays generated per frame

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 21 41 61 81 101 121 141 161 181 201 221 241 261

frame

ratio

of a

ctiv

e te

xtur

e si

ze

2-level(batched)

2-level(mixed)

(a)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 21 41 61 81 101 121 141 161 181 201 221 241 261

frame

ratio

of i

ncre

men

tal t

extu

re s

ize

2-level(batched)

2-level(mixed)

(b)

Figure 4.15: Powerplant: (a) percentage of active textures (b) percentage of incre-mental textures

number of visible nodes also reduced. Figure 4.14(b) counts the number of ray

generated at each frame, including camera rays and secondary rays. The two curve

are identical as we expected, otherwise the image produced will not be the same

which means something goes wrong. There are a couple of valleys which are the

result of the camera looking into the direction where more eye rays are shooting

into the background. This is also reflected on the rendering time, where a drop on

the number of rays maps to a drop on the rendering time.

Figure 4.15(a) plots the percentage of the sum of texture size that are active at

the frame over the size of total textures. This gives a clue of the potential amount of

data transfer at each frame.

4.3 Conclusions

In this chapter, we performed several experiments with different scenes. And we

get the following conclusions:

• The 2-level structure scales smoothly on complex scenes

32

• The upper bound of the input scene is limit by the size of memory allocated

by the GPU driver for texture memory mapping

• The mixed renderer of the 2-level structure is slightly better on rendering time

• The mixed renderer has 50% fewer number of access to the GPU shader than

the batched approach

• The 2-level structure reduces the texture usage in the situation such as inside

the powerplant building

• The 2-level structure with larger node size performs better in rendering time

than with smaller node size

33

Chapter 5

Conclusions and future work

5.1 Conclusinos

We present a two-level acceleration structure that utilizes the computation power of

GPU for ray tracing complex scenes. Although we do not compare our method to an

advanced CPU implementation of ray tracing system, the 2-level algorithm achieve

2.5 to 5 times of speedup over our own CPU KD-tree implementation. We design

several experiments to test our system under different scenes with different degree

of complexity and material. We also investigate how to choose a good threshold

when building the 2-level structure.

5.2 Future work

In our implementation, we pass the memory management between the main mem-

ory and texture memory to the driver and it works quite well except the limit of the

pre-allocated size of memory. An alternative is to allocate large blocks in the texture

memory and use the glTexSubImage function to update the sub-blocks. Through

this technique we can explicitly handle the memory management and make it pos-

sible something the driver unable to do such as prefetching of textures.

We implement only ray-scene intersection on GPU, while keep other tasks on

CPU. According to our profiling, the work load of CPU is more than GPU. We

might get improvements by moving some of the computation from CPU to GPU.

For example, shading and computation of the reflection and refraction are good

candidates.

Also in this work we demonstrate only simple illumination model and we would

like to extend it to more advanced global illumination, such as Monte Carlo ray

tracing or Photon Mapping for better and practical image quality.

34

References

[AK87] James Arvo and David Kirk. Fast ray tracing by ray classification.

SIGGRAPH Comput. Graph., 21(4):55–64, 1987.

[App68] Arthur Appel. Some techniques for shading machine renderings of

solids. In AFIPS 1968 Spring Joint Computer Conf., volume 32, pages

37–45, 1968.

[BFH+04] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fata-

halian, Mike Houston, and Pat Hanrahan. Brook for gpus: stream

computing on graphics hardware. ACM Trans. Graph., 23(3):777–786,

2004.

[CHCH06] Nathan A. Carr, Jared Hoberock, Keenan Crane, and John C. Hart.

Fast gpu ray tracing of dynamic meshes using geometry images. In

Proceedings of Graphics Interface 2006, 2006.

[CHH02] Nathan A. Carr, Jesse D. Hall, and John C. Hart. The ray engine. In

HWWS ’02: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS

conference on Graphics hardware, pages 37–46, Aire-la-Ville,

Switzerland, Switzerland, 2002. Eurographics Association.

[Chr05] Martin Christen. Implementing ray tracing on gpu. Master’s thesis,

University of Applied Sciences Basel, Switzerland, 2005.

[DWS05] Andreas Dietrich, Ingo Wald, and Philipp Slusallek. Large-Scale CAD

Model Visualization on a Scalable Shared-Memory Architecture. In

Gunther Greiner, Joachim Hornegger, Heinrich Niemann, and Marc

Stamminger, editors, Proceedings of 10th International Fall Workshop

- Vision, Modeling, and Visualization (VMV) 2005, pages 303–310, Er-

langen, Germany, November 2005. Akademische Verlagsgesellschaft

Aka.

35

[FSH04] K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the effi-

ciency of gpu algorithms for matrix-matrix multiplication. In HWWS

’04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS confer-

ence on Graphics hardware, pages 133–137, New York, NY, USA,

2004. ACM Press.

[FvDFH90] James D. Foley, Andries van Dam, Steven K. Feiner, and John F.

Hughes. Computer graphics: principles and practice (2nd ed.).

Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,

1990.

[GPG05] GPGPU.org. http://www.gpgpu.org, 2005.

[GRH+05] Naga K. Govindaraju, Nikunj Raghuvanshi, Michael Henson, David

Tuft, and Dinesh Manocha. A cache-efficient sorting algorithm for

database and data mining computations using graphics processors.

Technical report, UNC, 2005.

[GS87] Jeffrey Goldsmith and John Salmon. Automatic creation of object hi-

erarchies for ray tracing. IEEE Computer Graphics and Applications,

7:14–20, 1987.

[KL05] Filip Karlsson and Carl Johan Ljungstedt. Ray tracing fully im-

plemented on programmable graphics hardware. Master’s thesis,

Chalmers University of Technology, 2005.

[Lea00] Doug Lea. A memory allocator, 2000.

http://g.oswego.edu/dl/html/malloc.html.

[Mic05] Microsoft. http://www.microsoft.com/whdc/system/platform/server/pae/paemem.mspx,

2005.

[NVI05] NVIDIA. http://www.nvidia.com/page/gelato.html, 2005.

[OLG+05] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris,

Jens Kruger, Aaron E. Lefohn, and Timothy J. Purcell. A survey of

general-purpose computation on graphics hardware. In Eurographics

2005, State of the Art Reports, pages 21–51, aug 2005.

[oT97] Tomas Moller and Ben Trumbore. Fast, minimum storage ray-triangle

intersection. J. Graph. Tools, 2(1):21–28, 1997.

36

[PBMH02] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan. Ray

tracing on programmable graphics hardware. ACM Transactions on

Graphics, 21(3):703–712, July 2002. ISSN 0730-0301 (Proceedings

of ACM SIGGRAPH 2002).

[PH04] Matt Pharr and Greg Humphreys. Physically Based Rendering. Mor-

gan Kauffman, 2004.

[PKGH97] Matt Pharr, Craig Kolb, Reid Gershbein, and Pat Hanrahan. Render-

ing complex scenes with memory-coherent ray tracing. In SIGGRAPH

’97: Proceedings of the 24th annual conference on Computer graph-

ics and interactive techniques, pages 101–108, New York, NY, USA,

1997. ACM Press/Addison-Wesley Publishing Co.

[Pur04] Timothy John Purcell. Ray tracing on a stream processor. PhD thesis,

2004. Adviser-Patrick M. Hanrahan.

[Sta06] Stanford. The stanford 3d scanning repository, 2006.

http://graphics.stanford.edu/data/3Dscanrep/.

[TF05] Jeremy Sugerman Tim Foley. Kd-tree acceleration structures for a gpu

raytracer. In Proceedings of Graphics Hardware, 2005.

[TS05] Niels Thrane and Lars Ole Simonsen. A comparison of acceleration

structures for gpu assisted ray tracing. Master’s thesis, University of

Aarhus, 2005.

[UNC01] UNC. The walkthru project, 2001. http://www.cs.unc.edu/ walk/.

[WDS04] Ingo Wald, Andreas Dietrich, and Philipp Slusallek. An Interactive

Out-of-Core Rendering Framework for Visualizing Massively Com-

plex Models. In Proceedings of the Eurographics Symposium on Ren-

dering, 2004. (to appear).

[Whi80] Turner Whitted. An improved illumination model for shaded display.

Commun. ACM, 23(6):343–349, 1980.

[WIK+06] Ingo Wald, Thiago Ize, Andrew Kensler, Aaron Knoll, and Steven G

Parker. Ray Tracing Animated Scenes using Coherent Grid Traversal.

ACM SIGGRAPH 2006, 2006.

37

[WPS+03] Ingo Wald, Timothy J. Purcell, Joerg Schmittler, Carsten Benthin, and

Philipp Slusallek. Realtime Ray Tracing and its use for Interactive

Global Illumination. In Eurographics State of the Art Reports, 2003.

[WSBW01] Ingo Wald, Philipp Slusallek, Carsten Benthin, and Markus Wagner.

Interactive rendering with coherent ray tracing. In A. Chalmers and T.-

M. Rhyne, editors, EG 2001 Proceedings, volume 20(3), pages 153–

164. Blackwell Publishing, 2001.

[WSE04] D. Weiskopf, T. Schafhitzel, and T. Ertl. GPU-Based Nonlinear Ray

Tracing. Computer Graphics Forum (Eurographics 2004), 23(3):625–

633, 2004.

[WSS05] Sven Woop, Jörg Schmittler, and Philipp Slusallek. Rpu: a pro-

grammable ray processing unit for realtime ray tracing. ACM Trans.

Graph., 24(3):434–444, 2005.

38

Documents

Ray Tracing Complex Scenes on Graphics Hardwarer93039/thesis.pdf · · 2006-07-19than what GPU memory can hold. In this thesis, we present a GPU assisted ray tracer system that