Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia

Interactive Ray Tracing:From bad joke to old news

David LuebkeUniversity of Virginia

Besides Parallelization

Besides parallelizing the algorithm, what else can we do to accelerate ray tracing?– Amortize the cost of shooting rays– Use ray tracing selectively

Amortize the cost of rays

The Render Cache– Work by Bruce Walters, currently at Cornell;

also by Reinhard et al (Utah)– Basic idea:

Cache ray “hits” as shaded 3D points Reproject points for new viewpoint Now many pixels already have color!

Shoot rays for newly uncovered pixels Shoot rays to update stale pixels

– Show demo(?)– Web page w/ good examples, source:http://www.graphics.cornell.edu/research/interactive/rendercache/

http://www.graphics.cornell.edu/research/interactive/rendercache/

Amortize the cost:Tole et al.

Tole et al. extend these ideas to path tracing– Cache ray hits in object space as

Gouraud-shaded vertices– Designed for very slow sampling

schemes (full bidirectional path tracing) Pick pixels to sample carefully Use OpenGL hardware to display current

solution as it is gradually updated

– Show the Tole video

Amortize The Cost:Frameless Rendering

Eliminate frames altogether– If you can render 1/3 of the pixels in a

vertical retrace period: Double buffering displays a new frame after 3

vertical refreshes Single buffering causes horizontal tearing

artifacts Frameless rendering updates pixels as soon as

they are computed……but computes them in a randomized order to

avoid coherent tearing artifacts

– Show the Utah video

Shoot Rays Selectively

Use ray tracing selectively to augment a traditional interactive pipeline– Ex: use rays for shadows only– Ex: Use ray tracing to calculate

corrective textures where necessary (e.g., shiny objects)

Summary So Far

Interactive ray tracing is a reality– Parker et al. 1999 (SGI supercomputer)– Wald et al. 2001 (Cluster of PCs)

Why IRT?– Complex/realistic shading– Big data– Decoupled sampling

Summary So Far

How IRT?– Ray tracing is embarrassingly parallel

Field of VAX/Cray joke But memory coherence is a problem

– Brute force: shared-memory supercomputer– Slightly smarter: distributed cluster

Fan-in, latency, model sharing are issues

– Amortize cost Cache/reuse samples, frameless rendering

– Use selectively Shadows only, corrective textures

Moving to Hardware

Next topic: moving ray tracing to the GPU– Why do this?

Two papers:– Ray Engine (Carr et al., U. Illinois)– Ray Tracing On Programmable Graphics

Hardware (Purcell et al., Stanford) I stole most of the following slides from this

talk

Related Work:The Ray Engine

Nathan Carr, Jesse Hall, John Hart (University of Illinois)

Basic idea: use the fragment hardware!– Ray intersection is a crossbar:

Intersect a bunch of rays with a bunch of triangles, keep closest hit on each ray

– Triangle rasterization is a crossbar: Intersect a bunch of pixels with a bunch of

triangles, keep closest hit at each pixel

(a) Naive: intersect all rays w/ all polys(b) Acceleration structures break crossbar

grid up into a sparse block structure, but blocks are still dense crossbars

(c) Result: a series of points on the crossbar, max 1 per ray (closest wins)

(a) Each pixel potentially intersected with each poly

(b) Modern hardware

Ray Engine

Map ray casting crossbar to rasterization crossbar – Distribute rays across pixels

Ray-orgins texture Ray-directions texture

– Broadcast a stream of triangles as the vertex data interpolated across screen-filling quads

Quad color Triangle id Quad multi-texture coords:

Triangle vertices a,b, normal n, edges ab, ac, bc

– Output: Color = triangle id, alpha = intersect, z = t value

Ray Engine

Bulk of ray tracing computation is intersection

CPU handles bounding volume traversal, recursion, etc

GPU does ray-intersection on bundles of rays and triangles handed to it by CPU– NV_FENCE to keep both humming

Sometimes the CPU should intersect rays!

Why Ray Tracing?

Global illumination Good shadows!

– Doom 3 will be using shadow volumes

Expensive!

– Shadow maps are hard to use and prone to artifacts

Efficient ray tracing based shadows could be the next killer feature for GPUs

Doom 3 [id Software]

Why Ray Tracing?

Output-sensitive algorithm – Sublinear in depth

complexity Selective sampling

– Frameless rendering [Bishop et al. 1994]

– Render Cache [Walter et al.

1995]– Shading Cache

[Tole et al. 2002]

Interactive on clusters of PCs [Wald et al. 2001] and supercomputers [Parker et al. 1999 ]

Power Plant Power Plant [Wald et al. 2001][Wald et al. 2001]

Beyond Moore’s Law

Yearly growth well above Moore’s Law (1.5)

Season Product MT/s Yr rate MF/s Yr rate

2H97 Riva 128 5 - 100 -

1H98 Riva ZX 5 1.0 100 1.0

2H98 Riva TNT 5 1.0 180 3.2

1H99 Riva TNT2 8 1.0 333 3.4

2H99 GeForce 15 3.5 480 2.1

1H00 GeForce2 GTS 25 2.8 666 1.9

2H00 GeForce2 Ultra 31 1.5 1000 2.3

1H01 GeForce3 40 1.7 3200 10.2

1H02 GeForce4 65 1.6 4800 1.5

1.8 2.42.4Courtesy of Kurt Akeley

NVIDIA Historicals

Graphics PipelineApplication

Geometry

Rasterization

Texture

Fragment

Display

Command

TexturesFragmentProgram

Registers

FragmentInput

FragmentOutput

Traditional Pipeline Programmable Fragment Pipeline

Contributions

Map complete ray tracer onto GPU– Ray tracing generally thought to be

incompatible with the traditional graphics pipeline

Abstract programmable fragment processor as a stream processor

Map ray tracing to streaming computation Show that streaming GPU-based ray tracer

is competitive with CPU-based ray tracer

Assumptions

Static scenes Triangle primitives only Uniform grid acceleration structure

Stream Programming Model

Programmable fragment processor is essentially a stream processor

Kernels and streams– Stream is a set of data records– Kernels operate on records– Streams connect kernels together– Kernels can read global memory

kernelkernel

inputinputrecordrecordstreamstream

outputoutputrecordrecordstreamstream

kernelkernelglobalsglobals

globalsglobals

Streaming Ray Tracer (Simplified)

Generate Eye Rays

Traverse Acceleration

Structure

Intersect Triangles

Shade Hits and

Generate Shading

Rays

Camera

Grid

Triangles

Materials

rays

ray-voxel pairs

hits

pixels

Eye Ray Generator

Camera Screen

Generate Eye Rays

rays

Camera

SceneScene

Traverser

Camera Screen


StructureGrid

rays

ray-voxel pairs

Scene

Intersector

Camera Screen Scene

Intersect Triangles

Triangles

hits

ray-voxel pairs

Intersection Code

float4 Intersect( float3 ro, float3 rd, int listpos, float4 h ) {

float tri_id = texture( listpos, trilist );float3 v0 = texture( tri_id, v0 );float3 v1 = texture( tri_id, v1 );float3 v2 = texture( tri_id, v2 );float3 edge1 = v1 – v0;float3 edge2 = v2 – v0;float3 pvec = Cross( rd, edge2 );float det = Dot( edge1, pvec );float inv_det = 1/det;float3 tvec = ro – v0;float u = Dot( tvec, pvec ) * inv_det;float3 qvec = Cross( tvec, edge1 );float v = Dot( rd, qvec ) * inv_det;float t = Dot( edge2, qvec ) * inv_det;// determine if valid hit by checking// u,v > 0 and u+v < 1// set hit data into h based on valid hitreturn float4( {t,u,v,id} );

}

Intersect Triangles

Triangles

hits

ray-voxel pairs

Ray Tracing on a GPU

Store scene data in texture memory– Dependent texturing is key

Multipass rendering for flow control– Branching would eliminate this need

Scene in Texture Memory

xyz xyz xyz xyz xyz xyz … xyz

0 4 11 38 … 564

0 3 1 3 7 21 216 …



Uniform Grid3D Luminance

Texture

Triangle List1D Luminance

Texture

Triangles3x 1D RGB

Textures

vox0 vox1 vox2 vox3 vox4 vox5 voxM

vox0 vox2

tri0 tri1 tri2 tri3 tri4 tri5 triN

v0

v1

v2

Texture As Memory

Currently limited in size - 128MB– About 3M triangles @ 36 bytes per triangle

Uniform grid – Maps naturally to 3D textures – Requires 4 levels of dependent texture lookups

1D textures limited in length– Emulate larger address space with 2D textures

Want integer addressing – not floating point– Efficient access without interpolation

Integer arithmetic

Streaming Flow Control

Fragments(Input Stream)

Fragment Program(Kernel)

Fragment Program Output(Output Stream)

Rasterization

Texture(Globals)

Applicationand Geometry

Stages

Multiple Rendering Passes

Pass 1GenerateEye Rays

Draw quad

Rasterize



Run fragment program



Save to offscreen buffer (rays)


Pass 2Traverse Draw quad

Rasterize


Restore(rays)

Pass 2Traverse

Run fragmentprogram


Pass 2Traverse

Save to offscreen buffer(ray voxel pr)

Streaming Ray TracerGenerate Eye Rays


Structure

Intersect Triangles

Shade Hits and

Generate Shading

Rays

Camera

Grid

Triangles

Materials

Multipass Optimization

Reduce the number of passes– Choose to traverse or intersect based on

work to be done for each type of pass Connection Machine ray tracer [Delany 1988] Intersect once 20% of active rays need

intersecting

Make each pass less expensive– Most passes involve only a few rays– Early fragment kill based on fragment

mask Saves compute and bandwidth

Scene Statistics

v – average number of voxels a ray piercest – average triangles a ray intersectss – average number of shading evaluations per rayP – number of rendering passes

0.820.970.961.000.44s

13.8847.9034.0740.462.52t

93.93130.781.2926.1114.41v

C = R*(Cr + v*Cv + t*Ct + s*Cs) + R*P*Cmask

10852835199911982443P

Performance Estimates

Pentium III 800 MHz CPU implementation– 20M intersections/s [Wald et al. 2001]

Simulated performance– 2G instructions/s and 8GB/s bandwidth– Instruction limited

56M intersections/s

– Nearly bandwidth limited 222M intersections/s

Streaming ray tracing is compute limited!

Demo Analysis

Prototype Performance (ATI R300)– 500K – 1.4M raycast/s– 94M intersections/s– Only three weeks of coding effort

ATI Radeon 8500 GPU (R200)– 114M intersections/s [Carr et al. 2002]– Fixed point operations– Only ray-triangle intersection kernel

Summary

Programmable GPU is a stream processor Ray tracing can be a streaming

computation Complete ray tracer can map onto the

GPU– Ray tracing generally thought to be

incompatible with the traditional graphics pipeline

Streaming GPU-based ray tracer is competitive with CPU-based ray tracer

Architectural Results

Fragment mask proposed for efficient multipass– Stream buffer eliminates this need

Stream data should not go through standard texture cache

Triangles cache well for primary rays, secondary less so

Branching architecture– More cache coherence than the multipass

architecture for scene data– Reduces memory bandwidth for stream data– But has its own costs…

Final Thoughts

Ray tracing maps into current GPU architecture– Does not require fundamentally different

hardware– Hybrid algorithms possible

What else can the GPU do?– Given you can do ray tracing, you can do

anything– Fluid flow, molecular dynamics, etc.

GPU performance increase will continue to outpace CPU performance increase

Documents

Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia