View
215
Download
2
Category
Tags:
Preview:
Citation preview
Besides Parallelization
Besides parallelizing the algorithm, what else can we do to accelerate ray tracing?– Amortize the cost of shooting rays– Use ray tracing selectively
Amortize the cost of rays
The Render Cache– Work by Bruce Walters, currently at Cornell;
also by Reinhard et al (Utah)– Basic idea:
Cache ray “hits” as shaded 3D points Reproject points for new viewpoint Now many pixels already have color!
Shoot rays for newly uncovered pixels Shoot rays to update stale pixels
– Show demo(?)– Web page w/ good examples, source:http://www.graphics.cornell.edu/research/interactive/rendercache/
Amortize the cost:Tole et al.
Tole et al. extend these ideas to path tracing– Cache ray hits in object space as
Gouraud-shaded vertices– Designed for very slow sampling
schemes (full bidirectional path tracing) Pick pixels to sample carefully Use OpenGL hardware to display current
solution as it is gradually updated
– Show the Tole video
Amortize The Cost:Frameless Rendering
Eliminate frames altogether– If you can render 1/3 of the pixels in a
vertical retrace period: Double buffering displays a new frame after 3
vertical refreshes Single buffering causes horizontal tearing
artifacts Frameless rendering updates pixels as soon as
they are computed……but computes them in a randomized order to
avoid coherent tearing artifacts
– Show the Utah video
Shoot Rays Selectively
Use ray tracing selectively to augment a traditional interactive pipeline– Ex: use rays for shadows only– Ex: Use ray tracing to calculate
corrective textures where necessary (e.g., shiny objects)
Summary So Far
Interactive ray tracing is a reality– Parker et al. 1999 (SGI supercomputer)– Wald et al. 2001 (Cluster of PCs)
Why IRT?– Complex/realistic shading– Big data– Decoupled sampling
Summary So Far
How IRT?– Ray tracing is embarrassingly parallel
Field of VAX/Cray joke But memory coherence is a problem
– Brute force: shared-memory supercomputer– Slightly smarter: distributed cluster
Fan-in, latency, model sharing are issues
– Amortize cost Cache/reuse samples, frameless rendering
– Use selectively Shadows only, corrective textures
Moving to Hardware
Next topic: moving ray tracing to the GPU– Why do this?
Two papers:– Ray Engine (Carr et al., U. Illinois)– Ray Tracing On Programmable Graphics
Hardware (Purcell et al., Stanford) I stole most of the following slides from this
talk
Related Work:The Ray Engine
Nathan Carr, Jesse Hall, John Hart (University of Illinois)
Basic idea: use the fragment hardware!– Ray intersection is a crossbar:
Intersect a bunch of rays with a bunch of triangles, keep closest hit on each ray
– Triangle rasterization is a crossbar: Intersect a bunch of pixels with a bunch of
triangles, keep closest hit at each pixel
(a) Naive: intersect all rays w/ all polys(b) Acceleration structures break crossbar
grid up into a sparse block structure, but blocks are still dense crossbars
(c) Result: a series of points on the crossbar, max 1 per ray (closest wins)
Ray Engine
Map ray casting crossbar to rasterization crossbar – Distribute rays across pixels
Ray-orgins texture Ray-directions texture
– Broadcast a stream of triangles as the vertex data interpolated across screen-filling quads
Quad color Triangle id Quad multi-texture coords:
Triangle vertices a,b, normal n, edges ab, ac, bc
– Output: Color = triangle id, alpha = intersect, z = t value
Ray Engine
Bulk of ray tracing computation is intersection
CPU handles bounding volume traversal, recursion, etc
GPU does ray-intersection on bundles of rays and triangles handed to it by CPU– NV_FENCE to keep both humming
Sometimes the CPU should intersect rays!
Why Ray Tracing?
Global illumination Good shadows!
– Doom 3 will be using shadow volumes
Expensive!
– Shadow maps are hard to use and prone to artifacts
Efficient ray tracing based shadows could be the next killer feature for GPUs
Doom 3 [id Software]
Why Ray Tracing?
Output-sensitive algorithm – Sublinear in depth
complexity Selective sampling
– Frameless rendering [Bishop et al. 1994]
– Render Cache [Walter et al.
1995]– Shading Cache
[Tole et al. 2002]
Interactive on clusters of PCs [Wald et al. 2001] and supercomputers [Parker et al. 1999 ]
Power Plant Power Plant [Wald et al. 2001][Wald et al. 2001]
Beyond Moore’s Law
Yearly growth well above Moore’s Law (1.5)
Season Product MT/s Yr rate MF/s Yr rate
2H97 Riva 128 5 - 100 -
1H98 Riva ZX 5 1.0 100 1.0
2H98 Riva TNT 5 1.0 180 3.2
1H99 Riva TNT2 8 1.0 333 3.4
2H99 GeForce 15 3.5 480 2.1
1H00 GeForce2 GTS 25 2.8 666 1.9
2H00 GeForce2 Ultra 31 1.5 1000 2.3
1H01 GeForce3 40 1.7 3200 10.2
1H02 GeForce4 65 1.6 4800 1.5
1.8 2.42.4Courtesy of Kurt Akeley
NVIDIA Historicals
Graphics PipelineApplication
Geometry
Rasterization
Texture
Fragment
Display
Command
TexturesFragmentProgram
Registers
FragmentInput
FragmentOutput
Traditional Pipeline Programmable Fragment Pipeline
Contributions
Map complete ray tracer onto GPU– Ray tracing generally thought to be
incompatible with the traditional graphics pipeline
Abstract programmable fragment processor as a stream processor
Map ray tracing to streaming computation Show that streaming GPU-based ray tracer
is competitive with CPU-based ray tracer
Stream Programming Model
Programmable fragment processor is essentially a stream processor
Kernels and streams– Stream is a set of data records– Kernels operate on records– Streams connect kernels together– Kernels can read global memory
kernelkernel
inputinputrecordrecordstreamstream
outputoutputrecordrecordstreamstream
kernelkernelglobalsglobals
globalsglobals
Streaming Ray Tracer (Simplified)
Generate Eye Rays
Traverse Acceleration
Structure
Intersect Triangles
Shade Hits and
Generate Shading
Rays
Camera
Grid
Triangles
Materials
rays
ray-voxel pairs
hits
pixels
Intersection Code
float4 Intersect( float3 ro, float3 rd, int listpos, float4 h ) {
float tri_id = texture( listpos, trilist );float3 v0 = texture( tri_id, v0 );float3 v1 = texture( tri_id, v1 );float3 v2 = texture( tri_id, v2 );float3 edge1 = v1 – v0;float3 edge2 = v2 – v0;float3 pvec = Cross( rd, edge2 );float det = Dot( edge1, pvec );float inv_det = 1/det;float3 tvec = ro – v0;float u = Dot( tvec, pvec ) * inv_det;float3 qvec = Cross( tvec, edge1 );float v = Dot( rd, qvec ) * inv_det;float t = Dot( edge2, qvec ) * inv_det;// determine if valid hit by checking// u,v > 0 and u+v < 1// set hit data into h based on valid hitreturn float4( {t,u,v,id} );
}
Intersect Triangles
Triangles
hits
ray-voxel pairs
Ray Tracing on a GPU
Store scene data in texture memory– Dependent texturing is key
Multipass rendering for flow control– Branching would eliminate this need
Scene in Texture Memory
xyz xyz xyz xyz xyz xyz … xyz
0 4 11 38 … 564
0 3 1 3 7 21 216 …
xyz xyz xyz xyz xyz xyz … xyz
xyz xyz xyz xyz xyz xyz … xyz
Uniform Grid3D Luminance
Texture
Triangle List1D Luminance
Texture
Triangles3x 1D RGB
Textures
vox0 vox1 vox2 vox3 vox4 vox5 voxM
vox0 vox2
tri0 tri1 tri2 tri3 tri4 tri5 triN
v0
v1
v2
Texture As Memory
Currently limited in size - 128MB– About 3M triangles @ 36 bytes per triangle
Uniform grid – Maps naturally to 3D textures – Requires 4 levels of dependent texture lookups
1D textures limited in length– Emulate larger address space with 2D textures
Want integer addressing – not floating point– Efficient access without interpolation
Integer arithmetic
Streaming Flow Control
Fragments(Input Stream)
Fragment Program(Kernel)
Fragment Program Output(Output Stream)
Rasterization
Texture(Globals)
Applicationand Geometry
Stages
Streaming Ray TracerGenerate Eye Rays
Traverse Acceleration
Structure
Intersect Triangles
Shade Hits and
Generate Shading
Rays
Camera
Grid
Triangles
Materials
Multipass Optimization
Reduce the number of passes– Choose to traverse or intersect based on
work to be done for each type of pass Connection Machine ray tracer [Delany 1988] Intersect once 20% of active rays need
intersecting
Make each pass less expensive– Most passes involve only a few rays– Early fragment kill based on fragment
mask Saves compute and bandwidth
Scene Statistics
v – average number of voxels a ray piercest – average triangles a ray intersectss – average number of shading evaluations per rayP – number of rendering passes
0.820.970.961.000.44s
13.8847.9034.0740.462.52t
93.93130.781.2926.1114.41v
C = R*(Cr + v*Cv + t*Ct + s*Cs) + R*P*Cmask
10852835199911982443P
Performance Estimates
Pentium III 800 MHz CPU implementation– 20M intersections/s [Wald et al. 2001]
Simulated performance– 2G instructions/s and 8GB/s bandwidth– Instruction limited
56M intersections/s
– Nearly bandwidth limited 222M intersections/s
Streaming ray tracing is compute limited!
Demo Analysis
Prototype Performance (ATI R300)– 500K – 1.4M raycast/s– 94M intersections/s– Only three weeks of coding effort
ATI Radeon 8500 GPU (R200)– 114M intersections/s [Carr et al. 2002]– Fixed point operations– Only ray-triangle intersection kernel
Summary
Programmable GPU is a stream processor Ray tracing can be a streaming
computation Complete ray tracer can map onto the
GPU– Ray tracing generally thought to be
incompatible with the traditional graphics pipeline
Streaming GPU-based ray tracer is competitive with CPU-based ray tracer
Architectural Results
Fragment mask proposed for efficient multipass– Stream buffer eliminates this need
Stream data should not go through standard texture cache
Triangles cache well for primary rays, secondary less so
Branching architecture– More cache coherence than the multipass
architecture for scene data– Reduces memory bandwidth for stream data– But has its own costs…
Final Thoughts
Ray tracing maps into current GPU architecture– Does not require fundamentally different
hardware– Hybrid algorithms possible
What else can the GPU do?– Given you can do ray tracing, you can do
anything– Fluid flow, molecular dynamics, etc.
GPU performance increase will continue to outpace CPU performance increase
Recommended