Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

Stefan Popov High Performance GPU Ray Tracing

Real-time Ray Tracing on GPU with BVH-based Packet Traversal

Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek


Background GPUs attractive for ray tracing

High computational power Shading oriented architecture

GPU ray tracers Carr – the ray engine Purcell – Full ray tracing on the GPU, based on grids Ernst – KD trees with parallel stack Carr, Thrane & Simonsen – BVH Foley, Horn, Popov – KD trees - stackless traversal


Motivation So far

Interactive RT on GPU, but Limited model size No dynamic scene support

The G80 – new approach to the GPU High performance general purpose processor with

graphics extensions PRAM architecture

BVH allow for Dynamic/deformable scenes Small memory footprint Goal: Recursive ordered traversal of BVH on the G80


GPU Architecture (G80) Multi-threaded scalar architecture

12K HW threads Threads cover latencies

Off-chip memory ops Instruction dependencies

4 or 16 cycles to issue instr. 16 (multi-)cores

8-wide SIMD 128 scalar cores in total Cores process threads in 32 wide

SIMD chunks

……

Multi-Core 1

……

Chunk Pool…

……

…

Multi-Core 16

……

Chunk Pool…

……

…


GPU Architecture (G80) Scalar register file (8K)

Partitioned among running threads

Shared memory (16KB) On-chip, 0 cycle latency

On-board memory (768MB) Large latency (~ 200 cycles) R/W from within thread Un-cached

Read-only L2 cache (128KB) On chip, shared among all

threads

On-board memory

Multi-Core 1

……Shared

Mem

ory

RegistersRegisters

……

Multi-Core 16

L2 Cache (128KB)


Programming the G80 CUDA

C based language with parallel extensions GPU utilization at 100% only if

Enough threads are present (>> 12K) Every thread uses less than 10 registers and 5

words (32 bit) of shared memory Enough computations per transferred word of data

Bandwidth << computational power Adequate memory access pattern to allow read

combining


Performance Bottlenecks Efficient per-thread stack implementation

Shared memory too small – will limit parallelism On-board memory – uncached

Need enough computations between stack ops Efficient memory access pattern

Use texture caches However, only few words of cache / thread

Read successive memory locations in successive threads of a chunk Single roundtrip to memory (read combining)

Cover latency with enough computations


Ray Tracing on the G80 Map each ray to one thread

Enough threads to keep the GPU busy Recursive ray tracing

Use per-thread stack stored on on-board memory Efficient, since enough computations are present

But how to do the traversal ? Skip pointers (Thrane) – no ordered traversal Geometric images (Carr) – single mesh only Shared stack traversal


SIMD Packet Traversal of BVH Traverse a node with the whole packet At an internal node:

Intersect all rays with both children and determine traversal order

Push far child (if any) on a stack and descend to the near one with the packet

At a leaf: Intersect all rays with contained geometry Pop next node to visit from the stack


PRAM Basics The PRAM model

Implicitly synchronized processors (threads)

Shared memory between all processors

Basic PRAM operations Parallel OR in O(1) Parallel reduction in

O(log N)

false truefalse true

false

true

11 912 32

44 20

+ +

+64 20 11 9

11 9


PRAM Packet Traversal of BVH The G80 – PRAM machine on chunk level

Map packet chunk, ray thread Threads behave as in the single ray traversal

At leaf: Intersect with geometry. Pop next node from stack

At node: Decide which children to visit and in what order. Push far child

Difference: How rays choose which node to visit first

Might not be the one they want to


PRAM Packet Traversal of BVH Choose child traversal order

PRAM OR to determine if all rays agree on visiting the same node first The result is stored in shared memory

In case of divergence: choose child with more ray candidates Use PRAM SUM on +/- 1 for each thread, -1 left node Look at result’s sign

Guarantees synchronous traversal of BVH


PRAM Packet Traversal of BVH Stack:

Near & far child – the same for all threads => store once

Keep stack in shared memory. Only few bits per thread!

Only Thread 0 does all stack ops. Reading data:

All threads work with the same node / triangle Sequential threads bring in sequential words Single load operation. Single round trip to memory

Implementable in CUDA


Results


Analysis Coherent branch decisions / memory access Small footprint of the data structure

Can trace up to 12 million triangle models Program becomes compute bound

Determined by over/under-clocking the core/memory No frustums required

Good for secondary rays, bad for primary Can use rasterization for primary rays

Implicit SIMD – easy shader programming Running on a GPU – shading “for free”


Dynamic Scenes Update parts / whole BVH and geometry on

GPU Use GPU for RT and CPU for BVH construction /

refitting Construct BVH using binning

Similar to Wald RT07 / Popov RT06 Bin all 3 dimensions using SIMD

Results in > 10% better trees Measured as SAH quality, not FPS Speed loss is almost negligible


Results


Conclusions New recursive PRAM BVH traversal algorithm

Very well suited for the new generation of GPUs No additional pre-computed data required

First GPU ray tracer to handle large models Previous implementations were limited to < 300K

Can handle dynamic scenes By using the CPU to update the geometry / BVH


Future Work More features

Shaders, adaptive anti-aliasing, … Global illumination

Code optimizations Current implementation uses too many registers


Thank you!


CUDA Hello World__global__ void addArrays(int *arr1, int *arr2){

unsigned t = threadIdx.x + blockIdx.x * blockDim.x;arr1[t] += arr2[t];

}

int main(){

int *inArr1 = malloc(4194304), *inArr2 = malloc(4194304);int *ta1, *ta2;cudaMalloc((void**)&ta1, 4194304); cudaMalloc((void**)&ta2, 4194304);

for(int i = 0; i < 4194304; i++){ inArr1[i] = rand(); inArr2[i] = rand(); }

cudaMemcpy(ta1, inArr1, 4194304, cudaMemcpyHostToDevice);cudaMemcpy(ta2, inArr2, 4194304, cudaMemcpyHostToDevice);

addArrays<<<dim3(4194304 / 512, 1, 1), dim3(512, 1, 1)>>>(ta1, ta2);

cudaMemcpy(inArr1, ta1, 4194304, cudaMemcpyDeviceToHost);for(int i = 0; i < 4194304; i++) printf("%d ", inArr1[i]);

return 0;}

Documents

Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,