Upload
buddy-burns
View
229
Download
0
Embed Size (px)
Citation preview
Stefan Popov High Performance GPU Ray Tracing
Real-time Ray Tracing on GPU with BVH-based Packet Traversal
Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek
Stefan Popov High Performance GPU Ray Tracing
Background GPUs attractive for ray tracing
High computational power Shading oriented architecture
GPU ray tracers Carr – the ray engine Purcell – Full ray tracing on the GPU, based on grids Ernst – KD trees with parallel stack Carr, Thrane & Simonsen – BVH Foley, Horn, Popov – KD trees - stackless traversal
Stefan Popov High Performance GPU Ray Tracing
Motivation So far
Interactive RT on GPU, but Limited model size No dynamic scene support
The G80 – new approach to the GPU High performance general purpose processor with
graphics extensions PRAM architecture
BVH allow for Dynamic/deformable scenes Small memory footprint Goal: Recursive ordered traversal of BVH on the G80
Stefan Popov High Performance GPU Ray Tracing
GPU Architecture (G80) Multi-threaded scalar architecture
12K HW threads Threads cover latencies
Off-chip memory ops Instruction dependencies
4 or 16 cycles to issue instr. 16 (multi-)cores
8-wide SIMD 128 scalar cores in total Cores process threads in 32 wide
SIMD chunks
……
Multi-Core 1
……
Chunk Pool…
……
…
Multi-Core 16
……
Chunk Pool…
……
…
Stefan Popov High Performance GPU Ray Tracing
GPU Architecture (G80) Scalar register file (8K)
Partitioned among running threads
Shared memory (16KB) On-chip, 0 cycle latency
On-board memory (768MB) Large latency (~ 200 cycles) R/W from within thread Un-cached
Read-only L2 cache (128KB) On chip, shared among all
threads
On-board memory
Multi-Core 1
……Shared
Mem
ory
RegistersRegisters
……
Multi-Core 16
L2 Cache (128KB)
Stefan Popov High Performance GPU Ray Tracing
Programming the G80 CUDA
C based language with parallel extensions GPU utilization at 100% only if
Enough threads are present (>> 12K) Every thread uses less than 10 registers and 5
words (32 bit) of shared memory Enough computations per transferred word of data
Bandwidth << computational power Adequate memory access pattern to allow read
combining
Stefan Popov High Performance GPU Ray Tracing
Performance Bottlenecks Efficient per-thread stack implementation
Shared memory too small – will limit parallelism On-board memory – uncached
Need enough computations between stack ops Efficient memory access pattern
Use texture caches However, only few words of cache / thread
Read successive memory locations in successive threads of a chunk Single roundtrip to memory (read combining)
Cover latency with enough computations
Stefan Popov High Performance GPU Ray Tracing
Ray Tracing on the G80 Map each ray to one thread
Enough threads to keep the GPU busy Recursive ray tracing
Use per-thread stack stored on on-board memory Efficient, since enough computations are present
But how to do the traversal ? Skip pointers (Thrane) – no ordered traversal Geometric images (Carr) – single mesh only Shared stack traversal
Stefan Popov High Performance GPU Ray Tracing
SIMD Packet Traversal of BVH Traverse a node with the whole packet At an internal node:
Intersect all rays with both children and determine traversal order
Push far child (if any) on a stack and descend to the near one with the packet
At a leaf: Intersect all rays with contained geometry Pop next node to visit from the stack
Stefan Popov High Performance GPU Ray Tracing
PRAM Basics The PRAM model
Implicitly synchronized processors (threads)
Shared memory between all processors
Basic PRAM operations Parallel OR in O(1) Parallel reduction in
O(log N)
false truefalse true
false
true
11 912 32
44 20
+ +
+64 20 11 9
11 9
Stefan Popov High Performance GPU Ray Tracing
PRAM Packet Traversal of BVH The G80 – PRAM machine on chunk level
Map packet chunk, ray thread Threads behave as in the single ray traversal
At leaf: Intersect with geometry. Pop next node from stack
At node: Decide which children to visit and in what order. Push far child
Difference: How rays choose which node to visit first
Might not be the one they want to
Stefan Popov High Performance GPU Ray Tracing
PRAM Packet Traversal of BVH Choose child traversal order
PRAM OR to determine if all rays agree on visiting the same node first The result is stored in shared memory
In case of divergence: choose child with more ray candidates Use PRAM SUM on +/- 1 for each thread, -1 left node Look at result’s sign
Guarantees synchronous traversal of BVH
Stefan Popov High Performance GPU Ray Tracing
PRAM Packet Traversal of BVH Stack:
Near & far child – the same for all threads => store once
Keep stack in shared memory. Only few bits per thread!
Only Thread 0 does all stack ops. Reading data:
All threads work with the same node / triangle Sequential threads bring in sequential words Single load operation. Single round trip to memory
Implementable in CUDA
Stefan Popov High Performance GPU Ray Tracing
Results
Stefan Popov High Performance GPU Ray Tracing
Analysis Coherent branch decisions / memory access Small footprint of the data structure
Can trace up to 12 million triangle models Program becomes compute bound
Determined by over/under-clocking the core/memory No frustums required
Good for secondary rays, bad for primary Can use rasterization for primary rays
Implicit SIMD – easy shader programming Running on a GPU – shading “for free”
Stefan Popov High Performance GPU Ray Tracing
Dynamic Scenes Update parts / whole BVH and geometry on
GPU Use GPU for RT and CPU for BVH construction /
refitting Construct BVH using binning
Similar to Wald RT07 / Popov RT06 Bin all 3 dimensions using SIMD
Results in > 10% better trees Measured as SAH quality, not FPS Speed loss is almost negligible
Stefan Popov High Performance GPU Ray Tracing
Results
Stefan Popov High Performance GPU Ray Tracing
Conclusions New recursive PRAM BVH traversal algorithm
Very well suited for the new generation of GPUs No additional pre-computed data required
First GPU ray tracer to handle large models Previous implementations were limited to < 300K
Can handle dynamic scenes By using the CPU to update the geometry / BVH
Stefan Popov High Performance GPU Ray Tracing
Future Work More features
Shaders, adaptive anti-aliasing, … Global illumination
Code optimizations Current implementation uses too many registers
Stefan Popov High Performance GPU Ray Tracing
Thank you!
Stefan Popov High Performance GPU Ray Tracing
CUDA Hello World__global__ void addArrays(int *arr1, int *arr2){
unsigned t = threadIdx.x + blockIdx.x * blockDim.x;arr1[t] += arr2[t];
}
int main(){
int *inArr1 = malloc(4194304), *inArr2 = malloc(4194304);int *ta1, *ta2;cudaMalloc((void**)&ta1, 4194304); cudaMalloc((void**)&ta2, 4194304);
for(int i = 0; i < 4194304; i++){ inArr1[i] = rand(); inArr2[i] = rand(); }
cudaMemcpy(ta1, inArr1, 4194304, cudaMemcpyHostToDevice);cudaMemcpy(ta2, inArr2, 4194304, cudaMemcpyHostToDevice);
addArrays<<<dim3(4194304 / 512, 1, 1), dim3(512, 1, 1)>>>(ta1, ta2);
cudaMemcpy(inArr1, ta1, 4194304, cudaMemcpyDeviceToHost);for(int i = 0; i < 4194304; i++) printf("%d ", inArr1[i]);
return 0;}