Latency considerations of depth-first GPU ray tracing Michael Guthe University Bayreuth Visual Computing 1

1

Latency considerations of depth-first GPU ray tracing

Michael Guthe

University BayreuthVisual Computing

2

Depth-first GPU ray tracing

Based on bounding box or spatial hierarchy

Recursive traversal Usually using a stack Threads inside a warp may

access different data May also diverge

3

Performance Analysis

What limits performance of the trace kernel? Device memory bandwidth?

GTX 480 GTX 680 GTX Titan0%

2%

4%

6%

8%

10%

12%

primaryAOdiffuse

Obviously not!

4


What limits performance of the trace kernel? Maximum (warp) instructions per clock?


10%

20%

30%

40%

50%

60%

70%

80%

primaryAOdiffuse

Not really!

5


Why doesn’t the kernel fully utilize the cores?

Three possible reasons: Instruction fetch e.g. due to branches Memory latency a.k.a. data request mainly due to random access Read after write latency a.k.a. execution dependency It takes 22 clock cycles (Kepler) until the result is written to a register

6


Why doesn’t the kernel fully utilize the cores? Profiling shows:


10%

20%

30%

40%

50%

60%

70%

fetch prim.fetch AOfetch diff.mem. prim.mem. AOmem. diff.dep. prim.dep. AOdep. diff.

Memory & RAW latency limit performance!

7

Reducing Latency

Standard solution for latency: Increase occupancy No option due to register pressure

Relocate memory access Automatically performed by compiler But not between iterations of a while loop Loop unrolling for triangle test

8

Reducing Latency

Instruction level parallelism Not directly supported by GPU Increases number of eligible warps Same effect as higher occupancy We might even spend some more registers

Wider trees 4-ary tree means 4 independent instructions paths Almost doubles the number of eligible warps during node tests Higher width increase number of node tests, 4 is optimum

9

Reducing Latency

Tree construction Start from root Recursively pull largest

child up Special rules for leaves to

reduce memory consumptionGoal: 4 child nodes whenever possible

10

Reducing Latency

Overhead: sorting intersected nodes Can have two independent paths with parallel merge sort

We don‘t need sorting for occlusion rays

0.2 0.3 0.7

0.3 0.2 0.7

0.3 0.7 0.2

0.7 0.3 0.2

11

Results

Improved instructions per clock Doesn’t directly translate to speedup


10%

20%

30%

40%

50%

60%

70%

80%

primaryAOdiffuse


10%

20%

30%

40%

50%

60%

70%

80%

primaryAOdiffuse

12

Results

Up to 20.1% speedup over Aila et. al: “Understan-ding the Efficiency of Ray Traversal on GPUs”, 2012

GTX 480 GTX 680 GTX Titan0

100

200

300

400

500

600

pri-mary

AO


100

200

300

400

500

600

pri-mary

AOmillion

rays p

er

secon

d

Sibenik, 80k tris.

13

Results



50

100

150

200

250

300

350

400

450

pri-mary

AO


50

100

150

200

250

300

350

400

450

pri-mary

AOmillion

rays p

er

secon

d

Fairy forest, 174k tris.

14

Results



100

200

300

400

500

600

700

pri-mary

AO


100

200

300

400

500

600

700

pri-mary

AOmillion

rays p

er

secon

d

Conference, 283k tris.

15

Results



50

100

150

200

250

300

pri-mary

AO


50

100

150

200

250

300

pri-mary

AOmillion

rays p

er

secon

d

San Miguel, 11M tris.

16

Results

Latency is still performance limiter Mostly improved memory latency


10%

20%

30%

40%

50%

60%

70%



10%

20%

30%

40%

50%

60%

70%


Documents

Latency considerations of depth-first GPU ray tracing Michael Guthe University Bayreuth Visual Computing 1