Foveated Ray Tracing for VR on Multiple GPUs

FOVEATED RAY TRACING FOR VR ON MULTIPLE GPUS

TAKAHIRO HARADA, AMD 12/2014

2 | DEC 3, 2014

INTRO

y  Ray Tracing + Foveated rendering + VR + MulGple GPUs == A lot of GPU compute!!

y  Compute fills a texture

y  Use GL/CL interop to display

3 | DEC 3, 2014

GPU RAY TRACING

y  Everything is wriWen in compute

y  Our renderer is 100% OpenCL ‒ Win, Linux, OSX ‒ GPU, CPU

y  High quality rendering compared to raster graphics

4 | DEC 3, 2014

5 | DEC 3, 2014

GPU RAY TRACING

y  A single big kernel ‒ Easy to port ‒ Works

y  Do you write only 1 pixel shader??

y  Drawbacks ‒ Performance <= SIMD divergence, GPU occupancy (uses too much VGPRs) ‒ Maintainability ‒ Extendibility ‒ Portability ‒ Debug

y  MulGple kernel implementaGon

IMPLEMENTATION CHOICES

6 | DEC 3, 2014

HOW MANY WGS CAN WE EXECUTE PER SIMD (AMD GPU)

y  10 wavefronts (64WIs) per SIMD is the max

y  It depends on local resource usage of the kernel y  VGPR usage is ofen the problem

y  Share 256 VGPRs among n work groups ‒ 1 wavefront, 256VGPRs LL ‒ 2 wavefronts, 128VGPRs ‒ 4 wavefronts, 64VGPRs J ‒ 10 wavefronts, 25VGPRs

y  Share 16KB LDS among n work groups ‒ 1 work group, 16KB LL ‒ 2 work group, 8KB ‒ 4 work group, 4KB J

y  VGPRs ‒ Registers used by vector ALUs ‒ 64KB/SIMD ‒ 256 VGPRs/SIMD lane (= 64KB/64/4)

y  LDS (Local data share) ‒ 64KB/CU (CU == 4SIMD) ‒ 32KB/SIMD

7 | DEC 3, 2014

GPU RAY TRACING

launch( RayTraceKernel );

__kernel void RayTraceKernel();

Host Code

Device Code

launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } __kernel void PrimaryRayGenKernel() __kernel void TraceKernel() __kerenl void SampleLightKernel()

Single kernel implementa?on Mul?ple kernel implementa?on

8 | DEC 3, 2014

RAY TRACING + VR

y  Ray tracing is flexible y  Raster graphics, single proj matrix

y  Can cast rays to arbitrary direcGon y  Easy to set up VR

y  But performance isn’t good enough

y  ComputaGon cost ‒  Scene complexity ‒ # of samples (rays)

Fully ray traced but using baked textures:)

9 | DEC 3, 2014

RAY TRACING + VR

y  Ray tracing is flexible y  Raster graphics, single proj matrix

y  Can cast rays to arbitrary direcGon y  Easy to set up VR

y  But performance isn’t good enough

y  To speed it up, ‒ Reduce # of pixels to be shaded

y  Pixel shading (sample) reducGon ‒  Sample reuse (lef&right) ‒  Foveated rendering

Fully ray traced but using baked textures:)

10 | DEC 3, 2014

SAMPLE REUSE

11 | DEC 3, 2014

FOVEATED RENDERING

y  We can only see clearly where we are looking at

y  Shading at full rate everywhere is a waste of computaGon

y  Steps ‒ Create a density map ‒ Ray trace 1 sample for each area ‒ Reconstruct full resoluGon image

12 | DEC 3, 2014

FOVEATED RENDERING




13 | DEC 3, 2014

FOVEATED RENDERING




IMPLEMENTATION DETAIL

15 | DEC 3, 2014

1. DENSITY MAP DATA STRUCTURE

y  2 data structures are precomputed

y  Array<float2> samples( M ) ‒  Sample posiGon ‒ Normalized coordinate (x, y)

y  Array<NeighborInfo> neighborInfo( N ) ‒  For frame reconstrucGon ‒  Sample id[k] ‒  Sample weight[k]

y  # of pixels : N y  # of samples: M

16 | DEC 3, 2014

2. ASSIGN A UNIQUE INDEX FOR EACH SAMPLE

y  Execute work item for each sample in the paWern

y  Check which sample is in the rendered area

y  Use atomic Inc to get a unique index ‒ Count: # of samples ‒ Unique indices

As mulGple samples are taken for a render(), unique indices to idenGfy storage locaGon is necessary

0 5 7 2 10 23 Samples

Ray

Color

22

7 Count

Rendering Area

17 | DEC 3, 2014

3. GENERATE PRIMARY RAYS

y  Execute work item for each sample in the range

y  Read sampleID

y  Read sample coordinates

y  Generate a primary ray

y  Store to ray buffer

0 5 7 2 10 23 Samples

Ray

Color

22

7 Count

18 | DEC 3, 2014

4. RAY TRACE

y  Execute work item for each generated ray

y  Trace ray + Shade

0 5 7 2 10 23 Samples

Ray

Color

22

7 Count

19 | DEC 3, 2014

5. RECONSTRUCT FRAME BUFFER

y  Execute work item for each pixel

y  Weighted blend of k neighbors

y  Go through list of neighbors and fetch computed pixel color

Input Output

20 | DEC 3, 2014

6. APPLY DISTORTION AND RENDER LR

y  Render to LR y  Execute work item for each pixel in the frame buffer

y  Check if it is L or R y  Look up pixel value y  ChromaGc separaGon

y  Barrel distorGon

21 | DEC 3, 2014

RESULT

y  # of samples are reduced to 5% compared to full rate shading

y  Could make it faster (10~30fps)

y  SGll not fast enough for VR

y  ReducGon of more samples?

USING MULTIPLE GPUS FOR LATENCY CRITICAL APPLICATION

23 | DEC 3, 2014

HOW TO USE MULTIPLE GPUS

y  Alternate frame rendering ‒ Assign a frame rendering for a GPU ‒ Time to finish a frame doesn’t change

y  Frame split ‒  Split a frame and all GPUs work on the frame ‒ Can reduce the Gme to finish a frame

y  Frame split is beWer for our purpose

24 | DEC 3, 2014

CHALLENGE OF FRAME SPLIT

y  Load balancing issue y  A GPU finishes immediately, another might keep running forever

y  Workload of each pixel can be different

y  Foveated rendering makes it worse ‒  Shading point density is not uniform on the screen

25 | DEC 3, 2014

SEMI STATIC LOAD BALANCING

y  Load balancing once for each frame rendering step

y  Use staGsGcs from previous frame to load balance

y  Start from even split

y  At each frame ‒ Render the assigned area ‒ Each GPU reports # of samples processed and Gme to complete the work ‒ Compute processing speed for GPU i,

‒  p_i = n_i/t_i ‒  If we use the perfect load balancing, Gme to finish the work is

‒  t = sum n_i / sum p_i ‒ The work for GPU i can process at t is

‒  n_i = t p_i ‒ Compute next frame split from the CDF of sample distribuGon

Area

n0

n1

n2

n3

A0 A1 A2 A3

# of Samples

26 | DEC 3, 2014

APPLYING TO FOVEATED RENDERING

y  Samples in the area of the frame buffer is not enough

y  Sample in the other area is not in the GPU memory

y  We need to reconstruct frame buffer from neighbor samples

y  Gather samples which have at least 1 neighbor in the assigned area

27 | DEC 3, 2014

RESULT

y  More than 60fps on 4 GPUs ‒ 6M triangles ‒ 32 shadow rays/sample ‒ 2 AA rays/sample

Crytek Sponza (0.26M tris) ~12ms/frame

32 shadow rays/sample 4x AMD FirePro W9000 GPUs

Rungholt (6.7M tris) ~12ms/frame

32 shadow rays/sample 4x AMD FirePro W9000 GPUs

28 | DEC 3, 2014

CLOSING THE TALK

y  Showed an example of rendering pipeline 100% wriWen in GPU compute

y  Showed how to extend a ray tracer for VR y  Showed a fully manual usage of mulGple GPU

‒ ó Fully automaGc by driver (Crossfire)

29 | DEC 3, 2014

CLOSING THE TALK

y  Foveated Real-‐Time Ray Tracing for Virtual Reality Headset

y  Ray Tracing Irregularly Distributed Samples on MulGple GPUs

y  hWp://research.lighWransport.com/foveated-‐real-‐Gme-‐ray-‐tracing-‐for-‐virtual-‐reality-‐headset/index.html

y  Thanks to Masahiro Fujita@Light Transport Entertainment Inc.

Technology

Foveated Ray Tracing for VR on Multiple GPUs