Upload
takahiro-harada
View
1.010
Download
1
Tags:
Embed Size (px)
Citation preview
2 | DEC 3, 2014
INTRO
y Ray Tracing + Foveated rendering + VR + MulGple GPUs == A lot of GPU compute!!
y Compute fills a texture
y Use GL/CL interop to display
3 | DEC 3, 2014
GPU RAY TRACING
y Everything is wriWen in compute
y Our renderer is 100% OpenCL ‒ Win, Linux, OSX ‒ GPU, CPU
y High quality rendering compared to raster graphics
5 | DEC 3, 2014
GPU RAY TRACING
y A single big kernel ‒ Easy to port ‒ Works
y Do you write only 1 pixel shader??
y Drawbacks ‒ Performance <= SIMD divergence, GPU occupancy (uses too much VGPRs) ‒ Maintainability ‒ Extendibility ‒ Portability ‒ Debug
y MulGple kernel implementaGon
IMPLEMENTATION CHOICES
6 | DEC 3, 2014
HOW MANY WGS CAN WE EXECUTE PER SIMD (AMD GPU)
y 10 wavefronts (64WIs) per SIMD is the max
y It depends on local resource usage of the kernel y VGPR usage is ofen the problem
y Share 256 VGPRs among n work groups ‒ 1 wavefront, 256VGPRs LL ‒ 2 wavefronts, 128VGPRs ‒ 4 wavefronts, 64VGPRs J ‒ 10 wavefronts, 25VGPRs
y Share 16KB LDS among n work groups ‒ 1 work group, 16KB LL ‒ 2 work group, 8KB ‒ 4 work group, 4KB J
y VGPRs ‒ Registers used by vector ALUs ‒ 64KB/SIMD ‒ 256 VGPRs/SIMD lane (= 64KB/64/4)
y LDS (Local data share) ‒ 64KB/CU (CU == 4SIMD) ‒ 32KB/SIMD
7 | DEC 3, 2014
GPU RAY TRACING
launch( RayTraceKernel );
__kernel void RayTraceKernel();
Host Code
Device Code
launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } __kernel void PrimaryRayGenKernel() __kernel void TraceKernel() __kerenl void SampleLightKernel()
Single kernel implementa?on Mul?ple kernel implementa?on
8 | DEC 3, 2014
RAY TRACING + VR
y Ray tracing is flexible y Raster graphics, single proj matrix
y Can cast rays to arbitrary direcGon y Easy to set up VR
y But performance isn’t good enough
y ComputaGon cost ‒ Scene complexity ‒ # of samples (rays)
Fully ray traced but using baked textures:)
9 | DEC 3, 2014
RAY TRACING + VR
y Ray tracing is flexible y Raster graphics, single proj matrix
y Can cast rays to arbitrary direcGon y Easy to set up VR
y But performance isn’t good enough
y To speed it up, ‒ Reduce # of pixels to be shaded
y Pixel shading (sample) reducGon ‒ Sample reuse (lef&right) ‒ Foveated rendering
Fully ray traced but using baked textures:)
11 | DEC 3, 2014
FOVEATED RENDERING
y We can only see clearly where we are looking at
y Shading at full rate everywhere is a waste of computaGon
y Steps ‒ Create a density map ‒ Ray trace 1 sample for each area ‒ Reconstruct full resoluGon image
12 | DEC 3, 2014
FOVEATED RENDERING
y We can only see clearly where we are looking at
y Shading at full rate everywhere is a waste of computaGon
y Steps ‒ Create a density map ‒ Ray trace 1 sample for each area ‒ Reconstruct full resoluGon image
13 | DEC 3, 2014
FOVEATED RENDERING
y We can only see clearly where we are looking at
y Shading at full rate everywhere is a waste of computaGon
y Steps ‒ Create a density map ‒ Ray trace 1 sample for each area ‒ Reconstruct full resoluGon image
15 | DEC 3, 2014
1. DENSITY MAP DATA STRUCTURE
y 2 data structures are precomputed
y Array<float2> samples( M ) ‒ Sample posiGon ‒ Normalized coordinate (x, y)
y Array<NeighborInfo> neighborInfo( N ) ‒ For frame reconstrucGon ‒ Sample id[k] ‒ Sample weight[k]
y # of pixels : N y # of samples: M
16 | DEC 3, 2014
2. ASSIGN A UNIQUE INDEX FOR EACH SAMPLE
y Execute work item for each sample in the paWern
y Check which sample is in the rendered area
y Use atomic Inc to get a unique index ‒ Count: # of samples ‒ Unique indices
As mulGple samples are taken for a render(), unique indices to idenGfy storage locaGon is necessary
0 5 7 2 10 23 Samples
Ray
Color
22
7 Count
Rendering Area
17 | DEC 3, 2014
3. GENERATE PRIMARY RAYS
y Execute work item for each sample in the range
y Read sampleID
y Read sample coordinates
y Generate a primary ray
y Store to ray buffer
0 5 7 2 10 23 Samples
Ray
Color
22
7 Count
18 | DEC 3, 2014
4. RAY TRACE
y Execute work item for each generated ray
y Trace ray + Shade
0 5 7 2 10 23 Samples
Ray
Color
22
7 Count
19 | DEC 3, 2014
5. RECONSTRUCT FRAME BUFFER
y Execute work item for each pixel
y Weighted blend of k neighbors
y Go through list of neighbors and fetch computed pixel color
Input Output
20 | DEC 3, 2014
6. APPLY DISTORTION AND RENDER LR
y Render to LR y Execute work item for each pixel in the frame buffer
y Check if it is L or R y Look up pixel value y ChromaGc separaGon
y Barrel distorGon
21 | DEC 3, 2014
RESULT
y # of samples are reduced to 5% compared to full rate shading
y Could make it faster (10~30fps)
y SGll not fast enough for VR
y ReducGon of more samples?
23 | DEC 3, 2014
HOW TO USE MULTIPLE GPUS
y Alternate frame rendering ‒ Assign a frame rendering for a GPU ‒ Time to finish a frame doesn’t change
y Frame split ‒ Split a frame and all GPUs work on the frame ‒ Can reduce the Gme to finish a frame
y Frame split is beWer for our purpose
24 | DEC 3, 2014
CHALLENGE OF FRAME SPLIT
y Load balancing issue y A GPU finishes immediately, another might keep running forever
y Workload of each pixel can be different
y Foveated rendering makes it worse ‒ Shading point density is not uniform on the screen
25 | DEC 3, 2014
SEMI STATIC LOAD BALANCING
y Load balancing once for each frame rendering step
y Use staGsGcs from previous frame to load balance
y Start from even split
y At each frame ‒ Render the assigned area ‒ Each GPU reports # of samples processed and Gme to complete the work ‒ Compute processing speed for GPU i,
‒ p_i = n_i/t_i ‒ If we use the perfect load balancing, Gme to finish the work is
‒ t = sum n_i / sum p_i ‒ The work for GPU i can process at t is
‒ n_i = t p_i ‒ Compute next frame split from the CDF of sample distribuGon
Area
n0
n1
n2
n3
A0 A1 A2 A3
# of Samples
26 | DEC 3, 2014
APPLYING TO FOVEATED RENDERING
y Samples in the area of the frame buffer is not enough
y Sample in the other area is not in the GPU memory
y We need to reconstruct frame buffer from neighbor samples
y Gather samples which have at least 1 neighbor in the assigned area
27 | DEC 3, 2014
RESULT
y More than 60fps on 4 GPUs ‒ 6M triangles ‒ 32 shadow rays/sample ‒ 2 AA rays/sample
Crytek Sponza (0.26M tris) ~12ms/frame
32 shadow rays/sample 4x AMD FirePro W9000 GPUs
Rungholt (6.7M tris) ~12ms/frame
32 shadow rays/sample 4x AMD FirePro W9000 GPUs
28 | DEC 3, 2014
CLOSING THE TALK
y Showed an example of rendering pipeline 100% wriWen in GPU compute
y Showed how to extend a ray tracer for VR y Showed a fully manual usage of mulGple GPU
‒ ó Fully automaGc by driver (Crossfire)
29 | DEC 3, 2014
CLOSING THE TALK
y Foveated Real-‐Time Ray Tracing for Virtual Reality Headset
y Ray Tracing Irregularly Distributed Samples on MulGple GPUs
y hWp://research.lighWransport.com/foveated-‐real-‐Gme-‐ray-‐tracing-‐for-‐virtual-‐reality-‐headset/index.html
y Thanks to Masahiro Fujita@Light Transport Entertainment Inc.