Upload
mistercteam
View
268
Download
0
Tags:
Embed Size (px)
Citation preview
Getting The Best Out Of D3D12
Evan Hart, Principal Engineer, NVIDIADave Oldcorn, D3D12 Technical Lead, AMD
Prerequisites
● An interest in D3D12
Ideally, already looked at D3D12
● Experienced Graphics Programmer
● Console programming experience
Beneficial, not required
Brief D3D12 Overview
The ‘What’ of D3D12
● Broad rethinking of the API
● Much closer to HW realities
● Model is more explicit
Less driver magic
“With great power comes great responsibility.”
● D3D12 answers many developer requests
● Be ready to use it wisely and it can reward you
Console Vs PC
● D3D12 offers a great porting story More of the explicit control console devs crave
Much less driver interference
● Still a heterogeneous environment Need to test carefully
Heed API and tool warnings (exposed corners)
Game will run on HW you never tested
Central Objects to D3D12
● Command Lists
● Bundles
● Pipeline State Objects
● Root Signature and Descriptor Tables
● Resource Heaps
Using Bundles And Lists
Draw
Dispatch
Bundle
Command List
Frame
Command Lists & Bundles
● Bundle Small object recording a few commands
Great for reuse, but a subset of commands
Like drawing 3 meshes in an object
● Command List Useful for recording/submitting commands
Used to execute bundles and other commands
Pipeline State Object
● Collates most render state
Shaders, raster, blend
● All packaged and swapped together
Pipeline State Object
Pipeline StatePixel Shader
Vertex Shader
Rasterizer State
Depth State
Blend State
Input Layout
Topology
RT Format
Geometry Shader
Hull Shader
Domain Shader
Compute Shader
Root Signature & Descriptor Tables
● New method for resource setting
● Flexible interface
Methods for changing large blocks
Methods for small bits quickly
Indexing and open-ended tables enable “bindless”-like behaviour
Resource Heaps
● New memory management primitive
● Tie multiple related resources into one heap
● App controls residency on the heap
Somewhat coarse
● Enables console-like memory aliasing
New HW Features
● Conservative Rasterization
● Raster Ordered Views
● Typed UAV
● PS write of stencil reference
● Volume tiled resources
Advice for the D3D12 Dev
Practical Developer Advice
● Small nuggets on key issues
● Advice is from experience
Multiple engines have done trial ports
Many months of experimentation
• Driver, API, and app level
Efficient Submission
● Record commands in parallel
● Reuse fragments via bundles
● Taking over some driver/runtime work Make sure your code is efficient (and parallel)
● Submit in batches with ExecuteCmdLists Submit throughout the frame
Engine organisation
● Consider task oriented engines
Divide rendering into tasks
Run CPU tasks to build command lists
Use dependencies to order GPU submission
Also helps with resource barriers
Threading: Done Badly
Render Thread
Command List 0
Command List 1
Submit SubmitCreate
ResourcePresent
Game Thread
Aux ThreadAux ThreadAux Thread
App render code, runtime, driver all on one!
Async Thread
Worker Thread
Threading: Done Well
Master Render Thread
Command List 0
Command List 1
SubmitCL0
SubmitCL1
CreateResource
Present
Game Thread
Many solutions, key is parallelism!
CreateResource
Compile PSO
Command List 2
Command List 3
SubmitCL2
SubmitCL3
PSO Practicalities
● Merged state removes driver validation costs
● Don’t needlessly thrash state
Just because it is a PSO, doesn’t mean every state needs to flip in HW
• Avoid toggling compute/graphics
• Avoid toggling tessellation
Use sensible defaults for don’t care fields
Creating PSOs
● PSO creation can be costly
Probably means a compile
● Streaming threads should handle PSO
Gather state and create on async threads
Prevents stalls
Can handle specializations too
Deferred PSO Update
● “Quick first compile; better answer later” Simple / generic / free initial shader
Start the compile of the better result
Substitute PSO when it’s ready
● Generic / specialized especially useful Precompile the generic case
More optimal path for special cases, compiled on low priority thread
Using Bundles And Lists
Draw
Dispatch
Bundle
Command List
Frame
Bundle Advice
● Aim for a moderate size (~12 draws)
Some potential overhead with setup
● Limit resource binding inheritance when possible
Enables more complete cooking of bundle
Lists Advice
● Aim for a decent size Typically hundreds of draw calls
● Submit together when feasible
● Don’t expect lots of list reuse Per-frame changes + overlap limitation
Post-processing might be an exception• Still need 2-3 copies of that list
Using Command Allocators
Allocators and Lists
● Invisible consumers of GPU memory
● Hold on to memory until Destroy
● Reuse on similar data Warm list == no allocation during list
creation
● Destroy on different data Reuse on disparate cases grows all
lists to size of worst case over time
Initial
100 draws
Reset
Same 100 draws
200 draws
List / Allocator memory usage
(Guaranteed no new allocations)
Different 100 draws
5 draws
Allocator Advice
● Allocators are fastest when warm Keep reusing allocator with lists of equal size
● Need 2T + N allocators minimum T -> threads creating command lists
N -> extra pool for bundles
All lists/bundles on an allocator freed together• Need to double/triple buffer for reusing the allocators
Root Signature
● Carefully layout root signature Group tables by
frequency of change
Most frequent changes early in signature
● Standardize slots Signature change costs
Per-Draw Table
Pointer
Tex Tex
ConstBuf
(shaderparams)
TexConstBuf
(shaderparams)
Tex
ConstBuf
(camera, eye...)
ConstantBuffer pointer(Modelview
matrix, skinning)
Per-draw constants
Per-MaterialTable
Pointer
Per-FrameTable
Pointer
Tex
Root Signature Cnt’d
● Place single items which change per-draw in the root arguments
● Costs of setting new table vary across HW Cost varies from nearly 0 to O(N) work where N is
items in table
● Avoid changes to individual items in tables Requires app to instance table if in flight Try to update whole table atomically
Managing Resources with Heaps
● Committed Monolithic, D3D11-style
● Placed Offset in existing heap
● Reserved Mapped to heaps like
tiled resources
Resource [VA]
Heap
G-buffer
Postprocess buffer
Heap
Heap
Choosing a resource type:
CommittedNeed per-resource residency
Don’t need aliasing
Placed
Cheaper create / destroyCan group in heaps of similar residency
Want to alias over othersSmall resources
Tiled / Reserved
Need flexibility of memory managementCan tolerate CPU and GPU overheads of ResourceMap
Resource tips
● Committed gives driver more knowledge
● Tiled resources have separate caps
Need to prepare for HW without it
● Memory might be segmented
Cannot allocate entire space in a single heap
Residency tips
● MakeResident: Batch these up
Expect CPU and GPU cost for page table updates
● MakeUnresident Cost of move may be deferred; may be seen
on future MakeResident
Working Set Management
● Application has much more control in D3D12
● Directly tells the video memory manager which resources are required
● App can be sharper on memory than before
On D3D11, working set per frame typically much smaller than registered resource
Less likely to end up with object in slow memory
Working to a budget
● “Budget” is the memory you can use
● Get under the budget using residency MakeUnresident makes object candidate to swap to
system memory It is much cheaper to unresident, then later
resident again, than to destroy and create
● Tiled resources can drop mip levels dynamically
Barriers & Hazards
● Most objects stay in one state from creation Don’t insert redundant barriers
● Always specify the right set of target units Allows for minimal barrier
● Group barriers into same Barrier call Will take the worst case of all, rather than
potentially incurring multiple sequential barriers
Barriers enhance concurrency
● Resources both read and written in a given draw created dependency between draws Most common case was UAV used in adjacent
dispatches
Dispatch 0 Dispatch 1 Dispatch 2Dispatches (D3D11)
Draw 0 Draw 1 Draw 2 Draw 3
Draw 0
Draw 1
Draw 2 Draw 3
Logical view of draws
GPU timeline of draws
Barrie
r
Barrier enables overlap
● Explicit barrier eliminates issue
App tells API when a true dependency exists, rather than it being assumed
Dispatch 0 Dispatch 1 Dispatch 2
Dispatch 0
Dispatch 1
Dispatch 2
Logical view of dispatches
Dispatches with explicit barrier control
CPU side
● D3D12 simplifies picture
Easier to associate driver effort with application actions
Less likely that driver itself is the bottleneck
● Be aware of your system buses
GPU side
● Environment is new
Less familiar without console experience
Interesting new hardware limits are now accessible
● Use the tools
Wrap up
Get Ready
● D3D12 done right isn’t just an API port
More so when referring to consoles
● Good engine design offers a lot of opportunity
● The power you’ve been asking for is here
Questions