45
Getting The Best Out Of D3D12 Evan Hart, Principal Engineer, NVIDIA Dave Oldcorn, D3D12 Technical Lead, AMD

Getting the-best-out-of-d3 d12

Embed Size (px)

Citation preview

Page 1: Getting the-best-out-of-d3 d12

Getting The Best Out Of D3D12

Evan Hart, Principal Engineer, NVIDIADave Oldcorn, D3D12 Technical Lead, AMD

Page 2: Getting the-best-out-of-d3 d12

Prerequisites

● An interest in D3D12

Ideally, already looked at D3D12

● Experienced Graphics Programmer

● Console programming experience

Beneficial, not required

Page 3: Getting the-best-out-of-d3 d12

Brief D3D12 Overview

Page 4: Getting the-best-out-of-d3 d12

The ‘What’ of D3D12

● Broad rethinking of the API

● Much closer to HW realities

● Model is more explicit

Less driver magic

Page 5: Getting the-best-out-of-d3 d12

“With great power comes great responsibility.”

● D3D12 answers many developer requests

● Be ready to use it wisely and it can reward you

Page 6: Getting the-best-out-of-d3 d12

Console Vs PC

● D3D12 offers a great porting story More of the explicit control console devs crave

Much less driver interference

● Still a heterogeneous environment Need to test carefully

Heed API and tool warnings (exposed corners)

Game will run on HW you never tested

Page 7: Getting the-best-out-of-d3 d12

Central Objects to D3D12

● Command Lists

● Bundles

● Pipeline State Objects

● Root Signature and Descriptor Tables

● Resource Heaps

Page 8: Getting the-best-out-of-d3 d12

Using Bundles And Lists

Draw

Dispatch

Bundle

Command List

Frame

Page 9: Getting the-best-out-of-d3 d12

Command Lists & Bundles

● Bundle Small object recording a few commands

Great for reuse, but a subset of commands

Like drawing 3 meshes in an object

● Command List Useful for recording/submitting commands

Used to execute bundles and other commands

Page 10: Getting the-best-out-of-d3 d12

Pipeline State Object

● Collates most render state

Shaders, raster, blend

● All packaged and swapped together

Page 11: Getting the-best-out-of-d3 d12

Pipeline State Object

Pipeline StatePixel Shader

Vertex Shader

Rasterizer State

Depth State

Blend State

Input Layout

Topology

RT Format

Geometry Shader

Hull Shader

Domain Shader

Compute Shader

Page 12: Getting the-best-out-of-d3 d12

Root Signature & Descriptor Tables

● New method for resource setting

● Flexible interface

Methods for changing large blocks

Methods for small bits quickly

Indexing and open-ended tables enable “bindless”-like behaviour

Page 13: Getting the-best-out-of-d3 d12

Resource Heaps

● New memory management primitive

● Tie multiple related resources into one heap

● App controls residency on the heap

Somewhat coarse

● Enables console-like memory aliasing

Page 14: Getting the-best-out-of-d3 d12

New HW Features

● Conservative Rasterization

● Raster Ordered Views

● Typed UAV

● PS write of stencil reference

● Volume tiled resources

Page 15: Getting the-best-out-of-d3 d12

Advice for the D3D12 Dev

Page 16: Getting the-best-out-of-d3 d12

Practical Developer Advice

● Small nuggets on key issues

● Advice is from experience

Multiple engines have done trial ports

Many months of experimentation

• Driver, API, and app level

Page 17: Getting the-best-out-of-d3 d12

Efficient Submission

● Record commands in parallel

● Reuse fragments via bundles

● Taking over some driver/runtime work Make sure your code is efficient (and parallel)

● Submit in batches with ExecuteCmdLists Submit throughout the frame

Page 18: Getting the-best-out-of-d3 d12

Engine organisation

● Consider task oriented engines

Divide rendering into tasks

Run CPU tasks to build command lists

Use dependencies to order GPU submission

Also helps with resource barriers

Page 19: Getting the-best-out-of-d3 d12

Threading: Done Badly

Render Thread

Command List 0

Command List 1

Submit SubmitCreate

ResourcePresent

Game Thread

Aux ThreadAux ThreadAux Thread

App render code, runtime, driver all on one!

Page 20: Getting the-best-out-of-d3 d12

Async Thread

Worker Thread

Threading: Done Well

Master Render Thread

Command List 0

Command List 1

SubmitCL0

SubmitCL1

CreateResource

Present

Game Thread

Many solutions, key is parallelism!

CreateResource

Compile PSO

Command List 2

Command List 3

SubmitCL2

SubmitCL3

Page 21: Getting the-best-out-of-d3 d12

PSO Practicalities

● Merged state removes driver validation costs

● Don’t needlessly thrash state

Just because it is a PSO, doesn’t mean every state needs to flip in HW

• Avoid toggling compute/graphics

• Avoid toggling tessellation

Use sensible defaults for don’t care fields

Page 22: Getting the-best-out-of-d3 d12

Creating PSOs

● PSO creation can be costly

Probably means a compile

● Streaming threads should handle PSO

Gather state and create on async threads

Prevents stalls

Can handle specializations too

Page 23: Getting the-best-out-of-d3 d12

Deferred PSO Update

● “Quick first compile; better answer later” Simple / generic / free initial shader

Start the compile of the better result

Substitute PSO when it’s ready

● Generic / specialized especially useful Precompile the generic case

More optimal path for special cases, compiled on low priority thread

Page 24: Getting the-best-out-of-d3 d12

Using Bundles And Lists

Draw

Dispatch

Bundle

Command List

Frame

Page 25: Getting the-best-out-of-d3 d12

Bundle Advice

● Aim for a moderate size (~12 draws)

Some potential overhead with setup

● Limit resource binding inheritance when possible

Enables more complete cooking of bundle

Page 26: Getting the-best-out-of-d3 d12

Lists Advice

● Aim for a decent size Typically hundreds of draw calls

● Submit together when feasible

● Don’t expect lots of list reuse Per-frame changes + overlap limitation

Post-processing might be an exception• Still need 2-3 copies of that list

Page 27: Getting the-best-out-of-d3 d12

Using Command Allocators

Page 28: Getting the-best-out-of-d3 d12

Allocators and Lists

● Invisible consumers of GPU memory

● Hold on to memory until Destroy

● Reuse on similar data Warm list == no allocation during list

creation

● Destroy on different data Reuse on disparate cases grows all

lists to size of worst case over time

Initial

100 draws

Reset

Same 100 draws

200 draws

List / Allocator memory usage

(Guaranteed no new allocations)

Different 100 draws

5 draws

Page 29: Getting the-best-out-of-d3 d12

Allocator Advice

● Allocators are fastest when warm Keep reusing allocator with lists of equal size

● Need 2T + N allocators minimum T -> threads creating command lists

N -> extra pool for bundles

All lists/bundles on an allocator freed together• Need to double/triple buffer for reusing the allocators

Page 30: Getting the-best-out-of-d3 d12

Root Signature

● Carefully layout root signature Group tables by

frequency of change

Most frequent changes early in signature

● Standardize slots Signature change costs

Per-Draw Table

Pointer

Tex Tex

ConstBuf

(shaderparams)

TexConstBuf

(shaderparams)

Tex

ConstBuf

(camera, eye...)

ConstantBuffer pointer(Modelview

matrix, skinning)

Per-draw constants

Per-MaterialTable

Pointer

Per-FrameTable

Pointer

Tex

Page 31: Getting the-best-out-of-d3 d12

Root Signature Cnt’d

● Place single items which change per-draw in the root arguments

● Costs of setting new table vary across HW Cost varies from nearly 0 to O(N) work where N is

items in table

● Avoid changes to individual items in tables Requires app to instance table if in flight Try to update whole table atomically

Page 32: Getting the-best-out-of-d3 d12

Managing Resources with Heaps

● Committed Monolithic, D3D11-style

● Placed Offset in existing heap

● Reserved Mapped to heaps like

tiled resources

Resource [VA]

Heap

G-buffer

Postprocess buffer

Heap

Heap

Page 33: Getting the-best-out-of-d3 d12

Choosing a resource type:

CommittedNeed per-resource residency

Don’t need aliasing

Placed

Cheaper create / destroyCan group in heaps of similar residency

Want to alias over othersSmall resources

Tiled / Reserved

Need flexibility of memory managementCan tolerate CPU and GPU overheads of ResourceMap

Page 34: Getting the-best-out-of-d3 d12

Resource tips

● Committed gives driver more knowledge

● Tiled resources have separate caps

Need to prepare for HW without it

● Memory might be segmented

Cannot allocate entire space in a single heap

Page 35: Getting the-best-out-of-d3 d12

Residency tips

● MakeResident: Batch these up

Expect CPU and GPU cost for page table updates

● MakeUnresident Cost of move may be deferred; may be seen

on future MakeResident

Page 36: Getting the-best-out-of-d3 d12

Working Set Management

● Application has much more control in D3D12

● Directly tells the video memory manager which resources are required

● App can be sharper on memory than before

On D3D11, working set per frame typically much smaller than registered resource

Less likely to end up with object in slow memory

Page 37: Getting the-best-out-of-d3 d12

Working to a budget

● “Budget” is the memory you can use

● Get under the budget using residency MakeUnresident makes object candidate to swap to

system memory It is much cheaper to unresident, then later

resident again, than to destroy and create

● Tiled resources can drop mip levels dynamically

Page 38: Getting the-best-out-of-d3 d12

Barriers & Hazards

● Most objects stay in one state from creation Don’t insert redundant barriers

● Always specify the right set of target units Allows for minimal barrier

● Group barriers into same Barrier call Will take the worst case of all, rather than

potentially incurring multiple sequential barriers

Page 39: Getting the-best-out-of-d3 d12

Barriers enhance concurrency

● Resources both read and written in a given draw created dependency between draws Most common case was UAV used in adjacent

dispatches

Dispatch 0 Dispatch 1 Dispatch 2Dispatches (D3D11)

Draw 0 Draw 1 Draw 2 Draw 3

Draw 0

Draw 1

Draw 2 Draw 3

Logical view of draws

GPU timeline of draws

Barrie

r

Page 40: Getting the-best-out-of-d3 d12

Barrier enables overlap

● Explicit barrier eliminates issue

App tells API when a true dependency exists, rather than it being assumed

Dispatch 0 Dispatch 1 Dispatch 2

Dispatch 0

Dispatch 1

Dispatch 2

Logical view of dispatches

Dispatches with explicit barrier control

Page 41: Getting the-best-out-of-d3 d12

CPU side

● D3D12 simplifies picture

Easier to associate driver effort with application actions

Less likely that driver itself is the bottleneck

● Be aware of your system buses

Page 42: Getting the-best-out-of-d3 d12

GPU side

● Environment is new

Less familiar without console experience

Interesting new hardware limits are now accessible

● Use the tools

Page 43: Getting the-best-out-of-d3 d12

Wrap up

Page 44: Getting the-best-out-of-d3 d12

Get Ready

● D3D12 done right isn’t just an API port

More so when referring to consoles

● Good engine design offers a lot of opportunity

● The power you’ve been asking for is here

Page 45: Getting the-best-out-of-d3 d12

Questions