29
Fast Tessellated Rendering on Fermi GF100 Hot3D @ HPG 2010 Tim Purcell

Slides - High Performance Graphics 2013

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Slides - High Performance Graphics 2013

Fast Tessellated

Rendering on Fermi

GF100Hot3D @ HPG 2010

Tim Purcell

Page 2: Slides - High Performance Graphics 2013

Outline

DX10 state of the art

DX11 tessellation

Fermi GF100 architecture

Results

Demo

2

Page 3: Slides - High Performance Graphics 2013

State of the Art: DX10 pipeline

3

Image from Far Cry® 2,

courtesy of Ubisoft

Pixels are meticulously shaded

Input

Assembler

Vertex

Shader

Rasterizer/

Interpolator

Pixel

Shader

Geometry

Shader

Output

Merger

Page 4: Slides - High Performance Graphics 2013

Image from Far Cry® 2,

courtesy of Ubisoft

State of the Art: DX10 pipeline

4

Pixels are meticulously shaded,

but geometric detail is modest

Page 5: Slides - High Performance Graphics 2013

Tessellation in DirectX 11

Hull shader

Computes the

tessellation factor

Runs pre-expansion

Explicitly parallel across

control points

Vertex

Patch

Assembly

Hull

Tessellator

Domain

Primitive

Assembly

Geometry

From input assembly

Control

points

Page 6: Slides - High Performance Graphics 2013

Tessellation in DirectX 11

Fixed function

tessellation stage

Configured by LOD

output from Hull Shader

Produces triangles and

lines

Vertex

Patch

Assembly

Hull

Tessellator

Domain

Primitive

Assembly

Geometry

From input assembly

Control

points

Page 7: Slides - High Performance Graphics 2013

Tessellation in DirectX 11

Domain shader

Runs post-expansion

Maps (u,v) to (x,y,z,w)

Implicitly parallel

Vertex

Patch

Assembly

Hull

Tessellator

Domain

Primitive

Assembly

Geometry

From input assembly

Control

points

Page 8: Slides - High Performance Graphics 2013

GS

VS

GS

VS

GS

VS

Crossbar

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

Setup

Viewport

Transform

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

GS

VS

GS

VS

GS

VS

GS

VS

GS

VS

GS

VS

GS

VS

GS

VS

GS

VS

Primitive Distributor

NVIDIA DX10 Logical Pipeline

8

Data

flow

Vertex Shader

Geometry Shader

Pixel Shader

Page 9: Slides - High Performance Graphics 2013

Primitive Distributor

Crossbar

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

Setup

Viewport

Transform

TS

VS

GS

TS

VS

GS

TS

VS

GS

TS

VS

GS

TS

VS

GS

TS

VS

GS

TS

VS

GS

TS

VS

GS

TS

VS

GS

TS

VS

GS

TS

VS

GS

TS

VS

GS

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

PS

Raster

DX10 Logical Pipeline + Tessellation

9

Data

flow

Tessellation Shader

(includes Hull Shader,

Tessellator, and Domain

Shader)

Page 10: Slides - High Performance Graphics 2013

Problems With DX10 + Tess. (1)

Data expansion in TS

can lead to excessive

buffering requirements

after GS

Have to drain GS0 before

GS1 to maintain API

ordering

10

TS0

GS0

TSn

GSn

Page 11: Slides - High Performance Graphics 2013

Problems With DX10 + Tess. (2)

Setup limited prim rate

Already a problem for

CPU-generated

geometry

Tessellation generates

even more primitives

11

TS1 TSn

GS1 GSn

Viewport /

Setup

Raster0 Rasterm…

TS0

GS0

Page 12: Slides - High Performance Graphics 2013

Crossbar

Primitive Distributor

Work Distribution Crossbar

Crossbar

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

Task Distributor

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

Work Distribution Crossbar

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Fermi GF100 Logical Pipeline

12

Hull Shader

Data

flow

Domain Shader

Raster Engine

(includes Edge Setup,

Rasterizer, Zcull )

Page 13: Slides - High Performance Graphics 2013

Crossbar

Primitive Distributor

Work Distribution Crossbar

Crossbar

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

Task Distributor

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

Work Distribution Crossbar

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Fermi GF100 Logical Pipeline

13

Data

flowPolymorph Engine

(includes Tessellator,

Viewport Transform,

and Attribute Setup)

Page 14: Slides - High Performance Graphics 2013

Crossbar

Primitive Distributor

Work Distribution Crossbar

Crossbar

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

Task Distributor

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

Work Distribution Crossbar

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Fermi GF100 Tessellation Features 1

Task Distributor

Task ~= Hull Shader

output

Control points plus

tessellation factor

Pre-expansion

Distributes tasks for

subsequent geometry

processing

Expand patch into

primitives

Optional GS

Reduces buffering

requirements

14

Page 15: Slides - High Performance Graphics 2013

Fermi GF100 Tessellation Features 2

Parallel Viewport

Transform

Includes clipping and

culling

Eliminate serialization

point

15

Crossbar

Primitive Distributor

Work Distribution Crossbar

Crossbar

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

Task Distributor

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

Work Distribution Crossbar

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Page 16: Slides - High Performance Graphics 2013

Fermi GF100 Tessellation Features 3

Parallel Rasterization

Including Edge Setup,

Rasterization, and

Zcull

Improved primitive

throughput

Multiple primitives

per clock

Screen mapped for

load balancing

16

Crossbar

Primitive Distributor

Work Distribution Crossbar

Crossbar

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

Task Distributor

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

Work Distribution Crossbar

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Page 17: Slides - High Performance Graphics 2013

Screen Mapped Rasterization

Each block is a tile of pixels

Rasterizers uniquely own pixel tiles

Note: small primitives can overlap multiple tiles

17

0 1 32 0 1 32 0 1 2

3 0 1 32 0 1 32 0 1

1 32 0 1 32 0 1 32

32 0 1 32 0 1 32 0

0 1 32 0 1 32 0 1 2

3 0 1 32 0 1 32 0 1

1 32 0 1 32 0 1 32

AB

0

1

3

2

Rasterizer 0

Rasterizer 3

Rasterizer 1

Rasterizer 2

Page 18: Slides - High Performance Graphics 2013

Challenge: Maintaining API Order

Parallelization is easy

Maintaining API

ordering is a

challenge

Work Distribution

Crossbar (WDX)

Similar to

Pomegranate

[Eldridge et al. 2000]

18

Crossbar

Primitive Distributor

Work Distribution Crossbar

Crossbar

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

ROP

FB

Task Distributor

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

VS

HS

Work Distribution Crossbar

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Tessellator

Viewport

Transform

DS

GS

Page 19: Slides - High Performance Graphics 2013

Work Distribution Crossbar

OWDX

Uses primitive

bounding box to

determine which

rasterizers get which

primitives

SWDX

reconstructs API order

Rasterizers uniquely

own pixels

No subsequent sorting

for ordering required

19

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

SWDX

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

SWDX

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

SWDX

Raster

Engine

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

Attribute

Setup

PS

SWDX

Work Distribution Crossbar

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Tessellator

Viewport

Transform

DS

GS

OWDX

Page 20: Slides - High Performance Graphics 2013

Handling Load Imbalance

Load imbalance is a concern for all region-based

architectures because they are non adaptive

Static imbalance

Different screen regions get differing amounts of work

Mitigated by relatively small pixel tiles

Dynamic imbalance

Lots of tiny primitives in same screen region at the same time

Mitigated by buffering

– We can only buffer so much…

20

Page 21: Slides - High Performance Graphics 2013

Results

21

Page 22: Slides - High Performance Graphics 2013

Historical Triangle Rate

22

GF100

0

250

500

750

1000

1250

1500

1750

2000

Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10

M t

ris/s

ec

Chip Release

Page 23: Slides - High Performance Graphics 2013

Results – Synthetic Performance Test

23

0

0.5

1

1.5

2

2.5

3

0 2 4 6 8 10 12 14 16

Triangle

s p

er

clo

ck

Triangle Size (pixels)

Z + per-vertex color

Z-only

Page 24: Slides - High Performance Graphics 2013

Results – Island Demo

24Frame average of 1.94 triangles per clock

Page 25: Slides - High Performance Graphics 2013

Results – Unigine Heaven

252.34 triangles per clock in highly tessellated areas

Page 26: Slides - High Performance Graphics 2013

Demo Time

26

Page 27: Slides - High Performance Graphics 2013

Summary

Tessellation

Increases geometric realism without burdening the CPU

Fermi GF100 is designed for tessellation

Eliminates serialization points

Dramatically improved primitive throughput

Next-generation visual fidelity

27

Page 28: Slides - High Performance Graphics 2013

Acknowledgements

Steve Molnar

Ziyad Hakura

Andreas Dietrich (demo help)

Entire Fermi GF100 team

28

Page 29: Slides - High Performance Graphics 2013

Questions?

29