Developing Efficient Graphics Software

Developing Efficient Graphics Software

Developing Efficient Graphics SoftwareIntent of CourseIntent of Course• Identify application and hardware interactionIdentify application and hardware interaction

• Quantify and optimize interactionQuantify and optimize interaction

• Identify efficient software structureIdentify efficient software structure

• Balance software and hardware system component useBalance software and hardware system component use

Developing Efficient Graphics Software OutlineOutline• 1:35 Hardware and graphics architecture and performance1:35 Hardware and graphics architecture and performance

• 2:05 Software and System Performance2:05 Software and System Performance

• BreakBreak

• 2:55 Software profiling and performance analysis2:55 Software profiling and performance analysis

• 3:20 C/C++ language issues3:20 C/C++ language issues

• 3:50 Graphics techniques and algorithms3:50 Graphics techniques and algorithms

• 4:40 Performance Hints4:40 Performance Hints

Developing Efficient Graphics SoftwareSpeakers Speakers • Applications Consulting Engineers for SGI Applications Consulting Engineers for SGI

– optimizing, differentiating, graphicsoptimizing, differentiating, graphics

• Keith Cok, Bob Kuehne, Thomas True, Alan CommikeKeith Cok, Bob Kuehne, Thomas True, Alan Commike

Hardware & Graphics Architecture & Performance

Bob Kuehne, SGIBob Kuehne, SGI

Course OverviewWhy is your application drawing so slowly?Why is your application drawing so slowly?• Could actually be the graphicsCould actually be the graphics

• Could be the data traversalCould be the data traversal

• Could be something entirely differentCould be something entirely different

Tour GuidePlatform architecture & componentsPlatform architecture & components• CPUCPU

• MemoryMemory

• GraphicsGraphics

Graphics performanceGraphics performance• Measurements: triangle rate, fill rate, misc. Measurements: triangle rate, fill rate, misc.

• Reproduce & maximizeReproduce & maximize

Bottlenecks & BalanceBottlenecksBottlenecks• Find themFind them

• Eliminate them (sort of - move them around)Eliminate them (sort of - move them around)

BalanceBalance• Understand hardware architectureUnderstand hardware architecture

• Fully utilize hardwareFully utilize hardware

Yin & Yang

• ““Yin and yang are the two primal cosmic principles of the Yin and yang are the two primal cosmic principles of the universe”universe”

• ““The best state for everything in the universe is a state of The best state for everything in the universe is a state of harmony represented by a balance of yin and yang.”harmony represented by a balance of yin and yang.”

– Skeptics Dictionary -- http://skepdic.com/yinyang.htmlSkeptics Dictionary -- http://skepdic.com/yinyang.html

Write Once Run Everywhere?My application ran fast on that platform! Why is My application ran fast on that platform! Why is this one so slow?this one so slow?• Different platforms require different tuningDifferent platforms require different tuning

• Different platforms implement hardware differentlyDifferent platforms implement hardware differently

– Macro: Architecture & featuresMacro: Architecture & features

– Micro: Storage capacities, buffers, & cachesMicro: Storage capacities, buffers, & caches

– Effect: Bandwidth & latencyEffect: Bandwidth & latency

Definitions:Definitions:• Latency: time required to communicate a unit of dataLatency: time required to communicate a unit of data

• Bandwidth: data transferred per unit timeBandwidth: data transferred per unit time

Example:Example:• Latency bottleneck:Latency bottleneck:

• Bandwidth bottleneck:Bandwidth bottleneck:

Latency & Bandwidth

SS tt SS tt SS tt SS ttSS tt SS tt

tt tt ttSS tt tt tt : unit of time: unit of times: texture setup times: texture setup timet: texture download timet: texture download time

Platform: Software View

graphics

i/o

miscmemory

CPU

net

Platform: PCI, AGP

CPUCPU MemoryMemory

Dis

kD

isk

Net

Net

Gra

phic

sG

raph

ics

I/OI/O

PCIPCI

glueglue

CPUCPU

glueglue

MemoryMemory

Dis

kD

isk

Net

Net

Gra

phic

sG

raph

ics

I/OI/O

PCIPCI AGPAGP

PCIPCI

glueglue

Platform: UMA, Switched Hub

CPUCPU

glueglue

MemoryMemory CPUCPU MemoryMemory

Dis

kD

isk

Net

Net

Gra

phic

sG

raph

ics

I/OI/ODis

kD

isk

Net

Net

I/OI/O

UMAUMAG

raph

ics

Gra

phic

s

Platform: The PointsWhy learn about hardware?Why learn about hardware?• To understand how your app interacts with itTo understand how your app interacts with it

• To best utilize the hardwareTo best utilize the hardware

• Potentially can use extra hardware featuresPotentially can use extra hardware features

Where?Where?• Platform documentationPlatform documentation

• Talk with hardware vendorTalk with hardware vendor

CPU: OverviewCPU OperationCPU Operation• Data transferred from main memory to registersData transferred from main memory to registers

• CPU works on data in registersCPU works on data in registers

LatencyLatency• Registers: 0 (free)Registers: 0 (free)

• Level-1 (L1) cache: 1Level-1 (L1) cache: 1

• Level-2 (L2) cache: 10x L1 Level-2 (L2) cache: 10x L1

• Main memory: 100x L1Main memory: 100x L1

CPUCPU RR L1L1 L2L2 MainMainMemoryMemory

CPU, Cache, and MemoryCaches designed to exploit data localityCaches designed to exploit data locality• Temporal localityTemporal locality

• Spatial localitySpatial localityMainMain

MemoryMemory

CPUCPU

L1L1L2L2

RegistersRegisters

Memory: Cache & Logical Flow

In L1?In L1? In L2?In L2?In Register?In Register?

ComputeCompute Copy to L1Copy to L1(10)(10)

Copy toCopy toRegisterRegister

(1)(1)

Copy to L2Copy to L2(100)(100)

Memory: Cache & Physical Flow

CPUCPU

RegistersRegisters

PagePage

Main MemoryMain Memory L2 CacheL2 Cache L1 CacheL1 Cache

Memory: Allocation & Pools• List elements are often allocated as-neededList elements are often allocated as-needed

– This leads to spatial disparityThis leads to spatial disparity

• Mitigated by use of application memory managementMitigated by use of application memory management

– Bad: malloc, malloc, malloc, malloc, ...Bad: malloc, malloc, malloc, malloc, ...

– Good: pools - pool_init, pool_alloc, ...Good: pools - pool_init, pool_alloc, ...

• Graphics example:Graphics example:

– Vertices, normals, textures, etc.Vertices, normals, textures, etc.

Memory: Graphics! Vertex Arrays

Vertex Array Cache Behavior

Number of Array Vertices

Tim

e to

Tra

vers

e Platform 0 - InterleavedPlatform 0 - Non-interleavedPlatform 1 - InterleavedPlatform 1 - Non-interleaved

Graphics: Pipe

FIFOFIFO

xfxf lightlight clipclip rastrast fxfx fopsfops

xfxf: world to screen: world to screenlightlight: apply light: apply lightclipclip: clip to view: clip to view

rastrast: convert to pixels: convert to pixelsfxfx: apply texture, etc.: apply texture, etc.fopsfops: test pixel ops: test pixel ops

Graphics: Pipe & Akeley Taxonomy

• G - Generate geometric dataG - Generate geometric data

• T - Traverse data structuresT - Traverse data structures

• X - Transform primitives world to screenX - Transform primitives world to screen

• R - Rasterize triangles to pixelsR - Rasterize triangles to pixels

• D - Display framebuffer on output deviceD - Display framebuffer on output device

XX RR DDGG

TT

Graphics: Hardware4 types of hardware are common4 types of hardware are common• G-TXRD : all hardwareG-TXRD : all hardware

• GT-XRD :GT-XRD :

• GTX-RD :GTX-RD :

• GTXR-D : all softwareGTXR-D : all software

Graphics: PerformanceBenchmarksBenchmarks• ““Trust, but verify.” - an ex-presidentTrust, but verify.” - an ex-president

DefinitionsDefinitions• Triangle rate: speed at which primitives are transformed (X)Triangle rate: speed at which primitives are transformed (X)

• Fill rate: speed at which primitives are rasterized (R)Fill rate: speed at which primitives are rasterized (R)

– Depth complexity: number of times pixel filledDepth complexity: number of times pixel filled

CaveatsCaveats• Quantization, fastpathQuantization, fastpath

Graphics: Quantization• Frame Frame quantizationquantization is the result of swapbuffers occurring at is the result of swapbuffers occurring at

the next vertical retrace.the next vertical retrace.

– Necessary to avoid image artifacts such as tearingNecessary to avoid image artifacts such as tearing

• Example: 100Hz display refreshExample: 100Hz display refresh

Graphics: Quantization

100 Hz100 Hz

50 Hz50 Hz

50 Hz50 Hz

33 Hz33 Hz

tt00 tt11 tt22 tt33 tt44 tt55 tt44 tt66 tt77

no-sync 120 Hzno-sync 120 Hz

: : one graphics frameone graphics frame ttnn: : 1/100 second1/100 second

Graphics: FastpathDefinitionDefinition• Fastpath: the most optimized path through graphics Fastpath: the most optimized path through graphics

hardwarehardware

ExampleExample• fast path: float verts, float norms, AGBR textures, z-testfast path: float verts, float norms, AGBR textures, z-test

• less fast path: float verts, float norms, RGBA textures, z-testless fast path: float verts, float norms, RGBA textures, z-test

Graphics: Fastpath Example

• Fast path is often synonymous with ideal path.Fast path is often synonymous with ideal path.

– Real usage of graphics falls on a continuum.Real usage of graphics falls on a continuum.

• Must quantify what hardware can doMust quantify what hardware can do

– Quality & speedQuality & speed

Graphics: Fastpath Points

Fast pathFast path(hardware)(hardware)

Slow pathSlow path(software)(software)

SpeedSpeed QualityQualityWhere is your application?Where is your application?

Graphics Hardware: TestingDuplicate performance numbers simply:Duplicate performance numbers simply:• Good: build a simple test programGood: build a simple test program

• Better: glPerf - http://www.spec.orgBetter: glPerf - http://www.spec.org

Maximize performance in an app:Maximize performance in an app:• Good: Use fast API extensionsGood: Use fast API extensions

• Better: Create an “is-fast” test, use what is verified as fastBetter: Create an “is-fast” test, use what is verified as fast

Graphics Hardware: “Is-Fast”Test each platform to determine fast path Test each platform to determine fast path • Once, per-machine, test primitives and modesOnce, per-machine, test primitives and modes

– Vertex array format, texture format, display list, etc.Vertex array format, texture format, display list, etc.

• Store data in databaseStore data in database

– Detect hardware changes or time-to-liveDetect hardware changes or time-to-live

• Read data from database at startupRead data from database at startup

– Check database or re-generate dataCheck database or re-generate data

Graphics Hardware: “Is-Fast”Pseudo-codePseudo-code

If ( new_machine() || hardware_changed() ) {If ( new_machine() || hardware_changed() ) { test_interesting_modes();test_interesting_modes(); store_in_database();store_in_database(); }}else { else { // have database entry// have database entry get_performance_data_from_database();get_performance_data_from_database();}}

// use the modes & primitives that are ‘’fast’’ when rendering// use the modes & primitives that are ‘’fast’’ when rendering

Think Globally, Act LocallyThink globallyThink globally• Know the platforms & graphics hardwareKnow the platforms & graphics hardware

• Use hardware effectively in your appUse hardware effectively in your app

• Balance hardware utilizationBalance hardware utilization

Act locallyAct locally• Use in-cache dataUse in-cache data

• Understand hardware & graphics fastpathsUnderstand hardware & graphics fastpaths

• Balance quality vs. performanceBalance quality vs. performance

Software and System Performance

Thomas J. True, SGIThomas J. True, SGI

A Four Step Process

Quantify

System Evaluation

Graphics Analysis

Bottleneck Elimination

QuantifyCharacterizeCharacterize• Application SpaceApplication Space

• Primitive TypesPrimitive Types

• Primitive CountsPrimitive Counts

• Rendering CharacteristicsRendering Characteristics

• Frame RateFrame Rate

QuantifyCompareCompare

TriangleRate

Fill Rate

My Performance Ideal Performance

Examine System ConfigurationResourcesResources• MemoryMemory

• DiskDisk

SetupSetup• DisplayDisplay

• NetworkNetwork

Graphics AnalysisIdeal PerformanceIdeal Performance• Keep graphics pipeline full.Keep graphics pipeline full.

• 100% CPU utilization running application code.100% CPU utilization running application code.

• 100% graphics utilization.100% graphics utilization.

Graphics AnalysisGraphics BoundGraphics Bound

Acme Electronics

0 100

5030

10

40

209080

7060

0 100

5030

10

40

209080

7060

Graphics AnalysisGraphics BoundGraphics Bound• Graphics subsystem processes data slower than CPU can Graphics subsystem processes data slower than CPU can

feed it.feed it.

• Graphics subsystem issues an interrupt which causes the Graphics subsystem issues an interrupt which causes the CPU to stall.CPU to stall.

• Data processing within application stops until graphics Data processing within application stops until graphics subsystem can again accept data.subsystem can again accept data.

Graphics AnalysisGeometry LimitedGeometry Limited• Limited by the rate at which vertices can be transformed and Limited by the rate at which vertices can be transformed and

clipped.clipped.

Fill LimitedFill Limited• Limited by the rate at which transformed vertices can be Limited by the rate at which transformed vertices can be

rasterized.rasterized.

Graphics AnalysisCPU BoundCPU Bound

Acme Electronics

0 100

5030

10

40

209080

7060

0 100

5030

10

40

209080

7060

Graphics AnalysisCPU BoundCPU Bound• CPU at 100% utilization but can’t feed graphics fast enough.CPU at 100% utilization but can’t feed graphics fast enough.

• Graphics subsystem at less than 100% utilization.Graphics subsystem at less than 100% utilization.

• All CPU cycles consumed by data processing.All CPU cycles consumed by data processing.

Graphics AnalysisDetermination TechniquesDetermination Techniques• Remove graphics API calls.Remove graphics API calls.

• Shrink graphics window.Shrink graphics window.

• Reduce geometry processing requirements.Reduce geometry processing requirements.

• Use system monitoring tool.Use system monitoring tool.

Graphics AnalysisStart

Remove graphics API

calls

Graphics Performance

Problem

Graphics bound:?

Performance Problem Not

Graphics

Graphics bound: fill limited

Graphics bound: geometry limited

Remove rendering

calls

Fallen off fast path

Shrink graphics window

Reduce geometry

load

Use system monitoring

tool

Excessive or unexpected CPU

activity

= frame rate increase = no change in frame rate

Graphics AnalysisGraphics Architecture: GTXR-DGraphics Architecture: GTXR-D

Acme Electronics

Graphics AnalysisGraphics Architecture: GTXR-D Graphics Architecture: GTXR-D (aka Dumb Frame Buffer)(aka Dumb Frame Buffer) • CPU does everything.CPU does everything.

• Typically CPU bound.Typically CPU bound.

• To remedy, buy a “real” graphics board.To remedy, buy a “real” graphics board.

Graphics AnalysisGraphics Architecture: GTX-RDGraphics Architecture: GTX-RD

Acme Electronics

Graphics AnalysisGraphics Architecture: GTX-RDGraphics Architecture: GTX-RD• Screen space operations performed by graphics.Screen space operations performed by graphics.

• Object-space to screen-space transform on host.Object-space to screen-space transform on host.

• Can easily become CPU bound.Can easily become CPU bound. ““Roughly 100 single-precision floating point operations are required to Roughly 100 single-precision floating point operations are required to

transform, light, clip test, project and map an object-space vertex to screen-transform, light, clip test, project and map an object-space vertex to screen-space.” - K. Akeley & T. Jermolukspace.” - K. Akeley & T. Jermoluk

• Beware of fast-path and slow-path issues.Beware of fast-path and slow-path issues.

Graphics AnalysisGraphics Architecture: GTX-RDGraphics Architecture: GTX-RD• If Graphics Bound:If Graphics Bound:

– Reduce per-pixel operations.Reduce per-pixel operations.

– Reduce depth complexity.Reduce depth complexity.

– Use native-format data.Use native-format data.

Graphics AnalysisGraphics Architecture: GTX-RDGraphics Architecture: GTX-RD• If CPU Bound:If CPU Bound:

– Reduce scene complexity.Reduce scene complexity.

– Use more efficient graphics algorithms.Use more efficient graphics algorithms.

Graphics AnalysisGraphics Architecture: GT-XRDGraphics Architecture: GT-XRD

Acme Electronics

Graphics AnalysisGraphics Architecture: GT-XRDGraphics Architecture: GT-XRD• Transformation and rasterization performed by graphics.Transformation and rasterization performed by graphics.

• Can be CPU or graphics bound. Can be CPU or graphics bound.

• Beware of fast-path and slow-path issues.Beware of fast-path and slow-path issues.

• Subject to host bandwidth limitations.Subject to host bandwidth limitations.

Graphics AnalysisGraphics Architecture: GT-XRDGraphics Architecture: GT-XRD• If Graphics Bound:If Graphics Bound:

– Move lighting back to CPU.Move lighting back to CPU.

– Use native data formats within application.Use native data formats within application.

– Use display lists or vertex arrays.Use display lists or vertex arrays.

– Use less expensive lighting modes.Use less expensive lighting modes.

Graphics AnalysisGraphics Architecture: GT-XRDGraphics Architecture: GT-XRD• If CPU Bound:If CPU Bound:

– Move lighting from CPU to graphics subsystem.Move lighting from CPU to graphics subsystem.

– Do matrix operations in graphics hardware.Do matrix operations in graphics hardware.

– Profile in search of computational performance issues.Profile in search of computational performance issues.

Bottleneck EliminationBottlenecksBottlenecks

Bottleneck EliminationBottlenecksBottlenecks• Understanding, crucial to effective tuning.Understanding, crucial to effective tuning.

• Will always exist, tune to balance.Will always exist, tune to balance.

• Not always a bad thing.Not always a bad thing.

Bottleneck EliminationGraphicsGraphics• Use native graphics formats.Use native graphics formats.

• Remove excessive state changes.Remove excessive state changes.

• Package graphics primitives efficiently.Package graphics primitives efficiently.

• Use textures that fit in texture cache.Use textures that fit in texture cache.

• Don’t use unnecessary rendering modes.Don’t use unnecessary rendering modes.

• Decrease depth complexity.Decrease depth complexity.

• Cull out excessive geometry.Cull out excessive geometry.

Bottleneck EliminationMemoryMemory• Don’t allocate memory in rendering loop.Don’t allocate memory in rendering loop.

• Avoid copying and repackaging of graphics data.Avoid copying and repackaging of graphics data.

• Organize graphics data.Organize graphics data.

• Avoid memory fragmentation.Avoid memory fragmentation.

Bottleneck EliminationMemory Bandwidth and FragmentationMemory Bandwidth and Fragmentation

Independent Triangles

9 vertices: 504 bytes

Triangle Strip


Vertex Array


Vertex = RGBA+XYZW+XYZ+STR = 56 bytes

Bottleneck EliminationCode and LanguageCode and Language• Use native data types.Use native data types.

• Avoid contention for a single shared resource.Avoid contention for a single shared resource.

• Avoid application bottlenecks in non-graphics code.Avoid application bottlenecks in non-graphics code.

• Reduce API call overhead.Reduce API call overhead.

Bottleneck EliminationAPI Call OverheadAPI Call Overhead

Independent Triangles

(XYZW + RGBA + XYZ + STR) * 9 vertices: 36 function calls

Triangle Strips

(XYZW + RGBA + XYZ + STR) * 5 vertices: 20 function calls

Vertex Array

5 function calls

Display List

1 function call

ConclusionPerformance Tuning an Iterative ProcessPerformance Tuning an Iterative Process

Quantify

System Evaluation

Graphics Analysis

Bottleneck Elimination

ConclusionIt’s all about balance!It’s all about balance!

Profiling and Performance Analysis

Keith Cok, SGIKeith Cok, SGI

Profile and Performance Analysis

• Profiling points out code areas that take up most timeProfiling points out code areas that take up most time

• Imperative for well balanced applicationImperative for well balanced application

• Points out code and system bottlenecksPoints out code and system bottlenecks

Two Methods of Software ProfilingBasic block Basic block • A section of code that has one entry and one exitA section of code that has one entry and one exit

• Measures Measures ideal timeideal time

Statistical samplingStatistical sampling• Interrupts program execution and examines current locationInterrupts program execution and examines current location

• Measures Measures actual CPU cyclesactual CPU cycles spent executing a line of code spent executing a line of code

How Do You Profile Code?• Compile/link with compiler optimizations turned onCompile/link with compiler optimizations turned on

– cc foo.c -use_all_optimization_flagscc foo.c -use_all_optimization_flags .... ....

• Instrument the codeInstrument the code

– Unix: Unix: pixie foo.exepixie foo.exe -> foo.exe.pixie -> foo.exe.pixie

– Visual Studio: embedded in tool suiteVisual Studio: embedded in tool suite

• Run the application with relevant data setsRun the application with relevant data sets

– foo.exe.pixie - argsfoo.exe.pixie - args -> produces results data file -> produces results data file

Profiling: Finding the Hot SpotFunction list, in descending order by exclusive ideal time

excl.% cum.% instructions calls function (dso: file, line)

[1] 10.3% 10.3% 190583064 11484 GL_CreateSurfaceLightmap (foo: gl_rsurf.c, 1293)

[2] 8.9% 19.2% 173920781 3203 S_Update_ (foo: snd_dma.c, 848)

[3] 8.2% 27.4% 145950460 338787 R_RenderBrushPoly (foo: gl_rsurf.c, 641)

[4] 5.9% 33.3% 97798122 1975976 __sin (libm.so: sin.c, 194)

[5] 4.1% 37.4% 82310479 240 GL_LoadTexture (foo: gl_draw.c, 990)

[6] 3.4% 40.8% 50786176 1204269 __glMgrim_Begin (libGLcore.so: mgras_prim.c, 221)

[7] 3.2% 44.0% 58099072 16797 R_DrawAliasModel (foo: gl_rmain.c, 232)

[8] 3.1% 47.1% 53832546 290970 R_RecursiveWorldNode (foo: gl_rsurf.c, 894)

[9] 3.1% 50.2% 43855299 437627 R_CullBox (foo: gl_rlight.c, 313; compiled in gl_rmain.c)

[10] 2.8% 53.0% 44666700 30981 EmitWaterPolys (foo: gl_warp.c, 187)

Profiling: Fixing the Hot SpotWhat do you look for?What do you look for?• Common sub-expressions Common sub-expressions

• Loop invariant codeLoop invariant code

• Repeated pointer de-referencingRepeated pointer de-referencing

• Global variables and cache missesGlobal variables and cache misses

• ““Thin” loopsThin” loops

Profiling Example

// Code the old way// Code the old way // Code the new way// Code the new way19: 19: void old_loop() {void old_loop() { 27: 27: void new_loop () {void new_loop () {20: 20: sum = 0;sum = 0; 28: 28: sum = 0;sum = 0;21: 21: for (i = 0;i < NUM; i++)for (i = 0;i < NUM; i++) 29: 29: ii = NUM%4;ii = NUM%4;22: 22: sum += x[i];sum += x[i]; 30: 30: for (i=0; i < ii; i++)for (i=0; i < ii; i++)23: 23: printf("sum = %f\n",sum);printf("sum = %f\n",sum); 31: 31: sum +=x[I];sum +=x[I];24: 24: }} 32: 32: for (i = ii; i < NUM; i +=4) {for (i = ii; i < NUM; i +=4) {

33: 33: sum += x[i];sum += x[i];34: 34: sum += x[i+1];sum += x[i+1];35: 35: sum += x[i+2];sum += x[i+2];36: 36: sum += x[i+3];sum += x[i+3];37 : 37 : }}38: 38: printf(“ sum = %f\n”,sum);printf(“ sum = %f\n”,sum);39: 39: }}

Profiling Example: Profile Results

cycles instructions calls cycles instructions calls function (dso: file: line)function (dso: file: line)

[1] [1] 6160 6160 6168 6168 1 1 old_loopold_loop (blahdso.so: blahdso.c, 19) (blahdso.so: blahdso.c, 19)[2] [2] 4869 4869 8714 8714 1 1 setup_data (blahdso.so: blahdso.c, 11)setup_data (blahdso.so: blahdso.c, 11)

[1] [1] 4869 8714 4869 8714 1 1 setup_data (blahdso.so: blahdso.c, 11)setup_data (blahdso.so: blahdso.c, 11)

[2] [2] 4625 4891 4625 4891 1 1 new_loopnew_loop (blahdso.so: blahdso.c, 27) (blahdso.so: blahdso.c, 27)

Profile Example: Line AnalysisLine list, in descending order by timeLine list, in descending order by time------------------------------------------------------------------------------------------------------------ cycles invocations function (dso: file, line)cycles invocations function (dso: file, line)

4096 1024 4096 1024 old_loop old_loop sum += x[i];sum += x[i]; 2061 1024 2061 1024 old_loop old_loop for (i = 0;i < NUM; i++)for (i = 0;i < NUM; i++) 978 256 978 256 new_loop new_loop sum += x[i+3];sum += x[i+3]; 968 256 968 256 new_loop new_loop sum += x[i+2];sum += x[i+2]; 968 256 968 256 new_loop new_loop sum += x[i+1];sum += x[i+1]; 968 256 968 256 new_loop new_loop sum += x[i];sum += x[i]; 733 256 733 256 new_loop new_loop for (i = ii; i < NUM; i +=4)for (i = ii; i < NUM; i +=4) 7 1 7 1 new_loop new_loop ii = NUM%4;ii = NUM%4;

Profile and Performance AnalysisProfile Example: Visual C++/IntelProfile Example: Visual C++/Intel

Function Percent of HitFunction Percent of Hit Function Function Time(s) Run Time CountTime(s) Run Time Count ------------------------------------------------------------------------------------------------------------------------------------ 0.410 0.410 39.4 1 39.4 1 _old_loop _old_loop 0.249 23.9 1 _new_loop0.249 23.9 1 _new_loop

Statistical vs. Basic Block Profile

void void ijkijk_loop(){ // loops _loop(){ // loops kjikji and and ikjikj as well as well sum = 0;sum = 0; for (i=0;i<YNUM;i++)for (i=0;i<YNUM;i++) for (j=0;j<YNUM;j++)for (j=0;j<YNUM;j++) for (k=0;k<YNUM;k++)for (k=0;k<YNUM;k++) sum += y[i][j][k];sum += y[i][j][k];}} printf("sum = %f\n",sum); printf("sum = %f\n",sum);

Basic Block vs. Statistical SamplingBasic Block:Basic Block: Percent cycles inst calls function Percent cycles inst calls function [1] 25.3% 51141434 37101028 1 ijk_loop foo.c, 47[1] 25.3% 51141434 37101028 1 ijk_loop foo.c, 47 [2] 25.3% 51141434 37101028 1 kji_loop foo.c, 57[2] 25.3% 51141434 37101028 1 kji_loop foo.c, 57 [3] 25.3% 51141434 37101028 1 ikj_loop foo.c, 66[3] 25.3% 51141434 37101028 1 ikj_loop foo.c, 66

Statistical Sampling:Statistical Sampling: Percent Samples Procedure FunctionPercent Samples Procedure Function [1] 38.0% 2700 kji_loop foo.c, 57[1] 38.0% 2700 kji_loop foo.c, 57 [2] 23.9% 1700 setup_data foo.c, 15[2] 23.9% 1700 setup_data foo.c, 15 [3] 19.7% 1400 ikj_loop foo.c, 66[3] 19.7% 1400 ikj_loop foo.c, 66 [4] 18.3% 1300 ijk_loop foo.c, 47[4] 18.3% 1300 ijk_loop foo.c, 47

Now We Know About Hot Spots...What do we do next?What do we do next?• Use compilers to fine-tune codeUse compilers to fine-tune code

• Use knowledge of language to optimizeUse knowledge of language to optimize

• Hand-tune codeHand-tune code

Profiling is fun, hard, and iterative and it can be Profiling is fun, hard, and iterative and it can be highly effectivehighly effective

Compiler and Language Issues

Keith Cok, SGI Keith Cok, SGI Bob Kuehne, SGIBob Kuehne, SGI

Compiler and Language IssuesCompiler Optimizations:Compiler Optimizations:• Occur within a compromise ofOccur within a compromise of

speed and memory spacespeed and memory space vs.vs. time to compile and linktime to compile and link• An iterative process to discover what does and doesn’t workAn iterative process to discover what does and doesn’t work

• Important to keep at itImportant to keep at it

Compiler Issues: Trade-Offs• Trade-offs:Trade-offs:

– Round-off vs. needed precisionRound-off vs. needed precision

– Inter-procedural analysis vs. link timeInter-procedural analysis vs. link time

– Pointer aliasing vs. coding constraintsPointer aliasing vs. coding constraints

– Optimizing for processor architectures vs. work of multiple Optimizing for processor architectures vs. work of multiple binaries (support, test)binaries (support, test)

• Explore other compilers than your first choiceExplore other compilers than your first choice

• Different source code - different flagsDifferent source code - different flags

Compiler and Language IssuesComments on 32 vs. 64 bit codeComments on 32 vs. 64 bit code• Benefits of 64 bit code:Benefits of 64 bit code:

– Increased address spaceIncreased address space– Higher precisionHigher precision

• Downsides of 64 bit code:Downsides of 64 bit code:– Application memory footprintApplication memory footprint– Need to port which can be difficult!Need to port which can be difficult!

• Performance issuesPerformance issues

Language Issues

• Data ManagementData Management

• Unrolling loopsUnrolling loops

• ArraysArrays

• Temporary variablesTemporary variables

• Pointer aliasingPointer aliasing

Language Issues: Data ManagementManipulate data structures efficiently since Manipulate data structures efficiently since graphics IS datagraphics IS data

struct { str *next;struct { str *next; struct { str *next;struct { str *next; str *prev;str *prev; str *prev; str *prev; large_type foo;large_type foo; int key; int key;

int key;int key; large_type foo; large_type foo; } str;} str; } str; } str;

Language Issues: Data Management

Pack data efficientlyPack data efficientlystruct foo {struct foo { struct foo_better { struct foo_better { char aa; char aa; // 8 bits + 24 pad // 8 bits + 24 pad float bb; // 32 bitsfloat bb; // 32 bits float bb; float bb; // 32 bits// 32 bits char aa; // 8 bitschar aa; // 8 bits char cc;char cc; // 8 bits + 24 pad // 8 bits + 24 pad char cc; // 8 bitschar cc; // 8 bits float dd; float dd; // 32 bits// 32 bits char ee; // 8 bits + 8 padchar ee; // 8 bits + 8 pad char ee;char ee; // 8 bits + 24 pad // 8 bits + 24 pad float dd; // 32 bitsfloat dd; // 32 bits} foo_t; } foo_t; // // 160160 bits bits } foo_t; } foo_t; // // 9696 bits bits

Language Issues: Data ManagementExamine your arrays and note their caching Examine your arrays and note their caching behaviorbehavior• Break up large arrays into smaller sub-arrays for better Break up large arrays into smaller sub-arrays for better

memory access patternsmemory access patterns

• Understand the implications of data layout and cache Understand the implications of data layout and cache behaviorbehavior

Language Issues: Loop UnrollingProfiling ExampleProfiling Example// Code the old way// Code the old way // Code the new way// Code the new way19: 19: void old_loop() {void old_loop() { 27: 27: void new_loop() {void new_loop() {20: 20: sum = 0;sum = 0; 28: 28: sum = 0;sum = 0;21: 21: for (i = 0;i < NUM; i++)for (i = 0;i < NUM; i++) 29: 29: ii = NUM%4;ii = NUM%4;22: 22: sum += x[i];sum += x[i]; 30: 30: for (i=0; i < ii; i++)for (i=0; i < ii; i++)23: 23: printf("sum = %f\n",sum);printf("sum = %f\n",sum); 31: 31: sum +=x[i];sum +=x[i];24: 24: }} 32: 32: for (i=ii; i<NUM; ifor (i=ii; i<NUM; i +=4) {+=4) {

33: 33: sum += x[i];sum += x[i];34: 34: sum += x[i+1];sum += x[i+1];35: 35: sum += x[i+2];sum += x[i+2];36: 36: sum += x[i+3];sum += x[i+3];37: 37: }}38: 38: printf(“ sum = %f\n”,sum);printf(“ sum = %f\n”,sum);39: 39: }}

Language Issues: Loop UnrollingProfile Example: Line AnalysisProfile Example: Line AnalysisLine list, in descending order by timeLine list, in descending order by time------------------------------------------------------------------------------------------------------------ cycles invocations function cycles invocations function 4096 1024 4096 1024 old_loop old_loop sum += x[i];sum += x[i]; 2061 1024 2061 1024 old_loop old_loop for (i = 0;i < NUM; i++)for (i = 0;i < NUM; i++) 978 256 978 256 new_loop new_loop sum += x[i+3];sum += x[i+3]; 968 256 968 256 new_loop new_loop sum += x[i+2];sum += x[i+2]; 968 256 968 256 new_loop new_loop sum += x[i+1];sum += x[i+1]; 968 256 968 256 new_loop new_loop sum += x[i];sum += x[i]; 733 256 733 256 new_loop new_loop for (i = ii; i < NUM; i +=4)for (i = ii; i < NUM; i +=4) 7 1 7 1 new_loop new_loop ii = NUM%4;ii = NUM%4;

Language Issues: Loop UnrollingIssues with loop unrolling:Issues with loop unrolling:• Code complexityCode complexity• ClutterClutter• Compiler may/may not do thisCompiler may/may not do this• Flags may affect compiler time spent optimizingFlags may affect compiler time spent optimizing

Only “thin” loops gain performanceOnly “thin” loops gain performanceUse application knowledge to take advantage of Use application knowledge to take advantage of loop unrollingloop unrolling

Language Issues: Local temporary variablesUse local temporary variables to avoid repeatedly Use local temporary variables to avoid repeatedly de-referencing a pointer structurede-referencing a pointer structureExample:Example:

x = global_ptr->record_str->a;x = global_ptr->record_str->a;y = global_ptr->record_str->b;y = global_ptr->record_str->b;

Use:Use:tmptmp = global_ptr->record_str; = global_ptr->record_str;x = x = tmptmp->a;->a;y = y = tmptmp->b;->b;

Language Issues: Using tmp vars for global vars within a functionvoid tr_point(FLOAT *old_pt, FLOAT *m, FLOAT *new_pt)void tr_point(FLOAT *old_pt, FLOAT *m, FLOAT *new_pt)

FLOAT *c1, *c2, *c3, *c4, *op, *np, FLOAT *c1, *c2, *c3, *c4, *op, *np, tmptmp;;

c1 = m; c2 = m+4; c3 = m+8; c4 = m+12;c1 = m; c2 = m+4; c3 = m+8; c4 = m+12;for (j=0, np = new_pt;j<4; j++) { for (j=0; np = new_pt; j<4;j++) for (j=0, np = new_pt;j<4; j++) { for (j=0; np = new_pt; j<4;j++) op = old_pt; op = old_pt;op = old_pt; op = old_pt;

tmptmp += *op++ * *c1++; *np += *op++ * *c1++; += *op++ * *c1++; *np += *op++ * *c1++;tmptmp += *op++ * *c2++; *np += *op++ * *c2++; += *op++ * *c2++; *np += *op++ * *c2++;tmp tmp += *op++ * *c3++; *np += *op++ * *c3++; += *op++ * *c3++; *np += *op++ * *c3++;*np++ = *np++ = tmptmp + (*op * *c4++); } *np++ = *op++ * *c4++; } + (*op * *c4++); } *np++ = *op++ * *c4++; }

Language Issues: Pointer Aliasing

• Pointers are aliases when they point to potentially Pointers are aliases when they point to potentially overlapping regions of memoryoverlapping regions of memory

• If regions never overlap, may optimize for this case. Not If regions never overlap, may optimize for this case. Not possible, though, in generalpossible, though, in general

• Compiler can't tell when pointers are aliasedCompiler can't tell when pointers are aliased

• Use Use restrict restrict key word or compiler optionkey word or compiler option


in out

in out

Unaliased Pointers Compilers may use: - Parallelism - Pipelining

Aliased pointers


void process_data( float * void process_data( float * restrictrestrict in, in, float * float * restrictrestrict out, out,

float gain) {float gain) {int i;int i;for (i = 0; i < NSAMPS; i++) {for (i = 0; i < NSAMPS; i++) {

out[i] = in[i] * gain;out[i] = in[i] * gain;}}

}}

C++: General Issues• Language featuresLanguage features

– RTTI, safe casts, etc.RTTI, safe casts, etc.

• Use const, mutable, volatile, & inline Use const, mutable, volatile, & inline

– hints to compilershints to compilers

• Object constructionObject construction

– arrays, default constructors, arguments, etc.arrays, default constructors, arguments, etc.

• Method invocation issuesMethod invocation issues

– operators, overloads, conversion, etc.operators, overloads, conversion, etc.

C++: Virtual Functions• Good - used to invoke child method when managing base-Good - used to invoke child method when managing base-

class handlesclass handles

• Expensive - incur an additional pointer de-referenceExpensive - incur an additional pointer de-reference

– one, find VTBL, two, find method, invokeone, find VTBL, two, find method, invoke

– bad for cachingbad for caching

• Use when necessary, but not for common objectsUse when necessary, but not for common objects

– Good for ‘large’ methods that do lots of workGood for ‘large’ methods that do lots of work

– Bad for ‘small’ methods, like a vertex queryBad for ‘small’ methods, like a vertex query

C++: Exceptions & TemplatesExceptionsExceptions• Great for error checkingGreat for error checking

• Performance penaltyPerformance penalty

– Additional stack information requiredAdditional stack information required

TemplatesTemplates• Great for code re-useGreat for code re-use

• Memory penaltyMemory penalty

– Across libraries, across object filesAcross libraries, across object files

Code & Language Issues: The EndBalanceBalance

• Know your compilerKnow your compiler

– Features & performanceFeatures & performance

• Know your languageKnow your language


• Know your appKnow your app


Idioms and Application Architectures

Alan Commike, SGIAlan Commike, SGI

Starting Quote

The best tuned most efficient bubble sort is still a The best tuned most efficient bubble sort is still a bubble sort. Additional tweaking won't improve bubble sort. Additional tweaking won't improve performance.performance.

Change The Algorithm!Change The Algorithm! - Commike ‘99

IntroductionTo write an efficient graphics application, one To write an efficient graphics application, one must:must:• Understand the platformUnderstand the platform

• Use graphics efficientlyUse graphics efficiently

• Write good codeWrite good code

Use efficient application structures and algorithmsUse efficient application structures and algorithms

Outline• OutlineOutline

• BackgroundBackground

• CullingCulling

• Level of Detail (LOD) managementLevel of Detail (LOD) management

• Application architecturesApplication architectures

Application Architectures:Rendering Path• Application work, culling, LOD, drawingApplication work, culling, LOD, drawing

• Pipelined rendering pathPipelined rendering path

AppApp CullCull LODLOD DrawDraw










TT00 TT11 TT22 TT33 TT44 TT55

FrameFrame00

FrameFrame11

FrameFrame22

Application Architectures:Target Frame RateA target frame rate attempts to bound the A target frame rate attempts to bound the maximum render timemaximum render time• Control Culling and LOD aggressivenessControl Culling and LOD aggressiveness

• Maintain a constant frame rateMaintain a constant frame rate

• Achieve an acceptable interactive frame rateAchieve an acceptable interactive frame rate

Graphics Idioms• Culling Culling

– Removing geometry that isn't visibleRemoving geometry that isn't visible

• Level of Detail Management Level of Detail Management

– Reducing geometric complexityReducing geometric complexity

Culling

Don’t draw what you can’t seeDon’t draw what you can’t see

Culling:Culling TypesUse one. Use all. Pipeline them together.Use one. Use all. Pipeline them together.• View Frustum CullingView Frustum Culling

• Backface CullingBackface Culling

• Contribution CullingContribution Culling

• Occlusion CullingOcclusion Culling

Culling:Bounding VolumesTest against a bounding volume not individual Test against a bounding volume not individual primitivesprimitives• Can be bounding sphere, box, oriented box, or any enclosing Can be bounding sphere, box, oriented box, or any enclosing

volumevolume

• Hierarchical bounding volumes to reduce cull timeHierarchical bounding volumes to reduce cull time

• Spheres are fast, boxes are more accurateSpheres are fast, boxes are more accurate

– Use a combination of both Use a combination of both

Culling: View FrustumGraphics pipeline clips data that falls outside the Graphics pipeline clips data that falls outside the View FrustumView Frustum

If it will be clipped don’t bother drawingIf it will be clipped don’t bother drawing

Culling: View Frustum Usefulness• Improves geometry rate Improves geometry rate

– Culled vertices are not transformed, lit, and clippedCulled vertices are not transformed, lit, and clipped

• Improves host download rateImproves host download rate

– Less data moved from memory into graphics Less data moved from memory into graphics

• Does not change fill rateDoes not change fill rate

– Triangles outside the View Frustum would not have been Triangles outside the View Frustum would not have been drawn anywaydrawn anyway

Culling: View Frustum Implementation• Transform vertices to clip coordinates (in OpenGL multiply by Transform vertices to clip coordinates (in OpenGL multiply by

Model-View and Projection matrix)Model-View and Projection matrix)

• Check each vertex against View FrustumCheck each vertex against View Frustum

• Geometry is either Geometry is either In, , Out, or , or PartialPartial

• Render Render InIn and and PartialPartial

Culling: Skip the ClipIn software transform systems (GTX-RD) skip the In software transform systems (GTX-RD) skip the clipclip• PartialPartial and and InIn geometry classified geometry classified

– Pipe renders Pipe renders PartialPartial as usual as usual

– Pipe can render Pipe can render InIn without a View Frustum clip without a View Frustum clip

• Might be a hint to renderMight be a hint to render

• Can improve geometry rates if not already fill-limitedCan improve geometry rates if not already fill-limited

Only half of any closed polyhedron is visible at Only half of any closed polyhedron is visible at any one timeany one time

Don’t render what you can’t seeDon’t render what you can’t see

Culling: Backface

Culling: Backface Usefulness• Improves fill rate when using a native implementationImproves fill rate when using a native implementation

– Primitives are transformed and lit before cullingPrimitives are transformed and lit before culling

• Helps both geometry and fill with an application specific Helps both geometry and fill with an application specific algorithmalgorithm

– More computationally expensiveMore computationally expensive

– Balance graphics and CPU workBalance graphics and CPU work

• This may not work well when you can enter closed geometry This may not work well when you can enter closed geometry or need two-sided lightingor need two-sided lighting

Lava. Hot!

Random Quote

Try not. Do, or do not. There is no try.Try not. Do, or do not. There is no try.

- Yoda ‘80- Yoda ‘80

Culling: Contribution

If it’s too small to make a difference If it’s too small to make a difference

don’t render itdon’t render it

Culling: Contribution Usefulness• Improves geometry rate Improves geometry rate

– Culled vertices are not transformed, lit, and clippedCulled vertices are not transformed, lit, and clipped

• Improves host download rateImproves host download rate

– Less data moved from memory into graphics Less data moved from memory into graphics

• Does not change fill rateDoes not change fill rate

– Screen space projection already minimalScreen space projection already minimal

– Removes few pixels from rasterization stageRemoves few pixels from rasterization stage

Culling: Contribution ImplementationDon’t render items that fall below a size thresholdDon’t render items that fall below a size threshold• Screen space size of bounding volumeScreen space size of bounding volume

• A less computational approach A less computational approach

– Distance to object combined with some notion of global Distance to object combined with some notion of global object sizeobject size

If you can’t see itIf you can’t see it

don’t draw itdon’t draw it

Culling: Occlusion

Front Side

Culling: Occlusion GoalsFind the optimal set of occluders that will enable Find the optimal set of occluders that will enable drawing the minimal number of occludeesdrawing the minimal number of occludees• Occluders: The geometry that is visibleOccluders: The geometry that is visible

• Occludees: The geometry that is not visible Occludees: The geometry that is not visible

• Use general purpose occlusion culling algorithmsUse general purpose occlusion culling algorithms

• Use application specific spatial knowledge if possibleUse application specific spatial knowledge if possible

Culling: Occlusion Culling Usefulness• Can improve both transform-limited and fill-limited Can improve both transform-limited and fill-limited

applicationsapplications

• Computationally expensiveComputationally expensive

– Beware of time trade-offsBeware of time trade-offs

• Possible hardware supportPossible hardware support

Culling: General Occlusion Culling• Used for arbitrary scenesUsed for arbitrary scenes

• Can improve both transform limited and fill limited Can improve both transform limited and fill limited applicationsapplications

• Computationally expensive for arbitrary scenesComputationally expensive for arbitrary scenes

Culling: Occlusion Spatial Partitioning““Cell and Portal” CullingCell and Portal” Culling• Spatial organization leads to Spatial organization leads to CellsCells and and PortalsPortals

• Games that move from room to roomGames that move from room to room

• Architectural walkthroughsArchitectural walkthroughs

LOD: OverviewAfter culling, need to draw what is leftAfter culling, need to draw what is left• Still too much geometry: Still too much geometry:

– Use multiple Levels of Detail, I.e. multi-resolution objectsUse multiple Levels of Detail, I.e. multi-resolution objects

• Match geometric complexity to visible on-screen space Match geometric complexity to visible on-screen space coveragecoverage

• Reduce geometric complexity to maintain target frame rateReduce geometric complexity to maintain target frame rate

LOD: Issues• Generating LODs: Generating LODs:

– Height Fields vs 3D objectsHeight Fields vs 3D objects

– View-Dependent: nice, but compute intensiveView-Dependent: nice, but compute intensive

– View-Independent: fast, memory intensiveView-Independent: fast, memory intensive

• Need to decide which LOD level to useNeed to decide which LOD level to use

– Not trivial!Not trivial!

• Need smooth transitions between levelsNeed smooth transitions between levels

– GeomorphsGeomorphs

LOD: Height Fields• Generally thought of as infinite terrainGenerally thought of as infinite terrain

• Specialized algorithms can be usedSpecialized algorithms can be used

LOD: 3D Models• General purpose simplification algorithmGeneral purpose simplification algorithm

• Can use on height fields alsoCan use on height fields also

• Some recent real-time view-dependent algorithmsSome recent real-time view-dependent algorithms

• Also used for compressionAlso used for compression

1024 Triangles 256 Triangles 64 Triangles 16 Triangles

LOD: When to switch LOD levelsAbility to only generate LOD models is not Ability to only generate LOD models is not sufficientsufficient• Need to know when to use which LOD levelNeed to know when to use which LOD level

– single constant hard metric: distance from eyesingle constant hard metric: distance from eye

– Multiple heuristics: cost, benefit, rankingsMultiple heuristics: cost, benefit, rankings

• Can bias LODs to ensure frame rate targets are reachedCan bias LODs to ensure frame rate targets are reached

LOD:Level determination• Determine system rendering characteristicsDetermine system rendering characteristics

• Determine cost of rendering each objectDetermine cost of rendering each object

• Render objects with highest benefit while remaining under Render objects with highest benefit while remaining under the target frame ratethe target frame rate

Level determination can be time consuming!Level determination can be time consuming!““take the time to time the time taken to reduce the take the time to time the time taken to reduce the

rendering time”rendering time”

Going, and going, and going...

LOD: Determining cost of renderingCost is affected by many factorsCost is affected by many factors• Graphics hardware: published benchmarks, startup testsGraphics hardware: published benchmarks, startup tests

• Number of vertices: primarily a function of LOD algorithmNumber of vertices: primarily a function of LOD algorithm

• Rendering Quality: lighting, shading, wire frame, anti-aliasing, Rendering Quality: lighting, shading, wire frame, anti-aliasing, etc.etc.

• Global Factors: total texture memory, dirty internal stateGlobal Factors: total texture memory, dirty internal state

LOD: Benefit FunctionCost alone is not good enough, need benefit alsoCost alone is not good enough, need benefit also• Rendered size of objectRendered size of object

• Error tolerance between LOD level and reference modelError tolerance between LOD level and reference model

• Importance in sceneImportance in scene

• Frame-to-frame coherencyFrame-to-frame coherency

LOD: The Optimal LODsFor all Objects, at each LOD Level, rendered with For all Objects, at each LOD Level, rendered with each RenderTypeeach RenderTypeMaximize the Benefit function:Maximize the Benefit function: Benefit(Object, Level, RenderType)Benefit(Object, Level, RenderType)

Subject to:Subject to: Cost(Object, Level, RenderType) <= TargetFrameRateCost(Object, Level, RenderType) <= TargetFrameRate

LOD: Optimal Optimizations

• Simulated AnnealingSimulated Annealing

• Monte Carlo SimulationsMonte Carlo Simulations

• Simplex SearchesSimplex Searches

LOD: Optimal Optimizations

• Simulated AnnealingSimulated Annealing

• Monte Carlo SimulationsMonte Carlo Simulations

• Simplex SearchesSimplex Searches

Dude,Dude,Can you spare a few dozen CPUs?Can you spare a few dozen CPUs?

LOD: Trade-offsDon’t have enough time to run full LOD Don’t have enough time to run full LOD optimization problem and render the sceneoptimization problem and render the scene• Simplify cost and benefit functionsSimplify cost and benefit functions

• Simplify optimization problem into a ranking of Benefit/CostSimplify optimization problem into a ranking of Benefit/Cost

• Use frame-to-frame coherencyUse frame-to-frame coherency

• Be sure to consider time taken to calculate LODsBe sure to consider time taken to calculate LODs

Application Architectures: Multi-Threading• More stages give more time to cull or generate LODsMore stages give more time to cull or generate LODs

• Each stage adds latencyEach stage adds latency




TT00 TT11 TT22 TT33 TT44 TT55

FrameFrame00

FrameFrame11

FrameFrame22

Application Architectures: Multi-Threading• Hard part is data synchronizationHard part is data synchronization

• Watch out for memory bloatWatch out for memory bloat

Application Architectures: Scene GraphsA scene graph is the basic data structures holding A scene graph is the basic data structures holding the description of your scenethe description of your scene• Cull-able, sort-able, and can contain multi-resolution objectsCull-able, sort-able, and can contain multi-resolution objects

• Hierarchical Bounding VolumesHierarchical Bounding Volumes

• Statistics gathering and timing infrastructureStatistics gathering and timing infrastructure

• For large scenes can do memory management and database For large scenes can do memory management and database pagingpaging

Application Architectures: Trade-offs• QualityQuality

• SpeedSpeed

• MemoryMemory

• ComplexityComplexity

Conclusion: Most importantly - Think about balance!Most importantly - Think about balance!

Performance Hints

Keith Cok, SGIKeith Cok, SGI

Performance Hints:Pipeline Management• Avoid round trips to graphics serverAvoid round trips to graphics server

– Cache own state/attribute information Cache own state/attribute information

– Avoid pipeline queries (e.g., glGet*)Avoid pipeline queries (e.g., glGet*)

– Flush buffer efficiently (glFlush vs. glFinish)Flush buffer efficiently (glFlush vs. glFinish)

• Reduce state changes. Sort by expense. For example, sort Reduce state changes. Sort by expense. For example, sort geometry by type (triangles, quads, etc) and then by colorgeometry by type (triangles, quads, etc) and then by color

• Eliminate unused attributesEliminate unused attributes

Performance Hints: DebuggingDetect graphic errors:Detect graphic errors:#ifdef DEBUG#ifdef DEBUG#define GLEND() glEnd();\#define GLEND() glEnd();\ {int err; \{int err; \ err = glGetError(); \err = glGetError(); \ if (err != GL_NO_ERROR) \if (err != GL_NO_ERROR) \

printf("%s\n",gluErrorString(err)); \ printf("%s\n",gluErrorString(err)); \ assert(err == GL_NO_ERROR);}assert(err == GL_NO_ERROR);}#else#else #define GLEND() glEnd()#define GLEND() glEnd()#endif#endif

Performance Hints: Geometry• Maximize data between glBegin/glEndMaximize data between glBegin/glEnd

– Sort geometry by type (triangle, quad, etc.) and group them Sort geometry by type (triangle, quad, etc.) and group them togethertogether

– Find best fit for length of glBegin/glEnd pairFind best fit for length of glBegin/glEnd pair

• Use stripped primitives (GL_TRIANGLE_STRIP...) to reduce Use stripped primitives (GL_TRIANGLE_STRIP...) to reduce geometry data sent to the pipelinegeometry data sent to the pipeline

• Avoid GL_POLYGON. Use specific geometric primitives instead Avoid GL_POLYGON. Use specific geometric primitives instead (GL_TRIANGLE, GL_QUAD, etc.)(GL_TRIANGLE, GL_QUAD, etc.)

• Use GL_FASTEST with glHint calls where possibleUse GL_FASTEST with glHint calls where possible

Performance Hints: Geometry • Use flat display lists for static geometry. Deep display lists Use flat display lists for static geometry. Deep display lists

may induce unwanted memory thrashingmay induce unwanted memory thrashing

• Use API matrix operations instead of your own Use API matrix operations instead of your own

• Use texture to simulate complex geometryUse texture to simulate complex geometry

• Use vertex arrays. Test vertex, interleaved, precompiled Use vertex arrays. Test vertex, interleaved, precompiled arraysarrays

Performance Hints: Geometry• Pass one normal (not 3 or 4) per flat shaded polygonPass one normal (not 3 or 4) per flat shaded polygon

• Use a data format suitable for quick transfer to the graphics Use a data format suitable for quick transfer to the graphics subsystemsubsystem

• Disable unneeded operations (alpha blending, depth, stencil, Disable unneeded operations (alpha blending, depth, stencil, blending, dithering, fog, etc.)blending, dithering, fog, etc.)

Performance Hints: Lighting• Reduce lighting requirements: Reduce lighting requirements:

– Use as few lights as possibleUse as few lights as possible

– Use directional (infinite) lighting. Use Use directional (infinite) lighting. Use glLightfv(GL_LIGHTn, GL_POSITION, {x,y,z,0});glLightfv(GL_LIGHTn, GL_POSITION, {x,y,z,0});

– Use positional lights rather than spot lightsUse positional lights rather than spot lights

– Use one-sided lighting when possible (be aware of issues Use one-sided lighting when possible (be aware of issues associated with normals)associated with normals)

– Don’t change material properties frequently Don’t change material properties frequently

Performance Hints: Lighting• Use normalized normal vectorsUse normalized normal vectors

– Supply unit length vectorsSupply unit length vectors

– Don’t enable GL_NORMALIZEDon’t enable GL_NORMALIZE

– Don’t scale using model-view matrix Don’t scale using model-view matrix

• Pre-multiply geometry, if possiblePre-multiply geometry, if possible

Performance Hints: Visuals/Pixel Formats• Pick the correct visual. Use hardware accelerated visualsPick the correct visual. Use hardware accelerated visuals

• Structure windows and contexts to maximize performance Structure windows and contexts to maximize performance (app may block after context swaps)(app may block after context swaps)

• Put GUI elements in overlay planes to avoid unwanted Put GUI elements in overlay planes to avoid unwanted graphics window refreshesgraphics window refreshes

Performance Hints: Buffers• Turn off depth buffer when possibleTurn off depth buffer when possible

• Use HW accelerated off-screen buffer for backing-storeUse HW accelerated off-screen buffer for backing-store

• Use stencil buffer for interactive picking and quick re-render Use stencil buffer for interactive picking and quick re-render (see course notes for full algorithm)(see course notes for full algorithm)

• Use color/depth buffer data for interactive editing of complex Use color/depth buffer data for interactive editing of complex scenes (see course notes for full algorithm)scenes (see course notes for full algorithm)

Performance Hints: Textures• Be aware of texture sizesBe aware of texture sizes

– Reduce texture resolutionReduce texture resolution

– Use texture LOD extension (OpenGL 1.2)Use texture LOD extension (OpenGL 1.2)

• Use texture objects. Create textures once Use texture objects. Create textures once

• Don’t swap textures frequently, if possibleDon’t swap textures frequently, if possible

– Mosaic multiple textures into one large textureMosaic multiple textures into one large texture

– Sort geometry by textureSort geometry by texture

Performance Hints: Textures• Use texture as an additional data lookup to simulate more Use texture as an additional data lookup to simulate more

complex data:complex data:– Lighting, geometry, color, clipping, application-space data Lighting, geometry, color, clipping, application-space data

• Use glTexSubImage to replace part of a texture rather than Use glTexSubImage to replace part of a texture rather than creating a whole new texturecreating a whole new texture

• Avoid expensive texture filter modesAvoid expensive texture filter modes

• Use texture lookup tables instead of multi-channel texturesUse texture lookup tables instead of multi-channel textures

ConclusionKnow how your application works within the Know how your application works within the systemsystem• Don’t let caches, latencies, bandwidths, etc. slow you downDon’t let caches, latencies, bandwidths, etc. slow you down

• Know how fast you can goKnow how fast you can go

• Identify system performance characteristicsIdentify system performance characteristics

• Work your compilerWork your compiler

• Get all you can out of the hardwareGet all you can out of the hardware

Questions and Answers

Documents

Developing Efficient Graphics Software